- KonferenzbeitragNN2SQL: Let SQL Think for Neural Networks(BTW 2023, 2023) Schüle, Maximilian Emanuel; Kemper, Alfons; Neumann, Thomas; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedAlthough database systems perform well in data access and manipulation, their relational model hinders data scientists from formulating machine learning algorithms in SQL. Nevertheless, we argue that modern database systems perform well for machine learning algorithms expressed in relational algebra. To overcome the barrier of the relational model, this paper shows how to transform data into a relational representation for training neural networks in SQL: We first describe building blocks for data transformation in SQL. Then, we compare an implementation for model training using array data types to the one using a relational representation in SQL-92 only. The evaluation proves the suitability of modern database systems for matrix algebra, although specialised array data types perform better than matrices in relational representation.
- KonferenzbeitragBTW 2023 - Complete proceedings(BTW 2023, 2023) Köhnen, Christoph; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, Gottfried
- KonferenzbeitragWannaDB: Ad-hoc SQL Queries over Text Collections(BTW 2023, 2023) Hättasch, Benjamin; Bodensohn, Jan-Micha; Vogel, Liane; Urban, Matthias; Binnig, Carsten; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, Gottfriedn this paper, we propose a new system called WannaDB that allows users to interactively perform structured explorations of text collections in an ad-hoc manner. Extracting structured data from text is a classical problem where a plenitude of approaches and even industry-scale systems already exists. However, these approaches lack in the ability to support the ad-hoc exploration of texts using structured queries. The main idea of WannaDB is to include user interaction to support ad-hoc SQL queries over text collections using a new two-phased approach. First, a superset of information nuggets from the texts is extracted using existing extractors such as named entity recognizers. Then, the extractions are interactively matched to a structured table definition as requested by the user based on embeddings. In our evaluation, we show that WannaDB is thus able to extract structured data from a broad range of (real-world) text collections in high quality without the need to design extraction pipelines upfront.
- KonferenzbeitragWitness Generation for JSON Schema Patterns(BTW 2023, 2023) Köhnen, Christoph; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedJSON Schema is a schema language for the popular data exchange format JSON. This paper introduces an approach to convert regular expressions, which appear in ECMA-262 syntax in JSON Schema, into an alternative syntax such that they may be compiled to finite-state automata.This is a step towards generating witnesses, i.e., JSON instances which are valid w.r.t. the given JSON Schema specification. Specifically, we address the challenge that the ECMA-262 pattern syntax uses anchor symbols to mark the beginning and end of a word, which is not compatible with available libraries for automata manipulation. We implement an algorithm proposed by Dominik Freydenberger to convert regular expressions into brics syntax. We show that we successfully address over 97% of the patterns found in a collection of thousands of JSON Schema specifications collected from GitHub.
- KonferenzbeitragEfficient handling of recursive relationships in ORM frameworks using Entity Framework Core as an example(BTW 2023, 2023) Killisch, Benjamin Uwe; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedORM frameworks are a popular method to bridge the differences between object-oriented programming and relational data management. At the same time, recursive relationships are present in many schemas to represent tree-like or net-like structures. This paper discusses how to efficiently build and execute queries for data with recursive relationships in ORM frameworks. Five possible solutions are conceived and then implemented in Entity Framework Core (EF Core), while making sure that they can be used like regular LINQ queries. Next, the solutions are tested with different SQL dialects. The results of these tests are then analyzed by a variety of test parameters. This analysis shows that queries with recursive common table expressions and queries using key loading are the most efficient. Queries with auxiliary property, vertical unrolling or horizontal unrolling are either too slow or only usable under particular circumstances. The analysis also shows that the performance of the solutions is always dependent on the circumstances, especially the SQL dialect.
- KonferenzbeitragWhich Rules Entail this Fact? - An Efficient Approach Using RDBMSs(BTW 2023, 2023) Gutberlet, Tim; Sauerbier, Janik; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedIn this paper, we focus on the problem of identifying all rules that entail a certain target fact given a knowledge graph and a set of previously learned rules. This problem is relevant in the context of link prediction and explainability. We propose an efficient approach using relational database technology including indexing, filtering and pre-computing methods. Our experiments demonstrate the efficiency of our approach and the effect of various optimizations on different datasets like YAGO3-10, WN18RR and FB15k-237 using rules learned by the bottom up rule learner AnyBURL.
- KonferenzbeitragExplainable Data Matching: Selecting Representative Pairs with Active Learning Pair-Selection Strategies(BTW 2023, 2023) Laskowski, Lukas; Sold, Florian; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedIn both research and enterprise, dirty data poses numerous challenges. Many data cleaning pipelines include a data deduplication step that detects and removes entries within a given dataset which refer to the same real-world entity. Throughout the development of such deduplication techniques, data scientists have to make sense of the large result sets that their matching solutions generate to quickly identify changes in behavior or to discover opportunities for improvements. We propose an approach that aims to select a small subset of pairs from the result set of a data matching solution which is representative of the matching solution’s overall behavior. To evaluate our approach, we show that the performance of a matching solution trained on pairs selected according to our strategy outperforms a randomly selected subset of pairs.
- KonferenzbeitragTo Iterate Is Human, to Recurse Is Divine --- Mapping Iterative Python to Recursive SQL(BTW 2023, 2023) Fischer, Tim; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedWriting complex algorithms and iterative computations in SQL is difficult at best, commonly leading to code that intermingles looping control flow with database access. This yields programs with control flow that rapidly hops in and out of the database, with each roundtrip incurring significant overhead. We present the ByePy compiler, which can compile entire Python functions directly to plain recursive SQL:1999 queries. By doing so, the compilation eliminates all but a single roundtrip, leading to runtime speedups of up to an order of magnitude.
- KonferenzbeitragOptimizing Query Processing in PostgreSQL Through Learned Optimizer Hints(BTW 2023, 2023) Thiessat, Jerome; Woltmann, Lucas; Hartmann, Claudio; Habich, Dirk; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedQuery optimization in database systems is an important aspect and despite decades of research, it isstill far from being solved. Nowadays, query optimizers usually provide hints to be able to steer theoptimization on a query-by-query basis. However, setting the best-fitting hints is challenging. To tacklethat, we present a learning-based approach to predict the best-fitting hints for each incoming query. Inparticular, our learning approach is based on simple gradient boosting, where we learn one modelper query context for fine-grained predictions rather than a single global context-agnostic model asproposed in related work. We demonstrate the efficiency as well as effectiveness of our learning-basedapproach using the open-source database system PostgreSQL and show that our approach outperformsrelated work in that context.