- TextdokumentUsing FALCES against bias in automated decisions by integrating fairness in dynamic model ensembles(BTW 2021, 2021) Lässig, Nico; Oppold, Sarah; Herschel, Melanie; Kai-Uwe Sattler; Melanie Herschel; Wolfgang LehnerAs regularly reported in the media, automated classifications and decisions based on machine learning models can cause unfair treatment of certain groups of a general population. Classically, the machine learning models are designed to make highly accurate decisions in general. When one machine learning model is not sufficient to define the possibly complex boundary between classes, multiple specialized" models are used within a model ensemble to further boost accuracy. In particular
- TextdokumentAggregate-based Training Phase for ML-based Cardinality Estimation(BTW 2021, 2021) Woltmann, Lucas; Hartmann, Claudio; Habich, Dirk; Lehner, Wolfgang; Kai-Uwe Sattler; Melanie Herschel; Wolfgang LehnerCardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches may deliver more accurate cardinality estimations than traditional approaches. However, a lot of training queries have to be executed during the model training phase to learn a data-dependent ML model making it very time-consuming. Many of those training or example queries use the same base data, have the same query structure, and only differ in their selective predicates. To speed up the model training phase, our core idea is to determine a predicate-independent pre-aggregation of the base data and to execute the example queries over this pre-aggregated data. Based on this idea, we present a specific aggregate-based training phase for ML-based cardinality estimation approaches in this paper. As we are going to show with different workloads in our evaluation, we are able to achieve an average speedup of 63 with our aggregate-based training phase and thus outperform indexes.
- TextdokumentCluster Flow - an Advanced Concept for Ensemble-Enabling, Interactive Clustering(BTW 2021, 2021) Obermeier, Sandra; Beer, Anna; Wahl, Florian; Seidl, Thomas; Kai-Uwe Sattler; Melanie Herschel; Wolfgang LehnerEven though most clustering algorithms serve knowledge discovery in fields other than computer science, most of them still require users to be familiar with programming or data mining to some extent. As that often prevents efficient research, we developed an easy to use, highly explainable clustering method accompanied by an interactive tool for clustering. It is based on intuitively understandable kNN graphs and the subsequent application of adaptable filters, which can be combined ensemble-like and iteratively and prune unnecessary or misleading edges. For a first overview of the data, fully automatic predefined filter cascades deliver robust results. A selection of simple filters and combination methods that can be chosen interactively yield very good results on benchmark datasets compared to various algorithms.
- TextdokumentBTW 2021 - Komplettband(BTW 2021, 2021) Kai-Uwe Sattler; Melanie Herschel; Wolfgang Lehner
- TextdokumentOptimized Theta-Join Processing(BTW 2021, 2021) Weise, Julian; Schmidl, Sebastian; Papenbrock, Thorsten; Kai-Uwe Sattler; Melanie Herschel; Wolfgang LehnerThe Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <, ?, ?, ?, >. In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A2DB. Our evaluation on various real-world and synthetic datasets shows that A2DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries.
- TextdokumentExploring Memory Access Patterns for Graph Processing Accelerators(BTW 2021, 2021) Dann, Jonas; Ritter, Daniel; Fröning, Holger; Kai-Uwe Sattler; Melanie Herschel; Wolfgang LehnerRecent trends in business and technology (e.g., machine learning, social network analysis) benefit from storing and processing growing amounts of graph-structured data in databases and data science platforms. FPGAs as accelerators for graph processing with a customizable memory hierarchy promise solving performance problems caused by inherent irregular memory access patterns on traditional hardware (e.g., CPU). However, developing such hardware accelerators is yet time-consuming and difficult and benchmarking is non-standardized, hindering comprehension of the impact of memory access pattern changes and systematic engineering of graph processing accelerators. In this work, we propose a simulation environment for the analysis of graph processing accelerators based on simulating their memory access patterns. Further, we evaluate our approach on two state-of-the-art FPGA graph processing accelerators and show reproducibility, comparablity, as well as the shortened development process by an example. Not implementing the cycle-accurate internal data flow on accelerator hardware like FPGAs significantly reduces the implementation time, increases the benchmark parameter transparency, and allows comparison of graph processing approaches.
- TextdokumentUmbra as a Time Machine(BTW 2021, 2021) Karnowski, Lukas; Schüle, Maximilian E.; Kemper, Alfons; Neumann, Thomas; Kai-Uwe Sattler; Melanie Herschel; Wolfgang LehnerOnline lexicons such as Wikipedia rely on incremental edits that change text strings marginally. To support text versioning inside of the Umbra database system, this study presents the implementation of a dedicated data type. This versioning data type is designed for maximal throughput as it stores the latest string as a whole and computes previous ones using backward diffs. Using this data type for Wikipedia articles, we achieve a compression rate of up to 5% and outperform the traditional text data type, when storing each version as one tuple individually, by an order of magnitude.
- TextdokumentPrecise, Compact, and Fast Data Access Counters for Automated Physical Database Design(BTW 2021, 2021) Brendle, Michael; Weber, Nick; Valiyev, Mahammad; May, Norman; Schulze, Robert; Böhm, Alexander; Moerkotte, Guido; Grossniklaus, Michael; Kai-Uwe Sattler; Melanie Herschel; Wolfgang LehnerToday's database management systems offer numerous tuning knobs that allow an adaptation of the database behavior to specific customer needs
- TextdokumentData Management in Multi-Agent Simulation Systems(BTW 2021, 2021) Glake, Daniel; Panse, Fabian; Ritter, Norbert; Clemen, Thomas; Lenfers, Ulfia; Kai-Uwe Sattler; Melanie Herschel; Wolfgang LehnerMulti-agent simulations are an upcoming trend to deal with the urgent need to predict complex situations as they arise in many real-life areas, such as disaster or traffic management. Such simulations require large amounts of heterogeneous data ranging from spatio-temporal to standard object properties. This and the increasing demand for large scale and real-time simulations pose many challenges for data management. In this paper, we present the architecture of a typical agent-based simulation system, describe several data management challenges that arise in such a data ecosystem, and discuss their current solutions within our multi-agent simulation system MARS.
- TextdokumentSilentium! Run-Analyse-Eradicate the Noise out of the DB/OS Stack(BTW 2021, 2021) Mauerer, Wolfgang; Ramsauer, Ralf; Lucas, Edson; Lohmann, Daniel; Scherzinger, Stefanie; Kai-Uwe Sattler; Melanie Herschel; Wolfgang LehnerWhen multiple tenants compete for resources, database performance tends to suffer. Yet there are several scenarios where guaranteed sub-millisecond latencies are crucial, such as in real-time scenarios, IoT, or when operating in safety-critical environments. In this paper, we study how to make query latencies deterministic in the face of noise (whether caused by other tenants or unrelated operating systems tasks). We perform controlled experiments with an in-memory database in a multi-tenant setting, where we successively eradicate noisy interference from within the system software stack, to the point where the engine runs close to bare-metal on the underlying hardware. We show that we can achieve query latencies comparable to the database engine running as the sole tenant, but without noticeably impacting the workload of competing tenants. We discuss these results in the context of ongoing efforts to build custom operating systems for database workloads, and point out that for certain use cases, the margin for improvement is rather narrow. In fact, for scenarios like ours, existing operating systems might just be good enough, provided that they are expertly configured. We then critically discuss these findings in the light of a broader family of database systems (e.g., including disk-based), and the technological disruption of the advances in modern hardware.