P311 - BTW2021- Datenbanksysteme für Business, Technologie und Web
Autor*innen mit den meisten Dokumenten
Neueste Veröffentlichungen
- TextdokumentUsing FALCES against bias in automated decisions by integrating fairness in dynamic model ensembles(BTW 2021, 2021) Lässig, Nico; Oppold, Sarah; Herschel, MelanieAs regularly reported in the media, automated classifications and decisions based on machine learning models can cause unfair treatment of certain groups of a general population. Classically, the machine learning models are designed to make highly accurate decisions in general. When one machine learning model is not sufficient to define the possibly complex boundary between classes, multiple specialized" models are used within a model ensemble to further boost accuracy. In particular
- TextdokumentCluster Flow - an Advanced Concept for Ensemble-Enabling, Interactive Clustering(BTW 2021, 2021) Obermeier, Sandra; Beer, Anna; Wahl, Florian; Seidl, ThomasEven though most clustering algorithms serve knowledge discovery in fields other than computer science, most of them still require users to be familiar with programming or data mining to some extent. As that often prevents efficient research, we developed an easy to use, highly explainable clustering method accompanied by an interactive tool for clustering. It is based on intuitively understandable kNN graphs and the subsequent application of adaptable filters, which can be combined ensemble-like and iteratively and prune unnecessary or misleading edges. For a first overview of the data, fully automatic predefined filter cascades deliver robust results. A selection of simple filters and combination methods that can be chosen interactively yield very good results on benchmark datasets compared to various algorithms.
- TextdokumentAggregate-based Training Phase for ML-based Cardinality Estimation(BTW 2021, 2021) Woltmann, Lucas; Hartmann, Claudio; Habich, Dirk; Lehner, WolfgangCardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches may deliver more accurate cardinality estimations than traditional approaches. However, a lot of training queries have to be executed during the model training phase to learn a data-dependent ML model making it very time-consuming. Many of those training or example queries use the same base data, have the same query structure, and only differ in their selective predicates. To speed up the model training phase, our core idea is to determine a predicate-independent pre-aggregation of the base data and to execute the example queries over this pre-aggregated data. Based on this idea, we present a specific aggregate-based training phase for ML-based cardinality estimation approaches in this paper. As we are going to show with different workloads in our evaluation, we are able to achieve an average speedup of 63 with our aggregate-based training phase and thus outperform indexes.
- TextdokumentOptimized Theta-Join Processing(BTW 2021, 2021) Weise, Julian; Schmidl, Sebastian; Papenbrock, ThorstenThe Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <, ?, ?, ?, >. In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A2DB. Our evaluation on various real-world and synthetic datasets shows that A2DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries.
- TextdokumentUmbra as a Time Machine(BTW 2021, 2021) Karnowski, Lukas; Schüle, Maximilian E.; Kemper, Alfons; Neumann, ThomasOnline lexicons such as Wikipedia rely on incremental edits that change text strings marginally. To support text versioning inside of the Umbra database system, this study presents the implementation of a dedicated data type. This versioning data type is designed for maximal throughput as it stores the latest string as a whole and computes previous ones using backward diffs. Using this data type for Wikipedia articles, we achieve a compression rate of up to 5% and outperform the traditional text data type, when storing each version as one tuple individually, by an order of magnitude.
- TextdokumentExploring Memory Access Patterns for Graph Processing Accelerators(BTW 2021, 2021) Dann, Jonas; Ritter, Daniel; Fröning, HolgerRecent trends in business and technology (e.g., machine learning, social network analysis) benefit from storing and processing growing amounts of graph-structured data in databases and data science platforms. FPGAs as accelerators for graph processing with a customizable memory hierarchy promise solving performance problems caused by inherent irregular memory access patterns on traditional hardware (e.g., CPU). However, developing such hardware accelerators is yet time-consuming and difficult and benchmarking is non-standardized, hindering comprehension of the impact of memory access pattern changes and systematic engineering of graph processing accelerators. In this work, we propose a simulation environment for the analysis of graph processing accelerators based on simulating their memory access patterns. Further, we evaluate our approach on two state-of-the-art FPGA graph processing accelerators and show reproducibility, comparablity, as well as the shortened development process by an example. Not implementing the cycle-accurate internal data flow on accelerator hardware like FPGAs significantly reduces the implementation time, increases the benchmark parameter transparency, and allows comparison of graph processing approaches.
- TextdokumentBTW 2021 - Komplettband(BTW 2021, 2021)
- TextdokumentPrecise, Compact, and Fast Data Access Counters for Automated Physical Database Design(BTW 2021, 2021) Brendle, Michael; Weber, Nick; Valiyev, Mahammad; May, Norman; Schulze, Robert; Böhm, Alexander; Moerkotte, Guido; Grossniklaus, MichaelToday's database management systems offer numerous tuning knobs that allow an adaptation of the database behavior to specific customer needs
- TextdokumentSilentium! Run-Analyse-Eradicate the Noise out of the DB/OS Stack(BTW 2021, 2021) Mauerer, Wolfgang; Ramsauer, Ralf; Lucas, Edson; Lohmann, Daniel; Scherzinger, StefanieWhen multiple tenants compete for resources, database performance tends to suffer. Yet there are several scenarios where guaranteed sub-millisecond latencies are crucial, such as in real-time scenarios, IoT, or when operating in safety-critical environments. In this paper, we study how to make query latencies deterministic in the face of noise (whether caused by other tenants or unrelated operating systems tasks). We perform controlled experiments with an in-memory database in a multi-tenant setting, where we successively eradicate noisy interference from within the system software stack, to the point where the engine runs close to bare-metal on the underlying hardware. We show that we can achieve query latencies comparable to the database engine running as the sole tenant, but without noticeably impacting the workload of competing tenants. We discuss these results in the context of ongoing efforts to build custom operating systems for database workloads, and point out that for certain use cases, the margin for improvement is rather narrow. In fact, for scenarios like ours, existing operating systems might just be good enough, provided that they are expertly configured. We then critically discuss these findings in the light of a broader family of database systems (e.g., including disk-based), and the technological disruption of the advances in modern hardware.
- TextdokumentB²-Tree(BTW 2021, 2021) Schmeißer, Josef; Schüle, Maximilian E.; Leis, Viktor; Neumann, Thomas; Kemper, AlfonsRecently proposed index structures, that combine trie-based and comparison-based search mechanisms, considerably improve retrieval throughput for in-memory database systems. However, most of these index structures allocate small memory chunks when required. This stands in contrast to block-based index structures, that are necessary for disk-accesses of beyond main-memory database systems such as Umbra. We therefore present the B²-tree. The outer structure is identical to that of an ordinary B+-tree. It still stores elements in a dense array in sorted order, enabling efficient range scan operations. However, B²-tree is composed of multiple trees, each page integrates another trie-based search tree, which is used to determine a small memory region where a sought entry may be found. An embedded tree thereby consists of decision nodes, which operate on a single byte at a time, and span nodes, which are used to store common prefixes. This architecture usually accesses fewer cache lines than a vanilla B+-tree as our performance evaluation proved. As a result, the B²-tree is able to answer point queries considerably faster.
- «
- 1 (current)
- 2
- 3
- »