P311 - BTW2021- Datenbanksysteme für Business, Technologie und Web

https://dl.gi.de/handle/20.500.12116/35791

Auflistung nach:

1 - 10 von 23

Textdokument
Silentium! Run-Analyse-Eradicate the Noise out of the DB/OS Stack
(BTW 2021, 2021) Mauerer, Wolfgang; Ramsauer, Ralf; Lucas, Edson; Lohmann, Daniel; Scherzinger, Stefanie
When multiple tenants compete for resources, database performance tends to suffer. Yet there are several scenarios where guaranteed sub-millisecond latencies are crucial, such as in real-time scenarios, IoT, or when operating in safety-critical environments. In this paper, we study how to make query latencies deterministic in the face of noise (whether caused by other tenants or unrelated operating systems tasks). We perform controlled experiments with an in-memory database in a multi-tenant setting, where we successively eradicate noisy interference from within the system software stack, to the point where the engine runs close to bare-metal on the underlying hardware. We show that we can achieve query latencies comparable to the database engine running as the sole tenant, but without noticeably impacting the workload of competing tenants. We discuss these results in the context of ongoing efforts to build custom operating systems for database workloads, and point out that for certain use cases, the margin for improvement is rather narrow. In fact, for scenarios like ours, existing operating systems might just be good enough, provided that they are expertly configured. We then critically discuss these findings in the light of a broader family of database systems (e.g., including disk-based), and the technological disruption of the advances in modern hardware.
Textdokument
Extended Affinity Propagation Clustering for Multi-source Entity Resolution
(BTW 2021, 2021) Lerm, Stefan; Saeedi, Alieh; Rahm, Erhard
Entity resolution is the data integration task of identifying matching entities (e.g. products, customers) in one or several data sources. Previous approaches for matching and clustering entities between multiple (>2) sources either treated the different sources as a single source or assumed that the individual sources are duplicate-free, so that only matches between sources have to be found. In this work we propose and evaluate a general Multi-Source Clean Dirty (MSCD) scheme with an arbitrary combination of clean (duplicate-free) and dirty sources. For this purpose, we extend a constraint-based clustering algorithm called Affinity Propagation (AP) for entity clustering with clean and dirty sources (MSCD-AP). We also consider a hierarchical version of it for improved scalability. Our evaluation considers a full range of datasets containing 0% to 100% of clean sources. We compare our proposed algorithms with other clustering schemes in terms of both match quality and runtime.
Textdokument
Optimized Theta-Join Processing
(BTW 2021, 2021) Weise, Julian; Schmidl, Sebastian; Papenbrock, Thorsten
The Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <, ?, ?, ?, >. In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A2DB. Our evaluation on various real-world and synthetic datasets shows that A2DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries.
Textdokument
Umbra as a Time Machine
(BTW 2021, 2021) Karnowski, Lukas; Schüle, Maximilian E.; Kemper, Alfons; Neumann, Thomas
Online lexicons such as Wikipedia rely on incremental edits that change text strings marginally. To support text versioning inside of the Umbra database system, this study presents the implementation of a dedicated data type. This versioning data type is designed for maximal throughput as it stores the latest string as a whole and computes previous ones using backward diffs. Using this data type for Wikipedia articles, we achieve a compression rate of up to 5% and outperform the traditional text data type, when storing each version as one tuple individually, by an order of magnitude.
Textdokument
B²-Tree
(BTW 2021, 2021) Schmeißer, Josef; Schüle, Maximilian E.; Leis, Viktor; Neumann, Thomas; Kemper, Alfons
Recently proposed index structures, that combine trie-based and comparison-based search mechanisms, considerably improve retrieval throughput for in-memory database systems. However, most of these index structures allocate small memory chunks when required. This stands in contrast to block-based index structures, that are necessary for disk-accesses of beyond main-memory database systems such as Umbra. We therefore present the B²-tree. The outer structure is identical to that of an ordinary B+-tree. It still stores elements in a dense array in sorted order, enabling efficient range scan operations. However, B²-tree is composed of multiple trees, each page integrates another trie-based search tree, which is used to determine a small memory region where a sought entry may be found. An embedded tree thereby consists of decision nodes, which operate on a single byte at a time, and span nodes, which are used to store common prefixes. This architecture usually accesses fewer cache lines than a vanilla B+-tree as our performance evaluation proved. As a result, the B²-tree is able to answer point queries considerably faster.
Textdokument
Combining Programming-by-Example with Transformation Discovery from large Databases
(BTW 2021, 2021) özmen, Aslihan; Esmailoghli, Mahdi; Abedjan, Ziawasch
Data transformation discovery is one of the most tedious tasks in data preparation. In particular, the generation of transformation programs for semantic transformations is tricky because additional sources for look-up operations are necessary. Current systems for semantic transformation discovery face two major problems: either they follow a program synthesis approach that only scales to a small set of input tables, or they rely on extraction of transformation functions from large corpora, which requires the identification of exact transformations in those resources and is prone to noisy data. In this paper, we try to combine approaches to benefit from large corpora and the sophistication of program synthesis. To do so, we devise a retrieval and pruning strategy ensemble that extracts the most relevant tables for a given transformation task. The extracted resources can then be processed by a program synthesis engine to generate more accurate transformation results than state-of-the-art.
Textdokument
Exploring Memory Access Patterns for Graph Processing Accelerators
(BTW 2021, 2021) Dann, Jonas; Ritter, Daniel; Fröning, Holger
Recent trends in business and technology (e.g., machine learning, social network analysis) benefit from storing and processing growing amounts of graph-structured data in databases and data science platforms. FPGAs as accelerators for graph processing with a customizable memory hierarchy promise solving performance problems caused by inherent irregular memory access patterns on traditional hardware (e.g., CPU). However, developing such hardware accelerators is yet time-consuming and difficult and benchmarking is non-standardized, hindering comprehension of the impact of memory access pattern changes and systematic engineering of graph processing accelerators. In this work, we propose a simulation environment for the analysis of graph processing accelerators based on simulating their memory access patterns. Further, we evaluate our approach on two state-of-the-art FPGA graph processing accelerators and show reproducibility, comparablity, as well as the shortened development process by an example. Not implementing the cycle-accurate internal data flow on accelerator hardware like FPGAs significantly reduces the implementation time, increases the benchmark parameter transparency, and allows comparison of graph processing approaches.
Textdokument
Tracing the History of the Baltic Sea Oxygen Level
(BTW 2021, 2021) Auge, Tanja; Heuer, Andreas
In order to guarantee the reproducibility of research results, large research communities, conferences and journals increasingly demand the provision of original research data. Since this is often not possible or desired, a certain tact and sensitivity is needed. With our method, combining provenance and evolution, we can identify the source tuples necessary for the reconstruction of a query result also in temporal databases. To avoid dirty data caused by the inverse evolution, we introduced the what-provenance, which remembers the data types of the source relation.
Textdokument
Flexible data partitioning schemes for parallel merge joins in semantic web queries
(BTW 2021, 2021) Warnke, Benjamin; Rehan, Muhammad Waqas; Fischer, Stefan; Groppe, Sven
In the context of the Semantic Web, large amounts of data must be preprocessed and stored so that they can be queried efficiently later. The key technology in this topic are triple stores, in which all information is stored in the form of (subject, predicate and object) triple patterns. Depending on the triple patterns used within the queries, very different value distributions can be observed within these datasets. Currently, these properties are only exploited implicitly during join optimization in the form of histograms or similar technologies. This paper proposes a new way to take advantage of these different distributions using different partitioning schemes at runtime. This means that an optimal partitioning scheme can be used depending on the data access in order to improve query performance. In the experiments we achieve speedups up to a factor of 5.92 in comparison to no partitioning, and a performance improvement of up to 81% compared to a not optimal number of partitions.
Textdokument
Applying Machine Learning Models to Scalable DataFrames with Grizzly
(BTW 2021, 2021) Kläbe, Steffen; Hagedorn, Stefan
The popular Python Pandas framework provides an easy-to-use DataFrame API that enables a broad range of users to analyze their data. However, Pandas faces severe scalability issues in terms of runtime and memory consumption, limiting the usability of the framework. In this paper we present Grizzly, a replacement for Python Pandas. Instead of bringing data to the operators like Pandas, Grizzly ships program complexity to database systems by transpiling the DataFrame API to SQL code. Additionally, Grizzly offers user-friendly support for combining different data sources, user-defined functions, and applying Machine Learning models directly inside the database system. Our evaluation shows that Grizzly significantly outperforms Pandas as well as state-of-the-art frameworks for distributed Python processing in several use cases.

Auflistung P311 - BTW2021- Datenbanksysteme für Business, Technologie und Web nach Erscheinungsdatum

Treffer pro Seite

Sortieroptionen