P311 - BTW2021- Datenbanksysteme für Business, Technologie und Web

https://dl.gi.de/handle/20.500.12116/35791

Auflistung nach:

1 - 10 von 23

Textdokument
Towards Resilient Data Management for the Internet of Moving Things
(BTW 2021, 2021) Paz, Elena Beatriz Ouro; Zacharatou, Eleni Tzirita; Markl, Volker
Mobile devices have become ubiquitous; smartphones, tablets and wearables are essential commodities for many people. The ubiquity of mobile devices combined with their ever increasing capabilities, open new possibilities for Internet-of-Things (IoT) applications where mobile devices act as both data generators as well as processing nodes. However, deploying a stream processing system (SPS) over mobile devices is particularly challenging as mobile devices change their position within the network very frequently and are notoriously prone to transient disconnections. To deal with faults arising from disconnections and mobility, existing fault tolerance strategies in SPS are either checkpointing-based or replication-based. Checkpointing-based strategies are too heavyweight for mobile devices, as they save and broadcast state periodically, even when there are no failures. On the other hand, replication-based strategies cannot provide fault tolerance at the level of the data source, as the data source itself cannot be always replicated. Finally, existing systems exclude mobile devices from data processing upon a disconnection even when the duration of the disconnection is very short, thus failing to exploit the computing capabilities of the offline devices. This paper proposes a buffering-based reactive fault tolerance strategy to handle transient disconnections of mobile devices that both generate and process data, even in cases where the devices move through the network during the disconnection. The main components of our strategy are: (a) a circular buffer that stores the data which are generated and processed locally during a device disconnection, (b) a query-aware buffer replacement policy, and (c) a query restart process that ensures the correct forwarding of the buffered data upon re-connection, taking into account the new network topology. We integrate our fault tolerance strategy with NebulaStream, a novel stream processing system specifically designed for the IoT. We evaluate our strategy using a custom benchmark based on real data, exhibiting reduction in data loss and query runtime compared to the baseline NebulaStream.
Textdokument
Tracing the History of the Baltic Sea Oxygen Level
(BTW 2021, 2021) Auge, Tanja; Heuer, Andreas
In order to guarantee the reproducibility of research results, large research communities, conferences and journals increasingly demand the provision of original research data. Since this is often not possible or desired, a certain tact and sensitivity is needed. With our method, combining provenance and evolution, we can identify the source tuples necessary for the reconstruction of a query result also in temporal databases. To avoid dirty data caused by the inverse evolution, we introduced the what-provenance, which remembers the data types of the source relation.
Textdokument
Extended Affinity Propagation Clustering for Multi-source Entity Resolution
(BTW 2021, 2021) Lerm, Stefan; Saeedi, Alieh; Rahm, Erhard
Entity resolution is the data integration task of identifying matching entities (e.g. products, customers) in one or several data sources. Previous approaches for matching and clustering entities between multiple (>2) sources either treated the different sources as a single source or assumed that the individual sources are duplicate-free, so that only matches between sources have to be found. In this work we propose and evaluate a general Multi-Source Clean Dirty (MSCD) scheme with an arbitrary combination of clean (duplicate-free) and dirty sources. For this purpose, we extend a constraint-based clustering algorithm called Affinity Propagation (AP) for entity clustering with clean and dirty sources (MSCD-AP). We also consider a hierarchical version of it for improved scalability. Our evaluation considers a full range of datasets containing 0% to 100% of clean sources. We compare our proposed algorithms with other clustering schemes in terms of both match quality and runtime.
Textdokument
Data Management in Multi-Agent Simulation Systems
(BTW 2021, 2021) Glake, Daniel; Panse, Fabian; Ritter, Norbert; Clemen, Thomas; Lenfers, Ulfia
Multi-agent simulations are an upcoming trend to deal with the urgent need to predict complex situations as they arise in many real-life areas, such as disaster or traffic management. Such simulations require large amounts of heterogeneous data ranging from spatio-temporal to standard object properties. This and the increasing demand for large scale and real-time simulations pose many challenges for data management. In this paper, we present the architecture of a typical agent-based simulation system, describe several data management challenges that arise in such a data ecosystem, and discuss their current solutions within our multi-agent simulation system MARS.
Textdokument
B²-Tree
(BTW 2021, 2021) Schmeißer, Josef; Schüle, Maximilian E.; Leis, Viktor; Neumann, Thomas; Kemper, Alfons
Recently proposed index structures, that combine trie-based and comparison-based search mechanisms, considerably improve retrieval throughput for in-memory database systems. However, most of these index structures allocate small memory chunks when required. This stands in contrast to block-based index structures, that are necessary for disk-accesses of beyond main-memory database systems such as Umbra. We therefore present the B²-tree. The outer structure is identical to that of an ordinary B+-tree. It still stores elements in a dense array in sorted order, enabling efficient range scan operations. However, B²-tree is composed of multiple trees, each page integrates another trie-based search tree, which is used to determine a small memory region where a sought entry may be found. An embedded tree thereby consists of decision nodes, which operate on a single byte at a time, and span nodes, which are used to store common prefixes. This architecture usually accesses fewer cache lines than a vanilla B+-tree as our performance evaluation proved. As a result, the B²-tree is able to answer point queries considerably faster.
Textdokument
The Data Lake Architecture Framework
(BTW 2021, 2021) Giebler, Corinna; Gröger, Christoph; Hoos, Eva; Eichler, Rebecca; Schwarz, Holger; Mitschang, Bernhard
During recent years, data lakes emerged as a way to manage large amounts of heterogeneous data for modern data analytics. Although various work on individual aspects of data lakes exists, there is no comprehensive data lake architecture yet. Concepts that describe themselves as a “data lake architecture” are only partial. In this work, we introduce the data lake architecture framework. It supports the definition of data lake architectures by defining nine architectural aspects, i.e., perspectives on a data lake, such as data storage or data modeling, and by exploring the interdependencies between these aspects. The included methodology helps to choose appropriate concepts to instantiate each aspect. To evaluate the framework, we use it to configure an exemplary data lake architecture for a real-world data lake implementation. This final assessment shows that our framework provides comprehensive guidance in the configuration of a data lake architecture.
Textdokument
Combining Programming-by-Example with Transformation Discovery from large Databases
(BTW 2021, 2021) özmen, Aslihan; Esmailoghli, Mahdi; Abedjan, Ziawasch
Data transformation discovery is one of the most tedious tasks in data preparation. In particular, the generation of transformation programs for semantic transformations is tricky because additional sources for look-up operations are necessary. Current systems for semantic transformation discovery face two major problems: either they follow a program synthesis approach that only scales to a small set of input tables, or they rely on extraction of transformation functions from large corpora, which requires the identification of exact transformations in those resources and is prone to noisy data. In this paper, we try to combine approaches to benefit from large corpora and the sophistication of program synthesis. To do so, we devise a retrieval and pruning strategy ensemble that extracts the most relevant tables for a given transformation task. The extracted resources can then be processed by a program synthesis engine to generate more accurate transformation results than state-of-the-art.
Textdokument
Optimized Theta-Join Processing
(BTW 2021, 2021) Weise, Julian; Schmidl, Sebastian; Papenbrock, Thorsten
The Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <, ?, ?, ?, >. In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A2DB. Our evaluation on various real-world and synthetic datasets shows that A2DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries.
Textdokument
Umbra as a Time Machine
(BTW 2021, 2021) Karnowski, Lukas; Schüle, Maximilian E.; Kemper, Alfons; Neumann, Thomas
Online lexicons such as Wikipedia rely on incremental edits that change text strings marginally. To support text versioning inside of the Umbra database system, this study presents the implementation of a dedicated data type. This versioning data type is designed for maximal throughput as it stores the latest string as a whole and computes previous ones using backward diffs. Using this data type for Wikipedia articles, we achieve a compression rate of up to 5% and outperform the traditional text data type, when storing each version as one tuple individually, by an order of magnitude.
Textdokument
Precise, Compact, and Fast Data Access Counters for Automated Physical Database Design
(BTW 2021, 2021) Brendle, Michael; Weber, Nick; Valiyev, Mahammad; May, Norman; Schulze, Robert; Böhm, Alexander; Moerkotte, Guido; Grossniklaus, Michael
Today's database management systems offer numerous tuning knobs that allow an adaptation of the database behavior to specific customer needs

Auflistung P311 - BTW2021- Datenbanksysteme für Business, Technologie und Web nach Erscheinungsdatum

Treffer pro Seite

Sortieroptionen