GI Digital Library :: Auflistung nach Autor:in "Kai-Uwe Sattler"

Auflistung nach Autor:in "Kai-Uwe Sattler"

1 - 10 von 23

Textdokument
Aggregate-based Training Phase for ML-based Cardinality Estimation
(BTW 2021, 2021) Woltmann, Lucas; Hartmann, Claudio; Habich, Dirk; Lehner, Wolfgang
Cardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches may deliver more accurate cardinality estimations than traditional approaches. However, a lot of training queries have to be executed during the model training phase to learn a data-dependent ML model making it very time-consuming. Many of those training or example queries use the same base data, have the same query structure, and only differ in their selective predicates. To speed up the model training phase, our core idea is to determine a predicate-independent pre-aggregation of the base data and to execute the example queries over this pre-aggregated data. Based on this idea, we present a specific aggregate-based training phase for ML-based cardinality estimation approaches in this paper. As we are going to show with different workloads in our evaluation, we are able to achieve an average speedup of 63 with our aggregate-based training phase and thus outperform indexes.
Textdokument
Applying Machine Learning Models to Scalable DataFrames with Grizzly
(BTW 2021, 2021) Kläbe, Steffen; Hagedorn, Stefan
The popular Python Pandas framework provides an easy-to-use DataFrame API that enables a broad range of users to analyze their data. However, Pandas faces severe scalability issues in terms of runtime and memory consumption, limiting the usability of the framework. In this paper we present Grizzly, a replacement for Python Pandas. Instead of bringing data to the operators like Pandas, Grizzly ships program complexity to database systems by transpiling the DataFrame API to SQL code. Additionally, Grizzly offers user-friendly support for combining different data sources, user-defined functions, and applying Machine Learning models directly inside the database system. Our evaluation shows that Grizzly significantly outperforms Pandas as well as state-of-the-art frameworks for distributed Python processing in several use cases.
Textdokument
BTW 2021 - Komplettband
(BTW 2021, 2021)
Textdokument
B²-Tree
(BTW 2021, 2021) Schmeißer, Josef; Schüle, Maximilian E.; Leis, Viktor; Neumann, Thomas; Kemper, Alfons
Recently proposed index structures, that combine trie-based and comparison-based search mechanisms, considerably improve retrieval throughput for in-memory database systems. However, most of these index structures allocate small memory chunks when required. This stands in contrast to block-based index structures, that are necessary for disk-accesses of beyond main-memory database systems such as Umbra. We therefore present the B²-tree. The outer structure is identical to that of an ordinary B+-tree. It still stores elements in a dense array in sorted order, enabling efficient range scan operations. However, B²-tree is composed of multiple trees, each page integrates another trie-based search tree, which is used to determine a small memory region where a sought entry may be found. An embedded tree thereby consists of decision nodes, which operate on a single byte at a time, and span nodes, which are used to store common prefixes. This architecture usually accesses fewer cache lines than a vanilla B+-tree as our performance evaluation proved. As a result, the B²-tree is able to answer point queries considerably faster.
Textdokument
Cluster Flow - an Advanced Concept for Ensemble-Enabling, Interactive Clustering
(BTW 2021, 2021) Obermeier, Sandra; Beer, Anna; Wahl, Florian; Seidl, Thomas
Even though most clustering algorithms serve knowledge discovery in fields other than computer science, most of them still require users to be familiar with programming or data mining to some extent. As that often prevents efficient research, we developed an easy to use, highly explainable clustering method accompanied by an interactive tool for clustering. It is based on intuitively understandable kNN graphs and the subsequent application of adaptable filters, which can be combined ensemble-like and iteratively and prune unnecessary or misleading edges. For a first overview of the data, fully automatic predefined filter cascades deliver robust results. A selection of simple filters and combination methods that can be chosen interactively yield very good results on benchmark datasets compared to various algorithms.
Textdokument
Combining Programming-by-Example with Transformation Discovery from large Databases
(BTW 2021, 2021) özmen, Aslihan; Esmailoghli, Mahdi; Abedjan, Ziawasch
Data transformation discovery is one of the most tedious tasks in data preparation. In particular, the generation of transformation programs for semantic transformations is tricky because additional sources for look-up operations are necessary. Current systems for semantic transformation discovery face two major problems: either they follow a program synthesis approach that only scales to a small set of input tables, or they rely on extraction of transformation functions from large corpora, which requires the identification of exact transformations in those resources and is prone to noisy data. In this paper, we try to combine approaches to benefit from large corpora and the sophistication of program synthesis. To do so, we devise a retrieval and pruning strategy ensemble that extracts the most relevant tables for a given transformation task. The extracted resources can then be processed by a program synthesis engine to generate more accurate transformation results than state-of-the-art.
Textdokument
The Data Lake Architecture Framework
(BTW 2021, 2021) Giebler, Corinna; Gröger, Christoph; Hoos, Eva; Eichler, Rebecca; Schwarz, Holger; Mitschang, Bernhard
During recent years, data lakes emerged as a way to manage large amounts of heterogeneous data for modern data analytics. Although various work on individual aspects of data lakes exists, there is no comprehensive data lake architecture yet. Concepts that describe themselves as a “data lake architecture” are only partial. In this work, we introduce the data lake architecture framework. It supports the definition of data lake architectures by defining nine architectural aspects, i.e., perspectives on a data lake, such as data storage or data modeling, and by exploring the interdependencies between these aspects. The included methodology helps to choose appropriate concepts to instantiate each aspect. To evaluate the framework, we use it to configure an exemplary data lake architecture for a real-world data lake implementation. This final assessment shows that our framework provides comprehensive guidance in the configuration of a data lake architecture.
Textdokument
Data Management in Multi-Agent Simulation Systems
(BTW 2021, 2021) Glake, Daniel; Panse, Fabian; Ritter, Norbert; Clemen, Thomas; Lenfers, Ulfia
Multi-agent simulations are an upcoming trend to deal with the urgent need to predict complex situations as they arise in many real-life areas, such as disaster or traffic management. Such simulations require large amounts of heterogeneous data ranging from spatio-temporal to standard object properties. This and the increasing demand for large scale and real-time simulations pose many challenges for data management. In this paper, we present the architecture of a typical agent-based simulation system, describe several data management challenges that arise in such a data ecosystem, and discuss their current solutions within our multi-agent simulation system MARS.
Textdokument
Exploring Memory Access Patterns for Graph Processing Accelerators
(BTW 2021, 2021) Dann, Jonas; Ritter, Daniel; Fröning, Holger
Recent trends in business and technology (e.g., machine learning, social network analysis) benefit from storing and processing growing amounts of graph-structured data in databases and data science platforms. FPGAs as accelerators for graph processing with a customizable memory hierarchy promise solving performance problems caused by inherent irregular memory access patterns on traditional hardware (e.g., CPU). However, developing such hardware accelerators is yet time-consuming and difficult and benchmarking is non-standardized, hindering comprehension of the impact of memory access pattern changes and systematic engineering of graph processing accelerators. In this work, we propose a simulation environment for the analysis of graph processing accelerators based on simulating their memory access patterns. Further, we evaluate our approach on two state-of-the-art FPGA graph processing accelerators and show reproducibility, comparablity, as well as the shortened development process by an example. Not implementing the cycle-accurate internal data flow on accelerator hardware like FPGAs significantly reduces the implementation time, increases the benchmark parameter transparency, and allows comparison of graph processing approaches.
Textdokument
Extended Affinity Propagation Clustering for Multi-source Entity Resolution
(BTW 2021, 2021) Lerm, Stefan; Saeedi, Alieh; Rahm, Erhard
Entity resolution is the data integration task of identifying matching entities (e.g. products, customers) in one or several data sources. Previous approaches for matching and clustering entities between multiple (>2) sources either treated the different sources as a single source or assumed that the individual sources are duplicate-free, so that only matches between sources have to be found. In this work we propose and evaluate a general Multi-Source Clean Dirty (MSCD) scheme with an arbitrary combination of clean (duplicate-free) and dirty sources. For this purpose, we extend a constraint-based clustering algorithm called Affinity Propagation (AP) for entity clustering with clean and dirty sources (MSCD-AP). We also consider a hierarchical version of it for improved scalability. Our evaluation considers a full range of datasets containing 0% to 100% of clean sources. We compare our proposed algorithms with other clustering schemes in terms of both match quality and runtime.

Auflistung nach Autor:in "Kai-Uwe Sattler"

Treffer pro Seite

Sortieroptionen