Auflistung nach Autor:in "Papenbrock, Thorsten"
1 - 10 von 11
Treffer pro Seite
Sortieroptionen
- TextdokumentAn Actor Database System for Akka(BTW 2019 – Workshopband, 2019) Schmidl, Sebastian; Schneider, Frederic; Papenbrock, ThorstenSystem architectures for data-centric applications are commonly comprised of two tiers: An application tier and a data tier. The fact that these tiers do not typically share a common format for data is referred to as object-relational impedance mismatch. To mitigate this, we develop an actor database system that enables the implementation of application logic into the data storage runtime. The actor model also allows for easy distribution of both data and computation across multiple nodes in a cluster. More specifically, we propose the concept of domain actors that provide a type-safe, SQL-like interface to develop the actors of our database system and the concept of Functors to build queries retrieving data contained in multiple actor instances. Our experiments demonstrate the feasibility of encapsulating data into domain actors by evaluating their memory overhead and performance. We also discuss how our proposed actor database system framework solves some of the challenges that arise from the design of distributed databases such as data partitioning, failure handling, and concurrent query processing.
- KonferenzbeitragData Profiling - Efficient Discovery of Dependencies(Ausgezeichnete Informatikdissertationen 2017, 2018) Papenbrock, Thorsten
- TextdokumentData Profiling – Effiziente Entdeckung Struktureller Abhängigkeiten(BTW 2019, 2019) Papenbrock, ThorstenDaten sind nicht nur in der Informatik, sondern auch in vielen anderen wissenschaftlichen Disziplinen ein unverzichtbares Wirtschaftsgut. Sie dienen dem Austausch, der Verknüpfung und der Speicherung von Wissen und sind daher unverzichtbar in Forschung und Wirtschaft. Leider sind Daten häufig nicht ausreichend dokumentiert um sie direkt nutzen zu können – es fehlen Metadaten, welche die Struktur und damit Zugriffsmuster der digitalen Informationen beschreiben. Informatiker und Experten anderer Disziplinen verbringen daher viel Zeit damit, Daten strukturell zu analysieren und aufzubereiten. Da die Suche nach Metadaten jedoch eine hoch komplexe Aufgabe ist, scheitern viele algorithmische Ansätze schon an kleinen Datenmengen. In der Dissertation, die dieser Zusammenfassung zugrunde liegt, stellen wir drei neuartige Ent-deckungsalgorithmen für wichtige und zugleich schwierig zu findende Typen von Metadaten vor: Eindeutige Spaltenkombinationen, funktionale Abhängigkeiten und Inklusionsabhängigkeiten. Die vorgeschlagenen Algorithmen übertreffen deutlich den bisherigen Stand der Technik in Laufzeit und Ressourcenverbrauch und ermöglichen so die Nutzbarmachung von erheblich größeren Datensätzen. Da die Anwendung solcher Algorithmen für fachfremde Nutzer nicht einfach ist, entwickeln wir zusätzlich das Programm Metanome zur intuitiven Datenanalyse. Metanome bietet dabei nicht nur die in dieser Arbeit vorgeschlagenen Algorithmen an, sondern auch Entdeckungsalgorithmen für andere Typen von Metadaten. Am Anwendungsfall der Schema-Normalisierung demonstrieren wir schließlich, wie die gefundenen Metadaten effektiv genutzt werden können.
- KonferenzbeitragDPQL: The Data Profiling Query Language(BTW 2023, 2023) Seeger, Marcian; Schmidl, Sebastian; Vielhauer, Alexander; Papenbrock, ThorstenAbstract: Data profiling describes the activity of extracting implicit metadata, such as schema descriptions, data types, and various kinds of data dependencies, from a given data set. The considerable amount of research papers about novel metadata types and ever-faster data profiling algorithms emphasize the importance of data profiling in practice. Unfortunately, though, the current state of data profiling research fails to address practical application needs: Typical data profiling algorithms (i. e., challenging to operate structures) discover all (i. e., too many) minimal (i. e., the wrong) data dependencies within minutes to hours (i. e., too long). Consequently, if we look at the practical success of our research, we find that data profiling targets data cleaning, but most cleaning systems still use only hand-picked dependencies; data profiling targets query optimization, but hardly any query optimizer uses modern discovery algorithms for dependency extraction; data profiling targets data integration, but the application of automatically discovered dependencies for matching purposes is yet to be shown -and the list goes on. We aim to solve the profiling-and-application-disconnect with a novel data profiling engine that integrates modern profiling techniques for various types of data dependencies and provides the applications with a versatile, intuitive, and declarative Data Profiling Query Language (DPQL). The DPQL enables applications to specify precisely what dependencies are needed, which not only refines the results and makes the data profiling process more accessible but also enables much faster and (in terms of dependency types and selections) holistic profiling runs. We expect that integrating modern data profiling techniques and the post-processing of their results under a single application endpoint will result in a series of significant algorithmic advances, new pruning concepts, and a profiling engine with innovative components for workload auto-configuration, query optimization, and parallelization. With this paper, we present the first version of the DPQL syntax and introduce a fundamentally new line of research in data profiling.
- KonferenzbeitragDuplicate detection on GPUs(Datenbanksysteme für Business, Technologie und Web (BTW) 2024, 2013) Forchhammer, Benedikt; Papenbrock, Thorsten; Stening, Thomas; Viehmeier, Sven; Draisbach, Uwe; Naumann, FelixWith the ever increasing volume of data and the ability to integrate different data sources, data quality problems abound. Duplicate detection, as an integral part of data cleansing, is essential in modern information systems. We present a complete duplicate detection workflow that utilizes the capabilities of modern graphics processing units (GPUs) to increase the efficiency of finding duplicates in very large datasets. Our solution covers several well-known algorithms for pair selection, attribute-wise similarity comparison, record-wise similarity aggregation, and clustering. We redesigned these algorithms to run memory-efficiently and in parallel on the GPU. Our experiments demonstrate that the GPU-based workflow is able to outperform a CPU-based implementation on large, real-world datasets. For instance, the GPU-based algorithm deduplicates a dataset with 1.8m entities 10 times faster than a common CPU-based algorithm using comparably priced hardware.
- ZeitschriftenartikelEin Datenbankkurs mit 6000 Teilnehmern(Informatik-Spektrum: Vol. 37, No. 4, 2014) Naumann, Felix; Jenders, Maximilian; Papenbrock, ThorstenIm Sommersemester 2013 boten wir auf openHPI, der Internet-Bildungsplattform des Hasso-Plattner-Instituts, den Kurs Datenmanagement mit SQL an. Von den über 6000 Teilnehmern erhielten nach sieben Wochen 1641 Teilnehmer ein Zertifikat und 2074 eine Teilnahmebestätigung. Der Kurs folgte der üblichen Struktur einer Datenbankeinführung und umfasste die Grundlagen der ER-Modellierung, des relationalen Entwurfs und der relationalen Algebra sowie eine ausführliche Einführung in SQL. Der Vorlesungsinhalt wurde in kleine Videoeinheiten aufgebrochen, die jeweils mit kleinen Selbsttests abgeschlossen wurden. Begleitend zu jedem Themenblock mussten die Teilnehmer online Hausaufgaben lösen und zum Abschluss des Kurses eine Klausur bearbeiten.Wir berichten über unsere Erfahrungen bei der Durchführung dieses ersten deutschen Datenbank-MOOCs. Insbesondere gehen wir auf die Unterschiede zu einer klassischen Vorlesung ein und beschreiben den für uns teils schwierigen Umgang mit tausenden Teilnehmern. Wir wollen damit allen Interessierten einen Einblick hinter die Kulissen eines freien Online-Kurses und allen Lehrenden, die selbst einen solchen Kurs planen, praktische Hinweise geben.
- KonferenzbeitragFast Approximate Discovery of Inclusion Dependencies(Datenbanksysteme für Business, Technologie und Web (BTW 2017), 2017) Kruse, Sebastian; Papenbrock, Thorsten; Dullweber, Christian; Finke, Moritz; Hegner, Manuel; Zabel, Martin; Zöllner, Christian; Naumann, FelixInclusion dependencies (INDs) are relevant to several data management tasks, such as foreign key detection and data integration, and their discovery is a core concern of data profiling. However, n-ary IND discovery is computationally expensive, so that existing algorithms often perform poorly on complex datasets. To this end, we present F , the first approximate IND discovery algorithm. F combines probabilistic and exact data structures to approximate the INDs in relational datasets. In fact, F guarantees to find all INDs and only with a low probability false positives might occur due to the approximation. This little inaccuracy comes in favor of significantly increased performance, though. In our evaluation, we show that F scales to very large datasets and outperforms the state-of-the-art algorithm by a factor of up to six in terms of runtime without reporting any false positives. This shows that F strikes a good balance between efficiency and correctness.
- KonferenzbeitragA Hybrid Approach for Efficient Unique Column Combination Discovery(Datenbanksysteme für Business, Technologie und Web (BTW 2017), 2017) Papenbrock, Thorsten; Naumann, FelixUnique column combinations (UCCs) are groups of attributes in relational datasets that contain no value-entry more than once. Hence, they indicate keys and serve data management tasks, such as schema normalization, data integration, and data cleansing. Because the unique column combinations of a particular dataset are usually unknown, UCC discovery algorithms have been proposed to find them. All previous such discovery algorithms are, however, inapplicable to datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present the hybrid discovery algorithm H UCC, which uses the same discovery techniques as the recently proposed functional dependency discovery algorithm H FD: A hybrid combination of fast approximation techniques and e cient validation techniques. With it, the algorithm discovers all minimal unique column combinations in a given dataset. H UCC does not only outperform all existing approaches, it also scales to much larger datasets.
- KonferenzbeitragHYPEX: Hyperparameter Optimization in Time Series Anomaly Detection(BTW 2023, 2023) Schmidl, Sebastian; Wenig, Phillip; Papenbrock, ThorstenIn many domains, such as data cleaning, machine learning, pattern mining, or anomaly detection, a system’s performance depends significantly on the selected configuration hyperparameters. However, manual configuration of hyperparameters is particularly difficult because it requires an in-depth understanding of the problem at hand and the system’s internal behavior. While automatic methods for hyperparameter optimization exist, they require labeled training datasets and many trials to assess a system’s performance before the system can be applied to production data. Hence, automatic methods just shift the human effort from parameter optimization to the effort of labelling datasets, which is still complex and time-consuming. In this paper, we, therefore, propose a novel hyperparameter optimization framework called HYPEX that learns promising default parameters and explainable parameter rules from synthetically generated datasets, without the need for manually labeled datasets. HYPEX’ learned parameter model enables the easy adjustment of a system’s configuration to new, unlabeled, and unseen datasets. We demonstrate the capabilities of HYPEX in the context of time series anomaly detection because anomaly detection algorithms suffer from a general lack of labeled datasets and they are particularly sensitive to parameter changes. In our evaluation, we show that our hyperparameter suggestions on unseen data significantly improve an algorithm’s performance compared to existing manual hyperparameter optimization approaches and often are competitive to the optimal performance achieved with Bayesian optimization.
- TextdokumentOptimized Theta-Join Processing(BTW 2021, 2021) Weise, Julian; Schmidl, Sebastian; Papenbrock, ThorstenThe Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <, ?, ?, ?, >. In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A2DB. Our evaluation on various real-world and synthetic datasets shows that A2DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries.