Auflistung nach Autor:in "Schmidl, Sebastian"
1 - 4 von 4
Treffer pro Seite
Sortieroptionen
- TextdokumentAn Actor Database System for Akka(BTW 2019 – Workshopband, 2019) Schmidl, Sebastian; Schneider, Frederic; Papenbrock, ThorstenSystem architectures for data-centric applications are commonly comprised of two tiers: An application tier and a data tier. The fact that these tiers do not typically share a common format for data is referred to as object-relational impedance mismatch. To mitigate this, we develop an actor database system that enables the implementation of application logic into the data storage runtime. The actor model also allows for easy distribution of both data and computation across multiple nodes in a cluster. More specifically, we propose the concept of domain actors that provide a type-safe, SQL-like interface to develop the actors of our database system and the concept of Functors to build queries retrieving data contained in multiple actor instances. Our experiments demonstrate the feasibility of encapsulating data into domain actors by evaluating their memory overhead and performance. We also discuss how our proposed actor database system framework solves some of the challenges that arise from the design of distributed databases such as data partitioning, failure handling, and concurrent query processing.
- KonferenzbeitragDPQL: The Data Profiling Query Language(BTW 2023, 2023) Seeger, Marcian; Schmidl, Sebastian; Vielhauer, Alexander; Papenbrock, ThorstenAbstract: Data profiling describes the activity of extracting implicit metadata, such as schema descriptions, data types, and various kinds of data dependencies, from a given data set. The considerable amount of research papers about novel metadata types and ever-faster data profiling algorithms emphasize the importance of data profiling in practice. Unfortunately, though, the current state of data profiling research fails to address practical application needs: Typical data profiling algorithms (i. e., challenging to operate structures) discover all (i. e., too many) minimal (i. e., the wrong) data dependencies within minutes to hours (i. e., too long). Consequently, if we look at the practical success of our research, we find that data profiling targets data cleaning, but most cleaning systems still use only hand-picked dependencies; data profiling targets query optimization, but hardly any query optimizer uses modern discovery algorithms for dependency extraction; data profiling targets data integration, but the application of automatically discovered dependencies for matching purposes is yet to be shown -and the list goes on. We aim to solve the profiling-and-application-disconnect with a novel data profiling engine that integrates modern profiling techniques for various types of data dependencies and provides the applications with a versatile, intuitive, and declarative Data Profiling Query Language (DPQL). The DPQL enables applications to specify precisely what dependencies are needed, which not only refines the results and makes the data profiling process more accessible but also enables much faster and (in terms of dependency types and selections) holistic profiling runs. We expect that integrating modern data profiling techniques and the post-processing of their results under a single application endpoint will result in a series of significant algorithmic advances, new pruning concepts, and a profiling engine with innovative components for workload auto-configuration, query optimization, and parallelization. With this paper, we present the first version of the DPQL syntax and introduce a fundamentally new line of research in data profiling.
- KonferenzbeitragHYPEX: Hyperparameter Optimization in Time Series Anomaly Detection(BTW 2023, 2023) Schmidl, Sebastian; Wenig, Phillip; Papenbrock, ThorstenIn many domains, such as data cleaning, machine learning, pattern mining, or anomaly detection, a system’s performance depends significantly on the selected configuration hyperparameters. However, manual configuration of hyperparameters is particularly difficult because it requires an in-depth understanding of the problem at hand and the system’s internal behavior. While automatic methods for hyperparameter optimization exist, they require labeled training datasets and many trials to assess a system’s performance before the system can be applied to production data. Hence, automatic methods just shift the human effort from parameter optimization to the effort of labelling datasets, which is still complex and time-consuming. In this paper, we, therefore, propose a novel hyperparameter optimization framework called HYPEX that learns promising default parameters and explainable parameter rules from synthetically generated datasets, without the need for manually labeled datasets. HYPEX’ learned parameter model enables the easy adjustment of a system’s configuration to new, unlabeled, and unseen datasets. We demonstrate the capabilities of HYPEX in the context of time series anomaly detection because anomaly detection algorithms suffer from a general lack of labeled datasets and they are particularly sensitive to parameter changes. In our evaluation, we show that our hyperparameter suggestions on unseen data significantly improve an algorithm’s performance compared to existing manual hyperparameter optimization approaches and often are competitive to the optimal performance achieved with Bayesian optimization.
- TextdokumentOptimized Theta-Join Processing(BTW 2021, 2021) Weise, Julian; Schmidl, Sebastian; Papenbrock, ThorstenThe Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <, ?, ?, ?, >. In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A2DB. Our evaluation on various real-world and synthetic datasets shows that A2DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries.