DPQL: The Data Profiling Query Language

Abstract: Data profiling describes the activity of extracting implicit metadata, such as schema descriptions, data types, and various kinds of data dependencies, from a given data set. The considerable amount of research papers about novel metadata types and ever-faster data profiling algorithms emphasize the importance of data profiling in practice. Unfortunately, though, the current state of data profiling research fails to address practical application needs: Typical data profiling algorithms (i. e., challenging to operate structures) discover all (i. e., too many) minimal (i. e., the wrong) data dependencies within minutes to hours (i. e., too long). Consequently, if we look at the practical success of our research, we find that data profiling targets data cleaning, but most cleaning systems still use only hand-picked dependencies; data profiling targets query optimization, but hardly any query optimizer uses modern discovery algorithms for dependency extraction; data profiling targets data integration, but the application of automatically discovered dependencies for matching purposes is yet to be shown -and the list goes on. We aim to solve the profiling-and-application-disconnect with a novel data profiling engine that integrates modern profiling techniques for various types of data dependencies and provides the applications with a versatile, intuitive, and declarative Data Profiling Query Language (DPQL). The DPQL enables applications to specify precisely what dependencies are needed, which not only refines the results and makes the data profiling process more accessible but also enables much faster and (in terms of dependency types and selections) holistic profiling runs. We expect that integrating modern data profiling techniques and the post-processing of their results under a single application endpoint will result in a series of significant algorithmic advances, new pruning concepts, and a profiling engine with innovative components for workload auto-configuration, query optimization, and parallelization. With this paper, we present the first version of the DPQL syntax and introduce a fundamentally new line of research in data profiling.

Seeger, Marcian; Schmidl, Sebastian; Vielhauer, Alexander; Papenbrock, Thorsten (2023): DPQL: The Data Profiling Query Language. BTW 2023. DOI: 10.18420/BTW2023-19. Bonn: Gesellschaft für Informatik e.V.. ISBN: 978-3-88579-725-8. pp. 391-415. Dresden, Germany. 06.-10. März 2023

Schlagwörter

data profiling , query language , functional dependencies , unique column combinations , inclusion dependencies

DOI

10.18420/BTW2023-19

Sammlungen

P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web

Komplettanzeige

DPQL: The Data Profiling Query Language

Volltext URI

Dokumententyp

Dateien

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen