Logo des Repositoriums
 

DPQL: The Data Profiling Query Language

dc.contributor.authorSeeger, Marcian
dc.contributor.authorSchmidl, Sebastian
dc.contributor.authorVielhauer, Alexander
dc.contributor.authorPapenbrock, Thorsten
dc.contributor.editorKönig-Ries, Birgitta
dc.contributor.editorScherzinger, Stefanie
dc.contributor.editorLehner, Wolfgang
dc.contributor.editorVossen, Gottfried
dc.date.accessioned2023-02-23T13:59:48Z
dc.date.available2023-02-23T13:59:48Z
dc.date.issued2023
dc.description.abstractAbstract: Data profiling describes the activity of extracting implicit metadata, such as schema descriptions, data types, and various kinds of data dependencies, from a given data set. The considerable amount of research papers about novel metadata types and ever-faster data profiling algorithms emphasize the importance of data profiling in practice. Unfortunately, though, the current state of data profiling research fails to address practical application needs: Typical data profiling algorithms (i. e., challenging to operate structures) discover all (i. e., too many) minimal (i. e., the wrong) data dependencies within minutes to hours (i. e., too long). Consequently, if we look at the practical success of our research, we find that data profiling targets data cleaning, but most cleaning systems still use only hand-picked dependencies; data profiling targets query optimization, but hardly any query optimizer uses modern discovery algorithms for dependency extraction; data profiling targets data integration, but the application of automatically discovered dependencies for matching purposes is yet to be shown -and the list goes on. We aim to solve the profiling-and-application-disconnect with a novel data profiling engine that integrates modern profiling techniques for various types of data dependencies and provides the applications with a versatile, intuitive, and declarative Data Profiling Query Language (DPQL). The DPQL enables applications to specify precisely what dependencies are needed, which not only refines the results and makes the data profiling process more accessible but also enables much faster and (in terms of dependency types and selections) holistic profiling runs. We expect that integrating modern data profiling techniques and the post-processing of their results under a single application endpoint will result in a series of significant algorithmic advances, new pruning concepts, and a profiling engine with innovative components for workload auto-configuration, query optimization, and parallelization. With this paper, we present the first version of the DPQL syntax and introduce a fundamentally new line of research in data profiling.en
dc.identifier.doi10.18420/BTW2023-19
dc.identifier.isbn978-3-88579-725-8
dc.identifier.urihttps://dl.gi.de/handle/20.500.12116/40323
dc.language.isoen
dc.publisherGesellschaft für Informatik e.V.
dc.relation.ispartofBTW 2023
dc.relation.ispartofseriesLecture Notes in Informatics (LNI) - Proceedings, Volume P-331
dc.subjectdata profiling
dc.subjectquery language
dc.subjectfunctional dependencies
dc.subjectunique column combinations
dc.subjectinclusion dependencies
dc.titleDPQL: The Data Profiling Query Languageen
dc.typeText/Conference Paper
gi.citation.endPage415
gi.citation.publisherPlaceBonn
gi.citation.startPage391
gi.conference.date06.-10. März 2023
gi.conference.locationDresden, Germany

Dateien

Originalbündel
1 - 1 von 1
Vorschaubild nicht verfügbar
Name:
B4-2.pdf
Größe:
355.63 KB
Format:
Adobe Portable Document Format