DPQL: The Data Profiling Query Language

Seeger, Marcian; Schmidl, Sebastian; Vielhauer, Alexander; Papenbrock, Thorsten

DPQL: The Data Profiling Query Language

dc.contributor.author	Seeger, Marcian
dc.contributor.author	Schmidl, Sebastian
dc.contributor.author	Vielhauer, Alexander
dc.contributor.author	Papenbrock, Thorsten
dc.contributor.editor	König-Ries, Birgitta
dc.contributor.editor	Scherzinger, Stefanie
dc.contributor.editor	Lehner, Wolfgang
dc.contributor.editor	Vossen, Gottfried
dc.date.accessioned	2023-02-23T13:59:48Z
dc.date.available	2023-02-23T13:59:48Z
dc.date.issued	2023
dc.description.abstract	Abstract: Data profiling describes the activity of extracting implicit metadata, such as schema descriptions, data types, and various kinds of data dependencies, from a given data set. The considerable amount of research papers about novel metadata types and ever-faster data profiling algorithms emphasize the importance of data profiling in practice. Unfortunately, though, the current state of data profiling research fails to address practical application needs: Typical data profiling algorithms (i. e., challenging to operate structures) discover all (i. e., too many) minimal (i. e., the wrong) data dependencies within minutes to hours (i. e., too long). Consequently, if we look at the practical success of our research, we find that data profiling targets data cleaning, but most cleaning systems still use only hand-picked dependencies; data profiling targets query optimization, but hardly any query optimizer uses modern discovery algorithms for dependency extraction; data profiling targets data integration, but the application of automatically discovered dependencies for matching purposes is yet to be shown -and the list goes on. We aim to solve the profiling-and-application-disconnect with a novel data profiling engine that integrates modern profiling techniques for various types of data dependencies and provides the applications with a versatile, intuitive, and declarative Data Profiling Query Language (DPQL). The DPQL enables applications to specify precisely what dependencies are needed, which not only refines the results and makes the data profiling process more accessible but also enables much faster and (in terms of dependency types and selections) holistic profiling runs. We expect that integrating modern data profiling techniques and the post-processing of their results under a single application endpoint will result in a series of significant algorithmic advances, new pruning concepts, and a profiling engine with innovative components for workload auto-configuration, query optimization, and parallelization. With this paper, we present the first version of the DPQL syntax and introduce a fundamentally new line of research in data profiling.	en
dc.identifier.doi	10.18420/BTW2023-19
dc.identifier.isbn	978-3-88579-725-8
dc.identifier.uri	https://dl.gi.de/handle/20.500.12116/40323
dc.language.iso	en
dc.publisher	Gesellschaft für Informatik e.V.
dc.relation.ispartof	BTW 2023
dc.relation.ispartofseries	Lecture Notes in Informatics (LNI) - Proceedings, Volume P-331
dc.subject	data profiling
dc.subject	query language
dc.subject	functional dependencies
dc.subject	unique column combinations
dc.subject	inclusion dependencies
dc.title	DPQL: The Data Profiling Query Language	en
dc.type	Text/Conference Paper
gi.citation.endPage	415
gi.citation.publisherPlace	Bonn
gi.citation.startPage	391
gi.conference.date	06.-10. März 2023
gi.conference.location	Dresden, Germany

Dateien

Originalbündel

1 - 1 von 1

Name:: B4-2.pdf
Größe:: 355.63 KB
Format:: Adobe Portable Document Format

Herunterladen

Sammlungen

P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web