Auflistung nach Schlagwort "unique column combinations"
1 - 2 von 2
Treffer pro Seite
Sortieroptionen
- KonferenzbeitragDPQL: The Data Profiling Query Language(BTW 2023, 2023) Seeger, Marcian; Schmidl, Sebastian; Vielhauer, Alexander; Papenbrock, ThorstenAbstract: Data profiling describes the activity of extracting implicit metadata, such as schema descriptions, data types, and various kinds of data dependencies, from a given data set. The considerable amount of research papers about novel metadata types and ever-faster data profiling algorithms emphasize the importance of data profiling in practice. Unfortunately, though, the current state of data profiling research fails to address practical application needs: Typical data profiling algorithms (i. e., challenging to operate structures) discover all (i. e., too many) minimal (i. e., the wrong) data dependencies within minutes to hours (i. e., too long). Consequently, if we look at the practical success of our research, we find that data profiling targets data cleaning, but most cleaning systems still use only hand-picked dependencies; data profiling targets query optimization, but hardly any query optimizer uses modern discovery algorithms for dependency extraction; data profiling targets data integration, but the application of automatically discovered dependencies for matching purposes is yet to be shown -and the list goes on. We aim to solve the profiling-and-application-disconnect with a novel data profiling engine that integrates modern profiling techniques for various types of data dependencies and provides the applications with a versatile, intuitive, and declarative Data Profiling Query Language (DPQL). The DPQL enables applications to specify precisely what dependencies are needed, which not only refines the results and makes the data profiling process more accessible but also enables much faster and (in terms of dependency types and selections) holistic profiling runs. We expect that integrating modern data profiling techniques and the post-processing of their results under a single application endpoint will result in a series of significant algorithmic advances, new pruning concepts, and a profiling engine with innovative components for workload auto-configuration, query optimization, and parallelization. With this paper, we present the first version of the DPQL syntax and introduce a fundamentally new line of research in data profiling.
- KonferenzbeitragA Hybrid Approach for Efficient Unique Column Combination Discovery(Datenbanksysteme für Business, Technologie und Web (BTW 2017), 2017) Papenbrock, Thorsten; Naumann, FelixUnique column combinations (UCCs) are groups of attributes in relational datasets that contain no value-entry more than once. Hence, they indicate keys and serve data management tasks, such as schema normalization, data integration, and data cleansing. Because the unique column combinations of a particular dataset are usually unknown, UCC discovery algorithms have been proposed to find them. All previous such discovery algorithms are, however, inapplicable to datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present the hybrid discovery algorithm H UCC, which uses the same discovery techniques as the recently proposed functional dependency discovery algorithm H FD: A hybrid combination of fast approximation techniques and e cient validation techniques. With it, the algorithm discovers all minimal unique column combinations in a given dataset. H UCC does not only outperform all existing approaches, it also scales to much larger datasets.