A Hybrid Approach for Efficient Unique Column Combination Discovery

Papenbrock, Thorsten; Naumann, Felix

A Hybrid Approach for Efficient Unique Column Combination Discovery

dc.contributor.author	Papenbrock, Thorsten
dc.contributor.author	Naumann, Felix
dc.contributor.editor	Mitschang, Bernhard
dc.contributor.editor	Nicklas, Daniela
dc.contributor.editor	Leymann, Frank
dc.contributor.editor	Schöning, Harald
dc.contributor.editor	Herschel, Melanie
dc.contributor.editor	Teubner, Jens
dc.contributor.editor	Härder, Theo
dc.contributor.editor	Kopp, Oliver
dc.contributor.editor	Wieland, Matthias
dc.date.accessioned	2017-06-20T20:24:28Z
dc.date.available	2017-06-20T20:24:28Z
dc.date.issued	2017
dc.description.abstract	Unique column combinations (UCCs) are groups of attributes in relational datasets that contain no value-entry more than once. Hence, they indicate keys and serve data management tasks, such as schema normalization, data integration, and data cleansing. Because the unique column combinations of a particular dataset are usually unknown, UCC discovery algorithms have been proposed to find them. All previous such discovery algorithms are, however, inapplicable to datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present the hybrid discovery algorithm H UCC, which uses the same discovery techniques as the recently proposed functional dependency discovery algorithm H FD: A hybrid combination of fast approximation techniques and e cient validation techniques. With it, the algorithm discovers all minimal unique column combinations in a given dataset. H UCC does not only outperform all existing approaches, it also scales to much larger datasets.	en
dc.identifier.isbn	978-3-88579-659-6
dc.identifier.pissn	1617-5468
dc.language.iso	en
dc.publisher	Gesellschaft für Informatik, Bonn
dc.relation.ispartof	Datenbanksysteme für Business, Technologie und Web (BTW 2017)
dc.relation.ispartofseries	Lecture Notes in Informatics (LNI) - Proceedings, Volume P-265
dc.subject	unique column combinations
dc.subject	data profiling
dc.subject	metadata
dc.subject	hybrid
dc.title	A Hybrid Approach for Efficient Unique Column Combination Discovery	en
dc.type	Text/Conference Paper
gi.citation.endPage	204
gi.citation.startPage	195
gi.conference.date	6.-10. März 2017
gi.conference.location	Stuttgart
gi.conference.sessiontitle	Data Integration

Dateien

Originalbündel

1 - 1 von 1

Name:: paper13.pdf
Größe:: 751.21 KB
Format:: Adobe Portable Document Format

Herunterladen

Sammlungen

P265 - BTW2017 - Datenbanksysteme für Business, Technologie und Web