Logo des Repositoriums
 

Duplicate Table Discovery with Xash

dc.contributor.authorKoch, Maximilian
dc.contributor.authorEsmailoghli, Mahdi
dc.contributor.authorAuer, Sören
dc.contributor.authorAbedjan, Ziawasch
dc.contributor.editorKönig-Ries, Birgitta
dc.contributor.editorScherzinger, Stefanie
dc.contributor.editorLehner, Wolfgang
dc.contributor.editorVossen, Gottfried
dc.date.accessioned2023-02-23T13:59:48Z
dc.date.available2023-02-23T13:59:48Z
dc.date.issued2023
dc.description.abstractData lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data.Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to other hash functions, such as SimHash and other competitors, Xash results in fewer false positive candidates.en
dc.identifier.doi10.18420/BTW2023-18
dc.identifier.isbn978-3-88579-725-8
dc.identifier.urihttps://dl.gi.de/handle/20.500.12116/40322
dc.language.isoen
dc.publisherGesellschaft für Informatik e.V.
dc.relation.ispartofBTW 2023
dc.relation.ispartofseriesLecture Notes in Informatics (LNI) - Proceedings, Volume P-331
dc.subjectdata discovery
dc.subjectdata lakes
dc.subjectduplicate table detection
dc.titleDuplicate Table Discovery with Xashen
dc.typeText/Conference Paper
gi.citation.endPage390
gi.citation.publisherPlaceBonn
gi.citation.startPage367
gi.conference.date06.-10. März 2023
gi.conference.locationDresden, Germany

Dateien

Originalbündel
1 - 1 von 1
Vorschaubild nicht verfügbar
Name:
B4-1.pdf
Größe:
590.96 KB
Format:
Adobe Portable Document Format