Fast Approximate Discovery of Inclusion Dependencies

Kruse, SebastianPapenbrock, ThorstenDullweber, ChristianFinke, MoritzHegner, ManuelZabel, MartinZöllner, ChristianNaumann, FelixMitschang, BernhardNicklas, DanielaLeymann, FrankSchöning, HaraldHerschel, MelanieTeubner, JensHärder, TheoKopp, OliverWieland, Matthias2017-06-202017-06-202017978-3-88579-659-6Inclusion dependencies (INDs) are relevant to several data management tasks, such as foreign key detection and data integration, and their discovery is a core concern of data profiling. However, n-ary IND discovery is computationally expensive, so that existing algorithms often perform poorly on complex datasets. To this end, we present F , the first approximate IND discovery algorithm. F combines probabilistic and exact data structures to approximate the INDs in relational datasets. In fact, F guarantees to find all INDs and only with a low probability false positives might occur due to the approximation. This little inaccuracy comes in favor of significantly increased performance, though. In our evaluation, we show that F scales to very large datasets and outperforms the state-of-the-art algorithm by a factor of up to six in terms of runtime without reporting any false positives. This shows that F strikes a good balance between efficiency and correctness.eninclusion dependenciesdata profilingdependencydiscoverymetadataapproximationFast Approximate Discovery of Inclusion DependenciesText/Conference Paper1617-5468