Auflistung nach Autor:in "Kruse, Sebastian"
1 - 2 von 2
Treffer pro Seite
Sortieroptionen
- KonferenzbeitragFast Approximate Discovery of Inclusion Dependencies(Datenbanksysteme für Business, Technologie und Web (BTW 2017), 2017) Kruse, Sebastian; Papenbrock, Thorsten; Dullweber, Christian; Finke, Moritz; Hegner, Manuel; Zabel, Martin; Zöllner, Christian; Naumann, FelixInclusion dependencies (INDs) are relevant to several data management tasks, such as foreign key detection and data integration, and their discovery is a core concern of data profiling. However, n-ary IND discovery is computationally expensive, so that existing algorithms often perform poorly on complex datasets. To this end, we present F , the first approximate IND discovery algorithm. F combines probabilistic and exact data structures to approximate the INDs in relational datasets. In fact, F guarantees to find all INDs and only with a low probability false positives might occur due to the approximation. This little inaccuracy comes in favor of significantly increased performance, though. In our evaluation, we show that F scales to very large datasets and outperforms the state-of-the-art algorithm by a factor of up to six in terms of runtime without reporting any false positives. This shows that F strikes a good balance between efficiency and correctness.
- KonferenzbeitragScaling out the discovery of inclusion dependencies(Datenbanksysteme für Business, Technologie und Web (BTW 2015), 2015) Kruse, Sebastian; Papenbrock, Thorsten; Naumann, FelixInclusion dependencies are among the most important database dependencies. In addition to their most prominent application - foreign key discovery - inclusion dependencies are an important input to data integration, query optimization, and schema redesign. With their discovery being a recurring data profiling task, previous research has proposed different algorithms to discover all inclusion dependencies within a given dataset. However, none of the proposed algorithms is designed to scale out, i.e., none can be distributed across multiple nodes in a computer cluster to increase the performance. So on large datasets with many inclusion dependencies, these algorithms can take days to complete, even on high-performance computers. We introduce SINDY, an algorithm that efficiently discovers all unary inclusion dependencies of a given relational dataset in a distributed fashion and that is not tied to main memory requirements. We give a practical implementation of SINDY that builds upon the map-reduce-style framework Stratosphere and conduct several experiments showing that SINDY can process huge datasets by several factors faster than its competitors while scaling with the number of cluster nodes.