Auflistung nach Autor:in "Kolb, Lars"
1 - 4 von 4
Treffer pro Seite
Sortieroptionen
- ZeitschriftenartikelIterative Computation of Connected Graph Components with MapReduce(Datenbank-Spektrum: Vol. 14, No. 2, 2014) Kolb, Lars; Sehili, Ziad; Rahm, ErhardThe use of the MapReduce framework for iterative graph algorithms is challenging. To achieve high performance it is critical to limit the amount of intermediate results as well as the number of necessary iterations. We address these issues for the important problem of finding connected components in large graphs. We analyze an existing MapReduce algorithm, CC-MR, and present techniques to improve its performance including a memory-based connection of subgraphs in the map phase. Our evaluation with several large graph datasets shows that the improvements can substantially reduce the amount of generated data by up to a factor of 8.8 and runtime by up to factor of 3.5.
- ZeitschriftenartikelParallel Entity Resolution with Dedoop(Datenbank-Spektrum: Vol. 13, No. 1, 2013) Kolb, Lars; Rahm, ErhardWe provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browser-based specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.
- KonferenzbeitragParallel sorted neighborhood blocking with MapeReduce(Datenbanksysteme für Business, Technologie und Web (BTW), 2011) Kolb, Lars; Thor, Andreas; Rahm, ErhardCloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce jobs or apply a tailored data replication.
- KonferenzbeitragPrivacy preserving record linkage with ppjoin(Datenbanksysteme für Business, Technologie und Web (BTW 2015), 2015) Sehili, Ziad; Kolb, Lars; Borgs, Christian; Schnell, Rainer; Rahm, ErhardPrivacy-preserving record linkage (PPRL) becomes increasingly important to match and integrate records with sensitive data. PPRL not only has to preserve the anonymity of the persons or entities involved but should also be highly efficient and scalable to large datasets. We therefore investigate how to adapt PPJoin, one of the fastest approaches for regular record linkage, to PPRL resulting in a new approach called P4Join. The use of bit vectors for PPRL also allows us to devise a parallel execution of P4Join on GPUs. We evaluate the new approaches and compare their efficiency with a PPRL approach based on multibit trees.