Logo des Repositoriums
 

Parallel Entity Resolution with Dedoop

dc.contributor.authorKolb, Lars
dc.contributor.authorRahm, Erhard
dc.date.accessioned2018-01-10T13:18:53Z
dc.date.available2018-01-10T13:18:53Z
dc.date.issued2013
dc.description.abstractWe provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browser-based specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.
dc.identifier.pissn1610-1995
dc.identifier.urihttps://dl.gi.de/handle/20.500.12116/11667
dc.publisherSpringer
dc.relation.ispartofDatenbank-Spektrum: Vol. 13, No. 1
dc.relation.ispartofseriesDatenbank-Spektrum
dc.subjectBlocking
dc.subjectData skew
dc.subjectEntity resolution
dc.subjectHadoop
dc.subjectLoad balancing
dc.subjectMapReduce
dc.titleParallel Entity Resolution with Dedoop
dc.typeText/Journal Article
gi.citation.endPage32
gi.citation.startPage23

Dateien