An Efficient Blocking Technique for Reference Matching using MapReduce

Paradies, Marcus

Zeitschriftenartikel

An Efficient Blocking Technique for Reference Matching using MapReduce

Dokumententyp

Text/Journal Article

Datum

2011

Autor:innen

Paradies, Marcus

Quelle

Datenbank-Spektrum: Vol. 11, No. 1

Verlag

Springer

Zusammenfassung

Document Clustering has become an increasingly important task in the area of data mining and information retrieval. With growing data volumes, CPU—and memory-efficient techniques for clustering algorithms are receiving considerable attention in the research community. To deal with huge amounts of data (e.g., documents from Wikipedia or CiteSeerX which are several GB in size), distributed clustering techniques have been designed to provide scalable and flexible approaches. We study the problem of document clustering in the area of Entity Matching, where documents from various data sources are matched together. More specifically, we focus on a common optimization technique called blocking which reduces the enormous search space by clustering the data sources into smaller groups and processes comparisons only within a group. In this article, we describe our experiences and findings in applying the MapReduce framework to deal with huge bibliographic data sets and to provide a flexible, scalable and easy-to-use blocking technique to reduce the search space for Entity Matching.

Paradies, Marcus (2011): An Efficient Blocking Technique for Reference Matching using MapReduce. Datenbank-Spektrum: Vol. 11, No. 1. Springer. PISSN: 1610-1995. pp. 47-49

Schlagwörter

Cloud computing , Entity matching , Hierarchical clustering

Sammlungen

Datenbank Spektrum 11(1) - März 2011

Komplettanzeige

An Efficient Blocking Technique for Reference Matching using MapReduce

Volltext URI

Dokumententyp

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen