Logo des Repositoriums
 

Extended Affinity Propagation Clustering for Multi-source Entity Resolution

dc.contributor.authorLerm, Stefan
dc.contributor.authorSaeedi, Alieh
dc.contributor.authorRahm, Erhard
dc.contributor.editorKai-Uwe Sattler
dc.contributor.editorMelanie Herschel
dc.contributor.editorWolfgang Lehner
dc.date.accessioned2021-03-16T07:57:09Z
dc.date.available2021-03-16T07:57:09Z
dc.date.issued2021
dc.description.abstractEntity resolution is the data integration task of identifying matching entities (e.g. products, customers) in one or several data sources. Previous approaches for matching and clustering entities between multiple (>2) sources either treated the different sources as a single source or assumed that the individual sources are duplicate-free, so that only matches between sources have to be found. In this work we propose and evaluate a general Multi-Source Clean Dirty (MSCD) scheme with an arbitrary combination of clean (duplicate-free) and dirty sources. For this purpose, we extend a constraint-based clustering algorithm called Affinity Propagation (AP) for entity clustering with clean and dirty sources (MSCD-AP). We also consider a hierarchical version of it for improved scalability. Our evaluation considers a full range of datasets containing 0% to 100% of clean sources. We compare our proposed algorithms with other clustering schemes in terms of both match quality and runtime.en
dc.identifier.doi10.18420/btw2021-11
dc.identifier.isbn978-3-88579-705-0
dc.identifier.pissn1617-5468
dc.identifier.urihttps://dl.gi.de/handle/20.500.12116/35794
dc.language.isoen
dc.publisherGesellschaft für Informatik, Bonn
dc.relation.ispartofBTW 2021
dc.relation.ispartofseriesLecture Notes in Informatics (LNI) - Proceedings, Volume P-311
dc.subjectEntity Resolution
dc.subjectClustering
dc.subjectAffinity Propagation
dc.subjectMSCD-AP
dc.titleExtended Affinity Propagation Clustering for Multi-source Entity Resolutionen
gi.citation.endPage236
gi.citation.startPage217
gi.conference.date13.-17. September 2021
gi.conference.locationDresden
gi.conference.sessiontitleData Integration, Semantic Data Management, Streaming

Dateien

Originalbündel
1 - 1 von 1
Vorschaubild nicht verfügbar
Name:
A3-1.pdf
Größe:
2.87 MB
Format:
Adobe Portable Document Format