P311 - BTW2021- Datenbanksysteme für Business, Technologie und Web
Auflistung P311 - BTW2021- Datenbanksysteme für Business, Technologie und Web nach Schlagwort "Clustering"
1 - 3 von 3
Treffer pro Seite
Sortieroptionen
- TextdokumentCluster Flow - an Advanced Concept for Ensemble-Enabling, Interactive Clustering(BTW 2021, 2021) Obermeier, Sandra; Beer, Anna; Wahl, Florian; Seidl, ThomasEven though most clustering algorithms serve knowledge discovery in fields other than computer science, most of them still require users to be familiar with programming or data mining to some extent. As that often prevents efficient research, we developed an easy to use, highly explainable clustering method accompanied by an interactive tool for clustering. It is based on intuitively understandable kNN graphs and the subsequent application of adaptable filters, which can be combined ensemble-like and iteratively and prune unnecessary or misleading edges. For a first overview of the data, fully automatic predefined filter cascades deliver robust results. A selection of simple filters and combination methods that can be chosen interactively yield very good results on benchmark datasets compared to various algorithms.
- TextdokumentExtended Affinity Propagation Clustering for Multi-source Entity Resolution(BTW 2021, 2021) Lerm, Stefan; Saeedi, Alieh; Rahm, ErhardEntity resolution is the data integration task of identifying matching entities (e.g. products, customers) in one or several data sources. Previous approaches for matching and clustering entities between multiple (>2) sources either treated the different sources as a single source or assumed that the individual sources are duplicate-free, so that only matches between sources have to be found. In this work we propose and evaluate a general Multi-Source Clean Dirty (MSCD) scheme with an arbitrary combination of clean (duplicate-free) and dirty sources. For this purpose, we extend a constraint-based clustering algorithm called Affinity Propagation (AP) for entity clustering with clean and dirty sources (MSCD-AP). We also consider a hierarchical version of it for improved scalability. Our evaluation considers a full range of datasets containing 0% to 100% of clean sources. We compare our proposed algorithms with other clustering schemes in terms of both match quality and runtime.
- TextdokumentMulti-Party Privacy Preserving Record Linkage in Dynamic Metric Space(BTW 2021, 2021) Sehili, Ziad; Rohde, Florens; Franke, Martin; Rahm, ErhardWe propose and evaluate several approaches for multi-party privacy-preserving record linkage (MP-PPRL) for multiple data sources. To reduce the number of comparisons for scalability we propose a new pivot-based metric space approach that dynamically adapts the selection of pivots for additional sources and growing data volume. We investigate so-called early and late clustering schemes that either cluster matching records per additional source or holistically for all sources. A comprehensive evaluation for different datasets confirms the high effectiveness and efficiency of the proposed methods.