Explainable Data Matching: Selecting Representative Pairs with Active Learning Pair-Selection Strategies

Laskowski, Lukas; Sold, Florian

Konferenzbeitrag

Explainable Data Matching: Selecting Representative Pairs with Active Learning Pair-Selection Strategies

Dokumententyp

Text/Conference Paper

Dateien

C4-7.pdf (795.43 KB)

Datum

2023

Autor:innen

Laskowski, Lukas

Sold, Florian

Quelle

BTW 2023

Verlag

Gesellschaft für Informatik e.V.

Zusammenfassung

In both research and enterprise, dirty data poses numerous challenges. Many data cleaning pipelines include a data deduplication step that detects and removes entries within a given dataset which refer to the same real-world entity. Throughout the development of such deduplication techniques, data scientists have to make sense of the large result sets that their matching solutions generate to quickly identify changes in behavior or to discover opportunities for improvements. We propose an approach that aims to select a small subset of pairs from the result set of a data matching solution which is representative of the matching solution’s overall behavior. To evaluate our approach, we show that the performance of a matching solution trained on pairs selected according to our strategy outperforms a randomly selected subset of pairs.

Laskowski, Lukas; Sold, Florian (2023): Explainable Data Matching: Selecting Representative Pairs with Active Learning Pair-Selection Strategies. BTW 2023. DOI: 10.18420/BTW2023-77. Bonn: Gesellschaft für Informatik e.V.. ISBN: 978-3-88579-725-8. pp. 1099-1104. Dresden, Germany. 06.-10. März 2023

Schlagwörter

Entity Resolution , Data Matching , ExplainableDM , Pair Selection , Benchmark

DOI

10.18420/BTW2023-77

Sammlungen

P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web

Komplettanzeige

Explainable Data Matching: Selecting Representative Pairs with Active Learning Pair-Selection Strategies

Volltext URI

Dokumententyp

Dateien

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen