Building Scalable Machine Learning Solutions for Data Cleaning

Ilyas, Ihab

Building Scalable Machine Learning Solutions for Data Cleaning

dc.contributor.author	Ilyas, Ihab
dc.contributor.editor	Grust, Torsten
dc.contributor.editor	Naumann, Felix
dc.contributor.editor	Böhm, Alexander
dc.contributor.editor	Lehner, Wolfgang
dc.contributor.editor	Härder, Theo
dc.contributor.editor	Rahm, Erhard
dc.contributor.editor	Heuer, Andreas
dc.contributor.editor	Klettke, Meike
dc.contributor.editor	Meyer, Holger
dc.date.accessioned	2019-04-11T07:21:21Z
dc.date.available	2019-04-11T07:21:21Z
dc.date.issued	2019
dc.description.abstract	Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. In this talk I discuss why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions. The talk focuses on two main problems: (1) entity consolidation, which is arguably the most difficult data curation challenge because it is notoriously complex and hard to scale; and (2) using probabilistic inference to suggest data repair for identified errors and anomalies using our new system called HoloClean. Both problems have been challenging researchers and practitioners for decades due to the fundamentally combinatorial explosion in the space of solutions and the lack of ground truth. There’s a large body of work on this problem by both academia and industry. Techniques have included human curation, rules-based systems, and automatic discovery of clusters using predefined thresholds on record similarity Unfortunately, none of these techniques alone has been able to provide sufficient accuracy and scalability. The talk aims at providing deeper insight into the entity consolidation and data repair problems and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.	en
dc.identifier.doi	10.18420/btw2019-02
dc.identifier.isbn	978-3-88579-683-1
dc.identifier.pissn	1617-5468
dc.identifier.uri	https://dl.gi.de/handle/20.500.12116/21704
dc.language.iso	en
dc.publisher	Gesellschaft für Informatik, Bonn
dc.relation.ispartof	BTW 2019
dc.relation.ispartofseries	Lecture Notes in Informatics (LNI) – Proceedings, Volume P-289
dc.title	Building Scalable Machine Learning Solutions for Data Cleaning	en
gi.citation.endPage	28
gi.citation.startPage	27
gi.conference.date	4.-8. März 2019
gi.conference.location	Rostock
gi.conference.sessiontitle	Eingeladene Vorträge

Dateien

Originalbündel

1 - 1 von 1

Name:: A2-1.pdf
Größe:: 95.56 KB
Format:: Adobe Portable Document Format

Herunterladen

Sammlungen

P289 - BTW2019 - Datenbanksysteme für Business, Technologie und Web