Towards Learned Metadata Extraction for Data Lakes

Langenecker, Sven; Sturm, Christoph; Schalles, Christian; Binnig, Carsten

Towards Learned Metadata Extraction for Data Lakes

dc.contributor.author	Langenecker, Sven
dc.contributor.author	Sturm, Christoph
dc.contributor.author	Schalles, Christian
dc.contributor.author	Binnig, Carsten
dc.contributor.editor	Kai-Uwe Sattler
dc.contributor.editor	Melanie Herschel
dc.contributor.editor	Wolfgang Lehner
dc.date.accessioned	2021-03-16T07:57:10Z
dc.date.available	2021-03-16T07:57:10Z
dc.date.issued	2021
dc.description.abstract	An important task for enabling the efficient exploration of available data in a data lake is to annotate semantic type information to the available data sources. In order to reduce the manual overhead of annotation, learned approaches for automatic metadata extraction on structured data sources have been proposed recently. While initial results of these learned approaches seem promising, it is still not clear how well these approaches can generalize to new unseen data in real-world data lakes. In this paper, we aim to tackle this question and as a first contribution show the result of a study when applying Sato -a recent approach based on deep learning -to a real-world data set. In our study we show that Sato is only able to extract semantic data types for about 10% of the columns of the real-world data set. These results show the general limitation of deep learning approaches which often provide near-perfect performance on available training and testing data but fail in real settings since training data and real data often strongly vary. Hence, as a second contribution we propose a new direction of using weak supervision and present results of an initial prototype we built to generate labeled training data with low manual efforts to improve the performance of learned semantic type extraction approaches on new unseen data sets.	en
dc.identifier.doi	10.18420/btw2021-17
dc.identifier.isbn	978-3-88579-705-0
dc.identifier.pissn	1617-5468
dc.identifier.uri	https://dl.gi.de/handle/20.500.12116/35800
dc.language.iso	en
dc.publisher	Gesellschaft für Informatik, Bonn
dc.relation.ispartof	BTW 2021
dc.relation.ispartofseries	Lecture Notes in Informatics (LNI) - Proceedings, Volume P-311
dc.subject	data lakes
dc.subject	dataset discovery and search
dc.subject	semantic type detection
dc.title	Towards Learned Metadata Extraction for Data Lakes	en
gi.citation.endPage	336
gi.citation.startPage	325
gi.conference.date	13.-17. September 2021
gi.conference.location	Dresden
gi.conference.sessiontitle	Data Integration, Semantic Data Management, Streaming

Dateien

Originalbündel

1 - 1 von 1

Name:: A3-23.pdf
Größe:: 949.06 KB
Format:: Adobe Portable Document Format

Herunterladen

Sammlungen

P311 - BTW2021- Datenbanksysteme für Business, Technologie und Web