Logo des Repositoriums
 

SportsTables: A new Corpus for Semantic Type Detection

dc.contributor.authorLangenecker, Sven
dc.contributor.authorSturm, Christoph
dc.contributor.authorSchalles, Christian
dc.contributor.authorBinnig, Carsten
dc.contributor.editorKönig-Ries, Birgitta
dc.contributor.editorScherzinger, Stefanie
dc.contributor.editorLehner, Wolfgang
dc.contributor.editorVossen, Gottfried
dc.date.accessioned2023-02-23T14:00:16Z
dc.date.available2023-02-23T14:00:16Z
dc.date.issued2023
dc.description.abstractTable corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora that are used for training and testing since real-world data lakes contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show the results of a first study using a state-of-the-art approach for semantic type detection on our new corpus and demonstrate significant performance differences in predicting semantic types for textual and numerical data.en
dc.identifier.doi10.18420/BTW2023-68
dc.identifier.isbn978-3-88579-725-8
dc.identifier.urihttps://dl.gi.de/handle/20.500.12116/40377
dc.language.isoen
dc.publisherGesellschaft für Informatik e.V.
dc.relation.ispartofBTW 2023
dc.relation.ispartofseriesLecture Notes in Informatics (LNI) - Proceedings, Volume P-331
dc.subjectSemantic Type Detection
dc.subjectColumn Annotated Corpora
dc.titleSportsTables: A new Corpus for Semantic Type Detectionen
dc.typeText/Conference Paper
gi.citation.endPage1008
gi.citation.publisherPlaceBonn
gi.citation.startPage995
gi.conference.date06.-10. März 2023
gi.conference.locationDresden, Germany

Dateien

Originalbündel
1 - 1 von 1
Vorschaubild nicht verfügbar
Name:
C3-08.pdf
Größe:
630.32 KB
Format:
Adobe Portable Document Format