Logo des Repositoriums
 

ExtracTable: Extracting Tables from Raw Data Files

dc.contributor.authorHübscher, Leonardo
dc.contributor.authorJiang, Lan
dc.contributor.authorNaumann, Felix
dc.contributor.editorKönig-Ries, Birgitta
dc.contributor.editorScherzinger, Stefanie
dc.contributor.editorLehner, Wolfgang
dc.contributor.editorVossen, Gottfried
dc.date.accessioned2023-02-23T13:59:49Z
dc.date.available2023-02-23T13:59:49Z
dc.date.issued2023
dc.description.abstractRaw data, especially in text-files, comes in many shapes and forms, often tailored toward human readability. They include preambles and footnotes, are formatted visually, and in general do not follow csv-guidelines. The ability to easily ingest such files into data systems opens up many opportunities for data analysis and processing. With ExtracTable, we present a system that can automatically ingest a large variety of raw data files, including text files and poorly structured csv-files by detecting row patterns and thus separating their values into coherent columns. We manually annotated 957 files of a wide variety containing 1208 tables. We show experimentally that ExtracTable can correctly parse 90% of all lines in structured files and 76% of all lines in files with a visual layout only, significantly outperforming state-of-the-art.en
dc.identifier.doi10.18420/BTW2023-20
dc.identifier.isbn978-3-88579-725-8
dc.identifier.urihttps://dl.gi.de/handle/20.500.12116/40325
dc.language.isoen
dc.publisherGesellschaft für Informatik e.V.
dc.relation.ispartofBTW 2023
dc.relation.ispartofseriesLecture Notes in Informatics (LNI) - Proceedings, Volume P-331
dc.titleExtracTable: Extracting Tables from Raw Data Filesen
dc.typeText/Conference Paper
gi.citation.endPage438
gi.citation.publisherPlaceBonn
gi.citation.startPage417
gi.conference.date06.-10. März 2023
gi.conference.locationDresden, Germany

Dateien

Originalbündel
1 - 1 von 1
Vorschaubild nicht verfügbar
Name:
B4-3.pdf
Größe:
1.09 MB
Format:
Adobe Portable Document Format