Logo des Repositoriums
 
Konferenzbeitrag

ExtracTable: Extracting Tables from Raw Data Files

Vorschaubild nicht verfügbar

Volltext URI

Dokumententyp

Text/Conference Paper

Zusatzinformation

Datum

2023

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Gesellschaft für Informatik e.V.

Zusammenfassung

Raw data, especially in text-files, comes in many shapes and forms, often tailored toward human readability. They include preambles and footnotes, are formatted visually, and in general do not follow csv-guidelines. The ability to easily ingest such files into data systems opens up many opportunities for data analysis and processing. With ExtracTable, we present a system that can automatically ingest a large variety of raw data files, including text files and poorly structured csv-files by detecting row patterns and thus separating their values into coherent columns. We manually annotated 957 files of a wide variety containing 1208 tables. We show experimentally that ExtracTable can correctly parse 90% of all lines in structured files and 76% of all lines in files with a visual layout only, significantly outperforming state-of-the-art.

Beschreibung

Hübscher, Leonardo; Jiang, Lan; Naumann, Felix (2023): ExtracTable: Extracting Tables from Raw Data Files. BTW 2023. DOI: 10.18420/BTW2023-20. Bonn: Gesellschaft für Informatik e.V.. ISBN: 978-3-88579-725-8. pp. 417-438. Dresden, Germany. 06.-10. März 2023

Schlagwörter

Zitierform

Tags