ExtracTable: Extracting Tables from Raw Data Files
dc.contributor.author | Hübscher, Leonardo | |
dc.contributor.author | Jiang, Lan | |
dc.contributor.author | Naumann, Felix | |
dc.contributor.editor | König-Ries, Birgitta | |
dc.contributor.editor | Scherzinger, Stefanie | |
dc.contributor.editor | Lehner, Wolfgang | |
dc.contributor.editor | Vossen, Gottfried | |
dc.date.accessioned | 2023-02-23T13:59:49Z | |
dc.date.available | 2023-02-23T13:59:49Z | |
dc.date.issued | 2023 | |
dc.description.abstract | Raw data, especially in text-files, comes in many shapes and forms, often tailored toward human readability. They include preambles and footnotes, are formatted visually, and in general do not follow csv-guidelines. The ability to easily ingest such files into data systems opens up many opportunities for data analysis and processing. With ExtracTable, we present a system that can automatically ingest a large variety of raw data files, including text files and poorly structured csv-files by detecting row patterns and thus separating their values into coherent columns. We manually annotated 957 files of a wide variety containing 1208 tables. We show experimentally that ExtracTable can correctly parse 90% of all lines in structured files and 76% of all lines in files with a visual layout only, significantly outperforming state-of-the-art. | en |
dc.identifier.doi | 10.18420/BTW2023-20 | |
dc.identifier.isbn | 978-3-88579-725-8 | |
dc.identifier.uri | https://dl.gi.de/handle/20.500.12116/40325 | |
dc.language.iso | en | |
dc.publisher | Gesellschaft für Informatik e.V. | |
dc.relation.ispartof | BTW 2023 | |
dc.relation.ispartofseries | Lecture Notes in Informatics (LNI) - Proceedings, Volume P-331 | |
dc.title | ExtracTable: Extracting Tables from Raw Data Files | en |
dc.type | Text/Conference Paper | |
gi.citation.endPage | 438 | |
gi.citation.publisherPlace | Bonn | |
gi.citation.startPage | 417 | |
gi.conference.date | 06.-10. März 2023 | |
gi.conference.location | Dresden, Germany |
Dateien
Originalbündel
1 - 1 von 1
Vorschaubild nicht verfügbar
- Name:
- B4-3.pdf
- Größe:
- 1.09 MB
- Format:
- Adobe Portable Document Format