Logo des Repositoriums
 

Seamless Integration of Parquet Files into Data Processing

dc.contributor.authorRey, Alice
dc.contributor.authorFreitag, Michael
dc.contributor.authorNeumann, Thomas
dc.contributor.editorKönig-Ries, Birgitta
dc.contributor.editorScherzinger, Stefanie
dc.contributor.editorLehner, Wolfgang
dc.contributor.editorVossen, Gottfried
dc.date.accessioned2023-02-23T13:59:46Z
dc.date.available2023-02-23T13:59:46Z
dc.date.issued2023
dc.description.abstractRelational database systems are still the most powerful tool for data analysis. However, the steps necessary to bring existing data into the database make them unattractive for data exploration, especially when the data is stored in data lakes where users often use Parquet files, a binary column-oriented file format.This paper presents a fast Parquet framework that tackles these problems without costly ETL steps. We incrementally collect information during query execution.We create statistics that enhance future queries. In addition, we split the file into chunks for which we store the data ranges. We call these synopses. They allow us to skip entire sections in future queries.We show that these techniques only add minor overhead to the first query and are of benefit for future requests.Our evaluation demonstrates that our implementation can achieve comparable results to database relations and that we can outperform existing systems by up to an order of magnitude.en
dc.identifier.doi10.18420/BTW2023-12
dc.identifier.isbn978-3-88579-725-8
dc.identifier.urihttps://dl.gi.de/handle/20.500.12116/40316
dc.language.isoen
dc.publisherGesellschaft für Informatik e.V.
dc.relation.ispartofBTW 2023
dc.relation.ispartofseriesLecture Notes in Informatics (LNI) - Proceedings, Volume P-331
dc.titleSeamless Integration of Parquet Files into Data Processingen
dc.typeText/Conference Paper
gi.citation.endPage258
gi.citation.publisherPlaceBonn
gi.citation.startPage235
gi.conference.date06.-10. März 2023
gi.conference.locationDresden, Germany

Dateien

Originalbündel
1 - 1 von 1
Vorschaubild nicht verfügbar
Name:
B3-1.pdf
Größe:
639.66 KB
Format:
Adobe Portable Document Format