Logo des Repositoriums
 
Konferenzbeitrag

Seamless Integration of Parquet Files into Data Processing

Vorschaubild nicht verfügbar

Volltext URI

Dokumententyp

Text/Conference Paper

Zusatzinformation

Datum

2023

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Gesellschaft für Informatik e.V.

Zusammenfassung

Relational database systems are still the most powerful tool for data analysis. However, the steps necessary to bring existing data into the database make them unattractive for data exploration, especially when the data is stored in data lakes where users often use Parquet files, a binary column-oriented file format.This paper presents a fast Parquet framework that tackles these problems without costly ETL steps. We incrementally collect information during query execution.We create statistics that enhance future queries. In addition, we split the file into chunks for which we store the data ranges. We call these synopses. They allow us to skip entire sections in future queries.We show that these techniques only add minor overhead to the first query and are of benefit for future requests.Our evaluation demonstrates that our implementation can achieve comparable results to database relations and that we can outperform existing systems by up to an order of magnitude.

Beschreibung

Rey, Alice; Freitag, Michael; Neumann, Thomas (2023): Seamless Integration of Parquet Files into Data Processing. BTW 2023. DOI: 10.18420/BTW2023-12. Bonn: Gesellschaft für Informatik e.V.. ISBN: 978-3-88579-725-8. pp. 235-258. Dresden, Germany. 06.-10. März 2023

Schlagwörter

Zitierform

Tags