Seamless Integration of Parquet Files into Data Processing

Relational database systems are still the most powerful tool for data analysis. However, the steps necessary to bring existing data into the database make them unattractive for data exploration, especially when the data is stored in data lakes where users often use Parquet files, a binary column-oriented file format.This paper presents a fast Parquet framework that tackles these problems without costly ETL steps. We incrementally collect information during query execution.We create statistics that enhance future queries. In addition, we split the file into chunks for which we store the data ranges. We call these synopses. They allow us to skip entire sections in future queries.We show that these techniques only add minor overhead to the first query and are of benefit for future requests.Our evaluation demonstrates that our implementation can achieve comparable results to database relations and that we can outperform existing systems by up to an order of magnitude.

Rey, Alice; Freitag, Michael; Neumann, Thomas (2023): Seamless Integration of Parquet Files into Data Processing. BTW 2023. DOI: 10.18420/BTW2023-12. Bonn: Gesellschaft für Informatik e.V.. ISBN: 978-3-88579-725-8. pp. 235-258. Dresden, Germany. 06.-10. März 2023

DOI

10.18420/BTW2023-12

Sammlungen

P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web

Komplettanzeige

Seamless Integration of Parquet Files into Data Processing

Volltext URI

Dokumententyp

Dateien

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen