Seamless Integration of Parquet Files into Data Processing

Rey, AliceFreitag, MichaelNeumann, ThomasKönig-Ries, BirgittaScherzinger, StefanieLehner, WolfgangVossen, Gottfried2023-02-232023-02-232023978-3-88579-725-8https://dl.gi.de/handle/20.500.12116/40316Relational database systems are still the most powerful tool for data analysis. However, the steps necessary to bring existing data into the database make them unattractive for data exploration, especially when the data is stored in data lakes where users often use Parquet files, a binary column-oriented file format.This paper presents a fast Parquet framework that tackles these problems without costly ETL steps. We incrementally collect information during query execution.We create statistics that enhance future queries. In addition, we split the file into chunks for which we store the data ranges. We call these synopses. They allow us to skip entire sections in future queries.We show that these techniques only add minor overhead to the first query and are of benefit for future requests.Our evaluation demonstrates that our implementation can achieve comparable results to database relations and that we can outperform existing systems by up to an order of magnitude.enSeamless Integration of Parquet Files into Data ProcessingText/Conference Paper10.18420/BTW2023-12