Applying Machine Learning Models to Scalable DataFrames with Grizzly

Kläbe, SteffenHagedorn, StefanKai-Uwe SattlerMelanie HerschelWolfgang Lehner2021-03-162021-03-162021978-3-88579-705-0https://dl.gi.de/handle/20.500.12116/35793The popular Python Pandas framework provides an easy-to-use DataFrame API that enables a broad range of users to analyze their data. However, Pandas faces severe scalability issues in terms of runtime and memory consumption, limiting the usability of the framework. In this paper we present Grizzly, a replacement for Python Pandas. Instead of bringing data to the operators like Pandas, Grizzly ships program complexity to database systems by transpiling the DataFrame API to SQL code. Additionally, Grizzly offers user-friendly support for combining different data sources, user-defined functions, and applying Machine Learning models directly inside the database system. Our evaluation shows that Grizzly significantly outperforms Pandas as well as state-of-the-art frameworks for distributed Python processing in several use cases.enApplying Machine Learning Models to Scalable DataFrames with Grizzly10.18420/btw2021-101617-5468