Optimized Theta-Join Processing

The Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <, ?, ?, ?, >. In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A2DB. Our evaluation on various real-world and synthetic datasets shows that A2DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries.

Weise, Julian; Schmidl, Sebastian; Papenbrock, Thorsten (2021): Optimized Theta-Join Processing. BTW 2021. DOI: 10.18420/btw2021-03. Gesellschaft für Informatik, Bonn. PISSN: 1617-5468. ISBN: 978-3-88579-705-0. pp. 59-78. Database Technology. Dresden. 13.-17. September 2021

Schlagwörter

theta-join , query optimization , distributed computing , actor programming

DOI

10.18420/btw2021-03

Sammlungen

P311 - BTW2021- Datenbanksysteme für Business, Technologie und Web

Komplettanzeige

Optimized Theta-Join Processing

Volltext URI

Dokumententyp

Dateien

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen