Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems

Rohrmann, Till; Schelter, Sebastian; Rabl, Tilmann; Markl, Volker

Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems

dc.contributor.author	Rohrmann, Till
dc.contributor.author	Schelter, Sebastian
dc.contributor.author	Rabl, Tilmann
dc.contributor.author	Markl, Volker
dc.contributor.editor	Mitschang, Bernhard
dc.contributor.editor	Nicklas, Daniela
dc.contributor.editor	Leymann, Frank
dc.contributor.editor	Schöning, Harald
dc.contributor.editor	Herschel, Melanie
dc.contributor.editor	Teubner, Jens
dc.contributor.editor	Härder, Theo
dc.contributor.editor	Kopp, Oliver
dc.contributor.editor	Wieland, Matthias
dc.date.accessioned	2017-06-20T20:24:29Z
dc.date.available	2017-06-20T20:24:29Z
dc.date.issued	2017
dc.description.abstract	In recent years, the generated and collected data is increasing at an almost exponential rate. At the same time, the data’s value has been identified in terms of insights that can be provided. However, retrieving the value requires powerful analysis tools, since valuable insights are buried deep in large amounts of noise. Unfortunately, analytic capacities did not scale well with the growing data. Many existing tools run only on a single computer and are limited in terms of data size by its memory. A very promising solution to deal with large-scale data is scaling systems and exploiting parallelism. In this paper, we propose Gilbert, a distributed sparse linear algebra system, to decrease the imminent lack of analytic capacities. Gilbert offers a MATLAB®-like programming language for linear algebra programs, which are automatically executed in parallel. Transparent parallelization is achieved by compiling the linear algebra operations first into an intermediate representation. This language- independent form enables high-level algebraic optimizations. Di erent optimization strategies are evaluated and the best one is chosen by a cost-based optimizer. The optimized result is then transformed into a suitable format for parallel execution. Gilbert generates execution plans for Apache Spark® and Apache Flink®, two massively parallel dataflow systems. Distributed matrices are represented by square blocks to guarantee a well-balanced trade-o between data parallelism and data granularity. An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes. Two well known machine learning (ML) algorithms, namely PageRank and Gaussian non-negative matrix factorization (GNMF), are implemented with Gilbert. The performance of these algorithms is compared to optimized implementations based on Spark and Flink. Even though Gilbert is not as fast as the optimized algorithms, it simplifies the development process significantly due to its high-level programming abstraction.	en
dc.identifier.isbn	978-3-88579-659-6
dc.identifier.pissn	1617-5468
dc.language.iso	en
dc.publisher	Gesellschaft für Informatik, Bonn
dc.relation.ispartof	Datenbanksysteme für Business, Technologie und Web (BTW 2017)
dc.relation.ispartofseries	Lecture Notes in Informatics (LNI) - Proceedings, Volume P-265
dc.subject	Dataflow Optimization
dc.subject	Linear Algebra
dc.subject	Distributed Dataflow Systems
dc.title	Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems	en
dc.type	Text/Conference Paper
gi.citation.endPage	288
gi.citation.startPage	269
gi.conference.date	6.-10. März 2017
gi.conference.location	Stuttgart
gi.conference.sessiontitle	Streaming and Dataflows

Dateien

Originalbündel

1 - 1 von 1

Name:: paper18.pdf
Größe:: 1.44 MB
Format:: Adobe Portable Document Format

Herunterladen

Sammlungen

P265 - BTW2017 - Datenbanksysteme für Business, Technologie und Web