Logo des Repositoriums

Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems

dc.contributor.authorRohrmann, Till
dc.contributor.authorSchelter, Sebastian
dc.contributor.authorRabl, Tilmann
dc.contributor.authorMarkl, Volker
dc.contributor.editorMitschang, Bernhard
dc.contributor.editorNicklas, Daniela
dc.contributor.editorLeymann, Frank
dc.contributor.editorSchöning, Harald
dc.contributor.editorHerschel, Melanie
dc.contributor.editorTeubner, Jens
dc.contributor.editorHärder, Theo
dc.contributor.editorKopp, Oliver
dc.contributor.editorWieland, Matthias
dc.description.abstractIn recent years, the generated and collected data is increasing at an almost exponential rate. At the same time, the data’s value has been identified in terms of insights that can be provided. However, retrieving the value requires powerful analysis tools, since valuable insights are buried deep in large amounts of noise. Unfortunately, analytic capacities did not scale well with the growing data. Many existing tools run only on a single computer and are limited in terms of data size by its memory. A very promising solution to deal with large-scale data is scaling systems and exploiting parallelism. In this paper, we propose Gilbert, a distributed sparse linear algebra system, to decrease the imminent lack of analytic capacities. Gilbert offers a MATLAB®-like programming language for linear algebra programs, which are automatically executed in parallel. Transparent parallelization is achieved by compiling the linear algebra operations first into an intermediate representation. This language- independent form enables high-level algebraic optimizations. Di erent optimization strategies are evaluated and the best one is chosen by a cost-based optimizer. The optimized result is then transformed into a suitable format for parallel execution. Gilbert generates execution plans for Apache Spark® and Apache Flink®, two massively parallel dataflow systems. Distributed matrices are represented by square blocks to guarantee a well-balanced trade-o between data parallelism and data granularity. An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes. Two well known machine learning (ML) algorithms, namely PageRank and Gaussian non-negative matrix factorization (GNMF), are implemented with Gilbert. The performance of these algorithms is compared to optimized implementations based on Spark and Flink. Even though Gilbert is not as fast as the optimized algorithms, it simplifies the development process significantly due to its high-level programming abstraction.en
dc.publisherGesellschaft für Informatik, Bonn
dc.relation.ispartofDatenbanksysteme für Business, Technologie und Web (BTW 2017)
dc.relation.ispartofseriesLecture Notes in Informatics (LNI) - Proceedings, Volume P-265
dc.subjectDataflow Optimization
dc.subjectLinear Algebra
dc.subjectDistributed Dataflow Systems
dc.titleGilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systemsen
dc.typeText/Conference Paper
gi.conference.date6.-10. März 2017
gi.conference.sessiontitleStreaming and Dataflows


1 - 1 von 1
1.44 MB
Adobe Portable Document Format