- TextdokumentOn-the-fly Reconfiguration of Query Plans for Stateful Stream Processing Engines(BTW 2019, 2019) Bartnik, Adrian; Del Monte, Bonaventura; Rabl, Tilmann; Markl, Volker; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, HolgerStream Processing Engines (SPEs) must tolerate the dynamic nature of unbounded data streams and provide means to quickly adapt to fluctuations in the data rate. Many major SPEs however provide very little functionality to adjust the execution of a potentially infinite streaming query at runtime. Each modification requires a complete query restart, which involves an expensive redistribution of the state of a query and may require external systems in order to guarantee correct processing semantics. This results in significant downtime, which increase the operational cost of those SPEs. We present a modification protocol that enables modifying specific operators as well as the data flow of a running query while ensuring exactly-once processing semantics. We provide an implementation for Apache Flink, which enables stateful operator migration across machines, the introduction of new operators into a running query, and changes to a specific operator based on external triggers. Our results on two benchmarks show that migrating operators for queries with small state is as fast as using the savepoint mechanism of Flink. Migrating operators in the presence of large state even outperforms the savepoint mechanism by a factor of more than 2.3. Introducing and replacing operators at runtime is performed in less than 10 s. Our modification protocol demonstrates the general feasibility of runtime modifications and opens the door for many other modification use cases, such as online algorithm tweaking and up-or downscaling operator instances.
- TextdokumentEliminating the Bandwidth Bottleneck of Central Query Dispatching Through TCP Connection Hand-Over(BTW 2019, 2019) Klauck, Stefan; Plauth, Max; Knebel, Sven; Strobl, Marius; Santry, Douglas; Eggert, Lars; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, HolgerIn scale-out database architectures, client queries must be routed to individual backend database servers for processing. In dynamic database systems, where backend servers join and leave a cluster or data partitions move between servers, clients do not know which server to send queries to. Using a central dispatcher, all queries and responses are routed via a single node. In a system with many high-performance backends, such a central node can become the system bottleneck. This paper compares three different approaches for query dispatching in terms of scaling network throughput and processing flexibility. Off-the-shelf TCP/HTTP load-balancers cannot dispatch individual queries arriving over a single connection to different backend servers, unless they are extended to understand the database wire protocol. For small response sizes up to 4 KB, a purpose-built query dispatcher delivers the highest throughput. For larger responses (i.e., BLOBs or data sets for external analysis), a novel approach for network proxying that transparently maps TCP connections between backend servers performs best.We propose hybrid query dispatching that performs a TCP connection hand-over on demand when returning large database results.
- TextdokumentJa-(zu-)SQL: Evaluation einer SQL-Skriptsprache für Hauptspeicherdatenbanksysteme(BTW 2019, 2019) Schüle, Maximilian; Passing, Linnea; Kemper, Alfons; Neumann, Thomas; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, HolgerGroße Datenmengen in Wirtschaft und Wissenschaft werden klassischerweise in Daten-banksystemen verwaltet. Um die Daten für maschinelles Lernen sowie für die Datenanalyse zu nutzen, ist ein zeitintensiver zyklischer Transformations-und Verschiebeprozess nötig, da die Daten hierfür in anderen Formaten oder schlicht in anderen Plattformen vorliegen müssen. Forschungsarbeiten der letzten Jahre widmen sich der Aufgabe, Datenverwaltung und Datenanalyse in einem System zu integrieren, um teure Datentransfers zu vermeiden. Diese Arbeit untersucht die Leistungsfähigkeit gespeicherter Prozeduren (stored procedures) zur Implementierung von Data-Mining-Algorithmen in Datenbanksystemen. Grundlage hierfür bildet HyPerScript, die PL/SQL-ähnliche Skriptsprache des Hauptspeicherdatenbanksystems HyPer. Ins-besondere evaluieren wir die prototypische Implementierung von fünf Algorithmen, die ganz ohne separates Datenanalysesystem auskommt.
- TextdokumentLinDP++: Generalizing Linearized DP to Crossproducts and Non-Inner Joins(BTW 2019, 2019) Radke, Bernhard; Neumann, Thomas; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, HolgerChoosing the best join order is one of the main tasks of query optimization, as join ordering can easily affect query execution times by large factors. Finding the optimal join order is NP-hard in general, which means that the best known algorithms have exponential worst case complexity. As a consequence only relatively modest problems can be solved exactly, which is a problem for today’s large, machine generated queries. Two developments have improved the situation: If we disallow crossproducts, graph-based DP algorithms have pushed the boundary of solvable problems to a few dozen relations. Beyond that, the linearized DP strategy, where an optimal left-deep plan is used to linearize the search space of a subsequent DP, has proven to work very well up to a hundred relations or more. However, these strategies have limitations: Graph-based DP intentionally does not consider implicit crossproducts, which is almost always ok but sometimes undesirable, as in some cases such crossprod-ucts are beneficial. Even more severe, linearized DP can handle neither crossproducts nor non-inner joins, which is a serious limitation. Large queries with, e. g., outer joins are quite common and having to fall back on simple greedy heuristics in this case is highly undesirable. In this work we remove both limitations: First, we generalize the underlying linearization strategy to handle non-inner joins, which allows us to linearize the search space of arbitrary queries. And second, we explicitly recognize potential crossproduct opportunities, and expose them to the join ordering strategies by augmenting the query graph. This results in a very generic join ordering framework that can handle arbitrary queries and produces excellent results over the whole range of query sizes.
- TextdokumentWaves of Misery After Index Creation(BTW 2019, 2019) Glombiewski, Nikolaus; Seeger, Bernhard; Graefe, Goetz; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, HolgerAfter creation of a new b-tree, the ordinary course of database updates and index maintenance causes waves of node splits. Thus, a new index may at first speed up database query processing but then the first “wave of misery” requires effort for frequent node splits and imposes spikes of buffer pool contention and of I/O. Waves of misery continue over multiple instances although eventually the waves widen, flatten, and spread further apart. Free space in each node left during index creation fails to prevent the problem; it merely delays the onset of the first wave. We have found a theoretically sound way to avoiding these waves of misery as well as some simple and practical means to reduce their amplitude to negligible levels. Experiments demonstrate that these techniques are also effective. Waves of misery occur in databases and in key-value stores, in primary and in secondary b-tree indexes, after load operations, and after b-tree reorganization or rebuild. The same remedies apply with equal effect.
- TextdokumentBTW2019 - Datenbanksysteme für Business, Technologie und Web(BTW 2019, 2019) Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, Holger; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, Holger
- TextdokumentBig graph analysis by visually created workflows(BTW 2019, 2019) Rostami, M. Ali; Peukert, Eric; Wilke, Moritz; Rahm, Erhard; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, HolgerThe analysis of large graphs has received considerable attention recently but current solutions are typically hard to use. In this demonstration paper, we report on an effort to improve the usability of the open-source system Gradoop for processing and analyzing large graphs. This is achieved by integrating Gradoop into the popular open-source software KNIME to visually create graph analysis workflows, without the need for coding. We outline the integration approach and discuss what will be demonstrated.
- TextdokumentWhen is Harry Potters birthday? – Question Answering on Linked Data(BTW 2019, 2019) Steinmetz, Nadine; Arning, Ann-Katrin; Sattler, Kai-Uwe; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, HolgerQuestion Answering (QA) systems provide a user-friendly way to access any type of knowledge base and especially for Linked Data sources to get insight into semantic information. But understanding natural language, transferring it to a formal query and finding the correct answer is a complex task. The challenge is even harder when the QA system aims to be easily adaptable regarding the underlying information. This goal can be achieved by an approach that is independent from the knowledge base. Thereby, the respective data can be replaced or updated without changes on the system itself. With this paper we present our QA approach and the demonstrator which is able to consume natural language questions of general knowledge (not specific to a topic or domain).
- TextdokumentProcessing Large Raster and Vector Data in Apache Spark(BTW 2019, 2019) Hagedorn, Stefan; Birli, Oliver; Sattler, Kai-Uwe; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, HolgerSpatial data processing frameworks in many cases are limited to vector data only. However, an important type of spatial data is raster data which is produced by sensors on satellites but also by high resolution cameras taking pictures of nano structures, such as chips on wafers. Often the raster data sets become large and need to be processed in parallel on a cluster environment. In this paper we demonstrate our STARK framework with its support for raster data and functionality to combine raster and vector data in filter and join operations. To save engineers from the burden of learning a programming language, queries can be formulated in SQL in a web interface. In the demonstration, users can use this web interface to inspect examples of raster data using our extended SQL queries on a Apache Spark cluster.
- TextdokumentjHound: Large-Scale Profiling of Open JSON Data(BTW 2019, 2019) Möller, Mark Lukas; Berton, Nicolas; Klettke, Meike; Scherzinger, Stefanie; Störl, Uta; Grust, Torsten; Naumann, Felix; Böhm, Alexander; Lehner, Wolfgang; Härder, Theo; Rahm, Erhard; Heuer, Andreas; Klettke, Meike; Meyer, HolgerWe present jHound, a tool for profiling large collections of JSON data, and apply it to thousands of data sets holding open government data. jHound reports key characteristics of JSON documents, such as their nesting depth. As we show, jHound can help detect structural outliers, and most importantly, badly encoded documents: jHound can pinpoint certain cases of documents that use string-typed values where other native JSON datatypes would have been a better match. Moreover, we can detect certain cases of maladaptively structured JSON documents, which obviously do not comply with good data modeling practices. By interactively exploring particular example documents, we hope to inspire discussions in the community about what makes a good JSON encoding.