- ZeitschriftenartikelCollecting and visualizing data lineage of Spark jobs(Datenbank-Spektrum: Vol. 21, No. 3, 2021) Schoenenwald, Alexander; Kern, Simon; Viehhauser, Josef; Schildgen, JohannesMetadata management constitutes a key prerequisite for enterprises as they engage in data analytics and governance. Today, however, the context of data is often only manually documented by subject matter experts, and lacks completeness and reliability due to the complex nature of data pipelines. Thus, collecting data lineage—describing the origin, structure, and dependencies of data—in an automated fashion increases quality of provided metadata and reduces manual effort, making it critical for the development and operation of data pipelines. In our practice report, we propose an end-to-end solution that digests lineage via (Py‑)Spark execution plans. We build upon the open-source component Spline , allowing us to reliably consume lineage metadata and identify interdependencies. We map the digested data into an expandable data model, enabling us to extract graph structures for both coarse- and fine-grained data lineage. Lastly, our solution visualizes the extracted data lineage via a modern web app, and integrates with BMW Group’s soon-to-be open-sourced Cloud Data Hub.
- ZeitschriftenartikelFeature Engineering Techniques and Spatio-Temporal Data Processing(Datenbank-Spektrum: Vol. 21, No. 3, 2021) Forke, Chris-Marian; Tropmann-Frick, MarinaMore and more applications nowadays use spatio-temporal data for different purposes. In order to be processed and used efficiently, this unique type of data requires special handling. This paper summarizes methods and approaches for feature selection of spatio-temporal data and machine learning algorithms for spatio-temporal data engineering. Furthermore, it highlights relevant work in specific domains. The range of possible approaches for data processing is quite wide. However, in order to use these approaches with the spatio-temporal data in a meaningful and practical way, individual data processing steps need to be adapted. One of the most important steps is feature engineering.
- ZeitschriftenartikelKurz erklärt: Measuring Data Changes in Data Engineering and their Impact on Explainability and Algorithm Fairness(Datenbank-Spektrum: Vol. 21, No. 3, 2021) Klettke, Meike; Lutsch, Adrian; Störl, UtaData engineering is an integral part of any data science and ML process. It consists of several subtasks that are performed to improve data quality and to transform data into a target format suitable for analysis. The quality and correctness of the data engineering steps is therefore important to ensure the quality of the overall process. In machine learning processes requirements such as fairness and explainability are essential. The answers to these must also be provided by the data engineering subtasks. In this article, we will show how these can be achieved by logging, monitoring and controlling the data changes in order to evaluate their correctness. However, since data preprocessing algorithms are part of any machine learning pipeline, they must obviously also guarantee that they do not produce data biases. In this article we will briefly introduce three classes of methods for measuring data changes in data engineering and present which research questions still remain unanswered in this area.
- ZeitschriftenartikelThe Collaborative Research Center FONDA(Datenbank-Spektrum: Vol. 21, No. 3, 2021) Leser, Ulf; Hilbrich, Marcus; Draxl, Claudia; Eisert, Peter; Grunske, Lars; Hostert, Patrick; Kainmüller, Dagmar; Kao, Odej; Kehr, Birte; Kehrer, Timo; Koch, Christoph; Markl, Volker; Meyerhenke, Henning; Rabl, Tilmann; Reinefeld, Alexander; Reinert, Knut; Ritter, Kerstin; Scheuermann, Björn; Schintke, Florian; Schweikardt, Nicole; Weidlich, MatthiasToday’s scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating research is the reduction of development, adaptation, and maintenance times of DAWs. We describe the design and setup of the Collaborative Research Center (CRC) 1404 “FONDA -– Foundations of Workflows for Large-Scale Scientific Data Analysis”, in which roughly 50 researchers jointly investigate new technologies, algorithms, and models to increase the portability, adaptability, and dependability of DAWs executed over distributed infrastructures. We describe the motivation behind our project, explain its underlying core concepts, introduce FONDA’s internal structure, and sketch our vision for the future of workflow-based scientific data analysis. We also describe some lessons learned during the “making of” a CRC in Computer Science with strong interdisciplinary components, with the aim to foster similar endeavors.
- ZeitschriftenartikelOn Methods and Measures for the Inspection of Arbitrarily Oriented Subspace Clusters(Datenbank-Spektrum: Vol. 21, No. 3, 2021) Kazempour, Daniyal; Winter, Johannes; Kröger, Peer; Seidl, ThomasWhen using arbitrarily oriented subspace clustering algorithms one obtains a partitioning of a given data set and for each partition its individual subspace. Since clustering is an unsupervised machine learning task, we may not have “ground truth” labels at our disposal or do not wish to rely on them. What is needed in such cases are internal measure which permits a label-less analysis of the obtained subspace clustering. In this work, we propose methods for revising clusters obtained from arbitrarily oriented correlation clustering algorithms. Initial experiments conducted reveal improvements in the clustering results compared to the original clustering outcome. Our proposed approach is simple and can be applied as a post-processing step on arbitrarily oriented correlation clusterings.
- ZeitschriftenartikelPerformance Evaluation of Policy-Based SQL Query Classification for Data-Privacy Compliance(Datenbank-Spektrum: Vol. 21, No. 3, 2021) Schwab, Peter K.; Röckl, Jonas; Langohr, Maximilian S.; Meyer-Wegener, KlausData science must respect privacy in many situations. We have built a query repository with automatic SQL query classification according to data-privacy directives. It can intercept queries that violate the directives, since a JDBC proxy driver inserted between the end-users’ SQL tooling and the target data consults the repository for the compliance of each query. Still, this slows down query processing. This paper presents two optimizations implemented to increase classification performance and describes a measurement environment that allows quantifying the induced performance overhead. We present measurement results and show that our optimized implementation significantly reduces classification latency. The query metadata (QM) is stored in both relational and graph-based databases. Whereas query classification can be done in a few ms on average using relational QM, a graph-based classification is orders of magnitude more expensive at 137 ms on average. However, the graphs contain more precise information, and thus in some cases the final decision requires to check them, too. Our optimizations considerably reduce the number of graph-based classifications and, thus, decrease the latency to 0.35 ms in $$87\%$$ 87 % of the classification cases.
- ZeitschriftenartikelSeason- and Trend-aware Symbolic Approximation for Accurate and Efficient Time Series Matching(Datenbank-Spektrum: Vol. 21, No. 3, 2021) Kegel, Lars; Hartmann, Claudio; Thiele, Maik; Lehner, WolfgangProcessing and analyzing time series datasets have become a central issue in many domains requiring data management systems to support time series as a native data type. A core access primitive of time series is matching, which requires efficient algorithms on-top of appropriate representations like the symbolic aggregate approximation (SAX) representing the current state of the art. This technique reduces a time series to a low-dimensional space by segmenting it and discretizing each segment into a small symbolic alphabet. Unfortunately, SAX ignores the deterministic behavior of time series such as cyclical repeating patterns or a trend component affecting all segments, which may lead to a sub-optimal representation accuracy. We therefore introduce a novel season- and a trend-aware symbolic approximation and demonstrate an improved representation accuracy without increasing the memory footprint. Most importantly, our techniques also enable a more efficient time series matching by providing a match up to three orders of magnitude faster than SAX.
- ZeitschriftenartikelContinuous Training and Deployment of Deep Learning Models(Datenbank-Spektrum: Vol. 21, No. 3, 2021) Prapas, Ioannis; Derakhshan, Behrouz; Mahdiraji, Alireza Rezaei; Markl, VolkerDeep Learning (DL) has consistently surpassed other Machine Learning methods and achieved state-of-the-art performance in multiple cases. Several modern applications like financial and recommender systems require models that are constantly updated with fresh data. The prominent approach for keeping a DL model fresh is to trigger full retraining from scratch when enough new data are available. However, retraining large and complex DL models is time-consuming and compute-intensive. This makes full retraining costly, wasteful, and slow. In this paper, we present an approach to continuously train and deploy DL models. First, we enable continuous training through proactive training that combines samples of historical data with new streaming data. Second, we enable continuous deployment through gradient sparsification that allows us to send a small percentage of the model updates per training iteration. Our experimental results with LeNet5 on MNIST and modern DL models on CIFAR-10 show that proactive training keeps models fresh with comparable—if not superior—performance to full retraining at a fraction of the time. Combined with gradient sparsification, sparse proactive training enables very fast updates of a deployed model with arbitrarily large sparsity, reducing communication per iteration up to four orders of magnitude, with minimal—if any—losses in model quality. Sparse training, however, comes at a price; it incurs overhead on the training that depends on the size of the model and increases the training time by factors ranging from 1.25 to 3 in our experiments. Arguably, a small price to pay for successfully enabling the continuous training and deployment of large DL models.
- ZeitschriftenartikelNews(Datenbank-Spektrum: Vol. 21, No. 3, 2021)
- ZeitschriftenartikelDissertationen(Datenbank-Spektrum: Vol. 21, No. 3, 2021)