P289 - BTW2019 - Datenbanksysteme für Business, Technologie und Web

https://dl.gi.de/handle/20.500.12116/21526

Auflistung nach:

1 - 10 von 47

Textdokument
Modern techniques for transaction-oriented database recovery
(BTW 2019, 2019) Sauer, Caetano
Transaction-oriented database recovery has been a “solved problem” for at least 25 years since the introduction of the ARIES methods for logging and recovery. However, recent technological developments have urged the need for new software architectures that can better exploit the efficiency of modern hardware. In the context of recovery, new algorithms are required to effectively accommodate the exponential decrease in main-memory cost, the advent of flash memory, the rapid expansion into many-core CPUs, the ever-increasing capacity of magnetic disks, and, on the long term, the potential adoption of non-volatile memory. In our research, we evaluated a variety of new software techniques for efficient transaction-oriented database recovery, focusing on availability and architectural simplicity. The techniques presented here differ from most recent work in the field in which they aim to be hardware-agnostic, supporting different memory and storage configurations with the same software, as well as fully functional in comparison with traditional database systems, e.g., by supporting media recovery, index management, larger-than-memory datasets, and arbitrary access structures with structural modifications.
Textdokument
Processing Large Raster and Vector Data in Apache Spark
(BTW 2019, 2019) Hagedorn, Stefan; Birli, Oliver; Sattler, Kai-Uwe
Spatial data processing frameworks in many cases are limited to vector data only. However, an important type of spatial data is raster data which is produced by sensors on satellites but also by high resolution cameras taking pictures of nano structures, such as chips on wafers. Often the raster data sets become large and need to be processed in parallel on a cluster environment. In this paper we demonstrate our STARK framework with its support for raster data and functionality to combine raster and vector data in filter and join operations. To save engineers from the burden of learning a programming language, queries can be formulated in SQL in a web interface. In the demonstration, users can use this web interface to inspect examples of raster data using our extended SQL queries on a Apache Spark cluster.
Textdokument
MSDataStream – Connecting a Bruker Mass Spectrometer to the Internet
(BTW 2019, 2019) Zoun, Roman; Schallert, Kay; Broneske, David; Fenske, Wolfram; Pinnecke, Marcus; Heyer, Robert; Brehmer, Sven; Benndorf, Dirk; Saake, Gunter
Metaproteomics is the biological research of proteins of whole communities comprised of thousands of species using tandem mass spectrometry. But still it follows a sequential non parallelizable workflow. Hence, researchers have to wait for hours or even days until the measurement data are available. In our demo, we show a way to decrease the smallest unit of the workflow to a minimum to realize a near real time stream processing system on a fast data architecture.
Textdokument
DICE: Density-based Interactive Clustering and Exploration
(BTW 2019, 2019) Kazempour, Daniyal; Kazakov, Maksim; Kröger, Peer; Seidl, Thomas
Clustering algorithms are mostly following the pipeline to provide input data, and hyperparameter values. Then the algorithms are executed and the output files are generated or visualized. We provide in our work an early prototype of an interactive density-based clustering tool named DICE in which the users can change the hyperparameter settings and immediately observe the resulting clusters. Further the users can browse through each of the single detected clusters and get statistics regarding as well as a convex hull profile for each cluster. Further DICE keeps track of the chosen settings, enabling the user to review which hyperparameter values have been previously chosen. DICE can not only be used in scientific context of analyzing data, but also in didactic settings in which students can learn in an exploratory fashion how a density-based clustering algorithm like e.g. DBSCAN behaves.
Textdokument
Database-Supported Video Game Engines: Data-Driven Map Generation
(BTW 2019, 2019) O'Grady, Daniel
Video game engines can benefit greatly from being tightly coupled with database systems. To make this point and exemplify the similarities in database and game engine technology, we demonstrate a data-driven approach to generate maps for video games, expressed purely in SQL. The demonstration will feature such a live database-supported game that is playable on-site.
Textdokument
Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning
(BTW 2019, 2019) Boehm, Matthias; Evfimievski, Alexandre; Reinwald, Berthold
Cumulative aggregates are often overlooked yet important operations in large-scale machine learning (ML) systems. Examples are prefix sums and more complex aggregates, but also preprocessing techniques such as the removal of empty rows or columns. These operations are challenging to parallelize over distributed, blocked matrices—as commonly used in ML systems—due to recursive data dependencies. However, computing prefix sums is a classic example of a presumably sequential operation that can be efficiently parallelized via aggregation trees. In this paper, we describe an efficient framework for data-parallel cumulative aggregates over distributed, blocked matrices. The basic idea is a self-similar operator composed of a forward cascade that reduces the data size by orders of magnitude per iteration until the data fits in local memory, a local cumulative aggregate over the partial aggregates, and a backward cascade to produce the final result. We also generalize this framework for complex cumulative aggregates of sum-product expressions, and characterize the class of supported operations. Finally, we describe the end-to-end compiler and runtime integration into SystemML, and the use of cumulative aggregates in other operations. Our experiments show that this framework achieves both high performance for moderate data sizes and good scalability.
Textdokument
Understanding Trolls with Efficient Analytics of Large Graphs in Neo4j
(BTW 2019, 2019) Allen, David; Hodler, Amy; Hunger, Michael; Knobloch, Martin; Lyon, William; Needham, Mark; Voigt, Hannes
Analytics of large graph data set has become an important means of understanding and influencing the world. The use of graph database technology in the International Consortium of Investigative Journalists’ (ICIJ) investigation of the Panama Papers and Paradise Papers or in cancer research illustrates how analysing graph-structured data helps to uncover important but hidden relationships. A very current example in that regards shows how graph analytics can help shed light on the operations of social media troll-networks, e.g. on Twitter. In similar fashion, graph analytics can help enterprises to unearth hidden patterns and structures within connected data, to make more accurate predictions and faster decisions. All this requires efficient graph analytics well-integrated with management of graph data. The Neo4j Graph Platform provides such an environment. It provides transactional processing and analytical processing of graph data including data management and analytics tooling. A central element for graph analytics in the Graph Platform are the Neo4j graph algorithms. Neo4j graph algorithms provide efficiently implemented, parallel versions of common graph algorithms, integrated and optimized for the Neo4j transactional database. In this paper, we will describe the design and integration Neo4j Graph Algorithms, demonstrate its utility of our approach with a Twitter Troll analysis, and show case its performance with a few experiments on large graphs.
Textdokument
Explore FREDDY: Fast Word Embeddings in Database Systems
(BTW 2019, 2019) Günther, Michael; Thiele, Maik; Lehner, Wolfgang; Yanakiev, Zdravko
Word embeddings encode a lot of semantic as well as syntactic features and therefore are useful in many tasks especially in Natural Language Processing and Information Retrieval. FREDDY (Fast woRd EmbedDings Database sYstems), an extended PostgreSQL database system, allowing the user to analyze structured knowledge in the database relations together with unstructured text corpora encoded as word embedding by introducing novel operations for similarity calculation and analogy inference. Approximation techniques support these operations to perform fast similarity computations on high-dimensional vector spaces. This demo allows exploring the powerful query capabilities of FREDDY on different database schemes and a variety of word embeddings generated on different text corpora. From a systems perspective, the user is able to examine the impact of multiple approximation techniques and their parameters for similarity search on query execution time and precision.
Textdokument
The Borda Social Choice Movie Recommender
(BTW 2019, 2019) Kastner, Johannes; Ranitovic, Nemanja; Endres, Markus
In this demo paper we present a recommender system, which exploits the Borda social choice voting rule for clustering recommendations in order to produce comprehensible results for a user. Considering existing clustering techniques like k-means, the overhead of normalizing and preparing the preferred user data is dropped. In our demo showcase we facilitate a comparison of our clustering approach to the well known k-means++ with traditional distance measures.
Textdokument
The Best of Both Worlds: Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution
(BTW 2019, 2019) Chen, Xiao; Campero Durand, Gabriel; Zoun, Roman; Broneske, David; Li, Yang; Saake, Gunter
Recently word embedding has become a beneficial technique for diverse natural language processing tasks, especially after the successful introduction of several popular neural word embedding models, such as word2vec, GloVe, and FastText. Also entity resolution, i.e., the task of identifying digital records that refer to the same real-world entity, has been shown to benefit from word embedding. However, the use of word embeddings does not lead to a one-size-fits-all solution, because it cannot provide an accurate result for those values without any semantic meaning, such as numerical values. In this paper, we propose to use the combination of general word embedding with traditional hand-picked similarity measures for solving ER tasks, which aims to select the most suitable similarity measure for each attribute based on its property. We provide some guidelines on how to choose suitable similarity measures for different types of attributes and evaluate our proposed hybrid method on both synthetic and real datasets. Experiments show that a hybrid method reliant on correctly selecting required similarity measures can outperform the method of purely adopting traditional or word-embedding-based similarity measures.

Auflistung P289 - BTW2019 - Datenbanksysteme für Business, Technologie und Web nach Erscheinungsdatum

Treffer pro Seite

Sortieroptionen