Auflistung nach Autor:in "Zoun, Roman"
1 - 4 von 4
Treffer pro Seite
Sortieroptionen
- TextdokumentThe Best of Both Worlds: Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution(BTW 2019, 2019) Chen, Xiao; Campero Durand, Gabriel; Zoun, Roman; Broneske, David; Li, Yang; Saake, GunterRecently word embedding has become a beneficial technique for diverse natural language processing tasks, especially after the successful introduction of several popular neural word embedding models, such as word2vec, GloVe, and FastText. Also entity resolution, i.e., the task of identifying digital records that refer to the same real-world entity, has been shown to benefit from word embedding. However, the use of word embeddings does not lead to a one-size-fits-all solution, because it cannot provide an accurate result for those values without any semantic meaning, such as numerical values. In this paper, we propose to use the combination of general word embedding with traditional hand-picked similarity measures for solving ER tasks, which aims to select the most suitable similarity measure for each attribute based on its property. We provide some guidelines on how to choose suitable similarity measures for different types of attributes and evaluate our proposed hybrid method on both synthetic and real datasets. Experiments show that a hybrid method reliant on correctly selecting required similarity measures can outperform the method of purely adopting traditional or word-embedding-based similarity measures.
- ZeitschriftenartikelGridTables: A One-Size-Fits-Most H2TAP Data Store(Datenbank-Spektrum: Vol. 20, No. 1, 2020) Pinnecke, Marcus; Campero Durand, Gabriel; Broneske, David; Zoun, Roman; Saake, GunterHeterogeneous Hybrid Transactional Analytical Processing ( $$\mathrm{H}^{2}$$ H 2 TAP) database systems have been developed to match the requirements for low latency analysis of real-time operational data. Due to technical challenges, these systems are hard to architect, non-trivial to engineer, and complex to administrate. Current research has proposed excellent solutions to many of those challenges in isolation – a unified engine enabling to optimize performance by combining these solutions is still missing. In this concept paper, we suggest a highly flexible and adaptive data structure (called gridtable ) to physically organize sparse but structured records in the context of $$\mathrm{H}^{2}$$ H 2 TAP. For this, we focus on the design of an efficient highly-flexible storage layout that is built from scratch for mixed query workloads. The key challenges we address are: (1) partial storage in different memory locations, and (2) the ability to optimize for mixed OLTP-/OLAP access patterns. To guarantee safe and well-specified data definition or manipulation, as well as fast querying with no compromises on performance, we propose two dedicated access paths to the storage. In this paper, we explore the architecture and internals of gridtables showing design goals, concepts and trade-offs. We close this paper with open research questions and challenges that must be addressed in order to take advantage of the flexibility of our solution.
- TextdokumentMSDataStream – Connecting a Bruker Mass Spectrometer to the Internet(BTW 2019, 2019) Zoun, Roman; Schallert, Kay; Broneske, David; Fenske, Wolfram; Pinnecke, Marcus; Heyer, Robert; Brehmer, Sven; Benndorf, Dirk; Saake, GunterMetaproteomics is the biological research of proteins of whole communities comprised of thousands of species using tandem mass spectrometry. But still it follows a sequential non parallelizable workflow. Hence, researchers have to wait for hours or even days until the measurement data are available. In our demo, we show a way to decrease the smallest unit of the workflow to a minimum to realize a near real time stream processing system on a fast data architecture.
- TextdokumentProtobase: It's About Time for Backend/Database Co-Design(BTW 2019, 2019) Pinnecke, Marcus; Campero, Gabriel; Zoun, Roman; Broneske, David; Saake, GunterIn this interactive demonstration, we show the current state of Protobase, our main-memory analytic document store that is designed from scratch to enable rapid prototyping of efficient microservices that perform analytics and explorations on (third-party) JSON-like documents stored in a novel columnar binary-encoded format, called the Cabin file format. In contrast to other solutions, our database system exposes neither a particular query language, nor a fixed REST API to its clients. Instead, the entire user-defined backend logic, whose user code is written in Python, is placed inside a sandbox that runs in the systems process. Protobase in turn exposes a user-defined REST API that the (frontend) application interacts with. Thus, our system acts as a backend server while at the same time avoids full exposure of its database to the clients. Consequently, a Protobase instance (database + user code + REST API) serves as (the entire) microservice -potentially minimizing the number of systems running in a typical analytic software stack. In terms of execution performance, Protobase therefore takes the inter-process communication overhead between backend and database system out of the picture and heavily utilizes columnar binary document storage to scale-up for analytic queries. Both features lead to a notable performance gain for non-trivial services, potentially minimizing the number of required nodes in a cloud setting, too. In our demo, we overview Protobases internals, spot major design decisions, and show how to prototype a scholarly search engine managing the Microsoft Academic Graph, a real-world scientific paper graph of roughly 154 mio. Documents.