Auflistung nach Autor:in "Boden, Christoph"
1 - 2 von 2
Treffer pro Seite
Sortieroptionen
- ZeitschriftenartikelThe Berlin Big Data Center (BBDC)(it - Information Technology: Vol. 60, No. 5-6, 2018) Boden, Christoph; Rabl, Tilmann; Markl, VolkerThe last decade has been characterized by the collection and availability of unprecedented amounts of data due to rapidly decreasing storage costs and the omnipresence of sensors and data-producing global online-services. In order to process and analyze this data deluge, novel distributed data processing systems resting on the paradigm of data flow such as Apache Hadoop, Apache Spark, or Apache Flink were built and have been scaled to tens of thousands of machines. However, writing efficient implementations of data analysis programs on these systems requires a deep understanding of systems programming, prohibiting large groups of data scientists and analysts from efficiently using this technology. In this article, we present some of the main achievements of the research carried out by the Berlin Big Data Cente (BBDC). We introduce the two domain-specific languages Emma and LARA, which are deeply embedded in Scala and enable declarative specification and the automatic parallelization of data analysis programs, the PEEL Framework for transparent and reproducible benchmark experiments of distributed data processing systems, approaches to foster the interpretability of machine learning models and finally provide an overview of the challenges to be addressed in the second phase of the BBDC.
- ZeitschriftenartikelFact-Aware Document Retrieval for Information Extraction(Datenbank-Spektrum: Vol. 12, No. 2, 2012) Boden, Christoph; Löser, Alexander; Nagel, Christoph; Pieper, StephanExploiting textual information from large document collections such as the Web with structured queries is an often requested, but still unsolved requirement of many users. We present BlueFact, a framework for efficiently retrieving documents containing structured, factual information from a full-text index. This is an essential building block for information extraction systems that enable ad-hoc analytical queries on unstructured text data as well as knowledge harvesting in a digital archive scenario.Our approach is based on the observation that documents share a set of common grammatical structures and words for expressing facts. Our system observes these keyword phrases using structural, syntactic, lexical and semantic features in an iterative, cost effective training process and systematically queries the search engine index with these automatically generated phrases. Next, BlueFact retrieves a list of document identifiers, combines observed keywords as evidence for a factual information and infers the relevance for each document identifier. Finally, we forward the documents in the order of their estimated relevance to an information extraction service. That way BlueFact can efficiently retrieve all the structured, factual information contained in an indexed collection of text documents.We report results of a comprehensive experimental evaluation over 20 different fact types on the Reuters News Corpus Volume I (RCV1). BlueFact’s scoring model and feature generation methods significantly outperform existing approaches in terms of fact retrieval performance. BlueFact fires significantly fewer queries against the index, requires significantly less execution time and achieves very high fact recall across different domains.