Logo des Repositoriums

Datenbank Spektrum 19(2) - Juli 2019

Autor*innen mit den meisten Dokumenten  

Auflistung nach:

Neueste Veröffentlichungen

1 - 10 von 10
  • Zeitschriftenartikel
    Using the Semantic Web as a Source of Training Data
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019) Bizer, Christian; Primpeli, Anna; Peeters, Ralph
    Deep neural networks are increasingly used for tasks such as entity resolution, sentiment analysis, and information extraction. As the methods are rather training data hungry, it is necessary to use large training sets in order to enable the methods to play their strengths. Millions of websites have started to annotate structured data within HTML pages using the schema.org vocabulary. Popular types of entities that are annotated are products, reviews, events, people, hotels, and other local businesses [ 12 ]. These semantic annotations are used by all major search engines to display rich snippets in search results. This is also the main driver behind the wide-scale adoption of the annotation techniques. This article explores the potential of using semantic annotations from large numbers of websites as training data for supervised entity resolution, sentiment analysis, and information extraction methods. After giving an overview of the types of structured data that are available on the Semantic Web, we focus on the task of product matching in e‑commerce and explain how semantic annotations can be used to gather a large training dataset for product matching. The dataset consists of more than 20 million pairs of offers referring to the same products. The offers were extracted from 43 thousand e‑shops, that provide schema.org annotations including some form of product identifiers, such as manufacturer part numbers (MPNs), global trade item numbers (GTINs), or stock keeping units (SKUs). The dataset, which we offer for public download, is orders of magnitude larger than the Walmart-Amazon [ 7 ], Amazon-Google [ 10 ], and Abt-Buy [ 10 ] datasets that are widely used to evaluate product matching methods. We verify the utility of the dataset as training data by using it to replicate the recent result of Mudgal et al. [ 15 ] stating that embeddings and RNNs outperform traditional symbolic matching methods on tasks involving less structured data. After the case study on product data matching, we focus on sentiment analysis and information extraction and discuss how semantic annotations from the Web can be used as training data within both tasks.
  • Zeitschriftenartikel
    QUALM: Ganzheitliche Messung und Verbesserung der Datenqualität in der Textanalyse
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019) Kiefer, Cornelia; Reimann, Peter; Mitschang, Bernhard
    Bestehende Ansätze zur Messung und Verbesserung der Qualität von Textdaten in der Textanalyse bringen drei große Nachteile mit sich. Evaluationsmetriken wie zum Beispiel Accuracy messen die Qualität zwar verlässlich, sie (1) sind jedoch auf aufwändig händisch zu erstellende Goldannotationen angewiesen und (2) geben keine Ansatzpunkte für die Verbesserung der Qualität. Erste domänenspezifische Datenqualitätsmethoden für unstrukturierte Textdaten kommen zwar ohne Goldannotationen aus und geben Ansatzpunkte zur Verbesserung der Datenqualität. Diese Methoden wurden jedoch nur für begrenzte Anwendungsgebiete entwickelt und (3) berücksichtigen deshalb nicht die Spezifika vieler Analysetools in Textanalyseprozessen. In dieser Arbeit präsentieren wir hierzu das QUALM-Konzept zum qual itativ hochwertigen M ining von Textdaten (QUALity Mining), das die drei o.g. Nachteile adressiert. Das Ziel von QUALM ist es, die Qualität der Analyseergebnisse, z. B. bzgl. der Accuracy einer Textklassifikation, auf Basis einer Messung und Verbesserung der Datenqualität zu erhöhen. QUALM bietet hierzu eine Menge an QUALM-Datenqualitätsmethoden. QUALM- Indikatoren erfassen die Datenqualität ganzheitlich auf Basis der Passung zwischen den Eingabedaten und den Spezifika der Analysetools, wie den verwendeten Features, Trainingsdaten und semantischen Ressourcen (wie zum Beispiel Wörterbüchern oder Taxonomien). Zu jedem Indikator gehört ein passender Modifikator , mit dem sowohl die Daten als auch die Spezifika der Analysetools verändert werden können, um die Datenqualität zu erhöhen. In einer ersten Evaluation von QUALM zeigen wir für konkrete Analysetools und Datensätze, dass die Anwendung der QUALM-Datenqualitätsmethoden auch mit einer Erhöhung der Qualität der Analyseergebnisse im Sinne der Evaluationsmetrik Accuracy einhergeht. Die Passung zwischen Eingabedaten und Spezifika der Analysetools wird hierzu mit konkreten QUALM-Modifikatoren erhöht, die zum Beispiel Abkürzungen auflösen oder automatisch auf Basis von Textähnlichkeitsmetriken passende Trainingsdaten vorschlagen.
  • Zeitschriftenartikel
    Measuring and Facilitating Data Repeatability in Web Science
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019) Risch, Julian; Krestel, Ralf
    Accessible and reusable datasets are a necessity to accomplish repeatable research. This requirement poses a problem particularly for web science, since scraped data comes in various formats and can change due to the dynamic character of the web. Further, usage of web data is typically restricted by copyright-protection or privacy regulations, which hinder publication of datasets. To alleviate these problems and reach what we define as “partial data repeatability”, we present a process that consists of multiple components. Researchers need to distribute only a scraper and not the data itself to comply with legal limitations. If a dataset is re-scraped for repeatability after some time, the integrity of different versions can be checked based on fingerprints. Moreover, fingerprints are sufficient to identify what parts of the data have changed and how much. We evaluate an implementation of this process with a dataset of 250 million online comments collected from five different news discussion platforms. We re-scraped the dataset after pausing for one year and show that less than ten percent of the data has actually changed. These experiments demonstrate that providing a scraper and fingerprints enables recreating a dataset and supports the repeatability of web science experiments.
  • Zeitschriftenartikel
    Towards Semantic Integration of Federated Research Data
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019) Chamanara, Javad; Kraft, Angelina; Auer, Sören; Koepler, Oliver
    Digitization of the research (data) lifecycle has created a galaxy of data nodes that are often characterized by sparse interoperability. With the start of the European Open Science Cloud in November 2018 and facing the upcoming call for the creation of the National Research Data Infrastructure (NFDI), researchers and infrastructure providers will need to harmonize their data efforts. In this article, we propose a recently initiated proof-of-concept towards a network of semantically harmonized Research Data Management (RDM) systems. This includes a network of research data management and publication systems with semantic integration at three levels, namely, data, metadata, and schema. As such, an ecosystem for agile, evolutionary ontology development, and the community-driven definition of quality criteria and classification schemes for scientific domains will be created. In contrast to the classical data repository approach, this process will allow for cross-repository as well as cross-domain data discovery, integration, and collaboration and will lead to open and interoperable data portals throughout the scientific domains. At the joint lab of L3S research center and TIB Leibniz Information Center for Science and Technology in Hanover, we are developing a solution based on a customized distribution of CKAN called the Leibniz Data Manager (LDM). LDM utilizes the CKAN’s harvesting functionality to exchange metadata using the DCAT vocabulary. By adding the concept of semantic schema to LDM, it will contribute to realizing the FAIR paradigm. Variables, their attributes and relationships of a dataset will improve findability and accessibility and can be processed by humans or machines across scientific domains. We argue that it is crucial for the RDM development in Germany that domain-specific data silos should be the exception, and that a semantically-linked network of generic and domain-specific research data systems and services at national, regional, and organization levels should be promoted within the NFDI initiative.
  • Zeitschriftenartikel
    A Link is not Enough – Reproducibility of Data
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019) Pawlik, Mateusz; Hütter, Thomas; Kocher, Daniel; Mann, Willi; Augsten, Nikolaus
    Although many works in the database community use open data in their experimental evaluation, repeating the empirical results of previous works remains a challenge. This holds true even if the source code or binaries of the tested algorithms are available. In this paper, we argue that providing access to the raw, original datasets is not enough. Real-world datasets are rarely processed without modification. Instead, the data is adapted to the needs of the experimental evaluation in the data preparation process. We showcase that the details of the data preparation process matter and subtle differences during data conversion can have a large impact on the outcome of runtime results. We introduce a data reproducibility model, identify three levels of data reproducibility, report about our own experience, and exemplify our best practices.
  • Zeitschriftenartikel
    Transforming Heterogeneous Data into Knowledge for Personalized Treatments—A Use Case
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019) Vidal, Maria-Esther; Endris, Kemele M.; Jazashoori, Samaneh; Sakor, Ahmad; Rivas, Ariam
    Big data has exponentially grown in the last decade; it is expected to grow at a faster rate in the next years as a result of the advances in the technologies for data generation and ingestion. For instance, in the biomedical domain, a wide variety of methods are available for data ingestion, e.g., liquid biopsies and medical imaging, and the collected data can be represented using myriad formats, e.g., FASTQ and Nifti. In order to extract and manage valuable knowledge and insights from big data, the problem of data integration from structured and unstructured data needs to be effectively solved. In this paper, we devise a knowledge-driven approach able to transform disparate data into knowledge from which actions can be taken. The proposed framework resorts to computational extraction methods for mining knowledge from data sources, e.g., clinical notes, images, or scientific publications. Moreover, controlled vocabularies are utilized to annotate entities and a unified schema describes the meaning of these entities in a  knowledge graph ; entity linking methods discover links to existing knowledge graphs, e.g., DBpedia and Bio2RDF. A federated query engine enables the exploration of the linked knowledge graphs while knowledge discovery methods allow for uncovering patterns in the knowledge graphs. The proposed framework is used in the context of the EU H2020 funded project iASiS with the aim of paving the way for accurate diagnostics and personalized treatments.
  • Zeitschriftenartikel
    BTW 2019 – Datenbanksysteme im Zeitalter der Künstlichen Intelligenz, Data Science und neuen Hardware
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019) Heuer, Andreas; Klettke, Meike; Meyer, Holger
  • Zeitschriftenartikel
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019) Dittrich, Jens; Naumann, Felix; Ritter, Norbert; Härder, Theo
  • Zeitschriftenartikel
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019)
  • Zeitschriftenartikel
    (Datenbank-Spektrum: Vol. 19, No. 2, 2019)