Auflistung nach Schlagwort "Information extraction"
1 - 5 von 5
Treffer pro Seite
Sortieroptionen
- ZeitschriftenartikelA Brief Tutorial on How to Extract Information from User-Generated Content (UGC)(KI - Künstliche Intelligenz: Vol. 27, No. 1, 2013) Egger, Marc; Lang, AndréIn this brief tutorial, we provide an overview of investigating text-based user-generated content for information that is relevant in the corporate context. We structure the overall process along three stages: collection, analysis, and visualization. Corresponding to the stages we outline challenges and basic techniques to extract information of different levels of granularity.
- ZeitschriftenartikelDictionary learning for transcriptomics data reveals type-specific gene modules in a multi-class setting(it - Information Technology: Vol. 62, No. 3-4, 2020) Rams, Mona; Conrad, TimExtracting information from large biological datasets is a challenging task, due to the large data size, high-dimensionality, noise, and errors in the data. Gene expression data contains information about which gene products have been formed by a cell, thus representing which genes have been read to activate a particular biological process. Understanding which of these gene products can be related to which processes can for example give insights about how diseases evolve and might give hints about how to fight them. The Next Generation RNA-sequencing method emerged over a decade ago and is nowadays state-of-the-art in the field of gene expression analyses. However, analyzing these large, complex datasets is still a challenging task. Many of the existing methods do not take into account the underlying structure of the data. In this paper, we present a new approach for RNA-sequencing data analysis based on dictionary learning. Dictionary learning is a sparsity enforcing method that has widely been used in many fields, such as image processing, pattern classification, signal denoising and more. We show how for RNA-sequencing data, the atoms in the dictionary matrix can be interpreted as modules of genes that either capture patterns specific to different types, or else represent modules that are reused across different scenarios. We evaluate our approach on four large datasets with samples from multiple types. A Gene Ontology term analysis, which is a standard tool indicated to help understanding the functions of genes, shows that the found gene-sets are in agreement with the biological context of the sample types. Further, we find that the sparse representations of samples using the dictionary can be used to identify type-specific differences.
- KonferenzbeitragExtraction of Information from Invoices – Challenges in the Extraction Pipeline(INFORMATIK 2023 - Designing Futures: Zukünfte gestalten, 2023) Thiée, Lukas-Walter; Krieger, Felix; Funk, BurkhardtData from invoices are key information for business processes. In order to use the data and create business value, the information must be captured in a digital and structured form. Leveraging digital tools and AI/ML is state-of-the-art in the extraction of information from invoices. However, the existing approaches are trained on specific languages and layouts, and while focusing on the performance of individual metrics, they neglect the demonstration of the pipeline from raw data to processable information. In this paper, we investigate the types of information on invoices and address the challenges in the extraction pipeline. We contribute by providing a morphological framework for the problematization and design of a pipeline as part of a design science study.
- ZeitschriftenartikelIntroduction to Information Extraction: Basic Notions and Current Trends(Datenbank-Spektrum: Vol. 12, No. 2, 2012) Balke, Wolf-TiloTransforming unstructured or semi-structured information into structured knowledge is one of the big challenges of today’s knowledge society. While this abstract goal is still unreached and probably unreachable, intelligent information extraction techniques are considered key ingredients on the way to generating and representing knowledge for a wide variety of applications. This is especially true for the current efforts to turn the World Wide Web being the world’s largest collection of information into the world’s largest knowledge base. This introduction gives a broad overview about the major topics and current trends in information extraction.
- ZeitschriftenartikelUsing the Semantic Web as a Source of Training Data(Datenbank-Spektrum: Vol. 19, No. 2, 2019) Bizer, Christian; Primpeli, Anna; Peeters, RalphDeep neural networks are increasingly used for tasks such as entity resolution, sentiment analysis, and information extraction. As the methods are rather training data hungry, it is necessary to use large training sets in order to enable the methods to play their strengths. Millions of websites have started to annotate structured data within HTML pages using the schema.org vocabulary. Popular types of entities that are annotated are products, reviews, events, people, hotels, and other local businesses [ 12 ]. These semantic annotations are used by all major search engines to display rich snippets in search results. This is also the main driver behind the wide-scale adoption of the annotation techniques. This article explores the potential of using semantic annotations from large numbers of websites as training data for supervised entity resolution, sentiment analysis, and information extraction methods. After giving an overview of the types of structured data that are available on the Semantic Web, we focus on the task of product matching in e‑commerce and explain how semantic annotations can be used to gather a large training dataset for product matching. The dataset consists of more than 20 million pairs of offers referring to the same products. The offers were extracted from 43 thousand e‑shops, that provide schema.org annotations including some form of product identifiers, such as manufacturer part numbers (MPNs), global trade item numbers (GTINs), or stock keeping units (SKUs). The dataset, which we offer for public download, is orders of magnitude larger than the Walmart-Amazon [ 7 ], Amazon-Google [ 10 ], and Abt-Buy [ 10 ] datasets that are widely used to evaluate product matching methods. We verify the utility of the dataset as training data by using it to replicate the recent result of Mudgal et al. [ 15 ] stating that embeddings and RNNs outperform traditional symbolic matching methods on tasks involving less structured data. After the case study on product data matching, we focus on sentiment analysis and information extraction and discuss how semantic annotations from the Web can be used as training data within both tasks.