Auflistung nach Schlagwort "data quality"
1 - 6 von 6
Treffer pro Seite
Sortieroptionen
- KonferenzbeitragA data quality assessment tool for agricultural structured data as support for smart farming(43. GIL-Jahrestagung, Resiliente Agri-Food-Systeme, 2023) Schroth, Christof; Kelbert, Patricia; Vollmer, Anna MariaIn the field of precision farming or smart farming, more and more sensors are used and produce a massive amount of data. Examples are machinery, weather stations, or georeferenced data, which can be used, among other things, by Artificial Intelligence decision support systems to improve or facilitate farmers’ daily work tasks. Even if there are no issues in transferring (Internet of Things) sensor data from machines to farm management information systems, data still contain errors such as missing, implausible, or incorrect data values. In this paper, we present an automated data quality assessment (DQA) tool based on the ISO25012 standard. We describe the process of how we developed this tool with support from practitioners who produce agricultural data in the context of the EU Horizon 2020 project DEMETER. Additionally, we highlight some of the requirements we collected for such a tool and briefly discuss how we addressed them. For example, we learned that in the context of developing smart farming services, the data quality dimensions Accuracy, Completeness, Consistency, and Credibility are the most important ones for practitioners such as farmers, digital service providers, or machine suppliers. Therefore, we included them in the DQA tool and implemented it in Python. It is released under the open-source Apache 2 license. Individual parameters can be provided as input for calculations (e.g., thresholds or time lengths) to meet different users’ needs. The output of the DQA is provided in machine-readable JSON format and can be used for further analysis, e.g., to improve the quality of the data collection or the follow-up data analysis. This can help practitioners develop more valuable smart farming services.
- KonferenzbeitragDo We Need Real Data? - Testing and Training Algorithms with Artificial Geolocation Data(INFORMATIK 2019: 50 Jahre Gesellschaft für Informatik – Informatik für Gesellschaft, 2019) Kaiser, Jan; Bavendiek, Kai; Schupp, SibylleAs big data becomes increasingly important, so do algorithms that operate on geolocation data. Privacy requirements and the cost of collecting large sets of geolocation data, however, make it difficult to test those algorithms with real data. Artificially generated data sets therefore present an appealing alternative. This paper explores the use of two types of neural networks as generators of geolocation data and introduces a method based on the Turing Test to determine whether generated geolocation data is indistinguishable from real data. In an extensive evaluation we apply the method to data generated by our own implementation of neural networks as well as the widely used BerlinMOD generator on the one hand, the four most prominent data sets of real geolocation data covering at total of 65 million records on the other hand. The experiments show that in eleven of twelve cases artificial data sets can be told from real ones. We conclude that, at present, the generators we tested provide no safe replacement for real data.
- KonferenzbeitragFAIR is not enough -- A Metrics Framework to ensure Data Quality through Data Preparation(BTW 2023, 2023) Restat, Valerie; Klettke, Meike; Störl, UtaData-driven systems and machine learning-based decisions are becoming increasingly important and are having an impact on our everyday lives. The prerequisite for this is good data quality, which must be ensured by preprocessing the data. For domain experts, however, the following difficulties arise: On the one hand, they have to choose from a multitude of different tools and algorithms. On the other hand, there is no uniform evaluation method for data quality. For this reason, we present the design of a framework of metrics that allows for a flexible evaluation of data quality and data preparation results.
- KonferenzbeitragA High Quality Data Pipeline for Reasonable-Scale Machine Learning(Softwaretechnik-Trends Band 42, Heft 4, 2022) Faragó, DavidData quality (especially correctness) plays a critical role in the success of a machine learning (ML) project. This paper describes a data pipeline for creating high quality data, using as example Key Information Extraction (KIE) from invoices – one of the most popular tasks in Intelligent Document Processing (IDP). The tasks of each data pipeline step are listed, showing the decisions and technology involved. The focus is on practicality: doing ML at reasonable-scale, i.e. with as little cost (people and hardware) as possible, and a concern for practice more than achieving high scores on a metric that is not grounded in practical use. Contributions: 1. an extended list of quality dimensions, with simple definitions 2. overview of a data pipeline, examplified on KIE 3. for each pipeline step a list of tasks, showing decisions, pitfalls, and technology involved 4. in particular, how to use the state of the art contrastive model CLIP to solve difficult selection and reduction tasks on images 5. a tool for labeling key information on images 6. a labeling guide for invoices. Most contributions can easily be transfered to other supervised learning tasks.
- KonferenzbeitragImproving Data Quality of Programme of Measures for the Water Framework Directive in Saxony(EnviroInfo 2022, 2022) Hosenfeld, Friedhelm; Dimmer, Roland; Mattes, ChristophA web application is presented to support the responsible authorities in the management of measures for the WFD (Water Framework Directive) in Saxony. The web application enables the maintenance of WFD measures data by different authorities in a common database. The central data management supports the tasks of implementing the WFD of the LfULG for the fulfilment of the EU reporting obligations. A key requirement deals with the improvement of data quality implemented by comprehensive consistency and completeness checks, input rules and support functions for geometry creation. The spatial data are verified for consistency with the attribute data during data acquisition.
- TextdokumentQuality Indicators for Text Data(BTW 2019 – Workshopband, 2019) Kiefer, CorneliaTextual data sets vary in terms of quality. They have different characteristics such as the average sentence length or the amount of spelling mistakes and abbreviations. These text characteristics have influence on the quality of text mining results. They may be measured automatically by means of quality indicators. We present indicators, which we implemented based on natural language processing libraries such as Stanford CoreNLP2 and NLTK3. We discuss design decisions in the implementation of exemplary indicators and provide all indicators on GitHub4. In the evaluation, we investigate free texts from production, news, prose, tweets and chat data and show that the suggested indicators predict the quality of two text mining modules.