A High Quality Data Pipeline for Reasonable-Scale Machine Learning

Faragó, David

Konferenzbeitrag

A High Quality Data Pipeline for Reasonable-Scale Machine Learning

Dokumententyp

Text/Conference Paper

Dateien

3_TAV_STT_Farago.pdf (1.61 MB)

Datum

2022

Autor:innen

Faragó, David

Quelle

Softwaretechnik-Trends Band 42, Heft 4

FG TAV: Bericht und Beiträge vom Treffen der GI-Fachgruppe Test, Analyse und Verifikation von Software (TAV 47), 3. - 4. November 2022, München

Verlag

Gesellschaft für Informatik e.V.

Zusammenfassung

Data quality (especially correctness) plays a critical role in the success of a machine learning (ML) project. This paper describes a data pipeline for creating high quality data, using as example Key Information Extraction (KIE) from invoices – one of the most popular tasks in Intelligent Document Processing (IDP). The tasks of each data pipeline step are listed, showing the decisions and technology involved. The focus is on practicality: doing ML at reasonable-scale, i.e. with as little cost (people and hardware) as possible, and a concern for practice more than achieving high scores on a metric that is not grounded in practical use. Contributions: 1. an extended list of quality dimensions, with simple definitions 2. overview of a data pipeline, examplified on KIE 3. for each pipeline step a list of tasks, showing decisions, pitfalls, and technology involved 4. in particular, how to use the state of the art contrastive model CLIP to solve difficult selection and reduction tasks on images 5. a tool for labeling key information on images 6. a labeling guide for invoices. Most contributions can easily be transfered to other supervised learning tasks.

Faragó, David (2022): A High Quality Data Pipeline for Reasonable-Scale Machine Learning. Softwaretechnik-Trends Band 42, Heft 4. Bonn: Gesellschaft für Informatik e.V.. PISSN: 0720-8928. pp. 18-23. FG TAV: Bericht und Beiträge vom Treffen der GI-Fachgruppe Test, Analyse und Verifikation von Software (TAV 47), 3. - 4. November 2022, München

Schlagwörter

data quality , data-centric AI , data pipeline , reasonable-scale ML , IDP , KIE on invoices

Sammlungen

Softwaretechnik-Trends 42(4) - 2022

Komplettanzeige

A High Quality Data Pipeline for Reasonable-Scale Machine Learning

Volltext URI

Dokumententyp

Dateien

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen