A High Quality Data Pipeline for Reasonable-Scale Machine Learning

Faragó, David

A High Quality Data Pipeline for Reasonable-Scale Machine Learning

dc.contributor.author	Faragó, David
dc.date.accessioned	2023-01-25T14:36:12Z
dc.date.available	2023-01-25T14:36:12Z
dc.date.issued	2022
dc.description.abstract	Data quality (especially correctness) plays a critical role in the success of a machine learning (ML) project. This paper describes a data pipeline for creating high quality data, using as example Key Information Extraction (KIE) from invoices – one of the most popular tasks in Intelligent Document Processing (IDP). The tasks of each data pipeline step are listed, showing the decisions and technology involved. The focus is on practicality: doing ML at reasonable-scale, i.e. with as little cost (people and hardware) as possible, and a concern for practice more than achieving high scores on a metric that is not grounded in practical use. Contributions: 1. an extended list of quality dimensions, with simple definitions 2. overview of a data pipeline, examplified on KIE 3. for each pipeline step a list of tasks, showing decisions, pitfalls, and technology involved 4. in particular, how to use the state of the art contrastive model CLIP to solve difficult selection and reduction tasks on images 5. a tool for labeling key information on images 6. a labeling guide for invoices. Most contributions can easily be transfered to other supervised learning tasks.	en
dc.identifier.pissn	0720-8928
dc.identifier.uri	https://dl.gi.de/handle/20.500.12116/40159
dc.language.iso	en
dc.publisher	Gesellschaft für Informatik e.V.
dc.relation.ispartof	Softwaretechnik-Trends Band 42, Heft 4
dc.relation.ispartofseries	Softwaretechnik-Trends
dc.subject	data quality
dc.subject	data-centric AI
dc.subject	data pipeline
dc.subject	reasonable-scale ML
dc.subject	IDP
dc.subject	KIE on invoices
dc.title	A High Quality Data Pipeline for Reasonable-Scale Machine Learning	en
dc.type	Text/Conference Paper
gi.citation.endPage	23
gi.citation.publisherPlace	Bonn
gi.citation.startPage	18
gi.conference.sessiontitle	FG TAV: Bericht und Beiträge vom Treffen der GI-Fachgruppe Test, Analyse und Verifikation von Software (TAV 47), 3. - 4. November 2022, München

Dateien

Originalbündel

1 - 1 von 1

Name:: 3_TAV_STT_Farago.pdf
Größe:: 1.61 MB
Format:: Adobe Portable Document Format

Herunterladen

Sammlungen

Softwaretechnik-Trends 42(4) - 2022