Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups

Treder-Tschechlov, Dennis; Reimann, Peter; Schwarz, Holger; Mitschang, Bernhard

Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups

dc.contributor.author	Treder-Tschechlov, Dennis
dc.contributor.author	Reimann, Peter
dc.contributor.author	Schwarz, Holger
dc.contributor.author	Mitschang, Bernhard
dc.contributor.editor	König-Ries, Birgitta
dc.contributor.editor	Scherzinger, Stefanie
dc.contributor.editor	Lehner, Wolfgang
dc.contributor.editor	Vossen, Gottfried
dc.date.accessioned	2023-02-23T13:59:47Z
dc.date.available	2023-02-23T13:59:47Z
dc.date.issued	2023
dc.description.abstract	To benchmark novel classification algorithms, these algorithms should be evaluated on data with characteristics that also appear in real-world use cases. Important data characteristics that often lead to challenges for classification approaches are multi-class imbalance and heterogeneous groups. Real-world data that comprise these characteristics are usually not publicly available, e. g., because they constitute sensible patient information or due to privacy concerns. Further, the manifestations of the characteristics cannot be controlled specifically on real-world data. A more rigorous approach is to synthetically generate data such that different manifestations of the characteristics can be controlled. However, existing data generators are not able to generate data that feature both data characteristics, i. e., multi-class imbalance and heterogeneous groups. In this paper, we propose an approach that fills this gap as it allows to synthetically generate data that exhibit both characteristics. In particular, we make use of a taxonomy model that organizes real-world entities in domain-specific heterogeneous groups to generate data reflecting the characteristics of these groups. In addition, we incorporate probability distributions to reflect the imbalances of multiple classes and groups from real-world use cases. Our approach is applicable in different domains, as taxonomies are the simplest form of knowledge models and thus are available in many domains. The evaluation shows that our approach can generate data that feature the data characteristics multi-class imbalance and heterogeneous groups and that it allows to control different manifestations of these characteristics.	en
dc.identifier.doi	10.18420/BTW2023-16
dc.identifier.isbn	978-3-88579-725-8
dc.identifier.uri	https://dl.gi.de/handle/20.500.12116/40320
dc.language.iso	en
dc.publisher	Gesellschaft für Informatik e.V.
dc.relation.ispartof	BTW 2023
dc.relation.ispartofseries	Lecture Notes in Informatics (LNI) - Proceedings, Volume P-331
dc.subject	Machine Learning
dc.subject	Classification
dc.subject	Data Generation
dc.subject	Real-world Data Characteristics
dc.title	Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups	en
dc.type	Text/Conference Paper
gi.citation.endPage	351
gi.citation.publisherPlace	Bonn
gi.citation.startPage	329
gi.conference.date	06.-10. März 2023
gi.conference.location	Dresden, Germany

Dateien

Originalbündel

1 - 1 von 1

Name:: B3-5.pdf
Größe:: 1.27 MB
Format:: Adobe Portable Document Format

Herunterladen

Sammlungen

P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web