Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups

To benchmark novel classification algorithms, these algorithms should be evaluated on data with characteristics that also appear in real-world use cases. Important data characteristics that often lead to challenges for classification approaches are multi-class imbalance and heterogeneous groups. Real-world data that comprise these characteristics are usually not publicly available, e. g., because they constitute sensible patient information or due to privacy concerns. Further, the manifestations of the characteristics cannot be controlled specifically on real-world data. A more rigorous approach is to synthetically generate data such that different manifestations of the characteristics can be controlled. However, existing data generators are not able to generate data that feature both data characteristics, i. e., multi-class imbalance and heterogeneous groups. In this paper, we propose an approach that fills this gap as it allows to synthetically generate data that exhibit both characteristics. In particular, we make use of a taxonomy model that organizes real-world entities in domain-specific heterogeneous groups to generate data reflecting the characteristics of these groups. In addition, we incorporate probability distributions to reflect the imbalances of multiple classes and groups from real-world use cases. Our approach is applicable in different domains, as taxonomies are the simplest form of knowledge models and thus are available in many domains. The evaluation shows that our approach can generate data that feature the data characteristics multi-class imbalance and heterogeneous groups and that it allows to control different manifestations of these characteristics.

Treder-Tschechlov, Dennis; Reimann, Peter; Schwarz, Holger; Mitschang, Bernhard (2023): Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups. BTW 2023. DOI: 10.18420/BTW2023-16. Bonn: Gesellschaft für Informatik e.V.. ISBN: 978-3-88579-725-8. pp. 329-351. Dresden, Germany. 06.-10. März 2023

Schlagwörter

Machine Learning , Classification , Data Generation , Real-world Data Characteristics

DOI

10.18420/BTW2023-16

Sammlungen

P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web

Komplettanzeige

Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups

Volltext URI

Dokumententyp

Dateien

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen