Logo des Repositoriums
 
Konferenzbeitrag

Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups

Vorschaubild nicht verfügbar

Volltext URI

Dokumententyp

Text/Conference Paper

Zusatzinformation

Datum

2023

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Gesellschaft für Informatik e.V.

Zusammenfassung

To benchmark novel classification algorithms, these algorithms should be evaluated on data with characteristics that also appear in real-world use cases. Important data characteristics that often lead to challenges for classification approaches are multi-class imbalance and heterogeneous groups. Real-world data that comprise these characteristics are usually not publicly available, e. g., because they constitute sensible patient information or due to privacy concerns. Further, the manifestations of the characteristics cannot be controlled specifically on real-world data. A more rigorous approach is to synthetically generate data such that different manifestations of the characteristics can be controlled. However, existing data generators are not able to generate data that feature both data characteristics, i. e., multi-class imbalance and heterogeneous groups. In this paper, we propose an approach that fills this gap as it allows to synthetically generate data that exhibit both characteristics. In particular, we make use of a taxonomy model that organizes real-world entities in domain-specific heterogeneous groups to generate data reflecting the characteristics of these groups. In addition, we incorporate probability distributions to reflect the imbalances of multiple classes and groups from real-world use cases. Our approach is applicable in different domains, as taxonomies are the simplest form of knowledge models and thus are available in many domains. The evaluation shows that our approach can generate data that feature the data characteristics multi-class imbalance and heterogeneous groups and that it allows to control different manifestations of these characteristics.

Beschreibung

Treder-Tschechlov, Dennis; Reimann, Peter; Schwarz, Holger; Mitschang, Bernhard (2023): Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups. BTW 2023. DOI: 10.18420/BTW2023-16. Bonn: Gesellschaft für Informatik e.V.. ISBN: 978-3-88579-725-8. pp. 329-351. Dresden, Germany. 06.-10. März 2023

Zitierform

Tags