Logo des Repositoriums
 

Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups

dc.contributor.authorTreder-Tschechlov, Dennis
dc.contributor.authorReimann, Peter
dc.contributor.authorSchwarz, Holger
dc.contributor.authorMitschang, Bernhard
dc.contributor.editorKönig-Ries, Birgitta
dc.contributor.editorScherzinger, Stefanie
dc.contributor.editorLehner, Wolfgang
dc.contributor.editorVossen, Gottfried
dc.date.accessioned2023-02-23T13:59:47Z
dc.date.available2023-02-23T13:59:47Z
dc.date.issued2023
dc.description.abstractTo benchmark novel classification algorithms, these algorithms should be evaluated on data with characteristics that also appear in real-world use cases. Important data characteristics that often lead to challenges for classification approaches are multi-class imbalance and heterogeneous groups. Real-world data that comprise these characteristics are usually not publicly available, e. g., because they constitute sensible patient information or due to privacy concerns. Further, the manifestations of the characteristics cannot be controlled specifically on real-world data. A more rigorous approach is to synthetically generate data such that different manifestations of the characteristics can be controlled. However, existing data generators are not able to generate data that feature both data characteristics, i. e., multi-class imbalance and heterogeneous groups. In this paper, we propose an approach that fills this gap as it allows to synthetically generate data that exhibit both characteristics. In particular, we make use of a taxonomy model that organizes real-world entities in domain-specific heterogeneous groups to generate data reflecting the characteristics of these groups. In addition, we incorporate probability distributions to reflect the imbalances of multiple classes and groups from real-world use cases. Our approach is applicable in different domains, as taxonomies are the simplest form of knowledge models and thus are available in many domains. The evaluation shows that our approach can generate data that feature the data characteristics multi-class imbalance and heterogeneous groups and that it allows to control different manifestations of these characteristics.en
dc.identifier.doi10.18420/BTW2023-16
dc.identifier.isbn978-3-88579-725-8
dc.identifier.urihttps://dl.gi.de/handle/20.500.12116/40320
dc.language.isoen
dc.publisherGesellschaft für Informatik e.V.
dc.relation.ispartofBTW 2023
dc.relation.ispartofseriesLecture Notes in Informatics (LNI) - Proceedings, Volume P-331
dc.subjectMachine Learning
dc.subjectClassification
dc.subjectData Generation
dc.subjectReal-world Data Characteristics
dc.titleApproach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groupsen
dc.typeText/Conference Paper
gi.citation.endPage351
gi.citation.publisherPlaceBonn
gi.citation.startPage329
gi.conference.date06.-10. März 2023
gi.conference.locationDresden, Germany

Dateien

Originalbündel
1 - 1 von 1
Vorschaubild nicht verfügbar
Name:
B3-5.pdf
Größe:
1.27 MB
Format:
Adobe Portable Document Format