Kraft,AngelieZorn,Hans-PeterFecht,PascalSimon,JudithBiemann,ChrisUsbeck,RicardoDemmler, DanielKrupka, DanielFederrath, Hannes2022-09-282022-09-282022978-3-88579-720-3https://dl.gi.de/handle/20.500.12116/39481Most existing methods to measure social bias in natural language generation are specified for English language models. In this work, we developed a German regard classifier based on a newly crowd-sourced dataset. Our model meets the test set accuracy of the original English version. With the classifier, we measured binary gender bias in two large language models. The results indicate a positive bias toward female subjects for a German version of GPT-2 and similar tendencies for GPT-3. Yet, upon qualitative analysis, we found that positive regard partly corresponds to sexist stereotypes. Our findings suggest that the regard classifier should not be used as a single measure but, instead, combined with more qualitative analyses.engender biasstereotypesregardnatural language generationgpt-2gpt-3germanMeasuring Gender Bias in German Language Generation10.18420/inf2022_1081617-5468