Quality Indicators for Text Data

Kiefer, CorneliaMeyer, HolgerRitter, NorbertThor, AndreasNicklas, DanielaHeuer, AndreasKlettke, Meike2019-04-152019-04-152019978-3-88579-684-8https://dl.gi.de/handle/20.500.12116/21801Textual data sets vary in terms of quality. They have different characteristics such as the average sentence length or the amount of spelling mistakes and abbreviations. These text characteristics have influence on the quality of text mining results. They may be measured automatically by means of quality indicators. We present indicators, which we implemented based on natural language processing libraries such as Stanford CoreNLP2 and NLTK3. We discuss design decisions in the implementation of exemplary indicators and provide all indicators on GitHub4. In the evaluation, we investigate free texts from production, news, prose, tweets and chat data and show that the suggested indicators predict the quality of two text mining modules.endata qualitytext data qualitytext miningtext analysisquality indicators for text dataQuality Indicators for Text Data10.18420/btw2019-ws-151617-5468