Quality Indicators for Text Data

Kiefer, Cornelia

Textdokument

Quality Indicators for Text Data

Dateien

C2-5.pdf (199.5 KB)

Datum

2019

Autor:innen

Kiefer, Cornelia

Quelle

BTW 2019 – Workshopband

Workshop on Big (and Small) Data in Science and Humanities (BigDS 2019)

Verlag

Gesellschaft für Informatik, Bonn

Zusammenfassung

Textual data sets vary in terms of quality. They have different characteristics such as the average sentence length or the amount of spelling mistakes and abbreviations. These text characteristics have influence on the quality of text mining results. They may be measured automatically by means of quality indicators. We present indicators, which we implemented based on natural language processing libraries such as Stanford CoreNLP2 and NLTK3. We discuss design decisions in the implementation of exemplary indicators and provide all indicators on GitHub4. In the evaluation, we investigate free texts from production, news, prose, tweets and chat data and show that the suggested indicators predict the quality of two text mining modules.

Kiefer, Cornelia (2019): Quality Indicators for Text Data. BTW 2019 – Workshopband. DOI: 10.18420/btw2019-ws-15. Gesellschaft für Informatik, Bonn. PISSN: 1617-5468. ISBN: 978-3-88579-684-8. pp. 145-154. Workshop on Big (and Small) Data in Science and Humanities (BigDS 2019). Rostock. 4.-8. März 2019

Schlagwörter

data quality , text data quality , text mining , text analysis , quality indicators for text data

DOI

10.18420/btw2019-ws-15

Sammlungen

P290 - BTW2019 - Datenbanksysteme für Business, Technologie und Web - Workshopband

Komplettanzeige

Quality Indicators for Text Data

Volltext URI

Dokumententyp

Dateien

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen