YAWN: A Semantically Annotated Wikipedia XML Corpus

Schenkel, RalfSuchanek, FabianKasneci, GjergjiKemper, AlfonsSchöning, HaraldRose, ThomasJarke, MatthiasSeidl, ThomasQuix, ChristophBrochhaus, Christoph2020-02-112020-02-112007978-3-88579-197-3https://dl.gi.de/handle/20.500.12116/31804The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms to annotate pages and links with concepts from the WordNet thesaurus. This annotation process exploits categorical information in Wikipedia, which is a high-quality, manually assigned source of information, extracts additional information from lists, and utilizes the invocations of templates with named parameters. We give examples how such annotations can be exploited for high-precision queries.enYAWN: A Semantically Annotated Wikipedia XML CorpusText/Conference Paper1617-5468