Logo des Repositoriums
 
Konferenzbeitrag

Improving web page classification by integrating neighboring pages via a topic

Lade...
Vorschaubild

Volltext URI

Dokumententyp

Text/Conference Paper

Zusatzinformation

Datum

2010

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Verlag

Gesellschaft für Informatik e.V.

Zusammenfassung

This paper applies a topic model to represent the feature space for learning the Web page classification model. Latent Dirichlet Allocation (LDA) algorithm is applied to generate a probabilistic topic model consisting of term features clustered into a set of latent topics. Words assigned into the same topic are semantically related. In addition, we propose a method to integrate the additional term features obtained from neighboring pages (i.e., parent and child pages) to further improve the performance of the classification model. In the experiments, we evaluated among three different feature representations: (1) applying the simple BOW model, (2) applying the topic model on current page, and (3) integrating the neighboring pages via the topic model. From the experimental results, the approach of integrating current page with the neighboring pages via the topic model yielded the best performance with the F1 measure of 84.51%; an improvement of 23.31% over the BOW model.

Beschreibung

Sriurai, Wongkot; Meesad, Phayung; Haruechaiyasak, Choochart (2010): Improving web page classification by integrating neighboring pages via a topic. 10th International Conferenceon Innovative Internet Community Systems (I2CS) – Jubilee Edition 2010 –. Bonn: Gesellschaft für Informatik e.V.. PISSN: 1617-5468. ISBN: 978-3-88579-259-8. pp. 238-246. Regular Research Papers. Bangkok, Thailand. June 3-5, 2010

Schlagwörter

Zitierform

DOI

Tags