A Comparative Analysis on Machine Learning Techniques for Research Metadata: the ARDUOUS Case Study

Yadav, Dipendra; Tonkin, Emma; Stoev, Teodor; Yordanova, Kristina

A Comparative Analysis on Machine Learning Techniques for Research Metadata: the ARDUOUS Case Study

dc.contributor.author	Yadav, Dipendra
dc.contributor.author	Tonkin, Emma
dc.contributor.author	Stoev, Teodor
dc.contributor.author	Yordanova, Kristina
dc.contributor.editor	Klein, Maike
dc.contributor.editor	Krupka, Daniel
dc.contributor.editor	Winter, Cornelia
dc.contributor.editor	Gergeleit, Martin
dc.contributor.editor	Martin, Ludger
dc.date.accessioned	2024-10-21T18:24:26Z
dc.date.available	2024-10-21T18:24:26Z
dc.date.issued	2024
dc.description.abstract	The rapid increase in research publications necessitates effective methods for organizing and analyzing large volumes of textual data. This study evaluates various combinations of embedding models, dimensionality reduction techniques, and clustering algorithms applied to metadata from papers accepted at the ARDUOUS (Annotation of useR Data for UbiquitOUs Systems) workshop over a period of 7 years. The analysis encompasses different types of keywords, including All Keywords (a comprehensive set of all extracted keywords), Multi-word Keywords (phrases consisting of two or more words), Existing Keywords (keywords already present in the metadata), and Single-word Keywords (individual words). The study found that the highest silhouette scores were achieved with 3, 4, and 5 clusters across all keyword types. Principal Component Analysis (PCA) and Independent Component Analysis (ICA) were identified as the most effective dimensionality reduction techniques, while DistilBERT embeddings consistently yielded high scores. Clustering algorithms such as k-means, k-medoids, and Gaussian Mixture Models (GMM) demonstrated robustness in forming well-defined clusters. These findings provide valuable insights into the main topics covered in the workshop papers and suggest optimal methodologies for analyzing research metadata, thereby enhancing the understanding of semantic relationships in textual data.	en
dc.identifier.doi	10.18420/inf2024_37
dc.identifier.eissn	2944-7682
dc.identifier.isbn	978-3-88579-746-3
dc.identifier.issn	2944-7682
dc.identifier.pissn	1617-5468
dc.identifier.uri	https://dl.gi.de/handle/20.500.12116/45196
dc.language.iso	en
dc.publisher	Gesellschaft für Informatik e.V.
dc.relation.ispartof	INFORMATIK 2024
dc.relation.ispartofseries	Lecture Notes in Informatics (LNI) - Proceedings, Volume P-352
dc.subject	Keyword Extraction
dc.subject	Clustering Techniques
dc.subject	Dimensionality Reduction
dc.subject	ARDUOUS Workshop
dc.subject	Natural Language Processing
dc.subject	Contextual Embeddings
dc.subject	Research Metadata Analysis
dc.title	A Comparative Analysis on Machine Learning Techniques for Research Metadata: the ARDUOUS Case Study	en
dc.type	Text/Conference Paper
gi.citation.endPage	509
gi.citation.publisherPlace	Bonn
gi.citation.startPage	499
gi.conference.date	24.-26. September 2024
gi.conference.location	Wiesbaden
gi.conference.sessiontitle	8th International Workshop on Annotation of useR Data for UbiquitOUs Systems

Dateien

Originalbündel

1 - 1 von 1

Name:: Yadav_et_al_A_Comparative_Analysis.pdf
Größe:: 3.12 MB
Format:: Adobe Portable Document Format

Herunterladen

Sammlungen

P352 - INFORMATIK 2024 - Lock in or log out? Wie digitale Souveränität gelingt