A comparative empirical evaluation of semantic clustering algorithms on static word embeddings - TUdományos DOkumentumok Közös Keresője

in English |
magyarul

Betűméret: Súgó

Kereső

Bejelentkezés

Regisztráció

Kapcsolat

MTA KIK
HUN-REN SZTAKI DSD

A comparative empirical evaluation of semantic clustering algorithms on static word embeddings

Metaadatok

Tartalom:	https://unipub.lib.uni-corvinus.hu/12636/
Archívum:	Corvinus Kutatások
Gyűjtemény:	Status = Published Subject = Computer science Type = Article
Cím:	A comparative empirical evaluation of semantic clustering algorithms on static word embeddings
Létrehozó:	Asemi, Asefeh Kiani Shahvandy, Rajab Houshangi, Mahdi
Kiadó:	Elsevier
Dátum:	2026
Téma:	Computer science
Tartalmi leírás:	Objective This study conducts a comprehensive empirical evaluation of semantic clustering algorithms to identify the most effective approach for automatically organizing and extracting meaning from textual data. By systematically comparing the performance of K-means, K-medoids, and DBSCAN on word embeddings from GloVe and Wiki models, it provides data-driven insights for optimizing Natural Language Processing (NLP) pipelines in information management systems. The research suggests a practical framework for selecting clustering algorithms and embedding models based on specific operational objectives, such as document clustering, knowledge base construction, and content-based recommendation. Design/Methodology/Approach The investigation employed a two-phase methodology. Initially, predefined word lists were transformed into numerical vectors using pre-trained GloVe and Wiki models. K-means, K-medoids, and DBSCAN algorithms were applied, with performance evaluated via Silhouette Score and Davies-Bouldin Index, complemented by Principal Component Analysis (PCA) for visualization. Results were benchmarked against manually curated semantic groupings. Subsequently, the findings were validated on a large-scale corpus of 303 research articles to assess scalability and real-world applicability. Results/Discussion Analysis indicates that, under the evaluated configurations, K-means combined with GloVe embeddings produced comparatively higher semantic coherence and more interpretable cluster structures than the alternative methods considered. K-medoids demonstrated robustness against outliers but yielded less compact groupings. While DBSCAN indicated effective for outlier identification, it consistently underperformed in forming semantically meaningful clusters. The GloVe model significantly outperformed Wiki embeddings in generating precise and interpretable clusters, whereas Wiki produced broader, less distinct groupings. Large-scale validation confirmed these results, with K-means successfully identifying dominant research themes, including digital library adoption (43.2%), reference services (15.2%), and research data management (8.9%)—in a corpus of academic literature. Under the evaluated corpus characteristics and parameter settings, DBSCAN classified most documents as outliers, indicating limited suitability for this specific balanced document collection. Conclusions K-means and K-medoids emerge as comparatively effective algorithms under the evaluated conditions. The study underscores the critical influence of vector representation models, with GloVe embeddings providing superior semantic distinction compared to Wiki. These findings offer clear, actionable guidance for selecting clustering methods in NLP applications, highlighting the necessity of aligning algorithmic choice with specific dataset characteristics and information management goals. Originality/Value This research moves beyond theoretical descriptions by delivering a rigorous, empirical comparison that elucidates the crucial interaction between algorithm selection and embedding models for semantic tasks. The findings provide practitioners with a context-dependent decision matrix: K-means with GloVe is effective under the studied conditions for taxonomy development and thematic categorization, whereas DBSCAN is preferable for outlier detection in noisy data. By demonstrating that GloVe's global statistical approach yields more distinct clusters than Wiki's contextual model for this purpose, the study contributes a practical, evidence-based framework for enhancing semantic analysis in real-world information systems. © 2026 The Authors.
Nyelv:	angol angol
Típus:	Article PeerReviewed
Formátum:	application/pdf
Azonosító:	https://unipub.lib.uni-corvinus.hu/12636/1/1-s2.0-S2667096826000091-main.pdf Asemi, Asefeh ORCID: https://orcid.org/0000-0003-1667-4408 <https://orcid.org/0000-0003-1667-4408>, Kiani Shahvandy, Rajab and Houshangi, Mahdi ORCID: https://orcid.org/0000-0002-5406-1162 <https://orcid.org/0000-0002-5406-1162> (2026) A comparative empirical evaluation of semantic clustering algorithms on static word embeddings. International Journal of Information Management Data Insights, 6 (1). DOI 10.1016/j.jjimei.2026.100396 <https://doi.org/10.1016/j.jjimei.2026.100396>
Kapcsolat:	https://unipub.lib.uni-corvinus.hu/12636/ https://doi.org/10.1016/j.jjimei.2026.100396 10.1016/j.jjimei.2026.100396