Implementasi DBSCAN dan Latent Dirichlet Allocation pada Pemodelan Topik Skripsi di Fakultas Ilmu Komputer Universitas Jember
Abstract
Thesis is part of “Tri Dharma on Higher Education” in Indonesia which must be completed by a student before completing higher education. The importance of theses is often not complemented by topic grouping management that can maximize users' access to them. Previous research has shown that DBSCAN handles noise and classifies text data well, also Latent Dirichlet Allocation works great in extracting latent topics in a group of documents. This study aims to identify the best epsilon (eps) and minimum points (minpts) scores for clustering thesis topics with DBSCAN based on analysis and silhouette scores as well as analyzing the topics of thesis through Latent Dirichlet Allocation. The experimental scenario is made with two measurement metrics which are Euclidean Distance and Cosine Similarity. The research begins with data collection, data selection, data pre-processing, term weighting, clustering experiments and evaluation, topic modeling, and end up with analyzing topic, cluster, and noise. The best cluster is formed at epsilon 0.6 with 2 minimum points which produces 81 clusters with 143 noises. The cluster reached 0.095 on Silhouette Score with 0.595 Dunn Index, and 0.385 average Coherence Score. Topic modeling shows several thesis topics with the greatest interest in the topics of Machine Learning, Software Construction, and IT/IS Evaluation. The experimental results show that the best cluster is formed with silhouette score close to 0. This is because the thesis came from the same scientific group which causes the data distribution to form a leptokurtic. The results of clustering with DBSCAN can group data specifically based on objectives, methods, and research objects. For this reason, some noise is formed not because the document has bad writing, but because other similar documents are not found to be grouped.