In order to put the sensitivity of the feature representation models to test, they have been trained on two different datasets. The initial one is the KIPRIS patent dataset, and the latter is the concatenation of WoS and KIPRIS datasets, resulting in a larger text corpus. As the results in
Table 1 suggest, the FastText mean document embedding based methods generally yield better results in terms of NMI, AMI, and ARI when trained on the larger corpus. Further training on the second and less relevant corpus has resulted in higher scores across the metrics in all clustering approaches, except the modified DEC, where the initial smaller corpus has resulted substantially better, specifically in terms of ARI. Although this is not easy to assert for the weighted mean FastText document embedding, the weighted mean FastText document embedding model has performed comparatively poorly against the simple mean FastText model and we assume this method is not as sound in this task. Hence, the weighted mean FastText method won't be suggested. In comparison, it can be observed that in Doc2Vec, in almost all clustering approaches, the larger corpus has yielded much better results, where the ARI variance reaches 0.13 in the modified DEC. Overall, Doc2Vec has outperformed FastText based embeddings, where the highest ARI for FastText based methods has been achieved using modified DEC and the smaller corpus is 0.504, much lower than the 0.759 in Doc2Vec.We observed that DBSCAN is unable to reach stable clustering results, while it fails to determine the correct cluster size (k). This result is less than optimal, hence, will not be commented any further. K-means and Hierarchical agglomerative clustering methods are amongst popular methods in the field. Hierarchical agglomerative clustering can be very useful as it can support the decision of the number of clusters, based on a hierarchical dendrogram for clusters. However, in this task, it was observed that the clustering performance of this method is outmatched by k-means, where K-means has yielded an ARI of 0.742 and Hierarchical agglomerative clustering has yielded 0.633. However, it's worth mentioning that the performance of K-means can be affected by random initialization. The best clustering match has been achieved using the modified DEC in this study, where the ARI 0.759 has been reached. We also observe that further normalization of the inputs in the modified DEC method has dropped the performance significantly, denoted by “DEC (scaled)” in
Table 1.In addition to Doc2Vec and FastText, pre-trained BERT embeddings were also put to test to observe the clustering task performance of the BERT models, as the state of the art models. However, we observe that not fine-tuning or further training these models has resulted in lower than expected clustering performance in this study. As observed in
Table 2, the best result can be achieved using BERT uncased A12 pre-trained model. While SciBERT is unable to achieve 0.2 ARI. As the SciBERT model has been trained on full document corpora instead of abstracts or short texts, this behavior can be expected and explained. BERT pre-trained embeddings are regarded as general language representations (
Devlin et al., 2018). Hence, it can be explained that BERT and similar representations comprise various dimensions that generally make these embeddings very competitive for various tasks, specifically supervised tasks where the representations can be exploited. On the contrary, these representations are more complex and therefore less suitable for clustering tasks, or in general for some specific tasks. In this study where the objective is to cluster documents based on their topics, it is understandable that the utilization of such general embeddings may fall short. Hence, feature extraction for better task-specific representations should be preferred in such tasks. All in all, it can be concluded that the shortcomings in the BERT models for this task do not only derive from the number of features (or dimensions of vectors) and difficulty of clustering, but also from the complexity, or in this case the generalization of the representations from the (pre)training corpora. Even though the results have been less than optimal, clustering algorithms are showing consistent behavior. DEC still shows superiority against K-means and Hierarchical agglomerative clustering, while DBSCAN cannot perform any sort of clustering on vectors.