Purpose: In this paper, we develop a heterogeneous graph network using citation relations between papers and their basic information centered around the “Paper mills” papers under withdrawal observation, and we train graph neural network models and classifiers on these heterogeneous graphs to classify paper nodes.
Design/methodology/approach: Our proposed citation network-based “Paper mills” detection model (PDCN model for short) integrates textual features extracted from the paper titles using the BERT model with structural features obtained from analyzing the heterogeneous graph through the heterogeneous graph attention network model. Subsequently, these features are classified using LGBM classifiers to identify “Paper mills” papers.
Findings: On our custom dataset, the PDCN model achieves an accuracy of 81.85% and an F1-score of 80.49% in the “Paper mills” detection task, representing a significant improvement in performance compared to several baseline models.
Research limitations: We considered only the title of the article as a text feature and did not obtain features for the entire article.
Practical implications: The PDCN model we developed can effectively identify “Paper mills” papers and is suitable for the automated detection of “Paper mills” during the review process.
Originality/value: We incorporated both text and citation detection into the “Paper mills” identification process. Additionally, the PDCN model offers a basis for judgment and scientific guidance in recognizing “Paper mills” papers.
Jun Zhang, Jianhua Liu, Haihong E, Tianyi Hu, Xiaodong Qiao, ZiChen Tang
. A paper mill detection model based on citation manipulation paradigm[J]. Journal of Data and Information Science, 0
: 20250103
-20250103
.
DOI: 10.2478/jdis-2025-0003
[1] Candal-Pedreira C., Ross J. S., Ruano-Ravina A., Egilman D. S., Fernández E., & Pérez-Ríos M. (2022). Retracted papers originating from Paper mills: cross sectional study.bmj, 379.
[2] Chakraborty J., Pradhan D. K., & Nandi S. (2021). On the identification and analysis of citation pattern irregularities among journals.Expert Systems, 38(4), e12561.
[3] Chen J., Hou H., Gao J., Ji Y., & Bai, T. (2019). RGCN: recurrent graph convolutional networks for target-dependent sentiment analysis. In International Conference on Knowledge Science, Engineering and Management (pp. 667-675). Cham: Springer International Publishing.
[4] Christopher, J. (2021). The raw truth about Paper mills.FEBS letters, 595(13), 1751-1757.
[5] da Silva, J. A. T., & Nazarovets, S. (2023). Assessment of retracted papers, and their retraction notices, from a cancer journal associated with “Paper mills”.Journal of Data and Information Science, 8(2), 118-125.
[6] Devlin J., Chang M. W., Lee K., & Toutanova K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805.
[7] Else, H., & Van Noorden, R. (2021). The fight against fake-paper factories that churn out sham science.Nature, 591(7851), 516-520.
[8] Else, H. (2022). ‘Papermill alarm’software flags potentially fake papers.
[9] Hu Z., Dong Y., Wang K., & Sun Y. (2020). Heterogeneous graph transformer. InProceedings of the web conference 2020(pp. 2704-2710).
[10] Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., .. & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30.
[11] Liu Q., Barhoumi A., & Labbé C. (2024). Miscitations in scientific papers: dataset and detection.
[12] Oransky, I. (2022). Nearing 5,000 retractions: A review of 2022. Retraction Watch. https://retractionwatch.com/2022/12/27/nearing-5000-retractions-a-review-of-2022/
[13] Rogerson, A. M. (2014). Detecting the work of essay mills and file swapping sites: some clues they leave behind.
[14] Seifert, R. (2021). How Naunyn-Schmiedeberg’s Archives of Pharmacology deals with fraudulent papers from Paper mills.Naunyn-schmiedeberg’s Archives of Pharmacology, 394, 431-436.
[15] Van Noorden, R. (2023). How big is science’s fake-paper problem?.Nature, 623(7987), 466-467.
[16] Wang X., Ji H., Shi C., Wang B., Ye Y., Cui P., & Yu P. S. (2019). Heterogeneous graph attention network. InThe world wide web conference(pp. 2022-2032).
[17] Wang K., Shen W., Yang Y., Quan X., & Wang R. (2020). Relational graph attention network for aspect-based sentiment analysis.arXiv preprint arXiv:2004.12362.
[18] Wittau J., Celik S., Kacprowski T., Deserno T. M., & Seifert R. (2024). Fake paper identification in the pool of withdrawn and rejected manuscripts submitted to Naunyn-Schmiedeberg’s Archives of Pharmacology.Naunyn-schmiedeberg’s Archives of Pharmacology, 397(4), 2171-2181.
[19] Xu K., Hu W., Leskovec J., & Jegelka S. (2018). How powerful are graph neural networks?.arXiv preprint arXiv:1810.00826.
[20] Zhang Y., Jin R., & Zhou Z. H. (2010). Understanding bag-of-words model: a statistical framework.International journal of machine learning and cybernetics, 1, 43-52.