A paper mill detection model based on citation manipulation paradigm

Expand
  • 1School of Computer Science (National Model Software School), Beijing University of Posts and Telecommunications, Beijing 100876, China;
    2Beijing Wanfang Data Co., Ltd. Beijing 100080, China
Haihong E (Email: ehaihong@bupt.edu.cn).

Received date: 2024-05-10

  Revised date: 2024-10-08

  Accepted date: 2024-11-05

  Online published: 2025-01-03

Abstract

Purpose: In this paper, we develop a heterogeneous graph network using citation relations between papers and their basic information centered around the “Paper mills” papers under withdrawal observation, and we train graph neural network models and classifiers on these heterogeneous graphs to classify paper nodes.
Design/methodology/approach: Our proposed citation network-based “Paper mills” detection model (PDCN model for short) integrates textual features extracted from the paper titles using the BERT model with structural features obtained from analyzing the heterogeneous graph through the heterogeneous graph attention network model. Subsequently, these features are classified using LGBM classifiers to identify “Paper mills” papers.
Findings: On our custom dataset, the PDCN model achieves an accuracy of 81.85% and an F1-score of 80.49% in the “Paper mills” detection task, representing a significant improvement in performance compared to several baseline models.
Research limitations: We considered only the title of the article as a text feature and did not obtain features for the entire article.
Practical implications: The PDCN model we developed can effectively identify “Paper mills” papers and is suitable for the automated detection of “Paper mills” during the review process.
Originality/value: We incorporated both text and citation detection into the “Paper mills” identification process. Additionally, the PDCN model offers a basis for judgment and scientific guidance in recognizing “Paper mills” papers.

Cite this article

Jun Zhang, Jianhua Liu, Haihong E, Tianyi Hu, Xiaodong Qiao, ZiChen Tang . A paper mill detection model based on citation manipulation paradigm[J]. Journal of Data and Information Science, 0 : 20250103 -20250103 . DOI: 10.2478/jdis-2025-0003

References

[1] Candal-Pedreira C., Ross J. S., Ruano-Ravina A., Egilman D. S., Fernández E., & Pérez-Ríos M. (2022). Retracted papers originating from Paper mills: cross sectional study.bmj, 379.
[2] Chakraborty J., Pradhan D. K., & Nandi S. (2021). On the identification and analysis of citation pattern irregularities among journals.Expert Systems, 38(4), e12561.
[3] Chen J., Hou H., Gao J., Ji Y., & Bai, T. (2019). RGCN: recurrent graph convolutional networks for target-dependent sentiment analysis. In International Conference on Knowledge Science, Engineering and Management (pp. 667-675). Cham: Springer International Publishing.
[4] Christopher, J. (2021). The raw truth about Paper mills.FEBS letters, 595(13), 1751-1757.
[5] da Silva, J. A. T., & Nazarovets, S. (2023). Assessment of retracted papers, and their retraction notices, from a cancer journal associated with “Paper mills”.Journal of Data and Information Science, 8(2), 118-125.
[6] Devlin J., Chang M. W., Lee K., & Toutanova K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805.
[7] Else, H., & Van Noorden, R. (2021). The fight against fake-paper factories that churn out sham science.Nature, 591(7851), 516-520.
[8] Else, H. (2022). ‘Papermill alarm’software flags potentially fake papers.
[9] Hu Z., Dong Y., Wang K., & Sun Y. (2020). Heterogeneous graph transformer. InProceedings of the web conference 2020(pp. 2704-2710).
[10] Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., .. & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30.
[11] Liu Q., Barhoumi A., & Labbé C. (2024). Miscitations in scientific papers: dataset and detection.
[12] Oransky, I. (2022). Nearing 5,000 retractions: A review of 2022. Retraction Watch. https://retractionwatch.com/2022/12/27/nearing-5000-retractions-a-review-of-2022/
[13] Rogerson, A. M. (2014). Detecting the work of essay mills and file swapping sites: some clues they leave behind.
[14] Seifert, R. (2021). How Naunyn-Schmiedeberg’s Archives of Pharmacology deals with fraudulent papers from Paper mills.Naunyn-schmiedeberg’s Archives of Pharmacology, 394, 431-436.
[15] Van Noorden, R. (2023). How big is science’s fake-paper problem?.Nature, 623(7987), 466-467.
[16] Wang X., Ji H., Shi C., Wang B., Ye Y., Cui P., & Yu P. S. (2019). Heterogeneous graph attention network. InThe world wide web conference(pp. 2022-2032).
[17] Wang K., Shen W., Yang Y., Quan X., & Wang R. (2020). Relational graph attention network for aspect-based sentiment analysis.arXiv preprint arXiv:2004.12362.
[18] Wittau J., Celik S., Kacprowski T., Deserno T. M., & Seifert R. (2024). Fake paper identification in the pool of withdrawn and rejected manuscripts submitted to Naunyn-Schmiedeberg’s Archives of Pharmacology.Naunyn-schmiedeberg’s Archives of Pharmacology, 397(4), 2171-2181.
[19] Xu K., Hu W., Leskovec J., & Jegelka S. (2018). How powerful are graph neural networks?.arXiv preprint arXiv:1810.00826.
[20] Zhang Y., Jin R., & Zhou Z. H. (2010). Understanding bag-of-words model: a statistical framework.International journal of machine learning and cybernetics, 1, 43-52.
Outlines

/

京ICP备05002861号-43

Copyright © 2023 All rights reserved Journal of Data and Information Science

E-mail: jdis@mail.las.ac.cn Add:No.33, Beisihuan Xilu, Haidian District, Beijing 100190, China

Support by Beijing Magtech Co.ltd E-mail: support@magtech.com.cn