1 Introduction
1.1 Current situation of “Paper mills”
1.2 Current difficulties in detecting “Paper mills”
1.3 Our contribution
2 Related work
2.1 Manual discovery of “Paper mills” features
2.2 Automatic identification of “Paper mills”
2.3 Shortcomings and improvements of related work
3 Pattern discovery and feature mining
3.1 Citation pattern classification
Figure 1. Three “Paper mills” citation patterns. |
3.2 Metapath design
Figure 2. The process of constructing meta-paths from heterogeneous graphs. |
3.3 Matching of citation patterns and meta-paths
Table 1. Matching table for citation patterns and meta-paths. |
Pattern Classification | Specific modalities | Meta-paths | |||||
---|---|---|---|---|---|---|---|
1-PP* | 2-PPP | 3-PBP | 4-PJP | 5-PAuP | 6-PUP | ||
Manipulation of References | Circular citation of papers in a journal | ![]() | ![]() | ||||
Cross referencing in a journal | ![]() | ![]() | ![]() | ||||
Cross-referencing of papers between journals | ![]() | ![]() | ![]() | ||||
Papers within journals citing the same papers | ![]() | ![]() | ![]() | ||||
Irrelevant Citations | Citing papers that are not relevant to the topic | ![]() | |||||
Carrying citations directly from cited references | ![]() | ![]() | |||||
Aggregation of Cited Papers | Same publisher | ![]() | ![]() | ||||
Same journal | ![]() | ![]() | |||||
Same academic society | ![]() | ![]() |
Note*: 1-PP meta-paths indicate direct references to papers, so each specific pattern is matched on. |
4 Methodology
Figure 3. The framework of our proposed PDCN model. |
4.1 Text feature extraction module
4.2 Heterogeneous graph attention network module
Figure 4. Hierarchical attention structure. |
4.2.1 Node-level attention
4.2.2 Semantic-level attention
4.2.3 Big graph training methods
4.3 Classifier module
4.3.1 Gradient-based one-sided sampling
4.3.2 Mutually exclusive feature bundling
5 Data source and collection
5.1 Dataset acquisition
Figure 5. Dataset Construction Process. |
Figure 6. “Paper mills” cases. |
5.2 Data processing methods
Table 2. Heterogeneous map dataset details. |
Edge (A-B) | Num of A | Num of B | Meta-path | Num of Meta-path | Feature dimension | Training set | Validation set | Testing set |
---|---|---|---|---|---|---|---|---|
Paper-Paper | 25,900 | 25,900 | PP | 549,452 | 768 | 14,258 | 3,872 | 7,770 |
25,900 | 25,900 | PPP | 1,339,889 | |||||
Paper-Journal | 25,900 | 3,226 | PJP | 2,742,868 | ||||
Paper-Publisher | 25,900 | 285 | PBP | 81,173,164 | ||||
Paper-Auxiliary_Paper | 25,900 | 500w | PUP | 62,193 | ||||
25,900 | 500w | PUP5 | 45,854 | |||||
Paper-Academic Society | 25,900 | 16,010 | PAuP | 18,016 |
5.3 Overview of the dataset
6 Experimental results and analysis
6.1 Experimental indicators
6.2 Baseline model
6.3 PDCN model parameterization
6.4 Experimental results
6.4.1 Comparison experiment
Table 3. Comparison of experimental results. |
Model | Precision | Recall | F1-score | NMI | ARI |
---|---|---|---|---|---|
RGCN | 0.047 | 0.794 | 0.089 | 0.013 | -0.01 |
HGT | 0.095 | 0.303 | 0.145 | 0.023 | 0.091 |
GIN | 0.333 | 0.954 | 0.494 | 0.325 | 0.438 |
RGAT | 0.029 | 0.845 | 0.058 | 0.001 | 0.007 |
LDA-Title | 0.381 | 0.1039 | 0.1633 | ||
LDA-Abstract | 0.549 | 0.437 | 0.487 | ||
LDA-Full-text | 0.438 | 0.259 | 0.326 | ||
PDCN | 0.819 | 0.795 | 0.805 | 0.626 | 0.788 |
6.4.2 Ablation experiment
Table 4. Experimental results of ablation experiments. |
Model | Precision | Recall | F1-score | NMI | ARI |
---|---|---|---|---|---|
H | 0.101 | 0.865 | 0.186 | 0.091 | 0.129 |
T+H | 0.057 | 0.857 | 0.108 | 0.038 | 0.039 |
T+L | 0.231 | 0.041 | 0.069 | 0.018 | 0.058 |
H+L | 0.428 | 0.439 | 0.434 | 0.234 | 0.414 |
6.5 Model explanatory analysis
6.5.1 Meta-path interpretive analysis
Table 5. Meta-path weights in the heterogeneous graph attention network. |
Name of Meta-path | Meta-path weights |
---|---|
PP | 0.0049 |
PPP | 0.0052 |
PJP | 0.4129 |
PBP | 0.0052 |
PUP | 0.0111 |
PUP5 | 0.3353 |
PAuP | 0.2255 |