1 Introduction
2 Background
2.1 LLMs and ChatGPT
2.2 Research quality and UK REF2021
3 Methods
3.1 Article selection
3.2 Article scoring by my judgements
3.3 ChatGPT 4 REF D configuration and scores
3.4 Analyses
4 Results
4.1 RQ1: Can ChatGPT 4.0 understand the REF research quality evaluation task in the sense of producing plausible outputs?
4.2 RQ2: Does ChatGPT 4.0 allocate the full range of REF research quality scores?
Table 1. The scores given by ChatGPT-4 REF D and me to 51 of my open access articles. |
Score | GPT | % | Me | % |
---|---|---|---|---|
1* | 0 | 0.0% | 2 | 4% |
1.5* | 0 | 0.0% | 3 | 6% |
2* | 14 | 1.8% | 12 | 24% |
2.33* | 1 | 0.1% | 0 | 0% |
2.5* | 2 | 0.3% | 9 | 18% |
2.67* | 2 | 0.3% | 0 | 0% |
2.75* | 0 | 0.0% | 1 | 2% |
3* | 509 | 66.5% | 8 | 16% |
3.33* | 9 | 1.2% | 0 | 0% |
3.5* | 14 | 1.8% | 7 | 14% |
3.67* | 15 | 2.0% | 0 | 0% |
4* | 199 | 26.0% | 9 | 18% |
Total | 765 | 100.0% | 51 | 100% |
4.3 RQ3/4/5/6: Is ChatGPT 4.0 REF D consistent and accurate in its REF quality scoring?
Table 2. Pearson correlations for 51 of my open access articles, comparing my initial scores, and scores from ChatGPT-4 REF D. |
Correlation | All articles | Articles scored 2.5+ by me | Articles scored 3+ by me |
---|---|---|---|
GPT average vs. author (95% CI) | 0.509 (0.271,0.688) | 0.200 (-0.148,0.504) | 0.246 (-0.175,0.590) |
GPT vs. author, average of 15 pairs (fraction of 95% Cis excluding 0) | 0.281 (8/15) | 0.102 (1/15) | 0.128 (1/15) |
GPT vs. GPT (average of 105 pairs) | 0.245 | 0.194 | 0.215 |
Sample size (articles) | 51 | 34 | 24 |
Figure 1. The average REF star rating given by the REF D GPT against the author’s prior evaluation of the REF score of 51 of his open access articles. |
Figure 2. The range of REF star ratings given by the REF D GPT against the author’s prior evaluation of the REF score of 51 of his open access articles. The area of each bubble is proportional to the number of times the y axis score was given by ChatGPT to the x axis article. My REF scores are marked on the x axis. |