1 Introduction
2 Methods
2.1 Data
2.2 GPT prompt setup
2.3 GPT system prompt variations
2.4 GPT score extraction
2.5 Analysis
3 Results
3.1 Input and averaging length comparisons: ChatGPT 3.5-turbo, 4o, and 4o-mini on truncated texts, abstracts, and titles
Figure 1. ChatGPT 3.5-turbo score prediction correlations against human scores for 51 information science article full texts (truncated), article titles and abstracts, or just titles. Averages over n iterations and confidence intervals are calculated as in the methods. |
Table 1. Spearman correlations between humans scores and model average scores (over 30 iterations) for 51 information science articles. Values above 0.75 are highlighted. |
Spearman correlation | GPT-3.5 turbo: Abstracts | GPT-3.5 turbo: Truncated text | GPT-4o-mini: Abstracts | GPT-4o-mini: Truncated text | GPT-4o: Abstracts | GPT-4o: Truncated text | Human |
---|---|---|---|---|---|---|---|
GPT-3.5 turbo: Titles | 0.439 | 0.444 | 0.359 | 0.499 | 0.539 | 0.589 | 0.434 |
GPT-3.5 turbo: Abstracts | 1.000 | 0.757 | 0.700 | 0.718 | 0.875 | 0.774 | 0.674 |
GPT-3.5 turbo: Truncated text | 1.000 | 0.672 | 0.686 | 0.732 | 0.783 | 0.625 | |
GPT-4o-mini: Abstracts | 1.000 | 0.608 | 0.729 | 0.653 | 0.571 | ||
GPT-4o-mini: Truncated text | 1.000 | 0.813 | 0.801 | 0.506 | |||
GPT-4o: Abstracts | 1.000 | 0.858 | 0.678 | ||||
GPT-4o: Truncated text | 1.000 | 0.675 |
Table 2. Average humans scores and model average scores. |
Human | GPT-3.5 turbo: Titles | GPT-3.5 turbo: Abstracts | GPT-3.5 turbo: Truncated text | GPT-4o-mini: Abstracts | GPT-4o-mini: Truncated text | GPT-4o: Abstracts | GPT-4o: Truncated text | |
---|---|---|---|---|---|---|---|---|
Mean score | 2.75 | 2.49 | 2.75 | 3.03 | 2.93 | 3.22 | 2.99 | 3.16 |
3.2 Model comparison: ChatGPT 3.5-turbo, ChatGPT 4o and ChatGPT 4o-mini on abstracts
Figure 2. ChatGPT 4o-mini, ChatGPT 3.5-turbo and ChatGPT 4o score prediction correlations against human scores for 51 information science article titles and abstracts. Averages over n iterations and confidence intervals are calculated as in the methods. |
Figure 3. ChatGPT 4o score predictions based on abstracts (average of 30) against human scores (from the author) for 51 information science article titles and abstracts. |
3.3 Comparison of prompt strategies
Figure 4. ChatGPT 4o score predictions based on abstracts (average of 30) against human scores (from the author) for 51 information science article titles and abstracts with seven different system prompts. Strategies 1-5 are abbreviations of Strategy 6, the full REF instructions, and Strategy 0 is a brief instruction without a request for justification. |
3.4 Individual score level accuracy
Table 3. Mean Average Deviations for direct predictions and predictions with linear regression for each model and input. The improve column gives the percentage reduction in MAD compared to the baseline strategy of assigning each article the overall human average, 2.75. |
Model and input | Direct | Regression | ||||
---|---|---|---|---|---|---|
MAD | Improve | Intercept | Coefficient | MAD | Improve | |
GPT-3.5 turbo: Titles | 0.68 | 6% | -1.16 | 1.57 | 0.63 | 13% |
GPT-3.5 turbo: Abstracts | 0.60 | 17% | -3.46 | 2.26 | 0.51 | 30% |
GPT-3.5 turbo: Truncated text | 0.70 | 4% | -7.49 | 3.38 | 0.55 | 24% |
GPT-4o-mini: Abstracts | 0.63 | 13% | -3.32 | 2.07 | 0.59 | 19% |
GPT-4o-mini: Truncated text | 0.75 | -3% | -2.44 | 1.61 | 0.60 | 17% |
GPT-4o: Abstracts | 0.62 | 14% | -3.40 | 2.05 | 0.50 | 31% |
GPT-4o: Truncated text | 0.69 | 5% | -4.44 | 2.28 | 0.50 | 31% |
4 Discussion
4.1 Comparisons with prior work
Figure 5. ChatGPT 4 (web interface) score prediction correlations against human scores for 51 information science article titles and abstracts. Averages over n iterations and confidence intervals are calculated as in the methods (data from: Thelwall, 2024). |