Evaluating the quality of academic research is necessary for many national research evaluation exercises, such as in UK and New Zealand (Buckle & Creedy,
2024; Sivertsen,
2017), as well as for appointments, promotions, and tenure (Pontika et al.,
2022; Tierney & Bensimon,
1996). These evaluations are expensive and reduce the time that academics can spend conducting research, so automated alternatives or support are desirable. The need for shortcuts has given rise to the field of bibliometrics/scientometrics, which has, controversially (MacRoberts & MacRoberts,
2018), developed a wide range of academic impact or quality indicators, typically based on citation counts (Moed,
2006). Attempts to use bibliometric information to directly score journal articles for research quality have obtained mixed results; however, article level Pearson correlations between traditional machine learning (e.g., extreme gradient boost) predictions and expert scores vary between 0.028 (Art and Design) and 0.562 (Clinical Medicine) for REF2021 (Thelwall et al.,
2023). It is therefore logical to assess whether LLMs can be more accurate at predicting quality scores, given that they are relatively accurate at a wide range of language processing tasks (Bang et al.,
2023; Kocoń et al.,
2023), and have been proposed for scientometric purposes (Bornmann & Lepori,
2024). In fact, one funder is now using LLMs to support the initial triage of biomedical grant proposals, creating a pool of apparently weaker submissions for human reviewers to consider for rejection before the full peer review process. LLMs were not reliable enough to be used without human judgements but had an above random chance of identifying weak submissions and could save time by flagging these for human triage (Carbonell Cortés,
2024).