AI for Science of Science

Select

Is Google Gemini better than ChatGPT at evaluating research quality?

Mike Thelwall

Journal of Data and Information Science. 2025, 10(2): 1-5. https://doi.org/10.2478/jdis-2025-0014

Abstract (294) PDF (319KB) ( 234 ) RichHTML ( 162 )

Google Gemini 1.5 Flash scores were compared with ChatGPT 4o-mini on evaluations of (a) 51 of the author’s journal articles and (b) up to 200 articles in each of 34 field-based Units of Assessment (UoAs) from the UK Research Excellence Framework (REF) 2021. From (a), the results suggest that Gemini 1.5 Flash, unlike ChatGPT 4o-mini, may work better when fed with a PDF or article full text, rather than just the title and abstract. From (b), Gemini 1.5 Flash seems to be marginally less able to predict an article’s research quality (using a departmental quality proxy indicator) than ChatGPT 4o-mini, although the differences are small, and both have similar disciplinary variations in this ability. Averaging multiple runs of Gemini 1.5 Flash improves the scores.

Select

Journal Quality Factors from ChatGPT: More meaningful than Impact Factors?

Mike Thelwall, Kayvan Kousha

Journal of Data and Information Science. 2025, 10(2): 106-123. https://doi.org/10.2478/jdis-2025-0016

Abstract (280) PDF (1079KB) ( 135 ) RichHTML ( 99 )

Purpose: Journal Impact Factors and other citation-based indicators are widely used and abused to help select journals to publish in or to estimate the value of a published article. Nevertheless, citation rates primarily reflect scholarly impact rather than other quality dimensions, including societal impact, originality, and rigour. In response to this deficit, Journal Quality Factors (JQFs) are defined and evaluated. These are average quality score estimates given to a journal’s articles by ChatGPT.
Design/methodology/approach: JQFs were compared with Polish, Norwegian and Finnish journal ranks and with journal citation rates for 1,300 journals with 130,000 articles from 2021 in large monodisciplinary journals in the 25 out of 27 Scopus broad fields of research for which it was possible. Outliers were also examined.
Findings: JQFs correlated positively and mostly strongly (median correlation: 0.641) with journal ranks in 24 out of the 25 broad fields examined, indicating a nearly science-wide ability for ChatGPT to estimate journal quality. Journal citation rates had similarly high correlations with national journal ranks, however, so JQFs are not a universally better indicator. An examination of journals with JQFs not matching their journal ranks suggested that abstract styles may affect the result, such as whether the societal contexts of research are mentioned.
Research limitations: Different journal rankings may have given different findings because there is no agreed meaning for journal quality.
Practical implications: The results suggest that JQFs are plausible as journal quality indicators in all fields and may be useful for the (few) research and evaluation contexts where journal quality is an acceptable proxy for article quality, and especially for fields like mathematics for which citations are not strong indicators of quality.
Originality/value: This is the first attempt to estimate academic journal value with a Large Language Model.

Select

Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputs

Mike Thelwall

Journal of Data and Information Science. 2025, 10(1): 7-25. https://doi.org/10.2478/jdis-2025-0011

Abstract (486) PDF (602KB) ( 237 ) RichHTML ( 264 )

Purpose: Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process.

Design/methodology/approach: This article assesses which ChatGPT inputs (full text without tables, figures, and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts.

Findings: The optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66).

Research limitations: The data is a convenience sample of the work of a single author, it only includes one field, and the scores are self-evaluations.

Practical implications: The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones. Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.

Originality/value: This is the first systematic comparison of the impact of different prompts, parameters and inputs for ChatGPT research quality evaluations.

Select

JDIS: Bridging Data-Driven Science of Science and the Research Ecosystem

Liying Yang, Ronald Rousseau, Ping Meng

Journal of Data and Information Science. 2025, 10(1): 1-3. https://doi.org/10.2478/jdis-2025-0017

Abstract (207) PDF (156KB) ( 204 ) RichHTML ( 180 )

Select

Detecting LLM-assisted writing in scientific communication: Are we there yet?

Teddy Lazebnik, Ariel Rosenfeld

Journal of Data and Information Science. 2024, 9(3): 4-13. https://doi.org/10.2478/jdis-2024-0020

Abstract (466) PDF (354KB) ( 138 ) RichHTML ( 229 )

Large Language Models (LLMs), exemplified by ChatGPT, have significantly reshaped text generation, particularly in the realm of writing assistance. While ethical considerations underscore the importance of transparently acknowledging LLM use, especially in scientific communication, genuine acknowledgment remains infrequent. A potential avenue to encourage accurate acknowledging of LLM-assisted writing involves employing automated detectors. Our evaluation of four cutting-edge LLM-generated text detectors reveals their suboptimal performance compared to a simple ad-hoc detector designed to identify abrupt writing style changes around the time of LLM proliferation. We contend that the development of specialized detectors exclusively dedicated to LLM-assisted writing detection is necessary. Such detectors could play a crucial role in fostering more authentic recognition of LLM involvement in scientific communication, addressing the current challenges in acknowledgment practices.

Select

Can ChatGPT evaluate research quality?

Mike Thelwall

Journal of Data and Information Science. 2024, 9(2): 1-21. https://doi.org/10.2478/jdis-2024-0013

Abstract (1184) PDF (526KB) ( 373 ) RichHTML ( 744 )

Purpose: Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task.

Design/methodology/approach: Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements.

Findings: ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author’s significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations.

Research limitations: The data is self-evaluations of a convenience sample of articles from one academic in one field.

Practical implications: Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use.

Originality/value: This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.

Please choose a citation manager

Content to export

模态框（Modal）标题

Please choose a citation manager

Content to export