Special Collections
AI for Science of Science

As the fields of artificial intelligence and scientometrics converge, new methodologies and insights are emerging that can revolutionize how we analyze and evaluate scientific output.


We are pleased to announce a special topic dedicated to the theme "AI for Science of Science." We invite researchers, practitioners, and thought leaders to submit original research articles, case studies, and reviews that explore the following topics:

● Innovative AI techniques for measuring science impact

● Applications of AI in knowledge discovery, bibliomtric analysis, technology forecasting

● AI for enhancing research evaluation processes

● Ethical considerations and challenges in AI applications in scientometrics

● Datasets for evaluating the performance of AI tools in the context of Science of Science


Submission Instructions

Submissions should be made through our online platform https://mc03.manuscriptcentral.com/jdis by June 30, 2026. When submitting, please select the type AI4SoS. Once you submit, we will immediately start the peer review process, and if it passes review, we will immediately publish it online, without the need to wait for any issue publication deadlines. For more information, please contact us at jdis@mail.las.ac.cn.


Join us in exploring the transformative potential of AI in the field of scientometrics!

Sort by Default Latest Most read  
Please wait a minute...
  • Select all
    |
  • Research Papers
    Mike Thelwall
    Journal of Data and Information Science. 2024, 9(2): 1-21. https://doi.org/10.2478/jdis-2024-0013

    Purpose: Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task.

    Design/methodology/approach: Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements.

    Findings: ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author’s significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations.

    Research limitations: The data is self-evaluations of a convenience sample of articles from one academic in one field.

    Practical implications: Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use.

    Originality/value: This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.

  • Research Papers
    Mike Thelwall
    Journal of Data and Information Science. 2025, 10(1): 7-25. https://doi.org/10.2478/jdis-2025-0011

    Purpose: Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process.

    Design/methodology/approach: This article assesses which ChatGPT inputs (full text without tables, figures, and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts.

    Findings: The optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66).

    Research limitations: The data is a convenience sample of the work of a single author, it only includes one field, and the scores are self-evaluations.

    Practical implications: The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones. Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.

    Originality/value: This is the first systematic comparison of the impact of different prompts, parameters and inputs for ChatGPT research quality evaluations.

  • Perspectives
    Teddy Lazebnik, Ariel Rosenfeld
    Journal of Data and Information Science. 2024, 9(3): 4-13. https://doi.org/10.2478/jdis-2024-0020

    Large Language Models (LLMs), exemplified by ChatGPT, have significantly reshaped text generation, particularly in the realm of writing assistance. While ethical considerations underscore the importance of transparently acknowledging LLM use, especially in scientific communication, genuine acknowledgment remains infrequent. A potential avenue to encourage accurate acknowledging of LLM-assisted writing involves employing automated detectors. Our evaluation of four cutting-edge LLM-generated text detectors reveals their suboptimal performance compared to a simple ad-hoc detector designed to identify abrupt writing style changes around the time of LLM proliferation. We contend that the development of specialized detectors exclusively dedicated to LLM-assisted writing detection is necessary. Such detectors could play a crucial role in fostering more authentic recognition of LLM involvement in scientific communication, addressing the current challenges in acknowledgment practices.

  • Research Notes
    Mike Thelwall
    Journal of Data and Information Science. 2025, 10(2): 1-5. https://doi.org/10.2478/jdis-2025-0014

    Google Gemini 1.5 Flash scores were compared with ChatGPT 4o-mini on evaluations of (a) 51 of the author’s journal articles and (b) up to 200 articles in each of 34 field-based Units of Assessment (UoAs) from the UK Research Excellence Framework (REF) 2021. From (a), the results suggest that Gemini 1.5 Flash, unlike ChatGPT 4o-mini, may work better when fed with a PDF or article full text, rather than just the title and abstract. From (b), Gemini 1.5 Flash seems to be marginally less able to predict an article’s research quality (using a departmental quality proxy indicator) than ChatGPT 4o-mini, although the differences are small, and both have similar disciplinary variations in this ability. Averaging multiple runs of Gemini 1.5 Flash improves the scores.

  • Research Papers
    Pachisa Kulkanjanapiban, Tipawan Silwattananusarn, Maya Lambovska
    Journal of Data and Information Science. 2025, 10(4): 146-196. https://doi.org/10.2478/jdis-2025-0036

    Purpose: This study aims to analyze academic research on Artificial Intelligence (AI) applications and tools in academic libraries, focusing on publications from the Scopus database between 2014 and 2024.

    Design/methodology/approach: The study adheres to the PRISMA protocol, using VOSviewer, Bibliometrix, and Rstudio’s Biblioshiny function for bibliographic analysis and visualization.

    Findings: The study highlights how the potential of AI in academic libraries may be increased by changing user needs and technical advancements. It comprises four thematic clusters: foundational technologies (machine learning, natural language processing, and automation), emerging innovations (generative AI), user-centric applications (chatbots), and the importance of AI literacy. It also reveals research gaps in automation and strategic AI integration, providing recommendations for improving library services.

    Research limitations: The study is limited to articles published between 2014 and 2024 in the Scopus database, potentially excluding previous foundational work and research from other sources.

    Practical implications: The study offers policymakers and library practitioners insightful information on effectively utilizing AI tools. This may result in overlooking earlier foundational work and research from multiple sources.

    Originality/value: The study discovers the role of artificial intelligence (AI) in modernizing academic libraries, identifying research gaps, and providing strategic insights to improve technology and user experience.

  • Research Papers
    Mike Thelwall, Kayvan Kousha
    Journal of Data and Information Science. 2025, 10(2): 106-123. https://doi.org/10.2478/jdis-2025-0016

    Purpose: Journal Impact Factors and other citation-based indicators are widely used and abused to help select journals to publish in or to estimate the value of a published article. Nevertheless, citation rates primarily reflect scholarly impact rather than other quality dimensions, including societal impact, originality, and rigour. In response to this deficit, Journal Quality Factors (JQFs) are defined and evaluated. These are average quality score estimates given to a journal’s articles by ChatGPT.
    Design/methodology/approach: JQFs were compared with Polish, Norwegian and Finnish journal ranks and with journal citation rates for 1,300 journals with 130,000 articles from 2021 in large monodisciplinary journals in the 25 out of 27 Scopus broad fields of research for which it was possible. Outliers were also examined.
    Findings: JQFs correlated positively and mostly strongly (median correlation: 0.641) with journal ranks in 24 out of the 25 broad fields examined, indicating a nearly science-wide ability for ChatGPT to estimate journal quality. Journal citation rates had similarly high correlations with national journal ranks, however, so JQFs are not a universally better indicator. An examination of journals with JQFs not matching their journal ranks suggested that abstract styles may affect the result, such as whether the societal contexts of research are mentioned.
    Research limitations: Different journal rankings may have given different findings because there is no agreed meaning for journal quality.
    Practical implications: The results suggest that JQFs are plausible as journal quality indicators in all fields and may be useful for the (few) research and evaluation contexts where journal quality is an acceptable proxy for article quality, and especially for fields like mathematics for which citations are not strong indicators of quality.
    Originality/value: This is the first attempt to estimate academic journal value with a Large Language Model.

  • Research Papers
    Hui Nie, Zhao-hui Long, Ze-jun Fang, Lu-qiong Gao
    Journal of Data and Information Science. 2025, 10(4): 291-315. https://doi.org/10.2478/jdis-2025-0046

    Purpose: This study aims to integrate large language models (LLMs) with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud, addressing the limitations of traditional approaches in long-text semantic parsing, model interpretability, and multi-source data fusion, thereby providing regulatory agencies with intelligent auditing tools.

    Design/methodology/approach: Analyzing 5,304 Chinese listed firms’ annual reports (2015-2020) from the CSMAD database, this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors, developing textual semantic features. It integrates 19 financial indicators, 11 governance metrics, and linguistic characteristics (tone, readability) with fraud prediction models optimized through a group of Gradient Boosted Decision Tree (GBDT) algorithms. SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial, governance, and textual features on fraud likelihood.

    Findings: The study found that LLMs effectively distill lengthy annual reports into semantic summaries, while GBDT algorithms (AUC > 0.850) outperform the traditional Logistic Regression model in fraud detection. Multimodal fusion improved performance by 7.4%, with financial, governance, and textual features providing complementary signals. SHAP analysis revealed financial distress, governance conflicts, and narrative patterns (e.g., tone anchoring, semantic thresholds) as key fraud indicators, highlighting managerial intent in report language.

    Research limitations: This study identifies three key limitations: 1) lack of interpretability for semantic features, 2) absence of granular fraud-type differentiation, and 3) unexplored comparative validation with other deep learning methods. Future research will address these gaps to enhance fraud detection precision and model transparency.

    Practical implications: The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’ information disclosure quality and enables practical implementation through its derivative real-time monitoring system. This advancement significantly strengthens capital market risk early warning capabilities, offering actionable insights for securities regulation.

    Originality/value: This study presents three key innovations: 1) A novel “chunking-summarization-embedding” framework for efficient semantic compression of lengthy annual reports (30,000 words); 2) Demonstration of LLMs’ superior performance in financial text analysis, outperforming traditional methods by 19.3%; 3) A novel “language-psychology-behavior” triad model for analyzing managerial fraud motives.

  • Editorial
    Liying Yang, Ronald Rousseau, Ping Meng
    Journal of Data and Information Science. 2025, 10(1): 1-3. https://doi.org/10.2478/jdis-2025-0017
  • Research Article
    Álvaro Cabezas-Clavijo, Pavel Sidorenko-Bautista
    Journal of Data and Information Science. 2026, 11(2): 102-116. https://doi.org/10.1515/jdis-2025-0326
    Abstract
    Purpose

    This study evaluates the reliability of eight generative artificial intelligence chatbots—including ChatGPT, Claude, Gemini, and DeepSeek—when functioning as autonomous agents for academic bibliographic generation, specifically assessing their accuracy within a university research framework.

    Design/methodology/approach

    Using a standardized prompting methodology, 400 references were generated and analyzed across five core knowledge areas: Health, Engineering, Experimental Sciences, Social Sciences, and Humanities. Each agent’s output was rigorously audited against five formal criteria (authorship, year, title, source, and location) and categorized by error frequency and document type.

    Findings

    Results indicate a significant reliability gap, with only 26.5 % of references entirely accurate and nearly 40 % flawed or fabricated; while Grok and DeepSeek avoided hallucinations, Copilot, Perplexity, and Claude showed the highest failure rates, particularly when generating journal article citations.

    Research limitations

    The study focuses on the free versions of these AI agents, so results may vary with paid models or future architectural updates that integrate real-time web browsing more effectively.

    Practical implications

    These findings underscore the critical risks of uncritical reliance on AI agents for academic tasks, highlighting an urgent need for enhanced information literacy and the development of specialized critical thinking skills to navigate AI-mediated research.

    Originality/value

    This original and unpublished research provides a pioneering comparative analysis of multiple AI agents as research intermediaries, revealing structural limitations in their generative logic and offering a unique benchmark for the reliability of AI-driven bibliographic data in higher education.

  • Editorials
    The JDIS Editors
    Journal of Data and Information Science. 2025, 10(2): 176-178. https://doi.org/10.2478/jdis-2025-0019
  • Research Article
    Zhichao Ba, Yujie Zhang, Biao Zhang, Gang Li
    Journal of Data and Information Science. 2026, 11(2): 10-35. https://doi.org/10.1515/jdis-2025-0296
    Abstract
    Purpose

    While the imperative of enterprise digital transformation (EDT) has been widely acknowledged, a systematic understanding of its intricate network of antecedents and consequences remains fragmented. This study proposes a novel knowledge representation framework that leverages large language models (LLMs) to construct a variable relational network (VRN), offering a panoramic, micro-level perspective on EDT.

    Design/methodology/approach

    We extract five types of variable relationships from a vast corpus of academic publications on EDT to generate the VRN. Subsequently, we apply network topology analysis to uncover the temporal and regional characteristics of the VRN. Its hierarchical structure is then analyzed through K-shell decomposition.

    Findings

    Our results show that, over the past two decades, the scale of the VRN has experienced rapid growth, driven collectively by multi-layered external factors such as the rapid advancement of digital technologies, and its internal connections have become increasingly tighter. Regional comparisons of the VRN reveal that different economies, shaped by institutional theories, exhibit distinct transformation paradigms while striving toward common goals. K-shell analysis uncovers a clear hierarchical structure, distinguishing peripheral, intermediate, and core variables, with these layers corresponding to varying degrees of strategic significance and transformation maturity.

    Research limitations

    The study’s limitations primarily concern the accuracy of the VRN, which depends on the LLM’s extraction performance and its potential for hallucinations, which may introduce noise into the network topology.

    Practical implications

    The VRN and its network topology structure serve as a diagnostic tool for strategic decision-making, enterprises and policymakers can also use these insights to design targeted support programs.

    Originality/value

    This study contributes a data-driven, LLM-assisted framework for mapping the evolving and multidimensional landscape of enterprise digital transformation, thereby validating and extending the theoretical boundaries of EDT.

  • Research Article
    Mike Thelwall, Ralph Schroeder, Meena Dhanda
    Journal of Data and Information Science. 2026, 11(2): 154-164. https://doi.org/10.1515/jdis-2025-0390
    Abstract
    Purpose

    It has become increasingly likely that Large Language Models (LLMs) will be used to score the quality of academic publications to support research assessment goals in the future. This may cause problems for fields with competing paradigms since there is a risk that one may be favoured, causing long term harm to the reputation of the other.

    Design/methodology/approach

    To test whether paradigm favouritism is plausible, study 1 uses ChatGPT to evaluate up to 100 journal articles from each of eight pairs of competing sociology paradigms (1,490 altogether). Each article was assessed by prompting ChatGPT to take one of five roles: paradigm follower, opponent, antagonistic follower, antagonistic opponent, or neutral. Study 2 involved five pairs of more tightly defined paradigms.

    Findings

    Articles were scored highest by ChatGPT when it followed the aligning paradigm, and lowest when it was told to devalue it and to follow the opposing paradigm. Broadly similar patterns occurred for most of the paradigm pairs. Follower ChatGPTs displayed only a small amount of favouritism compared to neutral ChatGPTs, but articles evaluated by an opposing paradigm ChatGPT had a substantial disadvantage in some cases.

    Research limitations

    The data covers a single field and LLM.

    Practical implications

    The results confirm that LLM instructions for research evaluation should be carefully designed to ensure that they are paradigm-neutral to avoid accidentally resolving conflicts between paradigms on a technicality by devaluing one side’s contributions.

    Originality/value

    This is the first demonstration that LLMs can be prompted to show a partiality for academic paradigms.

  • Research Article
    Chongjun Xi, Xiaoting Chen, Dongmei Ye
    Journal of Data and Information Science. https://doi.org/10.1515/jdis-2025-0455
    Accepted: 2026-06-01
    Abstract
    Purpose

    To address the limitations of traditional patent metrics in capturing technical substance and the high cost of expert review, this study proposes a hybrid evaluation framework integrating Large Language Models (LLMs) with machine learning to achieve automated, highly accurate identification of high-value patents.

    Design/methodology/approach

    Adopting a “Virtual Assessor” paradigm, we constructed a dataset based on the China Patent Gold Awards. The study integrated semantic scores from three diverse LLMs (DeepSeek, Qwen, GLM) under zero-shot and few-shot prompt strategies into a Stacking ensemble learning model (combining XGBoost, Random Forest, and SVM) to predict patent value across nine comparative experimental setups.

    Findings

    Direct LLM evaluation revealed a “Knowledge Injection Paradox,” where explicit expert prior knowledge caused negative transfer and reduced accuracy due to over-conditioning. However, the Stacking model successfully rectified these biases, transforming subjective LLM evaluations into robust predictive features. The hybrid model achieved over 97 % accuracy in identifying high-value patents, demonstrating strong robustness even in high-noise environments.

    Research limitations

    The study relies on a binary classification of extreme samples (Gold Award vs. non-awarded), potentially oversimplifying the continuous distribution of patent value. Furthermore, the interpretability of the “black box” feature fusion mechanism requires further exploration.

    Practical implications

    The proposed framework offers IP managers and policymakers a scalable, cost-effective tool for automated patent screening, effectively bridging the gap between qualitative expert intuition and quantitative data precision.

    Originality/value

    This research introduces a “Semantic Enhancement + Algorithmic Rectification” paradigm. It empirically demonstrates how machine learning can correct LLM hallucinations and biases, marking a significant shift from data-driven perception to AI-driven cognitive decision-making in patent valuation.