Enhancing emerging technology discovery in nanomedicine by integrating innovative sentences using BERT and NLDA

Yifan Wang; Xiaoping Liu; Xiang-Li Zhu

doi:10.2478/jdis-2024-0031

Journal of Data and Information Science >

2024 , Vol. 9 >Issue 4: 155 - 195

DOI: https://doi.org/10.2478/jdis-2024-0031

Research Papers

Enhancing emerging technology discovery in nanomedicine by integrating innovative sentences using BERT and NLDA

Yifan Wang ² ,
Xiaoping Liu ^,¹^,²^,^† ,
Xiang-Li Zhu ¹^,²

Expand

¹National Science Library, Chinese Academy of Sciences, Beijing 100190, P.R. China
²Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing, 100190, P.R. China

† Xiaoping Liu (Email: liuxp@mail.las.ac.cn).

Received date: 2024-06-16

Revised date: 2024-10-12

Accepted date: 2024-10-16

Online published: 2024-10-28

Copyright

Fold

Abstract

Purpose: Nanomedicine has significant potential to revolutionize biomedicine and healthcare through innovations in diagnostics, therapeutics, and regenerative medicine. This study aims to develop a novel framework that integrates advanced natural language processing, noise-free topic modeling, and multidimensional bibliometrics to systematically identify emerging nanomedicine technology topics from scientific literature.

Design/methodology/approach: The framework involves collecting full-text articles from PubMed Central and nanomedicine-related metrics from the Web of Science for the period 2013-2023. A fine-tuned BERT model is employed to extract key informative sentences. Noiseless Latent Dirichlet Allocation (NLDA) is applied to model interpretable topics from the cleaned corpus. Additionally, we develop and apply metrics for novelty, innovation, growth, impact, and intensity to quantify the emergence of novel technological topics.

Findings: By applying this methodology to nanomedical publications, we identify an increasing emphasis on research aligned with global health priorities, particularly inflammation and biomaterial interactions in disease research. This methodology provides deeper insights through full-text analysis and leading to a more robust discovery of emerging technologies.

Research limitations: One limitation of this study is its reliance on the existing scientific literature, which may introduce publication biases and language constraints. Additionally, manual annotation of the dataset, while thorough, is subject to subjectivity and can be time-consuming. Future research could address these limitations by incorporating more diverse data sources, and automating the annotation process.

Practical implications: The methodology presented can be adapted to explore emerging technologies in other scientific domains. It allows for tailored assessment criteria based on specific contexts and objectives, enabling more precise analysis and decision-making in various fields.

Originality/value: This study offers a comprehensive framework for identifying emerging technologies in nanomedicine, combining theoretical insights and practical applications. Its potential for adaptation across scientific disciplines enhances its value for future research and decision-making in technology discovery.

Key words： Bibliometrics; Nanomedicine; Emerging technologies; BERT; Topic modeling

Cite this article

Yifan Wang , Xiaoping Liu , Xiang-Li Zhu . Enhancing emerging technology discovery in nanomedicine by integrating innovative sentences using BERT and NLDA[J]. Journal of Data and Information Science, 2024 , 9(4) : 155 -195 . DOI: 10.2478/jdis-2024-0031

1 Introduction

With breakthroughs in materials science, nanotechnology has been widely used in various industries. Nanotechnology, defined as the study and application of structures ranging in size from 1 to 100 nanometers (nm), has received significant attention (Rawat et al., 2006). Notably, nanoparticles have distinctive optical, electronic, and structural properties that distinguish them from macromolecules (Sinha et al., 2006) and offer unique capabilities at cellular, atomic, and molecular levels (Rawat et al., 2006). Therefore, nanotechnology has the potential to revolutionize the landscape of medical and healthcare solutions, particularly in disease diagnosis, treatment, and prevention at the cellular level, giving rise to the term “nanomedicine” (Bayford et al., 2017).

Nanomedicine encompasses two fundamental domains: nanomedicine-based diagnostics and nanotechnology-based therapeutics. Implementing nanomedicine enables the early detection, precise diagnosis, effective treatment, and monitoring of infections (Archakov, 2010). Nanotechnology has significantly advanced restorative diagnostics by introducing highly efficient and precise instrumentation for both in vitro and in vivo diagnostics (W. Zhang et al., 2018). Pathologists have used nanoparticles to distinguish malignant growth cells in biopsies by exploiting the distinct light dissipation and absorption spectra of these cells (Zitka et al., 2012). Moreover, nanoparticles are useful in medical imaging because of their comparable size to biomolecules, such as enzymes and receptors, enabling their use as probes for tracking cell motility and molecular changes (Cuenca et al., 2006; Rawat et al., 2006). Consequently, nanoparticle-based molecular imaging probes and contrast agents have become increasingly prominent, surpassing other single-molecule-based contrast agents. Imaging agents encompass various contrast materials, including those with fluorescent and radioactive properties (Bennett et al., 2008). Furthermore, nanoparticles have emerged as effective Magnetic Resonance Imaging (MRI) contrast agents, such as superparamagnetic iron oxide nanoparticles (SPION), which exhibit superior performance compared with conventional superparamagnetic reagents (D. Kim et al., 2018).

Regarding the therapeutic realm of nanotechnology, significant global research has focused on ZnO nanoparticles, single-walled carbon nanotubes, and antibiotic-coated nanoparticles (Baptista, 2014). Notably, an important area in nanomedicine research involves the targeted design of anticancer drugs, exploiting the distinctive features of tumors, such as endothelial cell openings ranging from 200 nm to 1.2 µm, permitting nanoparticle penetration into the extravascular space (Hobbs et al., 1998). Various nanodevices, such as dendrimers, ceramic nanoparticles, and carbon nanotubes, can target cancer cells (Sugiyama, 2015). Functionalization of nanoparticles through monoclonal antibodies or cell surface receptor ligands exploits specific receptors on cancer cell surfaces for targeted drug delivery (Mitra et al., 2006). Ligand-coordinated nanoparticle binding to tumor cell receptors represents a promising approach for particle therapy (Misra et al., 2010). Drug-loaded nanoparticles act as carriers, forming nanocapsules or nanospheres, thereby improving drug delivery (C.-Y. Zhao et al., 2018). Oral administration of zinc oxide nanoparticles has demonstrated efficacy in treating diabetes by affecting serum insulin, glucose tolerance, and blood glucose levels (P. Yang et al., 2023).

Furthermore, nanotechnology has significantly affected stem cell biology research, offering new avenues for regenerative medicine (Sanchez et al., 2011). Recently, diverse nanostructured materials have been used to reconstruct human organs, including bone, muscle, and skin (van Rijt & Habibovic, 2017). Mesenchymal stem cells (MSCs) have demonstrated significant potential for proliferation and differentiation into cardiomyocytes within AUNP-mixed polycaprolactone (PCL) scaffolds, showing promise for myocardial infarction repair (Jain, 2008). In addition, magnetic nanoparticles can isolate and group stem cells, whereas nanocarriers binding to biomolecules can modulate stem cell proliferation and differentiation.

Nanomedicine has demonstrated unique potential and interdisciplinary characteristics in modern medical care, with its transformative impact spanning various fields, including diagnosis, treatment, and regenerative medicine (Shi et al., 2017). Therefore, employing bibliometric tools to monitor research trends and scientific results in this field is essential (Betker et al., 2013).Identifying emerging technologies in the field constitutes an imperative strategy to afford countries and businesses opportunities to surmount innovation challenges, achieve cutting-edge technological breakthroughs, and cultivate growing markets (Ahmed et al., 2017).The identification of emerging technologies in the field of nanomedicine not only summarizes the current technological status but also highlights emerging technological fields that may influence nanomedicine. This process thus facilitates the discovery of technologies that urgently require development, providing superior solutions for timely intervention and improved clinical outcomes (Dundar et al., 2020a).

Numerous experts and scholars in the field of bibliometrics have contributed to the discovery of emerging technologies. For example, the Foresight and Understanding of Science Expositions (FUSE) project, funded by the Intelligence Advanced Research Projects Activity (IARPA) in 2011, aimed to identify emerging topics from scientific, technological, and patent literature (McKeown et al., 2016; Porter et al., 2019). Conversely, some scholars have focused on emerging technology forecasting in the medical-device industry in their study (AlSumait et al., 2009). Boyack et al. (2014) used citation analysis to identify technology topics in two nanotechnology domains. Chen et al. (2018) detected and visualized emerging trends and transient patterns in scientific literature. Ke (2017) redefined the metrics of fundamentality and novelty to investigate the technological impacts of biomedical research. M. Kim et al. (2018) used a data mining approach to compare the emergence of industrial robots in various countries and assess discrepancies between nations. Jiang et al. (2024) propose an innovative lexical-level framework to identify and understand emerging technologies from the perspective of Technical knowledge Flow (TKF), construct and interpret multidimensional indicators reflecting the attributes of ETs by comprehensively analyzing knowledge absorption, growth, and diffusion, effectively evaluating emerging scores at the terminology level and accurately identifying ETs. Using deep learning and weak signal analysis, Ebadi et al. (2022) propose a multi-layer quantitative approach to help strategic planners and domain experts better identify and monitor emerging technology trends. Wang et al. (2024) measure the innovative development path of emerging technologies by improving triple reference networks, proposing a compression method of reference networks from the perspective of data sets, and using technical interference degree as the main path to extract reference network importance criteria. Li et al. (2019) propose a novel approach to identifying emerging topics in science and technology by combining two large-scale scientific literature models based on direct citation and co-citation, and using a difference function to reward those emerging and rapidly growing clusters to nominate emerging topics. These findings emphasize the role of scientific publications as surrogates for scientific research and technological development (Xu et al., 2018).

Nevertheless, most studies have focused on quantifying publications and citation networks (Huang et al., 2022) or analyzing abstracts or titles using text-mining techniques (Abbasi et al., 2022; Ding et al., 2022). Only a few scholars have conducted analyses by examining the full text of the literature (Cao et al., 2023; Richard et al., 2022). Full-text analysis captures details and complexities in an article that are often overlooked or only briefly mentioned in titles and abstracts. By analyzing the full text, researchers are able to gain an in-depth understanding of the article’s core arguments, research methodology, experimental results, and their interpretations, resulting in a richer and more precise understanding of the research topic (Tosatto et al., 2022). This study utilizes advanced deep-learning techniques to extract relevant information from full-text literature. Subsequently, we applied Noiseless Latent Dirichlet Allocation (NLDA) topic modeling to the refined information and established metrics to identify emerging technologies (Churchill & Singh, 2021).

In general, the innovations of this research are primarily encapsulated in the following dimensions:

Firstly, this study proposes a technology framework grounded in innovative content extraction, utilizing SciBERT for the classified extraction of key information and identifying emerging technologies through NLDA modeling of the extracted data. This holistic approach eliminates redundant data, enhancing both the explanatory power and accuracy of the subject model.

Secondly, utilizing full-text processing of scientific articles, this study diverges from traditional methodologies that prioritize abstracts or titles, thereby uncovering cutting-edge advances in nanomedicine. This in-depth textual analysis facilitates a comprehensive discovery of emerging technologies by capturing nuanced details and core essences of research potentially overlooked in summaries or headlines.

Additionally, this study develops multidimensional metrics for assessing technological emergence, incorporating measures of novelty, innovation, growth, impact, and thematic intensity. Used collectively, these indicators quantify the emergence of new technological domains, offering a comprehensive perspective on technological advances in nanomedicine.

Finally, The study further incorporates a cross-country analysis and a time series analysis to examine the global distribution and identify decade-long trends in innovative nanomedical research, respectively. These analyses yield valuable insights into the geographical and temporal dynamics of innovation within nanomedical research.

The remainder of this paper is structured as follows. Section 2 reviews related work. Section 3 introduces a framework based on information extraction (IE) and topic modeling. Section 4 presents the experimental results and discusses the nanomedicine literature dataset for 2013-2023. Section 5 summarizes the work and provides policy implications.

2 Related Work

2.1 Information extraction

Numerous studies have emphasized the significant impact of the Internet’s rapid expansion on the scientific and technical literature across all domains throughout the year (Nasar et al., 2018; C. Zhang et al., 2022). A report by the international association of scientific, technical, and medical Publishers revealed that the number of published researches has increased by an annual average of 4%-5%. Approximately 2.2 million fresh scientific articles were disseminated in 2016 (Canagarajah, 2022; Khanna et al., 2022). Consequently, the challenge of extracting essential insights from an extensive repository of literary data while disregarding background information has become an important concern. Boegel et al. (2022) characterized this endeavor as Information Extraction (IE). Moreover, numerous studies have highlighted the increasing application of IE techniques in various fields, such as medicine, genetics, and other biological sciences (Seeger et al., 2022).

Among the various methodologies, Arshamian et al. (2013) adopted a rule-based approach to extract core ideas from research papers. In contrast, Gupta and B. Hu et al. (2014) applied sentence matching, dependency trees, and seed rules to identify focal points, techniques, and domains. Machine learning methods have gained prominence in recent years. Schneider et al. (2018) used preposition disambiguation with Naiver-Bayesian classification for entity extraction. I. C. Kim et al. (2014) proposed a support vector machine-based framework that utilized a linear kernel function during training to extract sentence-level key insights. Several scholars have adopted a structured learning approach with residuals to process a corpus of 400 articles from ACM and ACL using various training algorithms, including Perceptron, AdaGrad, and SVM (Bianchi et al., 2016; Peng et al., 2014).

Experts in this field have adopted diverse methodologies for IE. However, existing studies have primarily focused on abstracts, with a limited exploration of full-text articles. Therefore, a significant contribution of this study is the utilization of an advanced deep-learning approach. Herein, the BERT model was used for comprehensive IE from all articles instead of the conventional practice of abstract-based extraction, marking an innovation (Song et al., 2023).

2.2 Topic model

Topic models represent a category of unsupervised machine learning techniques devised to elucidate the fundamental topics inherent in a collection of documents (Gruen & Hornik, 2011; Jelodar et al., 2019; Zhao et al., 2011). They demonstrate exceptional proficiency in condensing a vast corpus of documents into a concise summary represented as a topic or cluster of related terms.

The inception of topic modeling can be traced back to the early 1990s when Letsche & Berry (1997) introduced a Latent Semantic Indexing (LSI) model, demonstrating how latent semantic analysis can be used in the automated indexing and retrieval of documents from vast databases. Subsequently, in 1999, Thomas Hofmann made a significant contribution to the field of topic modeling by introducing Probabilistic Latent Semantic Indexing (pLSI). Latent Dirichlet Allocation (LDA), proposed by Blei et al. in 2001, has emerged as a seminal contribution. LDA draws inspiration from pLSI (Blei et al., 2003) and develops a unique generative model based on the Dirichlet distribution (Nigam et al., 2000). This model features more comprehensive assumptions regarding text generation than its counterparts, rendering LDA the most widely used probabilistic topic model.

LDA has undergone numerous adaptations and applications in various facets of natural language processing. Jelodar et al. (2019) thoroughly investigated LDA and its variations in their survey of LDA-based topic models. For example, Lafferty and Blei developed a Correlated Topic Model (CTM) to address the challenges of interrelated topics (Blei & Lafferty, 2007). Recently, topic modeling has been improved by integrating Word2Vec and other word-embedding models with innovative natural language processing techniques (Zhang et al., 2018). Hu et al. (2018) presented the Biterm Topic Model (BTM), designed for short texts like tweets and social media posts, demonstrating exceptional performance in short text modeling. Fu et al. (2018) introduced the Embedding Topic Model (ETM), where words and topics are represented as vectors within the embedding space. Thompson (2020) later suggested using the BERT language model to generate topics by applying k-means clustering to tokens derived from BER-extracted context vectors. Furthermore, Churchill & Singh (2021) proposed a Topic-Noise Model that integrated the modeling of topic and noise distributions, resulting in superior noise filtration. They combined a Topic-Noise Discriminator (TND) and LDA to formulate an integrated Topic-Noise Model, known as NLDA, which demonstrated improved consistency compared to LDA. This is supported by quantitative and qualitative assessments of topic interpretability, facilitated by illustrative topics and human assessments (Churchill & Singh, 2023, 2022b).

The NLDA model was used in this study to process the text data IE. This approach effectively filters out extraneous text information, producing more interpretable topics for human understanding. Subsequently, the supporting documents for each topic were identified based on the document-probability matrix generated by this model.

2.3 Identifying technology emergence

Academic approaches to identifying emerging technologies involve qualitative methods based on specialized knowledge and quantitative text analysis (Bishop, 2006). Technical approaches include inventive problem-solving theory, scenario management, bibliometric analysis, and data mining (Arora et al., 2013; Breitzman & Thomas, 2015; Holmes & Ferrill, 2005; Keenan, 2003; Li et al., 2019; Noh et al., 2016). Various studies have focused on developing metrics for emerging technologies from multiple perspectives. Rotolo et al. (2015) refined five attributes to characterize emerging technologies: radical novelty, rapid growth, coherence, prominent impact, and the presence of uncertainty and ambiguity. Scholars have identified emerging technology and emergent topics by analyzing citations and co-citations in scientific data, highlighting the novelty and growth of emerging technologies (Sidaway, 2020; Vatanasakdakul et al., 2023). Other scholars have utilized publication and author counts to identify burgeoning areas in science (Long et al., 1980; Uddin & Khan, 2016; D. Zhao & Strotmann, 2011). Chakraborty & Pradeep (2017) introduced a model for detecting emerging clusters founded on four identifying characteristics: the public sector ratio, scientific index, originality index, and reference index. Other attributes developed by scholars encompass data validity, data availability, implementation cost, ease of use, and methodological adaptability (Dash et al., 2007; Gokhberg et al., 2013; Guderian, 2019; Lee et al., 2018).

Furthermore, frameworks rooted in time series evolution analysis are employed in identifying emerging technologies, aiming to comprehend the dynamic nature of technological progress. For instance, Xu et al. (2021) developed a framework employing thematic models to predict the emergence of new technologies through the analysis of sequential text data. Shen et al. (2020) proposed a deep learning model, SeriesNet, which predicts future technology trends by learning from time series data at multiple ranges and levels, utilizing sophisticated analytical techniques. Mercer and Keogh (2022) developed another predictive model for monitoring outlier behavior in time series, facilitating the discovery of new technologies in industrial and natural sciences. Additionally, Ebadi et al. (2022) combined deep learning with weak signal analysis in a multi-layer quantitative approach to predicting technological trends, including advances in the field of hypersonic motion.

In summary, methods for identifying emerging technologies exhibit significant diversity. In this context, our research has developed discovery metrics for emerging technologies in nanomedicine. These metrics stem from information extracted from technology topic support documents through topic modeling, ultimately facilitating the identification of emerging technologies.

3 Methodology

This section elucidates four dimensions of research pertaining to the discovery of emerging technologies.

The first dimension involves data preprocessing, including acquisition, partitioning, cleansing, and textual annotation. Data were obtained from established repositories of scientific literature in biomedicine. These raw data are transmuted through meticulous curation into a structured corpus amenable to computational recognition. A subset of this data was annotated manually by specific rules in anticipation of the subsequent model training and processing.

The second dimension involves the extraction and condensation of important information. In the knowledge extraction process, a sophisticated machine learning model was developed by training the annotated data obtained from the preliminary phase. Subsequently, this model was used to predict crucial insights embedded within the entirety of the literature data. These distilled fragments of significant knowledge were then summarized in accordance with the classification that corresponds to each literary. This undertaking was executed to lay the foundation for ensuing model recognition.

The third dimension focuses on identifying technology topics. The key knowledge corpus produced in the previous phase serves as the substratum. Potential technology topics are identified from this corpus using a combination of expert interviews and topic modeling of the relevant literature. This process encompasses the thorough elimination of unnecessary terminology.

The final dimension involved developing and quantifying indicators relevant to emergent technological topics. Indicators are devised through an in-depth review of the relevant literature. Subsequently, these sub-indicators were computed for each thematic facet and combined into composite indicators. This cumulative metric enhances the ability to identify emergent technological topics, thereby facilitating the formulation of judicious policy directives and strategies for technological advancements.

3.1 Data collection

We obtained data from PubMed Central (PMC) and Web of Science (WOS) databases. PMC is an open-access archive containing full-text biomedical and life science journal articles from the National Library of Medicine (NLM) of the National Institutes of Health (NIH) (Roberts, 2001). Currently, it comprises an extensive collection of thousands of journal articles, totaling over 8 million full-text articles, spanning centuries of research in biomedical and life sciences from the late 1700s to the present (Hsiao & Torvik, 2023; Lin et al., 2023). This content comprised articles published in scholarly journals, peer-reviewed author manuscripts, and preprint versions of articles publicly available before peer review (Michaleff et al., 2011). Therefore, we used this database for our comprehensive literature dataset.

View original graphic|Download|PPT slide

Figure 1. A framework of data acquisition and analysis.

The WOS is a valuable resource for citation data in bibliometric analyses (Falagas et al., 2008). Furthermore, WOS offers bibliometric software to produce comprehensive statistical insights (Harzing & Alakangas, 2016). Therefore, this database is frequently used by researchers as a primary resource for extensive bibliometric analysis.

Drawing on the research methods of past scholars, we established the time span for the data collection process as 2013-2023, as this period witnessed a 7.39-fold increase in global annual publications, from 1,851 to 13,683, offering a rich resource for research (Tao et al., 2023). The data collection process consisted of two distinct phases. In the initial phase, a search was conducted in PMC using the keyword “nanomedicine”. Web crawlers were developed using Python Selenium and the Beautiful Soup library to retrieve and analyze HTML content. This process facilitated the extraction of relevant information from the literature, including titles, full texts, and DOI numbers. Notably, we extracted content from full-text articles after the abstract and before the references. The extracted content formed the basis of our text database. Subsequently, we used a similar approach in WOS using the Science Citation Index Expanded (SCI-Expanded) journal categories. The subsequent dataset comprises titles, authors, keywords, publication years, and DOI numbers. In addition, the dataset incorporates supplementary metrics, including the number of citations, citation density (number of citations per year), affiliations, countries of affiliation, journals of publication, and associated impact factors (IF) based on the Journal Citation Reports. This information is brought together in our indicator database.

3.2 Data preprocessing

The preprocessing work performed on the text database and indicator data obtained in the previous step was as follows: First, the two databases were matched based on the DOI numbers of the articles. After matching, 2,083 full-text articles and their corresponding indicators were obtained. The articles were subsequently subjected to sentence processing using the NLTK package. An automated toolkit for sentence splitting has been used over traditional punctuation sentences to more accurately split sentences based on context and obtain precise results (Schmitt et al., 2019; Yao, 2019). Statements containing only formulas and special symbols were removed, and the remaining sentences (totaling 673,642) were stored in a database and traced back to their source literature. We used the criteria described in Table 1 to distinguish between academically innovative and non-innovative sentences.

Table 1. Criteria for distinguishing sentences.

Sentences with academic innovation	Sentences that lack innovation
Introduce new ideas, unique approaches, or novel methods.	Furnish common background information or definitions.
Display original discoveries, experimental results, or undocumented data.	Elucidate routine procedures and methods that are commonly used in this field.
Propose unique applications, materials, or techniques.	Provide a summary of well-established information or results available in the literature.

A sentence database consisting of 2,276 innovative utterances and 7,724 non-innovative utterances was obtained after labeling. This dataset was used to train the upcoming deep-learning models.

3.3 Information extraction (IE)

It has been empirically demonstrated that fine-tuned BERT models produce superior results across various natural language processing tasks, such as text categorization and automated question-answering (Devlin et al., 2019). The advantages of using BERT include accelerated development, reduced data prerequisites, and improved outcomes (Bello et al., 2023; Reimers & Gurevych, 2019). Therefore, we selected the fine-tuned BERT model to categorize sentences as innovative or non-innovative in the context of IE from an innovative perspective.

To optimize outcomes while saving valuable training time, it is imperative to make an informed decision regarding the pre-trained model. SciBERT was used as the pre-training model in our study, initially trained on a random sample of 1.14 million papers from Semantic Scholar (Lobanova et al., 2023). This corpus comprised 18% papers from computer science and 82% from various biomedical fields. Using the pre-trained parameters of this model as initial weights not only conserve time and resources but also improves performance by transferring its generalized linguistic representations, obtained from extensive text data, to subsequent classification tasks (Shenet al., 2023; Thierryet al., 2023). Consequently, it achieves state-of-the-art performance across a wide spectrum of scientific domain natural language processing (NLP) tasks. Further evaluation details can be found in the associated paper, with the evaluation code and data available in the associated repository.

Our approach to fine-tuning BERT based on the pre-trained model unfolds as follows: Initially, the model was implemented using the Keras package and integrated with the SciBERT pre-training model. Subsequently, we thoroughly preprocessed the input data, involving the disambiguation of text and its conversion into three essential embedding vector representations: word, position, and type embeddings. These embedding vectors are fundamental components that facilitate the BERT model in comprehending the text.

We fine-tuned several hyperparameters during the training phase to ensure training effectiveness. Table 2 indicates the final selection of the hyperparameter values. The learning rate of the optimizer was set to 2e-5, a standard value for fine-tuning the pre-trained models. In order to accommodate longer text sequences, we have broken the limitations of standard BERT. Specifically, the standard BERT model can accommodate a maximum sequence length of 512 tokens. When encountering text that is longer than 512 tokens but still less than 1,024 tokens, our current approach is to truncate the text and retain only the first 512 tokens. This approach is simple and efficient, but may inadvertently result in the exclusion of potentially critical information in the second half of the text. We have therefore adopted a different strategy for excessively long texts with more than 1,024 tokens. In this case, the text is divided into multiple segments, each containing no more than 512 tokens. Each segment is then processed independently. For text shorter than 512 bytes, we use padding techniques to extend the text length to the necessary 512 bytes. The truncation, chunking, and padding processes we employ ensure that the length of the input text is consistent and that the model fully learns the semantic information. The batch size was set as 32 to expedite the training process. Eight iterations were performed to ensure the complete convergence of the model. A fully connected layer was constructed in the final layer of the BERT model for sentence classification, which selected the output category with the highest probability. We used softmax as the activation function instead of the sigmoid function because of scalability concerns. The fully connected layer accepts the multidimensional vectors of the [CLS] tokens obtained after training as input. Unlike pre-training [CLS] tokens, which contain no information, post-training tokens encapsulate information about the entire current utterance (Jeonget al., 2020). To assess the performance and generalization capability of the model, we implemented a ten-fold cross-validation technique. Figure 2 illustrates that the fundamental BERT model consists of 12 transformer layers and 12 attention heads.

Table 2. BERT model hyperparameterization.

Hyperparameter	Values	Description
batch_size	32	Number of samples per batch during training
max_seq_length	500	Maximum length of input sequences
learning_rate	2e-5	Learning rate, subject to tuning
train_epochs	20	Number of training epochs
warmup_proportion	0.1	Proportion of total training steps for learning rate warm-up
adam_epsilon	1e-8	Epsilon value for the Adam optimizer
dropout_prob	0.1	Dropout probability in dropout layers

View original graphic|Download|PPT slide

Figure 2. A framework of the BERT model.

The model with the highest F1 score was selected for predictive purposes during the prediction phase. Following IE, we summarized all statements predicted as innovative sentences through article classification, resulting in refined literature content.

3.4 Topic identification

The Noiseless Latent Dirichlet Allocation (NLDA) represents a substantial advancement in topic modeling, particularly in addressing the challenges posed by brevity and noise prevalent in short text data (Churchill & Singh, 2021). NLDA markedly enhances subject modeling by the integration of the traditional latent Dirichlet allocation (LDA) framework with an innovative noise filtering technology, namely the Topic Noise Discriminator (TND) model. Its principal advantage lies in its capacity to simultaneously identify and separate meaningful subject content and noise from the data, thereby generating a more coherent and differentiated subject distribution. The key to the NLDA lies in its capacity not only to identify the subject content but also to construct a detailed noise distribution. Through the careful design of noise distribution construction, random and context-specific noise are systematically filtered out, significantly enhancing the purity and interpretability of the topics (Churchill & Singh, 2023). In particular, NLDA works with any generative topic model through a pre-trained TND noise distribution to enable two-tier analysis, a technique that refines the topic output while retaining the high-quality topic structure inherent in LDA, thereby generating topics that are both diverse and deeply reflective of the underlying data (Churchill & Singh, 2022b).

Detailed architectural depictions of the LDA and TND models are presented in Figure 3. In our notation, D denotes a dataset comprising M documents or posts represented by $D=\left\{d_{0}, d_{1}, \ldots, d_{M-1}\right\} $. Each document d is a compilation of N words, expressed as $d=\left\{w_{0}, w_{1}, \ldots, w_{N-1}\right\}$. A topic, denoted by t, encompasses a set of λ words, expressed as $t=\left\{w_{0}, w_{1}, \ldots, w_{\lambda-1}\right\}$, with these words characterized by their coherence and interpretability. The collection of topics is encapsulated in a topic set, $T=\left\{t_{0}, t_{1}, \ldots, t_{k-1}\right\}$, whereas a complementary noise set, $H=\left\{w_{0}, w_{1}, \ldots, w_{p-1}\right\}$, is used to detect the presence of the noisy elements.

View original graphic|Download|PPT slide

Figure 3. LDA (a) and TND (b) graphical models.

In contrast to the comprehensive generative process for LDA elucidated by previous scholars (Blei et al., 2003), we present a concise overview of high-level generative processes using our notation.

For each document d in D:

(1) Stochastically determine the word count N for d.

(2) Sample the topic distribution θ from the Dirichlet distribution, conditioned on the parameter αx.

(3) For each word w_i where 0 ≤ i < N:

(a) Draw a topic z_i from the distribution θ.

(b) Select a word w_i based on the conditional probability of w_i, a given topic z_i, conditioned on a parameter β.

Importantly, TND not only generates topics but also provides a valuable noise distribution that can be readily adapted to work with other topic models (Churchill & Singh, 2023). Using a pre-trained topic model that approximates topics through topic word distributions, we can use the pre-trained noise distribution from TND in the integration process to accomplish probabilistic removal of noisy words similar to the process in TND. The generative process for the TND can be summarized as follows:

For each document d in D:

(1) Stochastically determine the word count N for d.

(2) Sample the topic distribution θ from the Dirichlet distribution and condition it on the parameter α.

(3) For each word w_i, where 0 ≤ i < N:

(a) Draw a topic z_i from the topic distribution θ.

(b) Select a word either from z_i or from the noise distribution H, as dictated by the beta distribution and conditioned on the parameter α.

(c) If drawing from z_i, choose w_i based on the conditional probability of w_i the given topic and condition on the parameter β₀.

(d) If drawing from H, select w_i based on the conditional probability of w_i given H, and conditioned on parameter β₁.

Within NLDA, we combine the noise distribution produced by TND with LDA to develop an LDA variant that exhibits a reduced incidence of noisy words in its topics.

To implement NLDA, the noise distribution H on the dataset D was trained using TND, whereas the LDA model was simultaneously trained on a distinct dataset D², which was not necessarily congruent D. Subsequently, we fused the noise distribution from the TND with the topic word distribution from the LDA to produce a topic set. Similar to the process of distinguishing between a topic word or a noise word, for each topic t within T, we exclude words w_i based on the beta distribution of the frequencies of both the noise and the LDA topic distributions to make noise distributions more transferable to various parameters of LDA. We incorporated a topic weight parameter σ into the beta distribution calculation to downsample or oversample noise distribution. Equation 1 demonstrates how σ is used to scale the noise distribution based on k, the number of topics in the LDA model.

(1)$\operatorname{Beta}\left(\sqrt{\theta_{t}^{i}+\beta_{1}}, \sqrt{H_{i}(\sigma / k)}\right)$

Taken together, the NLDA offers a practical and efficient solution to the long-standing challenge of noise in text-subject modeling. NLDA achieves effective separation of noise and subject matter through a refined model design, which enables researchers to more clearly capture and analyze hidden topics in text data, thereby motivating our choice of this model for subject recognition work.

Following the initial IE step, we thoroughly processed the textual data extracted from the paper during this phase of the experimental workflow. This data served as our primary source for conducting comprehensive topic modeling using the NLDA algorithm. Our data preparation protocol involves two critical steps: word segmentation and comprehensive data cleansing. Word segmentation involves dividing the text into individual words or tokens, facilitating subsequent analysis. Data cleansing procedures were used to remove inconsistencies, irregularities, or extraneous elements from the text to ensure the highest data quality.

The NLDA algorithm was implemented using the “gdtm” package, a sophisticated utility for topic modeling tasks. To ensure the reliability and robustness of our findings, the algorithm was executed 20 times. In each run, we varied the number of topics from 2 to 50, covering a wide spectrum of topic configurations. This comprehensive exploration enabled us to evaluate the model’s performance across various topic scenarios systematically.

After identifying the optimal number of topics, we applied the NLDA model to preprocessed textual data under this configuration. This step produced two important outputs: the topic-word matrix, providing insights into the word distribution within each topic, and the document-topic matrix, unveiling the associations between individual documents and topics. These matrices served as the fundamental component of our topic modeling framework, facilitating the extraction of meaningful and interpretable insights from the textual data extracted from the paper.

3.5 Emerging technology identification

The NLDA model can reveal significant latent topics by removing noisy words. This prompts the question of how emerging topics can be extracted from these latent ones. Previous studies have revealed that emerging topics generally exhibit three key characteristics: novelty, growth, and impact (S. Xu et al., 2021). This study uses originality and innovation to represent novelty, employs growth rate to reflect growth, and assesses topic strength and significance to evaluate impact. Given that the NLDA model can provide a document-topic probability matrix, we first evaluated η_j, which represents the average weight of each topic j. A document is classified as a supporting document for a given topic when the weight

of the topic j in the document d exceeds η_j.

a. Innovation

Recent research has demonstrated a high level of Innovation. Thus, the average appearance time of papers related to this topic indicates Innovation (Small et al., 2014). This calculation is outlined in Equation 2, where N_j signifies the degree of Innovation of the topic j, T_i represents the publication time of paper i, and N indicates the total number of papers supporting the topic.

(2)$N_{j}=\sum_{i=1}^{N} \frac{T_{i}}{N}$

b. Originality

Highly novel research findings should contain more valuable elements (Keenan, 2003). Therefore, Originality can serve as an indicator of a topic’s novelty. This is computed using Equation 3, where I_j denotes the novelty of the topic j, A_i represents the number of innovative sentences in the paper i, and N is the total number of papers supporting the topic.

(3)$I_{j=} \sum_{i=1}^{N} \frac{A_{i}}{N}$

c. Growth

We consider the annual growth rate of papers outputs related to the topic to measure growth. Referring to a previous study (Noh et al., 2016), to mitigate the impact of factors, such as changes in publication volume over time, we adopted the average annual growth rate of theses over the past three years, as illustrated in Equations 4 and 5. Here, YGR_i represents the growth rate of articles in year i, P_i indicates the publication volume of theses in year i, and G signifies the growth rate of the topic.

(4)$Y G R_{\mathrm{i}}=\frac{P_{i+1}-P_{i}}{P_{i}}$

(5)$G=\frac{\sum_{t=0}^{2} Y G R_{i-t}}{3}$

d. Significance

The significance of scientific topics is primarily reflected in the citations of related papers (Iorfino et al., 2017). However, citations tended to accumulate over time. Therefore, to analyze the influence of scientific research topics on current academic research, we integrated the time weight t_i into the measurement of academic influence, as specified in Equation 6. Here, t_i denotes the time weight of the citation frequency of topic-related papers in the year i, and n is the number of years the dataset covers. It has been shown that more recent research generally has more influence on the field, so the numerator 2i in Equation 6 ensures that the most recent year has a higher weight, and as i increases, t increases. The denominator n(n+1) is a normalization factor that ensures that the weights of all years sum to 1. This constant does not directly affect the individual weights, but ensures a proportional distribution of weights. The method for measuring the influence of scientific topics is presented in Equation 7, where B_s represents the academic influence of the topic and C_i represents the citation frequency of papers related to the topic in the first year.

(6)$t_{i}=\frac{2 i}{n(n+1)}$

(7)$B_{s}=\sum_{i=1}^{n} t_{i} * C_{i}$

e. Topic Strength

Recent studies have emphasized the importance of topic intensity as a quantitative index for determining whether a scientific research topic is a research hotspot (Blei et al., 2003). It is frequently calculated by dividing the sum of the weight of the research topic in all scientific literature by the total weight of the research topic. In Equation 8, η_j denotes the topic intensity value of the topic j, N represents the total volume of the supporting literature, and $\eta_{j}^{(d)}$ denotes the weight of the topic j in the document d. Topic intensity reflects the importance of a topic in the text set, with a higher proportion of each document indicating greater significance. This method can be used to represent the impacts of a particular topic.

(8)$\eta_{j}=\frac{\sum_{d=1}^{N} \eta_{j}^{(d)}}{N}$

To address the significant variation in the indicator levels, we used forward deviation standardization to standardize the indicator values, as described in Equation 9, where Y_i indicates the standardized indicator value, X_i is the original indicator value, and X_max and X_min are the maximum and minimum indicator values, respectively.

(9)$Y_{i}=\frac{X_{i}-X_{\min }}{X_{\max }-X_{\min }}$

Multidimensional measurement indices were used to calculate a composite index to identify the emerging technology. Previous research has demonstrated the extensive use of the Criteria Importance Through Intercriteria Correlation (CRITIC) method in the library and intelligence fields (Diakoulaki et al., 1995; Peng et al., 2020). The CRITIC method provides objective weighting by evaluating the relative strength and conflict of indicators, thereby producing scientifically sound and reproducible analyses. A greater divergence among the indicator values and a lower similarity with other indicators demonstrated higher informativeness and importance, as per Equation 10, where CR_j signifies the information content of the indicator j, δ_j denotes the standard deviation of the indicator j, n represents the number of indicators, and r_ij represents the correlation coefficient between indicators i and j. We normalized the information quantity of the indicator CR_j, as per Equation 11, to obtain the objective weight W_j of the indicator j. Finally, the comprehensive index of emerging topics was calculated using indicator weights and normalized values. The topic that exhibits a higher comprehensive index is an emerging research topic.

(10)$C R_{j}=\delta_{j} \sum_{i=1}^{n}\left(1-r_{i j}\right)$

(11)$W_{j}=\frac{C R_{j}}{\sum_{j=1}^{n} C R_{j}}$

4 Results

4.1 BERT-based classification performance

During this phase of our study, we systematically annotated a dataset of 10,000 sentences, distinguishing between innovative and non-innovative sentences. This dataset served as the training corpus for our model. Throughout the training, we consistently optimized various hyperparameters to improve the metrics. Table 2 lists the finalized BERT hyperparameters.

Subsequently, a 10 cross-validation process was conducted, during which various metrics, including average accuracy and return rate, were calculated. In addition, we also implemented a variety of traditional machine learning text classification methods and calculated relevant indicators. The quantitative results of these evaluations are shown in Table 3.

Table 3. Model evaluation indicators.

Model	Accuracy	Precision	Recall	F1
BERT+ Classifier	0.836	0.864	0.853	0.858
TF-IDF+SVM	0.524	0.537	0.530	0.533
TF-IDF+KNN	0.600	0.617	0.593	0.604
TF-IDF+NB	0.582	0.553	0.569	0.560
Word2Vec+ SVM	0.650	0.658	0.614	0.635
Word2Vec+ KNN	0.696	0.628	0.603	0.615
Word2Vec+ NB	0.623	0.609	0.634	0.621

These findings unequivocally validate the efficiency of our model in the critical task of identifying innovative sentences, compared with other traditional machine learning classification models, each index of our model is about 20% higher. We use our optimized model to predict innovative sentences within a comprehensive collection of research papers, extracting 101,619 innovative sentences. Following this milestone, we organized these innovative sentences by publication and skillfully integrated publication metadata. We analyzed the distribution of innovative sentences, focusing on two key aspects: the average number of innovative sentences per article and the percentage of innovative sentences per article. Figure 4 illustrates the analytical insights.

View original graphic|Download|PPT slide

Figure 4. Innovation sentence distribution trends.

Our careful analysis revealed that the number of innovative sentences appearing in the articles and the percentage of innovative sentences were relatively stable. Generally, there were 70-90 innovative sentences in the articles. The time series analysis revealed that the number of innovative sentences peaked at 91 in 2015 and dropped to 68 in 2017. The proportion of innovative sentences to the total number of innovative sentences fluctuated from 30% to 45%, reaching a peak of 44.00% in 2016 and declining to 34.40% in 2017.

Furthermore, we conducted an in-depth cross-country analysis by compiling statistics on the number and proportion of innovative sentences per article. To ensure the robustness of our findings, we excluded countries with insufficient article representation to minimize the impact of potential outliers. Table 4 summarizes the top 20 countries ranked by the number of innovative sentences per article. The United States has emerged as the leader in this regard, with 91 innovative sentences per article. Meanwhile, Mexico has the distinction of having the most innovative sentences per article at 47%.

Table 4. Regional distribution of innovation indicators.

Region	Average count	Percentage (%)
USA	91.0	0.35
Australia	89.5	0.40
Poland	87.5	0.33
France	86.5	0.44
Greece	85.5	0.42
Iraq	83.5	0.42
Denmark	83.0	0.38
Jordan	82.0	0.35
Kenya	81.0	0.34
Hungary	81.0	0.38
Malaysia	79.0	0.38
India	75.5	0.39
Mexico	75.5	0.47
Thailand	75.5	0.25
China	74.5	0.40
Germany	74.0	0.41
Nepal	73.5	0.44
Brazil	73.0	0.40
Italy	71.0	0.37
Netherlands	71.0	0.45

4.2 NLDA topic extraction results

The “gdtm” package was used in the present study to build an NLDA model. A series of experiments was performed to determine the optimal hyperparameters for the model. Table 5 presents the results.

Table 5. NLDA model hyperparameterization.

Hyperparameter	Values	Description
num_topicsint	20	Number of topics
Alphaint	50	Alpha parameter of TND
Betafloat	0.01	Beta parameter of TND
Iterationsint	200	Number of training iterations
random_seed	54	Random seed to ensure consistent results

To assess the significance and relevance of each topic, we initially identified the first ten topic words associated with each topic. Subsequently, words ranked from positions 11 to 40 were generated. In a concerted effort to ensure the accuracy and validity of our findings, we engaged specialists in the field of nanobiomedicine for their expert assessment. Through this collaborative process, we determined that the optimal number of topics in our study was 20. Tables 6a and b indicate the detailed outcomes, including the topic words and their corresponding weights.

Table 6 (a). NLDA Topic-Word Distribution.

Topic 1: Nanostructures and Polymer Chemistry		Topic 2: Graphene and Nanocomposites in Drug Delivery		Topic 3: Nanomaterial Quality and Surface Properties		Topic 4: Nanoparticles in Cancer Research and Therapy
Polymers	Structures	Bacteria	Diffraction	Nanoparticles	Solution	Nanoparticles	Treatment
Nanoparticles	Nanostructures	Graphene	Surface	Nanomedicines	Surface	Transfection	Delivery
Copolymer	Solution	Nanoparticles	Solution	Information	Factors	Migration	Concentration
Treatment	Polymerization	Spectroscopy	Treatment	Product	Quality	Solution	Exosomes
Surface	Polymerization	Nanocomposites	Coating	Treatment	Molecules	Pathway	Surface
Topic 5: Ultrafast Centrifugation and Nanomaterial Preparation		Topic 6: Antitumor Therapy and Tumor Stem Cells		Topic 7: Nanoparticles and Drug Delivery		Topic 8: Photothermal Therapy and Nanomaterials in Cancer
Microscope	Centrifugation	Survival	Metastasis	Nanoparticles	Solubility	Cancer	Treatment
Supernatant	Reagent	Resistance	Cancer	Curcumin	Solution	Nanoparticles	Conversion
Nanoparticles	Solution	Chemotherapy	Treatments	Encapsulation	Content	Nanosheets	Photothermal
Minutes	Transmission	Antitumor	Cancers	Hydrogel	Surface	Solution	Hypoxia
Microplate	Concentration	Nanoparticles	Doxorubicin	Storage	Treatment	Species	Performance
Topic 9: Antidiabetic Nanodrug Carriers and Biomedical Applications		Topic 10: Animal Studies and Biomedical Protocols		Topic 11: Nanoparticle Surface Characterization and Complex Structures		Topic 12: Inflammation and Biomaterial Interactions in Disease Research
Extract	Insulin	Animals	Staining	Adsorption	Solution	Inflammation	Solution
Nanoparticles	Surface	Biodistribution	Committee	Nanoparticles	Treatment	Nanoparticles	Treatment
Dendrimers	Delivery	Nanoparticles	Liver	Surface	Complexes	Macrophage	Responses
Activities	Treatment	Sections	Solution	Nanotubes	Structures	Collagen	Surface
Solution	Concentration	Group	Surface	Results	Material	Cytokines	Inflammatory

Table 6 (b). NLDA Topic -Word Distribution.

Topic 13: Nanotechnology in Vaccine Research and Infectious Diseases		Topic 14: Liposomes and Receptors in Drug Delivery		Topic 15: Nanoparticles in Materials Science		Topic 16: Nanocarriers for Hormone Release and Antitumor Drugs
Nanoparticles	Treatment	Peptides	Affinity	Nanoparticles	Silicon	Nanocarriers	Loading
Authors	Concentration	Antibodies	Solution	Channel	Treatment	Nanoparticles	Micelle
Science	Diseases	Vesicles	Receptor	Solution	Diffusion	Carrier	Circulation
Solution	Results	Nanoparticles	Membranes	Surface	Results	Dialysis	Prodrug
Nanotechnology	Disease	Conjugation	Treatment	Parameters	Frequency	Antitumor	Solution
Topic 17: Luminescence and Spectroscopy in Nanomaterials		Topic 18: Endocytosis and Cell Signaling with Nanoparticles		Topic 19: Methods for Nanoparticle Preparation and Purification		Topic 20: Nanoparticle Permeation Properties in Oncology Treatments
Emission	Wavelength	Internalization	Pathway	Dispersion	Product	Transport	Circulation
Nanoparticles	Spectrum	Endocytosis	Nucleus	Nanoparticles	Solution	Nanoparticles	Solution
Excitation	Signals	Nanoparticles	Surface	Centrifugation	Calibration	Penetration	Spheroids
Resonance	Solution	Mechanisms	Treatment	Ethanol	Treatment	Vessels	Surface
Resolution	Fluorescent	Solution	Pathways	Compound	Surface	Permeability	Treatment

These topics are important for encapsulating the key topics that underpin the domain of nanobiomedicine. We distilled the main topics embodied in each topic through expert discussions and consensus.

The meaning of each topic was summarized based on expert discussions, with topic 1 focusing on research related to nanostructures and polymer chemistry, with particular emphasis on nanostructures, polymer chemistry, and delivery control. The research in Topic 2 focuses on bacteria, graphene, and nanocomposites and their applications in drug delivery and biomaterials. Topic 3 highlights the quality and product information of nanomaterials, including the quality assessment, surface properties, and methods for various nanomaterial applications. Topic 4 examined the use of nanoparticles in transfection, cell migration, and signaling pathways, with particular emphasis on cancer research and therapy. Topic 5 discusses ultrafast centrifugation, colloidal dispersion, the preparation of nanomaterials, and experimental methods related to microbiology.

Topic 6 concentrates on antitumor therapy, drug tolerance, and tumor stem cells, emphasizing different therapeutic approaches to tumors. Topic 7 focuses on nanoparticles, hormone release, and nanogels, emphasizing drug delivery and characterization of nanopreparations. Topic 8 covers photothermal therapy, photosensitizers, and the use of nanomaterials in cancer therapy, with a special emphasis on research related to biocompatibility and properties. Topic 9 focuses on antidiabetic nanodrug carriers, insulin delivery, and applications of bioactive molecules, with an emphasis on biomedical research. Topic 10 deals with animal studies, biodistribution, immunohistostaining, and laboratory manipulation, emphasizing protocols in biomedical experiments. Topic 11 focuses on adsorption, surface characterization, and the complex structures of nanoparticles, with particular applications in chemical and physical research.

Topic 12 is concerned with inflammation, cellular responses, and biomaterial interactions and has specific applications in disease research and therapy. Topic 13 covers nanotechnology, vaccine research, and infectious diseases with specific applications in vaccine development and infections. Topic 14 examines antibodies, liposomes, receptors, and cell permeation mechanisms, with particular applications in drug delivery and molecular recognition. Topic 15 discusses transport, deposition, and nanoparticles for materials science applications, focusing on material preparation and physical characterization. Topic 16 describes nanocarriers, hormone release, antitumor drugs for drug delivery, and nanomaterial applications. Topic 17 delves into luminescence, excitation, and spectroscopy, particularly highlighting optical techniques in nanomaterial research and fluorescence applications. Topic 18 deals with endocytosis, cell signaling pathways, and nanoparticles in biology, particularly their role in intracellular processes. Topic 19 discusses methods for dispersing, precipitating, dissolving, and purifying nanoparticles and their applications in chemical synthesis and biomaterials. Topic 20 deals with the permeation properties of nanoparticles in blood circulation, tissue retention, and oncology treatments, emphasizing therapeutics and therapeutic strategies. Collectively, these topics constitute a comprehensive summary of nanomaterial research, covering all aspects of research and application areas.

In addition, Table 7 demonstrates the results of LDA topic modeling using only three metadata elements of papers in the field of nanomedicine: the article title, abstract, and keywords. The dataset comprises 5000 articles, and the number of predefined topics is set to 12 based on perplexity analysis (Baimakhanbetov, 2023; Wang et al., 2019). The topic-keyword distribution in Table 7 appears disorganized, and the specialized vocabulary is scattered, making it difficult to summarize the appropriate semantic information of the topics.

Table 7. LDA Topic -Word Distribution.

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6
protein	delivery	nanoparticles	nps	antibacterial	applications
corona	cell	silver	nanoparticles	cells	properties
cell	nanoparticles	agnps	delivery	bacteria	review
proteins	review	activity	brain	activity	materials
bone	drugs	nm	size	vivo	titanium
Topic 7	Topic 8	Topic 9	Topic 10	Topic 11	Topic 12
exosomes	based	tumor	wound	clinical	nanosheets
cells	high	cancer	materials	cancer	surface
tumour	cell	therapy	tissue	therapy	go
nanoparticles	clinical	cells	healing	photothermal	mxenes
drug	detection	pdt	bone	agents	applications

4.3 Identification of emerging technology topics

The NLDA model was used in this study as the primary tool for obtaining a document probability matrix. This matrix served as the basis for determining supporting documents for each thematic area under consideration. Subsequently, the NLDA model facilitated the quantification of five distinctive sub-indicators within the context of each thematic area: novelty, innovation, growth, impact, and topic strength. These sub-indicators were subjected to a standardization procedure, resulting in normalized values, which were then used to calculate a comprehensive indicator representing the emergence of each topic.

This composite indicator, termed the “degree of emergence,” was systematically determined using the CRITIC method, encapsulating the multifaceted facets of topic relevance, innovation, and significance. The findings of this analytical process, wherein the degree of emergence of each topic was determined, are presented in Table 7. The tabular representation provides a comprehensive overview of the sub-indicator values and the corresponding total indicators, thereby offering holistic insight into the emergent attributes of each thematic element.

Table 8 . Results of the calculation of thematic indicators.

	Innovation	Originality	Growth	Significance	Topic strength	Emerging degree
Topic 12	0.793	0.446	0.390	0.913	1.000	0.722
Topic 8	1.000	0.489	0.698	0.507	0.567	0.614
Topic 6	0.632	0.591	0.616	0.709	0.375	0.595
Topic 4	0.580	0.296	1.000	0.592	0.499	0.569
Topic 2	0.145	0.566	0.191	0.730	0.777	0.533
Topic 16	0.557	0.769	0.581	0.359	0.426	0.529
Topic 9	0.351	0.342	0.144	0.882	0.376	0.474
Topic 7	0.530	0.315	0.391	0.709	0.284	0.465
Topic 3	0.272	0.000	0.342	1.000	0.256	0.427
Topic 5	0.403	1.000	0.426	0.103	0.180	0.417
Topic 13	0.310	0.217	0.379	0.768	0.046	0.383
Topic 15	0.000	0.124	0.000	0.777	0.609	0.363
Topic 11	0.090	0.213	0.360	0.753	0.044	0.343
Topic 14	0.292	0.238	0.222	0.684	0.071	0.341
Topic 1	0.264	0.263	0.262	0.506	0.268	0.334
Topic 18	0.208	0.479	0.238	0.540	0.000	0.333
Topic 10	0.329	0.758	0.404	0.000	0.246	0.330
Topic 20	0.358	0.089	0.206	0.705	0.117	0.328
Topic 17	0.216	0.332	0.447	0.392	0.156	0.318
Topic 19	0.273	0.544	0.255	0.127	0.007	0.242

This high-level academic analysis focuses on the emergence and impact of various topics in emerging technology research. Twenty different topics were evaluated based on five sub-metrics (novelty, innovative, growth, impact, and topic strength), which are important for understanding the development and significance of each topic.

In Table 7, we observe distinct trends in various research topics within the field of nanomedicine, as indicated by their performance in different assessment categories. Notably, Topic 8, “ Photothermal Therapy and Nanomaterials in Cancer,” and Topic 12, “ Inflammation and Biomaterial Interactions in Disease Research,” demonstrated the highest levels of “Novelty.” This signifies that they have received significant research attention in recent years due to their innovative contributions.

Furthermore, Topic 5, “ Ultrafast Centrifugation and Nanomaterial Preparation,” received the highest innovation score, highlighting its significant contribution to the field. On the growth trajectory, Topic 4, “ Nanoparticles in Cancer Research and Therapy,” and Topic 8, “A Photothermal Therapy and Nanomaterials in Cancer,” exhibited the highest growth rates. This reflects their rapid expansion and development within the biomedical research landscape.

Our impact metrics revealed intriguing insights. Topic 3, “ Nanomaterial Quality and Surface Properties,” and Topic 12, “ Inflammation and Biomaterial Interactions in Disease Research,” emerged as the highest-impact topics. In addition, Topic 12 obtained the highest topic intensity value, indicating its significant influence on the scientific community. In contrast, Topic 10, “ Animal Studies and Biomedical Protocols,” and Topic 18, “ Endocytosis and Cell Signaling with Nanoparticles,” demonstrated the lowest impact and topic strength, indicating relatively lower research attention in these areas. Disruptive innovations may be necessary for these topics to gain prominence in the field.

The comprehensive index provides an overview of the research landscape. Topic 12, “ Inflammation and Biomaterial Interactions in Disease Research,” and Topic 8, “ Photothermal Therapy and Nanomaterials in Cancer,” claimed the top two positions. This indicates that these research areas are at the forefront of emerging topics. Notably, this surge may be attributed to the profound impact of the COVID-19 pandemic on preventive medicine and infectious disease research, which has resulted in significant advances in vaccine development and infection studies. Concurrently, nanomedicine has made significant advances in the field of cancer treatment in recent years, which is why Topic 8 stands out.

In this study, we identified emerging research topics by selecting those with an emerging degree above the average threshold. Table 9 provides a summary of our final results on emerging theme identification.

Table 9. Emerging technologies in nanomedicine.

Topic id	Topic meaning
12	Inflammation and Biomaterial Interactions in Disease Research
8	Photothermal Therapy and Nanomaterials in Cancer
6	Antitumor Therapy and Tumor Stem Cells
4	Nanoparticles in Cancer Research and Therapy
2	Graphene and Nanocomposites in Drug Delivery
16	Animal Studies and Biomedical Protocols
9	Antidiabetic Nanodrug Carriers and Biomedical Applications
7	Nanoparticles and Drug Delivery

To validate our results, we reviewed the bibliometric and review studies of previous scholars on nanomedicine and summarized the emerging technologies identified in the field, as shown in Table 10 (Bragazzi, 2019; Chang et al., 2015; Dundar et al., 2020b; Ledet & Mandal, 2012; Sandhiya et al., 2009; Yeo, 2013). By comparing Table 10 with the emerging technologies we identified, we observed some overlap, demonstrating the credibility of our identification method and results.

Table 10. Emerging nanomedicine technology keywords classification.

Category	Keywords
Nanomaterial Applications	Nanoparticles; Nanotubes; Nanofibers; Nanocrystals; Nanoshells; Metal nanoparticles (e.g.; Gold; Silver); Carbon nanomaterials (e.g.; Graphene; Fullerenes); Quantum dots; Polymeric nanoparticles; Nucleic acid Nanostructures
Drug Delivery Systems	Drug-loaded nanoparticles; Nanomicelles; Nanoemulsions; Nanocapsules; liposomes; Polymeric drug conjugates; Targeted drug delivery; Controlled release systems; PH-responsive nanocarriers; Temperature-responsive nanocarriers
Diagnostics and Imaging	Molecular imaging; Magnetic resonance imaging (MRI); Positron emission tomography (PET); Single photon emission computed tomography (SPECT); Optical imaging; Fluorescence imaging; Raman spectroscopy imaging; Upconversion nanoparticle imaging; Multimodal imaging
Cancer diagnosis and treatment	Nanoparticles; Hypoxia therapy; Tumor-targeted therapy; Nano-immunomodulators
Nanotherapy for infectious diseases	Nano-antimicrobials; Inflammation; Antimicrobial peptides; Nano-vaccines; Nano-antibiotics
Rational Design of Antimicrobial Peptides	Antimicrobial peptides (AMPs); Peptide design; Membrane disruption mechanisms; Antibiotic resistance

In summary, our analysis highlights the dynamic landscape of nanomedicine research, wherein certain specific topics have gained attention owing to their novelty, innovation, growth, and impact. These findings underscore the evolving priorities and emerging areas of significance in the field that have been influenced by recent global events and scientific breakthroughs.

5 Discussion

This study introduced a comprehensive framework that integrates state-of-the-art natural language processing, topic modeling, and bibliometric techniques to identify emerging technologies in the scientific literature. Using the SciBERT pre-trained model to extract informative sentences and NLDA to identify potential topics, our method overcomes the limitations of existing methods that rely solely on raw text, titles, or abstracts. Our proposed multidimensional metrics provide a quantifiable assessment of the emergence of novel topics across various dimensions, such as novelty, innovation, growth, impact, and intensity.

When applied to a nanomedicine publication dataset, our methodology revealed an upward trend in vaccine development and research related to antitumor therapies, emphasizing their status as cutting-edge research domains. This trend aligns with the significant impact of the COVID-19 pandemic on stimulating infection control measures and fostering innovation in vaccine development. In summary, our data-driven approach provided actionable technical intelligence.

5.1 Theoretical contributions

In terms of technology, our approach uses advanced natural language processing models to extract knowledge from a diverse range of textual data more effectively. This approach presents significant advantages in terms of ease and accuracy compared with the traditional machine leaning methods used by scholars for IE. Furthermore, we mitigate the noise word interference encountered in conventional LDA models by using NLDA topic modeling for improved topic recognition, thereby enhancing the interpretability of topic identification methods (Churchill & Singh, 2022a). This shift toward a data-driven approach has improved the efficiency of uncovering critical insights within the vast scientific literature.

Moreover, we adopted a diverse set of evaluation metrics. We have introduced innovativeness and topic strength indicators based on the IE and topic identification outcomes of this study. These multidimensional indicators provide a comprehensive quantitative perspective on the emergence of technology topics. The CRITIC methodology integrates these indicators to provide a holistic measure, thereby furnishing a more nuanced assessment of the emergence level of each technology topic (Diakoulaki et al., 1995). These multidimensional indicators expand the toolkit available for researchers investigating emerging technologies.

These theoretical contributions serve as templates for other researchers, assisting them in systematically discovering and analyzing emerging technologies. In addition to nanomedicine, these contributions have the potential to shape the way emerging trends are studied in various disciplines, thereby contributing to a deeper theoretical comprehension of the evolution of knowledge and technology.

5.2 Application contributions

Furthermore, this study offers several practical contributions, particularly in applying these findings (L. Yang et al., 2023). Identifying emerging technologies in nanomedicine provides practical insights for decision-makers, stakeholders, and policymakers. It is important to use technical methods to identify technology topics with a high degree of emergence in practical domains. For example, our study identified “ Inflammation and Biomaterial Interactions in Disease Research “ and “ Photothermal Therapy and Nanomaterials in Cancer “ as topics of notable prominence, implying their importance for further research and investment. Policymakers can use this information to make informed funding decisions, prioritizing areas with a higher potential for scientific progress and societal impact (J. S. Lee et al., 2022). Industry stakeholders can identify untapped market opportunities and focus their research and development efforts on areas with high growth potential.

Moreover, our cross-country analysis offers a global perspective on the distribution of innovative research in nanomedicine. In addition, the analysis of time reveals trends in innovative research in the field over the past decade. This practice promotes international collaboration and facilitates a more efficient allocation of resources.

5.3 Limitations and future research

Although our study provides valuable insights, it is crucial to acknowledge its limitations and identify potential avenues for future research. One notable limitation is our reliance on the existing scientific literature, which may introduce publication biases and language constraints. Furthermore, despite the meticulous manual annotation of the dataset, it remains susceptible to subjectivity and can be time-consuming. Future studies should consider exploring larger datasets, incorporating additional data sources, and automating the annotation process to address and mitigate these limitations effectively.

It is important to note that this research is confined to the domain of nanomedicine, and the identified topics are specific to this field. This methodology can be adapted and extended to investigate emerging technologies in other scientific domains. Furthermore, assessment criteria for emerging topics can be tailored to specific contexts and objectives, facilitating a more precise analysis.

In summary, this study presents a comprehensive framework for identifying emerging technologies, providing theoretical and practical contributions with far-reaching implications. These limitations serve as a roadmap for future research, extending the potential application of this methodology to various scientific disciplines and decision-making contexts.

Funding information

This work was supported by the National Natural Science Foundation of China (Project No. 22342011).

Acknowledgement

The authors would also like to thank anonymous reviewers for their valuable comments.

Author contributions

Yifan Wang (wyf_bjfu@163.com): Conceptualization (Lead), Data curation (Lead), Formal analysis (Lead), Methodology (Lead), Software (Lead), Writing - original draft (Lead), Writing - review & editing (Equal);

Xiaoping Liu (liuxp@mail.las.ac.cn): Conceptualization (Supporting), Funding acquisition (Equal), Project administration (Lead), Supervision (Lead), Writing - review & editing (Lead);

Xiang-Li Zhu (Zhuxl@mail.las.ac.cn): Funding acquisition (Supporting), Project administration (Supporting), Supervision (Supporting), Writing - review & editing (Supporting).

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Publishing order | Descend order by publishing year | Descend order by cited within

Abbasi

K. R.

, Hussain

, Haddad

A. M.

, Salman

, & Ozturk

(2022). The role role of Financial Development and Technological Innovation towards Sustainable Development in Pakistan: Fresh insights from consumption and territory-based emissions. Technological Forecasting and Social Change, 176, 121444. https://doi.org/10.1016/j.techfore.2021.121444

Ahmed

M. B.

, Zhou

J. L.

, Huu

H. N.

, Guo

, Thomaidis

N. S.

, & Xu

(2017). Progress in the biological and chemical treatment technologies for emerging contaminant removal from wastewater:A critical review. Journal of Hazardous Materials, 323, 274-298. 14th International Conference on Environmental Science and Technology (CEST). https://doi.org/10.1016/j.jhazmat.2016.04.045

AlSumait

, Barbara

, Gentle

, & Domeniconi

(2009). Topic Significance Ranking of LDA Generative Models. In W.

Buntine

, M.

Grobelnik

, D.

Mladenic

, & J.

ShaweTaylor

(Eds.), Machine Learning and Knowledge Discovery in Databases, Pt I (Vol. 5781, pp. 67-82). Springer-Verlag Berlin. https://www.webofscience.com/wos/alldb/summary/46d8858c-2434-4cff-87d7-03b63c4b5a1a-b16f04c5/times-cited-descending/1

Archakov

A. I.

(2010). Nanobiotechnologies in Medicine: Nanodiagnostics and Nanodrugs. Biochemistry Moscow-Supplement Series B-Biomedical Chemistry, 4(1), 2-14. https://doi.org/10.1134/S1990750810010026

Arora

S. K.

, Porter

A. L.

, Youtie

, & Shapira

(2013). Capturing new developments in an emerging technology: An updated search strategy for identifying nanotechnology research outputs. Scientometrics, 95(1), 351-370. https://doi.org/10.1007/s11192-012-0903-6

Arshamian

, Iannilli

, Gerber

J. C.

, Willander

, Persson

, Seo

H.-S.

, Hummel

, & Larsson

(2013). The functional neuroanatomy of odor evoked autobiographical memories cued by odors and words. Neuropsychologia, 51(1), 123-131. https://doi.org/10.1016/j.neuropsychologia.2012.10.023

DOI PMID

Baimakhanbetov

(2023). Determination of the Optimal Number of Topics in the LDA Model When Working with Large Arrays of Text Data. 2023 IEEE International Conference on Smart Information Systems and Technologies (SIST), 332-336. https://ieeexplore.ieee.org/abstract/document/10223563/

Baptista

P. V.

(2014). Nanodiagnostics: Leaving the research lab to enter the clinics? Diagnosis, 1(4), 305-309. https://doi.org/10.1515/dx-2014-0055

DOI PMID

Bayford

, Rademacher

, Roitt

, & Wang

S. X.

(2017). Emerging applications of nanotechnology for diagnosis and therapy of disease: A review. Physiological Measurement, 38(8), 183-203. https://doi.org/10.1088/1361-6579/aa7182

DOI PMID

Bello

, Ng

S.-C.

, & Leung

M.-F.

(2023). A BERT Framework to Sentiment Analysis of Tweets. Sensors, 23(1), 506. https://doi.org/10.3390/s23010506

Bennett

K. M.

, Zhou

, Sumner

J. P.

, Dodd

S. J.

, Bouraoud

, Doi

, Star

R. A.

, & Koretsky

A. P.

(2008). MRI of the basement membrane using charged nanoparticles as contrast agents. Magnetic Resonance in Medicine, 60(3), 564-574. https://doi.org/10.1002/mrm.21684

DOI PMID

Betker

J. L.

, Gomez

, & Anchordoquy

T. J.

(2013). The effects of lipoplex formulation variables on the protein corona and comparisons with in vitro transfection efficiency. Journal of Controlled Release, 171(3), 261-268. https://doi.org/10.1016/j.jconrel.2013.07.024

DOI PMID

Bianchi

, Hachem

, & Iutzeler

(2016). A Coordinate Descent Primal-Dual Algorithm and Application to Distributed Asynchronous Optimization. IEEE Transactions on Automatic Control, 61(10), 2947-2957. https://doi.org/10.1109/TAC.2015.2512043

Bishop

P. C.

(2006). Tech mining: Exploiting new technologies for competitive advantage. Technological Forecasting and Social Change, 73(1), 91-93. https://doi.org/10.1016/j.techfore.2005.08.001

Blei

D. M.

, & Lafferty

J. D.

(2007). A Correlated Topic Model of Science. Annals of Applied Statistics, 1(1), 17-35. https://doi.org/10.1214/07-AOAS114

Blei

D. M.

, Ng

A. Y.

, & Jordan

M. I.

(2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(4-5), 993-1022. 18th International Conference on Machine Learning. https://doi.org/10.1162/jmlr.2003.3.4-5.993

Boegel

P. M.

, Augenstein

, Levin-Keitel

, & Upham

(2022). An interdisciplinary perspective on scaling in transitions: Connecting actors and space. Environmental Innovation and Societal Transitions, 42, 170-183. https://doi.org/10.1016/j.eist.2021.12.009

Boyack

K. W.

, Klavans

, Small

, & Ungar

(2014). Characterizing the emergence of two nanotechnology topics using a contemporaneous global micro-model of science. Journal of Engineering and Technology Management, 32, 147-159. https://doi.org/10.1016/j.jengtecman.2013.07.001

Bragazzi

N. L.

(2019). Nanomedicine: Insights from a bibliometrics-based analysis of emerging publishing and research trends. Medicina, 55(12), 785.

Breitzman

, & Thomas

(2015). The Emerging Clusters Model: A tool for identifying emerging technologies across multiple patent systems. Research Policy, 44(1), 195-205. https://doi.org/10.1016/j.respol.2014.06.006

Canagarajah

(2022). Language diversity in academic writing: Toward decolonizing scholarly publishing. Journal of Multicultural Discourses, 17(2), 107-128. https://doi.org/10.1080/17447143.2022.2063873

Cao

, Cheng

, & Liao

(2023). A comparison study of topic modeling based literature analysis by using full texts and abstracts of scientific articles: A case of COVID-19 research. Library Hi Tech, 41(2), 543-569. https://doi.org/10.1108/LHT-03-2022-0144

Chakraborty

, & Pradeep

(2017). Atomically Precise Clusters of Noble Metals: Emerging Link between Atoms and Nanoparticles. Chemical Reviews, 117(12), 8208-8271. https://doi.org/10.1021/acs.chemrev.6b00769

DOI PMID

Chang

E. H.

, Harford

J. B.

, Eaton

M. A. W.

, Boisseau

P. M.

, Dube

, Hayeshi

, Swai

, & Lee

D. S.

(2015). Nanomedicine: Past, present and future - A global perspective. Biochemical and Biophysical Research Communications, 468(3), 511-517. https://doi.org/10.1016/j.bbrc.2015.10.136

DOI PMID

Chen

, Wang

, Li

, & Sun

(2018). Modeling Scientific Influence for Research Trending Topic Prediction. Thirty-Second AAAI Conference on Artificial Intelligence / Thirtieth Innovative Applications of Artificial Intelligence Conference / Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, 2111-2118.

Churchill

, & Singh

(2021). Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections. In J. Bailey, P. Miettinen, Y. S. Koh, D. Tao, & X. Wu (Eds.), 2021 21st IEEE International Conference on Data Mining (ICDM 2021) (pp. 71-80). IEEE Computer Soc. https://doi.org/10.1109/ICDM51629.2021.00017

Churchill

, & Singh

(2022a). Dynamic Topic Model (Title) AND Blei (Author) - 2 - All Databases.

Churchill

, & Singh

(2022b). Dynamic Topic-Noise Models for Social Media. In J.

Gama

, T.

, Y.

, E.

Chen, Y.

Zheng, & F. Teng (Eds.), Advances in Knowledge Discovery and Data Mining, Pakdd 2022, PT II (Vol. 13281, pp. 429-443). Springer International Publishing Ag. https://doi.org/10.1007/978-3-031-05936-0_34

Churchill

, & Singh

(2023). Using topic-noise models to generate domain-specific topics across data sources. Knowledge and Information Systems, 65(5), 2159-2186. https://doi.org/10.1007/s10115-022-01805-2

Cuenca

A. G.

, Jiang

, Hochwald

S. N.

, Delano

, Cance

W. G.

, & Grobmyer

S. R.

(2006). Emerging implications of nanotechnology on cancer diagnostics and therapeutics. Cancer, 107(3), 459-466. https://doi.org/10.1002/cncr.22035

DOI PMID

Dash

, Monalisa

, Brown

, & Daim

T. U.

(2007). Exploring the relationship between research funding and science innovation indicators in emerging technologies. In D. F.

Kocaoglu

, T. R.

Anderson

, & T. U.

Daim

(Eds.), Picmet ’07: Portland International Center for Management of Engineering and Technology, Vols 1-6, Proceedings: Management of Converging Technologies (pp.1623-1636). Picmet. https://doi.org/10.1109/PICMET.2007.4349487

Devlin

, Chang

M.-W.

, Lee

, & Toutanova

(2019). BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Naacl HLT 2019), Vol. 1, 4171-4186.

Diakoulaki

, Mavrotas

, & Papayannakis

(1995). Determining Objective Weights in Multiple Criteria Problems—The Critic Method. Computers & Operations Research, 22(7), 763-770. https://doi.org/10.1016/0305-0548(94)00059-H

Ding

, Liu

, Zheng

, & Li

(2022). Digital Economy, Technological Innovation and High-Quality Economic Development: Based on Spatial Effect and Mediation Effect. Sustainability, 14(1), 216. https://doi.org/10.3390/su14010216

Dundar

, Mechler

, Alcaraz

J.-P.

, Henehan

, Prakash

, Lal

, & Martin

D. K.

(2020a). Reflections on Emerging Technologies in Nanomedicine. Erciyes Medical Journal, 42(4), 370-379. https://doi.org/10.14744/etd.2020.68542

Dundar

, Mechler

, Alcaraz

J.-P.

, Henehan

, Prakash

, Lal

, & Martin

D. K.

(2020b). Reflections on Emerging Technologies in Nanomedicine. Erciyes Medical Journal, 42(4), 370-379. https://doi.org/10.14744/etd.2020.68542

Ebadi

, Auger

, & Gauthier

(2022). Detecting emerging technologies and their evolution using deep learning and weak signal analysis. Journal of Informetrics, 16(4), 101344. https://doi.org/10.1016/j.joi.2022.101344

Falagas

M. E.

, Pitsouni

E. I.

, Malietzis

G. A.

, & Pappas

(2008). Comparison of PubMed, Scopus, Web of Science, and Google Scholar: Strengths and weaknesses. Faseb Journal, 22(2), 338-342. https://doi.org/10.1096/fj.07-9492LSF

DOI PMID

, Sun

, Wu

, Cui

, & Huang

J. Z.

(2018). Weakly supervised topic sentiment joint model with word embeddings. Knowledge-Based Systems, 147, 43-54. https://doi.org/10.1016/j.knosys.2018.02.012

Gokhberg

, Fursov

, Miles

, & Perani

(2013). Developing and using indicators of emerging and enabling technologies. In F.

Gault

(Ed.), Handbook of Innovation Indicators and Measurement (pp. 349-380). Edward Elgar Publishing Ltd. https://www.webofscience.com/wos/alldb/summary/2d946d63-1904-4423-905a-c02f116af127-b1743764/times-cited-descending/1

Gruen

, & Hornik

(2011). Topicmodels: An R Package for Fitting Topic Models. Journal of Statistical Software, 40(13), 1-30.

Guderian

C. C.

(2019). Identifying Emerging Technologies with Smart Patent Indicators: The Example of Smart Houses. International Journal of Innovation and Technology Management, 16(2), 1950040. https://doi.org/10.1142/S0219877019500408

Harzing

A.-W.

, & Alakangas

(2016). Google Scholar, Scopus and the Web of Science: A longitudinal and cross-disciplinary comparison. Scientometrics, 106(2), 787-804. https://doi.org/10.1007/s11192-015-1798-9

Hobbs

S. K.

, Monsky

W. L.

, Yuan

, Roberts

W. G.

, Griffith

, Torchilin

V. P.

, & Jain

R. K.

(1998). Regulation of transport pathways in tumor vessels: Role of tumor type and microenvironment. Proceedings of the National Academy of Sciences of the United States of America, 95(8), 4607-4612. https://doi.org/10.1073/pnas.95.8.4607

DOI PMID

Holmes

, & Ferrill

(2005). The application of operation and technology roadmapping to aid Singaporean SMEs identify and select emerging technologies. Technological Forecasting and Social Change, 72(3), 349-357. Conference on Managing Emerging Technologies in Asia. https://doi.org/10.1016/j.techfore.2004.08.010

Hsiao

T.-K.

, & Torvik

V. I. I.

(2023). OpCitance: Citation contexts identified from the PubMed Central open access articles. Scientific Data, 10(1), 243. https://doi.org/10.1038/s41597-023-02134-x

, Lu

, Li

, & Chen

(2014). Convolutional Neural Network Architectures for Matching Natural Language Sentences. In Z.

Ghahramani

, M.

Welling

, C.

Cortes

, N. D.

Lawrence

, & K. Q.

Weinberger

(Eds.), Advances in Neural Information Processing Systems 27 (Nips 2014) (Vol.27). Neural Information Processing Systems (nips). https://www.webofscience.com/wos/alldb/summary/ec795464-c332-4541-aace-1c8d839125ec-b170f65e/times-cited-descending/1

, Wang

, & Li

(2018). Online Biterm Topic Model based short text stream classification using short text expansion and concept drifting detection. Pattern Recognition Letters, 116, 187-194. https://doi.org/10.1016/j.patrec.2018.10.018

Huang

, Li

, Zou

, Jiang

, Porter

A. L.

, & Zhang

(2022). Technology life cycle analysis: From the dynamic perspective of patent citation networks. Technological Forecasting and Social Change, 181, 121760. https://doi.org/10.1016/j.techfore.2022.121760

Iorfino

, Davenport

T. A.

, Ospina-Pinillos

, Hermens

D. F.

, Cross

, Burns

, & Hickie

I. B.

(2017). Using New and Emerging Technologies to Identify and Respond to Suicidality Among Help-Seeking Young People: A Cross-Sectional Study. Journal of Medical Internet Research, 19(7), e247. https://doi.org/10.2196/jmir.7897

Jain

K. K.

(2008). Nanomedicine: Application of nanobiotechnology in medical practice. Medical Principles and Practice, 17(2), 89-101. https://doi.org/10.1159/000112961

DOI PMID

Jelodar

, Wang

, Yuan

, Feng

, Jiang

, Li

, & Zhao

(2019). Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169-15211. https://doi.org/10.1007/s11042-018-6894-4

DOI

Jeong

, Jang

, Park

, & Choi

(2020). A context-aware citation recommendation model with BERT and graph convolutional networks. Scientometrics, 124(3), 1907-1922. https://doi.org/10.1007/s11192-020-03561-y

Jiang

, Yang

, & Gao

(2024). Multidimensional indicators to identify emerging technologies: Perspective of technological knowledge flow. Journal of Informetrics, 18(1), 101483. https://doi.org/10.1016/j.joi.2023.101483

, Meng

, Finley

, Wang

, Chen

, Ma

, Ye

, & Liu

T.-Y.

(2017). LightGBM:A Highly Efficient Gradient Boosting Decision Tree. In I.

Guyon

, U. V.

Luxburg

, S.

Bengio

, H.

Wallach

, R.

Fergus

Vishwanathan

, & R.

Garnett

.(Eds.), Advances in Neural Information Processing Systems 30 (NIPS 2017) (Vol. 30). Neural Information Processing Systems (nips). https://www.webofscience.com/wos/alldb/summary/136756d2-4ad9-4a5b-a24e-5e54a64b9508-b16f6061/times-cited-descending/1

Keenan

(2003). Identifying emerging generic technologies at the national level: The UK experience. Journal of Forecasting, 22(2-3), 129-160. https://doi.org/10.1002/for.849

Khanna

, Ball

, Alperin

J. P.

, & Willinsky

(2022). Recalibrating the scope of scholarly publishing: A modest step in a vast decolonization process. Quantitative Science Studies, 3(4), 912-930. https://doi.org/10.1162/qss_a_00228

Kim

, Kim

, Park

Y. I.

, Lee

, & Hyeon

(2018). Recent Development of Inorganic Nanoparticles for Biomedical Imaging. ACS Central Science, 4(3), 324-336. https://doi.org/10.1021/acscentsci.7b00574

DOI PMID

Kim

I. C.

, Le

D. X.

, & Thoma

G. R.

(2014). Automated Method for Extracting “Citation Sentences” from Online Biomedical Articles Using SVM-based Text Summarization Technique. 2014 Ieee International Conference on Systems, Man and Cybernetics (SMC), 1991-1996.

Kim

, Baek

, & Song

(2018). Topic Diffusion Analysis of a Weighted Citation Network in Biomedical Literature. Journal of the Association for Information Science and Technology, 69(2), 329-342. https://doi.org/10.1002/asi.23960

Ledet

, & Mandal

T. K.

(2012). Nanomedicine: Emerging therapeutics for the 21st century. US Pharm, 37(3), 7-11.

Lee

, Kwon

, Kim

, & Kwon

(2018). Early identification of emerging technologies: A machine learning approach using multiple patent indicators. Technological Forecasting and Social Change, 127, 291-303. https://doi.org/10.1016/j.techfore.2017.10.002

Lee

J. S.

, Jung

, Roh

, Heo

, Lee

, & Lee

J. H.

(2022). Risk-based uncertainty assessment to identify key sustainability hurdles for emerging CO₂ utilization technologies. Green Chemistry, 24(11), 4588-4605. https://doi.org/10.1039/d2gc00514j

Letsche

T. A.

, & Berry

M. W.

(1997). Large-scale information retrieval with latent semantic indexing. Information Sciences, 100(1-4), 105-137. https://doi.org/10.1016/S0020-0255(97)00044-3

, Xie

, Jiang

, Zhou

, & Huang

(2019). Identifying and monitoring the development trends of emerging technologies using patent analysis and Twitter data mining:The case of perovskite solar cell technology. Technological Forecasting and Social Change, 146, 687-705. Portland International Conference on Management of Engineering and Technology (PIMCET). https://doi.org/10.1016/j.techfore.2018.06.004

Lin

, Hou

, Mishra

, Yao

, Huo

, Yang

, Wang

, Shih

, & Peng

(2023). Enhancing thoracic disease detection using chest X-rays from PubMed Central Open Access. Computers in Biology and Medicine, 159, 106962. https://doi.org/10.1016/j.compbiomed.2023.106962

Lobanova

, Bakhtin

, & Sergienko

(2023). Identifying and Visualizing Trends in Science, Technology, and Innovation Using SciBERT. IEEE Transactions on Engineering Management. https://doi.org/10.1109/TEM.2023.3306569

Long

, Mcginnis

, & Allison

(1980). The Problem of Junior-Authored Papers in Constructing Citation Counts. Social Studies of Science, 10(2), 127-143.

McKeown

, Daume

, Chaturvedi

, Paparrizos

, Thadani

, Barrio

, Biran

, Bothe

, Collins

, Fleischmann

K. R.

, Gravano

, Jha

, King

, McInerney

, Moon

, Neelakantan

, O’Seaghdha

, Radev

, Templeton

, & Teufel

(2016). Predicting the Impact of Scientific Concepts Using Full-Text Features. Journal of the Association for Information Science and Technology, 67(11), 2684-2696. https://doi.org/10.1002/asi.23612

Mercer

, & Keogh

(2022). Matrix Profile XXV: Introducing Novelets: A Primitive that Allows Online Detection of Emerging Behaviors in Time Series. In X. Zhu, S. Ranka, M. T. Thai, T. Washio, & X. Wu (Eds.), 2022 IEEE International Conference on Data Mining (ICDM) (pp. 338-347). IEEE. https://doi.org/10.1109/ICDM54844.2022.00044

Michaleff

Z. A.

, Costa

L. O. P.

, Moseley

A. M.

, Maher

C. G.

, Elkins

M. R.

, Herbert

R. D.

, & Sherrington

(2011). CENTRAL, PEDro, PubMed, and EMBASE Are the Most Comprehensive Databases Indexing Randomized Controlled Trials of Physical Therapy Interventions. Physical Therapy, 91(2), 190-197. https://doi.org/10.2522/ptj.20100116

DOI PMID

Misra

, Acharya

, & Sahoo

S. K.

(2010). Cancer nanotechnology: Application of nanotechnology in cancer therapy. Drug Discovery Today, 15(19-20), 842-850. https://doi.org/10.1016/j.drudis.2010.08.006

DOI PMID

Mitra

, Nan

, Line

B. R.

, & Ghandehari

(2006). Nanocarriers for nuclear imaging and radiotherapy of cancer. Current Pharmaceutical Design, 12(36), 4729-4749. 2nd Nanomedicine and Drug Delivery Symposium. https://doi.org/10.2174/138161206779026317

Nasar

, Jaffry

S. W.

, & Malik

M. K.

(2018). Information extraction from scientific articles: A survey. Scientometrics, 117(3), 1931-1990. https://doi.org/10.1007/s11192-018-2921-5

Nigam

, McCallum

A. K.

, Thrun

, & Mitchell

(2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2-3), 103-134. https://doi.org/10.1023/A:1007692713085

Noh

, Song

Y.-K.

, & Lee

(2016). Identifying emerging core technologies for the future: Case study of patents published by leading telecommunication organizations. Telecommunications Policy, 40(10-11), 956-970. https://doi.org/10.1016/j.telpol.2016.04.003

Peng

, Chen

, & Kong

(2014). A clipping dual coordinate descent algorithm for solving support vector machines. Knowledge-Based Systems, 71, 266-278. https://doi.org/10.1016/j.knosys.2014.08.005

Peng

, Zhang

, & Luo

(2020). Pythagorean fuzzy MCDM method based on CoCoSo and CRITIC with score function for 5G industry evaluation. Artificial Intelligence Review, 53(5), 3813-3847. https://doi.org/10.1007/s10462-019-09780-x

Porter

A. L.

, Garner

, Carley

S. F.

, & Newman

N. C.

(2019). Emergence scoring to identify frontier R&D topics and key players. Technological Forecasting and Social Change, 146, 628-643. Portland International Conference on Management of Engineering and Technology (PIMCET). https://doi.org/10.1016/j.techfore.2018.04.016

Rawat

, Singh

, Saraf

, & Saraf

(2006). Nanocarriers: Promising vehicle for bioactive drugs. Biological & Pharmaceutical Bulletin, 29(9), 1790-1798. https://doi.org/10.1248/bpb.29.1790

Reimers

, & Gurevych

(2019). Sentence-BERT:Sentence Embeddings using Siamese BERT-Networks. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019): Proceedings of the Conference, 3982-3992.

Richard

P. O.

, Violette

P. D.

, Bhindi

, Breau

R. H.

, Kassouf

, Lavallee

L. T.

, Jewett

, Kachura

J. R.

, Kapoor

, Noel-Lamy

, Ordon

, Pautler

S. E.

, Pouliot

, So

, Rendon

R. A.

, Tanguay

, Collins

, Kandi

, Shayegan

, … Finelli

(2022). Canadian Urological Association guideline: Management of small renal masses - Full-text. CUAJ-Canadian Urological Association Journal, 16(2), E61-E75. https://doi.org/10.5489/cuaj.7763

Roberts

R. J.

(2001). PubMed Central: The GenBank of the published literature. Proceedings of the National Academy of Sciences of the United States of America, 98(2), 381-382. https://doi.org/10.1073/pnas.98.2.381

PMID

Rotolo

, Hicks

, & Martin

B. R.

(2015). What is an emerging technology? Research Policy, 44(10), 1827-1843. https://doi.org/10.1016/j.respol.2015.06.006

Sanchez

, Belleville

, Popall

, & Nicole

(2011). Applications of advanced hybrid organic-inorganic nanomaterials: From laboratory to market. Chemical Society Reviews, 40(2), 696-753. https://doi.org/10.1039/c0cs00136h

Sandhiya

, Dkhar

S. A.

, & Surendiran

(2009). Emerging trends of nanomedicine - an overview. Fundamental & Clinical Pharmacology, 23(3), 263-269. https://doi.org/10.1111/j.1472-8206.2009.00692.x

Schmitt

, Kubler

, Robert

, Papadakis

, & LeTraon

(2019). A Replicable Comparison Study of NER Software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate. In M. Alsmirat & Y. Jararweh (Eds.), 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 338-343). IEEE. https://doi.org/10.1109/snams.2019.8931850

Schneider

, Hwang

J. D.

, Srikumar

, Prange

, Blodgett

, Moeller

S. R.

, Stern

, Bitan

, & Abend

(2018). Comprehensive Supersense Disambiguation of English Prepositions and Possessives. In I.

Gurevych

& Y.

Miyao

(Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Vol 1 (pp. 185-196). Assoc Computational Linguistics-Acl. https://www.webofscience.com/wos/alldb/summary/5fb302dc-1378-4381-b3e8-b88a7755e218-b17102cf/times-cited-descending/1

Seeger

P. M.

, Yahouni

, & Alpan

(2022). Literature review on using data mining in production planning and scheduling within the context of cyber physical systems. Journal of Industrial Information Integration, 28, 100371. https://doi.org/10.1016/j.jii.2022.100371

Shen

, Liu

, Lin

, Huang

, Zhang

, Liu

, Feng

, & Wang

(2023). SsciBERT: A pre-trained language model for social science texts. Scientometrics, 128(2), 1241-1263. https://doi.org/10.1007/s11192-022-04602-4

Shen

, Zhang

, Lu

, Xu

, & Xiao

(2020). A novel time series forecasting model with deep learning. Neurocomputing, 396, 302-313. https://doi.org/10.1016/j.neucom.2018.12.084

Shi

, Kantoff

P. W.

, Wooster

, & Farokhzad

O. C.

(2017). Cancer nanomedicine: Progress, challenges and opportunities. Nature Reviews Cancer, 17(1), 20-37. https://doi.org/10.1038/nrc.2016.108

DOI PMID

Sidaway

J. D.

(2020). Frontier Assemblages: The Emergent Politics of Resource Frontiers in Asia. Singapore Journal of Tropical Geography, 41(3), 470-472. https://doi.org/10.1111/sjtg.12330

Sinha

, Kim

G. J.

, Nie

, & Shin

D. M.

(2006). Nanotechnology in cancer therapeutics: Bioconjugated nanoparticles for drug delivery. Molecular Cancer Therapeutics, 5(8), 1909-1917. https://doi.org/10.1158/1535-7163.MCT-06-0141

DOI PMID

Small

, Boyack

K. W.

, & Klavans

(2014). Identifying emerging topics in science and technology. Research Policy, 43(8), 1450-1467. https://doi.org/10.1016/j.respol.2014.02.005

Song

, Luan

, & Liang

(2023). Identification of emerging technology topics (ETTs) using BERT-based model and sematic analysis: A perspective of multiple-field characteristics of patented inventions (MFCOPIs). Scientometrics, 128(11), 5883-5904. https://doi.org/10.1007/s11192-023-04819-x

Sugiyama

(2015). Arrangement of Gold Nanoparticles onto a Slit-Type DNA Nanostructure in Various Patterns. In Artificially Controllable Nanodevices Constructed By DNA Origami Technology: Photofunctionalization and Single-Molecule Analysis (pp. 67-73). Springer-Verlag Berlin. https://doi.org/10.1007/978-4-431-55769-2_5

Tao

, Yuan

, Zheng

, Jiang

, Chen

, Zhang

, Zhou

, Zhu

, & Deng

(2023). Bibliometric and visualized analysis of cancer nanomedicine from 2013 to 2023. Drug Delivery and Translational Research. https://doi.org/10.1007/s13346-023-01485-7

Thierry

, Bao

B.-K.

, & Ali

(2023). RAR-SB: Research article recommendation using SciBERT with BiGRU. Scientometrics. https://doi.org/10.1007/s11192-023-04840-0

Thompson

(2020, May). Topic Modeling with Contextualized Word Representation Clusters.

Tosatto

, Bonacina

, Signori

, Pellicciari

, Cecchi

, Cornaggia

C. M.

, & Piscitelli

(2022). Spin of information and inconsistency between abstract and full text in RCTs investigating upper limb rehabilitation after stroke: An overview study. Restorative Neurology and Neuroscience, 40(3), 195-207. https://doi.org/10.3233/RNN-211247

DOI PMID

Uddin

, & Khan

(2016). The impact of author-selected keywords on citation counts. Journal of Informetrics, 10(4), 1166-1177. https://doi.org/10.1016/j.joi.2016.10.004

van Rijt

, & Habibovic

(2017). Enhancing regenerative approaches with nanoparticles. Journal of the Royal Society Interface, 14(129), 20170093. https://doi.org/10.1098/rsif.2017.0093

Vatanasakdakul

, Aoun

, & Defiandry

(2023). Social Commerce Adoption: A Consumer’s Perspective to an Emergent Frontier. Human Behavior and Emerging Technologies, 2023, 3239491. https://doi.org/10.1155/2023/3239491

Wang

, Wang

, Zhang

, Wang

, & Mao

(2019). Optimization of Topic Recognition Model for News Texts Based on LDA. J. Digit. Inf. Manag., 17(5), 257.

Wang

, Liang

, Ye

, Chen

, & Liu

(2024). Disruptive development path measurement for emerging technologies based on the patent citation network. Journal of Informetrics, 18(1), 101493. https://doi.org/10.1016/j.joi.2024.101493

, Bu

, Ding

, Yang

, Zhang

, Yu

, & Sun

(2018). Understanding the formation of interdisciplinary research from the perspective of keyword evolution: A case study on joint attention. Scientometrics, 117(2), 973-995. https://doi.org/10.1007/s11192-018-2897-1

, Hao

, Yang

, Lu

, & An

(2021). A topic models based framework for detecting and forecasting emerging technologies. Technological Forecasting and Social Change, 162, 120366. https://doi.org/10.1016/j.techfore.2020.120366

Yang

, Sun

, & Zhang

(2023). Microreactor Technology: Identifying Focus Fields and Emerging Trends by Using CiteSpace II. Chempluschem, 88(1). https://doi.org/10.1002/cplu.202200349

Yang

, Ren

, & Yang

(2023). Nanoparticles in the New Era of Cardiovascular Therapeutics: Challenges and Opportunities. International Journal of Molecular Sciences, 24(6), 5205. https://doi.org/10.3390/ijms24065205

Yao

(2019). Automated Sentiment Analysis of Text Data with NLTK. 2018 International Symposium on Power Electronics and Control Engineering (ISPECE 2018), 1187, 052020. https://doi.org/10.1088/1742-6596/1187/5/052020

Yeo

(2013). Emerging Technology in Evaluation of Nanomedicine. Molecular Pharmaceutics, 10(6), 2091-2092. https://doi.org/10.1021/mp400264n

DOI PMID

Zhang

, Zhao

, & Zhang

(2022). Enhancing keyphrase extraction from academic articles with their reference information. Scientometrics, 127(2), 703-731. https://doi.org/10.1007/s11192-021-04230-4

Zhang

, Wang

, Yu

D. X.

, Xiao

, & He

(2018). Application of nanodiagnostics and nanotherapy to CNS diseases. Nanomedicine, 13(18). https://doi.org/10.2217/nnm-2018-0163

Zhang

, Lu

, Liu

, Porter

, Chen

, & Zhang

(2018). Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099-1117. https://doi.org/10.1016/j.joi.2018.09.004

Zhao

C.-Y.

, Cheng

, Yang

, & Tian

Z.-M.

(2018). Nanotechnology for Cancer Therapy Based on Chemotherapy. Molecules, 23(4), 826. https://doi.org/10.3390/molecules23040826

Zhao

, & Strotmann

(2011). Counting First, Last, or All Authors in Citation Analysis: A Comprehensive Comparison in the Highly Collaborative Stem Cell Research Field. Journal of the American Society for Information Science and Technology, 62(4), 654-676. https://doi.org/10.1002/asi.21495

Zhao

W. X.

, Jiang

, Weng

, He

, Lim

E.-P.

, Yan

, & Li

(2011). Comparing Twitter and Traditional Media Using Topic Models. In Advances in Information Retrieval:33^rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings 33 (pp. 338-349). Springer Berlin Heidelberg.

Zitka

, Ryvolova

, Hubalek

, Eckschlager

, Adam

, & Kizek

(2012). From Amino Acids to Proteins as Targets for Metal-based Drugs. Current Drug Metabolism, 13(3), 306-320. https://doi.org/10.2174/138920012799320437

PMID

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

1 Introduction

2 Related Work

2.1 Information extraction

2.2 Topic model

2.3 Identifying technology emergence

3 Methodology

3.1 Data collection

Figure 1. A framework of data acquisition and analysis.

3.2 Data preprocessing

Table 1. Criteria for distinguishing sentences.

3.3 Information extraction (IE)

Table 2. BERT model hyperparameterization.

Figure 2. A framework of the BERT model.

3.4 Topic identification

Figure 3. LDA (a) and TND (b) graphical models.

3.5 Emerging technology identification

4 Results

4.1 BERT-based classification performance

Table 3. Model evaluation indicators.

Figure 4. Innovation sentence distribution trends.

Table 4. Regional distribution of innovation indicators.

4.2 NLDA topic extraction results

Table 5. NLDA model hyperparameterization.

Table 6 (a). NLDA Topic-Word Distribution.

Table 6 (b). NLDA Topic -Word Distribution.

Table 7. LDA Topic -Word Distribution.

4.3 Identification of emerging technology topics

Table 8 . Results of the calculation of thematic indicators.

Table 9. Emerging technologies in nanomedicine.

Table 10. Emerging nanomedicine technology keywords classification.

5 Discussion

5.1 Theoretical contributions

5.2 Application contributions

5.3 Limitations and future research

Funding information

Acknowledgement

Author contributions

Declaration of interests

References