Research Paper

Using Machine Reading to Understand Alzheimer’s and Related Diseases from the Literature

  • Satoshi Tsutsui 1 ,
  • Yi Bu 1 ,
  • Ying Ding , 2
Expand
  • 1School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408, USA
  • 2School of Information Management, Wuhan University, Wuhan 430072, China
Corresponding author: Ying Ding (E-mail: ).

Received date: 2017-10-16

  Request revised date: 2017-11-04

  Accepted date: 2017-11-12

  Online published: 2017-12-06

Copyright

Open Access

Abstract

Purpose: This paper aims to better understand a large number of papers in the medical domain of Alzheimer’s disease (AD) and related diseases using the machine reading approach.Design/methodology/approach: The study uses the topic modeling method to obtain an overview of the field, and employs open information extraction to further comprehend the field at a specific fact level.Findings: Several topics within the AD research field are identified, such as the Human Immunodeficiency Virus (HIV)/Acquired Immune Deficiency Syndrome (AIDS), which can help answer the question of how AIDS/HIV and AD are very different yet related diseases.Research limitations: Some manual data cleaning could improve the study, such as removing incorrect facts found by open information extraction.Practical implications: This study uses the literature to answer specific questions on a scientific domain, which can help domain experts find interesting and meaningful relations among entities in a similar manner, such as to discover relations between AD and AIDS/HIV.Originality/value: Both the overview and specific information from the literature are obtained using two distinct methods in a complementary manner. This combination is novel because previous work has only focused on one of them, and thus provides a better way to understand an important scientific field using data-driven methods.

Cite this article

Satoshi Tsutsui , Yi Bu , Ying Ding . Using Machine Reading to Understand Alzheimer’s and Related Diseases from the Literature[J]. Journal of Data and Information Science, 2017 , 2(4) : 81 -94 . DOI: 10.1515/jdis-2017-0021

1 Introduction

Alzheimer’s disease (AD) is a neurodegenerative disorder that causes dementia, where in its early stage people cannot remember recent events, and gradually have more difficulty managing daily life tasks or even recognizing family and friends. AD is to date incurable (Alzheimer’s Association, 2015), although a great deal of resources are used to treat symptoms. The large number of sufferers from mild to severe cases creates a high social impact and tremendous costs for medical care that serves to manage symptoms instead of offering a cure. It is estimated that 4.7 million Americans have dementia, and the total number in 2050 is projected to be 13.8 million (Hebert et al., 2013). The economic costs of dementia were estimated to be $818 billion in 2015 (Prince et al., 2015). Due to its incurability, the costs for several sectors in the society, such as long-term care, home services, and non-professional caregivers, are greater than the cost of direct medical care (Bullock, 2004; Winblad et al., 2016; Yokoyama et al., 2016). Due to the severity and increasing number of people suffering from dementia, it is very important to promote further research on AD.
The current information available on the disease, however, is actually overwhelming. For example, simply searching “Alzheimer” in PubMed brings up 90,000+ articles(1)(1) https://www.ncbi.nlm.nih.gov/pubmed/?term=alzheimer). Because no single researcher can read even a fraction of these papers, using computational techniques could be a partial solution. Toward this end, this paper presents a case study that uses machine reading to help gather data from the large amount of literature.
The machine reading technique of data mining is based on the idea that machines can integrate and summarize information for humans by reading and understanding large amounts of texts (Hirschberg & Manning, 2015). Previous work related to machine reading and AD used computational techniques such as topic modeling to read and take main points from a large number of papers, but its general purpose is to get overview information (Hughes et al., 2014; Lee et al., 2015; Song, Heo, & Lee, 2015; Sorensen, 2009; Sorensen, Seary, & Riopelle, 2010). Indeed, in order to understand the domain of AD, we first need an overview that will help to identify key details, such as what kind of major topics are in the literature. This will help determine what kind of specific information is needed. For instance, unless we are aware of a group or individuals that discuss AIDS within AD papers, we cannot understand how AIDS is related to AD (addressed in Section 4.2). This study first uses a topic modeling technique (Blei, Ng, & Jordan, 2003) to obtain major topics within AD literature.
Once the overview is obtained, specific questions are addressed with a focus on how AD is connected to AIDS/HIV. Answering these questions requires techniques to read the content of related papers, recognize key entities mentioned in the text, and identify the relations among these entities. Gathering entities and relations from texts is called information extraction (IE), which conventionally assumes pre-defined target information. For example, extracting gene-disease interactions information is a common IE task, but it requires that the target genes, diseases, and types of expected interactions among them are already identified. Yet it is not advisable to limit the types of information being sought in advance, as this weeds out potentially significant topics or details that can be helpful to the search. Moreover, even after the information being sought is fixed, it can change frequently depending on how we have understood the literature at that point, or due to shifts in perspectives on the topic. For example, we might be interested in which pathway contains a particular gene after a gene-disease extractor found that the gene is associated with AD. In this case, we need to build the pathway-gene extractor again if conventional IE techniques are employed. Therefore, an IE technique is used that does not require pre-defined targets, called open information extraction (Open IE) (Fader, Soderland, & Etzioni, 2011).
Open IE is an information extraction technique applied in natural language processing (Fader, Zettlemoyer, & Etzioni, 2014; Mausam, 2016) that gathers facts in the form of triples <subject, predicate, object>. For example, given the sentence, “Alzheimer is strongly correlated with the apoe genotype,” it extracts <Alzheimer, is strongly correlated with, the apoe genotype>. These extracted facts are used to answer specific questions based on Latent Dirichlet Allocation (LDA) results (Blei et al., 2003). Note that it is always possible to trace the data back to specific sentences mentioning the answer. This provides an advantage because the information can be confirmed from reliable sources, that is, specific sentences in scientific papers.
The combination of the two methods of topic modeling and Open IE is complementary, in that Open IE answers key questions provided by topic modeling overviews. Topic modeling, specifically LDA, has been applied to a wide range of fields to reveal hidden topics from textual data (DiMaggio, Nag, & Blei, 2013; Hall, Jurafsky, & Manning, 2008; Hu et al., 2015). It is often difficult to make sense of each topic, however, even with substantial domain knowledge. This is because LDA just outputs topics as a distribution over terms, and does not provide information on how terms in the topic are linked together. On the contrary, Open IE can indicate how terms are specifically linked in texts. It is developed as a natural language processing (NLP) technique, and is applied to NLP tasks such as question answering. Yet when trying to understand a large collection of texts, searchers do not always have specific questions to ask in advance. An overview is therefore needed to identify specific questions. Combining LDA and Open IE is complimentary, as LDA provides the overview, which is helpful to infer specific questions that are answered by Open IE.
The rest of this paper is organized as follows. Section 2 briefly summarizes the key related work. Section 3 describes the proposed machine reading approach using the two distinct methods of LDA and Open IE to better comprehend large collections of literature both at the overview and specific levels. Section 4 presents the results of this approach for the medical domain of Alzheimer’s disease, whose related papers are far beyond what a single researcher can read. Due to the high social and fiscal impact of the disease, the need for further research is urgent. Finally, Section 5 concludes the paper.

2 Related Work

Computational techniques have been extensively used to understand a scientific domain, and applications for the topic of AD has also gathered a great deal of attention (Hughes et al., 2014; Lee et al., 2015; Song, Heo, & Lee, 2015; Sorensen, 2009; Sorensen et al., 2010). Sorensen (2009), for instance, investigated the productivity and impact of the top 100 AD researchers using citation analysis, and identified the role of AD within the field of neurodegenerative diseases. In defining an AD-specific h-index ranking to obtain an overview of the whole domain, Sorensen et al. (2010) focused on co-authorship networks and revealed major author communities. Chen et al. (2014) studied cholinesterase inhibitors within AD research, and analyzed research trends. Hughes et al. (2014) explored collaboration networks using papers from the Alzheimer Disease Center to reveal the impact of this organization. Lee et al. (2015) analyzed networks of medical entities to provide an overview of AD research, but did not investigate specific relations between entities. Song et al. (2015) studied AD literature both at the topic and entity levels; To our knowledge, this research is the only one that considers relations between entities, yet they only used 54 pre-defined relations defined in the Unified Medical Language System(2)(2) https://www.nlm.nih.gov/research/umls/). Overall, the related studies introduced so far mainly address overview information of the field of AD, while this paper has another purpose of obtaining specific information without specifying specific information searches in advance.
Obtaining specific information from texts has been well studied as information extraction in natural language processing. Information extraction (IE) is a task that automatically extracts structured information from texts. For example, many IE systems can extract entities such as genes, diseases, drugs, and relations between them from the medical literature (e.g. Song et al., 2015). However, because pre-defined relations are required, they are not effective when no relations are extracted in advance. Open IE systems (Mausam, 2016) overcome this issue and use raw textual phrases as relations, which has been applied to several NLP tasks such as question answering (Fader et al., 2014). To the best of our knowledge, this paper is the first to use Open IE to understand large amounts of literature in a specific medical domain in combination with topic modeling methods.

3 Methodology

The methodology of this paper is visually summarized in Figure 1. First, we collected literature with a focus that can be specified by key terms, specific periods, and/or target journals. After consulting domain experts, we collected a set of PubMed papers relevant to AD(3)(3) The query was performed in October, 2015.) using search terms Alzheimer, Mild cognitive impairment, Dementia, Significant memory concern, and Subjective memory complaint without any constraint of publication year. The earliest paper obtained was published in 1945. The data set collected includes 160,091 papers with MeSH terms, which are the index terms given to the PubMed articles and abstracts. Second, we applied LDA to keywords. Two types of keywords were employed: MeSH terms and genes mentioned in the abstracts. These genes were extracted from a dictionary-based extractor using LingPipe(4)(4)http://alias-i.com/lingpipe/index.html) with the NCBI human gene list(4)(4) https://www.ncbi.nlm.nih.gov/gene). In addition, we applied LDA to subsets of the papers with a narrower focus by year range, hoping to obtain more interesting observations than by looking at all papers. Specifically, we made two subsets of papers published in the last decade (2005-2014) and the previous decade (1995-2004). Having referred to some previous papers such as those of Zhang et al. (2017), we set the number of topics as five when using LDA because the purpose is to get a macro-level overview. Next, we applied Open IE, specifically Reverb (Fader et al., 2011), to textual content (e.g. abstracts or full text) of the papers. Each extracted triple is linked to a specific sentence in a paper. We then examined the topic modeling results and obtained some questions based on the results. Finally, we answered these questions using extracted triples. This part can be iterative because an answer to a question could raise new questions.
Figure 1. Conceptual sketch of methodology.

4 Results and Discussion

4.1 Overview

Based on the methods shown above, we obtained 1,469,008 triples and organized them in a relational database so that they can be traced back to a specific sentence in a paper. The extracted triples(5)(5)http://homes.soic.indiana.edu/stsutsui/machine_reading/data/index.html) and a simple demo website(6)(6)http://homes.soic.indiana.edu/stsutsui/machine_reading) were then available. The topic modeling results for MeSH terms and genes are shown in Tables 1 and 2, respectively. For each period and each topic, top five terms are shown, where topics are ranked by their popularity(7)(7) For the details on calculating topic popularity, see Chen et al. (2017).). The easily noticeable observation is only that the fifth topic in 1995-2004 from the MeSH terms is obviously related to AIDS/HIV. However, the results can provide more relevant observations if the characteristics of LDA are considered.
A basic characteristic of LDA is that it provides each topic a distribution of terms. This means the first term in a topic is its most representative term. Moreover, LDA can represent each paper as a distribution of topics, which enables the ranking of topics by popularity. From these characteristics, the popularity observations from MeSH terms and genes respectively are presented.
In MeSH term topic modeling, Huntington’s disease (HD) always appears in the first topic regardless of the rank all years or that of the last and second to last decades (Table 1). This means HD has consistently held certain popularity within the AD literature. Moreover, Creutzfeldt-Jakob syndrome (CJS), which is also a neurodegenerative disease like AD, was found to be popular in the period of 1995-2004 but not recently. This is because the disease is the first word in the first topic of that period, but later in 2005-2014, it only appears in the fourth topic.
Table 1 MeSH LDA results ranked by popularity.
Year 1st topic 2nd topic 3rd topic 4th topic 5th topic
All (1945-2015) Huntington disease
Parkinson disease
Neurons
Cerebral coretex
Nerve tissue proteins
Mental disorders
Caregivers
Dementia, vascular
AIDS dementia complex
Schizophrenia
Tau proteins
Amyloid beta-protein precursor
Neurodegenerative diseases
Brain diseases
Amyloid
Aging
Cognition
Cholinesterase inhibitors
Memory disorders
Neuropsychological tests
Creutzfeldt-Jakob syndrome
Apolipoproteins E
Magnetic resonance imaging
Nursing homes
Genetic predisposition to disease
1995-2004 Creutzfeldt-Jakob syndrome
Apolipoproteins E
Huntington disease
Tau proteins
Magnetic resonance imaging
Amyloid beta-protein precursor
Neurons
Membrane proteins
Nerve tissue proteins
Neurodegenerative diseases
Parkinson disease
Caregivers
Neuropsychological tests
Cognition
Memory
Cholinesterase inhibitors
Dementia, vascular
Memory disorders
Nootropic agents
Schizophrenia
AIDS dementia complex
Aging
Peptide fragments
HIV-1
HIV infections
2005-2014 Neurons
Peptide fragments
Tau proteins
Amyloid beta-protein precursor
Huntington disease
Aging
Neuropsychological tests
Magnetic resonance imaging
Memory disorders
Memory
Cognition
Neurodegenerative diseases
Mental disorders
Nursing homes
Amyotrophic lateral sclerosis
Parkinson disease
Caregivers
Amyloid
Creutzfeldt-Jakob syndrome
Depression
Cholinesterase inhibitors
Neuroprotective agents
Dementia, vascular
Frontotemporal dementia
Amyloid precursor protein secretases
In gene topic modeling, APP and APOE are always the most popular genes, as they appear as top words in either the first or second topic, regardless of time periods (Table 2). Other than the first or second topic, HTT became popular in recent periods as it appears as the top term in the third topic in 2005-2014, while it appears as the top term in the fifth topic in 1995-2004.
Table 2 Gene LDA results ranked by popularity.
Year 1st topic 2nd topic 3rd topic 4th topic 5th topic
All
(1945-2015)
APP
TNF
BDNF
BACE1
NGF
APOE
BCHE
CAT
CA3
CA1
INS
HTT
PSEN1
CA1
APP
MS
PRNP
GFAP
ALB
MDD1
MAPT
SDS
SST
GRN
PSD
1995-2004 APP
MS
CD4
SPY
PSEN1
APOE
BDNF
ACT
A2M
LDLR
PRNP
INS
NGF
TF
TTR
TNF
PSEN1
GFAP
MAPT
BCHE
HTT
CA1
SDS
ALB
CA3
2005-2014 APOE
INS
TNF
APP
ACE
APP
GFAP
CD4
CAT
NOTCH3
HTT
PRNP
CA1
BACE1
NGF
MS
BDNF
MAPT
BCHE
SYP
PSEN1
APP
APOE
ALB
PSEN2
Another characteristic of LDA is that highly co-occurring terms constitute a topic. Moreover, it sometimes distinguishes the term co-occurrences in different contexts by having the same term in multiple topics. From these characteristics, observations of a term in different contexts for MeSH terms and genes are presented, respectively.
In MeSH terms, CJS appears in three different topics: the first topic in the period 1995-2004, the fourth topic in the period 2005-2014, and the fifth topic in all years. We examined co-occurring terms, where magnetic resonance imaging (MRI) co-occurs in the topic list in 1995-2004, but the term caregivers co-occurs in 2005-2014. MRI is a brain-imaging technique while caregiver is the person who takes care of the patient or person suffering from the disease. This observation indicates that CJS can be studied both from two contexts of brain imaging research and patient care. Interestingly, when investigating the topic in all years, we find that the two contexts are merged into one topic because it has both MRI and nursing homes, where patients are given care.
The research also found that APP and APOE appear in multiple topics and multiple ranks in genes. For example, APP/APOE is always the top gene in the first and second topics, but also appears as the fifth gene at the third topic in all years, and the second and third genes as the fifth topic in 2005-2014 (Table 2). This observation indicates that APP and APOE can be studied in a context where each gene itself is the key to the topic, but also in a context where it is secondary to the topic.

4.2 Question Answering Examples

This section demonstrates the power of Open IE and answers questions specific to AD that are inferred by the LDA results and their interpretations. For example, the topic related to AIDS/HIV is found within the AD literature. Open IE can tell how AIDS/HIV is actually related to AD, which is answered by Open IE later in this section. In fact, Open IE can answer more basic questions such as the definition of terms, which could be helpful for researchers with limited knowledge of AD (e.g. information scientists or scientometrians who expect to study AD from the literature) to better understand a domain only from the literature. For example, you cannot interpret the results discussed in Section 4.1 if you do not know HD, CJS, MRI, APP, and APOE. Therefore, we first show how Open IE can answer these simple questions in order to understand these basic terms.
We first use an example of Huntington’s disease (HD), which in the previous section, was observed to consistently hold certain popularity within the AD literature. The immediate question “What is Huntington’s disease?” can be answered by searching triples with a pattern <Huntington disease, is, ?x > where ?x means some words are identified. In this case, 446 distinct triples were found. The top two frequent answers and other three randomly sampled results are shown in Table 3. The number inside the parenthesis represents the number of triples that matched the pattern. Now it can be discovered that HD is an inherited neurodegenerative disorder.
Table 3 Question answering example: What is Huntington’s disease?
Question Answer example
What is Huntington’s disease? an inherited neurodegenerative disorder (44)
a neurodegenerative disorder (28)
a hereditary brain disease (2)
an incurable genetic neurodegenerative disorder (1)
a complex, single gene (1) (wrong extraction)
Extraction from texts does not always result in correct data, so some manual inspections are required. Limitations of this approach include the wrong answer of complex, single gene. One possible method to reduce error is to only use highly frequent triples. However, low frequency triples also carry interesting and important facts. For example, an incurable genetic neurodegenerative disorder has been mentioned only once in our corpus but this answer carries an important fact that HD is incurable. Therefore, we decided not to rely on the frequency of triples, but instead use manually inspected results. Another issue is that Open IE does not provide normalization, so term variants are treated distinctly. This could be an issue when triples are searched with a pattern. For example, HD needs to be added as a term variant to Huntington’s disease searches if we want to get complete results. Yet these term variants can be obtained by querying terms that have the same definition. For example, checking other subjects defined as an inherited neurodegenerative disorder leads us to another term variant, Huntington’s disease.
The same approach is also applied herein to find definitions of CJS, MRI, APP, and APOE (Table 4). These definitions are complementary in allowing more LDA results to be interpreted. It can be confirmed that, for example, APP and APOE are strongly related to AD. This fact coincides with the observation in Section 4.1, that they are always the most popular genes.
Table 4 What are CJS, MRI, APP, and APOE?
Term Definition
AD a neurodegenerative disorder
AD a genetically complex and heterogeneous disorder
CJS a rare neurodegenerative disease
CJS a fatal neurodegenerative illness
CJS an incurable disease
MRI a useful diagnostic marker
MRI a promising AD biomarker
MRI the most widely used and less invasive medical imaging technique
APP a transmembrane glycoprotein
APP an extremely complex molecule
APP the key player in AD pathogenesis
APOE the major apolipoprotein
APOE the only confirmed susceptibility gene for AD
APOE the most prevalent and best established genetic risk factor for late-onset AD.
This study also uses more interesting questions than term definitions. Suppose we know that HD is also a neurodegenerative disease like AD. The next natural question is, “How is AD related to HD?” We first queried <AD, ?x, HD >, but could not find relevant relations. After extending the pattern to <AD, ?r1, ?x> & <?x, ?r2, HD>, one finds an equivalent of finding two-step paths from node AD to node HD on a directed graph. It is also possible to find a path the other way from HD to AD, which gives similar results. The query resulted in 408 paths with 61 distinct middle nodes. Part of them are shown in Figure 2. It can be observed that AD and HD share some symptoms such as cognitive impairment, depression, and vascular dysfunction. Neuronal death is also common in both AD and HD. Moreover, Figure 2 indicates that apoe genotype does not affect HD while it is strongly correlated with AD. This negative relation indicates the advantage of using Open IE in addition to LDA, which relies on co-occurrences, because co-occurrences cannot distinguish positive or negative relations. One could regard relations that are not found as negative relations, however, assuming that extracted knowledge is complete, but negative relations that are mentioned explicitly in papers could provide more important facts than others.
Figure 2. Two-step paths from AD to HD.
Another important observation from LDA results relates to AIDS and HIV, which are very different diseases (HIV is common to AIDS, in that all people with AIDS generally have HIV, but is not the full-blown AIDS disease in terms of symptoms and treatment). Also, a domain expert working on AD brain imaging was queried, but he did not know how HIV/AIDS are related to AD. Similar to the HD example, triples were thus queried with <AD, ?r1, ?x> & <?x, ?r2, HIV>. The results shown in Figure 3 confirm meaningful facts: HIV actually has similar symptoms to AD such as delirium and dementia, but does not infect neurons nor endothelial cells while AD affects them.
Figure 3. Two-step paths from AD to HIV.
Some questions need additional resources to answer. For example, this study was able to confirm that the apoe gene is strongly correlated with AD (as seen in the previous section), yet now we are interested in other genes that also have high correlation with AD. The natural way to find answers is to search triples with the pattern <?x, correlated with AD> & <?x, is, gene>. But this query gave no results because it is rare for researchers to write a sentence that contains “[a gene name] is a gene.” To solve this issue, we used an additional resource of NCBI human genes(8)(8) http://www.ncbi.nlm.nih.gov/gene/) to restrict ?x in the pattern. In other words, the final query is <?x, correlated with AD> where ?x is in the gene list. This query found 62 answers including APP, BDNF, and CR2.
Some questions from topic modeling results cannot be answered by Open IE. These are “why” questions because Open IE simply extracts facts mentioned in the paper. For example, we observed that CJS was popular in the period of 1995-2004, but not 2005-2015. Open IE, however, cannot answer why this change happened. A doctoral student working on AD who was queried inferred that the reason might be that Stanley B. Prusiner was awarded the Nobel Prize in 1997 for his discovery of Prions, the pathogen of CJS, which created an upsurge in interest. However, this cause and effect relation cannot be verified, as Open IE can by no means infer that.

5 Conclusions

This paper provides a case study of using the machine reading method to understand the domain of Alzheimer’s disease (AD), and its relation to other diseases such as HIV and AIDS. AD is a field whose number of the related papers is overwhelmingly high, although there is a vital need for further research that may actually help find the causes of the disease as well as a cure. We demonstrate that machine reading helps identify specific information that offers a better understanding via overviews provided by topic modeling. The use of both methods of LDA and Open IE in a mutually complementary way reveals how the topic modeling technique connects AD and HIV/AIDS. Based on this observation, when querying the Open IE extractions, the two diseases are found to have different mechanisms but share some symptoms such as dementia.
This study has several implications. First of all, it shows that the literature on a topic can answer specific questions relating to it, which has not been attempted in the literature to date. From the perspective of Alzheimer’s disease, the approach provided in this article could help domain experts find important relations between entities in a similar manner as this study identified relations between AD and HIV/AIDS. Methodologically, this approach can serve as a preliminary knowledge extraction step for literature-based knowledge discovery if future researchers hope to construct a curated knowledge base for a specific purpose.
One limitation of this approach is that we need to manually clean the data, such as remove false extractions. Moreover, this study is not able to answer abstract questions when these answers are not written explicitly in texts. In the future, it would be helpful to develop a method to automate the process to detect false extractions. We could also integrate existing medical knowledge bases to answer more complex or nuanced questions.

Author Contributions

S. Tsutsui (stsutsui@indiana.edu) designed the research framework and the methods, analyzed the data, and wrote the manuscript. Y. Bu (buyi@iu.edu) wrote the manuscript. Y. Ding (dingying@indiana.edu, corresponding author) instructed the research team and revised the manuscript.

The authors have declared that no competing interests exist.

[1]
Alzheimer’s Association. (2015). 2015 Alzheimer’s disease facts and figures. Alzheimer’s & Dementia, 11(3), 332-384.

[2]
Blei D.M., Ng A.Y., & Jordan M.I. (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research, 3, 993-1022.

[3]
Bullock R. (2004). The needs of the caregiver in the long-term treatment of Alzheimer disease. Alzheimer Disease & Associated Disorders, 18 Suppl 1, S17-S23.Abstract The long-term well-being of caregivers should be included as part of the treatment of patients with Alzheimer disease (AD). Throughout the process of caring for patients with AD, caregivers frequently experience social, emotional, physical, and financial losses, which become more significant as the disease progresses. Minimizing these losses is a goal in the overall management of AD. Successful treatment of the patient has been shown to positively impact quality of life for the caregiver. Randomized, controlled studies of acetylcholinesterase inhibitors (AChEIs) have demonstrated the effectiveness of these agents in stabilizing cognitive function and delaying behavioral symptoms. Moreover, a decrease in the incidence of nursing home placement has been associated with this therapy. The growing burden of AD on families and society as a whole warrants the investigation of ways to minimize the impact of AD. AChEIs play an important role in this effort. Further studies are needed to more closely examine the impact of specific AChEIs on caregiver burden.

DOI PMID

[4]
Chen B., Tsutsui S., Ding Y., & Ma F. (2017). Understanding the topic evolution in a scientific domain: An exploratory study for the field of information retrieval. Journal of Informetrics, 11(4), 1175-1189.Abstract Understanding topic evolution in a scientific domain is essential for capturing key domain developments and facilitating knowledge transfer within and across domains. Using a data set on information retrieval (IR) publications, this paper examines how research topics evolve by analyzing the topic trends, evolving dynamics, and semantic word shifts in the IR domain. Knowledge transfer between topics and the developing status of the major topics have been recognized, which are represented by the merging and splitting of local topics in different time periods. Results show that the evolution of a major topic usually follows a pattern from adjusting status to mature status, and sometimes with re-adjusting status in between the evolving process. Knowledge transfer happens both within a topic and among topics. Word migration via topic channels has been defined, and three migration types (non-migration, dual-migration, and multi-migration) are distinguished to facilitate better understanding of the topic evolution.

DOI

[5]
DiMaggio P., Nag M., & Blei D. (2013). Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding. Poetics, 41(6), 570-606.Topic modeling provides a valuable method for identifying the linguistic contexts that surround social institutions or policy domains. This article uses Latent Dirichlet Allocation (LDA) to analyze how one such policy domain, government assistance to artists and arts organizations, was framed in almost 8000 articles. These comprised all articles that referred to government support for the arts in the U.S. published in five U.S. newspapers between 1986 and 1997-a period during which such assistance, once noncontroversial, became a focus of contention. We illustrate the strengths of topic modeling as a means of analyzing large text corpora, discuss the proper choice of models and interpretation of model results, describe means of validating topic-model solutions, and demonstrate the use of topic models in combination with other statistical tools to estimate differences between newspapers in the prevalence of different frames. Throughout, we emphasize affinities between the topic-modeling approach and such central concepts in the study of culture as framing, polysemy, heteroglossia, and the relationality of meaning. (C) 2013 Elsevier B.V. All rights reserved.

DOI

[6]
Fader A., Soderland S., & Etzioni O. (2011). Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1535-1545). Stroudsburg, PA: Association for Computational Linguistics.Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. This paper shows that the output of state-of-the-art Open IE systems is rife with uninformative and incoherent extractions. To overcome these problems, we introduce two simple syntactic and lexical constraints on binary relations expressed by verbs. We implemented the constraints in the ReVerb Open IE system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TextRunner and woe pos . More than 30% of ReVerb's extractions are at precision 0.8 or higher---compared to virtually none for earlier systems. The paper concludes with a detailed analysis of ReVerb's errors, suggesting directions for future work. <!-- .bsa-cpc #_default_:before { display: block; margin: 1em auto; padding-top: 1em; max-width: 940px; border-top: solid 1px #b7babc; color: #8a9299; content: "Advertisements"; text-align: center; text-transform: uppercase; font-weight: bold; font-size: 0.8em; } .bsa-cpc #_default_ { position: relative; overflow: hidden; margin: 2em 0; margin: 0 auto; padding-bottom: 3em; max-width: 940px; border-bottom: solid 1px #b7babc; font-size: 11px; line-height: 1.5; justify-content: center; } .bsa-cpc .default-ad { display: none; } .bsa-cpc ._default_ { position: relative; display: block; float: left; overflow: hidden; margin: 0 .4em; padding: 1em; max-width: 30%; border-radius: 3px; background-color: #ece9d8; text-align: left; line-height: 1.5; } .bsa-cpc a { color: #1d4d0f; text-decoration: none !important; } .bsa-cpc a:hover { color: red; } .bsa-cpc .default-image img { display: block; float: left; margin-right: 10px; width: 36px; border-radius: 7.5%; } .bsa-cpc .default-title, .bsa-cpc .default-description { display: block; margin-left: 46px; max-width: calc(100% - 36px); } .bsa-cpc .default-title { font-weight: 600; } .bsa-cpc .default-description:after { position: absolute; top: 4px; right: 4px; padding: 1px 4px; color: hsla(0, 0%, 20%, .3); content: "Ad"; text-transform: uppercase; font-size: 7px; } @media only screen and (min-width: 320px) and (max-width: 759px) { .bsa-cpc #_default_ { flex-wrap: wrap; } .bsa-cpc ._default_ { float: none; margin: 0 1em .5em; max-width: 100%; } } (function(){ if(typeof _bsa !== 'undefined' && _bsa) { _bsa.init('default', 'CVADE2QJ', 'placement:acmorg', { target: '.bsa-cpc', align: 'horizontal', disable_css: 'true' }); } })(); SOURCE MATERIALS AVAILABLE FOR DOWNLOAD PDF (419KB) PDF from the publisher

DOI

[7]
Fader A., Zettlemoyer L., & Etzioni O. (2014). Open question answering over curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,(pp. 1156-1165). New York: ACM.We consider the problem of open-domain question answering (Open QA) over massive knowledge bases (KBs). Existing approaches use either manually curated KBs like Freebase or KBs automatically extracted from unstructured text. In this paper, we present OQA, the first approach to leverage both curated and extracted KBs. A key technical challenge is designing systems that are robust to the high variability in both natural language questions and massive KBs. OQA achieves robustness by decomposing the full Open QA problem into smaller sub-problems including question paraphrasing and query reformulation. OQA solves these sub-problems by mining millions of rules from an unlabeled question corpus and across multiple KBs. OQA then learns to integrate these rules by performing discriminative training on question-answer pairs using a latent-variable structured perceptron algorithm. We evaluate OQA on three benchmark question sets and demonstrate that it achieves up to twice the precision and recall of a state-of-the-art Open QA system.

DOI

[8]
Hall D., Jurafsky D., & Manning C.D. (2008). Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 363-371). Stroudsburg, PA: Association for Computational Linguistics.ABSTRACT How can the development of ideas in a sci- entific field be studied over time? We ap- ply unsupervised topic modeling to the ACL Anthology to analyze historical trends in the field of Computational Linguistics from 1978 to 2006. We induce topic clusters using Latent Dirichlet Allocation, and examine the strength of each topic over time. Our methods find trends in the field including the rise of prob- abilistic methods starting in 1988, a steady in- crease in applications, and a sharp decline of research in semantics and understanding be- tween 1978 and 2001, possibly rising again after 2001. We also introduce a model of the diversity of ideas, topic entropy, using it to show that COLING is a more diverse confer- ence than ACL, but that both conferences as well as EMNLP are becoming broader over time. Finally, we apply Jensen-Shannon di- vergence of topic distributions to show that all three conferences are converging in the topics they cover.

DOI

[9]
Hebert L.E., Weuve J., Scherr P.A., & Evans D.A. (2013). Alzheimer disease in the United States (2010-2050) estimated using the 2010 census. Neurology, 8(19), 1778-1783.To provide updated estimates of Alzheimer disease (AD) dementia prevalence in the United States from 2010 through 2050.Probabilities of AD dementia incidence were calculated from a longitudinal, population-based study including substantial numbers of both black and white participants. Incidence probabilities for single year of age, race, and level of education were calculated using weighted logistic regression and AD dementia diagnosis from 2,577 detailed clinical evaluations of 1,913 people obtained from stratified random samples of previously disease-free individuals in a population of 10,800. These were combined with US mortality, education, and new US Census Bureau estimates of current and future population to estimate current and future numbers of people with AD dementia in the United States.We estimated that in 2010, there were 4.7 million individuals aged 65 years or older with AD dementia (95% confidence interval [CI] = 4.0-5.5). Of these, 0.7 million (95% CI = 0.4-0.9) were between 65 and 74 years, 2.3 million were between 75 and 84 years (95% CI = 1.7-2.9), and 1.8 million were 85 years or older (95% CI = 1.4-2.2). The total number of people with AD dementia in 2050 is projected to be 13.8 million, with 7.0 million aged 85 years or older.The number of people in the United States with AD dementia will increase dramatically in the next 40 years unless preventive measures are developed.

DOI PMID

[10]
Hirschberg J. , & Manning, C.D. (2015). Advances in natural language processing. Science, 349(6245), 261-266.

DOI

[11]
Hu B., Dong X., Zhang C., Bowman T.D., Ding Y., Milojević S., . . & Larivière, V. (2015). A lead-lag analysis of the topic evolution patterns for preprints and publications. Journal of the Association for Information Science and Technology, 66(12), 2643-2656.

[12]
Hughes M.E., Peeler J., Hogenesch J.B., & Trojanowski J.Q. (2014). The growth and impact of Alzheimer disease centers as measured by social network analysis. JAMA Neurology, 71(4), 412-420.IMPORTANCE: Alzheimer disease (AD) is a neurodegenerative disorder with no effective therapies. In 1984, the National Institute on Aging created the first 5 AD centers (ADCs) in an effort to coordinate research efforts into the pathology and treatment of the disease. Since that time, the ADC program has expanded to include 27 centers in major medical schools throughout the United States. A major aim of ADCs is to develop shared resources, such as tissue samples and patient populations, and thereby promote large-scale, high-impact studies that go beyond the capabilities of any single investigator or institution working in isolation. OBJECTIVE: To quantitatively evaluate the performance of the ADC program over the past 25 years. DESIGN AND SETTING: We systematically harvested every article published by ADC investigators and used social network analysis to analyze copublication networks. RESULTS: A total of 12170 ADC papers were published from 1985 through 2012. The frequency of collaborations has increased greatly from the time that the ADCs were started until the present, even after the expansion of ADCs and the recruitment of new investigators plateaued. Moreover, the collaborations established within the context of the ADC program are increasingly interinstitutional, consistent with the overall goal of the program to catalyze multicenter research teams. Most important, we determined that collaborative multi-ADC research articles are consistently of higher impact than AD articles as a whole. CONCLUSIONS AND RELEVANCE: The ADC program has successfully fostered high-impact, multiuniversity collaborations; we suggest that its structural and administrative features could be replicated in other fields of patient-oriented research.

DOI PMID

[13]
Lee D., Kim W.C., Charidimou A., & Song M. (2015). A bird’s-eye view of Alzheimer’s disease research: Reflecting different perspectives of indexers, authors, or citers in mapping the field. Journal of Alzheimer’s Disease, 45(4), 1207-1222.ABSTRACT During the last 30 years, Alzheimer's disease (AD) research, aiming to understand the pathophysiology and to improve the diagnosis, management, and, ultimately, treatment of the disease, has grown rapidly. Recently, some studies have used simple bibliometric approaches to investigate research trends and advances in the field. In our study, we map the AD research field by applying entitymetrics, an extended concept of bibliometrics, to capture viewpoints of indexers, authors, or citers. Using the full-text documents with reference section retrieved from PubMed Central, we constructed four types of networks: MeSH-MeSH (MM), MeSH-Citation-MeSH (MCM), Keyphrase-Keyphrase (KK), and Keyphrase-Citation-Keyphrase (KCK) networks. The working hypothesis was that MeSH, keyphrase, and citation relationships reflect the views of indexers, authors, and/or citers, respectively. In comparative network and centrality analysis, we found that those views are different: indexers emphasize amyloid-related entities, including methodological terms, while authors focus on specific biomedical terms, including clinical syndromes. The more dense and complex networks of citing relationships reported in our study, to a certain extent reflect the impact of basic science discoveries in AD. However, none of these could have had clinical relevance for patients without close collaboration between investigators in translational and clinical-related AD research (reflected in indexers and authors' networks). Our approach has relevance for researches in the field, since they can identify relations between different developments which are not otherwise evident. These developments combined with advanced visualization techniques, might aid the discovery of novel interactions between genes and pathways or used as a resource to advance clinical drug development.

DOI PMID

[14]
Mausam M. (2016. Open information extraction systems and downstream applications. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. Retrieved on November 4, 2017, from 2016). Open information extraction systems and downstream applications. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. Retrieved on November 4, 2017, from .

[15]
Prince M., Wimo A., Guerchet M., Ali G., Wu Y., & Prina M. (2015). World Alzheimer Report 2015: The global impact of dementia: An analysis of prevalence, incidence, cost and trends. London: Alzheimer’s Disease International.World Alzheimer Report 2015 - The Global Impact of Dementia - Research Portal, King's College, London

[16]
Song M., Heo G.E., & Lee D. (2015). Identifying the landscape of Alzheimer’s disease research with network and content analysis. Scientometrics, 102(1), 905-927.Alzheimer disease (AD) is one of degenerative brain diseases, whose cause is hard to be diagnosed accurately. As the number of AD patients has increased, researchers have strived to understand the disease and develop its treatment, such as medical experiments and literature analysis. In the area of literature analysis, several traditional studies analyzed the literature at the macro level like author, journal, and institution. However, analysis of the literature both at the macro level and micro level will allow for better recognizing the AD research field. Therefore, in this study we adopt a more comprehensive approach to analyze the AD literature, which consists of productivity analysis (year, journal/proceeding, author, and Medical Subject Heading terms), network analysis (co-occurrence frequency, centrality, and community) and content analysis. To this end, we collect metadata of 96,081 articles retrieved from PubMed. We specifically perform the concept graph-based network analysis applying the five centrality measures after mapping the semantic relationship between the UMLS concepts from the AD literature. We also analyze the time-series topical trend using the Dirichlet multinomial regression topic modeling technique. The results indicate that the year 2013 is the most productive year and Journal of Alzheimer Disease the most productive journal. In discovery of the core biological entities and their relationships resided in the AD related PubMed literature, the relationship with glycogen storage disease is founded most frequently mentioned. In addition, we analyze 16 main topics of the AD literature and find a noticeable increasing trend in the topic of transgenic mouse.

DOI

[17]
Song M., Kim W.C., Lee D., Heo G.E., & Kang K.Y. (2015). PKDE4J: Entity and relation extraction for public knowledge discovery. Journal of Biomedical Informatics, 57, 320-332.Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means of information search, knowledge discovery, and hypothesis generation. Most previous studies have primarily focused on the design and performance improvement of either named entity recognition or relation extraction. In this paper, we present PKDE4J, a comprehensive text-mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework. Starting with the Stanford CoreNLP, we developed the system to cope with multiple types of entities and relations. The system also has fairly good performance in terms of accuracy as well as the ability to configure text-processing components. We demonstrate its competitive performance by evaluating it on many corpora and found that it surpasses existing systems with average F -measures of 85% for entity extraction and 81% for relation extraction.

DOI PMID

[18]
Sorensen A.A. (2009). Alzheimer’s disease research: Scientific productivity and impact of the top 100 investigators in the field. Journal of Alzheimer’s Disease, 16(3), 451-465.Abstract The online availability of scientific-literature databases and natural-language-processing (NLP) algorithms has enabled large-scale bibliometric studies within the field of scientometrics. Using NLP techniques and Thomson ISI reports, an initial analysis of the role of Alzheimer's disease (AD) within the neurosciences as well as a summary of the various research foci within the AD scientific community are presented. Citation analyses and productivity filters are applied to post-1984, AD-specific subsets of the PubMed and Thomson ISI Web-of-Science literature bases to algorithmically identify a pool of the top AD researchers. From the initial pool of AD investigators, top-100 rankings are compiled to assess productivity and impact. One of the impact and productivity metrics employed is an AD-specific H-index. Within the AD-specific H-index ranking, there are many cases of multiple AD investigators with similar or identical H-indices. In order to facilitate differentiation among investigators with equal or near-equal H indices, two derivatives of the H-index are proposed: the Second-Tier H-index and the Scientific Following H-index. Winners of two prestigious AD-research awards are highlighted, membership to the Institute of Medicine of the US National Academy of Sciences is acknowledged, and an analysis of highly-productive, high-impact, AD-research collaborations is presented.

DOI PMID

[19]
Sorensen A.A., Seary A., & Riopelle K. (2010). Alzheimer’s disease research: A coin study using co-authorship network analytics. Procedia-Social and Behavioral Sciences, 2(4), 6582-6586.

[20]
Winblad B., Amouyel P., Andrieu S., Ballard C., Brayne C., Brodaty H., . . & Zetter- berg, H. (2016). Defeating Alzheimer’s disease and other dementias: A priority for European science and society. The Lancet Neurology, 15(5), 455-532.Dementia encompasses a range of neurological disorders characterised by memory loss and cognitive impairment. Alzheimer's disease (AD) is the most common form of dementia, accounting for 50-70% of cases. The most common early symptom of dementia is difficulty in remembering recent events. As the disorder develops, a wide range of other symptoms can emerge, such as disorientation, mood swings, confusion, more serious memory loss, behavioural changes, difficulties in speaking and swallowing, and problems with walking. Progressive accumulation of disability, with deterioration in multiple cognitive domains, interferes with daily functioning, including social and professional functioning.1 Thus, dementia substantially affects the daily lives of patients, their families, and wider society.Increasing age is the most important risk factor for AD and other dementias, and as life expectancy increases and demographic ageing occurs in populations around the world, the number of people with dementia is expected to increase. In 2015, almost 47 million people worldwide were estimated to be affected by dementia, and the numbers are expected to reach 75 million by 2030, and 131 million by 2050, with the greatest increase expected in low-income and middle-income countries.2 In 2012 and 2015, the World Health Organization (WHO) presented reports in which it acknowledged this trend-- sometimes described in terms of a fast-growing epidemic--and concluded that AD and other dementias should be regarded as a global public health priority.3,4

DOI PMID

[21]
Yokoyama J.S., Wang Y., Schork A.J., Thompson W.K., Karch C.M., Cruchaga C.,. & Desikan, R.S. (2016). Association between genetic traits for immune-mediated diseases and Alzheimer disease. JAMA Neurology, 73(6), 691-697.Abstract IMPORTANCE: Late-onset Alzheimer disease (AD), the most common form of dementia, places a large burden on families and society. Although epidemiological and clinical evidence suggests a relationship between inflammation and AD, their relationship is not well understood and could have implications for treatment and prevention strategies. OBJECTIVE: To determine whether a subset of genes involved with increased risk of inflammation are also associated with increased risk for AD. DESIGN, SETTING, AND PARTICIPANTS: In a genetic epidemiology study conducted in July 2015, we systematically investigated genetic overlap between AD (International Genomics of Alzheimer's Project stage 1) and Crohn disease, ulcerative colitis, rheumatoid arthritis, type 1 diabetes, celiac disease, and psoriasis using summary data from genome-wide association studies at multiple academic clinical research centers. P values and odds ratios from genome-wide association studies of more than 100090004000 individuals were from previous comparisons of patients vs respective control cohorts. Diagnosis for each disorder was previously established for the parent study using consensus criteria. MAIN OUTCOMES AND MEASURES: The primary outcome was the pleiotropic (conjunction) false discovery rate P value. Follow-up for candidate variants included neuritic plaque and neurofibrillary tangle pathology; longitudinal Alzheimer's Disease Assessment Scale cognitive subscale scores as a measure of cognitive dysfunction (Alzheimer's Disease Neuroimaging Initiative); and gene expression in AD vs control brains (Gene Expression Omnibus data). RESULTS: Eight single-nucleotide polymorphisms (false discovery rate P090009<090009.05) were associated with both AD and immune-mediated diseases. Of these, rs2516049 (closest gene HLA-DRB5; conjunction false discovery rate P090009=090009.04 for AD and psoriasis, 5.37090009010309000910-5 for AD, and 6.03090009010309000910-15 for psoriasis) and rs12570088 (closest gene IPMK; conjunction false discovery rate P090009=090009.009 for AD and Crohn disease, P090009=0900095.73090009010309000910-6 for AD, and 6.57090009010309000910-5 for Crohn disease) demonstrated the same direction of allelic effect between AD and the immune-mediated diseases. Both rs2516049 and rs12570088 were significantly associated with neurofibrillary tangle pathology (P090009=090009.01352 and .03151, respectively); rs2516049 additionally correlated with longitudinal decline on Alzheimer's Disease Assessment Scale cognitive subscale scores (0205 [SE], 0.405 [0.190]; P090009=090009.03). Regarding gene expression, HLA-DRA and IPMK transcript expression was significantly altered in AD brains compared with control brains (HLA-DRA: 0205 [SE], 0.155 [0.024]; P090009=0900091.97090009010309000910-10; IPMK: 0205 [SE], -0.096 [0.013]; P090009=0900097.57090009010309000910-13). CONCLUSIONS AND RELEVANCE: Our findings demonstrate genetic overlap between AD and immune-mediated diseases and suggest that immune system processes influence AD pathogenesis and progression.

DOI PMID

[22]
Zhang C., Bu Y., Ding Y., & Xu J. (2017. Understanding scientific collaboration: Homophily, transitivity, and preferential attachment. Journal of the Association for Information Science and Technology. Retrieved on November 4, 2017, from 2017). Understanding scientific collaboration: Homophily, transitivity, and preferential attachment. Journal of the Association for Information Science and Technology. Retrieved on November 4, 2017, from.

Outlines

/

京ICP备05002861号-43

Copyright © 2023 All rights reserved Journal of Data and Information Science

E-mail: jdis@mail.las.ac.cn Add:No.33, Beisihuan Xilu, Haidian District, Beijing 100190, China

Support by Beijing Magtech Co.ltd E-mail: support@magtech.com.cn