Research Paper

A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications

  • Qiuzi Zhang ,
  • Qikai Cheng ,
  • Yong Huang & Wei Lu
Expand
  • School of Information Management, Wuhan University, Wuhan 430072, China

Received date: 2016-01-21

  Revised date: 2016-02-19

  Online published: 2016-03-15

Supported by

This work was supported by the National Natural Science Foundation of China (Grant No.: 71473183).

Abstract

Purpose: Our study proposes a bootstrapping-based method to automatically extract datausage statements from academic texts.

Design/methodology/approach: The method for data-usage statements extraction starts with seed entities and iteratively learns patterns and data-usage statements from unlabeled text. In each iteration, new patterns are constructed and added to the pattern list based on their calculated score. Three seed-selection strategies are also proposed in this paper.

Findings: The performance of the method is verified by means of experiments on real data collected from computer science journals. The results show that the method can achieve satisfactory performance regarding precision of extraction and extensibility of obtained patterns.

Research limitations: While the triple representation of sentences is effective and efficient for extracting data-usage statements, it is unable to handle complex sentences. Additional features that can address complex sentences should thus be explored in the future.

Practical implications: Data-usage statements extraction is beneficial for data-repository construction and facilitates research on data-usage tracking, dataset-based scholar search, and dataset evaluation.

Originality/value: To the best of our knowledge, this paper is among the first to address the important task of automatically extracting data-usage statements from real data.


http://ir.las.ac.cn/handle/12502/8479

Cite this article

Qiuzi Zhang , Qikai Cheng , Yong Huang & Wei Lu . A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications[J]. Journal of Data and Information Science, 2016 , 1(1) : 69 -85 . DOI: 10.20309/jdis.201606

References

Aalbersberg, I.J., Dunham, J., & Koers, H. (2013). Connecting scientific articles with research data: New directions in online scholarly publishing. Data Science Journal, 12, WDS235- WDS242.
Belter, C.W. (2014). Measuring the value of research data: A citation analysis of oceanographic data sets. PLOS One, 9(3), e92590.
Boland, K., Ritze, D., Eckert, K., & Mathiak, B. (2012). Identifying references to datasets in publications. In Zaphiris P., Buchanan G., Rasmussen E., & Loizides F. (Eds.) Theory and Practice of Digital Libraries (pp. 150-161). Heidelberg: Springer.
Chao, T.C. (2011). Dis ciplinary reach: Investigating the impact of dataset reuse in the earth sciences. Proceedings of the American Society for Information Science and Technology, 48(1), 1-8.
Fader, A., Soderland, S., & Etzioni, O. (2011). Identifying relations for open information extraction. Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1535-1545). Association for Computational Linguistics.
Konkiel, S. (2013). Tracking citations and altmetrics for research data: Challenges and opportunities. Bulletin of the American Society for Information Science and Technology, 39(6), 27-32.
Mayernik, M.S. (2013). Bridging data lifecycles: Tracking data use via data citations workshop report. Technical Report, National Center for Atmospheric Research.
Mooney, H., & Newton, M.P. (2012). The anatomy of a data citation: Discovery, reuse, and credit. Journal of Librarianship and Scholarly Communication, 1(1), article no. 1.
Névéol, A., Wilbur, W.J., & Lu, Z. (2011). Extraction of data deposition statements from the literature: A method for automatically tracking research results. Bioinformatics, 27(23), 3306-3312.
Parsons, M.A., Duerr, R., & Minster, J.B. (2010). Data citation and peer review. Eos, Transactions American Geophysical Union, 91(34), 297-298.
Piwowar, H.A. (2011). Who shares? Who doesn't? Factors associated with openly archiving raw research data. PLOS One, 6(7), e18657.
Piwowar, H.A., Carlson, J.D., & Vision, T.J. (2011). Beginning to track 1000 datasets from public repositories into the published literature. Proceedings of the American Society for Information Science and Technology, 48(1), 1-4.
Piwowar, H.A., & Chapman, W.W. (2008). Identifying data sharing in biomedical literature. AMIA Annual Symposium Proceedings (pp. 596-600).
Piwowar, H.A., & Vision, T.J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175.
Riloff, E. (1996). Automatically generating extraction patterns from untagged text. Proceedings of 13th National Conference on Artificial Intelligence (pp. 1044-1049). Menlo Park, California: AAAI Press.
Robinson, N., Jiménez, E., & Torres, D. (2015). Analyzing data citation practices according to the Data Citation Index. Journal of the Association for Information Science and Technology. http://arxiv.org/abs,1501.
Thelen, M., & Riloff, E. (2002). A bootstrapping method for learning semantic lexicons using extraction pattern contexts. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing-Volume 10 (pp. 214-221). Association for Computational Linguistics.
Torres, D., Martín, A., & Fuente, E. (2014). Analysis of the coverage of the Data Citation I ndex- Thomson Reuters: Disciplines, document types and repositories. Revista Española De Documentación Científica, 37(1): 95-97.
Outlines

/

京ICP备05002861号-43

Copyright © 2023 All rights reserved Journal of Data and Information Science

E-mail: jdis@mail.las.ac.cn Add:No.33, Beisihuan Xilu, Haidian District, Beijing 100190, China

Support by Beijing Magtech Co.ltd E-mail: support@magtech.com.cn