Expert Review

Big Metadata, Smart Metadata, and Metadata Capital: Toward Greater Synergy Between Data Science and Metadata

  • Jane Greenberg†
Expand
  • College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, USA
Corresponding author: Jane Greenberg (E-mail: ).

Online published: 2017-08-25

Copyright

Open Access

Abstract

Purpose: The purpose of the paper is to provide a framework for addressing the disconnect between metadata and data science. Data science cannot progress without metadata research. This paper takes steps toward advancing the synergy between metadata and data science, and identifies pathways for developing a more cohesive metadata research agenda in data science.

Design/methodology/approach: This paper identifies factors that challenge metadata research in the digital ecosystem, defines metadata and data science, and presents the concepts big metadata, smart metadata, and metadata capital as part of a metadata lingua franca connecting to data science.

Findings: The “utilitarian nature” and “historical and traditional views” of metadata are identified as two intersecting factors that have inhibited metadata research. Big metadata, smart metadata, and metadata capital are presented as part of a metadata lingua franca to help frame research in the data science research space.

Research limitations:There are additional, intersecting factors to consider that likely inhibit metadata research, and other significant metadata concepts to explore.

Practical implications: The immediate contribution of this work is that it may elicit response, critique, revision, or, more significantly, motivate research. The work presented can encourage more researchers to consider the significance of metadata as a research worthy topic within data science and the larger digital ecosystem.

Originality/value: Although metadata research has not kept pace with other data science topics, there is little attention directed to this problem. This is surprising, given that metadata is essential for data science endeavors. This examination synthesizes original and prior scholarship to provide new grounding for metadata research in data science.

Cite this article

Jane Greenberg† . Big Metadata, Smart Metadata, and Metadata Capital: Toward Greater Synergy Between Data Science and Metadata[J]. Journal of Data and Information Science, 2017 , 2(3) : 19 -36 . DOI: 10.1515/jdis-2017-0012

Jane Greenberg is the Alice B. Kroeger Professor and Director of the Metadata Research Center((1)(1)http://cci.drexel.edu/mrc/)d`iat the College of Computing and Informatics, Drexel University. Her research activities focus on metadata, knowledge organization/semantics, linked data, data science, and economics. She serves on the advisory board of the Dublin Core Metadata Initiative (DCMI) and the steering committee for the NSF Northeast Big Data Innovation Hub (NEBDIH)((2)(2)http://nebigdatahub.org/). She is a principal investigator on the NSF Spoke initiative, “A Licensing Model and Ecosystem for Data Sharing,” and the principal investigator for the Metadata Capital Initiative (MetaDataCAPT’L) and the Helping Interdisciplinary Vocabulary Engineering (HIVE) linked data project. She is also a co-principal investigator (CoPI) for Drexel’s NSF Industry/University Collaborative Research Center (NSF-I/UCRC), the Center for Visual and Decision Informatics (CVDI)((3)(3)http://www.nsfcvdi.org/). Her research has been funded by US National Science Foundation (NSF), National Institutes of Health (NIH), Institute of Museum and Library Services (IMLS), CVDI, National Consortium for Data Science, Microsoft Research, National Library of Medicine, Library of Congress, OCLC Online Computer Library Center, among other organizational and private sponsors. She has received numerous awards and honors for her research and leadership; most recently she was recognized as a 2016 ELATE at Drexel® Fellow, a 2014 Data Science Fellow at the National Consortium for Data Science, and, in 2012, she held a Chair of Excellence at the University of Juan Carlos III, Madrid, Spain.

1 Introduction

Metadata, as a form of data affixed to other data, is indispensable to data science and the interconnected domain of data analytics. Metadata describes data, provides context, and is vital for accurate data interpretation and use by both humans and machines. Given this dependency, it is logical to conclude that metadata innovation ought to have progressed in tandem with advances in big data and data science. To this end, leading data science journals and conferences have been increasing coverage of metadata research and development (R&D). Examples include Data Science Journal (v. 15, 2016) dedicated to data models, and a recent issue of the Journal of Data and Information Science (JDIS) with Li and Sugimoto’s (2017) work on long-term maintenance of metadata. Another example is the 2nd IEEE Workshop on Big Data Metadata and Management (BDMM, 2017)((4)(4)http://cci.drexel.edu/bigdata/bigdata2017/Workshops.html), to be hosted at the 2017 IEEE International Conference on Big Data in Boston, Massachusetts, USA. Nevertheless, despite such important developments, metadata R&D has not kept pace within the greater framework of data science.
Research literature and notable reports reveal the metadata research lag in data science. For example, Smith et al. (2014) call for metadata research, emphasizing that “current big data ecosystems lack a principled approach to metadata management.” Another clear example is the US report entitled The Federal Big Data Research and Development Strategic Plan (NITRD, 2016). This report stresses the need for research on metadata frameworks to ensure data trustworthiness, and identifies a myriad of metadata-related research topics, many of which are found in similar governmental and disciplinary reports worldwide (e.g. ERAC Secretariat, 2016; Ilevbare, Athanassopoulou, & Wooldridge, 2017). Collectively, these examples make clear the absence of a coherent metadata research agenda in data science. This gap raises questions about why there is a disconnect between metadata and the larger sphere of data science research, and how to address this challenge.
This article considers these questions as steps toward advancing the synergy between metadata and data science. The following section describes metadata and data science, followed by discussion of two intersecting factors that challenge metadata research in the digital ecosystem. Next, the paper introduces the concepts big metadata, smart metadata, and metadata capital. These concepts are presented as contributions to the metadata lingua franca connecting to the data science space. The conclusion summarizes key discussion points and considers next steps for advancing metadata research in the area of data science.

2 Metadata and Data Science Defined

Exploring the interconnection between metadata and data science requires a review of these two concepts.

2.1 Metadata: A Value-added Language

Metadata has been loosely defined and popularized as data about data, information about information. More comprehensive definitions address metadata as structured data supporting functions associated with an object, an object being any “entity, form, or mode” (Greenberg, 2005, 2010; Lytras, Sicilia, & Cechinel, 2013). Examples of metadata functions include data discovery, access, use, provenance tracking, authenticity and security verification, preservation management, and other activities throughout the data lifecycle (UK Data Archive, 2012). Metadata researchers draw on these functions to create typologies identifying different metadata types (Méndez & van Hooland, 2013). Business and data warehousing typologies generally include business and technical metadata, and, at times, also include process and operational metadata (Dong et al., 2016; Shankaranarayanan & Even, 2006; Vaduva & Dittrich, 2001). The digital library, archive, and repository communities have categories, such as descriptive, technical, preservation, provenance, and usage metadata (Zeng & Qin, 2016). In these cases, the “types” of metadata connect to the lifecycle of the object being represented or tracked.
To understand the full extent of metadata, it is important to recognize that the adjectival label “metadata” is not always used when, in fact, the data of interest has a meta status. In other words, data that is meta, an abstraction of another object, is not always labeled as metadata. Common examples include provenance data, linked data, contextual data, and authenticity data. These data exist only because of the actuality of other objects, and can only occur as a result of an object’s activity.
Beyond labeling and categorization, metadata can more universally be thought of as value-added language that serves as an integrated layer in an information system. When appropriately placed and accessible, by human or machine, metadata language eloquently enables the interplay between an object, such as data, and the desired activity, such as discovery, access, provenance tracking, calculation, or other directives. To understand metadata research opportunities in data science, it is useful to also review the meaning of data science, as follows.

2.2 Data Science: Leveraging Data to Gain New Insights

Data science is an interdisciplinary field that targets studying and leveraging data to gain insights. A data science undertaking may enable one to predict a phenomenon or automate decision-making. The Data Science Association defines data science as the “scientific study of the creation, validation and transformation of data to create meaning” (Data Science Association, 2017). Data science draws upon the full range of data (small, big, static, structured, unstructured, or streaming), and applies scientific and statistical methodologies to learn from data (van der Aalst, 2016).
Data science has many aspects, and the collection of definitions reveals different emphases. For example, Dhar (2013) focuses on the predictive capabilities of data, emphasizing application of statistical methods. Stanton (2012) offers a broader definition, explaining that data science encompasses a full range of activities, including the “collection, preparation, analysis, visualization, management, and preservation of large collections of information.” The unifying factor across various definitions is the “science” that comprises defining appropriate questions, selecting and obtaining suitable data, and applying the correct, at times often innovative, modeling, and statistical methods.
The “science” of data science indicates a methodological and systematic approach to leveraging data as part of studying a problem or a phenomenon. Data science endeavors rely not only on data, but accurate description of the data—hence metadata. Given the reliance on metadata, one would anticipate appropriate support for, and recognition of, the value of research addressing metadata processes, applications, and societal impacts. Unfortunately, there are a number of key impediments to understanding the scientific merit of metadata research. These impediments are reviewed below in the context of challenges to metadata R&D.

3 Challenges to Metadata Research

Information and library science, computer science, and a number of disciplinary domains (e.g. biology, medicine, materials science, and geography to name a few) support a generally tightly-knit, robust metadata community through interest groups within a larger association, several targeted conferences, and focused publications, such as the International Journal of Metadata, Semantics, and Ontologies. Despite strong, historical grounding, metadata research in data science, and the larger digital ecosystem, is restrained by not being considered a true scientific endeavor. More specifically, challenges to metadata research stem, to a large degree, from two intersecting factors: 1) the utilitarian nature of metadata, and 2) historical and traditional perceptions of metadata.

3.1 The Utilitarian Nature of Metadata

Metadata is generally viewed as a practical application relating to cataloging, indexing, database development, and the recording of digital transactions. This point is underscored in “Metadata in Everyday Life,” the first section in NISO((5)(5)National Information Standards Organization (http://www.niso.org/home/))’s new primer, Understanding Metadata (Riley, 2017). To be clear, seeking pragmatic solutions with metadata is vital to nearly any digital undertaking; however, a pragmatic emphasis can challenge research opportunities. An example here is the rationalist approach pursued for schema design. That is, data dictionaries and metadata application development are commonly based on practical experience, rather than substantive empirical or theoretical approaches.
Another utilitarian aspect affecting perceptions of metadata stems from the pressing need for metadata to accommodate the exponential growth of data and the larger digital ecosystem, which limits resources (time, personnel, and finances) that could otherwise be allocated toward deeper metadata research analyses and theoretical development (Greenberg, 2009). As noted above, there is a robust metadata research community; however, the pragmatic strength and necessities of metadata have very likely impeded development of a more rigorous metadata agenda in data science.

3.2 Historical and Traditional Perceptions of Metadata

Metadata carries baggage similar to that of cataloging (Coleman, 2005; Tennant, 2002). Specific criticisms address the Semantic Web, with claims that ontologies cannot support automatic reasoning (Shirky, 2005), the mark-up is excessive (Manian, 2011), and that the goals underlying linked data are unrealistic. There is also concept of “metacrap,” coined by Doctorow (2001), referring to the impossibility of “exhaustive, reliable metadata” due to “insurmountable obstacles,” and proclamations that automated methods will take over, obviating the need to investigate metadata (Dimitrova, 2004). The metadata community has internal critics as well, as demonstrated by Beall’s “Dublin Core: An Obituary” (2004), and his later piece, “Dublin Core is Still Dead” (2014). Both articles lambaste the Dublin Core metadata standard, despite the fact that it is one of the most universally adopted, cross-disciplinary, and internationally used metadata standards.
Traditional perceptions are further reflected in differing opinions about metadata and what constitutes a science. An illustrative example is found in “There Is No Science of Data,” a discussion on Visual Business Intelligence: A blog by Stephen Few, wherein the author states, “Metadata is a rather simple concept that doesn’t seem to require scientific study” (Few, 2017; see Figure 1).
Figure 1. Visual Business Intelligence: A blog by Stephen Few (January 23, 2017).
Few (the blog author) has over 30 years of experience in business intelligence and information design, and his viewpoint clearly illustrates that many simply equate metadata with the nuts-and-bolts of an information system, rather than a research-worthy topic. Few continues his blog discussion by observing that information science and data science are also misnomers; despite the fact that a discussion contributor, named Konrad, shares, “Actually there is a whole academic discipline dedicated to the study of information...,” (Konrad. January 24, 2017 at 1:11 am). Further, this participant references the Wikipedia entry for information science, which is substantive with credible references confirming existence of the discipline. Although continued discussion of what is a science extends beyond the scope of this JDIS metadata-focused paper, it is important to recognize that differing opinions impact views on what is a research worthy topic.

3.3 Summary: Moving Past the Impediments

Overall, the discussion above provides insight into why metadata research faces impediments in data science, and other disciplines. Nevertheless, the value of metadata cannot be denied. In fact, the significance of metadata became a mainstream media topic with Edward Snowden’s whistleblowing on the US government’s surveillance of personal phone record metadata, without individual consent or knowledge of this activity (Greenwald, 2013). In advocating for greater attention to metadata research, the following section presents three concepts to foster dialogue about metadata, and help provide a framework for metadata-focused research in data science.

4 Metadata Concepts Relating to Data Science

Every domain has its lingua franca; that is, a language that community members understand and use to correspond. A domain specific language may select or co-opt concepts or terms from another community, and tailor terms for their own needs. Zeng and Qin’s “Metadata Research Landscape” (Chapter 2, in Metadata, 2016) helps document the current metadata research lingua franca, covering metadata architecture, modeling, semantics, and data-driven aspects. Commonly used research methods are also part of this lingua franca. Examples include content analyses that generally target metadata quality (e.g. Zavalina, 2011), cross-walk analyses for metadata scheme development (e.g. Gaitanou et al., 2016), experiments comparing automatic metadata generation approaches, and semantic mapping assessments (e.g. Vlachidis et al., 2013; White, Willis, & Greenberg, 2014). The existing lingua franca forms an important footing for metadata R&D in data science. In advancing the dialogue and advocating for further metadata research, the following section presents three concepts that provide a framework for metadata-focused research in data science.

4.1 Big Metadata

The data science enterprise has been motivated by the availability of massive amounts of digital data and new capacities for data-driven solutions. These ideas are central to the “fourth paradigm,” a dimension coined by Microsoft Research visionary Jim Gray, and captured by Hey, Tansley, and Tolle (2009), to explain the growing, unprecedented opportunity for data-driven science. Metadata is a vital component of the fourth paradigm, although the significance of metadata is often overlooked or only noted in a limited way. Metadata can garner new research attention if it is understood as big metadata.
Big metadata is both a first-class object and an auxiliary associated with the wide, seemingly countless variety of data formats, types, and genres. Simon’s piece, “Too Big to Ignore: The Business Case for Big Data” (2013), validates the importance of metadata for big data. He also confirms the existence of “big metadata” in reference to the wide diversity of data. The concept big metadata has also appeared in the research literature. For example, Smith et al. (2014) discuss big metadata in relation to the US government’s trove of big data; and Zhao et al. (2014) identify big metadata as a vital data source that can give insight into real world traffic problems.
Beyond an association with big data diversity and size, big metadata reflects the wide range of data lifecycle activities found among projects, settings, and systems. Data lifecycle scenarios extend from simple (data creation, capture, storage, and preservation) to complex (data use, reuse, repurposing, and modification), using both human and automatic processes. And, at the data lifecycle meta level is the metadata lifecycle, which generates big metadata.
Big metadata is defined below in Table 1 by the volume, velocity, variety, variability, and value, built on the common 5Vs used to define big data.
Table 1 The five Vs of big metadata.
Five Vs Definition
Volume The quantity and usefulness of metadata generated daily confirms the existence of big metadata. At times metadata is less than or equal to the extent of the data it describes in size (bytes). During other times the metadata exceeds the data being described or tracked, due to the complexity of the data lifecycle activity. Linked data offers an example, with metadata renderings that can be larger than the volume of data object(s) being represented. Like big data, not all big metadata is useful, and a challenge is to identify the big metadata that is useful for data science and analytic endeavors.
Velocity Metadata is generated via automatic processes at immense speed correlating with rate of digital transactions. For example, searching Google, answering an email, purchasing an item online, and day-to-day office activities such as word processing of all log data, as well as associated metadata.
Variety Metadata reflects the wide variety of data formats, types, and genres along with the extensive range of data and metadata lifecycles. In addition, the different types of metadata (e.g. discovery, technical, preservation, etc.) as well as unique domain specific metadata requirements intensify the variety.
Variability There is an unmistakable unevenness of metadata across the digital ecosystem. Lack of uniformity is extensive for data descriptions across different domains, systems, and processes. This unevenness can even be profound within domains, given economic factors supporting metadata generation, competing standards, or, simply, differing adoption policies. For example, two organizations may use the same metadata standard, but have different implementation practices. Even when standardization is imposed, an organization, process, and human activity can contribute to inconsistencies.
Value If data is the new black gold*—akin to petroleum requiring purification, but also a money maker, then metadata is the new platinum—a malleable substance that keeps its toughness, and can serve as a catalyst, sparking a reaction.
Metadata, as the new platinum, can be modified, while remaining a strong, independent data type. Metadata stands as a durable data object that triggers various functions—the catalyst, and achieves results—a reaction. Metadata is vital to accurate data interpretation and use by both humans and machines, and the value of metadata for data science endeavors cannot be overstated or diminished.

Note. *Singh (2013) identified data as the new black gold on Wired.com.

Table 1 draws from the commonly applied 5Vs (Marr, 2014), although other big data frameworks with nuanced or even different criteria likely apply to big metadata. Clearly, data science is not limited to big data; however, exploring the framework above is warranted inasmuch as it helps define big metadata and identify research pathways. Smart metadata, discussed in the following section, offers another fresh insight into metadata in the area of data science.

4.2 Smart Metadata

Metadata is inherently smart data because it provides context and meaning for data. One of the earliest uses of “smart metadata” was for a special session entitled “Smart Metadata” at the 2003 Dublin Core Conference, Seattle, Washington (DCMI, 2003). Themes in this special session included interoperable metadata, Semantic Web support, accessibility, and ontologies. Around the same time, van Hemel et al. (2003) promoted the idea of smart metadata in reference to the Semantic Web and the use of the Resource Description Framework (RDF) for topic maps. In 2007, Kogen, Miller, and Schobbe (2007) of the Microsoft Corporation used the term smart metadata as part of a patent description for a technique supporting metadata field management in a taxonomy system. Since that time, there does not appear to be a clear path for using the term “smart metadata” although research and discussions acknowledge metadata as a value-added factor supporting smart search, and as an enabler or characteristic of the Semantic Web and linked data (e.g. Fatima, Luca, & Wilson, 2014; Oh, Yi, & Jang, 2015). Zeng underscores this point in her work on smart data in the humanities, specifically in a recent discussion segment entitled “How to Transform Big Data into Smart Data?”, where she identifies Semantic Web standards along with other semantic technologies (Zeng, 2017) as part of the solution.
A related aspect of smart metadata is the alignment with smart technology, including smart, mobile devices, and appliances. Examples include mobile health technology, such as the Fitbit, tracking heartrate, calories burned, miles that one has walked or run; smart buildings, using sensors to control lighting or the heating, ventilation, and air conditioning (HVAC) unit; the innovation of smart cities, powered by a smart grid and interlinking to the Internet of Things (IoT); or the more recently proposed phenomenon, the Internet of Everything (IoE). From smart technology to the more encompassing, smart environment, there is reliance on the collection of data, including metadata, feeding data-driven algorithms and launching intelligent, actionable processes.
Smart metadata has received attention within smart technology research. For example, Abbasi, Vassilopoulou, and Stergioulas (2017) used the phrase “smart metadata” to identify research directions and new tools supporting better use of digital media and the larger IoT. Contractor et al. (2015) refer to smart metadata in their analysis of the Learning Content Hub, a content management system supporting automatic metadata assignment, and the use of analytics to build customized educational applications. Similarly, researchers identify smart metadata as part of their design for a personalized, recommendation engine for TV programs (Thyagaraju & Kulkarni, 2011). In all these cases, metadata is smart in that it enables an action that draws on the data being represented or tracked. The action depends on good quality metadata that is accessible, preserved over time, and trusted. These ideas translate into the principles presented in Figure 2, forming a smart metadata matrix.
Figure 2. Smart metadata matrix of principles.
Principles of smart data (Figure 2) are defined as follows:
•Good quality Smart metadata is good quality metadata. A number of researchers have identified criteria that define good quality metadata. Bruce and Hillmann (2004) present one of most well-known criteria for determining metadata quality. Figure 2 identifies five of Bruce and Hillmann’s criteria that are essential for smart metadata. Good quality metadata is also trusted metadata, and produced by a reliable source.
•Accessible Smart metadata is accessible, along with data being represented, to support data-driven activities. There are multiple system levels connecting metadata and access. First, metadata specifies technical requirements for accessing and using data, such as technologies needed. Second, metadata indicates the access policy, such as necessary required permissions, rights, and other protocols that enable metadata and data access and use. Third, metadata, as a “smart” asset, is accessible along with the data being represented, so that both data sources—metadata and data—can be used for data science inquiries.
•Actionable Smart metadata is actionable. That is, smart metadata is formatted so that it can be ingested and understood by humans and/or machines, as required, to invoke or execute an operation. The consumable state of smart metadata needs to also be reflected in the data being represented.
•Preserved Smart metadata is preserved in a useful manner. This step is critical for identifying data patterns over time. Big data is volatile and metadata is often modified, enriched, or deleted in sync with change. Interpreting data change over time is difficult or even impossible when previously generated metadata is absent. Metadata must be preserved by a trusted, dependable source; this includes the preservation metadata vocabularies, such as data dictionaries and attribute descriptions. Research on metadata longevity (Li & Sugimoto, 2017; Sugimoto et al., 2016) has resulted in a framework solution for preserving metadata. Additionally, Kunze et al. (2016a; 2016b) present complementary work developing a persistence vocabulary. These are significant initiatives that can help further formalize our understanding of preservation as a principle for smart metadata.
•Trust The last smart metadata principle, trust, connects across all principles, although it primarily links with quality and preservation. As noted above, good quality metadata is trusted metadata, and produced by a reliable source; and metadata that is preserved must be overseen and maintained by a dependable entity.
The smart metadata principles defined here qualify metadata as value-added data. The next section of this paper explores value relating to metadata more thoroughly through the concept metadata capital.

4.3 Metadata Capital

“Metadata capital” is a concept that emerged through research on data and metadata reuse in the Dryad data repository Greenberg, Swauger, & Feinstein, 2013). Capital, broadly speaking, is understood as an asset with value; and the value may be financial, intellectual, social, or defined in other ways. Capital is most commonly associated with finance and wealth, and draws from work such as Adam Smith’s An Inquiry into the Nature and Causes of the Wealth of Nations, published in 1776 (Smith, 1776). Smith’s (1776) emphasis is on market value, or financial aspects. The financial component of capital has been explored more specifically through the Metadata Capital Initiative. This research was predicated on the fact that value, as a financial indicator, can be measured. The incipient effort has chiefly applied the modified capital gains equation (Greenberg, Swauger, & Feinstein, 2013; Greenberg et al., 2014a; 2014b), and calculated costs associated with metadata creation and reuse to determine value. Specifically, metadata reuse demonstrates a greater return on investment (ROI) by adding value to the initial metadata cost.
It is important to point out that cost and value are not always aligned; this is because a product can cost more than it is worth, or be assigned a price that is below its worth. Even so, financial cost can be calculated. The metadata capital work postulates that when a purchased item is reused, over time, it is worth more than its original cost. Analogies to consider include a top-end stainless steel pot that is used over and over, without any change, and always supporting cooking to perfection; or an antique chest that has been passed down generation after generation, and is used to store sweaters in the summer, while also serving as a piece of furniture, becoming more valuable with age.
As stated above, capital, wealth, and value do not solely apply to financial matters, despite the fact that much of the big data and data science coverage is associated with economic incentive and opportunity. The broader interpretation of capital extends to knowledge (intellectual capital), and friendships—personal and professional relationships (social capital), as well as other areas, including some still likely to be discovered. Drawing on this broader context of value, a formalized definition of “metadata capital” is as follows, which was originally published in the Bulletin of the Association for Information Science and Technology (Greenberg, 2014).
1. An asset that contains contextual knowledge about content.
a. Content is the data or information contained in any information object (any “entity, form, or mode”).
b. Context is who, what, where, when, how, why, etc., which can be captured via metadata attributes (Kunze, 2001).
2. A product or service generated by human labor and/or machine-driven processes with value that increases over time or that enables the value increase of other assets..
3. A good (a service facilitator) supporting a range of functions such as discovery, provenance tracking, rights management, authentication, preservation and other functions associated with lifecycle management and access..
4. A public good if the product (metadata) is open, following which the services can be open.
Metadata capital is defined as an asset, a product, a good, including a public good, which enables gain through knowledge, access, and services. Metadata capital connecting to this broader interpretation associates with the promise of big data when considering the unprecedented opportunity to address real world problems in energy, health, and the environment (Greenberg & Garoufallou, 2013). Metadata is essential for using data to compare new energy approaches; track the progression and decline of a health crisis, such as the Ebola virus; or study climate change.
The biggest challenge with metadata valuation in this broader spectrum, and even with financial aspects, is the formidable task of substantiating value. In pursuing metadata capital as a financial topic, costs can be identified, or at least estimated, by adding system expenses, staff salaries distributed by hours dedicated to metadata tasks, and other incurred costs. However, determining exactly where to begin measuring cost is not an easy task. Does cost start with the metadata system design, the salary of the person who had the idea to build the metadata system, the person or team that implemented workflow design, or the cost of the code library that allowed the system to be built? Assessing social and intellectual value is even more daunting. How can we determine long-term consequences for metadata created today that allows for a major health discovery five or ten years from now?
There are more questions than answers in pinpointing or even approximating the value of metadata; it is predicament that underscores a significant challenge and invites research. Metadata capital requires further study, including drawing upon valuation and appraisal frameworks from other disciplines. What frameworks exist for measuring value across the domains of energy, health, and the environment? How do people assess the value of knowledge, personal friendships, and professional contacts? Although there is no single answer, drawing upon valuation research from other domains can help chart metadata research directions, and, future, demonstrate the value of metadata entrenched in data science.

5 Summary and Conclusions

Metadata, while applauded by many, has not been vigorously pursued as a research topic in data science, compared to statistical modeling, algorithm testing, data mining, and visualization. To be clear, there is metadata research; however, metadata focused scientific and scholarly output in data science venues has not kept pace with these other topics. Articulating a problem is one of the first steps to addressing a challenge. This paper pursued initial steps to addressing this challenge in the following ways:
1) The “utilitarian nature” and “historical and traditional views” of metadata were identified as two intersecting factors that have inhibited metadata research. Having a clear understanding of barriers is important for addressing existing challenges. The pragmatic nature of metadata is paramount, and applied research ought to be shared more, rather than minimized. Additionally, fundamental approaches can interconnect with applied work, as research matures. As for traditional views, there is always an opportunity for change. The cultural shift taking place in data sharing is evidence of change. Sharing metadata research impacts in data science can attract more interest and support. Although this paper identified two key factors, there are very likely other intersecting factors to consider in future work.
2) Contextual definitions were presented for both “metadata” and “data science” to help further dialogue and research on metadata in the data science domain. As noted above, articulating a problem is a first step to addressing a challenge. A second step is to define the context. Both metadata and data science were defined as part of this goal. The definitions given draw from other published work, synthesizing common themes and ideas. Given the work pursued in this paper, defining these two terms was an obvious choice, although it is likely that providing contextual definitions for additional terms will aid future research.
3) Big metadata, smart metadata, and metadata capital were presented as part of a metadata lingua franca to help frame research in data science research space. These concepts are not commonly discussed in data science, although they appear in research, and the examination of these concepts integrates original work in this paper, along with ideas and outcomes from other scholarship to provide grounding. Admittedly, the concepts introduced may warrant refinement; and there are other significant metadata concepts that also deserve focus. Even so, the presentation of these terms together can offer support and provide a pathway for metadata research within data science.
The immediate contribution of this work is, simply, that it may elicit response, critique, or revision. A more impactful contribution is that this work may motivate more researchers to consider the significance of metadata as a topic worthy of research within data science and the larger digital ecosystem. In a recent discussion, my colleague at Drexel University, Dr. Rosina Weber, asked me, “Can you imagine data science without metadata?” I cannot think of a statement more profound than this to motivate next steps. This question needs to be considered by anyone who applauds or dismisses the value of metadata.
Data science cannot progress without metadata research; and while an extensive range of metadata topics are important, researchers need to ask: which metadata topics are most pressing to pursue? In other words, let’s prioritize metadata research so that data science can successfully address our most significant societal challenges, and more fully contribute to the greater good. In conclusion, the framework presented in this paper, defining big metadata, smart metadata, and metadata capital, can help researchers, across multiple disciplines, prioritize next steps and collectively advance metadata research in data science.

The authors have declared that no competing interests exist.

[1]
Abbasi M., Vassilopoulou P., & Stergioulas L. (2017). Technology roadmap for the creative industries. Creative Industries Journal, 10(1), 40-58.This paper discusses the findings of research conducted between 2013 and 2016, which concerned the development of technology roadmaps for the Creative Industries. The roadmap presented in this paper was built based on input from communities of creative and Information and Communication Technologies (ICT) experts collected during the consultation and validation phases of the research. It provides a synthesis of challenges and recommendations from the five creative sectors examined by the project – Architecture, Art, Design, Games, Media and e-Publishing – and proposes research directions for the development of desired future technologies, by highlighting innovative future developments in the Creative Industries, while also assessing their technology maturity in the short, medium and longer terms. By rating the desired technologies as ‘present’ (1–2 years), ‘possible’ (2--5 years), or ‘probable’ (5–10 years or beyond), the roadmap gives orientation towards the development of new technologies and related business models and skills and provides guidance for informed policy-making. The paper thus aims at enabling stakeholders – creators, professionals, SMEs, creative groups, creative communities, associations, organisations and institutions, as well as governments and policy makers – to maximise their benefit and the societal value from the new emerging technology landscape in the Creative Industries.

DOI

[2]
Beall,J (2004). Dublin Core: An obituary. Library Hi Tech News, 21(8), 40-41.

[3]
Beall,J (2014). Dublin Core is still dead. Library Hi Tech News, 31(9), 11-13.Purpose This article aims to update an earlier article by the same author published in Library Hi Tech News in 2004 and analyses the failure of the Dublin Core metadata standard. Design/methodology/approach The article is a commentary. Findings 鈥The article finds that the Dublin Core metadata standard is moribund and that a newer, competing standard has rendered it essentially useless. Originality/value The paper updates Dublin Core increasing obsolescence and will help librarians understand the standard rapid rise and slow fall.

DOI

[4]
Bruce,T.R., & Hillmann,D.I (2004. The continuum of metadata quality: Defining, expressing, exploiting. ALA Editions. Retrieved on July 31, 2017, from .

[5]
Coleman,A.S (2005). From cataloging to metadata: Dublin Core records for the library catalog. Cataloging & Classification Quarterly, 40(3-4), 153-181.The Dublin Core is an international standard for describing and cataloging all kinds of information resources: books, articles, videos, and World Wide Web (web) resources. Sixteen Dublin Core (DC) elements and the steps for cataloging web resources using these elements and minimal controlled values are discussed, general guidelines for metadata creation are highlighted, a worksheet is provided to create the DC metadata records for the library catalog, and sample resource descriptions in DC are included.

DOI

[6]
Contractor D., Negi S., Popat K., Ikbal S., Prasad B., Kakaraparthy S., Sengupta B., Vedula S., & Kumar V. (2015). Smarter learning content management using the Learning Content Hub. IBM Journal of Research and Development, 59(6), 3:1-3:9.The education sector is experiencing an unprecedented shift in how students learn and progress through their education. With the large amounts of digital learning content available, teachers are increasingly turning to sources such as online tutorials and eBooks for their teaching needs. Schools often make use of a Learning Content Management System (LCMS) to store learning material, which is catalogued based on the curriculum followed in the school. However, the amount of content indexed in the LCMS is often limited by the ability to manually label and catalogue content. In this paper, we describe our LCMS called the Learning Content Hub (LCH), which not only offers the features of a traditional LCMS including document search and retrieval, document security, user role management, etc., but also provides the ability to automatically analyze and label documents. LCH provides a framework for easily extending the analytics support and exposes application programming interfaces that can be used to build custom education applications for which the content needs can be met using the LCH. We discuss experimental results of the analyzers in our system as well as our experience of deploying this system in a U.S. school district.

DOI

[7]
Data Science Association (DSA). (2017. About data science. Retrieved on June 18, 2017, from .

[8]
DCMI (2003. Special session: Smart metadata. In 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice-Metadata Research & Applications, Seattle, Washington.Retrieved on June 30, 2017, from .

[9]
Dhar,V (2013). Data science and prediction. Communications of the ACM, 56(12), 64.

DOI

[10]
Dimitrova,N. (October-December, 2004). Is it time for a moratorium on metadata? IEEE Multimedia, 11(4), 10-17.This work discusses the author suggestion for locating content in mixed-media for context-sensitive queries. Also provides examples of nontextual approach, a system for organizing digital photographs in which all of the instances of a particular person can be found based on face recognition rather than keyword matching, and compared with a conventional metadata effort that uses predictive labelling of objects in the base data set.

DOI

[11]
Doctorow,C (2001. Metacrap: Putting the torch to seven straw-men of the meta-utopia . Retrieved on June 28, 2017, from.

[12]
Dong R., Su F., Yang S., Xu L., Cheng X., & Chen, W. (2016, September). Design and application on metadata management for information supply chain. In the 16th International Symposium on Communications and Information Technologies (ISCIT) (pp. 393-396). Washington,DC: IEEE Computer Society Press.

[13]
ERAC Secretariat. (2016. European Research Area and Innovation Committee. European Union. Brussels, February 3, 2016 . Retrieved on June 18, 2017, from .

[14]
Fatima A., Luca C., & Wilson, G. (2014, March). New framework for semantic search engine. In 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation (UKSim) (pp. 446-451). Washington,DC: IEEE Computer Society Press.

[15]
Few,S (2017). Visual business intelligence: A blog by Stephen Few. There is no science of data, January 23, 2017. Retrieved on July 7, 2017, from2560.

[16]
Gaitanou P., Gergatsoulis M., Spanoudakis D., Bountouri L., & Papatheodorou, C. (2016). Mapping the hierarchy of EAD to VRA Core 4.0 through CIDOC CRM. In the 10th International Conference on Metadata and Semantics Research (MTSR 2016) (pp. 193-204). Cham, Switzerland: Springer International Publishing.

[17]
Greenberg,J (2005). Understanding metadata and metadata schemes. Cataloging & Classification Quarterly, 40(3-4), 17-36.

[18]
Greenberg,J (2009). Theoretical considerations of lifecycle modeling: An analysis of the dryad repository demonstrating automatic metadata propagation, inheritance, and value system adoption. Cataloging & Classification Quarterly, 47(3-4), 380-402.ABSTRACT The Dryad repository is for data supporting published research in the field of evolutionary biology and related disciplines. Dryad development team members seek a theoretical framework to aid communication about metadata issues and plans. This article explores lifecycle modeling as a theoretical framework for understanding metadata in the repository environment. A background discussion reviews the importance of theory, the status of a metadata theory, and lifecycle concepts. An analysis draws examples from the Dryad repository demonstrating automatic propagation, metadata inheritance, and value system adoption, and reports results from a faceted term mapping experiment that included 12 vocabularies and approximately 600 terms. The article also reports selected key findings from a recent survey on the data-sharing attitudes and behaviors of nearly 400 evolutionary biologists. The results confirm the applicability of lifecycle modeling to Dryad's metadata infrastructure. The article concludes that lifecycle modeling provides a theoretical framework that can enhance our understanding of metadata, aid communication about the topic of metadata in the repository environment, and potentially help sustain robust repository development.

DOI

[19]
Greenberg,J. (2009). Metadata and digital information. In M.J. Bates & M.N. Maack (Eds.), Encyclopedia of Library and Information Sciences (pp. 3610-3623). Boca Raton,FL: CRC Press.

[20]
Greenberg,J (2014). Metadata capital: Raising awareness, exploring a new concept. Bulletin of the Association for Information Science and Technology, 40(4), 30-33.EDITOR'S SUMMARY While the value of information is widely recognized, the next step is recognizing metadata as an economic asset. Generating metadata involves costs in technological and human resources, but failure to generate and use metadata can lead to lost opportunity costs. Metadata activities are ultimately motivated by a drive for return on investment. The view of metadata as capital emphasizes that it has value that can rise with reuse. Metadata capital is defined as an asset that captures contextual knowledge about any information object, is produced by human labor or automated processes and yields a product or facilitates some service and can benefit the public. Expediting reuse of data and metadata is the key to maximizing their value, and early research demonstrates value in the linked open vocabulary environment and in reusing URIs. A collaborative project of the National Consortium for Data Science, the Metadata Capital Initiative, is focusing on the value of metadata through reuse in a big data setting to document its specific contributions to technology methods and intellectual advances.

DOI

[21]
Greenberg J.,& Garoufallou, E.(2013). Change and a future for metadata.In MTSR-2013: Proceedings of the 7th Metadata and Semantics Research Conference (pp. 1-5) . Cham, Switzerland: Springer International Publishing.

[22]
Greenberg J., Murillo A.P., Ogletree A., Boyles R., Martin N., & Romeo, C. (2014a). Metadata capital: Automating metadata workflows in the NIEHS Viral Vector Core Laboratory. In MTSR-2014: Proceedings of the 8th Metadata and Semantics Research Conference (pp. 1-13). Cham, Switzerland: Springer International Publishing.

[23]
Greenberg J., Ogletree A., Murillo A.P., Caruso T.P., & Huang, H. (2014b). Metadata capital: Simulating the predictive value of self-generated health information (SGHI). In 2014 IEEE International Conference on Big Data (pp. 31-36). Washington, DC: IEEE Computer Society Press.

[24]
Greenberg J., Swauger S., & Feinstein E.M. (2013). Metadata capital in a data repository. In DC-2013: the International Conference on Dublin Core and Metadata Applications (pp. 140-150). Lisbon, Portugal: Dublin Core metadata initiative.Abstract This paper reports on a study exploring 'metadata capital' acquired via metadata reuse. Collaborative modeling and content analysis methods were used to study metadata capital in the Dryad data repository. A sample of 20 cases for two Dryad metadata workflows (Case A and Case B) consisting of 100 (60 metadata objects, 40 metadata activities) instantiations was analyzed. Results indicate that Dryad's overall workflow builds metadata capital, with the total metadata reuse at 50% or greater for 8 of 12 metadata properties, and 5 of these 8 properties showing reuse at 80% or higher. Metadata reuse is frequent for basic bibliographic properties (e.g., author, title, subject), although it is limited or absent for more complex scientific properties (e.g., taxonomic, spatial, and temporal information). This paper provides background context, reports the research approach and findings. Research implications and system design priorities that may contribute to metadata capital are also considered.

[25]
Greenwald,G (2013). Edward Snowden: The whistleblower behind the NSA surveillance revelations. The Guardian. Retrieved on June 18, 2017, from

[26]
Hey T., Tansley S., & Tolle K. (2009). The fourth paradigm. Redmond, Washington:Microsoft Research.

[27]
Ilevbare I., Athanassopoulou I., & Wooldridge J. (2017. UK Workshop on Data Metrology and Standards. The National Physical Laboratory and partners at the University of Huddersfield and University of Cambridge.March, 2017. Retrieved on June 18, 2017, from .

[28]
Kogan D.E., Miller P.C., & Schobbe G.A. (2007. Techniques to manage metadata fields for a taxonomy system. US 20080301096 A1. (Also published as WO2008150619A1).Retrieved on June 28, 2017, from .

[29]
Kunze,J (2001. A metadata kernel for electronic permanence.In International Conference on Dublin Core and Metadata Applications, North America, DC2001. Retrieved on July 31, 2017, from .

[30]
Kunze J., Calvert S., DeBarry J., Hanlon M., Janée G., & Sweat S. (2016a. Persistence statements: Describing digital stickiness. California Digital Library. Retrieved on July 20, 2017, from .

[31]
Kunze J., DeBarry J., Hanlon M., Scout C., & Sweat S. (2016b A vocabulary for persistence.In SciDataCon 2016. September 11-13, 2016, Denver Colorado. Retrieved on July 21, 2017, from.

[32]
Li,C., & Sugimoto,S (2017). Provenance description of metadata vocabularies for the long-term maintenance of metadata. Journal of Data and Information Science, 2(2), 41-55.

[33]
Lytras M.D., Sicilia M.Á.,& Cechinel, C. (2013).The value and cost of metadata (chapter I. 3). In M.A. Sicilia (Ed.), Handbook of Metadata, Semantics and Ontologies (pp. 41-62). Hackensack, N.J., World Scientific Publishing Company.

[34]
Manian,D (2011, Nov.11). Our pointless pursuit of semantic value. Retrieved on June 29, 2017, from

[35]
Marr,B. (2014). Big data: The 5 Vs everyone must know.LinkedIn:Big data. Retrieved on June 18, 2017, from

[36]
Méndez E.,& van Hooland, S.(2013).Metadata typology and metadata uses (chapter I.2). In M.A. Sicilia (Ed.), Handbook of Metadata, Semantics and Ontologies (pp. 9-40). Hackensack, N.J., World Scientific Publishing Company.

[37]
NITRD (2016). The Federal Big Data Research and Development Strategic Plan. The Networking and Information Technology Research and Development Program, May 2016. Retrieved on June 15, 2017, from

[38]
Oh S.G., Yi M., & Jang W. (2015). Deploying linked open vocabulary (lov) to enhance library linked data. Journal of Information Science Theory and Practice, 2(2), 6-15.Since the advent of Linked Data (LD) as a method for building webs of data, there have been many attempts to apply and implement LD in various settings. Efforts have been made to convert bibliographic data in libraries into Linked Data, thereby generating Library Linked Data (LLD). However, when memory institutions have tried to link their data with external sources based on principles suggested by Tim Berners-Lee, identifying appropriate vocabularies for use in describing their bibliographic data has proved challenging. The objective of this paper is to discuss the potential role of Linked Open Vocabularies (LOV) in providing better access to various open datasets and facilitating effective linking. The paper will also examine the ways in which memory institutions can utilize LOV to enhance the quality of LLD and LLD-based ontology design.

DOI

[39]
Riley,J. (2017). Understanding metadata. Bethesda, MD: NISO Press.

[40]
Shankaranarayanan,G., & Even,A (2006). The metadata enigma. Communications of the ACM, 49(2), 88-94.

[41]
Shirky,C (2005. Ontology is overrated: Categories, links, and tags. Economics & Culture, Media & Community.Retrieved on June 20, 2017, from .

[42]
Simon,P (2013). Too big to ignore: The business case for big data (Vol. 72). Hoboken, NJ: John Wiley & Sons.

[43]
Singh,A (2013. Is big data the new black gold? Wired.Retrieved on July 7, 2017, from .

[44]
Smith,A (1776). An inquiry into the nature and causes of the wealth of nations. London: W. Strahan and T. Cadell.

[45]
Smith K., Seligman L., Rosenthal A., Kurcz C., Greer M., Macheret C., .. & Eckstein A. (2014). Big metadata: The need for principled metadata management in big data ecosystems. In Proceedings of Workshop on Data Analytics in the Cloud (pp. 1-4). New York: ACM.Current big data ecosystems lack a principled approach to metadata management. This impedes large organizations' ability to share data and data preparation and analysis code, to integrate data, and to ensure that analytic code makes compatible assumptions with the data it uses. This use-case paper describes the challenges and an in-progress effort to address them. We present a real application example, discuss requirements for "big metadata" drawn from that example as well as other U.S. government analytic applications, and briefly describe an effort to adapt an existing open source metadata manager to support the needs of big data ecosystems.

DOI

[46]
Stanton,J.M (2012). Introduction to data science. Syracuse University. Retrieved on June 6, 2017, from

[47]
Sugimoto S., Li C., Nagamori M., & Greenberg J. (2016. Permanence and temporal interoperability of metadata in the linked open data environment. In Proceedings of the International Conference on Dublin Core and Metadata Applications 2016 (pp. 45-54). Retrieved on June 28, 2017, from .

[48]
Tennant,R (2002). MARC must die. Library Journal, 127(17), 26-27.

[49]
Thyagaraju,G.S., & Kulkarni,U.P (2011). Family aware TV program and settings recommender. International Journal of Computer Applications, 29(4), 1-18.In this paper we are proposing a design of TV program and settings recommendation engine utilizing contextual parameters like personal, social, temporal, mood and activity. In addition to the contextual parameters the system utilize the explicit or implicit user ratings and watching history to resolve the conflict if any while recommending the services.The System is implemented exploiting AI techniques ( like ontology, fuzzy logic,Bayesian classifier and Rule Base) , RDBMS and SQL Query Processing. The motivation behind the proposed work is i) to improve the user鈥榮 satisfaction level and ii) to improve the social relationship between user and TV. The context aware recommender utilizes social context data as an additional input to the recommendation task alongside information of users and tv programs. We have analyzed the recommendation process and performed a subjective test to show the usefulness of the proposed system for small families.

DOI

[50]
UK Data Archive. (2012. Research data lifecycle. Retrieved on June 15, 2017, from .

[51]
Vaduva A.,& Dittrich, K.R.(2001).Metadata management for data warehousing:Between vision and reality In 2001 International Symposium on Database Engineering and Applications (pp 129-135) Washington, DC: IEEE Computer Society Press Between vision and reality. In 2001 International Symposium on Database Engineering and Applications (pp. 129-135). Washington, DC: IEEE Computer Society Press.

[52]
van der Aalst,W.(2016). Process mining: Data science in action. Berlin: Springer-Heidelberg.

[53]
van Hemel,S.,Paepen, B., & Engelen, J. (2003). Smart search in newspaper archives using topic maps. In Proceedings of the 7th ICCC/IFIP International Conference on Electronic Publishing. Retrieved on June 29, 2017, from.

[54]
Vlachidis A., Binding C., May K., & Tudhope D. (2013). Automatic metadata generation in an archaeological digital library: Semantic annotation of grey literature. In Computational Linguistics (pp. 187-202). Berlin: Springer-Heidelberg.This paper discusses the automatic generation of rich metadata from excavation reports from the Archaeological Data Service library of grey literature (OASIS). The work is part of the STAR project, in collaboration with English Heritage. An extension of the CIDOC CRM ontology for the archaeological domain acts as a core ontology. Rich metadata is automatically extracted from grey literature, directed by the CRM, via a three phase process of semantic enrichment employing the GATE toolkit augmented with bespoke rules and knowledge resources. The paper demonstrates the potential of combining knowledge based resources (ontologies and thesauri) in information extraction, and techniques for delivering the automatically extracted metadata as XML annotations coupled with the grey literature reports and as RDF graphs decoupled from content. Examples from two consuming applications are discussed, the Andronikos web portal which serves the annotated XML files for visual inspection and the STAR project, research demonstrator which offers unified search across of archaeological excavation data and grey literature via the core ontology CRM-EH.

DOI

[55]
White H., Willis C., & Greenberg J. (2014). HIVEing: The effect of a semantic web technology on inter-indexer consistency. Journal of Documentation, 70(3), 307-329.Purpose - The purpose of this paper is to examine the effect of the Helping Interdisciplinary Vocabulary Engineering (HIVE) system on the inter-indexer consistency of information professionals when assigning keywords to a scientific abstract. This study examined first, the inter-indexer consistency of potential HIVE users; second, the impact HTVE had on consistency; and third, challenges associated with using HIVE. Design/methodology/approach - A within-subjects quasi-experimental research design was used for this study. Data were collected using a task-scenario based questionnaire. Analysis was performed on consistency results using Hooper's and Rolling's inter-indexer consistency measures. A series of t-tests was used to judge the significance between consistency measure results. Findings - Results suggest that HIVE improves inter-indexing consistency. Working with HIVE increased consistency rates by 22 percent (Rolling's) and 25 percent (Hooper's) when selecting relevant terms from all vocabularies. A statistically significant difference exists between the assignment of free-text keywords and machine-aided keywords. Issues with homographs, disambiguation, vocabulary choice, and document structure were all identified as potential challenges. Research limitations/implications - Research limitations for this study can be found in the small number of vocabularies used for the study. Future research will include implementing HIVE into the Dryad Repository and studying its application in a repository system. Originality/value - This paper showcases several features used in HIVE system. By using traditional consistency measures to evaluate a semantic web technology, this paper emphasizes the link between traditional indexing and next generation machine-aided indexing (MAI) tools.

DOI

[56]
Zavalina O.L.(2011.

[57]
Zeng,M.L. (2017). Smart data for digital humanities. Journal of Data and Information Science, 2(1), 1-12.

[58]
Zeng,M.L., & Qin,J. (2016). Metadata. New York: Neal-Schuman Publishers, Inc.

[59]
Zhao X., Ma H., Zhang H., Tang Y., & Fu, G. (2014, October). Metadata extraction and correction for large-scale traffic surveillance videos. 2014 IEEE International Conference on Big Data (Big Data) (pp. 412-420). Washington, DC: IEEE Computer Society Press.

Outlines

/

京ICP备05002861号-43

Copyright © 2023 All rights reserved Journal of Data and Information Science

E-mail: jdis@mail.las.ac.cn Add:No.33, Beisihuan Xilu, Haidian District, Beijing 100190, China

Support by Beijing Magtech Co.ltd E-mail: support@magtech.com.cn