Improving Metadata, Improving Research

The machine learning team at IMPACT trained a machine learning model to process data set abstracts. These models suggest relevant metadata keywords by leveraging the ability of Word2Vec models to embed the meaning of words into numeric vectors.
author-share

Metadata. It’s critical to our knowledge-based society, but it’s something people rarely, if ever, think about. Like rebar inside concrete, metadata provides underlying structure that increases our ability to locate and access relevant data. Here’s an example. The dataset IRS 1C LIS3 Standard Products contains over ten years of detailed data. However, unless you are already familiar with IRS 1C, how can you know if that extensive collection of data is relevant to the knowledge you are seeking to discover?

Enter that overlooked workhorse, metadata. The dataset IRS 1C LIS3 Standard Products is manually tagged with the metadata keywords: Earth science, land surface, surface radiative properties, erosion sedimentation, and geomorphic landform processes. It is this metadata, not the actual data contained in the dataset that allows search engines such as NASA’s Earthdata Search client to connect you with this potentially valuable collection of data. Given the increasing importance of data to our society, robust and accurate metadata across multiple parameters is essential.

Image
Architecture of IMPACT's GCMD Keyword Tagger tool.


Given the importance of metadata and the subjectivity that arises from human curation, how can we efficiently verify the accuracy of metadata? If the keywords listed above for IRS 1C LIS3 Standard Products are inaccurate, the data set will appear in the wrong search results, impeding data discovery and research efforts. To address this need, the machine learning team at IMPACT trained a machine learning model to process data set abstracts. These models suggest relevant metadata keywords by leveraging the ability of Word2Vec models to embed the meaning of words into numeric vectors. This approach utilities machine learning techniques to provide subject matter experts with automated keyword suggestions that complements the hand curation processes.

The research that underlay this effort produced a valuable insight: machine learning training sets produce more accurate results when they utilize a training corpus aligned with the subject matter of a set of datasets. Muthukumaran Ramasubramanian, the lead developer on the project, explains:

"There is value in collecting domain-specific embeddings, especially for domain-related tasks such as scientific keyword recommendation. The word embeddings we made from 22,000 documents we collected from AGU did better than embeddings built from Wikipedia articles with 6 times the vocabulary size."

Image
The word embedding models used by the GCMD Keyword Tagger achieve higher accuracies.

This research effort has produced not only the GCMD keyword tagger, a tool that allows dataset curators to select metadata keywords from NASA’s Global Change Master Directory (GCMD) set of keywords, but also a conference paper at the recent IEEE SoutheastCon 2020: “ES2Vec: Earth Science Metadata Keyword Assignment using Domain-Specific WordEmbeddings.”

The alpha release of the GCMD keyword tagger tool is currently being tested by an IMPACT metadata validation team. Access to the alpha release is available at the website below. Keywords can be generated from specific collection level descriptions in NASA’s CMR or from a long-form description supplied by the user.

Learn more about the GCMD Keyword Classifier.

Check out Muthukumaran's LinkedIn profile.
 

Last Updated