Automating Metadata Review

IMPACT has released pyQuARC, an automatic assessment tool for Earth observation metadata built to improve metadata quality.
author-share

IMPACT has released pyQuARC, an automatic assessment tool for Earth observation metadata built to improve metadata quality. This tool was developed as part of the Analysis and Review of the CMR (ARC) project which is tasked with assessing NASA’s metadata records in the Common Metadata Repository (CMR) for correctness, completeness, and consistency. pyQuARC incorporates a robust set of metadata quality criteria developed by the ARC team and is the culmination of the lessons learned with regard to automating metadata quality assessment processes to the greatest extent possible.

Ensuring high-quality metadata records is essential to scientific research as ARC team lead Jeanné le Roux explains:

"Metadata management can have a direct effect on a scientist’s experience in finding, accessing, and using data. Since metadata is the connection point between users and data, metadata that is well maintained helps lower barriers to data use."

High quality metadata that includes a direct data access point and ample contextual information about the data (such as user documentation and compatible software) helps scientists get to the actual science faster rather than spending time hunting for information and resources. Rich metadata also allows for more complex and niche searches across data volumes that are ever increasing.

Metadata focuses attention on important information about the data, such as the date and time captured by a camera when a picture is taken. The metadata, rather than the data itself, is what is indexed by online data catalogs and other applications that connect users to data. Inaccurate metadata can connect users with data that does not, in fact, match their search criteria; incomplete metadata can make data difficult or impossible to find. Given the importance of high quality metadata, it is necessary that metadata be regularly assessed and updated as needed.

pyQuARC is a python code package that streamlines the process of assessing the quality of metadata by performing automated quality checks on metadata. It employs a metadata quality assessment framework which specifies a common set of assessment criteria. In addition to basic validation checks (e.g. adherence to the metadata schema, controlled vocabularies, and link checking), pyQuARC flags opportunities to improve or add contextual metadata information in order to help the user connect to, access, and better understand relevant data products. pyQuARC also ensures that information common to both data product and corresponding file-level metadata are consistent and compatible.

pyQuARC GitHub repo

As an open source software, pyQuARC has been designed to be customizable to allow for quality checks unique to different needs. For instance, pyQuARC can be configured to support other metadata standards in use by the Earth science community. New checks can also be added and the existing checks can be modified as needed. Moreover, pyQuARC provides a framework that can be customized to check any type of metadata, not just metadata describing Earth science data products. Other science disciplines, or any industry that has data that needs to be cataloged, can adopt the concepts contained in pyQuARC to assist with metadata management.

Slesa Adhikari, a pyQuARC developer notes the importance of such an extensible toolset:

"As the world is moving to more data-driven approaches, it is important that users have access to the data that they are looking for. To make data more searchable and accurate based on the search criteria, the data metadata needs to be valid, accurate, and have all the contextual information. pyQuARC is significant as it helps in this assessment of metadata and thus in making the metadata better."

pyQuARC is highly customizable. Users can add their own custom rules for their metadata by modifying two JSON files without having to touch the code base at all. Since pyQuARC is open source, everyone has access to the backend; they can extend its functionality by adding their own code, integrate it into their own application and such. Anyone looking for any kind of data benefits from improving accessibility and searchability of data, which is the final goal of pyQuARC.

To adapt pyQuARC for your own metadata validation, you’ll need to first install pyQuARC. You’ll then have to create two files: 1. Rules override JSON 2. Check messages override JSON (for custom rules and their error messages). And use them as specified in the “Install/User’s Guide” section of the README.md.

GitHub repo

Description and details as well as technical user-guide

User guide

How to add a new rule and a list of available checks

Last Updated