Principal Investigator: Ziheng Sun, George Mason University
At present, AI/ML (artificial intelligence/machine learning) research in Earth science is lacking in efficient management, and it is difficult to share, replicate, reuse, and scale ML workflows. Most of the time, scientists manage their ML workflows on their own. Due to the uncertainty, complexity, and variety of ML models, researchers struggle with a solo management strategy to track and control ML workflows, especially when big Earth data is involved.
To make AI-based Earth scientific workflows more shareable, replicable, reusable, and scalable, this project will further develop the existing open-source workflow management system (WfMS), GeoWeaver, into a stable operational platform for NASA's Earth Observing System Data and Information System (EOSDIS) ML workflow management, sharing, replication, and reuse.
GeoWeaver is a web-based multi-user WfMS which aims to help AI practitioners in Earth system sciences to integrate their experiment programs with computing resources into an ad hoc automated workflow. The existing shell scripts, command lines, Python code, Jupyter notebooks, and Google Earth Engine code can be seamlessly managed in GeoWeaver. GeoWeaver allows users to export their workflows and rerun it on another GeoWeaver instance installed somewhere else, as long as the instance can connect to the required storage and computational resources, such as Google Cloud, Amazon EC2, Amazon S3, and NASA data services.
A federated catalog will be built in GeoWeaver to archive, index, publish, and search all the public workflows created in GeoWeaver. The workflows can then integrate resources across different facilities as long as they are accessible. EOSDIS facilities will be able to deploy their own GeoWeaver instances to take advantage of internal computational and storage resources (e.g., the data storage in each Distributed Active Archive Center).
Meanwhile, workflows can also be transferred and reused in another facility via the GeoWeaver federated catalog. GeoWeaver will let scientists integrate and reuse existing EOSDIS resources, public/commercial computing resources, and their own resources anywhere anytime.
To demonstrate the usability of GeoWeaver, this project will deploy a public GeoWeaver instance in the cloud and build a hybrid air quality workflow that integrates the conventional air quality model, the Community Multi-scale Air Quality Model (CMAQ), and AI models to monitor and predict the air quality index in California. The Moderate Resolution Imaging Spectroradiometer (MODIS) and Visible Infrared Imaging Radiometer Suite (VIIRS) products will be used in the workflow. The use case will showcase the feasibility of GeoWeaver in real-world big-data-based ML applications.