ESDS Program

Pangeo ML: Open Source Tools and Pipelines for Scalable Machine Learning Using NASA Earth Observation Data

Principal Investigator: Joseph Hamman, CarbonPlan

This project will develop machine learning (ML) applications and open source technologies that meet specific computational needs of researchers and applied science practitioners. Over the course of a three-year project, which builds on the Pangeo ecosystem, the team will develop new high-level tools that serve a broad range of ML applications, primarily focusing on the extract-transform-load (ETL) pattern ubiquitous in ML workflows, yet functionally unique to the geosciences.

The project is motivated by the idea that geoscientific ML workflows are unique in many respects (dimensionality, types of data transformations, and data volume) and plans to provide new tooling and resources to streamline the process of using Earth observation (EO) data in deep-learning frameworks (e.g., TensorFlow). The team will build on the robust open-source scientific Python ecosystem already familiar to scientists, the Pangeo Project’s recent success in developing scalable cloud-based data analysis environments, and the team’s collective experience providing guidance on ML best practices for geoscience (e.g., EarthML).

As stated above, the primary focus will be on the preprocessing steps required to develop ML pipelines that use EO data, filling in key missing steps between the software libraries commonly used in geoscientific exploratory data analysis (e.g., Xarray, Dask, Intake), and the libraries commonly used for deep learning (e.g., TensorFlow, PyTorch). Planned development will improve the ability to easily combine datasets from multiple sources and provide high-level data pipeline tools for efficiently loading and processing batches of data for ML training and inference.

The development of these tools is motivated by pressing research questions that seek to integrate NASA EO data in cutting-edge machine learning applications. The project includes two science applications to help focus efforts, one motivated by NASA's Surface Water Ocean Topography (SWOT) mission and another that aims to improve the skill of macroscale hydrologic models. The tools themselves will provide much-needed functionality to the open-source software ecosystem and enable ML applications across the geosciences.

Furthermore, these developments will provide key improvements in our ability to share, reproduce, and scale ML workflows in the geosciences. The project will provide clear links between the ETL workflow tools developed under this effort and other ACCESS Program elements, including data analytics in the cloud, analytics-optimized data stores, and high-value open-source science tools.

The project will enable real-world machine-learning applications that draw on EO data from multiple NASA missions, e.g., the Soil and Water Assessment Tool (SWAT), Jason, Moderate Resolution Imaging Spectroradiometer (MODIS) and Landsat, as well as various in-situ and model-generated datasets.

Last Updated
Apr 15, 2021