Pangeo ML: Open Source Tools and Pipelines for Scalable Machine Learning Using NASA Earth Observation Data

Principal Investigator Dr. Max Jones (CarbonPlan)

The Pangeo-ML project has built on the foundations of the Pangeo Project to develop high-level tools that serve a broad range of machine learning (ML) applications, improving the workflows of researchers and data scientists working with complex multi-dimensional datasets. The Pangeo-ML team has both supported the open-source scientific Python ecosystem and filled in key missing steps between the software libraries commonly used in geoscientific exploratory data analysis and deep learning.

Project Objectives

Expand the interoperability of the scientific Python ecosystem to simplify construction of preprocessing pipelines for ML applications
Develop new software interfaces between Xarray and ML libraries
Expand the open-source documentation for ML applications in the geosciences

Update

The Pangeo-ML project has developed ML applications and open source software that improve the ML workflows of researchers and data scientists working with complex multi-dimensional datasets. At the core of the project is the idea that geoscientific ML workflows are unique in many respects (dimensionality, types of data transformations, and data volume) and require new tooling and resources to streamline the process of using Earth observation (EO) data in deep-learning frameworks (e.g., TensorFlow).

In addition to supporting the continued development and maintenance of the Pangeo software ecosystem, the Pangeo-ML project has filled in key missing steps between the software libraries commonly used in geoscientific exploratory data analysis and deep learning.

While the Pangeo community has successfully demonstrated that a collection of open-source scientific Python software can perform highly parallel cloud-native workflows and unlock scientific insights on datasets larger than 10 terabytes (TB) with interactive results, these pipelines still typically require data to be transformed into cloud-native storage formats. The Pangeo-ML team has contributed targeted developments towards the Kerchunk, Filesystem spec, Dask, and Intake projects to allow cloud-native processing at-scale on data stored in archival file formats (e.g., NetCDF, HDF5, GeoTiff).

One of the key objectives of the Pangeo-ML project is to simplify data preprocessing pipelines through improved interoperability within the scientific Python ecosystem. By improving the integration between the Holoviz suite of tools (e.g., hvPlot, GeoViews, Holoviews, Datashader, SpatialPandas) and the broader scientific Python ecosystem (e.g., Zarr, Xarray, Rioxarray), the Pangeo-ML team has simplified the interactive exploration of Earth science and ML datasets. Further integration improvements between Xarray, Dask, and the Pytroll Satpy and Pyresample libraries have simplified preprocessing pipelines that require common tasks like geographic resampling.

Another key objective of the project is to develop new software interfaces between Xarray and machine learning libraries. Our team has been developing the Xbatcher library to simplify batch data generation from Xarray datasets and support direct integration with popular machine learning frameworks like TensorFlow and PyTorch through lazy batch generation, parallel loading, caching, and data loaders.

Coupled with the development of the Pangeo, Pytroll, and Holoviz suite of tools, the Pangeo-ML team has been developing machine learning applications that motivate and guide tool development. Specific application examples include a biomass mapping ML workflow using Landsat and ICESat/GLAS, a hydrometerological data assimilation project using FluxNet, a climate downscaling application, and an estimation of ocean surface currents from remote sensing observations.

Major Accomplishments

Contributed to or led the development of software releases for numerous open-source software projects, including new libraries like Xbatcher and Kerchunk and foundational packages like Xarray and Dask
Developed machine learning applications to motivate and guide tool development, such as an open-source climate downscaling pipeline and sea surface current estimation from remote sensing observations
Engaged and supported the open-source community through expanded documentation, tutorials, talks, and workshops related to scalable machine learning workflows

For More Information

Pangeo

Publications and Presentations

Durant, M. (2023). Variable chunking in Zarr (presentation). ESIP Cloud Computing Cluster.

Jones, M., et al. (2023). Pangeo ML: Open Source Tools and Pipelines for Scalable Machine Learning Using NASA Earth Observation Data (poster). NASA Earth Science Data Systems Working Group (ESDSWG).

Jones, M., et al. (2023). Pangeo ML: Open Source Tools and Pipelines for Scalable Machine Learning Using NASA Earth Observation Data (presentation). NASA ESDSWG.

Jones, M., et al. (2023). Xbatcher - A Python Package That Simplifies Feeding Xarray Data Objects to Machine Learning Libraries (poster). Earth Science Information Partners (ESIP) January Meeting.

Jones, M., et al. (2023). Xbatcher - A Python Package That Simplifies Feeding Xarray Data Objects to Machine Learning Libraries (presentation). American Meteorological Society (AMS) Annual Meeting.

Sun, Z., Sandoval, L., …, Bednar, J.A., et al. (2022). A review of Earth Artificial Intelligence. Computers & Geosciences, 159. doi:10.1016/j.cageo.2022.105034

Chegwidden, O., et al., (2022). Global downscaled climate projections from CMIP6: open data and tools for climate risk applications (presentation). AGU Fall Meeting.

Durant, M., (2022). Access all the data in one cloud-friendly way with kerchunk! (presentation). ESDIS Technology Spotlight Webinar Series.

Hagen, R., et al. (2022). Building open source downscaling pipelines for the cloud. CarbonPlan.

Chegwidden, et al., (2022). Open data and tools for multiple methods of global climate downscaling. CarbonPlan.

Cherian, D, Banihirwe, A., et al. (2022). Xarray: Friendly, Interactive, and Scalable Scientific Data Analysis (tutorial). SciPy.

Bednar, J., et al. (2022). hvPlot and Holoviz: Visualize all your data easily, from notebooks to dashboards (tutorial). SciPy.

Liquet, M. (2022). Easily build interactive and static maps with hvPlot (presentation). GeoPython.

Durant, M. (2022). All you need is Zarr (presentation). PyData Global.

Kennedy, M. (2022). Pangeo data ecosystem (Abernathey and Hamman). Talk Python Podcast.

Macey, T. (2022). Building A Community And Technology Stack For Scalable Big Data Geoscience At Pangeo - Episode 358 (Abernathey and Hamman). Podcast.__init__.

Chiao, C., et al. (2022). Using LiDAR to estimate forest biomass. CarbonPlan.

Stern, C., Abernathey, R., et al. (2021). Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data Production. Frontiers in Climate, 3:782909. doi:10.3389/fclim.2021.782909

Hamman J. (2021). Pangeo-ML: Open science tools for machine learning in the geosciences (Ignite-talk). ESIP Summer Meeting.

Hamman J. (2021). Transforming weather research with cloud computing (presentation). NOAA SAB / Decadal Priorities for Weather Research.

Hamman J. (2021). What’s all the fuss about Zarr? (presentation). NCAR/CISL Seminar.

Abernathey, R.P., et al. (2021). Cloud-Native Repositories for Big Scientific Data. Computing in Science and Engineering, 23(2): 26-35. doi:10.1109/MCSE.2021.3059437

Hamman J., et al. (2021). Pangeo ML: Open Source Tools and Pipelines for Scalable Machine Learning Using NASA Earth Observation Data (poster). NASA ESDSWG.

Hamman J., et al. (2021). Pangeo ML: Open Source Tools and Pipelines for Scalable Machine Learning Using NASA Earth Observation Data (presentation). NASA ESDSWG.

Bednar, J.A. (2021). Using hvPlot for interactive plotting of Xarray, Pandas, and Dask data in Jupyter (presentation). ESIP Summer Meeting.

Bednar, J.A. et al. (2021). HoloViz – Visualize all your data easily, from notebooks to dashboards (presentation). SciPy.

Bednar, J.A. (2021). Datashader for visualizing geospatial data (presentation). Workshop on Scaling Geospatial Vector Data (presentation). Dask Distributed Summit.

Stevens, J.-L. (2022). Seeing the needle AND the haystack: single-datapoint selection for billion-point datasets (presentation). PyCon DE / PyData Berlin.

Rudiger, P. & Liquet, M. (2022). Easily build interactive plots and apps with hvPlot (presentation). PyCon DE / PyData Berlin.

Rudiger, P. (2021). Build polished, data-driven applications directly from your Pandas or Xarray pipelines (presentation). PyData Global.

Pangeo ML: Open Source Tools and Pipelines for Scalable Machine Learning Using NASA Earth Observation Data

Project Objectives

Update

Major Accomplishments

For More Information

Publications and Presentations

Find Data

By Platform

By Topic

Data Catalog

Data Tools