Any scientist will tell you that downloading, cleaning, and organizing data takes a lot of time. Chelle Gentemann, senior scientist at Farallon Institute, estimates that about 80 percent of her and other scientists’ time is spent in these stages of research. Spending so many days and hours downloading and cleaning data slows the speed of science, says Gentemann. What if scientists could use this valuable time interpreting and writing up their findings?
The Pangeo project is helping the Earth science community analyze data in the cloud so they can spend less time downloading and managing data. The project is partially funded by NASA's Advancing Collaborative Connections for Earth System Science (ACCESS) Program, which develops technologies to effectively manage, discover, and utilize NASA’s archive of Earth observations for scientific research and applications.
Time spent downloading data to a local computer is top of mind for a lot of researchers who are using data from new Earth-observing satellites, which generate high-resolution data that sometimes equates to large file sizes. For example, an instrument onboard NASA’s Ice, Cloud, and land Elevation Satellite-2 (ICESat-2) measures the height of the world’s ice, land, and water by sending 10,000 pulses of light to Earth every second. This mission produces about 1,000 gigabytes (or 1 terabyte (TB)) of data per day. This is a lot of data to manage, but it’s still smaller than the upcoming joint NASA-Indian Space Research Organisation (ISRO) Synthetic Aperture Radar (NISAR) mission, which is slated to launch in early 2022. NISAR is expected to add as much as 85 TB of data each day to NASA’s Earth Observing System Data and Information System (EOSDIS) archives. The file sizes and volume of data available makes downloading data to analyze it not just inconvenient, it’s also becoming impractical.
This is compelling scientists to change the way they work.
NASA’s Earth Science Data Systems (ESDS) Program is facilitating this change by migrating datasets to the cloud. As Earth science data and computations move into the cloud, researchers and commercial users will be able to do more than ever, enabling new science and application of large-scale analytics. The Earthdata Cloud will create opportunities for innovation, such as in machine learning and artificial intelligence.
Collaboration in the Cloud
Pangeo technologies were used to support NASA-sponsored hackweeks over the summer of 2019 to work with ICESat-2 data. Anthony Arendt, senior research scientist at the University of Washington and principal investigator of the Pangeo project, realized that because of the size of the ICESat-2 dataset, researchers were spending a lot of time developing code just to access the data. It would be more efficient if people could share what they’ve learned, he thought, so that no one would have to start from scratch.
Pangeo’s collaborative tools allow researchers to access, process, and analyze NASA data in the commercial cloud without having to download the data. Their ecosystem of interconnected open-source tools use software from Project Jupyter. Project Jupyter software allows users to create and share collaborative workflows in open-source notebooks that contain code, equations, and visualizations.
“With Jupyter notebooks users can develop blocks of code and run them in real-time and get really interactive,” Arendt said. “JupyterHub takes this interactivity to the next level and deploys a server that allows multiple users to generate their own Jupyter notebook environment and be connected to a broader ecosystem of tools that are consistent across the whole network or server.”
An Ecosystem of Tools
Instead of building one computational application programming interface, or API, to do data access, computation, and visualization, Pangeo has developed an ecosystem of interconnected tools that do all of these services. “Under the umbrella of Pangeo, we’ve developed tools that do one thing and they do it well,” said Joseph Hamman, project scientist at University Corporation for Atmospheric Research and co-principal investigator for the Pangeo project. “We work as a community to make sure these tools play well with each other within the modular ecosystem.”
One major component of the Pangeo ecosystem is Xarray, an open source Python library that makes it easier
to work with labeled multidimensional arrays. The Network Common Data Form, or netCDF, is a file format used for sharing scientific data in multidimensional arrays and is a standard format for EOSDIS data.
The Xarray library allows researchers to answer questions that cut across the dimensions of a dataset, without having to load the entire file that’s being analyzed. For example, If a researcher wanted to know the temperature of a square kilometer location every day for a year, Xarray uses a process called “lazy loading,” which reads just the metadata of files (rather than the entire file) until the analysis is executed. This allows researchers to perform analyses on large files by only reading the sections of the files they need.
During the ICESat-2 hackweek, researchers collaboratively developed an open-source library called icepyx. The library allows researchers to quickly access ICESat-2 data with just a couple of lines of code, as opposed to the 50 or 100 lines people were writing themselves to access it. Pangeo is prototyping tools to help scientists transition their work when the ICESat-2 data will be moved to the cloud.
These community tools are designed for the new era of data analysis and have the potential to change the workflow of many Earth scientists. “A lot of people may not realize what’s possible with their data until they know these tools exist,” said Arendt. Instead of every scientist having to teach themselves how to access data in the cloud, these resources can streamline the process, so that scientists can focus on what’s important.
So far the Pangeo project has been focused on tools for processing large datasets, but many researchers are now interested in using the Pangeo ecosystem for machine learning and developing training datasets on new Earth science data. Joseph Hamman was recently awarded an ACCESS grant to build on the Pangeo ecosystem to develop high-level tools that serve a broad range of machine learning applications.
The Pangeo team is also working to integrate the same ecosystem of tools on top of NASA’s Common Metadata Repository (CMR). CMR is a spatial and temporal metadata registry that stores metadata from a variety of science disciplines, providing a uniform view of NASA’s diverse data holdings. Pangeo’s integration with CMR will enable researchers to access and explore NASA datasets stored in the cloud with an unprecedented level of interactivity via Jupyter notebooks.