NASA has one of the world’s largest repositories of Earth science data, and over the next six years it will get much, much larger. By 2025, several new high-data-volume missions will be launched, requiring the Earth Observing System Data and Information System (EOSDIS) archive to grow almost eight-fold, from 32 petabytes to 247 petabytes.
Two missions in particular, NASA-Indian Space Research Organisation Synthetic Aperture Radar (NISAR) and Surface Water and Ocean Topography (SWOT), which are scheduled to be launched in 2021 or 2022, will generate 86 and 20 terabytes of data each day, respectively. With the impending arrival of these new missions, the need to effectively archive and process significantly larger data volumes will require new data management technologies and architectures.
To meet these needs, multiple components of EOSDIS are being migrated to a commercial cloud environment via Amazon Web Services (AWS) and a few components have already made the transition. NASA's Global Hydrology Resource Center Distributed Active Archive Center (GHRC DAAC), located at NASA's Marshall Space Flight Center, is the first of NASA's twelve DAACs to migrate its data to AWS. You can read more background on the Earthdata Cloud Evolution page.
NASA’s Earth Science Data and Information System (ESDIS) Project (which manages EOSDIS) has several years of experience managing large datasets in the cloud and used this experience, along with input from data users and the DAACs in the development of the Earthdata Cloud. The purpose of moving data to the cloud is not just to save money on hardware and data storage costs, it also enables researchers to do analyses that would be impractical to perform on a local computer.
In a recent Data Chat, EOSDIS System Architect Dr. Christopher Lynnes, said, "What we’re trying to accomplish with the Earthdata Cloud migration is to support an analysis-in-place user experience that will enable users to do their work in the cloud—do complete dataset analyses if they want to—without them having to schlep the data all over the place just to get it down to their computer."
ESDIS has gone to great lengths to ensure data are analysis-ready and to make cloud computing attractive to researchers. However, in some cases, researchers may feel the need to download data from the cloud to external locations, which is called data egress. Data egress is free to users, as part of NASA’s open data, services, and software policies. Egress traffic is managed by ESDIS to ensure researchers get access to the data they need.
ESDIS used several sources of information to model how much data egress to expect, including the experience the project has with cloud data storage and feedback solicited from data users and the DAACs. ESDIS has also invested in resources to improve data access and cloud computation.
How Was Data Egress Calculated?
In 2016, before the Earthdata Cloud evolution began, the Alaska Satellite Facility DAAC (ASF DAAC), in partnership with NASA’s Jet Propulsion Laboratory, began prototyping what a commercial cloud environment could look like, using synthetic aperture radar (SAR) data from the European Space Agency’s Sentinel-1 mission. ASF DAAC’s prototype system managed all aspects of the data lifecycle for Sentinel-1 SAR data in AWS, including data ingest, storage, distribution, and on-demand product generation. This project, called Getting Ready for NISAR (GRFN), provided ESDIS with insight into how data similar to NISAR’s was used in a commercial cloud and how much data egress occurred.
ESDIS used data from GRFN and 25 years of experience with data egress from the DAACs to model data egress.
The criteria for dataset selection for the commercial cloud were developed in order to optimize data use in the cloud and minimize data egress. Some datasets are a better fit for cloud-based distribution than others. Datasets with large file sizes are good candidates for cloud migration because they can be prohibitively time-intensive to download and costly to store. In addition to dataset size, popularity and value to research were all part of the selection criteria.
Because the current impetus for cloud development involves the upcoming SWOT and NISAR missions, datasets that would be useful to communities who will be using SWOT and NISAR data were prioritized. The “data lake” concept, where datasets are co-located and stored in one central repository, allows for advanced computing, such as machine learning, on multiple related datasets.
"We used statistics on past data egress to model how much to expect in the future, and we planned for even more egress than we modeled," said Katie Baynes, system architect at ESDIS. ESDIS also put in place a data egress traffic-shaping measure in case egress exceeds expectations at any given point in time.
Minimizing Data Infrastructure Costs
Traditional on-premise computing systems are connected to the network through cable and hardware and are restricted by purchased bandwidth that can become saturated and slow, requiring new hardware to meet increased network demand. This is typically a costly and timely activity. The Earthdata Cloud avoids this and provides more flexibility based on users’ demand.
EOSDIS and NASA pay only for the storage and services actually used. Custom egress management capabilities provide tightly coupled network controls based on demand and budget, in order to prevent wasting money, as can be the case with a physical network architecture. The Earthdata Cloud also avoids costs associated with supporting and replacing computer hardware and software as they age.
ESDIS participates in AWS special pricing programs for egress and storage costs, significantly lowering overall costs to store and access data.
There are several benefits to using AWS for the Earthdata Cloud, including flexibility, co-located data, and improved speed of data computation and egress. But the main benefit of cloud migration is that it enables the next generation of Earth science research.
Resources for Cloud Computation
In preparation for the cloud migration, NASA's Earth Science Data Systems (ESDS) Program has been supporting the cultivation of resources so that researchers can take full advantage of data available in the cloud, especially using new technologies such as machine learning.
In 2017, NASA's Advancing Collaborative Connections for Earth System Science (ACCESS) Program, part of ESDS, supported five projects to improve and expand the overall use of large and complex Earth science datasets.
One of the projects developed as part of ACCESS is the Pangeo Project, which facilitates community-driven tools for analysis of Earth Observation data in the cloud. Pangeo developed a repository of tutorial materials on Github for one of the first datasets available on the Earthdata Cloud, Multi-Scale Ultra High Resolution Sea Surface Temperature (MUR SST).
EOSDIS provides resources to help users directly access subsets of data stored on the cloud, using the Open-source Project for a Network Data Access Protocol (OPeNDAP) software. OPeNDAP allows users to explore and download only those parts of the data in which they are interested, using the tools with which they are comfortable.
EOSDIS has also developed webinar tutorials for working in the cloud for Earth scientists. Tools available for researchers to access cloud data are explained in more detail in the article EOSDIS Data in the Cloud: User Requirements.
NASA Technical Report: Archive Management of NASA Earth Observation Data to Support Cloud Analysis