With the impending arrival of new, high-data-volume missions, the need to effectively archive and process significantly larger data volumes will require new data management technologies and architectures that are more cost-effective, flexible, and scalable than traditional on-premises systems. To meet these needs, the Earth Science Data Systems (ESDS) Program has adopted a strategic vision to develop and operate multiple components of the Earth Observing System Data and Information System (EOSDIS) in a commercial cloud environment.
EOSDIS provides end-to-end capabilities for managing NASA Earth science data from satellites, aircraft, in-situ measurements, and other sources, and the migration of EOSDIS data into the Earthdata Cloud, the first and currently the largest cloud project at NASA. This migration benefits users by giving them new ways to access NASA’s collection of Earth science datasets, improves the efficiency of data systems operations, increases user autonomy, maximizes flexibility, and offers shared services and controls. The Earthdata Cloud is a key component of the ESDS Transform to Open Science (TOPS) program, which provides the visibility, advocacy, and community resources to support and enable the shift to open science. TOPS, in turn, is part of NASA’s Open-Source Science Initiative, which promotes the open sharing of software, data, and knowledge (algorithms, papers, documents, ancillary information) as early as possible in the scientific process.
Data in the Cloud
The Earthdata Cloud architecture went operational in July 2019 and, soon thereafter, key EOSDIS services, such as NASA's Common Metadata Repository (CMR) and Earthdata Search, were deployed within it. Since then, efforts to increase the amount of data and services available in the cloud has continued. For example, NASA’s Global Imagery Browse Services (GIBS) is transitioning to the cloud and, as of January 2022, 5% of its total imagery layers (20% by volume), 5% of its ingest "handlers" (i.e., the code used to pull imagery from DAACs in an Intelligence Community Directive-compliant manner), and 80% of its on-premises imagery archive has been transferred to the cloud. (The system is expected to be 100% in the cloud by Fall/Winter 2022.) Further, many of NASA’s EOSDIS Distributed Active Archive Centers (DAACs) have made considerable progress moving the data archives they manage into the cloud. As February 2022, EOSIDS DAACs have migrated more than 1 Petabyte (PB) into the Earthdata Cloud, with more data being added weekly.
Benefits to Data Users and the Scientific Community
Moving EOSDIS data to the cloud has numerous benefits for data users and EOSDIS, including:
- Easy access to data: Data users will be able to access data directly in the cloud, making the need to download volumes of data unnecessary. (Note: Users will still have the ability to download data if they choose.)
- Rapid deployment: Users can bring their algorithms and processing software to the cloud and work directly with the data in the cloud, simplifying procurement and hardware support while expediting science discovery.
- Scalability: The size and use of the archive can expand easily and rapidly as needed.
- Flexibility: Mission needs can dictate options for selecting operating systems, programming languages, databases, and other criteria to enable the best use of mission data.
- Reduced redundancy: The use of a common infrastructure with cloud native services will reduce redundant tools and services, enable sharing, and enforce the use of community standards as well as uniform policies and processes.
- Cost effectiveness: EOSDIS and NASA pay only for the storage and services used. Along with scalability benefits, this allows the amount of storage or services to be continually adjusted to ensure that data and services are effectively provided at the lowest possible cost to NASA and EOSDIS. Note: Under NASA’s full and open data policy, all NASA data will continue to be free to access and download. This means that users will be able to employ these cloud-based EOSDIS-provided services to discover, search, access, and download data, at no cost. However, users who wish to store data in their own Amazon Web Services (AWS) cloud instance or cloud storage, are responsible for covering these costs. For more information on this topic, see “Understanding and Managing Costs in the AWS Cloud,” one of several tutorials on how to get started in the AWS cloud.
Earthdata Cloud also benefits the scientific community that uses NASA Earth science data for research. By making NASA data, algorithmic code, and metadata available in the cloud, the scientific processes of NASA researchers will become more transparent and their results more reproducible, which in turn lends clarity and validity to the scientific process. In addition, the use of standardized software and code makes it easier for new users to learn to interact with the data and become more involved in the scientific process.
As of September 2021, Earthdata Cloud holds more than 59 PB of data. According to estimates from ESDS, that amount is expected to grow considerably in the coming years to more than 148 PB in 2023, 205 PB in 2024, and 250 PB in 2025. As the volume of data in the EOSDIS archive continues to grow, the EOSDIS archive’s data ingest rate is expected to increase dramatically along with it. By the end of the decade, the volume of data in the EOSDIS archive is expected to surpass 320 PB.
This anticipated growth in both the data ingest rate, as well as the overall archive volume, poses an array of challenges for distributing and analyzing the data currently stored and disseminated through physical servers on-premises at EOSDIS DAACs. Therefore, for Earthdata Cloud to meet users’ needs, NASA's Earth Science Data and Information System (ESDIS) is working to ensure it provides services in several key areas, including:
- Data acquisition from data providers (such as NASA science teams).
- Data ingest: The system must support multi-mission and multi-discipline data ingest.
- Data validation and processing.
- Data archive: The system must preserve and protect NASA Earth observation data.
- Data distribution, including disaster recovery: The system must support distribution of data, subsetting, and visualization, and must be adaptable to future technologies.
- Metadata: The harvest, creation, and publication of dataset metadata to the CMR.
- Data management: The system must meet the development and execution of information lifecycle needs of NASA mission-based Earth science datasets.
- Metrics: Publication of metrics to the ESDIS Metrics System (EMS), which collects and organizes various metrics from the DAACs and other data providers.
- NASA’s agreement with AWS has resulted in collaborations to improve the discovery, access, and use of NASA science datasets; the creation of data storage and staging areas to facilitate the community evaluation of data products; and workshops to expand the use of cloud-computing resources.
- NASA’s collaboration with Google has led to investigations into the transfer, storage, and value of making large volumes of NASA science datasets available on the Google Cloud and Google Earth Engine; making NASA Earth Science data accessible to users via the Google Cloud Public Dataset search engine and Earth Engine Catalog; and growing NASA’s artificial intelligence (AI) capabilities through joint efforts with NASA’s Frontier Development Lab (FDL) Challenges and SpaceML projects.
- NASA’s partnership with Microsoft has launched investigations into the value of making high-value NASA science datasets available on Azure; cost and performance evaluations of data storage methods and technologies and support analytics; the exploration of strategies to enable cloud-based analytics to promote science in the cloud; and analysis of the approaches to build and share training datasets for AI at a scale.
For example, NASA implemented Cumulus, which provides a range of functionality in the cloud, including data acquisition from providers (such as NASA science teams); data ingest, including validation and processing; the harvest, creation, and publication of dataset metadata to the CMR; the storage and distribution of data, including disaster recovery; and publication of metrics to the EMS, which collects and organizes various metrics from the DAACs and other data providers.
Further, Cumulus is integrated with the NASA-Compliant General Application Platform (NGAP), a custom-built cloud optimized platform, which provides highly flexible cloud native infrastructure, NASA-compliant IT Security controls, networking services, and business cost control in Amazon Web Services (AWS).
Moving the collective data archive from the DAACs into the cloud puts NASA Earth observation data “close to compute,” giving users improved access to data, the ability to use large datasets more efficiently, and the ability to conduct a broader range of research. This move will not change existing methods of user interaction with EOSDIS data, but it does require new methods of accessing NASA data that differs from on-premises platforms. Further, as more datasets migrate to the cloud, the DAACs will continue to serve as the gateways to EOSDIS data holdings and provide a wide range of support services for users.
Earthdata Cloud Evolution
As the volume of data from NASA missions increases, so will the need for data management and archive technologies that are adaptable and scalable. Earthdata Cloud possesses these attributes, which will serve NASA’s forthcoming, high-data-volume missions, such the upcoming Surface Water and Ocean Topography (SWOT), which is scheduled for launch in 2022, and the NASA-Indian Space Research Organization Synthetic Aperture Radar (NISAR) missions, which is expected to launch in 2023. NISAR is expected to add 85 TB of data to the EOSDIS archive each day and as much as 140 PB of data over its scheduled three-year lifespan. To accommodate this volume, NASA's Alaska Satellite Facility DAAC (ASF DAAC) worked collaboratively with NASA’s Jet Propulsion Laboratory to test and prototype ways of archiving and distributing NISAR data using the commercial cloud. This project, known as Getting Ready for NISAR (GRFN), is now complete and has successfully demonstrated key components for efficiently handling NISAR volumes in a commercial cloud. ASF DAAC is also archiving and distributing Sentinel-1 data from the European Commission’s Copernicus Program into NASA-managed cloud accounts.
These missions are presenting NASA with unrivaled opportunities to further develop and test systems and architectures for providing improved data management and user access to the unprecedented volumes of data future Earth science missions are expected to generate.
Achieving the Potential of Cloud Computing through Partnerships
To capitalize on the benefits that cloud computing offers to the scientific community, ESDS’s Interagency Implementation and Advanced Concepts Team (IMPACT) has executed Space Act Agreements (SAAs)— legal agreements between NASA and another party to work collaboratively on a project or technology—with private companies, including AWS, Google, and Microsoft. In the process, NASA is building a network of technical experts whose knowledge it can leverage for the benefit of Earth Science data users.
In addition, NASA is considering agreements with some additional companies—Esri, IBM, and Nvidia—to accelerate the development, delivery, and adoption of AI to further NASA’s science research and applications, explore new opportunities in cloud technologies that enable and accelerate open science, and collaboration on effective joint solutions promoting open science and a better experience for those who use NASA data.
Continuing NASA’s Tradition of Free and Open Data
NASA Earth science data have been freely and openly available to all users since EOSDIS became operational in 1994. Under NASA’s full and open data policy, all NASA mission data (along with the algorithms, metadata, and documentation associated with these data) must be freely available to the public. This means that anyone, anywhere in the world, can access the more than 71 PB of NASA Earth science data without restriction. Further, since 2015, the data systems software developed through NASA awards and research and technology grants is available as open-source software, which means the software’s source code for is freely available for inspection, modification, and enhancement. This allows enabled software and code to be available more broadly and shared collaboratively with diverse groups to accelerate software development.
The concept of open science builds on the philosophy and spirit of the open-source software movement, and endeavors to create a collaborative culture enabled by technology that empowers the open sharing of data, information, and knowledge within the scientific community and the public to accelerate scientific research and understanding. As the term implies, open science aims to make scientific findings as transparent as possible by making all elements of a claimed discovery readily accessible, which enables results to be repeated and validated. Now, a new scientific paradigm—open-source science—is emerging from the open science concept.
Open-source science takes these notions of openness and transparency even further by applying them to the entire scientific process. Its goal is to accelerate discovery by conducting science openly, from project initiation through implementation. The result is the inclusion of a wider, more diverse community in the scientific process as close to the start of research activities as possible, which engenders trust in the scientific process. It also represents a cultural shift that encourages collaboration and participation among practitioners of diverse backgrounds, including scientific discipline, gender, ethnicity, and expertise. Open-source science is more equitable science.
Open-source science is a foundational objective of NASA’s Science Mission Directorate (SMD) and SMD's ESDS Program. Along with the wide dissemination and use of openly available Earth-observing data, the SMD promotes and facilitates the full and open sharing of all metadata, documentation, models, images, and research results achieved using these data and makes available the source code used to generate, manipulate, and analyze them. Open-source science will also be a key attribute of NASA’s Earth System Observatory (ESO), a new set of Earth-focused satellite missions that will work in tandem to provide a holistic view of Earth and collect key information to guide efforts related to climate change, natural hazard mitigation, fighting forest fires, and improving real-time agricultural processes.
As part of its commitment to open-source science, NASA will make all ESO mission data, code, and supporting documents available as early in the mission life cycle as feasible. Given the expected high volume of ESO data, these data will be stored in the Earthdata Cloud and tools will be provided for working with these data directly in the cloud environment. This strategy will expand the ability of global research teams to collaboratively work with and conduct research using more NASA Earth science data than ever before, and the result will be the availability of these data to a broader, more diverse global community of users with the attendant increase in opportunities for scientific discovery.