NASA’s Earth Science Data and Information System (ESDIS) Project and its constituent Distributed Active Archive Centers (DAACs) continue to evolve data in NASA’s Earth Observing System Data and Information System (EOSDIS) collection from physical servers into the cloud. This effort is called Cumulus, and has been detailed in earlier articles in this series. The benefits of this evolution to worldwide EOSDIS data users are significant, and include the ability to work with more data more efficiently than ever before.
A key element in this process is determining user requirements to gain a better understanding of how users will interact with data in the cloud, the types of analyses they intend to conduct, and options for architecting the EOSDIS cloud environment to best facilitate data use. This is an important undertaking since the EOSDIS data collection is about to become much larger.
From its current data volume of about 27.5 petabytes (PB) at the end of the 2018 Fiscal Year, the volume is forecast to grow to as much as 250 PB by 2025. This is due to the extremely high volume of data expected from upcoming missions such as the joint NASA/French, Canadian, and United Kingdom Surface Water and Ocean Topography (SWOT) mission and the joint NASA-Indian Space Research Organisation Synthetic Aperture Radar (NISAR) mission, both of which are currently scheduled for launch in 2021. NISAR, for example, is expected to generate approximately 3 terabytes (TB) of Level 0 data each day, which is equivalent to about 3,000 gigabytes (GB) (for comparison, the five instruments aboard the Terra Earth observing satellite generate about 195 GB of Level 0 data each day, according to NASA’s Earth Observing System). For most data users, the current practice of downloading data onto an individual machine for analysis simply won’t work for data collections this large; collections that earn the name “Big Data.”
A primary objective of hosting EOSDIS data in the cloud is to “level the playing field” so anyone can work with these Big Data collections. The ideal user experience (UX) allows data users to work next to EOSDIS data in the cloud, meaning that a user can simply point their analysis software to a data location in the cloud and begin analyzing without the need to transfer or download data. After completing their analyses, a user can view or download the results. An integral part of facilitating this is preprocessing these data into Analysis Ready Data (ARD), which enables end-users to begin working with data immediately.
This would be a straightforward process if all EOSDIS data users interacted with data the same way. However, the millions of individual EOSDIS data users will interact with cloud-based data in different ways depending on their research and analysis requirements as well as their individual level of experience working with EOSDIS data. Some will conduct all their work inside the cloud, some will download data for analysis outside the cloud, and some will work in a hybridized fashion partially inside and outside the cloud. The ESDIS Project must be aware of these uses and have data architecture and systems ready to support these interactions.
Specifically, ESDIS and the DAACs are collecting end-user input to determine:
- What kind of analyses will be conducted?
- What do data users consider ARD and how much preprocessing can ESDIS do?
- Where will users analyze these data—in the cloud, outside of the cloud, or somewhere in-between?
- What support products (such as primers, webpages, webinars, or tutorials) will users need?
Sources for this information include the annual EOSDIS American Customer Satisfaction Index (ACSI) surveys, feedback from webinars, various early-adopter programs, interaction with data users at applications workshops and science meetings, and input from DAAC User Working Groups.
In general, four primary types of users, with distinct UX, are likely to use EOSDIS data (see illustration at right). Users who have their own algorithms (End-User Algorithm [left red box in illustration]) may only require data preprocessing, like subsetting or reformatting, then work with data inside or outside the cloud, depending on the amount of data they are using.
Users who do not have their own algorithms or who are new to EOSDIS data may just want a usable answer from a data query without having to worry about conducting their own analysis on the data. For these users, Analysis-As-A-Service and Visualization-As-A-Service can be offered, both of which will almost always result in a smaller amount of data displayed as some sort of a statistical model or as imagery (an example of Visualization-As-A-Service is the Giovanni data visualization application created by the Goddard Earth Sciences Data and Information Services Center (GES DISC), which is an EOSDIS DAAC). These users (End-User Interpretation and Data Exploration [right two red boxes in illustration]) can work with data inside or outside the cloud.
A fourth user type is more non-traditional, and will conduct their work completely inside the cloud using cloud-optimized data analysis applications (End-User Cloud-Native Analysis [red box in center of illustration]). For these users, EOSDIS would restructure ARD into a cloud-based storage form for working with cloud-native algorithms (an Analytics Optimized Data Store). These structures can include highly-scalable file systems and databases or simple data cubes. The overall objective is to re-aggregate the data to facilitate rapid data processing and analysis inside the cloud.
Knowing user requirements also helps in designing the best architecture for storing, organizing, and accessing these data, and ESDIS and the DAACs are looking at different approaches to address this. Currently, the only cloud system approved by the NASA Office of the Chief Information Officer is Amazon Web Services (AWS), and Cumulus is optimized to utilize AWS. The data storage provided by AWS is the Simple Storage Service, or S3. An S3 bucket is inexpensive and designed to store large volumes of data. The catch is that S3 is not a file system. This means that it can be difficult to get a specific segment of bytes from a file.
One solution to enable efficient in-place S3 data analysis is to use an Open-source Project for a Network Data Access Protocol, or OPeNDAP, server as an intermediary to retrieve file segments from the S3 bucket. As its name implies, OPeNDAP is an open-source protocol that provides a discipline-neutral means of requesting and providing data, and allows end-users to access the data they require using applications they possess and with which they are familiar. The OPeNDAP server stores a map (“Byte layout map” in illustration) of how the S3 bucket is organized, so it knows which bytes to retrieve from the file stored in the S3 bucket based on what the client’s application is requesting.
Another potential architecture solution being developed uses an environment called Pangeo. Pangeo is a community promoting open, reproducible, and scalable science, and the Pangeo Project serves as a coordination point between scientists, software, and computing infrastructure. The Pangeo mission is to cultivate an ecosystem in which the next generation of open-source analysis tools for Big Data geoscience datasets can be developed, distributed, and sustained. The Pangeo software ecosystem includes open-source tools such as xarray, Dask, Jupyter, and other open-source packages.
For EOSDIS cloud efforts, the Pangeo team proposes integrating open-source tools from EOSDIS DAACs and existing EOSDIS-wide data discovery applications, such as the Common Metadata Repository (CMR) and Global Imagery Browse Services (GIBS), around a central JupyterHub (see illustration at right). Python-based open-source packages from the PyData ecosystem (such as Dask and xarray) will be used to facilitate scientific data analysis within the cloud, with data analysis accelerated using a Python-based Application Program Interface (API). A user can issue API commands from a laptop to trigger processing in the cloud where the data are stored. Rather than downloading data in bulk for processing on a laptop, only final results (for example figures for scientific publications or subsets of imagery) are downloaded. This, in turn, should result in a savings in data use and data egress costs for the user.
The Pangeo Project is part of NASA’s Earth Science Data Systems’ Advancing Collaborative Connections for Earth System Science (ACCESS) 2017 Program. The ACCESS Program develops and implements technologies to effectively manage, discover, and utilize NASA’s archive of Earth observations for scientific research and applications, and complements work by ESDIS and the DAACs in various areas. The five projects selected for ACCESS 2017 are designed to improve and expand the overall use of NASA’s Earth science data by leveraging modern techniques for discovering, managing, and analyzing large and complex Earth science datasets. Technologies developed through these projects may be incorporated into EOSDIS cloud efforts.
As the EOSDIS data collection continues to become larger and more complex, the cloud will create new avenues for research using Big Data in this collection. Having basic information ahead of time about general user requirements enables EOSDIS and the DAACs to begin putting data in the right formats and structures and to start developing the resources to help users efficiently use these data. This information also will play a critical role in developing a cloud-optimized architecture to best facilitate interdisciplinary work and research in the cloud next to EOSDIS data. Stay tuned!