The anticipated growth in the volume of data in NASA's Earth Observing System Data and Information System (EOSDIS) poses new challenges for the Distributed Active Archive Centers (DAACs) tasked with archiving, curating, and distributing them.
To address these challenges, the DAACs are moving their data holdings from on-premise archives to the cloud. The transfer of EOSDIS data to the cloud is currently taking place but, when complete, the result will be vast collections of Earth observation data that are “close to compute,” meaning users will find it easier to discover, access, and manage data, as well as analyze large datasets more efficiently, thereby enabling a broader range of research.
However, as Physical Oceanography DAAC (PO.DAAC) Project Scientist Dr. Jinbo Wang acknowledges, moving DAAC-held data to the cloud means that members of the Earth observation science community will have to move to the cloud with it.
“After unlocking the potential of cloud computing, the next level is moving the code to the data,” said Wang. “Here the majority of the science community is far behind. There are some forerunners or early adopters, such as the members of the Pangeo community, who work to develop software and infrastructure to enable Big Data geoscience research. But there is still a long way to go to educate the majority of the scientific community and bring them into the cloud computing field.”
To do that, Wang and his PO.DAAC team and science colleagues have started a coding club to promote the use of cloud computing for scientific research. Participants include scientists covering the Sentinel-6 Michael Freilich satellite mission, the Estimating the Circulation and Climate of the Ocean (ECCO) project, the Gravity Recovery and Climate Experiment Follow-On (GRACE-FO) satellite mission, the Surface Water and Ocean Topography (SWOT) satellite mission, the Salinity and Stratification at the Sea Ice Edge (SASSIE) airborne mission, and the Group for High Resolution Sea Surface Temperature (GHRSST) project, as well as members of the PO.DAAC engineering team and NASA's Jet Propulsion Laboratory (JPL) Artificial Intelligence and Analytics Group.
The club, which began meeting in March of this year, assembles on a weekly basis and organizers regularly invite cloud experts to answer members’ questions and provide solutions to member’s computing problems. However, the club’s meetings aren’t led by technical experts.
“The PO.DAAC project science team is setting the agenda because we wanted to learn from the science community’s perspective,” said Wang. “So the experience is more relatable to and shareable with the community, and the outcome is more practical and useful for scientific applications.”
In fact, ensuring the meetings are practically useful is Wang’s primary objective.
“Every day we use our laptops and never stop to think how the laptop works,” said Wang. “We should aim for the same level of comfort and familiarity with cloud computing so that scientists can focus on their work rather than admiring the technology. Once reaching that stage, we can say that the community has adopted cloud computing."
To reach that level of comfort and familiarity, the coding club helps members of the PO.DAAC science community understand the basics of Amazon Web Services (AWS), the cloud platform NASA uses to ingest, archive, distribute, and manage the Earth science data in NASA’s EOSDIS collection. It also helps researchers determine if using the cloud is the best approach for their work.
“To begin, there needs to be a rigorous analysis of cloud computing and its costs from the scientific perspective,” said Wang. “Is it worth it? Is it cost effective to embrace this? No matter how wonderful cloud computing is, we will not get everybody to use the cloud. Our goal is to share our own experience and inform the community about the advantages and disadvantages in using the cloud and help them make informed decisions.”
If researchers decide it is worth using the cloud, they then learn the practices associated with getting started in the cloud, such as learning how to set up a collaborative workspace.
“These are basic steps for a cloud engineer, but for scientists it's a giant leap,” said Wang. “We were often lost in the jargon and the acronyms and desperately needed a translator when listening to cloud experts; we do not know enough about the cloud infrastructure to troubleshoot very elementary problems by ourselves or even ask the right questions. Many universities and institutions now start to provide cloud support for their community, but a lot of the researchers that I interact with are still wandering to find the start line. All they need is a gentle nudge toward that line, which might be just three steps away. ”
A Successful Case
That was certainly true for Dr. Ian Fenty, Principal Investigator with the ECCO Consortium’s team at JPL who, despite feeling comfortable around most computer systems and very competent with complex nonlinear physics and numerical systems, didn’t know where or how to begin.
“Operating in the cloud environment is not that hard once you know how to do it,” he said. “It's just that no one knew what those steps were. The purpose and motivation of the coding club was to figure things out together.”
Fenty described his understanding of cloud computing prior to joining the club as “zero.” However, he felt compelled to participate in the coding club because he wanted to make use of the latest big datasets from NASA and a complete collection of the latest ECCO products are distributed by PO.DAAC.
“The ways we’ve been managing and grappling with large datasets are no longer tenable,” he said. “Datasets are much larger and cannot be manipulated on single workstations or even small clusters. They're growing so large that even on supercomputers it’s become a real challenge to process and analyze them. It was rumored that with cloud resources we might be able to get our hands into the data again in ways that lead us to new insights.”
Participating in the coding club provided Fenty and his colleagues with the information they needed to use those big datasets, and it removed some of the uncertainty about cloud computing in the process.
“Within NASA there is significant interest in archiving and distributing data via the cloud, but there are questions about accessing the data, if cloud resources can be used to analyze the data in the cloud itself, and so on,” he said. “None of us really knew the answers to these questions, but the coding club allowed us to get a sense of what working in the cloud actually looked like and took the mystery out of a lot of nebulous concepts.”
As for more tangible benefits, learning to use the cloud has made working with large datasets much easier, and in Fenty’s case, significantly faster.
“One of the projects I work on involves a global ocean climate model, the output of which is itself a large dataset,” he said. “The process of transforming the raw model output into a format that's useful for other scientists is a perfect application for the cloud because it involves operating on more than 1 million files. Before, we processed files 10 at a time on a supercomputer and it took more than one week. Now, we are using the cloud to transform files 3,000 at a time and it only takes 30 minutes. Plus, since PO.DAAC is archiving and distributing NASA’s physical oceanography datasets in the cloud, delivering our datasets to PO.DAAC is going to be much, much easier.”
Also, now that Fenty and his colleagues are getting more familiar with cloud computing’s capabilities and resources, they’ve begun to investigate the diversity of cloud-ready services and tools at their disposal.
“You can use the cloud like a regular computer—you can just turn it on and have a virtual computer with 1 or 48 CPUs,” he said. “Then there are other services that Amazon has set up to meet different business needs. We're just beginning to explore the range of cloud services and how we can adapt them for our scientific use.”
Fenty credits the cloud computing gains he and his team have made to the collegial nature of the club and the leadership of Wang, who he says, has a vision for how the cloud is going to change the field of physical oceanography.
“Jinbo’s enthusiasm for the potential of the cloud is contagious and he is committed to getting other scientists at JPL on board,” Fenty said. “Therefore, the club is kind of a roundtable where everyone shares the information they learned within the last week or two weeks that someone else might find useful. It's the idea that you're going to go much further working together.”
And according to Fenty, some of the club’s members have gone pretty far.
“I have an intern who went from total beginner to full cloud deployment of our ocean model in about six weeks. This wouldn't have been possible without the club, where people can share their experience.”
Fenty also credits Dr. Nadya Vinogradova-Shiffer, Program Scientist and Manager of NASA's Physical Oceanography Program, and Kevin Murphy, Chief Science Data Officer for NASA's Science Mission Directorate, for encouraging the members of the ECCO Consortium to explore how cloud computing resources could benefit the consortium’s research.
Benefitting from the Community
Yet, while Fenty might credit Wang for the coding club’s successes, Wang says the club would not have formed without his experience with Openscapes, a National Center for Ecological Analysis & Synthesis (NCEAS) initiative that offers open data science mentorship, teaching, and coaching.
Openscapes has provided its services to NASA in two ways. First, it established a cohort of DAAC mentors who participated in so-called “training the trainer” programs, that is, a program where DAAC mentors trained other DAAC personnel to expand the pool of experts. The second way involves the creation of a Champions Cohort, which is a group of scientists that Openscapes teaches about the cloud paradigm and helps them move their science workflows to the cloud.
This second method is similar to Jinbo's approach with the coding club.
“The coding club aims to transfer the knowledge learned from the Openscapes train-the-trainer sessions to the mission scientists, but with a very specific practical goal: help the NASA missions’ science teams bring their workflow to the cloud,” said Wang. “We want to focus on solving practical problems by cloud computing while avoiding learning cloud infrastructure as much as possible. Lessons from Openscapes helped us get started.”
Now, Openscapes is working to build on its success with PO.DAAC to establish a larger framework for supporting the specific needs of the NASA community.
“This coding club is very complimentary to some of the other training provided through the Openscapes effort,” said Dr. Catalina Oaida, an Applied Science Systems Engineer at JPL who has been working with Openscapes. “So, we're part of that community interacting with other DAAC mentors and we're creating both training materials and identifying common needs across NASA’s DAACs to help users adjust to the cloud paradigm and interact with the data.”
To that end, Openscapes is developing resources that address cloud computing issues common to all the DAACs, such as the process of cloud on-boarding, searching and accessing data, using the tools and services developed by DAAC science teams, and using the Common Metadata Repository (CMR)—the management system for EOSDIS Earth science metadata.
DAACs Encouraged to Start Their Own Coding Clubs
Given Openscapes’ experience, Wang recommends that other DAAC science teams consider establishing coding clubs of their own.
“I would suggest that each DAAC science team launch their own coding club, regardless of where they are in the process,” said Wang. “Each DAAC may have its own community needs and unique institutional cloud infrastructure and requirements. Also, try to keep it small to increase engagement and avoid running it like a seminar. It is extremely satisfying to watch the ECCO team demonstrate the practicality of cloud computing for their project in a short time, he said. “The cloud is a great tool for open science and efficient collaboration. It also has great potential to bring equity and inclusion into scientific computing.”
Wang cautioned, however, that these outcomes are not a given.
“We need a clear path to achieve that goal," he said. "The coding club hoped to provide a data point for shedding light on the path toward it by solving one practical problem at a time and sharing our experiences with the community through forums like the Earthdata webinar series. It has been a fun experience with a wonderful team. I look forward to our next chapter.”
- PO.DAAC has produced a variety of information detailing the transition of its data to the cloud, as well as resources to help guide data users in discovering, accessing, and utilizing cloud data. To see it, visit the Cloud Data page on the PO.DAAC website.
- PO.DAAC personnel Jinbo Wang and Ed Armstrong recently hosted a webinar titled, “Moving Code to the Data: Analyzing Sea Level Rise Using Earth Data in the Cloud,” which is accessible via the NASA Earthdata YouTube Channel.