Data Chat: Katie Baynes

For NASA EOSDIS System Architect Katie Baynes, having NASA Earth observing data in the commercial cloud will foster not only a new community of data users, but also new communal ways of using these data.
Josh Blumenfeld

It doesn’t take long for Katie Baynes to bring up the community aspect of her work evolving data and services in NASA’s Earth Observing System Data and Information System (EOSDIS) into the commercial cloud. As she observes, this evolution will lead to not only new discoveries in the Earth sciences, but also more collaborative ways of working with data to accomplish these discoveries. 

Headshot of Katie Baynes

Along with her fellow EOSDIS system architect Dr. Christopher Lynnes, Baynes is leading teams to develop a system capable of efficiently archiving and distributing one of the world’s largest collections of Earth observing data—a collection that is expected to grow in volume from approximately 35 petabytes (PB) to almost 250 PB over the next five years.

The influx of high-volume data from the NASA-Indian Space Research Organisation Synthetic Aperture Radar (NISAR) and the Surface Water Ocean Topography (SWOT) missions (both of which are scheduled for launch in 2021 or 2022) are providing not only challenges, but the prospect of an exciting new era of Earth science research based on a communal use of Big Data. As Baynes points out, going together into this new era of data use has many advantages.

Let’s start with NISAR and SWOT, two missions that will add more data by volume to the EOSDIS archive than any previous Earth observing missions. How are you and the EOSDIS Distributed Active Archive Centers (DAACs) preparing for these massive amounts of data?

We’ve spent several years preparing for these missions, and have been doing a lot of load testing to make sure we can maintain the downloads we expect to get for these products along with cost modeling for these downloads. One of the most interesting facts I’ve found is that the amount of data collected by NASA’s EOSDIS on a daily basis right now is going to be multiplied by five once we start receiving data from NISAR and SWOT. This means that if we are collecting, hypothetically, 25 terabytes (TB) of data per day right now, we soon will be collecting 125 TB of data every day. The actual figure is projected to be around 126 TB of data every day. When you think about this in terms of economies of scale and orders of magnitude of change, this is really a staggering amount of change in the volume of data we’ll be dealing with.

Graphic showing current way of using data through directly downloading data (top image) and by working with data in the cloud (bottom image).
The evolution of data in NASA's EOSDIS collection is expected to lead to a new paradigm in how data are used. Rather than individual data users downloading data from multiple sources for analysis on their local machines (top image), Big Data will be most efficiently used by working close to the data and conducting analyses in the cloud, then downloading only the results (bottom image). This also will facilitate more communal use of data in the cloud by teams of researchers. NASA EOSDIS graphic.

But this is not just about load testing and preparing for the onslaught of data volume; it’s also about transforming the mindset of tomorrow’s researchers. For instance, it may not be feasible to look at these data on local disks, so we’re looking at ways to reduce the data transfer. Maybe you don’t need to download several terabytes of data to get your job done. Maybe you can go into the cloud, look at just the data you need, perform your analyses and processing on these data within the cloud, and then simply download your results. We want data users to learn how to see our data as a large pool they can visit and spend time at and not as a bucket they carry with them to their local machine.

What are some of the benefits to EOSDIS DAACs and EOSDIS data users of putting EOSDIS data and EOSDIS services, like Earthdata Search and Global Imagery Browse Services (GIBS), into the commercial cloud?

I think there are synergies that can be found from a technological perspective in getting everyone on the same page using the same framework, systems, and interaction patterns across the DAACs. This can only lead to more time for science using these data and services. By this I mean that less time is being spent dealing with data management problems at the individual DAACs and more time can be spent [by the DAACs] focusing on the science use cases of their unique user communities. For our end-users, benefits include the ability for larger-scale compute. It will allow for a more coherent and complete picture for those looking to interact with our data.

In the recent Data Chat with your fellow EOSDIS system architect Dr. Christopher Lynnes, he notes that he is handling the data use side of the EOSDIS cloud evolution while you are responsible for the ingest and archive side. How does the work that you’re doing complement the work being done by Dr. Lynnes and his team?

My team and I are working to get all the data that [EOSDIS] services will be acted on into the commercial cloud, so our work as part of the ingest and archive activity enables [Dr. Lynnes’] work. We put the data in a place and make it accessible in a way that [his work on] services and transformations and extractions and mosaicking and reformatting and regridding of data products can be done as simply and as uniformly as possible. We try to enable this by making sure that we store data products that historically have been stored at different DAACs using different principles, file paths, and access policies in a more uniform way by using uniform tools.

Once the EOSDIS data collection evolves to the commercial cloud, how will this change, or evolve, the roles of the DAACs?

The DAACs are stewards of the data products that they have in their catalog, and this will not change. The data might move, but the people will remain at these centers of excellence to provide support for the data that they have always been responsible for. They will still be maintaining their websites and the data-use tools that are specific to their user community for interacting with the data they distribute. Really, the only change is that they will be moving towards a centralized ingest and archive system and a centralized storage facility for the data. It’s a change, but I think it’s probably a change for the better since this will allow the DAACs more time to focus on the specific needs of their discrete user communities.

A key element of facilitating the evolution of EOSDIS data into the commercial cloud is the Cumulus effort, which you lead. What this is and what the status of this effort?

Cumulus is an open source ingest and archive service built for use in Amazon Web Services, or AWS. The intent is to use the same ingest and archive system for all data that NASA’s EOSDIS is migrating into the commercial cloud. One of the guiding principles of Cumulus is that it uses AWS as much as possible to take out some of the redundant tasking. Cumulus also is being developed in an open source manner—not just to our team members, but to the entire world.

From the start, Cumulus was a collaborative effort involving several EOSDIS DAACs. At this time, eight of the 12 EOSDIS DAACs are engaged with the Cumulus community as integrators, developers, or operators and are participating in working groups to make sure that this effort serves as an archive system that can broadly serve the needs of all the DAACs.

In a recent presentation, you had a slide noting that Cumulus is an organizational shift, not just a technology shift. What do you mean by this?

You know, historically, I think we have had a model that was more of a heads-down situation—“I’m a developer at a DAAC, I have a problem in front of me, and I look to my small group of colleagues at the DAAC or at universities to solve this problem.”

Blue image showing a cloud with data and resources inside the cloud and data users pulling data out of the cloud for use; words "A New Paradigm" are at the top with the words "EOSDIS Cloud Evolution" underneath.

As we move into the cloud and to a more open, collaborative model, this situation will change from one where we’re talking not just within our existing small academic sphere, but with a broader range of colleagues in a wider range of interest areas. This will bring in a wider range of thought to problem-solving and investigation. We’ll realize that collaboration across great distances and across scientific disciplines is possible in unique and exciting ways.

This is really about building trust at an organizational level, trust that goes beyond an individual DAAC or immediate physical colleagues and moves it to trust and collaboration with, for example, everyone working within the EOSDIS and seeing them all as potential collaborators and as sources for good ideas to help solve the particular problem you are trying to solve—and maybe bring in some problems you might not have even thought about.

Ironically, the current situation that we’re in, with being forced to stay at home and interact remotely through video more often than we’re interacting with colleagues on a face-to-face basis, is similar to the model we’ve been working towards with EOSDIS data for the past 10 years. This use of distributed teams to use data collaboratively is one of our goals, and I feel that now more and more people are starting to catch on not only about how this can be done, but also that productive remote collaboration and problem-solving is not only possible, but fruitful.

What is the structure of the team working to evolve EOSDIS data to the cloud? How will this work influence future NASA missions?

This effort has three primary pillars of activity. There’s ingest and archive, which is my piece, and there’s data use, which is [Chris Lynnes’] piece. The third piece is the underlying platform on which all of this work depends. This work is spread through [NASA’s Earth Science Data and Information System (ESDIS) Project], the DAACs, outside contractors, and even some universities.

The work we do will also influence how future NASA and joint NASA missions are planned, especially on how upcoming missions are looking at doing their data processing in the cloud, like NISAR and SWOT. There are other missions that will generate lessons-learned and a pathway to doing more cloud-native processing as we go forward. I think that we can provide input that could help them with their decision-making process and serve as pathfinders to that end. Our work on the NISAR and SWOT missions is already helping to forge that path.

Twenty years ago, EOSDIS data users were still receiving physical tapes of data in the mail. Ten years ago, they were downloading EOSDIS data using the internet. Now, data and services are becoming available through the commercial cloud. What do you see as the next step in this evolution?

I think that you’ll start seeing a move from information, to knowledge, to wisdom. The natural evolution of the problem space is evolving from hoarding scientific knowledge as proprietary work and into a more open model where we reward things such as creating tools and software that allow end-users to come in and acquire understanding by dynamically exploring the data in real-time.

Another exciting avenue for development involves utilizing machine learning, which is creating [research] opportunities where we can sort of guide the data and guide the machines as they do their work, but also allow the machines to find patterns on their own that were not immediately evident. With Big Data collections like we’re developing and planning for, the human eye can’t look at all the data to parse out these hidden patterns. I’m really excited about pattern discovery and what we can glean from this sort of information about Earth’s changing processes and the direction in which these processes might be going.

Over the next five years, what are you most excited about in terms of how EOSDIS data will be archived and used?

I’m most excited not just about the NASA science that is going on, but also the enablement of science using data products that we already have that can be mashed together or fused together into longer continuity products or into more impactful, higher-level data products that are not just useful to the NASA community, but are also useful for entities like the U.S. Department of Agriculture or [the Federal Emergency Management Agency].

I think that as we see our climate evolve it will be crucial to put data into terms that are not just understandable only to NASA and generally-focused science users, but to communities beyond this, like decision-makers and policy-makers and influential global leadership. I feel that the work that we do is just a small piece of what could change our future and lead to a brighter tomorrow. I think that we can provide the data that will help mitigate the impacts of climate change. This is what makes my work important to me.

How will the commercial cloud help facilitate this?

My favorite advantage of the cloud is not the cloud itself, but what it has allowed us to do as a community by working together. Getting people to work together toward a common goal on a common problem after decades of working on it individually has been a really, really rewarding experience for me. I think we’re all benefitting from the diversity of thought that is brought to the table when we try to consider everyone’s needs.

There’s a good quote that says “when you want to go fast, go alone; when you want to go far, go together.” I think that has been an inspiration to me. We were going fast alone, but I think that now we’re enabling a future where we can go far together.


Explore more Data Chats

Last Updated