Applying Machine Learning to Harmful Algal Blooms

The open-source Cyanobacteria Finder (CyFi) uses machine learning to pinpoint areas that may contain harmful algal blooms in lakes, reservoirs, rivers, and other small water bodies.
author-share

The application of machine learning to satellite imagery is making it easier to identify areas that might contain high concentrations of cyanobacteria in smaller waterways. The open-source CyFi (Cyanobacteria Finder) Python package quickly flags the highest-risk areas of these harmful algal blooms (HABs) in lakes, reservoirs, and rivers.

The Problem of HABs

HABs occur when colonies of algae (simple plants that live in the sea and freshwater) grow out of control to form blooms. These blooms can have many impacts in aquatic environments. Some blooms produce toxins that can kill fish, small mammals, and birds and can lead to human illness (or death, in extreme cases). Even nontoxic algal blooms can harm aquatic environments by using up oxygen in the water, clogging the gills of fish and invertebrates, and smothering coral and vegetation. Other blooms discolor water, can form smelly piles on beaches, and contaminate drinking water. A great resource for learning more about HABs and relevant NASA data collections is the Water Quality Data Pathfinder.

Image
Satellite image of heart-shaped lake with dark lake water and bright green algal blooms along the shoreline, especially at southern end of lake.
Green colors along the shore of Lake St. Clair near Detroit, USA, are algal blooms. This image was acquired by the Operational Land Imager (OLI) instrument aboard Landsat 8 on July 28, 2015. Credit: NASA Landsat Image Gallery.

The Tricky Tracking of Cyanobacteria (in Small Waterways)

Cyanobacteria, also called blue-green algae, are single-celled organisms that live in fresh, brackish, and marine water and use sunlight to produce food. In warm, nutrient-rich environments, cyanobacteria can multiply quickly, creating blooms that spread across the water's surface.

Sensors aboard orbiting Earth observation satellites can detect blooms through visible changes in water color or changes in detected wavelengths of light. The newest and most advanced satellite mission collecting data that can be used to observe HABs—NASA's Plankton, Aerosol, Cloud, ocean Ecosystem (PACE) mission—launched on February 8, 2024. Ocean color and related data in NASA's Earth science collection are archived at and distributed by NASA's Ocean Biology Distributed Active Archive Center (OB.DAAC).

Satellite detection of water color and changes in color works fine for identifying and tracking HABs in the ocean, along coasts, or in large waterways. Blooms in lakes and other smaller inland water bodies, however, are much harder to spot in satellite imagery due to the large viewing areas and the low-resolution imagery available through many satellite sensors. Blooms in small water bodies generally need to be monitored manually at the source, which is a time intensive process.

How Does CyFi Address This?

CyFi combines high-resolution satellite imagery with machine learning to flag areas in smaller water bodies that are likely to have the highest concentrations of cyanobacteria and the greatest risk of blooms. The satellite imagery used in CyFi is from the ESA (European Space Agency) Sentinel-2 MultiSpectral Instrument (MSI), which provides imagery with a resolution as high as 10 meters.

Image
Satellite image of heart-shaped lake with dark water and green areas along shore indicating algal blooms; red colored dots in areas of algal bloom are CyFi indicated areas with the highest probabilities of having a bloom.
Image from the CyFi Demo Deck using the Landsat image above overlaid with a stylized view of cyanobacteria severity estimates. Red dots indicate areas most likely to contain a bloom. Credit: Base map image from the NASA Landsat Image Gallery; CyFi overlay by DrivenData.

CyFi searches for and downloads publicly available satellite imagery around points near or on small waterways. These data are then entered into a machine learning model. For each point, the model provides a cyanobacteria severity level based on World Health Organization (WHO) guidelines and an estimated density of cyanobacteria in cells per milliliter (mL) for detailed analysis.

Using simple lines of Python code, users can generate predictions of cyanobacteria concentrations or estimate cyanobacteria concentrations for a single point. Through the CyFi Explorer, users can view cyanobacteria estimates alongside Sentinel-2 imagery.

CyFi is most accurate at low and high cyanobacteria densities. For low densities, CyFi can help better allocate ground sampling resources by deprioritizing water bodies where blooms are likely absent. For areas with high cyanobacteria densities, CyFi can flag water bodies where severe blooms are more likely.

Where Did CyFi Come From?

CyFi was developed as part of the Tick Tick Bloom: Harmful Algal Detection Challenge conducted by DrivenData. The challenge was created on behalf of NASA with collaboration from NOAA, the U.S. Environmental Protection Agency (EPA), the U.S. Geological Survey (USGS), the U.S. Department of Defense (DOD) Defense Innovation Unit, Berkley AI Research, and Microsoft AI for Earth. The machine learning model was trained and evaluated using in-situ measurements of cyanobacteria density from across the U.S. The training data used 8,979 observations while the testing/evaluation data used 4,035 observations.

The CyFi algorithm is open-source and available on GitHub. This enables anyone to reuse, update, or contribute to its development.

Last Updated