Croissant is a format to describe datasets used in machine learning (ML). It was designed to make it easier for ML practitioners to work with datasets across ML platforms and repositories, and is being developed in synergy with the MLCommons Croissant Working Group.
Croissant provides enough metadata information for ML platforms to load a dataset, allowing platform users to incorporate Croissant datasets into the training or evaluation of a model with just a few lines of code. Croissant can be added easily to any tools commonly used by ML practitioners (e.g., for data preprocessing, analysis, or labeling). Besides helping developers work with ML datasets across platforms, Croissant also facilitates dataset discovery. After dataset publishers generate Croissant metadata and establish dataset repositories compatible with the format, dataset search engines can facilitate users in discovering and utilizing datasets, regardless of their publication sources. Creating or changing Croissant dataset descriptions is supported through a visual editor and a Python library. Detailed information about the Croissant launch can be found on the ML Commons website and in a Google Research blog post.
Croissant is designed as a modular and extensible format capable of extending its core specification to include relevant ML concepts and integration with other platforms and tools. One such extension is the Croissant Responsible AI (RAI) vocabulary, which captures RAI concerns around biases, fairness, robustness, and the use of human labeling. The geospatial use-case for RAI in the Croissant RAI specification includes contributions by members of NASA's Interagency Implementation and Advanced Concepts Team (IMPACT).
To further incorporate geospatial data for AI, representatives from IMPACT, along with a proposed working group, will explore a Geo-Croissant extension built on the Croissant Core and RAI specification.
Proposed Geo-Croissant Specification
Croissant Core and the RAI extension support the efficient representation of metadata and RAI attributes. They also enhance processing in an end-to-end workflow. However, certain crucial characteristics that are required to define Earth observation datasets for AI are missing. These include:
- Spatial reference information
- Nested data attributes (file formats such as netCDF4, HDF5, ZARR)
- Interoperability with existing cloud-native geospatial data formats
- Geographical biases
- Region restricted data access (i.e., compatibility with NASA Distributed Active Archive Centers [DAACs])
- Data-fusion opportunities with other modality datasets (i.e., tabular, graph)
We envision a standard way for defining geospatial datasets for AI through a new specification called Geo-Croissant. Additionally, developing Geo-Croissant will involve developing tools and platforms for converting existing datasets to the Croissant format that could be used directly with machine learning/deep learning frameworks such as PyTorch, Tensorflow, Keras, and HuggingFace.
With the ever-increasing size of geospatial datasets approaching petabyte-equivalent datasets distributed across multiple archives, there is a need for fast and efficient input/output data transfers. To accomplish this, Geo-Croissant will use metadata (data of data) to make data discoverable and provide access to the data when required for training. Moreover, it is important to abide by responsible Geo-AI practices, as location is important information given that the attributes change with respect to location. Additionally, sampling strategy and geospatial bias are significant data-centric concepts that can lead to inaccuracies in training the model. The Geo-Croissant specification will represent such information in an efficient manner to enhance data processing in an end-to-end workflow.
Those interested in contributing to the development of Geo-Croissant are encouraged to contact Rajat Shinde (rajat.shinde@uah.edu).