Summary
This document defines Hierarchical Data Format 5 (HDF5), a data model, file format, and I/O library designed for storing, exchanging, managing, and archiving complex data including scientific, engineering, and remote sensing data.
Status
The HDF5 Data Model, File Format and Library–HDF5 1.6 was approved for use in NASA Earth Science Data Systems in January 2007.
As of July 2024, ESCO is reviewing the status of the HDF5 1.6 approval. Since it was first approved, there have been a number of new releases and the HDF5 1.6 libraries are no longer supported. Newer releases have introduced new functionality to the HDF libraries and changes to the underlying file format. New library versions have retained some backwards compatibility with older file format versions. Updated recommendations should be available later in 2024.
|
NASA Earth Science Community Recommendations for Use
Strengths
HDF and HDF-EOS data formats, software libraries, and application programming interfaces (APIs) have been widely used for NASA earth observation mission data for many years. The latest version of HDF, HDF5, is the current or planned data format for missions including the Orbiting Carbon Observatory-2 (OCO-2) and the Joint Polar Satellite System (JPSS), totaling many 10s of terabytes of data. Users cite many strengths, including:
- Widespread planned use for NASA Earth science data
- Data users read only the data that they need, not the whole file. Data producers can put images, tables, multidimensional arrays, etc. into the same file
- Users do not need to be concerned with the platform in which the data are produced
- Its limited primary structures, i.e. groups and datasets, make the file design simple
- Ample metadata can be added to the file, groups, and dataset, making the file self-describing
- Data files can be internally compressed using different schemes, making better data storage and usage
- The ability to store data compactly, yet allow the data to be read on any platform
- Source code for writing and reading data in the format is widely and publicly available
- Supported by many third party applications such as Interactive Data Language (IDL) and MATLAB
- Support for a rich set of data types including composite and user-defined data types
- Support for extensions and profiles, including HDF-EOS5
Weaknesses
HDF5 is undeniably complex and requires a significant learning curve. However, users also applaud the quality of documentation and help-desk support available. Third-party tools with HDF5 support, such as IDL and Matlab, also help hide complexity from users. Users have expressed concern about the availability of long-term support for HDF5 and related tools, but this concern is somewhat alleviated by the availability of the source code.
Applicability
HDF5 is used for data archive and distribution. The strengths cited above, together with the availability of analysis tools, make the format suitable for data analysis as well. The new netCDF 4.0 will include the capability to use HDF5 as the data storage layer for the netCDF API, with the addition of many new features available in HDF5 such as user defined types, multiple unlimited dimensions, and per-variable data compression. This merger of the two formats will further extend the HDF5 user community.
Limitations
A major limitation for HDF5 is the loss of backward compatibility with HDF4 and earlier versions. Also, unlike less complex formats, users cannot read HDF5 files directly without using the HDF5 software library. Of greater concern are recent postings on a mailing list discussing use of netCDF and HDF5 in high performance computing applications with thousands of processors using parallel I/O, which warn of the danger of file corruption during parallel I/O if a client dies at a particular time. The HDF Group is aware of this problem and is addressing it.
Summary
Overall, HDF5 is a widely used data format with a well-defined specification that provides a standard way of storing and working with science data. The ESDS-RFC-007 TWG thus recommends its endorsement by the SPG as an Earth Science Data Systems Standard.