Summary
This document defines Hierarchical Data Format 5 (HDF5), a data model, file format and I/O library designed for storing, exchanging, managing and archiving complex data including scientific, engineering, and remote sensing data.
Status
The HDF5 Data Model, File Format and Library—HDF5 1.6 is an approved standard recommended for use in NASA Earth Science Data Systems in January 2007.
|
NASA Earth Science Community Recommendations for Use
Strengths
HDF and HDF-EOS data formats, software libraries and application programming interfaces (APIs), have been widely used for NASA earth observation mission data for many years. The latest version of HDF, HDF5 is the current or planned data format for missions including Orbiting Carbon Observatory 2 (OCO-2) and Joint Polar Satellite System (JPSS), totaling many 10s of terabytes of data. Users cite many strengths, including:
- Widespread planned use for NASA Earth science data.
- Data users read only the data that they need, not the whole file. Data producers can put images, tables, multidimensional arrays, etc into the same file.
- Users do not need to be concerned with the platform in which the data are produced.
- Its limited primary structures, i.e. groups and datasets, makes the file design simple.
- Ample metadata can be added to the file, groups and dataset, making the file self describing.
- Data files can be internally compressed using different schemes making better data storage and usage.
- The ability to store data compactly, yet allow it to be read on any platform.
- Source code for writing and reading data in the format is widely and publicly available.
- Supported by many third party applications such as Interactive Data Language (IDL) and MATLAB.
- Support for a rich set of data types including composite and user-defined data types.
- Support for extensions and profiles, including HDF-EOS5.
Weaknesses
HDF5 is undeniably complex, and requires a significant learning curve. However, users also applaud the quality of documentation and help-desk support available. Third-party tools with HDF5 support, such as IDL and Matlab, also help hide complexity from users. Users have expressed concern about the availability of long-term support for HDF5 and related tools, but this concern is somewhat alleviated by the availability of the source code.
Applicability
HDF5 is used for data archive and distribution. The strengths cited above, together with the availability of analysis tools, make the format suitable for data analysis as well. The new netCDF 4.0 will include the capability to use HDF5 as the data storage layer for the netCDF API, with the addition of many new features available in HDF5 such as user defined types, multiple unlimited dimensions, and per-variable data compression. This merger of the two formats will further extend the HDF5 user community.
Limitations
A major limitation for HDF5 is the loss of backward compatibility with HDF4 and earlier versions. Also, unlike less complex formats, users cannot read the HDF5 files directly without using the HDF5 software library. Of greater concern are recent postings on a mailing list discussing use of netCDF and HDF5 in high performance computing applications with thousands of processors using parallel I/O, which warn of the danger of file corruption during parallel I/O if a client dies at a particular time. The HDF Group is aware of this problem and is addressing it.
Overall, HDF5 is a widely used data format with a well-defined specification that provides a standard way of storing and working with science data. The ESDS-RFC-007 TWG thus recommends its endorsement by the SPG as an Earth Science Data Systems Standard.