Code examples for processing HDF5 files.
HDF5 is a file format for storing data that is highly extensible and flexible. For example, you can store a large number of images in a single HDF5 file.- Stands for "Hierarichal Data Format".
- Current version is 5.
- It is a file format for storing data that is highly extensible and flexible.
- Open-source and free.
- We may directly use the core implementation in C, C++, and Java. There are wrappers for several other languages, including Python.
import h5py
f = h5py.File('myfile.h5', 'r')
print(f.keys())
print(f['data'].shape)
print(f['data'][:])
f.close()
To save a file, you need to create a new file object. For example, to create a new file called "myfile.h5"
import h5py
f = h5py.File('myfile.h5', 'w')
data_set = f.create_dataset('data', (100,), dtype='i')
data_set[:] = np.arange(100)
f.close()
-
Groups (a concept similar to directories)
- Groups can contain datasets and other groups.
-
Datasets (a concept similar to files)
- Shape (ex. 1D, 2D, 5D)
- Datatype (ex. float, int32)
- Attributes (ex. compression, chunking, compression)
- Data (ex. data[:])
- Subdatasets (ex. subdataset[:])
Linear:
- Data is stored in a single file.
- Data is stored in a single chunk.
- Data is stored in a single block.
Chunked:
- Data is stored in multiple chunks.
- Data is stored in multiple blocks.
- Data is stored in multiple files.
Chunk size must strike a balance:
- maximizing i/o speed.
- minimizing non-used data i/o.
- minimizing chunking i/o overhead cost.
Filter is a way to compress data.
- Can be applied to datasets.
- It is a layer betwen program and data.
Program <- Filter (CPU) <- data (Disk).
Examples:
- Gzip (compression filter)
- ScaleOffset (stores data subtracted by median, then while reading median is added back)
- Szip (compression filter)
- Shuffle (shuffles data)
- Fletcher32 (checksum)