Data Loader Support
To facilitate the user experience, we plan to prepare some default data loaders for different use scenarios. Currently, Nifti and H5 formats are supported. For different types of use cases and image formats, a customised data loader is needed (add a link to the tutorial).
Data Format
There are some prerequisites on the data:
- Data must be split into train / val / test before and stored in different directories. Although val or test data are optional.
- Each image or label is in 3D. Image has shape
(width, height, depth)
; label has shape (width, height, depth)
or (width, height, depth, num_labels)
.
- The data do not have to be of the same shape - All will be resized to the same shape before feed-in. In order to prevent unexpected effects, it may be recommended that all images are pre-processed to the desirable shape.
Supported scenarios
Unpaired images (e.g. single-modality inter-subject registration)
- Case 1-1 multiple independent images.
- Case 1-2 multiple independent images and corresponding labels.
Grouped unpaired images (e.g. single-modality intra-subject registration)
- Case 2-1 multiple subjects each with multiple images.
- Case 2-2 multiple subjects each with multiple images and corresponding labels.
Paired images (e.g. two-modality intra-subject registration)
- Case 3-1 multiple paired images.
- Case 3-2 multiple paired images and corresponding labels.
Sampling during training
Sampling for multiple labels
In any case when corresponding labels are available and there are multiple types of labels, e.g. the segmentation of different organs in a CT image, two options are available:
- During one epoch, each image would be sampled only once and when there are multiple labels, we will randomly sample one label at a time. (Default)
- During one epoch, each image would be paired with each available label. So, if an image has four types of labels, it will be sampled for four times and each time corresponds to a different label.
When using multiple labels, it is the user's responsibility to ensure the labels are ordered, such that label_idx
are the corresponding types in (width, height, depth, label_idx)
- the same type of landmark or ROI - between all labels
Sampling for multiple subjects each with multiple images
When multiple subjects each with multiple images are available, multiple different sampling methods are supported:
- Inter-subject, one image is sampled from subject A as moving image, and another one image is sampled from a different subject B as fixed image.
- Intra-subject, two images are sampled from the same subject. In this case, we can specify:
a) moving image always has a smaller index, e.g. at an earlier time;
b) moving image always has a larger index, e.g. at a later time; or
c) no constraint on the order.
For the first two options, the intra-subject images will be ascending-sorted by name to represent ordered sequential images, such as time-series data
*Multiple label sampling is also supported once image pair is sampled; In case there are no consistent label types defined between subjects, an option is available to turned off label contribution to the loss for those inter-subject image pairs.
Examples (folder structure and filename requirement)
In the following, we take train directory as an example to list how the files should be stored.
Nifti Data Format
Assuming each .nii.gz
file contains only one tensor, which is either image or label.
Unpaired data
This is the simplest case. Data are assumed to be stored under train/images
and train/labels
directories.
Nifti Case 1-1 Images only
We only have images without any labels and all images are considered to be independent samples. So all data should be stored under train/images
, e.g.:
- train
- images
- subject1.nii.gz
- subject2.nii.gz
- ...
(It is also ok if the data are further grouped into different directories under images
as we will directly scan all nifti files under train/images
.)
Nifti Case 1-2 Images with labels
In this case, we have both images and labels. So all images should be stored under train/images
and all labels should be stored under train/labels
. The corresponding image file name and label file name should be exactly the same, e.g.:
- train
- images
- subject1.nii.gz
- subject2.nii.gz
- ...
- labels
- subject1.nii.gz
- subject2.nii.gz
- ...
Grouped unpaired images
Nifti Case 2-1 Images only
We have images without any labels, but images are grouped under different subjects/groups, e.g. time-series observations for each subject/group. For instance, the data set can be the CT scans of multiple patients (subjects/groups) where each patient has multiple scans acquired at different time points. So all data should be stored under train/images
and the leaf directories (directories that do not have sub-directories) must represent different subjects/groups, e.g.:
- train
- images
- subject1
- obs1.nii.gz
- obs2.nii.gz
- ...
- subject2
- obs1.nii.gz
- obs2.nii.gz
- ...
- ...
(It is also ok if the data are grouped into different directories, but the leaf directories will be considered as different subjects/groups.)
Nifti Case 2-2 Images with labels
We have both images and labels. So all images should be stored under train/images
and all labels should be stored under train/labels
. The leaf directories will be considered as different subjects/groups and the corresponding image file name and label file name should be exactly the same, e.g.:
- train
- images
- subject1
- obs1.nii.gz
- obs2.nii.gz
- ...
- ...
- labels
- subject1
- obs1.nii.gz
- obs2.nii.gz
- ...
- ...
Paired images
In this case, images are paired, for example, to represent a multimodal moving and fixed image pairs to register. Data are assumed to be stored under train/moving_images
, train/fixed_images
, train/moving_labels
, and train/fixed_labels
directories.
Nifti Case 3-1 Images only
We only have paired images without any labels. So all data should be stored under train/moving_images
, train/fixed_images
and the images corresponding to the same subject should have exactly the same name, e.g.:
- train
- moving_images
- subject1.nii.gz
- subject2.nii.gz
- ...
- fixed_images
- subject1.nii.gz
- subject2.nii.gz
- ...
(It is ok if the data are further grouped into different directories under train/moving_images
and train/fixed_images
as we will directly scan all nifti files under them.)
Nifti Case 3-2 Images with labels
We have both images and labels. So all data should be stored under train/moving_images
, train/fixed_images
, train/moving_labels
, and train/fixed_labels
. The images and labels corresponding to the same subjects/groups should have exactly the same names, e.g.:
- train
- moving_images
- subject1.nii.gz
- subject2.nii.gz
- ...
- fixed_images
- subject1.nii.gz
- subject2.nii.gz
- ...
- moving_labels
- subject1.nii.gz
- subject2.nii.gz
- ...
- fixed_labels
- subject1.nii.gz
- subject2.nii.gz
- ...
H5 Data Format
Each .h5
file is similar to a dictionary, having multiple key-value pairs. Hierarchical multi-level h5 indexing is not used. Each value is either image or label.
Unpaired images
H5 Case 1-1 Images only
Each key corresponds to one image, e.g. {"subject1": data1, "subject2": data1, ...}
. All data should be stored under train/images
, it can be a single h5 file or multiple h5 files e.g.:
H5 Case 1-2 Images with labels
Each key corresponds to one subject. Data can be stored in two single h5 files (one for image and one for label), the keys in the files should be the same.
- train
- images
- data.h5 (keys = ["subject1", "subject2", ...])
- labels
- data.h5 (keys = ["subject1", "subject2", ...])
Grouped unpaired images
H5 Case 2-1 Images only
Similar to case 1-1 above, but the keys, in this case, have to share the same format like subject%d-%d
where %d
represents a number. For instance, subject3-2
corresponds to the second observation for the subjects. Otherwise, the file structure is the same as case 1-1, e.g.
- train
- images
- part1.h5 (keys = ["subject1-1", "subject1-2", "subject2-1", ...])
- part2.h5
- ...
H5 Case 2-2 Images with labels
Similar to case 1-2 and 2-1 above, the keys have to share the same format like subject%d-%d
and the keys for images and labels should be consistent.
- train
- images
- part1.h5 (keys = ["subject1-1", "subject1-2", ...])
- part2.h5 (keys = ["subject101-1", "subject101-2", ...])
- ...
- labels
- part1.h5 (keys = ["subject1-1", "subject1-2", ...])
- part2.h5 (keys = ["subject101-1", "subject101-2", ...])
- ...
Paired images
In this case, data are paired. Data are assumed to be stored under train/moving_images
, train/fixed_images
, train/moving_labels
, and train/fixed_labels
directories.
H5 Case 3-1 Images only
We only have paired images without any labels. So all data should be stored under train/moving_images
, train/fixed_images
and the keys corresponding to the same subject should be the same, e.g.:
- train
- moving_images
- part1.h5 (keys = ["subject1", "subject2", ...])
- part2.h5
- ...
- fixed_images
- part1.h5 (keys = ["subject1", "subject2", ...])
- part2.h5
- ...
H5 Case 3-2 Images with labels
We have both images and labels. So all data should be stored under train/moving_images
, train/fixed_images
, train/moving_labels
, and train/fixed_labels
. The keys corresponding to the same subject should be the same, e.g.:
- train
- moving_images
- data.h5 (keys = ["subject1", "subject2", ...])
- fixed_images
- data.h5 (keys = ["subject1", "subject2", ...])
- moving_labels
- data.h5 (keys = ["subject1", "subject2", ...])
- fixed_labels
- data.h5 (keys = ["subject1", "subject2", ...])