jamesmudd / jhdf Goto Github PK

View Code? Open in Web Editor NEW

117.0 9.0 34.0 4.02 MB

A pure Java HDF5 library

Home Page: http://jhdf.io

License: MIT License

Java 92.85% Python 7.15%

hdf5 java file-format bigdata

jhdf's Introduction

jHDF - Pure Java HDF5 library

This project is a pure Java implementation for accessing HDF5 files. It is written from the file format specification and is not using any HDF Group code, it is not a wrapper around the C libraries. The file format specification is available from the HDF Group here. More information on the format is available on Wikipedia. I presented a webinar about jHDF for the HDF Group which is available on YouTube the example code used and slides can be found here.

The intention is to make a clean Java API to access HDF5 data. Currently, the project is targeting HDF5 read-only compatibility. For progress see the change log. Java 8, 11, 17 and 21 are officially supported.

Here is an example of reading a dataset with jHDF (see ReadDataset.java)

try (HdfFile hdfFile = new HdfFile(Paths.get("/path/to/file.hdf5"))) {
	Dataset dataset = hdfFile.getDatasetByPath("/path/to/dataset");
	// data will be a Java array with the dimensions of the HDF5 dataset
	Object data = dataset.getData();
}

For an example of traversing the tree inside a HDF5 file see PrintTree.java. For more examples see package io.jhdf.examples

Why should I use jHDF?

Easy integration with JVM based projects. The library is available on Maven Central, and GitHub Packages, so using it should be as easy as adding any other dependency. To use the libraries supplied by the HDF Group you need to load native code, which means you need to handle this in your build, and it complicates distribution of your software on multiple platforms.
The API design intends to be familiar to Java programmers, so hopefully it works as you might expect. (If this is not the case, open an issue with suggestions for improvement)
No use of JNI, so you avoid all the issues associated with calling native code from the JVM.
Fully debug-able you can step fully through the library with a Java debugger.
Provides access to datasets ByteBuffers to allow for custom reading logic, or integration with other libraries.
Integration with Java logging via SLF4J
Performance? Maybe, the library uses Java NIO MappedByteBuffers which should provide fast file access. In addition, when accessing chunked datasets the library is parallelized to take advantage of modern CPUs. jHDF will also allow parallel reading of multiple datasets or multiple files. I have seen cases where jHDF is significantly faster than the C libraries, but as with all performance issues, it is case specific, so you will need to do your own tests on the cases you care about. If you do run tests please post the results so everyone can benefit, here are some results I am aware of:
- Peter Kirkham - Parallel IO Improvements
Security - jHDF is pure Java and therefore benefits from the memory safety provided by the JVM. The HDF5 Group library is written using non-memory safe languages, therefore susceptible to memory related security bugs.

Why should I not use jHDF?

If you want to write HDF5 files. Currently, this is not supported. This will be supported in the future, but full read-only compatibility is currently the goal. If you would be intrested in this please comment on, or react to issue #354.
If jHDF does not yet support a feature you need. If this is the case you should receive a UnsupportedHdfException, open an issue and support can be added. For scheduling, the features which will allow the most files to be read are prioritized. If you really want to use a new feature feel free to work on it and open a PR, any help is much appreciated.
If you want to read slices of chunked datasets (slicing of contiguous datasets is supported since v0.6.6). This is an excellent feature of HDF5, and one reason why it's suited to large datasets. Support will be added in the future, but currently it is not possible. If you would be interested in this please comment on, or react to issue #52
If you want to read datasets larger than can fit in a Java array (i.e. Integer.MAX_VALUE elements). This issue would also be addressed by slicing.

Why did I start jHDF?

Mostly it's a challenge, HDF5 is a fairly complex file format with lots of flexibility, writing a library to access it is interesting. Also, as a widely used file format for storing scientific, engineering, and commercial data, it would seem like a good idea to be able to read HDF5 files with more than one library. In particular JVM languages are among the most widely used so having a native HDF5 implementation seems useful.

Developing jHDF

Fork this repository and clone your fork
Inside the jhdf directory run ./gradlew build (./gradlew.bat build on Windows) this will run the build and tests fetching dependencies.
Import the Gradle project jhdf into your IDE.
Make your changes and add tests.
Run ./gradlew check to run the build and tests.
Once you have made any changes please open a pull request.

To see other available Gradle tasks run ./gradlew tasks

If you have read this far please consider staring this repo. If you are using jHDF in a commercial product please consider making a donation. Thanks!

jhdf's People

Contributors

Stargazers

Watchers

jhdf's Issues

Error Reading Enum Data Type

Describe the bug
When loading Canada_Population.h5
io.jhdf.examples.ReadDataset.main(new String[] { "/eclipse-workspace/jhdf5/dataset/Canada_Population.h5", "/Record/Data/data" });

I get an error:

Exception in thread "main" io.jhdf.exceptions.HdfException: Failed to load children of group '/Record/' at address '800'
at io.jhdf.GroupImpl.getChild(GroupImpl.java:271)
at io.jhdf.GroupImpl.getByPath(GroupImpl.java:281)
at io.jhdf.GroupImpl.getByPath(GroupImpl.java:287)
at io.jhdf.GroupImpl.getDatasetByPath(GroupImpl.java:297)
at io.jhdf.HdfFile.getDatasetByPath(HdfFile.java:239)
at io.jhdf.examples.ReadDataset.main(ReadDataset.java:28)
at ReadDatasetTest.main(ReadDatasetTest.java:7)
Caused by: io.jhdf.exceptions.HdfException: Failed to read '/Record/Properties'
at io.jhdf.GroupImpl$ChildrenLazyInitializer.createOldStyleGroup(GroupImpl.java:135)
at io.jhdf.GroupImpl$ChildrenLazyInitializer.initialize(GroupImpl.java:57)
at io.jhdf.GroupImpl$ChildrenLazyInitializer.initialize(GroupImpl.java:1)
at org.apache.commons.lang3.concurrent.LazyInitializer.get(LazyInitializer.java:102)
at io.jhdf.GroupImpl.getChild(GroupImpl.java:269)
... 6 more
Caused by: io.jhdf.exceptions.HdfException: Failed to read object header at address: 20896
at io.jhdf.ObjectHeader$ObjectHeaderV1.(ObjectHeader.java:115)
at io.jhdf.ObjectHeader$ObjectHeaderV1.(ObjectHeader.java:82)
at io.jhdf.ObjectHeader.readObjectHeader(ObjectHeader.java:338)
at io.jhdf.GroupImpl$ChildrenLazyInitializer.createOldStyleGroup(GroupImpl.java:132)
... 10 more
Caused by: io.jhdf.exceptions.UnsupportedHdfException: Enumerated data type is not yet supported
at io.jhdf.object.datatype.DataType.readDataType(DataType.java:63)
at io.jhdf.object.datatype.CompoundDataType.(CompoundDataType.java:67)
at io.jhdf.object.datatype.DataType.readDataType(DataType.java:59)
at io.jhdf.object.message.DataTypeMessage.(DataTypeMessage.java:24)
at io.jhdf.object.message.Message.readMessage(Message.java:99)
at io.jhdf.object.message.Message.readObjectHeaderV1Message(Message.java:54)
at io.jhdf.ObjectHeader$ObjectHeaderV1.readMessages(ObjectHeader.java:121)
at io.jhdf.ObjectHeader$ObjectHeaderV1.(ObjectHeader.java:110)
... 13 more
**To Reproduce
Run : io.jhdf.examples.ReadDataset.main(new String[] { "/eclipse-workspace/jhdf5/dataset/Canada_Population.h5", "/Record/Data/data" });
Canada_Population.h5.zip

Expected behaviour
Show the data in the Object data = dataset.getData();

Please complete the following information:

jhdf version: v0.5.0
Java version: jdx1.8.0_144
OS (Windows, Mac, Linux): MAC
Stack trace: see up top

Possible Collaboration

Interesting work @jamesmudd! We've been working on a pure java HDF5 read implementation for quite a few years now for many of the reasons you describe in the [README]
(https://github.com/jamesmudd/jhdf#why-did-i-start-jhdf) (especially the point "[a]lso, as a widely used file format for storing scientific, engineering, and commercial data, it seem like a good idea to be able to read HDF5 files with more than one library."). It is a challenge, for sure! The goal of our reader is to read HDF5 into our Common Data Model, so likely not all of the code is applicable to the work here, but there might be some interesting parts in there that could be useful.

Currently, we only handle Superblock 2 and lower features, but are looking to start adding Superblock 3 features. Anyways, I wanted to at least reach out and say "hi!", and offer to help when/where I can. If you are interested in what we have done, it can be found in the ucar.nc2.internal.iosp.hdf5.

Add support for string datasets

Add support for getting flat data

Add an extra method Dataset#getDataFlat() which would always return a 1D array.

This would be helpful for integration with other nD java libraries which can accept data in this form and reduce the need to reshape into a nD array then unroll back to a 1D array.

Add support for bit field datasets

There are some examples files available at https://zenodo.org/record/1310934 currently they fail to open with jhdf because they contain bitfield attributes.

The error is

Caused by: io.jhdf.exceptions.HdfException: Unrecognized data class = 4
	at io.jhdf.object.datatype.DataType.readDataType(DataType.java:55)
	at io.jhdf.object.message.AttributeMessage.<init>(AttributeMessage.java:55)

Global heap does not support reuse of elements

Describe the bug
When refering to a global heap index more than once in one data set, the second reference becomes the empty string. This is due to the fact that the global heap object keeps a byte buffer as data element and aftrer reading it once it's position si at the limit, leading to an empty string the second time around.

To Reproduce
Use the attached file: var-length-strings-reused.zip
HDFView shows that values are reused muliple times and there are no empty strings. With jhdf, each value is present only once and the rest of the values are empty.

Expected behaviour
The output from jhdf should match the output from HDFView and resolve all global heap references accordingly.

Please complete the following information:

jhdf version: 0.4.8
Java version: 1.8
Stack trace/problem site: VariableLengthDatasetReader, l. 59 ff

Additional context
We see four possible fixes:

Reset byte buffer after decoding. Most simple, but also slow (multiple decodings)
Store data as byte array instead of buffer. Still limitations of 1)
Decode to string directly when reading global heap, since the type/charset does not change for one dataset.
Lazy decoding: Decode once, and store the decoed value in the object, getting rid of the byte buffer in the process. Needs more logic in the VariableLengthDatasetReader, but keeps the decoding to a minimum.

Exception when reading attribute data

I have this code to read an attribute of an HDF5:

HdfFile(file).use {
    val config = it.getAttribute("model_config").data as String
    // use config...
}

jhdf is able to read the attribute (i.e. it.getAttribute("model_config")) but reading its data (i.e. .data) throws this exception: https://pastebin.com/i6U9mc1R. Maybe it's worth noting that other tools (e.x., Netron) can load the file without errors, so I don't think there is a problem with the file.

I have attached the HDF5 file to this issue. I have also attached the log file. I am using jhdf version 0.4.6.
model2.zip
axon_20190915110753.log

Add support for HDF5 opaque types

h5py has some support for writing them https://docs.h5py.org/en/stable/special.html?highlight=time#storing-other-types-as-opaque-data so should be easy to generate test files.

Think they should have ByteBuffers as elements.

Javadoc + Sources build publishes test results again

This makes the tests reports on Azure Pipelines inaccurate.

Create HdfFile object from InputStream without creating temp file

I have hdf5 files on different sources, ie website, s3,etc. I have a java InputStream they are read into (could easily make a byte array or similar from them) but I want to create HdfFile object from that InputStream. I don't want to write out a local file for them.

I see fromInputStream is available but that will write out a temp file that hangs around till end of the application, but I have long running application that will be up for months at a time so I don't want the disk to be filled from reading many files.

Feature request is ability to create HdfFile object without creating temp/local file.

Add easier way to dereference references

Discussed in #311

Probably add a method to HdfFile like getNodeByAddress(long address) which would access the node directly without using the file structure. Then once you have a reference you can quickly deference it.

Add support for reference data type

Hi, I have a matlab v7.3 file, which is HDF5 based but it use reference data type which is currently not available in jhdf. Can this be added? I can provide the test files.

Thanks

Support for nested compound datasets

Describe the bug
jHDF fails to read nested compound datasets.

To Reproduce
Attached is an HDF5 file called master.ph5 with the following structure:

    Experiment_g
        Experiment_t
    Maps_g
        Das_g_1001X1077
        Das_g_1001X1040
        Das_g_1001X1021
        … etc.
    Receivers_g
        Das_g_1001X1389
        Das_g_1001X1398
        Das_g_1001X88
        … etc.
    Reports_g
        Report_t
    Responses_g
        NODE_1C_RESP
        Response_t
    Sorts_g
        Array_t_1001
        Sort_t

When reading the data from the /Experiment_g/Experiment_t dataset using the following program, I get this stack trace:

    import io.jhdf.HdfFile;
    import io.jhdf.api.Dataset;
    import java.io.File;


    public class ph5API {
        public static void main(String[] args) {
            String path = "master.ph5";
            File f = new File(path);
            try (HdfFile hdfFile = new HdfFile(f)) {
                Dataset dataset = hdfFile.getDatasetByPath("/Experiment_g/Experiment_t");
                // data will be a Java array with the dimensions of the HDF5 dataset
                Object data = dataset.getData();
            }
        }
    }

Stack trace:

Exception in thread "main" io.jhdf.exceptions.HdfException: DatasetReader was passed a type it cant fill. Type: io.jhdf.object.datatype.CompoundDataType
	at io.jhdf.dataset.DatasetReader.readDataset(DatasetReader.java:165)
	at io.jhdf.dataset.CompoundDatasetReader.readDataset(CompoundDatasetReader.java:48)
	at io.jhdf.dataset.DatasetBase.getData(DatasetBase.java:133)
	at edu.iris.ph5api.ph5API.main(ph5API.java:28)

System info:

jhdf version 0.5.4
Java 1.8
macOS Catalina

Additional context:
Existing branch containing fix at https://github.com/jamesmudd/jhdf/tree/nested-compound

Related HDF5 file with nested compound datasets:
master.ph5.zip

Typo in Dataset.isVariableLentgh()

Describe the bug
Typo in Dataset.isVariableLentgh()

To Reproduce
N/A

Expected behaviour
isVariableLength()

Please complete the following information:
N/A

Additional context

Performance of chunked dataset reads is poor

Currently when you read large chunked datastets the performance is bad. This is because the output array is filled element by element resulting in lots of chunk lookups and index calculations.

This could be improved very significantly if when a chunk was accessed all the available data was copied to the output array.

Error reading dataset with null terminated strings

Describe the bug
'm encountering a issue when trying to read the following dataset in from that attached hdf5 file (master.ph5).

Dataset ds = hdfFile.getDatasetByPath("/Experiment_g/Sorts_g/Array_t_001");
System.out.println((LinkedHashMap) ds.getData());

results in the following stacktrace:

Exception in thread "main" java.lang.IndexOutOfBoundsException
	at java.nio.Buffer.checkIndex(Buffer.java:540)
	at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
	at io.jhdf.object.datatype.StringData$NullTerminated.setBufferLimit(StringData.java:101)
	at io.jhdf.dataset.DatasetReader.fillFixedLengthStringData(DatasetReader.java:347)
	at io.jhdf.dataset.DatasetReader.readDataset(DatasetReader.java:140)
	at io.jhdf.dataset.CompoundDatasetReader.readDataset(CompoundDatasetReader.java:50)
	at io.jhdf.dataset.DatasetReader.readDataset(DatasetReader.java:168)
	at io.jhdf.dataset.DatasetBase.getData(DatasetBase.java:132)

System info:

jhdf version 0.5.5 (Developmental)
Java 1.8
macOS Catalina

Additional context:
Originally referenced at #157 (comment)

Related HDF5 file with nested compound datasets:
master.ph5.zip

Reading fixed length string dataset ignores charset

Describe the bug/missing feature
When reading a dataset of type fixed length string(i.e. StringData), the charset is ignored and the bytes are always decoded as US_ASCII (DatasetReader l. 336). It seems that HDFView has the same problem.

To Reproduce
Use the attached file: utf8-fixed-length.zip
HDFView and jhdf will both show broken characters instead of umlauts.

Expected behaviour
DatasetReader should take the string type's charset into account.

Please complete the following information:

jhdf version: 0.4.8
Java version: 1.8
Stack trace/problem site: DatasetReader l. 336

Additional context
StringData knows its charset, and the call to the private method fillFixedLengthStringData (DatasetReader l. 134) could use it as a parameter.

Error running on Java 8

When running on Java 8 the following error is received

Exception in thread "main" java.lang.NoSuchMethodError: java.nio.ByteBuffer.rewind()Ljava/nio/ByteBuffer;
	at io.jhdf.Superblock.readSuperblock(Superblock.java:83)
	at io.jhdf.HdfFile.<init>(HdfFile.java:78)
	at com.company.PrintTree.main(PrintTree.java:33)

This occurs because the jar was compiled using Java 11.

Infinite Loop in DeflatePipelineFilter

There is a problem in DeflatePipelineFilter.java:
while (! inflater.finished()) {
int read = inflater.inflate(buffer);
...

This loop will never exit in cases where read == 0. When I try to load a particular compressed HDF5 file, I quickly run into this case, and in a debugger I can see that inflater.needsInput() == true. Does this mean that compressedData is not self-contained, and we need bytes from the next block? Does this mean that the whole scheme is broken where Inflater is only scoped to decode a single Chunk? Or maybe it's just that it really is finished, but the native code never set finished=true?

I'm going to try simply exiting the loop when read==0 and inflater.needsInput() to see what affect that has, but I suspect it's just incorrect to assume that you can decompress a chunk at a time. It may be that the Inflater needs to be allocated once to be used for the whole DataSet so that it can resume decompression using leftovers from the previous call as it's processing Chunks.

Steps to reproduce the behaviour:

This is problematic. I'm not sure I can find a file that shows that problem that I'm allowed to share. I do have one file that consistently fails, but it's 38MB compressed. I'll verify next week whether I can share it, or maybe we'll find a smaller more generic sample that shows the issue.

Please complete the following information:

jhdf 0.6.3
AdoptOpenJDK 8u265
Windows

Cleanup temp files when file is closed

If you read a HdfFile from an InputStream a temp file is created. Currently this is only cleaned up when the JVM exits and improvement would also be to clean this up when the HdfFile is closed.

Add ability to get the HDF5 on disk type to the Dataset API

Constant in GroupInfoMessage is wrong

Describe the bug
The Constant GroupInfoMessage#ESTIMATED_ENTRY_INFORMATION_PRESENT probably should be 1 instead of 0. ;-)

Please complete the following information:

jhdf version 0.5.0

Only v3 and v4 data layout messages are supported. Detected version = 1

I've got io.jhdf.exceptions.UnsupportedHdfException: Only v3 and v4 data layout messages are supported. Detected version = 1 during a work with *.he5 file.
I've tried to iterate children of a group /HDFEOS/GRIDS/OMI Column Amount O3/Data Fields/ (File will be attached) with the code below.

Code in kotlin lang, but I guess it's clear and easy to understand.
The path of Node is /HDFEOS/GRIDS/OMI Column Amount O3/Data Fields/

     // check if `Node` is `Group`
     if (node !is Group) return
     
    // iteration operation of the group
    // it - is a child element of group
    node.forEach {
        logDepth(it, depth + 1, maxDepth, prefix)
    }

To Reproduce

Steps to reproduce the behaviour:

HDF file is here
Try to get access to data under path /HDFEOS/GRIDS/OMI Column Amount O3/Data Fields/

Expected behaviour
It should works.

Following information

jhdf: 0.5.7
Java version: 1.8.0_152
KotlinJVM: 1.3.72
OS: MacOs
Stack trace + logging:

Size: 36116430 [main] INFO io.jhdf.HdfFile  - jHDF version: 0.5.7
3 [main] INFO io.jhdf.HdfFile  - Opening HDF5 file '/Users/alexnuts/repositories/ozon_nasa_data_parser/src/main/resources/OMI-Aura_L3-OMTO3e_2020m0827_v883-2020m0827t205457.he5'...
18 [main] DEBUG io.jhdf.HdfFile  - Found valid signature at offset = 0
18 [main] DEBUG io.jhdf.Superblock  - Version of superblock is = 0
29 [main] DEBUG io.jhdf.ObjectHeader  - Creating lazy object header at address: 928
31 [main] DEBUG io.jhdf.GroupImpl  - Created root group of file 'OMI-Aura_L3-OMTO3e_2020m0827_v883-2020m0827t205457.he5'
31 [main] INFO io.jhdf.HdfFile  - Opened HDF5 file '/Users/alexnuts/repositories/ozon_nasa_data_parser/src/main/resources/OMI-Aura_L3-OMTO3e_2020m0827_v883-2020m0827t205457.he5'
57 [main] INFO io.jhdf.GroupImpl  - Lazy loading children of '//'
57 [main] DEBUG io.jhdf.ObjectHeader  - Lazy initializing object header at address: 928
65 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.SymbolTableMessage@1cf4f579
65 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.NilMessage@18769467
65 [main] DEBUG io.jhdf.ObjectHeader  - Read object header from address: 944
126 [main] DEBUG io.jhdf.GroupImpl  - Loading 'old' style group
130 [main] DEBUG io.jhdf.ObjectHeader  - Creating lazy object header at address: 1576
130 [main] DEBUG io.jhdf.GroupImpl  - Created group '/HDFEOS/'
130 [main] DEBUG io.jhdf.ObjectHeader  - Creating lazy object header at address: 4504
130 [main] DEBUG io.jhdf.GroupImpl  - Created group '/HDFEOS INFORMATION/'
--GroupImpl::HDFEOS
132 [main] INFO io.jhdf.GroupImpl  - Lazy loading children of '/HDFEOS/'
132 [main] DEBUG io.jhdf.ObjectHeader  - Lazy initializing object header at address: 1576
132 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.SymbolTableMessage@16f65612
132 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.NilMessage@311d617d
132 [main] DEBUG io.jhdf.ObjectHeader  - Read object header from address: 1592
132 [main] DEBUG io.jhdf.GroupImpl  - Loading 'old' style group
132 [main] DEBUG io.jhdf.ObjectHeader  - Creating lazy object header at address: 2552
132 [main] DEBUG io.jhdf.GroupImpl  - Created group '/HDFEOS/ADDITIONAL/'
132 [main] DEBUG io.jhdf.ObjectHeader  - Creating lazy object header at address: 5752
132 [main] DEBUG io.jhdf.GroupImpl  - Created group '/HDFEOS/GRIDS/'
----GroupImpl::ADDITIONAL
132 [main] INFO io.jhdf.GroupImpl  - Lazy loading children of '/HDFEOS/ADDITIONAL/'
132 [main] DEBUG io.jhdf.ObjectHeader  - Lazy initializing object header at address: 2552
132 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.SymbolTableMessage@7c53a9eb
132 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.NilMessage@ed17bee
132 [main] DEBUG io.jhdf.ObjectHeader  - Read object header from address: 2568
133 [main] DEBUG io.jhdf.GroupImpl  - Loading 'old' style group
133 [main] DEBUG io.jhdf.ObjectHeader  - Creating lazy object header at address: 3528
133 [main] DEBUG io.jhdf.GroupImpl  - Created group '/HDFEOS/ADDITIONAL/FILE_ATTRIBUTES/'
------GroupImpl::FILE_ATTRIBUTES
133 [main] INFO io.jhdf.GroupImpl  - Lazy loading children of '/HDFEOS/ADDITIONAL/FILE_ATTRIBUTES/'
133 [main] DEBUG io.jhdf.ObjectHeader  - Lazy initializing object header at address: 3528
133 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.ObjectHeaderContinuationMessage@2a33fae0
133 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.SymbolTableMessage@707f7052
141 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: OrbitNumber
141 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=OrbitNumber, dataType=io.jhdf.object.datatype.FixedPoint@43556938, dataSpace=io.jhdf.object.message.DataSpace@3d04a311]
141 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: OrbitPeriod
141 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=OrbitPeriod, dataType=io.jhdf.object.datatype.FloatingPoint@7a46a697, dataSpace=io.jhdf.object.message.DataSpace@5f205aa]
143 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: InstrumentName
143 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=InstrumentName, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@47089e5f]
143 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: ProcessLevel
143 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=ProcessLevel, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@4141d797]
143 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GranuleMonth
143 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GranuleMonth, dataType=io.jhdf.object.datatype.FixedPoint@68f7aae2, dataSpace=io.jhdf.object.message.DataSpace@4f47d241]
143 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GranuleDay
143 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GranuleDay, dataType=io.jhdf.object.datatype.FixedPoint@4c3e4790, dataSpace=io.jhdf.object.message.DataSpace@38cccef]
143 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GranuleYear
143 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GranuleYear, dataType=io.jhdf.object.datatype.FixedPoint@5679c6c6, dataSpace=io.jhdf.object.message.DataSpace@27ddd392]
144 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GranuleDayOfYear
144 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GranuleDayOfYear, dataType=io.jhdf.object.datatype.FixedPoint@19e1023e, dataSpace=io.jhdf.object.message.DataSpace@7cef4e59]
144 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: TAI93At0zOfGranule
144 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=TAI93At0zOfGranule, dataType=io.jhdf.object.datatype.FloatingPoint@64b8f8f4, dataSpace=io.jhdf.object.message.DataSpace@2db0f6b2]
144 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: PGEVersion
144 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=PGEVersion, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@3cd1f1c8]
144 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: StartUTC
144 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=StartUTC, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@3a4afd8d]
144 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: EndUTC
144 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=EndUTC, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@1996cd68]
145 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: Period
145 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=Period, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@3339ad8e]
145 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.NilMessage@555590
145 [main] DEBUG io.jhdf.ObjectHeader  - Read object header from address: 3544
145 [main] DEBUG io.jhdf.GroupImpl  - Loading 'old' style group
----GroupImpl::GRIDS
145 [main] INFO io.jhdf.GroupImpl  - Lazy loading children of '/HDFEOS/GRIDS/'
145 [main] DEBUG io.jhdf.ObjectHeader  - Lazy initializing object header at address: 5752
145 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.SymbolTableMessage@6d1e7682
145 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.NilMessage@424c0bc4
145 [main] DEBUG io.jhdf.ObjectHeader  - Read object header from address: 5768
146 [main] DEBUG io.jhdf.GroupImpl  - Loading 'old' style group
146 [main] DEBUG io.jhdf.ObjectHeader  - Creating lazy object header at address: 5856
146 [main] DEBUG io.jhdf.GroupImpl  - Created group '/HDFEOS/GRIDS/OMI Column Amount O3/'
------GroupImpl::OMI Column Amount O3
146 [main] INFO io.jhdf.GroupImpl  - Lazy loading children of '/HDFEOS/GRIDS/OMI Column Amount O3/'
146 [main] DEBUG io.jhdf.ObjectHeader  - Lazy initializing object header at address: 5856
146 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.ObjectHeaderContinuationMessage@3c679bde
146 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.SymbolTableMessage@16b4a017
146 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GCTPProjectionCode
146 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GCTPProjectionCode, dataType=io.jhdf.object.datatype.FixedPoint@8807e25, dataSpace=io.jhdf.object.message.DataSpace@2a3046da]
147 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: Projection
147 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=Projection, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@2a098129]
147 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GridOrigin
147 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GridOrigin, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@198e2867]
147 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GridSpacing
147 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GridSpacing, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@12f40c25]
147 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GridSpacingUnit
147 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GridSpacingUnit, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@3ada9e37]
147 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GridSpan
147 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GridSpan, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@5cbc508c]
148 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: GridSpanUnit
148 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=GridSpanUnit, dataType=StringData{paddingType=NULL_TERMINATED, charset=US-ASCII}, dataSpace=io.jhdf.object.message.DataSpace@3419866c]
148 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: NumberOfLongitudesInGrid
148 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=NumberOfLongitudesInGrid, dataType=io.jhdf.object.datatype.FixedPoint@63e31ee, dataSpace=io.jhdf.object.message.DataSpace@68fb2c38]
148 [main] DEBUG io.jhdf.object.message.AttributeMessage  - Read attribute: NumberOfLatitudesInGrid
148 [main] DEBUG io.jhdf.object.message.Message  - Read message: AttributeMessage [name=NumberOfLatitudesInGrid, dataType=io.jhdf.object.datatype.FixedPoint@567d299b, dataSpace=io.jhdf.object.message.DataSpace@2eafffde]
148 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.NilMessage@59690aa4
148 [main] DEBUG io.jhdf.ObjectHeader  - Read object header from address: 5872
148 [main] DEBUG io.jhdf.GroupImpl  - Loading 'old' style group
148 [main] DEBUG io.jhdf.ObjectHeader  - Creating lazy object header at address: 5960
148 [main] DEBUG io.jhdf.GroupImpl  - Created group '/HDFEOS/GRIDS/OMI Column Amount O3/Data Fields/'
--------GroupImpl::Data Fields
148 [main] INFO io.jhdf.GroupImpl  - Lazy loading children of '/HDFEOS/GRIDS/OMI Column Amount O3/Data Fields/'
148 [main] DEBUG io.jhdf.ObjectHeader  - Lazy initializing object header at address: 5960
149 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.SymbolTableMessage@6842775d
149 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.NilMessage@574caa3f
149 [main] DEBUG io.jhdf.ObjectHeader  - Read object header from address: 5976
149 [main] DEBUG io.jhdf.GroupImpl  - Loading 'old' style group
149 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.FillValueMessage@64cee07
149 [main] WARN io.jhdf.object.message.Message  - After reading message (FillValueMessage) buffer still has 12 bytes remaining
149 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.FillValueOldMessage@1761e840
149 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.DataTypeMessage@6c629d6e
149 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.DataSpaceMessage@5ecddf8f
150 [main] DEBUG io.jhdf.object.message.Message  - Read message: io.jhdf.object.message.FilterPipelineMessage@5f5a92bb
Exception in thread "main" io.jhdf.exceptions.HdfException: Failed to load children for group '/HDFEOS/GRIDS/OMI Column Amount O3/Data Fields/' at address '5960'
	at io.jhdf.GroupImpl.getChildren(GroupImpl.java:243)
	at io.jhdf.GroupImpl.iterator(GroupImpl.java:264)
	at com.alexnuts.nasa.ozon.RunKt.logDepth(Run.kt:332)
	at com.alexnuts.nasa.ozon.RunKt.logDepth$default(Run.kt:265)
	at com.alexnuts.nasa.ozon.RunKt.logDepth(Run.kt:283)
	at com.alexnuts.nasa.ozon.RunKt.logDepth$default(Run.kt:265)
	at com.alexnuts.nasa.ozon.RunKt.logDepth(Run.kt:283)
	at com.alexnuts.nasa.ozon.RunKt.logDepth$default(Run.kt:265)
	at com.alexnuts.nasa.ozon.RunKt.logDepth(Run.kt:283)
	at com.alexnuts.nasa.ozon.RunKt.logDepth$default(Run.kt:265)
	at com.alexnuts.nasa.ozon.RunKt.logDepth(Run.kt:283)
	at com.alexnuts.nasa.ozon.RunKt.logDepth$default(Run.kt:265)
	at com.alexnuts.nasa.ozon.Run.main(Run.kt:33)
Caused by: io.jhdf.exceptions.HdfException: Failed to read '/HDFEOS/GRIDS/OMI Column Amount O3/Data Fields/ColumnAmountO3'
	at io.jhdf.GroupImpl$ChildrenLazyInitializer.createOldStyleGroup(GroupImpl.java:136)
	at io.jhdf.GroupImpl$ChildrenLazyInitializer.initialize(GroupImpl.java:58)
	at io.jhdf.GroupImpl$ChildrenLazyInitializer.initialize(GroupImpl.java:43)
	at org.apache.commons.lang3.concurrent.LazyInitializer.get(LazyInitializer.java:106)
	at io.jhdf.GroupImpl.getChildren(GroupImpl.java:240)
	... 12 more
Caused by: io.jhdf.exceptions.HdfException: Failed to read object header at address: 43392
	at io.jhdf.ObjectHeader$ObjectHeaderV1.<init>(ObjectHeader.java:116)
	at io.jhdf.ObjectHeader$ObjectHeaderV1.<init>(ObjectHeader.java:76)
	at io.jhdf.ObjectHeader.readObjectHeader(ObjectHeader.java:339)
	at io.jhdf.GroupImpl$ChildrenLazyInitializer.createOldStyleGroup(GroupImpl.java:133)
	... 16 more
Caused by: io.jhdf.exceptions.UnsupportedHdfException: Only v3 and v4 data layout messages are supported. Detected version = 1
	at io.jhdf.object.message.DataLayoutMessage.createDataLayoutMessage(DataLayoutMessage.java:32)
	at io.jhdf.object.message.Message.readMessage(Message.java:107)
	at io.jhdf.object.message.Message.readObjectHeaderV1Message(Message.java:54)
	at io.jhdf.ObjectHeader$ObjectHeaderV1.readMessages(ObjectHeader.java:122)
	at io.jhdf.ObjectHeader$ObjectHeaderV1.<init>(ObjectHeader.java:111)
	... 19 more

Null or space padded fixed length strings of length 0 cannot be read

Describe the bug
When reading string data that contain empty strings with null or space padding, the sub-buffer's limit is set to -1, causing the reading to fail.

To Reproduce
Steps to reproduce the behaviour:

File demonstrating the issue:
empty_string_error.zip
Reading the attribute "dictionary" from the dataset "a2" causes the error

Expected behaviour
Empty strings should be readable as well

Please complete the following information:

jhdf version 0.5.6
Java version 1.8
OS (Windows, Mac, Linux): Windows
Stack trace:

java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkIndex(Buffer.java:540)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
at io.jhdf.object.datatype.StringData$NullPadded.setBufferLimit(StringData.java:114)
at io.jhdf.dataset.DatasetReader.fillFixedLengthStringData(DatasetReader.java:374)
at io.jhdf.dataset.DatasetReader.readDataset(DatasetReader.java:151)
at io.jhdf.AttributeImpl.getData(AttributeImpl.java:72)

Additional context
Solution would be to check for i >= 0 in the while loops both in NullPadded and SpacePadded

Add ability to get the fill value to Dataset API

Bitshuffle Filter

jhdf is currently missing the bitshuffle filter. When I try to open the primary dataset in a the following hdf file HDF5 file
12_ID_B_Sagbe_2M_SAXS_Eiger9M.h5.zip
with compressed data I get

Exception in thread "main" io.jhdf.exceptions.HdfFilterException: A required filter is not available: name='bitshuffle; see [https://github.com/kiyo-masui/bitshuffle' id=32008](https://urldefense.com/v3/https:/github.com/kiyo-masui/bitshuffle'*20id=32008;JQ!!IBzWLUs!F
8uUItD4EWpvvhym4ZgnK6-Dekob-TlEE18ADfKCnY4IXw1KVlXMpdTGZ0oS8QVg3g$)

at io.jhdf.filter.FilterManager.getPipeline(FilterManager.java:90)

It would be great if this could be implemented.

Multidimensional fixed length string datasets only reads first "row"

Describe the bug
When reading a multidimensional dataset of type fixed length string, the same buffer position is read again and again.

To Reproduce
When opening the file multidim_string_datasest.zip, the result is three arrays with the content ["a1"," a2"]. This seems to be the case because DatasetReader#fillFixedLengthStringData is called recursively with always the same buffer and manually setting the position to the i-th element.
HDFView can read this just fine.

Please complete the following information:

jhdf version: 0.5.0
Java version: 1.8

Additional context
One probable fix to do this is to use subbuffers for each recursive call.
The other would be to store the position before reading and go to the expected position after the reading was done, maybe like so:

for (int i = 0; i < dims[0]; i++) {
  int pos = buffer.position();
  ByteBuffer elementBuffer = Utils.createSubBuffer(buffer, stringLength);
  stringPaddingHandler.setBufferLimit(elementBuffer);
  Array.set(data, i, charset.decode(elementBuffer).toString());
  buffer.position(pos + stringLength);
}

Should support lzf compression.

What is the suggestion?

Many .h5 files use lzf compression, which is not supported by the library, but easy to add.

How would it be an improvement?

This would allow opening files that require lzf compression.

Additional context

I have prepared a patch which provides the necessary support. It uses a pure java lzf compression library, and thus can easily be incorporated with a one line change to the gradle config, a one line change to the filter manager, and the new filter class, also included.

0001-Adding-LZF-support-to-jhdf.patch.zip

Support for Multi dimension array data types

I don't quite know if this is a bug or a suggestion.

What is the suggestion?

I urgently need to read hdf5 file that seems to require a currently not supported feature. A HdfException: Multi dimension array data types are not supported is thrown. The error message does not mention which column of the compound dataset causes the problem, but I guess the array columns are the problem.

How would it be an improvement?

I can not change the source format. So support for this feature would be required for the application of jhdf in my code and enriches the calatogue of supported features. If someone has an idea about the implementation, I would write the tests for my cases.

Additional context

Example: test_MultiDimensionArray.zip

Code: (Same for DATASET1 & DATASET2)

HdfFile hdf = new HdfFile(PATH);
Node node = hdf.getByPath("GROUP1/GROUP2/DATASET1");
NodeType type = node.getType();
assert (type == NodeType.DATASET);
Dataset dataset = (Dataset)node;
Map<String,Object> data = (Map<String,Object>)dataset.getData();

Log (thrown for the call of getData()):

io.jhdf.exceptions.HdfException: Multi dimension array data types are not supported
  at io.jhdf.object.datatype.ArrayDataType.fillData(ArrayDataType.java:71)
  at io.jhdf.dataset.DatasetReader.readDataset(DatasetReader.java:64)
  at io.jhdf.dataset.CompoundDatasetReader.readDataset(CompoundDatasetReader.java:49)
  at io.jhdf.object.datatype.CompoundDataType.fillData(CompoundDataType.java:100)
  at io.jhdf.dataset.DatasetReader.readDataset(DatasetReader.java:64)
  at io.jhdf.dataset.DatasetBase.getData(DatasetBase.java:129)

Environment:
- jhdf version: 0.6.4
- Java version: 13
- OS: Windows

Handle publishing after Bintray/JCenter close down

It was recently announced Bintray/JCenter will close 1st May 2021.

https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/?utm_source=mkto&utm_medium=email&utm_campaign=bintray-sunset&utm_content=global-02-2021

jHDF is currently published on bintray https://bintray.com/jamesmudd/jhdf/jhdf/ and the releases are typically pushed from there to Maven central.

Need to update the Gradle script to publish directly to maven central as part of this work publishing should be moved to GH actions.

Add Dataset method to get the filter pipeline for display

Would be useful to add a method to get the filter pipeline to Dataset. As only chunked datasets can have filters a "no filters" constant would be returned by other Dataset types.

Should be able to access the filter name, id and filter data.

HdfFile is not API

Describe the bug

I get

Access restriction: The type 'HdfFile' is not API (restriction on required library '~/.m2/repository/io/jhdf/jhdf/0.6.5/jhdf-0.6.5.jar')

To Reproduce

The first line of the README in fact:

try (HdfFile hdfFile = new HdfFile(file)) {

I used @eclipse-m2e to add it to the .target platform

	<location includeDependencyDepth="none" includeSource="true" missingManifest="generate" type="Maven">
		<dependencies>
			<dependency>
				<groupId>io.jhdf</groupId>
				<artifactId>jhdf</artifactId>
				<version>0.6.5</version>
				<type>jar</type>
			</dependency>
		</dependencies>
	</location>

Expected behavior

No access rule error.

@eclipse 2022-03
jhdf 0.6.5
Temurin 11
@openSUSE Linux

Restructure to split out extra filters and their dependencies

jHDF now depends on compress-lzf and lz4-java to offer support for additional HDF5 filters. This is good when these filters are needed but it makes the base jHDF dependencies larger, for people who don't require these extra filters.

The ide would be to split out a separate gradle project jhdf-extra-filters that has these dependencies and the required filter impls and tests. It would produce a seperate jar artifact. When this extra jar is present on the classpath the extra filters would be available this service loading is already in place.

The benefit would be people who don't require the extra filters are not forced to take the dependencies.

IllegalArgumentException: Argument is not an array for compound attribute

First of all, thanks for all your work.

Describe the bug

I try to read a hdf5 file group with a compound scalar attribute but get a java.lang.IllegalArgumentException when calling attribute.getData().

To Reproduce

Steps to reproduce the behaviour:

Example file: test.zip

Code:

HdfFile hdf = new HdfFile(PATH);
Node node = input.getByPath("GROUP");
Attribute attribute = node.getAttribute("VERSION");
System.out.println(attribute.getJavaType()); // gives "interface java.util.Map"
Object data = attribute.getData();

Expected behaviour

I'd like the compound attribute data available for further processing.

Please complete the following information:

jhdf version: 0.6.4
Java version: 13
OS: Windows

Stack trace if available:

java.lang.IllegalArgumentException: Argument is not an array
   at java.base/java.lang.reflect.Array.get(Native Method)
   at io.jhdf.dataset.DatasetReader.readDataset(DatasetReader.java:67)
   at io.jhdf.AttributeImpl.getData(AttributeImpl.java:73)

Add support for Committed Datatypes

Currently reading a file containing Committed Datatypes causes an exception. Support needs to be added

Add support for reading slices of data

It would be nice to be able to read subsets of the datasets, and probably to offer iterators through datasets returning slices. The way to specify slices need to be figured out.

Files with paged chunk storage cannot be read

Describe the bug

Opening this file fails
chunked_v4_datasets.hdf5.zip

Additional context
Add any other context about the problem here, or possible solutions/workarounds.

Add support for variable length datasets

Describe the bug
Trying to load a Variable length dataset results in an exception.

To Reproduce
Steps to reproduce the behaviour:

Attached file Canada_Population.h5.zip
Read the /Record/Labels/Names dataset, which is a compound dataset containing a variable length string.

Expected behaviour
Load the dataset.

Please complete the following information:

jhdf version: master
Java version 1.8
OS (Windows, Mac, Linux): Linux

Additional context
This was noted while working on #121

Support for sparse data

`getDatasetByPath` fails to traverse external links

Describe the bug
if you call Dataset dataset = hdfFile.getDatasetByPath(/path/to/dataset) and it needs to traverse an link it will fail.

Expected behaviour
It should work and return the dataset

Please complete the following information:

jhdf version: master
Java version: 11
OS (Windows, Mac, Linux): Linus

String Data Padding not handled correctly

Describe the bug
In the data type StringData, the class bits responsible for the padding type are interpreted bit-wise instead of as the value of the lowest two order bits. Both the padding as well as the charset are set up via the short value of their four respective bits (see https://support.hdfgroup.org/HDF5/doc/H5.format.html#DatatypeMessage). While the charset is handled properly, the padding is not.
This error might go very well undetected, since the padding/termination is ignored, as can be seen in DatasetReader#fillFixedLengthStringData, where the actual String value is trimmed.
This might produce errors if e.g. the string actually ends with null bytes on purpose and is marked as space padded, or the other way round.

To Reproduce
The file in padding_problem.zip has a string attribute named "Test" that is marked as space padded, but is interpreted as null padded. This should result in a wrong string length of 10, which is counterred by the trimming in the DatasetReader.

Expected behaviour
The padding/termination should both be read correctly from the file as described in https://support.hdfgroup.org/HDF5/doc/H5.format.html#DatatypeMessage, as well as being taken into account when resolving the actual string value.

Please complete the following information:

jhdf version 0.5.0
Java version 1.8

HDF5 File content parse error

Describe the bug
When read 64-bit integer data from a hdf5 file, cannot get the data right.

To Reproduce
Steps to reproduce the behaviour:

I cannot upload the HDF5 file, because github told me that "Something went really wrong, and we can't process that file" when uploading the file.

The Java code are like below.
`
String date = "20170103";
File file = new File("test.h5");

  HdfFile hdfFile = new HdfFile(file);
  DatasetBase dataset = (DatasetBase) hdfFile.getByPath(date);
  IntBuffer buffer = dataset.getDataBuffer().asIntBuffer();
  int data_size = (int) dataset.getSize();

  int[] data = new int[data_size];
  buffer.get(data);
  hdfFile.close();
  System.out.println(Arrays.toString(data));`

Expected behaviour
Then first int value of the array "data" is 1237122510, while the actual value is 2017010309250121. The python library h5py can get it right.

Please complete the following information:

jhdf version 0.5.6
Java version jdk1.8.0_171
OS (Windows, Mac, Linux) Windows

Additional information
The software named HDFView 3.0 shows that the data type of the file is "64-bit integer", while Java has 32-bit int and 64-bit long. But using LongBuffer I still cannot get the numbers right.
How can I share the hdf5 file with you?

Add a way to see creation order tracking

Create an is{Attribute, Link}CreationOrderTracked methods on {Node, Group} and make them work.

Cannot read empty arrays in var-length int dataset

The dataset from the h5py docu https://docs.h5py.org/en/stable/special.html#arbitrary-vlen-data cannnot be read by jhdf. Similar for a smaller example with an empty array in the middle:

with h5py.File('test.hdf5', 'w') as f:
    dt = h5py.vlen_dtype(np.dtype('int32'))
    dset = f.create_dataset('vlen_int', (3,), dtype=dt)
    dset[0] = [1,2,3]
    dset[1] = []
    dset[2] = [1,2,3,4,5]

this exception is thrown:

Exception in thread "main" io.jhdf.exceptions.HdfException: Error reading global heap at address 0
	at io.jhdf.GlobalHeap.<init>(GlobalHeap.java:76)
	at io.jhdf.dataset.VariableLengthDatasetReader.lambda$readDataset$0(VariableLengthDatasetReader.java:58)
	at java.util.HashMap.computeIfAbsent(HashMap.java:1127)
	at io.jhdf.dataset.VariableLengthDatasetReader.readDataset(VariableLengthDatasetReader.java:57)
	at io.jhdf.object.datatype.VariableLength.fillData(VariableLength.java:101)
	at io.jhdf.dataset.DatasetReader.readDataset(DatasetReader.java:64)
	at io.jhdf.dataset.DatasetBase.getData(DatasetBase.java:128)
Caused by: io.jhdf.exceptions.HdfException: Global heap signature 'GCOL' not matched, at address 0
	at io.jhdf.GlobalHeap.<init>(GlobalHeap.java:47)
	... 7 more

The reason is that h5py encodes empty arrays with global heap index 0 and address 0.
HDFView and h5py can read the result.

Changing VariableLengthDatasetReader l. 56 ff to

for (GlobalHeapId globalHeapId : getGlobalHeapIds(buffer, type.getSize(), hdfFc, getTotalPoints(dimensions))) {
	if (globalHeapId.getIndex() == 0) {
		elements.add(ByteBuffer.allocate(0)); //TODO: constant?
	} else {
		GlobalHeap heap = heaps.computeIfAbsent(globalHeapId.getHeapAddress(),
				address -> new GlobalHeap(hdfFc, address));

		ByteBuffer bb = heap.getObjectData(globalHeapId.getIndex());
		elements.add(bb);
	}
}

fixes the problem.

Add support for bit field datasets

There are some examples files available at https://zenodo.org/record/1310934 currently they fail to open with jhdf because they contain bitfield attributes.

The error is

Caused by: io.jhdf.exceptions.HdfException: Unrecognized data class = 4
	at io.jhdf.object.datatype.DataType.readDataType(DataType.java:55)
	at io.jhdf.object.message.AttributeMessage.<init>(AttributeMessage.java:55)

Add HDF5 writing capability ？

thanks

Add Lz4 filter

Would it be possible to add an LZ4 filter? My application using the jhdf library cannot open the attached file
493aadda-5540-4597-81a8_1148_data_000001.h5.zip
which uses filter ID 32004 .

Chunked v4 string datasets

Describe the bug
Adding string datasets to chunked v4 test files results in an error see branch https://github.com/jamesmudd/jhdf/tree/string-v4-datasets

To Reproduce
Checkout and run the test on the branch

Expected behaviour
To work and load the string datasets

Please complete the following information:

jhdf version: master
Java version: 11
OS (Windows, Mac, Linux): Linux

About the new memory-mapped-file benchmark

What is the suggestion?

Hi, I came across your library since I am also developing a non-native HDF5 reading library (for .NET) and noticed your new memory-mapped file benchmark. I have some comments on it:

Memory-mapped files work best for random file access, because of reduced number of system calls. However, your benchmark is just copying a large sequential block of data where I guess MMF will not shine.
Both benchmarked methods allocate an array per iteration, so GC overhead will become a signifcant part of the benchmark result.
File system cache will hold the whole file in memory right after is has been written to disk. I.e. your benchmarks are not measuring disk access performance but memory access performance. Maybe this is intended, but I prefer to have it mentioned at least.

Hopefully you find some of the points above useful :-) If not, please ignore this issue, I just wanted to write down my thoughts on this benchmark.

BTreeV1 needs cleanup

It currently contains duplication and is not well organised or commented.

jamesmudd / jhdf Goto Github PK

jhdf's Introduction

jHDF - Pure Java HDF5 library

Why should I use jHDF?

Why should I not use jHDF?

Why did I start jHDF?

Developing jHDF

jhdf's People

Contributors

Stargazers

Watchers

Forkers

jhdf's Issues

Recommend Projects

Recommend Topics

Recommend Org