inm-6 / h5py_wrapper Goto Github PK

View Code? Open in Web Editor NEW

2.0 6.0 12.0 121 KB

A wrapper to conveniently store nested python dictionaries in hdf5 files.

License: GNU General Public License v2.0

Python 100.00%

python hdf5 h5py

h5py_wrapper's People

Contributors

Stargazers

Watchers

Forkers

alperyeg carloscanova vahidrostami lzehl mschmidt87 jakobj jschuecker juliasprenger midick whigg hannahbos fsurmont

h5py_wrapper's Issues

add python3 support

we should make sure this module can be safely run with python2 & 3

Fix order of imports and version check

We should order imports alphabetically, but more importantly, move the h5py version check after all imports to be PEP8 compatible.

Enable storage of Python's type long

Since Python's long integer type is not a native numpy type it is converted to an object and thus has no native HDF5 equivalent.

I discussed this with @jakobj: One workaround would be to store it as a string and store the type information and recreate the long type when loading from h5. @mschmidt87, what do you think?

unicode strings cannot be stored

Storing a list of unicode string leads to the following h5py error:
TypeError: No conversion path for dtype: dtype('<U1').

I haven't understood the background completely. If there is no workaround in h5py, we should at least build in some error catching / conversion code. This error can be reproduced with the additional test added in my personal branch unicode_strings.

Specification of path within hdf5 file inconsistent between load and save function

Both while loading from and saving to a file, the user can specify paths within the hdf5 file. They are specified with keyword arguments that are called differently in the load and save function:

save(filename, d, write_mode='a', overwrite_dataset=False, resize=False, dict_label='', compression=None)

load(filename, path='', lazy=False)

So, we have dict_path in the save function and path in the load function. In my opinion, we should name the arguments consistently across the two function and opt for calling it path.

This will of course break user code so we should only include this in the next major release.

Files created with Python2 cannot be read using Python3

I discovered a problem with compatibility of files beween Python 2 and 3:

If a file has been created with Python 2, it cannot be opened with Python 3.
The reason is that key_type and value_type are read in as byte-strings by Python 3 so that the wrapper complains that e.g. it does not support data type b'int' or cannot correctly read a string key.

Here is minimal reproducer:
Execute with Python 2:

import h5py_wrapper as h5w
d = {'a': 1}
h5w.save('h5w_bug.h5', d, 'w')`

Execute with Python 3:

import h5py_wrapper as h5w
d = h5w.load('h5w_bug.h5')

Add coveralls support

Conversion scripts hangs instead of displaying help message when called without arguments

If the conversion script is executed without arguments usage information should be printed. Instead, the script tries to read from stdin continously. I think this is caused by docopt since only optional arguments are specified. We should change at least <files> to non-optional.

Explicitely store type of value

Currently, we do not explicitely store the type of the value of the dataset.
This restricts functionality, for instance, it is not possible to distinguish between a NoneType and the string 'None', because we store a None as the string 'None'.

Furthermore, implementing this enables us to retrieve lists as lists again instead of Numpy arrays.

Improve warning in case of deprecated file

If the wrapper fails to load a file because of a KeyError, we display a message informing the user that this probably occured because the file has been created by an old release version.

This only works if the user loads the entire file. If the user specifies a path to a deeper substructure of the stored file, by using load('foo.h5', path='path/to/substructure') or load('foo.h5', 'path/to/substructure'), the KeyError is caught by the load function and the message is not displayed.

We should improve the error handling so that the message is also displayed in this case.

replace assert with unittest.assert...

the tests use the regular assert. this should be replace with the asserts from the unittest module to provide more information in the case of failed tests.

Implement testsuite with unittest

Currently, the tests are just a collection of functions which are executed.
It would be better to design it as a proper testsuite using the unittest library.

supported data types not consistent?

The following code runs (I'm able to save and load):

import h5py_wrapper as h5w
import numpy as np

d = { 'np_int64': [ np.int64(10) ] }

h5w.save('file.name', d, write_mode='w')
h5w.load('file.name')

However, if I don't put the np.int64(10) value in a list, so d = { 'np_int64': np.int64(10) }, I get an error when trying to load the file:

NotImplementedError: Unsupported data type: int64.

I think it's because the list of supported data types is not complete (valuetype_dict), although this shouldn't represent a problem since it's possible to save the values..

include more complex tests

as far as i can see, our test dictionary is quite simple. it might be a good idea to include more complex examples (e.g., lists of dicts, sets, tuples)

get_previous_version should clean up

After retrieving and unpacking the tar file, get_previous_version should clean up and remove the tar file as it is not needed any more.

Allow storage of complex numbers

Currently, complex numbers are not supported by the wrapper. Since this is certainly a common use case, we should remedy this.

Fix Python 2.7 tests

The testsuite currently fails for the python 2.7 tests. Seems to be a problem with the conversion script

Benchmark wrapper in comparison with release 0.0.1 and work on speed

I did a quick speed test of the current master (25834fb) with release version 0.0.1 and discovered a significant slow down of the loading routine.

We should do some proper benchmarking, investigate bottlenecks that we apparently introduced and think about solutions. Either we can improve the speed of the implementation in general or we could provide options to the user to e.g. circumvent the time-consuming type-checking in the loading routine.

Better API function names

In my opinion the current function names of the API are not very well chosen. add_to_h5 and load_h5 are not particularly intuitive. How about save and load like in numpy?
This changes the user interface, so it should be well discussed and we would need to add deprecation the current functions for a while before removing the completely.

Add information about supported python version to setup.py

The setup.py script is currently lacking information about the supported python versions:
I copy our discussion from a pull-request of python-dicthash:

When you check numpy on pypi (https://pypi.python.org/pypi/numpy/1.13.0rc1) you see that they list the Python versions that this packages is compatible with explicitly:

Requires Python: >=2.7,!=3.0.,!=3.1.,!=3.2.,!=3.3.

Maybe we need to add something similar? Alternatively we could just create a new release with Python3 support?

ValueError: malformed node or string: 
    <ast.Name object at 0x7fcfa1870970>

when trying to load the file. This can be reproduced using the following test (using pytest):

import pytest
import numpy as np
import h5py_wrapper as h5

from unittest import TestCase


def test_saving_and_consecutive_loading_of_numpy_string_keys(tmpdir):
    file = 'test.h5'
    keys = ['a', 'b', 'c']
    
    # this makes the keys numpy strings
    keys = np.atleast_1d(keys)
    output = {key: i for i, key in enumerate(keys)}

    # create temporary directory to save test file into
    tmp_test = tmpdir.mkdir('tmp_test')
    with tmp_test.as_cwd():
        h5.save(file, output)
        input = h5.load(file)
        TestCase().assertDictEqual(output, input)

The problem seems to be line 296 in wrapper.py where key_type is compared to ['str', 'unicode', 'string_'], but a numpy string leads to key_type = 'str_'.