inm-6 / h5py_wrapper Goto Github PK
View Code? Open in Web Editor NEWA wrapper to conveniently store nested python dictionaries in hdf5 files.
License: GNU General Public License v2.0
A wrapper to conveniently store nested python dictionaries in hdf5 files.
License: GNU General Public License v2.0
we should make sure this module can be safely run with python2 & 3
We should order imports alphabetically, but more importantly, move the h5py version check after all imports to be PEP8 compatible.
Since Python's long integer type is not a native numpy type it is converted to an object and thus has no native HDF5 equivalent.
I discussed this with @jakobj: One workaround would be to store it as a string and store the type information and recreate the long type when loading from h5. @mschmidt87, what do you think?
Storing a list of unicode string leads to the following h5py error:
TypeError: No conversion path for dtype: dtype('<U1')
.
I haven't understood the background completely. If there is no workaround in h5py, we should at least build in some error catching / conversion code. This error can be reproduced with the additional test added in my personal branch unicode_strings.
Both while loading from and saving to a file, the user can specify paths within the hdf5 file. They are specified with keyword arguments that are called differently in the load
and save
function:
save(filename, d, write_mode='a', overwrite_dataset=False, resize=False, dict_label='', compression=None)
load(filename, path='', lazy=False)
So, we have dict_path
in the save
function and path
in the load function. In my opinion, we should name the arguments consistently across the two function and opt for calling it path
.
This will of course break user code so we should only include this in the next major release.
I discovered a problem with compatibility of files beween Python 2 and 3:
If a file has been created with Python 2, it cannot be opened with Python 3.
The reason is that key_type
and value_type
are read in as byte-strings by Python 3 so that the wrapper complains that e.g. it does not support data type b'int'
or cannot correctly read a string key.
Here is minimal reproducer:
Execute with Python 2:
import h5py_wrapper as h5w
d = {'a': 1}
h5w.save('h5w_bug.h5', d, 'w')`
Execute with Python 3:
import h5py_wrapper as h5w
d = h5w.load('h5w_bug.h5')
If the conversion script is executed without arguments usage information should be printed. Instead, the script tries to read from stdin continously. I think this is caused by docopt since only optional arguments are specified. We should change at least <files>
to non-optional.
Currently, we do not explicitely store the type of the value of the dataset.
This restricts functionality, for instance, it is not possible to distinguish between a NoneType and the string 'None', because we store a None as the string 'None'.
Furthermore, implementing this enables us to retrieve lists as lists again instead of Numpy arrays.
If the wrapper fails to load a file because of a KeyError
, we display a message informing the user that this probably occured because the file has been created by an old release version.
This only works if the user loads the entire file. If the user specifies a path to a deeper substructure of the stored file, by using load('foo.h5', path='path/to/substructure')
or load('foo.h5', 'path/to/substructure')
, the KeyError
is caught by the load
function and the message is not displayed.
We should improve the error handling so that the message is also displayed in this case.
the tests use the regular assert. this should be replace with the asserts from the unittest module to provide more information in the case of failed tests.
Currently, the tests are just a collection of functions which are executed.
It would be better to design it as a proper testsuite using the unittest library.
The following code runs (I'm able to save and load):
import h5py_wrapper as h5w
import numpy as np
d = { 'np_int64': [ np.int64(10) ] }
h5w.save('file.name', d, write_mode='w')
h5w.load('file.name')
However, if I don't put the np.int64(10)
value in a list, so d = { 'np_int64': np.int64(10) }
, I get an error when trying to load the file:
NotImplementedError: Unsupported data type: int64.
I think it's because the list of supported data types is not complete (valuetype_dict
), although this shouldn't represent a problem since it's possible to save the values..
as far as i can see, our test dictionary is quite simple. it might be a good idea to include more complex examples (e.g., lists of dicts, sets, tuples)
After retrieving and unpacking the tar file, get_previous_version
should clean up and remove the tar file as it is not needed any more.
Currently, complex numbers are not supported by the wrapper. Since this is certainly a common use case, we should remedy this.
The testsuite currently fails for the python 2.7 tests. Seems to be a problem with the conversion script
I did a quick speed test of the current master (25834fb) with release version 0.0.1 and discovered a significant slow down of the loading routine.
We should do some proper benchmarking, investigate bottlenecks that we apparently introduced and think about solutions. Either we can improve the speed of the implementation in general or we could provide options to the user to e.g. circumvent the time-consuming type-checking in the loading routine.
In my opinion the current function names of the API are not very well chosen. add_to_h5
and load_h5
are not particularly intuitive. How about save
and load
like in numpy?
This changes the user interface, so it should be well discussed and we would need to add deprecation the current functions for a while before removing the completely.
The setup.py
script is currently lacking information about the supported python versions:
I copy our discussion from a pull-request of python-dicthash
:
When you check numpy on pypi (https://pypi.python.org/pypi/numpy/1.13.0rc1) you see that they list the Python versions that this packages is compatible with explicitly:
Requires Python: >=2.7,!=3.0.,!=3.1.,!=3.2.,!=3.3.
Maybe we need to add something similar? Alternatively we could just create a new release with Python3 support?
Doctests fail on the current examples for load_h5
and add_to_h5
due to missing imports.
Currently, it is not possible to store tuples with complex shape, such as
((1, 2), (3, 4, 5))
because h5py converts this into a numpy array with dtype=Object
.
For numpy arrays, we handle this case (for 2D arrays) but to apply this to tuples (and sets) as well, we first need to fix #11 .
With #59 , the wrapper supports Python 3. Although it actually supports Python 3.5, we only test for Python 3.4 for now because conda reports a package conflict for Python 3.5 and h5py.
We should keep an eye on this and remedy this once the problem is solved on the conda level.
we should set up travis now that we have a proper test suite
The API reference seems to be broken: https://h5py-wrapper.readthedocs.io/en/latest/api_reference.html
Did anything change or did this never work properly?
Currently, the wrapper requires every user to have the quantities package installed.
We should check for the availability of the package and adapt functionality if it's not found since the wrapper itself does not require it.
The testsuite creates h5 files, which are currently stored in the same folder as the test script.
The conversion script downloads an old release version and stores it in the current working directory.
In both cases, we should use the /tmp
directory of the system.
We need to gitignore (note to self).
A function, which only loads the skeleton of a data file would be very useful to be able to investigate the structure of a dataset/file without taking the effort of loading the actual data.
Hi, I just came across the following issue:
If you have numpy strings as keys of a dictionary and you save this dictionary, the h5py_wrapper
raises the error
ValueError: malformed node or string:
<ast.Name object at 0x7fcfa1870970>
when trying to load the file. This can be reproduced using the following test (using pytest
):
import pytest
import numpy as np
import h5py_wrapper as h5
from unittest import TestCase
def test_saving_and_consecutive_loading_of_numpy_string_keys(tmpdir):
file = 'test.h5'
keys = ['a', 'b', 'c']
# this makes the keys numpy strings
keys = np.atleast_1d(keys)
output = {key: i for i, key in enumerate(keys)}
# create temporary directory to save test file into
tmp_test = tmpdir.mkdir('tmp_test')
with tmp_test.as_cwd():
h5.save(file, output)
input = h5.load(file)
TestCase().assertDictEqual(output, input)
The problem seems to be line 296 in wrapper.py
where key_type
is compared to ['str', 'unicode', 'string_']
, but a numpy string leads to key_type = 'str_'
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.