ghcollin / tftables Goto Github PK

HDF5 interface for Tensorflow.

License: MIT License

Python 100.00%

tftables's Introduction

tftables allows convenient access to HDF5 files with Tensorflow. A class for reading batches of data out of arrays or tables is provided. A secondary class wraps both the primary reader and a Tensorflow FIFOQueue for straight-forward streaming of data from HDF5 files into Tensorflow operations.

The library is backed by multitables for high-speed reading of HDF5 datasets. multitables is based on PyTables (tables), so this library can make use of any compression algorithms that PyTables supports.

Licence

This software is distributed under the MIT licence. See the LICENSE.txt file for details.

Installation

pip install tftables

Alternatively, to install from HEAD, run

pip install git+https://github.com/ghcollin/tftables.git

You can also download or clone the repository and run

python setup.py install

tftables depends on multitables, numpy and tensorflow. The package is compatible with the latest versions of python 2 and 3.

Quick start

An example of accessing a table in a HDF5 file.

import tftables
import tensorflow as tf

with tf.device('/cpu:0'):
    # This function preprocesses the batches before they
    # are loaded into the internal queue.
    # You can cast data, or do one-hot transforms.
    # If the dataset is a table, this function is required.
    def input_transform(tbl_batch):
        labels = tbl_batch['label']
        data = tbl_batch['data']

        truth = tf.to_float(tf.one_hot(labels, num_labels, 1, 0))
        data_float = tf.to_float(data)

        return truth, data_float

    # Open the HDF5 file and create a loader for a dataset.
    # The batch_size defines the length (in the outer dimension)
    # of the elements (batches) returned by the reader.
    # Takes a function as input that pre-processes the data.
    loader = tftables.load_dataset(filename='path/to/h5_file.h5',
                                   dataset_path='/internal/h5/path',
                                   input_transform=input_transform,
                                   batch_size=20)

# To get the data, we dequeue it from the loader.
# Tensorflow tensors are returned in the same order as input_transformation
truth_batch, data_batch = loader.dequeue()

# The placeholder can then be used in your network
result = my_network(truth_batch, data_batch)

with tf.Session() as sess:

    # This context manager starts and stops the internal threads and
    # processes used to read the data from disk and store it in the queue.
    with loader.begin(sess):
        for _ in range(num_iterations):
            sess.run(result)

If the dataset is an array instead of a table. Then input_transform can be omitted if no pre-processing is required. If only a single pass through the dataset is desired, then you should pass cyclic=False to load_dataset.

Examples

See the unit tests for complete examples.

Examples

See the how-to for more in-depth documentation, and the unit tests for complete examples.

Documentation

Online documentation is available. A how to gives a basic overview of the library.

Offline documentation can be built from the docs folder using sphinx.

tftables's People

Contributors

Stargazers

Watchers

Forkers

koltenpearson carloslema algoskynet dkosm lihua213 gyunt vichuda montekristo1946 sullivan-sean gehongpeng

tftables's Issues

hdf5 files created by h5py

Does tftables support hdf5 files created by h5py? I have such files that contain multiple datasets (that is, numpy arrays) in them.

It seems tftables is built on top of 'tables', which is different from h5py library.

Cannot use same FIFOLoader twice.

I get an error whenever I try to use the same FIFOLoader twice. The error, as far as I can tell, comes from these lines:

if self.monitor_thread is not None:
    raise Exception("This loader has already been started.")

and the fact that the monitor thread is only closed, but not set to None when loader.stop(sess) is called.

Is preprocessing and augmentation possible?

Hi Thank you for this library. This is something exactly I was looking for. I have hdf5 file for PASCAL VOC dataset for object detection. In this file I have two dataset, one contains variable length images and other contains segmentation masks for each of the object category presented in the image. I wish to feed the data in a following way.
Read image and corresponding masks, resize and augment them using custom warp function and then feed them in the FIFO queue for training. You can find the full question here. Could you please help me with this problem?

what is the 'dataset_path' ?

def input_transform(tbl_batch):
    labels = tbl_batch['non_nodule']
    data = tbl_batch['nodule']

    return truth, data

loader = tftables.load_dataset(filename='path/to/h5_file.h5',
                               dataset_path='/internal/h5/path',
                               input_transform=input_transform,
                               batch_size=16)

h5_file -> 'nodule', 'non_nodule' is key ...
What does dataset_path mean?
If my hdf5 file name is taki.h5 and the path is /home/data/lunit/taki.h5

filename = taki.h5
dataset_path = /home/data/lunit/

filename = /home/data/lunit/taki.h5
dataset_path = nodule

What is the correct answer?

Multiple h5 files

My dataset is formatted as a few dozen h5 files instead of one h5 with internal directories. Is it possible to load them into one queue without merging them into one file?

I have some questions

I have some questions...

def input_transform(tbl_batch):
    labels = tbl_batch['non_nodule']
    data = tbl_batch['nodule']

    return truth, data

loader = tftables.load_dataset(filename='path/to/h5_file.h5',
                               dataset_path='/internal/h5/path',
                               input_transform=input_transform,
                               batch_size=16)

h5_file -> 'nodule', 'non_nodule' is key ...
What does dataset_path mean?
If my hdf5 file name is taki.h5 and the path is /home/data/lunit/taki.h5

filename = taki.h5
dataset_path = /home/data/lunit/

filename = /home/data/lunit/taki.h5
dataset_path = nodule

What is the correct answer?

def input_transform(tbl_batch):
    labels = tbl_batch['non_nodule']
    data = tbl_batch['nodule']

    return truth, data

loader = tftables.load_dataset(filename='path/to/h5_file.h5',
                               dataset_path='/internal/h5/path',
                               input_transform=input_transform,
                               batch_size=16)

result = my_network(self.x, self.y)
# self.x = tf.placeholder(tf.float32, [16,68,68,68,3])
# self.y = tf.placeholder(tf.float32, [16,2])

with tf.Session() as sess:
    with loader.begin(sess):
        for _ in range(num_iterations): 
            truth_batch, data_batch = loader.dequeue()
            feed_dict = { self.x : data_batch, self.y : truth_batch }

            sess.run(result, feed_dict=feed_dict)

Is this possible?
Please answer

Thank you !

Shuffle data while cycling?

Do you see a possibility to shuffle the data while reading/cycling through it? Either on tftables or multitables-level?

As I don't see an option related to random access, I assume, you store your training data already shuffled?

HDF5 dataset path question

From @DolanDack :

Hi, I am brand new to python and I am trying to use your code to train on tensorflow using datasets other than mnist. When I am following your guides there is this line: array_batch_placeholder = reader.get_batch('/internal/h5_path/to/array')

I do not really get what this pathway refers to. Tensorflow is quite different from R and Matlab that I have been used to use until now and I cannot really check the variables by executing small batches of the code. The /internal/h5_path/to/array is it something I should provide? I know that in the: reader = tftables.open_file(filename='path/to/h5_file', batch_size=20) I use the h5 file path.

I tried to understand following the more applicable examples that you are offering but to be honest I feel quite lost.

Code hangs on training

I'm reading multiple datasets from a single file and after a certain number of iterations the code hangs indefinitely (I let it go overnight just to be absolutely certain). I have to ctrl+C out of it and I get the following exception. Looks like a hang in multitables somewhere? Maybe from the queue not being populated quickly enough?

Traceback (most recent call last): │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap │··
self.run() │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run │··
self._target(*self._args, **self._kwargs) │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 389, in _Streamer__read_process │··
with sync.do(cbuf.put_direct(), i, (i+read_size) % len(ary)) as put_ary: │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 136, in enter │··
with self.sync.barrier_in.wait(*self.index): │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 87, in enter │··
self.sync.cvar.wait() │··
File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 262, in wait │··
return self._wait_semaphore.acquire(True, timeout)

_pickle.PicklingError: Can't pickle

Hi, when I ran the test unit on my system, I got the following errors:

_pickle.PicklingError: Can't pickle <function Streamer.__read_process at 0x000002012024BF28>: attribute lookup Streamer.__read_process on multitables failed

Please help me.

Queuing from multiple datasets?

Awesome package!!

Is it possible to load/dequeue data samples from multiple datasets (which maybe inside the same hdf5 file)? For example, lets say we have filename=/path/to/h5_file.h5 which contains two tables: /path/to/table/1 and /path/to/table/2. Both tables contain columns data and labels like on the main README example.

I can make a loader any individual table as suggested on the README:

 loader_dataset1 = tftables.load_dataset(filename='path/to/h5_file.h5',
                                   dataset_path='/path/to/table/1',
                                   input_transform=input_transform, ...)

But would I have to create an entirely different loader to handle the second table? Like this:

 loader_dataset2 = tftables.load_dataset(filename='path/to/h5_file.h5',
                                   dataset_path='/path/to/table/2',
                                   input_transform=input_transform, ...)

Then I would have to load the batches from each table separately and alternate on which to use on every iteration of training:

truth_batch1, data_batch1 = loader_dataset1.dequeue()
truth_batch2, data_batch2 = loader_dataset2.dequeue()

Is there a better way of doing this? I could imagine concatenating both tables into a single table (and thus use a single loader). For clarity, it would make sense to keep the tables separate but if this is the only solution, merging them together is certainly possible. Do you have any suggestions?

ZeroDivisionError when `FileReader.get_batch` is called without `block_size`

I get a ZeroDivisionError when I try to read a batch without specifying what the block_size is. My HDF5 file is created with h5py and is a simple table.

Here is my code and the error message:

In [2]: reader = tftables.open_file('/home/yngve/table.h5', batch_size=5)

In [3]: a = reader.get_batch('/train/images', n_procs=3)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-3-05142b89f50c> in <module>()
----> 1 a = reader.get_batch('/train/images', n_procs=3)

~/anaconda3/envs/tf/lib/python3.6/site-packages/tftables.py in get_batch(self, path, **kw_args)
    264         block_size = queue.block_size
    265         # get an example for finding data types and row sizes.
--> 266         example = self.streamer.get_remainder(path, block_size)
    267         batch_type = example.dtype
    268         inner_shape = example.shape[1:]

~/anaconda3/envs/tf/lib/python3.6/site-packages/multitables.py in get_remainder(self, path, block_size)
    459         :return: A copy of the remainder elements as a numpy array.
    460         """
--> 461         return self.__get_batch(path, length=block_size, last=True)
    462 
    463     class Queue:

~/anaconda3/envs/tf/lib/python3.6/site-packages/multitables.py in __get_batch(self, path, length, last)
    444 
    445         if last:
--> 446             example = h5_node[length*(len(h5_node)//length):].copy()
    447         else:
    448             example = h5_node[:length].copy()

ZeroDivisionError: integer division or modulo by zero

The error is resolved if I specify block_size. The code

In [2]: reader = tftables.open_file('/home/yngve/table.h5', batch_size=5)

In [3]: a = reader.get_batch('/train/images', block_size=5,  n_procs=3)

does not give any errors. I have tested this code with the FIFOQueue and it does indeed give the expected result.

System info:
OS: Ubuntu 16.04
Python version: 3.6
Tensorflow version: 1.5.0
Tftables version: 1.1.2 (latest from pip)

Performance gain of ttables

Hi,
I have implemented a hdf5 stream based on your document. I want to know the performance gain based on ttables in general? Because i do not see too much gain in my case, i.e., the speed of my example is same as before.

Thanks

repeated same data

It -seems- to send the same data into the graph multiple times. I used a data set from here: http://download.nexusformat.org/sphinx/examples/h5py/#id4 and don't know for sure what is in it, however, putting print statements in tftables' getbatch().readbatch() yield statements, outputs the same data several times. I don't see why this would be expected?

ghcollin / tftables Goto Github PK

tftables's Introduction

Licence

Installation

Quick start

Examples

Examples

Documentation

tftables's People

Contributors

Stargazers

Watchers

Forkers

tftables's Issues

Recommend Projects

Recommend Topics

Recommend Org