vahidk / tfrecord Goto Github PK

View Code? Open in Web Editor NEW

840.0 840.0 102.0 39 KB

TFRecord reader for PyTorch

Home Page: https://twitter.com/VahidK

License: MIT License

Python 100.00%

dataset loader pytorch tensorflow tfrecord

tfrecord's People

Contributors

Stargazers

Watchers

Forkers

laksh9950 praateekmahajan xinkez shannonyu kiminh chenghuige mwawrzos rheehot dsybswsw irvingshu ayushkarnawat banniford msalvaris l1aoxingyu trantorrepository gavinljj collector-m codeaudit 907552421 nhendy chreisinger huangzehao aurelio-amerio zcemycl jareturing brucewuzhang samra-irshad linkun-1998 binbinmeng allanbatista mahshad92 niexiaokun gsgoncalves arpol viai957 ptigwe developmentseed e0397123 arufus usama3059 willer94 sytelus jamriko jimypbr guoquanhao yycho0108 tli347 hpec herolin12 tomaszbochenski ericwiener meihuanshan otaj jue-jue-zi anabur920 congvmit jacobarose gtkalapurackal wh-forker andyongg lixiny kimwoonggon tberriel rpiryani baodii jamesdolezal klassikcat piupiubang jreremy lppllppl920 liuchang0523 lcj1105 alexriedel1 czhongtao data-ai-ml-services tianlinlong rainbowmeteo-technologies ptr-br python-repository-hub johnypark jmei1994 emmanuelol sundy1219 kennith-li tonylibing chenxing3 weixian001 reg1ss prokotg cgebbe ndthong2411 samliu nkundiushuti huanghanchi sunnyghj

tfrecord's Issues

Losing shape information

Hi,

I am trying to load the ImageNet2012 tfrecords (loaded using tensorflow datasets).

I can load the tfrecords using Tensorflow fine, however, when I use this library I lose the shape information of the images (they come back as flattened lists). Is there any way to resolve this? Other than that the library works great!

Thanks

RuntimeError: Failed to read the record.

Hi, guys, thanks your contribution!

when I use your source code, I find this error:
Traceback (most recent call last):
File "D:/coding/SIFA-master/model.py", line 316, in
data = next(iter(loader))
File "D:\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 346, in next
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "D:\Anaconda3\lib\site-packages\torch\utils\data_utils\fetch.py", line 28, in fetch
data.append(next(self.dataset_iter))
File "D:\Anaconda3\lib\site-packages\tfrecord\io_utils.py", line 106, in tfrecord_loader
for record in record_iterator:
File "D:\Anaconda3\lib\site-packages\tfrecord\io_utils.py", line 73, in tfrecord_iterator
yield from read_records()
File "D:\Anaconda3\lib\site-packages\tfrecord\io_utils.py", line 67, in read_records
raise RuntimeError("Failed to read the record.")
RuntimeError: Failed to read the record.

There is my code:

from tfrecord.torch.dataset import TFRecordDataset
from torch.utils.data import DataLoader

tfrecord_path = "./data/training_A/image_01.tfrecords"

dataset = TFRecordDataset(tfrecord_path, None, {"dsize_dim0":"int", "dsize_dim1":"int", "dsize_dim2":"int",
"lsize_dim0":"int", "lsize_dim0":"int", "lsize_dim0":"int",
"data_vol":"str", "label_vol":"str",
})

loader = DataLoader(dataset, batch_size=1)

data = next(iter(loader))

print(data)

so have you fix the error or help me to find mistake? Thanks!

infinity loop when reading multi tfrecords

Hi, thanks for your awesome tfrecord

loader = tfrecord.multi_tfrecord_loader(
        tfrecord_pattern, index_pattern, splits, description
    )
    qbar = tqdm(enumerate(loader))
    for i, record in qbar:
         # do something
         # infinity loop, [never ends]
    # do something else [never enter this part]

I came across a problem when loading multiple tfrecords with tfrecord.multi_tfrecord_loader.
The loop seems never end...

Google Cloud Bucket support

Pytorch xla has torch_xla.utils.tf_record_reader.TfRecordReader which supports loading from GCB. When I use the GCB location for TFRecordDataset , I got

FileNotFoundError: [Errno 2] No such file or directory: 'gs://xxx'

I wonder whether this repo also supports it.

How to transform the datasets?

Hi! It's nice to u to provide this,but i still have some problems
i see the code in datasets.py:

worker_info = torch.utils.data.get_worker_info() if worker_info is not None: shard = worker_info.id, worker_info.num_workers np.random.seed(worker_info.seed % np.iinfo(np.uint32).max) else: shard = None it = reader.tfrecord_loader( self.data_path, self.index_path, self.description, shard) if self.shuffle_queue_size: it = iterator_utils.shuffle_iterator(it, self.shuffle_queue_size)

does 'iterator_utils' has other methods to do transform? like calling map function in tensorflow? if does, where can i find more information about how to use this tfrecord

When i create the index file of tfrecord file, it occur Failed to parse TFRecord

/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py:125: RuntimeWarning: 'tfrecord.tools.tfrecord2idx' found in sys.modules after import of package 'tfrecord.tools', but prior to execution of 'tfrecord.tools.tfrecord2idx'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
Failed to parse TFRecord.

the TFRecord's compression_type is GZIP

How to extract the string type data?

How to read tf.VarLenFeature in the records?

How to read tf.VarLenFeature in the records?
My tfrecords file contains some string arrays in the form of tf.VarLenFeature, but I don't know how read it in pytorch

Random access

Are there plans to add support for random access of individual samples? If an index file is supplied, this could be done pretty efficiently.

Creating index file from multiple tfrecord files?

How might one go about creating an index file before using MultiTFRecordDataset?

In my case, dataset is 20 files in dir data/proc_data/IMDB/unsup/bt-0.9/0/.

Thanks!

RuntimeError: stack expects each tensor to be equal size, but got [14] at entry 0 and [13] at entry 3

Can you help how to fix this error(RuntimeError: stack expects each tensor to be equal size, but got [14] at entry 0 and [13] at entry 3)? I've got tfrec files and the image names are diffrent len. I think that its the reason.

CODE:

import torch
import cv2

from tfrecord.torch.dataset import TFRecordDataset
import numpy as np

def decode_image(features):
    features["image"] = cv2.imdecode(features["image"], -1)
    return features

tfrecord_path = "train_dataset/tfrecords/ld_train01-2140.tfrec"
index_path = None
description = {"image": "byte", "target": "int", "image_name": "byte"}

dataset = TFRecordDataset(tfrecord_path, index_path, description, transform = decode_image)
loader = torch.utils.data.DataLoader(dataset, batch_size=32)

data = next(iter(loader))
print(data)

Still RuntimeError: Failed to read the record. But I can provide details how TFRecord is generated

Traceback (most recent call last):
File "torchRead.py", line 11, in
data = next(temp)
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
data.append(next(self.dataset_iter))
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/tfrecord/reader.py", line 86, in tfrecord_loader
for record in record_iterator:
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/tfrecord/reader.py", line 53, in tfrecord_iterator
yield from read_records()
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/tfrecord/reader.py", line 47, in read_records
raise RuntimeError("Failed to read the record.")
RuntimeError: Failed to read the record.

Here is the definition of _float_feature
def _float_feature(value): # used
return tf.train.Feature(float_list=tf.train.FloatList(value=list(value)))
example = tf.train.Example(features=tf.train.Features(
feature={
"ori_image": _float_feature(image.reshape(-1)),
"aug_image": _float_feature(aug_image.reshape(-1)),
}))

Using this config I can read out the file with TensorFlow
aug_record_spec = {
"ori_image": tf.FixedLenFeature([224 * 224 * 3], tf.float32),
"aug_image": tf.FixedLenFeature([224 * 224 * 3], tf.float32),
}

I know the problem is raised with
file.readinto(datum_bytes_view) != length
I also print out the two value:
first is 1204282 = 2 * 602141 cannot be decomposed any more.
the second is 1048576 = 2^20

Hope you will notice this, thanks very much~~

Error when loading string feature

I'm getting some errors when loading a byte feature from Waymo's open data motion dataset (I can't share the data because of the license, but it's available here: https://waymo.com/open/).

Minimal reproducible example:

import torch
from tfrecord.torch.dataset import TFRecordDataset

tfrecord_path = "uncompressed_tf_example_training_training_tfexample.tfrecord-00000-of-01000"
index_path = None
scenario_features = {
        'scenario/id': 'byte',
}

dataset = TFRecordDataset(tfrecord_path, index_path, scenario_features)
loader = torch.utils.data.DataLoader(dataset, batch_size=32)

data = next(iter(loader))
print(data)

Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-32-b161469c8e99> in <module>
     11 loader = torch.utils.data.DataLoader(dataset, batch_size=32)
     12 
---> 13 data = next(iter(loader))
     14 print(data)

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    515             if self._sampler_iter is None:
    516                 self._reset()
--> 517             data = self._next_data()
    518             self._num_yielded += 1
    519             if self._dataset_kind == _DatasetKind.Iterable and \

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    555     def _next_data(self):
    556         index = self._next_index()  # may raise StopIteration
--> 557         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    558         if self._pin_memory:
    559             data = _utils.pin_memory.pin_memory(data)

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     33         else:
     34             data = next(self.dataset_iter)
---> 35         return self.collate_fn(data)
     36 
     37 

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     71         return batch
     72     elif isinstance(elem, container_abcs.Mapping):
---> 73         return {key: default_collate([d[key] for d in batch]) for key in elem}
     74     elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
     75         return elem_type(*(default_collate(samples) for samples in zip(*batch)))

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in <dictcomp>(.0)
     71         return batch
     72     elif isinstance(elem, container_abcs.Mapping):
---> 73         return {key: default_collate([d[key] for d in batch]) for key in elem}
     74     elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
     75         return elem_type(*(default_collate(samples) for samples in zip(*batch)))

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     61                 raise TypeError(default_collate_err_msg_format.format(elem.dtype))
     62 
---> 63             return default_collate([torch.as_tensor(b) for b in batch])
     64         elif elem.shape == ():  # scalars
     65             return torch.as_tensor(batch)

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     53             storage = elem.storage()._new_shared(numel)
     54             out = elem.new(storage)
---> 55         return torch.stack(batch, 0, out=out)
     56     elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
     57             and elem_type.__name__ != 'string_':

RuntimeError: stack expects each tensor to be equal size, but got [16] at entry 0 and [14] at entry 4

From TensorFlow I'm able to load it using this definition:

'scenario/id': tf.io.FixedLenFeature((), tf.string, default_value=None),

Loading a multi-tfrecord multi-split dataset

I have downloaded the open_image_v4 dataset from Tensorflow using the tfds utility (as well as many other similar datasets), and now I have a folder with the following structure:

open_image_v4
├── open_images_v4-test.tfrecord-00001-of-00512
├── open_images_v4-test.tfrecord-00002-of-00512
├── ...
├── open_images_v4-train.tfrecord-00001-of-01024
├── ...
└── open_images_v4-validation.tfrecord-00001-of-00128
├── ...

As you can see, I have 3 splits: train, test and validation, however, each of those splits are themselves splitted/sharded into subfiles.

I created an index file for each of those TFrecord using GNU parallel:

parallel -j19 python3 -m tfrecord.tools.tfrecord2idx {} index/{}.index ::: *.tfrecord-*

From skimming over the reader code, it seems that you only support datasets that all fit in a single .tfrecord file.
Did I miss something or is there a workaround I could employ to read large subsets (train, valid, test) of the dataset?

I initially thought it might be possible to provide the input pattern as:

tfrecord_pattern = "/tmp/{split}_{idx}.tfrecord"

or in the case of my above file system:

tfrecord_pattern = "/tmp/open_images_v4-{split}.tfrecord-{idx:05d}-of-{total:05d}"

if that may be of any help for future PRs.

How to load int64 type?

Hi, thanks for this package!

Are int64/float64 data types supported?

I'm trying to load some int64 data but when I specify description = {"field_name": "int"} it just returns a 32-bit int. I also tried using torch.int64, int64, and long, but these all return a KeyError:

KeyError                                  Traceback (most recent call last)
<ipython-input-17-ccc38984de18> in <module>
     96 loader = torch.utils.data.DataLoader(dataset, batch_size=32)
     97 
---> 98 data = next(iter(loader))
     99 print(data)

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    515             if self._sampler_iter is None:
    516                 self._reset()
--> 517             data = self._next_data()
    518             self._num_yielded += 1
    519             if self._dataset_kind == _DatasetKind.Iterable and \

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    555     def _next_data(self):
    556         index = self._next_index()  # may raise StopIteration
--> 557         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    558         if self._pin_memory:
    559             data = _utils.pin_memory.pin_memory(data)

~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     26             for _ in possibly_batched_index:
     27                 try:
---> 28                     data.append(next(self.dataset_iter))
     29                 except StopIteration:
     30                     break

~/.venv/wenv/lib/python3.8/site-packages/tfrecord/reader.py in example_loader(data_path, index_path, description, shard)
    203         example.ParseFromString(record)
    204 
--> 205         yield extract_feature_dict(example.features, description, typename_mapping)
    206 
    207 

~/.venv/wenv/lib/python3.8/site-packages/tfrecord/reader.py in extract_feature_dict(features, description, typename_mapping)
    147             raise KeyError(f"Key {key} doesn't exist (select from {all_keys})!")
    148 
--> 149         processed_features[key] = get_value(typename, typename_mapping, key)
    150 
    151     return processed_features

~/.venv/wenv/lib/python3.8/site-packages/tfrecord/reader.py in get_value(typename, typename_mapping, key)
    128 
    129         def get_value(typename, typename_mapping, key):
--> 130             return process_feature(features[key], typename,
    131                                    typename_mapping, key)
    132     else:

~/.venv/wenv/lib/python3.8/site-packages/tfrecord/reader.py in process_feature(feature, typename, typename_mapping, key)
    100 
    101     if typename is not None:
--> 102         tf_typename = typename_mapping[typename]
    103         if tf_typename != inferred_typename:
    104             reversed_mapping = {v: k for k, v in typename_mapping.items()}

KeyError: 'long'

Issue with shuffle=True, when using with pytorch dataloader

Hi,

Thank you for the very useful repo.

ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True

is there a way to use shuffling with the interface. Also is there a way to preceompute
the length of the dataset.

Thanks

ModuleNotFoundError: No module named 'tfrecord.torch'

As title suggests.

Probably there is a requirement to edit

tfrecord/__init__.py to also include from .torch import *

and inside

tfrecord/torch/__init__.py say from .dataset import TFRecordDataset, MultiTFRecordDataset

shard not used in MultiTFRecordDataset

Hi, I just noticed, that MultiTFRecordDataset does not use shard, whereas TFRecordDataset does. Is this a correct behavior? It very well could be, there could be some dark magic which I am not aware of, however, to me it seems it is not. If it is not, I can submit a PR to fix it.

Thanks :)

No need to specify individual dtypes when loading

It seems cumbersome that the user needs to (re)specify the dtype of the feature they are trying to load from their TFRecord file. Within tf.example, we can extract both the tf_typename and the value from a specific example using the following:

field = example.features.feature[key].ListFields()[0]
tf_typename, value = field[0].name, field[1].value

Here, the key represents the feature name in which the data was saved in. Additionally, we only assume each key has only one field.

MultiTFRecordDataset can not stop

When we use MultiTFRecordDataset in DataLoader ,the loader will read data for ever and never stop.

Adding new samples into an existing tfrecord?

Is it possible to add new samples into an existing tfrecord?

How to read TFRecs from GCS bucket in Colab?

Every Time I try to use any publicly available GCS bucket from which I can read Multiple or Single tfrecords, it raises the FileNotFoundError, whereas when the same path is used in TensorFlow, gives the expected output.

This is the error I am getting:-

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-20-2a5acbdc128e> in <module>()
     11                                transform=transforms)
     12 loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE)
---> 13 data = next(iter(loader))
     14 print(data)

6 frames
/usr/local/lib/python3.6/dist-packages/tfrecord/reader.py in tfrecord_iterator(data_path, index_path, shard)
     42         file (for a single record).
     43     """
---> 44     file = io.open(data_path, "rb")
     45 
     46     length_bytes = bytearray(8)

FileNotFoundError: [Errno 2] No such file or directory: 'gs://flowers-public/tfrecords-jpeg-192x192-2/flowers05-230.tfrec'

This is the colab notebook which I was trying to implement.

Please correct me, if I'm wrong somewhere.

for the reference

Error related to parameter 'splits' in MultiTFRecordDataset

Hi, I'm making experiments with pytorch on dataset Youtube8M.

My data preprocessing code is:

    tfrecord_pattern = "/media/canvolcnao/新加卷/ML/Dataset/YT8M/test/{}.tfrecord"
    # index_pattern = "/path/to/{}.index"
    splits={}
    val_set_len=3844
    tem=''
    for i in range(val_set_len):
        splits['validate%(num)04d'%{'num':i}]=1/val_set_len
    description = {"labels": "byte", "video_id": "int","mean_rgb":"list","mean_audio":"list"}
    dataset = MultiTFRecordDataset(tfrecord_pattern, splits,description)
    loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    data = next(iter(loader))
    print(data)

Howerver, a weird ERROR occured:

Traceback (most recent call last):
  File "/home/canvolcnao/ZCZ/SourceCode/VSCode/Python/ML/YT8M/SNR/dataPre.py", line 42, in <module>
    data = next(iter(loader))
  File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 277, in __iter__
    return _SingleProcessDataLoaderIter(self)
  File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 381, in __init__
    self._dataset_kind, self._dataset, self._auto_collation, self._collate_fn, self._drop_last)
  File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 41, in create_fetcher
    return _utils.fetch._IterableDatasetFetcher(dataset, auto_collation, collate_fn, drop_last)
  File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 21, in __init__
    self.dataset_iter = iter(dataset)
  File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/tfrecord/torch/dataset.py", line 129, in __iter__
    self.data_pattern, self.index_pattern, self.splits, self.description)
  File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/tfrecord/reader.py", line 215, in multi_tfrecord_loader
    for split in splits.keys()]
  File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/tfrecord/reader.py", line 215, in <listcomp>
    for split in splits.keys()]
AttributeError: 'dict' object has no attribute 'format'

I can't figure out, since it doesn't seem that there is an 'dict' object invoking its 'format' attribute?

Error while making tfrecords

Hi , Thanks for sharing this repo with us. I am trying to create tfrecords but I get this error:

OverflowError: Python int too large to convert to C long

I am passing images in form of raw bytes for creation. any hint why is this happening?

iterator_utils.sample_iterators() hangs indefinitely.

I added a bunch of debug prints to see why MultiTFRecordDataset fails for tfrecords >=4. It seems when iterating over the data loader, first a function call is made for reader.multi_tfrecord_loader() which in the end calls iterator_utils.sample_iterators().

I added print statements inside sample_iterators() and this part of the code is where the program gets stuck:

while True:
        choice = np.random.choice(len(ratios), p=ratios)
        print('made my choice, yielding')
        yield next(iterators[choice])

This should be run as many times as the batch size (I used 10 to debug). When using only 2 tfrecord files, that's exactly what happens. Here's the output -

iter called
Length of loaders is 2
iter obtained, returning
iterators 2
Calling iteration
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding

But when I try with 4 or more tfrecords, it gets stuck in between and the data loading never finishes. Here's the output-

iter called
Length of loaders is 4
iter obtained, returning
iterators 4
Calling iteration
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding

The number of steps after which it hangs varies. It's random, sometimes hangs after 2 yield calls, at other times after 4 (as above) in my re-runs.

I have a feeling this has to do with how the sub-processes are being spawned. Any ideas why it's happening and how to fix it?

Thanks in advance!

For reference, here's the code I am using:-

import cv2
import numpy as np
import torch

from tfrecord.torch.dataset import MultiTFRecordDataset

tfrecord_path = "/om/user/xboix/data/ImageNet/train-00277-of-01024"

def resize_image(features):
    # get BGR image from bytes
    img = cv2.imdecode(features["image/encoded"], -1)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    features["image/encoded"] = cv2.resize(img, (224,224), interpolation = cv2.INTER_AREA)
    features["image/encoded"] = np.transpose(features["image/encoded"], (2, 0, 1))
    return features

train_splits = {}
tfrecord_pattern='/om/user/xboix/data/ImageNet/{}'
for i in range(4):
    key = "train-%05d-of-01024"%i
    value = 1.0
    train_splits[key] = value
    

description = {"image/encoded": "byte", "image/class/label": "int", "image/height": "int", "image/width": "int"}
train_dataset = MultiTFRecordDataset(tfrecord_pattern, None, train_splits, description, transform=resize_image)
train_loader_tfr = torch.utils.data.DataLoader(train_dataset, batch_size=10)

for data in train_loader_tfr:
    break

this byte image data is ```open('xxx.jpg','rb').read() ``` data?

tfrecord/tfrecord/reader.py

Line 116 in 74b2d24

value = np.frombuffer(value[0], dtype=np.uint8)

when i use open('xxx.jpg','rb').read() to read the image, I get wrong shape

data = next(iter(loader))
print(data['image'].shape)

torch.Size([1, 12751])

how to do ?

MultiTFRecordDataset keeps iterating though dataset(s)

If we try to iterate though multiple datasets at the same time using torch's DataLoader,

import os
import numpy as np
import torch.utils.data

import tfrecord
import tfrecord.torch.dataset
from tfrecord.tools import tfrecord2idx


# Create tmp dir for saving file(s)
rootdir = "tmp/"
if not os.path.isdir(rootdir):
    os.makedirs(rootdir)

# Create and write records to file(s)
data_pattern = os.path.join(rootdir, "dataset{}.tfrecord")
index_pattern = os.path.join(rootdir, "dataset{}.index")
for i in range(2):
    data_path = data_pattern.format(i)
    writer = tfrecord.TFRecordWriter(data_path)
    for idx in range(50):
        writer.write({
            "image": (np.random.bytes(20), "byte"),
            "label": ((np.random.rand(1)*(10**i)).tolist(), "float"),
            "index": ([idx], "int")
        })
    writer.close()

    # Create index
    tfrecord2idx.create_index(data_path, index_pattern.format(i))

splits = {"0": 0.8, "1": 0.2}
description = {"image": "byte", "label": "float"}
dataset = tfrecord.torch.dataset.MultiTFRecordDataset(data_pattern,
    index_pattern=index_pattern, splits=splits, description=description)

loader = torch.utils.data.DataLoader(dataset, batch_size=10)
# data = next(iter(loader))
# print(data["label"])

for idx, batch in enumerate(loader):
    print(idx, [arr.size() for arr in batch.values()])

Current behavior

43 [torch.Size([10, 20]), torch.Size([10, 1])]
44 [torch.Size([10, 20]), torch.Size([10, 1])]
45 [torch.Size([10, 20]), torch.Size([10, 1])]
46 [torch.Size([10, 20]), torch.Size([10, 1])]
47 [torch.Size([10, 20]), torch.Size([10, 1])]
48 [torch.Size([10, 20]), torch.Size([10, 1])]
^CTraceback (most recent call last):
  File "test.py", line 42, in <module>
    for idx, batch in enumerate(loader):
  File "/Users/ayushkarnawat/miniconda3/envs/tfrecord/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/Users/ayushkarnawat/miniconda3/envs/tfrecord/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/Users/ayushkarnawat/miniconda3/envs/tfrecord/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
    data.append(next(self.dataset_iter))
  File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/iterator_utils.py", line 19, in sample_iterators
    yield next(iterators[choice])
  File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/iterator_utils.py", line 8, in cycle
    for element in iterator_fn():
  File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/reader.py", line 97, in tfrecord_loader
    if key not in example.features.feature:
KeyboardInterrupt

Expected behavior

Since there are a total of 100 records between the two files, there should be exactly 10 batches each of size 10.

TypeError: object of type 'TFRecordDataset' has no len()

the len() method is not implemented on TFRecordDataset which prevents using torch functions such as random_split()

Some questions

Hello @vahidk and thank you for this awesome repository. I would like to ask:

To my understanding, TFRecordDataset returns an infinite iterator over the TFRecord. How can I change this behavior? this is needed for the evaluation step. If I somehow got it wrong and it is finite, how do I make it finite?
I noticed that when I try to read a large TFRecord with TFRecordDataset, iterating over the dataset is a lot slower than when I iterate over a small TFRecord. This is not a surprise of course, but is it better to somehow split the large TFRecord to smaller ones and use MultiTFRecordDataset instead? If so, how can you split TFRecords?

Thank you again.

Can't create tfrecord, and load bytes in tfrecord not correct

I try to create tfrecord by following code, and I have pip the crc32c:

writer = tfrecord.TFRecordWriter(r"I:\testdata\patch_test2\a.tfrecord")
writer.write({
    "label": (1.4, "float")
})

But it return error for me:

Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "D:\software\Anaconda3\envs\pt36\lib\site-packages\tfrecord\writer.py", line 45, in write
    self.file.write(TFRecordWriter.masked_crc(length_bytes))
  File "D:\software\Anaconda3\envs\pt36\lib\site-packages\tfrecord\writer.py", line 55, in masked_crc
    masked = np.uint32(masked)
OverflowError: Python int too large to convert to C long

So, I create tfrecord by tensorflow by following example

                example = tf.train.Example(features=tf.train.Features(feature={
                    "img_name": tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_name.encode('utf-8')])),
                    'patch_name': tf.train.Feature(bytes_list=tf.train.BytesList(value=[patch_name.encode('utf-8')])),
                    'img_source_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_source_raw])),
                    'img_mask_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_mask_raw])),
                    'img_shape': tf.train.Feature(int64_list=tf.train.Int64List(value=shape))
                }))
                writer.write(example.SerializeToString())

but when I load this data by tfrecord, it return bad result for me:

loader = tfrecord.tfrecord_loader(p, None, {
    'label': 'int',
    'img_name': 'byte',
    'img_raw': 'byte',
    'img_shape': 'int'})

for record in loader:
    # print(record)
    print('img_shape:{}'.format(record['img_shape']))
    print('label:{}'.format(record['label']))
    print('img_name:{}'.format(record['img_name']))

print result like following:

img_shape:[62 86 75  3]
label:[1]
img_name:[112 114 101 112 114 111  99 101 115 115  95 112  97 100  95  85  78  69
  48  48  48  57  50  95  49 120  49 120  49  95  86  79  73  46 110 112
 121  95  49]
img_raw:[185 235  66 ...  15  39 190]
img_shape:[62 86 75  3]

the img_name become a numpy array rather than a bytes
So, I want to know how to create tfrecord by tfrecord and how to load bytes? Thanks!

Why the `splits` parameter of MultiTFRecordDataset needs sampling probability

Should all the shards of the dataset be iterated, and then the sampling is handed over to the DataLoader?

can not use tfrecord

Traceback (most recent call last):
File "/home/anaconda3/lib/python3.7/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/home/anaconda3/lib/python3.7/runpy.py", line 109, in _get_module_details
import(pkg_name)
ValueError: source code string cannot contain null bytes

error in example

Hello,
Thanks for the library. I am following example exactly from readme.

import tfrecord
path = '/tmp/xyz.tfrecord'
writer = tfrecord.TFRecordWriter(path)
writer.write({'length': (3, 'int'), 'label': (1, 'int')},
           {'tokens': [[0, 0, 1], [0, 1, 0], [1, 0, 0]], 'seq_labels': [0, 1, 1]})
writer.write({'length': (3, 'int'), 'label': (1, 'int')},
           {'tokens': [[0, 0, 1], [1, 0, 0]], 'seq_labels': [0, 1]})
writer.close()

import tfrecord

context_description = {"length": "int", "label": "int"}
sequence_description = {"tokens": "int ", "seq_labels": "int"}
loader = tfrecord.tfrecord_loader(path, None,
                                context_description,
                                sequence_description=sequence_description)

for context, sequence_feats in loader:
    print(context["label"])
    print(sequence_feats["seq_labels"])

but I had the following error

Traceback (most recent call last):
  File "<input>", line 5, in <module>
  File "/home/chen/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tfrecord/writer.py", line 55, in write
    record = TFRecordWriter.serialize_tf_sequence_example(datum, sequence_datum)
  File "/home/chen/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tfrecord/writer.py", line 148, in serialize_tf_sequence_example
    features = {key: serialize_repeated(value, dtype) for key, (value, dtype) in features_datum.items()}
  File "/home/chen/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tfrecord/writer.py", line 148, in <dictcomp>
    features = {key: serialize_repeated(value, dtype) for key, (value, dtype) in features_datum.items()}
ValueError: too many values to unpack (expected 2)

Would you like to take a look at this? Many thanks!

Error happened when reading the index

"tfrecord/reader.py", line 56
index = np.loadtxt(index_path, dtype=np.int64)[:, 0]
IndexError: too many indices for array.

I print 'np.loadtxt(index_path, dtype=np.int64)', it is '[ 0 122978]'.
Thank you for help!

Get length of the tfrecord dataset

Can we get length of the tfrecord dataset?

Scaling to numerous TFRecordDataset with shuffling

Hi,

First of all, thanks a lot for this great library.

Context: Based on your library, I am trying to implement a pytorch version of recent Few-Shot Image classification benchmark https://github.com/google-research/meta-dataset. The idea is quite simple: each category of a dataset has its own .tfrecords. https://github.com/google-research/meta-dataset/blob/d6574b42c0f501225f682d651c631aef24ad0916/meta_dataset/data/reader.py#L283. Therefore, a TFRecordDataset is created for each class of the dataset, and each is shuffled with 1000 samples. Importantly, in this setting, some datasets have over 1000 classes, but somehow their pipeline scales pretty well in terms of memory.

Problem: If I want to reproduce this with the current library, each TFecordDataset will load 1000 (processed) samples in memory for each of the 1000 classes to be able to shuffle them, which becomes rapidly intractable. In an attempt to solve this, I tried to postpone the decoding later in the pipeline, such that all what is stored in memory by each dataset are raw records, which hopefully require less memory. Specifically, my idea was to replace

tfrecord/tfrecord/reader.py

Line 205 in d721799

yield extract_feature_dict(example.features, description, typename_mapping)

by:

yield record

and perform the parsing operations only for the current batch of data. Unfortunately, I am facing the issue that the parsing of a record seems to only work properly if the record parsed was the last yielded by the iterator. Specifically, consider the following simple modif of TFRecordDataset:

        it = reader.tfrecord_loader(data_path=self.data_path,
                                    index_path=self.index_path,
                                    description=self.description,
                                    shard=shard,
                                    sequence_description=self.sequence_description)
        while True:
            it1 = next(it)
            it2 = next(it)
            yield it2

Then downstream parsing of the yielded raw record will work just fine. However, a simple modification :

        while True:
            it1 = next(it)
            it2 = next(it)
            yield it1

will result in google.protobuf.message.DecodeError: Error parsing message down the line.

My questions are:

Do you think my overall approach is suited to make the problem scalable ?
Do you have any idea why I'm facing the problem I just described ?

Thanks a lot !

is there any way to process the TFRecordDataset?

hi, thanks for your contrubutions! I have some problems when I'm make the torch.utils.data.DataLoader(TFRecordDataset, batch_size), because the data of my tfrecrds files was encoded, so they have different length.
Is there any way to process the TFRecordDataset? or I can only preprocess in the feature descriptor?
thank you so much

getitem(index)

Hey,
how about implementing __getitem__() on the tfrecord dataset?
I use that function quite to exactly determine which sample i wan to look at.
regards
btw. great work!

How to load data in multiple processes?

OverflowError: Python int too large to convert to C long

import tfrecord

writer = tfrecord.TFRecordWriter("E:/File/package_by_mmh/train_set_128/train.tfrecord")
writer.write({'length': (3, 'int'), 'label': (1, 'int')},
             {'tokens': ([[0, 0, 1], [0, 1, 0], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1, 1], 'int')})
writer.write({'length': (3, 'int'), 'label': (1, 'int')},
             {'tokens': ([[0, 0, 1], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1], 'int')})
writer.close()

when using above example, some errors occur.

Traceback (most recent call last):
  File "E:/File/package_by_mmh/DeepChroma/testTFrcord.py", line 4, in <module>
    writer.write({'length': (3, 'int'), 'label': (1, 'int')},
  File "E:\Anaconda3\lib\site-packages\tfrecord\writer.py", line 60, in write
    self.file.write(TFRecordWriter.masked_crc(length_bytes))
  File "E:\Anaconda3\lib\site-packages\tfrecord\writer.py", line 70, in masked_crc
    masked = np.uint32(masked)
OverflowError: Python int too large to convert to C long

Question about reading binary data

Some tf records are not tf.example, they yield a binary instead.

The following works well in tf.data.TFRecordDataset.

dataset = tf.data.TFRecordDataset(FILENAME, compression_type='')
data = next(dataset.as_numpy_iterator())
sc = Scenario()
sc.ParseFromString(data)
print(sc.timestamps_seconds)

When I try to use tfrecord (this repo) to replace it, I found that data here is a dictionary.

dataset = TFRecordDataset(tfrecord_path, None, compression_type=None)
data = next(iter(dataset))
print(data)  # {}

The following is a sample from waymo motion prediction dataset
Google Drive
Official Protobuf

Map values to list

Piggybacking off the comment "This is a breaking change. Let's not combine it with cleanup. It has to be made as a separate PR. Unless there's a use case for this I prefer to keep the old way."

Originally posted by @vahidk in #27

For example, if the user inputs a value of type int (i.e. "index": (5, "int")) without enclosing it within [], then the following error occurs:

Traceback (most recent call last):
  File "test.py", line 25, in <module>
    "index": (idx, "int")
  File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/writer.py", line 41, in write
    record = TFRecordWriter.serialize_tf_example(datum)
  File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/writer.py", line 83, in serialize_tf_example
    features = {key: serialize[dtype](value) for key, (value, dtype) in datum.items()}
  File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/writer.py", line 83, in <dictcomp>
    features = {key: serialize[dtype](value) for key, (value, dtype) in datum.items()}
  File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/writer.py", line 80, in <lambda>
    int64_list=example_pb2.Int64List(value=f))
TypeError: Value must be iterable

Obviously, the user can simply wrap their value within [], but it shouldn't break code if that is not given.

Reading split tfrecords

I downloaded the tensorflow flowers dataset. It has two files

tf_flowers-train.tfrecord-00000-of-00002
tf_flowers-train.tfrecord-00001-of-00002

How do I read it into a dataset with tfrecord?

Thanks.

tfrecordreader could also support SequenceExample protobuf

Looks like tfrecord_loader assumes that the tfrecord files are of type Example.

The drawback of Example is that it doesn't support Arrays of Arrays. (Which might be a common use case in ML world). One can get away with it for now by flattening the array while writing, and then PyTorch doing a reshape(-1, DIM).

However it might be worth to add SequenceExample which (seems) to be a superset of Example.

Looks like, we can even replace the Example with SequenceExample and everything should work the way it does, plus we might get new features with a little more effort.

Let's assume we have three features

int_feature : Integer : int64_list
arr_feature : Array(Float) : float64_list
arr_arr_feature : Array(Array(Float)) : float64_list

Currently arr_arr_feature can not be accessed in the given framework. However below I go through an example if we use SequenceExample on how arr_arr_feature could be accessible.

# current code
example = example_pb2.Example()
example.ParseFromString(record)
example.features.feature['int_feature'].int64_list.value 
example.features.feature['arr_feature'].float64_list.value 
# not possible to access arr_arr_feature

can be replaced by

# suggested code  
example = example_pb2.SequenceExample()
example.ParseFromString(record)
 # replace .feature with .context
example.context.feature['int_feature'].int64_list.value 
example.context.feature['arr_feature'].float64_list.value
[ft.float_list.value for ft in example.feature_lists.feature_list['arr_arr_feature'].feature]

I'm not totally impressed with the way I've done [ft.float64.value for ft in example.feature_list.feature_list.....].

Let me know if you like the idea, I can create a PR for the same.

More stuff here :

Shuffling the dataset

How do we shuffle the dataset while training using this tf record files.
I have found that pytorch Iterable dataset doesn't have shuffle option.

Unsupported type list

Hey there!

I tried to use the tensor flow feature_description which looks a little bit different to your tfrecord feature_description.
I shows an error "Unhashable type list" - do you know how to solve the problem?

FEATURE_DESCRIPTION = {
    'image': tf.io.FixedLenFeature([], tf.string),
    'label': tf.io.FixedLenFeature([], tf.string),
}

compared to your FEATURE_DESCRIPTION:

description = {
    "image": "bytes",
}

The error from the FEATURE_DESCRIPTION is as follows:

/usr/local/lib/python3.7/dist-packages/tfrecord/reader.py in process_feature(feature, typename, typename_mapping, key)
    107 
    108     if typename is not None:
--> 109         tf_typename = typename_mapping[typename]
    110         if tf_typename != inferred_typename:
    111             reversed_mapping = {v: k for k, v in typename_mapping.items()}

TypeError: unhashable type: 'list'

Iterating over multiple tf record files once but randomly

I'm trying to train ImageNet from scratch and tfrecords are much faster for that. The dataset exists in multiple tf record files. With one single tf record it is straightforward, I just use TFRecordDataset. However, I am unsure how to use the MultiTFRecordDataset for this purpose. For some reason, the data loader hangs and goes into an infinite loop.

Is this the intended purpose of MultiTFRecordDataset?

Thanks in advance!

Empty feature support

Hello, Vahid,

Thank you for such library! I am trying to use it to read tf records, that I previously created, into a dataloader. However, I found out that there is a problem with empty features (in particular, negative samples for object detection), examples are provided below. Could this be related with a lack of support for VarLenFeature (I use it to write these records)? In any case, may I get some suggestions and contribute with a solution? Thank you.

Sincerely,
Dmitry.

Supported case:

feature {
    key: "image/object/class/label"
    value {
      int64_list {
        value: 1
        value: 1
        value: 1
      }
    }
  }

Unsupported case:

feature {
    key: "image/object/class/label"
    value {
      int64_list {
      }
    }
  }

Error:

IndexError                                Traceback (most recent call last)

~/lib/python3.6/site-packages/tfrecord/reader.py in example_loader(data_path, index_path, description, shard, compression_type)
    221         example.ParseFromString(record)
    222 
--> 223         yield extract_feature_dict(example.features, description, typename_mapping)
    224 
    225 

~/lib/python3.6/site-packages/tfrecord/reader.py in extract_feature_dict(features, description, typename_mapping)
    154             raise KeyError(f"Key {key} doesn't exist (select from {all_keys})!")
    155 
--> 156         processed_features[key] = get_value(typename, typename_mapping, key)
    157 
    158     return processed_features

~/lib/python3.6/site-packages/tfrecord/reader.py in get_value(typename, typename_mapping, key)
    136         def get_value(typename, typename_mapping, key):
    137             return process_feature(features[key], typename,
--> 138                                    typename_mapping, key)
    139     else:
    140         raise TypeError(f"Incompatible type: features should be either of type "

~/lib/python3.6/site-packages/tfrecord/reader.py in process_feature(feature, typename, typename_mapping, key)
    114 
    115     if inferred_typename == "bytes_list":
--> 116         value = np.frombuffer(value[0], dtype=np.uint8)
    117     elif inferred_typename == "float_list":
    118         value = np.array(value, dtype=np.float32)

IndexError: list index (0) out of range

variable 'shard' referenced before assignment in torch.dataset

Looks like inside TFRecordDataset shard is defined before declaring it.

It's raised here :

tfrecord/tfrecord/torch/dataset.py

Line 23 in f7e7549

self.data_path, self.index_path, self.description, shard)