vahidk / tfrecord Goto Github PK
View Code? Open in Web Editor NEWTFRecord reader for PyTorch
Home Page: https://twitter.com/VahidK
License: MIT License
TFRecord reader for PyTorch
Home Page: https://twitter.com/VahidK
License: MIT License
Hi,
I am trying to load the ImageNet2012 tfrecords (loaded using tensorflow datasets).
I can load the tfrecords using Tensorflow fine, however, when I use this library I lose the shape information of the images (they come back as flattened lists). Is there any way to resolve this? Other than that the library works great!
Thanks
Hi, guys, thanks your contribution!
when I use your source code, I find this error:
Traceback (most recent call last):
File "D:/coding/SIFA-master/model.py", line 316, in
data = next(iter(loader))
File "D:\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 346, in next
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "D:\Anaconda3\lib\site-packages\torch\utils\data_utils\fetch.py", line 28, in fetch
data.append(next(self.dataset_iter))
File "D:\Anaconda3\lib\site-packages\tfrecord\io_utils.py", line 106, in tfrecord_loader
for record in record_iterator:
File "D:\Anaconda3\lib\site-packages\tfrecord\io_utils.py", line 73, in tfrecord_iterator
yield from read_records()
File "D:\Anaconda3\lib\site-packages\tfrecord\io_utils.py", line 67, in read_records
raise RuntimeError("Failed to read the record.")
RuntimeError: Failed to read the record.
There is my code:
from tfrecord.torch.dataset import TFRecordDataset
from torch.utils.data import DataLoader
tfrecord_path = "./data/training_A/image_01.tfrecords"
dataset = TFRecordDataset(tfrecord_path, None, {"dsize_dim0":"int", "dsize_dim1":"int", "dsize_dim2":"int",
"lsize_dim0":"int", "lsize_dim0":"int", "lsize_dim0":"int",
"data_vol":"str", "label_vol":"str",
})
loader = DataLoader(dataset, batch_size=1)
data = next(iter(loader))
print(data)
so have you fix the error or help me to find mistake? Thanks!
Hi, thanks for your awesome tfrecord
loader = tfrecord.multi_tfrecord_loader(
tfrecord_pattern, index_pattern, splits, description
)
qbar = tqdm(enumerate(loader))
for i, record in qbar:
# do something
# infinity loop, [never ends]
# do something else [never enter this part]
I came across a problem when loading multiple tfrecords with tfrecord.multi_tfrecord_loader
.
The loop seems never end...
Pytorch xla has torch_xla.utils.tf_record_reader.TfRecordReader
which supports loading from GCB. When I use the GCB location for TFRecordDataset
, I got
FileNotFoundError: [Errno 2] No such file or directory: 'gs://xxx'
I wonder whether this repo also supports it.
Hi! It's nice to u to provide this,but i still have some problems
i see the code in datasets.py:
worker_info = torch.utils.data.get_worker_info() if worker_info is not None: shard = worker_info.id, worker_info.num_workers np.random.seed(worker_info.seed % np.iinfo(np.uint32).max) else: shard = None it = reader.tfrecord_loader( self.data_path, self.index_path, self.description, shard) if self.shuffle_queue_size: it = iterator_utils.shuffle_iterator(it, self.shuffle_queue_size)
does 'iterator_utils' has other methods to do transform? like calling map function in tensorflow? if does, where can i find more information about how to use this tfrecord
/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py:125: RuntimeWarning: 'tfrecord.tools.tfrecord2idx' found in sys.modules after import of package 'tfrecord.tools', but prior to execution of 'tfrecord.tools.tfrecord2idx'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
Failed to parse TFRecord.
the TFRecord's compression_type is GZIP
How to extract the string type data?
How to read tf.VarLenFeature in the records?
My tfrecords file contains some string arrays in the form of tf.VarLenFeature, but I don't know how read it in pytorch
Are there plans to add support for random access of individual samples? If an index file is supplied, this could be done pretty efficiently.
How might one go about creating an index file before using MultiTFRecordDataset
?
In my case, dataset is 20 files in dir data/proc_data/IMDB/unsup/bt-0.9/0/
.
Thanks!
Can you help how to fix this error(RuntimeError: stack expects each tensor to be equal size, but got [14] at entry 0 and [13] at entry 3)? I've got tfrec files and the image names are diffrent len. I think that its the reason.
CODE:
import torch
import cv2
from tfrecord.torch.dataset import TFRecordDataset
import numpy as np
def decode_image(features):
features["image"] = cv2.imdecode(features["image"], -1)
return features
tfrecord_path = "train_dataset/tfrecords/ld_train01-2140.tfrec"
index_path = None
description = {"image": "byte", "target": "int", "image_name": "byte"}
dataset = TFRecordDataset(tfrecord_path, index_path, description, transform = decode_image)
loader = torch.utils.data.DataLoader(dataset, batch_size=32)
data = next(iter(loader))
print(data)
Traceback (most recent call last):
File "torchRead.py", line 11, in
data = next(temp)
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
data.append(next(self.dataset_iter))
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/tfrecord/reader.py", line 86, in tfrecord_loader
for record in record_iterator:
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/tfrecord/reader.py", line 53, in tfrecord_iterator
yield from read_records()
File "/home/cym/anaconda3/envs/tfread/lib/python3.6/site-packages/tfrecord/reader.py", line 47, in read_records
raise RuntimeError("Failed to read the record.")
RuntimeError: Failed to read the record.
Here is the definition of _float_feature
def _float_feature(value): # used
return tf.train.Feature(float_list=tf.train.FloatList(value=list(value)))
example = tf.train.Example(features=tf.train.Features(
feature={
"ori_image": _float_feature(image.reshape(-1)),
"aug_image": _float_feature(aug_image.reshape(-1)),
}))
Using this config I can read out the file with TensorFlow
aug_record_spec = {
"ori_image": tf.FixedLenFeature([224 * 224 * 3], tf.float32),
"aug_image": tf.FixedLenFeature([224 * 224 * 3], tf.float32),
}
I know the problem is raised with
file.readinto(datum_bytes_view) != length
I also print out the two value:
first is 1204282 = 2 * 602141 cannot be decomposed any more.
the second is 1048576 = 2^20
Hope you will notice this, thanks very much~~
I'm getting some errors when loading a byte feature from Waymo's open data motion dataset (I can't share the data because of the license, but it's available here: https://waymo.com/open/).
Minimal reproducible example:
import torch
from tfrecord.torch.dataset import TFRecordDataset
tfrecord_path = "uncompressed_tf_example_training_training_tfexample.tfrecord-00000-of-01000"
index_path = None
scenario_features = {
'scenario/id': 'byte',
}
dataset = TFRecordDataset(tfrecord_path, index_path, scenario_features)
loader = torch.utils.data.DataLoader(dataset, batch_size=32)
data = next(iter(loader))
print(data)
Error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-32-b161469c8e99> in <module>
11 loader = torch.utils.data.DataLoader(dataset, batch_size=32)
12
---> 13 data = next(iter(loader))
14 print(data)
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
515 if self._sampler_iter is None:
516 self._reset()
--> 517 data = self._next_data()
518 self._num_yielded += 1
519 if self._dataset_kind == _DatasetKind.Iterable and \
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
555 def _next_data(self):
556 index = self._next_index() # may raise StopIteration
--> 557 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
558 if self._pin_memory:
559 data = _utils.pin_memory.pin_memory(data)
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
33 else:
34 data = next(self.dataset_iter)
---> 35 return self.collate_fn(data)
36
37
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
71 return batch
72 elif isinstance(elem, container_abcs.Mapping):
---> 73 return {key: default_collate([d[key] for d in batch]) for key in elem}
74 elif isinstance(elem, tuple) and hasattr(elem, '_fields'): # namedtuple
75 return elem_type(*(default_collate(samples) for samples in zip(*batch)))
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in <dictcomp>(.0)
71 return batch
72 elif isinstance(elem, container_abcs.Mapping):
---> 73 return {key: default_collate([d[key] for d in batch]) for key in elem}
74 elif isinstance(elem, tuple) and hasattr(elem, '_fields'): # namedtuple
75 return elem_type(*(default_collate(samples) for samples in zip(*batch)))
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
61 raise TypeError(default_collate_err_msg_format.format(elem.dtype))
62
---> 63 return default_collate([torch.as_tensor(b) for b in batch])
64 elif elem.shape == (): # scalars
65 return torch.as_tensor(batch)
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
53 storage = elem.storage()._new_shared(numel)
54 out = elem.new(storage)
---> 55 return torch.stack(batch, 0, out=out)
56 elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
57 and elem_type.__name__ != 'string_':
RuntimeError: stack expects each tensor to be equal size, but got [16] at entry 0 and [14] at entry 4
From TensorFlow I'm able to load it using this definition:
'scenario/id': tf.io.FixedLenFeature((), tf.string, default_value=None),
I have downloaded the open_image_v4
dataset from Tensorflow using the tfds
utility (as well as many other similar datasets), and now I have a folder with the following structure:
open_image_v4
├── open_images_v4-test.tfrecord-00001-of-00512
├── open_images_v4-test.tfrecord-00002-of-00512
├── ...
├── open_images_v4-train.tfrecord-00001-of-01024
├── ...
└── open_images_v4-validation.tfrecord-00001-of-00128
├── ...
As you can see, I have 3 splits: train
, test
and validation
, however, each of those splits are themselves splitted/sharded into subfiles.
I created an index file for each of those TFrecord using GNU parallel:
parallel -j19 python3 -m tfrecord.tools.tfrecord2idx {} index/{}.index ::: *.tfrecord-*
From skimming over the reader code, it seems that you only support datasets that all fit in a single .tfrecord
file.
Did I miss something or is there a workaround I could employ to read large subsets (train, valid, test) of the dataset?
I initially thought it might be possible to provide the input pattern as:
tfrecord_pattern = "/tmp/{split}_{idx}.tfrecord"
or in the case of my above file system:
tfrecord_pattern = "/tmp/open_images_v4-{split}.tfrecord-{idx:05d}-of-{total:05d}"
if that may be of any help for future PRs.
Hi, thanks for this package!
Are int64/float64 data types supported?
I'm trying to load some int64 data but when I specify description = {"field_name": "int"}
it just returns a 32-bit int. I also tried using torch.int64
, int64
, and long
, but these all return a KeyError:
KeyError Traceback (most recent call last)
<ipython-input-17-ccc38984de18> in <module>
96 loader = torch.utils.data.DataLoader(dataset, batch_size=32)
97
---> 98 data = next(iter(loader))
99 print(data)
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
515 if self._sampler_iter is None:
516 self._reset()
--> 517 data = self._next_data()
518 self._num_yielded += 1
519 if self._dataset_kind == _DatasetKind.Iterable and \
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
555 def _next_data(self):
556 index = self._next_index() # may raise StopIteration
--> 557 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
558 if self._pin_memory:
559 data = _utils.pin_memory.pin_memory(data)
~/.venv/wenv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
26 for _ in possibly_batched_index:
27 try:
---> 28 data.append(next(self.dataset_iter))
29 except StopIteration:
30 break
~/.venv/wenv/lib/python3.8/site-packages/tfrecord/reader.py in example_loader(data_path, index_path, description, shard)
203 example.ParseFromString(record)
204
--> 205 yield extract_feature_dict(example.features, description, typename_mapping)
206
207
~/.venv/wenv/lib/python3.8/site-packages/tfrecord/reader.py in extract_feature_dict(features, description, typename_mapping)
147 raise KeyError(f"Key {key} doesn't exist (select from {all_keys})!")
148
--> 149 processed_features[key] = get_value(typename, typename_mapping, key)
150
151 return processed_features
~/.venv/wenv/lib/python3.8/site-packages/tfrecord/reader.py in get_value(typename, typename_mapping, key)
128
129 def get_value(typename, typename_mapping, key):
--> 130 return process_feature(features[key], typename,
131 typename_mapping, key)
132 else:
~/.venv/wenv/lib/python3.8/site-packages/tfrecord/reader.py in process_feature(feature, typename, typename_mapping, key)
100
101 if typename is not None:
--> 102 tf_typename = typename_mapping[typename]
103 if tf_typename != inferred_typename:
104 reversed_mapping = {v: k for k, v in typename_mapping.items()}
KeyError: 'long'
Hi,
Thank you for the very useful repo.
ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True
is there a way to use shuffling with the interface. Also is there a way to preceompute
the length of the dataset.
Thanks
As title suggests.
Probably there is a requirement to edit
tfrecord/__init__.py
to also include from .torch import *
and inside
tfrecord/torch/__init__.py
say from .dataset import TFRecordDataset, MultiTFRecordDataset
Hi, I just noticed, that MultiTFRecordDataset does not use shard, whereas TFRecordDataset does. Is this a correct behavior? It very well could be, there could be some dark magic which I am not aware of, however, to me it seems it is not. If it is not, I can submit a PR to fix it.
Thanks :)
It seems cumbersome that the user needs to (re)specify the dtype of the feature they are trying to load from their TFRecord file. Within tf.example
, we can extract both the tf_typename
and the value from a specific example using the following:
field = example.features.feature[key].ListFields()[0]
tf_typename, value = field[0].name, field[1].value
Here, the key
represents the feature name in which the data was saved in. Additionally, we only assume each key has only one field.
When we use MultiTFRecordDataset in DataLoader ,the loader will read data for ever and never stop.
Is it possible to add new samples into an existing tfrecord?
Every Time I try to use any publicly available GCS bucket from which I can read Multiple or Single tfrecords, it raises the FileNotFoundError
, whereas when the same path is used in TensorFlow, gives the expected output.
This is the error I am getting:-
FileNotFoundError Traceback (most recent call last)
<ipython-input-20-2a5acbdc128e> in <module>()
11 transform=transforms)
12 loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE)
---> 13 data = next(iter(loader))
14 print(data)
6 frames
/usr/local/lib/python3.6/dist-packages/tfrecord/reader.py in tfrecord_iterator(data_path, index_path, shard)
42 file (for a single record).
43 """
---> 44 file = io.open(data_path, "rb")
45
46 length_bytes = bytearray(8)
FileNotFoundError: [Errno 2] No such file or directory: 'gs://flowers-public/tfrecords-jpeg-192x192-2/flowers05-230.tfrec'
This is the colab notebook which I was trying to implement.
Please correct me, if I'm wrong somewhere.
for the reference
Hi, I'm making experiments with pytorch on dataset Youtube8M.
My data preprocessing code is:
tfrecord_pattern = "/media/canvolcnao/新加卷/ML/Dataset/YT8M/test/{}.tfrecord"
# index_pattern = "/path/to/{}.index"
splits={}
val_set_len=3844
tem=''
for i in range(val_set_len):
splits['validate%(num)04d'%{'num':i}]=1/val_set_len
description = {"labels": "byte", "video_id": "int","mean_rgb":"list","mean_audio":"list"}
dataset = MultiTFRecordDataset(tfrecord_pattern, splits,description)
loader = torch.utils.data.DataLoader(dataset, batch_size=32)
data = next(iter(loader))
print(data)
Howerver, a weird ERROR occured:
Traceback (most recent call last):
File "/home/canvolcnao/ZCZ/SourceCode/VSCode/Python/ML/YT8M/SNR/dataPre.py", line 42, in <module>
data = next(iter(loader))
File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 277, in __iter__
return _SingleProcessDataLoaderIter(self)
File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 381, in __init__
self._dataset_kind, self._dataset, self._auto_collation, self._collate_fn, self._drop_last)
File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 41, in create_fetcher
return _utils.fetch._IterableDatasetFetcher(dataset, auto_collation, collate_fn, drop_last)
File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 21, in __init__
self.dataset_iter = iter(dataset)
File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/tfrecord/torch/dataset.py", line 129, in __iter__
self.data_pattern, self.index_pattern, self.splits, self.description)
File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/tfrecord/reader.py", line 215, in multi_tfrecord_loader
for split in splits.keys()]
File "/home/canvolcnao/anaconda3/lib/python3.7/site-packages/tfrecord/reader.py", line 215, in <listcomp>
for split in splits.keys()]
AttributeError: 'dict' object has no attribute 'format'
I can't figure out, since it doesn't seem that there is an 'dict' object invoking its 'format' attribute?
Hi , Thanks for sharing this repo with us. I am trying to create tfrecords but I get this error:
OverflowError: Python int too large to convert to C long
I am passing images in form of raw bytes for creation. any hint why is this happening?
I added a bunch of debug prints to see why MultiTFRecordDataset fails for tfrecords >=4. It seems when iterating over the data loader, first a function call is made for reader.multi_tfrecord_loader() which in the end calls iterator_utils.sample_iterators().
I added print statements inside sample_iterators() and this part of the code is where the program gets stuck:
while True:
choice = np.random.choice(len(ratios), p=ratios)
print('made my choice, yielding')
yield next(iterators[choice])
This should be run as many times as the batch size (I used 10 to debug). When using only 2 tfrecord files, that's exactly what happens. Here's the output -
iter called
Length of loaders is 2
iter obtained, returning
iterators 2
Calling iteration
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
But when I try with 4 or more tfrecords, it gets stuck in between and the data loading never finishes. Here's the output-
iter called
Length of loaders is 4
iter obtained, returning
iterators 4
Calling iteration
made my choice, yielding
made my choice, yielding
made my choice, yielding
made my choice, yielding
The number of steps after which it hangs varies. It's random, sometimes hangs after 2 yield calls, at other times after 4 (as above) in my re-runs.
I have a feeling this has to do with how the sub-processes are being spawned. Any ideas why it's happening and how to fix it?
Thanks in advance!
For reference, here's the code I am using:-
import cv2
import numpy as np
import torch
from tfrecord.torch.dataset import MultiTFRecordDataset
tfrecord_path = "/om/user/xboix/data/ImageNet/train-00277-of-01024"
def resize_image(features):
# get BGR image from bytes
img = cv2.imdecode(features["image/encoded"], -1)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
features["image/encoded"] = cv2.resize(img, (224,224), interpolation = cv2.INTER_AREA)
features["image/encoded"] = np.transpose(features["image/encoded"], (2, 0, 1))
return features
train_splits = {}
tfrecord_pattern='/om/user/xboix/data/ImageNet/{}'
for i in range(4):
key = "train-%05d-of-01024"%i
value = 1.0
train_splits[key] = value
description = {"image/encoded": "byte", "image/class/label": "int", "image/height": "int", "image/width": "int"}
train_dataset = MultiTFRecordDataset(tfrecord_pattern, None, train_splits, description, transform=resize_image)
train_loader_tfr = torch.utils.data.DataLoader(train_dataset, batch_size=10)
for data in train_loader_tfr:
break
Line 116 in 74b2d24
when i use open('xxx.jpg','rb').read()
to read the image, I get wrong shape
data = next(iter(loader))
print(data['image'].shape)
torch.Size([1, 12751])
how to do ?
If we try to iterate though multiple datasets at the same time using torch's DataLoader
,
import os
import numpy as np
import torch.utils.data
import tfrecord
import tfrecord.torch.dataset
from tfrecord.tools import tfrecord2idx
# Create tmp dir for saving file(s)
rootdir = "tmp/"
if not os.path.isdir(rootdir):
os.makedirs(rootdir)
# Create and write records to file(s)
data_pattern = os.path.join(rootdir, "dataset{}.tfrecord")
index_pattern = os.path.join(rootdir, "dataset{}.index")
for i in range(2):
data_path = data_pattern.format(i)
writer = tfrecord.TFRecordWriter(data_path)
for idx in range(50):
writer.write({
"image": (np.random.bytes(20), "byte"),
"label": ((np.random.rand(1)*(10**i)).tolist(), "float"),
"index": ([idx], "int")
})
writer.close()
# Create index
tfrecord2idx.create_index(data_path, index_pattern.format(i))
splits = {"0": 0.8, "1": 0.2}
description = {"image": "byte", "label": "float"}
dataset = tfrecord.torch.dataset.MultiTFRecordDataset(data_pattern,
index_pattern=index_pattern, splits=splits, description=description)
loader = torch.utils.data.DataLoader(dataset, batch_size=10)
# data = next(iter(loader))
# print(data["label"])
for idx, batch in enumerate(loader):
print(idx, [arr.size() for arr in batch.values()])
43 [torch.Size([10, 20]), torch.Size([10, 1])]
44 [torch.Size([10, 20]), torch.Size([10, 1])]
45 [torch.Size([10, 20]), torch.Size([10, 1])]
46 [torch.Size([10, 20]), torch.Size([10, 1])]
47 [torch.Size([10, 20]), torch.Size([10, 1])]
48 [torch.Size([10, 20]), torch.Size([10, 1])]
^CTraceback (most recent call last):
File "test.py", line 42, in <module>
for idx, batch in enumerate(loader):
File "/Users/ayushkarnawat/miniconda3/envs/tfrecord/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/Users/ayushkarnawat/miniconda3/envs/tfrecord/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/Users/ayushkarnawat/miniconda3/envs/tfrecord/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
data.append(next(self.dataset_iter))
File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/iterator_utils.py", line 19, in sample_iterators
yield next(iterators[choice])
File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/iterator_utils.py", line 8, in cycle
for element in iterator_fn():
File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/reader.py", line 97, in tfrecord_loader
if key not in example.features.feature:
KeyboardInterrupt
Since there are a total of 100 records between the two files, there should be exactly 10 batches each of size 10.
the len()
method is not implemented on TFRecordDataset which prevents using torch functions such as random_split()
Hello @vahidk and thank you for this awesome repository. I would like to ask:
To my understanding, TFRecordDataset returns an infinite iterator over the TFRecord. How can I change this behavior? this is needed for the evaluation step. If I somehow got it wrong and it is finite, how do I make it finite?
I noticed that when I try to read a large TFRecord with TFRecordDataset, iterating over the dataset is a lot slower than when I iterate over a small TFRecord. This is not a surprise of course, but is it better to somehow split the large TFRecord to smaller ones and use MultiTFRecordDataset instead? If so, how can you split TFRecords?
Thank you again.
I try to create tfrecord by following code, and I have pip the crc32c:
writer = tfrecord.TFRecordWriter(r"I:\testdata\patch_test2\a.tfrecord")
writer.write({
"label": (1.4, "float")
})
But it return error for me:
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "D:\software\Anaconda3\envs\pt36\lib\site-packages\tfrecord\writer.py", line 45, in write
self.file.write(TFRecordWriter.masked_crc(length_bytes))
File "D:\software\Anaconda3\envs\pt36\lib\site-packages\tfrecord\writer.py", line 55, in masked_crc
masked = np.uint32(masked)
OverflowError: Python int too large to convert to C long
So, I create tfrecord by tensorflow by following example
example = tf.train.Example(features=tf.train.Features(feature={
"img_name": tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_name.encode('utf-8')])),
'patch_name': tf.train.Feature(bytes_list=tf.train.BytesList(value=[patch_name.encode('utf-8')])),
'img_source_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_source_raw])),
'img_mask_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img_mask_raw])),
'img_shape': tf.train.Feature(int64_list=tf.train.Int64List(value=shape))
}))
writer.write(example.SerializeToString())
but when I load this data by tfrecord, it return bad result for me:
loader = tfrecord.tfrecord_loader(p, None, {
'label': 'int',
'img_name': 'byte',
'img_raw': 'byte',
'img_shape': 'int'})
for record in loader:
# print(record)
print('img_shape:{}'.format(record['img_shape']))
print('label:{}'.format(record['label']))
print('img_name:{}'.format(record['img_name']))
print result like following:
img_shape:[62 86 75 3]
label:[1]
img_name:[112 114 101 112 114 111 99 101 115 115 95 112 97 100 95 85 78 69
48 48 48 57 50 95 49 120 49 120 49 95 86 79 73 46 110 112
121 95 49]
img_raw:[185 235 66 ... 15 39 190]
img_shape:[62 86 75 3]
the img_name become a numpy array rather than a bytes
So, I want to know how to create tfrecord by tfrecord and how to load bytes? Thanks!
Should all the shards of the dataset be iterated, and then the sampling is handed over to the DataLoader?
Traceback (most recent call last):
File "/home/anaconda3/lib/python3.7/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/home/anaconda3/lib/python3.7/runpy.py", line 109, in _get_module_details
import(pkg_name)
ValueError: source code string cannot contain null bytes
Hello,
Thanks for the library. I am following example exactly from readme.
import tfrecord
path = '/tmp/xyz.tfrecord'
writer = tfrecord.TFRecordWriter(path)
writer.write({'length': (3, 'int'), 'label': (1, 'int')},
{'tokens': [[0, 0, 1], [0, 1, 0], [1, 0, 0]], 'seq_labels': [0, 1, 1]})
writer.write({'length': (3, 'int'), 'label': (1, 'int')},
{'tokens': [[0, 0, 1], [1, 0, 0]], 'seq_labels': [0, 1]})
writer.close()
import tfrecord
context_description = {"length": "int", "label": "int"}
sequence_description = {"tokens": "int ", "seq_labels": "int"}
loader = tfrecord.tfrecord_loader(path, None,
context_description,
sequence_description=sequence_description)
for context, sequence_feats in loader:
print(context["label"])
print(sequence_feats["seq_labels"])
but I had the following error
Traceback (most recent call last):
File "<input>", line 5, in <module>
File "/home/chen/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tfrecord/writer.py", line 55, in write
record = TFRecordWriter.serialize_tf_sequence_example(datum, sequence_datum)
File "/home/chen/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tfrecord/writer.py", line 148, in serialize_tf_sequence_example
features = {key: serialize_repeated(value, dtype) for key, (value, dtype) in features_datum.items()}
File "/home/chen/anaconda3/envs/learning_to_simulate/lib/python3.7/site-packages/tfrecord/writer.py", line 148, in <dictcomp>
features = {key: serialize_repeated(value, dtype) for key, (value, dtype) in features_datum.items()}
ValueError: too many values to unpack (expected 2)
Would you like to take a look at this? Many thanks!
"tfrecord/reader.py", line 56
index = np.loadtxt(index_path, dtype=np.int64)[:, 0]
IndexError: too many indices for array.
I print 'np.loadtxt(index_path, dtype=np.int64)', it is '[ 0 122978]'.
Thank you for help!
Can we get length of the tfrecord dataset?
Hi,
First of all, thanks a lot for this great library.
Context: Based on your library, I am trying to implement a pytorch version of recent Few-Shot Image classification benchmark https://github.com/google-research/meta-dataset. The idea is quite simple: each category of a dataset has its own .tfrecords. https://github.com/google-research/meta-dataset/blob/d6574b42c0f501225f682d651c631aef24ad0916/meta_dataset/data/reader.py#L283. Therefore, a TFRecordDataset is created for each class of the dataset, and each is shuffled with 1000 samples. Importantly, in this setting, some datasets have over 1000 classes, but somehow their pipeline scales pretty well in terms of memory.
Problem: If I want to reproduce this with the current library, each TFecordDataset will load 1000 (processed) samples in memory for each of the 1000 classes to be able to shuffle them, which becomes rapidly intractable. In an attempt to solve this, I tried to postpone the decoding later in the pipeline, such that all what is stored in memory by each dataset are raw records, which hopefully require less memory. Specifically, my idea was to replace
Line 205 in d721799
yield record
and perform the parsing operations only for the current batch of data. Unfortunately, I am facing the issue that the parsing of a record seems to only work properly if the record parsed was the last yielded by the iterator. Specifically, consider the following simple modif of TFRecordDataset:
it = reader.tfrecord_loader(data_path=self.data_path,
index_path=self.index_path,
description=self.description,
shard=shard,
sequence_description=self.sequence_description)
while True:
it1 = next(it)
it2 = next(it)
yield it2
Then downstream parsing of the yielded raw record will work just fine. However, a simple modification :
while True:
it1 = next(it)
it2 = next(it)
yield it1
will result in google.protobuf.message.DecodeError: Error parsing message
down the line.
My questions are:
Thanks a lot !
hi, thanks for your contrubutions! I have some problems when I'm make the torch.utils.data.DataLoader(TFRecordDataset, batch_size), because the data of my tfrecrds files was encoded, so they have different length.
Is there any way to process the TFRecordDataset? or I can only preprocess in the feature descriptor?
thank you so much
Hey,
how about implementing __getitem__()
on the tfrecord dataset?
I use that function quite to exactly determine which sample i wan to look at.
regards
btw. great work!
How to load data in multiple processes?
import tfrecord
writer = tfrecord.TFRecordWriter("E:/File/package_by_mmh/train_set_128/train.tfrecord")
writer.write({'length': (3, 'int'), 'label': (1, 'int')},
{'tokens': ([[0, 0, 1], [0, 1, 0], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1, 1], 'int')})
writer.write({'length': (3, 'int'), 'label': (1, 'int')},
{'tokens': ([[0, 0, 1], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1], 'int')})
writer.close()
when using above example, some errors occur.
Traceback (most recent call last):
File "E:/File/package_by_mmh/DeepChroma/testTFrcord.py", line 4, in <module>
writer.write({'length': (3, 'int'), 'label': (1, 'int')},
File "E:\Anaconda3\lib\site-packages\tfrecord\writer.py", line 60, in write
self.file.write(TFRecordWriter.masked_crc(length_bytes))
File "E:\Anaconda3\lib\site-packages\tfrecord\writer.py", line 70, in masked_crc
masked = np.uint32(masked)
OverflowError: Python int too large to convert to C long
Some tf records are not tf.example, they yield a binary instead.
The following works well in tf.data.TFRecordDataset
.
dataset = tf.data.TFRecordDataset(FILENAME, compression_type='')
data = next(dataset.as_numpy_iterator())
sc = Scenario()
sc.ParseFromString(data)
print(sc.timestamps_seconds)
When I try to use tfrecord (this repo) to replace it, I found that data
here is a dictionary.
dataset = TFRecordDataset(tfrecord_path, None, compression_type=None)
data = next(iter(dataset))
print(data) # {}
The following is a sample from waymo motion prediction dataset
Google Drive
Official Protobuf
Piggybacking off the comment "This is a breaking change. Let's not combine it with cleanup. It has to be made as a separate PR. Unless there's a use case for this I prefer to keep the old way."
Originally posted by @vahidk in #27
For example, if the user inputs a value of type int (i.e. "index": (5, "int")
) without enclosing it within []
, then the following error occurs:
Traceback (most recent call last):
File "test.py", line 25, in <module>
"index": (idx, "int")
File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/writer.py", line 41, in write
record = TFRecordWriter.serialize_tf_example(datum)
File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/writer.py", line 83, in serialize_tf_example
features = {key: serialize[dtype](value) for key, (value, dtype) in datum.items()}
File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/writer.py", line 83, in <dictcomp>
features = {key: serialize[dtype](value) for key, (value, dtype) in datum.items()}
File "/Users/ayushkarnawat/Documents/dev/python_workspace/tfrecord/tfrecord/writer.py", line 80, in <lambda>
int64_list=example_pb2.Int64List(value=f))
TypeError: Value must be iterable
Obviously, the user can simply wrap their value within []
, but it shouldn't break code if that is not given.
I downloaded the tensorflow flowers dataset. It has two files
How do I read it into a dataset with tfrecord?
Thanks.
Looks like tfrecord_loader
assumes that the tfrecord files are of type Example
.
The drawback of Example is that it doesn't support Arrays of Arrays. (Which might be a common use case in ML world). One can get away with it for now by flattening the array while writing, and then PyTorch doing a reshape(-1, DIM)
.
However it might be worth to add SequenceExample which (seems) to be a superset of Example.
Looks like, we can even replace the Example with SequenceExample and everything should work the way it does, plus we might get new features with a little more effort.
Let's assume we have three features
int_feature
: Integer : int64_list
arr_feature
: Array(Float) : float64_list
arr_arr_feature
: Array(Array(Float)) : float64_list
Currently arr_arr_feature
can not be accessed in the given framework. However below I go through an example if we use SequenceExample on how arr_arr_feature
could be accessible.
# current code
example = example_pb2.Example()
example.ParseFromString(record)
example.features.feature['int_feature'].int64_list.value
example.features.feature['arr_feature'].float64_list.value
# not possible to access arr_arr_feature
can be replaced by
# suggested code
example = example_pb2.SequenceExample()
example.ParseFromString(record)
# replace .feature with .context
example.context.feature['int_feature'].int64_list.value
example.context.feature['arr_feature'].float64_list.value
[ft.float_list.value for ft in example.feature_lists.feature_list['arr_arr_feature'].feature]
I'm not totally impressed with the way I've done [ft.float64.value for ft in example.feature_list.feature_list.....]
.
Let me know if you like the idea, I can create a PR for the same.
More stuff here :
How do we shuffle the dataset while training using this tf record files.
I have found that pytorch Iterable dataset doesn't have shuffle option.
Hey there!
I tried to use the tensor flow feature_description which looks a little bit different to your tfrecord feature_description.
I shows an error "Unhashable type list" - do you know how to solve the problem?
FEATURE_DESCRIPTION = {
'image': tf.io.FixedLenFeature([], tf.string),
'label': tf.io.FixedLenFeature([], tf.string),
}
compared to your FEATURE_DESCRIPTION:
description = {
"image": "bytes",
}
The error from the FEATURE_DESCRIPTION is as follows:
/usr/local/lib/python3.7/dist-packages/tfrecord/reader.py in process_feature(feature, typename, typename_mapping, key)
107
108 if typename is not None:
--> 109 tf_typename = typename_mapping[typename]
110 if tf_typename != inferred_typename:
111 reversed_mapping = {v: k for k, v in typename_mapping.items()}
TypeError: unhashable type: 'list'
I'm trying to train ImageNet from scratch and tfrecords are much faster for that. The dataset exists in multiple tf record files. With one single tf record it is straightforward, I just use TFRecordDataset. However, I am unsure how to use the MultiTFRecordDataset for this purpose. For some reason, the data loader hangs and goes into an infinite loop.
Is this the intended purpose of MultiTFRecordDataset?
Thanks in advance!
Hello, Vahid,
Thank you for such library! I am trying to use it to read tf records, that I previously created, into a dataloader. However, I found out that there is a problem with empty features (in particular, negative samples for object detection), examples are provided below. Could this be related with a lack of support for VarLenFeature
(I use it to write these records)? In any case, may I get some suggestions and contribute with a solution? Thank you.
Sincerely,
Dmitry.
Supported case:
feature {
key: "image/object/class/label"
value {
int64_list {
value: 1
value: 1
value: 1
}
}
}
Unsupported case:
feature {
key: "image/object/class/label"
value {
int64_list {
}
}
}
Error:
IndexError Traceback (most recent call last)
~/lib/python3.6/site-packages/tfrecord/reader.py in example_loader(data_path, index_path, description, shard, compression_type)
221 example.ParseFromString(record)
222
--> 223 yield extract_feature_dict(example.features, description, typename_mapping)
224
225
~/lib/python3.6/site-packages/tfrecord/reader.py in extract_feature_dict(features, description, typename_mapping)
154 raise KeyError(f"Key {key} doesn't exist (select from {all_keys})!")
155
--> 156 processed_features[key] = get_value(typename, typename_mapping, key)
157
158 return processed_features
~/lib/python3.6/site-packages/tfrecord/reader.py in get_value(typename, typename_mapping, key)
136 def get_value(typename, typename_mapping, key):
137 return process_feature(features[key], typename,
--> 138 typename_mapping, key)
139 else:
140 raise TypeError(f"Incompatible type: features should be either of type "
~/lib/python3.6/site-packages/tfrecord/reader.py in process_feature(feature, typename, typename_mapping, key)
114
115 if inferred_typename == "bytes_list":
--> 116 value = np.frombuffer(value[0], dtype=np.uint8)
117 elif inferred_typename == "float_list":
118 value = np.array(value, dtype=np.float32)
IndexError: list index (0) out of range
Looks like inside TFRecordDataset
shard is defined before declaring it.
It's raised here :
tfrecord/tfrecord/torch/dataset.py
Line 23 in f7e7549
How to specify multiple tfrecord files in the example of MultiTFRecordDataset?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.