szymonmaszke / torchdatasets Goto Github PK

PyTorch dataset extended with map, cache etc. (tensorflow.data like)

License: MIT License

Shell 0.29% Python 99.71%

pytorch dataset disk tensorflow tf-data cache torch concatenate library map

torchdatasets's Introduction

👋 Hi there!

Hey, I'm Simon, from time to time I create open source projects (mostly Machine Learning) and try to help on StackOverflow.

Contact

Please do not hesitate to contact me via:

[email protected] - regarding open source projects
[email protected] - regarding paid consulting

torchdatasets's People

Contributors

Stargazers

Watchers

Forkers

torchdatasets's Issues

Pickle support for Storage will be removed in torch 1.5

Hi!
This is an awesome package, but I ran into a warning while using the Pickle cacher:

***/site-packages/torch/storage.py:34: FutureWarning: pickle support for Storage will be removed in 1.5. Use `torch.save` instead
  warnings.warn("pickle support for Storage will be removed in 1.5. Use `torch.save` instead", FutureWarning)

As far as I understand, it can be fixed by creating a new cacher that will use torch.save and torch.load. I was planning to open this as a PR, but I couldn't figure out how to set up the test environment so this is an issue instead.

Metaclass issue with Python 3.7/3.8

The Metaclass trick doesn't seem to work on later versions of python.
Any ideas?
This can be easily checked on colab, for example (please see below).

Thanks for the amazing work here

!pip install torchdata
!python --version
Collecting torchdata
  Downloading torchdata-0.2.0-py3-none-any.whl (27 kB)
Requirement already satisfied: torch>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from torchdata) (1.9.0+cu102)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch>=1.2.0->torchdata) (3.7.4.3)
Installing collected packages: torchdata
Successfully installed torchdata-0.2.0
Python 3.7.12

import torchdata as td
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-4-d739c1dd990c> in <module>()
----> 1 import torchdata as td

1 frames

/usr/local/lib/python3.7/dist-packages/torchdata/__init__.py in <module>()
     58 """
     59 
---> 60 from . import cachers, datasets, maps, modifiers, samplers
     61 from ._version import __version__
     62 from .datasets import Dataset, Iterable

/usr/local/lib/python3.7/dist-packages/torchdata/datasets.py in <module>()
    155 
    156 
--> 157 class Iterable(TorchIterable, _DatasetBase, metaclass=MetaIterable):
    158     r"""`torch.utils.data.IterableDataset` **dataset with extended capabilities**.
    159 

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Document gotcha when using DataLoader with workers

Using .cache() (with the default memory cacher) does nothing when the Dataset is used in a multi-process DataLoader. This is a gotcha that should probably pointed out in the documentation and the tutorial, as it is easy to overlook.

In my case I was dropping in torchdatas cache in a program that already had the DataLoaders defined. The DataLoaders were initialized with a positive int in num_workers. It took me a while to figure out why the cache didn't seem to work, at all.

python3.6 order (MRO) for bases type, GenericMeta, _DataPipeMeta

Traceback (most recent call last):
  File "test-dataset.py", line 1, in <module>
    from dataset import func
  File "/root/torch-cache-test/dataset/func.py", line 3, in <module>
    from . import nocache
  File "/root/torch-cache-test/dataset/nocache.py", line 1, in <module>
    from . import utils
  File "/root/torch-cache-test/dataset/utils.py", line 8, in <module>
    import torchdatasets as td
  File "/opt/conda/lib/python3.6/site-packages/torchdatasets/__init__.py", line 60, in <module>
    from . import cachers, datasets, maps, modifiers, samplers
  File "/opt/conda/lib/python3.6/site-packages/torchdatasets/datasets.py", line 164, in <module>
    class MetaIterableWrapper(MetaIterable, GenericMeta, _typing._DataPipeMeta): pass
TypeError: Cannot create a consistent method resolution
order (MRO) for bases type, GenericMeta, _DataPipeMeta

metaclass conflict

I got TypeError: metaclass conflict importing torchdata

Python Version: Python 3.6.13
PyTorch:

torch==1.7.1+cu110
torchaudio==0.7.2
torchdata==0.2.0
torchvision==0.8.2+cu110

In [1]: import torchdata as td
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-d739c1dd990c> in <module>
----> 1 import torchdata as td

~/git/envs/pyenv/lib/python3.6/site-packages/torchdata/__init__.py in <module>
     58 """
     59 
---> 60 from . import cachers, datasets, maps, modifiers, samplers
     61 from ._version import __version__
     62 from .datasets import Dataset, Iterable

~/git/envs/pyenv/lib/python3.6/site-packages/torchdata/datasets.py in <module>
    155 
    156 
--> 157 class Iterable(TorchIterable, _DatasetBase, metaclass=MetaIterable):
    158     r"""`torch.utils.data.IterableDataset` **dataset with extended capabilities**.
    159 

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

type(TorchIterable)
Out[2]: typing.GenericMeta

Dataset inherited torch.Dataset which already inherited GenericMeta metaclass, but it also specified using MetaDataset as metaclass which caused the conflicting

You can reproduce with this

class M_A(type):
    pass

class M_B(type):
    pass

class A(metaclass=M_A):
    pass

class C(A, metaclass=M_B):
    pass

This can be solved with a wrapper:

class MetaDatasetWrapper(MetaDataset, GenericMeta): pass

class Dataset(TorchDataset, _DatasetBase, metaclass=MetaDatasetWrapper):

Concatenate datasets with different map function

Hi @szymonmaszke , is it possible to concatenate two datasets with different map function? I checked the doc but I am not sure.
Thank you in advance!

Apparent mismatch between official pip version `0.2.0` and GitHub tagged version of `0.2.0`

First, thanks for an elegant library that has saved me a significant amount of time over the past couple years.

Now, the problem: Since the name change, Ive been trying to refactor some old code to work with future versions of torch (specifically torch==1.8.1+cu101). While doing so, I seem to have uncovered a confusing issue that shows up upon installation of torchdatasets. The problem is that neither installing via pip install torchdatasets or even pip install torchdatasets==0.2.0 results in an identical version of the repo to the one tagged as 0.2.0 on GitHub. This became a problem for me, while I tried simply importing:

(ffcv-test) jrose3@serrep3:/media/data/jacob/GitHub/ffcv/examples/cifar$ python
Python 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:42:07)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torchdatasets as torchdata
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/data/conda/jrose3/envs/ffcv-test/lib/python3.8/site-packages/torchdatasets/__init__.py", line 60, in <module>
    from . import cachers, datasets, maps, modifiers, samplers
  File "/media/data/conda/jrose3/envs/ffcv-test/lib/python3.8/site-packages/torchdatasets/datasets.py", line 28, in <module>
    from torch.utils.data import _typing
ImportError: cannot import name '_typing' from 'torch.utils.data' (/media/data/conda/jrose3/envs/ffcv-test/lib/python3.8/site-packages/torch/utils/data/__init__.py)

Examining the 2 implicated python scripts (1 in torch and 1 in torchdatasets) I realized that the file torch.utils.data._typing isn't actually introduced into any torch repo until version torch==1.9.0, while I'm currently using torch==1.8.1 and as far as I can tell, the only stated requirement for the torchdatasets library is torch>=1.2.0 listed in requirements.txt.

Looking further into the torchdatasets file that relies on torch.utils.data._typing, namely torchdatasets.datasets.py, I found that it's only used once, and for a comically unnecessary type hint used in a placeholder class's definition!

class MetaIterableWrapper(MetaIterable, GenericMeta, _typing._DataPipeMeta): pass

My assumption is that this was introduced as part of an effort to integrate the new torch data pipe pattern, but at some point it leaked into the main repo and broke a bunch of other, significant assumptions necessary to install smoothly. Since I can only find it via my locally installed pip version and not on GitHub, I have no clear way of tracking down when it was introduced or by whom.

My recommendation is removing these 2 lines from the file torchdatasets/datasets.py hosted on pip for version 0.2.0 (Im not sure if these can be revised without updating the version as well). Thoughts?

HDF5 Support

Thanks for this amazing library. I was wondering for large datasets with millions of images, would it make sense to cache in a single file (e.g., HDF5) instead of creating millions of cache files? Do you have any plan to support Hdf5 format? Thanks!

AttributeError: 'Subset' object has no attribute 'map'

Hello! Thanks so much for this wonderful library.

This is my first time using it, and I'm following the README example and StackOverflow post to apply transformations after splitting the data. However, I am getting the above error when I try to run train_dataset.map(train_transform). Does the wrapper still work, or did I make a mistake somewhere?

dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder('./root'))

total_num = len(dataset)
train_num = int(0.7 * total_num)
val_num = int(0.2 * total_num)
test_num = total_num - train_num - val_num
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(
    dataset, (train_num, val_num, test_num)
)

train_dataset = train_dataset.map(train_transform)

Multiple concatenation with logical or operator yields nested concatenation

Concatenation of two datasets with the logical operator works as intended:
concat_2 = images | images

While concatenation of more datasets (concat_3 = images | images | images) yields a nested concatenated dataset.

The code is equivalent to:
concat_3 = torchdata.datasets.ConcatDataset([torchdata.datasets.ConcatDataset([images, images]), images])

I'd argue a more intuitive result would something that is equivalent to this instead:
concat_3 = torchdata.datasets.ConcatDataset([images, images, images])

In short, concatenating with the | operator to an already concatenated dataset should add the new dataset in the list of concatenations, instead of creating a nested concatenated dataset.

support for pytorch 1.3.0

first, this is a great package! it seems like to bring the only cool part of tensorflow (tf.data) to pytorch. Thanks for the effort.

Is torch 1.3.0 supported in the current release?

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Hi there, I love the repo and wanted to try it out on a project... but when I import torchdata it gives me this error. I installed the latest version of torchdata using pip install --upgrade torchdata and yet it throws this error
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Benchmark

Can you share some benchmark when using this library vs when not use it?

Thank you.

Support stratified subsampler

Hi! From what I can see currently there is no simple way in pytorch to perform a stratified subsampling of the training dataset.
I think it fits this library scope perfectly.
Let me know what do you think about it.

load N samples in memory (queue) and train on it on GPU. In the meanwhile, load another N samples into queue

Hi, question as title.
What shall I do use torchdata?

pip install doesn't work in Google Colab

I was introduced to this library when I asked this question on stack overflow.

I was able to do a pip install and get my work done on my local machine.
But, I also need to be able to share some of the code with a teammate via Google Colab.
So, I put the local Jupyter notebook into Colab and tried !pip install torch data.
Turns out that doesn't work. It gives the following error message.

ERROR: Could not find a version that satisfies the requirement torchdata (from versions: none)
ERROR: No matching distribution found for torchdata

Details of the Colab environment

OS: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9 (default, Apr 18 2020, 01:56:04)  [GCC 8.4.0]
numpy version: 1.18.4
future version: 0.16.0
PyTorch version: 1.5.0+cu101
Torchvision Version: 0.6.0+cu101

Is there any way I can get the library to work in Colab?
Thank you.