szymonmaszke / torchdatasets Goto Github PK
View Code? Open in Web Editor NEWPyTorch dataset extended with map, cache etc. (tensorflow.data like)
License: MIT License
PyTorch dataset extended with map, cache etc. (tensorflow.data like)
License: MIT License
Hi there, I love the repo and wanted to try it out on a project... but when I import torchdata it gives me this error. I installed the latest version of torchdata using pip install --upgrade torchdata
and yet it throws this error
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases
The Metaclass trick doesn't seem to work on later versions of python.
Any ideas?
This can be easily checked on colab, for example (please see below).
Thanks for the amazing work here
!pip install torchdata
!python --version
Collecting torchdata
Downloading torchdata-0.2.0-py3-none-any.whl (27 kB)
Requirement already satisfied: torch>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from torchdata) (1.9.0+cu102)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch>=1.2.0->torchdata) (3.7.4.3)
Installing collected packages: torchdata
Successfully installed torchdata-0.2.0
Python 3.7.12
import torchdata as td
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-d739c1dd990c> in <module>()
----> 1 import torchdata as td
1 frames
/usr/local/lib/python3.7/dist-packages/torchdata/__init__.py in <module>()
58 """
59
---> 60 from . import cachers, datasets, maps, modifiers, samplers
61 from ._version import __version__
62 from .datasets import Dataset, Iterable
/usr/local/lib/python3.7/dist-packages/torchdata/datasets.py in <module>()
155
156
--> 157 class Iterable(TorchIterable, _DatasetBase, metaclass=MetaIterable):
158 r"""`torch.utils.data.IterableDataset` **dataset with extended capabilities**.
159
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases
Traceback (most recent call last):
File "test-dataset.py", line 1, in <module>
from dataset import func
File "/root/torch-cache-test/dataset/func.py", line 3, in <module>
from . import nocache
File "/root/torch-cache-test/dataset/nocache.py", line 1, in <module>
from . import utils
File "/root/torch-cache-test/dataset/utils.py", line 8, in <module>
import torchdatasets as td
File "/opt/conda/lib/python3.6/site-packages/torchdatasets/__init__.py", line 60, in <module>
from . import cachers, datasets, maps, modifiers, samplers
File "/opt/conda/lib/python3.6/site-packages/torchdatasets/datasets.py", line 164, in <module>
class MetaIterableWrapper(MetaIterable, GenericMeta, _typing._DataPipeMeta): pass
TypeError: Cannot create a consistent method resolution
order (MRO) for bases type, GenericMeta, _DataPipeMeta
Can you share some benchmark when using this library vs when not use it?
Thank you.
I was introduced to this library when I asked this question on stack overflow.
I was able to do a pip install and get my work done on my local machine.
But, I also need to be able to share some of the code with a teammate via Google Colab.
So, I put the local Jupyter notebook into Colab and tried !pip install torch data
.
Turns out that doesn't work. It gives the following error message.
ERROR: Could not find a version that satisfies the requirement torchdata (from versions: none)
ERROR: No matching distribution found for torchdata
Details of the Colab environment
OS: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]
numpy version: 1.18.4
future version: 0.16.0
PyTorch version: 1.5.0+cu101
Torchvision Version: 0.6.0+cu101
Is there any way I can get the library to work in Colab?
Thank you.
Using .cache()
(with the default memory cacher) does nothing when the Dataset is used in a multi-process DataLoader. This is a gotcha that should probably pointed out in the documentation and the tutorial, as it is easy to overlook.
In my case I was dropping in torchdatas cache in a program that already had the DataLoaders defined. The DataLoaders were initialized with a positive int in num_workers. It took me a while to figure out why the cache didn't seem to work, at all.
Hi!
This is an awesome package, but I ran into a warning while using the Pickle cacher:
***/site-packages/torch/storage.py:34: FutureWarning: pickle support for Storage will be removed in 1.5. Use `torch.save` instead
warnings.warn("pickle support for Storage will be removed in 1.5. Use `torch.save` instead", FutureWarning)
As far as I understand, it can be fixed by creating a new cacher that will use torch.save
and torch.load
. I was planning to open this as a PR, but I couldn't figure out how to set up the test environment so this is an issue instead.
I got TypeError: metaclass conflict
importing torchdata
Python Version: Python 3.6.13
PyTorch:
torch==1.7.1+cu110
torchaudio==0.7.2
torchdata==0.2.0
torchvision==0.8.2+cu110
In [1]: import torchdata as td
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-1-d739c1dd990c> in <module>
----> 1 import torchdata as td
~/git/envs/pyenv/lib/python3.6/site-packages/torchdata/__init__.py in <module>
58 """
59
---> 60 from . import cachers, datasets, maps, modifiers, samplers
61 from ._version import __version__
62 from .datasets import Dataset, Iterable
~/git/envs/pyenv/lib/python3.6/site-packages/torchdata/datasets.py in <module>
155
156
--> 157 class Iterable(TorchIterable, _DatasetBase, metaclass=MetaIterable):
158 r"""`torch.utils.data.IterableDataset` **dataset with extended capabilities**.
159
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases
type(TorchIterable)
Out[2]: typing.GenericMeta
Dataset
inherited torch.Dataset
which already inherited GenericMeta
metaclass, but it also specified using MetaDataset
as metaclass which caused the conflicting
You can reproduce with this
class M_A(type):
pass
class M_B(type):
pass
class A(metaclass=M_A):
pass
class C(A, metaclass=M_B):
pass
This can be solved with a wrapper:
class MetaDatasetWrapper(MetaDataset, GenericMeta): pass
class Dataset(TorchDataset, _DatasetBase, metaclass=MetaDatasetWrapper):
Thanks for this amazing library. I was wondering for large datasets with millions of images, would it make sense to cache in a single file (e.g., HDF5) instead of creating millions of cache files? Do you have any plan to support Hdf5 format? Thanks!
Hi @szymonmaszke , is it possible to concatenate two datasets with different map
function? I checked the doc but I am not sure.
Thank you in advance!
First, thanks for an elegant library that has saved me a significant amount of time over the past couple years.
Now, the problem: Since the name change, Ive been trying to refactor some old code to work with future versions of torch (specifically torch==1.8.1+cu101
). While doing so, I seem to have uncovered a confusing issue that shows up upon installation of torchdatasets. The problem is that neither installing via pip install torchdatasets
or even pip install torchdatasets==0.2.0
results in an identical version of the repo to the one tagged as 0.2.0
on GitHub. This became a problem for me, while I tried simply importing:
(ffcv-test) jrose3@serrep3:/media/data/jacob/GitHub/ffcv/examples/cifar$ python
Python 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:42:07)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torchdatasets as torchdata
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/media/data/conda/jrose3/envs/ffcv-test/lib/python3.8/site-packages/torchdatasets/__init__.py", line 60, in <module>
from . import cachers, datasets, maps, modifiers, samplers
File "/media/data/conda/jrose3/envs/ffcv-test/lib/python3.8/site-packages/torchdatasets/datasets.py", line 28, in <module>
from torch.utils.data import _typing
ImportError: cannot import name '_typing' from 'torch.utils.data' (/media/data/conda/jrose3/envs/ffcv-test/lib/python3.8/site-packages/torch/utils/data/__init__.py)
Examining the 2 implicated python scripts (1 in torch and 1 in torchdatasets) I realized that the file torch.utils.data._typing
isn't actually introduced into any torch repo until version torch==1.9.0
, while I'm currently using torch==1.8.1
and as far as I can tell, the only stated requirement for the torchdatasets library is torch>=1.2.0
listed in requirements.txt
.
Looking further into the torchdatasets file that relies on torch.utils.data._typing
, namely torchdatasets.datasets.py
, I found that it's only used once, and for a comically unnecessary type hint used in a placeholder class's definition!
class MetaIterableWrapper(MetaIterable, GenericMeta, _typing._DataPipeMeta): pass
My assumption is that this was introduced as part of an effort to integrate the new torch data pipe pattern, but at some point it leaked into the main repo and broke a bunch of other, significant assumptions necessary to install smoothly. Since I can only find it via my locally installed pip version and not on GitHub, I have no clear way of tracking down when it was introduced or by whom.
My recommendation is removing these 2 lines from the file torchdatasets/datasets.py
hosted on pip for version 0.2.0 (Im not sure if these can be revised without updating the version as well). Thoughts?
Hi! From what I can see currently there is no simple way in pytorch to perform a stratified subsampling of the training dataset.
I think it fits this library scope perfectly.
Let me know what do you think about it.
Hello! Thanks so much for this wonderful library.
This is my first time using it, and I'm following the README example and StackOverflow post to apply transformations after splitting the data. However, I am getting the above error when I try to run train_dataset.map(train_transform). Does the wrapper still work, or did I make a mistake somewhere?
dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder('./root'))
total_num = len(dataset)
train_num = int(0.7 * total_num)
val_num = int(0.2 * total_num)
test_num = total_num - train_num - val_num
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(
dataset, (train_num, val_num, test_num)
)
train_dataset = train_dataset.map(train_transform)
Concatenation of two datasets with the logical operator works as intended:
concat_2 = images | images
While concatenation of more datasets (concat_3 = images | images | images
) yields a nested concatenated dataset.
The code is equivalent to:
concat_3 = torchdata.datasets.ConcatDataset([torchdata.datasets.ConcatDataset([images, images]), images])
I'd argue a more intuitive result would something that is equivalent to this instead:
concat_3 = torchdata.datasets.ConcatDataset([images, images, images])
In short, concatenating with the |
operator to an already concatenated dataset should add the new dataset in the list of concatenations, instead of creating a nested concatenated dataset.
Hi, question as title.
What shall I do use torchdata?
first, this is a great package! it seems like to bring the only cool part of tensorflow (tf.data) to pytorch. Thanks for the effort.
Is torch 1.3.0 supported in the current release?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.