ufoym / imbalanced-dataset-sampler Goto Github PK

View Code? Open in Web Editor NEW

2.2K 31.0 264.0 185 KB

A (PyTorch) imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.

License: MIT License

Python 100.00%

pytorch imbalanced-data image-classification data-sampling

imbalanced-dataset-sampler's Introduction

imbalanced-dataset-sampler's People

Contributors

Stargazers

Watchers

Forkers

dsp6414 queenie88 msamogh mengdebin18 xuezhizeng tanjasper arturtan dakomura nipi64310 abdelaffo simonbinwang pku-aiic ml-lab shao15xiang joeysu shanglianlm0525 sanqiaiziji fendaq trendingtechnology liaoheping xiaodongdreams jangocheng vpanjeta i6173215 drjzhou shlpu fatimaparveen ziyanzhu1994 hansbu sarathknv marccoru bayesianbrad achaiah daimuuc weibinl areche artechstark nourelmadany w-garcia shreyasipathak whoyouwith91 christinaliang rpantone-rain sirlps lukeshuo henanjun dtennant sunbc0120 lunayach ashemag gzuidhof yangsenwxy carryhjr yuanyunshuang amlarraz tomheaven pgsrv koushik-elite yowhatever chen709847237 apple635471 xsschauhan hokie23 ashlee-lu prakhar2b wolfvbx sreenamadhu kangjin2014 scottkaykay afcarl wangyouyuan jspolson stevenji gkyustc lumen2018 sailfish009 zymale sagarrathod-tomtom wgqtmac gustavoplensack windson87 lihengtianxia borda pandinosaurus liuguoyou junhua-zhang priteshgohil yuv4r4j makai281 heorhii-bolotov databill86 declanzane gp1313 huangdengrong supnewbeen tangxinkevin forestwang felixzhang7 enginbozaba jwllee

imbalanced-dataset-sampler's Issues

AttributeError: 'ConcatDataset' object has no attribute 'img_norm_cfg'

when i run test.py, there is an error:
File "tools/test.py", line 211, in
main()
File "tools/test.py", line 181, in main
outputs = single_gpu_test(model, data_loader, args.show, args.log_dir)
File "tools/test.py", line 39, in single_gpu_test
model.module.show_result(data, result, dataset.img_norm_cfg, dataset='DOTA1_5')
AttributeError: 'ConcatDataset' object has no attribute 'img_norm_cfg'

How can I solve this problem?

ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

Hi, I am using BERT for multi label classification.
The dataset is imbalance and I use ImbalancedDatasetSampler as the sampler.

The train data has been tokenized,
has id, mask and label:

(tensor([ 101, 112, 872, 4761, 6887, 1914, 840, 1914, 7353, 6818, 3300, 784,
720, 1408, 136, 1506, 1506, 3300, 4788, 2357, 5456, 119, 119, 119,
4696, 4638, 741, 677, 1091, 4638, 872, 1420, 1521, 119, 119, 119,
872, 2157, 6929, 1779, 4788, 2357, 3221, 686, 4518, 677, 3297, 1920,
4638, 4788, 2357, 117, 1506, 1506, 117, 7745, 872, 4638, 1568, 2124,
3221, 6432, 2225, 1217, 2861, 4478, 4105, 2357, 3221, 686, 4518, 677,
3297, 1920, 4638, 4105, 2357, 1568, 119, 119, 119, 1506, 1506, 1506,
112, 112, 4268, 4268, 117, 1961, 4638, 1928, 1355, 5456, 106, 2769,
812, 1920, 2812, 7370, 3488, 2094, 6963, 6206, 5436, 677, 3341, 2769,
4692, 1168, 3312, 1928, 5361, 7027, 3300, 1928, 1355, 119, 119, 119,
671, 2137, 3221, 8584, 809, 1184, 1931, 1168, 4638, 117, 872, 6432,
3221, 679, 3221, 136, 138, 4495, 4567, 140, 102, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0]),
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
tensor(0))

When using

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

they are fine:

batch_size=3
dataloader_train_o = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size,
    # **kwargs
)

However, replace the sampler to ImbalancedDatasetSampler

batch_size=3
dataloader_train_o = DataLoader(
    dataset_train,
    sampler=ImbalancedDatasetSampler(dataset_train),
    batch_size=batch_size,
    # **kwargs
)

The error print below

ValueError Traceback (most recent call last)
File D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pandas\core\frame.py:3892, in DataFrame._ensure_valid_index(self, value)
3891 try:
-> 3892 value = Series(value)
3893 except (ValueError, NotImplementedError, TypeError) as err:

File D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pandas\core\series.py:451, in Series.init(self, data, index, dtype, name, copy, fastpath)
450 else:
--> 451 data = sanitize_array(data, index, dtype, copy)
453 manager = get_option("mode.data_manager")

File D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pandas\core\construction.py:601, in sanitize_array(data, index, dtype, copy, raise_cast_failure, allow_2d)
599 subarr = maybe_infer_to_datetimelike(subarr)
--> 601 subarr = _sanitize_ndim(subarr, data, dtype, index, allow_2d=allow_2d)
603 if isinstance(subarr, np.ndarray):
604 # at this point we should have dtype be None or subarr.dtype == dtype

File D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pandas\core\construction.py:652, in _sanitize_ndim(result, data, dtype, index, allow_2d)
651 return result
--> 652 raise ValueError("Data must be 1-dimensional")
653 if is_object_dtype(dtype) and isinstance(dtype, ExtensionDtype):
654 # i.e. PandasDtype("O")

ValueError: Data must be 1-dimensional

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
Input In [49], in <cell line: 5>()
2 from torchsampler import ImbalancedDatasetSampler
4 batch_size=3
5 dataloader_train_o = DataLoader(
6 dataset_train,
----> 7 sampler=ImbalancedDatasetSampler(dataset_train),
8 batch_size=batch_size,
9 # **kwargs
10 )
12 dataloader_validation_o = DataLoader(
13 dataset_val,
14 sampler=SequentialSampler(dataset_val),
15 batch_size=batch_size,
16 # **kwargs
17 )

File D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torchsampler\imbalanced.py:37, in ImbalancedDatasetSampler.init(self, dataset, labels, indices, num_samples, callback_get_label)
35 # distribution of classes in the dataset
36 df = pd.DataFrame()
---> 37 df["label"] = self._get_labels(dataset) if labels is None else labels
38 df.index = self.indices
39 df = df.sort_index()

File D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pandas\core\frame.py:3655, in DataFrame.setitem(self, key, value)
3652 self._setitem_array([key], value)
3653 else:
3654 # set column
-> 3655 self._set_item(key, value)

File D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pandas\core\frame.py:3832, in DataFrame._set_item(self, key, value)
3822 def _set_item(self, key, value) -> None:
3823 """
3824 Add series to DataFrame in specified column.
3825
(...)
3830 ensure homogeneity.
3831 """
-> 3832 value = self._sanitize_column(value)
3834 if (
3835 key in self.columns
3836 and value.ndim == 1
3837 and not is_extension_array_dtype(value)
3838 ):
3839 # broadcast across multiple columns if necessary
3840 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):

File D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pandas\core\frame.py:4528, in DataFrame._sanitize_column(self, value)
4515 def _sanitize_column(self, value) -> ArrayLike:
4516 """
4517 Ensures new columns (which go into the BlockManager as new blocks) are
4518 always copied and converted into an array.
(...)
4526 numpy.ndarray or ExtensionArray
4527 """
-> 4528 self._ensure_valid_index(value)
4530 # We should never get here with DataFrame value
4531 if isinstance(value, Series):

File D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\pandas\core\frame.py:3894, in DataFrame._ensure_valid_index(self, value)
3892 value = Series(value)
3893 except (ValueError, NotImplementedError, TypeError) as err:
-> 3894 raise ValueError(
3895 "Cannot set a frame with no defined index "
3896 "and a value that cannot be converted to a Series"
3897 ) from err
3899 # GH31368 preserve name of index
3900 index_copy = value.index.copy()

ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

Dose this work for multi label classification?

This is a Goood work!
But, I want to find out: Dose this work for mulit label classification? Such as: BCELoss in pytorch.
THANKS.

ConcatDataset support

Thanks for the great work!

I try to combine two datasets by using "dataset = dataset1+dataset2", and it gives me such error:
AttributeError: 'ConcatDataset' object has no attribute 'get_labels'

Is there any workaround?

callback_get_label no longer works as described

callback_get_label: a callback-like function which takes two arguments - dataset and index

This no longer seems to be the case?

Please update how the new use-case looks like, because above commit breaks a lot of my previous code.

@hwany-j

Error when call ImbalancedDatasetSampler function

Following error occurred when on dataloader. I am working on google colab.

Code
train_dataset = DataLoader(train_dataset, sampler=ImbalancedDatasetSampler(train_dataset), batch_size = BATCH_SIZE, drop_last=True )

Error

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 5, in
train_dataset = DataLoader(train_dataset, sampler=ImbalancedDatasetSampler(train_dataset),batch_size = BATCH_SIZE,
File "/content/drive/My Drive/Research_Shanto/Shanto/Packages/imbalanced-dataset-sampler-master/torchsampler/imbalanced.py", line 32, in init
label = self._get_label(dataset, idx)
File "/content/drive/My Drive/Research_Shanto/Shanto/Packages/imbalanced-dataset-sampler-master/torchsampler/imbalanced.py", line 53, in _get_label
raise NotImplementedError
NotImplementedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
stb = value.render_traceback()
AttributeError: 'NotImplementedError' object has no attribute 'render_traceback'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
File "/usr/lib/python3.6/inspect.py", line 1490, in getinnerframes
frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
File "/usr/lib/python3.6/inspect.py", line 1448, in getframeinfo
filename = getsourcefile(frame) or getfile(frame)
File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
if getattr(getmodule(object, filename), 'loader', None) is not None:
File "/usr/lib/python3.6/inspect.py", line 725, in getmodule
file = getabsfile(object, _filename)
File "/usr/lib/python3.6/inspect.py", line 709, in getabsfile
return os.path.normcase(os.path.abspath(_filename))
File "/usr/lib/python3.6/posixpath.py", line 383, in abspath
cwd = os.getcwd()
OSError: [Errno 107] Transport endpoint is not connected

I really hope the community sees this issue and gives solution.

Could you explain your way of sampling in details?

Thanks so much for your implementation. But I have several questions:

In the below picture, it seems that the class with less numbers is sampled repeatedly, while the class with more numbers is sub-sampled. So I wonder what's the difference between your method and traditional method?

In each epoch, does each image is sampled for only once? Because you mentioned that your method avoids of creating a new balanced dataset,

'MyDataset' object has no attribute 'get_labels'

When I try to use my own Dataset class, I get the error 'MyDataset' object has no attribute 'get_labels' and cannot proceed.

The content of the Dataloader is as follows, and there is nothing strange about it.
It processes the image data and label data in .npz format.

class MyDataset(data.Dataset):
    def __init__(self, images, labels, transform=None):
        self.images = images
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, index):
        image = self.images[index]
        label = self.labels[index]

        if self.transform is not None:
            image = self.transform(image=image)["image"]

        return image, label

train_dataset = MyDataset(train_imgs, train_labels, transform=transform)
train_dataloader = torch.utils.data.DataLoader(train_dataset,
                                               sampler=ImbalancedDatasetSampler(train_dataset),
                                               batch_size= batch_size,
                                               shuffle=True,
                                               num_workers=2)

Is there something wrong with the code?
I don't think it's a typo.

How can I fix it so that it works correctly?

how to extend to distributed training?

If i want to use multi gpus to train the model, how should i implement it？

Feature Request: Force a ratio while sampling

Could you please implement a feature where we can choose the ratio for how much sampling is required. I believe now, it automatically leads to equal weight.

Something like this: https://discuss.pytorch.org/t/how-to-handle-imbalanced-classes/11264/2

But in-built.

Does this work for binary classification tasks?

Does val_dataset need balanced sampler?

I don't know if the val_dataset needle a balance.
Thanks.

Alternative for IterableDatasets

Hello.

I'm opening this issue to make users aware that I've just released a small package for sampling from IterableDatasets. It's thus complementary to this package which only works with batch datasets.

latest commit cannot get correct labels from ImageFolder dataset

the ImageFolder.imgs return a list of 2-element tuple, such as:

[
('classA/img01.jpg', 0),
('classB/img01.jpg', 1),
...
('classN/img01.jpg', N-1)
]

if we use dataset.imgs[:][1], it cannot not get correct labels of all samples.

at this line:

imbalanced-dataset-sampler/torchsampler/imbalanced.py

Line 46 in 756b0b6

return dataset.imgs[:][1]

No epoch exist here

Imbalanced dataset sampler w/ crossentropyloss not importing on google colab

Hello, Im trying to import the module on google colab but im getting a no module found error, someone fixed this?
I also tried to install with !pip install torchsampler.

Thanks

ERROR label_to_count not callable

Hi !

I noticed that they are some bugs introduce with the last commit ad50e22

Step to reproduce

`
import torch
from torchsampler import ImbalancedDatasetSampler

mnist = torchvision.datasets.MNIST('.', train=True, download=True, transform=transform)
train_loader_b = torch.utils.data.DataLoader(
mnist,
sampler=ImbalancedDatasetSampler(mnist),
batch_size=args.batch_size,
)
`

`
TypeError Traceback (most recent call last)
in
1 train_loader= torch.utils.data.DataLoader(
2 mnist,
----> 3 sampler=ImbalancedDatasetSampler(mnist),
4 batch_size=args.batch_size,
5 )

~/.local/lib/python3.8/site-packages/torchsampler/imbalanced.py in init(self, dataset, indices, num_samples, callback_get_label)
34 label_to_count = df["label"].value_counts()
35
---> 36 weights = 1.0 / label_to_count(df["label"])
37
38 self.weights = torch.DoubleTensor(weights)

TypeError: 'Series' object is not callable
`

I just think that label_to_count is now pandas series and can't be called.

Any idea how to fix it ? ( I will give it a try soon)

Too much time cost

It slower than before too many times when I use this sampler

(self.indices[i] for i in torch.multinomial(self.weights, self.num_samples, replacement=True))
it seems that this expression cost too much time!

Any one have any solution?

Is it possible to implement this sampler in segmentation model?

I have an imbalanced dataset consisting of 5 classes of images paired with pixel-wise annotated masks.
My 'mask' is an array that has the same pixel size as the image and class number assigned in the corresponding pixel.
But it seems the imbalanced-dataset-sampler is designed for "labels" rather than mask arrays.
Can I modify my dataset function to fit this sampler?
(my mask array only contains 0 and a specific class number at a time)

class Dataset(BaseDataset):

def __init__(self, 
             image_df,
             mask_list,
             preprocessing = None,
            ):
    self.images_dir = image_df.all_path
    self.masks_dir = mask_list
    self.class_values = list(range(len(CLASSES)))
    self.preprocessing = preprocessing

def __getitem__(self, i):
    image = cv2.imread(self.images_dir[i])
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    mask = cv2.imread(self.masks_dir[i], 0)
    
    masks = [(mask == v) for v in self.class_values]
    mask = np.stack(masks, axis = -1).astype('float')
    
    if self.preprocessing:
        sample = self.preprocessing(image = image, mask = mask)
        image, mask = sample['image'], sample['mask']
        
    return image, mask

def __len__(self):
    return len(self.images_dir)

Suggestion: Change y limits of learning curves in README

If I can make one small quick suggestion, perhaps the two following plots in your README should have the same y axis limits.

Can I use this sampler in balanced dataset?

Should apply dataset.target_transform

In the sampler, when getting the labels of the dataset and counting the frequencies of each, the dataset target transform should be applied. Target transforms sometimes change the label (e.g. by grouping multiple of the original classes) and might affect the frequency counts for each label.

Get label as the last entry in the tensor dataset

imbalanced-dataset-sampler/torchsampler/imbalanced.py

Line 51 in 01cb129

return dataset.tensors[1]

I think this should be return dataset.tensors[-1] . I have run into issues when the TensorDataset object is composed of more than 1 dataframe, with the last one corresponding to labels.

better description of "epoch"

Thanks for the good implementation! I was a bit confused about the description so I'd like to comment.

Then in each epoch, the loader will sample the entire dataset and weigh your samples inversely to your class appearing probability.

But this is no longer what we call an epoch normally, right? I mean, we do not iterate over all data points in an epoch, because same data points that belongs to majority classes are not used in an epoch. Technically, for each "epoch" defined by pytorch, the loader will sample the same number of data points in the original dataset, and each sample is picked with the probability disproportional to the class frequency.

pip install error

Follows your method to install the package:
git clone https://github.com/ufoym/imbalanced-dataset-sampler.git
cd imbalanced-dataset-sampler
python setup.py install
pip install .

But when I run "pip install .", I met the error as follows:
FileNotFoundError: [Errno 2] No such file or directory: '/home/miniconda3/envs/pytorch/lib/python3.7/site-packages/torchsampler-0.1-py3.7.egg'

How can I resolve it?

Unable to use the package in google colab

I tried importing the package in colab but error prompts me to install torchsampler. Then I tried writing !pip install torchsampler and the following error popped up:

ERROR: Could not find a version that satisfies the requirement torchsampler (from versions: none) ERROR: No matching distribution found for torchsampler.

Would be really grateful if quick fix is provided.

Implementation for Pytorch-geometric dataset

I have added a few lines that allow to work with pytorch-geometric dataset. Since Pytorch-geometric data is saved as a list before being loaded by a Pytorch-geometric Dataloader, the modification is pretty simple.
Hope this could be helpful to someone.

Best,

Anna

`from typing import Callable

import pandas as pd
import torch
import torch.utils.data
import torchvision

class ImbalancedDatasetSampler(torch.utils.data.sampler.Sampler):
"""Samples elements randomly from a given list of indices for imbalanced dataset

Arguments:
    indices: a list of indices
    num_samples: number of samples to draw
    callback_get_label: a callback-like function which takes two arguments - dataset and index
"""

def __init__(self, dataset, indices: list = None, num_samples: int = None, callback_get_label: Callable = None):
    # if indices is not provided, all elements in the dataset will be considered
    self.indices = list(range(len(dataset))) if indices is None else indices

    # define custom callback
    self.callback_get_label = callback_get_label

    # if num_samples is not provided, draw `len(indices)` samples in each iteration
    self.num_samples = len(self.indices) if num_samples is None else num_samples

    # distribution of classes in the dataset
    df = pd.DataFrame()
    df["label"] = self._get_labels(dataset)
    df.index = self.indices
    df = df.sort_index()

    label_to_count = df["label"].value_counts()

    weights = 1.0 / label_to_count[df["label"]]

    self.weights = torch.DoubleTensor(weights.to_list())

def _get_labels(self, dataset):
    if self.callback_get_label:
        return self.callback_get_label(dataset)
    elif isinstance(dataset, torchvision.datasets.MNIST):
        return dataset.train_labels.tolist()
    elif isinstance(dataset, torchvision.datasets.ImageFolder):
        return [x[1] for x in dataset.imgs]
    elif isinstance(dataset, torchvision.datasets.DatasetFolder):
        return dataset.samples[:][1]
    elif isinstance(dataset, torch.utils.data.Subset):
        return dataset.dataset.imgs[:][1]
    elif isinstance(dataset, torch.utils.data.Dataset):
        return dataset.get_labels()
    elif isinstance(dataset, list):
        return [dataset[i].y.item() for i in range(len(dataset))]  #here the modification
    else:
        raise NotImplementedError

def __iter__(self):
    return (self.indices[i] for i in torch.multinomial(self.weights, self.num_samples, replacement=True))

def __len__(self):
    return self.num_samples

Doesn't work with concatenated dataset

Using the ImbalancedDatasetSampler in the concatenated dataset using ConcatDataset([datasetA, datasetB, datasetC])

AttributeError: 'ConcatDataset' object has no attribute 'get_labels'

ModuleNotFoundError: No module named 'torchsampler'

Thanks for you sharing! when I run
"from torchsampler import ImbalancedDatasetSampler"

ModuleNotFoundError Traceback (most recent call last)
in
8 # os.environ["CUDA_VISIBLE_DEVICES"]="1"
9
---> 10 from torchsampler import ImbalancedDatasetSampler

ModuleNotFoundError: No module named 'torchsampler'

I meet this error,how I can do to solve this problem?

imbalanced data set not reading in correctly to torch

Hello,

I'm getting the following error trying to use the Imbalanced data sampler.

ValueError Traceback (most recent call last)
in
----> 1 train_loader = DataLoader(roof, batch_size=10, sampler = ImbalancedDatasetSampler)

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in init(self, dataset, batch_size, shuffle, sampler, batch_sampler, num_workers, collate_fn, pin_memory, drop_last, timeout, worker_init_fn, multiprocessing_context)
217 if batch_size is not None and batch_sampler is None:
218 # auto_collation without custom batch_sampler
--> 219 batch_sampler = BatchSampler(sampler, batch_size, drop_last)
220
221 self.batch_size = batch_size

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\sampler.py in init(self, sampler, batch_size, drop_last)
184 raise ValueError("sampler should be an instance of "
185 "torch.utils.data.Sampler, but got sampler={}"
--> 186 .format(sampler))
187 if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or
188 batch_size <= 0:

ValueError: sampler should be an instance of torch.utils.data.Sampler, but got sampler=<class 'sampler.ImbalancedDatasetSampler'>

Was I supposed to save the sampler.py file in a special location? I saved it in my directory and it imports.

Difference with WeightedRandomSampler

What is the difference between this sampler and WeightedRandomSampler in pytorch?
Is it only that in WeightedRandomSampler we need to give the weights and num_samples as input? But, here we give dataset as input?

Thanks

Subset sampling entire dataset

Hi everyone,

I have a question concering using subsets with this sampler. According to the code it chooses samples from all entries in the parent dataset:

imbalanced-dataset-sampler/torchsampler/imbalanced.py

Lines 49 to 50 in e9dd2de

 elif isinstance(dataset, torch.utils.data.Subset): 

 return dataset.dataset.imgs[:][1]

Shouldn't it only sample from the samples the chosen subset in dataset.indices? When I try to run _get_labels as is, I get length mismatch. Is my implementation of subset unusual or should this be changed? Only returning the labels corresponding to dataset.indices solved this problem for me:

        elif isinstance(dataset, torch.utils.data.Subset):
            return [dataset.dataset.imgs[ind][1] for ind in dataset.indices]

Does it work in Yolov5？

Can I used in yolov5(https://github.com/ultralytics/yolov5) to solve the problem of uneven foreground background?

A new sampler for your reference

https://github.com/zzw-zwzhang/Yoneed/blob/main/sampler.py#L15

NotImplemented Error while running ImbalancedDatasetSampler

I followed the steps exactly according to the readme file. Yet I am getting a notimplemented error. There's no explanation for the error as well.

Here's my code:
`from torchvision import transforms
from torchsampler import ImbalancedDatasetSampler

batch_size = 128
val_split = 0.2
shuffle_dataset=True
random_seed=42

dataset_size = len(melanoma_dataset)
indices = list(range(dataset_size))
split = int(np.floor(val_split * dataset_size))
if shuffle_dataset :
np.random.seed(random_seed)
np.random.shuffle(indices)
train_indices, test_indices = indices[split:], indices[:split]

train_loader = torch.utils.data.DataLoader(melanoma_dataset,batch_size=batch_size,sampler=ImbalancedDatasetSampler(melanoma_dataset))
test_loader = torch.utils.data.DataLoader(melanoma_dataset,batch_size=batch_size,sampler=test_sampler)`

	elif isinstance(dataset, torch.utils.data.Subset):
	return dataset.dataset.imgs[:][1]

ufoym / imbalanced-dataset-sampler Goto Github PK

imbalanced-dataset-sampler's Introduction

imbalanced-dataset-sampler's People

Contributors

Stargazers

Watchers

Forkers

imbalanced-dataset-sampler's Issues

Recommend Projects

Recommend Topics

Recommend Org