Git Product home page Git Product logo

cleanlab / cleanlab Goto Github PK

View Code? Open in Web Editor NEW
8.8K 85.0 677.0 11.32 MB

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

Home Page: https://cleanlab.ai

License: GNU Affero General Public License v3.0

Python 100.00%
weak-supervision data-cleaning data-quality data-science noisy-labels data-centric-ai out-of-distribution-detection outlier-detection active-learning data-labeling

cleanlab's Issues

GridSearchCV seems to work, but how does it work internally?

Let's say my classifier is LearningWithNoisyLabels(GridSearchCV(estimator = RandomForestClassifier(), param_grid = ..., cv = ...), cv_n_fold = ...)

What will happen here? Will I get the best parameters from GridSearchCV's cross validation, and then re-train this model on the best set of data using LearningWithNoisyLabels's cross validation? Will this potentially produce bad results?

fasttext.py line 247,typeerror

Anaconda3\lib\site-packages\cleanlab\models\fasttext.py", line 247, in
psx = [[p for _, p in sorted(list(zip(*l)), key=lambda x: x[0])] for l in list(zip(*pred))]
TypeError: zip argument #2 must support iteration

What is the algorithm behind this repo?

Hi, I am very interested in this package. I have a small clean dataset, and a large noisy dataset. I think this package can help my model better learn on the whole dataset.
What is the algorithm behind all these? Could you post a link or a paper? Many thanks!

Can the class number computed from psx not from s

Thanks for your contributions!
In the function cleanlab.latent_estimation.compute_confident_joint, if K is None, I wonder if I can replace this line with K = np.array(psx).shape[1] to avoid some bugs if one or more classes in the dataset is of zero sample.
And there may be other lines using this kind of code. Is the toolkit compatible with the situation above? Thanks!

Regarding a tensorflow CNN example

Hi sir,
I have implemented a tensorflow-version CNN example using your rank pruning algorithm. Could you please check the e-mail I sent (to [email protected]; e-mail subject: Regarding the confident learning library) and reply my questions? I would be very grateful.
Hopefully my codes are a positive commit to this repository.

Program runing time

Hello, I want to select samples with wrong labels, and I run the program with
ordered_label_errors = get_noise_indices( s=numpy_array_of_noisy_labels, psx=numpy_array_of_predicted_probabilities, sorted_index_method='normalized_margin', # Orders label errors )
the shape of psx is (390000, 70) and can you tell me how long can the program will run?
I already wait 2 hours and maybe I made some mistakes?

Getting NaN values when LearningWithNoisyLabels

Hi,

I am trying to apply your cleanlab, in a nutshell 🌰 tutorial with my own dataset, but on the learning step I got the following error:

# Wrap around any classifier (scikit-learn, PyTorch, TensorFlow, FastText, etc.)
lnl = LearningWithNoisyLabels(clf=LogisticRegression()) 
lnl.fit(X=X_train_data, s=train_noisy_labels) 

$ ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Regarding my training data:

  • X_train_data is a (47,100) numpy array from TFIDF extractor
  • train_noisy_labels is a (47,) numpy array from LabelEncoder (multi-class setting)

I have also tried with the XGBoost classifier (initially I intended to do it with XGBoost) and in that case the training seems to finish correctly but when I get the predictions with predict_proba the predictions array only contains nan values.

Thanks in advance

Validation dataset

Hello, if the confident learning just is applied in training dataset, the validation set keeps same to compare

Assigned pulearning doesn't change the results much

From what I understand if I assign pulearning = 1 (in a binary classification problem), it should imply that the class of ones has no noise. Still after training, i get the following output from the confident_joint, est_py, est_nm and est_inv respectively as:

[[ 3216. 1179.]
[16989. 14594.]]

[0.79313136 0.20686864]

[[0.15916852 0.07474799]
[0.84083148 0.92525201]]

[[0.73174061 0.53791597]
[0.26825939 0.46208403]]

Is there any other way to make sure no data points from class 1 are considered as noisy?

Thanks.

raise error:ValueError: operands could not be broadcast together with shapes (20000,9140) (401,)

hello, I use the cleanlab to clean my dataset.
my dataset contain about 450000 samples of 9140 classes, when i just use the first 20000 sample to clean.I got a error:

(20000,)
(20000, 9140)
/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in true_divide
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "clean_base_on_clean_lab.py", line 24, in
sorted_index_method='normalized_margin', # Orders label errors
File "/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/cleanlab/pruning.py", line 342, in get_noise_indices
multi_label=multi_label,
File "/opt/meituan/develop/lixianyang/miniconda3/lib/python3.7/site-packages/cleanlab/latent_estimation.py", line 337, in compute_confident_joint
psx_bool = (psx >= thresholds - 1e-6)
ValueError: operands could not be broadcast together with shapes (20000,9140) (401,)

my input of s is a numpy.ndarray with shape(20000)
and psx is a numpy.ndarry with shape(20000,9140)
I don't know why the error is can't broadcast together with shapes (20000,9140) (401,)?
where is (401,) come from?

How can cleanlab deal with too many labels?

HI, I'm using cleanlab to prune noise labels in my dataset. However, I find out that cleanlab was ineffective. There are 3900 classes in my dataset with almost 30million examples. The psx matrix is sparse. What can I do with this situation?

Multi-label Classification, getting Y class error.

Good day,
I am trying to use cleanlab to train a multi-label classifier, where my target classes are encoded using scikit-learn MultiLabelBinarizerlink here.
I have a total of five classes and 20,000 images for training.
I build up a scikit-learn BaseEstimator and wrap a Keras Resnet50 model inside it.
Now when using LearningWithNoisyLabels from cleanlab.classification , I am getting following error:

  File "c:/Users/ASUS/Desktop/cleanlab mangoes/clean_training.py", line 16, in <module>
    lnl.fit(X, y)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\cleanlab\classification.py", line 267, in fit
    assert_inputs_are_valid(X, s, psx)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\cleanlab\util.py", line 41, in assert_inputs_are_valid
    ensure_2d=False,
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\sklearn\utils\validation.py", line 807, in check_X_y
    y = column_or_1d(y, warn=True)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)
  File "C:\ProgramData\Anaconda3\envs\tf_gpu\lib\site-packages\sklearn\utils\validation.py", line 847, in column_or_1d
    "got an array of shape {} instead.".format(shape))
ValueError: y should be a 1d array, got an array of shape (25767, 5) instead.

my code used for training is as follow:

from sk_resnet import ResnetTrainer
from cleanlab.classification import LogReg
from cleanlab.classification import LearningWithNoisyLabels
import pandas as pd 

df = pd.read_csv('train_labels.csv')
values = df.values
X = values[:,0]
y = values[:,1:]


lnl = LearningWithNoisyLabels(clf=ResnetTrainer(batch_size=8, epochs=10))
lnl.fit(X, y)

I hope authors of cleanlab can provide a simple example of how to use cleanlab for multi-label training.

cross-validated train

@cgnorthcutt
Hello :
Thank you approach^^, but I have a question for you.
your approach step is 4 cross-validated , every train num_classes i s same , last to concat 4 softmax.npy,
but for any training set, if use all data to train ,get last softmax.npy is ok? or will have problem ? or what kind of data does your method work well or not?

Multilabel Scenario

How would get_noise_indices work on multilabel scenario?
psx would reaming same which is a nxm sized probability array. How to represent s in this case?

May I set the sample_weight parameter manually?

Hi, It's a pretty lib. When I try to train model with cleanlab, I find it is not allowed to set sample_weight manually. It is inteligent to set sample weight according to the ratio of positive samples and negitave samples. However, I need to tun the paramater for better performance.
Thank you for your contribution about this lib.

Conceptually does it make sense to run gridsearchCV on models that are wrapped with your api?

Hi,

First of all, really like the paper and this implementation.

  1. If one wanted to tune hyper-parameters of the classifier, is it reasonable to use say GridSearchCV on top of this wrapper? Say

GridSearchCV(clf=NoisyLabelsEstimator(clf=Logistic()),parameter_grid={}).

Does this make sense given the fact that there potentially could be fewer N to determine noise?

  1. If you find a hyper-parameter optimal, is it guaranteed after the de-noising that is still the optimal?

Thanks,

Will

number of mislabeled samples

Hello,
thank you for your contributions! I have some problem,
how can I get more possible mislabeled samples with high probability?
Can I just enlarge the frac_noise to a larger number? The number of indices seems not linear with frac_noise.

errors found by cleanlab are mostly correct actually.

I used the method in tutorial:

ordered_label_errors = get_noise_indices( s=numpy_array_of_noisy_labels, psx=numpy_array_of_predicted_probabilities, sorted_index_method='normalized_margin', # Orders label errors )

then the outputs that supposed to be error labels are actually correct, what actions could I take to figure out the reason?

_compute_confident_joint_multi_label() fails for multilabel case with varying label counts per data point

Problem:

_compute_confident_joint_multi_label() produces a wrong count of unique labels K in line 192 when the labels consist of lists with varying label counts per data point. The problem lies in the function np.unique() which only works if the lists of labels all have the same length.

Example:

s = [[1, 2], [3]]
np.unique(s)

>>> array([list([1, 2]), list([3])], dtype=object)

You can see, that np.unique() counts unique lists instead of the entries in the lists. The desired output would be:

>>> array([1, 2, 3])

The problem lies in the fact, that numpy fails at flattening the list of lists if the lists have varying length.

Proposed Solution:

Explicitely flatten s before passing it to np.unique():

s = [[1, 2], [2, 3]]
s_flat = [i for l in s for i in l]
np.unique(s_flat)

>>> array([1, 2, 3])

Using keras model for label noise prediction

I have build the custom class as mentioned in readme

class CustomKerasModel(Sequential):

    def __init__(self,name=None, max_features=20000, batch_size=128, epochs=10, validation_split=0.1):
        super(CustomKerasModel, self).__init__(name=name)
        self.add(Embedding(max_features, 128))
        self.add(Bidirectional(LSTM(32, return_sequences = True)))
        self.add(GlobalMaxPool1D())
        self.add(Dense(20, activation="relu"))
        self.add(Dropout(0.05))
        self.add(Dense(num_classes, activation="softmax"))
        self.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
        self.batch_size = batch_size
        self.epochs = epochs
        self.validation_split = validation_split

    def fit(self, X, y):
        y_one_hot = to_categorical(y)
        return self.model.fit(X, y_one_hot, batch_size=self.batch_size, epochs=self.epochs, validation_split=self.validation_split)
    
    def score(self, X, y):
        return self.model.evaluate(X, y, batch_size=128)[1]

and then to predict the label I am using

from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba(
    X=X,
    s=y,
    clf = CustomKerasModel(max_features=max_features,batch_size=batch_size, epochs=epochs, validation_split=0.1)
)

But its not working I am getting following error

~/anaconda3/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
    238     memo[id(x)] = y
    239     for key, value in x.items():
--> 240         y[deepcopy(key, memo)] = deepcopy(value, memo)
    241     return y
    242 d[dict] = _deepcopy_dict

~/anaconda3/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    167                     reductor = getattr(x, "__reduce_ex__", None)
    168                     if reductor:
--> 169                         rv = reductor(4)
    170                     else:
    171                         reductor = getattr(x, "__reduce__", None)

TypeError: can't pickle _thread.RLock objects

Could anyone please provide the solution to this problem or right way to use
keras model for lable noise prediction ??

what is the actual role of inv_noise_matrix?

I see that the inv_noise_matrix and noise_matrix are calculated throughout the code but I can't see them used anywhere in the final pruning function which is the main purpose. I see that the indices to remove are calculated using cj but the cj is itself calculated using the psx. I see that even inv_noise_matrix is passed to the get_noise_indices function but I don't see it actually used in the function. I understand that this is also explained in the paper. Would you please explain to me why it is calculated in the code even though it is not used in the actual function of the code? Maybe I'm missing something?!

What format of s and psx should be inputted into get_noise_indices() in a multi-label scenario

I have 20 samples with multi-label and 5 classes, such as:
[[2, 3, 4], [1, 3, 4, 5], [1, 3, 4], [1, 2, 3, 4, 5], [2, 3, 5], [1, 2, 4], [1, 3, 4, 5], [1, 3], [1, 5], [1, 3, 4, 5], [2, 3, 4], [3, 4], [4], [1, 3, 4], [2, 3, 4, 5], [1, 4], [3, 4], [3, 5], [2, 3, 5], [2, 5]]
I inputted this label list and a probabilities matrix as psx (shape=(20,5)) into get_noise_indices().
However, the error is:
File "C:\Users\Anaconda2\envs\tf18\lib\site-packages\cleanlab\pruning.py", line 342, in get_noise_indices
multi_label=multi_label,
File "C:\Users\Anaconda2\envs\tf18\lib\site-packages\cleanlab\latent_estimation.py", line 303, in compute_confident_joint
calibrate=calibrate,
File "C:\Users\Anaconda2\envs\tf18\lib\site-packages\cleanlab\latent_estimation.py", line 216, in _compute_confident_joint_multi_label
multi_label=True,
File "C:\Users\Anaconda2\envs\tf18\lib\site-packages\cleanlab\latent_estimation.py", line 121, in calibrate_confident_joint
confident_joint.T / confident_joint.sum(axis=1) * s_counts
ValueError: operands could not be broadcast together with shapes (5,5) (6,)

Is there any wrong with my inputs?
What format of s and psx are correct in this multi-label scenario?

Example Notebooks throws error

The notebooks 'iris_simple_example.ipynb' and 'classifier_comparison.ipynb' in the examples folder both throw an error when run.

The error is:

NameError: name 'psx' is not defined

It seems like the out of sample predicted probabilities are not being computed through CV?

I am using python version 3.7.3 on Windows 10 and sklearn version 0.21.2

Return "margin" for false-classifications

It would be great if get_noise_indices() would not only return the ordered indices but also the ordering criteria (so eg. the margin). Maybe returned as a tuple or have another function that returns both in order not break the API.

I am looking to convey a "confidence" that a label is a wrong classification.

Sparse matrix support

Hi!
I was wondering why the "utils.assert_inputs_are_valid" does not accept sparse matrices (such as scipy's csc/csr matrix) as input.
I understand that some models do not work well with sparse matrices and that could create errors while training, but I think it's reasonable to leave that for the user to handle .
What are your thoughts?

image segmentation with tensorflow?

Hi,

Thanks for this excellent work and cleanlab.

When I try this library in my image semantic segmentation task with tensorflow. There is a bug:

model = LearningWithNoisyLabels(clf=get_unet())

File "/usr/local/lib/python3.6/dist-packages/cleanlab/classification.py", line 178, in init
'The classifier (clf) must define a .predict_proba() method.')
ValueError: The classifier (clf) must define a .predict_proba() method.

Is there any advice?

thanks.

Error when Using LearningWithNoiseLabels()

Hey @cgnorthcutt ,

I am currently trying to improve my model's accuracy in classification. I know for sure that my labels have 20% noise in them. I am using Sklearn's Random Forest Classifier and my data is ~600K samples with 500 Features.

Environment Details :

  • Mac
  • Python 3.5.6
  • Sklearn 0.21.3

Code:

ln1=  LearningWithNoisyLabels(clf=RandomForestClassifier(class_weight='balanced'),seed = 2)
ln1.fit(X = x_train.values,s=y_train['prediction'].values)

Error Faced :

TypeError                                 Traceback (most recent call last)
<ipython-input-7-2636b16750dc> in <module>
      3 
      4 ln1=  LearningWithNoisyLabels(clf=RandomForestClassifier(class_weight='balanced'),seed = 2)
----> 5 ln1.fit(X = x_train.values,s=y_train['prediction'].values)

/anaconda3/envs/cleanlabel/lib/python3.5/site-packages/cleanlab/classification.py in fit(self, X, s, psx, thresholds, noise_matrix, inverse_noise_matrix)
    296             confident_joint = self.confident_joint,
    297             prune_method = self.prune_method,
--> 298             converge_latent_estimates = self.converge_latent_estimates,
    299         ) 
    300 

TypeError: get_noise_indices() got an unexpected keyword argument 'converge_latent_estimates

Exploring the code it seems the function get_nosie_indices() has no argument

get_noise_indices parameter requirements?

Hi,

I'm stuck with using pruning.get_noise_indices to find label errors in my dataset. So I have a set of image data, 392 images, and I calculated its psx usinng my model, with the shape of (372, 5026). (I didn't do cross validation but I think that's not a problem for now?), and y is just a vector of shape(372,) containing the labels of the images(all 0).

Then if I use psx and y as input, I get the following error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/Users/zijingwu/Library/Python/3.7/lib/python/site-packages/cleanlab/pruning.py", line 170, in _prune_by_count
if s_counts[k] <= MIN_NUM_PER_CLASS: # No prune if not MIN_NUM_PER_CLASS
IndexError: index 948 is out of bounds for axis 0 with size 1
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/zijingwu/PycharmProjects/inception/cleaner.py", line 302, in
run_detection()
File "/Users/zijingwu/PycharmProjects/inception/cleaner.py", line 292, in run_detection
ordered_label_errors = cleanlab.pruning.get_noise_indices(y_test, psx, prune_method=prune_method)
File "/Users/zijingwu/Library/Python/3.7/lib/python/site-packages/cleanlab/pruning.py", line 419, in get_noise_indices
noise_masks_per_class = p.map(_prune_by_count, range(K))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
IndexError: index 948 is out of bounds for axis 0 with size 1

I had a similar shape mismatch error with another dataset before. However, if I reshape the psx into having the same number of columns as the number of unique values in y, the method would work, despite giving incorrect results. So I'm wondering what shape should psx be? From my understanding of the cleanlab paper, I think (372, 5026) should be the correct shape of psx.

Thanks in advance for any help! :)

'psx' is not defined in get_noise_indices() - issue for WINDOWS python users

This is my code:

if __name__ == '__main__':
    .
    .
    .
    est_py, est_nm, est_inv, confident_joint, my_psx=estimate_py_noise_matrices_and_cv_pred_proba(
    X=X_train,
    s=train_labels_with_errors,
    clf = GaussianNB()
    )
    label_errors = get_noise_indices(train_labels_with_errors,my_psx,verbose=1)

I'm still getting this error, even if psx is declared as global in pruning.py

Traceback (most recent call last):
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\site-packages\cleanlab\pruning.py", line 109, in _prune_by_count
    noise_mask = np.zeros(len(psx), dtype=bool)
NameError: name 'psx' is not defined


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\Jacopo\.vscode\extensions\ms-python.python-2019.11.49689\pythonFiles\ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "c:\Users\Jacopo\.vscode\extensions\ms-python.python-2019.11.49689\pythonFiles\lib\python\old_ptvsd\ptvsd\__main__.py", line 432, in main
    run()
  File "c:\Users\Jacopo\.vscode\extensions\ms-python.python-2019.11.49689\pythonFiles\lib\python\old_ptvsd\ptvsd\__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\Users\Jacopo\Google Drive\TesiLulli\MachineLearning_Python\3classi\3 classi no BNP\CleanLab\dirty.py", line 28, in <module>
    label_errors = get_noise_indices(train_labels_with_errors,my_psx,verbose=1)
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\site-packages\cleanlab\pruning.py", line 336, in get_noise_indices
    noise_masks_per_class = p.map(_prune_by_count, range(K))
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\Jacopo\AppData\Local\Programs\Python\Python37\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
NameError: name 'psx' is not defined

Model (classifier) Selection if true label is unknown

I really like that cleanlab can be used with any classifier and any dataset distribution. My question is: for real world scenario, the label noise is unknown that one cannot really compute the accuracy scores (as in your classifier comparison example), how would one choose which classifier to use? Due to the unknown error rate, it is hard to compare the model performance using the traditional binary classification metrics: F-1, AUC, etc.
Do you have any advice on that? Thank you very much!

What is the correct format of s?

I have 14312 training examples and the number of classes is 14. We all know that the format of psx is n*m where n is the number of examples and m is the number of classes. Thus, the shape of my psx numpy array is (14312, 14).
I don't know what is the format of s. I guess it should be (14312,) and my s numpy array is like:
['6' '2' '10' ... '5' '0' '4']
But when I tried to run the following code:
ordered_label_errors = get_noise_indices(
s=noisy_labels,
psx=predicted_probabilities,
prune_method='prune_by_class')
I got the follow error:
Traceback (most recent call last):
File "run.py", line 18, in
prune_method='prune_by_class')
File "/nethome/shao42/.local/lib/python3.5/site-packages/cleanlab/pruning.py", line 342, in get_noise_indices
multi_label=multi_label,
File "/nethome/shao42/.local/lib/python3.5/site-packages/cleanlab/latent_estimation.py", line 357, in compute_confident_joint
confident_joint = calibrate_confident_joint(confident_joint, s)
File "/nethome/shao42/.local/lib/python3.5/site-packages/cleanlab/latent_estimation.py", line 121, in calibrate_confident_joint
confident_joint.T / confident_joint.sum(axis=1) * s_counts
ValueError: operands could not be broadcast together with shapes (0,0) (14,)

So I don't know where is wrong. Is it due to the format of s?

Can we detect an image that does not have a true class?

Suppose I'm classifying cats and dogs. But in the training data, there are sometimes images of a bird that are incorrectly labeled as cats or dogs. And this bird image cannot be labeled as dogs or cats. It's both wrong. The only course of action is to delete this image, you cannot fix the label.

Would cleanlab be able to detect bird images in a dataset of cats and dogs?

one mistake in algorithm in the paper

It is an excellent work.
In Algorithm 1 of Section C. The confident joint and joint algorithms of the paper.
The code in part 2:
_for i 1; m do_
may mistake
_for i 1; n do_

Suggestion: Memory consumption / time complexity in README

Hi, first of all thanks for open-sourcing!

I tried running this on a production dataset with ~ 200k examples that are classified in ~5k categories. I'd like to identify labeling errors with the simple example given in the README ordered_label_errors = get_noise_indices(...). Although being on a ~ 50 GB machine, I run out of memory.

It would be great if you could write something about memory and time complexity for the method in the README so that users can get a rough understanding on what the method can be used on and what not.

Also, I would appreciate to get some info for my example. Is the number of categories the problem here?

ValueError: operands could not be broadcast together with shapes

Dataset - UCI skin segmentation https://www.openml.org/d/1502
uci_skin_segmentaion.csv.zip

Error -

File "/Users/naveen/anaconda3/lib/python3.6/site-packages/cleanlab/latent_estimation.py", line 687, in estimate_py_noise_matrices_and_cv_pred_proba
    seed = seed,
  File "/Users/naveen/anaconda3/lib/python3.6/site-packages/cleanlab/latent_estimation.py", line 593, in estimate_confident_joint_and_cv_pred_proba
    return_list_of_converging_cj_matrices = return_list_of_converging_cj_matrices,
  File "/Users/naveen/anaconda3/lib/python3.6/site-packages/cleanlab/latent_estimation.py", line 319, in estimate_confident_joint_from_probabilities
    confident_joint = calibrate_confident_joint(confident_joint, s, psx)
  File "/Users/naveen/anaconda3/lib/python3.6/site-packages/cleanlab/latent_estimation.py", line 111, in calibrate_confident_joint
    calibrated_cj = (confident_joint.T / confident_joint.sum(axis=1) * s_counts).T
ValueError: operands could not be broadcast together with shapes (2,2) (3,)

Batch based noisy labels estimation

Thanks for sharing such a good tool!

I have a quite large dataset (10M samples) with 100K classes so the probabilities matrix of the whole dataset will take around 4Tb and it won't fit into RAM of my machine. Does it make any sense to split the dataset into batches and run the algorithm batch by batch?

in LearningWithNoisyLabels(), did it train only on confident results? (by dropping label errors found by cleanlab)?

Dear Curtis,

I am a bit confused. in LearningWithNoisyLabels(), did it train only on confident results? (by dropping label errors found by cleanlab)?

I found in your state-of-the-art on CIFAR-10 application, cleanlab is used to find the label errors in CIFAR-10, and then Remove errors and train on cleaned data via Co-Teaching...

Hope to learn your approach better. Thank you very much!

Question About the paper found in arxiv

Dear :
I'm sorry but i am confuse When i read the paper <Confident Learning: Estimating Uncertainly in Dataset labels> THAT in section 3.1, Why Confuseion matrix C define to given yk and predictions argmax(expr). Is that just like a Normalized Pivot table?

Available for Semantic Segmentation?

Hi, obviously it's a great python package, and I'm doing research about Pathological Image Segmentation with Noisy Labels. Can it's available for my recent research?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.