Git Product home page Git Product logo

Comments (16)

cgnorthcutt avatar cgnorthcutt commented on May 20, 2024 2

I see. Indeed that's the issue. Your noisy label vector s should contain all the labels represented by your psx predicted probability matrix, i.e. they should also pass these assertions:

import numpy as np
assert len(np.unique(s)) == np.shape(psx)[1]
assert len(s) == len(psx)

In other words, the number of classes in s should be the same as the number of classes in psx, otherwise the meaning of psx may be ambiguous.

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 20, 2024 1

Hi @alskdwq , thanks for your question. The shape of psx should be (number of examples, num_classes). So if you have 392 images (what you typed above), then your psx should be shape (392, num_classes), not (372, num_classes).

this bit of code here:
if s_counts[k] <= MIN_NUM_PER_CLASS

prevents you from pruning ALL the examples in one of your classes. This only happens when something is terribly wrong -- your model is producing bad probabilities or your didn't get out of sample predicted probabilities.

Also, in your case num_class >> num_examples. Are you training from scratch, because your problem is under-specified -- not enough data for that many classes. If you are fine-tuning a pre-trained model, then you should be okay.

Finally, how did you get your predicted probs if you didn't use cross-val? If you are trying to get the pred probs for a hold set, and you trained on a different set (or used a fine-tuned model), then you're fine. But if you trained on the set you intend to clean... that won't work because your predicted probabilities have been tuned to minimize loss.. they aren't accurate at all.

from cleanlab.

alskdwq avatar alskdwq commented on May 20, 2024 1

Hi @cgnorthcutt ,

I can't retrain my model at the moment but I'm very interested in your paper, thanks for sharing! Also thanks for the second advice, it's much easier than the solution I had in mind and I have the algorithm working now. All label errors were found, it's really cool!

from cleanlab.

tbass134 avatar tbass134 commented on May 20, 2024 1

I had a similar problem, my original labels were [1,4]. Converting the labels to be [0,1] fixed this issue for me.

from cleanlab.

alskdwq avatar alskdwq commented on May 20, 2024

Hi @cgnorthcutt , thanks for replying!
My apologies in advance, I wrote number of images wrong, it should be 392 images, consistent with the shapes following.

I'm fine tuning a pretrained model so I'm just using the result the model returned to me, so I didn't do cross validation on the model output. Do you mean the error is caused by having too few data for that many classes, or it's caused by having some bad probabilities in psx, or both? The dataset I'm using was used to train my model, I'm not sure if this would be the cause since you mentioned out of sample probabilities?

Thanks

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 20, 2024

If you fine-tuned on the dataset (trained on it), prior to getting the predicted probabilities, then they are not hold out prred probabilities. Try printing some examples out here and see if they make sense. Either the probabilities are highly biased due to training, or there is no label errors to find.

What is the accuracy of your model on the current labels?

from cleanlab.

alskdwq avatar alskdwq commented on May 20, 2024

Hi @cgnorthcutt ,

The accuracy is 93%. I checked the output and they seem normal to me. There are label errors exist, i.e images that are manually labeled wrong but the model can detect it right, for testing purpose.

Do you thinking taking fewer number of classes would prevent the problem?

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 20, 2024

can you please share some of the predicted probabilities for some of the examples you know are label errors? in particular, what is the argmax label and what is the pred probability of the given label?

from cleanlab.

alskdwq avatar alskdwq commented on May 20, 2024

@cgnorthcutt
176
0.7955323
0.0021725714

176
0.5158586
0.0036430184

1630
0.4700611
0.00039289624

176
0.5621327
0.07247438

176
0.7427883
0.003874809

176
0.6412345
0.0067541124

176
0.56932026
0.009997657

2191
0.94495004
1.6172016e-05

176
0.778387
0.0026943416

170
0.43431783
0.087170854

176
0.77016187
0.0041504777

0
0.36024565
0.00020584161

176
0.82254297
0.002936906

176
0.7759305
0.0054184226

166
0.8750435
0.0010437327

176
0.48362723
0.07118266

175
0.22730595
0.17336911

166
0.9755789
2.1407172e-05

170
0.8209886
0.0005263921

166
0.94358516
1.8767782e-05

176
0.8548886
0.0042036837

176
0.7480143
0.0022242914

176
0.68197525
0.0031096272

176
0.68588895
0.004278623

176
0.63312715
0.0040740413

166
0.95521224
6.281376e-05

This is some of the error labels in the dataset, they are all manually labeled as 174(in reality they're not), top one is the model's prediction of the image's label, second one is this prediction's probability, and last one is the predicted probability on class 174.

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 20, 2024

Okay, these look great. Which means the only other place I can think something might go wrong is how you set-up your input.

Make sure that your noisy labels vector, s, has the following property

import numpy as np
assert all(np.unique(s) == np.arange(len(np.unique(s))))

In a future version of cleanlab, I'll handle this sort of stuff internally, but for now your labels must be formatted 0, 1, 2, 3, .... num_classes-1.

So labels [0, 1, 2, 1, 2] would be okay if you had 3 classes.
But labels [0, 1, 3, 1, 0], would not be okay if you had 3 classes.
Labels [0, 1, 3, 1, 2], would be okay though (if you had 4 classes).

from cleanlab.

alskdwq avatar alskdwq commented on May 20, 2024

Hi @cgnorthcutt ,

My label vector passed the assertion, it's a vector made up of all 0 even though there are 5026 classes. (I think this is fine, right?)

from cleanlab.

alskdwq avatar alskdwq commented on May 20, 2024

I see, thank you so much for the answer!

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 20, 2024

@alskdwq Are you able to use the same vector of noisy labels that you used to get psx? That's the natural work flow, and if for some reason you aren't able to do that, and must use only the vector of zero labels, I'd love to hear why, so I can understand if that's a use case that cleanlab needs to support in the future. But hopefully, this unblocks you -- please confirm.

from cleanlab.

alskdwq avatar alskdwq commented on May 20, 2024

My scenario is that I have a small dataset of roughly 400 manually labeled images, most of these images are labeled and in reality class 174, but a small fraction of the images are not 174 but since they look similar, are also labeled as 174 by mistake. The model I'm using is a deep learning model that is trained to classify over 5000 classes. So for my case, it's hard to change the noisy label vector, so my way to work around is probably adding more dummy data into the dataset to cover the model predicted classes of the error labels and then extract the covered classes columns from the full output matrix and use as psx.

IMO, it would be nicer if cleanlab could support different psx and label vector sizes as this really adds to the flexibility of it. But anyways, cleanlab is an awesome tool and really appreciate your work!

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 20, 2024

@alskdwq Thank you for sharing. Your case is a special of great interest to me -- all your noise is in one class.

If you are able to re-train your model: a neat solution in your case then is to change all your labels to 0 or 1: 1 if the label is 174 and 0 otherwise. Now you have a simple binary classification task, and for each example you can see if its incorrectly labeled as 174 when its actually not 174, and also whether its incorrectly labeled as not 174 when it actually is. Then you can clean your dataset, reset back to the original labels with all 5000 classes, and train your final model. I've published work related to this (Northcutt, Lu, Chuang, UAI, 2017) https://arxiv.org/pdf/1705.01936.pdf. Feel free to reach out privately if you'd like to discuss collaboration.

If you cannot re-train your model and must use the pretrained model on 5000 classes: an easy solution is to remove all the columns in psx except for 174 and add those probabilities to form a new column -- so now you have a two-column psx with column 0 being the sum of all columns except 174, and column 1 being the 174 column. Now all you need to do is add in like 20 or so examples of the other class to your all zeros vector (and change all the zeros to 1, and make the dummy classes be zero).

from cleanlab.

jwmueller avatar jwmueller commented on May 20, 2024

closing this issue due to lack of activity. Feel free to re-open if you still have questions!

The latest version of cleanlab prints more informative error messages when the inputs are malformatted.
You can also check out our new FAQ section which very clearly explains the supported input formats.

from cleanlab.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.