Comments (16)
I see. Indeed that's the issue. Your noisy label vector s
should contain all the labels represented by your psx
predicted probability matrix, i.e. they should also pass these assertions:
import numpy as np
assert len(np.unique(s)) == np.shape(psx)[1]
assert len(s) == len(psx)
In other words, the number of classes in s should be the same as the number of classes in psx
, otherwise the meaning of psx
may be ambiguous.
from cleanlab.
Hi @alskdwq , thanks for your question. The shape of psx should be (number of examples, num_classes). So if you have 392 images (what you typed above), then your psx should be shape (392, num_classes), not (372, num_classes).
this bit of code here:
if s_counts[k] <= MIN_NUM_PER_CLASS
prevents you from pruning ALL the examples in one of your classes. This only happens when something is terribly wrong -- your model is producing bad probabilities or your didn't get out of sample predicted probabilities.
Also, in your case num_class >> num_examples. Are you training from scratch, because your problem is under-specified -- not enough data for that many classes. If you are fine-tuning a pre-trained model, then you should be okay.
Finally, how did you get your predicted probs if you didn't use cross-val? If you are trying to get the pred probs for a hold set, and you trained on a different set (or used a fine-tuned model), then you're fine. But if you trained on the set you intend to clean... that won't work because your predicted probabilities have been tuned to minimize loss.. they aren't accurate at all.
from cleanlab.
Hi @cgnorthcutt ,
I can't retrain my model at the moment but I'm very interested in your paper, thanks for sharing! Also thanks for the second advice, it's much easier than the solution I had in mind and I have the algorithm working now. All label errors were found, it's really cool!
from cleanlab.
I had a similar problem, my original labels were [1,4]. Converting the labels to be [0,1] fixed this issue for me.
from cleanlab.
Hi @cgnorthcutt , thanks for replying!
My apologies in advance, I wrote number of images wrong, it should be 392 images, consistent with the shapes following.
I'm fine tuning a pretrained model so I'm just using the result the model returned to me, so I didn't do cross validation on the model output. Do you mean the error is caused by having too few data for that many classes, or it's caused by having some bad probabilities in psx, or both? The dataset I'm using was used to train my model, I'm not sure if this would be the cause since you mentioned out of sample probabilities?
Thanks
from cleanlab.
If you fine-tuned on the dataset (trained on it), prior to getting the predicted probabilities, then they are not hold out prred probabilities. Try printing some examples out here and see if they make sense. Either the probabilities are highly biased due to training, or there is no label errors to find.
What is the accuracy of your model on the current labels?
from cleanlab.
Hi @cgnorthcutt ,
The accuracy is 93%. I checked the output and they seem normal to me. There are label errors exist, i.e images that are manually labeled wrong but the model can detect it right, for testing purpose.
Do you thinking taking fewer number of classes would prevent the problem?
from cleanlab.
can you please share some of the predicted probabilities for some of the examples you know are label errors? in particular, what is the argmax label and what is the pred probability of the given label?
from cleanlab.
@cgnorthcutt
176
0.7955323
0.0021725714
176
0.5158586
0.0036430184
1630
0.4700611
0.00039289624
176
0.5621327
0.07247438
176
0.7427883
0.003874809
176
0.6412345
0.0067541124
176
0.56932026
0.009997657
2191
0.94495004
1.6172016e-05
176
0.778387
0.0026943416
170
0.43431783
0.087170854
176
0.77016187
0.0041504777
0
0.36024565
0.00020584161
176
0.82254297
0.002936906
176
0.7759305
0.0054184226
166
0.8750435
0.0010437327
176
0.48362723
0.07118266
175
0.22730595
0.17336911
166
0.9755789
2.1407172e-05
170
0.8209886
0.0005263921
166
0.94358516
1.8767782e-05
176
0.8548886
0.0042036837
176
0.7480143
0.0022242914
176
0.68197525
0.0031096272
176
0.68588895
0.004278623
176
0.63312715
0.0040740413
166
0.95521224
6.281376e-05
This is some of the error labels in the dataset, they are all manually labeled as 174(in reality they're not), top one is the model's prediction of the image's label, second one is this prediction's probability, and last one is the predicted probability on class 174.
from cleanlab.
Okay, these look great. Which means the only other place I can think something might go wrong is how you set-up your input.
Make sure that your noisy labels vector, s, has the following property
import numpy as np
assert all(np.unique(s) == np.arange(len(np.unique(s))))
In a future version of cleanlab, I'll handle this sort of stuff internally, but for now your labels must be formatted 0, 1, 2, 3, .... num_classes-1.
So labels [0, 1, 2, 1, 2] would be okay if you had 3 classes.
But labels [0, 1, 3, 1, 0], would not be okay if you had 3 classes.
Labels [0, 1, 3, 1, 2], would be okay though (if you had 4 classes).
from cleanlab.
Hi @cgnorthcutt ,
My label vector passed the assertion, it's a vector made up of all 0 even though there are 5026 classes. (I think this is fine, right?)
from cleanlab.
I see, thank you so much for the answer!
from cleanlab.
@alskdwq Are you able to use the same vector of noisy labels that you used to get psx? That's the natural work flow, and if for some reason you aren't able to do that, and must use only the vector of zero labels, I'd love to hear why, so I can understand if that's a use case that cleanlab needs to support in the future. But hopefully, this unblocks you -- please confirm.
from cleanlab.
My scenario is that I have a small dataset of roughly 400 manually labeled images, most of these images are labeled and in reality class 174, but a small fraction of the images are not 174 but since they look similar, are also labeled as 174 by mistake. The model I'm using is a deep learning model that is trained to classify over 5000 classes. So for my case, it's hard to change the noisy label vector, so my way to work around is probably adding more dummy data into the dataset to cover the model predicted classes of the error labels and then extract the covered classes columns from the full output matrix and use as psx.
IMO, it would be nicer if cleanlab could support different psx and label vector sizes as this really adds to the flexibility of it. But anyways, cleanlab is an awesome tool and really appreciate your work!
from cleanlab.
@alskdwq Thank you for sharing. Your case is a special of great interest to me -- all your noise is in one class.
If you are able to re-train your model: a neat solution in your case then is to change all your labels to 0 or 1: 1 if the label is 174 and 0 otherwise. Now you have a simple binary classification task, and for each example you can see if its incorrectly labeled as 174 when its actually not 174, and also whether its incorrectly labeled as not 174 when it actually is. Then you can clean your dataset, reset back to the original labels with all 5000 classes, and train your final model. I've published work related to this (Northcutt, Lu, Chuang, UAI, 2017) https://arxiv.org/pdf/1705.01936.pdf. Feel free to reach out privately if you'd like to discuss collaboration.
If you cannot re-train your model and must use the pretrained model on 5000 classes: an easy solution is to remove all the columns in psx except for 174 and add those probabilities to form a new column -- so now you have a two-column psx with column 0 being the sum of all columns except 174, and column 1 being the 174 column. Now all you need to do is add in like 20 or so examples of the other class to your all zeros vector (and change all the zeros to 1, and make the dummy classes be zero).
from cleanlab.
closing this issue due to lack of activity. Feel free to re-open if you still have questions!
The latest version of cleanlab prints more informative error messages when the inputs are malformatted.
You can also check out our new FAQ section which very clearly explains the supported input formats.
from cleanlab.
Related Issues (20)
- Error in null: Ambiguous truth value of a Series HOT 4
- Add end-to-end tests at the end of Datalab quickstart tutorial
- get rid of warnings in the datalab quickstart tutorial
- Remove Tensorflow version constraint in developer dependencies
- add unit test with all identical dataset HOT 3
- Difference of object detection confident learning with objectlab paper HOT 1
- update coveragerc to only skip over specific experimental subfolders that currently are untested
- Null issue check throwing an error HOT 1
- lab.find_issues(features=features) outputs error for underperforming issue HOT 1
- Object detection, segmentation k-fold practical issue HOT 1
- Trying to create Datalab object with label set to a dtype of 'category' but getting 'NotImplementedError'
- test_scores_for_identical_examples unit test fails
- be able to pass in kwargs to plt.show()
- datalab issue guide should better describe the relevant cleanlab columns
- Trying to build docs with a new notebook I have created but getting `AttributeError` from the audio.ipynb tutorial HOT 1
- Doctests are failing for some functions HOT 1
- In the “Synthetic Data Quality” part, do we need the same amount of real data and generated data HOT 1
- image datalab tutorial broken: Getting build error RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [64, 1, 1, 28, 28] HOT 2
- 3D Cleanlab / DCAI ?
- Follow-Up: Revert macOS CI Environment to Latest Version Once Python Compatibility Is Resolved
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cleanlab.