When training an AlexNet with two classes of 5,000 images per each class in separate t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

50% Prediction Errors at "100%" accuracy about digits HOT 17 CLOSED

nvidia commented on May 13, 2024

50% Prediction Errors at "100%" accuracy

from digits.

Comments (17)

lukeyeager commented on May 13, 2024

I'm not sure I have enough information to help you. I'm not saying it couldn't be a bug in DIGITS, but it certainly sounds like you may have set up your datasets incorrectly. Can you give me some more detail on how you created your dataset? I assume that you've looked at the instructions here?

from digits.

Dezmon commented on May 13, 2024

I looked over the instructions and have run both examples of LeNet on MNIST and AlexNet on imagenet data, and they worked the way they are supposed to.

I'm happy to believe it is me, but something does seem off with the accuracy reported. I have image sequences (medical, video like, all very similar in appearance) the sequences are labeled as class one or two. I split the sequences into a training and val group (half in each). Finally the sequences are split into individual frames and these are my images. Images in the training set are strictly from training sequences and images in the validation set are strictly from the validation sequence.

I guess my issue boils down to after training, the DIGITS graphs is showing 100% accuracy but when I test individual images from the validation set I get mis classifications.

Update: I Have now been running with a batch size of 1 as opposed to 'default' and so far training looks a lot more normal (ie it is not converging yet, val loss is effectively constant, training loss is oscillating, and accuracy is a little over 50%) maybe this related to have a large N compared to the number of classes?

from digits.

lukeyeager commented on May 13, 2024

maybe this related to have a large N compared to the number of classes?

That does seem to make sense. Let me know if that fixes the problem for you - I could add a check to save others from this problem in the future.

something does seem off with the accuracy reported

I agree. I'll try to look into it. Hopefully today, if I find the time.

from digits.

Dezmon commented on May 13, 2024

Great, thank you. I won't have a result with my data for a couple days. But I can also try an reproduce the problem with a sub-set of the imageNet data if that would be helpful.

from digits.

drozdvadym commented on May 13, 2024

@Dezmon
are you training network with the same sizes (for example 256x256)?

from digits.

Dezmon commented on May 13, 2024

Yes, the images get scaled/padded to 256x256 same as with the ImageNet data. Though I am relying on DIGITS to do that for me, but it works fine from the ImageNet JPG's (and when I test single images from my data). So I'm assuming that part is ok.

from digits.

lukeyeager commented on May 13, 2024

Hmm, I may have messed up my data manipulation for testing somewhere (see BVLC/caffe#2255). I'll get back to you on this.

from digits.

lukeyeager commented on May 13, 2024

@Dezmon, I just upgraded to a newer version of caffe and changed the way that I do image preprocessing. Will you upgrade your DIGITS and NVIDIA/caffe installations, and then see if that fixes the issue for you?

from digits.

Dezmon commented on May 13, 2024

I did a fresh install and build of Caffe (NVIDAs) and DIGITS, I rebuilt the dataset from the raw PNGs and I'm still seeing the problem. That said I think it has todo with batch size (and some lack of basic understanding on my part). With batch sizes other than network default I get very different behavior, mostly the loss explodes which is not a great outcome but at least is understandable.

from digits.

thatguymike commented on May 13, 2024

Can you post your DB build page with the distribution of classes? This looks like an overfit problem of some sort.

from digits.

Dezmon commented on May 13, 2024

Sure I'll post it and yes the data is a little unbalanced, but how would overfitting drive the validation accuracy to 100% and the validation loss to 0? Training I would understand but validation doesn't make sense to me, but I am new to ML and open to suggestions.

I have now reproduced it with ImageNet Data. I took two classes from the full data set (n04404412 and n04409515) which are separated into training (2600) and test (26) directories. I trained using defaults for alexnet and got the same behavior. 100% accuracy shown but testing individual image from the validation directory give mixed results.

Here is the training for the two class imagenet data:

Here is my data:

from digits.

thatguymike commented on May 13, 2024

That is a tiny amount of data for a pretty large network. You are going to overfit quickly. Better would be to attempt finetuning from a fully trained. More importantly, you have a TINY amount of validation images, generally we shoot for >10%, more like 25% of the number of training images.

from digits.

Dezmon commented on May 13, 2024

Hi Mike, the little set was just to show another example of the problem. So it could be reproduced on a standard image set.

For my data (the first training plot in this thread) I am using 22,838 training images and 25,549 validation image (much more than 25%) per class. Is this still a tiny amount of data? I thought ImageNet used only 1,000 per class

from digits.

thatguymike commented on May 13, 2024

That should be working better. My hunch is still that you are overfitting your data. ImageNet has ~1000 images per class, but ~1.2M base training images. Still, I would expect better performance. We for example, we have taking Pascal VOC crops and trained on those from scratch successfully, but starting from a pretrained network trained on AlexNet/CaffeNet does produce better overall results.

Let's look at batch sizes and learning rate carefully. Generally if you mess with the batch size you also need to adjust your learning rate and decays. Alex K talks about this in Section 5 of his "One Weird Trick" paper.

Still, you are getting high traning accuracy and your 2 loss curves look correct. Your training and validation sets have no overlap in samples, correct?

from digits.

Dezmon commented on May 13, 2024

That is correct. I'm not expecting this to work all that well and I'll add a lot more data when I get my understanding of the tools worked out a little better. I'm just trying to figure out the high reported accuracy and subsequent poor single image performance for prediction. I will re-read his paper.

from digits.

Dezmon commented on May 13, 2024

You are correct, adjusting the learning rate up by his recommended sqrt(k) gives very different network performance (exploding training loss, woohoo :/ ). Should I close this? Since the problem appears to only comes up with pathologically un/under-trained networks that for some reason are reporting low validation loss and 100% accuracy.

Thank you both very much for your help.

from digits.

lukeyeager commented on May 13, 2024

No problem. I'll look into handling the learning rate vs. batch size adjustment automatically in the future.

from digits.

50% Prediction Errors at "100%" accuracy about digits HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent