Git Product home page Git Product logo

Comments (26)

sguada avatar sguada commented on April 29, 2024 15

A Test score of 0.001 means that the network is guessing random, that is not learning anything. Shuffling the training data is important, otherwise it could be that each batch only get images of the same class.

In general the Test score should be above 0.01 after the first aprox the 5000 iterations. So if the loss doesn't decrease and the Test score increase, that is a sign that your network is not learning.

You should check your prototxt files and your level-db.

from caffe.

Yangqing avatar Yangqing commented on April 29, 2024 4

The convert_imageset.cpp does contain an error that should be fixed. Thanks
for the heads-up! Adding one additional db->Write() function should do the
trick.

Imagenet files stored in the leveldb are 256x256. The images are
uncompressed and are stored as raw pixels, and that's why you are seeing a
db size larger than that of the original images.

IMHO, I am not very keen on the per-epoch random shuffling, mainly due to
the following reasons:

(1) Epochs are simply a notion we use to track the progress of training,
and are not really enforced - there is not explicit constraints that we
have to train 90 epochs, not one image more or less. Thus, it might be not
as useful to enforce the epoch boundaries.

(2) Per-epoch random shuffling of a leveldb will really hurt speed, since a
leveldb random shuffle on a single machine will involve random hard disk
access, which is very slow (unless one uses SSD).

(3) Do a one-time shuffling and then go sequentially through all the data
seem to be giving reasonable speed, and I haven't observed faster
convergence rate with random re-shuffling with smaller benchmark datasets.

Yangqing

On Mon, Jan 27, 2014 at 12:52 AM, mianba120 [email protected]:

@kloudkl https://github.com/kloudkl , thanks so much for your help. I
will check the discussion in #33 #33.

As you have pointed out that:

"If the max_iter % iterations_per_epoch != 0, I am afraid that the last
partial epoch consisted of max_iter % iterations_per_epoch iterations would
introduce bias into the training dataset.".

I think some code should also be refined at the file
examples/convert_imageset.cpp, from line 82:

int main(int argc, char** argv) {
... ...
if (++count % 1000 == 0) {
db->Write(leveldb::WriteOptions(), batch);
LOG(ERROR) << "Processed " << count << " files.";
delete batch;
batch = new leveldb::WriteBatch();
}
}
delete db;
return 0;
}

Here, I think he drops the training images from 1,281,001 - 1,281,167, as
I cannot find anyplace to write the last batch which contains those images.

In your second part, I think it may be better if we can separate the
"epoch loop" and "batch loop", however, the PreEpoch() may be
time-consuming, which I suppose should be done offline.

Also, could you tell me the file size of imagenet training data stored by
leveldb? I need to make sure the correctness of my training data. :)


Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-33350014
.

from caffe.

huangjunshi avatar huangjunshi commented on April 29, 2024 3

Hi @niuchuang , basically, this problem just disappeared after several trials without many modification, and it never happens again in last year.... The only thing I can remember is that the initial value of bias is changed into 0.7 (or even 0.5) for all the layers if it's 1 originally. Usually, the loss should drop below 6.90 after about 2,000 - 3,000 iterations (batch size is 256). Another observation which should be helpful is that the mean of gradient of loss w.r.t. weights/bias in FC8 layer should be 10^-5 - 10^-6 (You may write this part by yourself). If it is less than 10^-6, such as 10^-7, this usually leads to a bad solution, or even cannot converge.

from caffe.

kloudkl avatar kloudkl commented on April 29, 2024

Train on the Cifar dataset to investigate the effect of shuffling and not shuffling.

from caffe.

huangjunshi avatar huangjunshi commented on April 29, 2024

Thanks, @sguada and @kloudkl !

Sorry that I did not make the question clear. In Yangqing's instruction, he did not do the shuffling. Thus, I mean, I followed everything in his instruction, and also did the shuffling when constructing the leveldb.

I have already checked the proto files with the Alex's paper, and almost sure that the configuration is correct. Now, I am checking the code for constructing the leveldb, which I think is the only difference between "mnist demo" and "imagenet demo".

BTW, could you tell me the size of imagenet training data stored by leveldb? In my case, it is about 236.2G, which is strange as the size of original images is about 60G.

from caffe.

shelhamer avatar shelhamer commented on April 29, 2024

The convert_imageset utility source discusses shuffling. Perhaps a note should be added to the recipe.

from caffe.

kloudkl avatar kloudkl commented on April 29, 2024

@mianba120 , your problem may not have been caused by the order of the data. The ImageNet dataset is really huge and not suitable for debugging if you are using all of images. Successful training can be seen in comments of #33.
Out of concern of shuffling data before each training epoch, I just looked into the code and found that in caffe/solver.cpp

template <typename Dtype>
void Solver<Dtype>::Solve(const char* resume_file)  {
  ...
  while (iter_++ < param_.max_iter()) {
  ...
  } // while (iter_++ < param_.max_iter())
} // void Solver<Dtype>::Solve(const char* resume_file)

the iterations of different epochs are not separated. In caffe/proto/caffe.proto, there is no definition of epochs

message SolverParameter {
  optional int32 max_iter = 7; // the maximum number of iterations

Therefore, setting the max_iter entails computing expected_epochs * iterations_per_epoch which is a little inconvenient and indeed produced an error in the original imagenet.prototxt (again see comments of #33). If the max_iter % iterations_per_epoch != 0, I am afraid that the last partial epoch consisted of max_iter % iterations_per_epoch iterations would introduce bias into the training dataset.

Although the typo has been fixed in commit b31b316, it suggested us that a better design woud be to set max_epoch and let the iterations_per_epoch = ceil(data_size / data_size_per_minibatch). Then in caffe/solver.cpp we will have the chance to shuffle data before each epoch to make the gradients more random and accelerate the optimization process.

template <typename Dtype>
void Solver<Dtype>::Solve(const char* resume_file)  {
  ...
  while (epoch_++ < param_.max_epoch()) {
    PreEpoch(...); // Shuffle data and some other stuff
    for (size_t i = 0; i < iterations_per_epoch; ++i) {
      iter_++;
      ...
    } // for (size_t i = 0; i < iterations_per_epoch; ++i)
    ...
  } // while (epoch++ < param_.max_epoch())
} // void Solver<Dtype>::Solve(const char* resume_file)

After the change, it is no longer necessary to remember to do shuffling in each example recipe or in any other application.

from caffe.

huangjunshi avatar huangjunshi commented on April 29, 2024

@kloudkl , thanks so much for your help. I will check the discussion in #33.

As you have pointed out that:

"If the max_iter % iterations_per_epoch != 0, I am afraid that the last partial epoch consisted of max_iter % iterations_per_epoch iterations would introduce bias into the training dataset.".

I think some code should also be refined at the file examples/convert_imageset.cpp, from line 82:

int main(int argc, char** argv) {
... ...
if (++count % 1000 == 0) {
db->Write(leveldb::WriteOptions(), batch);
LOG(ERROR) << "Processed " << count << " files.";
delete batch;
batch = new leveldb::WriteBatch();
}
}
delete db;
return 0;
}

Here, I think he drops the training images from 1,281,001 - 1,281,167, as I cannot find anyplace to write the last batch which contains those images.

In your second part, I think it may be better if we can separate the "epoch loop" and "batch loop", however, the PreEpoch() may be time-consuming, which I suppose should be done offline.

Also, could you tell me the file size of imagenet training data stored by leveldb? I need to make sure the correctness of my training data. :)

from caffe.

palmforest avatar palmforest commented on April 29, 2024

I may have exactly the same problem with @mianba120 described. I followed the new version of imagenet training recipe, with the Caffe package updated on Feb. 10. I also did shuffling with 'convert_imageset'.
However, the testing score remains 0.001, the loss keep increasing slowly.

The current log:
I0212 19:31:26.893280 13373 solver.cpp:207] Iteration 12940, lr = 0.01
I0212 19:31:26.903733 13373 solver.cpp:65] Iteration 12940, loss = 6.91592
I0212 19:32:18.686200 13373 solver.cpp:207] Iteration 12960, lr = 0.01
I0212 19:32:18.696670 13373 solver.cpp:65] Iteration 12960, loss = 6.90673
I0212 19:33:03.411830 13373 solver.cpp:207] Iteration 12980, lr = 0.01
I0212 19:33:03.422310 13373 solver.cpp:65] Iteration 12980, loss = 6.91012
I0212 19:33:47.450816 13373 solver.cpp:207] Iteration 13000, lr = 0.01
I0212 19:33:47.461206 13373 solver.cpp:65] Iteration 13000, loss = 6.90916
I0212 19:33:47.461225 13373 solver.cpp:87] Testing net
I0212 19:36:39.351974 13373 solver.cpp:114] Test score #0: 0.001
I0212 19:36:39.352042 13373 solver.cpp:114] Test score #1: 6.90934
I0212 19:37:30.193382 13373 solver.cpp:207] Iteration 13020, lr = 0.01
I0212 19:37:30.203824 13373 solver.cpp:65] Iteration 13020, loss = 6.90191
I0212 19:38:14.958088 13373 solver.cpp:207] Iteration 13040, lr = 0.01
I0212 19:38:14.968525 13373 solver.cpp:65] Iteration 13040, loss = 6.90839
I0212 19:39:08.426115 13373 solver.cpp:207] Iteration 13060, lr = 0.01
I0212 19:39:08.436560 13373 solver.cpp:65] Iteration 13060, loss = 6.9094
I0212 19:39:50.263488 13373 solver.cpp:207] Iteration 13080, lr = 0.01
I0212 19:39:50.273931 13373 solver.cpp:65] Iteration 13080, loss = 6.90351
I0212 19:40:35.237869 13373 solver.cpp:207] Iteration 13100, lr = 0.01
I0212 19:40:35.248314 13373 solver.cpp:65] Iteration 13100, loss = 6.90753

I am using Ubuntu 12.04, K20. The mnist demo works well with my Caffe setup.

I am wondering what the problem is. For the training and validation images, I just converted them into 256x256 as the simplest way. Has anyone succeed with training imagenet by resizing like this? Should I do the same as“The images are reshaped so that the shorter side has length 256, and the centre 256x256 part is cropped for training”, as mentioned in http://decaf.berkeleyvision.org/about ?

from caffe.

huangjunshi avatar huangjunshi commented on April 29, 2024

hi @palmforest , this is mianba120 (I have changed my name...). I have fixed this problem accidentally, though I have no idea on how to give you the correct answer. However, I found from my successful case on imagenet:

  1. It is OK to directly resize the images to 256x256 without cropping. The shuffling is not so important, which our group have tested with convnet (implemented by Alex).
  2. The original Caffe code (master branch) is also OK. The loss can start dropping to 6.89 at about 2,000 iteration, and the testing accuracy is about 0.002 at 2,000 iteration.
  3. The mean update of weight (the fully-connection may be smaller) is about 10^(-7) - 10^(-8), although 10^(-5) - 10^(-6) should be normal.

Overall, I think you may try:

  1. Use the original Caffe code.
  2. My environment: g++ 4.6, cuda 5.5, mkl is directly downloaded from intel's official website. The driver of my GTX Titan is "NVIDIA-Linux-x86_64-319.82.run".... within which I guess the driver version may be important, as my failure case is based the NVIDIA-Linux-x86_64-331.20.run.
  3. Lastly, though I don't know why it works finally, I think you may try to run the code for times, and stop it if the loss doesn't drop below 6.89 at 3,000 iteration. Because I think sometimes I may be so unlucky that I always get bad random initialization.

Anyway, if you want, I can send you the log of imagenet. Also, you may try my branch.... (not an advertisement...)

Good luck!

from caffe.

palmforest avatar palmforest commented on April 29, 2024

Hi @huangjunshi Many thanks for sharing your view and experience, I am using the original Caffe codes and have tried to re-run the training 3 times...but, the testing scores remain 0.001 after 5000 iterations...I am still trying for good lucks...:)

Could you please share your log on training imagenet? My email address is [email protected]

I am also curious about bad/good random initialization. Theoretically, the random initialization should not lead to the problem like this, has anyone else met this problem with Caffe?

from caffe.

niuchuang avatar niuchuang commented on April 29, 2024

Hi@palmforest,I have met the same problem now,and I think you must have solved it,I will really appreciate it if you can share your wisdom!

from caffe.

niuchuang avatar niuchuang commented on April 29, 2024

Hi huangjunshi,I have met the same problem now,and I think you must have solved it,I will really appreciate it if you can share your wisdom!

from caffe.

jnhwkim avatar jnhwkim commented on April 29, 2024

In my case, the first iteration # for below 6.9 was 4,500, however, it drops rapidly after that.
I'm getting loss as about 3.7 and accuracy as around 25% at iteration 15,500. If you have some time, be patient, let it do what it does.

from caffe.

stevenluzheng avatar stevenluzheng commented on April 29, 2024

HI ,Kim

I met same problem as you describe in your mail, did you finally solve it?

from caffe.

jnhwkim avatar jnhwkim commented on April 29, 2024

@stevenluzheng Let it go few hours. I didn't nothing but after 2 days, it is getting over 50% accuracy now.

from caffe.

stevenluzheng avatar stevenluzheng commented on April 29, 2024

Thanks Kim

Actually,I have done 25K iterations from yesterday, and accuracy still remains 0.39, I use my own data set to train caffe, it seems training fails for some uncertain reasons.

BTW do you use your own dataset to train caffe? if you use your own dataset, so how many pictures you use in training set and val set respectively?

from caffe.

jnhwkim avatar jnhwkim commented on April 29, 2024

@stevenluzheng I used ilsvrc12 dataset which has 1,281,167 images for training. I heard that it'll take 6 days to get a sufficient accuracy.

from caffe.

stevenluzheng avatar stevenluzheng commented on April 29, 2024

Oh....I think you might use ImageNet dataset to train your caffee, this is a magnitude dataset , I guess huge dataset can train caffe network sufficiently, small dataset might not cause deep learning network convergence, I only use 300 pictures in training process and 40 pictures in val process...

Did you ever use you own data set to train and val caffee before?

from caffe.

acpn avatar acpn commented on April 29, 2024

Hi guys I have the same problem, but in my case loss = 8.5177, I'm trying use LFW dataset to train my net, for this I write my own file.prototxt, and I follow paper of Guosheng Hu for write the architecture. Someone have any idea?

from caffe.

stevenluzheng avatar stevenluzheng commented on April 29, 2024

acpn:
Try FaceScrub, LFW is only used for challenging, not for training, BTW, please use CASIA-webface this recommends a workable model, most of us use it

from caffe.

acpn avatar acpn commented on April 29, 2024

Hi, thanks stevenluzheng, but i'm trying reproduce results of this paper: http://arxiv.org/abs/1504.02351.

from caffe.

aTnT avatar aTnT commented on April 29, 2024

I had a similar problem as described here but in my case the root cause was that i was using labels (in the filename and labels .txt file) starting at 1 instead of 0.

from caffe.

fucevin avatar fucevin commented on April 29, 2024

BN implementation from cuDNN has no accuracy problem, at least for cuDNN5. I trained resnet-50 with a top-1 accuracy of 75%.

from caffe.

neftaliw avatar neftaliw commented on April 29, 2024

What @huangjunshi was saying basically did it for me, but I changed every bias=1 to bias=0.1, but how are we supposed to know this from the very incomplete documentation caffe's website has. The tutorials are meant for people who know nothing about caffe and getting into deep learning and yet they leave a lot of things out, and this bias thing is their own mistake, did they even test the tutorials before publishing them?

from caffe.

Dror370 avatar Dror370 commented on April 29, 2024

Hi all ,
I suggest you all to check the softmaxwithloss layer, see that you define the layer correctly,
In case you define Phase:train, it is not work properly,
The layer definition should not include any phase ....

from caffe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.