Comments (17)
Thanks for your work on this! I have not had a chance to look at this in detail yet, but I can say this is not redundant with current efforts. I'll check back when I've had a closer look, but I look forward to seeing this as a pull request once it's polished.
from caffe.
Intel MKL cannot be used on some kind liunx.
Looking for more work on this.
from caffe.
In src/caffe/util/math_functions.cpp line 289
- // FIXME check if boundaries are handled in the same way ?
- boost::uniform_real random_distribution(a, b);
No, boost:: and std::uniform_real interval is [a, b), while Intel MKL is [a, b]. Besides, boost::uniform_real is deprecated by uniform_real_distribution. How about this work around:
using boost::variate_generator;
using boost::mt19937;
using boost::random::uniform_real_distribution;
Caffe::random_generator_t &generator = Caffe::vsl_stream();
Dtype epsilon = 1e-5; // or 1e-4, 1e-6, different values may cause some tests to fail or pass
variate_generator<mt19937, uniform_real_distribution<Dtype> > rng(generator, uniform_real_distribution<Dtype>(a, b + epsilon));
do {
r[i] = rng();
} while (r[i] > b);
from caffe.
Great to see this moving, and glad that you found/understood the source of
the problem.
Stackoverflow indicates that a less hacky way of fixing this is using
std::nextafter, or for better compatibility, using boost::math::nextafter
http://stackoverflow.com/questions/16224446/stduniform-real-distribution-inclusive-range
I am not a git guru (I am more of a hg guy), in which branch is
+openwzdhhttps://github.com/openwzdh working
on ?
Should I switch to that branch to try helping out, or import into my own ?
On Wed, Jan 8, 2014 at 11:58 AM, kloudkl [email protected] wrote:
In src/caffe/util/math_functions.cpp line 289
- // FIXME check if boundaries are handled in the same way ?
- boost::uniform_real random_distribution(a, b);
No, boost:: and std::uniform_real interval is [a, b), while Intel MKL is
[a, b]. Besides, boost::uniform_real is deprecated by
uniform_real_distribution. How about this work around:
using boost::variate_generator;
using boost::mt19937;
using boost::uniform_real_distribution;
Caffe::random_generator_t &generator = Caffe::vsl_stream();
Dtype epsilon = 1e-5; // or 1e-4, 1e-6, different values may cause some
tests to fail or pass
variate_generator > rng(generator, uniform_real_distribution(a, b +
epsilon));
do {
r[i] = rng();
} while (r[i] > b);—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-31822234
.
from caffe.
This is good progress. Thanks for the commit @rodrigob and debugging @kloudkl!
Let's develop this port in the boost-eigen branch I have just pushed. I have included the initial commit by @rodrigod.
To continue development, please make commits in your fork then pull request to this branch. I will review and merge the requests.
Please rebase any work on the latest bvlc/caffe boost-eigen before requesting a pull–I'd rather keep the history clean from merge noise.
from caffe.
Is the plan to completely get rid of MKL?
Just as a suggestion: it would be nice to be able to switch between different BLAS libraries, e.g. having a BLASFactory that spits out whatever BLAS library that is available on the system.
from caffe.
you can change the makefile include and library to make it work on
different BLAS.
2014/1/23 Tobias Domhan [email protected]
Is the plan to completely get rid of MKL?
Just as a suggestion: it would be nice to be able to switch between
different BLAS libraries, e.g. having a BLASFactory that spits out whatever
BLAS library that is available on the system.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-33040877
.
from caffe.
Please note that on debian systems selecting the blas implementation is done via
sudo update-alternatives --config libblas.so
Such decision is certainly not meant to be done during the runtime of an application.
from caffe.
The ideal case for integration is that performance of the MKL and boost-eigen implementations are comparable and boost-eigen is made the default. If the MKL vs. boost/eigen differences can be insulated cleanly enough it would be nice to offer both by a build switch.
We need benchmarking to move forward and comparisons by anyone with both MKL and boost/eigen would be welcome. @Yangqing @jeffdonahue should comparing train/test of the imagenet model do it, or is are there more comparisons to be done?
from caffe.
CPU is too slow to train such a large dataset as ImageNet. Most possible use case is to first train on GPU and deploy the model on devices without GPU. Beside benchmarking the runtime of a complete pipeline, microbenchmarking of math methods/functions and profiling to find out the hotspot codes are also helpful.
from caffe.
Agreed, real training of ImageNet / any contemporary architecture and data set is infeasible on CPU. Sorry my suggestion was not more precise. I think benchmarking training minibatches or epochs is still indicative of performance. I second microbenchmarking too, as a further detail. If the speed of the full pipeline is close enough that suffices.
from caffe.
I have just benchmarked on the MNIST dataset using both the heads of the boost-eigen branch and the master. The three experiments used CPU mode with boost-eigen, CPU mode with MKL and GPU mode respectively. The CPU is Intel® Core™ i7-3770 CPU @ 3.40GHz × 8 and the GPU is NVIDIA GTX 560 Ti. But the CPU code under-utilized the available cores using only a single thread.
After training 10000 iterations, the final learning rate, training loss, testing accuracy (Test score 0) and testing loss (Test score 1) of boost-eigen and MKL were all exactly the same. The training time of boost-eigen was 26m25.259s and that of MKL was 26m43.919s. Considering the fluctuations of data IO costs, there was actually no significant performance difference. The results were a little surprising. So you may want to double check it on your own machine.
On GTX 560 Ti, it took 85.5% less time than the faster CPU mode with boost-eigen to train a slightly better model in terms of training loss, testing accuracy and testing loss.
Because the training processes also included testing iterations, this benchmark demonstrate that there is no need to further depend on a proprietary library which brings no benefit but excess codes and redundant maintenance burdens. It is time to merge this branch directly into the master.
cd data
time ./train_mnist.sh
CPU boost-eigen
I0207 12:54:18.161139 14107 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 12:54:18.163564 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:18.166762 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:18.169086 14107 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 12:54:18.169108 14107 solver.cpp:84] Testing net
I0207 12:54:25.810292 14107 solver.cpp:111] Test score #0: 0.9909
I0207 12:54:25.810333 14107 solver.cpp:111] Test score #1: 0.0285976
I0207 12:54:25.811945 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:25.815465 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:25.818124 14107 solver.cpp:78] Optimization Done.
I0207 12:54:25.818137 14107 train_net.cpp:34] Optimization Done.
real 26m25.259s
user 26m26.499s
sys 0m0.724s
CPU MKL
I0207 13:34:29.381631 4691 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:34:29.384047 4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:29.387784 4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:29.390490 4691 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 13:34:29.390512 4691 solver.cpp:84] Testing net
I0207 13:34:37.038708 4691 solver.cpp:111] Test score #0: 0.9909
I0207 13:34:37.038748 4691 solver.cpp:111] Test score #1: 0.0285976
I0207 13:34:37.040276 4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:37.043890 4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:37.046598 4691 solver.cpp:78] Optimization Done.
I0207 13:34:37.046612 4691 train_net.cpp:34] Optimization Done.
real 26m43.919s
user 26m45.056s
sys 0m0.768s
GPU
I0207 13:40:54.950667 24846 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:40:54.962781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:54.967131 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:54.970029 24846 solver.cpp:66] Iteration 10000, loss = 0.00247615
I0207 13:40:54.970067 24846 solver.cpp:84] Testing net
I0207 13:40:56.242010 24846 solver.cpp:111] Test score #0: 0.991
I0207 13:40:56.242048 24846 solver.cpp:111] Test score #1: 0.0284187
I0207 13:40:56.242781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:56.246444 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:56.249151 24846 solver.cpp:78] Optimization Done.
I0207 13:40:56.249166 24846 train_net.cpp:34] Optimization Done.
real 3m50.187s
user 3m3.219s
sys 0m50.039s
from caffe.
It would be good to have a benchmark with larger networks such as imagenet,
as MNIST might be too small to make a significant difference for any
platform. This being said I believe boost-eigen to give comparable
performances to MKL, and we should in general move to open-source libraries
in the long run.
Yangqing
On Thu, Feb 6, 2014 at 10:22 PM, kloudkl [email protected] wrote:
I have just benchmarked on the MNIST dataset using both the heads of the
boost-eigen branch and the master. The three experiments used CPU mode with
boost-eigen, CPU mode with MKL and GPU mode respectively. The CPU is Intel(R)
Core(tm) i7-3770 CPU @ 3.40GHz × 8 and the GPU is NVIDIA GTX 560 Ti. But the
CPU code under-utilized the available cores using only a single thread.After training 10000 iterations, the final learning rate, training loss,
testing accuracy (Test score 0) and testing loss (Test score 1) of
boost-eigen and MKL were all exactly the same. The training time of
boost-eigen was 26m25.259s and that of MKL was 26m43.919s. Considering the
fluctuations of data IO costs, there was actually no significant
performance difference. The results were a little surprising. So you may
want to double check it on your own machine.On GTX 560 Ti, it took 85.5% less time than the faster CPU mode with
boost-eigen to train a slightly better model in terms of training loss,
testing accuracy and testing loss.Because the training processes also included testing iterations, this
benchmark demonstrate that there is no need to further depend on a
proprietary library which brings no benefit but excess codes and redundant
maintenance burdens.cd data
time ./train_mnist.shCPU boost-eigen
I0207 12:54:18.161139 14107 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 12:54:18.163564 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:18.166762 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:18.169086 14107 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 12:54:18.169108 14107 solver.cpp:84] Testing net
I0207 12:54:25.810292 14107 solver.cpp:111] Test score #0: 0.9909
I0207 12:54:25.810333 14107 solver.cpp:111] Test score #1: 0.0285976
I0207 12:54:25.811945 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:25.815465 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:25.818124 14107 solver.cpp:78] Optimization Done.
I0207 12:54:25.818137 14107 train_net.cpp:34] Optimization Done.real 26m25.259s
user 26m26.499s
sys 0m0.724sCPU MKL
I0207 13:34:29.381631 4691 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:34:29.384047 4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:29.387784 4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:29.390490 4691 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 13:34:29.390512 4691 solver.cpp:84] Testing net
I0207 13:34:37.038708 4691 solver.cpp:111] Test score #0: 0.9909
I0207 13:34:37.038748 4691 solver.cpp:111] Test score #1: 0.0285976
I0207 13:34:37.040276 4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:37.043890 4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:37.046598 4691 solver.cpp:78] Optimization Done.
I0207 13:34:37.046612 4691 train_net.cpp:34] Optimization Done.real 26m43.919s
user 26m45.056s
sys 0m0.768sGPU
I0207 13:40:54.950667 24846 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:40:54.962781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:54.967131 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:54.970029 24846 solver.cpp:66] Iteration 10000, loss = 0.00247615
I0207 13:40:54.970067 24846 solver.cpp:84] Testing net
I0207 13:40:56.242010 24846 solver.cpp:111] Test score #0: 0.991
I0207 13:40:56.242048 24846 solver.cpp:111] Test score #1: 0.0284187
I0207 13:40:56.242781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:56.246444 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:56.249151 24846 solver.cpp:78] Optimization Done.
I0207 13:40:56.249166 24846 train_net.cpp:34] Optimization Done.real 3m50.187s
user 3m3.219s
sys 0m50.039sReply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-34407255
.
from caffe.
Would it help to replace some of the code with parallel for loops. Eigen does not exploit the several cores present in most workstations except for matrix matrix multiplication. For example, the relu layer (or any simple activation function) does an independent operation for every neuron. It can be made fast using #pragma omp parallel for.
from caffe.
@aravindhm, I had the same idea as you just after observing that the training on CPU is single-threaded and experimented parallelizing with OpenMP. But the test accuracy turned out to be staying at the random guess level. Then I realized that there was conflict between OpenMP and BLAS and the correct solution is to take advantage of a multi-threaded BLAS library such as OpenBLAS. See my reference from #79 above.
from caffe.
The updated benchmark exploiting multi-threaded OpenBLAS showed great speed-up that training on a multi-core CPU can be as fast as or even faster than that on a GPU. Now, it becomes more realistic to benchmark with a larger scale dataset.
cd data
OPENBLAS_NUM_THREADS=8 OMP_NUM_THREADS=8 time ./train_mnist.sh
CPU boost-eigen
I0207 18:41:32.068876 8664 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 18:41:32.071004 8664 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 18:41:32.074946 8664 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 18:41:32.078304 8664 solver.cpp:66] Iteration 10000, loss = 0.00375376
I0207 18:41:32.078330 8664 solver.cpp:84] Testing net
I0207 18:41:33.663113 8664 solver.cpp:111] Test score #0: 0.9911
I0207 18:41:33.663157 8664 solver.cpp:111] Test score #1: 0.0282938
I0207 18:41:33.664984 8664 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 18:41:33.668848 8664 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 18:41:33.671816 8664 solver.cpp:78] Optimization Done.
I0207 18:41:33.671834 8664 train_net.cpp:34] Optimization Done.
768.49user 538.33system 5:27.77elapsed 398%CPU
CPU MKL
I0207 19:00:01.696180 27157 solver.cpp:207] Iteration 10000, lr = 0.00594604
I0207 19:00:01.696760 27157 solver.cpp:65] Iteration 10000, loss = 0.00308708
I0207 19:00:01.696787 27157 solver.cpp:87] Testing net
I0207 19:00:02.968822 27157 solver.cpp:114] Test score #0: 0.9905
I0207 19:00:02.968865 27157 solver.cpp:114] Test score #1: 0.0284175
I0207 19:00:02.970607 27157 solver.cpp:129] Snapshotting to lenet_iter_10000
I0207 19:00:02.974674 27157 solver.cpp:136] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 19:00:02.979106 27157 solver.cpp:129] Snapshotting to lenet_iter_10000
I0207 19:00:02.984369 27157 solver.cpp:136] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 19:00:02.990788 27157 solver.cpp:81] Optimization Done.
I0207 19:00:02.990809 27157 train_net.cpp:34] Optimization Done.
1121.49user 18.71system 4:45.62elapsed 399%CPU
from caffe.
A proposal has been made at #97 - please kindly discuss there. Closing this to reduce duplicates.
from caffe.
Related Issues (20)
- caffe time -model -weights -gpu=0
- BUG: error happens while building the project using cmake, if without preinstall `gflags`. HOT 1
- Makefile
- import error: segment fault when import caffe
- Segmentation fault (core dumped) when creating imageset
- MSBuild Error
- DeleteMe
- Glib 3.4.30 not found HOT 1
- Error MSB6006: "cmd.exe" exited with code -1073741 515 HOT 2
- blob.hpp dimension check code problem
- Is it possible to use OpenCL on FreeBSD without using ROCm?
- How to build Caffe(OpenCL) on Linux from source code? HOT 1
- Caffe(OpenCL) Error: ordered comparison between pointer and zero ('int32_t *' (aka 'int *') and 'int') HOT 1
- Failed inference with nyud-fcn32s-hha
- ю
- caffe installation HOT 1
- Assessment of the difficulty in porting CPU architecture for caffe
- How to add new layer to caffe like HardSigmoid or Resize HOT 1
- module 'caffe' has no attribute 'set_mode_cpu'
- `GLOG_LIBRARYRARY_DIRS` appears to be in error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from caffe.