jcjohnson / cnn-benchmarks Goto Github PK

View Code? Open in Web Editor NEW

2.5K 2.5K 411.0 100 KB

Benchmarks for popular CNN models

License: MIT License

Lua 47.87% Python 52.13%

cnn-benchmarks's People

Contributors

Stargazers

Watchers

Forkers

lixiangnlp jianweilin mldl paulhendricks pgericson benjamesbabala guoshengcv clear-datacenter panyang kdjyss mathrho soonminhwang felixmonkey ddx10000 zakiindrasukma ottolu minvex hpplinux lvpengyuan franck-dernoncourt apo-j tpys charwillvision unizard xiangxiaoxiao b-area shaoli-huang caomw leliaonvidia huleg peratham alexeyab dmitryulyanov mutual-ai binbinbian kuixu ml-ai-nlp-ir wanjinchang thomasdic2000 mayanxin89 ml-lab yangwang166 leelabcnbc jeffzhengye dongweibox arasharchor khemanta shashikale ankitrai chelovekhe solversa donghyunlee wzhang34 phpmind natbusa zmoon111 yuchen112358 wbbetter zengjianping yuyang3478 ntuanhung githubfragments catree ustczjr86 coocoky xyb996534057 canbuoy johnsoningzhuang rongrong005 livst gaoxiaojun qiqipipioioi ilibx liviust jijingyu rickppd hanyankai bob-long bigstone09 craftsliu xshhhm leezqcst yalechang stoneyang-dl liupeng89 tonychouzju rain1024 world2005 christopher-f vakons allankevinrichie maxsu hbu-mlc-3 deep-learning-cdrone mingzhew zhangyuancv liyuan24 mhttx2016 coderx7 roll920

cnn-benchmarks's Issues

meaning of (nn)

Hi Justin,

May I know what does "(nn)" mean in your table, for example TITAN X (nn)?

Thank you.

Would you please share the resnet code?

Do you have caffe version of resnet 34/50?Can you share it? Thank you.

Resnet18 model weights issues when running on multi-GPU system.

Hello Johnson,

Thank you for the nice tutorial.
I was able to change the code in cnn-benchmarks.lua to run it for multi-GPU system(8 GPU).
It gave me a speed-up of 3x.

I am facing issues with "Resnet-x" model weights. I am able to work with the Alexnet, Googlenet, VGG16 and VGG19 model weights but Resnet is giving me problem for 224x224 image size.

Can you please let me know about it.

Also, I am not able to find model weight for inception-V3 graph. Can you add the Inception V3 to the models folder.

Thank you for your help.

Regards,
Bhushan

Saying that the GTX 1080 > Maxwell Titan X is misleading. The metric should be the time of epoch (or convergence), not forward + backward. The +4GB in the Titan X Maxwell make it much faster than the GTX 1080 for training.

Titan X Pascal on VGG16 much slower on my machine than in benchmark

I have a Titan X Pascal, Intel i5-6600, 16GB Ram and running torch7 in Ubuntu 14.04. The Nvidia driver version is 375.20, CUDA Toolkit 8.0 and cuDNN v5.1.

I did the same test with the same VGG16 network from Caffe (imported via loadcaffe) as you did. However, for a forward pass my setup needs 80ms which is double the time as it apparently needs in your benchmark.

I also generated a batch of 16 images with 3 channels and size 224x224. The relevant code is:

local model = loadcaffe.load("/home/.../Models/VGG16/VGG_ILSVRC_16_layers_deploy.prototxt",
                              "/home/.../Models/VGG16/VGG_ILSVRC_16_layers.caffemodel",
                              "cudnn")

for i=1, 50 do
  local input = torch.randn(16, 3, 224, 224):type("torch.CudaTensor")

  cutorch.synchronize()
  local timer = torch.Timer()

  model:forward(input)
  cutorch.synchronize()

  local deltaT = timer:time().real
  print("Forward time: " .. deltaT)
end

The output is:

Forward time: 0.96536016464233
Forward time: 0.10063600540161
Forward time: 0.096444129943848
Forward time: 0.089151859283447
Forward time: 0.082037925720215
Forward time: 0.082045078277588
Forward time: 0.079913139343262
Forward time: 0.080273866653442
Forward time: 0.080694913864136
Forward time: 0.082727193832397
Forward time: 0.082070827484131
Forward time: 0.079407930374146
Forward time: 0.080456018447876
Forward time: 0.083559989929199
Forward time: 0.082060098648071
Forward time: 0.081624984741211
Forward time: 0.080413103103638
Forward time: 0.083755016326904
Forward time: 0.083209037780762
...

Did you do anything additional to get that speed? Or am I doing something wrong here?
Or is it maybe because I am using Ubuntu 14.04 (although your GTX 1080 running on Ubuntu 14.04 also only needs 60ms)

Resnet-200 for GTX 1080 Ti

The benchmark does not include results for Resnet-200 with GTX 1080 Ti (11GB). Can anyone manage to add it?

Speed: normalized by batch size?

Hi,

Are the speed numbers divided by the mini-batch size ( = 16)?

From what I can see (https://github.com/jcjohnson/cnn-benchmarks/blob/master/cnn_benchmark.lua#L46), it is not normalized. Is it correct?

Thanks,
Vadim

add Densenet and Wide Residual Net result?

These two new architectures are able to reduce model size and achieving better result than resnet. It will be very interesting to see how they perform in the benchmark.

New CNN Architectures

Is there any plan for new CNN Architectures - DenseNet, NasNet etc

/home/ubuntu/torch/install/bin/luajit: cannot open <> in mode w at /home/ubuntu/torch/pkg/torch/lib/TH/THDiskFile.c:649


ubuntu@ip-Address:~/cnn-benchmarks$ th convert_model.lua -input_caffemodel VGG16_SOD_finetune.caffemodel -input_prototxt VGG16_SOD_finetune_deploy.prototxt -backend nn
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 538683157
Successfully loaded VGG16_SOD_finetune.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8-SOD100: 1 1 4096 100
/home/ubuntu/torch/install/bin/luajit: cannot open <> in mode  w at /home/ubuntu/torch/pkg/torch/lib/TH/THDiskFile.c:649
stack traceback:
        [C]: at 0x7f1d7a3567e0
        [C]: in function 'DiskFile'
        /home/ubuntu/torch/install/share/lua/5.1/torch/File.lua:385: in function 'save'
        convert_model.lua:36: in main chunk
        [C]: in function 'dofile'
        ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405d50
ubuntu@ip-Address:~/cnn-benchmarks$

I get the same error with or without -backend nn. Not sure how to resolve it.

The benchmark structure

Dear everyone;

Who can explain to me about "the benchmark structure"? For each model (eg AlexNet, ResNet models) will there be different benchmark structures, Right? if it correct. What is the benchmark structure of AlexNet model?

Many thank everyone
Best

Object detection architectures

Any plans to benchmark object detection architectures like YOLO, SSD, RCNN ? 🥇

Forward timing benchmark meaning

Hi this is not a issue I am just not sure where to ask questions. If there is another place to put questions, please let me know. Thanks.

Does the "forward" timing in the benchmark mean the time the network takes to inference a SINGLE image (image/sec) or the time to process a batch (e.g. 16 images)? Take the AlexNet on Maxwell Titan X as an example which is 7.09ms. How can I deduce the images/sec from it? Is it 1/7.09ms=141image/sec or 1/(7.09ms/16)=2256image/sec? According to what NVIDIA announced, the specs for AlexNet on Maxwell Titan X is 450images/sec. However, neither of these two values (141 and 2256) is close which is quite confused.

http://cdn.wccftech.com/wp-content/uploads/2016/01/NVIDIA-Drive-PX-2-Specifications.jpg

Thanks and look forward to reply.

when to update with the latest cnn models?

1080ti benchmarks

Just a placeholder to add this card for once its available :)

CPU performance update

@jcjohnson hi, really nice benchmark!

I am working on torch optimization for intel platforms, Xeon and Xeon Phi. Our optimized version is much faster than original torch cpu backend and we are trying to upstream. https://github.com/intel/torch.

About the saying "The Pascal Titan X with cuDNN is 49x to 74x faster than dual Xeon E5-2630 v3 CPUs." this is somehow missleading :(

This is because a)pascal is the latest generation of gpu while Xeon E5 v3 is about 4 years ago b)our intel-torch yields much faster performance that is competitive to GPU.

We are happy to update the benchmark performance on intel latest hardware platforms, Xeon E5-2699v4, Xeon Phi 7250 (KNL) and also the upcoming platforms (SKL/KNM).

Could you please update these numbers once we finished? :)

Any publication of this?

I want to cite this result, but I cannot find any paper about this.

Thanks.

Speed comparison with different implement frameworks

As you stated that the Alexnet is implemented from caffe, while Resnets are implemented from torch.

Is it a fair comparison about the speed?

openCL branch of caffe reports much higher speeds

on OpenCL-caffe, there are performance matrices claiming speeds of about 4ms per image for training AlexNet with Radeon R290X, Considering this GPU is much weaker than a GTX 1080, these figures seem very weird compared with the 20ms in your tests.

What's your take on this?

Can you please test more recent hardware? (like Tesla series)

This here are very intriguing and detailed results for people who really couldn't afford many hardware, But I'd really want to see this kind of detailed test results on new gears like Tesla GPUs and Google TPUs. Since even testing them on Cloud by myself is really costly and time consuming...

Other information about computing platform

Justin, can you add information about motherboard, discs, RAM, power, processor etc. ?

add googlenet results?

Thank you very much for the results. Is there any plan to add results of gogglenet v1,v2,v3,v4? or even squeeze and caffenet (alexnet with lrn layer replaced)?

Batch size for High-res Images

Hey Justin,

I am going to work on Radiology X-Ray images with 2048x2048 resolution for my next project. Do you have any insights on what batch size can fit in 12GB Titan X for these higher res images?

I am trying to decide if I can get away with Titan X or need to bite the bullet and get a card with 24GB VRAM.

Thanks!

vgg16 benchmark?

So I was trying to look at the speed I get on some nets with TF and pytorch. (maxwell titan GPU)

I was just trying the forward pass, and I get similar results (tf being slightly slower usually) for the resnet-* architectures. But If I try with the VGG16 one, I get something way worse.

Model	your benchmark	pytorch (me)	tf (me)	Reported MatConvnet from here (interpolated, probably Pascal)
resnet-50	55.75	48.3	57.6	~40
vgg16	62.30	113	169	~80

I am a bit surprised all the others have a sharp increase (more than twice slower on average) from resnet-50 to vgg16 but not on your benchmark?.

convert_model.lua Usage

Having issues trying to use convert_model.lua to convert VGG_ILSVRC_19_layers.caffemodel to a .t7 file for use with fast_neural_style.lua.

Here is an example of my terminal command:

sudo th convert_model.lua -input_prototxt '/home/david/Deepstyle/cnn-benchmarks/VGG_ILSVRC_19_layers_deploy.prototxt' -input_caffemodel '/home/david/Deepstyle/cnn-benchmarks/VGG_ILSVRC_19_layers.caffemodel'

It seems to load and begin the process but here is an example of the error I receive:

Any help would be greatly appreciated! I can't seem to locate the /tmp directory listed (to resolve error 'cannot open <> in mode w')

David

Which dataset is the testing using?

Hi, thanks for testing so many model for speed, but I wonder know which dataset do you use?

Pascal architecture is much slower than Maxwell...

Hi Justin,

In fact Pascal architecture is much slower than Maxwell. Could you please look at my benchmarks down below? The point is to measure old architecture GPUs with optimized DNN libraries for that specific architecture - CUDA 7.5 cuDNN v4.0.

Measured in CNTK, Caffe under Windows 7 and Ubuntu OS.

GTX 980 Ti - CUDA 7.5 with cuDNN 4.0
Caffe Performance: 5242 imgs/s

GTX 980 Ti - CUDA 8.0 RC with cuDNN 5.1
Caffe Performance: 4183 imgs/s

GTX 1080 - CUDA 7.5 with cuDNN 4.0
Caffe Performance: Not Applicable

GTX 1080 - CUDA 8.0 RC with cuDNN 5.1
Caffe Performance: 4628 imgs/s

Best Regards,
Ondrej

Approximation of the memory usage

"We benchmark all models with a minibatch size of 16 and an image size of 224 x 224; this allows direct comparisons between models, and allows all but the ResNet-200 model to run on the GTX 1080, which has only 8GB of memory."

I'm curious about arithmetic approximation of the memory usage on a given model if there is one.

Comparison between two GTX 1080 and one Pascal Titan X

Thanks for your kind cnn-benchmarks. I have two options to setup my computer:

Buy 1 Pascal Titan X
Buy 2 Pascal GTX 1080 and bridge them

In my options, Which one is a better choice in term of speed and configuration?

Emphasize ML framework name

Hello, I would like to ask you to emphasize the framework you are using in the Readme file. Initially I missed that they were run on torch and that created some confusion.

I would suggest to add a section saying that or make "All benchmarks were run in Torch" bold.

Thanks

Pull request

@jcjohnson could you, please, merge my pull request: #23. Or reject it ;)
It provides some help with bulk of work when profiling different setups.

Also I did profiling on a bunch of different cloud GPUs here: https://github.com/rejunity/cnn-benchmarks. Do you think it makes sense to integrate results into your main repository too?