convnet-benchmarks

Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.

Machine: 6-core Intel Core i7-5930K CPU @ 3.50GHz + NVIDIA Titan X + Ubuntu 14.04 x86_64

##Imagenet Winners Benchmarking I pick some popular imagenet models, and I clock the time for a full forward + backward pass. I average my times over 10 runs. I ignored dropout and softmax layers.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Class	Time (ms)	forward (ms)	backward (ms)
NervanaSys-16	ConvLayer	97	30	67
NervanaSys-32	ConvLayer	109	31	78
fbfft	SpatialConvolutionCuFFT	136	45	91
cudaconvnet2*	ConvLayer	177	42	135
CuDNN (R2) *	cudnn.SpatialConvolution	231	70	161
Caffe (native)	ConvolutionLayer	324	121	203
Torch-7 (native)	SpatialConvolutionMM	342	132	210

Overfeat [fast] - Input 128x3x231x231

Library	Class	Time (ms)	forward (ms)	backward (ms)
NervanaSys-16	ConvLayer	364	119	245
NervanaSys-32	ConvLayer	410	126	284
fbfft	SpatialConvolutionCuFFT	407	139	268
cudaconvnet2*	ConvLayer	723	176	547
CuDNN (R2) *	cudnn.SpatialConvolution	810	234	576
Caffe	ConvolutionLayer	823	355	468
Torch-7 (native)	SpatialConvolutionMM	878	379	499

OxfordNet [Model-A] - Input 64x3x224x224

Library	Class	Time (ms)	forward (ms)	backward (ms)
NervanaSys-16	ConvLayer	530	166	364
NervanaSys-32	ConvLayer	629	173	456
fbfft	SpatialConvolutionCuFFT	1092	355	737
cudaconvnet2*	ConvLayer	1229	408	821
CuDNN (R2) *	cudnn.SpatialConvolution	1099	342	757
Caffe	ConvolutionLayer	1068	323	745
Torch-7 (native)	SpatialConvolutionMM	1105	350	755

Layer-wise Benchmarking

###Spatial Convolution layer (3D input 3D output, densely connected)

forward + backprop (wrt input and weights)

Original Library	Class/Function Benchmarked	Time (ms)	forward (ms)	backward (ms)
fbfft	SpatialConvolutionCuFFT	256	101	155
cuda-convnet2 *	ConvLayer	977	201	776
cuda-convnet**	pylearn2.cuda_convnet	1077	312	765
CuDNN R2 *	cudnn.SpatialConvolution	1019	269	750
Theano	CorrMM	1225	407	818
Caffe	ConvolutionLayer	1231	396	835
Torch-7	SpatialConvolutionMM	1265	418	877
DeepCL	ConvolutionLayer	6280	2648	3632
cherry-picking****	best per layer	235	79	155

This table is NOT UPDATED For TITAN-X. These numbers below were on Titan Black and are here only for informational and legacy purposes.

Original Library	Class/Function Benchmarked	Time (ms)	forward (ms)	backward (ms)
Theano (experimental)***	conv2d_fft	1178	304	874
Torch-7	nn.SpatialConvolutionBHWD	1892	581	1311
ccv	ccv_convnet_layer	809+bw	809
Theano (legacy)	conv2d	70774	3833	66941

* indicates that the library was tested with Torch bindings of the specific kernels.
** indicates that the library was tested with Pylearn2 bindings.
*** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
**** The last row shows results obtainable when choosing the best-performing library for each layer.
L1 - Input: 128x128 Batch-size 128, Feature maps: 3->96, Kernel Size: 11x11, Stride: 1x1
L2 - Input: 64x64 Batch-size 128, Feature maps: 64->128, Kernel Size: 9x9, Stride: 1x1
L3 - Input: 32x32 Batch-size 128, Feature maps: 128->128, Kernel Size: 9x9, Stride: 1x1
L4 - Input: 16x16 Batch-size 128, Feature maps: 128->128, Kernel Size: 7x7, Stride: 1x1
L5 - Input: 13x13 Batch-size 128, Feature maps: 384->384, Kernel Size: 3x3, Stride: 1x1
The table is ranked according to the total time forward+backward calls for layers (L1 + L2 + L3 + L4 + L5)

#####Breakdown

forward

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library	Class/Function Benchmarked	L1	L2	L3	L4	L5	Total
fbfft	SpatialConvolutionCuFFT	57	27	6	2	9	101
cuda-convnet2 *	ConvLayer	36	113	40	4	8	201
cuda-convnet**	pylearn2.cuda_convnet	38	183	68	7	16	312
CuDNN R2	cudnn.SpatialConvolution	56	143	53	6	11	269
Theano	CorrMM	91	143	121	24	28	407
Caffe	ConvolutionLayer<Dtype>	93	136	116	24	27	396
Torch-7	nn.SpatialConvolutionMM	94	149	123	24	28	418
DeepCL	ConvolutionLayer	738	1241	518	47	104	2648
cherry-picking****	best per layer	36	27	6	2	8	79

backward (gradInput + gradWeight)

Columns L1, L2, L3, L4, L5, Total are times in milliseconds

Original Library	Class/Function Benchmarked	L1	L2	L3	L4	L5	Total
fbfft	SpatialConvolutionCuFFT	76	45	12	4	18	155
cuda-convnet2 *	ConvLayer	103	467	162	15	29	776
cuda-convnet**	pylearn2.cuda_convnet	136	433	147	15	34	765
CuDNN R2	cudnn.SpatialConvolution	139	401	159	19	32	750
Theano	CorrMM	179	405	174	29	31	818
Caffe	ConvolutionLayer<Dtype>	200	405	172	28	30	835
Torch-7	nn.SpatialConvolutionMM	206	432	178	29	32	877
DeepCL	ConvolutionLayer	484	2144	747	59	198	3632
cherry-picking****	best per layer	76	45	12	4	18	155

bangadennis / convnet-benchmarks Goto Github PK