Evaluation of Deep Learning Toolkits

Abstract. In this study, I evaluate some popular deep learning toolkits. The candidates are listed in alphabetical order: Caffe, CNTK, TensorFlow, Theano, and Torch. This is a dynamic document and the evaluation, to the best of my knowledge, is based on the current state of their code.

I also provide ratings in some areas because for a lot of people, ratings are useful. However, keep in mind that ratings are inherently subjective [1].

If you find something wrong or inadequate, please help improve by filing an issue.

Table of contents

Modeling Capability

Interfaces
Model Deployment
Performance
Architecture
Ecosystem
Cross-platform

Modeling Capability

In this section, we evaluate each toolkit's ability to train common and state-of-the-art networks without writing too much code. Some of these networks are:

ConvNets: AlexNet, OxfordNet, GoogleNet
RecurrentNets: plain RNN, LSTM/GRU, bidirectional RNN
Sequential modeling with attention.

In addition, we also evaluate the flexibility to create a new type of model.

Caffe

Caffe is perhaps the first mainstream industry-grade deep learning toolkit, started in late 2013, due to its excellent convnet implementation (at the time). It is still the most popular toolkit within the computer vision community, with many extensions being actively added.

However, its support for recurrent networks and language modeling in general is poor, due to its legacy architecture, which's limitations are detailed in the architecture section.

CNTK

CNTK is a deep learning system started by the speech people who started the deep learning craze and grown into a more general platform-independent deep learning system. It is better known in the speech community than in the general deep learning community.

In CNTK (as in TensorFlow and Theano), a network is specified as a symbolic graph of vector operations, such as matrix add/multiply or convolution. A layer is just a composition of those operations. The fine granularity of the building blocks (operations) allows users to invent new complex layer types without implementing them in a low-level language (as in Caffe).

More to be written ...

TensorFlow

State-of-the-art models

RNN API and implementation are suboptimal. The team also commented about it here and here.
Bidirectional RNN not available yet
No 3D convolution, which is useful for video recognition

New models Since TF uses symbolic graph of vector operations approach, specifying a new network is fairly easy.

The public release of TF doesn’t yet support loop and condition controls in the graph definition. This makes RNN implementations less ideal because they have to use Python loops and no graph compiler optimization can be made.

Google claimed to have this in their white paper and details are still being worked out.

Theano

State-of-the-art models. Theano has implementation for most state-of-the-art networks, either in the form of a higher-level framework (e.g. Blocks, Keras, etc.) or in pure Theano. In fact, many recent research ideas (e.g. attentional model) started here.

New models. Theano pioneered the trend of using symbolic graph for programming a network. Theano's symbolic API supports looping control, so-called scan, which makes implementing RNNs easy and efficient. Users don't always have to define a new model at the tensor operations level. There are a few higher-level frameworks, mentioned above, which make model definition and training simpler.

Torch

State-of-the-art models

Excellent for conv nets. It's worth noting that temporal convolution can be done in TensorFlow/Theano via conv2d but that's a trick. The native interface for temporal convolution in Torch makes it slightly more intuitive to use.
Rich set of RNNs available through a non-official extension [2]

New models. In Torch, there are multiple ways (stack of layers or graph of layers) to define a network but essentially, a network is defined as a graph of layers. Because of this coarser granularity, Torch is sometimes considered less flexible because for new layer types, users have to implement the full forward, backward, and gradient input update.

However, unlike Caffe, defining a new layer in Torch is much easier because you don't have to program in C++. Plus, in Torch, the difference between new layer definition and network definition is minimal. In Caffe, layers are defined in C++ while networks are defined via Protobuf.

Hence, in my opinion, Torch is as flexible as Theano.

Left: graph model of CNTK/Theano/TensorFlow; Right: graph model of Caffe/Torch