Git Product home page Git Product logo

rnn's Introduction

rnn: recurrent neural networks

Note: this repository is deprecated in favor of

This is a Recurrent Neural Network library that extends Torch's nn. You can use it to build RNNs, LSTMs, GRUs, BRNNs, BLSTMs, and so forth and so on. This library includes documentation for the following objects:

Modules that consider successive calls to forward as different time-steps in a sequence :

Modules that forward entire sequences through a decorated AbstractRecurrent instance :

  • AbstractSequencer : an abstract class inherited by Sequencer, Repeater, RecurrentAttention, etc.;
  • Sequencer : applies an encapsulated module to all elements in an input sequence (Tensor or Table);
  • SeqLSTM : a very fast version of nn.Sequencer(nn.FastLSTM) where the input and output are tensors;
  • SeqLSTMP : SeqLSTM with a projection layer;
  • SeqGRU : a very fast version of nn.Sequencer(nn.GRU) where the input and output are tensors;
  • SeqBRNN : Bidirectional RNN based on SeqLSTM;
  • BiSequencer : used for implementing Bidirectional RNNs and LSTMs;
  • BiSequencerLM : used for implementing Bidirectional RNNs and LSTMs for language models;
  • Repeater : repeatedly applies the same input to an AbstractRecurrent instance;
  • RecurrentAttention : a generalized attention model for REINFORCE modules;

Miscellaneous modules and criterions :

  • MaskZero : zeroes the output and gradOutput rows of the decorated module for commensurate input rows which are tensors of zeros;
  • TrimZero : same behavior as MaskZero, but more efficient when input contains lots zero-masked rows;
  • LookupTableMaskZero : extends nn.LookupTable to support zero indexes for padding. Zero indexes are forwarded as tensors of zeros;
  • MaskZeroCriterion : zeros the gradInput and err rows of the decorated criterion for commensurate input rows which are tensors of zeros;
  • SeqReverseSequence : reverses an input sequence on a specific dimension;

Criterions used for handling sequential inputs and targets :

  • SequencerCriterion : sequentially applies the same criterion to a sequence of inputs and targets (Tensor or Table).
  • RepeaterCriterion : repeatedly applies the same criterion with the same target on a sequence.

To install this repository:

git clone [email protected]:Element-Research/rnn.git
cd rnn
luarocks make rocks/rnn-scm-1.rockspec

Note that luarocks intall rnn now installs instead.


The following are example training scripts using this package :

External Resources


If you use rnn in your work, we'd really appreciate it if you could cite the following paper:

Léonard, Nicholas, Sagar Waghmare, Yang Wang, and Jin-Hwa Kim. rnn: Recurrent Library for Torch. arXiv preprint arXiv:1511.07889 (2015).

Any significant contributor to the library will also get added as an author to the paper. A significant contributor is anyone who added at least 300 lines of code to the library.


Most issues can be resolved by updating the various dependencies:

luarocks install torch
luarocks install nn
luarocks install dpnn
luarocks install torchx

If you are using CUDA :

luarocks install cutorch
luarocks install cunn
luarocks install cunnx

And don't forget to update this package :

git clone [email protected]:Element-Research/rnn.git
cd rnn
luarocks make rocks/rnn-scm-1.rockspec

If that doesn't fix it, open and issue on github.


An abstract class inherited by Recurrent, LSTM and GRU. The constructor takes a single argument :

rnn = nn.AbstractRecurrent([rho])

Argument rho is the maximum number of steps to backpropagate through time (BPTT). Sub-classes can set this to a large number like 99999 (the default) if they want to backpropagate through the entire sequence whatever its length. Setting lower values of rho are useful when long sequences are forward propagated, but we only whish to backpropagate through the last rho steps, which means that the remainder of the sequence doesn't need to be stored (so no additional cost).

[recurrentModule] getStepModule(step)

Returns a module for time-step step. This is used internally by sub-classes to obtain copies of the internal recurrentModule. These copies share parameters and gradParameters but each have their own output, gradInput and any other intermediate states.


This is a method reserved for internal use by Recursor when doing backward propagation. It sets the object's output attribute to point to the output at time-step step. This method was introduced to solve a very annoying bug.


Decorates the internal recurrentModule with MaskZero. The output Tensor (or table thereof) of the recurrentModule will have each row (i.e. samples) zeroed when the commensurate row of the input is a tensor of zeros.

The nInputDim argument must specify the number of non-batch dims in the first Tensor of the input. In the case of an input table, the first Tensor is the first one encountered when doing a depth-first search.

Calling this method makes it possible to pad sequences with different lengths in the same batch with zero vectors.

When a sample time-step is masked (i.e. input is a row of zeros), then the hidden state is effectively reset (i.e. forgotten) for the next non-mask time-step. In other words, it is possible seperate unrelated sequences with a masked element.


Decorates the internal recurrentModule with TrimZero.

[output] updateOutput(input)

Forward propagates the input for the current step. The outputs or intermediate states of the previous steps are used recurrently. This is transparent to the caller as the previous outputs and intermediate states are memorized. This method also increments the step attribute by 1.

updateGradInput(input, gradOutput)

Like backward, this method should be called in the reverse order of forward calls used to propagate a sequence. So for example :

rnn = nn.LSTM(10, 10) -- AbstractRecurrent instance
local outputs = {}
for i=1,nStep do -- forward propagate sequence
   outputs[i] = rnn:forward(inputs[i])

for i=nStep,1,-1 do -- backward propagate sequence in reverse order
   gradInputs[i] = rnn:backward(inputs[i], gradOutputs[i])


The reverse order implements backpropagation through time (BPTT).

accGradParameters(input, gradOutput, scale)

Like updateGradInput, but for accumulating gradients w.r.t. parameters.


This method goes hand in hand with forget. It is useful when the current time-step is greater than rho, at which point it starts recycling the oldest recurrentModule sharedClones, such that they can be reused for storing the next step. This offset is used for modules like nn.Recurrent that use a different module for the first step. Default offset is 0.


This method brings back all states to the start of the sequence buffers, i.e. it forgets the current sequence. It also resets the step attribute to 1. It is highly recommended to call forget after each parameter update. Otherwise, the previous state will be used to activate the next, which will often lead to instability. This is caused by the previous state being the result of now changed parameters. It is also good practice to call forget at the start of each new sequence.


This method sets the maximum number of time-steps for which to perform backpropagation through time (BPTT). So say you set this to rho = 3 time-steps, feed-forward for 4 steps, and then backpropgate, only the last 3 steps will be used for the backpropagation. If your AbstractRecurrent instance is wrapped by a Sequencer, this will be handled auto-magically by the Sequencer. Otherwise, setting this value to a large value (i.e. 9999999), is good for most, if not all, cases.


This method was deprecated Jan 6, 2016. Since then, by default, AbstractRecurrent instances use the backwardOnline behaviour. See updateGradInput for details.


In training mode, the network remembers all previous rho (number of time-steps) states. This is necessary for BPTT.


During evaluation, since their is no need to perform BPTT at a later time, only the previous step is remembered. This is very efficient memory-wise, such that evaluation can be performed using potentially infinite-length sequence.


References :

A composite Module for implementing Recurrent Neural Networks (RNN), excluding the output layer.

The nn.Recurrent(start, input, feedback, [transfer, rho, merge]) constructor takes 6 arguments:

  • start : the size of the output (excluding the batch dimension), or a Module that will be inserted between the input Module and transfer module during the first step of the propagation. When start is a size (a number or torch.LongTensor), then this start Module will be initialized as nn.Add(start) (see Ref. A).
  • input : a Module that processes input Tensors (or Tables). Output must be of same size as start (or its output in the case of a start Module), and same size as the output of the feedback Module.
  • feedback : a Module that feedbacks the previous output Tensor (or Tables) up to the merge module.
  • merge : a table Module that merges the outputs of the input and feedback Module before being forwarded through the transfer Module.
  • transfer : a non-linear Module used to process the output of the merge module, or in the case of the first step, the output of the start Module.
  • rho : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Due to the vanishing gradients effect, references A and B recommend rho = 5 (or lower). Defaults to 99999.

An RNN is used to process a sequence of inputs. Each step in the sequence should be propagated by its own forward (and backward), one input (and gradOutput) at a time. Each call to forward keeps a log of the intermediate states (the input and many Module.outputs) and increments the step attribute by 1. Method backward must be called in reverse order of the sequence of calls to forward in order to backpropgate through time (BPTT). This reverse order is necessary to return a gradInput for each call to forward.

The step attribute is only reset to 1 when a call to the forget method is made. In which case, the Module is ready to process the next sequence (or batch thereof). Note that the longer the sequence, the more memory that will be required to store all the output and gradInput states (one for each time step).

To use this module with batches, we suggest using different sequences of the same size within a batch and calling updateParameters every rho steps and forget at the end of the sequence.

Note that calling the evaluate method turns off long-term memory; the RNN will only remember the previous output. This allows the RNN to handle long sequences without allocating any additional memory.

For a simple concise example of how to make use of this module, please consult the simple-recurrent-network.lua training script.

Decorate it with a Sequencer

Note that any AbstractRecurrent instance can be decorated with a Sequencer such that an entire sequence (a table) can be presented with a single forward/backward call. This is actually the recommended approach as it allows RNNs to be stacked and makes the rnn conform to the Module interface, i.e. each call to forward can be followed by its own immediate call to backward as each input to the model is an entire sequence, i.e. a table of tensors where each tensor represents a time-step.

seq = nn.Sequencer(module)

The simple-sequencer-network.lua training script is equivalent to the above mentionned simple-recurrent-network.lua script, except that it decorates the rnn with a Sequencer which takes a table of inputs and gradOutputs (the sequence for that batch). This lets the Sequencer handle the looping over the sequence.

You should only think about using the AbstractRecurrent modules without a Sequencer if you intend to use it for real-time prediction. Actually, you can even use an AbstractRecurrent instance decorated by a Sequencer for real time prediction by calling Sequencer:remember() and presenting each time-step input as {input}.

Other decorators can be used such as the Repeater or RecurrentAttention. The Sequencer is only the most common one.


References :

This is an implementation of a vanilla Long-Short Term Memory module. We used Ref. A's LSTM as a blueprint for this module as it was the most concise. Yet it is also the vanilla LSTM described in Ref. C.

The nn.LSTM(inputSize, outputSize, [rho]) constructor takes 3 arguments:

  • inputSize : a number specifying the size of the input;
  • outputSize : a number specifying the size of the output;
  • rho : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999.


The actual implementation corresponds to the following algorithm:

i[t] = σ(W[x->i]x[t] + W[h->i]h[t1] + W[c->i]c[t1] + b[1->i])      (1)
f[t] = σ(W[x->f]x[t] + W[h->f]h[t1] + W[c->f]c[t1] + b[1->f])      (2)
z[t] = tanh(W[x->c]x[t] + W[h->c]h[t1] + b[1->c])                   (3)
c[t] = f[t]c[t1] + i[t]z[t]                                         (4)
o[t] = σ(W[x->o]x[t] + W[h->o]h[t1] + W[c->o]c[t] + b[1->o])        (5)
h[t] = o[t]tanh(c[t])                                                (6)

where W[s->q] is the weight matrix from s to q, t indexes the time-step, b[1->q] are the biases leading into q, σ() is Sigmoid, x[t] is the input, i[t] is the input gate (eq. 1), f[t] is the forget gate (eq. 2), z[t] is the input to the cell (which we call the hidden) (eq. 3), c[t] is the cell (eq. 4), o[t] is the output gate (eq. 5), and h[t] is the output of this module (eq. 6). Also note that the weight matrices from cell to gate vectors are diagonal W[c->s], where s is i,f, or o.

As you can see, unlike Recurrent, this implementation isn't generic enough that it can take arbitrary component Module definitions at construction. However, the LSTM module can easily be adapted through inheritance by overriding the different factory methods :

  • buildGate : builds generic gate that is used to implement the input, forget and output gates;
  • buildInputGate : builds the input gate (eq. 1). Currently calls buildGate;
  • buildForgetGate : builds the forget gate (eq. 2). Currently calls buildGate;
  • buildHidden : builds the hidden (eq. 3);
  • buildCell : builds the cell (eq. 4);
  • buildOutputGate : builds the output gate (eq. 5). Currently calls buildGate;
  • buildModel : builds the actual LSTM model which is used internally (eq. 6).

Note that we recommend decorating the LSTM with a Sequencer (refer to this for details).


A faster version of the LSTM. Basically, the input, forget and output gates, as well as the hidden state are computed at one fellswoop.

Note that FastLSTM does not use peephole connections between cell and gates. The algorithm from LSTM changes as follows:

i[t] = σ(W[x->i]x[t] + W[h->i]h[t1] + b[1->i])                      (1)
f[t] = σ(W[x->f]x[t] + W[h->f]h[t1] + b[1->f])                      (2)
z[t] = tanh(W[x->c]x[t] + W[h->c]h[t1] + b[1->c])                   (3)
c[t] = f[t]c[t1] + i[t]z[t]                                         (4)
o[t] = σ(W[x->o]x[t] + W[h->o]h[t1] + b[1->o])                      (5)
h[t] = o[t]tanh(c[t])                                                (6)

i.e. omitting the summands W[c->i]c[t−1] (eq. 1), W[c->f]c[t−1] (eq. 2), and W[c->o]c[t] (eq. 5).


This is a static attribute of the FastLSTM class. The default value is false. Setting usenngraph = true will force all new instantiated instances of FastLSTM to use nngraph's nn.gModule to build the internal recurrentModule which is cloned for each time-step.

Recurrent Batch Normalization

This extends the FastLSTM class to enable faster convergence during training by zero-centering the input-to-hidden and hidden-to-hidden transformations. It reduces the internal covariate shift between time steps. It is an implementation of Cooijmans et. al.'s Recurrent Batch Normalization. The hidden-to-hidden transition of each LSTM cell is normalized according to

i[t] = σ(BN(W[x->i]x[t]) + BN(W[h->i]h[t1]) + b[1->i])                      (1)
f[t] = σ(BN(W[x->f]x[t]) + BN(W[h->f]h[t1]) + b[1->f])                      (2)
z[t] = tanh(BN(W[x->c]x[t]) + BN(W[h->c]h[t1]) + b[1->c])                   (3)
c[t] = f[t]c[t1] + i[t]z[t]                                                 (4)
o[t] = σ(BN(W[x->o]x[t]) + BN(W[h->o]h[t1]) + b[1->o])                      (5)
h[t] = o[t]tanh(c[t])                                                        (6)

where the batch normalizing transform is:

  BN(h; gamma, beta) = beta + gamma *      hd - E(hd)
                                       sqrt(E(σ(hd) + eps))                       

where hd is a vector of (pre)activations to be normalized, gamma, and beta are model parameters that determine the mean and standard deviation of the normalized activation. eps is a regularization hyperparameter to keep the division numerically stable and E(hd) and E(σ(hd)) are the estimates of the mean and variance in the mini-batch respectively. The authors recommend initializing gamma to a small value and found 0.1 to be the value that did not cause vanishing gradients. beta, the shift parameter, is null by default.

To turn on batch normalization during training, do: = true
lstm = nn.FastLSTM(inputsize, outputsize, [rho, eps, momentum, affine]

where momentum is same as gamma in the equation above (defaults to 0.1), eps is defined above and affine is a boolean whose state determines if the learnable affine transform is turned off (false) or on (true, the default).


References :

This is an implementation of Gated Recurrent Units module.

The nn.GRU(inputSize, outputSize [,rho [,p [, mono]]]) constructor takes 3 arguments likewise nn.LSTM or 4 arguments for dropout:

  • inputSize : a number specifying the size of the input;
  • outputSize : a number specifying the size of the output;
  • rho : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999;
  • p : dropout probability for inner connections of GRUs.
  • mono : Monotonic sample for dropouts inside GRUs. Only needed in a TrimZero + BGRU(p>0) situation.


The actual implementation corresponds to the following algorithm:

z[t] = σ(W[x->z]x[t] + W[s->z]s[t1] + b[1->z])            (1)
r[t] = σ(W[x->r]x[t] + W[s->r]s[t1] + b[1->r])            (2)
h[t] = tanh(W[x->h]x[t] + W[hr->c](s[t1]r[t]) + b[1->h])  (3)
s[t] = (1-z[t])h[t] + z[t]s[t-1]                           (4)

where W[s->q] is the weight matrix from s to q, t indexes the time-step, b[1->q] are the biases leading into q, σ() is Sigmoid, x[t] is the input and s[t] is the output of the module (eq. 4). Note that unlike the LSTM, the GRU has no cells.

The GRU was benchmark on PennTreeBank dataset using recurrent-language-model.lua script. It slightly outperfomed FastLSTM, however, since LSTMs have more parameters than GRUs, the dataset larger than PennTreeBank might change the performance result. Don't be too hasty to judge on which one is the better of the two (see Ref. C and D).

                Memory   examples/s
    FastLSTM      176M        16.5K 
    GRU            92M        15.8K

Memory is measured by the size of dp.Experiment save file. examples/s is measured by the training speed at 1 epoch, so, it may have a disk IO bias.


RNN dropout (see Ref. E and F) was benchmark on PennTreeBank dataset using recurrent-language-model.lua script, too. The details can be found in the script. In the benchmark, GRU utilizes a dropout after LookupTable, while BGRU, stands for Bayesian GRUs, uses dropouts on inner connections (naming as Ref. F), but not after LookupTable.

As Yarin Gal (Ref. F) mentioned, it is recommended that one may use p = 0.25 for the first attempt.



To implement GRU, a simple module is added, which cannot be possible to build only using nn modules.

module = nn.SAdd(addend, negate)

Applies a single scalar addition to the incoming data, i.e. y_i = x_i + b, then negate all components if negate is true. Which is used to implement s[t] = (1-z[t])h[t] + z[t]s[t-1] of GRU (see above Equation (4)).

nn.SAdd(-1, true)

Here, if the incoming data is z[t], then the output becomes -(z[t]-1)=1-z[t]. Notice that nn.Mul() multiplies a scalar which is a learnable parameter.


References :

This is an implementation of the Multi-Function Recurrent Unit module.

The nn.MuFuRu(inputSize, outputSize [,ops [,rho]]) constructor takes 2 required arguments, plus optional arguments:

  • inputSize : a number specifying the dimension of the input;
  • outputSize : a number specifying the dimension of the output;
  • ops: a table of strings, representing which composition operations should be used. The table can be any subset of {'keep', 'replace', 'mul', 'diff', 'forget', 'sqrt_diff', 'max', 'min'}. By default, all composition operations are enabled.
  • rho : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999;

The Multi-Function Recurrent Unit generalizes the GRU by allowing weightings of arbitrary composition operators to be learned. As in the GRU, the reset gate is computed based on the current input and previous hidden state, and used to compute a new feature vector:

r[t] = σ(W[x->r]x[t] + W[s->r]s[t1] + b[1->r])            (1)
v[t] = tanh(W[x->v]x[t] + W[sr->v](s[t1]r[t]) + b[1->v])  (2)

where W[a->b] denotes the weight matrix from activation a to b, t denotes the time step, b[1->a] is the bias for activation a, and s[t-1]r[t] is the element-wise multiplication of the two vectors.

Unlike in the GRU, rather than computing a single update gate (z[t] in GRU), MuFuRU computes a weighting over an arbitrary number of composition operators.

A composition operator is any differentiable operator which takes two vectors of the same size, the previous hidden state, and a new feature vector, and returns a new vector representing the new hidden state. The GRU implicitly defines two such operations, keep and replace, defined as keep(s[t-1], v[t]) = s[t-1] and replace(s[t-1], v[t]) = v[t].

Ref. A proposes 6 additional operators, which all operate element-wise:

  • mul(x,y) = x * y
  • diff(x,y) = x - y
  • forget(x,y) = 0
  • sqrt_diff(x,y) = 0.25 * sqrt(|x - y|)
  • max(x,y)
  • min(x,y)

The weightings of each operation are computed via a softmax from the current input and previous hidden state, similar to the update gate in the GRU. The produced hidden state is then the element-wise weighted sum of the output of each operation.

p^[t][j] = W[x->pj]x[t] + W[s->pj]s[t1] + b[1->pj])         (3)
(p[t][1], ... p[t][J])  = softmax (p^[t][1], ..., p^[t][J])  (4)
s[t] = sum(p[t][j] * op[j](s[t-1], v[t]))                    (5)

where p[t][j] is the weightings for operation j at time step t, and sum in equation 5 is over all operators J.


This module decorates a module to be used within an AbstractSequencer instance. It does this by making the decorated module conform to the AbstractRecurrent interface, which like the LSTM and Recurrent classes, this class inherits.

rec = nn.Recursor(module[, rho])

For each successive call to updateOutput (i.e. forward), this decorator will create a stepClone() of the decorated module. So for each time-step, it clones the module. Both the clone and original share parameters and gradients w.r.t. parameters. However, for modules that already conform to the AbstractRecurrent interface, the clone and original module are one and the same (i.e. no clone).

Examples :

Let's assume I want to stack two LSTMs. I could use two sequencers :

lstm = nn.Sequential()

Using a Recursor, I make the same model with a single Sequencer :

lstm = nn.Sequencer(

Actually, the Sequencer will wrap any non-AbstractRecurrent module automatically, so I could simplify this further to :

lstm = nn.Sequencer(

I can also add a Linear between the two LSTMs. In this case, a Linear will be cloned (and have its parameters shared) for each time-step, while the LSTMs will do whatever cloning internally :

lstm = nn.Sequencer(

AbstractRecurrent instances like Recursor, Recurrent and LSTM are expcted to manage time-steps internally. Non-AbstractRecurrent instances can be wrapped by a Recursor to have the same behavior.

Every call to forward on an AbstractRecurrent instance like Recursor will increment the self.step attribute by 1, using a shared parameter clone for each successive time-step (for a maximum of rho time-steps, which defaults to 9999999). In this way, backward can be called in reverse order of the forward calls to perform backpropagation through time (BPTT). Which is exactly what AbstractSequencer instances do internally. The backward call, which is actually divided into calls to updateGradInput and accGradParameters, decrements by 1 the self.udpateGradInputStep and self.accGradParametersStep respectively, starting at self.step. Successive calls to backward will decrement these counters and use them to backpropagate through the appropriate internall step-wise shared-parameter clones.

Anyway, in most cases, you will not have to deal with the Recursor object directly as AbstractSequencer instances automatically decorate non-AbstractRecurrent instances with a Recursor in their constructors.

For a concrete example of its use, please consult the simple-recurrent-network.lua training script for an example of its use.


A extremely general container for implementing pretty much any type of recurrence.

rnn = nn.Recurrence(recurrentModule, outputSize, nInputDim, [rho])

Unlike Recurrent, this module doesn't manage a separate modules like inputModule, startModule, mergeModule and the like. Instead, it only manages a single recurrentModule, which should output a Tensor or table : output(t) given an input table : {input(t), output(t-1)}. Using a mix of Recursor (say, via Sequencer) with Recurrence, one can implement pretty much any type of recurrent neural network, including LSTMs and RNNs.

For the first step, the Recurrence forwards a Tensor (or table thereof) of zeros through the recurrent layer (like LSTM, unlike Recurrent). So it needs to know the outputSize, which is either a number or torch.LongStorage, or table thereof. The batch dimension should be excluded from the outputSize. Instead, the size of the batch dimension (i.e. number of samples) will be extrapolated from the input using the nInputDim argument. For example, say that our input is a Tensor of size 4 x 3 where 4 is the number of samples, then nInputDim should be 1. As another example, if our input is a table of table [...] of tensors where the first tensor (depth first) is the same as in the previous example, then our nInputDim is also 1.

As an example, let's use Sequencer and Recurrence to build a Simple RNN for language modeling :

rho = 5
hiddenSize = 10
outputSize = 5 -- num classes
nIndex = 10000

-- recurrent module
rm = nn.Sequential()
      :add(nn.LookupTable(nIndex, hiddenSize))
      :add(nn.Linear(hiddenSize, hiddenSize)))

rnn = nn.Sequencer(
      :add(nn.Recurrence(rm, hiddenSize, 1))
      :add(nn.Linear(hiddenSize, outputSize))

Note : We could very well reimplement the LSTM module using the newer Recursor and Recurrent modules, but that would mean breaking backwards compatibility for existing models saved on disk.


Ref. A : Regularizing RNNs by Stabilizing Activations

This module implements the norm-stabilization criterion:

ns = nn.NormStabilizer([beta])

This module regularizes the hidden states of RNNs by minimizing the difference between the L2-norms of consecutive steps. The cost function is defined as :

loss = beta * 1/T sum_t( ||h[t]|| - ||h[t-1]|| )^2

where T is the number of time-steps. Note that we do not divide the gradient by T such that the chosen beta can scale to different sequence sizes without being changed.

The sole argument beta is defined in ref. A. Since we don't divide the gradients by the number of time-steps, the default value of beta=1 should be valid for most cases.

This module should be added between RNNs (or LSTMs or GRUs) to provide better regularization of the hidden states. For example :

local stepmodule = nn.Sequential()
local rnn = nn.Sequencer(stepmodule)

To use it with SeqLSTM you can do something like this :

local rnn = nn.Sequential()


This abstract class implements a light interface shared by subclasses like : Sequencer, Repeater, RecurrentAttention, BiSequencer and so on.


The nn.Sequencer(module) constructor takes a single argument, module, which is the module to be applied from left to right, on each element of the input sequence.

seq = nn.Sequencer(module)

This Module is a kind of decorator used to abstract away the intricacies of AbstractRecurrent modules. While an AbstractRecurrent instance requires that a sequence to be presented one input at a time, each with its own call to forward (and backward), the Sequencer forwards an input sequence (a table) into an output sequence (a table of the same length). It also takes care of calling forget on AbstractRecurrent instances.

Input/Output Format

The Sequencer requires inputs and outputs to be of shape seqlen x batchsize x featsize :

  • seqlen is the number of time-steps that will be fed into the Sequencer.
  • batchsize is the number of examples in the batch. Each example is its own independent sequence.
  • featsize is the size of the remaining non-batch dimensions. So this could be 1 for language models, or c x h x w for convolutional models, etc.

Hello Fuzzy

Above is an example input sequence for a character level language model. It has seqlen is 5 which means that it contains sequences of 5 time-steps. The openning { and closing } illustrate that the time-steps are elements of a Lua table, although it also accepts full Tensors of shape seqlen x batchsize x featsize. The batchsize is 2 as their are two independent sequences : { H, E, L, L, O } and { F, U, Z, Z, Y, }. The featsize is 1 as their is only one feature dimension per character and each such character is of size 1. So the input in this case is a table of seqlen time-steps where each time-step is represented by a batchsize x featsize Tensor.


Above is another example of a sequence (input or output). It has a seqlen of 4 time-steps. The batchsize is again 2 which means there are two sequences. The featsize is 3 as each time-step of each sequence has 3 variables. So each time-step (element of the table) is represented again as a tensor of size batchsize x featsize. Note that while in both examples the featsize encodes one dimension, it could encode more.


For example, rnn : an instance of nn.AbstractRecurrent, can forward an input sequence one forward at a time:

input = {torch.randn(3,4), torch.randn(3,4), torch.randn(3,4)}

Equivalently, we can use a Sequencer to forward the entire input sequence at once:

seq = nn.Sequencer(rnn)

We can also forward Tensors instead of Tables :

-- seqlen x batchsize x featsize
input = torch.randn(3,3,4)


The Sequencer can also take non-recurrent Modules (i.e. non-AbstractRecurrent instances) and apply it to each input to produce an output table of the same length. This is especially useful for processing variable length sequences (tables).

Internally, the Sequencer expects the decorated module to be an AbstractRecurrent instance. When this is not the case, the module is automatically decorated with a Recursor module, which makes it conform to the AbstractRecurrent interface.

Note : this is due a recent update (27 Oct 2015), as before this AbstractRecurrent and and non-AbstractRecurrent instances needed to be decorated by their own Sequencer. The recent update, which introduced the Recursor decorator, allows a single Sequencer to wrap any type of module, AbstractRecurrent, non-AbstractRecurrent or a composite structure of both types. Nevertheless, existing code shouldn't be affected by the change.

For a concise example of its use, please consult the simple-sequencer-network.lua training script.


When mode='neither' (the default behavior of the class), the Sequencer will additionally call forget before each call to forward. When mode='both' (the default when calling this function), the Sequencer will never call forget. In which case, it is up to the user to call forget between independent sequences. This behavior is only applicable to decorated AbstractRecurrent modules. Accepted values for argument mode are as follows :

  • 'eval' only affects evaluation (recommended for RNNs)
  • 'train' only affects training
  • 'neither' affects neither training nor evaluation (default behavior of the class)
  • 'both' affects both training and evaluation (recommended for LSTMs)


Calls the decorated AbstractRecurrent module's forget method.


This module is a faster version of nn.Sequencer(nn.FastLSTM(inputsize, outputsize)) :

seqlstm = nn.SeqLSTM(inputsize, outputsize)

Each time-step is computed as follows (same as FastLSTM):

i[t] = σ(W[x->i]x[t] + W[h->i]h[t1] + b[1->i])                      (1)
f[t] = σ(W[x->f]x[t] + W[h->f]h[t1] + b[1->f])                      (2)
z[t] = tanh(W[x->c]x[t] + W[h->c]h[t1] + b[1->c])                   (3)
c[t] = f[t]c[t1] + i[t]z[t]                                         (4)
o[t] = σ(W[x->o]x[t] + W[h->o]h[t1] + b[1->o])                      (5)
h[t] = o[t]tanh(c[t])                                                (6)

A notable difference is that this module expects the input and gradOutput to be tensors instead of tables. The default shape is seqlen x batchsize x inputsize for the input and seqlen x batchsize x outputsize for the output :

input = torch.randn(seqlen, batchsize, inputsize)
gradOutput = torch.randn(seqlen, batchsize, outputsize)

output = seqlstm:forward(input)
gradInput = seqlstm:backward(input, gradOutput)

Note that if you prefer to transpose the first two dimension (i.e. batchsize x seqlen instead of the default seqlen x batchsize) you can set seqlstm.batchfirst = true following initialization.

For variable length sequences, set seqlstm.maskzero = true. This is equivalent to calling maskZero(1) on a FastLSTM wrapped by a Sequencer:

fastlstm = nn.FastLSTM(inputsize, outputsize)
seqfastlstm = nn.Sequencer(fastlstm)

For maskzero = true, input sequences are expected to be seperated by tensor of zeros for a time step.

The seqlstm:toFastLSTM() method generates a FastLSTM instance initialized with the parameters of the seqlstm instance. Note however that the resulting parameters will not be shared (nor can they ever be).

Like the FastLSTM, the SeqLSTM does not use peephole connections between cell and gates (see FastLSTM for details).

Like the Sequencer, the SeqLSTM provides a remember method.

Note that a SeqLSTM cannot replace FastLSTM in code that decorates it with a AbstractSequencer or Recursor as this would be equivalent to Sequencer(Sequencer(FastLSTM)). You have been warned.



lstmp = nn.SeqLSTMP(inputsize, hiddensize, outputsize)

The SeqLSTMP is a subclass of SeqLSTM. It differs in that after computing the hidden state h[t] (eq. 6), it is projected onto r[t] using a simple linear transform (eq. 7). The computation of the gates also uses the previous such projection r[t-1] (eq. 1, 2, 3, 5). This differs from SeqLSTM which uses h[t-1] instead of r[t-1].

The computation of a time-step outlined in SeqLSTM is replaced with the following:

i[t] = σ(W[x->i]x[t] + W[r->i]r[t1] + b[1->i])                      (1)
f[t] = σ(W[x->f]x[t] + W[r->f]r[t1] + b[1->f])                      (2)
z[t] = tanh(W[x->c]x[t] + W[h->c]r[t1] + b[1->c])                   (3)
c[t] = f[t]c[t1] + i[t]z[t]                                         (4)
o[t] = σ(W[x->o]x[t] + W[r->o]r[t1] + b[1->o])                      (5)
h[t] = o[t]tanh(c[t])                                                (6)
r[t] = W[h->r]h[t]                                                   (7)

The algorithm is outlined in ref. A and benchmarked with state of the art results on the Google billion words dataset in ref. B. SeqLSTMP can be used with an hiddensize >> outputsize such that the effective size of the memory cells c[t] and gates i[t], f[t] and o[t] can be much larger than the actual input x[t] and output r[t]. For fixed inputsize and outputsize, the SeqLSTMP will be able to remember much more information than the SeqLSTM.


This module is a faster version of nn.Sequencer(nn.GRU(inputsize, outputsize)) :

seqGRU = nn.SeqGRU(inputsize, outputsize)

Usage of SeqGRU differs from GRU in the same manner as SeqLSTM differs from LSTM. Therefore see SeqLSTM for more details.


brnn = nn.SeqBRNN(inputSize, outputSize, [batchFirst], [merge])

A bi-directional RNN that uses SeqLSTM. Internally contains a 'fwd' and 'bwd' module of SeqLSTM. Expects an input shape of seqlen x batchsize x inputsize. By setting [batchFirst] to true, the input shape can be batchsize x seqLen x inputsize. Merge module defaults to CAddTable(), summing the outputs from each output layer.


input = torch.rand(1, 1, 5)
brnn = nn.SeqBRNN(5, 5)

Prints an output of a 1x1x5 tensor.


Applies encapsulated fwd and bwd rnns to an input sequence in forward and reverse order. It is used for implementing Bidirectional RNNs and LSTMs.

brnn = nn.BiSequencer(fwd, [bwd, merge])

The input to the module is a sequence (a table) of tensors and the output is a sequence (a table) of tensors of the same length. Applies a fwd rnn (an AbstractRecurrent instance) to each element in the sequence in forward order and applies the bwd rnn in reverse order (from last element to first element). The bwd rnn defaults to:

bwd = fwd:clone()

For each step (in the original sequence), the outputs of both rnns are merged together using the merge module (defaults to nn.JoinTable(1,1)). If merge is a number, it specifies the JoinTable constructor's nInputDim argument. Such that the merge module is then initialized as :

merge = nn.JoinTable(1,merge)

Internally, the BiSequencer is implemented by decorating a structure of modules that makes use of 3 Sequencers for the forward, backward and merge modules.

Similarly to a Sequencer, the sequences in a batch must have the same size. But the sequence length of each batch can vary.

Note : make sure you call brnn:forget() after each call to updateParameters(). Alternatively, one could call brnn.bwdSeq:forget() so that only bwd rnn forgets. This is the minimum requirement, as it would not make sense for the bwd rnn to remember future sequences.


Applies encapsulated fwd and bwd rnns to an input sequence in forward and reverse order. It is used for implementing Bidirectional RNNs and LSTMs for Language Models (LM).

brnn = nn.BiSequencerLM(fwd, [bwd, merge])

The input to the module is a sequence (a table) of tensors and the output is a sequence (a table) of tensors of the same length. Applies a fwd rnn (an AbstractRecurrent instance to the first N-1 elements in the sequence in forward order. Applies the bwd rnn in reverse order to the last N-1 elements (from second-to-last element to first element). This is the main difference of this module with the BiSequencer. The latter cannot be used for language modeling because the bwd rnn would be trained to predict the input it had just be fed as input.


The bwd rnn defaults to:

bwd = fwd:clone()

While the fwd rnn will output representations for the last N-1 steps, the bwd rnn will output representations for the first N-1 steps. The missing outputs for each rnn ( the first step for the fwd, the last step for the bwd) will be filled with zero Tensors of the same size the commensure rnn's outputs. This way they can be merged. If nn.JoinTable is used (the default), then the first and last output elements will be padded with zeros for the missing fwd and bwd rnn outputs, respectively.

For each step (in the original sequence), the outputs of both rnns are merged together using the merge module (defaults to nn.JoinTable(1,1)). If merge is a number, it specifies the JoinTable constructor's nInputDim argument. Such that the merge module is then initialized as :

merge = nn.JoinTable(1,merge)

Similarly to a Sequencer, the sequences in a batch must have the same size. But the sequence length of each batch can vary.

Note that LMs implemented with this module will not be classical LMs as they won't measure the probability of a word given the previous words. Instead, they measure the probabiliy of a word given the surrounding words, i.e. context. While for mathematical reasons you may not be able to use this to measure the probability of a sequence of words (like a sentence), you can still measure the pseudo-likeliness of such a sequence (see this for a discussion).


This Module is a decorator similar to Sequencer. It differs in that the sequence length is fixed before hand and the input is repeatedly forwarded through the wrapped module to produce an output table of length nStep:

r = nn.Repeater(module, nStep)

Argument module should be an AbstractRecurrent instance. This is useful for implementing models like RCNNs, which are repeatedly presented with the same input.


References :

This module can be used to implement the Recurrent Attention Model (RAM) presented in Ref. A :

ram = nn.RecurrentAttention(rnn, action, nStep, hiddenSize)

rnn is an AbstractRecurrent instance. Its input is {x, z} where x is the input to the ram and z is an action sampled from the action module. The output size of the rnn must be equal to hiddenSize.

action is a Module that uses a REINFORCE module (ref. B) like ReinforceNormal, ReinforceCategorical, or ReinforceBernoulli to sample actions given the previous time-step's output of the rnn. During the first time-step, the action module is fed with a Tensor of zeros of size input:size(1) x hiddenSize. It is important to understand that the sampled actions do not receive gradients backpropagated from the training criterion. Instead, a reward is broadcast from a Reward Criterion like VRClassReward Criterion to the action's REINFORCE module, which will backprogate graidents computed from the output samples and the reward. Therefore, the action module's outputs are only used internally, within the RecurrentAttention module.

nStep is the number of actions to sample, i.e. the number of elements in the output table.

hiddenSize is the output size of the rnn. This variable is necessary to generate the zero Tensor to sample an action for the first step (see above).

A complete implementation of Ref. A is available here.


This module zeroes the output rows of the decorated module for commensurate input rows which are tensors of zeros.

mz = nn.MaskZero(module, nInputDim)

The output Tensor (or table thereof) of the decorated module will have each row (samples) zeroed when the commensurate row of the input is a tensor of zeros.

The nInputDim argument must specify the number of non-batch dims in the first Tensor of the input. In the case of an input table, the first Tensor is the first one encountered when doing a depth-first search.

This decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.

Caveat: MaskZero not guarantee that the output and gradInput tensors of the internal modules of the decorated module will be zeroed as well when the input is zero as well. MaskZero only affects the immediate gradInput and output of the module that it encapsulates. However, for most modules, the gradient update for that time-step will be zero because backpropagating a gradient of zeros will typically yield zeros all the way to the input. In this respect, modules to avoid in encapsulating inside a MaskZero are AbsractRecurrent instances as the flow of gradients between different time-steps internally. Instead, call the AbstractRecurrent.maskZero method to encapsulate the internal recurrentModule.


WARNING : only use this module if your input contains lots of zeros. In almost all cases, MaskZero will be faster, especially with CUDA.

Ref. A : TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing

The usage is the same with MaskZero.

mz = nn.TrimZero(module, nInputDim)

The only difference from MaskZero is that it reduces computational costs by varying a batch size, if any, for the case that varying lengths are provided in the input. Notice that when the lengths are consistent, MaskZero will be faster, because TrimZero has an operational cost.

In short, the result is the same with MaskZero's, however, TrimZero is faster than MaskZero only when sentence lengths is costly vary.

In practice, e.g. language model, TrimZero is expected to be faster than MaskZero about 30%. (You can test with it using test/test_trimzero.lua.)


This module extends nn.LookupTable to support zero indexes. Zero indexes are forwarded as zero tensors.

lt = nn.LookupTableMaskZero(nIndex, nOutput)

The output Tensor will have each row zeroed when the commensurate row of the input is a zero index.

This lookup table makes it possible to pad sequences with different lengths in the same batch with zero vectors.


This criterion zeroes the err and gradInput rows of the decorated criterion for commensurate input rows which are tensors of zeros.

mzc = nn.MaskZeroCriterion(criterion, nInputDim)

The gradInput Tensor (or table thereof) of the decorated criterion will have each row (samples) zeroed when the commensurate row of the input is a tensor of zeros. The err will also disregard such zero rows.

The nInputDim argument must specify the number of non-batch dims in the first Tensor of the input. In the case of an input table, the first Tensor is the first one encountered when doing a depth-first search.

This decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.


reverseSeq = nn.SeqReverseSequence(dim)

Reverses an input tensor on a specified dimension. The reversal dimension can be no larger than three.


input = torch.Tensor({{1,2,3,4,5}, {6,7,8,9,10}})
reverseSeq = nn.SeqReverseSequence(1)

Gives us an output of torch.Tensor({{6,7,8,9,10},{1,2,3,4,5}})


This Criterion is a decorator:

c = nn.SequencerCriterion(criterion, [sizeAverage])

Both the input and target are expected to be a sequence, either as a table or Tensor. For each step in the sequence, the corresponding elements of the input and target will be applied to the criterion. The output of forward is the sum of all individual losses in the sequence. This is useful when used in conjunction with a Sequencer.

If sizeAverage is true (default is false), the output loss and gradInput is averaged over each time-step.


This Criterion is a decorator:

c = nn.RepeaterCriterion(criterion)

The input is expected to be a sequence (table or Tensor). A single target is repeatedly applied using the same criterion to each element in the input sequence. The output of forward is the sum of all individual losses in the sequence. This is useful for implementing models like RCNNs, which are repeatedly presented with the same target.

rnn's People


achalddave avatar ahoy196 avatar anoidgit avatar bartvm avatar cheng6076 avatar dkurt avatar douwekiela avatar ethanabrooks avatar eywalker avatar frugs avatar gheinrich avatar guillitte avatar hughperkins avatar ivendrov avatar jbboin avatar jnhwkim avatar joostvdoorn avatar nhynes avatar nicholas-leonard avatar peterroelants avatar rachtsingh avatar raymondhs avatar robotsorcerer avatar rohanpadhye avatar sennendoko avatar supakjk avatar suryabhupa avatar temerick avatar vgire avatar ywelement avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rnn's Issues

Adding Examples for Different RNN Structures


I wanted to suggest adding examples for different architectures using Recurrent structures.

For example:
Link to the blog
Image Link.

I've seen that you've added some example in some of the issues (#51,#21).
Maybe adding them explicitly in the examples folder can reduce confusion.


LSTM numerical gradient check

Haven't been able to figure out what's going on here... With the new LinearBias I can get one step to give the correct gradient, but as soon as I have multiple steps weird stuff starts happening. The numerical gradients are generally much, much larger (1e3 vs. 1e-3). Any ideas as to what is happening?

nn = require 'nn'
require 'rnn'
require 'optim'

hiddenSize = 2
nIndex = 2
r = nn.LSTM(hiddenSize, hiddenSize)

rnn = nn.Sequential()
rnn:add(nn.Linear(hiddenSize, nIndex))

criterion = nn.ClassNLLCriterion()

function f(x)
  -- Do the forward prop
  -- With or without fastBackward doesn't matter
  r.fastBackward = false
  local err = 0
  for i = 1, inputs:size(1) do
     local output = rnn:forward(inputs[i])
     err = err + criterion:forward(output, targets[i])
     local gradOutput = criterion:backward(output, targets[i])
     rnn:backward(inputs[i], gradOutput)
  return err, grads

parameters, grads = rnn:getParameters()

-- This works:
-- targets = torch.Tensor{1}:resize(1, 1)
-- inputs = torch.randn(1, 2)

targets = torch.Tensor{1, 2}:resize(2, 1)
inputs = torch.randn(2, 2)
local err, dC, dC_est = optim.checkgrad(f, parameters:clone())
-- Print the exact and numerical gradients side by side
print(, 1), dC_est:view(dC_est:size(1), 1), 2))
assert(err < 0.0001, "failed")

nn.Periodic decorator

For e.g. : apply zerograd every 3 forwards:

nn.Periodic(nn.ZeroGrad(), 3)

Could be useful for truncated BPTT.

Reproduce RAM results

thanks for nice package for Torch.
I was trying to reproduce your RAM results on MNIST dataset (< 1%). I am not able to reproduce such results, the early stooping criteria finish learning after ~800 epoch with 98.6%. Could you provide exact parameters which were used for learning?

Second question is about using LSTM. How this model can be used with LSTM units? It is just replace nn.Recurrent with nn.LSTM or sth more (because I am not able to run such model).

Question on memory usage for LSTM?

Platform: OSX 10.10.3.
Revision: Reasonably current (17/5) git pull of Torch7 and rnn.
Runtime: CPU (not GPU - yet).

Background: I'm modelling events that happen over a 30-day window, with each window divided into 3 minute time-steps. So each distinct user has 30_24_(60/3)=14,400 events with a lot of these simply encoding the fact that nothing happened. I've got 60k users to look at initially and an 80:20 split between train and test, so 48k training users and 12k test users.

Problem: I inevitably run out of memory about 60 minutes after training - running into the 1 GB luajit cap. Adding collectgarbage() inside the forward / backward loop helped but only to delay the out of memory from 1 minute to 60 minutes. I'm using Tensors throughout my code, not tables.

My planned solutions: I'm working on restructuring my code to load the training data on the fly from storage and then using subsets of the data as mini-batches to hopefully remain within the 1 GB limit as well as evaluating more compressed / efficient ways of modelling the events themselves (but part of my research is to see how good LSTM is at extracting features from the events through time without having to preprocess those events..).

Question: Is this behaviour (exceeding the 1 GB luajit mem limit) simply expected behaviour in the current LSTM code for a dataset of this size - that it's using a table somewhere to maintain state (e.g. unrolling through time etc.) or is it more likely that there is a bug somewhere in my code manifesting itself as this problem and RNN / LSTM should have a stable / reasonable mem usage profile?

Sequencer remember/forget with "eval" mode


Recent changes to the Sequencer remember/forget mechanism introduced modes like "both" and "eval", which is very convenient. However, in "eval" mode, a forward step during evaluation will set the maximum number of BPTT steps (rho value) to the size of the input. Then, a subsequent epoch of training on a sequence of different size will fail in the backward step. Before the change, remember() worked fine.

The reason is probably the setting of rho in the recurrent module (in this case LSTM), which then causes the backward step during training to stop before reaching the beginning of the sequence. See LSTM:updateGradInputThroughTime().

Note: I know that the README says it is recommended to set mode="both" for LSTM, but I prefer the "eval" mode because each training example is independent. In any case, I suppose both modes should be possible for any AbstractRecurrent instance.

A minimal working example with LSTMs:

lstm = nn.LSTM(5,5)
seq = nn.Sequencer(lstm)
inputTrain = {torch.randn(5), torch.randn(5), torch.randn(5)}
inputEval = {torch.randn(5)}

modes = {'both', 'eval'}
for i, mode in ipairs(modes) do
  print('\nmode: ' .. mode)

  -- do one epoch of training
  seq:backward(inputTrain, inputTrain)

  -- evaulate

  -- do another epoch of training
  -- this will fail when mode = 'eval'
  seq:backward(inputTrain, inputTrain)

Could you look into that?

Many thanks for your help.

LSTM implementation

Question: Why does the LSTM implementation not inherit from Recurrent? I it seems that (something like) this is equivalent to the current LSTM implementation, but avoids a lot of code duplication.

nn = require 'nn'
require 'rnn'

local hiddenSize = 2
local nIndex = 2

-- A silly hack to make sure LSTM.recurrentModule is fed zeros at step 1
local Start = torch.class('nn.Start', 'nn.Identity')

function Start:updateOutput(input)
  self.output = {input, torch.zeros(2), torch.zeros(2)}
  return self.output

function Start:updateGradInput(input, gradOutput)
  self.gradInput = gradOutput[1]
  return self.gradInput

-- The LSTM network
-- The input and feedback modules are unused
-- Merge basically turns {input, {output, cell}} into {input, output, cell}
-- The transfer module is the full LSTM module
local r = nn.Recurrent(nn.Start(), nn.Identity(), nn.Identity(),
                       nn.LSTM(hiddenSize, hiddenSize).recurrentModule,
                       9999, nn.FlattenTable())

local rnn = nn.Sequential()
rnn:add(nn.SelectTable(1))  -- Since both the output and the cell is given
rnn:add(nn.Linear(hiddenSize, nIndex))

Growing weights


thanks for great recurrent package. I'm still new to Torch, so the problem is probably in my usage. However, when I try to stack LSTM into multilayer network by using Sequencer, the weights are constantly growing. Without sequencer with only one recurrent layer it works fine.


    recurrent = nn.LSTM(inSize, hiddenSize1, rho)
    recurrent2 = nn.LSTM(hiddenSize1, hiddenSize2, rho)
    recurrent.scales = torch.Tensor(rho):fill(1)
    recurrent2.scales = torch.Tensor(rho):fill(1)
    linear = nn.Linear(hiddenSize2, outSize)

    model = nn.Sequential()
    sequencer = nn.Sequencer(recurrent)
    sequencer2 = nn.Sequencer(recurrent2)
    sequencer3 = nn.Sequencer(linear)


for t = 1, (trainSize - batchSize) do

        -- load new sample
        local inputs, targets, gradOutputs = {}, {}, {}
        for step = 1, rho do
            local index = t + step
            inputs[step] = inputData:sub(1, inputData:size(1), index, index + batchSize - 1):transpose(1,2)
            targets[step] = outputData:sub(1, outputData:size(1), index, index + batchSize - 1):transpose(1, 2)


        local outputs = model:forward(inputs)

        for step = 1, rho do
            local err = criterion:forward(outputs[step], targets[step])
            trainError = trainError + err
            gradOutputs[step] = criterion:backward(outputs[step], targets[step])

        model:backward(inputs, gradOutputs)

I have tried lowering the learning rate (1e-2, 1e-3, 1e-4), different size of scales (1, 1/rho), batch size (1,10) or rho (1,5,10,100). But none of these modification seemed to work. Any ideas?

Getting rnn ready for torch

The objective is to build a general recurrent library for torch

  • isRecursable test in updateOutput (first pass only)
  • Module:sharedClone()
  • BidirectionalSequencer (for BRNN/BLSTM)
  • SequencerCriterion (for Sequencers)
  • doc
    • AbstractRecurrent
    • LSTM
    • Sequencer
    • Repeater
    • RepeaterCriterion
  • more unit tests (they need as much unit tests as possible to find corner cases)
    • Sequencer
    • Repeater (assert not sequential containing recurrence)
    • LSTM (recycle and such)
    • BidirectionalSequencer
  • examples
    • dp.PennTreeBank
    • dp example (pure nn, no dp.Models)
  • Module:sequencer()

testJacobian fails on Missing gradInput

When trying to test a simple network (to get the last state of an LSTM)

local sequenceLength = math.random(5, 10)
local vectorSize = math.random(5, 10)
local hiddenSize = math.random(5, 10)

local input = torch.rand(sequenceLength, vectorSize)

-- testJacobian doesn't support table inputs so use a split table module on the input.
local module = nn.Sequential()
module:add(nn.SplitTable(1, 2))
module:add(nn.Sequencer(nn.FastLSTM(vectorSize, hiddenSize)))

nn.Jacobian.testJacobian(module, input)

nn.Jacobian.testJacobian fails at

Am I missing something ?
Can you provide a pointer to solve this ?


Recurrent Neural Network update on sequence of length 1

The Recurrent class of 'rnn' does not allow training updates on sequences of length 1. Minimal example:

  require 'rnn'

  x = torch.rand(200)
  target = torch.rand(1)

  rho = 5
  hiddenSize = 100
  -- RNN
  r = nn.Recurrent(
     hiddenSize, nn.Linear(200,hiddenSize), 
     nn.Linear(hiddenSize, hiddenSize), nn.Sigmoid(), 

  seq = nn.Sequential()
  seq:add(nn.Linear(hiddenSize, 1))

  criterion = nn.MSECriterion()

  output = seq:forward(x)
  err = criterion:forward(output,target)
  gradOutput = criterion:backward(output,target)


As far as I understand this should not be an issue, yet when ran this gives something like:

  /Users/hroosterhuis/torch/install/bin/luajit: /Users/hroosterhuis/torch/install/share/lua/5.1/nn/Add.lua:62: bad argument #1 to 'size' (dimension 1 out of range of 0D tensor at /Users/hroosterhuis/torch/pkg/torch/generic/Tensor.c:17)
  stack traceback:
  [C]: in function 'size'
  /Users/hroosterhuis/torch/install/share/lua/5.1/nn/Add.lua:62: in function 'accGradParameters'
  ...s/hroosterhuis/torch/install/share/lua/5.1/nn/Module.lua:53: in function 'accUpdateGradParameters'
  ...oosterhuis/torch/install/share/lua/5.1/rnn/Recurrent.lua:247: in function 'accUpdateGradParametersThroughTime' in function 'backwardUpdateThroughTime' in function 'updateParameters'
  ...roosterhuis/torch/install/share/lua/5.1/nn/Container.lua:31: in function 'updateParameters'
  testRNN.lua:26: in main chunk
  [C]: in function 'dofile'
  ...huis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
  [C]: at 0x0103773780

Prefer if examples dont need dp

I reckon it would be easier to use the examples if they didnt need one to learn dp to understand them fully.

Basically, currently one has to kind of learn both rnn and dp together, to understand the examples. I reckon it would be faster learning curve if the examples in rnn assumed only knowledge of rnn, and core Torch libraries, such as nn etc.

Implementing many-to-one RNN

Thanks for the great RNN package! I am trying to implement a many-to-one RNN, LSTM specifically, where each sequence of inputs only produces a single output and finding it difficult to use the rnn package for this case. This is useful for e.g. in sentiment analysis where each review (with a set of words) gets mapped to a sentiment (positive or negative).

Any help regarding this would be appreciated. I am not sure if I should have raised an issue for this since this is a personal question but hoping others would benefit too.

Thanks a lot for your help!

Multiple batches LSTM

Does the LSTM support multiple batches directly (for performance benchmarking)? I tried to implement this. It didn't raise an error but the results seem to be inconsistent. (I am also confused because in the LSTM code file it is states that the expected input is either 1D or 2D, but in the Penn Tree Bank Sample multiple batches can be used. Here I want to time forward and backward separately though.). If I input identical sequences in one batch, I get different outputs.
BTW: Thanks for making this available to the public!

Here a minimal(the Mask zero part can also be removed) code example of what I mean:

require "rnn"
require "cunn"


batch_size= 2
maxLen = 4
wordVec = 5
nWords = 100
mode = 'CPU'

-- create random data with zeros as empty indicator
inp1 = torch.ceil(torch.rand(batch_size, maxLen)*nWords) -- 
labels = torch.ceil(torch.rand(batch_size)*2) -- create labels of 1s and 2s

-- not all sequences have the same lenght, 0 placeholder
for i=1, batch_size do
    n_zeros = torch.random(maxLen-2) 
    inp1[{{i},{1, n_zeros}}] = torch.zeros(n_zeros)

-- make the first sequence the same as the second
inp1[{{2},{}}] = inp1[{{1},{}}]:clone()

lstm = nn.Sequential()
lstm:add(nn.LookupTableMaskZero(10000, wordVec, batch_size))  -- convert indices to word vectors
lstm:add(nn.SplitTable(1))  -- convert tensor to list of subtensors
lstm:add(nn.Sequencer(nn.MaskZero(nn.LSTM(wordVec, wordVec), 1))) -- Seq to Seq', 0-Seq to 0-Seq

if mode == 'GPU' then
    labels = labels:cuda()
    inp1 = inp1:cuda()

out = lstm:forward(inp1)

print('input 1', inp1[1])
print('lstm out 1', out[1])  

print('input 2', inp1[2])  -- shoudl be the same as above
print('lstm out 2', out[2])  --  should be the same as above


input 1   0
[torch.DoubleTensor of size 4]

lstm out 1   0.0000  0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.0000
-0.0226  0.0012  0.1373  0.0064  0.0766
 0.1174  0.1793  0.0684  0.0029  0.0138
[torch.DoubleTensor of size 4x5]

input 2   0
[torch.DoubleTensor of size 4]

lstm out 2   0.0000  0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000  0.0000
-0.0325  0.0143  0.2019  0.0113  0.1202
 0.1606  0.2348  0.1093  0.0045  0.0208
[torch.DoubleTensor of size 4x5]

time series prediction

hi all,
just wondering if this library can be used for either multivariate or univariate time series prediction.
Is there already an example of this?

Many thanks,

Sequencer Problem since last update

Recently I updated the RNN package and the, up to this point, working script wouldn't run anymore.
To ensure the problem is not (completely) on my side I tested the script in the README

require 'rnn'

batchSize = 8
rho = 5
hiddenSize = 10
nIndex = 10000

mlp = nn.Sequential()
      hiddenSize, nn.LookupTable(nIndex, hiddenSize), 
      nn.Linear(hiddenSize, hiddenSize), nn.Sigmoid(), 
   :add(nn.Linear(hiddenSize, nIndex))

rnn = nn.Sequencer(mlp)

criterion = nn.SequencerCriterion(nn.ClassNLLCriterion())

-- dummy dataset (task is to predict next item, given previous)
sequence = torch.randperm(nIndex)

offsets = {}
for i=1,batchSize do
   table.insert(offsets, math.ceil(math.random()*batchSize))
offsets = torch.LongTensor(offsets)

lr = 0.1
i = 1
while true do
   -- prepare inputs and targets
   local inputs, targets = {},{}
   for step=1,rho do
      -- a batch of inputs
      table.insert(inputs, sequence:index(1, offsets))
      -- incement indices
      for j=1,batchSize do
         if offsets[j] > nIndex then
            offsets[j] = 1
      -- a batch of targets
      table.insert(targets, sequence:index(1, offsets))

   local outputs = rnn:forward(inputs)
   local err = criterion:forward(outputs, targets)
   print(i, err/rho)
   i = i + 1
   local gradOutputs = criterion:backward(outputs, targets)
   rnn:backward(inputs, gradOutputs)

After adding the missing ')' in line 12 after rho (is this a bug?) the script should run.
Instead it gave my the same error message as my privat script, which was just recently corrected from Mr. Léonard himself (the problem with the sequencer):

.../sebastian/Torch/install/share/lua/5.1/rnn/Recurrent.lua:148: expecting at least one updateOutput
stack traceback:
        [C]: in function 'assert'
        .../sebastian/Torch/install/share/lua/5.1/rnn/Recurrent.lua:148: in function 'updateGradInputThroughTime' in function 'backwardUpdateThroughTime' in function 'updateParameters'
        ...e/sebastian/Torch/install/share/lua/5.1/nn/Container.lua:34: in function 'func'
        ...e/sebastian/Torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
        ...e/sebastian/Torch/install/share/lua/5.1/nn/Container.lua:34: in function 'updateParameters' in function 'updateParameters'
        ...e/sebastian/Torch/install/share/lua/5.1/nn/Container.lua:34: in function 'func'
        ...e/sebastian/Torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
        ...e/sebastian/Torch/install/share/lua/5.1/nn/Container.lua:34: in function 'updateParameters'
        torchrnntest.lua:55: in main chunk
        [C]: in function 'dofile'
        ...tian/Torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00405ea0

Every luarock was updated to the latest version.

I looked in the code and I think it happened in commit 1fc81a4
with changing:

-         if step > 1 then
-            self.gradCells[step-1] = gradCell
-         end
+         self.gradCells[step-1] = gradCell

I tried manipulating it but I think there is also a change in the sequencer part.

Can anybody reproduce the error?

I'm pretty (totally) new to Git and I'm just getting started with torch and rnn so please forgive me if this is no bug or if I am doing anything wrong.

Thank you Mr. Léonard for providing this awesome repo!

backward() vs backwardThroughTime()

I'm confused about how to use this library in such a way that it is API-compatible with the rest of torch. For other parts of torch, eg for simple MLPs, it seems that the standard pattern for training models is to do something like:

local parameters, gradParameters = model:getParameters()
local inputs, targets = getMiniBatch()

local function fEval(x)
if parameters ~= x then parameters:copy(x) end
local output = model:forward(inputs)
local err = criterion:forward(output, targets)
local df_do = criterion:backward(output, targets)
model:backward(inputs, df_do)
return err, gradParameters

optim.optimMethod(fEval, parameters)

I'd like to be able to use this package's RNN code, but train using training code I already have. However, it seems that backward() doesn't do something and we have to call backwardThroughTime() instead.

  1. Is this true? Why not make the RNN stuff API-compatible so that we can call backward()?
  2. If I'm going to add a backwardThroughTime() call to my training code? Where do I put it? Do I call backward() and then backwardThroughTime(). Suppose there are LookupTable layers below the RNN. After calling backwardThroughTime, would I need to call backward() on them?


Sharing params using clone() on Sequencer(LSTM())

I have the following architecture:

    lstm_seq = nn.Sequential()
    lstm_seq:add(nn.Linear(n_hid, n_hid))

    parallel_flows = nn.ParallelTable()
    for f=1, 2 do

    lstm = nn.Sequential()

If I check the parameters by using:

   w, dw = lstm.getParameters()

I get inconsistent sizes (dw seems to have almost twice the number of params as w).
However, when I turn off sharing params (in lstm_seq:clone()), the sizes are consistent. Do you have any idea why?


backward not working for LookupTable

I have an example of use of LookupTable with the backward returning empty tensors (not computed)
is this a bug ?
gradInputs :
1 : DoubleTensor - size: 2x5
2 :
1 : DoubleTensor - empty
2 : DoubleTensor - empty
3 : DoubleTensor - empty

code :

require 'nn'
require 'rnn'
require 'cutorch'
require 'cunn'

batchSize = 2
rho = 3
embeddingSize = 4
dictionarySize = 10

inputs, targets = {}, {} -- inputs and outputs
for i = 1, nbfeatures do
local featureTensor=torch.Tensor(batchSize,1)
for j=1,batchSize do
table.insert(inputs, featureTensor)
for i = nbfeatures+1, rho+nbfeatures do
local measure=torch.Tensor(batchSize)
for j=1,batchSize do
table.insert(inputs, measure)
for i = 1, rho do
local measure=torch.Tensor(batchSize)
for j=1,batchSize do
table.insert(targets, measure)

b1:add(nn.JoinTable(2)) -- ->Tensor(batchSize X nbfeatures)
c:add(b2) -- ->{tensorF , {list of tensor(i)}}


p:add(nn.Sequencer(nn.LookupTable(dictionarySize, embeddingSize))) -- ->ListofTensor(batchSize X embeddingSize)
SliceList=nn.ConcatTable() -- purpose: create a list tensor created by joining tensorF & tensor(i)
for i=1, rho do
local Slice =nn.Sequential()
local cc=nn.ConcatTable() -- contains the 2 tensors to join
local a=nn.Sequential()
a:add(nn.SelectTable(2)) -- we select list of tensor(i)
a:add(nn.SelectTable(i)) -- we select a tensor(i)
local b=nn.Sequential()
b:add(nn.SelectTable(1)) -- we select tensorF
Slice:add(nn.JoinTable(2)) -- we create a single tensor = tensorF & tensor(i)
model:add(nn.Sequencer(nn.FastLSTM(embeddingSize+nbfeatures, embeddingSize, rho)))
model:add(nn.Sequencer(nn.Linear(embeddingSize, dictionarySize)))

criterion = nn.SequencerCriterion(nn.ClassNLLCriterion())

prediction = model:forward(inputsA)
err = criterion:forward(prediction, targets)
print('err=' .. err)
gradOutputs = criterion:backward(prediction, targets)
gradInputs=model:backward(inputsA, gradOutputs)

Recurrent/Modules With Table Output


I just noticed that Recurrent apparently can't be used with modules whose outputs are tables rather than tensors. The error comes in AbstractRecurrent.lua:48 where 'new' can't be called on a table. Was this intentional?


Adding minimalistic examples

This is not a bug issues, but it would be interesting to have a few minimalistic working toy examples on the usage with and without Sequencer .

Bug in new AbstractRecurrent? updateGradInputStep not reset

in method updateGradInput, I find:
self.updateGradInputStep = self.updateGradInputStep or self.step

First BPP pass is fine. However, if I do a second BPP pass self.updateGradInputStep is initialized (it is 1 after the first BPP), and is decreased to negative values even. Shouldn't updateGradInputStep be reset at some point between BPP passes?

Question: does this package work with optim? (yes)

Thank you for this awesome package. Much wow.

I cannot use dp which all examples are based on (#60).
So I have to make sure optim is supported for architectures like these:


The second one was suggested for a many-to-one transducer in #21.

AbstractRecurrent says that BPTT happens in updateParameters() ( But optim manipulates the parameters directly, never calls updateParameters().

UPDATE: Sorry, a little confused about how similar stuff happens in different places.
Apparently, as long as you decorate with the Sequencer container, optim should work fine.

optim just wants you to call backward() on your net and provide the gradParams. backward() on your Sequencer should handle all the BPTT:

Module:backward() -> 
    Sequencer:updateGradInput() -> BPTT( LSTM:updateGradInput() )
    Sequencer:accGradParameters() -> BPTT( LSTM:accGradParameters() )

If someone with a bit more insight could verify this, I would be super happy. Thanks!

Going to close this issue then.

Unit test fails.

The current repository fdc3b21 has issues in unit test.

$ th -lrnn -e "dofile('test/test.lua'); rnn.test()"
Running 18 tests
________*_____**__  ==> Done Completed 1876 asserts in 18 tests with 22 errors
Recurrence fwd err 2
 TensorEQ(==) violation   val=0.1917265403829, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:2134: in function <test/test.lua:2105>

Recurrence fwd err 3
 TensorEQ(==) violation   val=0.18550556665035, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:2134: in function <test/test.lua:2105>

Recurrence bwd err 1
 TensorEQ(==) violation   val=0.19031328301695, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:2141: in function <test/test.lua:2105>

Recurrence bwd err 2
 TensorEQ(==) violation   val=0.12799707568363, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:2141: in function <test/test.lua:2105>

Recurrence bwd err 3
 TensorEQ(==) violation   val=0.16196581750727, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:2141: in function <test/test.lua:2105>

 Function call failed expecting at least one updateOutput
stack traceback:
    [C]: in function 'assert' in function 'updateGradInputThroughTime' in function 'backwardUpdateThroughTime' in function 'updateParameters'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'func'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'updateParameters' in function 'updateParameters'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'func'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'updateParameters'
    test/test.lua:2144: in function <test/test.lua:2105>
    [C]: in function 'xpcall'
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:186: in function '_run'
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:161: in function 'run'
    test/test.lua:2399: in function 'test'
    [string "dofile('test/test.lua'); rnn.test()"]:1: in main chunk
    [C]: in function 'pcall'
    ...lvin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:117: in main chunk
    [C]: at 0x0101c582f0

Recursor(Recurrent) bwd err 1
 TensorEQ(==) violation   val=0.0055855133430601, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1955: in function <test/test.lua:1905>

Recursor(Recurrent) fwd err 2
 TensorEQ(==) violation   val=0.20695973077736, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1954: in function <test/test.lua:1905>

Recursor(Recurrent) bwd err 2
 TensorEQ(==) violation   val=0.063109996521219, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1955: in function <test/test.lua:1905>

Recursor(Recurrent) fwd err 3
 TensorEQ(==) violation   val=0.18835985089823, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1954: in function <test/test.lua:1905>

Recursor(Recurrent) bwd err 3
 TensorEQ(==) violation   val=0.087992354034426, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1955: in function <test/test.lua:1905>

Recursor(Recurrent) fwd err 4
 TensorEQ(==) violation   val=0.20735178234061, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1954: in function <test/test.lua:1905>

Recursor(Recurrent) bwd err 4
 TensorEQ(==) violation   val=0.062143137089905, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1955: in function <test/test.lua:1905>

Recursor(Recurrent) fwd err 5
 TensorEQ(==) violation   val=0.21002205166161, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1954: in function <test/test.lua:1905>

Recursor(Recurrent) bwd err 5
 TensorEQ(==) violation   val=0.072276017437007, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1955: in function <test/test.lua:1905>

 Function call failed
/Users/Calvin/torch/install/share/lua/5.1/rnn/Recurrent.lua:148: expecting at least one updateOutput
stack traceback:
    [C]: in function 'assert'
    /Users/Calvin/torch/install/share/lua/5.1/rnn/Recurrent.lua:148: in function 'updateGradInputThroughTime' in function 'backwardUpdateThroughTime' in function 'updateParameters'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'func'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'updateParameters' in function 'updateParameters'
    test/test.lua:1958: in function <test/test.lua:1905>
    [C]: in function 'xpcall'
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:186: in function '_run'
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:161: in function 'run'
    test/test.lua:2399: in function 'test'
    [string "dofile('test/test.lua'); rnn.test()"]:1: in main chunk
    [C]: in function 'pcall'
    ...lvin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:117: in main chunk
    [C]: at 0x0101c582f0

Repeater(Recursor) output err
 TensorEQ(==) violation   val=0.17625401664154, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1136: in function <test/test.lua:1078>

Repeater(Recursor) output err
 TensorEQ(==) violation   val=0.18676221856625, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1136: in function <test/test.lua:1078>

Repeater(Recursor) output err
 TensorEQ(==) violation   val=0.18772510482446, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1136: in function <test/test.lua:1078>

Repeater(Recursor) output err
 TensorEQ(==) violation   val=0.18784248574533, condition=1e-07
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1136: in function <test/test.lua:1078>

Repeater(Recursor) gradInput err
 TensorEQ(==) violation   val=0.073757910583675, condition=1e-06
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:61: in function 'assertTensorEq'
    test/test.lua:1138: in function <test/test.lua:1078>

 Function call failed
/Users/Calvin/torch/install/share/lua/5.1/rnn/Recurrent.lua:148: expecting at least one updateOutput
stack traceback:
    [C]: in function 'assert'
    /Users/Calvin/torch/install/share/lua/5.1/rnn/Recurrent.lua:148: in function 'updateGradInputThroughTime' in function 'backwardUpdateThroughTime' in function 'updateParameters'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'func'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'updateParameters' in function 'updateParameters'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'func'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules'
    /Users/Calvin/torch/install/share/lua/5.1/nn/Container.lua:34: in function 'updateParameters'
    test/test.lua:1141: in function <test/test.lua:1078>
    [C]: in function 'xpcall'
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:186: in function '_run'
    /Users/Calvin/torch/install/share/lua/5.1/torch/Tester.lua:161: in function 'run'
    test/test.lua:2399: in function 'test'
    [string "dofile('test/test.lua'); rnn.test()"]:1: in main chunk
    [C]: in function 'pcall'
    ...lvin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:117: in main chunk
    [C]: at 0x0101c582f0


LSTM Example

I'm working on a simple LSTM example, and I have an issue on model validation.
My idea is to train using mini-batch and validate at each epoch. The validation is made example by example and not in batch, so that I can use the same code base for prediction.

However I got this error:

th lstm_early_stop.lua
error for iteration 100 is 0.11727129280201 
/Users/fabiofumarola/torch/install/bin/luajit: ...biofumarola/torch/install/share/lua/5.1/nn/CAddTable.lua:12: inconsistent tensor size at /Users/fabiofumarola/torch/pkg/torch/lib/TH/generic/THTensorMath.c:456
stack traceback:
    [C]: in function 'add'
    ...biofumarola/torch/install/share/lua/5.1/nn/CAddTable.lua:12: in function 'updateOutput'
    ...iofumarola/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
    ...ofumarola/torch/install/share/lua/5.1/nn/ConcatTable.lua:11: in function 'updateOutput'
    ...iofumarola/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
    ...s/fabiofumarola/torch/install/share/lua/5.1/rnn/LSTM.lua:162: in function 'updateOutput'
    ...iofumarola/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    lstm_early_stop.lua:85: in function 'validate'
    lstm_early_stop.lua:106: in main chunk
    [C]: in function 'dofile'
    ...rola/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x010e1c1180

I suppose that I miss something on LSTM internal states initialisation. Can someone help me on this?

require 'rnn'
require 'optim'

batchSize = 50
rho = 10
hiddenSize = 64
inputSize = 4
outputSize = 1

seriesSize = 10000
seriesEval = 1000

model = nn.Sequential()
model:add(nn.FastLSTM(inputSize, hiddenSize, rho))
--model:add(nn.Linear(inputSize, hiddenSize))
model:add(nn.Linear(hiddenSize, outputSize))

criterion = nn.MSECriterion()

-- dummy dataset (task predict the next item)
dataset = torch.randn(seriesSize, inputSize)
evalset = torch.randn(seriesEval, inputSize)

-- define the index of the batch elements
offsets = {}
for i= 1, batchSize do
   table.insert(offsets, math.ceil(math.random() * batchSize))
offsets = torch.LongTensor(offsets)

-- method to compute a batch
function nextBatch()
    --get a batch of inputs
    local inputs = dataset:index(1, offsets)
    -- shift of one batch indexes
    for j=1,batchSize do
        if offsets[j] > seriesSize then
            offsets[j] = 1
    -- a batch of targets
    local targets = dataset[{{},{outputSize}}]:index(1,offsets)
    return inputs, targets

-- get weights and loss wrt weights from the model
x, dl_dx = model:getParameters()

-- In the following code, we define a closure, feval, which computes
-- the value of the loss function at a given point x, and the gradient of
-- that function with respect to x. weigths is the vector of trainable weights,
-- it extracts a mini_batch via the nextBatch method
feval = function(x_new)
    -- copy the weight if are changed
    if x ~= x_new then

    -- select a training batch
    local inputs, targets = nextBatch()

    -- reset gradients (gradients are always accumulated, to accommodate
    -- batch methods)

    -- evaluate the loss function and its derivative wrt x, given mini batch
    local prediction = model:forward(inputs)
    local loss_x = criterion:forward(prediction, targets)
    model:backward(inputs, criterion:backward(prediction, targets))

    return loss_x, dl_dx

--function for validation
validate = function(data)
    local maxPosition = data:size()[1] - 1
   local cumulatedError = 0
    for i = 1, maxPosition do
            local x = data[i]
            local y = torch.DoubleTensor{data[i+1][4]}
          local prediction = model:forward(x)
          local err = criterion:forward(prediction, y)
          cumulatedError = cumulatedError + err
    return cumulatedError / maxPosition

sgd_params = {
   learningRate = 0.1,
   learningRateDecay = 1e-4,
   weightDecay = 0,
   momentum = 0

lr = 0.1
for i = 1, 10e3 do
    -- train a mini_batch of batchSize in parallel
    _, fs = optim.sgd(feval,x, sgd_params)

    if sgd_params.evalCounter % 100 == 0 then
        print('error for iteration ' .. sgd_params.evalCounter  .. ' is ' .. fs[1] / rho)
        local validationError = validate(evalset)
        print('error on validation ' .. validationError)

Strange interaction with nn.Container

Below is a minimal piece of code demonstrating a strange bug I have experienced.

Here, I have a typical LSTM sequence encoder. I also have a dummy class that inherits from nn.Container. The following code crashes. However, if you move the
parent:__init(self) line to go after mapper:forward(), then everything is ok. It fails in dpnn.Module:

   if moduleClones then
      assert(self.modules == nil)
      self.modules = modules
      clone.modules = moduleClones

Since the container has a member called modules, this code crashes. It seems like the self pointer is wrong here or something. Any ideas about what is going on?

require 'nn'
require 'rnn'

local vocabSize = 25
local embeddingDim = 10
local rnnHidSize = 15

local lstm = nn.Sequencer(nn.LSTM(embeddingDim, rnnHidSize))
local mapper = nn.Sequential():add(nn.LookupTable(vocabSize,embeddingDim)):add(nn.SplitTable(2)):add(lstm):add(nn.SelectTable(-1))

--this is a minibatch of 'sentences'
local data = torch.rand(32,16):mul(vocabSize):ceil()

local NoopContainer, parent = torch.class('nn.NoopContainer', 'nn.Container')

function NoopContainer:__init()

    local length = 12
    local dd = torch.rand(32,length):mul(vocabSize):ceil()


local noop = nn.NoopContainer()

Inconsistency behavior of the Sequencer module

In our model there is a Sequencer with dropout module.
In the testing phase we call model:evaluate() but in the Sequencer the field sharedClones is not updated well - the field train is false only in the first module of sharedClones and in the rest of the module the train field remains true. The result is that the dropout module has its training behavior instead of testing behavior.
Could you please check it out?
Many thanks for your help,

Help understanding what encInSeq, decInSeq, decOutSeq are in the encoder and decoder example.

Hello! I just found this library and have been trying to use seq2seq learning to flip a sequences of numbers. However, i'm stuck trying to understand how each of these tensors in the encoder decoder example...


...each relate to a seq2seq model like this:
Image of Model
My current understanding is that encInSeq is the tensor input that is given to the encoder network, the decInSeq is the tensor input given to the decode network, and finally decOutSeq is the expected output tensor for the decode layer. Is this correct? The fact that all of these tensors are 2x3 and 2x4 doesn't seem to match with this understanding. I'm sorry if this question seems obvious, but I have trying to figure this out for a few days now. (I'm new to using Torch for RNNs.) Thanks!


Hey Nicholas,
Thanks for adding in the BiSequencer. I see that code contains nn.ReverseTable() but I'm unable to find it in the nn or nnx packages. Am I missing something? I get the following error, of course:

rnn/BiSequencer.lua:48: attempt to call field 'ReverseTable' (a nil value)

GRU init problem

probably a dumb issue, but still can't figure it out -
trying to init a GRU (e.g., r = nn.GRU(1,1)) gives an 'attempt to call field 'GRU' (a nil value)' error.

Same code for LSTM works fine. Would appreciate any help.

How to use MaskZero with LSTM and nn.ClassNLLCriterion for variable length squences

Hi Guys,
I tried to make use LSTM to deal with variable length sequences. But I failed to do that by using the MaskZero function. Could you please help me out? Thanks a lot!!

Here a minimal code example of what I mean:

require 'rnn'
require 'optim'

inSize = 20
batchSize = 2
hiddenSize = 10
seqLengthMax = 11
numSeq = 30

x, y1 = {}, {}

for i = 1, numSeq do
   local seqLength = torch.random(1,seqLengthMax)
   local temp = torch.zeros(seqLengthMax, inSize)
   local targets ={}
   if seqLength == seqLengthMax then
         targets = (torch.rand(seqLength)*numTargetClasses):ceil()
      targets =,(torch.rand(seqLength)*numTargetClasses):ceil())
      temp[{{seqLengthMax-seqLength+1,seqLengthMax}}] = torch.randn(seqLength,inSize)
   table.insert(x, temp)
   table.insert(y1, targets)

model = nn.Sequencer(
      :add(nn.MaskZero(nn.Linear(hiddenSize, numTargetClasses),1))

criterion = nn.SequencerCriterion(nn.MaskZero(nn.ClassNLLCriterion(),1))

output = model:forward(x)

err = criterion:forward(output, y1)

return undeclared variable

Not sure if it matters. In LSTM.lua , line 279 and 300 returns an global variable gradInput, which I guess is nil since is undeclared.

recurrent-visual-attention.lua issue

Hello, thank you for great project!
After I trained a model using recurrent-visual-attention.lua ,

  1. How do I use the model to get a picture's visual attention?
  2. When I ran:
    th recurrent-visual-attention.lua --cuda --xpPath /home/silva/save/silva-XPS-8300:1441270258:1.dat
    I got the following error:
    /home/silva/torch/install/bin/luajit: /home/silva/torch/install/share/lua/5.1/torch/File.lua:262: unknown Torch class <optim.ConfusionMatrix>
    stack traceback:
    [C]: in function 'error'
    /home/silva/torch/install/bin/luajit: /home/silva/torch/install/share/lua/5.1/torch/File.lua:262: unknown Torch class <optim.ConfusionMatrix>

rnn with cltorch?

Hi, are you planning to add support for using library with opencl through cltorch and clnn?

Possible problem in Sequencer backward on inputs of different lengths


I just updated the rnn package and am now having an error from the backward step in the Sequencer module. I have a network with some LSTMs in it, wrapped inside Sequencer modules. When I try to train my network on inputs of variable length, I get this error:
Sequencer.lua:81: gradOutput should have as many elements as input

If I train my networks on inputs all the same length, I don't run into the error.

I wasn't able to replicate the problem in a simple contained network, so I cannot provide a minimal working example. However, this problem did not occur prior to update.
I do notice that before updating, my network wasn't using Recursor modules (they did not exist in my version), and after updating it is using them. For example, if I add a non-recurrent module inside a Sequencer (e.g. Dropout), it gets printed with Recursor when I print the model (nn.Sequencer @ nn.Recurser @ nn.Dropout), whereas before updating it was not printed (nn.Sequencer @ nn.Dropout).

I know this is not a very detailed description, but do you have any idea what might be the source of this problem?
I will keep debugging this.

Dropout "sizes do not match" error on backward pass

I am trying to train a stacked LSTM by calling forward/backward one step at a time. I need to do this because I run out of memory if I use a Sequencer on the whole mini-batch of sequences. If I just run the forward pass all is well; but when I add in the backward pass I get the error:

...Dropout.lua:42: bad argument #2 to 'cmul' (sizes do not match at /home/ubuntu/torch/extra/cutorch/lib/THC/

Here is a code snippet for my model and feval function (I am using Optim for training):

model = nn.Sequential()
model:add(nn.LookupTable(vocab_size, 512))
model:add(nn.FastLSTM(512, opt.rnn_size, opt.seq_length))
model:add(nn.FastLSTM(opt.rnn_size, opt.rnn_size, opt.seq_length))
model:add(nn.FastLSTM(opt.rnn_size, opt.rnn_size, opt.seq_length))
model:add(nn.Linear(opt.rnn_size, vocab_size))
criterion = nn.ClassNLLCriterion()

x, dl_dx = model:getParameters()

function feval(x_new)
    if x ~= x_new then

    ------------------ get minibatch -------------------
    local inputs, targets = loader:next_batch(1)
    ------------------- forward pass -------------------

    local loss_x = 0

    outputs = {}

    for i = 1,opt.seq_length do
        local lst = model:forward(inputs[i])
        table.insert(outputs, lst)
        loss_x = loss_x + criterion:forward(lst, targets[i])

    loss_x = loss_x / opt.seq_length

    for i = opt.seq_length,1,-1 do
        model:backward(inputs[i], criterion:backward(outputs[i], targets[i]))

    dl_dx = torch.clamp(dl_dx,-opt.grad_clip,opt.grad_clip)

    return loss_x, dl_dx

AbstractRecurrent forget() execute unwanted assertion?

in this section of the code

function AbstractRecurrent:forget(offset)
   offset = offset or 0
   if self.train ~= false then
      -- bring all states back to the start of the sequence buffers
      local lastStep = self.step - 1

      if lastStep > self.rho + offset then
         local i = 1 + offset

forget() is trying to check the boolean value self.train, but since AbstractRecurrent's parent class is nn.Container, there is no self.train, so the code below end up executing whether you called training() or evaluate().

How to check out the parameters?


It seems that the lib has not implemented parameters() in nn.Module, so I could hardly get access to the parameters of RNN modules unless change the codes of the library itself.

I'm wonder if is there any way to check out the parameters outside the class like the getParameters() in nn.Module?


Simple LSTM sequence to singe output for multiple batches

We tried to implement an LSTM similar for the imdb dataset for a batch of multiple sequences. For some reason the backpropagation does not work. Here a minimum sample of the code which causes the error and the corresponding error message. I used a Sequencer for the recurrent part which is supposed to be compatible with the forward and backward functions (according to documentation). Thank you for providing this module by the way :).

require "rnn"
require "cunn"

batch_size= 5
maxLen = 17
wordVec = 128
nWords = 10000
mode = 'GPU'

inp1 = torch.ceil(torch.rand(batch_size, maxLen)*nWords) -- 
labels = torch.ceil(torch.rand(batch_size)*2) -- create labels of 1s and 2s

lstm = nn.Sequential()
lstm:add(nn.LookupTable(nWords,wordVec, batch_size))  -- convert indices to word vectors
lstm:add(nn.SplitTable(1))  -- convert tensor to list of subtensors
lstm:add(nn.Sequencer(nn.LSTM(wordVec, wordVec))) -- lstm, no batch size here
lstm:add(nn.JoinTable(1)) -- stack list to tensor
lstm:add(nn.View(batch_size, -1, 128)) -- reshape tensor arbitrary y (maxLen)
lstm:add(nn.Mean(2))  -- average over words
lstm:add(nn.Linear(wordVec, 2)) -- bring to to classes

criterion = nn.ClassNLLCriterion()

if mode == 'GPU' then
    labels = labels:cuda()
    inp1 = inp1:cuda()

out = lstm:forward(inp1)

print('out', #out)  --- pritns (bsize, classes) here 5,2
print('labels', labels)  -- vector of 1s and 2s with len batch size here 5

out_crit = criterion:forward(out, labels)
print('loss', out_crit) -- scalar

gradOut = criterion:backward(out, labels)

print('gradout', #gradOut)  -- same as out 5,2

lstm:backward(inp1, gradOut) -- does not work
torch/install/share/lua/5.1/torch/Tensor.lua:460: expecting a contiguous tensor
stack traceback:
    [C]: in function 'assert'
    /home/../torch/install/share/lua/5.1/torch/Tensor.lua:460: in function 'view'
    /home/../torch/install/share/lua/5.1/nn/View.lua:85: in function 'updateGradInput'
    /home/../torch/install/share/lua/5.1/nn/Module.lua:31: in function 'backward'
    /home/../torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
    [string "require "rnn"..."]:45: in main chunk
    [C]: in function 'xpcall'
    /home/../torch/install/share/lua/5.1/itorch/main.lua:179: in function </home/../torch/install/share/lua/5.1/itorch/main.lua:143>

Numerical gradient check fails?

So this is my first time toying around with Torch modules and the like, so there's a big chance I'm overlooking something obvious. I was trying to implement an attention model, but when testing the gradients using optim.checkgrad they didn't match. I later realised that even for this simple model, I can't get them to match:

nn = require 'nn'
require 'rnn'
require 'optim'

hiddenSize = 2
nIndex = 2
r = nn.Recurrent(hiddenSize, nn.LookupTable(nIndex, hiddenSize),
                 nn.Linear(hiddenSize, hiddenSize))

rnn = nn.Sequential()
rnn:add(nn.Linear(hiddenSize, nIndex))

criterion = nn.ClassNLLCriterion()

function f(x)
  -- Do the forward prop
  local err = 0
  for i = 1, sequence:size(1) - 1 do
     local output = rnn:forward(sequence[i])
     err = err + criterion:forward(output, sequence[i + 1])
     local gradOutput = criterion:backward(output, sequence[i + 1])
     rnn:backward(sequence[i], gradOutput)
  return err, grads

parameters, grads = rnn:getParameters()

sequence = torch.Tensor{1, 2, 1, 2}:resize(4, 1)
local err = optim.checkgrad(f, parameters:clone())

This gives me errors anywhere between 0.1 and 0.01, which is way too big. After some digging I got to these lines in Recurrent.lua. Removing these lines seems to fix the problem, making the gradient error falls to around 1e-7.

         -- startModule's gradParams shouldn't be step-averaged
         -- as it is used only once. So un-step-average it
         local params, gradParams = self.startModule:parameters()
         if gradParams then
            for i,gradParam in ipairs(gradParams) do

I fail to see where gradParams get averaged, so I don't really understand the logic behind these lines. They seem to just scale the gradients for the initial hidden states with the number of steps?

Nested Sequencer

Hey guys,

I am currently writing and testing some minimalexamples of this repo.
I have one problem with understanding the way I should handle sequencer in models, e.g. nested sequencer.

Can anybody tell me how I get the inner LSTMs to work?

require 'nn'
require 'rnn'

local inputsize = 10
local outputsize = 12

local inputdata_t = torch.rand(10)

local innermodel = 
  :add(FastLSTM(inputsize, 2))
  :add(FastLSTM(2, 5))

local model = 

model =  nn.Recurrence(model, 12, 1)

local inputs = {}
for ii=1,3 do
 table.insert(inputs, inputdata_t[ii])

local outputs = model:forward(inputs)

I guess I need to wrap the complete module in another Sequencer? But how do I take different rhos for different modules in the same model.

Is there some kind of hold unit which catches the outputs?

Thx for helping

Encoder-Decoder Architectures


I was looking to implement an encoder-decoder LSTM architecture (like But the problem I have is that there doesn't seem to be a good way to pass the output of the encoder network to the decoder network as the hidden state.

More precisely, in LSTM:updateOutput, prevOutput is initialized to zero:

   if self.step == 1 then
      prevOutput = self.zeroTensor

However, I would need a way to pass in output[-1] from the encoder network into the decoder network as prevOutput. Of course, I will also need the gradients to flow back into the encoder properly.

Is there a way to achieve this setup with your current architecture?

Thanks a lot!

Fast Tensor to Sequence Creation

There are a few datasets available saved as tensors. I've been converting them to sequences something like the following:

    -- conversion loop
    local pixels = {}
    for k = 1, raster_size do
        pixels[k] = torch.Tensor({raw_data[k]})


   sequenced_layer = rnn.Sequencer(...)

However the conversion loop seems to be relatively slow. If this is a common operation (converting Tensor data to Sequences), perhaps a faster Lua/C helper function would be a useful feature.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.