jcjohnson / torch-rnn Goto Github PK

Efficient, reusable RNNs and LSTMs for torch

License: MIT License

Lua 94.77% Python 5.23%

torch-rnn's Introduction

torch-rnn

torch-rnn provides high-performance, reusable RNN and LSTM modules for torch7, and uses these modules for character-level language modeling similar to char-rnn.

You can find documentation for the RNN and LSTM modules here; they have no dependencies other than torch and nn, so they should be easy to integrate into existing projects.

Compared to char-rnn, torch-rnn is up to 1.9x faster and uses up to 7x less memory. For more details see the Benchmark section below.

Installation

Docker Images

Cristian Baldi has prepared Docker images for both CPU-only mode and GPU mode; you can find them here.

System setup

You'll need to install the header files for Python 2.7 and the HDF5 library. On Ubuntu you should be able to install like this:

sudo apt-get -y install python2.7-dev
sudo apt-get install libhdf5-dev

Python setup

The preprocessing script is written in Python 2.7; its dependencies are in the file requirements.txt. You can install these dependencies in a virtual environment like this:

virtualenv .env                  # Create the virtual environment
source .env/bin/activate         # Activate the virtual environment
pip install -r requirements.txt  # Install Python dependencies
# Work for a while ...
deactivate                       # Exit the virtual environment

Lua setup

The main modeling code is written in Lua using torch; you can find installation instructions here. You'll need the following Lua packages:

After installing torch, you can install / update these packages by running the following:

# Install most things using luarocks
luarocks install torch
luarocks install nn
luarocks install optim
luarocks install lua-cjson

# We need to install torch-hdf5 from GitHub
git clone https://github.com/deepmind/torch-hdf5
cd torch-hdf5
luarocks make hdf5-0-0.rockspec

CUDA support (Optional)

To enable GPU acceleration with CUDA, you'll need to install CUDA 6.5 or higher and the following Lua packages:

You can install / update them by running:

luarocks install cutorch
luarocks install cunn

OpenCL support (Optional)

To enable GPU acceleration with OpenCL, you'll need to install the following Lua packages:

cltorch
clnn

You can install / update them by running:

luarocks install cltorch
luarocks install clnn

OSX Installation

Jeff Thompson has written a very detailed installation guide for OSX that you can find here.

Usage

To train a model and use it to generate new text, you'll need to follow three simple steps:

Step 1: Preprocess the data

You can use any text file for training models. Before training, you'll need to preprocess the data using the script scripts/preprocess.py; this will generate an HDF5 file and JSON file containing a preprocessed version of the data.

If you have training data stored in my_data.txt, you can run the script like this:

python scripts/preprocess.py \
  --input_txt my_data.txt \
  --output_h5 my_data.h5 \
  --output_json my_data.json

This will produce files my_data.h5 and my_data.json that will be passed to the training script.

There are a few more flags you can use to configure preprocessing; read about them here

Step 2: Train the model

After preprocessing the data, you'll need to train the model using the train.lua script. This will be the slowest step. You can run the training script like this:

th train.lua -input_h5 my_data.h5 -input_json my_data.json

This will read the data stored in my_data.h5 and my_data.json, run for a while, and save checkpoints to files with names like cv/checkpoint_1000.t7.

You can change the RNN model type, hidden state size, and number of RNN layers like this:

th train.lua -input_h5 my_data.h5 -input_json my_data.json -model_type rnn -num_layers 3 -rnn_size 256

By default this will run in GPU mode using CUDA; to run in CPU-only mode, add the flag -gpu -1.

To run with OpenCL, add the flag -gpu_backend opencl.

There are many more flags you can use to configure training; read about them here.

Step 3: Sample from the model

After training a model, you can generate new text by sampling from it using the script sample.lua. Run it like this:

th sample.lua -checkpoint cv/checkpoint_10000.t7 -length 2000

This will load the trained checkpoint cv/checkpoint_10000.t7 from the previous step, sample 2000 characters from it, and print the results to the console.

By default the sampling script will run in GPU mode using CUDA; to run in CPU-only mode add the flag -gpu -1 and to run in OpenCL mode add the flag -gpu_backend opencl.

There are more flags you can use to configure sampling; read about them here.

Benchmarks

To benchmark torch-rnn against char-rnn, we use each to train LSTM language models for the tiny-shakespeare dataset with 1, 2 or 3 layers and with an RNN size of 64, 128, 256, or 512. For each we use a minibatch size of 50, a sequence length of 50, and no dropout. For each model size and for both implementations, we record the forward/backward times and GPU memory usage over the first 100 training iterations, and use these measurements to compute the mean time and memory usage.

All benchmarks were run on a machine with an Intel i7-4790k CPU, 32 GB main memory, and a Titan X GPU.

Below we show the forward/backward times for both implementations, as well as the mean speedup of torch-rnn over char-rnn. We see that torch-rnn is faster than char-rnn at all model sizes, with smaller models giving a larger speedup; for a single-layer LSTM with 128 hidden units, we achieve a 1.9x speedup; for larger models we achieve about a 1.4x speedup.

Below we show the GPU memory usage for both implementations, as well as the mean memory saving of torch-rnn over char-rnn. Again torch-rnn outperforms char-rnn at all model sizes, but here the savings become more significant for larger models: for models with 512 hidden units, we use 7x less memory than char-rnn.

TODOs

Get rid of Python / JSON / HDF5 dependencies?

torch-rnn's People

Contributors

Stargazers

Watchers

Forkers

ml-lab ml-ai-nlp-ir hughperkins zhmz90 youwayx maraoz phecy anupamaray lilimeng scatterbrain333 wavelets northanapon hitluobin kalyanp jmrinaldi nicolas-ivanov alasdairtran bulam boweiliu ageitgey soumith urish qinhongwei salemameen terrykong breezedeus jonathanlofgren lijian8 notimesea junteudjio darktex dylanauty jw-chen parthraghav pierrebeauguitte helennn guillitte hyeonsuukang belvo gevangelopoulos eldar28 gphuang jameskyle squidszyd ilovecv mattbierner ai42 iassael liyuanlucasliu patiencett theclaymethod crazylyf drzax davidcelis jacksbrain billzorn opraveen yongduek tybxiaobao iamalbert tinyloop chubbymaggie recursivemake reaction peratham yerevann alehander92 jungikim junhocho chriscummins tianlongwang preethamsp xcyan crazydonkey200 jinghsu robinsloan jhcmine spyatakov oldgil urisavka zhangxinnan mktal jackhopkins chunsj monolithpl offbit feynmanliang hnkulkarni nadavbh12 antihutka drakh noah1989 miradel51 sidec masdude stonedavid rlabuguen zhuth chagge juliuskunze

torch-rnn's Issues

About Multi-threading

Is this project multi-threaded ?
if no, then can someone guide me how to make this multi-threaded ?

Are the defaults supposed to be a 1:1 match for char-rnn?

I'm trying the shakespeare dataset on torch-rnn and char-rnn using all defaults in both cases. I'm training without a GPU. Before doing any kind of tweaking of model size and/or dropout it seems char-rnn gets a much better training loss and validation loss. Should the loss be approximately the same for these two models? torch-rnn is stuck at about 1.6 to 1.7 where char-rnn manages a considerably better loss with fewer epochs.

A quick examination of sampled text seems to show that char-rnn has performed better as well: Almost every word sampled is an actual word, torch-rnn hasn't got the vocabulary right yet.

Is this just a question of different defaults for modelparameters/hyperparameters or is something wrong?

Checkpoint saving error when writing to json file

Hey, this is the error I'm getting when checkpoints are being saved while training the RNN.

/Users/yuweixu/torch/install/bin/luajit: ./util/utils.lua:52: attempt to index local 'f' (a nil value)
stack traceback:
    ./util/utils.lua:52: in function 'write_json'
    train.lua:219: in main chunk
    [C]: in function 'dofile'
    ...eixu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x01087e1ad0

I believe it is because:
https://github.com/jcjohnson/torch-rnn/blob/master/train.lua#L217
is creating a directory rather than a json file.

Invalid multinomial distribution (sum of probabilities <= 0)

After training a new 3 x1024 model (dropout=0.1, batchnorm=1) for a couple of days, I got the following error:

$ th mysample.lua -checkpoint /nas/doc/nn/checkpoint_200.t7.reset.t7 -length 2000 -gpu -1
Loading /nas/doc/nn/checkpoint_200.t7.reset.t7
Loaded
/home/alekz/torch/install/bin/luajit: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /home/alekz/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:120)
stack traceback:
[C]: at 0x7ffad932e1d0
[C]: in function 'multinomial'
./LanguageModel.lua:196: in function 'sample'
mysample.lua:43: in main chunk
[C]: in function 'dofile'
...lekz/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d70

What can be the problem?

How to use a trained char-based RNN language model?

A dumb question, after training a char-based RNN language model with large data set (say millions of words), how could one utilize the model for different tasks like sentiment analysis?

Most of the exampels I've seen so far combined model training and sentiment analysis in one go (normally a LookupTable as the first layer for word embeddings then two LSTM layers after it...etc), I was wondering wouldn't it be more efficient to reuse a pre-trained RNN model and repurpose it for classification tasks?

Is it possible and how? Thanks!

Here are some sentiment analysis examples:
https://github.com/Element-Research/rnn/blob/master/examples/sequence-to-one.lua
https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py

How to extend this code to deal with variable length squences?

Hi jcjohnson,

First of all, thanks for sharing!

I'm looking for a way of using this fast rnn code for speech processing. But it seems that the current code only works for fixed length sequences. Do you know how to extend it for variable length sequences? Thanks!

BTW, is there a typo at row 105 of VanillaRNN.lua? : "h: Sequence of hidden states, of shape (T, N, H)" => (N, T, H)?

Best,
Jun

attempt to call method 'resetSize' (a nil value)

My friend and I are both encountering this error when attempting to run train.lua. Any advice would be appreciated.

/share/apps/torch/20151009/intel/bin/luajit: ./LanguageModel.lua:81: attempt to call method 'resetSize' (a nil value)
stack traceback:
        ./LanguageModel.lua:81: in function 'forward'
        train.lua:117: in function 'opfunc'
        ...e/apps/torch/20151009/intel/share/lua/5.1/optim/adam.lua:33: in function 'adam'
        train.lua:174: in main chunk
        [C]: in function 'dofile'
        ...rch/20151009/intel/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
        [C]: at 0x00405430

Maybe need a `paths.mkdir('cv')` somewhere?

I get:

./util/utils.lua:52: attempt to index loc
al 'f' (a nil value)

I think that it's because the cv directory hasnt been created?

cannot convert 'struct THCudaTensor *' to 'float'

I'm just trying to set this up and when I go to run train I run into this issue:

Running with CUDA on GPU 0
/home/alex/torch/install/bin/luajit: /home/alex/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
/home/alex/torch/install/share/lua/5.1/nn/THNN.lua:109: bad argument #5 to 'v' (cannot convert 'struct THCudaTensor *' to 'float')

Seems like someone made this an issue on the torch/nn repo: torch/nn#694
Any ideas?

Implementing seq2seq learning?

Is there any simple way to implement a forwardConnect and backwardConnect method to create a seq2seq network similar to the encoder-decoder-coupling example found here? Thanks

Sampling ouputs garbage, encoding problem?

My model outputs only garbage when sampling (tiny shakespeare):
th sample.lua -checkpoint cv/checkpoint_4000.t7 -length 500

yAmPhaiOp!Rhe.c&Mr;vSghiBKXezpy me Bfe;
WZP:LMa.XZodWeF'
zux'd I reaRecheboKhxFk3So go ear h,UdFyboxRfZvEHxapJzPpy qmZYBHYQ SwR ze? fra 
...

I followed the installation procedure, prepared the tiny shakespeare data and trained the model as explained.

prepro
python scripts/preprocess.py --input_txt data/tiny-shakespeare.txt --output_h5 data/tiny-shakespeare.h5 --output_json data/tiny-shakespeare.json

training
th train.lua -input_h5 data/tiny-shakespeare.h5 -input_json data/tiny-shakespeare.json

training is blazingly fast, but when sampling, the model outputs garbage, even when trained longer (e.g. 17k steps, i.e., 50 epochs). At 4000 steps char-rnn outputs are great already.

Am I missing something obvious?

scripts/preprocess.py should sort tokens lexicographically

Currently the indexes are assigned to tokens on the first occurrence basis. If the text is changed (think of fixing a typo or training a pre-trained model on a different corpus) the indexes might be reassigned what will break subsequent training initialized from a checkpoint.

Nuance for setup: `sudo apt-get install -y python2.7-dev`

Hi Justin,

Slight nuance for setup: I think one needs to also do sudo apt-get -y install python2.7-dev, prior to running the pip install command.

Also, it failed with 'cant install numpy', so maybe need to reorder the requirements.txt file? But I installed like this:

pip install Cython
pip install numpy
pip install -r requirements

(or perhaps re-order the requirements.txt file perhaps? or file a bug report with

I also needed to:

sudo apt-get install libhdf5-dev

...prior to installing the hdf5 luarocks. I think?

init_from fails

Ran the training for a while before quitting it with ctrl+c. Restarting throws this error:

th train.lua -init_from cv/checkpoint_80000.t7 -gpu -1
Running in CPU mode
Initializing from cv/checkpoint_7000.t7
Epoch 1.00 / 50, i = 1 / 17800, loss = 12.962780
Epoch 1.01 / 50, i = 2 / 17800, loss = 11.756465
Epoch 1.01 / 50, i = 3 / 17800, loss = 11.562723
Epoch 1.01 / 50, i = 4 / 17800, loss = 11.154325
Epoch 1.01 / 50, i = 5 / 17800, loss = 10.792833
Epoch 1.02 / 50, i = 6 / 17800, loss = 10.426821
Epoch 1.02 / 50, i = 7 / 17800, loss = 10.035604
Epoch 1.02 / 50, i = 8 / 17800, loss = 9.824880
Epoch 1.03 / 50, i = 9 / 17800, loss = 9.523673
Epoch 1.03 / 50, i = 10 / 17800, loss = 9.234539
Epoch 1.03 / 50, i = 11 / 17800, loss = 9.018742
Epoch 1.03 / 50, i = 12 / 17800, loss = 8.913129
Epoch 1.04 / 50, i = 13 / 17800, loss = 8.635086
/Users/spardy/torch/install/bin/luajit: /Users/spardy/torch/install/share/lua/5.1/nn/Container.lua:69:
In 1 module of nn.Sequential:
...rs/spardy/torch/install/share/lua/5.1/nn/LookupTable.lua:62: index out of range at /Users/spardy/torch/pkg/torch/lib/TH/generic/THTensorMath.c:156
stack traceback:
[C]: in function 'index'
...rs/spardy/torch/install/share/lua/5.1/nn/LookupTable.lua:62: in function <...rs/spardy/torch/install/share/lua/5.1/nn/LookupTable.lua:56>
[C]: in function 'xpcall'
/Users/spardy/torch/install/share/lua/5.1/nn/Container.lua:65: in function 'rethrowErrors'
/Users/spardy/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
train.lua:124: in function 'opfunc'
/Users/spardy/torch/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
train.lua:181: in main chunk
[C]: in function 'dofile'
...ardy/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010f1f3d50

I get the same error, at the same index no matter what restart file I choose. Restarting the training from scratch does not give any errors and I am able to sample from any of the restart files without error.

Sampling: incremental printing of generated text

While using sample.lua to generate text to inspect how the quality is changing between checkpoints, it would be nice if each character could be printed as it's generated (like in char-rnn) so you don't have to wait for the full sample to be created before you see any output. This wait is particularly painful if you need to generate large amounts of text to see how it's going; for my metadata training, I need 10-20k characters to see it sample from several different authors, and it takes something like an hour to do that much.

sample.lua calls LanguageModel.lua, and inside sample, I think this could be done by adding a print call using the next_char variable?

Memory allocation/leak issue

This is a weird one, and took me a while to track down. I believe torch/torch-rnn has a memory leak that survives the process ending (?!). Crazy talk, I know - but try the following:

Train a network, any network.
In a terminal, run 'top'
Create a loop of hundreds or thousands of sample actions (I was running 'th sample.lua -checkpoint words/20160323-0001/cv/cp_w512_d2_s3_b10_1000.t7 -start_text "hgiusdhss unwiieorw 99jsowlr4 " -length 135 -temperature 1' and comparing subtle variations in output)
Watch memory creep up. On my system, it takes about 95-105 iterations for total memory usage to creep from less than 1 GB to *GB, and to start using swap. I can do this in under two minutes.

I've waited hours and days for the allocated memory to free up, and it doesn't. Luajit/Lua is not running in any way. If I let this go on long enough, Ubuntu starts killing processes due to lack of memory - eventually the system halts or I reboot it.

My system has 8 GB of RAM and a Nvidia GTX 750 Ti GPU, is running Ubuntu 15.10, and is using gcc/g++ 4.9.3 with NVIDIA 352.63, Cuda 7.5.18 and CudNN 4.0. I installed torch7 from the git repo, and installed it with alternately luajit 2.1rc1 and lua52 - both display this issue.

I noticed this first when sampling trained networks, but I believe it occurs in training as well - its harder to detect because training iterations are far longer.

Checkpoints not being written

th train.lua -input_h5 data.h5 -input_json data.json -rnn_size 2068 -dropout 0.3 -num_layers 3 -checkpoint_every 100

comes to a stand still when it's time to write a check point. I also tried at checkpoint_every 1000.

Comparing 'likeliness' of some input against trained model

Hi Justin,

Apologies that I'm not much of a machine learning guy, but is it possible to use torch-rnn to compare the likeliness of an input text against a model trained using some corpus? So for example, with an rnn trained using Shakespeare, this function would return a low numerical score for the input 'The cat sat on the mat" (intuitively: this sentence was not likely written by Shakespeare), and a high numerical score for the input 'O Romeo, where art thou?' (intuitively: this sentence was likely written by Shakespeare). Obviously both inputs are too short to make a strong distinction, but I hope the question makes sense.

Thank you for your amazing work,
Chris

sample.lua flag of -sample 0 raises error

Hi!

I've been working with this code for a few weeks now, and it's fantastic! I've found a possible issue - when I pass a -sample 0 flag to sample.lua (to force argmax output vice softmax), it fails with the following error:

~/torch-rnn$ th sample.lua -checkpoint /tmp/cp_512_2_130800.t7 -length 20000 -sample 0
~/torch/install/bin/luajit: ./LanguageModel.lua:185: bad argument #1 to 'copy' (torch.*Tensor expected, got nil)
stack traceback:
[C]: in function 'copy'
./LanguageModel.lua:185: in function 'sample'
sample.lua:41: in main chunk
[C]: in function 'dofile'
...five/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405ea0

Gradually increasing calculation time

What can the gradually increasing calculation time be attributed to?

$ th mytrain.lua -input_h5 data/input.txt.h5 -input_json data/input.txt.json -gpu -1
Running in CPU mode
New model created
number of parameters in the model: 21297834
Epoch 1.00 / 50, i = 1 / 10050, loss = 3.801416, time = 236.572506
Epoch 1.01 / 50, i = 2 / 10050, loss = 3.950289, time = 231.743912
Epoch 1.01 / 50, i = 3 / 10050, loss = 4.013554, time = 260.275881
Epoch 1.02 / 50, i = 4 / 10050, loss = 4.037642, time = 232.696755
Epoch 1.02 / 50, i = 5 / 10050, loss = 3.753084, time = 237.560788
Epoch 1.03 / 50, i = 6 / 10050, loss = 3.104723, time = 305.237002
Epoch 1.03 / 50, i = 7 / 10050, loss = 3.063986, time = 486.629820
Epoch 1.04 / 50, i = 8 / 10050, loss = 2.992056, time = 705.056393
Epoch 1.04 / 50, i = 9 / 10050, loss = 2.963458, time = 757.198806
Epoch 1.05 / 50, i = 10 / 10050, loss = 2.858334, time = 779.982924
. . .
Epoch 1.10 / 50, i = 21 / 10050, loss = 2.618500, time = 1102.372405
Epoch 1.11 / 50, i = 22 / 10050, loss = 2.627150, time = 1093.363219
Epoch 1.11 / 50, i = 23 / 10050, loss = 2.656708, time = 1124.799454
Epoch 1.12 / 50, i = 24 / 10050, loss = 2.622511, time = 1146.338546
. . . etc.

The processing time for all models I tried before (< 3 x 512) was very stable. But this one (3 x 1024) shows this behavior. The used memory is always the same - 12G, no swapping, same OMP_NUM_THREADS=4, same input.txt. Do I miss anything?

Updated training?

I'm very new to the whole world of neural networks so please forgive any silly questions.

Is there a way to update a network by training it on new text? The undocumented -init_from flag looks like it might do that, but I can't quite be sure.

Sampling can't be run in CPU mode without cunn and cutorch

#9 solves this

Thank you !

Never found the exact way to just say thank you on Github.
Maybe it is missing some basic "like" button :)

So here it is : thank you for sharing this amazing work.

Alex

Using non-ASCII characters in start_text crashes sample.lua

Having non-ASCII characters in arguments, particularly UTF-8 which is my terminal encoding breaks sample.lua start_text functionality. Here's sample output, where i try to initialize network with russian word for "test":

th sample.lua -checkpoint models/test/checkpoint_27350.t7 -length 1000 -sample 1 -gpu -1 -temperature 1 -start_text тест
/home/vostrosa/torch/install/bin/luajit: ./LanguageModel.lua:129: Got invalid idx
stack traceback:
        [C]: in function 'assert'
        ./LanguageModel.lua:129: in function 'encode_string'
        ./LanguageModel.lua:174: in function 'sample'
        sample.lua:41: in main chunk
        [C]: in function 'dofile'
        ...ator/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405ec0

This is very unfortunate, since most of datasets I train network on consist mostly of Russian UTF-8 encoded text and I'm unable to preseed the network. My guess is that it treats UTF-8 as a single-byte encoding, which would explain why it yields invalid indices.

dropout enabled during sampling!

Going through the code, I found that dropout layers aren't disabled during sampling. This certainly isn't expected behaviour.

After training, I manually set all the dropout layers p variable to 0, and the probability distributions for a given initiation sequence stopped changing. Without doing this, I'd get different results across runs.

This presumably also means that dropout isn't being disabled during validation.

Can anyone confirm that this is a bug? If it is, I can do a fix in LanguageModel.lua.

scripts/preprocess.py--encoding flag not documented

The --encoding flag isn't documented. It should probably be added to flags.md.

Thanks for this!

preprocess.py does't handle unicode character sequences correctly

Great project!

I just ran into one small problem with text containing emojis. These are currently not encoded correctly by preprocess.py:

Test 😀!

Outputs the following json:

{"idx_to_token": {"1": "T", "2": "e", "3": "s", "4": "t", "5": " ", "6": "\ud83d", "7": "\ude00", "8": "!", "9": "\n"}, "token_to_idx": {"!": 8, " ": 5, "e": 2, "\ude00": 7, "\n": 9, "s": 3, "T": 1, "\ud83d": 6, "t": 4}}

As you can see, the emoji has been broken into two characters: \ud83d and \ude00. cjson throws an error when it attempts to decode this since \ud83d is not a valid unicode character.

I prototyped a fix in Python3.3+ based on this SO question that I can submit a pull request for, but that requires updating print and unrelated code for Python 3 as well. I'm not sure what the proper fix is for Python 2.x.

Can torch-rnn be used with 0 test and validation data?

When I set --val_frac and --test_frac to "0" or to a value that produces val/test sizes less than batch_size * seq_length, I'm getting index errors:

Set to "0":

Running in CPU mode
/home/alekz/torch/install/bin/luajit: bad argument #2 to '?' (too many indices provided at /tmp/luarocks_torch-scm-1-7800/torch7/generic/Tensor.c:929)
stack traceback:
[C]: at 0x7fb3edaf0b40
[C]: in function '__index'
./util/DataLoader.lua:30: in function '__init'
/home/alekz/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/alekz/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'DataLoader'
train.lua:75: in main chunk
[C]: in function 'dofile'
...lekz/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d70

Set to 0.0001:

/home/alekz/torch/install/bin/luajit: bad argument #2 to '?' (end index out of bound at /tmp/luarocks_torch-scm-1-7800/torch7/generic/Tensor.c:967)
stack traceback:
[C]: at 0x7fcff2850b40
[C]: in function '__index'
./util/DataLoader.lua:30: in function '__init'
/home/alekz/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/alekz/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'DataLoader'
train.lua:74: in main chunk
[C]: in function 'dofile'
...lekz/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d70

preprocess.py uses np.uint32 when there are 256 items in the vocabulary

256 different values should fit in a byte, but I believe this line has a bug, and goes to uint32 when there are more than 255:

https://github.com/jcjohnson/torch-rnn/blob/master/scripts/preprocess.py#L47

Stateful RNN mod in sampling

Does just removing the lines: self:resetStates() in the sample function in langmodel.lua will make it a statefull RNN, thus preserving long term context?

Sample h5 and json files don't work

I believe the sample tiny_shakespeare.h5 and tiny_shakespeare.json files are from an earlier version of this repo and will break when training. I re-ran the processor.py script and it now works.

Instructions step 2, should be only one hyphen before each option I think?

th train.lua -input_h5 my_data.h5 -input_json my_data.json

The initial bias of the forget gate

I'm still trying to understand how it all works together. A question: how/where is the initial bias of the forget gate set? According to some papers (e.g. http://www.datascienceassn.org/content/empirical-exploration-recurrent-network-architectures), it should be set between 1 and 2 to "encourage learning". In char_rnn it's set to 1. But I can't find it in torch_rnn.

Can I add softmax layer at the top?

Hi jcjohnson,
Thanks for sharing the code. It's really helpful.
I want to add nn.LogSoftMax() at the top layer but couldn't find way to do it because batch and depth size.
Isn't that possible to add nn.LogSoftMax module?

Thanks in advance.

error on `train.lua`: expected near ')' at line 579

I installed all dependencies and preprocessed a txt (tried to with the provided shakespeare.txt). However train.lua throws some error. What could be the cause of this?

➜  torch-rnn git:(master) ✗ th train.lua -input_h5 my_data.h5 -input_json my_data.json
/home/tom/gits/torch/install/bin/luajit: /home/tom/gits/torch/install/share/lua/5.1/trepl/init.lua:363: /home/tom/gits/torch/install/share/lua/5.1/trepl/init.lua:363: /home/tom/gits/torch/install/share/lua/5.1/hdf5/ffi.lua:56: ';' expected near ')' at line 579
stack traceback:
    [C]: in function 'error'
    /home/tom/gits/torch/install/share/lua/5.1/trepl/init.lua:363: in function 'require'
    train.lua:6: in main chunk
    [C]: in function 'dofile'
    ...gits/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x00405d60

OpenCL backend slower than CPU

Running the tinyshakespeare dataset with the default settings, I get timings of around 0.3s/iteration with CPU, but using the OpenCL backend I get more like 2.6s/iteration. These timings seem to be similar whether or not benchmarking is enabled. Running char-rnn the timings are approximately reversed (around 3s/iteration CPU, 0.3s/iteration GPU).

Running on OS X 10.10, with a Radeon 5770.

libhdf5-dev unavailable on latest Ubuntu?

I was trying to install this today, and 'sudo apt-get install libhdf5-dev' failed with "Unable to locate package libhdf5-dev". Apparently libhdf5-dev doesn't exist at all on the most recent Ubuntu version, or something like that? Is there any good workaround for this? I'm a linux newbie so I could be overlooking something.

cannot find lua package "optim"

$ luarocks install optim

Error: No results matching query were found.

Is it possible to save checkpoints without resetting/clearing the state of the model?

From what I understand, without resetting the internal states "torch.save" saves everything, so the checkpoint file size can be gigabytes in size. However, char-rnn does it somehow (I'm currently trying to understand how it's done, but I'm not sure it's compatible with torch-rnn ).

Is it possible to implement a similar way of saving in torch-rnn?

Training stops at 10th iteration

I'm attempting to train a moderately sized dataset using torch-rnn. When I tested on a smaller dataset, it ran successfully and I was able to sample a checkpoint. However, now that I am on the larger dataset, it gets to the 10th iteration (i = 10 / 482600), and then it stops printing; it reached i = 10 in about 10 seconds, but has hung on i = 10 for 60+ minutes. I've tried on a couple of different datasets and training parameters and am getting similar behaviour.

Is the lack of printing expected behaviour? (i.e. code is still running but it doesn't print), or is this an error?

Any help or suggestions would be most appreciated.

[SOLVED] Sorry for the notifications everybody!

Cutorch illegal memory access was encountered

Hi,

First, thanks a lot for open sourcing this project :)

I don't know much at all about Torch, CUDA, etc. so I could be missing something. However some fruitless googling has led me here hoping you can help.

I'm training a model (3 layers, 512 rnn size, batch size 64, sequence length 128) on an AWS GPU instance using the AMI from https://github.com/brotchie/torch-ubuntu-gpu-ec2-install. After a little while, the training process crashes with the following error:

Epoch 1.31 / 50, i = 12999 / 2086300, loss = 1.071470
Epoch 1.31 / 50, i = 13000 / 2086300, loss = 1.066751
val_loss = 2.1398722386657
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-4337/cutorch/lib/THC/generic/THCStorage.cu line=48 error=77 : an illegal memory access was encountered /usr/local/bin/luajit: cuda runtime error (77) : an illegal memory access was encountered at /tmp/luarocks_cutorch-scm-1-4337/cutorch/lib/THC/generic/THCStorage.c:147

I'll try using the OpenCL backend but after reading this I don't look forward to it... #11

Thanks

[feature request] Character-level features

I recently trained torch-rnn nuernets on ANSi art: rodarmor.com/artnet

I didn't do color, because there didn't seem to be an efficient or elegant way of representing color in the training data.

My feature request is for something like character level features, so that there is some way to tell the nuernet that a character has some feature, in this case color.

I'm not sure the best way to support this though. Perhaps a preprocessing mode where you specify that you want the preprocessor to consume pairs of characters, and the first in the pair is the character, and the second in the pair is the feature.

For encoding a red "hello" followed by a blue "frank", I could feed it in as:

hReRlRlRoRfBrBaBnBkB

The framework doesn't have to know anything about what the features represent, but just concatinate the character and the features when providing input to the neural network.

Install tutorial for Mac users

I've written up a rather detailed set of installation instructions for Mac users:
http://www.jeffreythompson.org/blog/2016/03/25/torch-rnn-mac-install/

If this is helpful in any way, please feel free to link to and/or use!

Resume training from checkpoints

One feature in char-rnn is the ability to take a .t7 checkpoint and resume training with it. This is very helpful for recovering from crashes, OOMs, system shutdowns (say, from a cat stepping on one's laptop power button several days into training), accidental C-cs, etc. It can also be useful for training on one dataset and then training on another.

cmd:option('-init_from', '', 'initialize network parameters from checkpoint at this path')

(Yes, I know I can use char-rnn, but I like torch-rnn's efficiency and this seems to be the main feature I'm missing from char-rnn at the moment.)

I have a question in util.gradcheck

Here
https://github.com/jcjohnson/torch-rnn/blob/master/util/gradcheck.lua#L11
instead this

return math.abs(x - y) / math.max(math.abs(x) + math.abs(y), h)

isn't it this ?

return math.abs(x - y) / math.max(math.abs(x) , math.abs(y), h)

Add support for bidirectional RNNs

Hi, I am wondering how to implement bidirectional RNNs, instead of reversing the sequence beforehand,
by internally changing the order to feed.

take LSTM as example, can one just modify x[{{}, t}] to x[{{}, T-t+1}] in these two following line?
https://github.com/jcjohnson/torch-rnn/blob/master/LSTM.lua#L155
https://github.com/jcjohnson/torch-rnn/blob/master/LSTM.lua#L243

For info, I think can install torch-hdf5 as `hdf5`?

Hi justin,

Nice library :-)

For hdf5, seems like one can install it using luarocks install hdf5? This will install: https://raw.githubusercontent.com/torch/rocks/master/hdf5-20-0.rockspec , which I think brings down latest master branch?

Batchnorm vs dropout

I wonder if anybody has made any comparisons between batch normalization and dropout, and/or a combination of them. I've seen some papers on that, they state that for text processing the difference is not as pronounced as, for example, for images, but I did not see any proper comparisons.

torch-rnn does not make use of multiple CPU cores

I've got a 4 core CPU with hyper-threading and no GPU. With char-rnn htop shows, that all 8 threads are almost 100% busy, but torch-rnn uses only one thread (all other cores are completely idling), what makes it several times slower than char-rnn (in the CPU only mode). Is it possible to add multi-core support to torch-rnn?

Element-Research rnn

Hi @jcjohnson,

So I work on the Element-Research/rnn package. I have a couple of questions:

why did you chose to build your own LSTM/RNN implementations?
what does torch-rnn package do better? What does rnn do worse?
would you be open to merging these repositories?

The last questions came out from the NVIDIA GPU Tech Conference this week. If you prefer, we can talk offline.

Regards,
Nicholas Leonard