okuchaiev / f-lm Goto Github PK

Language Modeling

License: MIT License

Python 100.00%

lstm-cells lstm-layer language-model gpu deep-learning machine-learning

f-lm's Introduction

F-LM

Language modeling. This codebase contains implementation of G-LSTM and F-LSTM cells from [1]. It also might contain some ongoing experiments.

This code was forked from https://github.com/rafaljozefowicz/lm and contains "BIGLSTM" language model baseline from [2].

Current code runs on Tensorflow r1.5 and supports multi-GPU data parallelism using synchronized gradient updates.

Perplexity

~~On One Billion Words benchmark using 8 GPUs in one DGX-1, BIG G-LSTM G4 was able to achieve 24.29 after 2 weeks of training and 23.36 after 3 weeks.~~

On 02/06/2018 We found an issue with our experimental setup which makes perplexity numbers listed in the paper invalid.

See current numbers in the table below.

On DGX Station, after 1 week of training using all 4 GPUs (Tesla V100) and batch size of 256 per GPU:

Model	Perplexity	Steps	WPS
BIGLSTM	35.1	~0.99M	~33.8K
BIG F-LSTM F512	36.3	~1.67M	~56.5K
BIG G-LSTM G4	40.6	~1.65M	~56K
BIG G-LSTM G2	36	~1.37M	~47.1K
BIG G-LSTM G8	39.4	~1.7M	~58.5

Dependencies

TensorFlow r1.5
Python 2.7 (should work with Python 3 too)
1B Word Benchmark Dataset (https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark to get data)

To run

Assuming the data directory is in: /raid/okuchaiev/Data/LM1B/1-billion-word-language-modeling-benchmark-r13output/, execute:

export CUDA_VISIBLE_DEVICES=0,1,2,3

SECONDS=604800
LOGSUFFIX=FLSTM-F512-1week

python /home/okuchaiev/repos/f-lm/single_lm_train.py --logdir=/raid/okuchaiev/Workspace/LM/GLSTM-G4/$LOGSUFFIX --num_gpus=4 --datadir=/raid/okuchaiev/Data/LM/LM1B/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,float16_rnn=False,max_time=$SECONDS,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256,fact_size=512  >> train_$LOGSUFFIX.log 2>&1

python /home/okuchaiev/repos/f-lm/single_lm_train.py --logdir=/raid/okuchaiev/Workspace/LM/GLSTM-G4/$LOGSUFFIX --num_gpus=1 --mode=eval_full --datadir=/raid/okuchaiev/Data/LM/LM1B/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,float16_rnn=False,max_time=$SECONDS,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=1,fact_size=512

To use G-LSTM cell specify num_of_groups parameter.
To use F-LSTM cell specify fact_size parameter.

Note, that current data reader may miss some tokens when constructing mini-batches which can have a minor effect on final perplexity.

For most accurate results, use batch_size=1 and num_steps=1 in evaluation. Thanks to Ciprian for noticing this.

To change hyper-parameters

The command accepts and additional argument --hpconfig which allows to override various hyper-parameters, including:

batch_size=128 - batch size per GPU. Global batch size = batch_size*num_gpus
num_steps=20 - number of LSTM cell timesteps
num_shards=8 - embedding and softmax matrices are split into this many shards
num_layers=1 - numer of LSTM layers
learning_rate=0.2 - learning rate for optimizer
max_grad_norm=10.0 - maximum acceptable gradient norm for LSTM layers
keep_prob=0.9 - dropout keep probability
optimizer=0 - which optimizer to use: Adagrad(0), Momentum(1), Adam(2), RMSProp(3), SGD(4)
vocab_size=793470 - vocabluary size
emb_size=512 - size of the embedding (should be same as projected_size)
state_size=2048 - LSTM cell size
projected_size=512 - LSTM projection size
num_sampled=8192 - training uses sampled softmax, number of samples)
do_summaries=False - generate weight and grad stats for Tensorboard
max_time=180 - max time (in seconds) to run
fact_size - to use F-LSTM cell, this should be set to factor size
num_of_groups=0 - to use G-LSTM cell, this should be set to number of groups
save_model_every_min=30 - how often to checkpoint
save_summary_every_min=16 - how often to save summaries
use_residual=False - whether to use LSTM residual connections

Feedback

Forked code and GLSTM/FLSTM cells: [email protected]

References

[1] Factorization tricks for LSTM networks, ICLR 2017 workshop.
[2] Exploring the Limits of Language Modeling

f-lm's People

Contributors

Stargazers

Watchers

f-lm's Issues

Why the download link of pretrain-model has been removed?

Can it still useful for the latest repo code? Thanks.

using G-LSTM with dynamic-rnn

Hey. Thanks for the amazing article!

I'm trying to use G-LSTM for my cell in dynamic_rnn and I got this error:
File "/language_model.py", line 30, in init
loss = self._forward(i, xs[i], ys[i], lengths[i])
File /language_model.py", line 121, in _forward
inputs=x)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 574, in dynamic_rnn
dtype=dtype)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 737, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2770, in while_loop
result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2599, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2549, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 722, in _time_step
(output, new_state) = call_cell()
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 708, in
call_cell = lambda: cell(input_t, state)
File "/factorized_lstm_cells.py", line 172, in call
self._get_input_for_group(m_prev, group_id, self._group_shape[0])], axis=1)
File "/factorized_lstm_cells.py", line 129, in _get_input_for_group
name="GLSTMinputGroupCreation")
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 547, in slice
return gen_array_ops.slice(input, begin, size, name=name)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2896, in _slice
name=name)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 499, in apply_op
repr(values), type(values).name))
TypeError: Expected int32 passed to parameter 'size' of op 'Slice', got [None, 128] of type 'list' instead.

Looks like its not proccessing cause of the size=[inpt.get_shape()[0].value, group_size] line, because the input size (apperantly, both batch size and time) is dynamic.
I think it can be treated with passing the batch_size directly to cell, but if there is any good solution, I'd be grateful if you'd tell me.

Checkpoint required?

Is a checkpoint required to run the model? It keeps printing out "No checkpoint file found. Waiting...".

Error when loading checkpoint model

After training a G-LSTM, I got error when evaluating it:

W tensorflow/core/framework/op_kernel.cc:993] Not found: Key model/lstm_0/lstm_cell/biases not found in checkpoint

This error occurs when restoring the ckpt model.

How can I solve this issue?

Can't restore model from pre-trained model link

Hi, I am trying to use the pre-trained model for evaluation, but I am seeing an error while restoring the model parameters. Is the code up to date with it?

This is the error that I see. I tried searching for some of the missing parameters in the graph.pbtxt file, but they weren't there. I tested with both the head commit and d98fb11.

$ python3 single_lm_train.py --logdir=/path/to/my/logdir --num_gpus=2 --datadir=/path/to/my/datadir --mode=eval_full --hpconfig run_profiler=False,float16_rnn=False,max_time=$SECONDS,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=4,num_of_groups=0
*****HYPER PARAMETERS*****
{'batch_size': 4, 'num_steps': 20, 'num_shards': 8, 'num_layers': 2, 'learning_rate': 0.2, 'max_grad_norm': 1.0, 'num_delayed_steps': 150, 'keep_prob': 0.9, 'optimizer': 0, 'vocab_size': 793470, 'emb_size': 1024, 'state_size': 8192, 'projected_size': 1024, 'num_sampled': 8192, 'num_gpus': 2, 'float16_rnn': False, 'float16_non_rnn': False, 'average_params': True, 'run_profiler': False, 'do_summaries': False, 'max_time': 1303, 'fact_size': None, 'fnon_linearity': 'none', 'num_of_groups': 0}
**************************
Not using groups
Not using fnonlinearities
Not using groups
Not using fnonlinearities
Not using groups
Not using fnonlinearities
Not using groups
Not using fnonlinearities
Averaging parameters for evaluation.
2017-12-23 11:35:51.468529: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2017-12-23 11:35:51.747194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:17:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2017-12-23 11:35:51.970520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.31GiB
2017-12-23 11:35:51.971259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2017-12-23 11:35:51.971284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 
2017-12-23 11:35:51.971289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y 
2017-12-23 11:35:51.971292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y 
2017-12-23 11:35:51.971299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2017-12-23 11:35:51.971303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
2017-12-23 11:35:52.541605: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
2017-12-23 11:35:52.542993: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage not found in checkpoint
2017-12-23 11:35:52.544005: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_1/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
2017-12-23 11:35:52.544978: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_1/LSTMCell/W_0/ExponentialMovingAverage not found in checkpoint
2017-12-23 11:35:52.669979: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:52.772370: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:52.863129: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:53.007704: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:53.021356: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.951154: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.955047: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.959807: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.959976: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.967513: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:55.552041: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:55.576411: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:55.582257: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:55.858505: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint

...

Thanks

pretrain model

hi could you please share pre-trained model?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.