Git Product home page Git Product logo

f-lm's Introduction

F-LM

Language modeling. This codebase contains implementation of G-LSTM and F-LSTM cells from [1]. It also might contain some ongoing experiments.

This code was forked from https://github.com/rafaljozefowicz/lm and contains "BIGLSTM" language model baseline from [2].

Current code runs on Tensorflow r1.5 and supports multi-GPU data parallelism using synchronized gradient updates.

Perplexity

On One Billion Words benchmark using 8 GPUs in one DGX-1, BIG G-LSTM G4 was able to achieve 24.29 after 2 weeks of training and 23.36 after 3 weeks.

On 02/06/2018 We found an issue with our experimental setup which makes perplexity numbers listed in the paper invalid.

See current numbers in the table below.

On DGX Station, after 1 week of training using all 4 GPUs (Tesla V100) and batch size of 256 per GPU:

Model Perplexity Steps WPS
BIGLSTM 35.1 ~0.99M ~33.8K
BIG F-LSTM F512 36.3 ~1.67M ~56.5K
BIG G-LSTM G4 40.6 ~1.65M ~56K
BIG G-LSTM G2 36 ~1.37M ~47.1K
BIG G-LSTM G8 39.4 ~1.7M ~58.5

Dependencies

To run

Assuming the data directory is in: /raid/okuchaiev/Data/LM1B/1-billion-word-language-modeling-benchmark-r13output/, execute:

export CUDA_VISIBLE_DEVICES=0,1,2,3

SECONDS=604800
LOGSUFFIX=FLSTM-F512-1week

python /home/okuchaiev/repos/f-lm/single_lm_train.py --logdir=/raid/okuchaiev/Workspace/LM/GLSTM-G4/$LOGSUFFIX --num_gpus=4 --datadir=/raid/okuchaiev/Data/LM/LM1B/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,float16_rnn=False,max_time=$SECONDS,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256,fact_size=512  >> train_$LOGSUFFIX.log 2>&1

python /home/okuchaiev/repos/f-lm/single_lm_train.py --logdir=/raid/okuchaiev/Workspace/LM/GLSTM-G4/$LOGSUFFIX --num_gpus=1 --mode=eval_full --datadir=/raid/okuchaiev/Data/LM/LM1B/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,float16_rnn=False,max_time=$SECONDS,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=1,fact_size=512

  • To use G-LSTM cell specify num_of_groups parameter.
  • To use F-LSTM cell specify fact_size parameter.

Note, that current data reader may miss some tokens when constructing mini-batches which can have a minor effect on final perplexity.

For most accurate results, use batch_size=1 and num_steps=1 in evaluation. Thanks to Ciprian for noticing this.

To change hyper-parameters

The command accepts and additional argument --hpconfig which allows to override various hyper-parameters, including:

  • batch_size=128 - batch size per GPU. Global batch size = batch_size*num_gpus
  • num_steps=20 - number of LSTM cell timesteps
  • num_shards=8 - embedding and softmax matrices are split into this many shards
  • num_layers=1 - numer of LSTM layers
  • learning_rate=0.2 - learning rate for optimizer
  • max_grad_norm=10.0 - maximum acceptable gradient norm for LSTM layers
  • keep_prob=0.9 - dropout keep probability
  • optimizer=0 - which optimizer to use: Adagrad(0), Momentum(1), Adam(2), RMSProp(3), SGD(4)
  • vocab_size=793470 - vocabluary size
  • emb_size=512 - size of the embedding (should be same as projected_size)
  • state_size=2048 - LSTM cell size
  • projected_size=512 - LSTM projection size
  • num_sampled=8192 - training uses sampled softmax, number of samples)
  • do_summaries=False - generate weight and grad stats for Tensorboard
  • max_time=180 - max time (in seconds) to run
  • fact_size - to use F-LSTM cell, this should be set to factor size
  • num_of_groups=0 - to use G-LSTM cell, this should be set to number of groups
  • save_model_every_min=30 - how often to checkpoint
  • save_summary_every_min=16 - how often to save summaries
  • use_residual=False - whether to use LSTM residual connections

Feedback

Forked code and GLSTM/FLSTM cells: [email protected]

References

f-lm's People

Contributors

deeplearningathome avatar okuchaiev avatar rafaljozefowicz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

f-lm's Issues

using G-LSTM with dynamic-rnn

Hey. Thanks for the amazing article!

I'm trying to use G-LSTM for my cell in dynamic_rnn and I got this error:
File "/language_model.py", line 30, in init
loss = self._forward(i, xs[i], ys[i], lengths[i])
File /language_model.py", line 121, in _forward
inputs=x)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 574, in dynamic_rnn
dtype=dtype)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 737, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2770, in while_loop
result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2599, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2549, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 722, in _time_step
(output, new_state) = call_cell()
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 708, in
call_cell = lambda: cell(input_t, state)
File "/factorized_lstm_cells.py", line 172, in call
self._get_input_for_group(m_prev, group_id, self._group_shape[0])], axis=1)
File "/factorized_lstm_cells.py", line 129, in _get_input_for_group
name="GLSTMinputGroupCreation")
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 547, in slice
return gen_array_ops.slice(input, begin, size, name=name)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2896, in _slice
name=name)
File "/.pyenv/versions/tflow/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 499, in apply_op
repr(values), type(values).name))
TypeError: Expected int32 passed to parameter 'size' of op 'Slice', got [None, 128] of type 'list' instead.

Looks like its not proccessing cause of the size=[inpt.get_shape()[0].value, group_size] line, because the input size (apperantly, both batch size and time) is dynamic.
I think it can be treated with passing the batch_size directly to cell, but if there is any good solution, I'd be grateful if you'd tell me.

Checkpoint required?

Is a checkpoint required to run the model? It keeps printing out "No checkpoint file found. Waiting...".

Error when loading checkpoint model

After training a G-LSTM, I got error when evaluating it:

W tensorflow/core/framework/op_kernel.cc:993] Not found: Key model/lstm_0/lstm_cell/biases not found in checkpoint

This error occurs when restoring the ckpt model.

How can I solve this issue?

Can't restore model from pre-trained model link

Hi, I am trying to use the pre-trained model for evaluation, but I am seeing an error while restoring the model parameters. Is the code up to date with it?

This is the error that I see. I tried searching for some of the missing parameters in the graph.pbtxt file, but they weren't there. I tested with both the head commit and d98fb11.

$ python3 single_lm_train.py --logdir=/path/to/my/logdir --num_gpus=2 --datadir=/path/to/my/datadir --mode=eval_full --hpconfig run_profiler=False,float16_rnn=False,max_time=$SECONDS,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=4,num_of_groups=0
*****HYPER PARAMETERS*****
{'batch_size': 4, 'num_steps': 20, 'num_shards': 8, 'num_layers': 2, 'learning_rate': 0.2, 'max_grad_norm': 1.0, 'num_delayed_steps': 150, 'keep_prob': 0.9, 'optimizer': 0, 'vocab_size': 793470, 'emb_size': 1024, 'state_size': 8192, 'projected_size': 1024, 'num_sampled': 8192, 'num_gpus': 2, 'float16_rnn': False, 'float16_non_rnn': False, 'average_params': True, 'run_profiler': False, 'do_summaries': False, 'max_time': 1303, 'fact_size': None, 'fnon_linearity': 'none', 'num_of_groups': 0}
**************************
Not using groups
Not using fnonlinearities
Not using groups
Not using fnonlinearities
Not using groups
Not using fnonlinearities
Not using groups
Not using fnonlinearities
Averaging parameters for evaluation.
2017-12-23 11:35:51.468529: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2017-12-23 11:35:51.747194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:17:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2017-12-23 11:35:51.970520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.31GiB
2017-12-23 11:35:51.971259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2017-12-23 11:35:51.971284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 
2017-12-23 11:35:51.971289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y 
2017-12-23 11:35:51.971292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y 
2017-12-23 11:35:51.971299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2017-12-23 11:35:51.971303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
2017-12-23 11:35:52.541605: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
2017-12-23 11:35:52.542993: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage not found in checkpoint
2017-12-23 11:35:52.544005: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_1/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
2017-12-23 11:35:52.544978: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_1/LSTMCell/W_0/ExponentialMovingAverage not found in checkpoint
2017-12-23 11:35:52.669979: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:52.772370: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:52.863129: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:53.007704: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:53.021356: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.951154: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.955047: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.959807: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.959976: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:54.967513: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:55.552041: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:55.576411: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:55.582257: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint
	 [[Node: save/RestoreV2_9 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_9/tensor_names, save/RestoreV2_9/shape_and_slices)]]
2017-12-23 11:35:55.858505: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage not found in checkpoint

...

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.