rdspring1 / pytorch_gbw_lm Goto Github PK

View Code? Open in Web Editor NEW

122.0 6.0 21.0 4.61 MB

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

License: Apache License 2.0

Python 95.25% C++ 4.53% Makefile 0.22%

pytorch language-model lstm deep-learning gpu machine-learning nlp torch torch-gbw

pytorch_gbw_lm's Introduction

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)

Previous Results

46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

Type	LM Memory Size	GPU
w/o tied weights	~9 GB	Nvidia 1080 TI, Nvidia Titan X
w/ tied weights [6]	~7 GB	Nvidia 1070 or higher

There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

Parameter	Value
# Epochs	5
Training Batch Size	128
Evaluation Batch Size	1
BPTT	20
Embedding Size	256
Hidden Size	2048
Projection Size	256
Tied Embedding + Softmax	False
# Layers	1
Optimizer	AdaGrad
Learning Rate	0.10
Gradient Clipping	1.00
Dropout	0.01
Weight-Decay (L2 Penalty)	1e-6

Setup - Torch Data Format

Download Google Billion Word Dataset for Torch - Link
Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
Install Cython framework and build Log_Uniform Sampler
Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

References

pytorch_gbw_lm's People

Contributors

Stargazers

Watchers

pytorch_gbw_lm's Issues

ImportError: cannot import name 'LogUniformSampler'

After running: 'python3 setup.py build_ext --inplace', I still got ImportError: cannot import name 'LogUniformSampler'. It seems that log_uniform module is not built correctly.

Any suggestion?

Thanks!

TypeError: iteration over a 0-d tensor

File "main_dev.py", line 99, in repackage_hidden
return [repackage_hidden(state) for state in h]
File "/Users/admin/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 381, in iter
raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor

Have you met this kind of question before?

Nondeterministic result?

Hi, I was trying to run your example, but the result is non-deterministic each time (even if I set dropout=0.0). Is that expected?
(BTW, I'm using GBWStream to read the dataset with deterministic=True, I can post the code if you want to take a look)

sample_ids being ignored?

Hi! thanks for your code. I've been reading through it to understand the approach and I've noticed that the output of sampled is actually always a zero long-tensor:

https://github.com/rdspring1/PyTorch_GBW_LM/blob/master/lm/model.py#L68-L69

Is this the way is supposed to work? I was understanding that the sampled softmax obtains the speed up by computing the loss on only a sample of the entire vocabulary. But the way it's setup the loss would always be computed with respect to the same target (0).

Or is there something else I might be missing?

greetings!

RuntimeError: inconsistent tensor size

I have problem:
load word frequency mapping - complete
loaded tensor torch.Size([798949912])
loaded tensor torch.Size([798949912, 3])
#sentences 798949912
load train data - complete
#sentences 6073
load test data - complete
Traceback (most recent call last):
File "main.py", line 195, in
train()
File "main.py", line 157, in train
for batch, item in enumerate(train_loader):
File "/home/xxxx/PyTorch_LM/lm/fast_gbw.py", line 89, in batch_generator
tracker_list[idx] = self.add(seq_length, source, target, idx, tracker)
File "/home/xxxx/lm/PyTorch_LM/lm/fast_gbw.py", line 124, in add
source[curr:batch_end, batch_idx] = self.corpus[seq_start:seq_end]
RuntimeError: inconsistent tensor size, expected tensor [19] and src [798949911] to have the same number of elements, but got 19 and 798949911 elements respectively at /pytorch/torch/lib/TH/generic/THTensorCopy.c:86

build Log_Uniform Sampler

I have Cython installed, but I'm not sure how to do the step "build Log_Uniform Sampler".
Could you be more detailed in what commands should I run?

I tried to do python setup.py install but I got the following error:

running install
running build
running build_ext
building 'log_uniform' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/home/goncalo/.virtualenvs/nmtpy/include/python3.5m -c log_uniform.cpp -o build/temp.linux-x86_64-3.5/log_uniform.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
log_uniform.cpp:608:31: fatal error: numpy/arrayobject.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

So I'm not sure if I'm doing the right thing.

how to build Log_Uniform Sampler?

On my macbook, I run 'python setup.py install' or 'python setup.py build_ext --inplace' in log_uniform folder and got error:

➜  log_uniform git:(master) ✗ ~/miniconda3/bin/python setup.py install
running install
running build
running build_ext
building 'log_uniform' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.7
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/lib/python3.7/site-packages/numpy/core/include -I/Users/gaoxianglu/miniconda3/include/python3.7m -c log_uniform.cpp -o build/temp.macosx-10.7-x86_64-3.7/log_uniform.o -std=c++11
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead
      [-Wstdlibcxx-not-found]
log_uniform.cpp:635:10: fatal error: 'ios' file not found
#include "ios"
         ^~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1

I installed xcode command line, but the error still exists

missing train_data.pt

It seems that process_gbw.py is looking for train_data.pt but couldn't find it. Are there any instructions on how to create this file (or does it belong to the dataset downloaded)?

Thanks!

dead link (Google Billion Word Dataset for Torch)

Hi, I'd like to use your language model for my research. I can't train it because the link to the Google Billion Word Dataset for Torch is down. Is there a mirror somewhere?

Preprocess problem

It seems torch.load() cannot load train_data.th7?
I cannot figure out how to "run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file."

state of the art performance?

Nice work! I have a question regarding the result:
In the paper "Exploring the limits of language modeling", it reports test ppl of 54.1 using LSTM-512-512. Does it mean two 2 layers are used in the paper, while your result is obtained from 4 layers ? If so, what makes the difference?

Pretrained Model?

Nice work! It's so tragic that when I type "pytorch language models", this is not the first repo that shows up!

Do you plan to release the pre-trained model?

(I see it takes roughly 3 days...so probably it's ok)