Git Product home page Git Product logo

pytorch_gbw_lm's Introduction

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

  • 39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
  • Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)

Previous Results

  • 46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
  • Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
  • Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

Type LM Memory Size GPU
w/o tied weights ~9 GB Nvidia 1080 TI, Nvidia Titan X
w/ tied weights [6] ~7 GB Nvidia 1070 or higher
  • There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

Parameter Value
# Epochs 5
Training Batch Size 128
Evaluation Batch Size 1
BPTT 20
Embedding Size 256
Hidden Size 2048
Projection Size 256
Tied Embedding + Softmax False
# Layers 1
Optimizer AdaGrad
Learning Rate 0.10
Gradient Clipping 1.00
Dropout 0.01
Weight-Decay (L2 Penalty) 1e-6

Setup - Torch Data Format

  1. Download Google Billion Word Dataset for Torch - Link
  2. Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
  3. Install Cython framework and build Log_Uniform Sampler
  4. Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

  • Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
  • Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

  1. Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

References

  1. Exploring the Limits of Language Modeling Github
  2. Factorization Tricks for LSTM networks Github
  3. Efficient softmax approximation for GPUs Github
  4. Candidate Sampling
  5. Torch GBW
  6. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

pytorch_gbw_lm's People

Contributors

rdspring1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pytorch_gbw_lm's Issues

state of the art performance?

Nice work! I have a question regarding the result:
In the paper "Exploring the limits of language modeling", it reports test ppl of 54.1 using LSTM-512-512. Does it mean two 2 layers are used in the paper, while your result is obtained from 4 layers ? If so, what makes the difference?

missing train_data.pt

It seems that process_gbw.py is looking for train_data.pt but couldn't find it. Are there any instructions on how to create this file (or does it belong to the dataset downloaded)?

Thanks!

build Log_Uniform Sampler

Hi

I have Cython installed, but I'm not sure how to do the step "build Log_Uniform Sampler".
Could you be more detailed in what commands should I run?

I tried to do python setup.py install but I got the following error:

running install
running build
running build_ext
building 'log_uniform' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/home/goncalo/.virtualenvs/nmtpy/include/python3.5m -c log_uniform.cpp -o build/temp.linux-x86_64-3.5/log_uniform.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
log_uniform.cpp:608:31: fatal error: numpy/arrayobject.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

So I'm not sure if I'm doing the right thing.

missing dataset

The link of Torch Data Format is broken, can you offer a link of google drive?thank you!

Pretrained Model?

Nice work! It's so tragic that when I type "pytorch language models", this is not the first repo that shows up!

Do you plan to release the pre-trained model?

(I see it takes roughly 3 days...so probably it's ok)

Preprocess problem

It seems torch.load() cannot load train_data.th7?
I cannot figure out how to "run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file."

sample_ids being ignored?

Hi! thanks for your code. I've been reading through it to understand the approach and I've noticed that the output of sampled is actually always a zero long-tensor:

https://github.com/rdspring1/PyTorch_GBW_LM/blob/master/lm/model.py#L68-L69

Is this the way is supposed to work? I was understanding that the sampled softmax obtains the speed up by computing the loss on only a sample of the entire vocabulary. But the way it's setup the loss would always be computed with respect to the same target (0).

Or is there something else I might be missing?

greetings!

Resume Training?

Hi, I am wondering whether it is possible to resume training using the saved checkpoint? Based on the code I think I just need to re-define the scheduler by myself. Is there anything that you think I missed?

Thank you so much for your code btw.

RuntimeError: inconsistent tensor size

I have problem:
load word frequency mapping - complete
loaded tensor torch.Size([798949912])
loaded tensor torch.Size([798949912, 3])
#sentences 798949912
load train data - complete
#sentences 6073
load test data - complete
Traceback (most recent call last):
File "main.py", line 195, in
train()
File "main.py", line 157, in train
for batch, item in enumerate(train_loader):
File "/home/xxxx/PyTorch_LM/lm/fast_gbw.py", line 89, in batch_generator
tracker_list[idx] = self.add(seq_length, source, target, idx, tracker)
File "/home/xxxx/lm/PyTorch_LM/lm/fast_gbw.py", line 124, in add
source[curr:batch_end, batch_idx] = self.corpus[seq_start:seq_end]
RuntimeError: inconsistent tensor size, expected tensor [19] and src [798949911] to have the same number of elements, but got 19 and 798949911 elements respectively at /pytorch/torch/lib/TH/generic/THTensorCopy.c:86

Nondeterministic result?

Hi, I was trying to run your example, but the result is non-deterministic each time (even if I set dropout=0.0). Is that expected?
(BTW, I'm using GBWStream to read the dataset with deterministic=True, I can post the code if you want to take a look)

how to build Log_Uniform Sampler?

On my macbook, I run 'python setup.py install' or 'python setup.py build_ext --inplace' in log_uniform folder and got error:

➜  log_uniform git:(master) ✗ ~/miniconda3/bin/python setup.py install
running install
running build
running build_ext
building 'log_uniform' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.7
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/lib/python3.7/site-packages/numpy/core/include -I/Users/gaoxianglu/miniconda3/include/python3.7m -c log_uniform.cpp -o build/temp.macosx-10.7-x86_64-3.7/log_uniform.o -std=c++11
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead
      [-Wstdlibcxx-not-found]
log_uniform.cpp:635:10: fatal error: 'ios' file not found
#include "ios"
         ^~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1

I installed xcode command line, but the error still exists

TypeError: iteration over a 0-d tensor

File "main_dev.py", line 99, in repackage_hidden
return [repackage_hidden(state) for state in h]
File "/Users/admin/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 381, in iter
raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor

Have you met this kind of question before?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.