rdspring1 / pytorch_gbw_lm Goto Github PK

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

License: Apache License 2.0

Python 95.25% C++ 4.53% Makefile 0.22%

pytorch language-model lstm deep-learning gpu machine-learning nlp torch torch-gbw

pytorch_gbw_lm's Issues

how to build Log_Uniform Sampler?

On my macbook, I run 'python setup.py install' or 'python setup.py build_ext --inplace' in log_uniform folder and got error:

➜  log_uniform git:(master) ✗ ~/miniconda3/bin/python setup.py install
running install
running build
running build_ext
building 'log_uniform' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.7
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/lib/python3.7/site-packages/numpy/core/include -I/Users/gaoxianglu/miniconda3/include/python3.7m -c log_uniform.cpp -o build/temp.macosx-10.7-x86_64-3.7/log_uniform.o -std=c++11
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead
      [-Wstdlibcxx-not-found]
log_uniform.cpp:635:10: fatal error: 'ios' file not found
#include "ios"
         ^~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1

I installed xcode command line, but the error still exists

Preprocess problem

It seems torch.load() cannot load train_data.th7?
I cannot figure out how to "run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file."

RuntimeError: inconsistent tensor size

I have problem:
load word frequency mapping - complete
loaded tensor torch.Size([798949912])
loaded tensor torch.Size([798949912, 3])
#sentences 798949912
load train data - complete
#sentences 6073
load test data - complete
Traceback (most recent call last):
File "main.py", line 195, in
train()
File "main.py", line 157, in train
for batch, item in enumerate(train_loader):
File "/home/xxxx/PyTorch_LM/lm/fast_gbw.py", line 89, in batch_generator
tracker_list[idx] = self.add(seq_length, source, target, idx, tracker)
File "/home/xxxx/lm/PyTorch_LM/lm/fast_gbw.py", line 124, in add
source[curr:batch_end, batch_idx] = self.corpus[seq_start:seq_end]
RuntimeError: inconsistent tensor size, expected tensor [19] and src [798949911] to have the same number of elements, but got 19 and 798949911 elements respectively at /pytorch/torch/lib/TH/generic/THTensorCopy.c:86

sample_ids being ignored?

Hi! thanks for your code. I've been reading through it to understand the approach and I've noticed that the output of sampled is actually always a zero long-tensor:

https://github.com/rdspring1/PyTorch_GBW_LM/blob/master/lm/model.py#L68-L69

Is this the way is supposed to work? I was understanding that the sampled softmax obtains the speed up by computing the loss on only a sample of the entire vocabulary. But the way it's setup the loss would always be computed with respect to the same target (0).

Or is there something else I might be missing?

greetings!

ImportError: cannot import name 'LogUniformSampler'

After running: 'python3 setup.py build_ext --inplace', I still got ImportError: cannot import name 'LogUniformSampler'. It seems that log_uniform module is not built correctly.

Any suggestion?

Thanks!

dead link (Google Billion Word Dataset for Torch)

Hi, I'd like to use your language model for my research. I can't train it because the link to the Google Billion Word Dataset for Torch is down. Is there a mirror somewhere?

Pretrained Model?

Nice work! It's so tragic that when I type "pytorch language models", this is not the first repo that shows up!

Do you plan to release the pre-trained model?

(I see it takes roughly 3 days...so probably it's ok)

missing dataset

The link of Torch Data Format is broken, can you offer a link of google drive?thank you!

Nondeterministic result?

Hi, I was trying to run your example, but the result is non-deterministic each time (even if I set dropout=0.0). Is that expected?
(BTW, I'm using GBWStream to read the dataset with deterministic=True, I can post the code if you want to take a look)

TypeError: iteration over a 0-d tensor

File "main_dev.py", line 99, in repackage_hidden
return [repackage_hidden(state) for state in h]
File "/Users/admin/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 381, in iter
raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor

Have you met this kind of question before?

state of the art performance?

Nice work! I have a question regarding the result:
In the paper "Exploring the limits of language modeling", it reports test ppl of 54.1 using LSTM-512-512. Does it mean two 2 layers are used in the paper, while your result is obtained from 4 layers ? If so, what makes the difference?

running install
running build
running build_ext
building 'log_uniform' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/home/goncalo/.virtualenvs/nmtpy/include/python3.5m -c log_uniform.cpp -o build/temp.linux-x86_64-3.5/log_uniform.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
log_uniform.cpp:608:31: fatal error: numpy/arrayobject.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

So I'm not sure if I'm doing the right thing.

rdspring1 / pytorch_gbw_lm Goto Github PK

pytorch_gbw_lm's Issues

how to build Log_Uniform Sampler?

Preprocess problem

RuntimeError: inconsistent tensor size

sample_ids being ignored?

ImportError: cannot import name 'LogUniformSampler'

dead link (Google Billion Word Dataset for Torch)

Pretrained Model?

missing dataset

Nondeterministic result?

TypeError: iteration over a 0-d tensor

state of the art performance?

missing train_data.pt

Is the dataset offline?

Resume Training?

build Log_Uniform Sampler

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent