rdspring1 / pytorch_gbw_lm Goto Github PK
View Code? Open in Web Editor NEWPyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset
License: Apache License 2.0
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset
License: Apache License 2.0
On my macbook, I run 'python setup.py install' or 'python setup.py build_ext --inplace' in log_uniform folder and got error:
➜ log_uniform git:(master) ✗ ~/miniconda3/bin/python setup.py install
running install
running build
running build_ext
building 'log_uniform' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.7
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/include -arch x86_64 -I/Users/gaoxianglu/miniconda3/lib/python3.7/site-packages/numpy/core/include -I/Users/gaoxianglu/miniconda3/include/python3.7m -c log_uniform.cpp -o build/temp.macosx-10.7-x86_64-3.7/log_uniform.o -std=c++11
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead
[-Wstdlibcxx-not-found]
log_uniform.cpp:635:10: fatal error: 'ios' file not found
#include "ios"
^~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1
I installed xcode command line, but the error still exists
It seems torch.load() cannot load train_data.th7?
I cannot figure out how to "run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file."
I have problem:
load word frequency mapping - complete
loaded tensor torch.Size([798949912])
loaded tensor torch.Size([798949912, 3])
#sentences 798949912
load train data - complete
#sentences 6073
load test data - complete
Traceback (most recent call last):
File "main.py", line 195, in
train()
File "main.py", line 157, in train
for batch, item in enumerate(train_loader):
File "/home/xxxx/PyTorch_LM/lm/fast_gbw.py", line 89, in batch_generator
tracker_list[idx] = self.add(seq_length, source, target, idx, tracker)
File "/home/xxxx/lm/PyTorch_LM/lm/fast_gbw.py", line 124, in add
source[curr:batch_end, batch_idx] = self.corpus[seq_start:seq_end]
RuntimeError: inconsistent tensor size, expected tensor [19] and src [798949911] to have the same number of elements, but got 19 and 798949911 elements respectively at /pytorch/torch/lib/TH/generic/THTensorCopy.c:86
Hi! thanks for your code. I've been reading through it to understand the approach and I've noticed that the output of sampled
is actually always a zero long-tensor:
https://github.com/rdspring1/PyTorch_GBW_LM/blob/master/lm/model.py#L68-L69
Is this the way is supposed to work? I was understanding that the sampled softmax obtains the speed up by computing the loss on only a sample of the entire vocabulary. But the way it's setup the loss would always be computed with respect to the same target (0).
Or is there something else I might be missing?
greetings!
After running: 'python3 setup.py build_ext --inplace', I still got ImportError: cannot import name 'LogUniformSampler'. It seems that log_uniform module is not built correctly.
Any suggestion?
Thanks!
Hi, I'd like to use your language model for my research. I can't train it because the link to the Google Billion Word Dataset for Torch
is down. Is there a mirror somewhere?
Nice work! It's so tragic that when I type "pytorch language models", this is not the first repo that shows up!
Do you plan to release the pre-trained model?
(I see it takes roughly 3 days...so probably it's ok)
The link of Torch Data Format is broken, can you offer a link of google drive?thank you!
Hi, I was trying to run your example, but the result is non-deterministic each time (even if I set dropout=0.0). Is that expected?
(BTW, I'm using GBWStream to read the dataset with deterministic=True
, I can post the code if you want to take a look)
File "main_dev.py", line 99, in repackage_hidden
return [repackage_hidden(state) for state in h]
File "/Users/admin/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 381, in iter
raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor
Have you met this kind of question before?
Nice work! I have a question regarding the result:
In the paper "Exploring the limits of language modeling", it reports test ppl of 54.1 using LSTM-512-512. Does it mean two 2 layers are used in the paper, while your result is obtained from 4 layers ? If so, what makes the difference?
It seems that process_gbw.py is looking for train_data.pt but couldn't find it. Are there any instructions on how to create this file (or does it belong to the dataset downloaded)?
Thanks!
Dataset seems to be offline. Is it possible to reup it again?
Hi, I am wondering whether it is possible to resume training using the saved checkpoint? Based on the code I think I just need to re-define the scheduler by myself. Is there anything that you think I missed?
Thank you so much for your code btw.
Hi
I have Cython installed, but I'm not sure how to do the step "build Log_Uniform Sampler".
Could you be more detailed in what commands should I run?
I tried to do python setup.py install
but I got the following error:
running install
running build
running build_ext
building 'log_uniform' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/home/goncalo/.virtualenvs/nmtpy/include/python3.5m -c log_uniform.cpp -o build/temp.linux-x86_64-3.5/log_uniform.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
log_uniform.cpp:608:31: fatal error: numpy/arrayobject.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
So I'm not sure if I'm doing the right thing.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.