artetxem / undreamt Goto Github PK
View Code? Open in Web Editor NEWUnsupervised Neural Machine Translation
License: GNU General Public License v3.0
Unsupervised Neural Machine Translation
License: GNU General Public License v3.0
Hello
I have one question about the number of iterations. As I understand it, the number of lines to read = the number of iterations * batch size. In your paper, however, the number of iterations is set as 300000 and batch size as 50, meaning the total number of read lines is 15 million. Isn't it too small to read the whole News Crawl dataset? Does it mean only the part of the dataset is read only once (the epoch size is 1)? I would like to have some clarifications on this.
Thank you
I am running with RTX2080ti python 3.6 , pytorch 1.0.1
But this error has shown up.
STEP 1000 x 15
Source to target (backtranslation)
Traceback (most recent call last):
File "train.py", line 20, in
undreamt.train.main_train()
File "/home/{mypath}/undreamt-master/undreamt/train.py", line 307, in main_train
logger.log(step)
File "/home/{mypath}/Translation/undreamt-master/undreamt/train.py", line 415, in log
.format(self.trainer.perplexity_per_word(), self.trainer.total_time(),
File "/home/{mypath}/Translation/undreamt-master/undreamt/train.py", line 357, in perplexity_per_word
return np.exp(self.loss/self.trg_word_count)
File "/home/{mypath}/anaconda3/envs/Translation/lib/python3.6/site-packages/torch/tensor.py", line 450, in array
return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
Have any solutions, please?
I'm running the code on a machine with python 3.6, pytorch 0.3.1 and CUDA 8.0.
The training is done, I use the resulting model for translation as follows:
python translate.py MODEL_PREFIX.final.src2trg.pth < INPUT.TXT > OUTPUT.TXT
And I met the following error.
Have any solutions or insights, please?
Thank you.
Hi Mikel,
I preprocess, train,translate according to the original hyerparameters and the information of original paper, test on newstest2014, and get the bleu value 10.97. Some problems may be ?
Thank you.
Hi, I am using undreamt as regular NMT. After running following command :
python train.py --src2trg ../nematus/test/data/norig_out.hi ../nematus/test/data/norig_out.en --src_vocabulary ../nematus/test/data/norig_out.hi.json --trg_vocabulary ../nematus/test/data/norig_out.en.json --embedding_size 300 --learn_encoder_embeddings --disable_denoising --save MODEL_PREFIX --cuda
getting this error:
Traceback (most recent call last):
File "train.py", line 16, in
import undreamt.train
File "/home1/kamalpcs17/undreamt/undreamt/train.py", line 425
print(line, file=f)
^
SyntaxError: invalid syntax
(My python version is 2.7)
Hello,
May I know how to use this code for semi-supervised learning where I have an additional parallel corpus?
Thank you!
I am very curious whether it can be used for text classification.
Hello. I have one question. In your paper, the embedding of encoder is cross-lingual embeddings and keep fixed, the embedding of decoder is randomly initialized and update it during training. However, which embedding will be used when you inference? the cross-lingual or the updated, and why?
Hello @artetxem,
I'm training an unsupervised NMT using your settings. I wonder if you could give a lead oh how to resume the training from a previous checkpoint?
Hello,
it seems that my GPU is out of memory(cuda run time error) when I train the model. I use K80 and change the batchsize to 20. Is there sth wrong with it? Maybe the problem with Corpus? I use Europarl German and English corpus, with Fasttext word vectors.
Thank you
I'm running the code on a machine with python 3.6, pytorch 0.3.1, K80 and CUDA 8.0 as described in the README.txt.
CUDA_VISIBLE_DEVICES=1 python3 train.py --src ~/IWT15/mono/euro.tc.en --trg ~/IWT15/mono/euro.tc.de --src_embeddings vecmap/data/euro.tc40.en.map --trg_em beddings vecmap/data/euro.tc40.de.map --save eurotc40_en2de --cuda
And I met the following fatal error.
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1522182087074/work/torch/lib/THC/generic/THCStorage.cu:58
I'm sure there is only one process on the specific GPU and it requires more than 12GB memory. I tried to use very small bilingual word embeddings (17MB in source language and 40MB in target language) and batch_size =2, the error still occurs.
Have any solutions or insights, please?
Thank you.
PS: the program runs smoothly on CPU.
Hi Mikel!
I apply all the steps which your toolkit required in paper on urdu- english corpus. But get very poor bleu score like 0.5 or 0.9.
data Preprocessing
step 1) monolingual data on apply: tokenization, true casing and cleaning 1-50 sentence length with moses.
step 2)word embeddings with word2vec parameters epco=5, window_size=5, window_size =5 and dimension=300 then apply MUSE for alignment mapped on shared space with Vecmap.
size of my corpus is 13k. (it's enough?)
my query is this toolkit support urdu language.
and second i use parameter toolkit default.
if effect parameter on model training kindly please share.
Hello.
I ran the experiment on nearly 30w Tibet-Chinese corpus and the result is sooo bad.
(Most translated text can be read smoothly but they are totally irrelevant to the source text.
I did the experiment according to your paper, using BPE and Vecmap(objective nearly 34%).
Can I ask how large is your training corpus?
I wonder if it is because of my corpus is not big enough, or there's something wrong with mapping?
Thanks again in advance!
Hi, I am running an experiment for unsupervised Slovak-English translation. I get the "nan" error during beam search decoding. That is, the valuable "logprobs" in L87 of translator.py contains all
"nan".
https://github.com/artetxem/undreamt/blob/master/undreamt/
translator.py#L87
I follow the same setting in your paper (train bilingual wordings by
vecmap and run for 300k iterations). Does anyone know how to fix it? Thanks!
Hi,
when using the train.py
script on my corpora (incl. mapped embeddings with vecmap
) the following error message appears (using the --cuda
option):
Traceback (most recent call last):
File "train.py", line 20, in <module>
undreamt.train.main_train()
File "/tmp/undreamt/undreamt/undreamt/train.py", line 189, in main_train
bidirectional=not args.disable_bidirectional, layers=args.layers, dropout=args.dropout))
File "/tmp/undreamt/undreamt/undreamt/devices.py", line 22, in gpu
return x.cuda() if x is not None else None
File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in cuda
return self._apply(lambda t: t.cuda(device))
File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 146, in _apply
module._apply(fn)
File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in _apply
self.flatten_parameters()
File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 111, in flatten_parameters
params = rnn.get_parameters(fn, handle, fn.weight_buf)
File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 165, in get_parameters
assert filter_dim_a.prod() == filter_dim_a[0]
AssertionError
I'm using pytorch in version 0.3.1 via conda
- could that be a problem? README.md
shows that 0.3 was tested.
Thanks many in advance + cheers,
Stefan
I have monolingual corpora of two languages and a small parallel corpus too. How can I use that parallel corpora while training with those monolingual corpora in unsupervised way ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.