artetxem / undreamt Goto Github PK

View Code? Open in Web Editor NEW

472.0 472.0 77.0 29 KB

Unsupervised Neural Machine Translation

License: GNU General Public License v3.0

Python 100.00%

undreamt's People

Contributors

Stargazers

Watchers

undreamt's Issues

Question about the number of iterations

Hello

I have one question about the number of iterations. As I understand it, the number of lines to read = the number of iterations * batch size. In your paper, however, the number of iterations is set as 300000 and batch size as 50, meaning the total number of read lines is 15 million. Isn't it too small to read the whole News Crawl dataset? Does it mean only the part of the dataset is read only once (the epoch size is 1)? I would like to have some clarifications on this.

Thank you

TypeError: can't convert CUDA tensor to numpy.

I am running with RTX2080ti python 3.6 , pytorch 1.0.1
But this error has shown up.

STEP 1000 x 15
Source to target (backtranslation)
Traceback (most recent call last):
File "train.py", line 20, in
undreamt.train.main_train()
File "/home/{mypath}/undreamt-master/undreamt/train.py", line 307, in main_train
logger.log(step)
File "/home/{mypath}/Translation/undreamt-master/undreamt/train.py", line 415, in log
.format(self.trainer.perplexity_per_word(), self.trainer.total_time(),
File "/home/{mypath}/Translation/undreamt-master/undreamt/train.py", line 357, in perplexity_per_word
return np.exp(self.loss/self.trg_word_count)
File "/home/{mypath}/anaconda3/envs/Translation/lib/python3.6/site-packages/torch/tensor.py", line 450, in array
return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Have any solutions, please?

cuda runtime error (59)

I'm running the code on a machine with python 3.6, pytorch 0.3.1 and CUDA 8.0.
The training is done, I use the resulting model for translation as follows:
python translate.py MODEL_PREFIX.final.src2trg.pth < INPUT.TXT > OUTPUT.TXT
And I met the following error.

Have any solutions or insights, please?

Thank you.

invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

different result

Hi Mikel,

I preprocess, train,translate according to the original hyerparameters and the information of original paper, test on newstest2014, and get the bleu value 10.97. Some problems may be ?

Thank you.

Error while using undreamt as regular NMT : print(line, file=f) invalid syntax

Hi, I am using undreamt as regular NMT. After running following command :

python train.py --src2trg ../nematus/test/data/norig_out.hi ../nematus/test/data/norig_out.en --src_vocabulary ../nematus/test/data/norig_out.hi.json --trg_vocabulary ../nematus/test/data/norig_out.en.json --embedding_size 300 --learn_encoder_embeddings --disable_denoising --save MODEL_PREFIX --cuda

getting this error:

Traceback (most recent call last):
File "train.py", line 16, in
import undreamt.train
File "/home1/kamalpcs17/undreamt/undreamt/train.py", line 425
print(line, file=f)
^
SyntaxError: invalid syntax

(My python version is 2.7)

semi-supervised learning

Hello,

May I know how to use this code for semi-supervised learning where I have an additional parallel corpus?
Thank you!

Can be used for text classification?

I am very curious whether it can be used for text classification.

one question

Hello. I have one question. In your paper, the embedding of encoder is cross-lingual embeddings and keep fixed, the embedding of decoder is randomly initialized and update it during training. However, which embedding will be used when you inference？ the cross-lingual or the updated, and why?

How to resume training from a checkpoint?

Hello @artetxem,
I'm training an unsupervised NMT using your settings. I wonder if you could give a lead oh how to resume the training from a previous checkpoint?

out of memory

Hello,
it seems that my GPU is out of memory(cuda run time error) when I train the model. I use K80 and change the batchsize to 20. Is there sth wrong with it? Maybe the problem with Corpus? I use Europarl German and English corpus, with Fasttext word vectors.

Thank you

CUDA Out of Memory Error Even with small batch size and embedding size.

I'm running the code on a machine with python 3.6, pytorch 0.3.1, K80 and CUDA 8.0 as described in the README.txt.

CUDA_VISIBLE_DEVICES=1 python3 train.py --src ~/IWT15/mono/euro.tc.en --trg ~/IWT15/mono/euro.tc.de --src_embeddings vecmap/data/euro.tc40.en.map --trg_em beddings vecmap/data/euro.tc40.de.map --save eurotc40_en2de --cuda

And I met the following fatal error.

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1522182087074/work/torch/lib/THC/generic/THCStorage.cu:58

I'm sure there is only one process on the specific GPU and it requires more than 12GB memory. I tried to use very small bilingual word embeddings (17MB in source language and 40MB in target language) and batch_size =2, the error still occurs.

Have any solutions or insights, please?

Thank you.

PS: the program runs smoothly on CPU.

Issue Urdu-English translation

Hi Mikel!
I apply all the steps which your toolkit required in paper on urdu- english corpus. But get very poor bleu score like 0.5 or 0.9.
data Preprocessing
step 1) monolingual data on apply: tokenization, true casing and cleaning 1-50 sentence length with moses.
step 2)word embeddings with word2vec parameters epco=5, window_size=5, window_size =5 and dimension=300 then apply MUSE for alignment mapped on shared space with Vecmap.
size of my corpus is 13k. (it's enough?)
my query is this toolkit support urdu language.
and second i use parameter toolkit default.
if effect parameter on model training kindly please share.

Some questions...

Hello.
I ran the experiment on nearly 30w Tibet-Chinese corpus and the result is sooo bad.
(Most translated text can be read smoothly but they are totally irrelevant to the source text.

I did the experiment according to your paper, using BPE and Vecmap(objective nearly 34%).
Can I ask how large is your training corpus?
I wonder if it is because of my corpus is not big enough, or there's something wrong with mapping?

Thanks again in advance!

"nan" error occurs for the valuable "logprobs" in L87 of translator.py

Hi, I am running an experiment for unsupervised Slovak-English translation. I get the "nan" error during beam search decoding. That is, the valuable "logprobs" in L87 of translator.py contains all
"nan".
https://github.com/artetxem/undreamt/blob/master/undreamt/
translator.py#L87

I follow the same setting in your paper (train bilingual wordings by
vecmap and run for 300k iterations). Does anyone know how to fix it? Thanks!

AssertionError

Hi,

when using the train.py script on my corpora (incl. mapped embeddings with vecmap) the following error message appears (using the --cuda option):

Traceback (most recent call last):
  File "train.py", line 20, in <module>
    undreamt.train.main_train()
  File "/tmp/undreamt/undreamt/undreamt/train.py", line 189, in main_train
    bidirectional=not args.disable_bidirectional, layers=args.layers, dropout=args.dropout))
  File "/tmp/undreamt/undreamt/undreamt/devices.py", line 22, in gpu
    return x.cuda() if x is not None else None
  File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 146, in _apply
    module._apply(fn)
  File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in _apply
    self.flatten_parameters()
  File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 111, in flatten_parameters
    params = rnn.get_parameters(fn, handle, fn.weight_buf)
  File "/tmp/anaconda3/envs/cupy/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 165, in get_parameters
    assert filter_dim_a.prod() == filter_dim_a[0]
AssertionError

I'm using pytorch in version 0.3.1 via conda - could that be a problem? README.md shows that 0.3 was tested.

Thanks many in advance + cheers,

Stefan

How to use small parallel corpora in unsupervised training

I have monolingual corpora of two languages and a small parallel corpus too. How can I use that parallel corpora while training with those monolingual corpora in unsupervised way ?

artetxem / undreamt Goto Github PK

undreamt's People

Contributors

Stargazers

Watchers

Forkers

undreamt's Issues

Recommend Projects

Recommend Topics

Recommend Org