cybertronai / transformer-xl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kimiyoung/transformer-xl

138.0 138.0 13.0 578 KB

Training Transformer-XL on 128 GPUs

License: Apache License 2.0

Python 99.44% Shell 0.56%

transformer-xl's People

Stargazers

Watchers

Forkers

sudosharma manik-hossain sprinterzzj wangkanger azgo14 atanida mmarius shubhampachori12110095 ares2013 armancohan peganovanton chenfhcs yiyualt

transformer-xl's Issues

Multi-machine test is broken

Steps to repro: start multi-machine run, ^C during training. After eval & load best checkpoint, it throws:

Traceback (most recent call last):
  File "train.py", line 745, in <module>
    main()
  File "train.py", line 729, in main
    model = torch.load(model_f, map_location=lambda storage, loc: storage.cuda(args.local_rank))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "train.py", line 745, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 532, in _load
    magic_number = pickle_module.load(f)
  File "train.py", line 729, in main
    model = torch.load(model_f, map_location=lambda storage, loc: storage.cuda(args.local_rank))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 532, in _load
    magic_number = pickle_module.load(f)
EOFError: Ran out of input

Correctly adjust LR with LAMP

Hello,

I have a question regarding adjusting learning rate with LAMP.

In your case you have a fixed learning rate which is "0.000125", and then you divide or multiple some numbers to get the correct base learning rate depend on the number of GPUs:

one_machine = 'base_lr': 0.000125 * 5 / 3 = 0.00020833333
sixteen_machines = 'base_lr': 0.000125 / 4 = 0.00003125

Then you apply another equation to get the final learning rate:

BASE_LR_BATCHSIZE = 32
total_gpus = num_gpus_per_machine * config.machines
global_batch_size = config.batch_size * total_gpus

# linear LR scaling (https://arxiv.org/abs/1706.02677)
lr = config.base_lr * (global_batch_size / BASE_LR_BATCHSIZE)

This means using 16x nodes at amazon we will get a bigger batch size and bigger learning rate:
0.00020833333 * (96 * 16 * 8 / 32) = 0.07999999872
While a single node at amazon will get a smaller batch size and smaller learning rate:
0.00003125 * (96 * 1 * 8 / 32) = 0.00075

My questions are:

Why the BASE_LR_BATCHSIZE is 32 and not 96 ?
If I want to train the model for x number of nodes using y batch size per GPU, how I can determine the correct base_lr ?

Thanks a lot.

Share PPL results

Hi,
It was not clear to me from the article what are your final PPL results for each model.
Can you share them too?

From a first look I actually thought that you achieve same or comparable PPL results. I am not sure about it now. Can you clarify?

Do you have a baseline model with comparable PPL to the original base model?

Can someone use what you did as a baseline for smaller scale research? (4-8 "commodity" GPUs for example?).

Extra detail on total training time:
I noticed that you count in tokens instead of steps,
were tokens_per_global_batch=global_batch_size*seq_len.
Using the parameters in the script, a simple calculation yields, in steps:

config	num gpus	max tokens	seq len	base batch	global batch size	tokens per batch	required steps	PPL
single machine	1	1.8B	128	32	32	4096	439453.125	?
single machine	2	1.8B	128	32	64	8192	219726.5625	?
single machine	4	1.8B	128	32	128	16384	109863.2813	?

Comparing the the base_wiki103 config from the original repo
(they used only data parallel) we get:

config	num gpus	tokens	seq len	base batch	global batch size	tokens per batch	steps	PPL
original-base-wt103	don't care	1.92B	150	don't care	64	9600	200000	24

=>They trained on much more tokens.
If your results are really comparable, the model you present here is worth using as a baseline for future transfomerXL experiments because its faster. Right?

GPT-2 encoder breaks in new version of PyTorch/huggingface

After switching to pytorch_april_patched and installing -r requirements.txt

Producing dataset wiki...
encoding file testdata/wikiextracted/AA/wiki_01.txt ...
Traceback (most recent call last):
  File "train.py", line 1036, in <module>
    eval(f'test_{g.args.test}()')
  File "<string>", line 1, in <module>
  File "train.py", line 940, in test_checkpoint_wiki
    data_setup()
  File "train.py", line 333, in data_setup
    g.corpus = get_lm_corpus(g.args.data, g.args.dataset, use_bpe=g.args.bpe)
  File "/home/ubuntu/data_utils.py", line 381, in get_lm_corpus
    corpus = Corpus(datadir, dataset, use_bpe, **kwargs)
  File "/home/ubuntu/data_utils.py", line 309, in __init__
    self.valid = self.vocab.encode_file(valid_path, ordered=True)
  File "/home/ubuntu/utils/vocabulary.py", line 204, in encode_file
    tokens: List[int] = self.tokenizer.encode(text) + [self.EOT]
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode
    return self.convert_tokens_to_ids(self.tokenize(text))
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr>
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
KeyError: 8212

run eval and possibly checkpoint at end of training

qkv computation

In https://github.com/cybertronai/transformer-xl/blob/master/mem_transformer.py#L106 , I am wondering why only key and value are layernormed, while query is not normed? In other variants (such as RelMultiHeadAttn), the qkv computation is implemented by a single self.qkv_net layer.

        if self.pre_lnorm:
            ##### layer normalization
            c = self.layer_norm(c)

        head_q = self.q_net(h)
        head_k, head_v = torch.chunk(self.kv_net(c), 2, -1)

unexpected keyword argument 'serialized_options' in some envs

Sometimes this is fixed by

000003e-05', '--n_layer', '18', '--d_model', '1024', '--n_head', '16', '--d_head', '64', '--d_inner', '4096', '--dropout', '0.1', '--dropatt', '0.1', '--optim', 'lamb', '--warmup_tokens', '0', '--tgt_len', '384', '--mem_len', '384', '--eval_tgt_len', '128', '--fp16', '--dynamic_loss_scale', '--init_std', '0.005', '--div_val', '1', '--bpe', '--checkpoint_each_epoch', '1', '--checkpoint', '/ncluster/runs/txl.09/model-best.pt', '--scheduler', 'constant']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "train.py", line 22, in <module>
    from tensorboardX import SummaryWriter
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/__init__.py", line 5, in <module>
    from .torchvis import TorchVis
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/torchvis.py", line 11, in <module>
    from .writer import SummaryWriter
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/writer.py", line 27, in <module>
    from .event_file_writer import EventFileWriter
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 28, in <module>
    from .proto import event_pb2
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/event_pb2.py", line 15, in <module>
    from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in <module>
    from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in <module>
    from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in <module>
    serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX\"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')
TypeError: __new__() got an unexpected keyword argument 'serialized_options'
(pytorch_april) ubuntu@ip-172-31-44-63:~$ source activate pytorch_p36
(pytorch_p36) ubuntu@ip-172-31-44-63:~$ NCCL_DEBUG=VERSION NCCL_MIN_NRINGS=16 NCCL_MAX_NRINGS=16  python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=172.31.44.63 --master_port=6016 train.py --seed 1111 --data data/wikiextracted --dataset wiki  --log_interval 100 --eval_interval 500 --max_tokens 1500000000 --logdir /ncluster/runs/wiki.09 --lr 0.000375 --batch_size 6 --eta_min 3.7500000000000003e-05 --n_layer 18 --d_model 1024 --n_head 16 --d_head 64 --d_inner 4096 --dropout 0.1 --dropatt 0.1 --optim lamb --warmup_tokens 0 --tgt_len 384 --mem_len 384 --eval_tgt_len 128 --fp16 --dynamic_loss_scale --init_std 0.005 --div_val 1 --bpe --checkpoint_each_epoch 1 --checkpoint /ncluster/runs/txl.09/model-best.pt --scheduler constant > >(tee -a /tmp/ncluster/0.wiki-dpwp8/10.out) 2> >(tee -a /tmp/ncluster/0.wiki-dpwp8/10.out >&2); echo $? > /tmp/ncluster/0.wiki-dpwp8/

Module Not Found Error

Hi, the train.py file in the master branch imports the eval.py file, where in the eval.py file it imports two functions which doesn't exist in the current repository. Could you update the code to fix this discrepancy or give some suggestions on how to bypass it? Thanks

from generate import generate_text, prepare_git_context
from search import hidden_to_softmax

Generating text from the model

Hello,

The generation example in the readme seems impressive. Is there a script for sampling from the model? I have been trying to generate text from transfromer XL model but the output is nowhere near your example. Any help would be appreciated @yaroslavvb . Thanks.

Reduce Loss

Hi,

Thanks for the awesome script.

I have made a modification, to get the average of loss, ppl across all machines.
Currently, your script only store the best value for the validation loss only on the master machine. It ignores the validation loss on the rest of the machines.
A simple solution that I made is to call the following function after calculation training and validation loss:

def reduce_tensor(value):
    tensor = torch.Tensor([value]).type(torch.cuda.FloatTensor)
    rt = tensor.clone()
    dist.all_reduce(rt, op=dist.reduce_op.SUM)
    rt /= dist.get_world_size()
    return rt.item()

of course, you have to change both train and eval function to:

            # compute average loss over last logging interval
            cur_loss = train_loss / elapsed_steps
            if dist.get_world_size() > 1:
                reduced_cur_loss = reduce_tensor(cur_loss)
            else:
                reduced_cur_loss = cur_loss

and

    # Log all the things.
    mean_loss = total_loss / total_len
    if dist.get_world_size() > 1:
        reduced_mean_loss = reduce_tensor(mean_loss)
    else:
        reduced_mean_loss = mean_loss

increase eval frequency

Now it's every 30 mins, should be every 5 mins

Figure out how to install pytorch 1.1 in new env on AWS

Things tried

conda create --name=pytorch_11 --clone pytorch_p36
source activate pytorch_11
pip install torch==1.1.0
python -c "import torch"

On brand new machine

conda create --name=pytorch_11 --clone pytorch_p36
source activate pytorch_11
conda install pytorch torchvision -c pytorch
python -c "import torch"

Both fail with

Traceback (most recent call last):
  File "check_versions.py", line 12, in <module>
    import torch.optim as optim
  File "/home/ubuntu/anaconda3/envs/pytorch_11/lib/python3.6/site-packages/torch/__init__.py", line 222, in <module>
    class BoolStorage(_C.BoolStorageBase, _StorageBase):
AttributeError: module 'torch._C' has no attribute 'BoolStorageBase'

cc @8enmann

Some kind of weird interaction with conda clone and new pytorch install
pytorch/pytorch#20403

Incorrect model loading

Hello,

I think you have 3 issues in mode loading:-

If someone loaded a model from a specific checkpoint, the model will be reloaded from best-model. It will simply override it.
The loading of best-model is written after the "train" function. This simply means it will never execute.
loading best-model assumes the checkpoint will exist in all nodes not only the first node.
I would recommend to take a look at:
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
How to map it from GPU0 at master node to all other nodes.

Thanks.

local 8-GPU run hangs in .item after 3 days

The long local run is hanging with all 8 processes having identical stack trace and 100% nvidia-smi GPU utilization

#8  0x00007f5178571f92 in cuMemcpyDtoHAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007f517d0984bf in ?? ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#10 0x00007f517d075573 in ?? ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#11 0x00007f517d0aed86 in cudaMemcpyAsync ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#12 0x00007f518ba39566 in at::native::_local_scalar_dense_cuda(at::Tensor const&)::{lambda()#1}::operator()() const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007f518ba3bbb7 in at::native::_local_scalar_dense_cuda(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007f518aa70902 in at::CUDAType::_local_scalar_dense(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007f517d8e5685 in torch::autograd::VariableType::_local_scalar_dense(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#16 0x00007f517fb0392a in at::native::item(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#17 0x00007f517fe0de15 in at::TypeDefault::item(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#18 0x00007f517dadf418 in torch::autograd::VariableType::item(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#19 0x00007f51be448756 in torch::autograd::dispatch_to_CLong(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#20 0x00007f51be4499f0 in torch::autograd::THPVariable_item(_object*, _object*) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#21 0x000055dd363e1bda in _PyCFunction_FastCallDict ()

There's only one place in training which uses .item

train_loss += loss.float().item()

Figure out of that's connected, and maybe change code to not use item() here

How to determine max_tokens ?

Hello,

How did you determined the max_tokens value ?
Is it the total number of tokens exist on the dataset ?

Thanks in advance.

FP16_Optimizer in conjunction with log_lamb_rs causes KeyError

Skipping the optimizer step (https://github.com/cybertronai/transformer-xl/blob/master/fp16_opt.py#L439 ) in the very first iteration causes an empty state dict.
This results in a KeyError from https://github.com/cybertronai/transformer-xl/blob/master/train.py#L536 .

feature: include jupyter notebook server for all runs

FP16 warnings

Seeing a lot of prints like

FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([512])
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([3, 512])
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([3])
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([50257])

and

OVERFLOW! Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
skipped iteration!

See /ncluster/runs.new/ben-bpe2/info.log
Repro:
python launch.py --config=one_machine_fp16_small --name ben-bpe2

model vs. model_to_reset in evaluate_and_log

I have a question regarding the evaluate_and_log function. Shouldn't model_to_reset be passed to the evaluate function instead of model?

Step time measurement missing proper barrier

true step time is around 600, but see this

babysitter job to automatically kill hung jobs

idea for babysitter job, cc @8enmann

job = ncluster.make_job(name=args.name,
                            run_name=f"{args.name}",
                            num_tasks=config.machines,
                            image_name=config.image_name,
                            instance_type=config.instance_type,
                            spot=not args.nospot,
                            skip_setup=args.skip_setup)

killer_task = ncluster.make_task()
killer_task.run(f'export AWS_ACCESS_KEY_ID={os.environ["AWS_ACCESS_KEY_ID"]}')
killer_task.run(f'export AWS_SECRET_ACCESS_KEY={os.environ["AWS_SECRET_ACCESS_KEY"]')
killer_task.run(f'export AWS_DEFAULT_REGION={os.environ["AWS_DEFAULT_REGION"]')
killer_task.run(f'python hung_job_killer.py --watchdir={job.logdir} --instances=",".join(t.name for t in job.tasks)')

The hung_job_killer would check watch_dir on a regular basis and kill all instances in --instances if watch_dir had no modifications for an hour.

Killing can be done with subset of the logic from ncluster command-line tool. Currently lookup_instances has special logic for exact_match which kicks in when fragment is wrapped in '', this should probably be a keyword argument instead

def kill(fragment=''):
  instances = u.lookup_instances(fragment, valid_states=['running', 'stopped'])
  instances_to_kill = []
  for i in instances:
    state = i.state['Name']
    if LIMIT_TO_CURRENT_USER and i.key_name != u.get_keypair_name():
      print(f"Skipping instance launched with key {i.key_name}, use reallykill to kill")
      continue
    print(u.get_name(i), i.instance_type, i.key_name,
          state if state == 'stopped' else '')
    instances_to_kill.append(i)

  action = 'terminating'
  if not _check_instance_found(instances, fragment):
    return

  ec2_client = u.get_ec2_client()
  if answer.lower() == "y":
    instance_ids = [i.id for i in instances_to_kill]
    response = ec2_client.terminate_instances(InstanceIds=instance_ids)

    assert u.is_good_response(response), response
    print(f"{action}: success")
  else:
    print("Didn't get y, doing nothing")

cybertronai / transformer-xl Goto Github PK

transformer-xl's People

Stargazers

Watchers

Forkers

transformer-xl's Issues

Recommend Projects

Recommend Topics

Recommend Org