Git Product home page Git Product logo

transformer-xl's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

transformer-xl's Issues

Multi-machine test is broken

Steps to repro: start multi-machine run, ^C during training. After eval & load best checkpoint, it throws:

Traceback (most recent call last):
  File "train.py", line 745, in <module>
    main()
  File "train.py", line 729, in main
    model = torch.load(model_f, map_location=lambda storage, loc: storage.cuda(args.local_rank))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "train.py", line 745, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 532, in _load
    magic_number = pickle_module.load(f)
  File "train.py", line 729, in main
    model = torch.load(model_f, map_location=lambda storage, loc: storage.cuda(args.local_rank))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 532, in _load
    magic_number = pickle_module.load(f)
EOFError: Ran out of input

Correctly adjust LR with LAMP

Hello,

I have a question regarding adjusting learning rate with LAMP.

In your case you have a fixed learning rate which is "0.000125", and then you divide or multiple some numbers to get the correct base learning rate depend on the number of GPUs:

one_machine = 'base_lr': 0.000125 * 5 / 3 = 0.00020833333
sixteen_machines = 'base_lr': 0.000125 / 4 = 0.00003125

Then you apply another equation to get the final learning rate:

BASE_LR_BATCHSIZE = 32
total_gpus = num_gpus_per_machine * config.machines
global_batch_size = config.batch_size * total_gpus

# linear LR scaling (https://arxiv.org/abs/1706.02677)
lr = config.base_lr * (global_batch_size / BASE_LR_BATCHSIZE)

This means using 16x nodes at amazon we will get a bigger batch size and bigger learning rate:
0.00020833333 * (96 * 16 * 8 / 32) = 0.07999999872
While a single node at amazon will get a smaller batch size and smaller learning rate:
0.00003125 * (96 * 1 * 8 / 32) = 0.00075

My questions are:

  1. Why the BASE_LR_BATCHSIZE is 32 and not 96 ?
  2. If I want to train the model for x number of nodes using y batch size per GPU, how I can determine the correct base_lr ?

Thanks a lot.

Share PPL results

Hi,
It was not clear to me from the article what are your final PPL results for each model.
Can you share them too?

From a first look I actually thought that you achieve same or comparable PPL results. I am not sure about it now. Can you clarify?

Do you have a baseline model with comparable PPL to the original base model?

Can someone use what you did as a baseline for smaller scale research? (4-8 "commodity" GPUs for example?).

Extra detail on total training time:
I noticed that you count in tokens instead of steps,
were tokens_per_global_batch=global_batch_size*seq_len.
Using the parameters in the script, a simple calculation yields, in steps:

config num gpus max tokens seq len base batch global batch size tokens per batch required steps PPL
single machine 1 1.8B 128 32 32 4096 439453.125 ?
single machine 2 1.8B 128 32 64 8192 219726.5625 ?
single machine 4 1.8B 128 32 128 16384 109863.2813 ?

Comparing the the base_wiki103 config from the original repo
(they used only data parallel) we get:

config num gpus tokens seq len base batch global batch size tokens per batch steps PPL
original-base-wt103 don't care 1.92B 150 don't care 64 9600 200000 24

=>They trained on much more tokens.
If your results are really comparable, the model you present here is worth using as a baseline for future transfomerXL experiments because its faster. Right?

GPT-2 encoder breaks in new version of PyTorch/huggingface

After switching to pytorch_april_patched and installing -r requirements.txt

Producing dataset wiki...
encoding file testdata/wikiextracted/AA/wiki_01.txt ...
Traceback (most recent call last):
  File "train.py", line 1036, in <module>
    eval(f'test_{g.args.test}()')
  File "<string>", line 1, in <module>
  File "train.py", line 940, in test_checkpoint_wiki
    data_setup()
  File "train.py", line 333, in data_setup
    g.corpus = get_lm_corpus(g.args.data, g.args.dataset, use_bpe=g.args.bpe)
  File "/home/ubuntu/data_utils.py", line 381, in get_lm_corpus
    corpus = Corpus(datadir, dataset, use_bpe, **kwargs)
  File "/home/ubuntu/data_utils.py", line 309, in __init__
    self.valid = self.vocab.encode_file(valid_path, ordered=True)
  File "/home/ubuntu/utils/vocabulary.py", line 204, in encode_file
    tokens: List[int] = self.tokenizer.encode(text) + [self.EOT]
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode
    return self.convert_tokens_to_ids(self.tokenize(text))
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr>
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
KeyError: 8212

unexpected keyword argument 'serialized_options' in some envs

Sometimes this is fixed by

000003e-05', '--n_layer', '18', '--d_model', '1024', '--n_head', '16', '--d_head', '64', '--d_inner', '4096', '--dropout', '0.1', '--dropatt', '0.1', '--optim', 'lamb', '--warmup_tokens', '0', '--tgt_len', '384', '--mem_len', '384', '--eval_tgt_len', '128', '--fp16', '--dynamic_loss_scale', '--init_std', '0.005', '--div_val', '1', '--bpe', '--checkpoint_each_epoch', '1', '--checkpoint', '/ncluster/runs/txl.09/model-best.pt', '--scheduler', 'constant']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "train.py", line 22, in <module>
    from tensorboardX import SummaryWriter
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/__init__.py", line 5, in <module>
    from .torchvis import TorchVis
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/torchvis.py", line 11, in <module>
    from .writer import SummaryWriter
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/writer.py", line 27, in <module>
    from .event_file_writer import EventFileWriter
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 28, in <module>
    from .proto import event_pb2
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/event_pb2.py", line 15, in <module>
    from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in <module>
    from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in <module>
    from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
  File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in <module>
    serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX\"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')
TypeError: __new__() got an unexpected keyword argument 'serialized_options'
(pytorch_april) ubuntu@ip-172-31-44-63:~$ source activate pytorch_p36
(pytorch_p36) ubuntu@ip-172-31-44-63:~$ NCCL_DEBUG=VERSION NCCL_MIN_NRINGS=16 NCCL_MAX_NRINGS=16  python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=172.31.44.63 --master_port=6016 train.py --seed 1111 --data data/wikiextracted --dataset wiki  --log_interval 100 --eval_interval 500 --max_tokens 1500000000 --logdir /ncluster/runs/wiki.09 --lr 0.000375 --batch_size 6 --eta_min 3.7500000000000003e-05 --n_layer 18 --d_model 1024 --n_head 16 --d_head 64 --d_inner 4096 --dropout 0.1 --dropatt 0.1 --optim lamb --warmup_tokens 0 --tgt_len 384 --mem_len 384 --eval_tgt_len 128 --fp16 --dynamic_loss_scale --init_std 0.005 --div_val 1 --bpe --checkpoint_each_epoch 1 --checkpoint /ncluster/runs/txl.09/model-best.pt --scheduler constant > >(tee -a /tmp/ncluster/0.wiki-dpwp8/10.out) 2> >(tee -a /tmp/ncluster/0.wiki-dpwp8/10.out >&2); echo $? > /tmp/ncluster/0.wiki-dpwp8/

Module Not Found Error

Hi, the train.py file in the master branch imports the eval.py file, where in the eval.py file it imports two functions which doesn't exist in the current repository. Could you update the code to fix this discrepancy or give some suggestions on how to bypass it? Thanks

from generate import generate_text, prepare_git_context
from search import hidden_to_softmax

Generating text from the model

Hello,

The generation example in the readme seems impressive. Is there a script for sampling from the model? I have been trying to generate text from transfromer XL model but the output is nowhere near your example. Any help would be appreciated @yaroslavvb . Thanks.

Reduce Loss

Hi,

Thanks for the awesome script.

I have made a modification, to get the average of loss, ppl across all machines.
Currently, your script only store the best value for the validation loss only on the master machine. It ignores the validation loss on the rest of the machines.
A simple solution that I made is to call the following function after calculation training and validation loss:

def reduce_tensor(value):
    tensor = torch.Tensor([value]).type(torch.cuda.FloatTensor)
    rt = tensor.clone()
    dist.all_reduce(rt, op=dist.reduce_op.SUM)
    rt /= dist.get_world_size()
    return rt.item()

of course, you have to change both train and eval function to:

            # compute average loss over last logging interval
            cur_loss = train_loss / elapsed_steps
            if dist.get_world_size() > 1:
                reduced_cur_loss = reduce_tensor(cur_loss)
            else:
                reduced_cur_loss = cur_loss

and

    # Log all the things.
    mean_loss = total_loss / total_len
    if dist.get_world_size() > 1:
        reduced_mean_loss = reduce_tensor(mean_loss)
    else:
        reduced_mean_loss = mean_loss

Figure out how to install pytorch 1.1 in new env on AWS

Things tried

conda create --name=pytorch_11 --clone pytorch_p36
source activate pytorch_11
pip install torch==1.1.0
python -c "import torch"

On brand new machine

conda create --name=pytorch_11 --clone pytorch_p36
source activate pytorch_11
conda install pytorch torchvision -c pytorch
python -c "import torch"

Both fail with

Traceback (most recent call last):
  File "check_versions.py", line 12, in <module>
    import torch.optim as optim
  File "/home/ubuntu/anaconda3/envs/pytorch_11/lib/python3.6/site-packages/torch/__init__.py", line 222, in <module>
    class BoolStorage(_C.BoolStorageBase, _StorageBase):
AttributeError: module 'torch._C' has no attribute 'BoolStorageBase'

cc @8enmann

Some kind of weird interaction with conda clone and new pytorch install
pytorch/pytorch#20403

Incorrect model loading

Hello,

I think you have 3 issues in mode loading:-

  1. If someone loaded a model from a specific checkpoint, the model will be reloaded from best-model. It will simply override it.

  2. The loading of best-model is written after the "train" function. This simply means it will never execute.

  3. loading best-model assumes the checkpoint will exist in all nodes not only the first node.
    I would recommend to take a look at:
    https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
    How to map it from GPU0 at master node to all other nodes.

Thanks.

local 8-GPU run hangs in .item after 3 days

The long local run is hanging with all 8 processes having identical stack trace and 100% nvidia-smi GPU utilization

#8  0x00007f5178571f92 in cuMemcpyDtoHAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007f517d0984bf in ?? ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#10 0x00007f517d075573 in ?? ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#11 0x00007f517d0aed86 in cudaMemcpyAsync ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#12 0x00007f518ba39566 in at::native::_local_scalar_dense_cuda(at::Tensor const&)::{lambda()#1}::operator()() const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007f518ba3bbb7 in at::native::_local_scalar_dense_cuda(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007f518aa70902 in at::CUDAType::_local_scalar_dense(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007f517d8e5685 in torch::autograd::VariableType::_local_scalar_dense(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#16 0x00007f517fb0392a in at::native::item(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#17 0x00007f517fe0de15 in at::TypeDefault::item(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#18 0x00007f517dadf418 in torch::autograd::VariableType::item(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#19 0x00007f51be448756 in torch::autograd::dispatch_to_CLong(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#20 0x00007f51be4499f0 in torch::autograd::THPVariable_item(_object*, _object*) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#21 0x000055dd363e1bda in _PyCFunction_FastCallDict ()

There's only one place in training which uses .item

train_loss += loss.float().item()

Figure out of that's connected, and maybe change code to not use item() here

How to determine max_tokens ?

Hello,

How did you determined the max_tokens value ?
Is it the total number of tokens exist on the dataset ?

Thanks in advance.

FP16 warnings

Seeing a lot of prints like

FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([512])
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([3, 512])
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([3])
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([50257])

and

OVERFLOW! Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
skipped iteration!

See /ncluster/runs.new/ben-bpe2/info.log
Repro:
python launch.py --config=one_machine_fp16_small --name ben-bpe2

babysitter job to automatically kill hung jobs

idea for babysitter job, cc @8enmann

job = ncluster.make_job(name=args.name,
                            run_name=f"{args.name}",
                            num_tasks=config.machines,
                            image_name=config.image_name,
                            instance_type=config.instance_type,
                            spot=not args.nospot,
                            skip_setup=args.skip_setup)

killer_task = ncluster.make_task()
killer_task.run(f'export AWS_ACCESS_KEY_ID={os.environ["AWS_ACCESS_KEY_ID"]}')
killer_task.run(f'export AWS_SECRET_ACCESS_KEY={os.environ["AWS_SECRET_ACCESS_KEY"]')
killer_task.run(f'export AWS_DEFAULT_REGION={os.environ["AWS_DEFAULT_REGION"]')
killer_task.run(f'python hung_job_killer.py --watchdir={job.logdir} --instances=",".join(t.name for t in job.tasks)')

The hung_job_killer would check watch_dir on a regular basis and kill all instances in --instances if watch_dir had no modifications for an hour.

Killing can be done with subset of the logic from ncluster command-line tool. Currently lookup_instances has special logic for exact_match which kicks in when fragment is wrapped in '', this should probably be a keyword argument instead

def kill(fragment=''):
  instances = u.lookup_instances(fragment, valid_states=['running', 'stopped'])
  instances_to_kill = []
  for i in instances:
    state = i.state['Name']
    if LIMIT_TO_CURRENT_USER and i.key_name != u.get_keypair_name():
      print(f"Skipping instance launched with key {i.key_name}, use reallykill to kill")
      continue
    print(u.get_name(i), i.instance_type, i.key_name,
          state if state == 'stopped' else '')
    instances_to_kill.append(i)

  action = 'terminating'
  if not _check_instance_found(instances, fragment):
    return

  ec2_client = u.get_ec2_client()
  if answer.lower() == "y":
    instance_ids = [i.id for i in instances_to_kill]
    response = ec2_client.terminate_instances(InstanceIds=instance_ids)

    assert u.is_good_response(response), response
    print(f"{action}: success")
  else:
    print("Didn't get y, doing nothing")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.