cybertronai / transformer-xl Goto Github PK
View Code? Open in Web Editor NEWThis project forked from kimiyoung/transformer-xl
Training Transformer-XL on 128 GPUs
License: Apache License 2.0
This project forked from kimiyoung/transformer-xl
Training Transformer-XL on 128 GPUs
License: Apache License 2.0
Steps to repro: start multi-machine run, ^C during training. After eval & load best checkpoint, it throws:
Traceback (most recent call last):
File "train.py", line 745, in <module>
main()
File "train.py", line 729, in main
model = torch.load(model_f, map_location=lambda storage, loc: storage.cuda(args.local_rank))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
return _load(f, map_location, pickle_module)
File "train.py", line 745, in <module>
main()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 532, in _load
magic_number = pickle_module.load(f)
File "train.py", line 729, in main
model = torch.load(model_f, map_location=lambda storage, loc: storage.cuda(args.local_rank))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
return _load(f, map_location, pickle_module)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/serialization.py", line 532, in _load
magic_number = pickle_module.load(f)
EOFError: Ran out of input
Hello,
I have a question regarding adjusting learning rate with LAMP.
In your case you have a fixed learning rate which is "0.000125", and then you divide or multiple some numbers to get the correct base learning rate depend on the number of GPUs:
one_machine = 'base_lr': 0.000125 * 5 / 3 = 0.00020833333
sixteen_machines = 'base_lr': 0.000125 / 4 = 0.00003125
Then you apply another equation to get the final learning rate:
BASE_LR_BATCHSIZE = 32
total_gpus = num_gpus_per_machine * config.machines
global_batch_size = config.batch_size * total_gpus
# linear LR scaling (https://arxiv.org/abs/1706.02677)
lr = config.base_lr * (global_batch_size / BASE_LR_BATCHSIZE)
This means using 16x nodes at amazon we will get a bigger batch size and bigger learning rate:
0.00020833333 * (96 * 16 * 8 / 32) = 0.07999999872
While a single node at amazon will get a smaller batch size and smaller learning rate:
0.00003125 * (96 * 1 * 8 / 32) = 0.00075
My questions are:
Thanks a lot.
Hi,
It was not clear to me from the article what are your final PPL results for each model.
Can you share them too?
From a first look I actually thought that you achieve same or comparable PPL results. I am not sure about it now. Can you clarify?
Do you have a baseline model with comparable PPL to the original base model?
Can someone use what you did as a baseline for smaller scale research? (4-8 "commodity" GPUs for example?).
Extra detail on total training time:
I noticed that you count in tokens instead of steps,
were tokens_per_global_batch=global_batch_size*seq_len
.
Using the parameters in the script, a simple calculation yields, in steps:
config | num gpus | max tokens | seq len | base batch | global batch size | tokens per batch | required steps | PPL |
---|---|---|---|---|---|---|---|---|
single machine | 1 | 1.8B | 128 | 32 | 32 | 4096 | 439453.125 | ? |
single machine | 2 | 1.8B | 128 | 32 | 64 | 8192 | 219726.5625 | ? |
single machine | 4 | 1.8B | 128 | 32 | 128 | 16384 | 109863.2813 | ? |
Comparing the the base_wiki103 config from the original repo
(they used only data parallel) we get:
config | num gpus | tokens | seq len | base batch | global batch size | tokens per batch | steps | PPL |
---|---|---|---|---|---|---|---|---|
original-base-wt103 | don't care | 1.92B | 150 | don't care | 64 | 9600 | 200000 | 24 |
=>They trained on much more tokens.
If your results are really comparable, the model you present here is worth using as a baseline for future transfomerXL experiments because its faster. Right?
After switching to pytorch_april_patched and installing -r requirements.txt
Producing dataset wiki...
encoding file testdata/wikiextracted/AA/wiki_01.txt ...
Traceback (most recent call last):
File "train.py", line 1036, in <module>
eval(f'test_{g.args.test}()')
File "<string>", line 1, in <module>
File "train.py", line 940, in test_checkpoint_wiki
data_setup()
File "train.py", line 333, in data_setup
g.corpus = get_lm_corpus(g.args.data, g.args.dataset, use_bpe=g.args.bpe)
File "/home/ubuntu/data_utils.py", line 381, in get_lm_corpus
corpus = Corpus(datadir, dataset, use_bpe, **kwargs)
File "/home/ubuntu/data_utils.py", line 309, in __init__
self.valid = self.vocab.encode_file(valid_path, ordered=True)
File "/home/ubuntu/utils/vocabulary.py", line 204, in encode_file
tokens: List[int] = self.tokenizer.encode(text) + [self.EOT]
File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode
return self.convert_tokens_to_ids(self.tokenize(text))
File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize
token = ''.join(self.byte_encoder[ord(b)] for b in token)
File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr>
token = ''.join(self.byte_encoder[ord(b)] for b in token)
KeyError: 8212
In https://github.com/cybertronai/transformer-xl/blob/master/mem_transformer.py#L106 , I am wondering why only key
and value
are layernormed, while query
is not normed? In other variants (such as RelMultiHeadAttn), the qkv computation is implemented by a single self.qkv_net layer.
if self.pre_lnorm:
##### layer normalization
c = self.layer_norm(c)
head_q = self.q_net(h)
head_k, head_v = torch.chunk(self.kv_net(c), 2, -1)
Sometimes this is fixed by
000003e-05', '--n_layer', '18', '--d_model', '1024', '--n_head', '16', '--d_head', '64', '--d_inner', '4096', '--dropout', '0.1', '--dropatt', '0.1', '--optim', 'lamb', '--warmup_tokens', '0', '--tgt_len', '384', '--mem_len', '384', '--eval_tgt_len', '128', '--fp16', '--dynamic_loss_scale', '--init_std', '0.005', '--div_val', '1', '--bpe', '--checkpoint_each_epoch', '1', '--checkpoint', '/ncluster/runs/txl.09/model-best.pt', '--scheduler', 'constant']' returned non-zero exit status 1.
Traceback (most recent call last):
File "train.py", line 22, in <module>
from tensorboardX import SummaryWriter
File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/__init__.py", line 5, in <module>
from .torchvis import TorchVis
File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/torchvis.py", line 11, in <module>
from .writer import SummaryWriter
File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/writer.py", line 27, in <module>
from .event_file_writer import EventFileWriter
File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 28, in <module>
from .proto import event_pb2
File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/event_pb2.py", line 15, in <module>
from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in <module>
from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in <module>
from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
File "/home/ubuntu/anaconda3/envs/pytorch_april/lib/python3.6/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in <module>
serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX\"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')
TypeError: __new__() got an unexpected keyword argument 'serialized_options'
(pytorch_april) ubuntu@ip-172-31-44-63:~$ source activate pytorch_p36
(pytorch_p36) ubuntu@ip-172-31-44-63:~$ NCCL_DEBUG=VERSION NCCL_MIN_NRINGS=16 NCCL_MAX_NRINGS=16 python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=172.31.44.63 --master_port=6016 train.py --seed 1111 --data data/wikiextracted --dataset wiki --log_interval 100 --eval_interval 500 --max_tokens 1500000000 --logdir /ncluster/runs/wiki.09 --lr 0.000375 --batch_size 6 --eta_min 3.7500000000000003e-05 --n_layer 18 --d_model 1024 --n_head 16 --d_head 64 --d_inner 4096 --dropout 0.1 --dropatt 0.1 --optim lamb --warmup_tokens 0 --tgt_len 384 --mem_len 384 --eval_tgt_len 128 --fp16 --dynamic_loss_scale --init_std 0.005 --div_val 1 --bpe --checkpoint_each_epoch 1 --checkpoint /ncluster/runs/txl.09/model-best.pt --scheduler constant > >(tee -a /tmp/ncluster/0.wiki-dpwp8/10.out) 2> >(tee -a /tmp/ncluster/0.wiki-dpwp8/10.out >&2); echo $? > /tmp/ncluster/0.wiki-dpwp8/
Hi, the train.py
file in the master branch imports the eval.py
file, where in the eval.py
file it imports two functions which doesn't exist in the current repository. Could you update the code to fix this discrepancy or give some suggestions on how to bypass it? Thanks
from generate import generate_text, prepare_git_context
from search import hidden_to_softmax
Hello,
The generation example in the readme seems impressive. Is there a script for sampling from the model? I have been trying to generate text from transfromer XL model but the output is nowhere near your example. Any help would be appreciated @yaroslavvb . Thanks.
Hi,
Thanks for the awesome script.
I have made a modification, to get the average of loss, ppl across all machines.
Currently, your script only store the best value for the validation loss only on the master machine. It ignores the validation loss on the rest of the machines.
A simple solution that I made is to call the following function after calculation training and validation loss:
def reduce_tensor(value):
tensor = torch.Tensor([value]).type(torch.cuda.FloatTensor)
rt = tensor.clone()
dist.all_reduce(rt, op=dist.reduce_op.SUM)
rt /= dist.get_world_size()
return rt.item()
of course, you have to change both train and eval function to:
# compute average loss over last logging interval
cur_loss = train_loss / elapsed_steps
if dist.get_world_size() > 1:
reduced_cur_loss = reduce_tensor(cur_loss)
else:
reduced_cur_loss = cur_loss
and
# Log all the things.
mean_loss = total_loss / total_len
if dist.get_world_size() > 1:
reduced_mean_loss = reduce_tensor(mean_loss)
else:
reduced_mean_loss = mean_loss
Now it's every 30 mins, should be every 5 mins
Things tried
conda create --name=pytorch_11 --clone pytorch_p36
source activate pytorch_11
pip install torch==1.1.0
python -c "import torch"
On brand new machine
conda create --name=pytorch_11 --clone pytorch_p36
source activate pytorch_11
conda install pytorch torchvision -c pytorch
python -c "import torch"
Both fail with
Traceback (most recent call last):
File "check_versions.py", line 12, in <module>
import torch.optim as optim
File "/home/ubuntu/anaconda3/envs/pytorch_11/lib/python3.6/site-packages/torch/__init__.py", line 222, in <module>
class BoolStorage(_C.BoolStorageBase, _StorageBase):
AttributeError: module 'torch._C' has no attribute 'BoolStorageBase'
cc @8enmann
Some kind of weird interaction with conda clone and new pytorch install
pytorch/pytorch#20403
Hello,
I think you have 3 issues in mode loading:-
If someone loaded a model from a specific checkpoint, the model will be reloaded from best-model. It will simply override it.
The loading of best-model is written after the "train" function. This simply means it will never execute.
loading best-model assumes the checkpoint will exist in all nodes not only the first node.
I would recommend to take a look at:
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
How to map it from GPU0 at master node to all other nodes.
Thanks.
The long local run is hanging with all 8 processes having identical stack trace and 100% nvidia-smi GPU utilization
#8 0x00007f5178571f92 in cuMemcpyDtoHAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9 0x00007f517d0984bf in ?? ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#10 0x00007f517d075573 in ?? ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#11 0x00007f517d0aed86 in cudaMemcpyAsync ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#12 0x00007f518ba39566 in at::native::_local_scalar_dense_cuda(at::Tensor const&)::{lambda()#1}::operator()() const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007f518ba3bbb7 in at::native::_local_scalar_dense_cuda(at::Tensor const&) ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007f518aa70902 in at::CUDAType::_local_scalar_dense(at::Tensor const&) const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007f517d8e5685 in torch::autograd::VariableType::_local_scalar_dense(at::Tensor const&) const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#16 0x00007f517fb0392a in at::native::item(at::Tensor const&) ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#17 0x00007f517fe0de15 in at::TypeDefault::item(at::Tensor const&) const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#18 0x00007f517dadf418 in torch::autograd::VariableType::item(at::Tensor const&) const ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#19 0x00007f51be448756 in torch::autograd::dispatch_to_CLong(at::Tensor const&) ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#20 0x00007f51be4499f0 in torch::autograd::THPVariable_item(_object*, _object*) ()
from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#21 0x000055dd363e1bda in _PyCFunction_FastCallDict ()
There's only one place in training which uses .item
train_loss += loss.float().item()
Figure out of that's connected, and maybe change code to not use item()
here
Hello,
How did you determined the max_tokens value ?
Is it the total number of tokens exist on the dataset ?
Thanks in advance.
Skipping the optimizer step (https://github.com/cybertronai/transformer-xl/blob/master/fp16_opt.py#L439 ) in the very first iteration causes an empty state dict.
This results in a KeyError from https://github.com/cybertronai/transformer-xl/blob/master/train.py#L536 .
Seeing a lot of prints like
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([512])
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([3, 512])
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([3])
FP16_Optimizer received torch.cuda.HalfTensor with torch.Size([50257])
and
OVERFLOW! Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
skipped iteration!
See /ncluster/runs.new/ben-bpe2/info.log
Repro:
python launch.py --config=one_machine_fp16_small --name ben-bpe2
I have a question regarding the evaluate_and_log function. Shouldn't model_to_reset be passed to the evaluate function instead of model?
idea for babysitter job, cc @8enmann
job = ncluster.make_job(name=args.name,
run_name=f"{args.name}",
num_tasks=config.machines,
image_name=config.image_name,
instance_type=config.instance_type,
spot=not args.nospot,
skip_setup=args.skip_setup)
killer_task = ncluster.make_task()
killer_task.run(f'export AWS_ACCESS_KEY_ID={os.environ["AWS_ACCESS_KEY_ID"]}')
killer_task.run(f'export AWS_SECRET_ACCESS_KEY={os.environ["AWS_SECRET_ACCESS_KEY"]')
killer_task.run(f'export AWS_DEFAULT_REGION={os.environ["AWS_DEFAULT_REGION"]')
killer_task.run(f'python hung_job_killer.py --watchdir={job.logdir} --instances=",".join(t.name for t in job.tasks)')
The hung_job_killer
would check watch_dir
on a regular basis and kill all instances in --instances
if watch_dir
had no modifications for an hour.
Killing can be done with subset of the logic from ncluster
command-line tool. Currently lookup_instances
has special logic for exact_match
which kicks in when fragment is wrapped in '', this should probably be a keyword argument instead
def kill(fragment=''):
instances = u.lookup_instances(fragment, valid_states=['running', 'stopped'])
instances_to_kill = []
for i in instances:
state = i.state['Name']
if LIMIT_TO_CURRENT_USER and i.key_name != u.get_keypair_name():
print(f"Skipping instance launched with key {i.key_name}, use reallykill to kill")
continue
print(u.get_name(i), i.instance_type, i.key_name,
state if state == 'stopped' else '')
instances_to_kill.append(i)
action = 'terminating'
if not _check_instance_found(instances, fragment):
return
ec2_client = u.get_ec2_client()
if answer.lower() == "y":
instance_ids = [i.id for i in instances_to_kill]
response = ec2_client.terminate_instances(InstanceIds=instance_ids)
assert u.is_good_response(response), response
print(f"{action}: success")
else:
print("Didn't get y, doing nothing")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.