lucidrains / reformer-pytorch Goto Github PK
View Code? Open in Web Editor NEWReformer, the efficient Transformer, in Pytorch
License: MIT License
Reformer, the efficient Transformer, in Pytorch
License: MIT License
Hi, thanks for this codebase. I'm looking to adapt Reformer for Squad, and I'm not sure exactly how. Do you think it will look something like this, from the pytorch-transformers github?https://github.com/huggingface/transformers/blob/master/examples/run_squad.py
I'm having trouble seeing how I am going to pass two parameters (context and query) into the model. forward() function instead of one. Thanks!
Hi! Thanks for implementing a new interesting model, Reformer.
Have you got any chance to report the results on imagenet64 and enwik8-64K? It would be great if you can validate the model on them. At least to validate whether the code is implemented as it should be.
Thank you
Hello @lucidrains and thanks for the nice implementation!
I have a question about using LSH in the contextual attention in a decoder, which can't have Q equal to K.
In that case some queries might end up in buckets without keys, as the paper mentions, but there are ways around that - like picking the next best bucket (2nd or 3rd argmax) or just zeroing attention for that query. Do you think that could work?
In the context of MT, for example, if we can only use LSH in self-attention, then the encoder could even read some 64k tokens, but the decoder would need to be limited to ouput partial sequences (which sounds reasonable but not exploiting the full capabilities of the reformer).
The LSH attention is considerably slower than the vanilla full attention; around 8 times slower in my experiments (with sequences shorter than 512 time steps).
I know LSH is supposed to be slower than vanilla attention since there's all the bucketing overhead, and for smaller sequences we can use full attention. But I wonder if we can still get some improvements that will affect running times on longer sequences.
@lucidrains if you believe there's not much to optimize, then please just close this issue.
I noticed there is no dropout in the feedforward layers, nor after the attention. Is there any reason for that?
Thank you for putting in the time to do this. I have a bunch of ideas for it.
I crudely ported your example training script to use the pytorch-lightning library and when I attempted to use data distributed ran into a crash, The problem may be down in the revtorch library, but I want to hand the script off to you so you can play with it while reporting it so you can take a look and decide where the issue is.
you can get the crash by supplying the --distributed flag to the script with any number of gpus
Epoch 1: 0%| | 0/1451 [00:00<?, ?batch/s]Traceback (most recent call last):
File "example/train_lightning.py", line 166, in <module>
main()
File "example/train_lightning.py", line 161, in main
trainer.fit(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 687, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 331, in ddp_train
self.run_pretrain_routine(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 829, in run_pretrain_routine
self.train()
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 332, in train
self.run_training_epoch()
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 386, in run_training_epoch
output = self.run_training_batch(batch, batch_idx)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 506, in run_training_batch
loss = optimizer_closure()
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 489, in optimizer_closure
model_ref.backward(self.use_amp, closure_loss, optimizer)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/core/hooks.py", line 154, in backward
loss.backward()
File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 77, in apply
return self._forward_cls.backward(self, *args)
File "/opt/conda/lib/python3.6/site-packages/revtorch/revtorch.py", line 161, in backward
y, dy = ctx.reversible_blocks[i].backward_pass(y, dy)
File "/opt/conda/lib/python3.6/site-packages/revtorch/revtorch.py", line 89, in backward_pass
gy1.backward(dy2)
File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Expected to mark a variable ready only once. This error is caused by use of a module parameter outside the `forward` function. The return value of the `forward` function is inspected by the distributed data parallel wrapper to figure out if any of the module's parameters went unused. If this is the case, it knows they won't receive gradients in a backward pass. If any of those parameters are then used outside `forward`, this error condition is triggered. You can disable unused parameter detection by passing the keyword argument `find_unused_parameters=False` to `torch.nn.parallel.DistributedDataParallel`.
script:
from reformer_pytorch import ReformerLM
import tqdm
import gzip
import numpy as np
import torch.optim as optim
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import Trainer
import os
import torch
from torch import nn
from torchvision import transforms
import argparse
import pytorch_lightning as pl
# constants
NUM_BATCHES = int(1e5)
BATCH_SIZE = 4
GRADIENT_ACCUMULATE_EVERY = 4
LEARNING_RATE = 1e-4
VALIDATE_EVERY = 100
SEQ_LEN = 4096
# helpers
def cycle(loader):
while True:
for data in loader:
yield data
with gzip.open('./data/enwik8.gz') as file:
X = np.fromstring(file.read(int(95e6)), dtype=np.uint8)
trX, vaX = np.split(X, [int(90e6)])
data_train, data_val = torch.from_numpy(trX), torch.from_numpy(vaX)
class TextSamplerDataset(Dataset):
def __init__(self, data, seq_len):
super().__init__()
self.data = data
self.seq_len = seq_len
def __getitem__(self, index):
rand_start = torch.randint(0, self.data.size(0) - self.seq_len - 1, (1,))
full_seq = self.data[rand_start: rand_start + self.seq_len + 1].long()
return full_seq[0:-1], full_seq[1:]
def __len__(self):
return self.data.size(0) // self.seq_len
class ReformerTrainer(pl.LightningModule):
def __init__(self, batch_size=4, distributed_mode=False):
super(ReformerTrainer, self).__init__()
self.batch_size = batch_size
self.distributed_mode = distributed_mode
# instantiate model
self.model = ReformerLM(
emb = 512,
depth = 6,
max_seq_len = SEQ_LEN,
num_tokens = 256,
heads = 8,
bucket_size = 64,
n_hashes = 4,
ff_chunks = 10,
lsh_dropout = 0.1,
weight_tie = True,
causal = True,
use_full_attn = False # set this to true for comparison with full attention
)
def forward(self, x):
pred = self.model(x).transpose(1, 2)
return pred
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.forward(x)
loss = F.cross_entropy(y_hat, y, reduction='mean')
tensorboard_logs = {'train_loss': loss}
return {'loss': loss, 'log': tensorboard_logs}
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.forward(x)
return {'val_loss': F.cross_entropy(y_hat, y)}
def validation_end(self, outputs):
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss}
return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}
def test_step(self, batch, batch_idx):
x, y = batch
y_hat = self.forward(x)
return {'test_loss': F.cross_entropy(y_hat, y)}
def test_end(self, outputs):
avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
tensorboard_logs = {'test_loss': avg_loss}
return {'avg_test_loss': avg_loss, 'log': tensorboard_logs}
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=LEARNING_RATE)
@pl.data_loader
def train_dataloader(self):
# REQUIRED
dataset = TextSamplerDataset(data_train, SEQ_LEN)
if self.distributed_mode:
dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=dist_sampler, batch_size=self.batch_size)
else:
dataloader = DataLoader(dataset, batch_size=self.batch_size)
return dataloader
@pl.data_loader
def val_dataloader(self):
# OPTIONAL
dataset = TextSamplerDataset(data_val, SEQ_LEN)
if self.distributed_mode:
dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=dist_sampler, batch_size=self.batch_size)
else:
dataloader = DataLoader(dataset, batch_size=self.batch_size)
return dataloader
@pl.data_loader
def test_dataloader(self):
dataset = TextSamplerDataset(data_val, SEQ_LEN)
if self.distributed_mode:
dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=dist_sampler, batch_size=self.batch_size)
else:
dataloader = DataLoader(dataset, batch_size=self.batch_size)
return dataloader
def main():
parser = argparse.ArgumentParser("reformer-lightning example")
parser.add_argument("--gpus", default=1, help="gpus to use")
parser.add_argument("-d", "--distributed", default=False, action="store_true",
help="activates distributed using data distributed parallel")
parser.add_argument("-b", "--batch_size", type=int, default=4, help="batch_size")
args = parser.parse_args()
model = ReformerTrainer(args.batch_size, args.distributed)
# most basic trainer, uses good defaults
if args.distributed:
trainer = Trainer(gpus=args.gpus, distributed_backend='ddp', accumulate_grad_batches=GRADIENT_ACCUMULATE_EVERY)
else:
trainer = Trainer(gpus=args.gpus, distributed_backend='dp', accumulate_grad_batches=GRADIENT_ACCUMULATE_EVERY)
trainer.fit(model)
trainer.test()
if __name__ == "__main__":
main()
Hello! Thank you for this repo!
Can you please implement sampling in the example file, so we'd be able to generate outputs from the model (as done in GPT-2)?
Hi, can you release an autoregressive example first? Such as a reformer language model example.
So we can test it in an all-round way.
Thanks much.
I have noticed that in the original Transformers paper, in the decoder blocks they alternate between masked self-attention and regular-attention over the outputs of the encoder:
This allows the model to take attention to the previously generated words before choosing what parts of the decoder output to pay attention to.
Please, correct me if I'm wrong, but in the Reformer module and ReformerLM modules of this repository, I believe that if the encoder outputs are passed as an extra input to the decoder (with the "keys" parameters, like in the translation example), the model will only apply attention to it, and will not perform any self-attention on the decoder input, so every step of the decoder the model is only able to use the current word as query, and is not able to look at the previously generated words, right?
And what's more, the "causal" flag will apply a mask on the attention from the encoder keys, so when the model generates the word in position N we only pay attention to the Keys in position < N
Would it make sense to allow the choice to combine self-attention (k == v == q), where the mask should be applied if we want it to be causal, with regular attention(q != k == v) using the passed keys, where causality no longer makes a lot of sense because we might want to be able to focus on a word at the end of the sentence if the language has a different word ordering?
Hi,
In the sequence โ sequence example:
x = torch.randint(0, 20000, (1, DE_SEQ_LEN)).long().cuda()
yi = torch.randint(0, 20000, (1, EN_SEQ_LEN)).long().cuda()
enc_keys = encoder(x)
yo = decoder(yi, keys = enc_keys)
what is yi? I assume that the decoder only needs the enc_keys as the output from the encoder?
Thanks.
stotal_hashes
should be total_hashes
. Thanks for the library!I'm pretty new to NLP, and trying to adapt reformer to do BERT-like masked language pretraining for long sequences.
Is it as simple as setting causal=False in ReformerLM class ?
Hey @lucidrains - thanks for your amazing implementation - it's super cool - I learned so much from it!! I just had the following questions about reformer_pytorch/reformer_pytorch.py -
Thanks again for your work,
Ankit
Testing around with simple next token prediction. Sample data is separate sentences with padding. Is there a way to apply an input mask for the paddings?
Hi there, I've pre-trained a REFORMER for 4 days with 500MB of text data, just to try how it works. Now I'm trying to use it for fine-tuning and it's taking huge time for each epoch... I'm using a nice GPU (the one you were jealous about :P ) but it's still taking too long, as you can see below. When compared to a normal BERT, for example, there's no point of comparison, as the latter needs only a couple of secs for fine-tuning while this one is taking hours.
EPOCH: 0%| | 0/40 [00:00<?, ?it/s]
Training epoch 0: 0%| | 0/1041 [00:00<?, ?it/s]
Training epoch 0: 0%| | 1/1041 [00:13<3:46:44, 13.08s/it]
Training epoch 0: 0%| | 2/1041 [00:24<3:39:14, 12.66s/it]
Training epoch 0: 0%| | 3/1041 [00:36<3:33:28, 12.34s/it]
Training epoch 0: 0%| | 4/1041 [00:48<3:31:05, 12.21s/it]
Training epoch 0: 0%| | 5/1041 [01:00<3:29:03, 12.11s/it]
Training epoch 0: 1%| | 6/1041 [01:11<3:26:42, 11.98s/it]
Training epoch 0: 1%| | 7/1041 [01:23<3:24:39, 11.88s/it]
Training epoch 0: 1%| | 8/1041 [01:35<3:25:09, 11.92s/it]
Training epoch 0: 1%| | 9/1041 [01:46<3:22:59, 11.80s/it]
Training epoch 0: 1%| | 10/1041 [01:58<3:23:07, 11.82s/it]
Training epoch 0: 1%| | 11/1041 [02:11<3:25:52, 11.99s/it]
Training epoch 0: 1%| | 12/1041 [02:23<3:25:39, 11.99s/it]
Training epoch 0: 1%| | 13/1041 [02:34<3:21:48, 11.78s/it]
Training epoch 0: 1%|โ | 14/1041 [02:46<3:23:27, 11.89s/it]
Training epoch 0: 1%|โ | 15/1041 [02:57<3:19:09, 11.65s/it]
Training epoch 0: 2%|โ | 16/1041 [03:10<3:22:35, 11.86s/it]
Training epoch 0: 2%|โ | 17/1041 [03:22<3:22:47, 11.88s/it]
Training epoch 0: 2%|โ | 18/1041 [03:33<3:22:16, 11.86s/it]
Training epoch 0: 2%|โ | 19/1041 [03:45<3:23:15, 11.93s/it]
Training epoch 0: 2%|โ | 20/1041 [03:57<3:20:54, 11.81s/it]
Training epoch 0: 2%|โ | 21/1041 [04:09<3:19:35, 11.74s/it]
Training epoch 0: 2%|โ | 22/1041 [04:21<3:22:12, 11.91s/it]
Training epoch 0: 2%|โ | 23/1041 [04:32<3:20:29, 11.82s/it]
Training epoch 0: 2%|โ | 24/1041 [04:44<3:16:36, 11.60s/it]
Training epoch 0: 2%|โ | 25/1041 [04:56<3:18:51, 11.74s/it]
Training epoch 0: 2%|โ | 26/1041 [05:07<3:17:10, 11.66s/it]
Training epoch 0: 3%|โ | 27/1041 [05:18<3:15:37, 11.58s/it]
Training epoch 0: 3%|โ | 28/1041 [05:30<3:15:43, 11.59s/it]
Training epoch 0: 3%|โ | 29/1041 [05:42<3:16:18, 11.64s/it]
Training epoch 0: 3%|โ | 30/1041 [05:54<3:16:54, 11.69s/it]
Training epoch 0: 3%|โ | 31/1041 [06:05<3:12:38, 11.44s/it]
Training epoch 0: 3%|โ | 32/1041 [06:16<3:11:49, 11.41s/it]
Training epoch 0: 3%|โ | 33/1041 [06:27<3:11:52, 11.42s/it]
Training epoch 0: 3%|โ | 34/1041 [06:39<3:13:15, 11.51s/it]
Training epoch 0: 3%|โ | 35/1041 [06:50<3:10:34, 11.37s/it]
Training epoch 0: 3%|โ | 36/1041 [07:02<3:12:29, 11.49s/it]
Training epoch 0: 4%|โ | 37/1041 [07:13<3:11:37, 11.45s/it]
Training epoch 0: 4%|โ | 38/1041 [07:24<3:09:23, 11.33s/it]
Training epoch 0: 4%|โ | 39/1041 [07:36<3:09:00, 11.32s/it]
Training epoch 0: 4%|โ | 40/1041 [07:47<3:09:20, 11.35s/it]
Training epoch 0: 4%|โ | 41/1041 [07:58<3:08:17, 11.30s/it]
Do you know which may be the problem? I've created this class for NER:
class ReformerForTokenClassification(nn.Module):
def __init__(self, num_labels, model_dim, depth,
n_tokens, maxlen, heads, weights_file, n_hashes, dropout=0.2):
super(ReformerForTokenClassification, self).__init__()
self.num_labels = num_labels
self.model_dim = model_dim
self.reformer = ReformerLM(n_tokens, model_dim, depth, maxlen, heads,
n_hashes, return_embeddings=True)
model_dict = self.reformer.state_dict()
pretrained_dict = torch.load(weights_file)
weights_dict = {k:v for k, v in pretrained_dict.items() if 'to_logits' not in k}
self.reformer.load_state_dict(weights_dict)
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Linear(self.model_dim, self.num_labels)
def forward(self, input_ids=None, labels=None):
outputs = self.reformer(input_ids)
sequence_output = self.dropout(outputs)
logits = self.classifier(sequence_output)
outputs = (logits, outputs[2:])
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss, outputs[0], outputs[1])
return outputs
model = ReformerForTokenClassification(num_labels=9, model_dim=768, depth=12, maxlen=512, n_tokens=tokenizer.vocab_size,
heads=8, n_hashes=4, weights_file='ckpts_pequeรฑo_oscar/model_state_dict.pt')
I did a few training runs of a simple Reformer module with different parameters and logged the GPU memory usage.
Of course, depending on your machine or other things these values can vary, but I thought it might be useful as a visual guide:
dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 1: 452 MB
dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 8: 992 MB
dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 16: 1584 MB
dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 32: 2866 MB
dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 64: 4606 MB
dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 128: 9788 MB
dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 1: 538 MB
dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 8: 1580 MB
dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 16: 2870 MB
dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 32: 4582 MB
dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 64: 9276 MB
dim = 512,seq_len = 1024, depth = 1, heads = 1, batch_size = 1: 682 MB
dim = 512, seq_len = 1024, depth = 1, heads = 1, batch_size = 8: 2904 MB
dim = 512, seq_len = 1024, depth = 1, heads = 1, batch_size = 16: 4634 MB
dim = 512, seq_len = 1024, depth = 1, heads = 1, batch_size = 32: 9310 MB
dim = 512, seq_len = 2048, depth = 1, heads = 1, batch_size = 1: 992 MB
dim = 512, seq_len = 2048, depth = 1, heads = 1, batch_size = 8: 4644 MB
dim = 512, seq_len = 2048, depth = 1, heads = 1, batch_size = 16: 9256 MB
dim = 512, seq_len = 4096, depth = 1, heads = 1, batch_size = 1: 1602 MB
dim = 512, seq_len = 4096, depth = 1, heads = 1, batch_size = 8: 8810 MB
dim = 512, seq_len = 4096, depth = 1, heads = 1, batch_size = 10: 10976 MB
dim = 512, seq_len = 8192, depth = 1, heads = 1, batch_size = 1: 2884 MB
dim = 512, seq_len = 8192, depth = 1, heads = 1, batch_size = 5: 11396 MB
dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 8: 992 MB
dim = 512, seq_len = 256, depth = 2, heads = 1, batch_size = 8: 1054 MB
dim = 512, seq_len = 256, depth = 4, heads = 1, batch_size = 8: 1142 MB
dim = 512, seq_len = 256, depth = 6, heads = 1, batch_size = 8: 1220 MB
dim = 512, seq_len = 256, depth = 12, heads = 1, batch_size = 8: 1512 MB
dim = 512, seq_len = 256, depth = 24, heads = 1, batch_size = 8: 2056 MB
dim = 512, seq_len = 256, depth = 24, heads = 1, batch_size = 16: 2680 MB
dim = 128, seq_len = 256, depth = 12, heads = 1, batch_size = 8: 566 MB
dim = 128, seq_len = 256, depth = 12, heads = 2, batch_size = 8: 576 MB
dim = 128, seq_len = 256, depth = 12, heads = 4, batch_size = 8: 616 MB
dim = 128, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 732 MB
dim = 128, seq_len = 256, depth = 12, heads = 16, batch_size = 8: 1000 MB
dim = 32, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 644 MB
dim = 64, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 670 MB
dim = 128, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 732 MB
dim = 256, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 918 MB
dim = 512, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 1516 MB
dim = 1024, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 3552 MB
dim = 512, seq_len = 4096, depth = 6, heads = 8, batch_size = 8: 9672 MB
dim = 128, seq_len = 4096, depth = 12, heads = 8, batch_size = 8: 6270 MB
dim = 512, seq_len = 8192, depth = 12, heads = 8, batch_size = 1: 3628 MB
dim = 512, seq_len = 8192, depth = 12, heads = 8, batch_size = 4: 10048 MB
dim = 128, seq_len = 1024, depth = 6, heads = 4, batch_size = 32: 4608 MB
dim = 128, seq_len = 1024, depth = 6, heads = 4, batch_size = 64: 8052 MB
dim = 128, seq_len = 1024, depth = 6, heads = 4, batch_size = 80: 9990 MB
Hey I am planning to start building a codebase to both train and evaluate this model on the GLUE tasks.
Are you working on this already? If not, I can write some code and make a pull request.
Hi, I've been a couple of months working with transformers models with the transformers library (https://huggingface.co/transformers/index.html), and I want to try your ReformerLM to see if I can train a good language model for spanish using this new technology and your library. Therefore, first of all I want to say thank you for developing this library and for implementing the Reformer, since the code provided by google together with the paper was not very "usable", in that it wasn't implemented in parametric classes in an ordered manner. So thanks for the effort. The thing is, I don't understand what kind of tokenization you're using in this model, as in the example there doesn't seem to be any tokenization step, nor have you any tokenizer class to train your own tokenizer with BPE or other method. Maybe I'm getting something wrong about the model, but I'd like to know how you deal with this issue to introduce the inputs to the model. Thank you in advance for your response, and if I manage to make the ReformerLM work, I'll try to make a Generator like GPT-2 based on this architecture instead of the transformer architecture, so that we can expand this library. Regards, Alejandro.
hi, I wonder if 't' should be 'kv_len'?
return v.view(b, h, t, -1).transpose(1, 2).contiguous()
https://github.com/lucidrains/reformer-pytorch/blob/fde8efefe34ce30cf872e67b86e268b103d5a49b/reformer_pytorch/reformer_pytorch.py#L472
Can you please explain this line ?
It seems you add extra 128 units to the input sequence why is that?
It would be great to publish your results on
https://paperswithcode.com/sota
It would increase awareness and use of your paper.
When I try this example: A full Reformer image โ caption, I found out that encoder was wrong because there was no arguments axial_position_emb = True, axial_position_shape = (32, 32), axial_position_dims = (256, 256) in Reformer class but ReformerLM has them. Please help me verify this issue.
First of all, the deepspeed implementation is awesome! I trained on 4 V100 and got a 8.5X boost and 20X with fp16 turned on compared to just one GPU.
I trained a model on 300MB dialogue dataset for 2 epochs but the generated samples weren't good. I'm quite sure I messed up with the code somehow since I come from a programming background and not ML.
Here's my code: https://pastebin.com/V1t5Ctg7
lr = 0.0004, bs=32, vocab_size=2000
Here are some samples: https://pastebin.com/yCL0vVdv
From my experiments with other architectures (GPT-2 from scratch, LSTM), it should generate decent samples after feeding this data so something must be wrong somewhere.
Thanks for the cool library!
I'm working on a seq2seq demo using it, and I'd like to visualize the attention weights, but it isn't clear how to get them out of the ReformerLM class. Can you point me in the right direction?
When are we going to see pretrained reformer models? I don't have the compute or the dataset for doing it myself - but this seems to be strictly better of a technique for training NLP models than previous transformers
Hi, first of all thanks for the great work!
I was wondering how the Glue example is supposed to do any classification. There is no classification head added anywhere in the example for the classification tasks. From what I see in the example it simple takes the argmax(-1) from the [batch, sequence_length, number_of_tokens] output of the ReformerLM which is nonsense, isn't it?. I would expect to set 1. return_embeddings = True
and 2. causal=False
and take one output token i.e [:,0,:] and add maybe 2 Linear layers on top of that like BERT does, or average over all the tokens, since our task is classification.
Are my assumptions right, or am I getting something wrong?
Hi @lucidrains
I'm currently testing the generate
function of the TrainingWrapper
class.
When I use DeepSpeed and I try to generate a sequence it gives me the following error:
AttributeError: 'DeepSpeedLight' object has no attribute 'generate'
Is it because Generation can only be done outside DeepSpeed Engine?
Thank you very much, once again! :)
Thanks so much for this implementation! I suspect it will be very helpful for the community. ๐
I'm trying to use the ReformerLM
with return_embeddings=True
and it looks like it's effectively a no-op. When I try your collab with return_embeddings=True
, model.out
is just Identity()
and no embedding is returned from the forward pass.
Am I using this option incorrectly or is it still a TODO?
Hi @lucidrains
I updated the code to use your implementation of the EncDec architecture, but I ran out of memory when I set the input_mask and the context_mask accordingly in order to mask the Pad indexes.
In the previous implementation where I used this:
encoder = TrainingWrapper(encoder, ignore_index=PAD_IDX).cuda()
decoder = TrainingWrapper(decoder, ignore_index=PAD_IDX).cuda()
encoder_engine, encoder_optimizer, trainloader, _ = deepspeed.initialize(args=cmd_args, model=encoder, optimizer=encoder_optimizer, model_parameters=encoder_params, training_data=train_dataset, dist_init_required=True)
decoder_engine, decoder_optimizer, _, _ = deepspeed.initialize(args=cmd_args, model=decoder, optimizer=decoder_optimizer, model_parameters=encoder_params, dist_init_required=False)
for src, trg in dataset:
encoder_engine.train()
decoder_engine.train()
src = src.to(encoder_engine.local_rank)
trg = trg.to(decoder_engine.local_rank)
enc_keys = encoder_engine(src)
loss = decoder_engine(trg, keys = enc_keys, return_loss = True)
loss.backward()
decoder_engine.step()
encoder_engine.step()
instead of this:
enc_dec_engine, enc_dec_optimizer, trainloader, _ = deepspeed.initialize(args=cmd_args, model=enc_dec, optimizer=enc_dec_optimizer, model_parameters=enc_dec_params, training_data=train_dataset)
for src, trg in dataset:
enc_dec_engine.train()
src = src.to(enc_dec_engine.local_rank)
trg = trg.to(enc_dec_engine.local_rank)
enc_input_mask = torch.tensor([[1 for idx in smpl if idx != PAD_IDX] for smpl in src]).bool().to(device)
context_mask = torch.tensor([[1 for idx in smpl if idx != PAD_IDX] for smpl in trg]).bool().to(device)
loss = enc_dec(src, trg, return_loss = True, enc_input_mask = enc_input_mask, context_mask=context_mask)
loss.backward()
enc_dec_engine.step()
I didn't run out of memory and I was assuming that the loss was computed excluding the Pad Index, did I made a mistake? What is the best way to ignore Pad Idx?
Using the input_mask
has the same outcome as setting the Pad Idx to ignore in the TrainingWrapper
?
If they are equal, is there a way to use the ignore_index in the training wrapper instead of the masking techniques to save some memory also in your EncDec implementation?
Thank you in advance,
Cal
Hi there, I'm trying to pretrain a ReformerLM for spanish on a single Nvidia p-100 16GB GPU, and even when restricting the embedding dimension, the number of heads etc. I still get a Memory Error. I'm using the script in https://github.com/lucidrains/reformer-pytorch/blob/master/pretraining/self-supervised.py for that, and my configuration is the following:
tokenizer.max_len=128
model = ReformerLM(
num_tokens=tokenizer.vocab_size,
dim=128,
depth=1,
heads=1,
max_seq_len=tokenizer.max_len,
causal=True
n_hashes=2,
ff_chunks=10000
)
trainer = ReformerTrainer(dataset, model, tokenizer, train_batch_size=1, eval_batch_size=1)
I've reduced the number of hashes, the max len, I've increased the ff_chunks... I've tried everything that's supposed to reduce the memory usage, but it's still not working. Have you been able to make the code in the link above work? @lucidrains If so, please tell me how... Just in case, my GPU is completely free before I start training, and the trainer tries to use about 19GB of memory...
Greetings! I've found this repo very useful,flexible and easy to use. Thanks for putting it out. I've been playing with this repo for a text generation problem.
I want to generate a reply given previous history of conversation. Here's how I'm encoding the sequence.
<bos><speaker1>Hello, how are you?<speaker2>Great! What about you?<eos>
.
As can be seen from the enwiki8 example, while giving input we need to drop the last token from the sequence and for target we start from 2nd token to last token. So for above example
inp: <bos><speaker1>Hello, how are you?<speaker2>Great! What about you?
targets: <speaker1>Hello, how are you?<speaker2>Great! What about you?<eos>
I'm calculating loss only on the last portion i.e on Great! What about you?<eos>
.
This is unlike other models, i.e with GPT-2 or Trax implementation of Reformer where you just feed the same sequence as input and targets, and it handles the rest.
So when I trained the model with above encoding, the model only generates <eos>
tokens. So I removed the <eos>
token and trained again but then the last token was always some punctuation, so the model was only generating punctuation.
Is this really an issue or am I doing something wrong ? Also can we make it more like the Trax implementation where we just feed the same sequence for both input and targets ?
Thanks for your great job!
When i am testing this model with code as
import torch
from reformer_pytorch import ReformerLM
from torch.nn import functional as F
model = ReformerLM(
num_tokens=20000,
dim=1024,
depth=24,
max_seq_len=1024,
heads=16,
lsh_dropout=0.1,
emb_dim=1024, # embedding factorization for further memory savings
causal=True, # auto-regressive or not
bucket_size=64, # average size of qk per bucket, 64 was recommended in paper
n_hashes=8, # 4 is permissible per author, 8 is the best but slower
ff_chunks=200, # number of chunks for feedforward layer, make higher if there are memory issues
weight_tie=False, # tie parameters of each layer for no memory per additional depth
attn_chunks=8, # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
num_mem_kv=0, # persistent learned memory key values, from all-attention paper
twin_attention=False, # both branches of the reversible network will be attention
use_full_attn=True, # use full self attention, for comparison
full_attn_thres=128, # use full attention if context length is less than set value
use_scale_norm=False # use scale norm from 'Transformers without tears' paper
).cuda()
model = torch.nn.DataParallel(model)
model.train()
x = torch.randint(0, 20000, (8, 1024)).long().cuda()
y = torch.randint(0, 20000, (8, 1024)).long().cuda()
pred = model(x)
loss = F.cross_entropy(pred.transpose(1, 2), y, reduction='mean')
loss.backward()
import ipdb
ipdb.set_trace()
When without model = torch.nn.DataParallel(model)
, 7616M memory is used.
But after I add model = torch.nn.DataParallel(model)
, it causes OOV while 8 gpus has 16GB memory for each.
I think maybe it is the problem of revtorch?
I'm running the example script (with no change) on Nvidia V100 and the training seems to go very slow. Each batch takes a few seconds (โ 3.2sec). What can be the problem?
EDIT: After some comparisons with other models, it doesn't seem to be relatively slow. But I still wanted to know other people's experiences.
Greetings! Your repository is a very welcomed contribution. I tried to follow the examples in this repo but faced some problems. Trying to modify the enwik8_simple I didn't understand how to:
Thanks a lot!
I did a parameter change test similar to the one offered in torchtest with a pytorch_reformer model (using reformer_pytorch==0.12.7).
I got the following result:
utils.VariablesChangeException: #OK: 31 #wrong: 2 Parameters:
E reformer.reformer.layers.reversible_blocks.0.f_block.fn.fn.mem_kv
E reformer.reformer.layers.reversible_blocks.1.f_block.fn.fn.mem_kv
indicating that the mem_kv parameters are not updated during an optimizer step.
I looked at the code of LSHAttention and see this line:
keys = default(keys, torch.empty(b, 0, e, dtype=mem.dtype, device=device))
I think that the call to torch.empty should include the require_grad=True parameter.
If I try to pass a simple tensor, that does not require grad, through a Reformer, it won't allow me to do backpropagation.
x = torch.randn(batch_size, 256, 512).cuda()
pred = reformer(x)
loss = criterion(pred, torch.ones_like(pred))
optimizer.zero_grad()
loss.backward()
optimizer.step()
If I do x.requires_grad = True
just before passing it to the reformer model it works though.
I am not sure why this happens, but makes me think that the model is not being optimized at all when training :/
Any example of Questions answering on whole document (like Squad on paragraph)..
Thanks
Mahesh
hi!
the default value of dropout in class init method of LSHAttention is 0 and then no where to change it
class LSHSelfAttention(nn.Module):
def __init__(self, emb, heads = 8, bucket_size = 64, n_hashes = 8, causal = False, **kwargs):
# init position
self.lsh_attn = LSHAttention(bucket_size=bucket_size, causal=causal, **kwargs)
class Reformer(nn.Module):
def __init__(self, emb, depth, max_seq_len, num_tokens = 10000, heads = 8, bucket_size = 64, n_hashes = 8, ff_chunks = 100, causal = False, weight_tie = False):
# Never pass the dropout parameters so the dropout can't be changed
get_attn = lambda: LSHSelfAttention(emb, heads, bucket_size, n_hashes, causal = causal)
Thanks!
Hi Lucidrains
First of all thanks for the contribution. You are doing an awesome job here.
I'm trying to implement the Seq2Seq model using DeepSpeed since I will have 32k seq_len as input. This is my code:
` CODE:
class GenomeToMolDataset(Dataset):
def __init__(self, data, src_lang, trg_lang):
super().__init__()
self.data = data
self.src_lang = src_lang
self.trg_lang = trg_lang
def __getitem__(self, index):
#print(index)
pair = self.data[index]
#print('src:',pair[0])
#print('\n\ntrg:',pair[1])
src = torch.tensor(indexesFromSentence(self.src_lang,pair[0]))
trg = torch.tensor(indexesFromSentence(self.trg_lang,pair[1]))
print('src:', src)
print('trg:', trg)
return src,trg
def __len__(self):
return len(self.data)
train_dataset = GenomeToMolDataset(tr_pairs, input_lang, target_lang)
test_dataset = GenomeToMolDataset(ts_pairs, input_lang, target_lang)
encoder = ReformerLM(
num_tokens = input_lang.n_words,
emb_dim = emb_dim,#128,
dim = dim,#512,
bucket_size = bucket_size, # 16,
depth = depth, # 6,
heads = heads, # 8,
n_hashes= n_hashes,
max_seq_len = VIR_SEQ_LEN,
ff_chunks = ff_chunks, #400, # number of chunks for feedforward layer, make higher if there are memory issues
attn_chunks = attn_chunks, #16, # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
#weight_tie = True,
fixed_position_emb = True,
return_embeddings = True # return output of last attention layer
).cuda()
decoder = ReformerLM(
num_tokens = target_lang.n_words,
emb_dim = emb_dim, # 128,
dim = dim, # 512,
bucket_size = bucket_size, #16,
depth = depth, #6,
heads = heads, #8,
n_hashes= n_hashes,
ff_chunks = ff_chunks, # 400, # number of chunks for feedforward layer, make higher if there are memory issues
attn_chunks = attn_chunks, # 16, # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
max_seq_len = MOL_SEQ_LEN,
fixed_position_emb = True,
causal = True
).cuda()
encoder_optimizer = RangerLars(encoder.parameters()) # torch.optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = RangerLars(decoder.parameters()) # torch.optim.Adam(decoder.parameters(), lr=learning_rate)
if use_apex:
encoder, encoder_optimizer = amp.initialize(encoder, encoder_optimizer, opt_level='O1')
decoder, decoder_optimizer = amp.initialize(decoder, decoder_optimizer, opt_level='O1')
encoder = TrainingWrapper(encoder).cuda()
#encoder.cuda()
decoder = TrainingWrapper(decoder).cuda()
#decoder.cuda()
encoder_params = filter(lambda p: p.requires_grad, encoder.parameters())
decoder_params = filter(lambda p: p.requires_grad, decoder.parameters())
encoder_engine, encoder_optimizer, trainloader, _ = deepspeed.initialize(args=cmd_args, model=encoder, optimizer=encoder_optimizer, model_parameters=encoder_params, training_data=train_dataset, dist_init_required=True)
decoder_engine, decoder_optimizer, _, _ = deepspeed.initialize(args=cmd_args, model=decoder, optimizer=decoder_optimizer, model_parameters=encoder_params, dist_init_required=False)
# training
VALIDATE_EVERY = 1
SAVE_EVERY = 10
SAVE_DIR = './saved_model/'
_, encoder_client_sd = encoder_engine.load_checkpoint(SAVE_DIR+'encoder/', None)
_, decoder_client_sd = decoder_engine.load_checkpoint(SAVE_DIR+'decoder/', None) #args.ckpt_id
for i, pair in enumerate(trainloader):
src = pair[0]
trg = pair[1]
encoder_engine.train()
decoder_engine.train()
src = src.to(encoder_engine.local_rank)
trg = trg.to(decoder_engine.local_rank)
print(src.shape)
print(src.dtype)
print(trg.shape)
print(trg.dtype)
enc_keys = encoder_engine(src)
loss = decoder_engine(trg, keys = enc_keys, return_loss = True) # (1, 4096, 20000)
encoder_engine.backward(loss)
decoder_engine.backward(loss)
encoder_engine.step()
decoder_engine.step()
print('Training Loss:',loss.item())
if i % VALIDATE_EVERY == 0:
encoder.eval()
decoder.eval()
with torch.no_grad():
ts_src,ts_trg = random.choice(test_dataset)[:-1]
enc_keys = encoder(ts_src.to(device))
loss = decoder(ts_trg, keys=enc_keys, return_loss = True)
print(f'\tValidation Loss: {loss.item()}')
if i % SAVE_EVERY:
encoder_client_sd['step'] = i
decoder_client_sd['step'] = i
ckpt_id = loss.item()
encoder_engine.save_checkpoint(SAVE_DIR+'encoder/', ckpt_id, client_sd = encoder_client_sd)
decoder_engine.save_checkpoint(SAVE_DIR+'decoder/', ckpt_id, client_sd = decoder_client_sd)`
The issue I'm having is with the nn.Embedding Layer since it wants Long integer as input but DeepSpeed works only with Floats. And it prompts this error:
RuntimeError: expected device cuda:0 and dtype Float but got device cuda:0 and dtype Long
If I cast to float the inputs, then the Embedding layer will prompt the vice versa error.
How can I use your ReformerLM as Encoder-Decoder with DeepSpeed in this case? Is there any way I can workaround the Embedding issue?
Thank you,
Cal
Hi Author,
Thanks for your work on reformer implementation on Pytorch. May I ask, could you share the image generation task examples, thanks.
Hi,
Here's an error when attempting to pull and run train.py from your repo:
$ python3 train.py
train.py:42: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
X = np.fromstring(file.read(int(95e6)), dtype=np.uint8)
Traceback (most recent call last):
File "train.py", line 43, in <module>
trX, vaX = np.split(X, [int(90e6), int(5e6)])
ValueError: too many values to unpack (expected 2)
python3
Python 3.7.5 (default, Nov 20 2019, 09:21:52)
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
Hello,
I am trying to understand input and output to your Reformer. I am following your example code
import torch
from reformer_pytorch import Reformer
model = Reformer(
emb = 512,
depth = 12,
max_seq_len = 8192,
num_tokens= 20000,
heads = 8,
lsh_dropout = 0.1,
causal = True, # auto-regressive or not
bucket_size = 64, # average size of qk per bucket, 64 was recommended in paper
n_hashes = 8, # should keep at 8 per paper
ff_chunks = 200, # number of chunks for feedforward layer
weight_tie = False, # tie parameters of each layer for no memory per additional depth
attn_chunks = 8, # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
use_full_attn = False # use full self attention, for comparison
).cuda()
x = torch.randint(0, 20000, (1, 8192)).long().cuda()
y = model(x)
print(x.shape)
print(y.shape)
Output:
torch.Size([1, 8192])
torch.Size([1, 8192, 20000])
So as I understand input x is a 2D tensor with the size of [batch_size,seq_length] and the output y is a 3D tensor with the size of [batch_size, seq_length,_num_tokens]. I wonder why there is a mismatch like this ?
Comparing to the official Transformer code, I have a simple example:
import torch
from torch.nn.modules.transformer import *
transformer_model = Transformer(nhead=16, num_encoder_layers=12)
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
out = transformer_model(src, tgt)
print(out.shape)
Output:
torch.Size([20, 32, 512])
As you can see, the input and output to the Transformer model is a 3D tensor with the size of [seq_length,batch_size,emb_size]. Is there any method I can do the same thing with your Reformer implementation?
Hi @lucidrains, in a encoder-decoder setting, consider input to decoder as target, denote encoder input length S and decoder length input T, the size of input_mask and input_attn_mask should be NxT and TxT. It is unclear context_mask should have NxS (padding) or TxS (memory)?
Hi @lucidrains
I was wondering if it could make sense to you if I create a pull request where the user can choose between GLUE or MISH as the activation function.
The explanation of MISH can be found here:
The GitHub is here:
And the discussion can be found here:
Here there is a little benchmark:
If I'm not mistaken there is only one place in reformer_pytorch
library where you define GLUE_
in the FeedForward
layer, I could add a parameter to the constructor as a flag.
Let me know what would you think about it.
Thank you,
Cal
This is a follow up from my comment in #50. How do you make a prediction for a test example for encoder-decoder, after training with the code block mention in the issue?
Any chance you can provide a trained model? It takes extremely long for me to train on my own.
RuntimeError Traceback (most recent call last)
in
21
22 x = torch.randint(0, 20000, (1, 8192)).long().cuda()
---> 23 y = model(x) # (1, 8192, 20000)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x, **kwargs)
499
500 x = self.to_model_dim(x)
--> 501 x = self.reformer(x, **kwargs)
502 return self.to_logits(x)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x, keys)
479 x = torch.cat([x, x], dim = -1)
480 self.set_reversible_args(keys = keys)
--> 481 x = self.layers(x)
482 return torch.stack(x.chunk(2, dim=-1)).sum(dim=0)
483
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/revtorch/revtorch.py in forward(self, x)
197 :return: Output tensor
198 '''
--> 199 x = _ReversibleModuleFunction.apply(x, self.reversible_blocks, self.eagerly_discard_variables)
200 return x
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/revtorch/revtorch.py in forward(ctx, x, reversible_blocks, eagerly_discard_variables)
144 for block in reversible_blocks:
145 assert (isinstance(block, ReversibleBlock))
--> 146 x = block(x)
147 ctx.y = x.detach() #not using ctx.save_for_backward(x) saves us memory by beeing able to free ctx.y earlier in the backward pass
148 ctx.reversible_blocks = reversible_blocks
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/revtorch/revtorch.py in forward(self, x)
47 with torch.no_grad():
48 self._init_seed('f')
---> 49 y1 = x1 + self.f_block(x2)
50 self._init_seed('g')
51 y2 = x2 + self.g_block(y1)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x)
80 def forward(self, x):
81 x = self.norm(x)
---> 82 return self.fn(x)
83
84 class Chunk(nn.Module):
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x)
105
106 def forward(self, x):
--> 107 return self.fn(x, *self.args, **self.kwargs)
108
109 # LSH attention as described in https://openreview.net/pdf?id=rkgNKkHtvB
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)
~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x, keys)
394
395 mem = self.mem_kv.expand(b, m, e)
--> 396 keys = default(keys, torch.empty(b, 0, e, dtype=mem.dtype, device=device))
397
398 kv_len = t + m + keys.shape[1]
RuntimeError: sizes must be non-negative
.....................................................................................
This error occurs when using exactly the same example you have in your README...
import torch
from reformer_pytorch import ReformerLM
model = ReformerLM(
num_tokens= 20000,
dim = 1024,
depth = 12,
max_seq_len = 8192,
heads = 8,
lsh_dropout = 0.1,
emb_dim = 128, # embedding factorization for further memory savings
causal = True, # auto-regressive or not
bucket_size = 64, # average size of qk per bucket, 64 was recommended in paper
n_hashes = 4, # 4 is permissible per author, 8 is the best but slower
ff_chunks = 200, # number of chunks for feedforward layer, make higher if there are memory issues
weight_tie = False, # tie parameters of each layer for no memory per additional depth
attn_chunks = 8, # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
num_mem_kv = 128, # persistent learned memory key values, from all-attention paper
twin_attention = False, # both branches of the reversible network will be attention
use_full_attn = False, # use full self attention, for comparison
full_attn_thres = 1024, # use full attention if context length is less than set value
use_scale_norm = False # use scale norm from 'Transformers without tears' paper
).cuda()
x = torch.randint(0, 20000, (1, 8192)).long().cuda()
y = model(x) # (1, 8192, 20000)
.....................................................................
โ
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.