lucidrains / reformer-pytorch Goto Github PK

View Code? Open in Web Editor NEW

2.1K 53.0 256.0 35.34 MB

Reformer, the efficient Transformer, in Pytorch

License: MIT License

Python 100.00%

artificial-intelligence transformers attention-mechanism machine-learning pytorch

reformer-pytorch's People

Contributors

Stargazers

Watchers

Forkers

mfused jiasenlu intuitionmachine leiloong gavinzjchao sanjibnarzary cheesama zggl lifeisstrange merajat hhy5277 flyfoxci chunyuany limr1209 tchigher jeonsworld trendingtechnology gorkaurbizu balaprasanna amirstudy harshareddyk marvinzh tuner007 abhishekyana samypal srinivasanbigdata manik-hossain nukea chenyangh joytianya akirasosa anwarul nikolayvoronchikhin xrick robertdigital wangyong1122 ruohoruotsi patil-suraj itali43 zbloss justindujardin pabloppp lorarjohns shivam1702 mananshah99 hackerwei dimeldo leogao2 guzpenha erickrf manolowang shamanez currylym voltek62 gulby gentaiscool burnhamg jcmc00 woffett xrosliang napoler batikim09 tommylitlle ilya16 mmaher22 tbj128 zsunpku rosssong multipath databill86 giannisdaras lduperier ringwraith liuqiangict draco520 gokulsg inconnu11 sahanduiuc blizda satya77 jindal2309 varunp2k colinsongf aetherks ammieqi thonjoe kyeongpil 0828slp chenghuige saber5433 o7s8r6 dunovank primasanjaya arita37 dongkuanx27 rachitbansal lgstd syyunn muipomeranian jiangbin216

reformer-pytorch's Issues

Reformer for Squad and other QA

Hi, thanks for this codebase. I'm looking to adapt Reformer for Squad, and I'm not sure exactly how. Do you think it will look something like this, from the pytorch-transformers github?https://github.com/huggingface/transformers/blob/master/examples/run_squad.py

I'm having trouble seeing how I am going to pass two parameters (context and query) into the model. forward() function instead of one. Thanks!

Results

Hi! Thanks for implementing a new interesting model, Reformer.

Have you got any chance to report the results on imagenet64 and enwik8-64K? It would be great if you can validate the model on them. At least to validate whether the code is implemented as it should be.

Thank you

Contextual attention in decoder, where Q != K

Hello @lucidrains and thanks for the nice implementation!
I have a question about using LSH in the contextual attention in a decoder, which can't have Q equal to K.

In that case some queries might end up in buckets without keys, as the paper mentions, but there are ways around that - like picking the next best bucket (2nd or 3rd argmax) or just zeroing attention for that query. Do you think that could work?

In the context of MT, for example, if we can only use LSH in self-attention, then the encoder could even read some 64k tokens, but the decoder would need to be limited to ouput partial sequences (which sounds reasonable but not exploiting the full capabilities of the reformer).

Optimizing LSH attention runtime

The LSH attention is considerably slower than the vanilla full attention; around 8 times slower in my experiments (with sequences shorter than 512 time steps).

I know LSH is supposed to be slower than vanilla attention since there's all the bucketing overhead, and for smaller sequences we can use full attention. But I wonder if we can still get some improvements that will affect running times on longer sequences.

@lucidrains if you believe there's not much to optimize, then please just close this issue.

Feedforward and attention dropout?

I noticed there is no dropout in the feedforward layers, nor after the attention. Is there any reason for that?

Runtime error when attempting to use data distributed parallel

Thank you for putting in the time to do this. I have a bunch of ideas for it.

I crudely ported your example training script to use the pytorch-lightning library and when I attempted to use data distributed ran into a crash, The problem may be down in the revtorch library, but I want to hand the script off to you so you can play with it while reporting it so you can take a look and decide where the issue is.

you can get the crash by supplying the --distributed flag to the script with any number of gpus

Epoch 1:   0%|                                                                                                                                                                         | 0/1451 [00:00<?, ?batch/s]Traceback (most recent call last):
  File "example/train_lightning.py", line 166, in <module>
    main()
  File "example/train_lightning.py", line 161, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 687, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 331, in ddp_train
    self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 829, in run_pretrain_routine
    self.train()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 332, in train
    self.run_training_epoch()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 386, in run_training_epoch
    output = self.run_training_batch(batch, batch_idx)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 506, in run_training_batch
    loss = optimizer_closure()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 489, in optimizer_closure
    model_ref.backward(self.use_amp, closure_loss, optimizer)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/core/hooks.py", line 154, in backward
    loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 77, in apply
    return self._forward_cls.backward(self, *args)
  File "/opt/conda/lib/python3.6/site-packages/revtorch/revtorch.py", line 161, in backward
    y, dy = ctx.reversible_blocks[i].backward_pass(y, dy)
  File "/opt/conda/lib/python3.6/site-packages/revtorch/revtorch.py", line 89, in backward_pass
    gy1.backward(dy2)
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Expected to mark a variable ready only once. This error is caused by use of a module parameter outside the `forward` function. The return value of the `forward` function is inspected by the distributed data parallel wrapper to figure out if any of the module's parameters went unused. If this is the case, it knows they won't receive gradients in a backward pass. If any of those parameters are then used outside `forward`, this error condition is triggered. You can disable unused parameter detection by passing the keyword argument `find_unused_parameters=False` to `torch.nn.parallel.DistributedDataParallel`.

script:

from reformer_pytorch import ReformerLM

import tqdm
import gzip
import numpy as np
import torch.optim as optim
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import Trainer

import os

import torch
from torch import nn
from torchvision import transforms

import argparse

import pytorch_lightning as pl

# constants

NUM_BATCHES = int(1e5)
BATCH_SIZE = 4
GRADIENT_ACCUMULATE_EVERY = 4
LEARNING_RATE = 1e-4
VALIDATE_EVERY = 100

SEQ_LEN = 4096

# helpers

def cycle(loader):
    while True:
        for data in loader:
            yield data

with gzip.open('./data/enwik8.gz') as file:
    X = np.fromstring(file.read(int(95e6)), dtype=np.uint8)
    trX, vaX = np.split(X, [int(90e6)])
    data_train, data_val = torch.from_numpy(trX), torch.from_numpy(vaX)

class TextSamplerDataset(Dataset):
    def __init__(self, data, seq_len):
        super().__init__()
        self.data = data
        self.seq_len = seq_len

    def __getitem__(self, index):
        rand_start = torch.randint(0, self.data.size(0) - self.seq_len - 1, (1,))
        full_seq = self.data[rand_start: rand_start + self.seq_len + 1].long()
        return full_seq[0:-1], full_seq[1:]

    def __len__(self):
        return self.data.size(0) // self.seq_len

class ReformerTrainer(pl.LightningModule):

    def __init__(self, batch_size=4, distributed_mode=False):
        super(ReformerTrainer, self).__init__()
        self.batch_size = batch_size
        self.distributed_mode = distributed_mode
        # instantiate model
        self.model = ReformerLM(
            emb = 512,
            depth = 6,
            max_seq_len = SEQ_LEN,
            num_tokens = 256,
            heads = 8,
            bucket_size = 64,
            n_hashes = 4,
            ff_chunks = 10,
            lsh_dropout = 0.1,
            weight_tie = True,
            causal = True,
            use_full_attn = False # set this to true for comparison with full attention
        )

    def forward(self, x):
        pred = self.model(x).transpose(1, 2)
        return pred

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y, reduction='mean')
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}
    
    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}
        
    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.forward(x)
        return {'test_loss': F.cross_entropy(y_hat, y)}
    
    def test_end(self, outputs):
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        tensorboard_logs = {'test_loss': avg_loss}
        return {'avg_test_loss': avg_loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=LEARNING_RATE)

    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        dataset = TextSamplerDataset(data_train, SEQ_LEN)
        if self.distributed_mode:
            dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
            dataloader = DataLoader(dataset, sampler=dist_sampler, batch_size=self.batch_size)
        else:
            dataloader = DataLoader(dataset, batch_size=self.batch_size)
        return dataloader

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        dataset = TextSamplerDataset(data_val, SEQ_LEN)
        if self.distributed_mode:
            dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
            dataloader = DataLoader(dataset, sampler=dist_sampler, batch_size=self.batch_size)
        else:
            dataloader = DataLoader(dataset, batch_size=self.batch_size)
        return dataloader

    @pl.data_loader
    def test_dataloader(self):
        dataset = TextSamplerDataset(data_val, SEQ_LEN)
        if self.distributed_mode:
            dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
            dataloader = DataLoader(dataset, sampler=dist_sampler, batch_size=self.batch_size)
        else:
            dataloader = DataLoader(dataset, batch_size=self.batch_size)
        return dataloader

def main():
    
    parser = argparse.ArgumentParser("reformer-lightning example")
    parser.add_argument("--gpus", default=1, help="gpus to use")
    parser.add_argument("-d", "--distributed", default=False, action="store_true",
                        help="activates distributed using data distributed parallel")
    parser.add_argument("-b", "--batch_size", type=int, default=4, help="batch_size")
    args = parser.parse_args()

    model = ReformerTrainer(args.batch_size, args.distributed)

    # most basic trainer, uses good defaults
    if args.distributed:
        trainer = Trainer(gpus=args.gpus, distributed_backend='ddp', accumulate_grad_batches=GRADIENT_ACCUMULATE_EVERY)
    else:
        trainer = Trainer(gpus=args.gpus, distributed_backend='dp', accumulate_grad_batches=GRADIENT_ACCUMULATE_EVERY)
    trainer.fit(model)
    trainer.test()


if __name__ == "__main__":
    main()

Implement sampling

Hello! Thank you for this repo!

Can you please implement sampling in the example file, so we'd be able to generate outputs from the model (as done in GPT-2)?

example

Hi, can you release an autoregressive example first? Such as a reformer language model example.
So we can test it in an all-round way.
Thanks much.

Combining Attention & Self-Attention in Reformer ?

I have noticed that in the original Transformers paper, in the decoder blocks they alternate between masked self-attention and regular-attention over the outputs of the encoder:

This allows the model to take attention to the previously generated words before choosing what parts of the decoder output to pay attention to.

Please, correct me if I'm wrong, but in the Reformer module and ReformerLM modules of this repository, I believe that if the encoder outputs are passed as an extra input to the decoder (with the "keys" parameters, like in the translation example), the model will only apply attention to it, and will not perform any self-attention on the decoder input, so every step of the decoder the model is only able to use the current word as query, and is not able to look at the previously generated words, right?

And what's more, the "causal" flag will apply a mask on the attention from the encoder keys, so when the model generates the word in position N we only pay attention to the Keys in position < N

Would it make sense to allow the choice to combine self-attention (k == v == q), where the mask should be applied if we want it to be causal, with regular attention(q != k == v) using the passed keys, where causality no longer makes a lot of sense because we might want to be able to focus on a word at the end of the sentence if the language has a different word ordering?

Sequence to sequence example

Hi,

In the sequence → sequence example:

x = torch.randint(0, 20000, (1, DE_SEQ_LEN)).long().cuda()
yi = torch.randint(0, 20000, (1, EN_SEQ_LEN)).long().cuda()

enc_keys = encoder(x)
yo = decoder(yi, keys = enc_keys)

what is yi? I assume that the decoder only needs the enc_keys as the output from the encoder?

Thanks.

RevTorch requires fixing a manual seed for LSH attention to be deterministic on forward and backward pass

RobinBruegger/RevTorch#4

Typo in LSHSelfAttention causing NameError exception

reformer-pytorch/reformer_pytorch/reformer_pytorch.py

Line 306 in b73f2ff

bq_locs = b_locs1.expand(b_locs.shape[:3] + (2, stotal_hashes))

stotal_hashes should be total_hashes. Thanks for the library!

BERT-like masked language pre-training?

I'm pretty new to NLP, and trying to adapt reformer to do BERT-like masked language pretraining for long sequences.
Is it as simple as setting causal=False in ReformerLM class ?

Some questions about reformer implementation

Hey @lucidrains - thanks for your amazing implementation - it's super cool - I learned so much from it!! I just had the following questions about reformer_pytorch/reformer_pytorch.py -

line 30: I think it should be c.sum() instead of c.sum(dim=-1). Otherwise wouldn't chunked_sum(tensor) just be tensor.sum(dim=-1)?
line 72: In class Chunk(nn.Module), I understand that the intermediate large-dimentional activations of FeedForward Layer will not be stored in the memory due to the use of no_grad() in forward pass of G in the residual block. But I dont see why the large intermediate gradients will not be stored during G.backward()? Shouldn't the Chunk module have a custom forward() and backward() in which chunking is also done in backward() with no_grad() so that intermediate large-dimensional gradients of previous chunk are not stored in memory?
line 112: In hash_vectors() shouldn't one use make_unit_length() to make vecs unit length as line 192 says "Hashing operates on unit-length vectors...".
line 169: ticker = torch.arange(self.n_hashes * seqlen, device=device).unsqueeze(0). Can we expand ticker to buckets.shape ? Maybe its not required due to broadcasting, but can make the code more clear.
line 273 : A comment describing why UnsortLogits class is necessary would've been great. But maybe I am requesting this only because this custom backprop stuff is new to me :).

Thanks again for your work,
Ankit

Applying input mask?

Testing around with simple next token prediction. Sample data is separate sentences with padding. Is there a way to apply an input mask for the paddings?

Using a pre-trained REFORMER for fine-tuning takes soooo looong

Hi there, I've pre-trained a REFORMER for 4 days with 500MB of text data, just to try how it works. Now I'm trying to use it for fine-tuning and it's taking huge time for each epoch... I'm using a nice GPU (the one you were jealous about :P ) but it's still taking too long, as you can see below. When compared to a normal BERT, for example, there's no point of comparison, as the latter needs only a couple of secs for fine-tuning while this one is taking hours.

EPOCH: 0%| | 0/40 [00:00<?, ?it/s]
Training epoch 0: 0%| | 0/1041 [00:00<?, ?it/s]
Training epoch 0: 0%| | 1/1041 [00:13<3:46:44, 13.08s/it]
Training epoch 0: 0%| | 2/1041 [00:24<3:39:14, 12.66s/it]
Training epoch 0: 0%| | 3/1041 [00:36<3:33:28, 12.34s/it]
Training epoch 0: 0%| | 4/1041 [00:48<3:31:05, 12.21s/it]
Training epoch 0: 0%| | 5/1041 [01:00<3:29:03, 12.11s/it]
Training epoch 0: 1%| | 6/1041 [01:11<3:26:42, 11.98s/it]
Training epoch 0: 1%| | 7/1041 [01:23<3:24:39, 11.88s/it]
Training epoch 0: 1%| | 8/1041 [01:35<3:25:09, 11.92s/it]
Training epoch 0: 1%| | 9/1041 [01:46<3:22:59, 11.80s/it]
Training epoch 0: 1%| | 10/1041 [01:58<3:23:07, 11.82s/it]
Training epoch 0: 1%| | 11/1041 [02:11<3:25:52, 11.99s/it]
Training epoch 0: 1%| | 12/1041 [02:23<3:25:39, 11.99s/it]
Training epoch 0: 1%| | 13/1041 [02:34<3:21:48, 11.78s/it]
Training epoch 0: 1%|▏ | 14/1041 [02:46<3:23:27, 11.89s/it]
Training epoch 0: 1%|▏ | 15/1041 [02:57<3:19:09, 11.65s/it]
Training epoch 0: 2%|▏ | 16/1041 [03:10<3:22:35, 11.86s/it]
Training epoch 0: 2%|▏ | 17/1041 [03:22<3:22:47, 11.88s/it]
Training epoch 0: 2%|▏ | 18/1041 [03:33<3:22:16, 11.86s/it]
Training epoch 0: 2%|▏ | 19/1041 [03:45<3:23:15, 11.93s/it]
Training epoch 0: 2%|▏ | 20/1041 [03:57<3:20:54, 11.81s/it]
Training epoch 0: 2%|▏ | 21/1041 [04:09<3:19:35, 11.74s/it]
Training epoch 0: 2%|▏ | 22/1041 [04:21<3:22:12, 11.91s/it]
Training epoch 0: 2%|▏ | 23/1041 [04:32<3:20:29, 11.82s/it]
Training epoch 0: 2%|▏ | 24/1041 [04:44<3:16:36, 11.60s/it]
Training epoch 0: 2%|▏ | 25/1041 [04:56<3:18:51, 11.74s/it]
Training epoch 0: 2%|▏ | 26/1041 [05:07<3:17:10, 11.66s/it]
Training epoch 0: 3%|▎ | 27/1041 [05:18<3:15:37, 11.58s/it]
Training epoch 0: 3%|▎ | 28/1041 [05:30<3:15:43, 11.59s/it]
Training epoch 0: 3%|▎ | 29/1041 [05:42<3:16:18, 11.64s/it]
Training epoch 0: 3%|▎ | 30/1041 [05:54<3:16:54, 11.69s/it]
Training epoch 0: 3%|▎ | 31/1041 [06:05<3:12:38, 11.44s/it]
Training epoch 0: 3%|▎ | 32/1041 [06:16<3:11:49, 11.41s/it]
Training epoch 0: 3%|▎ | 33/1041 [06:27<3:11:52, 11.42s/it]
Training epoch 0: 3%|▎ | 34/1041 [06:39<3:13:15, 11.51s/it]
Training epoch 0: 3%|▎ | 35/1041 [06:50<3:10:34, 11.37s/it]
Training epoch 0: 3%|▎ | 36/1041 [07:02<3:12:29, 11.49s/it]
Training epoch 0: 4%|▎ | 37/1041 [07:13<3:11:37, 11.45s/it]
Training epoch 0: 4%|▎ | 38/1041 [07:24<3:09:23, 11.33s/it]
Training epoch 0: 4%|▎ | 39/1041 [07:36<3:09:00, 11.32s/it]
Training epoch 0: 4%|▍ | 40/1041 [07:47<3:09:20, 11.35s/it]
Training epoch 0: 4%|▍ | 41/1041 [07:58<3:08:17, 11.30s/it]

Do you know which may be the problem? I've created this class for NER:
class ReformerForTokenClassification(nn.Module):

def __init__(self, num_labels, model_dim, depth, 
             n_tokens, maxlen, heads, weights_file, n_hashes, dropout=0.2):
    super(ReformerForTokenClassification, self).__init__()
    self.num_labels = num_labels
    self.model_dim = model_dim
    self.reformer = ReformerLM(n_tokens, model_dim, depth, maxlen, heads,
                              n_hashes, return_embeddings=True)
    model_dict = self.reformer.state_dict()
    pretrained_dict = torch.load(weights_file)
    weights_dict = {k:v for k, v in pretrained_dict.items() if 'to_logits' not in k}
    self.reformer.load_state_dict(weights_dict)
    self.dropout = nn.Dropout(dropout)
    self.classifier = nn.Linear(self.model_dim, self.num_labels)

def forward(self, input_ids=None, labels=None):

    outputs = self.reformer(input_ids)
    sequence_output = self.dropout(outputs)
    logits = self.classifier(sequence_output)
    outputs = (logits, outputs[2:])

    if labels is not None:

        loss_fct = nn.CrossEntropyLoss()

        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        
        outputs = (loss, outputs[0], outputs[1])

    return outputs

model = ReformerForTokenClassification(num_labels=9, model_dim=768, depth=12, maxlen=512, n_tokens=tokenizer.vocab_size,
heads=8, n_hashes=4, weights_file='ckpts_pequeño_oscar/model_state_dict.pt')

GPU Memory Benchmark

I did a few training runs of a simple Reformer module with different parameters and logged the GPU memory usage.

Of course, depending on your machine or other things these values can vary, but I thought it might be useful as a visual guide:

dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 1: 452 MB

dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 8: 992 MB

dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 16: 1584 MB

dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 32: 2866 MB

dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 64: 4606 MB

dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 128: 9788 MB

dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 1: 538 MB

dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 8: 1580 MB

dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 16: 2870 MB

dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 32: 4582 MB

dim = 512, seq_len = 512, depth = 1, heads = 1, batch_size = 64: 9276 MB

dim = 512,seq_len = 1024, depth = 1, heads = 1, batch_size = 1: 682 MB

dim = 512, seq_len = 1024, depth = 1, heads = 1, batch_size = 8: 2904 MB

dim = 512, seq_len = 1024, depth = 1, heads = 1, batch_size = 16: 4634 MB

dim = 512, seq_len = 1024, depth = 1, heads = 1, batch_size = 32: 9310 MB

dim = 512, seq_len = 2048, depth = 1, heads = 1, batch_size = 1: 992 MB

dim = 512, seq_len = 2048, depth = 1, heads = 1, batch_size = 8: 4644 MB

dim = 512, seq_len = 2048, depth = 1, heads = 1, batch_size = 16: 9256 MB

dim = 512, seq_len = 4096, depth = 1, heads = 1, batch_size = 1: 1602 MB

dim = 512, seq_len = 4096, depth = 1, heads = 1, batch_size = 8: 8810 MB

dim = 512, seq_len = 4096, depth = 1, heads = 1, batch_size = 10: 10976 MB

dim = 512, seq_len = 8192, depth = 1, heads = 1, batch_size = 1: 2884 MB

dim = 512, seq_len = 8192, depth = 1, heads = 1, batch_size = 5: 11396 MB

dim = 512, seq_len = 256, depth = 1, heads = 1, batch_size = 8: 992 MB

dim = 512, seq_len = 256, depth = 2, heads = 1, batch_size = 8: 1054 MB

dim = 512, seq_len = 256, depth = 4, heads = 1, batch_size = 8: 1142 MB

dim = 512, seq_len = 256, depth = 6, heads = 1, batch_size = 8: 1220 MB

dim = 512, seq_len = 256, depth = 12, heads = 1, batch_size = 8: 1512 MB

dim = 512, seq_len = 256, depth = 24, heads = 1, batch_size = 8: 2056 MB

dim = 512, seq_len = 256, depth = 24, heads = 1, batch_size = 16: 2680 MB

dim = 128, seq_len = 256, depth = 12, heads = 1, batch_size = 8: 566 MB

dim = 128, seq_len = 256, depth = 12, heads = 2, batch_size = 8: 576 MB

dim = 128, seq_len = 256, depth = 12, heads = 4, batch_size = 8: 616 MB

dim = 128, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 732 MB

dim = 128, seq_len = 256, depth = 12, heads = 16, batch_size = 8: 1000 MB

dim = 32, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 644 MB

dim = 64, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 670 MB

dim = 128, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 732 MB

dim = 256, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 918 MB

dim = 512, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 1516 MB

dim = 1024, seq_len = 256, depth = 12, heads = 8, batch_size = 8: 3552 MB

dim = 512, seq_len = 4096, depth = 6, heads = 8, batch_size = 8: 9672 MB

dim = 128, seq_len = 4096, depth = 12, heads = 8, batch_size = 8: 6270 MB

dim = 512, seq_len = 8192, depth = 12, heads = 8, batch_size = 1: 3628 MB

dim = 512, seq_len = 8192, depth = 12, heads = 8, batch_size = 4: 10048 MB

dim = 128, seq_len = 1024, depth = 6, heads = 4, batch_size = 32: 4608 MB

dim = 128, seq_len = 1024, depth = 6, heads = 4, batch_size = 64: 8052 MB

dim = 128, seq_len = 1024, depth = 6, heads = 4, batch_size = 80: 9990 MB

Reversible nets need to be able to accept additional arguments for encoder to pass encoded keys to decoder

RobinBruegger/RevTorch#6

GLUE Training/Evaluation Codebase

Hey I am planning to start building a codebase to both train and evaluate this model on the GLUE tasks.

Are you working on this already? If not, I can write some code and make a pull request.

Don't understand how REFORMER tokenizes words

Hi, I've been a couple of months working with transformers models with the transformers library (https://huggingface.co/transformers/index.html), and I want to try your ReformerLM to see if I can train a good language model for spanish using this new technology and your library. Therefore, first of all I want to say thank you for developing this library and for implementing the Reformer, since the code provided by google together with the paper was not very "usable", in that it wasn't implemented in parametric classes in an ordered manner. So thanks for the effort. The thing is, I don't understand what kind of tokenization you're using in this model, as in the example there doesn't seem to be any tokenization step, nor have you any tokenizer class to train your own tokenizer with BPE or other method. Maybe I'm getting something wrong about the model, but I'd like to know how you deal with this issue to introduce the inputs to the model. Thank you in advance for your response, and if I manage to make the ReformerLM work, I'll try to make a Generator like GPT-2 based on this architecture instead of the transformer architecture, so that we can expand this library. Regards, Alejandro.

split_heads()

hi, I wonder if 't' should be 'kv_len'?
return v.view(b, h, t, -1).transpose(1, 2).contiguous()
https://github.com/lucidrains/reformer-pytorch/blob/fde8efefe34ce30cf872e67b86e268b103d5a49b/reformer_pytorch/reformer_pytorch.py#L472

Not clear about the calculation of the KQ and V vectors

Can you please explain this line ?

It seems you add extra 128 units to the input sequence why is that?

How do you compare in accuracy with XLnet?

It would be great to publish your results on
https://paperswithcode.com/sota
It would increase awareness and use of your paper.

A full Reformer image → caption example was wrong

When I try this example: A full Reformer image → caption, I found out that encoder was wrong because there was no arguments axial_position_emb = True, axial_position_shape = (32, 32), axial_position_dims = (256, 256) in Reformer class but ReformerLM has them. Please help me verify this issue.

Generation doesn't seem right

First of all, the deepspeed implementation is awesome! I trained on 4 V100 and got a 8.5X boost and 20X with fp16 turned on compared to just one GPU.

I trained a model on 300MB dialogue dataset for 2 epochs but the generated samples weren't good. I'm quite sure I messed up with the code somehow since I come from a programming background and not ML.

Here's my code: https://pastebin.com/V1t5Ctg7
lr = 0.0004, bs=32, vocab_size=2000

Here are some samples: https://pastebin.com/yCL0vVdv

From my experiments with other architectures (GPT-2 from scratch, LSTM), it should generate decent samples after feeding this data so something must be wrong somewhere.

Visualizing attention weights?

Thanks for the cool library!

I'm working on a seq2seq demo using it, and I'd like to visualize the attention weights, but it isn't clear how to get them out of the ReformerLM class. Can you point me in the right direction?

Pretrained models?

When are we going to see pretrained reformer models? I don't have the compute or the dataset for doing it myself - but this seems to be strictly better of a technique for training NLP models than previous transformers

Glue example

Hi, first of all thanks for the great work!

I was wondering how the Glue example is supposed to do any classification. There is no classification head added anywhere in the example for the classification tasks. From what I see in the example it simple takes the argmax(-1) from the [batch, sequence_length, number_of_tokens] output of the ReformerLM which is nonsense, isn't it?. I would expect to set 1. return_embeddings = True and 2. causal=False and take one output token i.e [:,0,:] and add maybe 2 Linear layers on top of that like BERT does, or average over all the tokens, since our task is classification.

Are my assumptions right, or am I getting something wrong?

DeepSpeed and Generate Method

Hi @lucidrains

I'm currently testing the generate function of the TrainingWrapper class.
When I use DeepSpeed and I try to generate a sequence it gives me the following error:
AttributeError: 'DeepSpeedLight' object has no attribute 'generate'

Is it because Generation can only be done outside DeepSpeed Engine?

Thank you very much, once again! :)

`return_embedding` seems to be a no-op

Thanks so much for this implementation! I suspect it will be very helpful for the community. 😄

I'm trying to use the ReformerLM with return_embeddings=True and it looks like it's effectively a no-op. When I try your collab with return_embeddings=True, model.out is just Identity() and no embedding is returned from the forward pass.

Am I using this option incorrectly or is it still a TODO?

Is enc_input_mask equal to set pad_idx and ignore_idx?

Hi @lucidrains
I updated the code to use your implementation of the EncDec architecture, but I ran out of memory when I set the input_mask and the context_mask accordingly in order to mask the Pad indexes.
In the previous implementation where I used this:

    encoder = TrainingWrapper(encoder, ignore_index=PAD_IDX).cuda()
    decoder = TrainingWrapper(decoder, ignore_index=PAD_IDX).cuda()

    encoder_engine, encoder_optimizer, trainloader, _ = deepspeed.initialize(args=cmd_args, model=encoder, optimizer=encoder_optimizer, model_parameters=encoder_params, training_data=train_dataset, dist_init_required=True)

    decoder_engine, decoder_optimizer, _, _ = deepspeed.initialize(args=cmd_args, model=decoder, optimizer=decoder_optimizer, model_parameters=encoder_params, dist_init_required=False)

    for src, trg in dataset:
            encoder_engine.train()
            decoder_engine.train()
            src = src.to(encoder_engine.local_rank)
            trg = trg.to(decoder_engine.local_rank)
            enc_keys = encoder_engine(src)
            loss = decoder_engine(trg, keys = enc_keys, return_loss = True)   
            loss.backward()
            decoder_engine.step()
            encoder_engine.step()

instead of this:

enc_dec_engine, enc_dec_optimizer, trainloader, _ = deepspeed.initialize(args=cmd_args, model=enc_dec, optimizer=enc_dec_optimizer, model_parameters=enc_dec_params, training_data=train_dataset)

    for src, trg in dataset:
            enc_dec_engine.train()
            
            src = src.to(enc_dec_engine.local_rank)
            trg = trg.to(enc_dec_engine.local_rank)

            enc_input_mask = torch.tensor([[1 for idx in smpl if idx != PAD_IDX] for smpl in src]).bool().to(device)
            context_mask = torch.tensor([[1 for idx in smpl if idx != PAD_IDX] for smpl in trg]).bool().to(device)

            loss = enc_dec(src, trg, return_loss = True, enc_input_mask = enc_input_mask, context_mask=context_mask)

            loss.backward()

            enc_dec_engine.step()

I didn't run out of memory and I was assuming that the loss was computed excluding the Pad Index, did I made a mistake? What is the best way to ignore Pad Idx?
Using the input_mask has the same outcome as setting the Pad Idx to ignore in the TrainingWrapper?
If they are equal, is there a way to use the ignore_index in the training wrapper instead of the masking techniques to save some memory also in your EncDec implementation?

Thank you in advance,
Cal

MEMORY ISSUES in self-supervised.py

Hi there, I'm trying to pretrain a ReformerLM for spanish on a single Nvidia p-100 16GB GPU, and even when restricting the embedding dimension, the number of heads etc. I still get a Memory Error. I'm using the script in https://github.com/lucidrains/reformer-pytorch/blob/master/pretraining/self-supervised.py for that, and my configuration is the following:

tokenizer.max_len=128
model = ReformerLM(
num_tokens=tokenizer.vocab_size,
dim=128,
depth=1,
heads=1,
max_seq_len=tokenizer.max_len,
causal=True
n_hashes=2,
ff_chunks=10000
)

trainer = ReformerTrainer(dataset, model, tokenizer, train_batch_size=1, eval_batch_size=1)

I've reduced the number of hashes, the max len, I've increased the ff_chunks... I've tried everything that's supposed to reduce the memory usage, but it's still not working. Have you been able to make the code in the link above work? @lucidrains If so, please tell me how... Just in case, my GPU is completely free before I start training, and the trainer tries to use about 19GB of memory...

Model only generates <eos> tokens, Text Generation,

Greetings! I've found this repo very useful,flexible and easy to use. Thanks for putting it out. I've been playing with this repo for a text generation problem.

I want to generate a reply given previous history of conversation. Here's how I'm encoding the sequence.
<bos><speaker1>Hello, how are you?<speaker2>Great! What about you?<eos>.
As can be seen from the enwiki8 example, while giving input we need to drop the last token from the sequence and for target we start from 2nd token to last token. So for above example
inp: <bos><speaker1>Hello, how are you?<speaker2>Great! What about you?
targets: <speaker1>Hello, how are you?<speaker2>Great! What about you?<eos>
I'm calculating loss only on the last portion i.e on Great! What about you?<eos>.

This is unlike other models, i.e with GPT-2 or Trax implementation of Reformer where you just feed the same sequence as input and targets, and it handles the rest.
So when I trained the model with above encoding, the model only generates <eos> tokens. So I removed the <eos> token and trained again but then the last token was always some punctuation, so the model was only generating punctuation.

Is this really an issue or am I doing something wrong ? Also can we make it more like the Trax implementation where we just feed the same sequence for both input and targets ?

torch.nn.DataParallel causes strange GPU memory overflow

Thanks for your great job!
When i am testing this model with code as

import torch
from reformer_pytorch import ReformerLM
from torch.nn import functional as F

model = ReformerLM(
    num_tokens=20000,
    dim=1024,
    depth=24,
    max_seq_len=1024,
    heads=16,
    lsh_dropout=0.1,
    emb_dim=1024,  # embedding factorization for further memory savings
    causal=True,  # auto-regressive or not
    bucket_size=64,  # average size of qk per bucket, 64 was recommended in paper
    n_hashes=8,  # 4 is permissible per author, 8 is the best but slower
    ff_chunks=200,  # number of chunks for feedforward layer, make higher if there are memory issues
    weight_tie=False,  # tie parameters of each layer for no memory per additional depth
    attn_chunks=8,  # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
    num_mem_kv=0,  # persistent learned memory key values, from all-attention paper
    twin_attention=False,  # both branches of the reversible network will be attention
    use_full_attn=True,  # use full self attention, for comparison
    full_attn_thres=128,  # use full attention if context length is less than set value
    use_scale_norm=False  # use scale norm from 'Transformers without tears' paper
).cuda()

model = torch.nn.DataParallel(model)
model.train()
x = torch.randint(0, 20000, (8, 1024)).long().cuda()
y = torch.randint(0, 20000, (8, 1024)).long().cuda()
pred = model(x)
loss = F.cross_entropy(pred.transpose(1, 2), y, reduction='mean')
loss.backward()
import ipdb
ipdb.set_trace()

When without model = torch.nn.DataParallel(model), 7616M memory is used.
But after I add model = torch.nn.DataParallel(model), it causes OOV while 8 gpus has 16GB memory for each.
I think maybe it is the problem of revtorch?

Is training slow?

I'm running the example script (with no change) on Nvidia V100 and the training seems to go very slow. Each batch takes a few seconds (≅3.2sec). What can be the problem?

EDIT: After some comparisons with other models, it doesn't seem to be relatively slow. But I still wanted to know other people's experiences.

Script to easily train text generation models a la gpt-2-simple repo

Greetings! Your repository is a very welcomed contribution. I tried to follow the examples in this repo but faced some problems. Trying to modify the enwik8_simple I didn't understand how to:

Load my custom data into examples (I have a poetry dataset).
Generate output from a start prefix and until an end token.

Thanks a lot!

How is the argmax operation in LSH bucketing differentiable?

parameter change test on reformer model

I did a parameter change test similar to the one offered in torchtest with a pytorch_reformer model (using reformer_pytorch==0.12.7).

I got the following result:

utils.VariablesChangeException: #OK: 31 #wrong: 2 Parameters:
E reformer.reformer.layers.reversible_blocks.0.f_block.fn.fn.mem_kv
E reformer.reformer.layers.reversible_blocks.1.f_block.fn.fn.mem_kv

indicating that the mem_kv parameters are not updated during an optimizer step.
I looked at the code of LSHAttention and see this line:

keys = default(keys, torch.empty(b, 0, e, dtype=mem.dtype, device=device))
I think that the call to torch.empty should include the require_grad=True parameter.

Missing grad_fn when passing a simple tensor through the reformer module.

If I try to pass a simple tensor, that does not require grad, through a Reformer, it won't allow me to do backpropagation.

x = torch.randn(batch_size, 256, 512).cuda()  
pred = reformer(x)
loss = criterion(pred, torch.ones_like(pred))
optimizer.zero_grad()
loss.backward()
optimizer.step()

If I do x.requires_grad = True just before passing it to the reformer model it works though.

I am not sure why this happens, but makes me think that the model is not being optimized at all when training :/

example for QA on document

Any example of Questions answering on whole document (like Squad on paragraph)..

Thanks
Mahesh

Set dropout parameters

hi!
the default value of dropout in class init method of LSHAttention is 0 and then no where to change it

class LSHSelfAttention(nn.Module):
    def __init__(self, emb, heads = 8, bucket_size = 64, n_hashes = 8, causal = False, **kwargs):
        # init position
        self.lsh_attn = LSHAttention(bucket_size=bucket_size, causal=causal, **kwargs)

class Reformer(nn.Module):
    def __init__(self, emb, depth, max_seq_len, num_tokens = 10000, heads = 8, bucket_size = 64, n_hashes = 8, ff_chunks = 100, causal = False, weight_tie = False):
         # Never pass the dropout parameters so the dropout can't be changed
        get_attn = lambda: LSHSelfAttention(emb, heads, bucket_size, n_hashes, causal = causal)

Thanks!

DeepSpeed and nn.Embedding issue

Hi Lucidrains
First of all thanks for the contribution. You are doing an awesome job here.

I'm trying to implement the Seq2Seq model using DeepSpeed since I will have 32k seq_len as input. This is my code:
` CODE:

 class GenomeToMolDataset(Dataset):
    def __init__(self, data, src_lang, trg_lang):
        super().__init__()
        self.data = data
        self.src_lang = src_lang
        self.trg_lang = trg_lang

    def __getitem__(self, index):
        #print(index)
        pair = self.data[index]
        #print('src:',pair[0])
        #print('\n\ntrg:',pair[1])
        src = torch.tensor(indexesFromSentence(self.src_lang,pair[0]))
        trg = torch.tensor(indexesFromSentence(self.trg_lang,pair[1]))
        print('src:', src)
        print('trg:', trg)
        return src,trg

    def __len__(self):
        return len(self.data)

train_dataset = GenomeToMolDataset(tr_pairs, input_lang, target_lang)
test_dataset = GenomeToMolDataset(ts_pairs, input_lang, target_lang)

encoder = ReformerLM(
    num_tokens = input_lang.n_words,
    emb_dim = emb_dim,#128,
    dim = dim,#512,
    bucket_size = bucket_size, # 16,
    depth = depth, # 6,
    heads = heads, # 8,
    n_hashes= n_hashes,
    max_seq_len = VIR_SEQ_LEN,
    ff_chunks = ff_chunks, #400,      # number of chunks for feedforward layer, make higher if there are memory issues
    attn_chunks = attn_chunks, #16,    # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
    #weight_tie = True,
    fixed_position_emb = True,
    return_embeddings = True # return output of last attention layer
).cuda()

decoder = ReformerLM(
    num_tokens = target_lang.n_words,
    emb_dim = emb_dim, # 128,
    dim = dim, # 512,
    bucket_size = bucket_size, #16,
    depth = depth, #6,
    heads = heads, #8,
    n_hashes= n_hashes,
    ff_chunks = ff_chunks, # 400,      # number of chunks for feedforward layer, make higher if there are memory issues
    attn_chunks = attn_chunks, # 16,    # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
    max_seq_len = MOL_SEQ_LEN,
    fixed_position_emb = True,
    causal = True
).cuda()

encoder_optimizer = RangerLars(encoder.parameters()) # torch.optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = RangerLars(decoder.parameters()) # torch.optim.Adam(decoder.parameters(), lr=learning_rate)

if use_apex:
    encoder, encoder_optimizer = amp.initialize(encoder, encoder_optimizer, opt_level='O1')
    decoder, decoder_optimizer = amp.initialize(decoder, decoder_optimizer, opt_level='O1')

encoder = TrainingWrapper(encoder).cuda()
#encoder.cuda()

decoder = TrainingWrapper(decoder).cuda()
#decoder.cuda()

encoder_params = filter(lambda p: p.requires_grad, encoder.parameters())
decoder_params = filter(lambda p: p.requires_grad, decoder.parameters())

encoder_engine, encoder_optimizer, trainloader, _ = deepspeed.initialize(args=cmd_args, model=encoder, optimizer=encoder_optimizer, model_parameters=encoder_params, training_data=train_dataset, dist_init_required=True)
decoder_engine, decoder_optimizer, _, _ = deepspeed.initialize(args=cmd_args, model=decoder, optimizer=decoder_optimizer, model_parameters=encoder_params, dist_init_required=False)

# training
VALIDATE_EVERY = 1
SAVE_EVERY = 10
SAVE_DIR = './saved_model/'
_, encoder_client_sd = encoder_engine.load_checkpoint(SAVE_DIR+'encoder/', None)
_, decoder_client_sd = decoder_engine.load_checkpoint(SAVE_DIR+'decoder/', None) #args.ckpt_id 
for i, pair in enumerate(trainloader):
    src = pair[0]
    trg = pair[1]
    encoder_engine.train()
    decoder_engine.train()
    src = src.to(encoder_engine.local_rank)
    trg = trg.to(decoder_engine.local_rank)
    
    print(src.shape)
    print(src.dtype)
    print(trg.shape)
    print(trg.dtype)

    enc_keys = encoder_engine(src)
    loss = decoder_engine(trg, keys = enc_keys, return_loss = True)   # (1, 4096, 20000)
    encoder_engine.backward(loss)
    decoder_engine.backward(loss)
    encoder_engine.step()
    decoder_engine.step()
    print('Training Loss:',loss.item())       

    if i % VALIDATE_EVERY == 0:
        encoder.eval()
        decoder.eval()
        with torch.no_grad():
            ts_src,ts_trg = random.choice(test_dataset)[:-1]
            enc_keys = encoder(ts_src.to(device))
            loss = decoder(ts_trg, keys=enc_keys, return_loss = True)
            print(f'\tValidation Loss: {loss.item()}')

    if i % SAVE_EVERY:
        encoder_client_sd['step'] = i
        decoder_client_sd['step'] = i
        ckpt_id = loss.item()
        encoder_engine.save_checkpoint(SAVE_DIR+'encoder/', ckpt_id, client_sd = encoder_client_sd)
        decoder_engine.save_checkpoint(SAVE_DIR+'decoder/', ckpt_id, client_sd = decoder_client_sd)`

The issue I'm having is with the nn.Embedding Layer since it wants Long integer as input but DeepSpeed works only with Floats. And it prompts this error:
RuntimeError: expected device cuda:0 and dtype Float but got device cuda:0 and dtype Long

If I cast to float the inputs, then the Embedding layer will prompt the vice versa error.

How can I use your ReformerLM as Encoder-Decoder with DeepSpeed in this case? Is there any way I can workaround the Embedding issue?

Thank you,
Cal

Image Generation Example

Hi Author,

Thanks for your work on reformer implementation on Pytorch. May I ask, could you share the image generation task examples, thanks.

Error with example train.py

Hi,

Here's an error when attempting to pull and run train.py from your repo:

$ python3 train.py 
train.py:42: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
  X = np.fromstring(file.read(int(95e6)), dtype=np.uint8)
Traceback (most recent call last):
  File "train.py", line 43, in <module>
    trX, vaX = np.split(X, [int(90e6), int(5e6)])
ValueError: too many values to unpack (expected 2)

python3
Python 3.7.5 (default, Nov 20 2019, 09:21:52) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.

Explain input and output of reformer

Hello,
I am trying to understand input and output to your Reformer. I am following your example code

import torch
from reformer_pytorch import Reformer

model = Reformer(
    emb = 512,
    depth = 12,
    max_seq_len = 8192,
    num_tokens= 20000,
    heads = 8,
    lsh_dropout = 0.1,
    causal = True,        # auto-regressive or not
    bucket_size = 64,     # average size of qk per bucket, 64 was recommended in paper
    n_hashes = 8,         # should keep at 8 per paper
    ff_chunks = 200,      # number of chunks for feedforward layer
    weight_tie = False,   # tie parameters of each layer for no memory per additional depth
    attn_chunks = 8,        # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
    use_full_attn = False   # use full self attention, for comparison
).cuda()

x = torch.randint(0, 20000, (1, 8192)).long().cuda()
y = model(x)
print(x.shape)
print(y.shape)

Output:

torch.Size([1, 8192])
torch.Size([1, 8192, 20000])

So as I understand input x is a 2D tensor with the size of [batch_size,seq_length] and the output y is a 3D tensor with the size of [batch_size, seq_length,_num_tokens]. I wonder why there is a mismatch like this ?

Comparing to the official Transformer code, I have a simple example:

import torch
from torch.nn.modules.transformer import *
transformer_model = Transformer(nhead=16, num_encoder_layers=12)
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
out = transformer_model(src, tgt)
print(out.shape)

Output:
torch.Size([20, 32, 512])

As you can see, the input and output to the Transformer model is a 3D tensor with the size of [seq_length,batch_size,emb_size]. Is there any method I can do the same thing with your Reformer implementation?

definition of input_attn_mask and context_mask

Hi @lucidrains, in a encoder-decoder setting, consider input to decoder as target, denote encoder input length S and decoder length input T, the size of input_mask and input_attn_mask should be NxT and TxT. It is unclear context_mask should have NxS (padding) or TxS (memory)?

MISH as ActivationFunction

Hi @lucidrains
I was wondering if it could make sense to you if I create a pull request where the user can choose between GLUE or MISH as the activation function.
The explanation of MISH can be found here:

https://medium.com/@lessw/meet-mish-new-state-of-the-art-ai-activation-function-the-successor-to-relu-846a6d93471f

The GitHub is here:

https://github.com/digantamisra98/Mish

And the discussion can be found here:

https://forums.fast.ai/t/meet-mish-new-activation-function-possible-successor-to-relu/53299/315

Here there is a little benchmark:

If I'm not mistaken there is only one place in reformer_pytorch library where you define GLUE_ in the FeedForward layer, I could add a parameter to the constructor as a flag.

Let me know what would you think about it.

Thank you,
Cal

Predicting with Encoder-Decoder structure

This is a follow up from my comment in #50. How do you make a prediction for a test example for encoder-decoder, after training with the code block mention in the issue?

No pretrained model

Any chance you can provide a trained model? It takes extremely long for me to train on my own.

The first example in your README doesn't work :(

Here's the error:

RuntimeError Traceback (most recent call last)
in
21
22 x = torch.randint(0, 20000, (1, 8192)).long().cuda()
---> 23 y = model(x) # (1, 8192, 20000)

~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
475 result = self._slow_forward(*input, **kwargs)
476 else:
--> 477 result = self.forward(*input, **kwargs)
478 for hook in self._forward_hooks.values():
479 hook_result = hook(self, input, result)

~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x, **kwargs)
499
500 x = self.to_model_dim(x)
--> 501 x = self.reformer(x, **kwargs)
502 return self.to_logits(x)

~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x, keys)
479 x = torch.cat([x, x], dim = -1)
480 self.set_reversible_args(keys = keys)
--> 481 x = self.layers(x)
482 return torch.stack(x.chunk(2, dim=-1)).sum(dim=0)
483

~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/revtorch/revtorch.py in forward(self, x)
197 :return: Output tensor
198 '''
--> 199 x = _ReversibleModuleFunction.apply(x, self.reversible_blocks, self.eagerly_discard_variables)
200 return x

~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/revtorch/revtorch.py in forward(ctx, x, reversible_blocks, eagerly_discard_variables)
144 for block in reversible_blocks:
145 assert (isinstance(block, ReversibleBlock))
--> 146 x = block(x)
147 ctx.y = x.detach() #not using ctx.save_for_backward(x) saves us memory by beeing able to free ctx.y earlier in the backward pass
148 ctx.reversible_blocks = reversible_blocks

~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/revtorch/revtorch.py in forward(self, x)
47 with torch.no_grad():
48 self._init_seed('f')
---> 49 y1 = x1 + self.f_block(x2)
50 self._init_seed('g')
51 y2 = x2 + self.g_block(y1)

~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x)
80 def forward(self, x):
81 x = self.norm(x)
---> 82 return self.fn(x)
83
84 class Chunk(nn.Module):

~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x)
105
106 def forward(self, x):
--> 107 return self.fn(x, *self.args, **self.kwargs)
108
109 # LSH attention as described in https://openreview.net/pdf?id=rkgNKkHtvB

~/miniconda/envs/distilbert_env/lib/python3.6/site-packages/reformer_pytorch/reformer_pytorch.py in forward(self, x, keys)
394
395 mem = self.mem_kv.expand(b, m, e)
--> 396 keys = default(keys, torch.empty(b, 0, e, dtype=mem.dtype, device=device))
397
398 kv_len = t + m + keys.shape[1]

RuntimeError: sizes must be non-negative
.....................................................................................

This error occurs when using exactly the same example you have in your README...
import torch
from reformer_pytorch import ReformerLM

model = ReformerLM(
num_tokens= 20000,
dim = 1024,
depth = 12,
max_seq_len = 8192,
heads = 8,
lsh_dropout = 0.1,
emb_dim = 128, # embedding factorization for further memory savings
causal = True, # auto-regressive or not
bucket_size = 64, # average size of qk per bucket, 64 was recommended in paper
n_hashes = 4, # 4 is permissible per author, 8 is the best but slower
ff_chunks = 200, # number of chunks for feedforward layer, make higher if there are memory issues
weight_tie = False, # tie parameters of each layer for no memory per additional depth
attn_chunks = 8, # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
num_mem_kv = 128, # persistent learned memory key values, from all-attention paper
twin_attention = False, # both branches of the reversible network will be attention
use_full_attn = False, # use full self attention, for comparison
full_attn_thres = 1024, # use full attention if context length is less than set value
use_scale_norm = False # use scale norm from 'Transformers without tears' paper
).cuda()

x = torch.randint(0, 20000, (1, 8192)).long().cuda()
y = model(x) # (1, 8192, 20000)
.....................................................................