Git Product home page Git Product logo

enformer-pytorch's Introduction

Enformer - Pytorch

Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch. This repository also contains the means to fine tune pretrained models for your downstream tasks. The original tensorflow sonnet code can be found here.

Update: finetuned for predicting pseudobulk chromatin accessibility here

Install

$ pip install enformer-pytorch

Usage

import torch
from enformer_pytorch import Enformer

model = Enformer.from_hparams(
    dim = 1536,
    depth = 11,
    heads = 8,
    output_heads = dict(human = 5313, mouse = 1643),
    target_length = 896,
)
    
seq = torch.randint(0, 5, (1, 196_608)) # for ACGTN, in that order (-1 for padding)
output = model(seq)

output['human'] # (1, 896, 5313)
output['mouse'] # (1, 896, 1643)

You can also directly pass in the sequence as one-hot encodings, which must be float values

import torch
from enformer_pytorch import Enformer, seq_indices_to_one_hot

model = Enformer.from_hparams(
    dim = 1536,
    depth = 11,
    heads = 8,
    output_heads = dict(human = 5313, mouse = 1643),
    target_length = 896,
)

seq = torch.randint(0, 5, (1, 196_608))
one_hot = seq_indices_to_one_hot(seq)

output = model(one_hot)

output['human'] # (1, 896, 5313)
output['mouse'] # (1, 896, 1643)

Finally, one can fetch the embeddings, for fine-tuning and otherwise, by setting the return_embeddings flag to be True on forward

import torch
from enformer_pytorch import Enformer, seq_indices_to_one_hot

model = Enformer.from_hparams(
    dim = 1536,
    depth = 11,
    heads = 8,
    output_heads = dict(human = 5313, mouse = 1643),
    target_length = 896,
)

seq = torch.randint(0, 5, (1, 196_608))
one_hot = seq_indices_to_one_hot(seq)

output, embeddings = model(one_hot, return_embeddings = True)

embeddings # (1, 896, 3072)

For training, you can directly pass the head and target in to get the poisson loss

import torch
from enformer_pytorch import Enformer, seq_indices_to_one_hot

model = Enformer.from_hparams(
    dim = 1536,
    depth = 11,
    heads = 8,
    output_heads = dict(human = 5313, mouse = 1643),
    target_length = 200,
).cuda()

seq = torch.randint(0, 5, (196_608 // 2,)).cuda()
target = torch.randn(200, 5313).cuda()

loss = model(
    seq,
    head = 'human',
    target = target
)

loss.backward()

# after much training

corr_coef = model(
    seq,
    head = 'human',
    target = target,
    return_corr_coef = True
)

corr_coef # pearson R, used as a metric in the paper

Pretrained Model

Deepmind has released the weights for their tensorflow sonnet Enformer model! I have ported it over to Pytorch and uploaded it to 🤗 Huggingface (~1GB). There are still some rounding errors that seem to be accruing across the layers, resulting in an absolute error as high as 0.5. However, correlation coefficient look good so I am releasing the 'rough'ly working version. Will keep working on figuring out where the numerical errors are happening (it may be the attention pooling module, as I noticed the attention logits are pretty high).

Update: John St. John did some work and found that the enformer-official-rough model hits the reported marks in the paper - human pearson R of 0.625 for validation, and 0.65 for test.

Update: As of version 0.8.0, if one were to use the from_pretrained function to load the pretrained model, it should automatically use precomputed gamma positions to address a difference between tensorflow and pytorch xlogy. This should resolve the numerical discrepancy above. If you were to further finetune and not be using the from_pretrained function, please make sure to set use_tf_gamma = True when using .from_hparams to instantiate the Enformer

$ pip install enformer-pytorch>=0.5

Loading the model

from enformer_pytorch import from_pretrained

enformer = from_pretrained('EleutherAI/enformer-official-rough')

Quick sanity check on a single human validation point

$ python test_pretrained.py
# 0.5963 correlation coefficient on a validation sample

This is all made possible thanks to HuggingFace's custom model feature.

You can also load, with overriding of the target_length parameter, if you are working with shorter sequence lengths

from enformer_pytorch import from_pretrained

model = from_pretrained('EleutherAI/enformer-official-rough', target_length = 128, dropout_rate = 0.1)

# do your fine-tuning

To save on memory during fine-tuning a large Enformer model

from enformer_pytorch import from_pretrained

enformer = from_pretrained('EleutherAI/enformer-official-rough', use_checkpointing = True)

# finetune enformer on a limited budget

Fine-tuning

This repository will also allow for easy fine-tuning of Enformer.

Fine-tuning on new tracks

import torch
from enformer_pytorch import from_pretrained
from enformer_pytorch.finetune import HeadAdapterWrapper

enformer = from_pretrained('EleutherAI/enformer-official-rough')

model = HeadAdapterWrapper(
    enformer = enformer,
    num_tracks = 128,
    post_transformer_embed = False   # by default, embeddings are taken from after the final pointwise block w/ conv -> gelu - but if you'd like the embeddings right after the transformer block with a learned layernorm, set this to True
).cuda()

seq = torch.randint(0, 5, (1, 196_608 // 2,)).cuda()
target = torch.randn(1, 200, 128).cuda()  # 128 tracks

loss = model(seq, target = target)
loss.backward()

Finetuning on contextual data (cell type, transcription factor, etc)

import torch
from enformer_pytorch import from_pretrained
from enformer_pytorch.finetune import ContextAdapterWrapper

enformer = from_pretrained('EleutherAI/enformer-official-rough')
    
model = ContextAdapterWrapper(
    enformer = enformer,
    context_dim = 1024
).cuda()

seq = torch.randint(0, 5, (1, 196_608 // 2,)).cuda()

target = torch.randn(1, 200, 4).cuda()  # 4 tracks
context = torch.randn(4, 1024).cuda()   # 4 contexts for the different 'tracks'

loss = model(
    seq,
    context = context,
    target = target
)

loss.backward()

Finally, there is also a way to use attention aggregation from a set of context embeddings (or a single context embedding). Simply use the ContextAttentionAdapterWrapper

import torch
from enformer_pytorch import from_pretrained
from enformer_pytorch.finetune import ContextAttentionAdapterWrapper

enformer = from_pretrained('EleutherAI/enformer-official-rough')
    
model = ContextAttentionAdapterWrapper(
    enformer = enformer,
    context_dim = 1024,
    heads = 8,              # number of heads in the cross attention
    dim_head = 64           # dimension per head
).cuda()

seq = torch.randint(0, 5, (1, 196_608 // 2,)).cuda()

target = torch.randn(1, 200, 4).cuda()      # 4 tracks
context = torch.randn(4, 16, 1024).cuda()   # 4 contexts for the different 'tracks', each with 16 tokens

context_mask = torch.ones(4, 16).bool().cuda() # optional context mask, in example, include all context tokens

loss = model(
    seq,
    context = context,
    context_mask = context_mask,
    target = target
)

loss.backward()

Data

You can use the GenomicIntervalDataset to easily fetch sequences of any length from a .bed file, with greater context length dynamically computed if specified

import torch
import polars as pl
from enformer_pytorch import Enformer, GenomeIntervalDataset

filter_train = lambda df: df.filter(pl.col('column_4') == 'train')

ds = GenomeIntervalDataset(
    bed_file = './sequences.bed',                       # bed file - columns 0, 1, 2 must be <chromosome>, <start position>, <end position>
    fasta_file = './hg38.ml.fa',                        # path to fasta file
    filter_df_fn = filter_train,                        # filter dataframe function
    return_seq_indices = True,                          # return nucleotide indices (ACGTN) or one hot encodings
    shift_augs = (-2, 2),                               # random shift augmentations from -2 to +2 basepairs
    context_length = 196_608,
    # this can be longer than the interval designated in the .bed file,
    # in which case it will take care of lengthening the interval on either sides
    # as well as proper padding if at the end of the chromosomes
    chr_bed_to_fasta_map = {
        'chr1': 'chromosome1',  # if the chromosome name in the .bed file is different than the key name in the fasta file, you can rename them on the fly
        'chr2': 'chromosome2',
        'chr3': 'chromosome3',
        # etc etc
    }
)

model = Enformer.from_hparams(
    dim = 1536,
    depth = 11,
    heads = 8,
    output_heads = dict(human = 5313, mouse = 1643),
    target_length = 896,
)

seq = ds[0] # (196608,)
pred = model(seq, head = 'human') # (896, 5313)

To return the random shift value, as well as whether reverse complement was activated (in the case you need to reverse the corresponding chip-seq target data), just set return_augs = True when initializing the GenomicIntervalDataset

import torch
import polars as pl
from enformer_pytorch import Enformer, GenomeIntervalDataset

filter_train = lambda df: df.filter(pl.col('column_4') == 'train')

ds = GenomeIntervalDataset(
    bed_file = './sequences.bed',                       # bed file - columns 0, 1, 2 must be <chromosome>, <start position>, <end position>
    fasta_file = './hg38.ml.fa',                        # path to fasta file
    filter_df_fn = filter_train,                        # filter dataframe function
    return_seq_indices = True,                          # return nucleotide indices (ACGTN) or one hot encodings
    shift_augs = (-2, 2),                               # random shift augmentations from -2 to +2 basepairs
    rc_aug = True,                                      # use reverse complement augmentation with 50% probability
    context_length = 196_608,
    return_augs = True                                  # return the augmentation meta data
)

seq, rand_shift_val, rc_bool = ds[0] # (196608,), (1,), (1,)

Appreciation

Special thanks goes out to EleutherAI for providing the resources to retrain the model, during a time when the official model from Deepmind had not been released yet.

Thanks also goes out to @johahi for finding out that there are numerical differences between the torch and tensorflow implementations of xlogy. He provided a fix for this difference, which is adopted in this repository in v0.8.0

Todo

  • script to load weights from trained tensorflow enformer model to pytorch model
  • add loss wrapper with poisson loss
  • move the metrics code over to pytorch as well
  • train enformer model
  • build context manager for fine-tuning with unfrozen enformer but with frozen batchnorm
  • allow for plain fine-tune with fixed static context
  • allow for fine tuning with only unfrozen layernorms (technique from fine tuning transformers)
  • fix handling of 'N' in sequence, figure out representation of N in basenji barnyard
  • take care of shift augmentation in GenomicIntervalDataset
  • speed up str_to_seq_indices
  • add to EleutherAI huggingface (done thanks to Niels)
  • offer some basic training utils, as gradient accumulation will be needed for fine tuning

Citations

@article {Avsec2021.04.07.438649,
    author  = {Avsec, {\v Z}iga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R. and Grabska-Barwinska, Agnieszka and Taylor, Kyle R. and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R.},
    title   = {Effective gene expression prediction from sequence by integrating long-range interactions},
    elocation-id = {2021.04.07.438649},
    year    = {2021},
    doi     = {10.1101/2021.04.07.438649},
    publisher = {Cold Spring Harbor Laboratory},
    URL     = {https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649},
    eprint  = {https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649.full.pdf},
    journal = {bioRxiv}
}
@misc{liu2022convnet,
    title   = {A ConvNet for the 2020s},
    author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
    year    = {2022},
    eprint  = {2201.03545},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

enformer-pytorch's People

Contributors

jstjohn avatar julien-c avatar lucidrains avatar vance-raiti avatar yang-dongxu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

enformer-pytorch's Issues

rounding errors fixed?

Thanks for making this library!

In the README there is a comment:

Update: John St. John did some work and found that the enformer-official-rough model hits the reported marks in the paper - human pearson R of 0.625 for validation, and 0.65 for test.

Does this mean that the earlier concern below has been fixed?

There are still some rounding errors that seem to be accruing across the layers, resulting in an absolute error as high as 0.5. However, correlation coefficient look good so I am releasing the 'rough'ly working version. Will keep working on figuring out where the numerical errors are happening (it may be the attention pooling module, as I noticed the attention logits are pretty high).

Thank you! It's not 100% clear if we can go ahead and use the library without worrying about the rounding errors.

error loading enformer package

I am trying to install the enformer package but seem to be getting the following error:

>>> import torch
>>> from enformer_pytorch import Enformer
Traceback (most recent call last):
  File "/sc/arion/projects/ad-omics/clakhani/conda/envs/enformer_lightning/lib/python3.9/site-packages/transformers/utils/import_utils.py", line 905, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/sc/arion/projects/ad-omics/clakhani/conda/envs/enformer_lightning/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/sc/arion/projects/ad-omics/clakhani/conda/envs/enformer_lightning/lib/python3.9/site-packages/transformers/modeling_utils.py", line 76, in <module>
    from accelerate import dispatch_model, infer_auto_device_map, init_empty_weights
ImportError: cannot import name 'dispatch_model' from 'accelerate' (/sc/arion/projects/ad-omics/clakhani/conda/envs/enformer_lightning/lib/python3.9/site-packages/accelerate/__init__.py)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/sc/arion/projects/ad-omics/clakhani/conda/envs/enformer_lightning/lib/python3.9/site-packages/enformer_pytorch/__init__.py", line 2, in <module>
    from enformer_pytorch.modeling_enformer import Enformer, SEQUENCE_LENGTH, AttentionPool
  File "/sc/arion/projects/ad-omics/clakhani/conda/envs/enformer_lightning/lib/python3.9/site-packages/enformer_pytorch/modeling_enformer.py", line 14, in <module>
    from transformers import PreTrainedModel
  File "<frozen importlib._bootstrap>", line 1055, in _handle_fromlist
  File "/sc/arion/projects/ad-omics/clakhani/conda/envs/enformer_lightning/lib/python3.9/site-packages/transformers/utils/import_utils.py", line 895, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/sc/arion/projects/ad-omics/clakhani/conda/envs/enformer_lightning/lib/python3.9/site-packages/transformers/utils/import_utils.py", line 907, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'dispatch_model' from 'accelerate' (/sc/arion/projects/ad-omics/clakhani/conda/envs/enformer_lightning/lib/python3.9/site-packages/accelerate/__init__.py)

I simply cloned an existing pytorch environment on Conda (using cuda 11.1 and torch 1.10) and then pip installed the hugging face packages and enformer packages

pip install transformers
pip install datasets
pip install accelerate
pip install tokenizers
pip install enformer-pytorch

Any idea why I'm getting this error?

metric for enformer

Hello, can I ask how you find of the human pearson R is 0.625 for validation, and 0.65 for test? Couldn't find any information in the paper. Is there any other place that records this?

enformer TF pretrained weights

Hello!

Thanks for this wonderful resource. I was wondering whether you can point me to how to obtain the model weights for the original TF version of Enformer, or the actual weights if they are stored somewhere easily accessible. I see the model on TF hub but am not sure exactly how to extract the weights - I seem to be running into some issues potentially because the original code is sonnet based and the model is always loaded as a custom user object..

Much appreciated!

Initializing AttentionPool weights with 2 * Identity matrix, again!

Thank you for the quick fix for my recent issue (#21)! nn.init.dirac_ is a great solution.

However there's one more thing: we have to initialize the weight with a Identity matrix multiplied by 2!

image

Of note, the official deepmind Sonnet implementation of enformer uses snt.initializers.Identity with gain=2.
(Please refer code lines here and here in deepmind implementation)

Thanks,
Dohoon

example data files

Hi, in the README, you mentioned to use sequences.bed and hg38.ml.fa files to build the GenomeIntervalDataset, but I can't find these example data files, could you provide the links of these files ? Thanks!

Distribution of embedding values

I'm using the embeddings generated by enformer for another project, and while I'm getting meaningful results from my own linear output heads I was struck by the distribution of the embedding values themselves.

If I run the following

import matplotlib.pyplot as plt
import numpy as np
import torch
from enformer_pytorch import Enformer

enformer = Enformer.from_pretrained('EleutherAI/enformer-official-rough')
seq = torch.randint(0, 5, (2, 196_608)) # for ACGTN, in that order (-1 for padding)
output, embeddings = enformer(seq, return_embeddings=True)

plt.hist(embeddings.detach().numpy().ravel(), bins=np.linspace(-0.5, 0.5, 200), density=True);
plt.xlim([-0.5, 0.5]);

I see the following plot

image

I can confirm I get a very similar distribution of values using real sequences instead of random noise.

I'm struck by the sharp cutoff at around -0.16, and the U shape distribution for negative values, the strong peak at zero, and the tail of positive values. Naively I would have expected to see something more normally distributed around zero, but I guess I don't have a lot of experience looking at outputs of transformers. I'm curious if this distribution is to be expected based on the model or if it is at least of no concern to experts? I'm also curious if any DeepMind provided embedding values have the same distribution. I havn't downloaded any, but could try looking into that if it was of broader interest.

As noted above I do get alright predictions from my own linear output heads, but my I'm struggling to match the distribution of target values and the distribution of my outputs and so was curious if this could be one contributing reason, though there could be many other factors at play that are specific to my application.

Gradient checkpoint

Hi, I noticed you have a version of the trunk with gradient checkpoints for transformer layers. Is it safe to use it like this since there are many non-zero dropout layers for both attention and feed forward layers? Since the checkpointing API runs the forward pass twice on the checkpointed segment of the model and dropout drops values arbitrarily, it means that output values won't match?

Models in training splits

Hey,

Is there a way of getting the models trained in each training set, as mentioned in the "Model training and evaluation" paragraph of the Enformer paper?

Thanks!

Reverse Complement Fails when return_seq_indices=True

Hi, thanks for this adaptation.

Reverse complemeting when return_seq_indices is set True fails, because the seq that is passed to seq_indices_reverse_complement is a string, while the function treats it as a tensor of indices. A simple fix would be to first call str_to_seq_indices previous to this.

        if self.return_seq_indices:
            if self.rc_aug and coin_flip():
                seq = seq_indices_reverse_complement(seq)

            return str_to_seq_indices(seq)

Correlation Coefficient Calculation

I ran the test_pretrained.py script to calculate the correlation coefficient on a validation sample, and got 0.5963 as expected. However, when I inspected the target and predictions, the shapes were each (896, 5313), i.e. missing the batch dimension. The pearson_corr_coef function computes similarity over dim=1, so the calculated number 0.5963 is actually a measure of correlation over the different cell lines, rather than over the track positions per cell line. When you unsqueeze the batch dimension, then the correlation is calculated over track positions, and yields a value of 0.4721. This is the way that Enformer reports correlation, so does it make sense to update the README and test_pretrained.py with this procedure? Also, were the reported correlation coefficients 0.625 and 0.65 on the train/test sets calculated on samples with missing batch dimension? If so, a recalculation would be necessary. Am I missing something?

Here is the modified test_pretrained.py script I have used:

import torch
from enformer_pytorch import Enformer

enformer = Enformer.from_pretrained('EleutherAI/enformer-official-rough').cuda()
enformer.eval()

data = torch.load('./data/test-sample.pt')
seq, target = data['sequence'].cuda(), data['target'].cuda()
print(seq.shape) # torch.Size([131072, 4])
print(target.shape) # torch.Size([896, 5313])
seq = seq.unsqueeze(0)
target = target.unsqueeze(0)

# Note: you will find prediction shape is also `torch.Size([896, 5313])`.

with torch.no_grad():
    corr_coef = enformer(
        seq,
        target = target,
        return_corr_coef = True,
        head = 'human'
    )

print(corr_coef) # tensor([0.4721], device='cuda:0')
assert corr_coef > 0.1

Initializing AttentionPool weights with 2 * Identity matrix.

Hi, thanks for this great implementation. I'm learning a lot from it :)

I noticed that the commit 0dc46e4 makes the internal AttentionPool weight initialized with randomly-sampled values rather than 2 * Identity-matrix, which is specified in the Enformer paper. (If there's something I am missing, please let me know!)

Indeed, this will not affect the performance of Enformer loaded with pretrained parameters, but I think it may lead to slightly worse (according to the paper) performance when trained from scratch.

Perhaps some simple manual weight initialization like

self.to_attn_logits = nn.Conv2d(dim, dim, 1, bias = False)
self.to_attn_logits.data.zero_()
self.to_attn_logits.data.squeeze().fill_diagonal_(2)

will do.

If you think it'll be okay, please let me know then I'll open a PR right away.

Thanks again for this great repo!

Best,
Dohoon

How to load the pre-trained Enfromer model?

Hi, I encountered a problem when trying to load the pre-trained enformer model.

from enformer_pytorch import Enformer
model = Enformer.from_pretrained("EleutherAI/enformer-preview")

AttributeError Traceback (most recent call last)
Input In [3], in
1 from enformer_pytorch import Enformer
----> 2 model = Enformer.from_pretrained("EleutherAI/enformer-preview")

AttributeError: type object 'Enformer' has no attribute 'from_pretrained'

Is there a recommended approach to extract embeddings for the starting and ending regions?

The input sequence length of 196,608 bp is given as the input sequence to the network. Out of which the embeddings are extracted for the middle 114,688 bp region mapped to 876. That means that extracting the embeddings from region 0 to 196,608 would miss out on the starting 40,960 bp embeddings. So how should we extract the embeddings for region 0 in the input sequence? Is there a recommended approach.

Thanks

Hard coded input sequence length to the transformer blocks with using use_tf_gamma = True

Hi! Thanks for your amazing code. I am trying to use the pre-trained model but I found out that when I set the use_tf_gamma = True, I can only use the precomputed gamma positions for the input sequence of length 1536, will you fix that later?
Also, the sanity check will fail. After running this

python test_pretrained.py 
Traceback (most recent call last):
  File "/home/ubuntu/enformer-pytorch/test_pretrained.py", line 11, in <module>
    corr_coef = enformer(
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/enformer-pytorch/enformer_pytorch/modeling_enformer.py", line 450, in forward
    x = trunk_fn(x)
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/enformer-pytorch/enformer_pytorch/modeling_enformer.py", line 144, in forward
    return self.fn(x, **kwargs) + x
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/conda/envs/seq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/enformer-pytorch/enformer_pytorch/modeling_enformer.py", line 269, in forward
    positions = get_positional_embed(n, self.num_rel_pos_features, device, use_tf_gamma = self.use_tf_gamma)
  File "/home/ubuntu/enformer-pytorch/enformer_pytorch/modeling_enformer.py", line 123, in get_positional_embed
    embeddings = torch.cat(embeddings, dim = -1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2047 but got size 3071 for tensor number 2 in the list.

The program will raise the problem I said above because the input sequence length for the transformer block is 1024 for the test sample.

minor question on extract sequence embeddings

Thanks for adopting the Enformer model to PyTorch.
Recently, I've needed to extract sequence embeddings from Enformer. Do I have to use one-hot encoding as input, or can I directly use sequences instead?

Would using FlashAttention(2) help?

Hello!

Thank you for all your contributions.

I noticed that you have a built-in implementation of attention. I've been running into some memory restrictions when using the full model. Do you think that it would run faster/with less memory if the built-in attention was swapped with FlashAttention?

Thanks!

Problems during fine-tuing on my own ATAC-seq data

Hi, I'm now trying to fine-tune the pretrained model on my own ATAC-seq data using HeadAdapterWrapper. Due to device limitations and long training time, I set finetune_last_n_layers_only to 2, and use Adam as optimizer. I use CosineAnnealingLR as learning rate scheduler, and set my learning rate from 1e-3 to 1e-5. However, the result seems not good, so may I have your learning rate, optimizer and other hyper-parameters? Thanks very much.

Computing Contribution Scores

From the paper:

To better understand what sequence elements Enformer is utilizing when making predictions, we computed two different gene expression contribution scores — input gradients (gradient × input and attention weights

I was just wondering how to compute input gradients and fetch the attention matrix for the given input. I'm not well versed with PyTorch, so I'm sorry if this is a noob question.

Fail to read .bed file

Hello,
I would like to read in a bed field using the GenomeIntervalDataset, but I keep on getting the following error: "ComputeError: invalid utf-8 sequence in csv ." Do I need to decode the binary .bed file before loading it with polars.read_csv() ?

Many thanks in advance !

Using EleutherAI/enformer-official-rough PyTorch implementation to just get human output head

Hi @lucidrains,

Thank you so much for your efforts in releasing the PyTorch version of the Enformer model! I am really excited to use it for my particular implementation.

I was wondering if it is possible to use the pre-trained huggingface model to just get the human output head. The reason is that inference takes a few minutes, and since I just need human data, this will help make my implementation a bit smoother. Is there a way to do this elegantly with the current codebase, or would I need to rewrite some functions to allow for this? From what I have seen so far it doesn't seem that this modularity is possible yet.

The way I have set up my inference currently is as follows:

class EnformerInference:
    def __init__(self, data_path: str, model_path="EleutherAI/enformer-official-rough"):
        if torch.cuda.is_available():
            device = torch.device("cuda")
        else:
            device = torch.device("cpu")
        self.device = device
        self.model = Enformer.from_pretrained(model_path)
        self.data = EnformerDataLoader(pd.read_csv(data_path, sep="\t")) # returns a one hot encoded torch.Tensor representation of the sequence of interest
                                                                                                                          

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x.to(self.device))

Any guidance on this would be greatly appreciated, thank you!

Fine-tuning without freezing transformer parameter leads to poor performance

Dear team,

As far as I can understand, the current fine-tuning feature will fix transformer model and only fine-tune the linear heads. I wonder has anyone try to fine-tuning the whole model without fixing the transformer parameter? I tried, but the performance is not good; for example, when I simply fine-tune the model on the original basenji training set, the correlation score decreased from 0.6 to 0.1. I wonder has anyone encounter the similar issue and what is the potential reason for this?

Minor potential typo in `FastaInterval` class

Hello, first off thanks so much for this incredible repository, it's greatly accelerating a project I am working on!

I've been using the GenomeIntervalDataset class and notice a minor potential typo in the FastaInterval class when I was trying to fetch a sequence with a negative start position and got an empty tensor back. It looks like there is logic for clipping the start position at 0 and padding the sequence here

if start < 0:
left_padding = -start
start = 0
if end > chromosome_length:
right_padding = end - chromosome_length
end = chromosome_length
but that it wasn't being used in my case as it was inside the above if clause that I wasn't triggering https://github.com/lucidrains/enformer-pytorch/blob/ab29196d535802c8a04929534c5860fb55d06056/enformer_pytorch/data.py#LL128C9-L128C82. If I unindent that logic then everything worked fine for me.

If it was unintentional to have the clipping inside that if clause I'd be happy to submit a trivial PR to fix the indentation.

Thanks again for all your work

AttentionPool bug?

Looking at the attention pool class did you mean to have

self.pool_fn = Rearrange('b d (n p) -> b d n p', p = self.pool_size)

instead of

self.pool_fn = Rearrange('b d (n p) -> b d n p', p = 2)

Here's the full class

class AttentionPool(nn.Module):
    def __init__(self, dim, pool_size = 2):
        super().__init__()
        self.pool_size = pool_size
        self.pool_fn = Rearrange('b d (n p) -> b d n p', p = 2)
        self.to_attn_logits = nn.Conv2d(dim, dim, 1, bias = False)

    def forward(self, x):
        b, _, n = x.shape
        remainder = n % self.pool_size
        needs_padding = remainder > 0

        if needs_padding:
            x = F.pad(x, (0, remainder), value = 0)
            mask = torch.zeros((b, 1, n), dtype = torch.bool, device = x.device)
            mask = F.pad(mask, (0, remainder), value = True)

        x = self.pool_fn(x)
        logits = self.to_attn_logits(x)

        if needs_padding:
            mask_value = -torch.finfo(logits.dtype).max
            logits = logits.masked_fill(self.pool_fn(mask), mask_value)

        attn = logits.softmax(dim = -1)

        return (x * attn).sum(dim = -1)

Rounding errors in the positional encodings

Hi, many thanks for your efforts.

I investigated the rounding errors a bit and to me it seems like they mostly accumulate in the gamma positional encodings. With same inputs (rate, concentration, x are torch.allclose), I get different results after the xlogy for tensorflow and pytorch, and because these are rather large deviations/values, this then seems to propagate. Unsure where TF and pytorch differ for xlogy here...

So as a "dirty" fix, I precomputed the gammas in TF, saved and loaded them in your pytorch implementation and I manage to get the same results (up to some numerical issues) with enformer-pytorch as if I compute it for some test sequences with the TF reference implementation.

For instance, for this test_sequence.npy.zip: the maximum of the output of the current pytorch implementation is 474.7, with the precomputed gammas 555.6, and the reference implementation has 555.5.

Feel free to have a look at my fork for all of this, where I integrated the precomputed gammas in your implementation.

How to retreive the sequence of genes?

Hi, since enformer can be used to predict gene expression values, I am looking for the index of model output to describe the relationship between genes and genome regions. Could you please tell me how to figure it out? Thanks.

Training process

Hello. Can you share the details of neural model training? Did you train it yourself? Did you collect data for training from basenji dataset files? I am unable to reproduce the claimed results during training.

AssertionError: if using tf gamma, only sequence length of 1536 allowed for now

@lucidrains I think this is not getting set properly. As per the doc it says from_pretrained should set the use_tf_gamma. Then the pre trained model should be able to accept seqlength of 196_608 ? or am I missing something ?

import torch
import polars as pl
from torch import nn
from enformer_pytorch import from_pretrained,GenomeIntervalDataset
from enformer_pytorch.finetune import HeadAdapterWrapper

enformer = from_pretrained('EleutherAI/enformer-official-rough')

enformer(seq) #seq of length 196_680

get the AssertionError with above

Error in fine-tuning

Hi,

Thanks for the nice repo.
I was trying to do fine-tuning on pre-trained model and used the code like below.

import torch
from enformer_pytorch import Enformer
from enformer_pytorch.finetune import HeadAdapterWrapper

enformer = Enformer.from_pretrained('EleutherAI/enformer-official-rough')

model = HeadAdapterWrapper(
    enformer = enformer,
    num_tracks = 128,
    post_transformer_embed = False   # by default, embeddings are taken from after the final pointwise block w/ conv -> gelu - but if you'd like the embeddings right after the transformer block with a learned layernorm, set this to True
).cuda()

seq = torch.randint(0, 5, (1, 196_608 // 2,)).cuda()
target = torch.randn(1, 200, 128).cuda()  # 128 tracks

loss = model(seq, target = target)
loss.backward()

But it has an error like below.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-7-369343718439>](https://localhost:8080/#) in <module>
----> 1 loss = model(seq, target = target)
      2 loss.backward()

8 frames
[/usr/local/lib/python3.8/dist-packages/enformer_pytorch/modeling_enformer.py](https://localhost:8080/#) in forward(self, x)
    172 
    173         if seq_len < target_len:
--> 174             raise ValueError(f'sequence length {seq_len} is less than target length {target_len}')
    175 
    176         trim = (target_len - seq_len) // 2

ValueError: sequence length 768 is less than target length 896

Error seems to come from the different length of sequences so I put the 229_376 to make 896 but it's also not working. Do you know any solution for this?

Thanks,

How to retrieve neular with ACTG information from the dataset

Hi, I am interested in how to modify this code to get the DNA slice information of each gene/peak.

import torch
import polars as pl
from enformer_pytorch import Enformer, GenomeIntervalDataset

filter_train = lambda df: df.filter(pl.col('column_4') == 'train')

ds = GenomeIntervalDataset(
    bed_file = './sequences.bed',                       # bed file - columns 0, 1, 2 must be <chromosome>, <start position>, <end position>
    fasta_file = './hg38.ml.fa',                        # path to fasta file
    filter_df_fn = filter_train,                        # filter dataframe function
    return_seq_indices = True,                          # return nucleotide indices (ACGTN) or one hot encodings
    shift_augs = (-2, 2),                               # random shift augmentations from -2 to +2 basepairs
    rc_aug = True,                                      # use reverse complement augmentation with 50% probability
    context_length = 196_608,
    return_augs = True                                  # return the augmentation meta data
)

seq, rand_shift_val, rc_bool = ds[0] # (196608,), (1,), (1,)

I intend to only retrieve the DNA slice information of the bed file. For example, for gene a, the DNA sequence is ACTG...

Inversely mapping spends me too much time. Therefore, I wonder how to direftly get the DNA sequence information. Thanks.

Minor typo in data.py

Hi!

I have just started working with this repository, thank you for making this! I am trying to use the GenomicIntervalDataset to supply a sequence in fasta format, and I think I found a spelling mistake that creates an error. In the file enformer_pytorch/data.py and then in line 189, the line reads df = pl.read_csv(str(bed_path), sep = '\t', has_headers = False). However, according to the polars documentation, it should be has_header.

This is the error I got that lead me to this: TypeError: read_csv() got an unexpected keyword argument 'has_headers'.

Please let me know if you can change this!

CUDA out of memory in unexpected cases

Hi! Thank you so much for providing this implementation and documentations. I have been looking for the pytorch implementation of enformer for a while.

When I run the provided code test_pretrained.py using 8G of GPU, the model behaves as expected. The input sequence dimension is [131072, 4], which means the input DNA sequence is of length 131072 bp.

However, when I started to write a script that would take the embedding of some genomic sequences, which according to enformer paper and the default parameters, is of length 196608 bp, then I started to get errors of CUDA out of memory. This happens even if my batch size is 1. It also happens when I increases my memory request to 32Gb (instead of 8G as before). If I run the same code to get embeddings for input sequences (batchsize=1 to maximum of 8), then it can run on a CPU with 32Gb of memory.

What confused me is that the code test_pretrained.py seems to run fine with just 8GB and CUDA, but when I just try to get embeddings on 1 input sequence (which theoretically less computing than what is done in test_pretrained.py), I get such an error. I have been looking at various platforms to fix this, but I have not been successful. Do you have any suggestions about why that is the case?

Thank you so much. Below is a snapshot of my code to get embeddings:

import torch
from enformer_pytorch import Enformer, seq_indices_to_one_hot, GenomeIntervalDataset
from torch.utils.data import DataLoader


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Enformer.from_pretrained('enformer-official-rough')
model = model.to(device)


ds = GenomeIntervalDataset(
    bed_file = '/data/gnomad/v2.1.1_hg19/variants/MAF/chr1_maf.bed.gz',                       
    fasta_file = '/data/ucsc/sequence/hg19/chr1.fa.gz',                        
    return_seq_indices = True,                          
    shift_augs = None,                            
    context_length = 196_608,
    chr_bed_to_fasta_map = {
        'chr1': 'chr1'
    }
)

The pytorch-enformer has much lower performance than the tensorflow-enformer

Dear developers, thank you very much for developing the pytorch version of enformer!
However, when I use the example data provided by basenji and my own ATAC-Seq dataset, the predictive power of pytorch-enformer compared to the predictive power of tensorflow-enformer (referring to the Pearson correlation coefficient) drops by around 0.2, (0.3 vs 0.5 for the validation set) . I tried to initialize my model weights following the weight initialization method of the tensorflow-enformer, but things didn't get radically better, can anyone tell me why?

Best,
Eli

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.