Git Product home page Git Product logo

esm's Introduction

Evolutionary Scale Modeling

atlas

Update April 2023: Code for the two simultaneous preprints on protein design is now released! Code for "Language models generalize beyond natural proteins" is under examples/lm-design/. Code for "A high-level programming language for generative protein design" is under examples/protein-programming-language/.

This repository contains code and pre-trained weights for Transformer protein language models from the Meta Fundamental AI Research Protein Team (FAIR), including our state-of-the-art ESM-2 and ESMFold, as well as MSA Transformer, ESM-1v for predicting variant effects and ESM-IF1 for inverse folding. Transformer protein language models were introduced in the 2019 preprint of the paper "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences". ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks. ESMFold harnesses the ESM-2 language model to generate accurate structure predictions end to end directly from the sequence of a protein.

In November 2022, we released v0 of the ESM Metagenomic Atlas, an open atlas of 617 million predicted metagenomic protein structures. The Atlas was updated in March 2023 in collaboration with EBI. The new v2023_02 adds another 150 million predicted structures to the Atlas, as well as pre-computed ESM2 embeddings. Bulk download, blog post and the resources provided on the Atlas website are documented on this README.

In December 2022, we released two simultaneous preprints on protein design.

  • "Language models generalize beyond natural proteins" (PAPER, CODE) uses ESM2 to design de novo proteins. The code and data associated with the preprint can be found here.
  • "A high-level programming language for generative protein design" (PAPER, CODE) uses ESMFold to design proteins according to a high-level programming language.
Citation For ESM2, ESMFold and ESM Atlas: ```bibtex @article{lin2023evolutionary, title = {Evolutionary-scale prediction of atomic-level protein structure with a language model}, author = {Zeming Lin and Halil Akin and Roshan Rao and Brian Hie and Zhongkai Zhu and Wenting Lu and Nikita Smetanin and Robert Verkuil and Ori Kabeli and Yaniv Shmueli and Allan dos Santos Costa and Maryam Fazel-Zarandi and Tom Sercu and Salvatore Candido and Alexander Rives }, journal = {Science}, volume = {379}, number = {6637}, pages = {1123-1130}, year = {2023}, doi = {10.1126/science.ade2574}, URL = {https://www.science.org/doi/abs/10.1126/science.ade2574}, note={Earlier versions as preprint: bioRxiv 2022.07.20.500902}, } ```

For transformer protein language models:

@article{rives2021biological,
  title={Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences},
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C Lawrence and Ma, Jerry and others},
  journal={Proceedings of the National Academy of Sciences},
  volume={118},
  number={15},
  pages={e2016239118},
  year={2021},
  publisher={National Acad Sciences},
  note={bioRxiv 10.1101/622803},
  doi={10.1073/pnas.2016239118},
  url={https://www.pnas.org/doi/full/10.1073/pnas.2016239118},
}
Table of contents
What's New

Main models you should use

Shorthand esm.pretrained. Dataset Description
ESM-2 esm2_t36_3B_UR50D() esm2_t48_15B_UR50D() UR50 (sample UR90) SOTA general-purpose protein language model. Can be used to predict structure, function and other protein properties directly from individual sequences. Released with Lin et al. 2022 (Aug 2022 update).
ESMFold esmfold_v1() PDB + UR50 End-to-end single sequence 3D structure predictor (Nov 2022 update).
ESM-MSA-1b esm_msa1b_t12_100M_UR50S() UR50 + MSA MSA Transformer language model. Can be used to extract embeddings from an MSA. Enables SOTA inference of structure. Released with Rao et al. 2021 (ICML'21 version, June 2021).
ESM-1v esm1v_t33_650M_UR90S_1() ... esm1v_t33_650M_UR90S_5() UR90 Language model specialized for prediction of variant effects. Enables SOTA zero-shot prediction of the functional effects of sequence variations. Same architecture as ESM-1b, but trained on UniRef90. Released with Meier et al. 2021.
ESM-IF1 esm_if1_gvp4_t16_142M_UR50() CATH + UR50 Inverse folding model. Can be used to design sequences for given structures, or to predict functional effects of sequence variation for given structures. Enables SOTA fixed backbone sequence design. Released with Hsu et al. 2022.

For a complete list of available models, with details and release notes, see Pre-trained Models.

Usage

Quick start

An easy way to get started is to load ESM or ESMFold through the HuggingFace transformers library, which has simplified the ESMFold dependencies and provides a standardized API and tools to work with state-of-the-art pretrained models.

Alternatively, ColabFold has integrated ESMFold so that you can easily run it directly in the browser on a Google Colab instance.

We also provide an API which you can access through curl or on the ESM Metagenomic Atlas web page.

curl -X POST --data "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL" https://api.esmatlas.com/foldSequence/v1/pdb/

For ESM-MSA-1b, ESM-IF1, or any of the other models you can use the original implementation from our repo directly via the instructions below.

Getting started with this repo

As a prerequisite, you must have PyTorch installed to use this repository.

You can use this one-liner for installation, using the latest release of esm:

pip install fair-esm  # latest release, OR:
pip install git+https://github.com/facebookresearch/esm.git  # bleeding edge, current repo main branch

To use the ESMFold model, make sure you start from an environment with python <= 3.9 and pytorch installed. Then add the [esmfold] option to your pip install, which will install the dependencies for OpenFold automatically. Openfold installation requires nvcc.

pip install "fair-esm[esmfold]"
# OpenFold and its remaining dependency
pip install 'dllogger @ git+https://github.com/NVIDIA/dllogger.git'
pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307'

NOTE: If openfold installation fails, please double check that nvcc is available and that a cuda-compatable version of PyTorch has been installed.

Alternatively, we provide the esmfold conda environment, which can be built via conda env create -f environment.yml.

We also support PyTorch Hub, which removes the need to clone and/or install this repository yourself:

import torch
model, alphabet = torch.hub.load("facebookresearch/esm:main", "esm2_t33_650M_UR50D")

After pip install, you can load and use a pretrained model as follows:

import torch
import esm

# Load ESM-2 model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()  # disables dropout for deterministic results

# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
    ("protein2 with mask","KALTARQQEVFDLIRD<mask>ISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
    ("protein3",  "K A <mask> I S Q"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
batch_lens = (batch_tokens != alphabet.padding_idx).sum(1)

# Extract per-residue representations (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)
token_representations = results["representations"][33]

# Generate per-sequence representations via averaging
# NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
sequence_representations = []
for i, tokens_len in enumerate(batch_lens):
    sequence_representations.append(token_representations[i, 1 : tokens_len - 1].mean(0))

# Look at the unsupervised self-attention map contact predictions
import matplotlib.pyplot as plt
for (_, seq), tokens_len, attention_contacts in zip(data, batch_lens, results["contacts"]):
    plt.matshow(attention_contacts[: tokens_len, : tokens_len])
    plt.title(seq)
    plt.show()

ESMFold Structure Prediction

After installing with the [esmfold] option, you can use the ESMFold structure prediction model as follows:

import torch
import esm

model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

# Optionally, uncomment to set a chunk size for axial attention. This can help reduce memory.
# Lower sizes will have lower memory requirements at the cost of increased speed.
# model.set_chunk_size(128)

sequence = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"
# Multimer prediction can be done with chains separated by ':'

with torch.no_grad():
    output = model.infer_pdb(sequence)

with open("result.pdb", "w") as f:
    f.write(output)

import biotite.structure.io as bsio
struct = bsio.load_structure("result.pdb", extra_fields=["b_factor"])
print(struct.b_factor.mean())  # this will be the pLDDT
# 88.3

Besides esm.pretrained.esmfold_v1() which is the best performing model we recommend using, we also provide esm.pretrained.esmfold_v0() which was used for the experiments in Lin et al. 2022.

We also provide a command line interface (esm-fold) that efficiently predicts structures in bulk from a FASTA file using ESMFold:

usage: esm-fold [-h] -i FASTA -o PDB [--num-recycles NUM_RECYCLES]
                [--max-tokens-per-batch MAX_TOKENS_PER_BATCH]
                [--chunk-size CHUNK_SIZE] [--cpu-only] [--cpu-offload]

optional arguments:
  -h, --help            show this help message and exit
  -i FASTA, --fasta FASTA
                        Path to input FASTA file
  -o PDB, --pdb PDB     Path to output PDB directory
  --num-recycles NUM_RECYCLES
                        Number of recycles to run. Defaults to number used in
                        training (4).
  --max-tokens-per-batch MAX_TOKENS_PER_BATCH
                        Maximum number of tokens per gpu forward-pass. This
                        will group shorter sequences together for batched
                        prediction. Lowering this can help with out of memory
                        issues, if these occur on short sequences.
  --chunk-size CHUNK_SIZE
                        Chunks axial attention computation to reduce memory
                        usage from O(L^2) to O(L). Equivalent to running a for
                        loop over chunks of of each dimension. Lower values
                        will result in lower memory usage at the cost of
                        speed. Recommended values: 128, 64, 32. Default: None.
  --cpu-only            CPU only
  --cpu-offload         Enable CPU offloading

The command will make one prediction for every sequence in the fasta file. Multimers can be predicted and should be entered in the fasta file as a single sequence, with chains seprated by a ":" character.

By default, predictions will be batched together so that shorter sequences are predicted simultaneously. This can be disabled by setting --max-tokens-per-batch=0. Batching can significantly improve prediction speed on shorter sequences.

The --cpu-offload flag can be useful for making predictions on longer sequences. It will attempt to offload some parameters to the CPU RAM, rather than storing on GPU.

Finally, the ablation experiments for LMs of varying sizes Lin et al. 2022 table S1 are released as esm.pretrained.esmfold_structure_module_only_*(). We don't recommend using these models for structure prediction.

Compute embeddings in bulk from FASTA

We provide a command line interface (esm-extract) that efficiently extracts embeddings in bulk for a FASTA file from the ESM:

usage: esm-extract [-h] [--toks_per_batch TOKS_PER_BATCH]
                   [--repr_layers REPR_LAYERS [REPR_LAYERS ...]] --include
                   {mean,per_tok,bos,contacts}
                   [{mean,per_tok,bos,contacts} ...]
                   [--truncation_seq_length TRUNCATION_SEQ_LENGTH]
                   model_location fasta_file output_dir

Extract per-token representations and model outputs for sequences in a FASTA
file

positional arguments:
  model_location        PyTorch model file OR name of pretrained model to
                        download (see README for models)
  fasta_file            FASTA file on which to extract representations
  output_dir            output directory for extracted representations

optional arguments:
  -h, --help            show this help message and exit
  --toks_per_batch TOKS_PER_BATCH
                        maximum batch size
  --repr_layers REPR_LAYERS [REPR_LAYERS ...]
                        layers indices from which to extract representations
                        (0 to num_layers, inclusive)
  --include {mean,per_tok,bos,contacts} [{mean,per_tok,bos,contacts} ...]
                        specify which representations to return
  --truncation_seq_length TRUNCATION_SEQ_LENGTH
                        truncate sequences longer than the given value

The following commands allow the extraction of the final-layer embedding for a FASTA file from the ESM-2 model:

esm-extract esm2_t33_650M_UR50D examples/data/some_proteins.fasta \
  examples/data/some_proteins_emb_esm2 --repr_layers 0 32 33 --include
python scripts/extract.py esm2_t33_650M_UR50D examples/data/some_proteins.fasta \
  examples/data/some_proteins_emb_esm2 --repr_layers 0 32 33 --include mean per_tok

A cuda device is optional and will be auto-detected.

Directory some_proteins_emb_esm2/ now contains one .pt file per FASTA sequence; use torch.load() to load them. scripts/extract.py has flags that determine what's included in the .pt file:

  • --repr-layers (default: final only) selects which layers to include embeddings from.
  • --include specifies what embeddings to save. You can use the following:
    • per_tok includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
    • mean includes the embeddings averaged over the full sequence, per layer.
    • bos includes the embeddings from the beginning-of-sequence token. (NOTE: Don't use with the pre-trained models - we trained without bos-token supervision)

CPU offloading for inference with large models

If you want to load very large models like 15B and/or do inference on long sequences on your machine, regular GPU inference may lead to OOM errors. We show how to load the model with Fairscale's Fully Sharded Data Parallel (FSDP) and use its CPU offloading feature. This allows to do inference of large models on a single GPU. Please check out examples/esm2_infer_fairscale_fsdp_cpu_offloading.py for more details.

Zero-shot variant prediction

See "examples/variant-prediction/" for code and pre-trained weights for the ESM-1v models described in Language models enable zero-shot prediction of the effects of mutations on protein function. (Meier et al. 2021).

Note that ESM-2 could be used for variant prediction as well, and is expected to have similar performance to ESM-1v.

Inverse folding

See "examples/inverse_folding/" for detailed user guide. The ESM-IF1 model is described as GVPTransformer in Learning inverse folding from millions of predicted structures. (Hsu et al. 2022).

We also provide a colab notebook for the sequence design and sequence scoring functionalities.

The ESM-IF1 inverse folding model is built for predicting protein sequences from their backbone atom coordinates. We provide scripts here 1) to sample sequence designs for a given structure and 2) to score sequences for a given structure.

Trained with 12M protein structures predicted by AlphaFold2, the ESM-IF1 model consists of invariant geometric input processing layers followed by a sequence-to-sequence transformer, and achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues. The model is also trained with span masking to tolerate missing backbone coordinates and therefore can predict sequences for partially masked structures.

Sample sequence designs for a given structure

The environment setup is described in this subsection of examples/inverse_folding.

To sample sequences for a given structure in PDB or mmCIF format, use the sample_sequences.py script. The input file can have either .pdb or .cif as suffix.

For example, to sample 3 sequence designs for the golgi casein kinase structure (PDB 5YH2; PDB Molecule of the Month from January 2022), we can run the following command from the esm root directory:

python examples/inverse_folding/sample_sequences.py examples/inverse_folding/data/5YH2.pdb \
  --chain C --temperature 1 --num-samples 3 --outpath examples/inverse_folding/output/sampled_sequences.fasta

The sampled sequences will be saved in a fasta format to the specified output file.

The temperature parameter controls the sharpness of the probability distribution for sequence sampling. Higher sampling temperatures yield more diverse sequences but likely with lower native sequence recovery. The default sampling temperature is 1. To optimize for native sequence recovery, we recommend sampling with low temperature such as 1e-6.

Scoring sequences

To score the conditional log-likelihoods for sequences conditioned on a given structure, use the score_log_likelihoods.py script.

For example, to score the sequences in examples/inverse_folding/data/5YH2_mutated_seqs.fasta according to the structure in examples/inverse_folding/data/5YH2.pdb, we can run the following command from the esm root directory:

python examples/inverse_folding/score_log_likelihoods.py examples/inverse_folding/data/5YH2.pdb \
  examples/inverse_folding/data/5YH2_mutated_seqs.fasta --chain C \
  --outpath examples/inverse_folding/output/5YH2_mutated_seqs_scores.csv

The conditional log-likelihoods are saved in a csv format in the specified output path. The output values are the average log-likelihoods averaged over all amino acids in a sequence.

For more information, see "./examples/inverse_folding/" for detailed user guide.

ESM Metagenomic Atlas

Please visit the ESM Metagenomic Atlas website, and see our blog post to learn more.

Bulk download instructions available at a seperate README here.

The Atlas resources include a page to fold a sequence using ESMFold, searching a subset of the ESM Atlas by structure or sequence, as well as an API to access those resources programmatically.

Foldseek provides search against the Atlas without the length limitation here.

Notebooks

Inverse folding - predicting or scoring sequences based on backbone structures

The ESM-IF1 inverse folding model predicts protein sequences from their backbone atom coordinates, trained with 12M protein structures predicted by AlphaFold2. This notetook guide you through examples of sampling sequences, calculating conditional log-likelihoods, and extracting encoder output as structure representation.

Supervised variant prediction - training a classifier on the embeddings

To help you get started with using the embeddings, this jupyter notebook tutorial shows how to train a supervised variant predictor using embeddings from ESM-1. You can adopt a similar protocol to train a model for any downstream task, even with limited data. First you can obtain the embeddings for examples/data/P62593.fasta either by downloading the precomputed embeddings as instructed in the notebook or by running the following:

# Obtain the embeddings
python scripts/extract.py esm1v_t33_650M_UR90S_1 examples/data/P62593.fasta \
  examples/data/P62593_emb_esm1v --repr_layers 33 --include mean

Then, follow the remaining instructions in the tutorial. You can also run the tutorial in a colab notebook.

Note, alternatively use the newer instructions for zero-shot variant prediction, which predicts mutational effects without any supervised training.

Unsupervised contact prediction

This jupyter notebook tutorial demonstrates contact prediction with both the ESM-2 and MSA Transformer (ESM-MSA-1) models. Contact prediction is based on a logistic regression over the model's attention maps. This methodology is based on our ICLR 2021 paper, Transformer protein language models are unsupervised structure learners. (Rao et al. 2020) The MSA Transformer (ESM-MSA-1) takes a multiple sequence alignment (MSA) as input, and uses the tied row self-attention maps in the same way. See MSA Transformer. (Rao et al. 2021).

To get unsupervised attention-based contacts, call model.predict_contacts(tokens) or model(tokens, return_contacts=True).

ESMStructuralSplitDataset and self-attention contact prediction

And this jupyter notebook tutorial shows how to load and index the ESMStructuralSplitDataset, and computes the self-attention map unsupervised contact predictions using ESM-2.

Available Models and Datasets

Pre-trained Models

Shorthand esm.pretrained. #layers #params Dataset Embedding Dim Model URL (automatically downloaded to ~/.cache/torch/hub/checkpoints)
ESM-2 esm2_t48_15B_UR50D 48 15B UR50/D 2021_04 5120 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t48_15B_UR50D.pt
esm2_t36_3B_UR50D 36 3B UR50/D 2021_04 2560 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt
esm2_t33_650M_UR50D 33 650M UR50/D 2021_04 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt
esm2_t30_150M_UR50D 30 150M UR50/D 2021_04 640 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t30_150M_UR50D.pt
esm2_t12_35M_UR50D 12 35M UR50/D 2021_04 480 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t12_35M_UR50D.pt
esm2_t6_8M_UR50D 6 8M UR50/D 2021_04 320 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t6_8M_UR50D.pt
ESMFold esmfold_v1 48 (+36) 690M (+3B) UR50/D 2021_04 - https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v1.pt
esmfold_v0 48 (+36) 690M (+3B) UR50/D 2021_04 - https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v0.pt
esmfold_structure_module_only_* 0 (+various) various UR50/D 2021_04 - https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_structure_module_only_*
ESM-IF1 esm_if1_gvp4_t16_142M_UR50 20 124M CATH 4.3 + predicted structures for UR50 512 https://dl.fbaipublicfiles.com/fair-esm/models/esm_if1_gvp4_t16_142M_UR50.pt
ESM-1v esm1v_t33_650M_UR90S_[1-5] 33 650M UR90/S 2020_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1v_t33_650M_UR90S_1.pt
ESM-MSA-1b esm_msa1b_t12_100M_UR50S 12 100M UR50/S + MSA 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm_msa1b_t12_100M_UR50S.pt
ESM-MSA-1 esm_msa1_t12_100M_UR50S 12 100M UR50/S + MSA 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm_msa1_t12_100M_UR50S.pt
ESM-1b esm1b_t33_650M_UR50S 33 650M UR50/S 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
ESM-1 esm1_t34_670M_UR50S 34 670M UR50/S 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt
esm1_t34_670M_UR50D 34 670M UR50/D 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt
esm1_t34_670M_UR100 34 670M UR100 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt
esm1_t12_85M_UR50S 12 85M UR50/S 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt
esm1_t6_43M_UR50S 6 43M UR50/S 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt

Here is a chronological list of the released models and the paper they were introduced in:

Shorthand Release Notes
ESM-1 Released with Rives et al. 2019 (Aug 2020 update).
ESM-1b Released with Rives et al. 2019 (Dec 2020 update). See Appendix B.
ESM-MSA-1 Released with Rao et al. 2021 (Preprint v1).
ESM-MSA-1b Released with Rao et al. 2021 (ICML'21 version, June 2021).
ESM-1v Released with Meier et al. 2021.
ESM-IF1 Released with Hsu et al. 2022.
ESM-2 Released with Lin et al. 2022.

ESM Structural Split Dataset

This is a five-fold cross validation dataset of protein domain structures that can be used to measure generalization of representations across different levels of structural dissimilarity. The dataset implements structural holdouts at the family, superfamily, and fold level. The SCOPe database is used to classify domains. Independently for each level of structural hold-out, the domains are split into 5 equal sets, i.e. five sets of folds, superfamilies, or families. This ensures that for each of the five partitions, structures having the same classification do not appear in both the train and test sets. For a given classification level each structure appears in a test set once, so that in the cross validation experiment each of the structures will be evaluated exactly once.

The dataset provides 3d coordinates, distance maps, and secondary structure labels. For further details on the construction of the dataset see Rives et al. 2019 Appendix A.10.

This jupyter notebook tutorial shows how to load and index the ESMStructuralSplitDataset.

ESMStructuralSplitDataset, upon initializing, will download splits and pkl. We also provide msas for each of the domains. The data can be directly downloaded below.

Name Description URL
splits train/valid splits https://dl.fbaipublicfiles.com/fair-esm/structural-data/splits.tar.gz
pkl pkl objects containing sequence, SSP labels, distance map, and 3d coordinates https://dl.fbaipublicfiles.com/fair-esm/structural-data/pkl.tar.gz
msas a3m files containing MSA for each domain https://dl.fbaipublicfiles.com/fair-esm/structural-data/msas.tar.gz

Pre-training Dataset Split

The split files establishing which UniRef50 clusters were used as held-out evaluation set for pre-training in Rives et al. 2019 and Rao et al. 2021 can be found here:

These files only contain only the UniRef50 IDs and UniRef100 IDs corresponding to the UniRef database, 2018-03 release which is released by the UniProt Consortium under a Creative Commons Attribution (CC BY 4.0) License.

Comparison to related works

Task Unsupervised contact prediction Structure Prediction
Test set Large valid CASP14 CAMEO (Apr-Jun 2022) CASP14 CAMEO (Apr-Jun 2022)
Gremlin (Potts) 39.3
TAPE 11.2
ProtBert-BFD 34.1
Prot-T5-XL-BFD 35.6 46.1 62.6
Prot-T5-XL-Ur50 (3B) 47.9 49.8 69.4
ESM-1 33.7
ESM-1b 41.1 24.4 39 41.6 64.5
ESM-1v 35.3
ESM-MSA-1b 57.4
ESM-2 (8M) 15.9 9.8 15.7 36.7 48.1
ESM-2 (35M) 28.8 16.4 28.4 41.4 56.4
ESM-2 (150M) 42.2 26.8 40.1 49.0 64.9
ESM-2 (700M) 50.1 32.5 47.6 51.3 70.1
ESM-2 (3B) 52.7 34.0 49.9 52.5 71.8
ESM-2 (15B) 54.5 37.0 51.7 55.4 72.1

Comparison to related protein language models on structure prediction tasks.

  • All contact numbers are the top-L,LR precision metric, where long range means sequence separation of at least 24 residues
  • For unsupervised contact prediction, a sparse linear combination of the attention heads is used to directly predict protein contacts, fitted with logistic regression on 20 structures. For more details on the method, see Rao et al. 2020.
  • For structure prediction, an AlphaFold2 structure module is trained directly from the frozen language model embeddings. For more details on the method, see Lin et al. 2022.
  • Direct coupling analysis methods (Gremlin, mfDCA, Psicov) and ESM-MSA-1 use the trRosetta MSAs, while other methods predict from single sequence.

Citations

If you find the models useful in your research, we ask that you cite the relevant paper:

@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v4},
  journal={PNAS}
}

For the self-attention contact prediction:

@article{rao2020transformer,
  author = {Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander},
  title={Transformer protein language models are unsupervised structure learners},
  year={2020},
  doi={10.1101/2020.12.15.422761},
  url={https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1},
  journal={bioRxiv}
}

For the MSA Transformer:

@article{rao2021msa,
  author = {Rao, Roshan and Liu, Jason and Verkuil, Robert and Meier, Joshua and Canny, John F. and Abbeel, Pieter and Sercu, Tom and Rives, Alexander},
  title={MSA Transformer},
  year={2021},
  doi={10.1101/2021.02.12.430858},
  url={https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1},
  journal={bioRxiv}
}

For variant prediction using ESM-1v:

@article{meier2021language,
  author = {Meier, Joshua and Rao, Roshan and Verkuil, Robert and Liu, Jason and Sercu, Tom and Rives, Alexander},
  title = {Language models enable zero-shot prediction of the effects of mutations on protein function},
  year={2021},
  doi={10.1101/2021.07.09.450648},
  url={https://www.biorxiv.org/content/10.1101/2021.07.09.450648v1},
  journal={bioRxiv}
}

For inverse folding using ESM-IF1:

@article{hsu2022learning,
	author = {Hsu, Chloe and Verkuil, Robert and Liu, Jason and Lin, Zeming and Hie, Brian and Sercu, Tom and Lerer, Adam and Rives, Alexander},
	title = {Learning inverse folding from millions of predicted structures},
	year = {2022},
	doi = {10.1101/2022.04.10.487779},
	url = {https://www.biorxiv.org/content/early/2022/04/10/2022.04.10.487779},
	journal = {ICML}
}

For the ESM-2 language model and ESMFold:

@article{lin2022language,
  title={Language models of protein sequences at the scale of evolution enable accurate structure prediction},
  author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Sal and others},
  journal={bioRxiv},
  year={2022},
  publisher={Cold Spring Harbor Laboratory}
}

Much of this code builds on the fairseq sequence modeling framework. We use fairseq internally for our protein language modeling research. We highly recommend trying it out if you'd like to pre-train protein language models from scratch.

Additionally, if you would like to use the variant prediction benchmark from Meier et al. (2021), we provide a bibtex file with citations for all data in ./examples/variant-prediction/mutation_data.bib. You can cite each paper individually, or add all citations in bulk using the LaTeX command:

\nocite{wrenbeck2017deep,klesmith2015comprehensive,haddox2018mapping,romero2015dissecting,firnberg2014comprehensive,deng2012deep,stiffler2015evolvability,jacquier2013capturing,findlay2018comprehensive,mclaughlin2012spatial,kitzman2015massively,doud2016accurate,pokusaeva2019experimental,mishra2016systematic,kelsic2016rna,melnikov2014comprehensive,brenan2016phenotypic,rockah2015systematic,wu2015functional,aakre2015evolving,qi2014quantitative,matreyek2018multiplex,bandaru2017deconstruction,roscoe2013analyses,roscoe2014systematic,mavor2016determination,chan2017correlation,melamed2013deep,starita2013activity,araya2012fundamental}

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

ESM Metagenomic Atlas (also referred to as “ESM Metagenomic Structure Atlas” or “ESM Atlas”) data is available under a CC BY 4.0 license for academic and commercial use. Copyright (c) Meta Platforms, Inc. All Rights Reserved. Use of the ESM Metagenomic Atlas data is subject to the Meta Open Source Terms of Use and Privacy Policy.

esm's People

Contributors

alexperiments avatar amorehead avatar andersoncarlosfs avatar brettkoonce avatar brianhie avatar chloechsu avatar cutecutecat avatar ebetica avatar eric-tc-wong avatar jacoberts avatar jasoniliu avatar joshim5 avatar kiramt avatar latticetower avatar luoyunan avatar moritzschaefer avatar naailkhan28 avatar nikitos9000 avatar rmrao avatar robert-verkuil avatar sbonner0 avatar scandido avatar tomsercu avatar w3ntinglu avatar walid0925 avatar yaoyinying avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

esm's Issues

Pretrained model unable to predict correctly in the test example?

I'm assuming I'm doing something wrong in the following, but I can't really see what it could be so hopefully you guys can help point out what it is!

I'm hoping to use your model in my research, but first I wanted to do some validation of how it works. So rather than looking at the embedding layer, I'm looking at the actual output from the model. Which I assume can be parsed through a softmax function in order to return token probabilities, from which I can get the predicted amino acids by taking argmax.

However when I do that, I find that the returned probabilities makes no sense. What I'm doing is given in the code below. What am I doing wrong?

import torch
import esm
import numpy as np
# Load 34 layer model
model, alphabet = esm.pretrained.esm1_t34_670M_UR50S()
batch_converter = alphabet.get_batch_converter()

# Prepare data (two protein sequences)
data = [("protein1", "MYLYQKIKN"), ("protein2", "MNAKYD")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

aa = alphabet.all_toks
# model.model_version
# Extract per-residue embeddings (on CPU)
with torch.no_grad():
    results = model(batch_tokens)
    logits = results['logits']
    prob = torch.softmax(logits,dim=2)
    pred = torch.argmax(prob,dim=2)
    pred_str = aa[pred[0,0,]]

MSA Column attention's softmax axis is not the same as the padding_mask

esm/esm/axial_attention.py

Lines 205 to 225 in 5680ba7

q = self.q_proj(x).view(num_rows, num_cols, batch_size, self.num_heads, self.head_dim)
k = self.k_proj(x).view(num_rows, num_cols, batch_size, self.num_heads, self.head_dim)
v = self.v_proj(x).view(num_rows, num_cols, batch_size, self.num_heads, self.head_dim)
q *= self.scaling
attn_weights = torch.einsum("icnhd,jcnhd->hcnij", q, k)
if self_attn_mask is not None:
raise NotImplementedError
if self_attn_padding_mask is not None:
attn_weights = attn_weights.masked_fill(
self_attn_padding_mask.permute(2, 0, 1).unsqueeze(0).unsqueeze(3),
-10000,
)
attn_probs = attn_weights.softmax(-1)
attn_probs = self.dropout_module(attn_probs)
context = torch.einsum("hcnij,jcnhd->icnhd", attn_probs, v)
context = context.contiguous().view(num_rows, num_cols, batch_size, embed_dim)
output = self.out_proj(context)
return output, attn_probs

If I understand the einsm correctly, the attn_weights is of shape [head_size, seq_len, batch_size, msa_row_size, msa_row_size] or [H, C, B, R, R].

If above is true, shouldn't we take the softmax at 'C' axis? i.e.

attn_probs = attn_weights.softmax(1)

Thank you!

ESM Vs other Transformers

Hi,

Congratulations on your great work and for releasing the pertained models.

In your work you compared your results against SeqVec, our early work, and I was wondering if you have plans to compare it to our new work ProtTrans ?
https://github.com/agemagician/ProtTrans

By making a quick comparison, it seems ProtBert-BFD model performs better than Transformer-34 on SS8 with only 63% of Transformer-34 capacity.
Screenshot 2020-09-03 at 09 55 33

I believe more analysis is needed here, especially, to see how Roberta style models compared to other transformers including XLNet, Albert, Bert, Electra, Transformer XL, etc.

Unable to load many sequences

Bug description
I am trying to embed 10k+ protein sequences (about 250 aa residues each, all stored in a single fasta file foo). This errs out. Running the same command on head -n1000 foo > bar works as expected. Is there a limit here?

Reproduction steps
Try to run the model on 10k sequences.

Expected behavior
Embed them.

Logs

python esm/extract.py esm1b_t33_650M_UR50S foo my_reprs/ --repr_layers 33 --include mean
Traceback (most recent call last):
  File "esm/extract.py", line 144, in <module>
    main(args)
  File "esm/extract.py", line 71, in main
    dataset = FastaBatchedDataset.from_file(args.fasta_file)
  File ".../tmp/fair-esm/esm/esm/data.py", line 52, in from_file
    assert len(set(sequence_labels)) == len(sequence_labels)
AssertionError

MSA transformer

Thank you for sharing these excellent results. I'm hoping to clarify on the number of sequences you are sampling in training the MSA transformer (as in the subsample strategy section).

As I understand (please correct me if I'm wrong), you are using 1024 as sequence length L, so the number of sequences you are sampling is only N/L = 2^14/1024=16? And you are keeping to this number in latter supervised contact prediction as well (you mentioned 256 are used in unsupervised prediction)?

Any suggestions for extracting embeddings for sequences with > 1024 residues?

Could I split the sequence into 1024 length chunks, run each separately with the BOS and EOS tokens occurring in the first and last chunks, concatenate the resulting embeddings, then take the average?

Seems like since during training the model used random crops of >1024 length sequences, this should work, but want to make sure.

Also, some warning that your sequence is too long might be helpful, since as of now trying to embed a larger than 1024 length sequence while running on gpu results in the unhelpful "device-side assert triggered" CUDA runtime error.

fine-tune model with language model head

Hi

I would like to fine-tune esm with a language model head. I tried

import torch
model = torch.hub.load("facebookresearch/esm", "modelWithLMHead", "esm1_t34_670M_UR50S")  

but I got
RuntimeError: Cannot find callable modelWithLMHead in hubconf

Is there a simple way to do this? Thanks

Inconsistent dimension when generate contact maps and maximum sequence length

First of all, thank you for your fantastic work! I tried your embeddings on some basic protein family tasks and the results are amazing!

However, I have some issues when I try to generate contact map predictions. 1. From your code, the batch_conventer function will padding sequences to the maximum length. In my case is 600. Whereas, the generated contact maps have the dimension of [batch_size, 599 or 600, 599 or 600]. I am wondering why this happens. 2. Another issue is with the maximum length of the sequence the model can proceed, I am working with protein families that can have a maximum of over 30,000 long. And when I trying to use the model on such protein, the position encoding seems to fail. So I am curious what is the maximum length of protein this model can handle so I can set a reasonable cutoff.

Fine-tune ESM model

Hi,
I want to do fine-tuning of the model using my own dataset. Is there away to run your model for this task?

Provide pre-training code?

Hi there!

I'm trying to compare ESM to UniRep, the embedding from the Church lab, for variant function prediction. Eventually, there are a few proteins our lab would like to optimize, and ESM has some advantages over UniRep. I need to "evolutionarily fine tune" ESM, as the Church lab does for UniRep: refine the global model's weights by continuing training on a small neighborhood (~100k sequences) around the target protein.

Could y'all provide any of the code you used in the pre-training task? Eg, your implementations of noising / masking, your loss function, or your gradient descent function?

Thank you, I think ESM is super cool!
Best,
Jacob

"invalid dimensions for input" When Running with ONNXRuntime

Summary:

I'm looking to use ONNX and ONNXRuntime to speed up our workloads to use ESM-1b.

I converted the serialized model esm1b_t33_650M_UR50S.pt to a .onnx graph using torch.onnx then explicitly applied extended optimizations (including conversion to float16) using onnxruntime_tools.

When trying to run an ONNXRuntime inference session with the following inputs:

data = [
    ("protein1", "VLAGG"),
    ("protein2", "KALTARQ"),
]

I get:

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: input.1 for the following indices
index: 1 Got: 8 Expected: 9

ONNX Graph Conversion and ONNXRuntime Inference Code:

Before the ONNXRuntime inference code, I'd like to share my Pytorch-to-ONNX conversion process first. Just in case the issue resides there:

export MODEL_PATH=/mnt/models/esm1b/esm1b_t33_650M_UR50S.pt # model file downloaded locally
export CONVERTED_GRAPH_PATH=/tmp/models/onnx_esm/graph.onnx # intermediate storage for the graph and the external data binaries ("/tmp/models/onnx_esm" must be created beforehand)
export OPTIMIZED_GRAPH_PATH=/mnt/models/onnx_esm/graph.onnx # final form of the graph, encapsulated within a single 1.3G file ("/mnt/models/onnx_esm" must be created beforehand)

python convert_onnx_esm.py --model-path $MODEL_PATH --converted-model-path $CONVERTED_GRAPH_PATH
python -m onnxruntime_tools.optimizer_cli --float16 --opt_level 99 --use_gpu --model_type bert --hidden_size 1024 --num_heads 16 --input $CONVERTED_GRAPH_PATH --output $OPTIMIZED_GRAPH_PATH

This is the source code for convert_onnx_esm.py:

import os
import torch
import torch.onnx
import argparse
from esm.pretrained import load_model_and_alphabet_local


parser = argparse.ArgumentParser()

parser.add_argument("--model-path", type=str, required=True)
parser.add_argument("--converted-model-path", type=str, required=True)
args = parser.parse_args()

model, alphabet = load_model_and_alphabet_local(args.model_path)
batch_converter = alphabet.get_batch_converter()

data = [
    ("protein1", "VLAGG"),
    ("protein2", "KALTARQ"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

with torch.no_grad():
    torch.onnx.export(model,
        batch_tokens,
        args.converted_model_path,
        use_external_data_format=True,
        opset_version=12,
        do_constant_folding=True
    )

Now for the inference code which produced the exception:

import os
import numpy as np
import argparse
from onnxruntime import (
    GraphOptimizationLevel,
    InferenceSession,
    SessionOptions,
    get_device,
)
from esm.data import Alphabet

provider = "CUDAExecutionProvider" if get_device() == "GPU" else "CPUExecutionProvider"

parser = argparse.ArgumentParser()

parser.add_argument("--optimized-model-path", type=str, required=True)
args = parser.parse_args()

options = SessionOptions()
options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
model = InferenceSession(args.optimized_model_path, options
)
model.set_providers([provider])
alphabet = Alphabet.from_architecture("protein_bert_base")
batch_converter = alphabet.get_batch_converter()

data = [
    ("protein1", "VLAGG"),
    ("protein2", "KALTARQ"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

output = model.run(None, {"input.1": batch_tokens.numpy()}) # input name "input.1" should be the default when exporting with `torch.onnx.export()` 

Environment:

Cuda 11.2, CudNN=8.1.1.33, and Python 3.8.5 with packages:

fair-esm==0.3.0
onnx==1.8.1
onnxconverter-common==1.6.0
onnxruntime-gpu==1.7.0
onnxruntime-tools==1.6.0

Note: If we can find a solution for this issue, I was wondering if it's a good idea for me to clean up and add the model conversion and the inference example as a contribution to the repo.

Running by CPU mode, GPU is occupied

When running the ESM model CPU mode, i found that GPU was occupied(about 1.7GB)

# Load ESM-1b model
model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
batch_converter = alphabet.get_batch_converter()

# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract per-residue representations (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)
token_representations = results["representations"][33]

Broken pip install

Hello,

thank you for the amazing work!

I just would like to let you know, that the latest modification of README has broken pip install, as there is a README.rst among data_files in the setup.py, but now the repo has README.md instead.

Best regards,
Raman

Quick start example with GPU

In your quick start example you have an example for cpu:
# Extract per-residue embeddings (on CPU) with torch.no_grad(): results = model(batch_tokens, repr_layers=[34]) token_embeddings = results["representations"][34]
if I wanted to Extract per-residue embeddings on a gpu how would I change the above lines?

Thanks

MSA file

Hi again.

Thank you for adding the MSA-1 model.

I am experimenting if we can get a better representation of proteins for some down-stream tasks with the MSA model.

Do you have any suggestions on how to prepare a proper MSA file (a3m) of a specific protein that fits the model? Like what tools you used, the procedures...

Thanks.

Odd Prec?

I have a small problem. In my cognition, the prediction of long-range residue contact is more difficult than the short-range one, but the indicators given in this paper are just the opposite. If anyone can explain it, I would be very grateful.

Fine Tuning using the ESM model

Hi!
Thanks for making this model public! I have had a good amount of success using it to predict function similar to how your example colab showed. I was wondering if you had any insight for how I could use the model as is to fine tune on more specific data? I think I could pretty reasonably set up a masking function by setting some values to 0 and using the logits and softmax to get predictions, but if I wanted to do next sequence predicting would I have to add a lot to the current model or is there an output I could use? Eg If a sequence is MYA i want to only use the information before the predicted AA and not the entire sequence. EG to predict M i would use only the fact that its after a start codon, then to predict Y i would use the previous hidden state that knows there is a start codon and M etc for the whole sequence.

Supervised contact prediction network

Hi guys,

Congrats for the excellent work and great results. Great to see sequence embedding works indeed.

May I ask in your MSA Transformer supervised contact prediction network, did you use the outer concat of the query sequence embedding (or with symmetries row self-attention maps) as the only input so that it demonstrates the superior information content of sequence embedding replacing traditional MSA-related features or did you still include all the RaptorX features as input to the resnet as stated in Rives et al 2020? If latter, did you conduct an ablation study like that in Rives et al 2020, to see how much does the sequence embedding contribute to the improved contact precision?

Thanks in advance.

Adding the embedders to bio_embeddings

Hey folks :)

Great work!. As I mentioned on Twitter, it'd be nice to add your models to bio_embeddings. Purpose of the pipeline: make it easy for less-tech-savy bio/informatician to use protein LMs. Since you use torch, this should be quite straightforward since we already have some transformer models.

Out of the box, you get the whole "read FASTA in, make run reproducible", project, viz & embedding annotation transfer (goPredSim) pipelines. Edit: oh, and the auto-batching of large sequence files between GPU/CPU (which is not at all intuitive for the avg user), + per-sequence vs per-AA representations (looking through closed issues, #2 )

I noticed you have some variant prediction code, maybe it makes sense to include that as a pipeline step if it is sensible?

I'll link this to our issue for integration so that we can cross-follow the status: sacdallago/bio_embeddings#62

[Question] AssertionError using extract.py

Hello, I was trying to use

python extract.py esm1b_t33_650M_UR50S ../test/test.fasta my_reprs/ --repr_layers 0 32 33 --include mean per_tok

And it returns:

Traceback (most recent call last):
  File "extract.py", line 134, in <module>
    main(args)
  File "extract.py", line 66, in main
    dataset = FastaBatchedDataset.from_file(args.fasta_file)
  File "/home/jsun/msa_transformer/esm/esm/data.py", line 52, in from_file
    assert len(set(sequence_labels)) == len(sequence_labels)
AssertionError

I guess that there are problems with my fasta file (about 350 sequences and 450 amino acids for each). Are there any restrictions on fasta file? Can anyone please help me with this problem?

ESM-MSA-1 from pytorch hub

Hi there,

Thanks for making this work public.

I was trying to load the MSA-1 model from pytorch hub, but it seems that it is not on the list yet.

Could you please add the msa-1 model to hubconf.py?

Thanks a lot again.

Embedding proteins in batches with MSA transformer?

Passing batches of proteins to the other ESM transformers seems to work fine, but with the MSA transformer, it seems like a specific error is raised -

image

Is there a supported way to encode batches of proteins in a single forward pass?

Decoding an embedding back to a sequence

Thank you for this resource! I am currently trying to predict a protein sequence from an embedding.
Is it possible to load the decoder of the ESM model? Is there an example snippet I could use?

decode method for batch_converter

Hi

I have been working with some of the models from the hugging face library and their tokenizers have a .decode() method associated with them. it seems that 'batch_converter' is the esm equivelent of huggingfaces 'tokenizer', but I can't find a similar method to map back from ids to strings. Is there one?

Thanks

Proteins longer than 1024 causes an exception on the CPU and poisons the GPU

Bug description

Having a sequence longer than 1022 residues causes the an unspecific exception on CPU and GPU. On the CPU, it says IndexError: index out of range in self. On the Quadro RTX 8000 I tested, it causes a CUDA error: device-side assert triggered that will cause all further attempts to embed sequences of any length to fail with the same error.

I'm aware that esm was trained with sequences of less than 1024 amino acids; I'm opening this issue because this does not seem to be mentioned in the repo nor the paper, and from the error message in the exception it's hard to figure out what is wrong. I'd also be interested in how you'd suggest handling user input with longer sequences: Should this simply error with a message that this is not supported or do you suggest another way of handling this (I've seen #21 (comment) listing some strategies)?

Reproduction steps

Pretty much the readme example, only with a longe sequence added:

c="MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELNGGAYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKRGVYCCREHEHEIAWYTERSEKSYELQTPFEIKLAKKFDTFNGECPNFVFPLNSIIKTIQPRVEKKKLDGFMGRIRSVYPVASPNECNQMCLSTLMKCDHCGETSWQTGDFVKATCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSLAEYHNESGLKTILRKGGRTIAFGGCVFSYVGCHNKCAYWVPRASANIGCNHTGVVGEGSEGLNDNLLEILQKEKVNINIVGDFKLNEEIAIILASFSASTSAFVETVKGLDYKAFKQIVESCGNFKVTKGKAKKGAWNIGEQKSILSPLYAFASEAARVVRSIFSRTLETAQNSVRVLQKAAITILDGISQYSLRLIDAMMFTSDLATNNLVVMAYITGGVVQLTSQWLTNIFGTVYEKLKPVLDWLEEKFKEGVEFLRDGWEIVKFISTCACEIVGGQIVTCAKEIKESVQTFFKLVNKFLALCADSIIIGGAKLKALNLGETFVTHSKGLYRKCVKSREETGLLMPLKAPKEIIFLEGETLPTEVLTEEVVLKTGDLQPLEQPTSEAVEAPLVGTPVCINGLMLLEIKDTEKYCALAPNMMVTNNTFTLKGGAPTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEKCSAYTVELGTEVNEFACVVADAVIKTLQPVSELLTPLGIDLDEWSMATYYLFDESGEFKLASHMYCSFYPPDEDEEEGDCEEEEFEPSTQYEYGTEDDYQGKPLEFGATSAALQPEEEQEEDWLDDDSQQTVGQQDGSEDNQTTTIQTIVEVQPQLEMELTPVVQTIEVNSFSGYLKLTDNVYIKNADIVEEAKKVKPTVVVNAANVYLKHGGGVAGALNKATNNAMQVESDDYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSAYENFNQHEVLLAPLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKLVSSFLEMKSEKQVEQKIAEIPKEEVKPFITESKPSVEQRKQDDKKIKACVEEVTTTLEETKFLTENLLLYIDINGNLHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKALRKVPTDNYITTYPGQGLNGYTVEEAKTVLKKCKSAFYILPSIISNEKQEILGTVSWNLREMLAHAEETRKLMPVCVETKAIVSTIQRKYKGIKIQEGVVDYGARFYFYTSKTTVASLINTLNDLNETLVTMPLGYVTHGLNLEEAARYMRSLKVPATVSVSSPDAVTAYNGYLTSSSKTPEEHFIETISLAGSYKDWSYSGQSTQLGIEFLKRGDKSVYYTSNPTTFHLDGEVITFDNLKTLLSLREVRTIKVFTTVDNINLHTQVVDMSMTYGQQFGPTYLDGADVTKIKPHNSHEGKTFYVLPNDDTLRVEAFEYYHTTDPSFLGRYMSALNHTKKWKYPQVNGLTSIKWADNNCYLATALLTLQQIELKFNPPALQDAYYRARAGEAANFCALILAYCNKTVGELGDVRETMSYLFQHANLDSCKRVLNVVCKTCGQQQTTLKGVEAVMYMGTLSYEQFKKGVQIPCTCGKQATKYLVQQESPFVMMSAPPAQYELKHGTFTCASEYTGNYQCGHYKHITSKETLYCIDGALLTKSSEYKGPITDVFYKENSYTTTIKPVTYKLDGVVCTEIDPKLDNYYKKDNSYFTEQPIDLVPNQPYPNASFDNFKFVCDNIKFADDLNQLTGYKKPASRELKVTFFPDLNGDVVAIDYKHYTPSFKKGAKLLHKPIVWHVNNATNKATYKPNTWCIRCLWSTKPVETSNSFDVLKSEDAQGMDNLACEDLKPVSEEVVENPTIQKDVLECNVKTTEVVGDIILKPANNSLKITEEVGHTDLMAAYVDNSSLTIKKPNELSRVLGLKTLATHGLAAVNSVPWDTIANYAKPFLNKVVSTTTNIVTRCLNRVCTNYMPYFFTLLLQLCTFTRSTNSRIKASMPTTIAKNTVKSVGKFCLEASFNYLKSPNFSKLINIIIWFLLLSVCLGSLIYSTAALGVLMSNLGMPSYCTGYREGYLNSTNVTIATYCTGSIPCSVCLSGLDSLDTYPSLETIQITISSFKWDLTAFGLVAEWFLAYILFTRFFYVLGLAAIMQLFFSYFAVHFISNSWLMWLIINLVQMAPISAMVRMYIFFASFYYVWKSYVHVVDGCNSSTCMMCYKRNRATRVECTTIVNGVRRSFYVYANGGKGFCKLHNWNCVNCDTFCAGSTFISDEVARDLSLQFKRPINPTDQSSYIVDSVTVKNGSIHLYFDKAGQKTYERHSLSHFVNLDNLRANNTKGSLPINVIVFDGKSKCEESSAKSASVYYSQLMCQPILLLDQALVSDVGDSAEVAVKMFDAYVNTFSSTFNVPMEKLKTLVATAEAELAKNVSLDNVLSTFISAARQGFVDSDVETKDVVECLKLSHQSDIEVTGDSCNNYMLTYNKVENMTPRDLGACIDCSARHINAQVAKSHNIALIWNVKDFMSLSEQLRKQIRSAAKKNNLPFKLTCATTRQVVNVVTTKIALKGGKIVNNWLKQLIKVTLVFLFVAAIFYLITPVHVMSKHTDFSSEIIGYKAIDGGVTRDIASTDTCFANKHADFDTWFSQRGGSYTNDKACPLIAAVITREVGFVVPGLPGTILRTTNGDFLHFLPRVFSAVGNICYTPSKLIEYTDFATSACVLAAECTIFKDASGKPVPYCYDTNVLEGSVAYESLRPDTRYVLMDGSIIQFPNTYLEGSVRVVTTFDSEYCRHGTCERSEAGVCVSTSGRWVLNNDYYRSLPGVFCGVDAVNLLTNMFTPLIQPIGALDISASIVAGGIVAIVVTCLAYYFMRFRRAFGEYSHVVAFNTLLFLMSFTVLCLTPVYSFLPGVYSVIYLYLTFYLTNDVSFLAHIQWMVMFTPLVPFWITIAYIICISTKHFYWFFSNYLKRRVVFNGVSFSTFEEAALCTFLLNKEMYLKLRSDVLLPLTQYNRYLALYNKYKYFSGAMDTTSYREAACCHLAKALNDFSNSGSDVLYQPPQTSITSAVLQSGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQSAVKRTIKGTHHWLLLTILTSLLVLVQSTQWSLFFFLYENAFLPFAMGIIAMSAFAMMFVKHKHAFLCLFLLPSLATVAYFNMVYMPASWVMRIMTWLDMVDTSLSGFKLKDCVMYASAVVLLILMTARTVYDDGARRVWTLMNVLTLVYKVYYGNALDQAISMWALIISVTSNYSGVVTTVMFLARGIVFMCVEYCPIFFITGNTLQCIMLVYCFLGYFCTCYFGLFCLLNRYFRLTLGVYDYLVSTQEFRYMNSQGLLPPKNSIDAFKLNIKLLGVGGKPCIKVATVQSKMSDVKCTSVVLLSVLQQLRVESSSKLWAQCVQLHNDILLAKDTTEAFEKMVSLLSVLLSMQGAVDINKLCEEMLDNRATLQAIASEFSSLPSYAAFATAQEAYEQAVANGDSEVVLKKLKKSLNVAKSEFDRDAAMQRKLEKMADQAMTQMYKQARSEDKRAKVTSAMQTMLFTMLRKLDNDALNNIINNARDGCVPLNIIPLTTAAKLMVVIPDYNTYKNTCDGTTFTYASALWEIQQVVDADSKIVQLSEISMDNSPNLAWPLIVTALRANSAVKLQNNELSPVALRQMSCAAGTTQTACTDDNALAYYNTTKGGRFVLALLSDLQDLKWARFPKSDGTGTIYTELEPPCRFVTDTPKGPKVKYLYFIKGLNNLNRGMVLGSLAATVRLQAGNATEVPANSTVLSFCAFAVDAAKAYKDYLASGGQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDHPNPKGFCDLKGKYVQIPTTCANDPVGFTLKNTVCTVCGMWKGYGCSCDQLREPMLQSADAQSFLNGFAV"

import torch
import esm

# File downloaded with `wget https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt`
model, alphabet = esm.pretrained.load_model_and_alphabet_local("../main/esm1b_t33_650M_UR50S.pt")
batch_converter = alphabet.get_batch_converter()

# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
    ("protein3", c[:1025]), # <== Setting this to 1022 makes the code pass
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True) # CPU error occurs here

model = model.to(torch.device("cuda"))

with torch.no_grad():
    results = model(batch_tokens.to(torch.device("cuda")), repr_layers=[33], return_contacts=True) # GPU error occurs here

# This was initially working, but no doesn't anymore:
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad():
    results = model(batch_tokens.to(torch.device("cuda")), repr_layers=[33], return_contacts=True) # GPU error occurs here

Expected behavior

Either a way to handle long sequences, or an error message that explains the length limit, ideally with a note in the readme. That error also really shouldn't poison the GPU in ways that I need to restart the process before I can do any proper computation again, but not sure if you can do anything about it or if that's an issue with torch and/or cuda

Logs

CPU:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/esm/model.py", line 131, in forward
    x = x + self.embed_positions(tokens)
  File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/esm/modules.py", line 225, in forward
    return F.embedding(
  File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

GPU:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[...]
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/esm/model.py", line 149, in forward
    if not padding_mask.any():
RuntimeError: CUDA error: device-side assert triggered

Trying sequences shorter than 1024 afterwards:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
RuntimeError: CUDA error: device-side assert triggered

Additional context

Ubuntu 18.04, python 3.8, torch 1.7.1, cuda 10.2, Driver Version 455.23.05, pip install -U git+https://github.com/facebookresearch/esm with 537ad6a

Version the lib

Would that be possible to add a git tag to the repo and so version the lib so we can make a package of it on conda-forge or similar?

Model fine tuning example

Hi,

Thanks for making the model public! I would like to fine-tune the model for a downstream task, however in your colab model you only make inference using the pre-trained model, without any fine tuning it seems. Could you advise me on the best way to go about finetuning with this model? provide an example script?

Thanks so much.

[Question] Supervised contact map and secondary structure models

Hello,

Are you planning on releasing the model (architecture and weights) that you have trained in a supervised fashion for predicting contact maps and secondary structure in "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences"?

Thank you for making all this awesome work available!

Meanings of tokens in alphabet

I notice that there are several special tokens in the alphabet, which are neither amind-acids nor gap. What do they mean?
image

Special Symbols In Proteins

Does the ESM model deal with special symbols for proteins?

Does it deal with input sequences with gaps? For example, sequence = ---AB----C?

Does it deal with ambiguous residues like BZJX?

Thank you!

ESM-MSA-1 unsupervised contact prediction Bug

Bug overview
Unsupervised contact map prediction from ESM-MSA-1 seems bugged.

Bug description
When I generate the contact-map its seems to be not able to predict any long range contacts or even medium range contacts. I tested it on casp 14 targets and the performance seems much worse than expected based on the results reported in the manuscript.

Additional information
the following code was used to generate the output
model, alphabet = torch.hub.load("facebookresearch/esm", "esm_msa1_t12_100M_UR50S")
batch_converter = alphabet.get_batch_converter()
batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad():
contact = model.predict_contacts(batch_tokens)

Contact-map

Hi
Thanks again for your great repositories;
I have a question; how did you calculate contact map in your pre-retrained model? Have you used any supervising datasets or just use the embedded representation of amino acids and pairwise scoring?
Thanks
Nasser

Error in Loading State Dict

I am running into an error when trying to reload a fine-tuned version of the models. After further-training, models were saved using the below code:

# Load model
model, alphabet = torch.hub.load("facebookresearch/esm", "esm1_t12_85M_UR50S")

# Training code

# Save model
torch.save(model.state_dict(), BEST_MODEL)

Upon trying to reload the model using the below

# Load model
model, alphabet = torch.hub.load("facebookresearch/esm", "esm1_t12_85M_UR50S")
model.load_state_dict(torch.load(BEST_MODEL))

I run into the error

RuntimeError                              Traceback (most recent call last)
<ipython-input-4-b6ed0ebd023b> in <module>
----> 1 model.load_state_dict(torch.load("esm1_t12_85M_UR50S-Best.pt"))

~/anaconda3/envs/c10_stability/lib/python3.8/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
   1049
   1050         if len(error_msgs) > 0:
-> 1051             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   1052                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1053         return _IncompatibleKeys(missing_keys, unexpected_keys)

This is a fairly standard pytorch error that you get when trying to load the wrong state_dict into a model. Upon further inspection, it looks like all keys in saved state_dict have a "model." prefix on their names while the keys in the model itself do not (I attached a file of the full error which shows this mismatch in key names).

What do you recommend for bypassing this problem? Is there a specific GitHub repo tag that I should be trying to work off of? It looks like there might have been some change to the torch.hub models between my original training script and trying to load now. Would it be reasonable to just remove the "model." prefix in the saved state_dict?

Thank you!

StateDictLoadError.txt

Extracting per-residue/per-protein embeddings on GPU

Hey,

Thank you for doing the research which is needed in order to many biotech issues.

Is there any plan to add support for extracting per-residue embeddings on GPU (multi-GPU)?

...

# Extract per-residue embeddings (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[34])

...

I have another question: how can I apply ESM embedding to get per-protein vector?
Is it enough if I will apply mean(dim=0)?

Thanks,
Piotr

OOM on Colab

When trying to run
python extract.py esm1_t34_670M_UR50S examples/P62593.fasta examples/P62593_reprs/ --repr_layers 34 --include mean
I get:

tcmalloc: large alloc 2676842496 bytes == 0x548b6000 @  0x7f739765cb6b 0x7f739767c379 0x7f734850974e 0x7f734850b7b6 0x7f7382f74ba5 0x7f7392c2f1d9 0x551555 0x5a9dac 0x50a433 0x50beb4 0x507be4 0x508ec2 0x5a4c61 0x5a4fb8 0x4e012e 0x50a461 0x50beb4 0x507be4 0x588e5c 0x59fd0e 0x50d256 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x509900 0x50a2fd 0x50cc96 0x5095c8 0x50a2fd
tcmalloc: large alloc 2676842496 bytes == 0xf418c000 @  0x7f739765cb6b 0x7f739767c379 0x7f734850974e 0x7f734850b7b6 0x7f7382f74ba5 0x7f7392c2f1d9 0x551555 0x5a9dac 0x50a433 0x50beb4 0x507be4 0x508ec2 0x5a4c61 0x5a4fb8 0x4e012e 0x50a461 0x50beb4 0x507be4 0x588e5c 0x59fd0e 0x50d256 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x509900 0x50a2fd 0x50cc96 0x5095c8 0x50a2fd

I'm guessing that the Colab GPU (a T4 with 15Gb of mem in my case) is unable to pull the entire model into memory? Anybody else running into this?

[Question] Similarity search using the embeddings of the training dataset (uniparc)

Hi,

For people working in the field of protein science, it'd be useful to find the sequences/structures in uniparc that are similar to a given sequence in the esm embedding space, as a great alternative to the existing protein sequence-based search tools and methods.

Is the easiest way to do that 1) download uniparc 2) get esm-1b embeddings 3) build a kNN index via Faiss or pynndescent, or are you planning to release a script and/or a kNN/faiss index to facilitate that somehow?

I can imagine that you have been already using a Faiss instance to do that internally, but the question is whether you'd like to release it or not 😄

Cheers.

Two question about ESM_MSA

Hi
Thanks so much for sharing your great job;
I have two question; I was wondering if you answer both:
1: After downloading the esm1b and esm-MSA, although I did not have any problem to load the esm1b, I faced the below error when I try to load the esm-MSA:

in esm_msa1_t12_100M_UR50S
return load_model_and_alphabet_hub("esm_msa1_t12_100M_UR50S")
in load_model_and_alphabet_hub
model_data = load_hub_workaround(url)
in load_hub_workaround
f"{torch.hub.get_dir()}/checkpoints/{fn}",
AttributeError: module 'torch.hub' has no attribute 'get_dir'

  1. Also, in the MSA-Transformer paper, are you using 1024 number of seq as input for below section? If yes, how you managed the GPU capacity to do it? In the example of Contact Prediction Examples, you only used 64 and it seems it is OK, but when increasing it to 1024, is there any trick we should use?

def read_msa(filename: str, nseq: int) -> List[Tuple[str, str]]:
""" Reads the first nseq sequences from an MSA file, automatically removes insertions."""
return [(record.description, remove_insertions(str(record.seq)))
for record in itertools.islice(SeqIO.parse(filename, "fasta"), nseq)]

Thanks so much

Embedding not reproducible? [token_representation] giving different results each time

Hello,

I was trying to get the embedding representations from the MSA transformer.
The code:

import torch
import esm

# Load ESM-1b model
model, alphabet = esm.pretrained.esm_msa1_t12_100M_UR50S()
batch_converter = alphabet.get_batch_converter()

# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYIIVATPRGYVLAGG"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract per-residue representations (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[12], return_contacts=True)
token_representations = results["representations"][12]

Each time that I run the code, token_representations is different. Is this the expected behavior?

ModuleNotFoundError: fused_layer_norm_cuda

Following the QuickStart instruction, I load the model through PyTorch Hub.

model, alphabet = torch.hub.load("facebookresearch/esm", "esm1b_t33_650M_UR50S")

Exception experienced:

ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

Double checked torch.__version__ as 1.7.0. Can provide more info if needed.

How come the networks include <cls>, <eos>, <unk> and other similar tokens?

This is not an issue, so I apologize for putting it here, but I didn't really know where else to ask.

I have been testing out the various pretrained networks you have trained in this repository, and they seem very interesting and I might use them in a paper I'm working on, so I would like to understand it in detail.
One thing I do not understand about the networks is why they include so many special tokens?
I get that you need the masking token, and similarly the padding token for handling proteins batched together with various sizes.
The cls and eos are used just before and after a protein, but seem unnecessary for proteins unless I'm missing something?
The unk token should signal that an amino acid is unknown if I understand correctly, but isn't X generally the catch all case in protein language for unknown amino acids? So what is the usecase here?
And similarly for the last few tokens used which I have no good guess for.

should stop token be in sequence representation?

The example code in the Quick Start section of the github readme page shows this excerpt:

sequence_representations = []
for i, (_, seq) in enumerate(data):
    sequence_representations.append(token_representations[i, 1 : len(seq) + 1].mean(0))

The sequence_representations will then include the last position of token_representations, which appears to be the stop token.
Is this intended?

Decoding SSP labels

I took a look at the SSP labels that come with the Structural Split dataset and those include (T, E, X, B, H, G, S, I, -). In the paper, it says these labels were pulled from Joosten et al 2010 where the labels correspond to (BEGHITS). What does the X character represent?

How I can extract attention map?

Hi there;
I hope you are well;
I have question, given a sequence as input in esm1b, how I can extract 660 attention map associated with each head in each layer?
Thanks so much

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.