songlab-cal / tape Goto Github PK

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.

Home Page: https://www.biorxiv.org/content/10.1101/676825v1

License: BSD 3-Clause "New" or "Revised" License

Shell 1.16% TeX 1.48% Python 97.36%

deep-learning protein-sequences protein-structure semi-supervised-learning benchmark language-modeling dataset pytorch

tape's People

Contributors

Stargazers

Watchers

Forkers

linhduongtuan shunsunsun largelymfs liuchuang0059 wowml akulashray apoorvab93 carolineechen cuongqn wilerjrxd rdedhia jessevig robertogemartin charliechen1 elvinjun seutao pmm09c tdiethe douglas2code brendanyap casensio-ux sailfish009 twilightbi tanhevg alexlioralexli joshim5 mheinzinger pravinrav alexanderwhatley spark157 panaali smyth7 ameenali fteufel psturmfels yangkky akshay1-6180 igemto-drylab pascalnotin stephanheijl jessie0624 dlhuang hyeh20 neotcr diegovalenzuelaiturra bioinfo-fiend mengqvist li-ziang samgoldman97 kishorenagarajan konstin kanohase tomorrowisanotherday superxiang bwang-ecnu andrewcchang aceun heiidii simoncorrea wes-lewis leiqichn bioembeddings eacwecwecd danny305 leroy-r mannykayy paccmann ziruiwang1996 ziqlu0722 kodrzywolek frankji hmkim chrislengerich xinxinatg xrosliang ekvall93 xinzzzhou boxiangliu aaroncsolomon skaftenicki frank-39 andrewolal bl-lac149597870 panda1018 bioinformatics-study-group adamowczarczyk mdavari soheilasamiee fortunate41 lichman0405 wesleywt jayagami aidanbio 404notfound101 hanwenxuthu xxchenxx junsangpark tpan1039-ui zhoubay ipark2021

tape's Issues

Language modeling targets

In the language modeling task, I think your goal is to predict the next amino acid? But when constructing the dataset, the targets look exactly the same with the original inputs (except for the padding constant value). If I did not misunderstand... Could you check that part?

UniRep pretrained weights

Hi,

I would need some clarification as to what the babbler-1900 checkpoint is exactly.
Is this the actual UniRep model from the original paper, or just the UniRep architecture trained on Pfam? Internally, the LM task points to Pfam as for every other model.

Reason being I want to evaluate the perplexity of UniRep on some sequences.

Corrupted stability dataset (raw format)

I downloaded stability.tar.gz using the link on the readme page.
Uncompressed using archive manager on Ubuntu 18.04.
stability_train.json and stability_valid.json look OK
but stability_test looks corrupted.

Corrupted raw remote homology dataset

In the raw remote homology dataset, remote_homology_test_family_holdout.json is a binary file. Similar to #49 it appears to be an archive which contains remote_homology_test_fold_holdout.json and remote_homology_test_superfamily_holdout.json.

Visualization

I stopped a training of your 'transformer' model with my language model task and in the 'result' folder, I have several files, including the log file.

Could you give an example of how you can view the results saved inside? Because I tried to open it with tensorboardX, but it turns out not to contain data to show.

Furthermore, I saw that the repo contains a file called 'visualization.py', which I believe is to be used in this area, but I have not understood how to do it.
Thanks a lot!

Add a force re-download to the `from-pretrained` method

If file is corrupted, allow user to force a re-download. See huggingface's modeling_utils.py for example.

Could you say more about what you are trying to run cpu-only? Pretraining, fine-tuning, embedding, etc.

Originally posted by @nickbhat in #2 (comment)

About python command

I want to use python command to run a task, how to do?

Contact Prediction (ProteinNet) raw data download link not working

Hi,
First of all thank you so much for this wonderful repository. I've really liked your TAPE project.

Unfortunately, I can't download the raw dataset for Contact Prediction Task. It is showing access denied.

The link provided in your readme.md is http://s3.amazonaws.com/proteindata/data_raw_pytorch/proteinnet.tar.gz

Would you kindly have a look at this.

Thanks.

UniRep model pooled output returns 3800 vector values

Hey,

I would like to use protein embedding based on the pre-trained UniRep model.
I used the code below:

device = 'cuda:0'

model = UniRepModel.from_pretrained('babbler-1900')
model.cuda(device=device)

tokenizer = TAPETokenizer(vocab='unirep')

sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = torch.tensor(
    [tokenizer.encode(sequence)],
    device=device
)

output = model(token_ids)

sequence_output = output[0]
pooled_output = output[1]

What I discovered is the fact that pooled output shape is X/3800 instead of X/1900 (as I expected). Would you tell me why it's double?

When I used ProteinBertModel with bert-base in the same way I got correct shape.

Thanks for clarification.

Piotr

Can I make some simple predictions directly (just like stability on my own datasets) using the sequence embedding results?

I want to embed the protein sequences (my own dataset) and use the embedding vectors to make stability predictions. Can I use the Extracting Embeddings section directly to get the results?
I just started to learn the knowledge, and I would appreciate it if you can reply to me.

tape-train error

When i use tape-train transformer masked_language_modeling --batch_size 16 --learning_rate 0.001 --warmup_steps 100, it works. But use tape-train lstm language_modeling --batch_size 16 --learning_rate 0.001 --warmup_steps 100, error occured. I want to kown how to train lstm model.

cpu-only environment

Can I run it in a cpu-only environment?

Reading output_filename.npz

I tried to produce the file 'output_filename.npz' using the command:

tape-embed transformer my_input.fasta output_filename.npz bert-base --tokenizer iupac.

The file is correctly created, but when I try to import it, I can't access the data correctly.

First (following the instruction reported in README file), using the comand:

arrays = np.load('output_filename.npz', allow_pickle=True),

the file is correctly loaded. Then, I created a list of both the keys (in my case the protein sequences id) and the values (the dictionaries of the 'pooled' and 'avg' vectors associated with the protein id) :

keys = list(arrays.keys())
values = []
for key in keys:
values.append(arrays[key])

But the values of each key in my case are numpy.arrays of size 0, and not dictionaries as indicated in the README file. I can print it, but I don't have the possibility to access the 'pooled' and 'avg' fields as I would like. What am I doing wrong?

Below are some prints of what I mentioned above:

trRosetta problem with tape_embed

Hello,

I'm trying to get embeddings with trRosetta as;

tape-embed trrosetta my_fasta_file.fasta my_outout.npz xaa

But getting error as below. I install tape first from pip repo then from github but the error persists.

Traceback (most recent call last):
File "/truba/home/sunsal/Apps/anaconda3/envs/tape_new/bin/tape-embed", line 8, in
sys.exit(run_embed())
File "/truba/home/sunsal/Apps/anaconda3/envs/tape_new/lib/python3.7/site-packages/tape/main.py", line 234, in run_embed
training.run_embed(**embed_args)
File "/truba/home/sunsal/Apps/anaconda3/envs/tape_new/lib/python3.7/site-packages/tape/training.py", line 629, in run_embed
model_type, task_spec.name, model_config_file, from_pretrained)
File "/truba/home/sunsal/Apps/anaconda3/envs/tape_new/lib/python3.7/site-packages/tape/registry.py", line 214, in get_task_model
model_cls = task_spec.get_model(model_name)
File "/truba/home/sunsal/Apps/anaconda3/envs/tape_new/lib/python3.7/site-packages/tape/registry.py", line 44, in get_model
return self.models[model_name]
KeyError: 'trrosetta'

f-string syntax fails with Python 3.5

Thanks for the interesting pkg, and looking FW to try it!

On import, got the following:

>>> from tape import ProteinBertModel
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Continuum\anaconda3\envs\tape_proteins\lib\site-packages\tape\__init__.py", line 1, in <module>
    from . import datasets  # noqa: F401
  File "D:\Continuum\anaconda3\envs\tape_proteins\lib\site-packages\tape\datasets.py", line 31
    raise ValueError(f"Unrecognized datafile type {data_file.suffix}")
                                                                    ^
SyntaxError: invalid syntax
>>> from tape import TAPETokenizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Continuum\anaconda3\envs\tape_proteins\lib\site-packages\tape\__init__.py", line 1, in <module>
    from . import datasets  # noqa: F401
  File "D:\Continuum\anaconda3\envs\tape_proteins\lib\site-packages\tape\datasets.py", line 31
    raise ValueError(f"Unrecognized datafile type {data_file.suffix}")
                                                                    ^
SyntaxError: invalid syntax

System info

{'commit_hash': '2eec187d3',
 'commit_source': 'installation',
 'default_encoding': 'cp437',
 'ipython_path': 'D:\\Continuum\\anaconda3\\envs\\tape_proteins\\lib\\site-packages\\IPython',
 'ipython_version': '7.9.0',
 'os_name': 'nt',
 'platform': 'Windows-10-10.0.17134-SP0',
 'sys_executable': 'D:\\Continuum\\anaconda3\\envs\\tape_proteins\\python.exe',
 'sys_platform': 'win32',
 'sys_version': '3.5.5 | packaged by conda-forge | (default, Jul 24 2018, '
                '01:52:17) [MSC v.1900 64 bit (AMD64)]'}

Assertion Error in Tape-Embed

I am getting an error on tape-embed using a GPU. I am using batch size as 4 because of anything more causes memory issues in GPU. Please advise.

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [256,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
Traceback (most recent call last):
File "/project/6000341/saby2k13/pyDPPI/kit/medver/bin/tape-embed", line 8, in
sys.exit(run_embed())
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/tape/main.py", line 234, in run_embed
training.run_embed(**embed_args)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/tape/training.py", line 642, in run_embed
outputs = runner.forward(batch, no_loss=True)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/tape/training.py", line 86, in forward
outputs = self.model(**batch)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/tape/models/modeling_bert.py", line 449, in forward
chunks=None)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/tape/models/modeling_bert.py", line 346, in forward
layer_outputs = layer_module(hidden_states, attention_mask)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/tape/models/modeling_bert.py", line 285, in forward
attention_outputs = self.attention(hidden_states, attention_mask)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/tape/models/modeling_bert.py", line 242, in forward
self_outputs = self.self(input_tensor, attention_mask)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/project/6000341/saby2k13/pyDPPI/kit/medver/lib/python3.6/site-packages/tape/models/modeling_bert.py", line 180, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA error: device-side assert triggered

Are the pytorch finetuned transformer models available

Are the pytorch finetuned transformer models available ?

CUDA oom with batch size 1

I ran this command (with NVIDIA Apex installed):

tape-train-distributed transformer masked_language_modeling --nproc_per_node 4 --batch_size 1024 --grad
ient_accumulation_steps 256

and got the following output:

RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 2; 7.43 GiB total capacity; 6.81 GiB already allocated; 6.94 MiB free; 6.94 GiB r
eserved in total by PyTorch)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/tape/utils/distributed_utils.py", line 38, in _wrap
    fn(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/tape/main.py", line 191, in run_train
    training.run_train(**train_args)
  File "/opt/conda/lib/python3.7/site-packages/tape/training.py", line 536, in run_train
    break
  File "/opt/conda/lib/python3.7/site-packages/tape/utils/utils.py", line 244, in __exit__
    raise RuntimeError(message)
RuntimeError: CUDA out of memory. Reduce batch size or increase gradient_accumulation_steps to divide each batch over more forward passes.
        Hyperparameters:
                batch_size per backward-pass: 1024
                gradient_accumulation_steps: 256
                n_gpu: 4
                batch_size per (gpu * forward-pass): 1

I got the same output with tape-train as with tape-train-distributed. I'm running on a machine with 4 GPUs. How can I be running out of memory if the batch size is (per gpu, per forward pass) is 1? What can I do to actually run the training code on my machine?

Fine-Tuning of Bert model

I have a protein sequence dataset (FASTA file) and I want to fine-tune your transformer model adding a task that works on this dataset.

Following the instructions in the "adding_task.py" file, I correctly generated the new task and by entering the command "python adding_task transformer my_new_task", it is executed correctly.

However, when the training phase starts, the following error is reported:

Apparently, the "J" token is not contained in the IUPAC vocabulary. Checking actually missing, then I tried to enter it manually, but the problem is not resolved, as if the change was not detected.
Is it possible to make such a change? If yes, how can I make the program recognize the change?

And is it possible to force, for example, the function 'convert_to_ids' in tokenizers.py file to assign token J to the token "" in the IUPAC vocab? Because even this change seems not to work.

Loss and perplexity

hello
the loss and perplexity printed after each iteration in running tape-train is for train or validation set?

batch sizes

Thank you very much for your great repo! This is really helpful.

I have a question: in your parer, especially in Section "A.3 Training Details", you described

speciﬁc batch sizes for each model at each protein length are available in our repository
however I could not find these. Could tell me the corresponding information in this repo?

Thank you.

Unable to locate data_utils module

Thank you for this incredibly useful and well-designed library! I couldn’t find the data_utils module that is imported in the preprocessing scripts. Would it be possible to include that in the repo as well? Many thanks in advance!

what does the from tape import data_utils refer?

from tape import data_utils
vocab = {v: k for k, v in data_utils.PFAM_VOCAB.items()}

and what is PFAM_VOCAB?

Question: Is ProteinBertModel a Devlin 2018 BERT trained on Pfam?

Hi, I got a little bit confused. Maybe you already mentioned it and I didn't find it.

As far as I understand the pre-trained ProteinBertModel (Transformer) is from huggingface, so it is the model described by Devlin 2018.

In your paper Evaluating Protein Transfer Learning with TAPE however the Transformer is a model as described by Vaswani 2017.

So, you trained the Devlin 2018-Transformer again on Pfam as you did before for the Vaswani 2017-Transformer?

Best
Marc

Embed not finishing for large files

I am trying to embed the whole human proteome on my mac, but running into this error. It works fine for small files.

tape-embed transformer y.fasta human.npz bert-base
20/01/24 01:09:11 - INFO - tape.training - device: cpu n_gpu: 1
20/01/24 01:09:11 - INFO - tape.models.modeling_utils - loading configuration file https://s3.amazonaws.com/proteindata/pytorch-models/bert-base-config.json from cache at /Users/sabby/.cache/torch/protein_models/6373bfcc1f9755cd1b90c75a4c82e5b6ace8db121253e86653b69c3fec08ed04.05edb4ed225e1907a3878f9d68b275d79e025b667555aa94a086e27cb5c591e0
20/01/24 01:09:11 - INFO - tape.models.modeling_utils - Model config {
"attention_probs_dropout_prob": 0.1,
"base_model": "transformer",
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"input_size": 768,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 8192,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"num_labels": -1,
"output_attentions": false,
"output_hidden_states": false,
"output_size": 768,
"pruned_heads": {},
"torchscript": false,
"type_vocab_size": 1,
"vocab_size": 30
}

20/01/24 01:09:11 - INFO - tape.models.modeling_utils - loading weights file https://s3.amazonaws.com/proteindata/pytorch-models/bert-base-pytorch_model.bin from cache at /Users/sabby/.cache/torch/protein_models/6d307d719bd3cf0453747b0ae7362b8297eb16a045ac50aee4ed9e8890afe773.8206daaea9be2736b6ccde432df9dc3dbb8c3233b47f07688d6ff38d74258d22
0%| | 0/20 [00:00<?, ?it/s]Killed: 9

Is this related to memory? Please suggest

secondary structure dictionary

Hello,

I'm training a model for secondary structure prediction using three classes, but im not sure about how you encode those classes in the data (ss3). Could you please help me with this?

Many thanks!

Javier.

Question about onehot contact prediction model

Hi,
I was able able to get the onehot model working for contact prediction (after making a few changes, like setting defaults in config class), however the performance is much lower (~0.06 precision@L/5) than reported in the paper.

If I'm understanding the code correctly, the pytorch version of the contact prediction head uses a simple linear classifier, whereas the original implementation also included convolutional layers, as described in the paper.

Is this understanding correct?

Thanks!

2 problems - SOLVED

Hello, I wanted to let you know that I encountered 2 problems and resolved them for myself, which you may want to update with you.

After running the first command I was told "from Bio import SeqIO no module named Bio" - solved with pip install biopython
"ValueError: Object arrays cannot be saved when allow_pickle = False"
so I changed source code in tape / utils / utils.py line 283 to include "allow_pickle = True"

Thanks for sharing your code :)

Extend list of models by SeqVec

Hey,

I would propose extend list of available models by SeqVec (ELMo-based implementation) which was presented in the Modeling aspects of the language of life through transfer-learning protein sequences paper.

SeqVec model trained on UniRef50 is available at: SeqVec-model

The checkpoint for the pre-trained model is available at: SeqVec-checkpoint

Regards,
Piotr

tape-embed error

No errors were reported, but the results were clearly wrong

CUDA out of memory issue with contact map and masked_language_model training

Just wonder if anyone has encountered "CUDA out of memory" issue before? When I tried training contact_prediction model and the masked_language_model with the following batch_size configuration, I received the "CUDA out of memory" error.

For example:

by having the following batch size configuration for masked_language_model:

Hyperparameters: batch_size per backward-pass: 256 gradient_accumulation_steps: 64 n_gpu: 4 batch_size per (gpu * forward-pass): 2

I receive the following error:

RuntimeError: CUDA out of memory. Tried to allocate 386.00 MiB (GPU 6; 15.78 GiB total capacity; 14.32 GiB already allocated; 89.44 MiB free; 14.63 GiB reserved in total by PyTorch)
RuntimeError: CUDA out of memory. Reduce batch size or increase gradient_accumulation_steps to divide each batch over more forward passes.

The error asks me to reduce the # of samples per forward pass (which currently is 2 in the above configuration). I tried this as well. However, it only works for masked_language_model training. For the contact_prediction, even if I specify the batch_size per forward pass to be 1. I still get the cuda memory issue. Just wonder has anyone encountered and resolved this issue before? This prevents me from having a larger "batch per forward pass" configuration.

P.S. this seems to be independent of how many GPUs I used, as long as the batch_size per forward pass is greater than 1, The optimization process of the two models will be stopped somewhere in the middle of the training. And this problem happens with or without the "--fp16" flag.

I did some search online and found that some Apex cuda memory leak issue that might be similar to this (not sure though). I also tried using torch.cuda.empty_cache() to empty the unnecessary memory in CUDA during the training (before each forward pass). I did see drops in the memory allocation in GPUs with watch -n 1 nvidia-smi (e.g., from 5000MB to 2000MB), however, I still got the same out of memory error during training even the usage of GPU memory in nvidia-smi is around 2000MB.

And my current GPU has 16GB memory. By my visual check, usually the training program ends before the actual GPU usage (displayed in nvidia-smi) reaches 16GB and the error is still "CUDA out of memory".

About LSTM pretrained model

Could you add LSTM pretrained model? I want to use it for some tasks. Thanks :)

'bert-base' for ProteinBertForMaskedLM does not seem to work

I'm trying to predict masked tokens. However, the predictions are of low quality.
As a sanity check I'm not masking any tokens and look if the model predicts back the original tokens with reasonable accuracy (see code below).
However, I'm encountering two issues:

The predictions vary, even when the model is put in eval() mode
The predictions are of very low quality (near 0% accuracy)

Can you please tell me if loading in the 'base-bert' for language modeling not supported yet or otherwise if am I doing something wrong?

Thanks!

import numpy as np
import torch
from tape import TAPETokenizer,  ProteinBertForMaskedLM

tokenizer = TAPETokenizer(vocab='iupac')  # iupac is the vocab for TAPE models, use unirep for the UniRep model

model = ProteinBertForMaskedLM.from_pretrained('bert-base')
model.eval() # deactivate dropout


# Pfam Family: Hexapep, Clan: CL0536
sequence = 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ'
token_ids = tokenizer.encode(sequence)
input_tensor = torch.tensor([token_ids])

# Predict all tokens
with torch.no_grad():
    outputs = model(input_tensor)
    predictions = outputs[0]
    logits = predictions[0] # only one sequence
    
pred_ids = logits.detach().numpy().argmax(1)
pred_tokens = tokenizer.convert_ids_to_tokens(pred_ids)
pred_seq = tokenizer.convert_tokens_to_string(pred_tokens)
print(pred_seq)

bert-base embeddings

I tried to run the following command:
tape-embed unirep my_input.fasta output_filename.npz bert-base --tokenizer unirep --full_sequence_embed

why bert-base embedding not working and only 1900 dim embeddings?

thanks!

Training UniRep language model error

Hi,

When I try to train a language model with UniRep I get the following error:
"RuntimeError: shape '[-1,26]' is invalid for input of size 2800"

Looking at the UniRepForLM Class, it says in a comment:
"# TODO: Fix this for UniRep - UniRep changes the size of the targets"

Does that mean that currently, it is not possible to train a UniRep model for a language modelling task?

evaluation

The model was successfully trained by tape-train, and finally, a folder was created under the results folder. In this folder, I could not find anything about the evaluation of the results, such as accuracy, etc.What should I do to get the evaluation?

Examples for using code (not from command line?)

Hi - is there an option to provide some examples on how to use the models (pretrained or novel) to train/fine-tune on new tasks, or the existing tasks, not just from the command line?

(The models and classes (e.g. tokenizer) are not 1:1 with huggingface's format, so it makes applying it to new problems hard. Even a simple example of training on a given set of sequences and labels [in code] would be great - currently i'm running into issues applying these models, or running my own on them (e.g. loading the data etc').

Add Bepler's model

Hi, I'm currently dealing with Bepler's model and want to use a PyTorch version to evaluate on the TAPE, would you add the model later? Thanks!

Reverse sequence for LSTM model

Within LSTMEncoder, when attempting to compute the reverse outputs though theLSTM with reversed inputs (aka the reverse_lstm layers), reversing the sequence does not return anything when input_mask=None (see below). This is because of incorrect indentation on L127.

tape/tape/models/modeling_lstm.py

Lines 114 to 127 in 2e5bfa2

 def reverse_sequence(self, sequence, input_mask): 

 if input_mask is None: 

 idx = torch.arange(sequence.size(1) - 1, -1, -1) 

 reversed_sequence = sequence.index_select(1, idx, device=sequence.device) 

 else: 

 sequence_lengths = input_mask.sum(1) 

 reversed_sequence = [] 

 for seq, seqlen in zip(sequence, sequence_lengths): 

 idx = torch.arange(seqlen - 1, -1, -1, device=seq.device) 

 seq = seq.index_select(0, idx) 

 seq = F.pad(seq, [0, 0, 0, sequence.size(1) - seqlen]) 

 reversed_sequence.append(seq) 

 reversed_sequence = torch.stack(reversed_sequence, 0) 

 return reversed_sequence

tape-embed fails under PyTorch 1.5.0

I was trying to run tape-embed, but received the following error message (everything went fine when I ran it with the --no_cuda flag):

(protein) wbogud@cuda:~/projects/protein$ time tape-embed transformer ../data/test.fasta embeddings.npz models/tape/bert-base/
20/04/22 16:12:11 - INFO - tape.training -   device: cuda n_gpu: 4
20/04/22 16:12:11 - INFO - tape.models.modeling_utils -   loading configuration file models/tape/bert-base/config.json
20/04/22 16:12:11 - INFO - tape.models.modeling_utils -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "base_model": "transformer",
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "input_size": 768,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 8192,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": -1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_size": 768,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 1,
  "vocab_size": 30
}

20/04/22 16:12:11 - INFO - tape.models.modeling_utils -   loading weights file models/tape/bert-base/pytorch_model.bin
  0%|                                                                                                               | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/home/wbogud/anaconda3/envs/protein/bin/tape-embed", line 8, in <module>
    sys.exit(run_embed())
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/main.py", line 234, in run_embed
    training.run_embed(**embed_args)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/training.py", line 642, in run_embed
    outputs = runner.forward(batch, no_loss=True)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/training.py", line 86, in forward
    outputs = self.model(**batch)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/models/modeling_bert.py", line 443, in forward
    dtype=next(self.parameters()).dtype)  # fp16 compatibility
StopIteration

Downgrading to PyTorch 1.4.0 solved the issue.

Could the error be related to a known issue of PyTorch 1.5.0 described at https://github.com/pytorch/pytorch/releases/tag/v1.5.0? (torch.nn.parallel.DistributedDataParallel does not work in Single-Process Multi-GPU mode)

decode fluorescence sequences

I'm trying to decode the sequences in the fluorescence benchmark dataset, but when I use the default tokenizer it returns nonsensical sequences.
I've included an example below.

Can you tell us how to decode the sequences?

import dotenv
import os
from pathlib import Path
import json


dotenv.load_dotenv()
data_path = Path(os.getenv("DATA_ROOT"))

task = "fluorescence"
fpath = data_path / f"tape/{task}/{task}_train.json"
with open(fpath,'r') as f:
    train = json.load(f)
    
from tape import TAPETokenizer
tokenizer = TAPETokenizer(vocab='iupac')

''.join(tokenizer.convert_ids_to_tokens(train[0]['primary']))

>>> 'RIFDDKESFUUOHKUDKCFCUMFGIERURFDFDFC<unk>SXFIKSKIEHBSSFIKOUOVOSKUSSKRXFUPBERQXOCGLIPGCEEIR<unk>LODFXUPDQSHEEICCFMXISQ<unk>DUIEDFCSKUMQHDKIFHCEIDCFMHKFGIKDXMXMRGMUXHL<unk>CIPIMFHIUMEIHQGIHDCFRUPK<unk>CGXPPMSOHFCFOUKKOCMGXKRSPR<unk>KRICOMDIQCGLUKKDEUS<unk><unk>FHSGFLCDQXI'

tape-embed fails with more than 100k sequences in fasta file

tape-embed works well with 1000 sequences but produces the following error with fasta files containing a large number of sequences (>~10k):
0%| | 0/49 [00:00<?, ?it/s]Traceback (most recent call last):
File "/bin/tape-embed", line 8, in
sys.exit(run_embed())
File "/lib/python3.6/site-packages/tape/main.py", line 232, in run_embed
training.run_embed(**embed_args)
File "/lib/python3.6/site-packages/tape/training.py", line 637, in run_embed
for batch in tqdm(valid_loader, total=len(valid_loader)):
File "/lib/python3.6/site-packages/tqdm/std.py", line 1102, in iter
for obj in iterable:
File "/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next
return self._process_next_batch(batch)
File "/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
File "/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/lib/python3.6/site-packages/tape/datasets.py", line 211, in getitem
item = self.data[index]
File "/lib/python3.6/site-packages/tape/datasets.py", line 95, in getitem
record = self._records[key]
File "/lib/python3.6/site-packages/Bio/File.py", line 430, in getitem
raise ValueError("Key did not match (%s vs %s)" % (key, key2))
ValueError: Key did not match (bsnpaso1144 vs bsnph4874)

0%| | 0/49 [00:11<?, ?it/s]

How to train models on new data (FASTA)

I have some specific data, that I want the model to train on. The data is stored in FASTA format. How can I use the interfaces provided to train the models on my custom data?

Data licenses

Would you be able to add licenses for the datasets?
In particular, the pre-trained models may themselves require licenses, depending on the dataset(s) they were trained on?

Questions: About train BERT from scratch parameter?

Hi, I want train a bert from scartch, But I have occured some problem. The first one is your parameter about tape-train-distributed transformer masked_language_modeling --batch_size BS --learning_rate LR --fp16 --warmup_steps WS --nproc_per_node NGPU --gradient_accumulation_steps NSTEPS, because when I train from scratch, the log follow

and my parameter is

`DATA_DIR=data/
BS=1024
LR=0.0005
NGPU=8
GA=32

tape-train-distributed transformer masked_language_modeling
--data_dir ${DATA_DIR}
--batch_size ${BS}
--learning_rate ${LR}
--warmup_steps 5000
--fp16
--nproc_per_node ${NGPU}
--gradient_accumulation_steps ${GA}`

so I want konw your parameter？

thank you!

UniRep or BERT fine tuning examples?

I'm just beginning to work with the TAPE package, specifically with the hope of testing both UniRep and BERT protein embeddings in my system. I anticipate that I will need to do some fine tuning of the base models first, but I'm a bit unsure how to approach that. Are there any examples of this approach for either of these models?

tape-eval error

When I run tape-eval transformer secondary_structure results/secondary_structure_transformer_20-03-10-17-00-07_837748/ --metrics mae --split casp12 it gives me an error about the metric not existing in the metric_name_mapping.

I used tape-train transformer secondary_structure --batch_size=4 --gradient_accumulation_steps=4 --num_train_epochs=1 to quickly train a model to test out tape-eval.

I've tried adding the following to the top of main.py

from metrics import *
from . import metrics

and this to the top of training.py:

from .metrics import *

since it seems the metrics were not getting registered, but I still got the same error message. Adding cls.metric_name_mapping to line 194 of tape/registry.py, and results in printing an empty dictionary.

UniRep's representation

I've looked at the UniRep implementation and I'm a bit confused about the representation that it uses to do, for example, stability predictions. It seems that it uses the last hidden state to do the prediction - but doesn't the UniRep paper suggest taking the mean of all the hidden states?

For reference, the stability task for UniRep:

@registry.register_task_model('fluorescence', 'unirep')
@registry.register_task_model('stability', 'unirep')
class UniRepForValuePrediction(UniRepAbstractModel):

    def __init__(self, config):
        super().__init__(config)

        self.unirep = UniRepModel(config)
        self.predict = ValuePredictionHead(config.hidden_size * 2)

        self.init_weights()

    def forward(self, input_ids, input_mask=None, targets=None):

        outputs = self.unirep(input_ids, input_mask=input_mask)

        sequence_output, pooled_output = outputs[:2]
        outputs = self.predict(pooled_output, targets) + outputs[2:]
        # (loss), prediction_scores, (hidden_states)
        return outputs

Notice that it uses the pooled output, which we can see comes from this model:

@registry.register_task_model('embed', 'unirep')
class UniRepModel(UniRepAbstractModel):

    def __init__(self, config: UniRepConfig):
        super().__init__(config)
        self.embed_matrix = nn.Embedding(config.vocab_size, config.input_size)
        self.encoder = mLSTM(config)
        self.output_hidden_states = config.output_hidden_states
        self.init_weights()

    def forward(self, input_ids, input_mask=None):
        if input_mask is None:
            input_mask = torch.ones_like(input_ids)

        # fp16 compatibility
        input_mask = input_mask.to(dtype=next(self.parameters()).dtype)
        embedding_output = self.embed_matrix(input_ids)

        encoder_outputs = self.encoder(embedding_output, mask=input_mask)
        sequence_output = encoder_outputs[0]
        hidden_states = encoder_outputs[1]
        pooled_outputs = torch.cat(hidden_states, 1)

        outputs = (sequence_output, pooled_outputs)
        return outputs

Where it seems that the pooled outputs is the concatenation of the last cell state and last hidden state.

Why does TAPE use this concatenated representation, when the UniRep paper uses a mean over all the hidden states as its representation?

Masked Language Modelling with a pretrained language model

Hi,

I'm interested to continue training an already pre-trained model using the masked language task.

When I pass in the --from_pretrained flag it gives an error. I can see in create_train_parser there is no add_argument for the argument --from_pretrained. Would it be as simple as adding the functionality to read in the argument for --from_pretrained in main.py and then the pre-trained model will subsequently be loaded properly as the starting point for more training?

Thanks.

Scott

	def reverse_sequence(self, sequence, input_mask):
	if input_mask is None:
	idx = torch.arange(sequence.size(1) - 1, -1, -1)
	reversed_sequence = sequence.index_select(1, idx, device=sequence.device)
	else:
	sequence_lengths = input_mask.sum(1)
	reversed_sequence = []
	for seq, seqlen in zip(sequence, sequence_lengths):
	idx = torch.arange(seqlen - 1, -1, -1, device=seq.device)
	seq = seq.index_select(0, idx)
	seq = F.pad(seq, [0, 0, 0, sequence.size(1) - seqlen])
	reversed_sequence.append(seq)
	reversed_sequence = torch.stack(reversed_sequence, 0)
	return reversed_sequence