mheinzinger / prostt5 Goto Github PK

View Code? Open in Web Editor NEW

145.0 1.0 13.0 6.76 MB

Bilingual Language Model for Protein Sequence and Structure

License: MIT License

Jupyter Notebook 83.76% Python 16.24%

prostt5's People

Stargazers

Watchers

Forkers

btparrish schultzjack stepwise-ai-dev milot-mirdita kweissenow spyfighting pskvins gbouras13 mattoslmp haroon123 teletchealab vicvaleeva

prostt5's Issues

translate.py documentation

Hello!

I just noticed a couple of things while using translate and wanted to bring some attention to them in case they are issues.

is_3Di is assigned twice

is_3Di = False if int(args.is_3Di) == 0 else True

split_char = args.split_char
id_field   = args.id

half_precision = False if int(args.half)   == 0 else True
is_3Di         = False if int(args.is_3Di) == 0 else True

It seems the logic in the message 0 for 3Di 1 for AA for is_3Di is wrong

print(f"is_3Di is {is_3Di}. (0=expect input to be 3Di, 1= input is AA")

if is_3Di: # if we go from 3Di (start/s) to AA (target/t) 
    prefix_s2t = "<fold2AA>"
    # don't generate 3Di or rare/ambig. AAs when outputting AA
    noGood = "acdefghiklmnpqrstvwyXBZ"

if is_3Di:
    sequences[ uniprot_id ] += ''.join( line.split() ).replace("-","").lower()
else:
    sequences[ uniprot_id ] += ''.join( line.split() ).replace("-","")

The documentation in the scripts README for translate calls translate_clean, which doesn't seem to exist

python translate_clean.py --input /path/to/some_AA_sequences.fasta --output /path/to/output_directory --half 1 --is_3Di 0

There is a similar misnaming for the predict_3Di_encoderOnly.py

python predict_3Di.py --input /path/to/some_AA_sequences.fasta --output /path/to/some_3Di_sequences.fasta --half 1

All the best,
Logan

Protein sequence length limit

Hi @mheinzinger,

I was using ProstT5 on some protein sequences. I noticed it was not predicting 3Di sequences for protein sequence length of greater than 1000 residues. What is the reason and rationale for this?

Are these limits due to the way the model was trained, if so what are the other limitations?

Regards
Rakesh

How well does this scale with length of amino acids?

Hi, thank you for this approach, absolutely fascinating! I'm hoping to use this in a workflow to assess and triage eukaryotic genome annotations using FoldSeek. The range of protein sequences can be small (<50 a. acids) to very large (many kbps). I see a figure in the manuscript where the range of amino acids evaluated was up to 500 but wanted to check if you had any information or details for larger genes, or if the results might be similar to the input database?

Anecdotally, for a small handful of genes up to 1.5kbp that I've checked manually the results from ProstT5 -> Foldseek are similar to ones I get for a FoldSeek search with the corresponding structure made from AlphaFold3.

Thank you for any thoughts on this!

ProstT5 for tasks like AMP classification or protein localization. Also protT5 xxl results.

First of all I would like to express my appreciation for your work. I use your models for protein embedding extraction and then use them in subsequent tasks like AMP classification, Protein Localization etc. We are currently writing a comperative study for PFMs on different tasks and ProtT5 seems to be the best among the ones under examination (except some cases where ESM2t45 outperforms it by a little but protT5 is substancialy faster.).

I would like to ask you if you expect the ProstT5 model to perform better on tasks like protein localization than ProtT5 xl? I am going to perform some experiments and inform you if you are intrested.

Also I would like to ask you about your opinion on why ProtT5 xxl seems to perform worse than the ProtT5xl model. Is it a tuning thing or maybe a lack of sufficient amount of training data or something else. What do you think.

Best regards,

Elias G. University of Crete - FORTH

Dataset version with IDs?

Hello!

Great work! Is there a version of the dataset (or a way to get a version) of the dataset where there is an additional column for Uniprot ID or equivalent?

'T5Tokenizer' object has no attribute 'to'

I followed the install and attempted to run the example to convert a protein sequence to a 3di embedding. I am getting the following issue: 'T5Tokenizer' object has no attribute 'to'. I have tried uninstalling transformers and installing other previous versions which did not resolve the issue. Please advise troubleshooting this problem.

Problem while trying to convert AA to 3Diaa

Hi! First of all many thanks for developing such a useful tool.

I am running into the following problem while executing the code provided in this repository to convert AA to 3Diaa (I paste a screenshot below).

Due to my GPU, which is compatible with CUDA 11.5, I am trying to run ProstT5 with PyTorch 1.11.0, is this possible or the error might be related to this?

Many thanks in advance and sorry if I am being not very specific in my answer, I am new to this field.

computing residue logits from 3Di input

Hello @mheinzinger and thanks for sharing this interesting model.

I dont have practical experience with encoder-decoder LMs such as T5 so I am still trying to figure out the right way to get logits for the inverse folding task, instead of sampling sequences as in the example shown in the readme.

I have tried to run

outputs = model.decoder(ids_backtranslation.input_ids, attention_mask=ids_backtranslation.attention_mask)
logits = model.lm_head(outputs.last_hidden_state)

the shape of logits seems right, ie [batch, L+2, vocab_size] but then when I sample from these logits for sanity check I get about random tokens and no recovery of the target sequence from which I extracted the 3Di tokens ...

If calling model itself, I get the error

ValueError: You have to specify either input_ids or inputs_embeds

but in the case of inverse folding, we should not provide anything to the encoder and only get the structure information through the decoder, at least as far as I understand ...

Can you help please?

Base Roundtrip Accuracy

Hello,

Did you guys try / record the base roundtrip accuracy for the test set?
Best,
Logan

AttributeError: 'T5EncoderModel' object has no attribute 'full'

Hi,

Thank you for the great tool. When I run the script predict_3Di_encoderOnly.py in full precision mode I get the following error "AttributeError: 'T5EncoderModel' object has no attribute 'full'

Downloading spiece.model: 100%|██████████| 238k/238k [00:00<00:00, 78.0MB/s]
Downloading added_tokens.json: 100%|██████████| 283/283 [00:00<00:00, 129kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 1.00MB/s]
Downloading tokenizer_config.json: 100%|██████████| 2.40k/2.40k [00:00<00:00, 1.10MB/s]
Downloading: https://rostlab.org/~deepppi/prostt5/cnn_chkpnt/model.pt
Traceback (most recent call last):
File "/n/scratch3/users/s/sez10/_RESTORE/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 383, in
main()
File "/n/scratch3/users/s/sez10/_RESTORE/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 378, in main
output_probs,
File "/n/scratch3/users/s/sez10/_RESTORE/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 201, in get_embeddings
model = model.full()
File "/home/sez10/miniconda3_2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1270, in getattr
type(self).name, name))
AttributeError: 'T5EncoderModel' object has no attribute 'full'

Specifically the code I am running is the following: python predict_3Di_encoderOnly.py --input test.fasta --output test_res.fasta --model prstt5_model --half 0

Thanks again.

Best,
Sam

Back Translate 3Di Tokesn to PDB Format

Hello, I have a few questions which might be very basic to others.

Can we convert the 3Di tokens produced by foldseek into the actual 3D structure format like PDB files?
How can I compare two sequences of predicted and true 3Di tokens? Can I use something like the TM Score for that? I am working on a model to predict 3Di tokens from amino acids and am searching for the best metric to evaluate the model.

I would be very greatful to whom can answer my questions.

Issue downloading model weights

Hi,

Sorry for opening another issue. I was wondering if you changed the model weights recently?

For some reason after running predict_3Di_encoderOnly.py the cnn_chkpnt/model.pt file is not an actual weight file, but instead it is a bunch of HTML code. See the atttachment (I changed the name from model.pt to model.txt because it wouldn't let me upload it otherwise). Oddly I am having a difficult time reproducing this error, sometimes it downloads the file correctly, sometimes it does not.

As a result, when I run predict_3Di_encoderOnly.py I get the following error.

Downloading spiece.model: 100%|██████████| 238k/238k [00:00<00:00, 133MB/s]
Downloading added_tokens.json: 100%|██████████| 283/283 [00:00<00:00, 145kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 1.48MB/s]
Downloading tokenizer_config.json: 100%|██████████| 2.40k/2.40k [00:00<00:00, 1.93MB/s]
Downloading: https://rostlab.org/~deepppi/prostt5/cnn_chkpnt/model.pt
Traceback (most recent call last):
File "/n/scratch/users/s/sez10/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 383, in
main()
File "/n/scratch/users/s/sez10/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 378, in main
output_probs,
File "/n/scratch/users/s/sez10/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 194, in get_embeddings
predictor = load_predictor()
File "/n/scratch/users/s/sez10/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 173, in load_predictor
state = torch.load(checkpoint_p, map_location=device)
File "/home/sez10/miniconda3_2/lib/python3.7/site-packages/torch/serialization.py", line 795, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/sez10/miniconda3_2/lib/python3.7/site-packages/torch/serialization.py", line 1002, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

Thank you all for the great tool!

Best Regards,
Sam

model.txt

Translation result is unexpected！

Hello, I used the pre-trained model of ProstT5 to convert the protein sequence into 3Di, but I found that the result is full of "d" and "p", if the sequence is short, there is only d and p, if the sequence is long, then the beginning parts are all d and p, what is the reason for this?

Converting the AA sequences to 3Di and getting the output in PDB format for Foldseek input?

Hi @mheinzinger!

Thank you for the great software! If I am starting with amino acid sequences and want to generate PDB files to run Foldseek's easy-cluster function.

Once I have the 3Di from the amino acid sequence using the following command-
python translate_clean.py --input /path/to/some_AA_sequences.fasta --output /path/to/output_directory --half 1 --is_3Di 0

How do I convert the 3Di to PDB format?

Looking forward to your reply!

Foldseek not liking generate_foldseek_db.py output TSVs

I'm trying to convert 3Di tokens from ProstT5 and the associated amino acids into foldseek databases using generate_foldseek_db.py. However, when I try this, I receive an error like this Cannot open index file testdb.index.0 . Do you happen to know if I need a specific foldseek release to do this? For reference, I'm using version 2.8bd520. The test files I've been using are attached as well, with superfluous txt endings so GitHub allows me to upload them.

test_3di.fasta.txt
query.fasta.txt

mheinzinger / prostt5 Goto Github PK

prostt5's People

Stargazers

Watchers

Forkers

prostt5's Issues

Recommend Projects

Recommend Topics

Recommend Org