mheinzinger / prostt5 Goto Github PK
View Code? Open in Web Editor NEWBilingual Language Model for Protein Sequence and Structure
License: MIT License
Bilingual Language Model for Protein Sequence and Structure
License: MIT License
Hello!
I just noticed a couple of things while using translate and wanted to bring some attention to them in case they are issues.
is_3Di = False if int(args.is_3Di) == 0 else True
split_char = args.split_char
id_field = args.id
half_precision = False if int(args.half) == 0 else True
is_3Di = False if int(args.is_3Di) == 0 else True
print(f"is_3Di is {is_3Di}. (0=expect input to be 3Di, 1= input is AA")
if is_3Di: # if we go from 3Di (start/s) to AA (target/t)
prefix_s2t = "<fold2AA>"
# don't generate 3Di or rare/ambig. AAs when outputting AA
noGood = "acdefghiklmnpqrstvwyXBZ"
if is_3Di:
sequences[ uniprot_id ] += ''.join( line.split() ).replace("-","").lower()
else:
sequences[ uniprot_id ] += ''.join( line.split() ).replace("-","")
python translate_clean.py --input /path/to/some_AA_sequences.fasta --output /path/to/output_directory --half 1 --is_3Di 0
There is a similar misnaming for the predict_3Di_encoderOnly.py
python predict_3Di.py --input /path/to/some_AA_sequences.fasta --output /path/to/some_3Di_sequences.fasta --half 1
All the best,
Logan
Hi @mheinzinger,
I was using ProstT5 on some protein sequences. I noticed it was not predicting 3Di sequences for protein sequence length of greater than 1000 residues. What is the reason and rationale for this?
Are these limits due to the way the model was trained, if so what are the other limitations?
Regards
Rakesh
Hi, thank you for this approach, absolutely fascinating! I'm hoping to use this in a workflow to assess and triage eukaryotic genome annotations using FoldSeek. The range of protein sequences can be small (<50 a. acids) to very large (many kbps). I see a figure in the manuscript where the range of amino acids evaluated was up to 500 but wanted to check if you had any information or details for larger genes, or if the results might be similar to the input database?
Anecdotally, for a small handful of genes up to 1.5kbp that I've checked manually the results from ProstT5 -> Foldseek are similar to ones I get for a FoldSeek search with the corresponding structure made from AlphaFold3.
Thank you for any thoughts on this!
First of all I would like to express my appreciation for your work. I use your models for protein embedding extraction and then use them in subsequent tasks like AMP classification, Protein Localization etc. We are currently writing a comperative study for PFMs on different tasks and ProtT5 seems to be the best among the ones under examination (except some cases where ESM2t45 outperforms it by a little but protT5 is substancialy faster.).
I would like to ask you if you expect the ProstT5 model to perform better on tasks like protein localization than ProtT5 xl? I am going to perform some experiments and inform you if you are intrested.
Also I would like to ask you about your opinion on why ProtT5 xxl seems to perform worse than the ProtT5xl model. Is it a tuning thing or maybe a lack of sufficient amount of training data or something else. What do you think.
Best regards,
Elias G. University of Crete - FORTH
Hello!
Great work! Is there a version of the dataset (or a way to get a version) of the dataset where there is an additional column for Uniprot ID or equivalent?
I followed the install and attempted to run the example to convert a protein sequence to a 3di embedding. I am getting the following issue: 'T5Tokenizer' object has no attribute 'to'. I have tried uninstalling transformers and installing other previous versions which did not resolve the issue. Please advise troubleshooting this problem.
Hi! First of all many thanks for developing such a useful tool.
I am running into the following problem while executing the code provided in this repository to convert AA to 3Diaa (I paste a screenshot below).
Due to my GPU, which is compatible with CUDA 11.5, I am trying to run ProstT5 with PyTorch 1.11.0, is this possible or the error might be related to this?
Many thanks in advance and sorry if I am being not very specific in my answer, I am new to this field.
Hello @mheinzinger and thanks for sharing this interesting model.
I dont have practical experience with encoder-decoder LMs such as T5 so I am still trying to figure out the right way to get logits for the inverse folding task, instead of sampling sequences as in the example shown in the readme.
I have tried to run
outputs = model.decoder(ids_backtranslation.input_ids, attention_mask=ids_backtranslation.attention_mask)
logits = model.lm_head(outputs.last_hidden_state)
the shape of logits
seems right, ie [batch, L+2, vocab_size] but then when I sample from these logits for sanity check I get about random tokens and no recovery of the target sequence from which I extracted the 3Di tokens ...
If calling model itself, I get the error
ValueError: You have to specify either input_ids or inputs_embeds
but in the case of inverse folding, we should not provide anything to the encoder and only get the structure information through the decoder, at least as far as I understand ...
Can you help please?
Hello,
Did you guys try / record the base roundtrip accuracy for the test set?
Best,
Logan
Hi,
Thank you for the great tool. When I run the script predict_3Di_encoderOnly.py in full precision mode I get the following error "AttributeError: 'T5EncoderModel' object has no attribute 'full'
Downloading spiece.model: 100%|██████████| 238k/238k [00:00<00:00, 78.0MB/s]
Downloading added_tokens.json: 100%|██████████| 283/283 [00:00<00:00, 129kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 1.00MB/s]
Downloading tokenizer_config.json: 100%|██████████| 2.40k/2.40k [00:00<00:00, 1.10MB/s]
Downloading: https://rostlab.org/~deepppi/prostt5/cnn_chkpnt/model.pt
Traceback (most recent call last):
File "/n/scratch3/users/s/sez10/_RESTORE/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 383, in
main()
File "/n/scratch3/users/s/sez10/_RESTORE/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 378, in main
output_probs,
File "/n/scratch3/users/s/sez10/_RESTORE/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 201, in get_embeddings
model = model.full()
File "/home/sez10/miniconda3_2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1270, in getattr
type(self).name, name))
AttributeError: 'T5EncoderModel' object has no attribute 'full'
Specifically the code I am running is the following: python predict_3Di_encoderOnly.py --input test.fasta --output test_res.fasta --model prstt5_model --half 0
Thanks again.
Best,
Sam
Hello, I have a few questions which might be very basic to others.
Can we convert the 3Di tokens produced by foldseek into the actual 3D structure format like PDB files?
How can I compare two sequences of predicted and true 3Di tokens? Can I use something like the TM Score for that? I am working on a model to predict 3Di tokens from amino acids and am searching for the best metric to evaluate the model.
I would be very greatful to whom can answer my questions.
Hi,
Sorry for opening another issue. I was wondering if you changed the model weights recently?
For some reason after running predict_3Di_encoderOnly.py the cnn_chkpnt/model.pt file is not an actual weight file, but instead it is a bunch of HTML code. See the atttachment (I changed the name from model.pt to model.txt because it wouldn't let me upload it otherwise). Oddly I am having a difficult time reproducing this error, sometimes it downloads the file correctly, sometimes it does not.
As a result, when I run predict_3Di_encoderOnly.py I get the following error.
Downloading spiece.model: 100%|██████████| 238k/238k [00:00<00:00, 133MB/s]
Downloading added_tokens.json: 100%|██████████| 283/283 [00:00<00:00, 145kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 1.48MB/s]
Downloading tokenizer_config.json: 100%|██████████| 2.40k/2.40k [00:00<00:00, 1.93MB/s]
Downloading: https://rostlab.org/~deepppi/prostt5/cnn_chkpnt/model.pt
Traceback (most recent call last):
File "/n/scratch/users/s/sez10/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 383, in
main()
File "/n/scratch/users/s/sez10/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 378, in main
output_probs,
File "/n/scratch/users/s/sez10/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 194, in get_embeddings
predictor = load_predictor()
File "/n/scratch/users/s/sez10/Alzheimer_project/ProstT5/scripts/predict_3Di_encoderOnly.py", line 173, in load_predictor
state = torch.load(checkpoint_p, map_location=device)
File "/home/sez10/miniconda3_2/lib/python3.7/site-packages/torch/serialization.py", line 795, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/sez10/miniconda3_2/lib/python3.7/site-packages/torch/serialization.py", line 1002, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
Thank you all for the great tool!
Best Regards,
Sam
Hi @mheinzinger!
Thank you for the great software! If I am starting with amino acid sequences and want to generate PDB files to run Foldseek's easy-cluster
function.
Once I have the 3Di from the amino acid sequence using the following command-
python translate_clean.py --input /path/to/some_AA_sequences.fasta --output /path/to/output_directory --half 1 --is_3Di 0
How do I convert the 3Di to PDB format?
Looking forward to your reply!
I'm trying to convert 3Di tokens from ProstT5 and the associated amino acids into foldseek databases using generate_foldseek_db.py. However, when I try this, I receive an error like this Cannot open index file testdb.index.0
. Do you happen to know if I need a specific foldseek release to do this? For reference, I'm using version 2.8bd520. The test files I've been using are attached as well, with superfluous txt endings so GitHub allows me to upload them.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.