Comments (7)
Hi,
yes, I recorded this and I tried to recover the file you are asking for (I assume: "base" roundtrip accuracy means roundtrip accuracy for generating only a single candidate sequences, i.e., without generating/filtering until a roundtrip accuracy of e.g. >70 was reached). Hope it helps;
roundtrip_statistics.csv
Best, Michael
Ps.: there is also a column saying "PPL" because I tried to find some correlation to perplexity but the "roundtrip-accuracy" direction appeared more promising/easier_to_interpret to me. That being said: the PPL field always says -666 because I did not compute it for this run.
from prostt5.
Thanks so much for sharing, this is perfect. I assume similarity is the accuracy % IE correct if same token, incorrect if not, correct/total?
from prostt5.
Also, is it okay to report this as SOTA round trip accuracy? The average was 70.6 from what you sent. We have a bert-like model that is getting 70+% on the same data so I think this is a great comparison. SAProt can't do this task because they didn't train on filling their structure tokens.
from prostt5.
I assume similarity is the accuracy % IE correct if same token, incorrect if not, correct/total?
Ah, no, sorry I should have added this is: I used the Foldseek substitution matrix together with global alignment.
Also, is it okay to report this as SOTA round trip accuracy?
Oh, thanks for asking :) - So if I had to redo this right now, I would go for our dedicated 3Di predictor. It is essentially just a 2-layer CNN trained on top of embeddings of ProstT5's encoder. So if you do not need a distribution over solutions but rather only a single solution (as you usually do if you want to use 3Di for searching remote homologs, or, if you want to use it for roundtrip-based filtering), I would go for this one. The rational is that it is MUCH faster as you do not need to decode token-by-token but rather have a single forward pass through the encoder which translates all amino acids in the input to 3Di tokens (probably similar to your BERT-like model). sorry for not having it in the paper yet but this is ongoing work and in the next iteration of the paper, we'll also describe this CNN.
For SAProt: maybe I would have to re-read the paper but didn't they have these mixed tokens of "Aa" (upper case amino acids, lower case 3Di; or vice versa not sure) where you could mask out either 3Di ("A?") or amino acids ("?a") and reconstruct it (so you could mask out all 3Di tokens and ask the model to reconstruct it)?
from prostt5.
Thanks for the info! Yes, I think the 3Di predictor is a great comparison to our BERT-like model. Is there a checkpoint available for it yet? If not I will await the next iteration of the paper.
You are correct about the mixed tokens, it is the same approach we took. However, they chose to only mask amino acid portions during training, so their model cannot recover masked 3Di tokens with anything beyond random chance. Our approach was to mask amino acid portions and 3Di, or both, so it is quite good at this round-trip task.
from prostt5.
Yes, you can run the script I linked above to get predictions from our dedicated 3Di predictor (it already downloads the checkpoint); you just have to give some input fasta file as explained here under predict_3Di_encoderOnly.py
: https://github.com/mheinzinger/ProstT5/tree/main/scripts
Cool, great to hear that you got promising results from this approach :) - let me know once you have a pre-print up
from prostt5.
Thanks for the help! Will do :)
from prostt5.
Related Issues (20)
- translate.py documentation HOT 1
- AttributeError: 'T5EncoderModel' object has no attribute 'full' HOT 4
- Back Translate 3Di Tokesn to PDB Format HOT 1
- ProstT5 for tasks like AMP classification or protein localization. Also protT5 xxl results. HOT 1
- Issue downloading model weights HOT 1
- Protein sequence length limit HOT 1
- Translation result is unexpected! HOT 1
- Foldseek not liking generate_foldseek_db.py output TSVs HOT 6
- 'T5Tokenizer' object has no attribute 'to' HOT 1
- Problem while trying to convert AA to 3Diaa HOT 2
- computing residue logits from 3Di input HOT 5
- How well does this scale with length of amino acids? HOT 1
- Probability cutoff? HOT 6
- Sequence conservation model described in the paper HOT 2
- Entry error between aa and 3di string HOT 3
- translate.py with default settings generates long stretch of simple repeats HOT 1
- Limit AA alphabet when generating from 3Di rep HOT 1
- ProstT5 Using all available threads at inference with cpus
- Strange results for AA2fold translation HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prostt5.