Hello, Did you guys try / record the base roundtrip accuracy for the

Yes, you can run the I linked above to get predictions from our dedicated 3Di p

Base Roundtrip Accuracy about prostt5 HOT 7 CLOSED

lhallee commented on August 16, 2024

Base Roundtrip Accuracy

from prostt5.

Comments (7)

mheinzinger commented on August 16, 2024

Hi,
yes, I recorded this and I tried to recover the file you are asking for (I assume: "base" roundtrip accuracy means roundtrip accuracy for generating only a single candidate sequences, i.e., without generating/filtering until a roundtrip accuracy of e.g. >70 was reached). Hope it helps;
roundtrip_statistics.csv
Best, Michael
Ps.: there is also a column saying "PPL" because I tried to find some correlation to perplexity but the "roundtrip-accuracy" direction appeared more promising/easier_to_interpret to me. That being said: the PPL field always says -666 because I did not compute it for this run.

from prostt5.

lhallee commented on August 16, 2024

Thanks so much for sharing, this is perfect. I assume similarity is the accuracy % IE correct if same token, incorrect if not, correct/total?

from prostt5.

lhallee commented on August 16, 2024

Also, is it okay to report this as SOTA round trip accuracy? The average was 70.6 from what you sent. We have a bert-like model that is getting 70+% on the same data so I think this is a great comparison. SAProt can't do this task because they didn't train on filling their structure tokens.

from prostt5.

mheinzinger commented on August 16, 2024

I assume similarity is the accuracy % IE correct if same token, incorrect if not, correct/total?

Ah, no, sorry I should have added this is: I used the Foldseek substitution matrix together with global alignment.

Also, is it okay to report this as SOTA round trip accuracy?

Oh, thanks for asking :) - So if I had to redo this right now, I would go for our dedicated 3Di predictor. It is essentially just a 2-layer CNN trained on top of embeddings of ProstT5's encoder. So if you do not need a distribution over solutions but rather only a single solution (as you usually do if you want to use 3Di for searching remote homologs, or, if you want to use it for roundtrip-based filtering), I would go for this one. The rational is that it is MUCH faster as you do not need to decode token-by-token but rather have a single forward pass through the encoder which translates all amino acids in the input to 3Di tokens (probably similar to your BERT-like model). sorry for not having it in the paper yet but this is ongoing work and in the next iteration of the paper, we'll also describe this CNN.

For SAProt: maybe I would have to re-read the paper but didn't they have these mixed tokens of "Aa" (upper case amino acids, lower case 3Di; or vice versa not sure) where you could mask out either 3Di ("A?") or amino acids ("?a") and reconstruct it (so you could mask out all 3Di tokens and ask the model to reconstruct it)?

from prostt5.

lhallee commented on August 16, 2024

Thanks for the info! Yes, I think the 3Di predictor is a great comparison to our BERT-like model. Is there a checkpoint available for it yet? If not I will await the next iteration of the paper.

You are correct about the mixed tokens, it is the same approach we took. However, they chose to only mask amino acid portions during training, so their model cannot recover masked 3Di tokens with anything beyond random chance. Our approach was to mask amino acid portions and 3Di, or both, so it is quite good at this round-trip task.

from prostt5.

mheinzinger commented on August 16, 2024

Yes, you can run the script I linked above to get predictions from our dedicated 3Di predictor (it already downloads the checkpoint); you just have to give some input fasta file as explained here under predict_3Di_encoderOnly.py : https://github.com/mheinzinger/ProstT5/tree/main/scripts

Cool, great to hear that you got promising results from this approach :) - let me know once you have a pre-print up

from prostt5.

lhallee commented on August 16, 2024

Thanks for the help! Will do :)

from prostt5.

Base Roundtrip Accuracy about prostt5 HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent