Git Product home page Git Product logo

Comments (7)

mheinzinger avatar mheinzinger commented on August 16, 2024

Hi,
yes, I recorded this and I tried to recover the file you are asking for (I assume: "base" roundtrip accuracy means roundtrip accuracy for generating only a single candidate sequences, i.e., without generating/filtering until a roundtrip accuracy of e.g. >70 was reached). Hope it helps;
roundtrip_statistics.csv
Best, Michael
Ps.: there is also a column saying "PPL" because I tried to find some correlation to perplexity but the "roundtrip-accuracy" direction appeared more promising/easier_to_interpret to me. That being said: the PPL field always says -666 because I did not compute it for this run.

from prostt5.

lhallee avatar lhallee commented on August 16, 2024

Thanks so much for sharing, this is perfect. I assume similarity is the accuracy % IE correct if same token, incorrect if not, correct/total?

from prostt5.

lhallee avatar lhallee commented on August 16, 2024

Also, is it okay to report this as SOTA round trip accuracy? The average was 70.6 from what you sent. We have a bert-like model that is getting 70+% on the same data so I think this is a great comparison. SAProt can't do this task because they didn't train on filling their structure tokens.

from prostt5.

mheinzinger avatar mheinzinger commented on August 16, 2024

I assume similarity is the accuracy % IE correct if same token, incorrect if not, correct/total?

Ah, no, sorry I should have added this is: I used the Foldseek substitution matrix together with global alignment.

Also, is it okay to report this as SOTA round trip accuracy?

Oh, thanks for asking :) - So if I had to redo this right now, I would go for our dedicated 3Di predictor. It is essentially just a 2-layer CNN trained on top of embeddings of ProstT5's encoder. So if you do not need a distribution over solutions but rather only a single solution (as you usually do if you want to use 3Di for searching remote homologs, or, if you want to use it for roundtrip-based filtering), I would go for this one. The rational is that it is MUCH faster as you do not need to decode token-by-token but rather have a single forward pass through the encoder which translates all amino acids in the input to 3Di tokens (probably similar to your BERT-like model). sorry for not having it in the paper yet but this is ongoing work and in the next iteration of the paper, we'll also describe this CNN.

For SAProt: maybe I would have to re-read the paper but didn't they have these mixed tokens of "Aa" (upper case amino acids, lower case 3Di; or vice versa not sure) where you could mask out either 3Di ("A?") or amino acids ("?a") and reconstruct it (so you could mask out all 3Di tokens and ask the model to reconstruct it)?

from prostt5.

lhallee avatar lhallee commented on August 16, 2024

Thanks for the info! Yes, I think the 3Di predictor is a great comparison to our BERT-like model. Is there a checkpoint available for it yet? If not I will await the next iteration of the paper.

You are correct about the mixed tokens, it is the same approach we took. However, they chose to only mask amino acid portions during training, so their model cannot recover masked 3Di tokens with anything beyond random chance. Our approach was to mask amino acid portions and 3Di, or both, so it is quite good at this round-trip task.

from prostt5.

mheinzinger avatar mheinzinger commented on August 16, 2024

Yes, you can run the script I linked above to get predictions from our dedicated 3Di predictor (it already downloads the checkpoint); you just have to give some input fasta file as explained here under predict_3Di_encoderOnly.py : https://github.com/mheinzinger/ProstT5/tree/main/scripts

Cool, great to hear that you got promising results from this approach :) - let me know once you have a pre-print up

from prostt5.

lhallee avatar lhallee commented on August 16, 2024

Thanks for the help! Will do :)

from prostt5.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.