Comments (5)
in case others face same issue, actually one should also use generate method for inference to compute logits
it goes as
ids_backtranslation = tokenizer.batch_encode_plus(["<fold2AA>"+" "+seq_3di], add_special_tokens=True, padding="longest", return_tensors='pt').to(model.device)
outputs = model.generate(ids_backtranslation.input_ids, attention_mask=ids_backtranslation.attention_mask, max_length=len(seq_3di)+1, min_length=len(seq_3di)+1, output_scores=True, return_dict_in_generate=True, repetition_penalty=repetition_penalty)
logits = torch.cat(outputs.scores).cpu()
one thing which seemed off in your example is that if I didnt add +1 to the expected length (here only one example) then the output would be shorter by one residue compared to the expected length from the 3Di encoding ...
Any corrections on what I came up with would be greatly appreciated. For sanity check, the recovery from the sequence I computed the 3Di from is pretty fine, i.e. >40% so it seems not buggy to me ..
from prostt5.
Thanks for sharing the details in how you got scores. -From what I remember, I used a similar logic at one point, so I would not immediately see what to change.
Only thing: on the +1 offset: maybe double check but the decoder should not need those special pre-fixes which indicate the direction of translation ("" etc). Those prefixes are only added to the encoder input to tell the model already how to interpret the input to the encoder and how to optimally embed it for the translation direction you are interested in.
That being said: I think there is a special token added to the decoder to kick-off the translation (<s>
if I am not mistaken) but this should get stripped off automatically when you do sth like decoded_translations = tokenizer.batch_decode( translations, skip_special_tokens=True )
from prostt5.
Sounds interesting, let me know in case you hit any problems on the way.
Regarding finetuning, I would recommend to consider some parameter-efficient version which we made good experience with previously. .
from prostt5.
@mheinzinger Thanks for the advice, here specifically I am thinking about finetuning the encoder-decoder models together on sequence-3Di pairs, not on e.g. supervised fitness prediction with the encoder alone (as with e.g. ESM2). It could be interesting to tune the models to antibodies for example, or retrain from scratch but still I guess starting from the general pretrained model would be an advantage.
from prostt5.
for inference it looks like it is easiest to stick to using model.generate()
along with the special tokens which indicate what processing is expected (e.g. encoder or decoder)
additionally thanks for sharing this batch_decode
method which takes care of dropping special tokens from sequence outputs
I am considering to try to finetune ProstT5 for other protein types, I might come back with a few questions if you dont mind!
from prostt5.
Related Issues (19)
- Dataset version with IDs? HOT 11
- Converting the AA sequences to 3Di and getting the output in PDB format for Foldseek input? HOT 4
- Base Roundtrip Accuracy HOT 7
- translate.py documentation HOT 1
- AttributeError: 'T5EncoderModel' object has no attribute 'full' HOT 4
- Back Translate 3Di Tokesn to PDB Format HOT 1
- ProstT5 for tasks like AMP classification or protein localization. Also protT5 xxl results. HOT 1
- Issue downloading model weights HOT 1
- Protein sequence length limit HOT 1
- Translation result is unexpected! HOT 1
- Foldseek not liking generate_foldseek_db.py output TSVs HOT 6
- 'T5Tokenizer' object has no attribute 'to' HOT 1
- Problem while trying to convert AA to 3Diaa HOT 2
- How well does this scale with length of amino acids? HOT 1
- Probability cutoff? HOT 6
- Sequence conservation model described in the paper HOT 2
- Entry error between aa and 3di string HOT 3
- translate.py with default settings generates long stretch of simple repeats HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prostt5.