Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

computing residue logits from 3Di input about prostt5 HOT 5 CLOSED

adrienchaton commented on July 17, 2024

computing residue logits from 3Di input

from prostt5.

Comments (5)

adrienchaton commented on July 17, 2024 1

in case others face same issue, actually one should also use generate method for inference to compute logits

it goes as

ids_backtranslation = tokenizer.batch_encode_plus(["<fold2AA>"+" "+seq_3di], add_special_tokens=True,  padding="longest", return_tensors='pt').to(model.device)
outputs = model.generate(ids_backtranslation.input_ids, attention_mask=ids_backtranslation.attention_mask, max_length=len(seq_3di)+1, min_length=len(seq_3di)+1, output_scores=True, return_dict_in_generate=True, repetition_penalty=repetition_penalty)
logits = torch.cat(outputs.scores).cpu()

one thing which seemed off in your example is that if I didnt add +1 to the expected length (here only one example) then the output would be shorter by one residue compared to the expected length from the 3Di encoding ...

Any corrections on what I came up with would be greatly appreciated. For sanity check, the recovery from the sequence I computed the 3Di from is pretty fine, i.e. >40% so it seems not buggy to me ..

from prostt5.

mheinzinger commented on July 17, 2024 1

Thanks for sharing the details in how you got scores. -From what I remember, I used a similar logic at one point, so I would not immediately see what to change.

Only thing: on the +1 offset: maybe double check but the decoder should not need those special pre-fixes which indicate the direction of translation ("" etc). Those prefixes are only added to the encoder input to tell the model already how to interpret the input to the encoder and how to optimally embed it for the translation direction you are interested in.
That being said: I think there is a special token added to the decoder to kick-off the translation (<s> if I am not mistaken) but this should get stripped off automatically when you do sth like decoded_translations = tokenizer.batch_decode( translations, skip_special_tokens=True )

from prostt5.

mheinzinger commented on July 17, 2024 1

Sounds interesting, let me know in case you hit any problems on the way.
Regarding finetuning, I would recommend to consider some parameter-efficient version which we made good experience with previously. .

from prostt5.

adrienchaton commented on July 17, 2024 1

@mheinzinger Thanks for the advice, here specifically I am thinking about finetuning the encoder-decoder models together on sequence-3Di pairs, not on e.g. supervised fitness prediction with the encoder alone (as with e.g. ESM2). It could be interesting to tune the models to antibodies for example, or retrain from scratch but still I guess starting from the general pretrained model would be an advantage.

from prostt5.

adrienchaton commented on July 17, 2024

for inference it looks like it is easiest to stick to using model.generate() along with the special tokens which indicate what processing is expected (e.g. encoder or decoder)

additionally thanks for sharing this batch_decode method which takes care of dropping special tokens from sequence outputs

I am considering to try to finetune ProstT5 for other protein types, I might come back with a few questions if you dont mind!

from prostt5.

computing residue logits from 3Di input about prostt5 HOT 5 CLOSED

Comments (5)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent