Comments (8)
Hi a little Reminder.
Shall ignore the warning from above?
Where does this Messaage gets printout ?
Cheers. Cedric
from titan.
Hi @cirdeCyL sorry for the long cycle time.
First of all, you should always expect problems ;)
However, this is a logging message so it might be possible to ignore it safely. I cant judge from distance whether you really can.
Here's some context: TITAN models rely on something called SMILESLanguage
from the pytoda
library. This class is responsible for 1) preprocessing the SMILES (padding, start/stop tokens, canonicalizing etc), 2) splitting the strings into tokens and 3) converting them into integers that can be embedded via nn.Embedding
. If you finetune TITAN on your own dataset you should, ideally, set all those values where the warning is raised, to the same value as during pretraining. If you dont do that, you make it harder for the model to transfer to your dataset. This can get quite drastic, e.g., if you bring molecules that have tokens that are not available in the SMILES language
I'd say in most cases you can safely ignore this warning but if you experience issues with your model these messages are a good hint as to where to start debugging
from titan.
Hmm maybe, I am stupid.
So I am trying to finetune the model. I don't get a warning for each example of Ligand and Sequence. This message from above, occurs one time at the beginning and then the model just run through. So I cant set those values where the warning is raised, because the message does not provide at which example it got raised.
I am also not so sure what you mean with "set those values (where the warning is raised), to the same value as during pretraining".
Do you mean not to use finetuning and directly train from scratch? I suspect, that the model will perform worse in that case.
from titan.
haha no worries you are not stupid, the warnings are raised intentionally upon dataset setup where the provided configuration is compared to the past one.
Hence, you dont see logging messages during training unless for very drastic instances, e.g., you set padding_length to 20 but then provide a molecule that has 200 tokens. In that case, the molecules are cropped (only the first 20 tokens are fed to the model) but you will see a warning during training.
The message occurs twice since you set up a train and a test dataset.The reported problems are about configuration parameter (like whether to surround the sequences by <START>
and <STOP>
tokens. These params can be configured via the parameter .json
(--params_filepath) so you should have control to fix those, I think. This is what I meant when saying "set those values (where the warning is raised), to the same value as during pretraining".
Hope this helps 👋🏼
from titan.
Perfect. I will try it out.
I was thinking about increasing the token-size in the param file, but I was afraid that this might change the model and this is what I tried to avoid.
Thank You a lot.
from titan.
Hi @jannisborn , I am interested in how you prepared the epitope.csv (which contains the amino acid sequence for the epitopes) file into the epitope.smi (which contains the SMILES for the epitopes as input for training or finetuning the model) file, could you please elaborate how you prepared the .smi file? Which tool did you use for generating the epitope.smi files? Probably the warning is due to different methods for generating the .smi file here.
Thanks for your reply in advance!
Lihua
from titan.
Hi Lihua,
This was done with RDKit:
from rdkit import Chem
mol = Chem.MolFromFasta('EFG')
smi = Chem.MolToSmiles(mol)
from titan.
Hi Lihua,
This was done with RDKit:
from rdkit import Chem mol = Chem.MolFromFasta('EFG') smi = Chem.MolToSmiles(mol)
Thanks for your reply. Will try this.
Best,
Lihua
from titan.
Related Issues (13)
- Data unavailable! HOT 1
- input output data size not match HOT 2
- Running/ training on own dataset HOT 4
- Ligand with MHC HOT 1
- How do you visualize the epitope peptides based on the attention scores? HOT 1
- Bindingdb dataset HOT 1
- Running HOT 3
- shared dataset cannot be access HOT 3
- how did you get the tcr_full from VDJdb? HOT 2
- Need help with running semifrozen_finetuning.py HOT 7
- Question regarding pretraining HOT 1
- pytoda version is not correct HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from titan.