Git Product home page Git Product logo

titan's Introduction

Python package License: MIT

TITAN

TITAN - Tcr epITope bimodal Attention Networks

Installation

The library itself has few dependencies (see setup.py) with loose requirements.

Create a virtual environment and install dependencies

python -m venv --system-site-packages venv
source venv/bin/activate
pip install -r requirements.txt

Install in editable mode for development:

pip install -e .

Data structure

For data handling, we make use of the pytoda package. If you bring your own data, it needs to adhere to the following format:

  • tcrs.csv A .csv file containing two columns, one for the tcr sequences and one for their IDs.
  • epitopes.csv A .csv file containing two columns, one for the epitope sequences and one for their IDs. This can optionally also be a .smi file (tab-separated) with the SMILES seuqences of the eptiopes.
  • train.csv A .csv file containing three columns, one for TCR IDs, one for epitope IDs and one for the labels. This data is used for training.
  • test.csv A .csv file containing three columns, one for TCR IDs, one for epitope IDs and one for the labels. This data is used for testing.

NOTE: tcrs.csv and epitopes.csv need to contain all TCRs and epitopes used during training and testing. No duplicates in both sequence and IDs are allowed. All data can be found in https://ibm.box.com/v/titan-dataset .

Example usages

Train a TITAN model

The TITAN model uses the architecture published as 'paccmann_predictor' package. Example parameter files are given in the params folder.

python3 scripts/flexible_training.py \
name_of_training_data_files.csv \
name_of_testing_data_files.csv \
path_to_tcr_file.csv \
path_to_epitope_file.csv/.smi \
path_to_store_trained_model \
path_to_parameter_file \
training_name \
bimodal_mca

Finetune an existing TITAN model

To load a TITAN model after pretraining and finetune it on another dataset, use the semifrozen_finetuning.py script. Use the parameter number_of_tunable_layers to control the number of layers which will be tuned, the rest will be frozen. Model will freeze epitope input channel first and the final dense layers last. Do not change the input data type (i.e. SMILES or amino acids) between pretraining and finetuning.

python3 scripts/semifrozen_finetuning.py \
name_of_training_data_files.csv \
name_of_testing_data_files.csv \
path_to_tcr_file.csv \
path_to_epitope_file.smi \
path_to_pretrained_model \
path_to_store_model \
training_name \
path_to_parameter_file \
bimodal_mca

Run trained TITAN model on data

A trained model is provided in trained_model. The model is pretrained on BindingDB and finetuned using the semifrozen setting, on full TCR sequences and with SMILES encoding of epitopes. All parameters can be found in the parameter files provided.

python3 scripts/flexible_model_eval.py \
name_of_test_data_file.csv \
path_to_tcr_file.csv \
path_to_epitope_file.smi \
path_to_trained_model_folder \
bimodal_mca \
save_name

Evaluate K-NN baseline on cross validation

The script scripts/knn_cv.py uses the KNN baseline model of the paper and performs a cross validation. The script can be used in two modes, shared and separate. Shared is the default mode as specified above. In separate mode, the TCRs and epitope sequences for training and testing dont need to be in the same file, but can be split across two files. To use this mode, simply provide additional paths to -test_tcr and test_ep arguments.

python3 scripts/knn_cv.py \
-d path_to_data_folder \
-tr name_of_training_data_files.csv \
-te name_of_testing_data_files.csv \
-f 10 \
-ep path_to_epitope_file.ccsv \
-tcr path_to_tcr_file.csv \
-r path_to_result_folder \
-k 25

type python3 scripts/knn_cv.py -h for help. The data in data_folder needs to be structured as:

data_path
├── fold0
│   ├── name_of_training_data_files.csv
│   ├── name_of_testing_data_files.csv
...
├── fold9
│   ├── name_of_training_data_files.csv
│   ├── name_of_testing_data_files.csv

Data Handling

To generate full sequences of TCRs from CDR3 sequence and V and J segment names, the cdr3_to_full_seq.py script can be used. The script relies on the user having downloaded a fasta files containing the Names of V and J segments with their respecive sequences called V_segment_sequences.fasta and J_segment_sequences.fasta. These can be downloaded from IMGT.org. Header names must be provided to the script to adapt to different format of the input file.

python3 scripts/cdr3_to_full_seq.py \
directoy_with_VJ_segment_fasta_files \
path_to_file_with_input_sequences.csv \
v_seq_header \
j_seq_header \
cdr3_header \
path_to_output_file.csv

Citation

If you use titan in your projects, please cite the following:

@article{weber2021titan
    author = {Weber, Anna and Born, Jannis and Rodriguez Martinez, Maria},
    title = "{TITAN: T-cell receptor specificity prediction with bimodal attention networks}",
    journal = {Bioinformatics},
    volume = {37},
    number = {Supplement_1},
    pages = {i237-i244},
    year = {2021},
    month = {07},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btab294},
    url = {https://doi.org/10.1093/bioinformatics/btab294}
}

titan's People

Contributors

annaweber209 avatar jannisborn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

titan's Issues

Need help with running semifrozen_finetuning.py

Hi,

I want to try your models and when I run the semifrozen_finetuning.py script using the data provided by you. I got the following error messages.

Maybe you can help solve it? Thanks a lot in advance.
Screenshot 2022-03-29 162234

Have a nice day!

Best,
Lihua

The logger give some advice, but I don't understand it.

Expected Behavior

SO I was trying out Titan with my own dataset.
Essentially I preprocessed my files to have the same column name. Labelling my CDR3 and Epitopes and further transform the epitopes in smiles.
So now running the semifrozen training, I expected the to have no problems.

Actual Behavior

Instead I got this prompted.
Here are the problems: Provided arg add_start_and_stop:True does not match the smiles_language value: False NOTE: smiles_language value takes preference!! Provided arg padding:True does not match the smiles_language value: False NOTE: smiles_language value takes preference!! Provided arg padding_length:500 does not match the smiles_language value: None NOTE: smiles_language value takes preference!! To get rid of this, adapt the smiles_language *offline*, feed itready for intended usage, and adapt the constructor args to be identical with their equivalents in the language object Since you provided a smiles_language, the following parameters to this class will be ignored: canonical, augment, kekulize, all_bonds_explicit, selfies, sanitize, all_hs_explicit, remove_bonddir, remove_chirality, randomize, add_start_and_stop, padding, padding_length, device. Here are the problems: Provided arg add_start_and_stop:True does not match the smiles_language value: False NOTE: smiles_language value takes preference!! Provided arg padding:True does not match the smiles_language value: False NOTE: smiles_language value takes preference!! Provided arg padding_length:500 does not match the smiles_language value: None NOTE: smiles_language value takes preference!! To get rid of this, adapt the smiles_language *offline*, feed itready for intended usage, and adapt the constructor args to be identical with their equivalents in the language object Steps to Reproduce the Problem

I was looking into code and couldn't find, where this Message get printed out.

It would be very helpful If you can help me with understanding the message and how I should preprocess my data.

Thanks a lot beforehand,
Cedric

Running

Hi there,

Many thanks for developing TITAN and making it public!

This sounds very interesting and I would like to give it a go, but already fail to use the provided test data. Specifically, I'm looking at Run trained TITAN model on data section, which I've translated to:

python3 scripts/flexible_model_eval.py \
    tutorial/data/test_small.csv \
    tutorial/data/tcr_full.csv \
    tutorial/data/epitopes.csv \
    ../TITAN-dataset/trained_model/ \
    bimodal_mca test20230303

I'm unsure about ./TITAN-dataset/trained_model/, which points to the externally provided data (https://ibm.box.com/v/titan-dataset ). But I couldn't find a model in the repo providing the code. Anyway, the above fails with:

RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [128, 1, 500, 26, 32]

Would it be possible to expand the documentation sections on to show how to use https://ibm.box.com/v/titan-dataset?

Many thanks,
Andreas

pytoda version is not correct

Hi, I was trying to use TITAN to TCR epitope prediction by calling the flexible_model_eval.py script, my tcr and eiptope input are both .csv file. I got an error
image

I tried change protein_language to protein_languages in the flexible_model_eval.py, but got another error
image

I install all the requirement listed in the requirements.txt, but I doubt pytoda==1.0.2 is the correct version to use the script.
Could you help check and test the code?

Ligand with MHC

Hi,

First thank you for compiling and providing a benchmark 10-CV dataset.
The MHC molecular information is also important during the binding of the pMHC complex to the TCR.
Could you please provide additional information on the MHC-I molecules (HLA-X*XX:XX, e.g. HLA-A*02:01) corresponding to each data (epitope-tcr)?

Thanks

Question regarding pretraining

Hi,

Thanks for your help with the python version issue.

I had a question about a workflow for training models from 'scratch'. From what I gather, the 'flexible_training.py' script permits training of a TITAN model. Does this script pretrain the model on BindingDB? I assume that from there, one would finetune this model on TCR sequence data and epitope data of choice - e.g., using the semi_frozen_finetuning.py script?

Best,

Paul

input output data size not match

Hi,

I'm using flexible_model_eval.py to predict my own input data, but the output prediction has fewer rows than input rows, and I don't know which rows are dropped. There should not be unfound receptor and ligand in the affinity file.

Part of reason may be set drop_last=True in DataLoader, but set to False still can't solve the problem completely. Could you help check the issue?

Thanks

Bindingdb dataset

As I was reviewing the project, I noticed that you have provided access to the code and VDJdb dataset. However, the Bindingdb data was not included. Would it be possible for you to share this processed Bindingdb dataset with me

Running/ training on own dataset

Hi Team,

so I have a dataset with training and test example, and I want to check out your model.
What and how shall I approach this?
Shall I still going for the sewmifrozen model and train it or shall I train it from scratch?

I am grateful if you can some instructions tips for how to do it.

Best cheers,
Cedric

Data unavailable!

Hi!
I am trying to use your model - however when I attempt to use the "fine-tuning model", it requires the training data
Upon clicking on the data link in the README I got this:

image

Would it be possible to get another link?

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.