ddiem-ri-4d / epitcr Goto Github PK

epiTCR: a highly sensitive predictor for TCR–peptide binding

Home Page: https://github.com/ddiem-ri-4D/epiTCR

License: MIT License

Jupyter Notebook 82.29% Python 13.24% R 4.47%

epitcr's Introduction

epiTCR

epiTCR is a highly sensitive predictor for TCR-peptide binding. epiTCR uses TCR CDR3b sequences and peptide sequences as input. Additionally, users can also provide full length MHC to the tool. The output produces the predicted binding probability.

This repository contains the code and the data to train epiTCR model.

Requirements

python >= 3.0.0
numpy 1.22.4
scikit-learn 1.1.2

For other requirements, please see the env_requirements.txt file (here).

Run epiTCR

Users can run epiTCR in two modes: (i) train a new model and make prediction using the newly trained model, or (ii) make prediction using our pre-trained model.

Train a new model and make prediction

The main module of epiTCR is epiTCR.py. Users can train the epiTCR model (with or without MHC) and give prediction on their data by running:

python3 epiTCR.py --trainfile data/splitData/withMHC/train/train.csv --testfile data/splitData/withMHC/test/test01.csv --chain cem

given that:

--trainfile is a comma-separated file (CSV) containing columns for TCR, epitipe, binder, and/or full length MHC (reported by IMGT). See example training data.
--testfile is a CSV file containing columns for TCR, epitope and/or full length MHC (reported by IMGT). See example test file.
--chain specifies the chain(s) to use (ce, cem). Available options for this parameter are ce (cdr3b+epitope) and cem (cdr3b+epitope+mhc). Default as ce.

The prediction output is printed out on the standard output (std) or on a file (that can be specified using the option --outfile). For more information, view the section Prediction output below.

Run prediction using the pre-trained model

Users can also apply our pre-trained model to directly make prediction on their data using the module predict.py. TCR-epitope or TCR-pMHC binding prediction can be run with:

python3 predict.py --testfile data/splitData/withMHC/test/test01.csv --modelfile models/rdforestWithMHCModel.pickle --chain cem

given that:

--testfile is a CSV file containing columns for TCR, epitipe and/or full length MHC reported by IMGT. See example input file.
--modelfile specifies the full path of the file with trained model, should be a pickle files. Default model as models/rdforestWithMHCModel.pickle.
--chain specifies the chain(s) to use (ce, cem). Options for this parameter are ce (cdr3b+epitope) and cem (cdr3b+epitope+mhc). Default as ce.

Prediction output

epiTCR prediction output contains a table with four columns: the CDR3b sequences, epitope sequences, (full length MHC,) and the binding probability for the corresponding complexes. The example output file is here.

Contact

For more questions or feedback, please simply post an Issue.

Citation

Please cite this paper if it helps your research:

@article{10.1093/bioinformatics/btad284,
    author = {Pham, My-Diem Nguyen and Nguyen, Thanh-Nhan and Tran, Le Son and Nguyen, Que-Tran Bui and Nguyen, Thien-Phuc Hoang and Pham, Thi Mong Quynh and Nguyen, Hoai-Nghia and Giang, Hoa and Phan, Minh-Duy and Nguyen, Vy},
    title = "{epiTCR: a highly sensitive predictor for TCR–peptide binding}",
    journal = {Bioinformatics},
    volume = {39},
    number = {5},
    pages = {btad284},
    year = {2023},
    month = {04},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btad284},
    url = {https://doi.org/10.1093/bioinformatics/btad284},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/5/btad284/50204900/btad284.pdf},
}

epitcr's People

Contributors

Stargazers

Watchers

Forkers

nttvy aisha555ms2000 kamurani haiyangta

epitcr's Issues

Question about binder in input file.

Hello
Thanks for your great tool.
I have a question about the input file. When I run epiTCR without MHC in prediction mode use pre-trained model. I find I must add the binder column. So is this my own assumption about the TCR and epitipe bindinig? And in the result file, should I rely on the predict_proba and binder_pred information?
Some simple questions and thank for your patience!!

error in predict.py

Hello

thank you for the great tool. I am getting the error below when I run the predict.py
What am I doing wrong ?

Traceback (most recent call last):
File "predict.py", line 37, in
auc_test, acc_test, sens_test, spec_test = Model.predicMLModel(model_rf, test, pX_test, py_test, args.outfile)
File "/epiTCR-main/src/modules/model.py", line 78, in predicMLModel
y_rf_test_proba = model.predict_proba(X_test)
File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 865, in predict_proba
X = self._validate_X_predict(X)
File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 599, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py", line 579, in _validate_data
self._check_feature_names(X, reset=reset)
File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py", line 506, in _check_feature_names
raise ValueError(message)
ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:

F1000
F1001
F1002
F1003
F1004
...

unrecognized arguments

Hi there,
I am trying to run this program for the first time. However, I am running into an issue with the test data because it didn't seem to recognize the model file.
This is the error message I receive.
predict.py: error: unrecognized arguments: --model_file models/rdforestWithMHCModel.pickle
I unzipped the file and downloaded all dependencies - do you have any idea what the issue could be?

Best, Zita

License

Hey @ddiem-ri-4D, awesome work. What license have you released the code under? I'd love to try it out at work. Cheers, J.

Does not work for peptides longer than 11 amino acids

Hello,

Everytime I try to predict binding of a peptide longer than 11 amino acids, I get the error : "could not broadcast input array from shape (25,20) into shape (11,20)".

Any clue how I might solve that ?

Thanks

About using data sets TCR sequences almost never start with C and end with F

Most of the TCR sequences downloaded from the databases I reviewed are beginning with C and ending with F, and some articles also mentioned that this is more in line with the characteristics of CDR3 sequences. Why are almost all TCR datasets beginning with A and ending with uncertainty in your dataset？
@nttvy

Does epiTCR also provide binding strength between antigen sequence and TCR? so that the tool can be used to rank the antigen for immunogenicity ?

Describe the input file

I am trying to use this tool to predict whether a specific CDR3b sequence binds to an antigen. However, when I run the pedict.py with just this information, it throws an error. I don't have MHC or HLA information.