Git Product home page Git Product logo

epitcr's Introduction

epiTCR

epiTCR is a highly sensitive predictor for TCR-peptide binding. epiTCR uses TCR CDR3b sequences and peptide sequences as input. Additionally, users can also provide full length MHC to the tool. The output produces the predicted binding probability.

This repository contains the code and the data to train epiTCR model.

Requirements

python >= 3.0.0
numpy 1.22.4
scikit-learn 1.1.2

For other requirements, please see the env_requirements.txt file (here).

Run epiTCR

Users can run epiTCR in two modes: (i) train a new model and make prediction using the newly trained model, or (ii) make prediction using our pre-trained model.

Train a new model and make prediction

The main module of epiTCR is epiTCR.py. Users can train the epiTCR model (with or without MHC) and give prediction on their data by running:

python3 epiTCR.py --trainfile data/splitData/withMHC/train/train.csv --testfile data/splitData/withMHC/test/test01.csv --chain cem

given that:

  • --trainfile is a comma-separated file (CSV) containing columns for TCR, epitipe, binder, and/or full length MHC (reported by IMGT). See example training data.
  • --testfile is a CSV file containing columns for TCR, epitope and/or full length MHC (reported by IMGT). See example test file.
  • --chain specifies the chain(s) to use (ce, cem). Available options for this parameter are ce (cdr3b+epitope) and cem (cdr3b+epitope+mhc). Default as ce.

The prediction output is printed out on the standard output (std) or on a file (that can be specified using the option --outfile). For more information, view the section Prediction output below.

Run prediction using the pre-trained model

Users can also apply our pre-trained model to directly make prediction on their data using the module predict.py. TCR-epitope or TCR-pMHC binding prediction can be run with:

python3 predict.py --testfile data/splitData/withMHC/test/test01.csv --modelfile models/rdforestWithMHCModel.pickle --chain cem

given that:

  • --testfile is a CSV file containing columns for TCR, epitipe and/or full length MHC reported by IMGT. See example input file.
  • --modelfile specifies the full path of the file with trained model, should be a pickle files. Default model as models/rdforestWithMHCModel.pickle.
  • --chain specifies the chain(s) to use (ce, cem). Options for this parameter are ce (cdr3b+epitope) and cem (cdr3b+epitope+mhc). Default as ce.

Prediction output

epiTCR prediction output contains a table with four columns: the CDR3b sequences, epitope sequences, (full length MHC,) and the binding probability for the corresponding complexes. The example output file is here.

Contact

For more questions or feedback, please simply post an Issue.

Citation

Please cite this paper if it helps your research:

@article{10.1093/bioinformatics/btad284,
    author = {Pham, My-Diem Nguyen and Nguyen, Thanh-Nhan and Tran, Le Son and Nguyen, Que-Tran Bui and Nguyen, Thien-Phuc Hoang and Pham, Thi Mong Quynh and Nguyen, Hoai-Nghia and Giang, Hoa and Phan, Minh-Duy and Nguyen, Vy},
    title = "{epiTCR: a highly sensitive predictor for TCR–peptide binding}",
    journal = {Bioinformatics},
    volume = {39},
    number = {5},
    pages = {btad284},
    year = {2023},
    month = {04},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btad284},
    url = {https://doi.org/10.1093/bioinformatics/btad284},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/5/btad284/50204900/btad284.pdf},
}

epitcr's People

Contributors

ddiem-ri-4d avatar nttvy avatar kamurani avatar

Stargazers

James Jarad Dollar avatar  avatar Darly avatar  avatar Jiadong Lu avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

epitcr's Issues

Question about binder in input file.

Hello
Thanks for your great tool.
I have a question about the input file. When I run epiTCR without MHC in prediction mode use pre-trained model. I find I must add the binder column. So is this my own assumption about the TCR and epitipe bindinig? And in the result file, should I rely on the predict_proba and binder_pred information?
Some simple questions and thank for your patience!!

error in predict.py

Hello

thank you for the great tool. I am getting the error below when I run the predict.py
What am I doing wrong ?

Traceback (most recent call last):
File "predict.py", line 37, in
auc_test, acc_test, sens_test, spec_test = Model.predicMLModel(model_rf, test, pX_test, py_test, args.outfile)
File "/epiTCR-main/src/modules/model.py", line 78, in predicMLModel
y_rf_test_proba = model.predict_proba(X_test)
File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 865, in predict_proba
X = self._validate_X_predict(X)
File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 599, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py", line 579, in _validate_data
self._check_feature_names(X, reset=reset)
File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py", line 506, in _check_feature_names
raise ValueError(message)
ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:

  • F1000
  • F1001
  • F1002
  • F1003
  • F1004
  • ...

unrecognized arguments

Hi there,
I am trying to run this program for the first time. However, I am running into an issue with the test data because it didn't seem to recognize the model file.
This is the error message I receive.
predict.py: error: unrecognized arguments: --model_file models/rdforestWithMHCModel.pickle
I unzipped the file and downloaded all dependencies - do you have any idea what the issue could be?

Best, Zita

License

Hey @ddiem-ri-4D, awesome work. What license have you released the code under? I'd love to try it out at work. Cheers, J.

Does not work for peptides longer than 11 amino acids

Hello,

Everytime I try to predict binding of a peptide longer than 11 amino acids, I get the error : "could not broadcast input array from shape (25,20) into shape (11,20)".

Any clue how I might solve that ?

Thanks

Describe the input file

I am trying to use this tool to predict whether a specific CDR3b sequence binds to an antigen. However, when I run the pedict.py with just this information, it throws an error. I don't have MHC or HLA information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.