Git Product home page Git Product logo

mtb-amr-classification-cnn's Introduction

Pipline for developing machine learning (Random Forest, Logistic Regression, Deep CNN) models of mycobacterium tuberculosis (MTB) drug resistance prediction

Introduction

This pipeline automates the process of machine learning (ML) model development for MTB first-line drug (RIF, INH, PZA and EMB) resistance classification. It covers each step from DNA-seq data download to ML model validation. In addition, the last step is to evaluate a statistical rule-based method on the same datasets used to validate ML models. Please refer to our paper Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN for more details. The following is the guide for running this pipeline. alt text

Data preparation

Assume the working directory is /mnt/MTB_AMR_Pre and the scripts and original input files needed for running this pipeline are under the working directory.

Install SRA-Toolkit for downloading fastq files. (SRA-Tools introduction)

Download DNA-seq fastq files of isolates with SRA assessions listed in 'uniqueSRA.json' into directory 'fastqDump' (a sample uniqueSRA.json is in the folder sample_input_files):

python fasterq_download.py -f uniqueSRA.json -o fastqDump 

Run Ariba in docker:

docker run --rm -it -v /mnt/MTB_AMR_Pre:/data  sangerpathogens/ariba  /bin/bash
python -m pip install joblib

Run Ariba for isolates listed in 'uniqueSRA.json' to output report files of genetic data and intermediate results, e.g., bam, vcf, contig files, into output directory 'aribaResult_withBam'.
Provide the directory of fastq files using -i.
Give the number of threads you want to use using -n.

cd  /data
ariba getref card out.card
python runAribaInLoop_withBam.py -f uniqueSRA.json -i fastqDump -o aribaResult_withBam -n 8 

Get summary from Ariba result for isolates listed in 'uniqueSRA.json':

python run_summary_inLoop.py

Training-data-creation-for-traditional-ML-methods (put the sample_input_files phenotype.tsv and lineage.xls in the working directory)

Select AMR genes, known variants and novel variants on coding regions that are detected on at least one sample as genetic features, and add 20 lineages together as the input feature set.
Generate files of feature matrices, labels and SRA accessions in the same sample order for each drug based on phenotype and lineage availability.

python get_feature_vector.py

Traditional ML

Random Forest and Logistic Regression

Read features and labels from the output files of last step.
Output multiple metrics (e.g. f-measure, sensitivity, specificity) to evaluate RF and LR models (10-fold CV).

python RF_LR_validation_multiMetricCalculated.py

Multi-input 1D CNN

Feature selection:

Use 80% of samples to get the importance score for each of the features from the previous step, 20% for validation to find the best feature importance cutoff that maximizes F score.

python select_important_feaures.py  > feature_selection_tunning_output.txt

Build and validate 1D CNN models.

As multi-inputs of the first layer, variant features are converted to normalized base counts of fixed length (21) of DNA fragments centered at focal variants' loci.
Build our 1D CNN architecture.
Train and test 4 models for the 4 first-line TB drugs, respectively.

generateInput4Conv1D_withMultiInput_N_createCNN_trainNtest_on4drugs.py

Add coverage as an additional feature.

generateInput4Conv1D_withMultiInput_N_createCNN_trainNtest_on4drugs_withCoverage.py

Evaluate a rule-based method Mykobe

Run Mykrobe for isolates listed in 'uniqueSRA.json' in docker to get its predictions:

docker run --rm -it -v /mnt/MTB_AMR_Pre:/mnt  phelimb/mykrobe_predictor /bin/bash
python -m pip install joblib
python run_Mykrobe_inLoop.py

Evaluate performance of a simple Ariba-based method and Mykrobe on 4 sets of data that were also used to train and test the ML models for the 4 drugs, respectively:

AMR_prediction_validation_Ariba_Mykrobe.py

Citation

If you use code or idea here, please cite:

Kuang, X., Wang, F., Hernandez, K. M., Zhang, Z., & Grossman, R. L. (2022). Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN. Scientific Reports, 12(1), 1-10.

mtb-amr-classification-cnn's People

Contributors

kuangxy3 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.