Git Product home page Git Product logo

cnv-p's Introduction

CNV-P: A machine-learning framework for filtering copy number variations

CNV-P is a novel and post–processing approach for CNV filtering.

Cite this as
Wang T, Sun J, Zhang X, Wang W, Zhou Q. 2021. CNV-P: a machine-learning framework for predicting high confident copy number variations. PeerJ 9:e12564 https://doi.org/10.7717/peerj.12564

Prerequisites:

python3
sklearn
matplotlib
pysam
pandas
numpy

Install by conda

conda install python=3.7.0
conda install -c anaconda scikit-learn=0.21
conda install -c conda-forge matplotlib
conda install -c bioconda pysam
conda install -c anaconda pandas
conda install -c anaconda numpy

Getting started

1. CNV Predicting

run the "python script/CNV-P_predict_main.py -h" to see the USAGE;

usage: CNV-P_predict_main.py [-h] [-m model] -b bamfile -s CNVcaller -bed
                             BEDfile -bas basfile [-sam Samplename]
                             [-o outdir]
optional arguments:
  -h, --help            show this help message and exit
  -m model, --model model
                        which model you want to used
  -b bamfile, --bam bamfile
                        file provides features
  -s CNVcaller, --soft CNVcaller
                        which CNV caller you used
  -bed bedfile, --CNV_bed bedfile
                        the format of input file
  -bas basfile, --basfile basfile
                        file that provide mean insert-size and sequencing
                        depth
  -n Samplename, --Samplename Samplename
                        Samplename that use as prefix of result
  -o outdir, --output outdir
                         output directory

1.1 input parameters

model: should be one of RF (Random Forest), GBC (Gradient Boosting classifier) and SVM (Support Vector Machine)
CNVcaller: Lumpy, Manta, Pindel, Delly and breakdancer is currently supported, for Other software needs to be pre-trained(see 2.training for other CNV callers)
bamfile: BAM file should generated by a read aligner that supports partial read alignments, such as BWA-MEM
bedfile: This file should be 5 Columns: chromsome, start, end, length of CNV, type of CNV (DUP:1,DEL:0)
for example (test_data/HG002.Lumpy.fil.mer.bed):

chr19	350768	351961	1194	1
chr19	434243	434587	345	1
chr19	566222	569347	3126	0
chr19	878739	879857	1119	1
chr19	1182660	1183097	438	0
chr19	1572816	1573149	334	0
chr19	2033040	2033182	143	0
chr19	2713161	2714159	999	0

basfile: this file should be 4 columns: Samplename, median value of insert size, standard deviation of insert size, coverage
for example (test_data/HG002.bam.bas):

Samplename	median_insert_size	insert_size_median_sd	coverage
HG002	568.177944	163.819637	35.41

1.2 output

samplename.feature.txt: Extracted feature matrix.
samplename.pre.prop.txt: The prediction result and probability score. Including 7 columns:

ChrID: Chromosome (e.g. chr3, chrY)
start: Start coordinate on the chromosome 
end: End coordinate on the chromosome
length: length of CNV
CNV_type: type of CNV (DUP:1,DEL:0)
class: predicting results (true CNV:1 ,false CNV: 0)
probability_score: Probability of this CNV to be true

1.3 running example

python  script/CNV-P_predict_main.py  -m RF -b Test_data/HG002.test.bam -s Lumpy -bed Test_data/HG002.Lumpy.fil.mer.bed -bas Test_data/HG002.bam.bas -sam HG002 -o Test_data/out/


2. training for other CNV callers

For training a model for other CNV callers, use 'CNV-P_featureExtract_main.py' to perform features extraction:

python script/CNV-P_featureExtract_main.py -b test-data/HG002.test.bam -bed test-data/HG002.Lumpy.fil.mer.bed -bas test-data/HG002.bam.bas -sam HG002 -o test-data/out/

then,run the "script/CNV-P_training_main.py" to train a model
run " python script/CNV-P_training_main.py -h " to see the USAGE;

usage: CNV-P_training_main.py [-h] [-m model] -s CNVcaller -fea featuresfile
                              -lab labelfile [-o outdir]
optional arguments:
  -h, --help            show this help message and exit
  -m model, --model model
                        which model you want to used
  -s CNVcaller, --soft CNVcaller
                        which CNV caller you used
  -fea featuresfile, --features featuresfile
                        file that provide traing features
  -lab labelfile, --labelfile labelfile
                        file that provide CNV label, true CNVs labeled as
                        1,false CNVs labeled as 0, The order should
                        corresponds to CNV_bed file(-bed/--CNV_bed) one to one
  -o outdir, --output outdir
                         output directory

2.1 input parameters

featuresfile: file that provide traing features, results from 'CNV-P_featureExtract_main.py'
labelfile: one column, true CNVs labeled as 1,false CNVs labeled as 0
for example (see test-data/HG002.Lumpy.chr1.label.txt):

1
1
0
0
0
0
1
0
1

2.2 outputs:

CNVcaller.model.train_model.m: the classifier you trained
CNV-P_CNVcaller_model_Classifier.ROC.pdf, CNV-P_CNVcaller_model_Classifier.ROC.png: the ROC of 10fold-cross_validation

2.3 running example

python script/CNV-P_training_main.py -s Lumpy -fea test-data/HG002.Lumpy.chr1.feature.txt -lab test-data/HG002.Lumpy.chr1.label.txt -o test-data/out/

Please help us improve CNV-P by reporting bugs or ideas on how to make things better.



Comparison with CNV-JACG, MetaSV and hard cutoff method

We compared the performance of CNV-P with that of CNV- JACG (Zhuang et al. 2020), MetaSV (Mohiyuddin et al. 2015) and hard cutoff method in the same datasets. Since MetaSV currently does not support Delly's output, only four CNV detection tools (Lumpy, Manta, Pindel, and breakdancer) were taken into consideration. CNV-JACG was conducted running with default parameters. MetaSV was carried out with complete mode. For hard cutoff method, we used SR and RP as the evidence to support the existence of CNVs, therefore, the number of SR and RP greater than 2, 5, and 10 were set as hard cutoff to evaluate. SURVIVOR(Jeffares et al. 2017) was used to merge fragments with 80% overlap after filtering by CNV-P, CNV- JACG, MetaSV and hard cutoff method.


Process framework: Q798(22YAP0`ZL5SB8M~X2A

Comparison with CNV-JACG, MetaSV and hard cutoff method in NA12878 and HG002.

Sample method precision recall F1-score
NA12878 RAW 0.6032 1.0000 0.7525
Hard_Cutoff_2 0.6197 0.9792 0.7590
Hard_Cutoff_5 0.7145 0.8630 0.7818
Hard_Cutoff_10 0.7780 0.6976 0.7356
CNV-JACG 0.6828 0.7496 0.7146
MetaSV 0.7094 0.8817 0.7862
CNV-P 0.9007 0.7977 0.8461
HG002 RAW 0.2054 1.0000 0.3408
Hard_Cutoff_2 0.4026 0.9729 0.5695
Hard_Cutoff_5 0.5740 0.8653 0.6901
Hard_Cutoff_10 0.6642 0.7482 0.7037
CNV-JACG 0.5443 0.7076 0.6153
MetaSV 0.5917 0.8274 0.6900
CNV-P 0.7078 0.7516 0.7290

cnv-p's People

Contributors

wonderful1 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.