pughlab / viscap Goto Github PK

VisCap is an open flexible, software program targeted to clinical laboratories for inference and visualization of germline copy number variants (CNVs) from targeted next-generation sequencing data. For algorithmic details, see Pugh TJ et al. Genet Med. 2016 Jul;18(7):712-9.

R 100.00%

copy-number-variation germline

viscap's Introduction

Documentation

For updated and more detailed documentation, please refer to our Github Wiki page at https://github.com/pughlab/VisCap/wiki

Introduction

VisCap is a copy number detection and visualization tool written in R (www.r-project.org) for analysis of next-generation sequencing data derived from hybrid-capture experiments. It requires installation of the R libraries "gplots", "zoo", and “cluster”. We reported validation of this algorithm for detection of germline variation in Genetics in Medicine:

VisCap: inference and visualization of germ-line copy-number variants from targeted clinical sequencing data.
Pugh TJ, Amr SS, Bowser MJ, Gowrisankar S, Hynes E, Mahanta LM, Rehm HL, Funke B, Lebo MS.
Genet Med. 2015 Dec 17. doi: 10.1038/gim.2015.156. [Epub ahead of print]
PMID: 26681316
www.ncbi.nlm.nih.gov/pubmed/26681316

Input

As input, VisCap reads interval summary files generated by the Genome Analysis Toolkit (GATK)"DepthOfCoverage" tool (www.broadinstitute.org/gatk). For each sample, these files contain a summary of the total coverage of each genome region interval listed in a reference interval list. For our purposes, this interval list contains the coordinates of each target (usually an exon) captured by a specific bait set. Importantly, the bed file provided to GATK DepthOfCoverage must not contain any overlapping intervals as these are merged into single intervals resulting in a DepthOfCoverage file containing few intervals than the original bed file.

Algorithm

The initial step in the VisCap program is to generate a matrix of all intervals captured (~1,383 for OtoGenome) and the fractional of total coverage assigned to these interval within each sample from a batch (10 samples for production runs). When different interval lists are used to make the coverage files for samples within a batch, VisCap will use only the intervals common to all samples. Therefore, inclusion of even a single coverage file with a small number of targets will result in only these targets being considered across the entire batch. To normalize fractional coverage values for copy number detection, each is divided by the median for that target across the entire batch. These are stored as a matrix of log2 ratios and written to a file (log2_ratio_table.xls).

The X-chromosome requires further normalization as there are significant fractional coverage differences between males and females. Depending on the balance of males and females in the batch, males may display single-copy loss of chromosome X or females may display a gain. These patterns are evident from two clusters of boxplots. The clusters are detected informatically by removing outlier probes and then partitioning all samples around two medoids, a more robust version of K-means clustering. Each cluster is then normalized to zero by subtracting the median of each cluster from each sample. To facilitate review of this procedure, boxplots of log2 ratios are generated before and after subtraction of the cluster medians (QC_chrX_pre-scale.pdf and QC_chrX_post-scale.pdf).

Boxplots are constructed using the standard "boxplot()" command within R. This provides a visual representation of the distribution of log2 ratios from each sample (QC_cnv_boxplot.pdf) as well as a numeric summary used for subsequent thresholding and quality control (QC_cnv_boxplots.xls). The boxplot output summary includes: upper whisker, Q3, median, Q1, and lower whisker. Upper (lower) whiskers are the greatest (lowest) values that fall within Q3 (Q1) plus a quantity (default: 3) times the interquartile range (Q3 - Q1). A sample fails quality control if either of the boxplot whiskers extends beyond the expected theoretical log2 ratio for a single-copy gain (0.58) or a single-copy loss (-1). Failed samples are identified in the boxplot summary output, and VisCap automatically repeats the analysis until all samples pass quality control. The complete set of output files is generated for each iteration and stored in separate folders. For further quality control, log2 ratios across the entire batch are visualized in a single heatmap-based output (QC_cnv_heatmap.pdf).

The thresholding strategy for determining if a log2 ratio represents a gain or a loss is dependent on the boxplot whiskers and a set of fixed thresholds. These fixed thresholds were established based on the analysis of positive control samples (see validation report) and they represent the minimum log2 ratio for gains (0.40) and maximum log2 ratio for losses (-0.55) at which a call is made. A copy number variant is called when a log2 ratio exceeds both the upper (lower) fixed threshold and the upper (lower) whisker for the sample. The minimum number of consecutive exons with log2 ratios outside of these thresholds to call a copy number variant can also be set by the user (default: 1).

Output

For each sample, candidate copy number variants are output as a text data table (*.cnvs.xls). All calls from all intervals across all samples are also written to a file (cnv_boxplot_outliers.xls, losses -1, copy-neutral 0, gains 1). To facilitate visual review of the data for each sample, log2 ratios are plotted by relative genome order (i.e. rank order, not to scale on genome) and data points supporting copy number variants are color-coded (gains: red, losses: blue). This plot is overlaid with guidelines depicting the two sets of thresholds used for copy number variation (whiskers: dashed, fixed: solid black) and the theoretical log2 ratios (solid light gray) for single-copy gains and single-copy losses.

viscap's People

Stargazers

Watchers

Forkers

arnavaz sameet asmirnov239 biocyberman jesus807 zzygyx9119 teresaperinan inambioinfo luckyluke95 leefall rghu geng-lee

viscap's Issues

Error running VisCap

I am trying to run VisCap for one of my sample and getting following error, which i can not understand

"/data/Duplication/VisCap-master/MD36//MD36.sample_interval_summary"
Error in mat.cov[grep("X", rownames(mat.cov), invert = TRUE), ] :
incorrect number of dimensions
Calls: make_matrix_from_cov_files -> colSums -> is.data.frame
Execution halted

From the error i could understand there might be a problem with my sample_interval_summary file but i followed the steps shown, and my bed file also seems correct so please help me solve this error

bed file

chr1	46654345	46654440	GENE_ID=POMGNT1_AMPL7155142781	0	+
chr1	46654426	46654514	GENE_ID=POMGNT1_AMPL7154573365	0	+
chr1	46654472	46654555	GENE_ID=POMGNT1_AMPL7154539719	0	+
chr1	46654536	46654657	GENE_ID=POMGNT1_AMPL7153085900	0	+
chr1	46654876	46655005	GENE_ID=POMGNT1_AMPL7153021413	0	+
chr1	46654952	46655042	GENE_ID=POMGNT1_AMPL7155142780	0	+
chr1	46655037	46655159	GENE_ID=POMGNT1_AMPL7153018090	0	+
chr1	46655135	46655257	GENE_ID=POMGNT1_AMPL7153085862	0	+
chr1	46655443	46655571	GENE_ID=POMGNT1_AMPL7154537337	0	+
...

sample_interval_summary file

Target	total_coverage	average_coverage	MDDNA-201_total_cvg	MDDNA-201_mean_cvg	MDDNA-201_granular_Q1	MDDNA-201_granular_median	MDDNA-201_granular_Q3	MDDNA-201_%_above_15
chr1:46654346-46654657	802	2.57	802	2.57	3	4	5	0.0
chr1:46654877-46655257	2309	6.06	2309	6.06	4	7	10	0.0
chr1:46655444-46655667	1410	6.29	1410	6.29	5	8	9	0.0
chr1:46656036-46656267	2314	9.97	2314	9.97	9	12	12	6.9
chr1:46656358-46656479	365	2.99	365	2.99	4	4	4	0.0
chr1:46657702-46657902	444	2.21	444	2.21	1	5	5	0.0
chr1:46657927-46658123	1251	6.35	1251	6.35	2	12	12	0.0
chr1:46658157-46658283	1688	13.29	1688	13.29	13	14	15	22.8
chr1:46658538-46658666	514	3.98	514	3.98	5	5	5	0.0
chr1:46658797-46659080	2914	10.26	2914	10.26	6	9	16	28.5

Warning messages from VisCap

Hello, thanks for letting your project open.
I'm trying to test it and I'm facing this warnings:

Warning messages:
1: In image.default(z = matrix(z, ncol = 1), col = col, breaks = tmpbreaks,  :
  unsorted 'breaks' will be sorted before use
2: In matrix(ncol = 2, byrow = TRUE, data = c("Date", date(), "VisCap command",  :
  data length [35] is not a sub-multiple or multiple of the number of rows [18]

I don't understand this sentences, it can cause some error in my results?

Cheers,
George Carvalho

Viscap running only for small panel

Hi, I successfully ran Viscap for a number of individuals sequenced with Trusight cancer panel (~90 genes) Now, I´m trying to run Viscap for eight individuals sequenced with Trusight One Expanded (~6700 genes), but it's been 9 days since the program is running and only an empty QC_cnv_heatmap.pdf file has been created so far. For the cancer panel it took ~2 minutes to complete the run, so I think there is some error in the run with TrusighOne Expanded. Or is this expected?

Max samples per analysis

I am involved in a project of targeted sequencing where I want to analyze 384 samples for CNV. How many samples can I analyze per run? Could I run all the samples in just one analysis? Thank you

Test samples

Dear all,

Could anyone deposit test datas which works well to test my Viscap installation and scripts please ?
I have read the paper and do not find some datas in the supplementary datas...
I have different error messages and do not know if I do something wrong.

Thanks in advance for your help.

Best

Configuration file

Default paths are hard-coded. While these are overwritten by command line arguments or a winDialog() call, a separate configuration file would be a better approach.

Running under Linux

Hi,

I'm really looking forward to try VisCap on the samples from our lab. However, for a non-expert R user it is hard to figure out how to run and specify the .cfg file under Linux.
I would be grateful if you could add a short explanation to your Wiki.

Thanks a lot!

getting started?

Hi
VisCap seems to be an interesting tool. I want to try it out but am still not clear how to start.

The first question is: What is the exact command of gatk DepthOfCoverage should I run? There are many options to it as you may see.

So a sample command that your team used can be a good reference.

Secondly, some more details will help a lot:

What is the requirements for the BED to use with GATK.
What kind of samples are usable with VisCap? I have gene panel data captured by bait/probe hybridization and sequenced with Ion Proton. Is it usable with VisCap?
At minimum, how many control samples are required?
Do I need to remove duplicates before running GATK and VisCap?

Your answers will be much appreciated.

Kind regards
Vang

Missing gene annotation in the output files

Hi,

I am testing the VisCap. It is working well but I can not get the gene name in the file "cnv_boxplot_outliers.xls".

My interval file looks like the following:

@hd VN:1.5 SO:coordinate
@sq SN:chrM LN:16569 M5:c68f52674c9fb33aef52dcf399755519 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr3 LN:198022430 M5:641e4338fa8d52a5b781bd2a2c08d3c3 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6b UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375e UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr9 LN:141213431 M5:3e273117f15e0a400f01055d9f393768 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr10 LN:135534747 M5:988c28e000e84c26d552359af1ea2e1d UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr11 LN:135006516 M5:98c59049a2df285c76ffb1c6db8f8b96 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr12 LN:133851895 M5:51851ac0e1a115847ad36449b0015864 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr13 LN:115169878 M5:283f8d7892baa81b510a015719ca7b0b UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr14 LN:107349540 M5:98f3cae32b2a2e9524bc19813927542e UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr15 LN:102531392 M5:e5645a794a8238215b2cd77acb95a078 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr16 LN:90354753 M5:fc9b1a7b42b97a864f56b348b06095e6 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr17 LN:81195210 M5:351f64d4f4f9ddd45b35336ad97aa6de UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr18 LN:78077248 M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr19 LN:59128983 M5:1aacd71f30db8e561810913e0b72636d UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr20 LN:63025520 M5:0dec9660ec1efaaf33281c0d5ea2560f UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr21 LN:48129895 M5:2979a6085bfe28e3ad6f552f361ed74d UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chr22 LN:51304566 M5:a718acaa6135fdca8357d5bfe94211dd UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chrX LN:155270560 M5:7e0e2e580297b7764e31dbc80c2540dd UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
@sq SN:chrY LN:59373566 M5:1fa3474750af0948bdf97d5a0ee52e51 UR:file:/data1/ref/Homo_sapiens.GRCh37.75.genome.fa
chr1 10292320 10292502 + KIF1B
chr1 10316296 10316391 + KIF1B
chr1 10318542 10318740 + KIF1B

Is this the correct interval file format ? Could you please show me an example of the corrected interval file ?

Best,

Rodrigo

'chr' confusion

I have noticed getting always normal results with mislabeled plots when 'chr' word is within the interval list file provided, or in GATK DepthOfCoverage output files (*sample_interval_summary).
Please be sure that these files are without 'chr' word (i.e., chr14 should be modified to 14, using for example sed command).

Error in running Viscap

Dear all,

I want to use VisCap for my run but I have the following error message :

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’: lowess

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’: as.Date, as.Date.numeric

[1] "dpt/./2336.sample_interval_summary" Error in matrix(tab[, col_name], dimnames = list(rownames(tab), new_head)) : length of 'dimnames' [1] not equal to array extent Calls: make_matrix_from_cov_files -> matrix Execution halted

Does anybody have an idea please ?

Thanks for considering my request

Error in running Viscap

Dear all,

I want to use VisCap for my run but I have the following error message :

[1] "./coverage_files/run_11.sample_interval_summary"
Error in matrix(tab[, col_name], dimnames = list(rownames(tab), new_head)) : 
 length of 'dimnames' [1] not equal to array extent
Calls: make_matrix_from_cov_files -> matrix
Exécution arrêtée

I have generated correct output files (I think...) from GATK v.3.4 DepthOfCov but it seems that something is wrong and I really do not understand why...
Here are my files from GATK_output ::
run_11.zip

And here is my command line to run Viscap :

Rscript VisCap.R ./coverage_files output_viscap/ VisCap.cfg

and my VisCap.cfg and VisCap.R files
VisCap.zip

Does anybody have an idea please ?

I don't know where to find test files (generated by GATK DepofCov, and bed file) in order to test if the problem does not come from my files...

Thank you in advance

bedfile reader

The function for reading the bedfile searches through the bed directory and all its subdirectories.
Multiple bed files might be read into the list.

pughlab / viscap Goto Github PK

viscap's Introduction

Documentation

Introduction

Input

Algorithm

Output

viscap's People

Stargazers

Watchers

Forkers

viscap's Issues

Recommend Projects

Recommend Topics

Recommend Org