yuchaojiang / codex2 Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 12.0 79.6 MB

Full-spectrum copy number variation detection by high-throughput DNA sequencing

License: GNU General Public License v2.0

R 100.00%

codex2's People

Contributors

Stargazers

Watchers

Forkers

fatcc baiyuanxiang bwubb globalbiotech n-damo xiaohuaniu0032 wenjingli3367 adamluo12 stevi-x mqondisi wook2014 johnnywang92

codex2's Issues

CODEX2 for low coverage WGS

We have a low pass WGS data set and I found none of them can pass the sd threshold during normalization:
nonCNV.index=apply(z.codex,1,sd) < 0.15

summary(apply(z.codex,1,sd))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2869  0.5146  0.5487  0.5481  0.5822  0.736

What might be the concerns to increase the sd's threshold? say at median=0.55 or other fixed threshold?

The low coverage issue also found in the qc step where 104221 out of 135087 exons were excluded. (102876 due to extreme low coverage), I seem have to reduce the minimum coverage threshold as well. Any suggestions for it?

the meaning of the final result

Dear Dr.J,
Thank you for your work in developing CODEX2. As mentioned in title, how to understand the title 'sample_name chr cnv st_bp ed_bp length_kb st_exon ed_exon raw_cov norm_cov copy_no lratio mBIC' in the final result.?
Best regards
Li'an Lin

failed in loading hg38

library(CODEX2)
library(BSgenome.Hsapiens.UCSC.hg38)

Attaching package: ‘BSgenome.Hsapiens.UCSC.hg38’

The following object is masked from ‘package:BSgenome.Hsapiens.UCSC.hg19’:

Hsapiens

The following object is masked from ‘package:BSgenome.Hsapiens.UCSC.hg19’: Hsapiens

gc <- getgc(ref, genome = BSgenome.Hsapiens.UCSC.hg38)
Error: object 'ref' not found

BSgenome problem

I was following commands in the manual. Everything was fine until this command mapp <- getmapp(ref, genome = genome). I got an error message: Error in getmapp(ref, genome = genome) :
no slot of name "provider_version" for this object of class "BSgenome".

Thanks for any advice!

Some bugs and suggestion

Line 11 and Line 25 in qc.R:
You can replace end(ref) - start(ref)+1 by width(ref), for ref in of type IRranges
Line 67-68 in normalize.R:
Offset might be L[s,] , not log(L[s,]), because L[,] is already the logarithm of (Nmat * fhat * betahatmat)
as in line 54
Line 92 in codex_targeted.R:
The qcObj here is created across all chromosomes, but when call the qc function in qc.R
it seems that your qc function is for one chromosome, as in line 24,"rep(chr, length(ref))"
so when applied to targeted sequencing, we should alter the qc.R in the package
from "rep(chr, length(ref))" to 'chr' in line 24 and call it with 'chr.all' in condex_targeted.R

problem and question with CODEX2

Good morning,
I'm Pasquale.
I have some major issues with using CODEX.
We conducted a sequencing of only 1 ZC3H10 gene on 96 samples (controls and patients). from the trace on IGV there seems to be a CNV, so I started using the CODEX tools.
I wonder ask yoi:

does it work even with such a small portion (about 1kb)? why is it worth testing samples?
because if I use chr <- 12 it calculates the contents of GC and mapp, but then the rest doesn't work for me, while if I use chr <- NC_000012.12 the rest works for me but not GC and mapp??
also this step gives me problems

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9,
gc_thresh = c(20, 80))
Excluded 8 exons due to extreme coverage.
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'end': argument "ref" is missing, with no default

I would be grateful if you give me some suggestions.

getgc() function

chrtemp variable is named as chrX which causes paste('chr',chrtemp,sep='') to name it as "chrchrX" and fail.

getgc =function (ref, genome = NULL) {
  if(is.null(genome)){genome=BSgenome.Hsapiens.UCSC.hg19}
  gc=rep(NA,length(ref))
  for(chr in unique(seqnames(ref))){
    chr="chrX"
    message("Getting GC content for chr ", chr, sep = "")
    chr.index=which(as.matrix(seqnames(ref))==chr)
    ref.chr=IRanges(start= start(ref)[chr.index] , end = end(ref)[chr.index])
    if (chr == "X" | chr == "x" | chr == "chrX" | chr == "chrx") {
      chrtemp <- 'chrX'
    } else if (chr == "Y" | chr == "y" | chr == "chrY" | chr == 
               "chry") {
      chrtemp <- 'chrY'
    } else {
      chrtemp <- as.numeric(mapSeqlevels(as.character(chr), "NCBI")[1])
    }
    if (length(chrtemp) == 0) message("Chromosome cannot be found in NCBI Homo sapiens database!")
    chrm <- unmasked(genome[[paste('chr',chrtemp,sep='')]])
    seqs <- Views(chrm, ref.chr)
    af <- alphabetFrequency(seqs, baseOnly = TRUE, as.prob = TRUE)
    gc[chr.index] <- round((af[, "G"] + af[, "C"]) * 100, 2)
  }
  gc
}

I fix this on my data changing chrtemp <- 'chrX' to chrtemp <- 'X'

Let my know if this makes sense to you or it is just me!

Issues with qc function

Any ideas why I am getting this error?

Excluded 5316 exons due to extreme coverage.
Excluded 12333 exons due to extreme exonic length.
Excluded 12988 exons due to extreme mappability.
Excluded NA exons due to extreme GC content.
After taking union, excluded NA out of 217488 exons in QC.
Error: logical subscript contains NAs

segmentCBS error

Hi,

I am using the software on WES samples using h38.
The hole software runs well until this part:

> finalcall.CBS <- segmentCBS(Y_qc[chr.index,],
+                             Yhat, optK = which.max(BIC),
+                             K = 1:5,
+                             sampname_qc = sampname_qc,
+                             ref_qc = ranges(ref_qc)[chr.index],
+                             chr = chr, lmax = 400, mode = "integer")
Segmenting sample 1: WES_val18_1..
Error in rep(1:num, rep(lmax, num)) : invalid 'times' argument

First I tried 60 samples, in a second test I only used 7 samples.
Any ideas why I am getting this error?

Thank you for your help and kind regards,
Nicolas

missing or infinite values in inputs are not allowed

Hi,
I follow this tutorial: http://htmlpreview.github.io/?https://github.com/yuchaojiang/CODEX2/blob/master/demo/CODEX2.html

I have ~180 bam files that came from the same lab, but in different time periods (batches of 9-15 files).
When I analyze them together I get error:

Error in smooth.spline(gc_qc, z) : 
  missing or infinite values in inputs are not allowed

(maybe it comes from:

> pseudo.sample
  1:861321-861393   1:865534-865716   1:866418-866469   1:871151-871276   1:874419-874509   1:874654-874840   1:876523-876686   1:877938-878438 
              Inf               Inf               Inf               Inf               Inf               Inf               Inf               Inf 
  1:878632-878757   1:879077-879188   1:879287-879533   1:880073-880180   1:880436-880526   1:880897-881033   1:881552-881666   1:881781-881925 
              Inf

)

are there some methods, to firstly run "batches" separately, and then merge the results and follow the tutorial?

Thanks for any advice,
Damian

CODEX2 targeted Error: subscript out of bounds

I am trying to run CODEX2 on couple of targeted resequencing samples. the bed file is broken down into intervals of 250 basepairs for the regions of interest. I got the targeted demo code from git and initiated the run. Below is my log and error that I ran into. Any help understanding this error and fixing it would be great!

Thanks!

`Getting GC content for chr 1
Getting GC content for chr 10
Getting GC content for chr 11
Getting GC content for chr 12
Getting GC content for chr 13
Getting GC content for chr 14
Getting GC content for chr 15
Getting GC content for chr 16
Getting GC content for chr 17
Getting GC content for chr 18
Getting GC content for chr 19
Getting GC content for chr 2
Getting GC content for chr 20
Getting GC content for chr 21
Getting GC content for chr 22
Getting GC content for chr 3
Getting GC content for chr 4
Getting GC content for chr 5
Getting GC content for chr 6
Getting GC content for chr 7
Getting GC content for chr 8
Getting GC content for chr 9
Getting GC content for chr X
Getting GC content for chr Y
Getting mappability for chr1
Getting mappability for chr10
Getting mappability for chr11
Getting mappability for chr12
Getting mappability for chr13
Getting mappability for chr14
Getting mappability for chr15
Getting mappability for chr16
Getting mappability for chr17
Getting mappability for chr18
Getting mappability for chr19
Getting mappability for chr2
Getting mappability for chr20
Getting mappability for chr21
Getting mappability for chr22
Getting mappability for chr3
Getting mappability for chr4
Getting mappability for chr5
Getting mappability for chr6
Getting mappability for chr7
Getting mappability for chr8
Getting mappability for chr9
Getting mappability for chrX
Getting mappability for chrY

Getting coverage for sample Sample1: read length 97.
Getting coverage for sample Sample2: read length 97.
Error in Y[, 1:5] : subscript out of bounds
Calls: head
Execution halted
`

Error in genome[[chrtemp]] : no such sequence, getgc -> unmasked -> [[ -> [[

Hi,
I tried CODEX2 for some sample sets. The first two sets was just fine.
For the third set, I got the following error.

....
normalize

Getting GC content for chr chr1
Getting GC content for chr chr10
Getting GC content for chr chr11
Getting GC content for chr chr12
Getting GC content for chr chr13
Getting GC content for chr chr14
Getting GC content for chr chr15
Getting GC content for chr chr16
Getting GC content for chr chr17
Getting GC content for chr chr18
Getting GC content for chr chr19
Getting GC content for chr chr2
Getting GC content for chr chr20
Getting GC content for chr chr21
Getting GC content for chr chr22
Getting GC content for chr chr3
Getting GC content for chr chr4
Getting GC content for chr chr5
Getting GC content for chr chr6
Getting GC content for chr chr7
Getting GC content for chr chr8
Getting GC content for chr chr9
Getting GC content for chr chrM
Error in genome[[chrtemp]] : no such sequence
Calls: getgc -> unmasked -> [[ -> [[
In addition: Warning message:
In getgc(ref, genome = genome) : NAs introduced by coercion
Execution halted

Could you please suggest what could be the possible causes of this?
Many thanks.
--duangdao.

An error is occurring-obtaining GC content and mappability stage

Hi,
I am using CODEX2 (version: 1.3.0). I have used BSgenome.Hsapiens.UCSC.hg38 as the reference genome.
After running the command for obtaining GC content, the following error has been occurring:

Getting GC content for chr chr1
Getting GC content for chr chr10
Getting GC content for chr chr11
Getting GC content for chr chr12
Getting GC content for chr chr13
Getting GC content for chr chr14
Getting GC content for chr chr15
Getting GC content for chr chr16
Getting GC content for chr chr17
Getting GC content for chr chr18
Getting GC content for chr chr19
Getting GC content for chr chr2
Getting GC content for chr chr20
Getting GC content for chr chr21
Getting GC content for chr chr22
Getting GC content for chr chr3
Getting GC content for chr chr4
Getting GC content for chr chr5
Getting GC content for chr chr6
Getting GC content for chr chr7
Getting GC content for chr chr8
Getting GC content for chr chr9
Getting GC content for chr chrM
Error in genome[[paste("chr", chrtemp, sep = "")]] : no such sequence
In addition: Warning message:
In getgc(ref, genome = genome) : NAs introduced by coercion

In addition, when I ran the command to obtain mappability, the following error has been occurring:

Error in getmapp(ref, genome = genome) :
no slot of name "metadata" for this object of class "BSgenome"

With reference to the above errors, can you please suggest some ways to fix the issue?

getgc() failing on sex chromosomes

I have an issue with getgc() when I run the CODEX2 targeted demo. It runs fine with my .bed file, but fails when it gets to chrX. I have triple-checked the .bed file and verified that it runs for all lines up to the first chrX entry. The results from the pipeline (using a .bed file with only autosomal regions) look great. How can I get it to run with chrX and chrY?

Commands and output:

Initialize Variables

dirPath=getwd()
bamFile <- list.files(dirPath, pattern = '.q1plus..bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- substr(bamFile,20,27)
norm_index <- c(1,4:7,10:50)

bedFile <- list.files(dirPath, pattern = 'exons_25bp_pad.bed')

bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, sampname = sampname, projectname = "q1plus")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname

ref
GRanges object with 55562 ranges and 0 metadata columns:
seqnames ranges strand

[1] chr1 948928-948981 *
[2] chr1 949338-949883 *
[3] chr1 955527-955778 *
[4] chr1 957555-957867 *
[5] chr1 970631-970729 *
... ... ... ...
[55558] chrY 14959138-14959277 *
[55559] chrY 14968239-14968446 *
[55560] chrY 14968532-14968795 *
[55561] chrY 14969465-14969611 *
[55562] chrY 14971178-14971366 *

seqinfo: 24 sequences from an unspecified genome; no seqlengths

Verify chromosome names match exactly

genome = BSgenome.Hsapiens.UCSC.hg19
head(seqnames(genome),25)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY" "chrM"

unique(read.table(file="exons_25bp_pad_MSOC.bed")[,1])
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"

unique(read.table(file="exons_25bp_pad_MSOC.bed")[,1]) == head(seqnames(genome),24)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Call getgc()

gc <- getgc(ref, genome = BSgenome.Hsapiens.UCSC.hg19)
Getting GC content for chr chr1
Getting GC content for chr chr2
Getting GC content for chr chr3
Getting GC content for chr chr4
Getting GC content for chr chr5
Getting GC content for chr chr6
Getting GC content for chr chr7
Getting GC content for chr chr8
Getting GC content for chr chr9
Getting GC content for chr chr10
Getting GC content for chr chr11
Getting GC content for chr chr12
Getting GC content for chr chr13
Getting GC content for chr chr14
Getting GC content for chr chr15
Getting GC content for chr chr16
Getting GC content for chr chr17
Getting GC content for chr chr18
Getting GC content for chr chr19
Getting GC content for chr chr20
Getting GC content for chr chr21
Getting GC content for chr chr22
Getting GC content for chr chrX
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'unmasked': no such sequence

hg38

Hi,

is Codex2 able to handle hg38 genome ?
'cos during installation progress it means '(BSgenome.Hsapiens.UCSC.hg19)'

thx

Need help to understand the choice of optK

Hello Dr Yuchao

Thanks to you for your kind and clear reply. I hope you are enjoying this lovely Christmas time :-)
May I ask you one more question?
In general, the model which has the lowest BIC and RSS evaluated as the most efficient one.
However, CODEX2 determines the latent factor K which has the highest value of BIC.

Although I guess it is because of CODEX2 may have some different working principles from other programing models,
I couldn't clear what the difference is.

So could you give me some clues to clarify why in CODEX2 the K that has the highest value of BIC (instead of the lowest) is recommended as optK?

Demo code is this:

finalcall.CBS <- segmentCBS(Y_qc[chr.index,],  # recommended
                            Yhat.ns, optK = which.max(BIC.ns),
                            K = 1:5,
                            sampname_qc = colnames(Y_qc),
                            ref_qc = ranges(ref_qc)[chr.index],
                            chr = chr, lmax = 400, mode = "integer")

Difference between getmapp.R and mapp.R

I noticed there two different versions of function for calculating mappability and they take different inputs as arguments. What is the difference between those two? If I am using mouse genome, which one should I use?

Thanks

how to run CODEX2 with a pair of Tomor-Normal bam？

now,I have sevaral pairs of sorted.markup.bam to call CNV
In CODEX2.Rmd , I found following command to fill in my Tumor bam:
bamFile <- list.files(dirPath, pattern = '.bam$')

But in the part of ###Running CODEX2 with negative control samples###

{

Running CODEX2 with negative control samples

For the case-control scenario, the normal sample index is known (samples without spike-in signals).

# Below are pre-computed demo dataset, stored as part of the CODEX2 R-package.
normObj <- normalize_codex2_ns(Y_qc = Y_qc[chr.index,],
                               gc_qc = gc_qc[chr.index], 
                               K = 1:5, norm_index = norm_index_demo,
                               N = N)
Yhat.ns <- normObj$Yhat; fGC.hat.ns <- normObj$fGC.hat;
beta.hat.ns <- normObj$beta.hat; g.hat.ns <- normObj$g.hat; h.hat.ns <- normObj$h.hat
AIC.ns <- normObj$AIC; BIC.ns <- normObj$BIC; RSS.ns <- normObj$RSS

Choose the number of latent Poisson factors. BIC is used as the model selection metric by default.

choiceofK(AIC.ns, BIC.ns, RSS.ns, K = 1:5 , filename = "codex2_ns_choiceofK.pdf")

par(mfrow = c(1, 3))
plot(1:5, RSS.ns, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, AIC.ns, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, BIC.ns, type = "b", xlab = "Number of latent variables", pch=20)
par(mfrow = c(1,1))

}

it seem that no normal bam can be filled in.
How can I run CODEX2 with a pair of Tomor-Normal bam?
Is it ture that Normal Bam can be the input of CODEX2?
Sincerely hoping to receive your reply.

how to set sampname_qc correctly

I followed the pipeline provided in http://htmlpreview.github.io/?https://github.com/yuchaojiang/CODEX2/blob/master/demo/CODEX2.html. The output result like
. Obviously, it miss the true sample name,how do i fix it or how to set sampname_qc parameter correctly?

finalcall.CBS <- segmentCBS(Y_qc[chr.index,], # recommended Yhat.ns, optK = which.max(BIC.ns), K = 1:5, sampname_qc = paste('sample',1:ncol(Y_qc),sep=''), ref_qc = ranges(ref_qc)[chr.index], chr = chr, lmax = 400, mode = "integer")

error with BSgenome.Hsapiens.UCSC.hg38

Hi,

chrX and chrY chromosomes do not seem to be recognized:

gc <- getgc(ref, genome = genome)
Getting GC content for chr chr1
Getting GC content for chr chr10
Getting GC content for chr chr11
Getting GC content for chr chr12
Getting GC content for chr chr13
Getting GC content for chr chr14
Getting GC content for chr chr15
Getting GC content for chr chr16
Getting GC content for chr chr17
Getting GC content for chr chr18
Getting GC content for chr chr19
Getting GC content for chr chr2
Getting GC content for chr chr20
Getting GC content for chr chr21
Getting GC content for chr chr22
Getting GC content for chr chr3
Getting GC content for chr chr4
Getting GC content for chr chr5
Getting GC content for chr chr6
Getting GC content for chr chr7
Getting GC content for chr chr8
Getting GC content for chr chr9
Getting GC content for chr chrX
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'unmasked': no such sequence

mapp <- getmapp(ref, genome = genome)
Getting mappability for chr1
Getting mappability for chr10
Getting mappability for chr11
Getting mappability for chr12
Getting mappability for chr13
Getting mappability for chr14
Getting mappability for chr15
Getting mappability for chr16
Getting mappability for chr17
Getting mappability for chr18
Getting mappability for chr19
Getting mappability for chr2
Getting mappability for chr20
Getting mappability for chr21
Getting mappability for chr22
Getting mappability for chr3
Getting mappability for chr4
Getting mappability for chr5
Getting mappability for chr6
Getting mappability for chr7
Getting mappability for chr8
Getting mappability for chr9
Getting mappability for chrX
Getting mappability for chrY
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))
Error in FUN(X[[i]], ...) : object of type 'closure' is not subsettable

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9,
gc_thresh = c(20, 80))
Excluded 260 exons due to extreme coverage.
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'end': argument "ref" is missing, with no default
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
Error in getListElement(x, i, ...) :
IRanges objects don't support [[, as.list(), lapply(), or unlist() at
the moment

seqnames(bambedObj$ref)
factor-Rle of length 32821 with 24 runs
Lengths: 2605 1405 1462 1529 1069 ... 1768 1611 1386 1709 202
Values : chr1 chr10 chr11 chr12 chr13 ... chr7 chr8 chr9 chrX chrY
Levels(24): chr1 chr10 chr11 chr12 chr13 chr14 ... chr7 chr8 chr9 chrX chrY

codex2_targeted.R, getcoverage error

Hi,
This function return an error:

CODEX2/targeted_sequencing/codex2_targeted.R

Line 22 in dcd1be5

coverageObj=getcoverage(bambedObj,mapqthres=20)

> coverageObj=getcoverage(bambedObj,mapqthres=20)
Error in message("Getting coverage for chr ", chr, sep = "") :
argument "chr" is missing, with no default

?getcoverage
getcoverage {CODEX2}
(...)
getcoverage(bambedObj, mapqthres, chr)

Cheers
Damian Loska

Need Help

Hello Dr Yuchao

I am having a few issues while running CODEX2. I have not used CODEX before and am unfamiliar with the Poisson model. I am sorry if i am asking some apparent questions.

My situation is this: I have 2 Truth Samples (samples whose CNV details I know are correct). I have a test batch of 35 samples.
I have run CODEX2 in these manners:
Run 1:
Sample BAMs taken - 35 from test batch | K = 1:5 | chr: 17
Output:

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	Sample 1	chr17	dup	7687395	7689351	1.957	1307	1311	474	341	3	21.683	4.216
cnv2	Sample 1	chr17	del	19348563	19353709	5.147	799	800	49	104	1	18.038	4.88
cnv3	Sample 3	chr17	dup	3094629	3121197	26.569	1150	1156	2117	1548	3	84.334	66.263
cnv4	Sample 3	chr17	dup	28673501	28681954	8.454	874	875	206	129	3	19.022	68.109
cnv5	Sample 4	chr17	dup	73626838	73627110	0.273	532	533	604	446	3	21.891	4.485
cnv6	Sample 4	chr17	dup	19348563	19353709	5.147	799	800	389	279	3	18.219	5.588
cnv7	Sample 6	chr17	del	41209048	41215988	6.941	210	212	252	443	1	46.84	29.398

Run 2:
Sample BAMs taken - 35 from test batch | K = 1:10 | chr: 17
Output:

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	Sample 1	chr17	dup	7687395	7689351	1.957	1307	1311	474	341	3	21.683	4.216
cnv2	Sample 1	chr17	del	19348563	19353709	5.147	799	800	49	104	1	18.038	4.88
cnv3	Sample 3	chr17	dup	3094629	3121197	26.569	1150	1156	2117	1548	3	84.334	66.263
cnv4	Sample 3	chr17	dup	28673501	28681954	8.454	874	875	206	129	3	19.022	68.109
cnv5	Sample 4	chr17	dup	73626838	73627110	0.273	532	533	604	446	3	21.891	4.485
cnv6	Sample 4	chr17	dup	19348563	19353709	5.147	799	800	389	279	3	18.219	5.588
cnv7	Sample 6	chr17	del	41209048	41215988	6.941	210	212	252	443	1	46.84	29.398

Run 3:
Sample BAMs taken - 35 from test batch | K = 1:5 | chr: 13 and 17
Output:

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	Sample 1	chr13	dup	7687395	7689351	1.957	903	905	317	218	3	19.527	3.53
cnv2	Sample 1	chr17	dup	78403552	78410018	6.467	581	582	167	97	3	19.21	6.351
cnv3	Sample 4	chr13	dup	67647500	68331876	684.377	217	219	715	538	3	20.896	4.269
cnv4	Sample 4	chr17	del	10278985	10284601	5.617	924	925	122	208	1	19.442	8.308
cnv5	Sample 4	chr13	dup	75485511	75489741	4.231	228	229	439	319	3	18.492	11.985

Run 4:
Sample BAMs taken - 18 from test batch and 2 Truth Sample (named TS1 and TS2) | K = 1:20 | All Chr
Output:

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	TS1	chr1	del	14862604	153993261	139130.658	791	836	6452	9534	1	295.137	279.455
cnv2	TS2	chr1	dup	14862604	153230197	138367.594	791	833	13222	10416	3	152.839	136.733
cnv3	Sample 2	chr1	del	14862604	154221443	139358.84	791	838	5468	10091	1	1255.645	1247.018
cnv4	Sample 4	chr1	dup	45609814	50117179	4507.366	558	559	269	182	3	18.066	1.993
cnv5	Sample 5	chr1	dup	151849769	151900169	50.401	353	354	820	601	3	31.968	15.849

Run 5:
Sample BAMs taken - 18 from test batch and 2 Truth Sample (named TS1 and TS2) | K = 1:5 | chr 17
Output:

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	Sample 4	chr17	dup	5036717	5041065	4.349	127	131	980	755	3	19.839	2.346
cnv2	Sample 4	chr17	dup	8480533	8508320	27.788	305	307	301	205	3	19.54	4.606
cnv3	Sample 4	chr17	dup	36874077	36875856	1.78	590	591	751	574	3	17.492	5
cnv4	Sample 6	chr17	del	41209048	41215988	6.941	802	804	252	440	1	45.34	27.665

I am not able to get concurrent results from the above outputs.

I am getting the same output in Run 1 and Run 2, even if there is a change in K.
In Run 3, when I introduce 2 chromosomes, Chr 13 and 17, the output doesn't contain some of the observed events in Run 1 and 2.
In Run 4, Output states TS1 has del and TS2 has dup in Chr 1, which does not match the True CNV output.
In Run 5, when I only mention Chr 17, the output doesn't consider the BRCA1 deletion present in TS1, as per the True CNV output.

The code is as follows:

#Preprocessing ####
library(CODEX2)

# Set the path to the directory containing the BAM files 
dirPath = file.path('','home','bioinfo','Nilesh','CODEX2','Attempt_18B_Truth')

# Get the names of all BAM files in the directory
bamFile <- list.files(dirPath, pattern = '*.bam$')

# Create a new path by combining the directory path and the BAM file names
bamdir <- file.path(dirPath, bamFile)

# Extract the sample name from the BAM file name
sampname <- substr(bamFile,1,4)

# Set the path to the BED file
bedFile <- file.path(dirPath, "test_codex.bed")

# Create an object containing the BAM, BED, sample name, and project name
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, 
                       sampname = sampname, projectname = "CODEX2_18B-Truth")

# Extract the BAM directory, sample name, reference, and project name from the object
bamdir <- bambedObj$bamdir
sampname <- bambedObj$sampname
ref <- bambedObj$ref
projectname <- bambedObj$projectname
#------------------------------------------------------------------------------#

#GC Content and Mappability ####
genome = BSgenome.Hsapiens.UCSC.hg19 #hg19
#-- Defining Funcion getgc --#
# Define a function to calculate the GC content of a set of genomic ranges
getgc = function (ref, genome = NULL) {
  # If the genome argument is NULL, set it to the hg19 genome
  if(is.null(genome)){genome = BSgenome.Hsapiens.UCSC.hg19}
  
  # Create an empty vector to store the GC content for each range
  gc = rep(NA, length(ref))
  
  # Loop over each unique chromosome in the input ranges
  for (chr in unique(seqnames(ref))) {
    # Print a message to show which chromosome we are currently processing
    message("Getting GC content for chr ", chr, sep = "")
    
    # Get the indices of all ranges on the current chromosome
    chr.index = which(as.matrix(seqnames(ref)) == chr)
    
    # Create a new set of ranges that only includes the ranges on the current chromosome
    ref.chr = IRanges(start = start(ref)[chr.index], end = end(ref)[chr.index])
    
    # Check if the current chromosome is the X or Y chromosome and set the chrtemp variable accordingly
    if (chr == "X" | chr == "x" | chr == "chrX" | chr == "chrx") {
      chrtemp <- 'X'
    } else if (chr == "Y" | chr == "y" | chr == "chrY" | chr == "chry") {
      chrtemp <- 'Y'
    } else {
      chrtemp <- as.numeric(mapSeqlevels(as.character(chr), "NCBI")[1])
    }
    
    # If the current chromosome cannot be found in the NCBI Homo sapiens database, print a warning message
    if (length(chrtemp) == 0) message("Chromosome cannot be found in NCBI Homo sapiens database!")
    
    # Get the unmasked sequence for the current chromosome from the genome
    chrm <- unmasked(genome[[paste('chr',chrtemp,sep='')]])
    
    # Create a new set of sequences from the genome that only includes the ranges on the current chromosome
    seqs <- Views(chrm, ref.chr)
    
    # Calculate the base frequency for each range on the current chromosome
    af <- alphabetFrequency(seqs, baseOnly = TRUE, as.prob = TRUE)
    
    # Calculate the GC content for each range on the current chromosome
    gc[chr.index] <- round((af[, "G"] + af[, "C"]) * 100, 2)
  }
  
  # Return the GC content vector
  gc
}
#-- getgc Function end --#
# Calculate the GC content of the input ranges
gc <- getgc(ref, genome = genome)

# Calculate the mappability of the input ranges
mapp <- getmapp(ref, genome = genome)

# Create a vector of gene names for each range
gene = rep(NA, length(ref))

# Loop over each unique chromosome in the input ranges
for (chr in as.matrix(unique(seqnames(ref)))) {
  # Get the indices of all ranges on the current chromosome
  chr.index = which(seqnames(ref) == chr)
  
  # Generate a gene name for each range on the current chromosome by concatenating the chromosome name, 
  # the string "_gene_", and the index of the range divided by 30 and rounded up
  gene[chr.index] = paste(chr, '_gene_', ceiling(chr.index / 30), sep = '')
}

# Update the input ranges to include the GC content, mappability, and gene name for each range
values(ref) <- cbind(values(ref), DataFrame(gc, mapp, gene))
#------------------------------------------------------------------------------#

#Raw read Depth ####
# Calculate the coverage of the alignments
coverageObj <- getcoverage(bambedObj, mapqthres = 20)

# Get the coverage matrix from the coverage object
Y <- coverageObj$Y

# Write the coverage matrix to a CSV file
write.csv(Y, file = paste(projectname, '_coverage.csv', sep=''), quote = FALSE)

# Print the first four columns of the coverage matrix
head(Y[,1:4])

#------------------------------------------------------------------------------#

#Quality Control ####
# Perform quality control on the alignment data
qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
            length_thresh = c(20, 2000), mapp_thresh = 0.9,
            gc_thresh = c(20, 80))

# Get the filtered coverage matrix, sample names, and genomic ranges from the qc object
Y_qc <- qcObj$Y_qc
sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc
qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc

# Write the quality control matrix to a file
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
            sep = '\t', quote = FALSE, row.names = FALSE)

#------------------------------------------------------------------------------#

#Estimating library size factor for each sample ####
# Create a matrix of non-zero values from the quality-controlled coverage data
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]

# Calculate the pseudo-sample median for each row in the non-zero matrix
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})

# Divide each column in the non-zero matrix by the corresponding value in the pseudo-sample vector
# and take the median of the resulting values
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

# Create a scatter plot of N against the total sum of reads in each sample
# and save the plot as a PDF file
pdf("Batch_Library_Size_Factor.pdf")
plot(N, apply(Y,2,sum), xlab='Estimated library size factor', ylab='Total sum of reads')
dev.off() 
#------------------------------------------------------------------------------#

#Running CODEX2 without specifying negative control samples/regions ####
# This can be run for One chromosome 
chr <- 'chr17'
#Or for Multiple chromosomes
#c('chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chr20','chr21','chr22','chrX','chrY')

chr.index <- which(seqnames(ref_qc)==chr)
normObj.null <- normalize_null(Y_qc = Y_qc[chr.index,],
                               gc_qc = gc_qc[chr.index],
                               K = 1:5, N = N)
Yhat.null <- normObj.null$Yhat
AIC.null <- normObj.null$AIC; BIC.null <- normObj.null$BIC
RSS.null <- normObj.null$RSS

#Choose the number of latent Poisson factors. BIC is used as the model selection metric by default. 
choiceofK(AIC.null, BIC.null, RSS.null, K = 1:5, filename = "codex2_null_choiceofK_1-5_for_18B-Truth_chr17.pdf")
#------------------------------------------------------------------------------#

#Running segmentation by CODEX2 ####
finalcall.CBS <- segmentCBS(Y_qc[chr.index,],  # recommended
                            Yhat.null, optK = which.max(BIC.null),
                            K = 1:5,
                            sampname_qc = colnames(Y_qc),
                            ref_qc = ranges(ref_qc)[chr.index],
                            chr = chr, lmax = 400, mode = "integer")
#------------------------------------------------------------------------------#

#Exporting Results ####
write.csv(finalcall.CBS, "./finalcall_CBS_K1-5_for_18B-Truth_chr17.csv", row.names = TRUE)

1 Could you explain what I am doing wrong?

2 I observed that there are two versions of the code equation for pseudo.sample, as highlighted below:

CODEX2/targeted_sequencing/CODEX2_targeted_demo.R

Line 54 in 6fb70da

pseudo.sample <- apply(Y.nonzero,1,function(x){prod(x)^(1/length(x))})

CODEX2/demo/CODEX2.Rmd

Line 108 in 6fb70da

pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})

Could you explain to me the difference between the equations?

3 Is it possible for gene names to be taken from the bed file directly, if a gene name column is provided?

Problem with getcoverage

Hello,
I have Problems with running CODEX2. I want to use it for targeted panel CNV detection.
Somehow it does not find the coverage.

for: head(Y[,1:5]) it fives results like this:

129514_S1 136591_S9 136633_S1 136691_S3 136744_S2
1:26126702-26126925 0 0 0 0 0
1:26127514-26127672 0 0 0 0 0
1:26128487-26128629 0 0 0 0 0
1:26131613-26131787 0 0 0 0 0
1:26135051-26135301 0 0 0 0 0
1:26135497-26135662 0 0 0 0 0

which leads to an error in the next step:

Excluded NA exons due to extreme coverage.
Excluded 0 exons due to extreme exonic length.
Excluded 4 exons due to extreme mappability.
Excluded 1 exons due to extreme GC content.
After taking union, excluded NA out of 1160 exons in QC.
Error: logical subscript contains NAs

As fare as i can see the ref looks ok:

GRanges object with 1160 ranges and 3 metadata columns:
seqnames ranges strand | gc mapp gene
|
[1] 1 26126702-26126925 * | 86.16 1 1_gene_1
[2] 1 26127514-26127672 * | 54.72 1 1_gene_1
[3] 1 26128487-26128629 * | 51.75 1 1_gene_1
[4] 1 26131613-26131787 * | 61.14 1 1_gene_1
[5] 1 26135051-26135301 * | 65.34 1 1_gene_1
... ... ... ... . ... ... ...
[1156] X 153608030-153608175 * | 63.7 1 X_gene_39
[1157] X 153608282-153608400 * | 49.58 1 X_gene_39
[1158] X 153608574-153608748 * | 60.57 1 X_gene_39
[1159] X 153609093-153609183 * | 53.85 1 X_gene_39
[1160] X 153609222-153609578 * | 59.1 1 X_gene_39

The code i was running looks like this and is basically the same as the in the Demo

bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile,
sampname = sampname, projectname = "Codex_test")

bamdir <- bambedObj$bamdir

ref <- bambedObj$ref

sampname <- bambedObj$sampname

projectname <- bambedObj$projectname

gc <- getgc(ref)

mapp <- getmapp(ref)

gene=rep(NA,length(ref))

print(gene)

for(chr in as.matrix(unique(seqnames(ref)))){
chr.index=which(seqnames(ref)==chr)
gene[chr.index]=paste(chr,'gene',ceiling(chr.index/30),sep='')
}

values(ref) <- cbind(values(ref), DataFrame(gc, mapp, gene))

coverageObj <- getcoverage(bambedObj, mapqthres = 20)

Y <- coverageObj$Y

write.csv(Y, file = paste(projectname, '_coverage.csv', sep=''), quote = FALSE)
head(Y[,1:5])

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, Inf),
length_thresh = c(20, Inf), mapp_thresh = 0.9,
gc_thresh = c(20, 80))

Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc

does anybody have an idea what the problem could be?
thanks a lot in advance

First Pass - Normalizing Step - Error in smooth.spline

Dear Developer,

First, I'd like to thank you for all your efforts.

I sequenced some specific regions of chromosome 2, 5 and 12 using a custom panel. So my goal is to call CNV from targeted sequencing data.

n. Regions: some 3600
n, Samples: 16
Reference: HG38
R version: 3.6.3

I easily go through your demo (that perfectly fulfill my needs) up to the normalization step
Now, I designed my panel in order to use sequenced regions in chromosomes 2 and 5 as negative control region.

Despite this, I'd like to go for a 2-pass run calling CNV without considering negative control regions first. In this way, I can figure out exactly which regions on chromosome 12 I've to use as cnv_index in the second run.

I faced a problem trying to normalize my data without considering negative control regions.
This command:

chr <- 'chr12'
chr.index <- which(seqnames(ref_qc)==chr)
normObj.null <- normalize_null(Y_qc = Y_qc[chr.index,],
                               gc_qc = gc_qc[chr.index],
                               K = 1:5, N = N)

returns:

Computing normalization with k = 1 latent factors ...
Error in smooth.spline(gc_qc, z) : 
  'tol' must be strictly positive and finite
In addition: Warning message:
In matrix(nrow = nrow(Y_qc), ncol = ncol(Y_qc), data = N, byrow = TRUE) :
  data length exceeds size of matrix

After a little bit of debugging, I found that, due to problems with the library or during amplification phase, I totally missed some of the regions I was expecting to cover (some 50 small regions). I thought that maybe 0s in the coverage matrix were causing the problem.
I removed them but I still get the error message.

Am I missing something?

Thanks in advance,
Davide

Error when installing

I get the following error when I try to install. It also kills the R session after. I have all of the packages listed: Rsamtools, GenomicRanges, BSgenome.Hsapiens.UCSC.hg19, S4Vectors, IRanges, Biostrings, GenomeInfoDb installed already and can load them without any problems

> devtools::install_github("yuchaojiang/CODEX2/package")
Downloading GitHub repo yuchaojiang/CODEX2@master
Skipping 7 packages not available: Rsamtools, GenomicRanges, BSgenome.Hsapiens.UCSC.hg19, S4Vectors, IRanges, Biostrings, GenomeInfoDb
✔  checking for file ‘/tmp/RtmpQrsnVv/remotes3f3115da431d/yuchaojiang-CODEX2-e47af62/package/DESCRIPTION’ ... OK
─  preparing ‘CODEX2’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  looking to see if a ‘data/datalist’ file should be added
─  building ‘CODEX2_1.3.0.tar.gz’ (10.3s)
   
* installing *source* package ‘CODEX2’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
/home/maherm/R/lib64/R/bin/INSTALL: line 34: 31405 Done                    echo 'tools:::.install_packages()'
     31406 Killed                  | R_DEFAULT_PACKAGES= LC_COLLATE=C "${R_HOME}/bin/R" $myArgs --slave --args ${args}
Error: Failed to install 'CODEX2' from GitHub:
  (converted from warning) installation of package ‘/tmp/RtmpQrsnVv/file3f3133e61203/CODEX2_1.3.0.tar.gz’ had non-zero exit status
> 
> 
> Terminated

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /home/maherm/R/lib64/R/lib/libRblas.so
LAPACK: /home/maherm/R/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.1

chr in bambedObj is NULL / missing chr in getcoverage() in codex2_targeted.R

Dear CODEX2 developers,

I recently run CODEX2 v1.3.0 successfully in "exome mode" without controls, but I was unable to run your approach for targeted sequencing given in the codex2_targeted.R sccript.
First, I suppose that the getcoverage() call in line 22 misses argument chr. However, setting chr to bambedObj$chr fails in my case, as bambedObj$chr is NULL.

The bambedObj seems to be created as expected, at least without warnings and yielding:

>names(bambedObj)
[1] "bamdir"      "sampname"    "ref"         "projectname"

>bambedObj$ref
GRanges object with 4083 ranges and 0 metadata columns:
         seqnames               ranges strand
            <Rle>            <IRanges>  <Rle>
     [1]        1   [1634345, 1634438]      *
     [2]        1   [1634518, 1634708]      *
     [3]        1   [1634914, 1635063]      *
     [4]        1   [1635262, 1635379]      *
     [5]        1   [1635477, 1635585]      *
     ...      ...                  ...    ...
  [4079]       22 [40804797, 40804846]      *
  [4080]       22 [40804936, 40805022]      *
  [4081]       22 [40805253, 40805376]      *
  [4082]       22 [40805468, 40805529]      *
  [4083]       22 [40805685, 40805763]      *
  -------
  seqinfo: 21 sequences from an unspecified genome; no seqlengths

How can bambedObj$chr be initialized?

Many thanks in advance!

How to set parameters *_index in function normalize_codex2_nr and normalize_codex2_ns?

Hi, Is there some suggestions about how to set parameters norm_index or cnv_index in function normalize_codex2_nr and normalize_codex2_ns ? Thanks.

Error in genome[[chrtemp]] : no such sequence

Hi Yuchao,
When I tried to run the getgc function, I always get the error below. Can you please advise what's wrong? Thanks.

Getting GC content for chr 1
Getting GC content for chr 10
Getting GC content for chr 11
Getting GC content for chr 12
Getting GC content for chr 13
Getting GC content for chr 14
Getting GC content for chr 15
Getting GC content for chr 16
Getting GC content for chr 17
Getting GC content for chr 18
Getting GC content for chr 19
Getting GC content for chr 2
Getting GC content for chr 20
Getting GC content for chr 21
Getting GC content for chr 22
Getting GC content for chr 3
Getting GC content for chr 4
Getting GC content for chr 5
Getting GC content for chr 6
Getting GC content for chr 7
Getting GC content for chr 8
Getting GC content for chr 9
Getting GC content for chr M
Error in genome[[chrtemp]] : no such sequence
Calls: getgc -> unmasked -> [[ -> [[
Execution halted

Best,
Zeyan

Differences between CODEX and CODEX2?

Hi,

The CODEX page redirects me here, and could you please tell the major differences between version one and two?

If possible, I think it is better to make an introduction of the improvements in the README documents.

Thanks a lot.

Problem running qc

Hi,

I run qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000), length_thresh = c(20, 2000), mapp_thresh = 0.9, gc_thresh = c(20,80))

and get the following error

Error in apply(Y_qc, 1, median) : dim(X) must have a positive length

what is the issue?

Thanks

Need Help for Normalization w/ all chromosomes

Hello Dr Yuchao

Thanks to you for creating CODEX2, I am using this wonderful program well for work.
During running CODEX2, I had a difficulty in making a decision to make a code for normalization using all chromosomes.
So, I ask you if there's any problem with the code I used or if there is a better way.

My situation is this: I have 45 samples (without CNV details), and I want to identify which sample has germ-line variations on some specific genes in chr9. As I don’t have a specified negative control samples/regions, I want to use all chromosomes in all samples for better normalization. I checked the demo code “Running CODEX2 without specifying negative control samples/regions“, and have a question how to modify this code to use all chromosomes in all samples for normalization.

The one trial I made is:
normObj.null <- normalize_null(Y_qc = Y_qc, gc_qc = gc_qc, K = 1:4, N = N)

(cf. Given demo code is:
normObj.null <- normalize_null(Y_qc = Y_qc[chr.index,], gc_qc = gc_qc[chr.index], K = 1:5, N = N))

The whole code I used is as follows:

library(BiocManager)
library(CODEX2)

dirPath <- '/BiO/home/jekim/Data/HL_TJP2/Total-E'
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
bedFile <- file.path('/BiO/home/jekim/CNVkit', 'Exome_Xgen.bed')

sampname <- substr(bamFile,0,10)
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, 
                       sampname = sampname, projectname = "E4-all")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname

genome = BSgenome.Hsapiens.UCSC.hg19

getgc <- function (ref, genome = NULL) 
{
    if (is.null(genome)) {
        genome = BSgenome.Hsapiens.UCSC.hg19
    }
    gc = rep(NA, length(ref))
    for (chr in unique(seqnames(ref))) {
        message("Getting GC content for chr ", chr, sep = "")
        chr.index = which(as.matrix(seqnames(ref)) == chr)
        ref.chr = IRanges(start = start(ref)[chr.index], end = end(ref)[chr.index])
        if (chr == "X" | chr == "x" | chr == "chrX" | chr == 
            "chrx") {
            chrtemp <- "X"
        }
        else if (chr == "Y" | chr == "y" | chr == "chrY" | chr == 
            "chry") {
            chrtemp <- "Y"
        }
        else {
            chrtemp <- as.numeric(mapSeqlevels(as.character(chr), 
                "NCBI")[1])
        }
        if (length(chrtemp) == 0) 
            message("Chromosome cannot be found in NCBI Homo sapiens database!")
        chrm <- unmasked(genome[[paste("chr", chrtemp, sep = "")]])
        seqs <- Views(chrm, ref.chr)
        af <- alphabetFrequency(seqs, baseOnly = TRUE, as.prob = TRUE)
        gc[chr.index] <- round((af[, "G"] + af[, "C"]) * 100, 
            2)
     }
      gc
}

gc <- getgc(ref, genome = genome)
mapp <- getmapp(ref, genome = genome)

values(ref) <- cbind(values(ref), DataFrame(gc, mapp))
coverageObj <- getcoverage(bambedObj, mapqthres = 20)

Y <- coverageObj$Y
write.csv(Y, file = paste(projectname, '_coverage.csv', sep=''), quote = FALSE)

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
            length_thresh = c(20, 2000), mapp_thresh = 0.9,
            gc_thresh = c(20, 80))

Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
            sep = '\t', quote = FALSE, row.names = FALSE)

Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

normObj.null <- normalize_null(Y_qc = Y_qc,
                               gc_qc = gc_qc,
                               K = 1:4, N = N)

Yhat.null <- normObj.null$Yhat
AIC.null <- normObj.null$AIC; BIC.null <- normObj.null$BIC
RSS.null <- normObj.null$RSS
choiceofK(AIC.null, BIC.null, RSS.null, K = 1:4 , filename = "E4-all_choiceofK.pdf")
write.table(Yhat.null, file = paste( projectname,'_Yhat.null_test.txt',
                                     sep=''), sep='\t', quote=F, row.names=F)

chro <- c('chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18',
     'chr19','chr20','chr21','chr22','chrX','chrY')

for (chr in chro){
    chr.index <- which(seqnames(ref_qc)==chr)        
    finalcall.CBS <- segmentCBS(Y_qc[chr.index,],
Yhat.null, optK = which.max(BIC.null), K = 1:4,
sampname_qc = sampname_qc,
ref_qc = ranges(ref_qc)[chr.index],
chr = chr, lmax = 400, mode = "integer")   
write.table(finalcall.CBS, file = paste( projectname, chr, '_no_filter_test.txt',
                                     sep=''), sep='\t', quote=F, row.names=F)
 filter1 <- finalcall.CBS$length_kb<=200
    filter2 <- finalcall.CBS$length_kb/(finalcall.CBS$ed_exon-finalcall.CBS$st_exon+1)<50
    finalcall.CBS.filter <- finalcall.CBS[filter1 & filter2, ]
    filter3 <- finalcall.CBS.filter$lratio>40
    filter4 <- (finalcall.CBS.filter$ed_exon-finalcall.CBS.filter$st_exon)>1
    finalcall.CBS.filter=finalcall.CBS.filter[filter3|filter4,]
    # save filtered results
    write.table(finalcall.CBS.filter, file = paste( projectname, chr, '_filter.txt',
                                     sep=''), sep='\t', quote=F, row.names=F)
    }

and the Y_qc, gc_qc look like this:

Is this code okay for normalization - though it doesn’t have chr indexing “Y_qc = Y_qc[chr.index,], gc_qc = gc_qc[chr.index]”?
Could you suggest me a better solution for this-using all chromosomes in all samples for normalization?

Questions about "Running CODEX2 without specifying negative control samples/regions"

Hi, I have 49 uncorrelated samples and run in batch with CODEX2 without specifying negative control samples/regions. But the results is quit terrible, the number of cnv were few and only detected chr1,chr2 and chr3, and even some samples could not find any cnv (One of the sample is NA12878). I was wondering where is wrong in my code:

library(CODEX2)
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.

###################################################################################
# Pre-processing

bamFile <- list.files('/storage/wujh/Project/CNV_analysis/CODEX/bam', pattern = '*.bam$') 
bamdir <- file.path('/storage/wujh/Project/CNV_analysis/CODEX/bam', bamFile)
sampname <- substr(bamFile,1,8)
bedFile <- file.path('/home/sxuan/codex2', 'AgV6.bed')
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, 
                       sampname = sampname, projectname = 'CODEX2_test')
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname


###################################################################################
# Getting GC content and mappability

gc <- getgc(ref, genome = BSgenome.Hsapiens.UCSC.hg19)
mapp <- getmapp(ref, genome = BSgenome.Hsapiens.UCSC.hg19)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))


####################################################################################
# Getting raw read depth

coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y
write.csv(Y, file = paste(projectname, '_coverage.csv', sep=''), quote = FALSE)


####################################################################################
# Quality control

print('Quality control')
qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, Inf),
            length_thresh = c(20, Inf), mapp_thresh = 0.9,
            gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
            sep = '\t', quote = FALSE, row.names = FALSE)


###################################################################################
# Running CODEX2
print('Running CODEX2')

Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){prod(x)^(1/length(x))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

# Running CODEX2 without negative control samples/regions

print('Running CODEX2 without negative control samples/regions')
chr <- c(c(1:22),'X','Y')

chr.index <- which(seqnames(ref_qc)==chr)
norm_index <- c(1:length(sampname))

normObj.null <- normalize_null(Y_qc = Y_qc[chr.index,],
                           gc_qc = gc_qc[chr.index],
                           K = 1:5, N = N)
Yhat.null <- normObj.null$Yhat
AIC.null <- normObj.null$AIC; BIC.null <- normObj.null$BIC
RSS.null <- normObj.null$RSS

#Choose the number of latent Poisson factors. BIC is used as the model selection metric by default.
choiceofK(AIC.null, BIC.null, RSS.null, K = 1:5 , filename = "codex2_null_choiceofK.pdf")
par(mfrow = c(1, 3))
plot(1:5, RSS.null, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, AIC.null, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, BIC.null, type = "b", xlab = "Number of latent variables", pch=20)
par(mfrow = c(1,1))

##################################################################################
# Running segmentation by CODEX2
print('Running segmentation by CODEX2')
# recommended
finalcall.CBS <- segmentCBS(Y_qc[chr.index,],  
                            Yhat.null, optK = which.max(BIC.null),
                            K = 1:5,
                            sampname_qc = paste('sample',1:ncol(Y_qc),sep=''),
                            ref_qc = ranges(ref_qc)[chr.index],
                            chr = chr, lmax = 400, mode = "integer")
write.table(finalcall.CBS, file=paste('norm_null.final.cnv',sep=''), sep='\t', quote=F)

# not recommended
# finalcall.HMM <- segmentHMM(Y_qc[chr.index,],  
#                             Yhat.ns, optK = which.max(BIC.ns),
#                             K = 1:5,
#                             sampname_qc = paste('sample',1:ncol(Y_qc),sep=''),
#                             ref_qc = ranges(ref_qc)[chr.index],
#                             chr = chr, mode = "integer")


# Post-segmentation pruning and filtering are recommended based on CNV length (filter1), 
# length per exon (filter2), likelihood ratio (filter3), and number of exons (filter4)

filter1 <- finalcall.CBS$length_kb<=200
filter2 <- finalcall.CBS$length_kb/(finalcall.CBS$ed_exon-finalcall.CBS$st_exon+1)<50
finalcall.CBS.filter <- finalcall.CBS[filter1 & filter2, ]

filter3 <- finalcall.CBS.filter$lratio>40
filter4 <- (finalcall.CBS.filter$ed_exon-finalcall.CBS.filter$st_exon)>1
finalcall.CBS.filter=finalcall.CBS.filter[filter3|filter4,]

write.table(finalcall.CBS.filter, file='_norm_null.final.filtered.cnv', sep='\t', quote=F)

Any suggestions will be greatly appreciated !

some questions about CODEX2

Dear Dr.Jiang,
I am a student from China.I'm using CODEX2 to analyze copy number changes in human samples by target capture sequencing, but I am so confused by the bad results.I didn't make correct input files and there's no online tutorial of target capture sequencing.Therefore ,there are several questions I hope you could help me out :
1.how the bam files are made?Could you tell me if the way bam file of target capture sequencing data are made is the same as the WES?Are they only sorted?What are the requirements for the same batch of tested bam files e.g the difference of the read depth ?
2.how's the bed files cooked?I have the exact position of the capture region of target capture sequencing.
3.In addition,I don't know why the result shows that the duplicate and the deletion exist on the same chromosome of the same sample with high quality.
Thank you very much in advance.

Best regards,
Ziying Zhou

CODEX2, getbambed, wrong example

Hi,
in the example here:
https://github.com/yuchaojiang/CODEX2/blob/master/targeted_sequencing/codex2_targeted.R#L15-L20
there's an error when running this code:
library(CODEX2)

chr=1
# get bam directories, read in bed file, get sample names
bambedObj=getbambed(bamdir=bamdir,
                    bedFile=bedFile,
                    sampname=sampname,
                    projectname=projectname,chr)

Error in getbambed(bamdir = bamdir, bedFile = bedFile, sampname = sampname, :
unused argument (chr)

Should I use getbambed from CODEX?

EDIT: in the whole example file "codex2_targeted.R" the commands from CODEX and CODEX2 are mixed (This is my first time I use CODEX and/or CODEX2 so I'm a bit confused)

Cheers
Damian Loska

yuchaojiang / codex2 Goto Github PK

codex2's People

Contributors

Stargazers

Watchers

Forkers

codex2's Issues

The following object is masked from ‘package:BSgenome.Hsapiens.UCSC.hg19’: Hsapiens

Initialize Variables

Verify chromosome names match exactly

Call getgc()

Running CODEX2 with negative control samples

Recommend Projects

Recommend Topics

Recommend Org