Git Product home page Git Product logo

codex2's People

Contributors

yuchaojiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codex2's Issues

CODEX2 for low coverage WGS

We have a low pass WGS data set and I found none of them can pass the sd threshold during normalization:
nonCNV.index=apply(z.codex,1,sd) < 0.15

summary(apply(z.codex,1,sd))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2869  0.5146  0.5487  0.5481  0.5822  0.736

What might be the concerns to increase the sd's threshold? say at median=0.55 or other fixed threshold?

The low coverage issue also found in the qc step where 104221 out of 135087 exons were excluded. (102876 due to extreme low coverage), I seem have to reduce the minimum coverage threshold as well. Any suggestions for it?

the meaning of the final result

Dear Dr.J,
Thank you for your work in developing CODEX2. As mentioned in title, how to understand the title 'sample_name chr cnv st_bp ed_bp length_kb st_exon ed_exon raw_cov norm_cov copy_no lratio mBIC' in the final result.?
Best regards
Li'an Lin

failed in loading hg38

library(CODEX2)
library(BSgenome.Hsapiens.UCSC.hg38)

Attaching package: ‘BSgenome.Hsapiens.UCSC.hg38’

The following object is masked from ‘package:BSgenome.Hsapiens.UCSC.hg19’:

Hsapiens

The following object is masked from ‘package:BSgenome.Hsapiens.UCSC.hg19’: Hsapiens

gc <- getgc(ref, genome = BSgenome.Hsapiens.UCSC.hg38)
Error: object 'ref' not found

BSgenome problem

I was following commands in the manual. Everything was fine until this command mapp <- getmapp(ref, genome = genome). I got an error message: Error in getmapp(ref, genome = genome) :
no slot of name "provider_version" for this object of class "BSgenome".

Thanks for any advice!

Some bugs and suggestion

Line 11 and Line 25 in qc.R:
You can replace end(ref) - start(ref)+1 by width(ref), for ref in of type IRranges
Line 67-68 in normalize.R:
Offset might be L[s,] , not log(L[s,]), because L[,] is already the logarithm of (Nmat * fhat * betahatmat)
as in line 54
Line 92 in codex_targeted.R:
The qcObj here is created across all chromosomes, but when call the qc function in qc.R
it seems that your qc function is for one chromosome, as in line 24,"rep(chr, length(ref))"
so when applied to targeted sequencing, we should alter the qc.R in the package
from "rep(chr, length(ref))" to 'chr' in line 24 and call it with 'chr.all' in condex_targeted.R

problem and question with CODEX2

Good morning,
I'm Pasquale.
I have some major issues with using CODEX.
We conducted a sequencing of only 1 ZC3H10 gene on 96 samples (controls and patients). from the trace on IGV there seems to be a CNV, so I started using the CODEX tools.
I wonder ask yoi:

  • does it work even with such a small portion (about 1kb)? why is it worth testing samples?
  • because if I use chr <- 12 it calculates the contents of GC and mapp, but then the rest doesn't work for me, while if I use chr <- NC_000012.12 the rest works for me but not GC and mapp??
  • also this step gives me problems

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9,
gc_thresh = c(20, 80))
Excluded 8 exons due to extreme coverage.
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'end': argument "ref" is missing, with no default

I would be grateful if you give me some suggestions.

getgc() function

chrtemp variable is named as chrX which causes paste('chr',chrtemp,sep='') to name it as "chrchrX" and fail.

getgc =function (ref, genome = NULL) {
  if(is.null(genome)){genome=BSgenome.Hsapiens.UCSC.hg19}
  gc=rep(NA,length(ref))
  for(chr in unique(seqnames(ref))){
    chr="chrX"
    message("Getting GC content for chr ", chr, sep = "")
    chr.index=which(as.matrix(seqnames(ref))==chr)
    ref.chr=IRanges(start= start(ref)[chr.index] , end = end(ref)[chr.index])
    if (chr == "X" | chr == "x" | chr == "chrX" | chr == "chrx") {
      chrtemp <- 'chrX'
    } else if (chr == "Y" | chr == "y" | chr == "chrY" | chr == 
               "chry") {
      chrtemp <- 'chrY'
    } else {
      chrtemp <- as.numeric(mapSeqlevels(as.character(chr), "NCBI")[1])
    }
    if (length(chrtemp) == 0) message("Chromosome cannot be found in NCBI Homo sapiens database!")
    chrm <- unmasked(genome[[paste('chr',chrtemp,sep='')]])
    seqs <- Views(chrm, ref.chr)
    af <- alphabetFrequency(seqs, baseOnly = TRUE, as.prob = TRUE)
    gc[chr.index] <- round((af[, "G"] + af[, "C"]) * 100, 2)
  }
  gc
}

I fix this on my data changing chrtemp <- 'chrX' to chrtemp <- 'X'

Let my know if this makes sense to you or it is just me!

Issues with qc function

Any ideas why I am getting this error?

Excluded 5316 exons due to extreme coverage.
Excluded 12333 exons due to extreme exonic length.
Excluded 12988 exons due to extreme mappability.
Excluded NA exons due to extreme GC content.
After taking union, excluded NA out of 217488 exons in QC.
Error: logical subscript contains NAs

segmentCBS error

Hi,

I am using the software on WES samples using h38.
The hole software runs well until this part:

> finalcall.CBS <- segmentCBS(Y_qc[chr.index,],
+                             Yhat, optK = which.max(BIC),
+                             K = 1:5,
+                             sampname_qc = sampname_qc,
+                             ref_qc = ranges(ref_qc)[chr.index],
+                             chr = chr, lmax = 400, mode = "integer")
Segmenting sample 1: WES_val18_1..
Error in rep(1:num, rep(lmax, num)) : invalid 'times' argument

First I tried 60 samples, in a second test I only used 7 samples.
Any ideas why I am getting this error?

Thank you for your help and kind regards,
Nicolas

missing or infinite values in inputs are not allowed

Hi,
I follow this tutorial: http://htmlpreview.github.io/?https://github.com/yuchaojiang/CODEX2/blob/master/demo/CODEX2.html

I have ~180 bam files that came from the same lab, but in different time periods (batches of 9-15 files).
When I analyze them together I get error:

Error in smooth.spline(gc_qc, z) : 
  missing or infinite values in inputs are not allowed

(maybe it comes from:

> pseudo.sample
  1:861321-861393   1:865534-865716   1:866418-866469   1:871151-871276   1:874419-874509   1:874654-874840   1:876523-876686   1:877938-878438 
              Inf               Inf               Inf               Inf               Inf               Inf               Inf               Inf 
  1:878632-878757   1:879077-879188   1:879287-879533   1:880073-880180   1:880436-880526   1:880897-881033   1:881552-881666   1:881781-881925 
              Inf  

)

are there some methods, to firstly run "batches" separately, and then merge the results and follow the tutorial?

Thanks for any advice,
Damian

CODEX2 targeted Error: subscript out of bounds

I am trying to run CODEX2 on couple of targeted resequencing samples. the bed file is broken down into intervals of 250 basepairs for the regions of interest. I got the targeted demo code from git and initiated the run. Below is my log and error that I ran into. Any help understanding this error and fixing it would be great!

Thanks!

`Getting GC content for chr 1
Getting GC content for chr 10
Getting GC content for chr 11
Getting GC content for chr 12
Getting GC content for chr 13
Getting GC content for chr 14
Getting GC content for chr 15
Getting GC content for chr 16
Getting GC content for chr 17
Getting GC content for chr 18
Getting GC content for chr 19
Getting GC content for chr 2
Getting GC content for chr 20
Getting GC content for chr 21
Getting GC content for chr 22
Getting GC content for chr 3
Getting GC content for chr 4
Getting GC content for chr 5
Getting GC content for chr 6
Getting GC content for chr 7
Getting GC content for chr 8
Getting GC content for chr 9
Getting GC content for chr X
Getting GC content for chr Y
Getting mappability for chr1
Getting mappability for chr10
Getting mappability for chr11
Getting mappability for chr12
Getting mappability for chr13
Getting mappability for chr14
Getting mappability for chr15
Getting mappability for chr16
Getting mappability for chr17
Getting mappability for chr18
Getting mappability for chr19
Getting mappability for chr2
Getting mappability for chr20
Getting mappability for chr21
Getting mappability for chr22
Getting mappability for chr3
Getting mappability for chr4
Getting mappability for chr5
Getting mappability for chr6
Getting mappability for chr7
Getting mappability for chr8
Getting mappability for chr9
Getting mappability for chrX
Getting mappability for chrY

Getting coverage for sample Sample1: read length 97.
Getting coverage for sample Sample2: read length 97.
Error in Y[, 1:5] : subscript out of bounds
Calls: head
Execution halted
`

Error in genome[[chrtemp]] : no such sequence, getgc -> unmasked -> [[ -> [[

Hi,
I tried CODEX2 for some sample sets. The first two sets was just fine.
For the third set, I got the following error.

....
normalize

Getting GC content for chr chr1
Getting GC content for chr chr10
Getting GC content for chr chr11
Getting GC content for chr chr12
Getting GC content for chr chr13
Getting GC content for chr chr14
Getting GC content for chr chr15
Getting GC content for chr chr16
Getting GC content for chr chr17
Getting GC content for chr chr18
Getting GC content for chr chr19
Getting GC content for chr chr2
Getting GC content for chr chr20
Getting GC content for chr chr21
Getting GC content for chr chr22
Getting GC content for chr chr3
Getting GC content for chr chr4
Getting GC content for chr chr5
Getting GC content for chr chr6
Getting GC content for chr chr7
Getting GC content for chr chr8
Getting GC content for chr chr9
Getting GC content for chr chrM
Error in genome[[chrtemp]] : no such sequence
Calls: getgc -> unmasked -> [[ -> [[
In addition: Warning message:
In getgc(ref, genome = genome) : NAs introduced by coercion
Execution halted

Could you please suggest what could be the possible causes of this?
Many thanks.
--duangdao.

An error is occurring-obtaining GC content and mappability stage

Hi,
I am using CODEX2 (version: 1.3.0). I have used BSgenome.Hsapiens.UCSC.hg38 as the reference genome.
After running the command for obtaining GC content, the following error has been occurring:

Getting GC content for chr chr1
Getting GC content for chr chr10
Getting GC content for chr chr11
Getting GC content for chr chr12
Getting GC content for chr chr13
Getting GC content for chr chr14
Getting GC content for chr chr15
Getting GC content for chr chr16
Getting GC content for chr chr17
Getting GC content for chr chr18
Getting GC content for chr chr19
Getting GC content for chr chr2
Getting GC content for chr chr20
Getting GC content for chr chr21
Getting GC content for chr chr22
Getting GC content for chr chr3
Getting GC content for chr chr4
Getting GC content for chr chr5
Getting GC content for chr chr6
Getting GC content for chr chr7
Getting GC content for chr chr8
Getting GC content for chr chr9
Getting GC content for chr chrM
Error in genome[[paste("chr", chrtemp, sep = "")]] : no such sequence
In addition: Warning message:
In getgc(ref, genome = genome) : NAs introduced by coercion

In addition, when I ran the command to obtain mappability, the following error has been occurring:

Error in getmapp(ref, genome = genome) :
no slot of name "metadata" for this object of class "BSgenome"

With reference to the above errors, can you please suggest some ways to fix the issue?

getgc() failing on sex chromosomes

I have an issue with getgc() when I run the CODEX2 targeted demo. It runs fine with my .bed file, but fails when it gets to chrX. I have triple-checked the .bed file and verified that it runs for all lines up to the first chrX entry. The results from the pipeline (using a .bed file with only autosomal regions) look great. How can I get it to run with chrX and chrY?

Commands and output:

Initialize Variables

dirPath=getwd()
bamFile <- list.files(dirPath, pattern = '.q1plus..bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- substr(bamFile,20,27)
norm_index <- c(1,4:7,10:50)

bedFile <- list.files(dirPath, pattern = 'exons_25bp_pad.bed')

bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, sampname = sampname, projectname = "q1plus")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname

ref
GRanges object with 55562 ranges and 0 metadata columns:
seqnames ranges strand

[1] chr1 948928-948981 *
[2] chr1 949338-949883 *
[3] chr1 955527-955778 *
[4] chr1 957555-957867 *
[5] chr1 970631-970729 *
... ... ... ...
[55558] chrY 14959138-14959277 *
[55559] chrY 14968239-14968446 *
[55560] chrY 14968532-14968795 *
[55561] chrY 14969465-14969611 *
[55562] chrY 14971178-14971366 *


seqinfo: 24 sequences from an unspecified genome; no seqlengths

Verify chromosome names match exactly

genome = BSgenome.Hsapiens.UCSC.hg19
head(seqnames(genome),25)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY" "chrM"

unique(read.table(file="exons_25bp_pad_MSOC.bed")[,1])
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"

unique(read.table(file="exons_25bp_pad_MSOC.bed")[,1]) == head(seqnames(genome),24)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Call getgc()

gc <- getgc(ref, genome = BSgenome.Hsapiens.UCSC.hg19)
Getting GC content for chr chr1
Getting GC content for chr chr2
Getting GC content for chr chr3
Getting GC content for chr chr4
Getting GC content for chr chr5
Getting GC content for chr chr6
Getting GC content for chr chr7
Getting GC content for chr chr8
Getting GC content for chr chr9
Getting GC content for chr chr10
Getting GC content for chr chr11
Getting GC content for chr chr12
Getting GC content for chr chr13
Getting GC content for chr chr14
Getting GC content for chr chr15
Getting GC content for chr chr16
Getting GC content for chr chr17
Getting GC content for chr chr18
Getting GC content for chr chr19
Getting GC content for chr chr20
Getting GC content for chr chr21
Getting GC content for chr chr22
Getting GC content for chr chrX
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'unmasked': no such sequence

hg38

Hi,

is Codex2 able to handle hg38 genome ?
'cos during installation progress it means '(BSgenome.Hsapiens.UCSC.hg19)'

thx

Need help to understand the choice of optK

Hello Dr Yuchao

Thanks to you for your kind and clear reply. I hope you are enjoying this lovely Christmas time :-)
May I ask you one more question?
In general, the model which has the lowest BIC and RSS evaluated as the most efficient one.
However, CODEX2 determines the latent factor K which has the highest value of BIC.

Although I guess it is because of CODEX2 may have some different working principles from other programing models,
I couldn't clear what the difference is.

So could you give me some clues to clarify why in CODEX2 the K that has the highest value of BIC (instead of the lowest) is recommended as optK?

Demo code is this:

finalcall.CBS <- segmentCBS(Y_qc[chr.index,],  # recommended
                            Yhat.ns, optK = which.max(BIC.ns),
                            K = 1:5,
                            sampname_qc = colnames(Y_qc),
                            ref_qc = ranges(ref_qc)[chr.index],
                            chr = chr, lmax = 400, mode = "integer")

Difference between getmapp.R and mapp.R

I noticed there two different versions of function for calculating mappability and they take different inputs as arguments. What is the difference between those two? If I am using mouse genome, which one should I use?

Thanks

how to run CODEX2 with a pair of Tomor-Normal bam?

now,I have sevaral pairs of sorted.markup.bam to call CNV
In CODEX2.Rmd , I found following command to fill in my Tumor bam:
bamFile <- list.files(dirPath, pattern = '
.bam$')

But in the part of ###Running CODEX2 with negative control samples###

{

Running CODEX2 with negative control samples

For the case-control scenario, the normal sample index is known (samples without spike-in signals).

# Below are pre-computed demo dataset, stored as part of the CODEX2 R-package.
normObj <- normalize_codex2_ns(Y_qc = Y_qc[chr.index,],
                               gc_qc = gc_qc[chr.index], 
                               K = 1:5, norm_index = norm_index_demo,
                               N = N)
Yhat.ns <- normObj$Yhat; fGC.hat.ns <- normObj$fGC.hat;
beta.hat.ns <- normObj$beta.hat; g.hat.ns <- normObj$g.hat; h.hat.ns <- normObj$h.hat
AIC.ns <- normObj$AIC; BIC.ns <- normObj$BIC; RSS.ns <- normObj$RSS

Choose the number of latent Poisson factors. BIC is used as the model selection metric by default.

choiceofK(AIC.ns, BIC.ns, RSS.ns, K = 1:5 , filename = "codex2_ns_choiceofK.pdf")
par(mfrow = c(1, 3))
plot(1:5, RSS.ns, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, AIC.ns, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, BIC.ns, type = "b", xlab = "Number of latent variables", pch=20)
par(mfrow = c(1,1))

}

it seem that no normal bam can be filled in.
How can I run CODEX2 with a pair of Tomor-Normal bam?
Is it ture that Normal Bam can be the input of CODEX2?
Sincerely hoping to receive your reply.

how to set sampname_qc correctly

I followed the pipeline provided in http://htmlpreview.github.io/?https://github.com/yuchaojiang/CODEX2/blob/master/demo/CODEX2.html. The output result like
image. Obviously, it miss the true sample name,how do i fix it or how to set sampname_qc parameter correctly?

finalcall.CBS <- segmentCBS(Y_qc[chr.index,], # recommended Yhat.ns, optK = which.max(BIC.ns), K = 1:5, sampname_qc = paste('sample',1:ncol(Y_qc),sep=''), ref_qc = ranges(ref_qc)[chr.index], chr = chr, lmax = 400, mode = "integer")

error with BSgenome.Hsapiens.UCSC.hg38

Hi,

chrX and chrY chromosomes do not seem to be recognized:

gc <- getgc(ref, genome = genome)
Getting GC content for chr chr1
Getting GC content for chr chr10
Getting GC content for chr chr11
Getting GC content for chr chr12
Getting GC content for chr chr13
Getting GC content for chr chr14
Getting GC content for chr chr15
Getting GC content for chr chr16
Getting GC content for chr chr17
Getting GC content for chr chr18
Getting GC content for chr chr19
Getting GC content for chr chr2
Getting GC content for chr chr20
Getting GC content for chr chr21
Getting GC content for chr chr22
Getting GC content for chr chr3
Getting GC content for chr chr4
Getting GC content for chr chr5
Getting GC content for chr chr6
Getting GC content for chr chr7
Getting GC content for chr chr8
Getting GC content for chr chr9
Getting GC content for chr chrX
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'unmasked': no such sequence

mapp <- getmapp(ref, genome = genome)
Getting mappability for chr1
Getting mappability for chr10
Getting mappability for chr11
Getting mappability for chr12
Getting mappability for chr13
Getting mappability for chr14
Getting mappability for chr15
Getting mappability for chr16
Getting mappability for chr17
Getting mappability for chr18
Getting mappability for chr19
Getting mappability for chr2
Getting mappability for chr20
Getting mappability for chr21
Getting mappability for chr22
Getting mappability for chr3
Getting mappability for chr4
Getting mappability for chr5
Getting mappability for chr6
Getting mappability for chr7
Getting mappability for chr8
Getting mappability for chr9
Getting mappability for chrX
Getting mappability for chrY
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))
Error in FUN(X[[i]], ...) : object of type 'closure' is not subsettable

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9,
gc_thresh = c(20, 80))
Excluded 260 exons due to extreme coverage.
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'end': argument "ref" is missing, with no default
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
Error in getListElement(x, i, ...) :
IRanges objects don't support [[, as.list(), lapply(), or unlist() at
the moment

seqnames(bambedObj$ref)
factor-Rle of length 32821 with 24 runs
Lengths: 2605 1405 1462 1529 1069 ... 1768 1611 1386 1709 202
Values : chr1 chr10 chr11 chr12 chr13 ... chr7 chr8 chr9 chrX chrY
Levels(24): chr1 chr10 chr11 chr12 chr13 chr14 ... chr7 chr8 chr9 chrX chrY

codex2_targeted.R, getcoverage error

Hi,
This function return an error:

coverageObj=getcoverage(bambedObj,mapqthres=20)

> coverageObj=getcoverage(bambedObj,mapqthres=20)
Error in message("Getting coverage for chr ", chr, sep = "") :
argument "chr" is missing, with no default

?getcoverage
getcoverage {CODEX2}
(...)
getcoverage(bambedObj, mapqthres, chr)

Cheers
Damian Loska

Need Help

Hello Dr Yuchao

I am having a few issues while running CODEX2. I have not used CODEX before and am unfamiliar with the Poisson model. I am sorry if i am asking some apparent questions.

My situation is this: I have 2 Truth Samples (samples whose CNV details I know are correct). I have a test batch of 35 samples.
I have run CODEX2 in these manners:
Run 1:
Sample BAMs taken - 35 from test batch | K = 1:5 | chr: 17
Output:

  sample_name chr cnv st_bp ed_bp length_kb st_exon ed_exon raw_cov norm_cov copy_no lratio mBIC
cnv1 Sample 1 chr17 dup 7687395 7689351 1.957 1307 1311 474 341 3 21.683 4.216
cnv2 Sample 1 chr17 del 19348563 19353709 5.147 799 800 49 104 1 18.038 4.88
cnv3 Sample 3 chr17 dup 3094629 3121197 26.569 1150 1156 2117 1548 3 84.334 66.263
cnv4 Sample 3 chr17 dup 28673501 28681954 8.454 874 875 206 129 3 19.022 68.109
cnv5 Sample 4 chr17 dup 73626838 73627110 0.273 532 533 604 446 3 21.891 4.485
cnv6 Sample 4 chr17 dup 19348563 19353709 5.147 799 800 389 279 3 18.219 5.588
cnv7 Sample 6 chr17 del 41209048 41215988 6.941 210 212 252 443 1 46.84 29.398

Run 2:
Sample BAMs taken - 35 from test batch | K = 1:10 | chr: 17
Output:

  sample_name chr cnv st_bp ed_bp length_kb st_exon ed_exon raw_cov norm_cov copy_no lratio mBIC
cnv1 Sample 1 chr17 dup 7687395 7689351 1.957 1307 1311 474 341 3 21.683 4.216
cnv2 Sample 1 chr17 del 19348563 19353709 5.147 799 800 49 104 1 18.038 4.88
cnv3 Sample 3 chr17 dup 3094629 3121197 26.569 1150 1156 2117 1548 3 84.334 66.263
cnv4 Sample 3 chr17 dup 28673501 28681954 8.454 874 875 206 129 3 19.022 68.109
cnv5 Sample 4 chr17 dup 73626838 73627110 0.273 532 533 604 446 3 21.891 4.485
cnv6 Sample 4 chr17 dup 19348563 19353709 5.147 799 800 389 279 3 18.219 5.588
cnv7 Sample 6 chr17 del 41209048 41215988 6.941 210 212 252 443 1 46.84 29.398

Run 3:
Sample BAMs taken - 35 from test batch | K = 1:5 | chr: 13 and 17
Output:

  sample_name chr cnv st_bp ed_bp length_kb st_exon ed_exon raw_cov norm_cov copy_no lratio mBIC
cnv1 Sample 1 chr13 dup 7687395 7689351 1.957 903 905 317 218 3 19.527 3.53
cnv2 Sample 1 chr17 dup 78403552 78410018 6.467 581 582 167 97 3 19.21 6.351
cnv3 Sample 4 chr13 dup 67647500 68331876 684.377 217 219 715 538 3 20.896 4.269
cnv4 Sample 4 chr17 del 10278985 10284601 5.617 924 925 122 208 1 19.442 8.308
cnv5 Sample 4 chr13 dup 75485511 75489741 4.231 228 229 439 319 3 18.492 11.985

Run 4:
Sample BAMs taken - 18 from test batch and 2 Truth Sample (named TS1 and TS2) | K = 1:20 | All Chr
Output:

  sample_name chr cnv st_bp ed_bp length_kb st_exon ed_exon raw_cov norm_cov copy_no lratio mBIC
cnv1 TS1 chr1 del 14862604 153993261 139130.658 791 836 6452 9534 1 295.137 279.455
cnv2 TS2 chr1 dup 14862604 153230197 138367.594 791 833 13222 10416 3 152.839 136.733
cnv3 Sample 2 chr1 del 14862604 154221443 139358.84 791 838 5468 10091 1 1255.645 1247.018
cnv4 Sample 4 chr1 dup 45609814 50117179 4507.366 558 559 269 182 3 18.066 1.993
cnv5 Sample 5 chr1 dup 151849769 151900169 50.401 353 354 820 601 3 31.968 15.849

Run 5:
Sample BAMs taken - 18 from test batch and 2 Truth Sample (named TS1 and TS2) | K = 1:5 | chr 17
Output:

  sample_name chr cnv st_bp ed_bp length_kb st_exon ed_exon raw_cov norm_cov copy_no lratio mBIC
cnv1 Sample 4 chr17 dup 5036717 5041065 4.349 127 131 980 755 3 19.839 2.346
cnv2 Sample 4 chr17 dup 8480533 8508320 27.788 305 307 301 205 3 19.54 4.606
cnv3 Sample 4 chr17 dup 36874077 36875856 1.78 590 591 751 574 3 17.492 5
cnv4 Sample 6 chr17 del 41209048 41215988 6.941 802 804 252 440 1 45.34 27.665

I am not able to get concurrent results from the above outputs.

  • I am getting the same output in Run 1 and Run 2, even if there is a change in K.
  • In Run 3, when I introduce 2 chromosomes, Chr 13 and 17, the output doesn't contain some of the observed events in Run 1 and 2.
  • In Run 4, Output states TS1 has del and TS2 has dup in Chr 1, which does not match the True CNV output.
  • In Run 5, when I only mention Chr 17, the output doesn't consider the BRCA1 deletion present in TS1, as per the True CNV output.

The code is as follows:

#Preprocessing ####
library(CODEX2)

# Set the path to the directory containing the BAM files 
dirPath = file.path('','home','bioinfo','Nilesh','CODEX2','Attempt_18B_Truth')

# Get the names of all BAM files in the directory
bamFile <- list.files(dirPath, pattern = '*.bam$')

# Create a new path by combining the directory path and the BAM file names
bamdir <- file.path(dirPath, bamFile)

# Extract the sample name from the BAM file name
sampname <- substr(bamFile,1,4)

# Set the path to the BED file
bedFile <- file.path(dirPath, "test_codex.bed")

# Create an object containing the BAM, BED, sample name, and project name
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, 
                       sampname = sampname, projectname = "CODEX2_18B-Truth")

# Extract the BAM directory, sample name, reference, and project name from the object
bamdir <- bambedObj$bamdir
sampname <- bambedObj$sampname
ref <- bambedObj$ref
projectname <- bambedObj$projectname
#------------------------------------------------------------------------------#

#GC Content and Mappability ####
genome = BSgenome.Hsapiens.UCSC.hg19 #hg19
#-- Defining Funcion getgc --#
# Define a function to calculate the GC content of a set of genomic ranges
getgc = function (ref, genome = NULL) {
  # If the genome argument is NULL, set it to the hg19 genome
  if(is.null(genome)){genome = BSgenome.Hsapiens.UCSC.hg19}
  
  # Create an empty vector to store the GC content for each range
  gc = rep(NA, length(ref))
  
  # Loop over each unique chromosome in the input ranges
  for (chr in unique(seqnames(ref))) {
    # Print a message to show which chromosome we are currently processing
    message("Getting GC content for chr ", chr, sep = "")
    
    # Get the indices of all ranges on the current chromosome
    chr.index = which(as.matrix(seqnames(ref)) == chr)
    
    # Create a new set of ranges that only includes the ranges on the current chromosome
    ref.chr = IRanges(start = start(ref)[chr.index], end = end(ref)[chr.index])
    
    # Check if the current chromosome is the X or Y chromosome and set the chrtemp variable accordingly
    if (chr == "X" | chr == "x" | chr == "chrX" | chr == "chrx") {
      chrtemp <- 'X'
    } else if (chr == "Y" | chr == "y" | chr == "chrY" | chr == "chry") {
      chrtemp <- 'Y'
    } else {
      chrtemp <- as.numeric(mapSeqlevels(as.character(chr), "NCBI")[1])
    }
    
    # If the current chromosome cannot be found in the NCBI Homo sapiens database, print a warning message
    if (length(chrtemp) == 0) message("Chromosome cannot be found in NCBI Homo sapiens database!")
    
    # Get the unmasked sequence for the current chromosome from the genome
    chrm <- unmasked(genome[[paste('chr',chrtemp,sep='')]])
    
    # Create a new set of sequences from the genome that only includes the ranges on the current chromosome
    seqs <- Views(chrm, ref.chr)
    
    # Calculate the base frequency for each range on the current chromosome
    af <- alphabetFrequency(seqs, baseOnly = TRUE, as.prob = TRUE)
    
    # Calculate the GC content for each range on the current chromosome
    gc[chr.index] <- round((af[, "G"] + af[, "C"]) * 100, 2)
  }
  
  # Return the GC content vector
  gc
}
#-- getgc Function end --#
# Calculate the GC content of the input ranges
gc <- getgc(ref, genome = genome)

# Calculate the mappability of the input ranges
mapp <- getmapp(ref, genome = genome)

# Create a vector of gene names for each range
gene = rep(NA, length(ref))

# Loop over each unique chromosome in the input ranges
for (chr in as.matrix(unique(seqnames(ref)))) {
  # Get the indices of all ranges on the current chromosome
  chr.index = which(seqnames(ref) == chr)
  
  # Generate a gene name for each range on the current chromosome by concatenating the chromosome name, 
  # the string "_gene_", and the index of the range divided by 30 and rounded up
  gene[chr.index] = paste(chr, '_gene_', ceiling(chr.index / 30), sep = '')
}

# Update the input ranges to include the GC content, mappability, and gene name for each range
values(ref) <- cbind(values(ref), DataFrame(gc, mapp, gene))
#------------------------------------------------------------------------------#

#Raw read Depth ####
# Calculate the coverage of the alignments
coverageObj <- getcoverage(bambedObj, mapqthres = 20)

# Get the coverage matrix from the coverage object
Y <- coverageObj$Y

# Write the coverage matrix to a CSV file
write.csv(Y, file = paste(projectname, '_coverage.csv', sep=''), quote = FALSE)

# Print the first four columns of the coverage matrix
head(Y[,1:4])

#------------------------------------------------------------------------------#

#Quality Control ####
# Perform quality control on the alignment data
qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
            length_thresh = c(20, 2000), mapp_thresh = 0.9,
            gc_thresh = c(20, 80))

# Get the filtered coverage matrix, sample names, and genomic ranges from the qc object
Y_qc <- qcObj$Y_qc
sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc
qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc

# Write the quality control matrix to a file
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
            sep = '\t', quote = FALSE, row.names = FALSE)

#------------------------------------------------------------------------------#

#Estimating library size factor for each sample ####
# Create a matrix of non-zero values from the quality-controlled coverage data
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]

# Calculate the pseudo-sample median for each row in the non-zero matrix
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})

# Divide each column in the non-zero matrix by the corresponding value in the pseudo-sample vector
# and take the median of the resulting values
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

# Create a scatter plot of N against the total sum of reads in each sample
# and save the plot as a PDF file
pdf("Batch_Library_Size_Factor.pdf")
plot(N, apply(Y,2,sum), xlab='Estimated library size factor', ylab='Total sum of reads')
dev.off() 
#------------------------------------------------------------------------------#

#Running CODEX2 without specifying negative control samples/regions ####
# This can be run for One chromosome 
chr <- 'chr17'
#Or for Multiple chromosomes
#c('chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chr20','chr21','chr22','chrX','chrY')

chr.index <- which(seqnames(ref_qc)==chr)
normObj.null <- normalize_null(Y_qc = Y_qc[chr.index,],
                               gc_qc = gc_qc[chr.index],
                               K = 1:5, N = N)
Yhat.null <- normObj.null$Yhat
AIC.null <- normObj.null$AIC; BIC.null <- normObj.null$BIC
RSS.null <- normObj.null$RSS

#Choose the number of latent Poisson factors. BIC is used as the model selection metric by default. 
choiceofK(AIC.null, BIC.null, RSS.null, K = 1:5, filename = "codex2_null_choiceofK_1-5_for_18B-Truth_chr17.pdf")
#------------------------------------------------------------------------------#

#Running segmentation by CODEX2 ####
finalcall.CBS <- segmentCBS(Y_qc[chr.index,],  # recommended
                            Yhat.null, optK = which.max(BIC.null),
                            K = 1:5,
                            sampname_qc = colnames(Y_qc),
                            ref_qc = ranges(ref_qc)[chr.index],
                            chr = chr, lmax = 400, mode = "integer")
#------------------------------------------------------------------------------#

#Exporting Results ####
write.csv(finalcall.CBS, "./finalcall_CBS_K1-5_for_18B-Truth_chr17.csv", row.names = TRUE)

1 Could you explain what I am doing wrong?

2 I observed that there are two versions of the code equation for pseudo.sample, as highlighted below:

pseudo.sample <- apply(Y.nonzero,1,function(x){prod(x)^(1/length(x))})

pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})

Could you explain to me the difference between the equations?

3 Is it possible for gene names to be taken from the bed file directly, if a gene name column is provided?

Problem with getcoverage

Hello,
I have Problems with running CODEX2. I want to use it for targeted panel CNV detection.
Somehow it does not find the coverage.

for: head(Y[,1:5]) it fives results like this:

129514_S1 136591_S9 136633_S1 136691_S3 136744_S2
1:26126702-26126925 0 0 0 0 0
1:26127514-26127672 0 0 0 0 0
1:26128487-26128629 0 0 0 0 0
1:26131613-26131787 0 0 0 0 0
1:26135051-26135301 0 0 0 0 0
1:26135497-26135662 0 0 0 0 0

which leads to an error in the next step:

Excluded NA exons due to extreme coverage.
Excluded 0 exons due to extreme exonic length.
Excluded 4 exons due to extreme mappability.
Excluded 1 exons due to extreme GC content.
After taking union, excluded NA out of 1160 exons in QC.
Error: logical subscript contains NAs

As fare as i can see the ref looks ok:

GRanges object with 1160 ranges and 3 metadata columns:
seqnames ranges strand | gc mapp gene
|
[1] 1 26126702-26126925 * | 86.16 1 1_gene_1
[2] 1 26127514-26127672 * | 54.72 1 1_gene_1
[3] 1 26128487-26128629 * | 51.75 1 1_gene_1
[4] 1 26131613-26131787 * | 61.14 1 1_gene_1
[5] 1 26135051-26135301 * | 65.34 1 1_gene_1
... ... ... ... . ... ... ...
[1156] X 153608030-153608175 * | 63.7 1 X_gene_39
[1157] X 153608282-153608400 * | 49.58 1 X_gene_39
[1158] X 153608574-153608748 * | 60.57 1 X_gene_39
[1159] X 153609093-153609183 * | 53.85 1 X_gene_39
[1160] X 153609222-153609578 * | 59.1 1 X_gene_39

The code i was running looks like this and is basically the same as the in the Demo

bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile,
sampname = sampname, projectname = "Codex_test")

bamdir <- bambedObj$bamdir

ref <- bambedObj$ref

sampname <- bambedObj$sampname

projectname <- bambedObj$projectname

gc <- getgc(ref)

mapp <- getmapp(ref)

gene=rep(NA,length(ref))

print(gene)

for(chr in as.matrix(unique(seqnames(ref)))){
chr.index=which(seqnames(ref)==chr)
gene[chr.index]=paste(chr,'gene',ceiling(chr.index/30),sep='')
}

values(ref) <- cbind(values(ref), DataFrame(gc, mapp, gene))

coverageObj <- getcoverage(bambedObj, mapqthres = 20)

Y <- coverageObj$Y

write.csv(Y, file = paste(projectname, '_coverage.csv', sep=''), quote = FALSE)
head(Y[,1:5])

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, Inf),
length_thresh = c(20, Inf), mapp_thresh = 0.9,
gc_thresh = c(20, 80))

Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc

does anybody have an idea what the problem could be?
thanks a lot in advance

First Pass - Normalizing Step - Error in smooth.spline

Dear Developer,

First, I'd like to thank you for all your efforts.

I sequenced some specific regions of chromosome 2, 5 and 12 using a custom panel. So my goal is to call CNV from targeted sequencing data.

n. Regions: some 3600
n, Samples: 16
Reference: HG38
R version: 3.6.3

I easily go through your demo (that perfectly fulfill my needs) up to the normalization step
Now, I designed my panel in order to use sequenced regions in chromosomes 2 and 5 as negative control region.

Despite this, I'd like to go for a 2-pass run calling CNV without considering negative control regions first. In this way, I can figure out exactly which regions on chromosome 12 I've to use as cnv_index in the second run.

I faced a problem trying to normalize my data without considering negative control regions.
This command:

chr <- 'chr12'
chr.index <- which(seqnames(ref_qc)==chr)
normObj.null <- normalize_null(Y_qc = Y_qc[chr.index,],
                               gc_qc = gc_qc[chr.index],
                               K = 1:5, N = N)

returns:

Computing normalization with k = 1 latent factors ...
Error in smooth.spline(gc_qc, z) : 
  'tol' must be strictly positive and finite
In addition: Warning message:
In matrix(nrow = nrow(Y_qc), ncol = ncol(Y_qc), data = N, byrow = TRUE) :
  data length exceeds size of matrix

After a little bit of debugging, I found that, due to problems with the library or during amplification phase, I totally missed some of the regions I was expecting to cover (some 50 small regions). I thought that maybe 0s in the coverage matrix were causing the problem.
I removed them but I still get the error message.

Am I missing something?

Thanks in advance,
Davide

Error when installing

I get the following error when I try to install. It also kills the R session after. I have all of the packages listed: Rsamtools, GenomicRanges, BSgenome.Hsapiens.UCSC.hg19, S4Vectors, IRanges, Biostrings, GenomeInfoDb installed already and can load them without any problems

> devtools::install_github("yuchaojiang/CODEX2/package")
Downloading GitHub repo yuchaojiang/CODEX2@master
Skipping 7 packages not available: Rsamtools, GenomicRanges, BSgenome.Hsapiens.UCSC.hg19, S4Vectors, IRanges, Biostrings, GenomeInfoDb
✔  checking for file ‘/tmp/RtmpQrsnVv/remotes3f3115da431d/yuchaojiang-CODEX2-e47af62/package/DESCRIPTION’ ... OK
─  preparing ‘CODEX2’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  looking to see if a ‘data/datalist’ file should be added
─  building ‘CODEX2_1.3.0.tar.gz’ (10.3s)
   
* installing *source* package ‘CODEX2’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
/home/maherm/R/lib64/R/bin/INSTALL: line 34: 31405 Done                    echo 'tools:::.install_packages()'
     31406 Killed                  | R_DEFAULT_PACKAGES= LC_COLLATE=C "${R_HOME}/bin/R" $myArgs --slave --args ${args}
Error: Failed to install 'CODEX2' from GitHub:
  (converted from warning) installation of package ‘/tmp/RtmpQrsnVv/file3f3133e61203/CODEX2_1.3.0.tar.gz’ had non-zero exit status
> 
> 
> Terminated
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /home/maherm/R/lib64/R/lib/libRblas.so
LAPACK: /home/maherm/R/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.1

chr in bambedObj is NULL / missing chr in getcoverage() in codex2_targeted.R

Dear CODEX2 developers,

I recently run CODEX2 v1.3.0 successfully in "exome mode" without controls, but I was unable to run your approach for targeted sequencing given in the codex2_targeted.R sccript.
First, I suppose that the getcoverage() call in line 22 misses argument chr. However, setting chr to bambedObj$chr fails in my case, as bambedObj$chr is NULL.

The bambedObj seems to be created as expected, at least without warnings and yielding:

>names(bambedObj)
[1] "bamdir"      "sampname"    "ref"         "projectname"

>bambedObj$ref
GRanges object with 4083 ranges and 0 metadata columns:
         seqnames               ranges strand
            <Rle>            <IRanges>  <Rle>
     [1]        1   [1634345, 1634438]      *
     [2]        1   [1634518, 1634708]      *
     [3]        1   [1634914, 1635063]      *
     [4]        1   [1635262, 1635379]      *
     [5]        1   [1635477, 1635585]      *
     ...      ...                  ...    ...
  [4079]       22 [40804797, 40804846]      *
  [4080]       22 [40804936, 40805022]      *
  [4081]       22 [40805253, 40805376]      *
  [4082]       22 [40805468, 40805529]      *
  [4083]       22 [40805685, 40805763]      *
  -------
  seqinfo: 21 sequences from an unspecified genome; no seqlengths

How can bambedObj$chr be initialized?

Many thanks in advance!

Error in genome[[chrtemp]] : no such sequence

Hi Yuchao,
When I tried to run the getgc function, I always get the error below. Can you please advise what's wrong? Thanks.

Getting GC content for chr 1
Getting GC content for chr 10
Getting GC content for chr 11
Getting GC content for chr 12
Getting GC content for chr 13
Getting GC content for chr 14
Getting GC content for chr 15
Getting GC content for chr 16
Getting GC content for chr 17
Getting GC content for chr 18
Getting GC content for chr 19
Getting GC content for chr 2
Getting GC content for chr 20
Getting GC content for chr 21
Getting GC content for chr 22
Getting GC content for chr 3
Getting GC content for chr 4
Getting GC content for chr 5
Getting GC content for chr 6
Getting GC content for chr 7
Getting GC content for chr 8
Getting GC content for chr 9
Getting GC content for chr M
Error in genome[[chrtemp]] : no such sequence
Calls: getgc -> unmasked -> [[ -> [[
Execution halted

Best,
Zeyan

Differences between CODEX and CODEX2?

Hi,

The CODEX page redirects me here, and could you please tell the major differences between version one and two?

If possible, I think it is better to make an introduction of the improvements in the README documents.

Thanks a lot.

Problem running qc

Hi,

I run qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000), length_thresh = c(20, 2000), mapp_thresh = 0.9, gc_thresh = c(20,80))

and get the following error

Error in apply(Y_qc, 1, median) : dim(X) must have a positive length

what is the issue?

Thanks

Need Help for Normalization w/ all chromosomes

Hello Dr Yuchao

Thanks to you for creating CODEX2, I am using this wonderful program well for work.
During running CODEX2, I had a difficulty in making a decision to make a code for normalization using all chromosomes.
So, I ask you if there's any problem with the code I used or if there is a better way.

My situation is this: I have 45 samples (without CNV details), and I want to identify which sample has germ-line variations on some specific genes in chr9. As I don’t have a specified negative control samples/regions, I want to use all chromosomes in all samples for better normalization. I checked the demo code “Running CODEX2 without specifying negative control samples/regions“, and have a question how to modify this code to use all chromosomes in all samples for normalization.

The one trial I made is:
normObj.null <- normalize_null(Y_qc = Y_qc, gc_qc = gc_qc, K = 1:4, N = N)

(cf. Given demo code is:
normObj.null <- normalize_null(Y_qc = Y_qc[chr.index,], gc_qc = gc_qc[chr.index], K = 1:5, N = N))

The whole code I used is as follows:

library(BiocManager)
library(CODEX2)
dirPath <- '/BiO/home/jekim/Data/HL_TJP2/Total-E'
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
bedFile <- file.path('/BiO/home/jekim/CNVkit', 'Exome_Xgen.bed')
sampname <- substr(bamFile,0,10)
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, 
                       sampname = sampname, projectname = "E4-all")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname

genome = BSgenome.Hsapiens.UCSC.hg19

getgc <- function (ref, genome = NULL) 
{
    if (is.null(genome)) {
        genome = BSgenome.Hsapiens.UCSC.hg19
    }
    gc = rep(NA, length(ref))
    for (chr in unique(seqnames(ref))) {
        message("Getting GC content for chr ", chr, sep = "")
        chr.index = which(as.matrix(seqnames(ref)) == chr)
        ref.chr = IRanges(start = start(ref)[chr.index], end = end(ref)[chr.index])
        if (chr == "X" | chr == "x" | chr == "chrX" | chr == 
            "chrx") {
            chrtemp <- "X"
        }
        else if (chr == "Y" | chr == "y" | chr == "chrY" | chr == 
            "chry") {
            chrtemp <- "Y"
        }
        else {
            chrtemp <- as.numeric(mapSeqlevels(as.character(chr), 
                "NCBI")[1])
        }
        if (length(chrtemp) == 0) 
            message("Chromosome cannot be found in NCBI Homo sapiens database!")
        chrm <- unmasked(genome[[paste("chr", chrtemp, sep = "")]])
        seqs <- Views(chrm, ref.chr)
        af <- alphabetFrequency(seqs, baseOnly = TRUE, as.prob = TRUE)
        gc[chr.index] <- round((af[, "G"] + af[, "C"]) * 100, 
            2)
     }
      gc
}
gc <- getgc(ref, genome = genome)
mapp <- getmapp(ref, genome = genome)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))
coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y
write.csv(Y, file = paste(projectname, '_coverage.csv', sep=''), quote = FALSE)
qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
            length_thresh = c(20, 2000), mapp_thresh = 0.9,
            gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
            sep = '\t', quote = FALSE, row.names = FALSE)
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)
normObj.null <- normalize_null(Y_qc = Y_qc,
                               gc_qc = gc_qc,
                               K = 1:4, N = N)
Yhat.null <- normObj.null$Yhat
AIC.null <- normObj.null$AIC; BIC.null <- normObj.null$BIC
RSS.null <- normObj.null$RSS
choiceofK(AIC.null, BIC.null, RSS.null, K = 1:4 , filename = "E4-all_choiceofK.pdf")
write.table(Yhat.null, file = paste( projectname,'_Yhat.null_test.txt',
                                     sep=''), sep='\t', quote=F, row.names=F)
chro <- c('chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18',
     'chr19','chr20','chr21','chr22','chrX','chrY')
for (chr in chro){
    chr.index <- which(seqnames(ref_qc)==chr)        
    finalcall.CBS <- segmentCBS(Y_qc[chr.index,],
Yhat.null, optK = which.max(BIC.null), K = 1:4,
sampname_qc = sampname_qc,
ref_qc = ranges(ref_qc)[chr.index],
chr = chr, lmax = 400, mode = "integer")   
write.table(finalcall.CBS, file = paste( projectname, chr, '_no_filter_test.txt',
                                     sep=''), sep='\t', quote=F, row.names=F)
 filter1 <- finalcall.CBS$length_kb<=200
    filter2 <- finalcall.CBS$length_kb/(finalcall.CBS$ed_exon-finalcall.CBS$st_exon+1)<50
    finalcall.CBS.filter <- finalcall.CBS[filter1 & filter2, ]
    filter3 <- finalcall.CBS.filter$lratio>40
    filter4 <- (finalcall.CBS.filter$ed_exon-finalcall.CBS.filter$st_exon)>1
    finalcall.CBS.filter=finalcall.CBS.filter[filter3|filter4,]
    # save filtered results
    write.table(finalcall.CBS.filter, file = paste( projectname, chr, '_filter.txt',
                                     sep=''), sep='\t', quote=F, row.names=F)
    }

and the Y_qc, gc_qc look like this:
image
image

image

  1. Is this code okay for normalization - though it doesn’t have chr indexing “Y_qc = Y_qc[chr.index,], gc_qc = gc_qc[chr.index]”?
  2. Could you suggest me a better solution for this-using all chromosomes in all samples for normalization?

Questions about "Running CODEX2 without specifying negative control samples/regions"

Hi, I have 49 uncorrelated samples and run in batch with CODEX2 without specifying negative control samples/regions. But the results is quit terrible, the number of cnv were few and only detected chr1,chr2 and chr3, and even some samples could not find any cnv (One of the sample is NA12878). I was wondering where is wrong in my code:

library(CODEX2)
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.

###################################################################################
# Pre-processing

bamFile <- list.files('/storage/wujh/Project/CNV_analysis/CODEX/bam', pattern = '*.bam$') 
bamdir <- file.path('/storage/wujh/Project/CNV_analysis/CODEX/bam', bamFile)
sampname <- substr(bamFile,1,8)
bedFile <- file.path('/home/sxuan/codex2', 'AgV6.bed')
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, 
                       sampname = sampname, projectname = 'CODEX2_test')
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname


###################################################################################
# Getting GC content and mappability

gc <- getgc(ref, genome = BSgenome.Hsapiens.UCSC.hg19)
mapp <- getmapp(ref, genome = BSgenome.Hsapiens.UCSC.hg19)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))


####################################################################################
# Getting raw read depth

coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y
write.csv(Y, file = paste(projectname, '_coverage.csv', sep=''), quote = FALSE)


####################################################################################
# Quality control

print('Quality control')
qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, Inf),
            length_thresh = c(20, Inf), mapp_thresh = 0.9,
            gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
            sep = '\t', quote = FALSE, row.names = FALSE)


###################################################################################
# Running CODEX2
print('Running CODEX2')

Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){prod(x)^(1/length(x))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

# Running CODEX2 without negative control samples/regions

print('Running CODEX2 without negative control samples/regions')
chr <- c(c(1:22),'X','Y')

chr.index <- which(seqnames(ref_qc)==chr)
norm_index <- c(1:length(sampname))

normObj.null <- normalize_null(Y_qc = Y_qc[chr.index,],
                           gc_qc = gc_qc[chr.index],
                           K = 1:5, N = N)
Yhat.null <- normObj.null$Yhat
AIC.null <- normObj.null$AIC; BIC.null <- normObj.null$BIC
RSS.null <- normObj.null$RSS

#Choose the number of latent Poisson factors. BIC is used as the model selection metric by default.
choiceofK(AIC.null, BIC.null, RSS.null, K = 1:5 , filename = "codex2_null_choiceofK.pdf")
par(mfrow = c(1, 3))
plot(1:5, RSS.null, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, AIC.null, type = "b", xlab = "Number of latent variables", pch=20)
plot(1:5, BIC.null, type = "b", xlab = "Number of latent variables", pch=20)
par(mfrow = c(1,1))

##################################################################################
# Running segmentation by CODEX2
print('Running segmentation by CODEX2')
# recommended
finalcall.CBS <- segmentCBS(Y_qc[chr.index,],  
                            Yhat.null, optK = which.max(BIC.null),
                            K = 1:5,
                            sampname_qc = paste('sample',1:ncol(Y_qc),sep=''),
                            ref_qc = ranges(ref_qc)[chr.index],
                            chr = chr, lmax = 400, mode = "integer")
write.table(finalcall.CBS, file=paste('norm_null.final.cnv',sep=''), sep='\t', quote=F)

# not recommended
# finalcall.HMM <- segmentHMM(Y_qc[chr.index,],  
#                             Yhat.ns, optK = which.max(BIC.ns),
#                             K = 1:5,
#                             sampname_qc = paste('sample',1:ncol(Y_qc),sep=''),
#                             ref_qc = ranges(ref_qc)[chr.index],
#                             chr = chr, mode = "integer")


# Post-segmentation pruning and filtering are recommended based on CNV length (filter1), 
# length per exon (filter2), likelihood ratio (filter3), and number of exons (filter4)

filter1 <- finalcall.CBS$length_kb<=200
filter2 <- finalcall.CBS$length_kb/(finalcall.CBS$ed_exon-finalcall.CBS$st_exon+1)<50
finalcall.CBS.filter <- finalcall.CBS[filter1 & filter2, ]

filter3 <- finalcall.CBS.filter$lratio>40
filter4 <- (finalcall.CBS.filter$ed_exon-finalcall.CBS.filter$st_exon)>1
finalcall.CBS.filter=finalcall.CBS.filter[filter3|filter4,]

write.table(finalcall.CBS.filter, file='_norm_null.final.filtered.cnv', sep='\t', quote=F)

Any suggestions will be greatly appreciated !

some questions about CODEX2

Dear Dr.Jiang,
I am a student from China.I'm using CODEX2 to analyze copy number changes in human samples by target capture sequencing, but I am so confused by the bad results.I didn't make correct input files and there's no online tutorial of target capture sequencing.Therefore ,there are several questions I hope you could help me out :
1.how the bam files are made?Could you tell me if the way bam file of target capture sequencing data are made is the same as the WES?Are they only sorted?What are the requirements for the same batch of tested bam files e.g the difference of the read depth ?
2.how's the bed files cooked?I have the exact position of the capture region of target capture sequencing.
3.In addition,I don't know why the result shows that the duplicate and the deletion exist on the same chromosome of the same sample with high quality.
Thank you very much in advance.

Best regards,
Ziying Zhou

CODEX2, getbambed, wrong example

Hi,
in the example here:
https://github.com/yuchaojiang/CODEX2/blob/master/targeted_sequencing/codex2_targeted.R#L15-L20
there's an error when running this code:
library(CODEX2)

chr=1
# get bam directories, read in bed file, get sample names
bambedObj=getbambed(bamdir=bamdir,
                    bedFile=bedFile,
                    sampname=sampname,
                    projectname=projectname,chr)

Error in getbambed(bamdir = bamdir, bedFile = bedFile, sampname = sampname, :
unused argument (chr)

Should I use getbambed from CODEX?

EDIT: in the whole example file "codex2_targeted.R" the commands from CODEX and CODEX2 are mixed (This is my first time I use CODEX and/or CODEX2 so I'm a bit confused)

Cheers
Damian Loska

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.