Git Product home page Git Product logo

hmzdelfinder's Introduction

HMZDelFinder

CNV calling algorithm for detection of rare, homozygous and hemizygous deletions from whole exome sequencing data

Prerequisites

  • R in version >= 3.0.1

Following R libraries are required to run HMZDelFinder:

  • RCurl (version >= 1.95.4.7)
  • gdata (version >= 2.17.0)
  • data.table (version >= 1.9.6)
  • DNAcopy (version >= 1.36.0)
  • GenomicRanges (version >= 1.14.4)
  • parallel (version >= 3.0.1)
  • Hmisc (version >= 3.16.0)
  • matrixStats (version >= 0.50.2)
  • Rsubread (version >= 1.20.3)

To install missing packages, run the code from the appropriate sections ('install missing packages from ...') at example/example_run.R

Running HMZDelFinder

  • Working example that runs HMZDelFinder on 50 samples from 1000 genomes is available at example/example_run.R
  • The code was tested on Linux and may not work properly on other platforms.

Format of input files

BED file

Tab delimited file without header and four columns:

  • Chromosome
  • Start
  • Stop
  • Gene symbol

RPKM files

Tab delimited file with a header and two columns:

  • count // the number of reads overlapping with each capture target
  • RPKM // the RPKM value for each capture target

IMPORTANT: The number of rows and the order of capture targets have to correspond to the number of rows and the order defined in the BED file.

To generate RPKM files from BAM files, see comments at example/example_run.R.

VCF files

VCF files are required for AOH analysis and further filtering of identfied deletion calls. We assume that all files are single sample VCFs compressed with bz2. In general, VCF should follow the standard VCF format, however, the following columns are the most important:

  • CHROM // Chromosome
  • POS // Position
  • FILTER // Only variants with "PASS" in the FILTER column are used for AOH analysis.
  • FORMAT // 9-th VCF column containing definition of the last column
  • SAMPLE // 10-th VCF column with the genotype data and the information on the total number of reads ('DP') and the number of variant reads (e.g. 'VR')

NOTE: Please note that to calculate B-allele frequency (needed for AOH analysis) it is required that in the last column of VCF, both total number of reads and the number of variant reads are reported for every variant. Moreover, all multiallelic sites should be filtered out. Such VCFs can be generated, e.g. by Atlas2 variant caller.

Format of output files

Object returned by runHMZDelFinder(...), contains the following items:

  • filteredCalls // the list of calls after AOH and deletion size filtering
  • allCalls // the list of calls before AOH and deletion size filtering
  • bedOrdered // the data.table containing ordered coordinates of coverage targets
  • rpkmDtOrdered // the data.table containing RPKM data for all samples; rowNames corresponds to sample identifiers and columns to coverage targets.

Format of filteredCalls/allCalls

Both objects are data.frames with the following columns:

  • Chr // Chromosome
  • Start // Start position of deletion call
  • Stop // End position of deletion call
  • Genes // Comma separated list of genes encompassed by deletion
  • Start_idx // Index of first target
  • Mark_num // Number of targets that indicate deletion
  • Exon_num // Number of exons afftected (total number of targets encompassed by deletion)
  • FID // Sample identifier
  • Length // Length of deletion
  • BAB // Internal sample name (used only for BHCMG samples)
  • project // Project name (used only for BHCMG samples)
  • PoorSample // TRUE if the number of calls in the sample > 98 quantile
  • posKey // (chr+start+stop)
  • key // (sampleId+chr+start+stop)
  • inAOH_1000 // TRUE if deletion overlap with any AOH region greater than 1000bp
  • ZScore // z-score
  • OverlapCnt // number of overlapping calls in other samples
  • PerSampleNr // number of calls in this sample

hmzdelfinder's People

Contributors

tgambin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

hmzdelfinder's Issues

plotDeletion() error

Hi,
I think you have something wrong in plotDeletion function.
when I run:

lapply(1:nrow(results$filteredCalls),function(i){
  plotDeletion(results$filteredCalls, i, results$bedOrdered, results$rpkmDtOrdered,  lowRPKMthreshold, plotsDir, mainText=""  )})

I get error:

Error in !calls$inAOH_1000[other_idx] : invalid argument type

I discovered that when I'm setting:
minAOHsize <- 500
the hardcoded variable calls$inAOH_1000 becomes calls$inAOH_500. After changing it manually in the code, everything works.
so when I'm using: minAOHsize <- 500
I have to manually change it:

	if (length(other_idx)>0){
		noAOH <- calls$FID[other_idx][which(!calls$inAOH_1000[other_idx])]
	}

on this:

	if (length(other_idx)>0){
		noAOH <- calls$FID[other_idx][which(!calls$inAOH_500[other_idx])]
	}

(however at the end there are printed such info:

[[121]]
null device
1

[[122]]
null device
1

[[123]]
null device
1

but at least I obtain all 123 plots).

Thanks
Damian Loska

calcRPKMsFromBAMs()

Dear sir, madam,
I keep running into the following problem. It is not to me clear what this means and how to resolve. I hope you can help me further.

command: calcRPKMsFromBAMs(bedFile,bamFiles , sampleNames, rpkmDir,4)

In fread(bedFile) :
Bumped column 1 to type character on data row 213577, field contains 'X'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
2: In mclapply(1:length(bamFiles), function(i) { :
all scheduled cores encountered errors in user code

error running example

warning errors when running example

Warning messages:
1: In mccollect(p) : 1 parallel job did not deliver a result
2: In mccollect(p) : 1 parallel job did not deliver a result
3: In [.data.table(rpkmDtOrdered, , , with = F) :
i and j are both missing so ignoring the other arguments. This warning will be upgraded to error in future.
4: In mccollect(p) : 1 parallel job did not deliver a result
5: In mccollect(p) : 1 parallel job did not deliver a result
6: In mccollect(p) : 1 parallel job did not deliver a result
7: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter
8: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter
9: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter
10: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter
11: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter
12: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter
13: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter
14: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter
15: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter
16: In plot.xy(xy.coords(x, y), type = type, ...) :
"alpha" is not a graphical parameter

very few deletions call for HMZDelFinder

I have used this HMZDelfinder for 16 Exome Samples and I just got few 5-6 Deletion Calls. Do you have any idea what could be possible reason for this ? I have used TGP bed file provided for you. I assumed that I can use same BED file for all the other analysis.

calcRPKMsFromBAMs()

Dear Tomasz,

I am trying to test HMZDelFinder on our undiagnosed exome samples. To start with I am not an R expert - trying to learn by using it :). I used the R/3.2 version on slurm clusters and the 1000G example worked fine. When trying to run with R/3.0.1 which you recommend the library data.table is not available on that version. Nevertheless, I am having most probably a memory issue when trying to run the calcRPKMSFromBAMS to create the RPKM files from my data. For the record I used initially 12 exome bam files and then a single one for check. this is the error I get on both cases:

*** caught segfault ***
address 0x2aaee8021000, cause 'invalid permissions'
*** glibc detected *** /gpfs/buildsets/eb161004/software/R/3.2.3-intel-2016a-libX11-1.6.3/lib64/R/bin/exec/R: free(): invalid next size (normal): 0x00002aaee8002ee0 ***

Any clues on what is going wrong here?
Somethings wrong when the featureCounts function takes place. I tried the command separately for featurecounts and I get the same error.

Best
Athina

Broken Pipe Filtering for AOH

Hi,

I received an error from HMZDelFinder while filtering for AOH.I called variants using GATK Best Practices and zipped the files using bzip2 filename or bzip2 -z filename. All VCF files are in v4.2. Files extensions are .bz2.  I receive the error listed below but interestingly enough I can bzcat the file paths from the command line with no problem.Can you please help me solve this issue?

[1] "[step 1 out of 7] ******  AOH data ******"[1] "Reading VCFs and generating AOH data using CBS ..."  |                                                                                         |   0%bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA_1.10019.vcf.bz2, output file = (stdout)bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA-1_S1.vcf.bz2, output file = (stdout)bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA-2_S1.vcf.bz2, output file = (stdout)bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA_2.10018.vcf.bz2, output file = (stdout)bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA_3.10020.vcf.bz2, output file = (stdout)bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA-3_S1.vcf.bz2, output file = (stdout)bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA-5_S1.vcf.bz2, output file = (stdout)bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA-4_S1.vcf.bz2, output file = (stdout)bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA-6_S1.vcf.bz2, output file = (stdout)bzcat: I/O or other error, bailing out.  Possible reason follows.bzcat: Broken pipe        Input file = /home/ricardo/Downloads/DMA/ATLAS/DMA-7_S1.vcf.bz2, output file = (stdout)

Error after Step - 7

Hi,

It works perfectly fine without AOH analysis, The moment I have VCF to perform AOH, it dies after completing 7th step.

Any help on this would be appreciated.
Vivek

[1] "[step 1 out of 7] ****** AOH data ******"
[1] "Reading VCFs and generating AOH data using CBS ..."
|======================================================================| 100%
[1] "Extending AOHs by length of uncertain regions ..."
|======================================================================| 100%
[1] "[step 2 out of 7] ****** Preparing RPKM data ******"
[1] "Reading RPKM files ..."
|======================================================================| 100%
[1] "Removing empty elements ..."
[1] "Creating matrix ..."
|======================================================================| 100%
[1] "Creating data.table (may take a while)..."
[1] "[step 3 out of 7] ****** SELECTING CANDIDATE EXONS ******"
[1] "Computing initial statistics for each exon..."
|======================================================================| 100%
[1] "Selecting a subset of exons with potentiall deletions (may take a while)..."
[1] "Calculating number of potential deletions for each sample (may take a while)..."
[1] "Detetermining 2% of samples with the highest number of deletions"
[1] "Removing these samples (may take a while)..."
[1] "Recomputing per-exon statistics without low quality samples ..."
|======================================================================| 100%
[1] "[step 4 out of 7] ****** DELETION CALLING ******"
[1] "Calling deletions"
|======================================================================| 100%
[1] "[step 5 out of 7] ****** MERGING CALLS FROM CONSECUTIVE EXONS ******"
[1] "[step 6 out of 7] ****** ANNOTATING CALLS ******"
[1] "[step 7 out of 7] ****** OVERLAPPING WITH AOH REGIONS AND FILTERING ******"
|======================================================================| 100%
Error in 1:filteredCalls$Exon_num[i] : NA/NaN argument
Calls: runHMZDelFinder ... sapply -> sapply -> lapply -> FUN -> mean -> sapply
Execution halted

Merging calls error

Hello,
I've had some problems generating rpkms, but now I generated them using Conifer and adjusting the output. However now when I run the software I get an error on the fifth step:

[1] "[step 1 out of 7] ****** AOH data ******" [1] "Skipping AOH analysis ..." [1] "[step 2 out of 7] ****** Preparing RPKM data ******" [1] "Reading RPKM files ..." |======================================================================| 100% [1] "Removing empty elements ..." [1] "Creating matrix ..." |======================================================================| 100% [1] "Creating data.table (may take a while)..." [1] "[step 3 out of 7] ****** SELECTING CANDIDATE EXONS ******" [1] "Computing initial statistics for each exon..." |======================================================================| 100% [1] "Selecting a subset of exons with potentiall deletions (may take a while)..." [1] "Calculating number of potential deletions for each sample (may take a while)..." [1] "Detetermining 2% of samples with the highest number of deletions" [1] "Removing these samples (may take a while)..." [1] "Recomputing per-exon statistics without low quality samples ..." |======================================================================| 100% [1] "[step 4 out of 7] ****** DELETION CALLING ******" [1] "Calling deletions" |======================================================================| 100% [1] "[step 5 out of 7] ****** MERGING CALLS FROM CONSECUTIVE EXONS ******" Error in strsplit(v$V4, ",") : non-character argument
I don't know how to program in R, which is why I can't really figure out what the problem is.
How can I continue? I'm sooo close!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.