Git Product home page Git Product logo

duphold's Introduction

Build Status

Actions Status

duphold: uphold your DUP and DEL calls

The paper describing duphold is available here

SV callers like lumpy look at split-reads and pair distances to find structural variants. This tool is a fast way to add depth information to those calls. This can be used as additional information for filtering variants; for example we will be skeptical of deletion calls that do not have lower than average coverage compared to regions with similar gc-content.

In addition, duphold will annotate the SV vcf with information from a SNP/Indel VCF. For example, we will not believe a large deletion that has many heterozygote SNP calls.

duphold takes a bam/cram, a VCF/BCF of SV calls, and a fasta reference and it updates the FORMAT field for a single sample with:

  • DHFC: fold-change for the variant depth relative to the rest of the chromosome the variant was found on
  • DHBFC: fold-change for the variant depth relative to bins in the genome with similar GC-content.
  • DHFFC: fold-change for the variant depth relative to Flanking regions.

It also adds GCF to the INFO field indicating the fraction of G or C bases in the variant.

After annotating with duphold, a sensible way to filter to high-quality variants is:

bcftools view -i '(SVTYPE = "DEL" & FMT/DHFFC[0] < 0.7) | (SVTYPE = "DUP" & FMT/DHBFC[0] > 1.3)' $svvcf

In our evaluations, DHFFC works best for deletions and DHBFC works slightly better for duplications. For genomes/samples with more variable coverage, DHFFC should be the most reliable.

SNP/Indel annotation

NOTE it is strongly recommended to use BCF for the --snp argument as otherwise VCF parsing will be a bottleneck.

  • A DEL call with many HETs is unlikely to be valid.

When the user specifies a --snp VCF, duphold finds the appropriate sample in that file and extracts high (> 20) quality, bi-allelic SNP calls and for each SV, it reports the number of hom-refs, heterozygote, hom-alt, unknown, and low-quality snp calls in the region of the event. This information is stored in 5 integers in DHGT.

When a SNP/Indel VCF/BCF is given, duphold will annotate each DEL/DUP call with:

  • DHGT: counts of [0] Hom-ref, [1] Het, [2] Homalt, [3] Unknown, [4] low-quality variants in the event. A heterozygous deletion may have more hom-alt SNP calls. A homozygous deletion may have only unknown or low-quality SNP calls.

In practice, this has had limited benefit for us. The depth changes are more informative.

Performance

Speed

duphold runtime depends almost entirely on how long it takes to parse the BAM/CRAM files; it is relatively independent of the number of variants evaluated. It will also run quite a bit faster on CRAM than on BAM. It can be < 20 minutes of CPU time for a 30X CRAM.

Accuracy

Evaluting on the genome in a bottle truthset for DEL calls larger than 300 bp:

method FDR FN FP TP-call precision recall recall-% FP-%
unfiltered 0.054 276 86 1496 0.946 0.844 100.000 100.000
DHBFC < 0.7 0.018 298 27 1474 0.982 0.832 98.529 31.395
DHFFC < 0.7 0.021 289 32 1483 0.979 0.837 99.131 37.209

Note that filtering on DHFFC < 0.7 retains 99.1% of true positives and removes 62.8% (100 - 37.2) of false positives

This was generated using truvari.py with the command:

truvari.py --sizemax 15000000 -s 300 -S 270 -b HG002_SVs_Tier1_v0.6.DEL.vcf.gz -c $dupholded_vcf -o $out \
   --passonly --pctsim=0  -r 20 --giabreport -f $fasta --no-ref --includebed HG002_SVs_Tier1_v0.6.bed -O 0.6

For deletions >= 1KB, duphold does even better:

method FDR FN FP TP-call precision recall recall-% FP-%
unfiltered 0.073 46 38 486 0.927 0.914 100.000 100.000
DHBFC < 0.7 0.012 54 6 478 0.988 0.898 98.354 15.789
DHFFC < 0.7 0.012 53 6 479 0.988 0.900 98.560 15.789

Note that filtering on DHFFC < 0.7 retains 98.5% of DEL calls that are also in the truth-set (TPs) and removes 84.2% (100 - 15.8) of calls not in the truth-set (FPs)

The truvari.py command used for this is the same as above except for: -s 1000 -S 970

Install

duphold is distributed as a static binary here.

Usage

duphold -s $gatk_vcf -t 4 -v $svvcf -b $cram -f $fasta -o $output.bcf
duphold --snp $gatk_bcf --threads 4 --vcf $svvcf --bam $cram --fasta $fasta --output $output.bcf

--snp can be a multi-sample VCF/BCF. duphold will be much faster with a BCF, especially if the snp/indel file contains many (>20 or so) samples.

the threads are decompression threads so increasing up to about 4 works.

Full usage is available with duphold -h

duphold runs on a single-sample, but you can install smoove and run smoove duphold to parallelize across many samples.

Examples

Duplication

Here is a duplication with clear change in depth (DHBFC)

image

duphold annotated this with

  • DHBFC: 1.79

where together these indicate rapid (DUP-like) change in depth at the break-points and a coverage that 1.79 times higher than the mean for the genome--again indicative of a DUP. Together, these recapitulate (or anticipate) what we see on visual inspection.

Deletion

A clear deletion will have rapid drop in depth at the left and increase in depth at the right and a lower mean coverage.

image

duphold annotated this with:

  • DHBFC: 0.6

These indicate that both break-points are consistent with a deletion and that the coverage is ~60% of expected. So this is a clear deletion.

BND

when lumpy decides that a cluster of evidence does not match a DUP or DEL or INV, it creates a BND with 2 lines in the VCF. Sometimes these are actual deletions. For example:

image

shows where a deletion is bounded by 2 BND calls. duphold annotates this with:

  • DHBFC: 0.01

indicating a homozygous deletion with clear break-points.

Tuning and Env vars

The default flank is 1000 bases. If the environment variable DUPHOLD_FLANK is set to an integer, that can be used instead. In our experiments, this value should be large enough that duphold can get a good estimate of depth, but small enough that it is unlikely to extend into an unmapped region or another event. This may be lowered for genomes with poor assemblies.

If the sample name in your bam does not match the one in the VCF (tisk, tisk). You can use DUPHOLD_SAMPLE_NAME environment variable to set the name to use.

Acknowledgements

I stole the idea of annotating SVs with depth-change from Ira Hall.

duphold's People

Contributors

brentp avatar cbrueffer avatar roryk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

duphold's Issues

Not able to run duphold

Dear Brent,

First of all i want a appreciate great work that you doing for the community.

I have a couple of questions about how to run Duphold. As i understand written in Nim. I tried to run duphold straightforward from the bash. No luck.

[ealiyev@hpclogin1 duphold-0.0.1]$ duphold
-bash: duphold: command not found

I tried to run like that:

[ealiyev@hpclogin1 duphold-0.0.1]$ nim --run duphold
Hint: used config file '/gpfs/software/genomics/nim/0.17.2/config/nim.cfg' [Conf]
Hint: used config file '/gpfs/projects/bioinfo/najeeb/withKhalid/SVtoolsCompare/sandbox/duphold-0.0.1/nim.cfg' [Conf]
Error: invalid command: 'duphold'

And also i tried to compile from the source.
[ealiyev@hpclogin1 duphold-0.0.1]$ nim compile nim.cfg
Hint: used config file '/gpfs/software/genomics/nim/0.17.2/config/nim.cfg' [Conf]
Hint: used config file '/gpfs/projects/bioinfo/najeeb/withKhalid/SVtoolsCompare/sandbox/duphold-0.0.1/nim.cfg' [Conf]
Hint: system [Processing]
Hint: nim [Processing]
nim.cfg(1, 1) Error: undeclared identifier: 'threads'

thanks in advance for your help.

error setting DHD in VCF when annotating

Brent;
Thanks for duphold, this is a brilliant idea to help improve SV filtering and I'm working to incorporate into bcbio right now.

When running validation cases I'm hitting an issue that I'm not sure how to fix/work around. It complains with:

error setting DHD in VCF

and exits with a 1 error code before producing any variants.

I put togther a small test case that has this issue. It's self-contained, except you need to edit REF to point at a GRCh37 fasta file (I couldn't figure out how to make it work with a reduced fasta file to include). I'm running with the v0.0.2 pre-built binary packaged up into conda.

https://s3.amazonaws.com/chapmanb/testcases/duphold_setting_DHD.tar.gz

The command is fairly vanilla:

duphold --threads 4 --vcf HCC2218BL-HCC2218-svs-smoove.genotyped.vcf.gz --bam HCC2218C-sort.bam --fasta $REF | bgzip -c > HCC2218BL-HCC2218-svs-smoove.genotyped-duphold.vcf.gz

and I get the same error whether using a bgzipped/tabix VCF or bcf input.

Do you have any clues about what I'm doing wrong? I can see the error message and check in the code (

if variant.format.set("DHD", dhd) != Status.OK:
) but don't have a good clue why trying to set the calculated value would fail. Thanks much.

SIGSEGV: Illegal storage access. (Attempt to read from nil?)

sorry, I've got a new bug with the binary you sent me :-)

( the conditions are the same as previously : #22 )

+ /ccc/work/cont007/fg0019/lindenbp/packages/duphold/duphold --snp /ccc/scratch/cont007/fg0113/lindenbp/20190428.TROUCARD.MANTA/work/bd/c70293fb517483436faf17dd3b2e61/B00I4BY.snv.bcf --vcf jeter.bcf --bam /ccc/genostore/cont007/fg0073/fg0073/analyse/projet_TROUCARD_605/ANALYSE/ANALYSE_T605_B00I4BY_HJWJJALXX_2_hs37d5/MAPPING_B00I4BY/reliable.realign/T605_DA_B00I4BY_HJWJJALXX_hs37d5_MERGE_PE_2.reliable.realign.bam --fasta /ccc/work/cont007/fg/fg/biobank/by-taxonid/9606/hs37d5/hs37d5_all_chr.fasta --threads 4 -o jeter.duphold.vcf
I, [2019-05-01T15:41:17] -- duphold: starting to read snps for chrom: chr1
I, [2019-05-01T15:41:17] -- duphold: done reading 4143 bi-allelic snps for chrom: chr1
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

trying to run with gdb:

(...)
I, [2019-05-01T15:45:31] -- duphold: starting to read snps for chrom: chr1
I, [2019-05-01T15:45:31] -- duphold: done reading 4143 bi-allelic snps for chrom: chr1
Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaaab8039 in collectZCT_EN6T32AMm3va9bsrdxtF0cg ()
(gdb) bt
#0  0x00002aaaaaab8039 in collectZCT_EN6T32AMm3va9bsrdxtF0cg ()
#1  0x00002aaaaaab8b9b in collectCTBody_XHio9cMpnLoH7GyCj1Z9besg_2 ()
#2  0x00002aaaaaab8db6 in rawNewObj_ehkAaLROrd0Hc9aLROWt1nQ.constprop.33 ()
#3  0x00002aaaaaab9609 in newObj ()
#4  0x00002aaaaaaf54aa in annotate_5vOmVhlMmJnT4dAxyYRfyg ()
#5  0x00002aaaaaaf7235 in main_KwKibQAZqlBKGrljFq9agmQ ()
#6  0x00002aaaaaaf9477 in NimMain ()
#7  0x00002aaaaaaaf1dd in main ()

P.

How is DHSP computed?

Hi Brent,

How is the count of spanning read-pairs (DHSP) defined in duphold? Is it the number of discordantly aligned read pairs that flank a deletion event? If so, I find it odd that I generally observe that deletions attain a DHSP of 0, while discordantly aligned read pairs were one of the signals used to call them. Furthermore, IGV clearly shows that several of these deletions are flanked by read pairs with a significantly larger insert size than expected.

Thanks for your time.

calling copy number based on duphold values

Hi,

I used duphold to filter out false positives and it works pretty well.
Thanks a lot for making this nice programme!

I was thinking that Duphold can be useful in studying duplications with multiple levels of copies (usually called multi-allelic CNVs, mCNVs).
For instance, DHBFC 2.5 would mean 5 copies or DHBFC 3.5 would mean 7 copies.
However, sometimes DHBFC can be somewhat intermediate like 2.22... which could mean 4.5 copies, and it does not make sense, as copy numbers should be discrete numbers.

So I was wondering whether you have some ideas on this... or plan to add additional features to call (diploid) copy numbers using DHBFC... :)

Lim

Exit without any useful information

Hi @brentp
I ran duphold (version: 0.1.1) with following command, and a part of the log file (includes the head and the end lines). However, the program always failed and i can only find one error information from the log file.

command
/python2/bin/duphold --vcf ERR340341.vcf --bam ERR340341.sort.dedup.bam --fasta reference.fa --threads 4 --output ERR340341.ann.vcf --drop

log
[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)[W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.)[W::vcf_parse] Contig '3' is not defined in the header. (Quick workaround: index the file with tabix.) ..... I, [2018-12-10T09:54:10] -- duphold: starting read of bam values for chrom: NW_017216928.1 I, [2018-12-10T09:54:10] -- duphold: finished reading:NW_017216928.1. starting gc-content, discordant sorting I, [2018-12-10T09:54:10] -- duphold: finished gc-content, sorting

error
hts-nim/vcf bcf_read error:1

The ERR340341.vcf file was generated using lumpyexpress, a simplified wrapper for standard analyses.

I really do not know what happened to above command, any suggestion will be greatly appreciated.

Best wishes,
Zhuqing

Problem Annotating Manta Tandem Duplications

I believe I found a bug. When I try to annotate a vcf that contains a tandem duplication called by Manta Duphold fails to add fold changes to the call.

Here's an example:

chr17   41632594        MantaDUP:TANDEM:1:37095:37095:6:0:0     T       <DUP:TANDEM>    39      PASS    END=41633246;SVTYPE=DUP;SVLEN=652;IMPRECISE;CIPOS=-202,202;CIEND=-271,272;AN=6;AC=1     GT:FT:GQ:PL:PR  0/1:PASS:36:86,0,276:19,10

If I change ALT to <DUP> Duphold has no issue annotating

chr17   41632594        MantaDUP:TANDEM:1:37095:37095:6:0:0     T       <DUP>   39      PASS    END=41633246;SVTYPE=DUP;SVLEN=652;IMPRECISE;CIPOS=-202,202;CIEND=-271,272;AN=6;AC=1;GCF=0.471669        GT:FT:GQ:PL:PR:DHFC:DHFFC:DHBFC 0/1:PASS:36:86,0,276:19,10:1.27586:1.02778:1.27586

command using version 0.1.4:
duphold -v test.vcf --drop --threads 1 --output test.duphold.vcf --fasta $REFERENCE --bam crams/sample.1.cram

IndexError with Svaba caller vcf

weird variant:Variant(10:167066314 N/[10:167066555[N)
fatal.nim(39) sysFatal
Error: unhandled exception: index 167065294 not in 0 .. 130694993 [IndexError]

any suggestions to get rid of the above-mentioned error?

Couldn't find sample from bam or ENV in vcf

I'm getting an error when trying to use duphold

~/projects/lumpy$ duphold --vcf maddog_sorted.vcf --bam maddog_bam_trim_bwaMEM_sort_dedupped.bam -f ~/projects/data/celegans/c_elegans.PRJNA13758.WS265.genomic.fa 
couldn't find sample from bam:maddog_bam or ENV in vcf which had:maddog

I try to set DUPHOLD_SAMPLE_NAME environment variable but still get an error:

~/projects/umpy$ DUPHOLD_SAMPLE_NAME=maddog
~/projects/lumpy$ duphold --vcf maddog_sorted.vcf --bam maddog_bam_trim_bwaMEM_sort_dedupped.bam -f ~/projects/data/celegans/c_elegans.PRJNA13758.WS265.genomic.fa 
couldn't find sample from bam:maddog_bam or ENV in vcf which had:maddog

what size flanking region is used to calculate DHFFC?

I'm looking at a 214 bp deletion that shows a clear drop in coverage on IGV for several individuals, yet DHFFC for this variant has been scored at >1 in all my heterozygotes, sometimes up to 15. This deletion is in the middle of a 3 kb region surrounded by gaps in the alignments.
If the flanking region used to calculate DHFFC is only about 300bp, then I would expect DHFFC to be <1, so does this mean coverage within the variant is being compared to much larger flanking regions?
And is this parameter modifiable? Thank you!

BND enhancements

if the read orientations do not match a traditional variant, then lumpy will report 2 BND elements.

duphold should check if 2 BND elements are within ~10MB on the same chrom and if so, report the depth change between them rather than at the actual event.

error: "couldn't find sample from bam:or ENV in vcf"

I got the error: "couldn't find sample from bam:or ENV in vcf which had /path/samplename_guppy.fastq_Q7_500_hg19_sorted.bam" from running
duphold -v /path/samplename_Q7_500_m3_sniffles_genotyped_sorted.vcf.gz -b /path/samplename_guppy.fastq_Q7_500_hg19_sorted.bam -f /path/samplename_Q7_500.fasta -o duphold_out.vcf

I'm not sure what this error is trying to get at, and would appreciate your help!

Thanks!

Can I set the parameter --snp as null?

Hi @brentp,

Very useful software. Thank you. Just out of interest, can I set the --snp parameter as NULL if I do not care about the landscape of the SNPs in the structure variations?

Best wishes,
Zhuqing

Range error in version 0.2

Hi @brentp !

Thanks for supporting duphold!
We are using it to annotate calls made by CNVkit in bcbio.

Since the latest update to 0.2 one of our tests fails
(duphold 0.1.4 works just fine given the same input).
I tracked the variant which causes the failure:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  H-2_sorted
chr7    500     .       N       <DEL>   .       .       IMPRECISE;SVTYPE=DEL;END=33440015;SVLEN=-33439515;FOLD_CHANGE=0.621507;FOLD_CHANGE_LOG=-0.686157;PROBES=50;

We call duphold this way:

#!/bin/bash
duphold --threads 4 \
--vcf test.vcf.gz \
--bam test.bam \
--fasta /path/genomes/Hsapiens/hg38/seq/hg38.fa \
-o test_sorted-call-effects-duphold.vcf.gz

The error message:

fatal.nim(39)            sysFatal
Error: unhandled exception: value out of range [RangeError]

Could you please take a look, why it breaks?

Sergey

Error running Duphold, "expected AD field in snps VCF"

I'm running duphold v 0.2.1

The command I am running is:
~/install_crap/duphold/duphold -v ./results.vcf.gz -b /efs/WGS/data/WGS/ILMN_exptA/b37/KC_downsampledBAM_and_VCF/NA24385/40x/S7508/NA24385.40x.S7508.aligned.deduped.sort.bam -f /efs/WGS/data/reference/human/human_g1k_v37_modified.fasta/human_g1k_v37_modified.fasta -s ./x.vcf.gz -t 96 -o ./duphold.vcf

It spins for a while, then returns the error:
"expected AD field in snps VCF"

the thing is, my VCF files have the AD field present..... I created a small vcf to test this more directly, and here are the full contents of the x.vcf.gz

`##fileformat=VCFv4.3
##reference=human_g1k_v37_modified
##octopus=<version=0.6.3-beta_HEAD_5961a546,command="octopus --reference /home/kcibul/wgs_resources/data/reference/human/human_g1k_v37_modified.fasta
/human_g1k_v37_modified.fasta --reads NA24385.40x.S7508.aligned.deduped.sort.bam -t regions.bed --forest-file /home/kcibul/wgs_resources/data/referen
ce/forests/DC/germline.v0.7.0.forest -o NA24385.40x.S7508.octopus.tmp.vcf.gz --threads 192 -X 10000MB -B 38400 MB --sequence-error-model /home/kcibul
/wgs_resources/data/reference/octopus_err_models/novaseq.4a38e55.model --annotations AD ADP AF ARF BQ CRF DP FRF GC GQ MC MF MQ MQ0 QD QUAL SB STR_LE
NGTH STR_PERIOD --max-indel-errors 32 --duplicate-read-detection-policy AGGRESSIVE --max-haplotypes=400 --min-forest-quality=8",options="--allow-mark
ed-duplicates no --allow-octopus-duplicates no --allow-pileup-candidates-from-likely-misaligned-reads no --allow-qc-fails no --allow-reads-with-good-
decoy-supplementary-alignments no --allow-reads-with-good-unplaced-or-unlocalized-supplementary-alignments no --allow-secondary-alignments no --allow
-supplementary-alignments no --annotations[0] AD --annotations[1] ADP --annotations[2] AF --annotations[3] ARF --annotations[4] BQ --annotations[5] C
RF --annotations[6] DP --annotations[7] FRF --annotations[8] GC --annotations[9] GQ --annotations[10] MC --annotations[11] MF --annotations[12] MQ --
annotations[13] MQ0 --annotations[14] QD --annotations[15] QUAL --annotations[16] SB --annotations[17] STR_LENGTH --annotations[18] STR_PERIOD --asse
mble-all no --assembler-mask-base-quality 10 --backtrack-level NONE --bad-region-tolerance NORMAL --bamout-type MINI --caller population --consider-u
nmapped-reads no --contig-output-order REFERENCE_INDEX --contig-ploidies[0] Y=1 --contig-ploidies[1] chrY=1 --contig-ploidies[2] MT=1 --contig-ploidi
es[3] chrM=1 --denovo-filter-expression QUAL < 50 | PP < 40 | GQ < 20 | MQ < 30 | AF < 0.1 | SB > 0.95 | BQ < 20 | DP < 10 | DC > 1 | MF > 0.2 | FRF \

0.5 | MP < 30 | MQ0 > 2 --denovo-indel-mutation-rate 1e-09 --denovo-snv-mutation-rate 1.3e-08 --denovos-only no --disable-adapter-masking no --disa
ble-assembly-candidate-generator no --disable-call-filtering no --disable-denovo-variant-discovery no --disable-downsampling no --disable-inactive-fl
ank-scoring no --disable-overlap-masking no --disable-pileup-candidate-generator no --disable-read-preprocessing no --disable-repeat-candidate-genera
tor no --dont-model-mapping-quality no --dont-protect-reference-haplotype no --downsample-above 1000 --downsample-target 500 --dropout-concentration
2 --duplicate-read-detection-policy AGGRESSIVE --extension-level NORMAL --fallback-kmer-gap 10 --fast no --filter-expression QUAL < 10 | MQ < 10 | MP
< 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 --forest-file germline.v0.7.0.forest --good-base-quality 20 --haplotype-holdout-threshold 2500 --hap
lotype-overflow 200000 --ignore-unmapped-contigs no --indel-heterozygosity 0.0001 --keep-unfiltered-calls no --kmer-sizes[0] 10 --kmer-sizes[1] 15 --
kmer-sizes[2] 20 --lagging-level NORMAL --mask-3prime-shifted-soft-clipped-heads no --mask-inverted-soft-clipping no --mask-soft-clipped-bases no --m
ask-soft-clipped-boundary-bases 2 --max-assemble-region-overlap 200 --max-bubbles 30 --max-clones 3 --max-genotypes 5000 --max-haplotypes 400 --max-h
oldout-depth 20 --max-indel-errors 32 --max-joint-genotypes 1000000 --max-open-read-files 250 --max-phylogeny-size 3 --max-read-length 10000 --max-re
ference-cache-footprint 10GB --max-region-to-assemble 600 --max-somatic-haplotypes 2 --max-variant-size 2000 --max-vb-seeds 12 --min-bubble-score 2 -
-min-candidate-credible-vaf-probability 0.75 --min-clone-frequency 0.01 --min-credible-somatic-frequency 0.01 --min-denovo-posterior 3 --min-expected
-somatic-frequency 0.03 --min-forest-quality 8 --min-good-bases 20 --min-kmer-prune 2 --min-mapping-quality 20 --min-phase-score 10 --min-pileup-base
-quality 20 --min-protected-haplotype-posterior 100 --min-refcall-posterior 2 --min-somatic-posterior 0.5 --min-variant-posterior 1 --model-posterior
UnknownType(N7octopus7options20ModelPosteriorPolicyE) --no-adapter-contaminated-reads no --no-reads-with-decoy-supplementary-alignments no --no-read
s-with-distant-segments no --no-reads-with-unmapped-segments no --no-reads-with-unplaced-or-unlocalized-supplementary-alignments no --normal-contamin
ation-risk LOW --num-fallback-kmers 10 --one-based-indexing no --organism-ploidy 2 --output NA24385.40x.S7508.octopus.tmp.vcf.gz --read-linkage PAIRE
D --reads[0] NA24385.40x.S7508.aligned.deduped.sort.bam --refcall-block-merge-quality 10 --refcall-filter-expression QUAL < 2 | GQ < 20 | MQ < 10 | D
P < 10 | MF > 0.2 --reference human_g1k_v37_modified.fasta --regions-file regions.bed --resolve-symlinks no --sequence-error-model /home/kcibul/wgs_r
esources/data/reference/octopus_err_models/novaseq.4a38e55.model --sites-only no --snp-heterozygosity 0.001 --snp-heterozygosity-stdev 0.01 --somatic
-credible-mass 0.9 --somatic-filter-expression QUAL < 2 | GQ < 20 | MQ < 30 | SMQ < 40 | SB > 0.9 | SD > 0.9 | BQ < 20 | DP < 3 | MF > 0.2 | NC > 1 |
FRF > 0.5 --somatic-indel-mutation-rate 1e-06 --somatic-snv-mutation-rate 0.0001 --somatics-only no --split-long-reads no --target-read-buffer-footp
rint 38.4kB --temp-directory-prefix octopus-temp --threads 192 --tumour-germline-concentration 1.5 --use-filtered-source-candidates no --use-independ
ent-genotype-priors no --use-preprocessed-reads-for-filtering no --use-same-read-profile-for-all-samples no --use-uniform-genotype-priors no --varian
t-discovery-mode ILLUMINA --very-fast no">
##INFO=<ID=RFQUAL_ALL,Number=1,Type=Float,Description="Combined quality score for call using product of sample RFQUAL">
##INFO=<ID=STR_PERIOD,Number=1,Type=String,Description="Period of overlapping STR">
##INFO=<ID=STR_LENGTH,Number=1,Type=String,Description="Length of overlapping STR">
##INFO=<ID=QD,Number=1,Type=String,Description="QUAL divided by DP">
##INFO=<ID=MQ0,Number=1,Type=String,Description="Number of reads overlapping the call with mapping quality zero">
##INFO=<ID=MQ,Number=1,Type=String,Description="Mean mapping quality of reads overlapping the call">
##INFO=<ID=GC,Number=1,Type=String,Description="GC bias of the reference in a window centred on the call">
##INFO=<ID=CRF,Number=1,Type=String,Description="Fraction of clipped reads covering the call">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Combined depth across samples">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position on CHROM">
##INFO=<ID=MP,Number=1,Type=Float,Description="Model posterior">
##INFO=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality (phred-scaled)">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Number of read overlapping the call">
##FORMAT=<ID=MQ,Number=1,Type=Integer,Description="RMS mapping quality">
##FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set">
##FORMAT=<ID=PQ,Number=1,Type=Integer,Description="Phasing quality">
##FORMAT=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">
##FORMAT=<ID=ADP,Number=1,Type=String,Description="Number of reads overlapping the position that could be assigned to an allele">
##FORMAT=<ID=AF,Number=1,Type=String,Description="Empirical minor allele frequency of ALT alleles (AD / ADP)">
##FORMAT=<ID=ARF,Number=1,Type=String,Description="Fraction of reads overlapping the call that cannot be assigned to a unique haplotype">
##FORMAT=<ID=BQ,Number=1,Type=String,Description="Median base quality of reads supporting each allele">
##FORMAT=<ID=FRF,Number=1,Type=String,Description="Fraction of reads filtered for calling">
##FORMAT=<ID=MC,Number=1,Type=String,Description="Number of mismatches at variant position in reads supporting variant haplotype">
##FORMAT=<ID=MF,Number=1,Type=String,Description="Fraction of reads with mismatches at variant position">
##FORMAT=<ID=SB,Number=1,Type=String,Description="Strand bias of reads based on haplotype support">
##FORMAT=<ID=RFQUAL,Number=1,Type=Float,Description="Empirical quality score from random forest classifier">
##FORMAT=<ID=FT,Number=1,Type=String,Description="Sample genotype filter indicating if this genotype was 'called'">
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RF8.6,Description="All filters passed">
##FILTER=<ID=RF,Description="Random Forest filtered">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA24385.40x.S7508
1 10583 . G A 201.31 PASS RFQUAL_ALL=2.76;STR_PERIOD=0;STR_LENGTH=0;QD=3.097;MQ0=25;MQ=46.104;GC=0.710;CRF=0.111;AC=1;A
N=2;DP=65;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|0:201:65:51:10583:100:16:76:0.211:0.074:37:0.198:0:0.000
:0.244:2.76:RF
1 10616 . CCGCCGTTGCAAAGGCGCGCCG C 356.94 RF;RF8.6 RFQUAL_ALL=4.51;STR_PERIOD=5;STR_LENGTH=12;QD=15.519;MQ0=25;MQ=30.110
;GC=0.710;CRF=0.222;AC=2;AN=2;DP=23;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|1:110:23:42:10583:100:81:81:1.00
0:0.580:.:0.716:47:0.580:.:4.51:RF
1 12783 . G A 1271 RF;RF8.6 RFQUAL_ALL=7.23;STR_PERIOD=7;STR_LENGTH=15;QD=18.420;MQ0=353;MQ=15.787;GC=0.588;CRF=0
.000;AC=2;AN=2;DP=69;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|1:110:69:24:12783:100:150:150:1.000:0.000:37:0.6
08:0:0.000:.:7.23:RF
`

I ran pyvcf on this VCF file, and the FORMAT AD field appears in it.
I ran duphold on VCF files generated by Dragen and Sentieon, they both return the same error.

Any advice for how to fix this? I am eager to apply duphold to a clinical product I am developing, but this roadblock has me blocked.

Thanks!
John Major

local fold-change

currently there is a global fold-change and a GC fold-change.

currently, if a deletion falls into a relatively high-coverage area it may not appear to have any coverage loss (FC~1) but it could be lower than the surrounding bases. This is what the eye sees when looking at SVs.

this would add a local fold change. it would look left and right of the event by ~500 bases and determine if the inside of the event differed from the outside.

this would help to annotate events that have fall into regions (e.g. megabase) that are largely different than the rest of the genome.

Wrong tag

FYI, you tagged the 0.1.5 release using the 0.1.15 tag. Should be an easy fix at least :)

How the coverage is estimated

Hi,
Is it safe to use coverage filtering for small deletions/duplications?
Are the soft clipped bases added to variant coverage?

Best,
Kasia

Cutoff for duplications

The cutoff for deletions is DHFFC 0.7. What is the recommended DHBFC cutoff for duplications?

No error generated but duphold output missing last few lines.

I am running duphold on about 5000 crams. About 120 or so are finishing but the output vcf is not complete. It ends prematurely near the end of chromosome 22. There are only a few lines left to process of the imput vcf.

The last full vcf line output varies but all near the end.
1 50796757
24 50797069
24 50797489
7 50797585
7 50797605
4 50797606
11 50798225
14 50798764
25 50798783
20 50799200

here is an example of the end of one file...

chr22   50797069        rs201907533     GA      G       .       PASS    AVGCOV=7;MINCOV=7;ALTCOV=0;ZYG=hom;COVRATIO=1;CHI2=0;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na;ANN=G|intron_variant|MODIFIER|RPL23AP82|ENSG00000184319|transcript|ENST00000480246.1|processed_transcript|2/2|n.333-1579delA||||||INFO_REALIGN_3_PRIME;CAF=[0.9876,0.0124];COMMON=1;INT;KGPROD;KGPhase1;KGPilot123;RS=201907533;RSPOS=51235498;SAO=0;SSR=0;VC=DIV;VP=0x05000008000110001c000200;WGT=1;dbSNPBuildID=137;GCF=0.254902   GT:AD:DP:DHFC:DHFFC:DHBFC:DHSP  0/0:.:.:0.868421:1.17857:0.825:0
chr22   5079748

Any suggestions?

duphold use with gVCF files

Hello @brentp,

Thanks for this awesome tool!
It would be great, if duphold could be used to compute (B Allele freq /DHGT field) from gVCF files(where non-variant sites are represented as blocks to save space) directly, rather than be limited to regular VCF files.

Thanks and let us know if you have plans to implement such a feature.

version controlled docker image?

Would it be possible for you to automate the creation of versioned duphold docker images hosted on docker hub? The smoove docker image does contain duphold but the latest image(v0.2.3) only has duphold v0.1.1

If not no worries. It should be easy for me to make one just thought it would simplify updating to newer versions of duphold as they're released.

Thanks!
-Alex

SV Size and duphold

How small and SV is duphold useful for? The sensitivity and specificity example in the README gave an example with SVs >300bp. How does the duphold work as well in the 50-300bp range? Is there a minimum size SV that the stats are effective?

v0.1.0: Error while calculating read depth of variants located beyond the first chromosome

Hi Brett,

After annotating a VCF file using duphold v.0.1.0, I noticed that all variants located on the second chromosome (SL3.0ch01 in my case) and beyond have DHFC, DHFFC, and DHBFC values of 1. The values of the GCF field have changed as well in v.0.1.0, compared to v.0.0.9. Below, you can find a small section of the VCF files obtained after running v.0.0.9 and v.0.1.0 that illustrate the problem.

v_0_0_9.txt
v_0_1_0.txt

The GCF, DHFC, DHFFC, and DHBFC values computed by v0.1.0 for the variants located on ch01 seem to be erroneous. The first variant of ch01 (ch01:3588-3970; shown completely in the image below) clearly has a median read depth of 0 and consists almost completely out of N's (grey bases), A's (green bases) and T's (red bases):

first_region_chr01

Do you have any idea what could have happened here? If needed, I can put together a small test case that reproduces the issue.

How to install duphold

Hello I am new in programming and I wanted to use your tool to search for variants but my problem is that I can't install the tool on my linux environment because I don't understand how to install a binary.
Please is there a command to install the tool
thank you
Assane

Not every line was annotated?

Hi,

I use duphold to calculate SV depth, but I find that not every vcf line was annotated with the tags, how to interpret this situation?

chr17	57986589	7_2	N	[chr17:41256075[N	0	.	SVTYPE=BND;STRANDS=--:8;IMPRECISE;CIPOS=-237,5;CIEND=-207,9;CIPOS95=-48,1;CIEND95=-29,3;MATEID=7_1;EVENT=7;SECONDARY;SU=8;PE=8;SR=0;GCF=0.437831	GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:ASC:RP:AP:AB:DHFC:DHFFC:DHBFC	./.:8:8:0:.:.:.:0:0:0:0:0:0:0:0:0:0:.:0:0:0
chr17	59934382	9_2	N	]chr1:59536384]N	0	.	SVTYPE=BND;STRANDS=-+:4;CIPOS=-9,8;CIEND=-9,8;CIPOS95=0,0;CIEND95=0,0;MATEID=9_1;EVENT=9;SECONDARY;SU=4;PE=0;SR=4	GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:ASC:RP:AP:AB	./.:4:0:4:.:.:.:0:0:0:0:0:0:0:0:0:0:.

Best,
xiucz

svtype = variant.info.get("SVTYPE", "SV")

I am getting the following error and I am not sure how to go around this. I used the following command to annotate the my vcf.

duphold -v macope2_sorted.vcf -b ../macOpe2.sorted.bam -f /data/okendojo/datashare/macOpeProject/macOpe2Assembly.fasta -t 24 -o mc.vcf but I cannot get the SVTYPE column information.

(samplot) [okendojo@cn0798 bcgFile]$ samplot vcf --vcf mc.vcf  -d test -O png -b ../macOpe2.sorted.bam 
Traceback (most recent call last):
  File "/vf/users/okendojo/conda/envs/samplot/bin/samplot", line 10, in <module>
    sys.exit(main())
  File "/vf/users/okendojo/conda/envs/samplot/lib/python3.10/site-packages/samplot/__main__.py", line 31, in main
    args.func(parser, args, extra_args)
  File "/vf/users/okendojo/conda/envs/samplot/lib/python3.10/site-packages/samplot/samplot_vcf.py", line 1133, in vcf
    commands, table_data = generate_commands(
  File "/vf/users/okendojo/conda/envs/samplot/lib/python3.10/site-packages/samplot/samplot_vcf.py", line 949, in generate_commands
    svtype = variant.info.get("SVTYPE", "SV")
  File "pysam/libcbcf.pyx", line 2711, in pysam.libcbcf.VariantRecordInfo.get
ValueError: Invalid header

IndexDefect error in Version 0.2.1 - call for update in bioconda

Hi @brentp ,

Thanks for writing this useful tool.

I installed smoove 0.2.8 through bioconda, then used duphold in it (the version is 0.2.1) to calculate adjacent read depth of CNV regions. Unfortunately, an error throws out: "fatal.nim(49) sysFatal
Error: unhandled exception: index -1 not in 0 .. 159345972 [IndexDefect]".
Then I switched to use the latest version of duphold - v0.2.3 with exactly the same command line, and it worked. I do not know the reason for this error, but is it possible to update duphold or smoove in bioconda so we can use the latest duphold through it?

Many thanks in advance.

Best,
Jiahong

Variant at start of chromosome, illegal storage access

Brent;
Thanks for all the work and fixes with duphold. The 0.0.3 version is running cleanly on most of our samples now and I'm starting to try and build up test cases so we can look at including filters based on the annotations.

I ran into one edge case that's causing duphold to fail with:

SIGSEGV: Illegal storage access. (Attempt to read from nil?)

This appears to happen with reads right at the start of a chromosome and I have a reproducible test case here:

https://s3.amazonaws.com/chapmanb/testcases/duphold_sisegv_storage_access.tar.gz

It contains the variant and BAM files, you'll just need to adjust REF in the shell script to point to a GRCh37.fa reference.

This fails in the same way with both 0.0.3 and the 0.0.4 release from yesterday. Please let me know if I can provide anything else to help with debugging and thanks again.

DHGT Missing for Manta DEL/DUP and Delly DEL

I am trying to add DHGT to Manta, SMOOVE/Lumpy and Delly calls along with the depth based metrics provided by Duphold. I am able to get DHGT to both Smoove/Lumpy and Delly calls. DHSP is added to Smoove, Lumpy and Manta DEL calls.

I am not seeing DHGT added to Manta DUP calls. Is there something obvious I am missing?

Thanks.


MantaDUP:

9	28671504	MantaDUP:TANDEM:9028:0:1:0:0:0	G	<DUP:TANDEM>	999	PASS	END=28762797;SVTYPE=DUP;SVLEN=91293;GCF=0.394593	GT:FT:GQ:PL:PR:SR:DHFC:DHFFC:DHBFC	0/1:PASS:999:999,0,999:37,19:46,20:2.03448:2.45833:2.03448
9	85566393	MantaDUP:TANDEM:9236:0:1:0:0:0	T	<DUP:TANDEM>	999	PASS	END=85631054;SVTYPE=DUP;SVLEN=64661;GCF=0.400807	GT:FT:GQ:PL:PR:SR:DHFC:DHFFC:DHBFC	0/1:PASS:995:999,0,992:39,20:48,28:2.03448:1.90323:2.03448

MantaDEL:

8	111308271	MantaDEL:8866:0:1:0:0:0	T	<DEL>	999	PASS	END=111311921;SVTYPE=DEL;SVLEN=-3650;GCF=0.425637	GT:FT:GQ:PL:PR:SR:DHFC:DHFFC:DHBFC:DHSP	1/1:PASS:94:999,97,0:0,27:0,20:0:0:0:30
9	11733925	MantaDEL:8967:0:1:0:0:0	C	<DEL>	999	PASS	END=11740933;SVTYPE=DEL;SVLEN=-7008;CIPOS=0,2;CIEND=0,2;HOMLEN=2;HOMSEQ=CA;GCF=0.391497	GT:FT:GQ:PL:PR:SR:DHFC:DHFFC:DHBFC:DHSP	1/1:PASS:117:999,120,0:0,27:0,26:0:0:0:32

Command:

duphold     --threads 12     --output GRCm39_sim_manta.vcf.gz     --vcf GRCm39_sim_manta_sorted.vcf     --bam GRCm39_sim.bam     --fasta Mus_musculus.GRCm39.dna.primary_assembly.fa     --snp GRCm39_sim_SNP_INDEL_filtered_unannotated_final.bcf

Tool version:

  version: 0.2.1

  Usage: duphold [options]

Options:
  -v --vcf <path>           path to sorted SV VCF/BCF
  -b --bam <path>           path to indexed BAM/CRAM
  -f --fasta <path>         indexed fasta reference.
  -s --snp <path>           optional path to snp/indel VCF/BCF with which to annotate SVs. BCF is highly recommended as it's much faster to parse.
  -t --threads <int>        number of decompression threads. [default: 4]
  -o --output <string>      output VCF/BCF (default is VCF to stdout) [default: -]
  -d --drop                 drop all samples from a multi-sample --vcf *except* the sample in --bam. useful for parallelization by sample followed by merge.
  -h --help                 show help

duphold and clever: wrong annotation?

Hi brentp,

I used duphold to remove FP from our callset from different callers before merging with survivor.
In one merged deletion (supported by 4 callers) by survivor; one (clever) was not properly visualized by IGV (see image below; first sample above). They all have DHFFC=0, except for clever (0.642857). I removed deletions with DHFFC>0.7, so now I am just wondering if some calls from clever were not properly annotated by duphold because it was not also properly visualized by IGV? Will adding the missing END field solve this issue?

I hope my question was clear, if not please let me know for further clarification.

image

--Sample name function can be useful

Dear Brent,

First i want to say thank for that beatiful tool. We are implementing duphold filtration step in our cnv calling pipeline and it really helps to remove FPs. Unfortunately we came across with one issue when sample name in our snp vcf file is not the same as a SM tag in the BAM file. (Filename used for SAMPLE column in vcf file). Is it possible to add extra input parameter that will specify sample name to look for in snp vcf file?

support for long reads

currently, duphold just looks at read.start and read.stop, but can parse cigar ops if read-length > 300 and avoid the discordant read info.

from @wdecoster

IndexError from Smoove calls

Hi Brent,

I have calls from 4 SV-callers for all my 5 samples then ran DUPHOLD before merging.
In 2/5 samples, DUPHOLD failed on predictions from 1 SV-caller (SMOOVE), the rest were fine.
This was the error message I got:

fatal.nim(39)            sysFatal
Error: unhandled exception: index -1 not in 0 .. 185590 [IndexError]  

Is there any way I can find out why this happening?
I am using this current version: quay.io/biocontainers/duphold:0.2.1--h516909a_1

Thanks in advance!

The "REF prefixes differ" followed by Failed to merge alleles

Any suggestions for this error?

 tail smoove_duphold_ont_manta.sh.o3870689
[smoove] 2022/03/31 11:14:06 [duphold] finished
[smoove] 2022/03/31 11:14:07 [duphold] finished
[smoove] 2022/03/31 11:14:07 [duphold] finished
[smoove] 2022/03/31 11:14:07 [duphold] finished
[smoove] 2022/03/31 11:14:08 [duphold] finished
[smoove] 2022/03/31 11:14:08 starting bcftools merge
[smoove] 2022/03/31 11:35:58 The REF prefixes differ: T vs N (1,1)
Failed to merge alleles at chr22:17756433 in tmp.22/tempclean-539213974/dh132899912smoove-duphold.bcf

Below are the variants at that location. 3 Refs have an N and one has a T which appears to impact the merge step.

chr22   17756433        chr22:17756433:FG       N       <INS:SVSIZE=74:AGGREGATED>
chr22   17756433        chr22:17756433:FG.0     N       <INS:SVSIZE=74:BREAKPOINT1>
chr22   17756433        chr22:17756433:FG.1     N       <INS:SVSIZE=74:BREAKPOINT2>
chr22   17756433        chr22:17756433:OG       T       ]chr2:102886770]T

"expected GQ field in snps VCF" while (I think !) my vcf contains FORMAT/GQ

Hi Brent,

I'm running duphold for the 1st time with snps.vcf computed with bcftools mpileup+call.
My SV vcf was computed with manta.

I'm getting the following error:

+ /ccc/work/cont007/fg0019/lindenbp/packages/duphold/duphold --snp /ccc/scratch/cont007/fg0113/lindenbp/20190428.TROUCARD.MANTA/work/bd/c70293fb517483436faf17dd3b2e61/B00I4BY.snv.bcf --vcf jeter.bcf --bam /ccc/genostore/cont007/fg0073/fg0073/analyse/projet_TROUCARD_605/ANALYSE/ANALYSE_T605_B00I4BY_HJWJJALXX_2_hs37d5/MAPPING_B00I4BY/reliable.realign/T605_DA_B00I4BY_HJWJJALXX_hs37d5_MERGE_PE_2.reliable.realign.bam --fasta /ccc/work/cont007/fg/fg/biobank/by-taxonid/9606/hs37d5/hs37d5_all_chr.fasta --threads 4 -o jeter.duphold.vcf
I, [2019-05-01T14:48:04] -- duphold: starting to read snps for chrom: chr1
expected GQ field in snps VCF

  1. unless I'm wrong, my snv.bcf do contains the FORMAT/GQ field:
$ bcftools view /ccc/scratch/cont007/fg0113/lindenbp/20190428.TROUCARD.MANTA/work/bd/c70293fb517483436faf17dd3b2e61/B00I4BY.snv.bcf | grep -v '##' | cut -f 9 | sort | uniq
FORMAT
GT:PL:DP:AD:GQ
$

what am I doing wrong ?

  1. it seems that for this error, duphold doesn't exit with failure/-1 because the next statement was executed while I'm running with set -e

  2. minor: in the doc, it should be specified that snp.bcf should be indexed.

thank your for your help

Logging information ended up in output file

Hi Brent,

I ran the following command on a bunch of (single sample) vcf files and their bam file:
ls *.vcf | parallel 'duphold --bam {.}.bam --fasta genome.fna.gz --vcf {} > {.}_dh.vcf'

I'm now post-processing these duphold-annotated vcfs for filtering using bcftools and tabix, and get a bunch of warnings and error which I'm looking into.

The first one is generated by bcftools sort:

[E::vcf_parse_format] Incorrect number of FORMAT fields at chr15:101764387

grepping for that line with -C 5 gives me:

chr15	101759833	23348	N	<DEL>	.	PASS	IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101760124;STD_quant_start=45;STD_quant_stop=78;Kurtosis_quant_start=2;Kurtosis_quant_stop=1;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-291;STRANDS=+-;RE=14;REF_strand=0,0;AF=1;GCF=0.592466	GT:DR:DV:DHFC:DHFFC:DHBFC	1/1:0:14:1.37143:0.979592:1.2973
chr15	101760689	23349	N	<DEL>	.	PASS	IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101760777;STD_quant_start=39;STD_quant_stop=36;Kurtosis_quant_start=-1;Kurtosis_quant_stop=-1;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-88;STRANDS=+-;RE=19;REF_strand=1,0;AF=0.95;GCF=0.550562	GT:DR:DV:DHFC:DHFFC:DHBFC	1/1:1:19:1.17143:0.836735:1.10811
chr15	101762149	23350	N	<DEL>	.	PASS	IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101762529;STD_quant_start=62;STD_quant_stop=173;Kurtosis_quant_start=0;Kurtosis_quant_stop=0;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-380;STRANDS=+-;RE=16;REF_strand=0,3;AF=0.842105;GCF=0.585302	GT:DR:DV:DHFC:DHFFC:DHBFC	1/1:3:16:1.14286:0.851064:1.08108
chr15	101763139	23351	N	<DEL>	.	PASS	IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101763229;STD_quant_start=19;STD_quant_stop=19;Kurtosis_quant_start=6;Kurtosis_quant_stop=5;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-90;STRANDS=+-;RE=12;REF_strand=1,1;AF=0.857143;GCF=0.571429	GT:DR:DV:DHFC:DHFFC:DHBFC	1/1:2:12:0.942857:0.717391:0.891892
chr15	101764281	23352	N	<DEL>	.	PASS	PRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101764346;STD_quant_start=7;STD_quant_stop=6;Kurtosis_quant_start=0;Kurtosis_quant_stop=0;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-65;STRANDS=+-;RE=26;REF_strand=2,0;AF=0.928571;GCF=0.606061	GT:DR:DV:DHFC:DHFFC:DHBFC	1/1:2:26:0.857143:0.769231:0.810811
chr15	101764387	23353	N	<DEL>	.	PASS	PRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=chr15;END=101764484;STD_quant_start=3;STD_quant_stop=2;Kurtosis_quant_start=-1;Kurtosis_quant_stop=-1;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-97;STRANDS=+-;RE=22;REF_strand=0,0;AF=1;GCF=0.622449	GT:DR:DV:DHFC:DHFFC:DHBFC	1/1:0:22:0.828571:0.763158:0.783784I, [2018-11-25T16:58:37] -- duphold: no mates found. assuming single end reads
I, [2018-11-25T16:58:38] -- duphold: starting read of bam values for chrom: chr1
I, [2018-11-25T17:02:22] -- duphold: finished reading:chr1. starting gc-content, discordant sorting
I, [2018-11-25T17:02:29] -- duphold: finished gc-content, sorting
I, [2018-11-25T17:02:30] -- duphold: starting read of bam values for chrom: chr2
I, [2018-11-25T17:07:40] -- duphold: finished reading:chr2. starting gc-content, discordant sorting
<etc>

I suck at reading instructions, and it seems be that my input SV vcf file is not sorted. I'll repeat things there and see if the error reproduces, but I'm quite surprised to see these logging messages in the output file.

Cheers,
Wouter

missing insertion

Hi @brentp
I have a question about missing insertion variants.
I used sample HG002 for variant calling with smoove using the command below. But it seems strange to me that I can't find the variants of Insertions? Do you have any idea why? does it have a notation instead of INS? Thank you in advance for your answer.
Note: I also used a WGS of my samples and I still haven't found the Insertions.

$ singularity run smoove.sif smoove call \
-x \
--name my_project \
--exclude ${BED} \
--fasta ${hg38} \
-p 1 \
--genotype \
 ${input} 

How does duphold treat N tracts

I'm curious - how does duphold treat N tracts occurring between scaffolds in less-developed assemblies? These regions would have zero coverage, but should likely be ignored when performing fold change calculations.

Cheers,

David

duphold for MantaDUP(SVTYPE=DUP) are missing, INS are present

Hello @brentp,

Thank you for the awesome tool! We are trying to use this for SV QC for different callers on several of our samples.
Bit confused why duphold fields (DHFFC,DHBFC,DHGT,DHSP,DHFC) were missing for all our MantaDUP (seemingly high quals) calls while being present for most of MantaINS calls ?

screen shots are provided below

Thank you

MantaDUP

image

MantaINS
image

Ignoring bases with 0 coverage while calculating read depth

Hi Brent,

Thanks for developing duphold, I'm trying to use it to filter false positive deletions called in tomato samples that are evolutionary distant from the reference genome.

After filtering deletions using DHBFC < 0.7 and DHFFC < 0.7, I could still find some remaining ones containing large regions of zero coverage, but which are clearly false positives when observed by eye:

deletion_filtered_4

I presume that these regions lower the median depth of these regions, causing them to survive the filtering step.

Is there an option to ignore bases with zero coverage while calculating DHBFC and DHFFC? I noticed that the median procedure in duphold.nim has a skip_zeros parameter, but I'm not sure how I can set it to true and if this yields the result I intend to achieve.

Thanks in advance for your time.

setting DUPHOLD_SAMPLE_NAME doesn't work

Hello @brentp,

Thank you for this tool!

I had e.g. errors like this
[duphold] couldn't find sample:sqcudn971184 in snps vcf which had:sqcudn971184.187147

due to bam file and snps VCF file differing in sample names(in bams the names are truncated). Previously, have solved this issue with renaming the header of the snp VCF(removing the .187147). However this time tried to set up like this
export DUPHOLD_SAMPLE_NAME='sqcudn971184' or just this DUPHOLD_SAMPLE_NAME='sqcudn971184' which doesn't work.

I would appreciate any leads on this.

DHSP for duplications?

Hi,
DHSP (count of discordant reads that support the event) is very useful when we are evaluating a deletion.
Why we do not have DHSP for duplications also? Could you implement it?
Thanks for you time developing this tool.

Duphold with Nanopore data?

Hi,

I used duphold on some VCFs containing SVs called with Sniffles where the alignment was done using minimap2. The data used is all nanopore longread data. I was wondering if duphold is supposed to work properly on long read data? As quite a bit of the results seemed off/incorrect with very high numbers. Here's a sample of some of the results (parsed to obtain the info relevant for me):

SVtype | #CHROM | POS | End | SVLength | Precision | RE | Genotype | Consequence | GeneSymbol | EnsemblID | DHFC | DHFFC | DHBFC

DEL | Chr3 | 21648500 | 21946887 | -298387 | IMPRECISE | 10 | 01/Jan | coding_sequence_variant&5_prime_UTR_variant&intron_variant&feature_truncation | ARHGAP17 | ENSSSCG00000031488 | 181.481 | 101.031 | 178.182
DEL | Chr5 | 90902052 | 90907562 | -5510 | PRECISE | 39 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | CDK17 | ENSSSCG00000028182 | 0.963636 | 0.386861 | 0.946429
DEL | Chr2 | 122165466 | 122173123 | -7657 | PRECISE | 31 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | COMMD10 | ENSSSCG00000014223 | 161.818 | 143.548 | 161.818
INS | Chr6 | 13229037 | 13229037 | 46 | PRECISE | 21 | 0/1 | coding_sequence_variant&feature_elongation | GLG1 | ENSSSCG00000030420 | 0.814815 | 1 | 0.814815
DEL | Chr10 | 66054310 | 66056761 | -2451 | PRECISE | 62 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | ITIH5 | ENSSSCG00000011129 | 0.909091 | 0.438596 | 0.925926
DEL | Chr2 | 28721000 | 28772243 | -51243 | PRECISE | 69 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | KIAA1549L | ENSSSCG00000013311 | 143.636 | 0.153696 | 143.636
INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 130 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815
INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 143 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815
DEL | ChrX | 90375498 | 90376075 | -577 | PRECISE | 232 | 01/Jan | coding_sequence_variant&intron_variant&feature_truncation | MID2 | ENSSSCG00000012564 | 117.857 | 0.103774 | 117.857
INS | Chr13 | 135062890 | 135063265 | 1039 | IMPRECISE | 20 | 0/1 | coding_sequence_variant&intron_variant&feature_elongation | MUC13 | ENSSSCG00000011862 | 0.75 | 0.976744 | 0.792453

Thanks in advance,

Kind regards,
Anne

SNP annotation filters

Annotating with --snp FILE might add a step in getting a potentially large BCF w/genotypes ready for duphold. Does it make sense to add another option for filtering the SNP BCF that will be used for DHET and DHHU annotations?

Adding a full bcftools-style --include option w/ argument may be overkill. A step down could be to include --filter FLT0[;FLT1...] or just --pass.

I believe a duphold removes indels from a SNP BCF, drops genotypes from other samples to speed up parsing, and drops SNPs where QUAL < 20. If adding an option doesn't make sense, adding v.FILTER == "PASS" might be a good default for most uses.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.