Git Product home page Git Product logo

dsbs's Introduction

DSBS analyzer

DSBS analyzer is a pipeline to analyzing Double Strand Bisulfite Sequencing data, which could simultaneously identify SNVs and evaluate DNA methylation levels in a single base resolution. In DSBS, bisulfite-converted Watson strand and reverse complement of bisulfite-converted Crick strand derived from the same double-strand DNA fragment were sequenced in read 1 and read 2, and aligned to the same position on reference genome. By simultaneous analyzing the sequence of read 1 and read 2, the sequence and DNA methylation state of DNA fragment could be deduced.

Schematic


DSBS analyzer Usage

QuickStart

python3 DSBS_analyzer.py --config DSBS.config -a /data_path/*_1.fq.gz -b /data_path/*_2.fq.gz -o /outdir &

python3 DSBS_analyzer.py

                                                
             m                                      m                           
             |                                      |                           
        AGCTACGT  bisulfite treat   -->  A  G {T} TACGT        AGTTACGT         
        ||||||||-------------------->|   |  |  |  |||||                         
        TCGATGCA        (C>T)       -->  T {T} G  ATGCA        AACTAGCT         
              |                                      |                        
              m                                      m                        
        C>T  1) bisulfite conversion ?                                          
             2) mutation ?                                                      
             3) sequencing errors ?                                             
    
usage: DSBS_analyzer.py [-h] [--config CONFIG] -a [FQ1 [FQ1 ...]] -b
                                  [FQ2 [FQ2 ...]] [-r REF] 
                                  [-k [KNOWNSITES [KNOWNSITES ...]]] [-d DEPTH] 
				  [-q CLEANQUALITY] [--gapsize GAPSIZE] [--fastquniq]  
				  [--remove_tmp] [--bissnp] [-t CPU] [--pp PP] -o
                                  OUTDIR [--qsub]

  -h, --help            show this help message and exit
  --config CONFIG       config file
  -a [FQ1 [FQ1 ...]], --fq1 [FQ1 [FQ1 ...]]
                        Fastq1(s), Space separation
  -b [FQ2 [FQ2 ...]], --fq2 [FQ2 [FQ2 ...]]
                        Fastq2(s), Space separation
  -r REF, --ref REF     refrence
  -k [KNOWNSITES [KNOWNSITES ...]], --knownsites [KNOWNSITES [KNOWNSITES ...]]
                        knownsites
  -d DEPTH, --depth DEPTH
                        the minimum depth , default is 4
  -q CLEANQUALITY, --cleanquality CLEANQUALITY
                        quality when clening the fastq
  --gapsize GAPSIZE     the size of insert or del, default is 3
  --fastquniq           rmdup fastq before cleanning fastq
  --remove_tmp          rm temp file
  --bissnp              bisSNP
  -t CPU, --cpu CPU     thread, default is 20
  --pp PP               1,execute ,0,skip. 1)qualimap, 2)clean, 3)align,
                        4)dealBam, 5)call mutation and methylation , default 11111
  -o OUTDIR, --outdir OUTDIR
                        outdir
  --qsub                use qsub

config

cat DSBS.config

Here is the example of config file which configured the path of softwares and databases and software operating environment. Y For those necessary softwares and databases, you are on your own. It is important to note that each software needs to have execute permission(chmod a+x soft).

#soft

java:/public/soft/jdk1.8.0_111/bin/java:soft
java1.6:/public/soft/jdk1.7.0_45/bin/java:soft
python3:/public/soft/python3:soft

fastqc:/public/soft/fastqc:soft
qualimap:/public/soft/qualimap_v2.2.1/qualimap:soft

#clean,rmdup
fastuniq:/public/soft/FastUniq/fastuniq:soft
cutadapt:/public/soft/cutadapt-1.8.1/bin/cutadapt:soft
filter_fq:/public/soft/filter_fq.py:soft
stdmap_awk:/public/soft/stdmap.awk:soft
diffmap_awk:/public/soft/diffmap.awk:soft

bsmap:/public/soft/bsmap2.7/bsmap:soft
bissnp:/public/soft/BisSNP/BisSNP-0.82.2.jar:soft

#qsub
job_sub_py_2:/public/soft/job_sub_py_2.py:soft

#index, sort, merge, split 
samtools:/public/soft/samtools-1.8/samtools:soft
picard:/public/soft/picard-2.17.3/picard.jar:soft


#call SNP and  methy
DSBS:/public/soft/DSBS.py:soft
#resource
refDir:/public/database/hg19/:resource
ref:/public/database/hg19/hg19.fa:resource
1000G_omni:/public/database/1000G_omni2.5.hg19.sites.vcf:resource
1000G_phase1:/public/database/1000G_phase1.snps.high_confidence.hg19.sites.vcf:resource
dbsnp:/public/database/dbsnp_138.hg19.vcf:resource
dbsnp_gz:/public/database/dbsnp_138.hg19.vcf.gz:resource
hapmap:/public//database/hapmap_3.3.hg19.sites.vcf:resource

Dependencies

DSBS depends on

DSBS Usage

QuickStart

python DSBS.py example.bam -g chr1.fa -d dbsnp138.vcf.gz --chr chr1 -o outdir -q

Index

One time only, it is necessary to index a reference sequence and the dbsnp file。 The commands:

bgzip dbsnp_138.hg19.vcf
tabix -s 1 -b 2 dbsnp_138.hg19.vcf.gz (tabix -s 1 -b 2 -p vcf dbsnp_138.hg19.vcf.gz )

python DSBS.py

usage: DSBS.py [-h] [-v] [--maxDistance MAXDISTANCE] [--maxLen MAXLEN]
               [--mixLen MIXLEN] [--mixReadQual MIXREADQUAL]
               [--mixReadQualN MIXREADQUALN] [--maxN MAXN]
               [--maxSeqErr MAXSEQERR] [--maxSnp MAXSNP] [--maxIndel MAXINDEL]
               [--maxBp MAXBP] [--maxMut MAXMUT] [--minQual MINQUAL]
               [--minVaf MINVAF] [--secAlign] [--debug] [-q] [-c COVERAGE]
               [-p CPU] [-o OUTDIR] -g GENOMEFILE -d DBSNP [--Chr CHR]
               bamFile
	       
positional arguments:
bam                           input BAM file
optional arguments:
-h, --help                    show this help message and exit
-v, --version                 show program's version number and exit
--maxDistance MAXDISTANCE     the maxmum distance of the start postions of the
                              paired reads in reference genome, default is 50
--maxLen MAXLEN               the maxmum length of a read, default is 200
--minLen MINLEN               the minimum length of a read, default is 50
--minReadQual MINREADQUAL     the minimum quality for assessing the read 
                              overall qualities, default is 20
--minReadQualN MINREADQUALN   the minimum ratio which quality is greater than
                              the minReadQual, default is 0.6
--maxN MAXN                   the maxmum number of N within a read, default is 5
--maxSeqErr MAXSEQERR         the maxmum number of sequencing errors within a paired
                              reads, default is 10
--maxSnp MAXSNP               the maxmum number of snp within a paired reads,
                              default is 5
--maxIndel MAXINDEL           the maxmum number of indel within a paired reads,
                              default is 1
--maxBp MAXBP                 a certain range for allowing the currence of certain
                              mutation ,default is 50
--maxMut MAXMUT               the maxmum number of allowing the currence of
                              mutation within a certain range ,default is 3
--minQual MINQUAL             the minimum quality for call Variation, default is 20
--minVaf MINVAF               the minimum value of allele, default is 0.5
--window WINDOW               ignoring the snps in a certain range of a indel, the
                              range is 5bp
--minCoverage MINCOVERAGE     coverage or minimum number of reads desired,
                              default is 8
--maxCoverage MAXCOVERAGE     coverage or maxmum number of reads desired,
                              default is 500
--strand                      only consider the correct comparison direction,   
                              read1=++ && read2=-- || read1=-+ && read2=+-
--secAlign                    consider non-optimal alignments
--CpG                         output CpG
--CHG                         output CHG
--CHH                         output CHH
--debug                       dubug mode
-q, --quite                   quiet mode
-p CPU, --cpu CPU             the number of working threads, default is 50
-o OUTDIR, --outDir OUTDIR    the outDir
-g GENOMEFILE, --genomeFile GENOMEFILE
                            the chromosome reference fasta
-d DBSNP, --dbsnp DBSNP       the dbsnp file
--Chr CHR                    chromosome

Single jobs

python3 DSBS.py chr1.merge.sorted.rmdup.realigned.recal.bam --maxBp 50 --minVaf 0.1 -q -p 40 -o outdir -g chr1.fa --Chr chr1 -d dbsnp_138.hg19.vcf.gz --secAlign

Batch jobs

use job_sub_py_2.py which based on PBS to deliver the jobs

for i in {1..22} X Y M; do python3 job_sub_py_2.py --jobname chr$i --cpu 2 --work "python3 /public/soft/DSBS.py /datapath/chr$i.merge.sorted.rmdup.realigned.recal.bam --maxBp 50 --minVaf 0.1 -q --cpu 40 -o /outdir -g ~/public/database/hg19/chr$i.fa --Chr chr$i -d /public/database/dbsnp_138.hg19.vcf.gz " done

Contributions and suggestions for new features are welcome, as are bug reports! Please create a new issue for any of these, including example reports where possible.

dsbs's People

Contributors

tianguolangzi avatar

Stargazers

youbai avatar Joey avatar Yance Feng avatar Jiansen Lu avatar  avatar

Watchers

 avatar

Forkers

vpbrendel

dsbs's Issues

Run fails to generate any results

hi
I am using the tool for hg38 reference genome aligned BAM file but i get this error message and no output

DSBS chr1.bam --Chr chr1 -p 1 --maxBp 50 --minVaf 0.1 -d $DBSNP_VCF -g chr1.fa -o $PWD

**INFOR: chr1 665 was finished! Used time are 0S!
INOFR: chr1 666 66600000 66700000 is processesing
INFOR: thread 666, snpcons 0.
INFOR: thread 666, indelcons 0.
INFOR: thread 666, methycons 0.
INFOR: thread 666, CpGCon 0.
INFOR: thread 666, CHGCon 0.
INFOR: thread 666, CHHCon 0.
INFOR: thread 666, CpG_appear 0.
INFOR: thread 666, CpG_disappear 0.
INFOR: thread 666, HemimethyCon 0.
INFOR: thread 666, seqERROR 0.
INFOR: thread 666, the seqERROR number in Hemimethy 0.
INFOR: thread 666, the seqERROR number in variation 0.
INFOR: thread 666, the seqERROR number in methyaltion 0.
INFOR: chr1 666 was finished! Used time are 0S!
**

Failed with empty .txt file

Hello, I am trying to call snp based on a bam file output from bismark pipeline. Here is the cmd and python version 3.7.7:
/home/wanghong/tools/ENTER/bin/python /home/jianmei_zhong/softwares/DSBS/DSBS.py -g /Scratch/Database/genomic/hg19/default/hg19.fa -d /home/jianmei_zhong/softwares/refdb/dbsnp_138.hg19.vcf.gz -o ./ --ref hg19 --secAlign --CpG --CHG --CHH S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean_bismark_bt2_pe.sorted.bam.sortByName.bam
in which the bam file was generated via bismark /Scratch/Database/bismark_db/hg19/default/ --fastq --seedmms 1 --multicore 2 -p 4 --quiet -o /Scratch/analysis/Ironman/171106_E00490_0258_HFG32CCXY/Ironman-LC002/04.clean_alignment --temp_dir /Scratch/analysis/tmp/bismark_temp.S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean.1511716648/ -1 /Scratch/analysis/Ironman/171106_E00490_0258_HFG32CCXY/Ironman-LC002/02.clean_fastq/S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean.fastq.gz -2 /Scratch/analysis/Ironman/171106_E00490_0258_HFG32CCXY/Ironman-LC002/02.clean_fastq/S01-DPM1-LC002-131-FE1-IRM_S1_R2_001.clean.fastq.gz. But i got 9 empty files and the log showed messages like "Error: chhList, chr1 2492 249200000-249250621 is wrong!":

S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean_bismark_bt2_pe.sorted.bam.sortByName.chhFile.txt        S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean_bismark_bt2_pe.sorted.bam.sortByName.Indel.txt
S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean_bismark_bt2_pe.sorted.bam.sortByName.CpG_appear.txt     S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean_bismark_bt2_pe.sorted.bam.sortByName.Methylation.txt
S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean_bismark_bt2_pe.sorted.bam.sortByName.CpG_disappear.txt  S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean_bismark_bt2_pe.sorted.bam.sortByName.Snp.txt
S01-DPM1-LC002-131-FE1-IRM_S1_R1_001.clean_bismark_bt2_pe.sorted.bam.sortByName.cpgFile.txt```

Did I miss some settings ?

Best,

Jianmei Zhong

call somatic mutations from methylation sequencing data

Hello, I am very interested in how to detect somatic variation in methylation sequencing data, I read your article carefully, it says:

To deal with the challenge of calling somatic mutations from methylation sequencing data, we will update the computational pipeline in identifying somatic mutations from the double strand bisulfite sequencing data in the future.

So I'm wondering if somatic variant detection is now possible with DSBS, or do you know of any other tool or method that can accomplish this task?

Thanks!

test

it is just a test.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.