Git Product home page Git Product logo

sksv's Introduction

SKSV

ultrafast structural variation detection from circular consensus sequencing reads

Getting Start

git clone https://github.com/ydLiu-HIT/SKSV.git
cd SKSV/skeleton/
make   ## built ./SKSV-skeleton
cd ..

./SKSV index ref.fa index_route   # build index 
./SKSV aln index_route read.fq    # skeleton-alignment
./SKSV call in.svseg ref.fa out.vcf work_dir # call variants using in.svseg file
    
## add to enviroment path
vim ~/.bashrc
export PATH="path:$PATH"  # path is the folder where you download SKSV

Introduction

Circular consensus sequencing (CCS) reads are promising for the comprehensive detection of structural variants (SVs). However, alignment-based SV calling pipelines are computationally intensive due to the generation of complete read-alignments and its post-processing. Herein, we propose a SKeleton-based analysis toolkit for Structural Variation detection (SKSV). Benchmarks on real and simulated datasets demonstrate that SKSV has an order of magnitude of faster speed than state-of-the-art SV calling ap-proaches, moreover, it enables to well-handle various types of SVs with higher F1 scores.

Dependence

python3
Biopython
numpy
samtools  # samtools was used to extract read sequence in order to generate allele sequence.

Synopsis

Reference genome indexing

SKSV index ref.fa <index_route>

Skeleton alignment

SKSV aln <index_route> read.fa/fq -o skeleton.svseg

Variant calling

SKSV call skeleton.svseg ref.fa output.vcf workdir

Commands and options

Genome index

Usage:     SKSV-skeleton index [Options] <Reference> <HashIndexDir>
<Reference>            Sequence of reference genome, in FASTA format
<HashIndexDir>         The directory to store deBGA index

Skeleton alignment

Usage:     SKSV-skeleton aln [options] <index_route> <read.fa/fq>

    <index_route>                 The path of RdBG index.
    <read.fq/fa>                  The input reads in fasta or fastq format.

Algorithm options:

    -t --thread           [INT]    Number of threads. [4]
    -k --index_kmer       [INT]    RdBG-index kmer length. [22]
    -s --seeding_kmer     [INT]    K-mer length of seeding process (no long than RdBG-index kmer). [22]
    -d --hash_kmer        [INT]    K-mer length of local hash process. [7]
    -B --batch_size       [INT]    The number of reads to be processed in one loop. [100000]
    -l --sv_lim           [INT]    The minimum length to consider a structure variant. [30]
    -m --error_model      [INT]    Data type for different error model. [1]
    -w --block            [INT]    The window length when extend node in the skeleton. [30]
    -x --edit_dis         [INT]    The maximum edit distance for Landua-Vishkin. [0]
    -e --seed_step        [INT]    The interval of seeding. [20]
    -M --extend_mx        [INT]    The maximum length for an anchor. [5000]
    -N --top_N            [INT]    Max allowed skeleton paths. [4]
    -g --rd_mx_gap        [INT]    Maximum allowed distance in read that two anchor can be connected. [1000]
    -G --rf_mx_gap        [INT]    Maximum allowed distance in reference that two anchor can be connected. [1000]
    -Y --min_chain_score  [INT]    The minimum skeleton chaining score to be processed. [100]
    -p --secondary_ratio  [FLAOT]  Minimum secondary-to-primary alignment score ratio. [0.8]
    -P --overlap_ratio    [FLAOT]  Minimum overlap ratio to consider tow skeletons one as primary one as secondary. [0.4]

Output options:

    -o --output    	  [STR]    Output file for SV signatures in svseg format. [sk.svseg]
    -h --help                      Show detailed usage.

Variants calling

positional arguments:
  input                 Skeletons from SKSV-skeleton in svseg fromat.
  ref                   The reference genome in fasta format.
  output                Output VCF format file.
  work_dir              Work-directory for distributed jobs

optional arguments:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit
  -t THREADS, --threads THREADS
                        Number of threads to use.[16]
  -b BATCHES, --batches BATCHES
                        Batch of genome segmentation interval.[10000000]
  -S SAMPLE, --sample SAMPLE
                        Sample name/id
  --retain_work_dir     Enable to retain temporary folder and files.

Collection of SV signatures:
  -p MAX_SPLIT_PARTS, --max_split_parts MAX_SPLIT_PARTS
                        Maximum number of split segments a read may be aligned before it is ignored.[7]
  --merge_del_threshold MERGE_DEL_THRESHOLD
                        Maximum distance of deletion signals to be merged.[500]
  --merge_ins_threshold MERGE_INS_THRESHOLD
                        Maximum distance of insertion signals to be merged.[500]

Generation of SV clusters:
  -s MIN_SUPPORT, --min_support MIN_SUPPORT
                        Minimum number of reads that support a SV to be reported.[10]
  -l MIN_SIZE, --min_size MIN_SIZE
                        Minimum size of SV to be reported.[30]
  -L MAX_SIZE, --max_size MAX_SIZE
                        Maximum size of SV to be reported.[100000]
  -sl MIN_SIGLENGTH, --min_siglength MIN_SIGLENGTH
                        Minimum length of SV signal to be extracted.[10]

Computing genotypes:
  --genotype            Enable to generate genotypes.
  --gt_round GT_ROUND   Maximum round of iteration for alignments searching if perform genotyping.[500]
  
Generate allele sequence:
  --read READ           Raw reads in bgzip fasta/fastq format with index by samtools for extract inserted sequence for insertions.
  --print_allele_seq    Enable to print inserted and deleted sequence for an insertion and deletion variants. If --print_allele_seq was set, --read is needed.

Advanced:
  --max_cluster_bias_INS MAX_CLUSTER_BIAS_INS
                        Maximum distance to cluster read together for insertion.[800]
  --diff_ratio_merging_INS DIFF_RATIO_MERGING_INS
                        Do not merge breakpoints with basepair identity more than [0.9] for insertion.
  --max_cluster_bias_DEL MAX_CLUSTER_BIAS_DEL
                        Maximum distance to cluster read together for deletion.[1000]
  --diff_ratio_merging_DEL DIFF_RATIO_MERGING_DEL
                        Do not merge breakpoints with basepair identity more than [0.5] for deletion.
  --max_cluster_bias_INV MAX_CLUSTER_BIAS_INV
                        Maximum distance to cluster read together for inversion.[500]
  --max_cluster_bias_DUP MAX_CLUSTER_BIAS_DUP
                        Maximum distance to cluster read together for duplication.[500]
  --max_cluster_bias_TRA MAX_CLUSTER_BIAS_TRA
                        Maximum distance to cluster read together for translocation.[50]
  --diff_ratio_filtering_TRA DIFF_RATIO_FILTERING_TRA
                        Filter breakpoints with basepair identity less than [0.6] for translocation.

Datasets

We implemented the benchmarks on a server with AMD Ryzen 3950X CPU and 120GB RAM, running Linux Ubuntu 16.04.

The real HG002 PacBio CCS reads were downloaded from: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_SequelII_CCS_11kb/reads/.

The GIAB ground truth set and corresponding high confidence region set were downloaded from: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NIST_SV_v0.6/.

The new draft difficult medically relevant gene Structural Variant benchmark from GIAB (Zook, et al., 2016) that includes some SVs in segmental duplications and corresponding benchmark bed files were download from https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_SV_benchmark_v0.01/

Truvari (v2.0.0, https://github.com/spiralgenetics/truvari) was used to assess the precision, recall, and F1 score (Zook, et al., 2020)..

Reference

Zook, J.M., et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data 2016;3.

Zook, J.M., et al. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol 2020.

Contact

For advising, bug reporting and requiring help, please post on Github Issues or contact [email protected].

sksv's People

Contributors

ydliu-hit avatar

Stargazers

laichunxiao avatar Weriss avatar Zhigui Bao avatar Michael Alonge avatar Junhao avatar ZekunYin avatar Dengfeng Guan avatar Dan Browne avatar Billy Rowell avatar slp avatar  avatar  avatar JiangTao avatar

Watchers

James Cloos avatar JiangTao avatar  avatar

Forkers

quanrd

sksv's Issues

Is it possible to run SKSV on bam files?

Hi,

I'm trying to test your tool on several datasets, and it runs smoothly on the fastq inputs.My only concern is that, is there anyway to run SKSV on some already aligned results? Like bam files from NGMLR or minimap2.

Thank you.

您好,我最近正尝试在一些Longread数据上测试您的SKSV软件。依照您Readme中的说明,我成功在fastq文件上运行了SKSV,不过对于一些数据,我可能希望能跳过SKSV的align步骤,而使用我已有的NGMLR和minimap2 align出的bam文件作为sv call的输入。SKSV有可能在bam文件上完成SV call的操作吗?您的文章中提到SVSK在call步骤中采用了cuteSV的部分**,据我了解cuteSV可以接受bam文件输入,但是似乎SKSV的call步骤输入是svseg文件。svseg和bam可以互相转换吗?

非常感谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.