Git Product home page Git Product logo

fermikit's Introduction

Build Status

Introduction

FermiKit is a de novo assembly based variant calling pipeline for deep Illumina resequencing data. It assembles reads into unitigs, maps them to the reference genome and then calls variants from the alignment to an accuracy comparable to conventional mapping based pipelines (see evaluation in the tex directory). The assembly does not only encode SNPs and short INDELs, but also retains long deletions, novel sequence insertions, translocations and copy numbers. It is a heavily reduced representation of raw data. Storing, distributing and analyzing assemblies is much faster and cheaper at an acceptable loss of information.

FermiKit is not a prototype. It is a practical pipeline targeting large-scale data and has been used to process hundreds of human samples. On a modern server with 16 CPU cores, FermiKit can assemble 30-fold human reads in one day with about 85GB RAM at the peak. The subsequent mapping and variant calling only take half an hour.

Installation and Usage

The only library dependency of FermiKit is zlib. To compile on Linux or Mac:

git clone --recursive https://github.com/lh3/fermikit.git
cd fermikit
make

This creates a fermikit/fermi.kit directory containing all the executables. You can copy the fermi.kit directory anywhere and invoke the pipeline by specifying absolute or relative path:

# assembly reads into unitigs (-s specifies the genome size and -l the read length)
fermi.kit/fermi2.pl unitig -s3g -t16 -l150 -p prefix reads.fq.gz > prefix.mak
make -f prefix.mak
# call small variants and structural variations
fermi.kit/run-calling -t16 bwa-indexed-ref.fa prefix.mag.gz | sh

This generates prefix.mag.gz for the final assembly and prefix.flt.vcf.gz for filtered SNPs and short INDELs and prefix.sv.vcf.gz for long deletions, novel sequence insertions and complex structural variations. If you have multiple FASTQ files and want to trim adapters before assembly:

fermi.kit/fermi2.pl unitig -s3g -t16 -l150 -p prefix \
    "fermi.kit/seqtk mergepe r1.fq r2.fq | fermi.kit/trimadap-mt -p4" > prefix.mak

It is also possible to call SNPs and short INDELs from multiple BAMs at the same time and produce a multi-sample VCF:

fermi.kit/htsbox pileup -cuf ref.fa pre1.srt.bam pre2.srt.bam > out.raw.vcf
fermi.kit/k8 fermi.kit/hapdip.js vcfsum -f out.raw.vcf > out.flt.vcf

Limitations

FermiKit does not use paired-end information during assembly, which potentially leads to loss of power. In evaluations, the loss is minor for germline samples and even without pair information, FermiKit is more sensitive to short INDELs and long deletions. Furthermore, with longer upcoming Illumina reads, it is actually preferred to merge overlapping ends in a pair before assembly and treat the merged reads as regular single-end reads (see AllPaths-LG and DISCOVAR).

Another technical limitation of FermiKit is that the error correction phase may take excessive RAM when the error rate is unusually high. In practice, this concern is also minor. I have assembled ~270 human samples and none of them require more than ~90GB RAM.

Running FermiKit twice on the same dataset under the same setting is likely to result in two slightly different assemblies. Please see bfc/count.c for the cause in BFC. Unitig construction also has a random factor under the multi-threading mode. Nonetheless, FermiKit should call the same variants from the same assembly.

fermikit's People

Contributors

lh3 avatar mlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fermikit's Issues

joint calling

Hi Heng,

Great tool. Do you have any advice for joint calling of multiple samples?

I obtained a reasonable looking call set by first running your run-calling script on each sample individually. Then I ran the following on the bams produced:

fermi.kit/htsbox pileup -cuf hs37d5.fa *.bam | bcftools view -Oz output.vcf.gz

Is this a sensible approach? Obviously filtering still needs to performed.

cheers,

Jared

Error in last command of run-calling

I'm getting an error in the last cmd of the run-calling script, specifically with the htsbox abreak command. The perl script, run-calling, has this syntax for the last command:
$cmd .= "$root/htsbox abreak -cu $prefix.unsrt.sam.gz | gzip -1 > $prefix.sv.vcf.gz;\n";

Which looks like it is passing "-cu" flag to htsbox abreak, however those don't appear to be valid options:

Usage: htscmd abreak [options] <aln.sam>|<aln.bam>

Options: -b assume the input is BAM (default is SAM)
-l INT exclude contigs shorter than INT [150]
-q INT exclude alignments with mapQ below INT [10]
-m FLOAT exclude alignments overlapping another long alignment by FLOAT fraction [0.5]
-g INT join alignments separated by a gap shorter than INT bp [500]

Note: recommended BWA-SW setting is '-b9 -q16 -r1 -w500'

Thanks.
Jon

High RAM usage on simulated samples

Hi,

I have 50x simulated 1000g data with ART (http://www.niehs.nih.gov/research/resources/software/biostatistics/art/). I'm trying to run fermikit on this data and our cluster is killing the job when it passes 170G RAM. Do you have any suggestions for decreasing memory usage?

fermi.kit/fermi2.pl unitig -s3g -t16 -l126 -p art_50x_fermikit "fermi.kit/seqtk mergepe NA12877_50x_1.fq.gz NA12877_50x_2.fq.gz" > art_50x_fermikit.mak
make -f art_50x_fermikit.mak
fermi.kit/run-calling -t16 reference/hg19_random.fa art_50x_fermikit.mag.gz | sh

The only thing in the log is the following:

bash -c '/u/mtaschuk/git/fermikit/fermi.kit/bfc -s 3g -t 16 <(~/git/fermikit/fermi.kit/seqtk mergepe NA12877_50x_1.fq.gz NA12877_50x_2.fq.gz) <(~/git/fermikit/fermi.kit/seqtk mergepe NA12877_50x_1.fq.gz NA12877_50x_2.fq.gz) 2> art_50x_fermikit.ec.fq.gz.log | gzip -1 > art_50x_fermikit.ec.fq.gz'

I'm using NA12877 vcf from GiaB, converted to fasta reference, and then simulating using ART with the following characteristics:

  • read length: 126
  • mean fragment length: 500
  • standard deviation: 120
  • average coverage: 50x
  • paired end
  • error model: HiSeq 2500

Sorting; low sensitivity

Dear Heng,
Looks like a great tool and I tried to use it for a subset of the Anopheles 1000 Genomes data (https://www.malariagen.net/apps/ag1000g/phase1-AR3/index.html).
The first problem I encountered is at the sorting stage: once the first sam was made, fermi proceeded to sort it and generated ~150k tiny sam files - I had to kill it eventually.
The second issue: I ran clean up manually and genotyped with GATK, as GATK had been originally used to genotype after BWA-mapping the same reads to reference. When we compare the BWA and Fermikit genotypes, the Fermi pipeline results in ~18x fewer variants. What would be the key parameters that could help with this, please?
Best,
Chris

using fermikit to generate a reference graph

We are working on generating sequence graphs for use as the reference system in vg. I'm completing a first pass at bluntifying the overlap graph that can be pulled out of fermikit's intermediate mag files, which is a necessary step for our alignment and indexing algorithms. We have done a lot of testing on bidirectional string graphs, and so the use of these should be straightforward. It is just an issue of cleaning up the bluntification algorithm and understanding the best way to make the assembly graphs.

The ideal pattern would be to generate a pan genomic reference by assembling all the sequence data from a set of samples into a graph. Reading the Fermi paper it seems that some of the steps in that assembly process might not perform well in a multi sample assembly. Specifically the bubble popping algorithm in fermi has a diploid assumption. Has this been resolved in fermi2 and fermikit? Is it configurable, and if not would of be an easy change for me to make?

I am curious if there might be other contradictory aspects to this approach. Any advice would be welcome.

Error in make

When I run the command make -f prefix.mak, I get the error
fermi.kit/bfc -1s 3g -k 61 -t 16 prefix.ec.fq.gz 2> prefix.flt.fq.gz.log | gzip -1 > prefix.flt.fq.gz
/bin/bash: line 1: 13973 Segmentation fault fermi.kit/bfc -1s 3g -k 61 -t 16 prefix.ec.fq.gz 2> prefix.flt.fq.gz.log
13974 Done | gzip -1 > prefix.flt.fq.gz
make: *** [prefix.flt.fq.gz] Error 139
How can I solve this?

Can FermiKit be run on Amplicon data from targeted re-sequencing

Hi Heng,

Always interested to test out new software that you put out. I am doing a lot of targeted resequencing with Illumina TruSight Amplicon data of tumour-only samples. I have been looking for some assembly-based methods since there are some tandem-duplications and large indels in some of our samples that are difficult to detect with standard methods. I just tested this out on one of my samples and at the end I have no variants at all. The sorted bam file doesn't appear to have anything aligned against the reference.

Just wanted to check whether in principal fermikit will work on this data before I dive too deep into debugging and trying alternate parameters from the standard defaults on the Readme.

Cheers

Classify complex SV calling from fermikit ?

Hi Heng,

Thank you for developing the fermikit. I am trying your fermikit tools on a 30x whole genome sequencing population data. I successfully run the fermikit so far. But I wonder is there a way to classify the complex structure variation besides deletion and insertion called in fermikit? I just start this population level CNV project and very new to this kind of analysis. Thank you for help.

Cheers,
wenbin

Error in running fermikit

When I try to run the software using:
fermi.kit/run-calling -t16 ~/reference.genome/chr19_new.fa prefix.mag.gz | sh, I get the error
ERROR: failed to locate the BWA index. Please run 'fermi.kit/bwa index -p ~/reference.genome/chr19_new.fa ref.fa'.
When I run this command on my .fa file, I get the error
[user@login2 test_fermikit]$ fermi.kit/bwa index -p ~/reference.genome/chr19_new.fa a.fa
[bwa_index] Pack FASTA... [bns_fasta2bntseq] Failed to allocate 0 bytes at bntseq.c line 303: Success

(ha)ploidy

is it possible to change the ploidy for the variant calling steps (can't see a flag)?

"seqtk mergepe" is correct for multiple Lane fq DATA ?

Hi Heng,
I have one sample with PE100 sequence data, and it's multi-lane PE data like "sample.L01.read_1.fq.gz, sample.L01.read_2.fq.gz, sample.L02.read_1.fq.gz sample.L02.read_1.fq.gz". I want to call small variants and structural variations by using all clean data.
how should do?

pipeline questions

Hi Heng,

Few quick questions. Starting from BAM,

  • should duplicates be removed when converting to fastq?
  • seeing as you are not using pair information, does it matter if reads in the input fastq are in the order they come from the sorted BAM file? i.e. can I skip a bamshuf step?
  • you have htsbox abreak taking unsorted SAM as input. Is unsorted required, or can it be sorted?

Cheers,
Shane

Call SV jointly?

Hi Heng,

You mentioned the SNP and small InDels can be called jointly. I am wondering can you call structural variation (SV) jointly? For example, if I have multiple samples, how do I know if a large inversion is common or not. Thanks!

-Jinliang

GFA output?

Could the cleaned assembly graphs be output in GFA? What would need to be written?

make: nothing to be done for all

Hi,

I ran the first step to get DO50374T.mak.
fermi2.pl unitig -s3g -t16 -l150 -p DO50374T DO50374T.fq > DO50374T.mak
Then run make:
make -f DO50374T.mak

But it continue to say: make: Nothing to be done for 'all'. So what is the reason of the problem? below is the content of DO50374T.mak. Thanks!

PREFIX=DO50365T

EXE_FERMI2=/home/xiz978/anaconda3/bin/fermi2
EXE_ROPEBWT2=/home/xiz978/anaconda3/bin/ropebwt2
EXE_BFC=/home/xiz978/anaconda3/bin/bfc
GENOME_SIZE=3g
K_UNITIG=51
K_CLEAN=56
K_TRIM=61
K_MERGE=76
N_THREADS=16

INPUT=cat DO50365T.fq.gz

SHELL:=/bin/bash
export SHELLOPTS:=errexit:pipefail

all:$(PREFIX).mag.gz

$(PREFIX).ec.fq.gz:
    bash -e -o pipefail -c '$(EXE_BFC) -s $(GENOME_SIZE) -t $(N_THREADS) <($(INPUT)) <($(INPUT)) 2> [email protected] | gzip -1 > $@'

$(PREFIX).flt.fq.gz:$(PREFIX).ec.fq.gz
    $(EXE_BFC) -1s $(GENOME_SIZE) -k $(K_TRIM) -t $(N_THREADS) $< 2> [email protected] | gzip -1 > $@

$(PREFIX).flt.fmd:$(PREFIX).flt.fq.gz
    $(EXE_ROPEBWT2) -dNCr $< > $@ 2> [email protected]

$(PREFIX).pre.gz:$(PREFIX).flt.fmd
    $(EXE_FERMI2) assemble -l $(K_UNITIG) -m $(K_MERGE) -t $(N_THREADS) $< 2> [email protected] | gzip -1 > $@

$(PREFIX).mag.gz:$(PREFIX).pre.gz
    $(EXE_FERMI2) simplify -CSo $(K_CLEAN) -m $(K_MERGE) -T $(K_UNITIG) $< 2> [email protected] | gzip -1 > $@

fermikit-0.12.tar.bz2 has weird Mac OS extensions

Extracting fermikit-0.12.tar.bz2 on Linux causes tar to fail with a non-zero exit status due to non-standard BSD extensions in the tar file.

$ tar xf fermikit-0.12.tar.bz2 
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown extended header keyword 'SCHILY.ino'
tar: Ignoring unknown extended header keyword 'SCHILY.nlink'

See Fix Tar Errors on OS X and superuser.

The fix is to set set an environment variable before creating the tar file on Mac OS.

export COPYFILE_DISABLE=true

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.