derrickwood / kraken Goto Github PK

Kraken taxonomic sequence classification system

Home Page: http://ccb.jhu.edu/software/kraken/

License: GNU General Public License v3.0

Shell 14.67% Perl 36.33% Makefile 0.52% C++ 48.48%

kraken's Introduction

Kraken taxonomic sequence classification system

Please see the Kraken webpage or the Kraken manual for information on installing and operating Kraken. A local copy of the Kraken manual is also present here in the docs/ directory (MANUAL.html and MANUAL.markdown).

kraken's People

Contributors

Stargazers

Watchers

Forkers

lukeping fplazaonate jnesme zhouyuan878 agarg2008 hpcbio macieksk rtvt123 jackwadden mazeitor davebx davidvilanova andrewjpage sanger-pathogens bestrauc rec3141 yesimon fw1121 joepickrell janinl benlangmead christopherwilks sentausa pombredanne robertoalvarezm gnaneshan blankenberg bonder-mj bihealth tfenne younies gkarthik audy dnbaker lindachuan123 sujunhao expressionanalysis kemin711 bwlang jic523 ericdeveaud skerker flopezo tarah28 rpucheq taltman mlebeur boydgreenfield edelpuech dry-lab lboucinha rkostadi redmar-van-den-berg etabari tuanjienew jerry-kiwi dfornika liupfskygre mz-cy-han1998 dayedepps rshreyas johnne bgistone clintval smdabdoub sjaenick ezozayav markb729 lskatz mouradbioinfo chrispiebioinformatics bengouts kamouyiaros haoziyeung funpipi stevetsa sharonxiaoyuan wangdi2014 yi1873 mohsennady jzehnpfennig leasmaster aglucaci tintingli aaydenn barrantesisrael explodingcabbage shafieif sboeckelmann rheannna jrherr yemilawal mingjuhao tankmermaid anshulbudhraja gsarfo-boateng sachinkavindaa zhengbangyu 240db wonderful1

kraken's Issues

Kraken does not classify some 100% matching sequences

Hi Derrick,

I was testing Kraken on some sequences when I noted that Kraken completely misses some sequences which have a 100% identity match in the database.

Version: zip file master branch yesterday (18.05.2015)
Database building: standard (as shown in the manual), full bacteria, no database shrinkage

Example:

>from_BacSu_Natto_195
CTTGAATGGATGCAGCAACTTGAGTGTCAGTAGACGTAACATCAATGTCGCAAGAATCTC
TAACGATAATCATTTCTTCAGATGTTTGTTTGTTAAGGCTTAGCTGATCAAAATCTTGAA
ATGCGTTAGCATCTTCTCCTCCAGAACAAACGTCATGACAAACCGCTTCTCTTTTATAGC
CGCTACCCGGGTGTGTACATGAATTATCTAGGGCAACCCATGAGTAAGGTCTTGATTCCA
TTTGTTTTCCTCCTTTCGTTTACTTTAGTAGATGATGACAGTTCATATTTTGTATAAGCA
AATATCATGGTTTTAGATACCTATTTTAAATATTATCGTTTTTTCTTTGACGTACGCGAT
CTCTCAGTGTTGTTCGACGTCTTTTTTGCGGCGCTGCTTCTTGTTCTGCTTTTTCAGCAT
CTGCCTTTTTCTTATTGACTCTTTCAACAAATTCGTTAGCCTGTTTTTGAAGGTCATTTA
CCATTGTAATAATCGCTTGTTTTTGAGCTTCTGATAAGTTGCGTTTTTTATCTTCTGTTA
CATTGTTTTTGTTCAGTACATCTGAAACAAGAAGTTTCGTTAGGGGGTCGATATTTGCTT
CTAGTTCTTCTTTGTTTTTTTGCAGTGCATCTTTTTCTTCGGACATGTTATTCACCTCGA
CTTCTATAAAATTAGAAAGAAAGGGCTACAGAATGTCAGCTTCTGTTATAAGAGCGATTA
ATGTTTGGGTTAAGAGCTGAACGGATGCCATGATGTCTGTAGCTGATAAAGTGATTGTGA
TATTATAGCAGTTTATAATTTTGATTTTTACTTTCTTTCGGCTATGTGCTGTAGAGCGTG
CTATCAGATCACTCGCAACAATGTCTGCTATATCGCTATCGCTGATAAGAAGCTGGGTTG
TAACAGTGATTAAAGCTGAAACTGCTGATTGCAATGAAAGTGCAAATGTGGTGTCTGATT

to which kraken says

$ kraken --db /scratch1/tmp/bachtmp/krakentest/bacteria bug.fasta >kraoutbug.txt
1 sequences (0.00 Mbp) processed in 0.001s (76.1 Kseq/m, 73.10 Mbp/m).
  0 sequences classified (0.00%)
  1 sequences unclassified (100.00%)

whereas performing a simple fasta36 search confirms that the data should be in the database:

# fasta36 bug.fasta ../krakentest/bacteria/library/Bacteria/Bacillus_subtilis_natto_BEST195_uid183001/NC_017196.fna
FASTA searches a protein or DNA sequence data bank
 version 36.3.4 May, 2011
Please cite:
 W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Query: bug.fasta
  1>>>from_BacSu_Natto_195 - 960 nt
Library: ../krakentest/bacteria/library/Bacteria/Bacillus_subtilis_natto_BEST195_uid183001/NC_017196.fna
  4091591 residues in     1 sequences

The best scores are:                                      opt bits E(28)
gi|428277412|ref|NC_017196.1| Bacillus subtili (4091591) [f] 4800 523.7 8.6e-149
...
>>gi|428277412|ref|NC_017196.1| Bacillus subtilis subsp.  (4091591 nt)
 initn: 4800 init1: 4800 opt: 4800  Z-score: 2729.5  bits: 523.7 E(28): 8.6e-149
banded Smith-Waterman score: 4800; 100.0% identity (100.0% similar) in 960 nt overlap (1-960:1240841-1241800)

                                             10        20        30
from_B                               CTTGAATGGATGCAGCAACTTGAGTGTCAG
                                     ::::::::::::::::::::::::::::::
gi|428 CAATTGTAATGACAGCAGTTTGGAGGGCGGCTTGAATGGATGCAGCAACTTGAGTGTCAG
          1240820   1240830   1240840   1240850   1240860   1240870

               40        50        60        70        80        90
from_B TAGACGTAACATCAATGTCGCAAGAATCTCTAACGATAATCATTTCTTCAGATGTTTGTT
       ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gi|428 TAGACGTAACATCAATGTCGCAAGAATCTCTAACGATAATCATTTCTTCAGATGTTTGTT
          1240880   1240890   1240900   1240910   1240920   1240930

I first suspected that that this is a bug with organisms containing multiple FASTA files (Natto has 2 .fna files), but a quick check with another bacterium having 6 .fna files found the sequence of all 6 test cases.

Any idea what's causing this bug?

Best,
Bastien

'stat -c %s' is not portable to BSD (MacOSX)

could replace with ls + awk

stat -c %s database.idx
8589934608

ls -n database.idx | awk '{print $5}'
8589934608

fastq detection fails when paired reads

Hi,

I have an error when I use paired reads that are in fast files. I think that maybe is missing something in the parsing options when paired is set up.


 /home/lp113/soft/bin/kraken --paired --db /home/lp113/soft/kraken/db/minikraken_20140330 --quick  --preload --min-hits 3 --threads 1 --out - --classified-out kraken_out /home/lp113/bcbio-nextgen/tests/test_automated_output/trim/Hsapiens_Mmusculus_1_trimmed.fq /home/lp113/bcbio-nextgen/tests/test_automated_output/trim/Hsapiens_Mmusculus_rep2_trimmed.fq 
Loading database... read_merger.pl: mismatched mate pair names ('HWI-ST1233:94:C0YE6ACXX:2:1302:19463:27765' & 'HWI-ST1233:94:C0YE6ACXX:2:1302:19463:27765g')
complete.
classify: malformed fasta file - expected header char > not found
0 sequences (0.00 Mbp) processed in 0.000s (0.0 Kseq/m, 0.00 Mbp/m).
  0 sequences classified (-nan%)
  0 sequences unclassified (-nan%)

Having trouble adding the sequences to create a custom database

Hi - I am trying to build a custom KRAKEN database. I am following all the instructions provided on the manual, but I am not able to add the sequences to the database. Every time I am getting the following error message -"Can't add "/data1/home/sandeep/HMP_Dataset/Simulations/Genomes/CC_121.fa": sequence is missing GI number"

Below are the different modifications I tried on the header of the sequence.

It would be great if you could let me know what be the reason I am getting this error message

GI:686246107 Staphylococcus aureus ADL-101
gttacAAGCGCATTTTCGTTCAGTCAACTACTGCCAATATAACTTCGTAGAGCATAGAAC
ATTGATTTATGTACCAGCCTGATCAACATATAAATATAAATTTTTATGTTTCACGTAAAA

686246107 Staphylococcus aureus ADL-101
gttacAAGCGCATTTTCGTTCAGTCAACTACTGCCAATATAACTTCGTAGAGCATAGAAC
ATTGATTTATGTACCAGCCTGATCAACATATAAATATAAATTTTTATGTTTCACGTAAAA

CC_121|kraken:taxid|1308698 Staphylococcus aureus ADL-101
gttacAAGCGCATTTTCGTTCAGTCAACTACTGCCAATATAACTTCGTAGAGCATAGAAC
ATTGATTTATGTACCAGCCTGATCAACATATAAATATAAATTTTTATGTTTCACGTAAAA

kraken:taxid|1308698 Staphylococcus aureus ADL-101
gttacAAGCGCATTTTCGTTCAGTCAACTACTGCCAATATAACTTCGTAGAGCATAGAAC
ATTGATTTATGTACCAGCCTGATCAACATATAAATATAAATTTTTATGTTTCACGTAAAA

Multi-fasta file support for --add-to-library

We are attempting to build a large kraken database out of complete and draft bacterial genomes and have run into file system issues when all the data are loaded into the kraken database as separate individual FASTA files. The current NCBI bacterial assembly folder has over 1 million .fna files currently; this clobbers most file systems when all the files are dumped into the same directory.

I attempted to load a concatenated multi-record as a test. kraken-build splits the data up, then treats the concatenated record as an RNA (.ffn) record.

I can probably hack in something to make kraken-build accept multi-record FASTA, but just curious: is there a specific reason why multi-record FASTA isn't supported or is problematic? Couldn't find anything on the mail list.

Problems during installaion

Dear researcher,

I just started to use Kranken. After the downloading of database, some problem seems to come during building.

Found jellyfish v1.1.11
tar: citations.dmp：无法将所有权改为 uid 9019，gid 583: 无效的参数
tar: 跳转到下一个头
tar: 归档包含 “\tn\tuncultur” 而需要数字值 off_t
tar: 归档包含 “n\t11\tn\t1” 而需要数字值 mode_t
tar: 归档包含 “don\n350657\tn” 而需要数字值 time_t
tar: 归档包含 “\tn\t0\tn\t” 而需要数字值 uid_t
tar: 归档包含 “\tn\t1\tn\t” 而需要数字值 gid_t
tar: pecies\tn\tUAon\t11\tn\t1\tn\t11\tn\t1\tn\t0\tn\t1\tn\t1\tn\t0\tn\tunculturedon\n350656\tn\t75661\tn\tspecies\tn\tUAon\t11\tn\t1\t：未知的文件类型“ ”，作为普通文件展开
tar: pecies n UAon 11 n 1 n 11 n 1 n 0 n 1 n 1 n 0 n unculturedon
350656 n 75661 n species n UAon 11 n 1 ：不可信的旧时间戳 1970-01-01 07:59:59
tar: 跳转到下一个头

_gzip: stdin: invalid compressed data--crc error

gzip: stdin: invalid compressed data--length error
tar: Child returned status 1
tar: Error is not recoverable: exiting now_

A mail (in /var/spool/mail/root) gave some tips about the problems of the hardware.

WARNING: at kernel/rh_taint.c:13 mark_hardware_unsupported+0x39/0x40() (Not tainted)
:Hardware name: ThinkServer RD640
:Your hardware is unsupported. Please do not report bugs, panics, oopses, etc.,
on this hardware.

No such directory ‘genomes/Bacteria’.

Hi Derrick,

Can you please help on the following issue:

bhaley@NextSeq-Server:/$ ./kraken-build --standard --db //kraken_DB
Found jellyfish v1.1.11
--2016-03-08 17:40:58-- ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz
=> ‘all.fna.tar.gz’
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 130.14.250.11, 2607:f220:41e:250::10
Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /genomes/Bacteria ...
No such directory ‘genomes/Bacteria’.

PacBio FASTQ

Hi,
Would it be possible to use Kraken on PacBio FASTQ reads to remove contaminations?

Thank you in advance.

Michal

Accept reads in BAM format

In its current form does Kraken accept BAM as an input? This would enable already mapped reads to be classified. Some places (including our center) prefer BAM for holding unmapped reads as well due to the flexibility of the format compared to FASTQ described (here)[http://blastedbio.blogspot.com/2011/10/fastq-must-die-long-live-sambam.html]. This is obviously just an enhancement but might enable more users.

Error during db build: "find: -printf: unknown primary or operator"

I'm building a custom db following the instructions on the kraken website using the scripts provided by Mick Watson (http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/). Everything works fine up to the moment when I try to build the db using the following command:
karsten$ kraken-build --build --threads 24 --work-on-disk --db kraken_20160720
Kraken build set to minimize RAM usage.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
find: -printf: unknown primary or operator

Copied from the script, the line in question looks like:
"KRAKEN_HASH_SIZE=$(find library/ '(' -name '.fna' -o -name '.fa' -o -name '*.ffn' ')' -printf '%s\n' | perl -nle '$sum += $_; END {print int(1.15 * $sum)}')"

I'm using Mac OS X 10.11.6 with Perl v5.18.2.

So, what is wrong with the printf command?

Cheers,

Karsten

Regarding custom database

Dear Derrick,

Thanks for developing kraken pipeline for the scientific community.
I have couple of queries regarding the kraken database.

I believe, currently kraken uses RefSeq genomes for bacterial, archael, viral and human. Can I include fungal, plant, nematodes sequences in the database?
Instead of RefSeq genomes, can I use RefSeq protein to build my custom database?

Cheers,
Ram

Building custom database without gi_taxid_nucl.dmp

Hi,

I'm getting errors on version 0.10.5b when I prepare a fasta file as described in the manual and run kraken-build. First, kraken-build complains about a missing taxonomy/gi_taxid_nucl.dmp which is a 8 GiB file which not needed in this case. When I create this as an empty file, there is a mmap error.

Without gi_taxid_nucl.dmp:

kraken-build --db mydb --threads 20 --build
Kraken build set to minimize disk writes.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
Hash size not specified, using '81910966608'
K-mer set created. [2h4m52.942s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
K-mer set sorted. [11h41m59.065s]
Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [4m29.637s]
Creating seqID to taxID map (step 5 of 6)...
make_seqid_to_taxid_map: unable to open taxonomy/gi_taxid_nucl.dmp: No such file or directory

With empty gi_taxid_nucl.dmp:

kraken-build --db mydb --threads 20 --build
Kraken build set to minimize disk writes.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, GI number to seqID map already complete.
Creating seqID to taxID map (step 5 of 6)...
make_seqid_to_taxid_map: unable to mmap taxonomy/gi_taxid_nucl.dmp: Invalid argument

This could be changed.

Option "--report" for primary kraken script

Derrick,

Firstly, I just want to say that Kraken is a really nice piece of software, with good documentation and command line interface. Thank you!

I have an enhancement request. The typical use case is:

% kraken --db DB > kraken.out
OR
% kraken --db DB --outfile kraken.out
THEN
% kraken-report --db DB kraken.out > kraken.report

I was thinking an option like this:

% kraken --db DB --outfile kraken.out --report kraken.report --mpa-report kraken.mpa

That way the user doesn't have to RE-specify the --db parameter, and the user-end result is available in one command.

(Maybe a --filter option as well in case that makes sense wrt kraken-filter)

Torsten

which jellyfish version

Hi,
Which Jellyfish version is compatible to kraken:

> conda search jellyfish
Fetching package metadata ...............
jellyfish                    0.5.1                    py27_0  conda-forge     
                             0.5.1                    py34_0  conda-forge     
                             0.5.1                    py35_0  conda-forge     
                             0.5.6                    py35_0  conda-forge     
                             0.5.6                    py27_0  conda-forge     
                             0.5.6                    py34_0  conda-forge     
                             1.1.11                        0  bioconda        
                             1.1.11                        1  bioconda        
                             2.2.3                         0  bioconda        
                             2.2.3                         1  bioconda        
                          *  2.2.6                         0  bioconda

Thank you in advance.

Michal

Kraken-build should check for Jellyfish at the start of DB build process

As it takes a while to download the DB files, it'd be very useful if Kraken-build could check for Jellyfish at the start of the build process.

The ftp url in download_taxonomy.sh needs to be updated

The url for NCBI's ftp site appears to have changed causing the command kraken-build --download-taxonomy --db name to fail.

Line 27 of download_taxonomy.sh currently reads:
NCBI_SERVER="ftp.ncbi.nih.gov"

Changing it to:
NCBI_SERVER="ftp.ncbi.nlm.nih.gov"
and rebuilding Kraken resolved the issue.

Thanks,
Adam

"kraken-build --download-library human" ends up with empty Human library

H_sapiens library folder and size.

0B ./CHR_01
0B ./CHR_02
0B ./CHR_03
0B ./CHR_04
0B ./CHR_05
0B ./CHR_06
0B ./CHR_07
0B ./CHR_08
0B ./CHR_09
0B ./CHR_10
0B ./CHR_11
0B ./CHR_12
0B ./CHR_13
0B ./CHR_14
0B ./CHR_15
0B ./CHR_16
0B ./CHR_17
0B ./CHR_18
0B ./CHR_19
0B ./CHR_20
0B ./CHR_21
0B ./CHR_22
0B ./CHR_MT
0B ./CHR_Un
0B ./CHR_X
0B ./CHR_Y

set_lcas: database in improper format

On a MacPro with 128Gb RAM, kraken-build fails at last step, using kraken-0.10.16 (from github) and jellyfish 1.1.11. Has happened with two different databases, one big, one small. Any suggestions? Thanks

>kraken-build --build --threads 14 --db ensemblgenomes --jellyfish-hash-size 14500M --max-db-size 100
Kraken build set to minimize disk writes.
Skipping step 1, k-mer set already exists.
Reducing database size (step 2 of 6)...
Shrinking DB to use only 8232020649 of the 15480282244 k-mers
Written 8232020649/8232020649 k-mers to new file
Database reduced. [1h21m16.000s]
Sorting k-mer set (step 3 of 6)...
K-mer set sorted. [1h32m56.000s]
Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [4m47.000s]
Creating seqID to taxID map (step 5 of 6)...
640200 sequences mapped to taxa. [2.000s]
Setting LCAs in database (step 6 of 6)...
set_lcas: database in improper format

Will you merge the 16s-dev branch into master?

I was pointed to the 16s branch by someone at a conference.

Is it functional?

Any caveats?

Assinging taxonomy from NCBI genomes downloads

I recently downloaded all genomes available on ncbi in hopes of making a database that would be as comprehensive as possible for determining taxonomy for a largely unknown eukaryote/prokaryote metagenomic dataset. However, it appears that there are currently no GI number assigned to the files downloaded from ncbi and the headers look like this :

NC_008801.1 Monodelphis domestica chromosome 1, MonDom5, whole genome shotgun sequence AtcctcccccccaccaccaccccagcATGCAGGCCGCCACCATCTTATCCACCAGGCCGCCCCGGTGCGTGGC

rather than

gi|701219395|ref|NC_025403.1| Achimota virus 1, complete genome
ACCAGAGGGAAAATATAACAATGTCGTTTTATAGCGATGTAAATAATACTTATGTAGGCCCGAAAGTGC

I noticed for a previous issue that you mentioned that you are working on a better solution that will allow inclusion of sequences that lack GI numbers but have only accession numbers so hopefully that can help fix this issue down the road. However, I was wondering if you had any ideas for a workaround for this issue at this point in time that may make it easier than trying to individually assign taxonomy for the 60,000+ genomes that I am trying to make into a database. Any thoughts are greatly appreciated.

Best,
Eric

Query Annotation based on mini kraken and custom kraken Databases

Dear Derrick,

Query regarding Annotation:
My metagenomics forward and reverse fastq files have 20 million reads. After removing plant similar reads from my input fastq files using (fastq_screen pipeline), I had 4 million reads. Then I provided this fastq file (4 million reads) as input to metAMOS pipeline. FCP option has annotated those reads but each of the custom kraken database and minikraken did not annotate as expected. What could have been the reason?

But for the initial fastq files (with 20 million reads), kraken custom DB based on nt database annotated correctly.

I tried four different databases with metAMOS pipeline.

Using minikraken database (DB size 4.5GB), for these 4 million reads I received an output with no hits in annotation.
Using custom kraken database (Bacterial, Viral, Archaeal, Fungal) (DB size 105GB), for these 4 million reads.
Using custom kraken database (nt database from ncbi) (DB size - 604GB), for these 4 million reads.
Using FCP database, for these 4 million reads.

Kraken Standard Database

Hello,
I'm trying to build the standard kraken database. I gave the job a memory of 140gb and running on 16 threads.

I get this error after kraken has downloaded the gi to taxid mapping file and the taxonomy dump from NCBI.

Found jellyfish v1.1.11
--2016-06-23 16:32:01--  ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
           => “gi_taxid_nucl.dmp.gz.2”
Resolving ftp.ncbi.nih.gov... 130.14.250.10, 2607:f220:41e:250::12
Connecting to ftp.ncbi.nih.gov|130.14.250.10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/taxonomy ... done.
==> SIZE gi_taxid_nucl.dmp.gz ... 1405001825
==> PASV ... done.    ==> RETR gi_taxid_nucl.dmp.gz ... done.
Length: 1405001825 (1.3G) (unauthoritative)

100%[=================================================================>] 1,405,001,825 7.42M/s   in 4m 42s  

2016-06-23 16:36:45 (4.74 MB/s) - “gi_taxid_nucl.dmp.gz.2” saved [1405001825]

Downloaded GI to taxon map
--2016-06-23 16:36:45--  ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
           => “taxdump.tar.gz”
Resolving ftp.ncbi.nih.gov... 130.14.250.12, 2607:f220:41e:250::12
Connecting to ftp.ncbi.nih.gov|130.14.250.12|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/taxonomy ... done.
==> SIZE taxdump.tar.gz ... 36397155
==> PASV ... done.    ==> RETR taxdump.tar.gz ... done.
Length: 36397155 (35M) (unauthoritative)

100%[===================================================================>] 36,397,155  10.2M/s   in 5.2s    

2016-06-23 16:36:51 (6.62 MB/s) - “taxdump.tar.gz” saved [36397155]

Downloaded taxonomy tree data

gzip: gi_taxid_nucl.dmp.gz: unexpected end of file

Is there something I'm missing?

Thanks.

Identifying Bacillus at the STRAIN level with Kraken

Hello,

I would like to know if it's possible to achieve strain-level sensitivity for Bacillus with Kraken.

Thanks!

Better prediction on genome assemblies than reads

Hello everybody,

I assembled genomes from known species with paired reads.
Then, i used kraken to confirm the species of my genomes.
Generally, using kraken on the assembled genomes gives good predictions.
However, using kraken on the reads gives less accurate predictions.

For example, from my kraken output, I have this line :
C id 288681 201 86661:3 288681:31 86661:36 A:31 86661:23 0:47
I don't understand why, for this read, the assigned taxon number is 288681 when a majority of kmers is assigned 86661. Did someone have the same issue ?
Thanks !

Does kraken work with jellyfish version 1.1.11?

I try to build the database but it does not work with the latest Jellyfish 1 program:

Found jellyfish v1.1.11
Kraken requires jellyfish version 1

Does the checking script make a mistake or what version should I use in stead?

Pair end reads problem

kraken --paired --fastq-input --gzip-compressed -db /db/kraken --threads 1 --out --classified-out kraken_out file.R1.gz file.R2.gz
kraken: --paired requires exactly two filenames

Ok now added a colon between the two files as follows:
kraken --paired --fastq-input --gzip-compressed -db /db/kraken --threads 1 --out --classified-out kraken_out file.R1.gz,file.R2.gz

read_merger.pl: kraken_out does not exist
classify: malformed fasta file - expected header char > not found
0 sequences (0.00 Mbp) processed in 0.000s (0.0 Kseq/m, 0.00 Mbp/m).
0 sequences classified (-nan%)
0 sequences unclassified (-nan%)

Kraken-translate error

Hi, I did a custom databate, search with kraken, but when i try to translate kraken.sequence (kraken-translate --db myKrakenDB --mpa-format sequences.kraken
), i get the following output:

Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 116, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 117, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 118, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 119, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 120, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 121, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 122, <> line 1.
Use of uninitialized value $_ in string eq at /Users/castrolab01/programs/kraken/kraken-translate line 123, <> line 1.
Use of uninitialized value in transliteration (tr///) at /Users/castrolab01/programs/kraken/kraken-translate line 95, <> line 1.
Use of uninitialized value $taxid in numeric gt (>) at /Users/castrolab01/programs/kraken/kraken-translate line 101, <> line 1.
Use of uninitialized value $taxid in hash element at /Users/castrolab01/programs/kraken/kraken-translate line 104, <> line 1.
Use of uninitialized value $taxid in hash element at /Users/castrolab01/programs/kraken/kraken-translate line 110, <> line 1.
r1 root

genome sequences for RefSeq bacteria moved on the NCBI ftp

Dear Dr. Wood,

It’s impossible to build new bacteria database using "kraken-build" because the genome sequences for RefSeq bacteria have moved on the ftp as you can see on the NCBI readme file: ftp://ftp.ncbi.nlm.nih.gov/genomes/README.txt

Have you planned to update Kraken with the new path of the bacteria genomes?

Best regards,
Nicolas

Update documentation to include information on proxy servers

My work uses a proxy server for access to the Internet. Kraken uses wget to download files from the NCBI for which we needed to set the ftp_proxy variable - should this be added to the documentation?

db_sort: unable to mmap database.jdb: Cannot allocate memory

Hi Derrick,

./kraken-build --threads 20 --build --jellyfish-hash-size 2400M --db kraken_bv_072516/

working on a server with 128GB of RAM and 3 TB of disk space..

database.jdb.tmp file is around ~400gb and still I end up getting

Kraken build set to minimize RAM usage.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
K-mer set created. [6h45m35.859s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: unable to mmap database.jdb: Cannot allocate memory

I do not want to compromise on the sensitivity of the results, that is the reason I am avoiding --max-db-size, also tried with work on disk flag bust end up similar result.

Thanks.
Best
Sid

The databases need to include "Plasmids" too

Derrick,

I notice that often my pure bacterial samples still get 5% say of reads being unclassified. When I assemble these reads, they turn out to be bacterial plasmids.

The problem is that some of the Bacteria folders have chromosomes and plasmids, but there are also many separately submitted plasmids which are in a different Plasmids folder at NCBI:

ftp://ftp.ncbi.nih.gov/genomes/Plasmids/

It would be great to add support for this in the download tools, and in MiniKraken.

Building custom database

Dear Derrick,

I have installed jellyfish version1.1.11. to build the custom database with fungi, bacteria, and viruses. I followed the instructions in adding libraries to the library folder.

While at the step of building custom database from the library. I executed below command and I got the following messages in the terminal. I left more than 10 hours. Still I could see only the same message "Hash size not specified, using '12654893177'".

Is everything fine or am I missing something to build custom database?

wenchenaafc@wenchenaafc:~/metAMOS-1.5rc3/kraken_custom$ kraken-build --build --db customDB --threads 1 --kmer-len 31 --minimizer-len 15
Kraken build set to minimize disk writes.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.11
Hash size not specified, using '12654893177'

classify slow on MacOS Sierra

Kraken is running very slowly after update to MacOS Sierra (from 10.9). Even using 14 threads, 'classify' is only using 2% of CPU resources running on fastq files. Happens when using either RAM or a ramdisk. I also tried re-compiling kraken after the update, no difference. Anyone else had this problem?

kraken can't find jellyfish running within an HPCCluster

kraken and jellyfish are installed to /kraken_install and /jellyfish-1.1.11 and all directories containing relevant executables have been added successfully to .bashrc

[ans74@hpc~]$ kraken-build
Must select a task option.
Usage: kraken-build [task option] [options]
 
Task options (exactly one must be selected):
--download-taxonomy        Download NCBI taxonomic information

[ans74@hpc~]$ jellyfish
Too few arguments
Usage: jellyfish <cmd> [options] arg...
Where <cmd> is one of: count, stats, histo, dump, merge, query, cite, qhisto, qdump, qmerge, jf.
Options:
--version        Display version

running my script results in:

[ans74@hpc ~]$ cat kraken_3*
/hpc/ans74/kraken_install/check_for_jellyfish.sh: line 27: jellyfish: command not found

however, running (not submitting) check_for_jellyfish.sh works!

[ans74@hpc ~]$ ./kraken_install/check_for_jellyfish.sh
Found jellyfish v1.1.11

Also, running the command I am trying to submitting to the HPC manually works:

[ans74@hpc ~]$ kraken-build --standard --db /hpc/scratch/ans74/kraken_database/
Found jellyfish v1.1.11
--2017-03-20 15:04:16--  ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
            => “gi_taxid_nucl.dmp.gz”
Resolving ftp.ncbi.nih.gov... 130.14.250.12, 2607:f220:41e:250::13
Connecting to ftp.ncbi.nih.gov|130.14.250.12|:21... connected.
Logging in as anonymous ... ^C

Is this somehow related to how kraken is written?
Does it make any sense that the executables are mapped into the .bashrc and work from there but then the script when submitted isn't able to detect jellyfish from the same directory?

mmap() on low-memory systems

Hi, first I'd like to say the Kraken is a really well-written program!

I found that kraken (the classification part) does not succeed on systems where the amount of main memory (+ swap) is smaller than the index (database.kdb). However, I believe this should be possible via memory mapping, in particular in this case because the data needs to only be read by the program which allows the OS to do efficient swapping. While it should be technically possible, I cannot make any comment about whether it would be efficient.

Issue: In the file quickfile.cpp, you correctly use the parameters PROT_READ and MAP_SHARED to trigger this kind of reading in the read-only mode. However, its seems the database file is always opened in read-write mode and I don't know why. IMO the correct way would be to use the read-only flags and warn the user if this results in an inefficient memory access behavior or to require a parameter like '--force-memory-overcommit' in the classification program.

Cheers, Johannes

set_lcas: error opening taxonomy/nodes.dmp: No such file or directory

what should I do about the error?

Kraken build set to minimize disk writes.
Creating k-mer set (step 1 of 6)...
Found jellyfish v1.1.10
Hash size not specified, using '9096'
K-mer set created. [0.037s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
K-mer set sorted. [1m46.611s]
Creating GI number to seqID map (step 4 of 6)...
GI number to seqID map created. [0.008s]
Creating seqID to taxID map (step 5 of 6)...
5 sequences mapped to taxa. [0.005s]
Setting LCAs in database (step 6 of 6)...
set_lcas: error opening taxonomy/nodes.dmp: No such file or directory

new release any time soon?

Hi there. Your current master code looks fairly stable and has not seen many updates lately, whereas the 0.10.5-beta (the last tagged version) has a build issue. Looking at the commit history this seems to have been fixed very quickly. However, since there is no newer tag, I am stuck with using the master, which I would like to avoid using in my production setup. If there are no reasons against it, could you please release a new version and tag it?
Thanks a lot for a great tool!
Florian

Environmental variables for default --db and --threads

Specifying the full --db path and --threads each time is a bit tedious. What would you think about having (optional) environment variables for these defaults:

KRAKEN_DB_DEFAULT=/usr/local/share/kraken/db/mini_kraken
KRAKEN_THREADS_DEFAULT=8

Yes, I could write a wrapper script for these for the typical user. If I get time I could generate a pull request, but until then this Issue report will remind me.

Number of distinct kmers mapping to a species

Q1 . Is it somehow possible to get an output field in the kraken-report which summarizes the distinct number of non overlapping kmers (or overlapping if the former calculation is complex) that were directly assigned to a particular taxonomic node. This could give an idea about the total amount of coverage one sees for a particular taxonomic node. While we are at it, for each taxonomic node how about computing the ratio of observed kmers assigned for the node / total number of kmer assigned for that node .

Q2. Is there a good strategy to get rid of low complexity kmers ? They are especially troublesome when analyzing meta-transcriptomic data.

Silva and Greengenes support

Hi Derrick,
We talked about this topic during 2014 and I know you are working on it. I'm very excited to see you released a new kraken version! I really think kraken could be one of the best solutions for metabarcoding/metagenomics analysis.
Are you planning to release a guide or a script to create a kraken database from Silva, Greengenes or other custom databases?

switch from GI to accession

NCBI is discontinuing support for GI numbers in favor of accession numbers.

There is currently a database of GI -> TaxID available from NCBI's ftp site so moving kraken from GIs to accessions should be easy.

(related to #39)

Link to FTP: ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/

paired and unpaired reads

Hi,
I want to analyze a set of reads with kraken.
After the trimming step of my paired-end data, a part of the reads lost their mates (mate was too short after quality trimming). I obtain a fastq file with the right mate, a fastq file with the left mate and a fastq file with singletons.
Please, could you tell me if is it possible to include all the reads (paired and singleton) in the kraken analysis ? Should I concatenate the files containing right/left paired reads with the file containing unpaired right/left mate or should I run kraken separately for the paired and unpaired reads ?

Thank you in advance for your reply,

Installation Bug

Hi,
great job with kraken!
I just want to mention that your install script returns 1 on a successful installation.

It is because of these lines:

for file in $KRAKEN_DIR/kraken*
do
  [ -x "$file" ] && echo "  $file"
done

One of your files you are ckecking is not executable.

best,
Peter

human and plasmids db not downloaded

kraken-build --standard --db $DBNAME
Only build Bacteria and Viruses, plasmids and human are not downloaded.
Not sure this is a bug or standard feature.
kraken-build --download-library plasmids --db testdb
--2016-07-08 15:06:33-- ftp://ftp.ncbi.nlm.nih.gov/genomes/Plasmids/plasmids.all.fna.tar.gz
=> ‘plasmids.all.fna.tar.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 2607:f220:41e:250::10
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /genomes/Plasmids ...
No such directory ‘genomes/Plasmids’.

Build speeds for some users are really, REALLY slow

From Kraken v0.10.0 to v0.10.2, the default build operation manipulated data that was almost entirely on disk. Some users found this to cause their builds to take extraordinarily long times, often forcing the user to give up on the build. As of v0.10.3, the default build operation is to work in RAM as much as possible, which dramatically lessens the number of random access disk I/O operations. However, some people may not have enough extra RAM to use this mode of operation, and the other mode (accessible via --work-on-disk) may not work given the I/O configuration on the user's computer. I do have some plans in mind to break the build process down into more manageable parts for such users, and I hope to have them available in future versions.

Usage information should reflect KRAKEN_* env vars

The usage information for the kraken* tools should inform the user of the values of $KRAKEN_DB_PATH $KRAKEN_DEFAULT_DB $KRAKEN_NUM_THREADS in the usage information:

 --db NAME               Name for Kraken DB
 --threads NUM           Number of threads

For example, if they are set:

 --db NAME               Name for Kraken DB  (default=/var/db/kraken/minikraken)
 --threads NUM           Number of threads (default=8)

set_lcas segfaults with custom database

After building a custom database from a single multi-fasta file which I constructed using the described syntax with kraken:taxid header entries, kraken-build (v0.10.5b) fails in step 6:

kraken-build --db mydb --build
Kraken build set to minimize disk writes.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, GI number to seqID map already complete.
Skipping step 5, seqID to taxID map already complete.
Setting LCAs in database (step 6 of 6)...
Processed 108029 sequencesxargs: Prozeß cat wurde durch das Signal 13 abgebrochen.
/net/programs/Debian-7-x86_64/kraken-0.10.5b/bin/build_kraken_db.sh: Zeile 197: 13939 Fertig                  find library/ '(' -name '*.fna' -o -name '*.fa' -o -name '*.ffn' ')' -print0
     13940 Exit 125                | xargs -0 cat
     13941 Speicherzugriffsfehler  | set_lcas $MEMFLAG -x -d database.kdb -i database.idx -n taxonomy/nodes.dmp -t $KRAKEN_THREAD_CT -m seqid2taxid.map -F /dev/fd/0

I isolated the respective call and ran set_lcas manually which lead to a simple segfault:

set_lcas -x -d database.kdb -i database.idx -n taxonomy/nodes.dmp -t 1 -m seqid2taxid.map -F library/added/1Y0AnEaxod.fna
Processed 108029 sequences
Speicherzugriffsfehler

This refers to the same run that is described in issue #34. FASTA headers look like:

grep '^>' library/added/1Y0AnEaxod.fna | head
>NZ_AQYU01000016.1|kraken:taxid|35400
>NZ_AQYU01000015.1|kraken:taxid|35400
>NZ_AQYU01000017.1|kraken:taxid|35400
>NZ_AQYU01000014.1|kraken:taxid|35400
>NZ_AQYU01000018.1|kraken:taxid|35400
>NZ_AQYU01000011.1|kraken:taxid|35400
>NZ_AQYU01000010.1|kraken:taxid|35400
>NZ_AQYU01000009.1|kraken:taxid|35400
>NZ_AQYU01000012.1|kraken:taxid|35400
>NZ_AQYU01000008.1|kraken:taxid|35400

The taxid is mostly on the species level and I downloaded the latest gi_taxid_nucl.dmp from the NCBI ftp site to be able to run kraken-build on the custom dataset.

Plasmids?

When building a bacterial genome database, do you include the plasmids at the same place in the taxonomic tree or only include them in plasmid builds?

Support .gbk / .gbff [.gz] files

It would be great if --add-to-library could support .gbk/.gbff (and .gz versions thereof).

These have the taxid in taxon:NNN field in the source tag.

To make it fast you can avoid parsing the Genbank file, and just read it as follows:

https://raw.githubusercontent.com/MDU-PHL/mdu-tools/master/bin/genbank-to-kraken_fasta.pl

make_seqid_to_taxid_map fails when gi2seqid.map is too large

In Kraken version 0.10.5b, when fasta sequences and corresponding and gi2seqid.map is large (in my case ~150 k sequences), the binary make_seqid_to_taxid_map fails with std::size_error or similar. I suspect this is due to the integer size type used for the container type?

I simply replaced this step by an awk snipplet followed by GNU sort -k1,1 which seems to produce identical output.

Gruß Johannes

derrickwood / kraken Goto Github PK

kraken's Introduction

Kraken taxonomic sequence classification system

kraken's People

Contributors

Stargazers

Watchers

Forkers

kraken's Issues

Dear researcher,

I just started to use Kranken. After the downloading of database, some problem seems to come during building.

A mail (in /var/spool/mail/root) gave some tips about the problems of the hardware.

Recommend Projects

Recommend Topics

Recommend Org