Comments (9)
Dear @BenoitGoutorbe ,
We are also using kmer matching for 16s V3-4-based classification in our lab. We use the NCBI curated RefSeq Targeted Loci project FASTA files from https://www.ncbi.nlm.nih.gov/refseq/targetedloci/ and build a Kraken DB out of these. Idealy, you restrict the FASTA files to your amplicon sequences.
All the best
from kraken.
Thanks a lot for this precious help. I was not aware of this targeted loci database from refseq and it's exactly what I needed. I built my kraken database from it within 10 minutes (8 Threads - 64GB of RAM) and it classifies my reads very well (about 99.5% to the phylum level and 80% to the species level for the few samples I've tried so far) at very high speed (a few seconds for 150k reads of 500 bp each). I think I will stick with this solution because of the issues you (@rfm-targa) mentioned about GreenGenes, Silva and RDB (I need as much information as possible at genus/species levels).
Again, thanks a lot !
from kraken.
Hello everyone,
It's possible to build Greengenes and SILVA databases for Kraken but in order to maintain their original taxonomy I had to create custom names.dmp and nodes.dmp files for each of those 16S databases. The header of the sequences also has to be formatted as explained in the Kraken manual so that it is easier to build the DB without problems. I have a repository with the process to build a Greengenes 13.5 database (full file). While the steps I describe in the repository should work just fine, I would like to update the repository in a near future to include a faster process that also works for Greengenes 13.8 and SILVA. I currently have both databases for Kraken. Check the repo if you want, it might help if you really want to adapt those databases.
Anyway, just as @your-highness said, the NCBI Targeted Loci project is also a good option. I've also used those sequences with Kraken and they give good results. It's a small file with around 20K sequences but they are all annotated to species rank and include a lot of species. Greengenes and SILVA aren't good if you want to classify at species level (SILVA sequences are annotated maximum at genus level, anything at species level isn't really 'correct' and Greengenes hasn't been updated in a long time, only having around 637 species represented).
Best regards
from kraken.
@your-highness
Personally, I don't limit my database to the targeted region. Due to the way Kraken works, I don't think limiting the database will improve performance. The database with full 16S sequences should contain the k-mers for the region of interest and classify just as well or better. If we limit the database to the targeted region we might create some problems like:
-
We have a database with only the targeted region but the sequences used for constructing the database were obtained with certain primers or with a certain software that extracts sequences from full 16S sequences. It will be difficult to have sequences to classify that only spawn the exact same region as in our database and because of that we will get wrong results. It's difficult to have just the targeted region and to get only that targeted region when using primers so in my opinion using more than just the targeted region is a plus since with the full sequence you can find the full regions and classify based on all information without getting wrong hits because one region was slightly shorter or longer.
-
In my opinion, a database with full sequences is better and using more than one region or a longer target is obviously better. In the case of the 16S rRNA, including variable regions and parts of 'constant' regions might help even more, since there are different species with variable regions with the exact same sequence and including more info from the 'constant' regions might help ('A systematic search for discriminating sites in
the 16S ribosomal RNA gene' by Hilde et al. might be interesting). -
Creating a database from one targeted region will only work for that region and it might be more practical to have a database that can be used more broadly.
-
A database based on a target like full 16S will not take much disk space and will run well in a laptop with 16Gb (might work well with less, didn't test). Reducing the target to only a variable region will have a small impact in computing requirements. I even run Kraken with a 16S database made from a filtered SILVA 132 (around 2Gb) in a 16Gb machine and it's fine. In this case, RAM requirements are mainly due to Kraken indexing structure that takes a fixed amount of space (increasing after that as you had more and more sequences).
I've classified with full 16S databases and the results weren't bad since there's a lot of reference sequences. One problem that one can't really solve is the fact that at species level, different species might have the exact same 16S sequence or 16S region, that there might be multiple copies of 16S in the same bacteria and that those copies are not identical.
Well, this is just my opinion about some things, hope it helps.
from kraken.
Hi Riccardo,
I am indeed planning to add an option to kraken-build
that would set up at least a Greengenes database, and probably Silva as well. 16S DB support is on the TODO list on my office whiteboard. :)
I'll see what I can do with this today - I think the original test DBs were lost due to some HD failures, but I don't think it's too difficult to redo that work, especially with the latest Kraken's support for direct assignment of taxa to sequences.
from kraken.
Hi Derrick,
I'm also looking for an option to use SILVA database with Kraken. Did you make it available anywhere?? That would be great!
Regards
from kraken.
Hi,
I too would like to have access to a Greengenes database to use with Kraken. Will there be one soon?
Pat
from kraken.
Hi Jennifer, Hi Derrick ,
I am doing taxonomic classification on 16S data so it seems stupid to use a RefSeq-based Kraken DB. Is the '16s-dev' branch that you started in 2014/15 to build Kraken DB from Greengenes, Silva or RDP, ready to be used ? I guess the answer is no, otherwise I don't get why you did'nt merge it to Master and why this options are not available in the recent releases. Do you plan to work on that in the future ? I understand you guys are more interested in dealing with shotgun data but your algo is also very exciting for padawans working on 16s data !
Best regards, Benoit
from kraken.
I have a question out of curiousity to @BenoitGoutorbe and @rfm-targa 👍
When building a Kraken database for amplicon sequencing strategies, do you restrict your reference sequences (i.e. NCBI Targeted Loci project) to your amplified regions exclusively? What is your opinion on reducing the references to e.g. V3 if your primers target only V3?
from kraken.
Related Issues (20)
- gzip: .gz: not in gzip format
- kraken2-build error HOT 2
- db_sort: unable to mmap database.jdb: Cannot allocate memory
- Bioconda Kraken2 build standard database issue HOT 1
- How much time should be expected for building a database by kraken2-build?
- build_db: error opening taxonomy//nodes.dmp: No such file or directory 2020 HOT 2
- Kraken max length
- Issue with PLASMID download? HOT 2
- what(): 'database_10916': File truncated HOT 1
- xargs: cat: terminated by signal 13
- Kraken1 database exit code 137 HOT 2
- Xargs: cat: terminated by signal 13 with kraken2-build --build. HOT 4
- issue with rsync_from_ncbi.pl HOT 2
- Why classified reads are contaminated and unclassified are clean reads?
- problems with building kraken and kraken2 databases HOT 1
- Unable to run kraken2-build HOT 1
- errors with build kraken database
- rsync error
- Kraken2 error
- Cant open file: [Errno 2] No such file or directory: 'prueba//results.spa'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kraken.