Git Product home page Git Product logo

Comments (9)

your-highness avatar your-highness commented on July 29, 2024 3

Dear @BenoitGoutorbe ,

We are also using kmer matching for 16s V3-4-based classification in our lab. We use the NCBI curated RefSeq Targeted Loci project FASTA files from https://www.ncbi.nlm.nih.gov/refseq/targetedloci/ and build a Kraken DB out of these. Idealy, you restrict the FASTA files to your amplicon sequences.

All the best

from kraken.

bengouts avatar bengouts commented on July 29, 2024 2

Thanks a lot for this precious help. I was not aware of this targeted loci database from refseq and it's exactly what I needed. I built my kraken database from it within 10 minutes (8 Threads - 64GB of RAM) and it classifies my reads very well (about 99.5% to the phylum level and 80% to the species level for the few samples I've tried so far) at very high speed (a few seconds for 150k reads of 500 bp each). I think I will stick with this solution because of the issues you (@rfm-targa) mentioned about GreenGenes, Silva and RDB (I need as much information as possible at genus/species levels).
Again, thanks a lot !

from kraken.

rfm-targa avatar rfm-targa commented on July 29, 2024 1

Hello everyone,

It's possible to build Greengenes and SILVA databases for Kraken but in order to maintain their original taxonomy I had to create custom names.dmp and nodes.dmp files for each of those 16S databases. The header of the sequences also has to be formatted as explained in the Kraken manual so that it is easier to build the DB without problems. I have a repository with the process to build a Greengenes 13.5 database (full file). While the steps I describe in the repository should work just fine, I would like to update the repository in a near future to include a faster process that also works for Greengenes 13.8 and SILVA. I currently have both databases for Kraken. Check the repo if you want, it might help if you really want to adapt those databases.
Anyway, just as @your-highness said, the NCBI Targeted Loci project is also a good option. I've also used those sequences with Kraken and they give good results. It's a small file with around 20K sequences but they are all annotated to species rank and include a lot of species. Greengenes and SILVA aren't good if you want to classify at species level (SILVA sequences are annotated maximum at genus level, anything at species level isn't really 'correct' and Greengenes hasn't been updated in a long time, only having around 637 species represented).

Best regards

from kraken.

rfm-targa avatar rfm-targa commented on July 29, 2024 1

@your-highness
Personally, I don't limit my database to the targeted region. Due to the way Kraken works, I don't think limiting the database will improve performance. The database with full 16S sequences should contain the k-mers for the region of interest and classify just as well or better. If we limit the database to the targeted region we might create some problems like:

  • We have a database with only the targeted region but the sequences used for constructing the database were obtained with certain primers or with a certain software that extracts sequences from full 16S sequences. It will be difficult to have sequences to classify that only spawn the exact same region as in our database and because of that we will get wrong results. It's difficult to have just the targeted region and to get only that targeted region when using primers so in my opinion using more than just the targeted region is a plus since with the full sequence you can find the full regions and classify based on all information without getting wrong hits because one region was slightly shorter or longer.

  • In my opinion, a database with full sequences is better and using more than one region or a longer target is obviously better. In the case of the 16S rRNA, including variable regions and parts of 'constant' regions might help even more, since there are different species with variable regions with the exact same sequence and including more info from the 'constant' regions might help ('A systematic search for discriminating sites in
    the 16S ribosomal RNA gene' by Hilde et al. might be interesting).

  • Creating a database from one targeted region will only work for that region and it might be more practical to have a database that can be used more broadly.

  • A database based on a target like full 16S will not take much disk space and will run well in a laptop with 16Gb (might work well with less, didn't test). Reducing the target to only a variable region will have a small impact in computing requirements. I even run Kraken with a 16S database made from a filtered SILVA 132 (around 2Gb) in a 16Gb machine and it's fine. In this case, RAM requirements are mainly due to Kraken indexing structure that takes a fixed amount of space (increasing after that as you had more and more sequences).

I've classified with full 16S databases and the results weren't bad since there's a lot of reference sequences. One problem that one can't really solve is the fact that at species level, different species might have the exact same 16S sequence or 16S region, that there might be multiple copies of 16S in the same bacteria and that those copies are not identical.

Well, this is just my opinion about some things, hope it helps.

from kraken.

DerrickWood avatar DerrickWood commented on July 29, 2024

Hi Riccardo,

I am indeed planning to add an option to kraken-build that would set up at least a Greengenes database, and probably Silva as well. 16S DB support is on the TODO list on my office whiteboard. :)

I'll see what I can do with this today - I think the original test DBs were lost due to some HD failures, but I don't think it's too difficult to redo that work, especially with the latest Kraken's support for direct assignment of taxa to sequences.

from kraken.

igra666 avatar igra666 commented on July 29, 2024

Hi Derrick,
I'm also looking for an option to use SILVA database with Kraken. Did you make it available anywhere?? That would be great!
Regards

from kraken.

pbuendia avatar pbuendia commented on July 29, 2024

Hi,
I too would like to have access to a Greengenes database to use with Kraken. Will there be one soon?
Pat

from kraken.

bengouts avatar bengouts commented on July 29, 2024

Hi Jennifer, Hi Derrick ,
I am doing taxonomic classification on 16S data so it seems stupid to use a RefSeq-based Kraken DB. Is the '16s-dev' branch that you started in 2014/15 to build Kraken DB from Greengenes, Silva or RDP, ready to be used ? I guess the answer is no, otherwise I don't get why you did'nt merge it to Master and why this options are not available in the recent releases. Do you plan to work on that in the future ? I understand you guys are more interested in dealing with shotgun data but your algo is also very exciting for padawans working on 16s data !
Best regards, Benoit

from kraken.

your-highness avatar your-highness commented on July 29, 2024

I have a question out of curiousity to @BenoitGoutorbe and @rfm-targa 👍

When building a Kraken database for amplicon sequencing strategies, do you restrict your reference sequences (i.e. NCBI Targeted Loci project) to your amplified regions exclusively? What is your opinion on reducing the references to e.g. V3 if your primers target only V3?

from kraken.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.