Git Product home page Git Product logo

Comments (8)

fbreitwieser avatar fbreitwieser commented on July 17, 2024

Can you show the relevant parts in your input files? Does the taxa exist in the taxonomy tree, too?

from centrifuge.

sridhar0605 avatar sridhar0605 commented on July 17, 2024

I have downloaded the latest taxonomy and split it in to names and dump using
tar -zxvf taxdump.tar.gz nodes.dmp and names.dmp
before that let me brief you about me database
I have all the genomic.fna.gz files for bacteria and virus from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/
I tried using kraken to build a database (it failed badly due to memory issues even on amazon ec2 instance with 2TB of ram) hence I wanted to use centrifuge which suits perfectly for my analysis.
So far I have concatenated all the reads in a single fna file and used seqid2taxa.map file from kraken as initial inputs to centrifuge as centrifuge does not download all the bacterial files. I modified the fasta file as per centrifuge requirement just the sequence id and description. I however end up geting this as an error

Warning: taxomony id doesn't exists for NZ_AJTB01000101.1!

and then this too..
Warning: Taxonomy ID 1527292 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
I then used the same nodes.dmp and names.dmp file from kraken output, still no success.

from centrifuge.

fbreitwieser avatar fbreitwieser commented on July 17, 2024

This record has been removed from the NCBI nucleotide database (http://www.ncbi.nlm.nih.gov/nuccore/NZ_AJTB01000092.1). Usually we detect these cases by missing entries in the taxonomy dump - which I think is the case here. Note that the assembly_summary and taxonomy are not always in sync.

from centrifuge.

sridhar0605 avatar sridhar0605 commented on July 17, 2024

That is the issue I am not using assembly_summary as my backbone, I am trying to build it with all available sequences plasmid contigs scaffold in all around 42080 species for bacteria and 5654 for viral.

from centrifuge.

sridhar0605 avatar sridhar0605 commented on July 17, 2024

any solution for this?

from centrifuge.

mourisl avatar mourisl commented on July 17, 2024

Can you show us the line for NZ_AJTB01000101.1 in the seqid2taxa.map file and lines around it? Is the corresponding tax id (1527292) in the nodes.dmp and names.dmp?

from centrifuge.

sridhar0605 avatar sridhar0605 commented on July 17, 2024

Since I did not follow your manual online I made my own script and built the seqid2taxa.map (where is used all accession id from fasta header and got tax id from ncbi), and yes @fbreitwieser was right it has been removed from the database. and hence not seen in nodes.dmp. So the next question to ask is how is it still on their refseq website in fasta file. and how do i cater this issue to build centrifuge index?

from centrifuge.

fbreitwieser avatar fbreitwieser commented on July 17, 2024

The thing is that RefSeq and the taxonomy database are not always at the same state. In Centrifuge the sequences with no mapping get added to the database with taxonomy ID 0 - though maybe we should just skip them. But the database should be built without problems, even if there is missing mapping.

from centrifuge.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.