hsinnan75 / strainpro Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
As far as I can tell, the path to the tax dump files download by download_taxonomy.sh
is hardcoded into StrainPro-build
. It would be helpful if the user can specify the location to the tax dump files so that custom taxonomies can be used (eg., GTDB tax dump files instead of NCBI) or if the user already has the NCBI tax dump files downloaded and located elsewhere (eg., in a "databases" directory).
For StrainPro v0.9.2, it appears that when calling StrainPro-build
, that executable requires StrainPro-rep
to be located at bin/StrainPro-rep
relative to the user's current working directory. Thus the user is required to always call StrainPro-build
from the base StrainPro directory.
Why not just provide instructions for adding /path/to/clone/of/strainpro/StrainPro/bin/
to the user's directory and remove this hard-coded path requirement for bin/StrainPro-rep
in bin/StrainPro-build
? That is what the PATH
env variable is for, and it would make bin/StrainPro-build
a lot more flexible.
I just cloned the most recent version of StrainPro yesterday. I'm running Strainpro-build on the example "ecoli.fa" file with freshly downloaded taxdump files. The build job has been running for ~15 hours now, with no status updates on what it is doing, and it only seems to be using 1 thread instead of the default 16.
Why does Strainpro-build take so long even with only a few genomes?
It would help to have status updates to understand what strainpro-build is doing during the long run times.
In regards to the thread usage, I use a cluster for running jobs, and so it's inefficient to request 16 threads for a cluster job if only 1 thread is used for the majority of the time. If possible, It would be helpful to break up Strainpro-build into 2 separate commands: the first being the single-thread work, and the 2nd being the multi-threaded work.
Hi all,
what os reference-fna in this case?
$bin/StrainPro-build -r reference-fna -o ref_idx [ref_idx is the output folder for BWT indexes]
Thanks for your help :)
I'm using StrainPro-build v0.9.0
on Ubuntu 18.04.3. Here's a reproducible example:
git clone https://github.com/hsinnan75/StrainPro.git
cd StrainPro/
make
./download_taxonomy.sh
./download_genomic_library.sh archaea
./bin/StrainPro-build -r database/archaea/library.fna -o database/archaea/ref_idx
The output from StrainPro-build
is:
Load taxonomy information.
Get all sequences...
Cluster 542 sequences...
*** buffer overflow detected ***: ./bin/StrainPro-build terminated
Aborted (core dumped)
The server that I'm using has plenty of resources, so the issue isn't a lack of memory or something like that.
I just mapped 0.5mil 150bp HiSeq reads from a human gut metagenome (previously profiled with kraken2; looked very normal for the sample time) to RefSeq-bacteria, and I'm getting the following output:
#TaxID #Read_count #Depth #Relative_abundance #Confidence_score
@TaxRank:subspecies
@TaxRank:species
@TaxRank:genus
@TaxRank:family
@TaxRank:order
@TaxRank:calss
@TaxRank:phylum
@TaxRank:kingdom
Steps to reproduce
git clone https://github.com/hsinnan75/StrainPro.git
cd StrainPro
./download_taxonomy.sh
./download_genomic_library.sh library bacteria
# WARNING: the following cmd takes many hours to complete (even with the default 16 threads)!
./bin/StrainPro-build -r database/bacteria/ -o database/bacteria/ref_idx
./bin/StrainPro-map -i database/bacteria/ref_idx -f /path/to/metagenome/read1.fq.gz -o output
If I map the reads with kraken2 versus refseq, then I get a taxonomic distribution that looks normal. I tried mapping another sample with 10 mil reads instead of the 0.5 mil reads, and then I did get some output:
#TaxID #Read_count #Depth #Relative_abundance #Confidence_score
@TaxRank:subspecies
99822 2395 41 7.43 0.410877
411470 5785 28 5.07 0.282264
435591 218 25 4.53 0.255269
479831 1269 21 3.80 0.211359
499175 52 25 4.53 0.252427
536231 965 28 5.07 0.287117
553973 50 46 8.33 0.462963
679935 2386 30 5.43 0.297136
...
@TaxRank:phylum
976 1162258 35 89.80 1.000000
1224 5506 92 0.43 1.000000
1239 122790 38 9.49 1.000000
201174 3682 30 0.28 1.000000
@TaxRank:kingdom
2 1294236 36 100.00 1.000000
For the 0.5 mil read sample, I'm wondering why I didn't at least get some hits at the kingdom and phylum level. Even for this 10mil read file, I'm only getting 4 phyla. Is StrainPro just not very sensitive?
FYI: @TaxRank:calss
is spelled wrong
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.