ecogenomics / gtdblite Goto Github PK
View Code? Open in Web Editor NEWThis project forked from askars/gtdblite
This project forked from askars/gtdblite
It would likely be best if the default privacy setting for creating marker lists or adding genomes was public. In general, we are looking to share our data across ACE and build synergy between different projects. Only in some rare cases would I expect a user to need/want to make their data private.
A number of species are represented by an excessive numbers of genomes (e.g. C. difficile, E. coli). It would be beneficial to dereplicate these species for the purposes of inferring a genome tree. This would reduce the time required to infer the tree and help with visualizing the tree. Some care needs to be taken as any removed taxa may need to be updated based on changes made to to the taxonomy. Also, we need to make sure to only remove taxa that will not be of interest to users. As a start, I suggest simply dereplicating well established species and only genomes from IMG.
When prodigal isn't in your path you get a very vague error message, which due to the parallel nature of the code points to an uninformative part of the source (see below). A check for the required external dependancies before getting into this would help a lot.
src/gtdblite.py trees create --output ~/test_genome_tree --all_genomes --marker_set_ids 3 --no_tree
24377 genomes contain 975080 uncalculated markers.
These markers need to be calculated in order to build the tree. More markers means more waiting. Continue using 1 threads? (y/N): y
Breaking calculation into 49 chunks of up to 500 genomes.
Calculating chunk 1 of 49....
Prodigal complete for 102 of 500 genomes (chunk 1 of 49),
Prodigal complete for 189 of 500 genomes (chunk 1 of 49),
Prodigal complete for 276 of 500 genomes (chunk 1 of 49),
Prodigal complete for 363 of 500 genomes (chunk 1 of 49),
Prodigal complete for 449 of 500 genomes (chunk 1 of 49),
Prodigal complete for 500 of 500 genomes (chunk 1 of 49),
Exception caught. Dumping info.
Traceback (most recent call last):
File "src/gtdblite.py", line 664, in <module>
result = args.func(db, args)
File "src/gtdblite.py", line 128, in CreateTreeData
return db.MakeTreeData(marker_id_list, genome_id_list, args.out_dir, "gtdblite", args.profile, profile_config_dict, not(args.no_tree))
File "/export/data1/sw/GTDBLite/src/gtdblite/GenomeDatabase.py", line 1550, in MakeTreeData
prodigal_dir = async_result.get()
File "/opt/qiime/1.8.0/python-2.7.3-release/lib/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
OSError: [Errno 2] No such file or directory
It would be extremely helpful to have 16S blast hits for all genomes in the GTDB. This should include the alignment length, % identity, e-value, taxonomy string, database identifier, and NCBI accession number of the top hit. It would also be nice to have this information for the 2nd best hit. Perhaps doing homology search against the latest SILVA database with Phil's Greengenes taxonomy mapped to this dataset would be best.
Additional documentation is required to indicate how the default completeness and contamination filters can be changed.
When the CheckM completeness and contamination estimates are viewed in ARB, they have a prevision of 6 decimal places. Is it possible to limit this to a single decimal place by modifying the ARB import filter and/or how the value is stored in the gtdblite_concatenated.arbtxt
file.
It would be good to incorporate all metadata available at IMG into the GTDB. This can be download from the IMG website as a CSV file. It is a bit of a pain to download it, but it is possible.
When a genome has multiple copies of a marker genes to be used for phylogenetic inference, one is currently selected (best e-value?). In practice, this may not be the correct marker gene for the genome and could be contamination. It would be better to simply ignore any marker genes that are identified more than once. This is the approach currently taken by CheckM.
It would be useful to allow users to export a list of genomes as fasta files. Alternatively, we can just give users read access to the directory of population genomes. Either way, people need some way to access these things! :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.