ecogenomics / gtdblite Goto Github PK

View Code? Open in Web Editor NEW

This project forked from askars/gtdblite

0.0 0.0 2.0 249 KB

Perl 10.76% Python 89.24%

gtdblite's People

Contributors

Watchers

Forkers

ctskennerton shulp2211

gtdblite's Issues

default privacy setting should be public

It would likely be best if the default privacy setting for creating marker lists or adding genomes was public. In general, we are looking to share our data across ACE and build synergy between different projects. Only in some rare cases would I expect a user to need/want to make their data private.

Dereplicate common species

A number of species are represented by an excessive numbers of genomes (e.g. C. difficile, E. coli). It would be beneficial to dereplicate these species for the purposes of inferring a genome tree. This would reduce the time required to infer the tree and help with visualizing the tree. Some care needs to be taken as any removed taxa may need to be updated based on changes made to to the taxonomy. Also, we need to make sure to only remove taxa that will not be of interest to users. As a start, I suggest simply dereplicating well established species and only genomes from IMG.

Better error reporting when making trees

When prodigal isn't in your path you get a very vague error message, which due to the parallel nature of the code points to an uninformative part of the source (see below). A check for the required external dependancies before getting into this would help a lot.

src/gtdblite.py trees create --output ~/test_genome_tree --all_genomes --marker_set_ids 3 --no_tree
24377 genomes contain 975080 uncalculated markers.
These markers need to be calculated in order to build the tree. More markers means more waiting. Continue using 1 threads? (y/N): y
Breaking calculation into 49 chunks of up to 500 genomes.
Calculating chunk 1 of 49....
Prodigal complete for 102 of 500 genomes (chunk 1 of 49),
Prodigal complete for 189 of 500 genomes (chunk 1 of 49),
Prodigal complete for 276 of 500 genomes (chunk 1 of 49),
Prodigal complete for 363 of 500 genomes (chunk 1 of 49),
Prodigal complete for 449 of 500 genomes (chunk 1 of 49),
Prodigal complete for 500 of 500 genomes (chunk 1 of 49),
Exception caught. Dumping info.
Traceback (most recent call last):
  File "src/gtdblite.py", line 664, in <module>
    result = args.func(db, args)
  File "src/gtdblite.py", line 128, in CreateTreeData
    return db.MakeTreeData(marker_id_list, genome_id_list, args.out_dir, "gtdblite", args.profile, profile_config_dict, not(args.no_tree))
  File "/export/data1/sw/GTDBLite/src/gtdblite/GenomeDatabase.py", line 1550, in MakeTreeData
    prodigal_dir = async_result.get()
  File "/opt/qiime/1.8.0/python-2.7.3-release/lib/python2.7/multiprocessing/pool.py", line 528, in get
    raise self._value
OSError: [Errno 2] No such file or directory

16S top hit

It would be extremely helpful to have 16S blast hits for all genomes in the GTDB. This should include the alignment length, % identity, e-value, taxonomy string, database identifier, and NCBI accession number of the top hit. It would also be nice to have this information for the 2nd best hit. Perhaps doing homology search against the latest SILVA database with Phil's Greengenes taxonomy mapped to this dataset would be best.

how are the completeness and contamination filters specified

Additional documentation is required to indicate how the default completeness and contamination filters can be changed.

Excessive precision for completeness and contamination fields in ARB filter

When the CheckM completeness and contamination estimates are viewed in ARB, they have a prevision of 6 decimal places. Is it possible to limit this to a single decimal place by modifying the ARB import filter and/or how the value is stored in the gtdblite_concatenated.arbtxt file.

IMG metadata

It would be good to incorporate all metadata available at IMG into the GTDB. This can be download from the IMG website as a CSV file. It is a bit of a pain to download it, but it is possible.

Improved marker genes selection for phylogenetic inference

When a genome has multiple copies of a marker genes to be used for phylogenetic inference, one is currently selected (best e-value?). In practice, this may not be the correct marker gene for the genome and could be contamination. It would be better to simply ignore any marker genes that are identified more than once. This is the approach currently taken by CheckM.

Export genomes as fasta files

It would be useful to allow users to export a list of genomes as fasta files. Alternatively, we can just give users read access to the directory of population genomes. Either way, people need some way to access these things! :)

ecogenomics / gtdblite Goto Github PK

gtdblite's People

Contributors

Watchers

Forkers

gtdblite's Issues

default privacy setting should be public

Dereplicate common species

Better error reporting when making trees

16S top hit

how are the completeness and contamination filters specified

Excessive precision for completeness and contamination fields in ARB filter

IMG metadata

Improved marker genes selection for phylogenetic inference

Export genomes as fasta files

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent