ecogenomics / gtdbncbi Goto Github PK

The GTDB provides the software infrastructure for working with a large collection of genomic resources. The major goal of this initiative is to provide a phylogenetically consistent taxonomy for archaea and bacteria.

Home Page: https://gtdb.ecogenomic.org/

License: GNU General Public License v3.0

Python 94.27% PLpgSQL 4.79% Perl 0.84% Shell 0.10%

gtdbncbi's People

Contributors

Stargazers

Watchers

Forkers

cui-jing shulp2211

gtdbncbi's Issues

File file indicating path to all GTDB genomes

It would be nice to have an export function (perhaps in the 'metadata' menu) that allowed a flat file to be dumped indicating the absolute path to all genome directories currently in the GTDB. It is a bit non-trivial to get this at the moment given that genomes are stored in both user directories, RefSeq directories, and Genbank directories. This file would be useful for a number of downstream applications that need a way to access the genomic data of all genomes in the GTDB at a specific point in time.

Update User genome taxonomy

Currently, User genomes will only be assigned a GTDB taxonomy at each release cycle. This could be improved by periodically (say, every 2 weeks) inferring a tree covering all User genomes of sufficient quality and using tax2tree to automatically assign User genomes without a taxonomy a valid taxa string.

Representative genomes should never be filtered

The selection criteria for GTDB representatives is becoming increasingly complex. As such, some representatives actually have relatively poor CheckM quality estimates. Such genomes may represent known reduced genomes or genomes present as a single ungapped chromosome which are almost certainly complete (though may not pass the strict quality thresholds).

Selected representatives should NEVER be filtered out of the GTDB trees. These are considered trusted genomes. If the user request representatives they should always get the complete set.

Help menu should separate out required and optional parameters

It would be helpful if the help menus indicated required and optional parameters in the same manner done for mingle.

Automatically annotate new User genomes as Bacteria or Archaea

It is often convienent to infer trees for just bacterial or archaeal genomes. This can be done with the taxa filter option of tree create. However, to ensure all genomes in a domain are considered genomes need to be automatically annotated as being a Bacteria or Archaea. This is possible by considering the number of genes identified in the domain-specific canonical alignments and assigning genomes to the domain with the highest percentage of identified genomes. Informal testing indicates that high-quality genomes have <20% of genes in the incorrect domain marker set.

Tools.runMultiProdigal() in AddManyFastaGenomes() is fixed at 2 processes

Prodigal can be run in parallel for each fasta file. The framework is setup for this, but the number of processors is currently fixed at 2. It is unclear if this is intentional or was set for testing purposes.

Proper handling of removed/deleted genomes

A scheme for properly handling removed/deleted genomes needs to be determined. Fully deleting genomes may be problematic as lab members may be using these. Users do need to delete genomes though as they will often deem some genomes to be of poor quality and submit improved versions.

Calculating and storing metadata is time consuming

It can take several hours to calculate and store metadata for large numbers of genomes. This may be due to the size of the database transaction. Can this be improved? Is it simple enough to do this in parallel?

[2016-02-06 17:41:39] INFO: GTDB v0.0.2 (NCBI database 2015-11-27)
[2016-02-06 17:41:39] INFO: gtdb -t 40 genomes add --create_list abisko_assembly73_bins --checkm_results ../checkm/CHECKM_FILE --batchfile batchfile --study_file study_file
[2016-02-06 17:41:39] INFO: Adding genomes to database.
[2016-02-06 17:41:39] INFO: Parsing Study file.
[2016-02-06 17:41:39] INFO: Reading CheckM file.
[2016-02-06 17:44:26] INFO: Running Prodigal to identify genes.
==> Finished processing 1529 of 1529 (100.00%) genomes.
[2016-02-06 18:07:15] INFO: Calculating and storing metadata for each genome.
[2016-02-07 00:33:34] INFO: Identifying TIGRfam protein families.
==> Finished processing 1529 of 1529 (100.00%) genomes.
[2016-02-07 04:12:12] INFO: Identifying Pfam protein families.
==> Finished processing 915 of 1529 (59.84%) genomes.

Not the time for "Calculating and storing metadata for each genome." though.

Additional metadata about marker sets would be helpful

It would be nice to dynamically add the following to the ARB and GTDB metadata files for each genome:

marker_gene_count: number of single copy genes identified for the marker set
marker_genes_in_set: total number of genes in the marker set
msa_aa_count: number of amino acids in the MSA (after all filtering)
msa_length: total length of the MSA (after all filtering)

Flag is needed to indicate missing markers in aligned_markers table

Currently, the aligned_markers table indicates if a marker is present multiple times (multiple_hits). It would be useful to have an additional flag (say, missing_marker) to indicate the a marker was not identified. This can be inferred from the missing evalue and bitscore entries, but this makes for some awkward code.

Generate metadata for new genomes

Metadata needs to be generated for newly submitted user genomes. This includes all "nucleotide" and "gene" metadata, along with 16S taxonomic classification.

GenBank genomes without annotations may be annotated in RefSeq

There are GenBank assembles (e.g., GCA_000215505.2_ASM21550v2) that do not have called genes. This occurs when users submit genomes without annotations. In at least some cases, these genomes have been put into RefSeq by NCBI and have been annotated (e.g., GCF_000215505.1_ASM21550v2). It would be good to identify these cases and take the RefSeq genomes instead of the GenBank genomes.

Response from NCBI:

Submitters of genomic sequences (WGS) may or may not provide their own annotation and they can even request NCBI to annotate the sequences for them. Please see the first paragraph on the WGS page: https://www.ncbi.nlm.nih.gov/genbank/wgs

The Anaplasma marginale assembly that you are looking at is at the "contig" assembly level (nothing is assembled beyond the level of sequence contigs):
http://www.ncbi.nlm.nih.gov/assembly/GCF_000215505.1/

The corresponding GenBank contigs have not been annotated. NCBI (RefSeq) took the GenBank sequence (61 contigs in this case) and annotated these through the NCBI Prokaryotic Genome Annotation Pipeline.

Check uniqueness of NCBI assembly identifiers

There are currently two NCBI assembly identifiers which are not unique. This appears to be a processing problem at NCBI as these assemblies are identical. It would be good to have a script to identify this problem when we sync with NCBI and remove or flag these erroneous assemblies.

Example:
GCA_000569115.1_ASM56911v1
GCA_000569115.1_None

Clean-up partial genomes if adding a new genome fails

A lot of files are produced when adding a new genome into the database. These files and the genome directory itself need to be removed if any step during the addition of the genome fails. It would be very bad to have genome directories around that containing only part of an added genome.

I think a clean solution to this would be to write all files to a temporary directory in /tmp first. If processing fails at any point, an exception needs to be caught and this directory removed. Otherwise, the directory as a whole can be moved into the GTDB directory structure.

Remove marker set

There is currently no way to remove a marker set through the user interface. It is also unclear if a DB admin can just remove a marker set from the marker_set table without this causing downstream issues.

Database and file structure integrity test

It is important that the database and genome directory structure be in sync. In particular, any genome listed in the database should be present in the genome directory structure and have all expected files present in the directory. Similarly all user genomes should appear in the database. Some NCBI will not show up in the database, but these will be genomes missing called proteins. A script or hidden command in the GTDB code base to sanity checking that all this is in order would be good.

Need to reannotate Pfam and TIGRfam for all NCBI genomes

The NCBI gene calling is too conservative for our needs. As such, Prodigal has been used to call genes on all GenBank and RefSeq genomes. CheckM estimates have already been updated to reflect these new genes. Pfam and TIGRfam annotations should also be recalculated and the GTDB updated to reflect these new annotations (i.e., all existing alignments removed and recalculated).

Universal marker set

The GTDB needs a universal marker set defined over TIGRFAM and Pfam HMMs.

Check revised annotation for all type strains

The annotation of type strains should only be modified in order to maintain monophyly or when multiple type strains exists due to erroneous prior annotations. A check should be made to indicate any type strains that have been assigned to a new genus and these verified for correctness (either automatically or manually).

Translation table

Prodigal should report which translation table was used to call genes. This data should be recorded in the GTDB.

Trimming columns with insufficient taxa is slow

When inferring a tree, columns with insufficient representation across the taxa are trimmed. This step is currently very slow and can likely be improved. On the release 75 dereplicated bacterial tree is takes 25 minutes:

[2016-04-06 01:48:22] INFO: Trimming columns with insufficient taxa.
[2016-04-06 02:13:09] INFO: Trimmed alignment from 41155 to 41155 AA.

Additional CheckM statistics

It would be helpful to add in the following CheckM statistics: marker lineage (checkm_marker_lineage), # genomes (checkm_genome_count), # markers (checkm_marker_count), # marker sets (checkm_marker_set_count), strain heterogeneity (checkm_strain_heterogeneity). This information could be stored in the genomes table, though it might be better to create a checkm table or put this in the metadata_genes table.

PfamSearch and TigrfamSearch to GenomeTk project

At some point, it would probably make sense to move the Pfam and TIGRFAM searches out of the GenomeTreeDB code base and into the GenomeTk code base. It is generally useful to be able to annotated genomes against Pfam and TIGRFAM so it would be nice to expose this code via the GenomeTk.

Possibility to download SSU sequences

with the SSU sequences stored now in the database (metadata_ssu table), it would be great to be able to download all of them into a Fasta file.

The format of this Fasta file will be as following:

genome_id|contig_id gtdb_taxonomy
sequence

Unknown temporary gene file created by prodigal

The Prodigal code currently creates a temporary genes file called genes_id_modified.faa. I am unclear why this is necessary.

Ownership of genomes and other files in GTDB

Ideally, all genomes and other generated files the are added to the GTDB would be owned by gtdb_dev. Files should be set to read only by all other users and groups. This is basically a proactive step to ensure these genomes are not accidentally deleted, modified, or moved by individuals which would break the GTDB.

Logging framework

A logging framework should be established to keep track of what users are doing with the GTDB. In addition, the output of the GTDB should be refined to make it clear to users what was done. Ideally, they would be able to save the output of the GTDB as a record of exactly when there command was run, what the state of the database was, and what external programs (and versions) were executed. Basically, everything that is required to write a paper.

Automatic assignment of GTDB taxonomy for clustered genomes

A genome assigned to a representative should have the same GTDB taxonomy as the representative. Currently, this is enforced externally to the GTDB by updating all genomes with appropriate taxonomy strings whenever the taxonomy is updated. This works fine, except that newly added User genomes assigned to a representative will not automatically obtain a valid GTDB taxonomy until the taxonomy is updated. It would be better if such genomes were automatically given a GTDB taxonomy whenever assigned to a representative.

Need gtdb_cluster_size and gtdb_clustered_genomes fields in metadata_view table

There are two additional columns that would be good to have in the metadata_view table. These fields relate to representative genomes and I am currently calculating them "on-the-fly" for the ARB output file. The two fields are:

gtdb_cluster_size: the number of many genomes in a cluster. This should be 0 if a genomes is not a representative. It should equal the size of the cluster if the genome is a representative. Please note that I consider a representative genome to be in its own cluster. That is, if a representative genome only clustered with one other genome the gtdb_cluster_size field is 2. Similarly, if a representative did not cluster with any genomes it still has a size of 1.
gtdb_clustered_genomes: this is simply a comma separated list of genomes in a cluster. If a genome is not a representative the field should be None. For representative genomes it is just a list of all genomes in the cluster (including the representative itself!)

These are currently being calculated by the TreeManager. See the SQL query around line 207 and the final assignment to these fields on lines 240 and 241. It would be far better if these fields were in the metadata_view table so they didn't need to be calculated "on-the-fly", but more importantly so they would appear in the metadata table produced by ">gtdb metadata export".

Representative user genomes can be deleted

When a representative user genome is deleted, all genomes represented by this genome lose their association with it and become processed without representative.

Field indicating isolate or environmental genome.

It would be helpful to have a field indicating in a genome is an isolate or environmental genome (MAG or SAG). Ideally this field would use a controlled vocabulary of: ISOLATE, MAG, SAG.

I have a script for calculating this, but we would need to formally integrate this into the GTDB:
srv/whitlam/projects1/gtdb/release86/r86_taxonomy/derived_from_metagenomes.py

Change & remove published login details

Hello!

Did you intentionally commit:

GTDBNCBI/scripts_dev/dsmz_scrape/dsmz_api_scraper.py

Lines 52 to 53 in 8db9e1f

	USERNAME = '[email protected]'
	PASSWORD = 'dsmz2017'

? If not, please consider changing your BacDive password and factoring-out those two variables into a file that is .gitignored ;-)

Propagate taxonomic annotations to revised NCBI genomes

NCBI occasionally revises genomes. This results in an incremental update to their assembly accession number (e.g., GCA_1234567.1 to GCA_1234567.2). In such cases, it would be very useful if taxonomic annotations were automatically assigned to the revised genome (i.e., from GCA_1234567.1 to GCA_1234567.2). This should reduce the amount of work required to update the taxonomy with each new pull from NCBI.

Inherit flags on CLI interface

It would be nice if flags such as the number of threads (-t) were 'inherited' so they appeared in all help menus (i.e., gtdb create tree) and could be invoked as a typical flag (i.e., gtdb create tree -t 32 instead of gtdb -t create tree).

Delete aligned markers with no marker sets

Currently aligned markers are stored in the database even if the marker set listing them has been removed.
To only store meaningful aligned markers in the database, it would be better to delete all aligned markers which are not associated to any marker set.

Pull genome sequence data from GTDB

We should provide an interface for pulling genomic (i.e., the nucleotide tide sequences) and gene (i.e., the called genes in amino acid space) data from the GTDB. This would be a straight dump of the *_genomic.fna and/or *_protein.faa files.

Delete genomes using genome list IDs

Currently, genomes can only be deleted by specifying genome IDs. It would be convenient if they could also be removed by specifying one or more genome list IDs, e.g.:

>gtdb genomes delete --list_ids 217,218

Most people have their genomes organized in genome lists and will wont to remove the entire list.

Possible contaminating 16S rRNA in genome assemblies

Hi,

I am not sure if this is the best place to report this issue but i have been seeing some 16S rRNA that was misassigned from the GTDB. One example here

https://twitter.com/DrGanHM/status/1397083226988978180

Is there some QC work on the SSU sequences from GTDB representative genomes to ensure that they are taxonomically correct prior to release?

Regards,
Gan

GenBank genomes without called genes

Not all genomes are submitted with called genes. For such genomes, we should use Prodigal to do the gene calling. Note that this is only necessary if the genome isn't in RefSeq. All genomes in RefSeq will have called genes.

NCBI versioning in directory structure

Currently, we track NCBI versions using a date (e.g., ncbi_2016_03_16). At this point it would probably be easier to track the RefSeq releases (e.g., release75). We should rename existing directories to reflect this.

Viewing user marker sets if not currently implemented

Creating marker sets is a fairly rare event and in general people will not want/need to use other users marker sets as they will be very specific to a given project. Nonetheless, it would be good to support this.

>gtdb marker_sets view --owner uqdparks
[2016-02-03 10:55:15] INFO: GTDB v0.0.1 (NCBI database 2015-11-27)
[2016-02-03 10:55:15] INFO: gtdb marker_sets view --owner uqdparks
Database action failed. The following error(s) were reported:
        Viewing other peoples' marker sets not yet implemented.

Genome contains duplicate sequence: delete!

Delete the following genome once the functionality exists:
/uqcskenn/U_947

Parallel submission of Genomes generates an error

If multiple batchfiles are submitted in parallel it generates a conflict for the increment of the last_auto_id in genome_sources table.
multiple genomes can receive the same last_auto_id, if the value is pulled at the same time, which result in a crash of the genome submissions.

Relative paths to NCBI and user genomes

In order to setup a completely independent development environment it would be helpful it the paths to NCBI and user genomes stored in the database were not absolute paths. The root directory of the genomes should be read from Config.py and appended to a relative path stored in the database.

Versioning

The GTDB needs to be versioned both on the software framework and when NCBI genomes were last sourced. The software will continue to improve between updated of NCBI so it is important to decouple these to things.

Export representatives

It would be useful to allow representative information to be exported to a file. Specifically, for each representative genome it would be good to list all the genomes it represents. This could be a simple two column TSV file where the first column is the representative genome ID and the second column is a comma separated list of genomes clustered with the representative.

Rename older user genomes.

Old user genomes have the genome file in the format <external_genome_id>.fasta. It would be better if these were <external_genome_id>_genomic.fna. This is the format used by NCBI genomes and newly added user genomes.

Custom labels for genome and marker lists

It would be helpful if users could specify custom names/labels for the genome and marker lists they create. It is always a bit tough to remember that the bacterial marker set is id=2, the archaeal set is id=3, and the cyano set is id=7. It would be much nicer if these were 'bac', 'ar', and 'cyano'. Same for the genome lists. Ideally, the order the genome and marker sets appear would be configurable by developers. It would be fine if this required direct modification of the database.

Respect provided called genes file when users add a genome

Currently, genes are called for all genomes using Prodigal even if the user supplies a genes file. Some metadata requires a GFF file to calculate so some care will need to be taken when using a user supplied genes file.