nhoffman / ya16sdb Goto Github PK

A curated subset of 16S rRNA sequences from NCBI

Shell 1.54% Python 97.46% Dockerfile 1.00%

ya16sdb's Introduction

Yet Another 16S rRNA database

ya16sdb is a pipeline for downloading, curating, and annotating a database of bacterial 16S rRNA sequences. This repository also implements a web application (https://ya16sdb.labmed.uw.edu/) that can be used to visualize the distance-based relationships among sequences for a given species.

The purpose of the project is to provide a high quality source of bacterial 16S rRNA sequences that is up to date with NCBI, in a format that is useful as an input for various bioinformatics pipleines such as blast searching, phylogenetic reference set creation, sequence-based taxonomic assignment, etc.

Project information

This project is a product of ongoing research interests of Noah Hoffman (https://faculty.washington.edu/ngh2/home/pages/software.html) at the University of Washington in the Department of Laboratory Medicine.

Christopher Rosenthal is the primary author of the pipeline.

The pipeline heavily relies on taxtastic (https://github.com/fhcrc/taxtastic) and deenurp (https://github.com/fhcrc/deenurp), both of which began as collaborations with Erick Matsen at The Fred Hutchinson Cancer Research Center in Seattle, WA.

Please cite this project as

Rosenthal C and Hoffman NG. 2019. ya16sdb: a pipeline for creating a collection of high-quality bacterial 16S rRNA sequences from NCBI. Version 0.6.1. University of Washington. https://github.com/nhoffman/ya16sdb

Overview

At a high level, this pipeline does the following:

Downloads annotation for all available sequence records from the NCBI matching search terms for 16S rRNA.
Retrieves sequence records for corresponding full length (or near full-length) 16S rRNA genes; this involves extracting subsequences from genome sequences or contigs.
Ensures that all records are 16S rRNA genes
Ensures that sequences are in a consistent orientation.
Identifies the taxonomic lineage of each record.
Annotates records as a "type strain" (according to NCBI's definition of type strain), "published" (annotation has an accompanying PubMed ID), "refseq" (belonging to the Genbank refseq collection), or "direct" (direct submissions).
Discards records likely to be mis-annotated using deenurp filter-outliers.
Provides various subsets of annotated sequences. Each record subset provides sequence metadata, sequences, taxonomic lineages, and a blast database. For example:
- only records with taxonomic name consistent with species-level classifications
- type strains only
- outliers removed
- downsampled to a subset of sequences for each species, prioritizing type strains and "published" records.

Docker

Docker image can be built with the following:

docker build --tag ya16sdb:latest .

Once a Docker image has been built a Singularity image can be built using the docker daemon:

singularity build ya16sdb.img docker-daemon://ya16sdb:latest

A Singularity image can also be built using a Singularity Docker container:

docker run --volume /var/run/:/var/run/ --volume $(pwd):$(pwd) --workdir $(pwd) singularity:latest build ya16sdb.img docker-daemon://ya16sdb:latest

Pipeline execution

The virtual containers have a predefined entry point to the SConstruct pipeline file.

To execute using Docker just a settings.conf file is required and can be run as follows:

docker run --volume $(pwd):$(pwd) --workdir $(pwd) ya16sdb:latest

And with Singularity

singularity run --bind $(pwd) --pwd $(pwd) ya16sdb.img

ya16sdb's People

Contributors

Stargazers

Watchers

Forkers

jgolob dhoogest molmicdx crosenth

ya16sdb's Issues

bin/match_hits.py look top hit within same species

New algo:

Expand vsearch to top 5-10-20 hits
Select all hits at the single highest pct_id
If a hit(s) exists with same species taxonomy id then choose that hit. Otherwise take whatever vsearch returns as top pct_id hit

get tax db credentials from configuration file

Handle RefSeqs derived from other RefSeqs

For example - https://www.ncbi.nlm.nih.gov/nuccore/NR_114600.3

This Issue was raised by @dhoogest

use singularity image for ncbi tools

migrate to ECS

We will provide and build a Dockerfile here, host the image on ghcr, but depoy the app to internal infrastructure using a cdk stack configurations maintained elsewhere.

genome identifier

We need to define an identifier that can be used to group records by genome sequencing project: for shotgun assemblies, accessions refer to a contig, and as a result multiple accessions can refer to the same assembly.

Create a runnable Docker image

add sphinx docs

Split dash app legend into separate color and shape legends

Resources: https://plot.ly/python/legend/ and https://plot.ly/python/reference/#layout-legend

create s3 bucket for data, automate upload of feather file

Don't implement the whole dokku stack here; just install to existing infrastructure. Do create an s3 bucket and IAM role to store the feather file.

Add select and deselect all markers (glyphs) option to dash app

generate list of species represented in filter plots

dash app sequence records showing themselves as closet type strain (match_species)

Identified by @marykstewart

Add ANI tax check data to Dash app

Add column to bottom table
Add filter option in top box

Create a Dash Datatable to visualize seq_info.feather file

Related to #48

roll back https://github.com/nhoffman/ya16sdb/pull/31

Apparently the inclusion of species_group column in created some issues for other pipeline uses. We've come up with an alternate plan for building ref feather files in custom shapes, so it isn't necessary to include the NGS16S-consumed columns added in !31

update pipeline Docker file to use python 3.11 and bookworm base

Move dash folder into its own repo

regex bug for long genome accessions

For example, NZ_CAADIT010000001 from accession CAADIT010000001 is transformed by the regular expression in

ya16sdb/bin/extract_genbank.py

Line 26 in f429540

REFSEQ_SOURCE = re.compile(

to ADIT01000000. Results in duplicate records for this genome

Remove creation of dedicated stack and add deployment to existing dokku instance

Best match 16S type strain record in ya16sdb web interface does not match data in NGS16S output

Hi Chris, linking here to the issue I made in the molmicro github a couple of weeks ago https://gitlab.labmed.uw.edu/molmicro/mkrefpkg/-/issues/83

Will this be part of the upcoming release?

It just occurred to me...is this why you and Noah think there is a misclassified E. faecalis type strain, WRT example NZ_CABGZA010000007_337_1876 (coordinates have changed slightly)? It seems like this must just be the ya16sdb web interface bug. There are no Enterococcus type strains in the NCBI_16s_types_details when that 'Enterococcus faecalis' record is blasted...and that database is unfiltered, so if there was one lurking, it should show up. https://share.labmed.uw.edu/molmicro/sanger/report/2021/09/30/2053025_ad_hoc_20210930150544_NZ_CABGZA010000007_337_1876.html

Some accession feature tables returning duplicate feature coordinates

Create outline for publication

include build instructions for singularity images

All singularity images used in the pipeline should have corresponding build files, or instructions for where to find them (eg for deenurp)

gateway timeouts

Will fix in a branch from the 0.7.2 tag and release as 0.7.3

requirements.txt should pin all dependencies

need a bare-dependencies.txt for unpinned, direct dependencies
create requirements.txt from pip freeze after creating clean venv from bare-dependencies.txt
build image from requirements.txt (all deps pinned)

implement "mini" pipeline for testing

We should be able to test the pipeline all the way through with a small subset of NCBI - perhaps by restricting the initial query to a few taxids. I'd suggest doing this first, and then building out the whole pipeline using the mini query.

do not trust list read in from filesystem location

Rather than directly from project data dir.

add settings-example.conf

review dependencies in venv vs singularity images

eg, can we run all scripts requiring sqlalchemy from taxtastic image

Wrong source specified for trusted types blastdb

ya16sdb/SConstruct

Line 669 in 716320f

blast_db(

Should be trusted_type_fa

non-species in dedup/1200bp/named/filtered

I noticed that there are some records in named/filtered without a species-level classification - I would have assumed that these would have been removed between 1200bp and named.

% pwd
/fh/fast/fredricks_d/refpkg/ya16sdb/output/20180402/dedup/1200bp/named/filtered
% git --no-pager log -n1
commit 1f9ad19c6654ed8a46f523a9ef0ddacb27495962
Author: Chris Rosenthal <[email protected]>
Date:   Wed Apr 4 10:53:50 2018 -0700

    adding lineages.txt mothur output per github.com Issue 5
% xsv join tax_id seq_info.csv tax_id taxonomy.csv | xsv search -s species '^$' | xsv select seqname,tax_id,description,rank,species | xsv sort -s description | xsv table -c 50
seqname          tax_id   description                                            rank               species
X87311_1_1445    2049     Actinomycetaceae 16S rRNA gene, isolate SR 139         family
X87313_1_1506    2049     Actinomycetaceae 16S rRNA gene, isolate SR 210         family
X87318_1_1458    2049     Actinomycetaceae 16S rRNA gene, isolate SR 259         family
X87310_1_1504    2049     Actinomycetaceae 16S rRNA gene, isolate SR 272         family
X87617_1_1497    2049     Actinomycete (genus unknown) 16S ribosomal RNA         family
KX773496_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773497_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773498_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773499_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773500_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773503_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773507_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773504_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773505_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773506_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773511_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773512_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773513_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773514_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773515_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773516_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773517_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773518_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773519_1_1540  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773520_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773527_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh331F...  species_group
X80733_1_1447    561      Escherichia sp. gene for 16S rRNA                      genus
X92362_1_1464    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate 33...  genus
X92361_1_1483    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate 4S...  genus
X92360_1_1415    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate A7...  genus
X92363_1_1482    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate B-...  genus
X92358_1_1479    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate G1...  genus
X92364_1_1477    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate G1...  genus
X92366_1_1461    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate G6...  genus
X87321_1_1490    2049     Micromonosporaceae 16S rRNA gene, isolate SR 53        family
X87322_1_1497    28056    Micromonosporaceae 16S rRNA gene, isolate SR 83        family
X93995_1_1450    1763     Mycobacterium sp. 16S ribosomal RNA (strain 89-446...  genus
X81948_1_1491    126      Planctomycetaceae partial 16S rRNA gene (Schlesner...  family
Z31658_1_1494    1224     Proteobacterial (SCB11) gene for 16S ribosomal RNA     phylum
DQ777729_1_1459  1149133  Pseudomonas pseudoalcaligenes 16S ribosomal RNA ge...  species_subgroup_
AB109887_1_1443  1149133  Pseudomonas pseudoalcaligenes gene for 16S rRNA, p...  species_subgroup_
X87314_1_1497    2070     Pseudonocardiaceae 16S rRNA gene, isolate SR 244a      family
KY767658_1_1484  28453    Sphingobacterium strain Q1 16S ribosomal RNA gene,...  genus
X95470_1_1506    2062     Streptomycetaceae 16S rRNA gene (isolate SR 179c)      family
X87309_1_1438    2062     Streptomycetaceae 16S rRNA gene, isolate SR 119        family
X87312_1_1469    2062     Streptomycetaceae 16S rRNA gene, isolate SR 168        family
X87316_1_1501    2062     Streptomycetaceae 16S rRNA gene, isolate SR 257        family
X87320_1_1497    2062     Streptomycetaceae 16S rRNA gene, isolate SR 70         family
X87315_1_1493    2062     Streptomycetaceae rRNA gene, isolate SR 247            family
X87319_1_1491    2004     Streptosporangiaceae 16S rRNA gene, isolate SR 58      family
KX533958_1_1436  1978400  Xanthobacteraceae bacterium L1I3 16S ribosomal RNA...  family__

move axes to top of plot along with other selectors

Add E faecalis, E faecium and E avium to the test data set feather file

Investigate updating the Dash app html table into a DataTable

https://dash.plotly.com/datatable
https://community.plotly.com/t/datatable-column-clickalbe-to-link-to-url/14680/14

allow url-encoding of all selectors

for example https://ya16sdb.labmed.uw.edu?color=outliers&shape=type_strains&x=dist_pct&y=rank_order

Records dropped as duplicates but then 'added back' as types are absent from named_type_hits output

Looks like there's a bit of a circular issue emerging from interplay between definition of the named .fasta set (which has duplicate records within a genome dropped https://github.com/nhoffman/ya16sdb/blob/master/SConstruct#L496), and the logic which adds all type strain records back to the 'trusted' .fasta output (and BLAST db). The outcome of this is that the trusted BLASTdb contains dropped duplicate alleles for some seqs within is_type genomes, and these records lack info about the nearest type strain, since the named fa is used as a target in https://github.com/nhoffman/ya16sdb/blob/master/SConstruct#L737

Possible solutions:

build 'named_type_hits' from the trusted fa, either instead of or in addition to the current vsearch output
add logic to prevent drop of duplicate alleles from type records
move deduplication step in partition_refs https://github.com/nhoffman/ya16sdb/blob/master/bin/partition_refs.py#L116 after the append of trusted records

The third option seems easiest implementation-wise.

Slurm compatibility

@dhoogest

See Bioscons

update bokeh plots

Add is_type column before dedup step

Use the labmed classifier to illustrate species uncertainty in Dash app

convert to python 3

Note that scons v3 should be python3 compatible

Setup dash app LetsEncrypt plugin on Dokku

https://github.com/dokku/dokku-letsencrypt

Publication status of NCBI records is not current

The publication status for records submitted in association with PMID:24509479 (https://www.ncbi.nlm.nih.gov/bioproject/PRJEB2397) and PMID:25388376 (https://www.ncbi.nlm.nih.gov/bioproject/229402) was not update upon acceptance of the papers. These came to my attention through our whole genome ratification of Streptococcus pneumoniae records project. There is an A/C polymorphism at 16S position 203 that is said to distinguish S. pneumo (always C) from other S. mitis group species (always A). It turns out there are some S. pneumo strains with A at that position, but most appear to be refseqs (direct submissions of reasonable quality), so we did not know whether the submitters to NCBI had actually done much in the way of phenotypic characterization to ensure they were sequencing S. pneumos and not S. pseudopneumos or S. mitis until I found these publications. We use publication status frequently to evaluate record trustworthiness.

S. pneumos with an 'A' from PMID:24509479 (GCF_001113365.1 appears likely misclassified from our analysis, but the others are S. pneumo):
GCF_001344435.1,SMRU2068,NZ_CHVE01000029_6665_8216
GCF_001113365.1,SMRU2014,NZ_CKYA01000001_562_2114
GCF_001130445.1,SMRU2069,NZ_CLES01000030_2995_4546
GCF_001147945.1,SMRU2652,NZ_CLLB01000007_318_1869

An S. pneumo with an 'A' from PMID:25388376:
NZ_JFJF01000123_70_1621,NZ_JFJF01000123.1,NZ_JFJF01000123,NZ_JFJF01000123,"Streptococcus pneumoniae strain SC_0381 contig_81, whole genome shotgun sequence",1313,2019-09-30,2019-10-31,1,Streptococcus pneumoniae,WGS;RefSeq,Streptococcus pneumoniae,1552,0,SC_0381,genomic DNA,,conjunctivitis,70,1621,False,True,False,refseq,b27646b8b8d34ff1cb6d87a56ee8f2f6e022dd49

@crosenth indicated that he would contact NCBI and report back.

Checking my understanding of the 'type_classification' column in the new interface

I'm going to include ya16sdb in a sequence classification job aid, and I want to make sure I understand the meaning of this column. I was thinking that it was still the best matching type strain, but the genus rank entry for FJ917551_1_1414 and MH283835_1_1424 confuses me. Can you please explain, or direct me to the docs? I looked through the README and checked the wiki, but didn't see an answer to this question.

Distinguish 16S-only records from those that were extracted from whole genomes

Hi Noah and Chris, is the source (NCBI 16S-only vs NCBI whole genome extracted) easily mapped to the records in the database, such that it would be possible to control the records displayed using that criteria (or have them marked in some way)?

view outliers app

Tests

search for sequence name
search for accession
search for accession with multiple hits
url-encode accession

consider post-filtered 'type' sequences for target of vsearch

Discussion item:

For the pipeline phase where closest type strain is assessed via vsearch, it would perhaps be higher value for the type strains to be considered post-filtering (in other words, the similarity to closest trusted type strain is determined). I don't know how many of the 'from type' records end up dropped by either do-not-trust or deenurp filter_outliers, but it is a non-zero list.

seq_info.csv should be consistent among all subdirectories

==> 20180324/dedup/1200bp/named/seq_info.csv <==
seqname,version,accession,name,description,tax_id,modified_date,download_date,version_num,source,keywords,organism,length,ambig_count,strain,mol_type,isolate,isolation_source,seq_start,seq_stop,is_type
==> 20180324/dedup/1200bp/named/filtered/seq_info.csv <==
seqname,tax_id

The second file should have the same contents as the first. Find a different name for the latter. tax_ids.csv?

Type strain record present in latest NCBI_16S build not visible on https://ya16sdb.labmed.uw.edu/

I'm coming at this from the perspective of this sv and a few other S. maltophilia-ish svs in the NGS16S case https://share.labmed.uw.edu/molmicro/markergene/23N0214_NGS16S/report/23R169-19/details-sv-0085:23R169-19.html#tab_plottab.

I was looking to see where the S. maltophilia type strains are, NCBI indicates there are many including NR_119220_1_1500, which is in the latest build.

grep "NR_119220.1" /molmicro/common/ncbi/16s/output/LATEST/dedup/1200bp/named/filtered/trusted/seq_info.csv
NR_119220_1_1500,NR_119220.1,NR_119220,NR_119220,"Stenotrophomonas maltophilia strain LMG 958 16S ribosomal RNA, partial sequence",40324,12-Mar-2019,12-Jul-2019,1,Stenotrophomonas maltophilia,RefSeq,Stenotrophomonas maltophilia,1500,1,LMG 958,rRNA,,,1,1500,29,1514,,,,40324,40323,Stenotrophomonas maltophilia,Stenotrophomonas,True,True,True,True,type,,,,,,,,,,,,cab8c8fe29c0259f18a178d94abf7230ead38212,JF343225_1_1237,0.000809,False,21.0,0.0034076332023561,-0.0002546308763511,True,0.0809,438.0

I only see one (DQ067559_1_1412) listed in https://ya16sdb.labmed.uw.edu/. Is there a row limit? The minimum dist_pct I see is 0.91, so I suspect that's the case but wanted to confirm.