Git Product home page Git Product logo

ya16sdb's Introduction

Yet Another 16S rRNA database

ya16sdb is a pipeline for downloading, curating, and annotating a database of bacterial 16S rRNA sequences. This repository also implements a web application (https://ya16sdb.labmed.uw.edu/) that can be used to visualize the distance-based relationships among sequences for a given species.

The purpose of the project is to provide a high quality source of bacterial 16S rRNA sequences that is up to date with NCBI, in a format that is useful as an input for various bioinformatics pipleines such as blast searching, phylogenetic reference set creation, sequence-based taxonomic assignment, etc.

Project information

This project is a product of ongoing research interests of Noah Hoffman (https://faculty.washington.edu/ngh2/home/pages/software.html) at the University of Washington in the Department of Laboratory Medicine.

Christopher Rosenthal is the primary author of the pipeline.

The pipeline heavily relies on taxtastic (https://github.com/fhcrc/taxtastic) and deenurp (https://github.com/fhcrc/deenurp), both of which began as collaborations with Erick Matsen at The Fred Hutchinson Cancer Research Center in Seattle, WA.

Please cite this project as

Rosenthal C and Hoffman NG. 2019. ya16sdb: a pipeline for creating a collection of high-quality bacterial 16S rRNA sequences from NCBI. Version 0.6.1. University of Washington. https://github.com/nhoffman/ya16sdb

Overview

At a high level, this pipeline does the following:

  • Downloads annotation for all available sequence records from the NCBI matching search terms for 16S rRNA.
  • Retrieves sequence records for corresponding full length (or near full-length) 16S rRNA genes; this involves extracting subsequences from genome sequences or contigs.
  • Ensures that all records are 16S rRNA genes
  • Ensures that sequences are in a consistent orientation.
  • Identifies the taxonomic lineage of each record.
  • Annotates records as a "type strain" (according to NCBI's definition of type strain), "published" (annotation has an accompanying PubMed ID), "refseq" (belonging to the Genbank refseq collection), or "direct" (direct submissions).
  • Discards records likely to be mis-annotated using deenurp filter-outliers.
  • Provides various subsets of annotated sequences. Each record subset provides sequence metadata, sequences, taxonomic lineages, and a blast database. For example:
    • only records with taxonomic name consistent with species-level classifications
    • type strains only
    • outliers removed
    • downsampled to a subset of sequences for each species, prioritizing type strains and "published" records.

Docker

Docker image can be built with the following:

docker build --tag ya16sdb:latest .

Once a Docker image has been built a Singularity image can be built using the docker daemon:

singularity build ya16sdb.img docker-daemon://ya16sdb:latest

A Singularity image can also be built using a Singularity Docker container:

docker run --volume /var/run/:/var/run/ --volume $(pwd):$(pwd) --workdir $(pwd) singularity:latest build ya16sdb.img docker-daemon://ya16sdb:latest

Pipeline execution

The virtual containers have a predefined entry point to the SConstruct pipeline file.

To execute using Docker just a settings.conf file is required and can be run as follows:

docker run --volume $(pwd):$(pwd) --workdir $(pwd) ya16sdb:latest

And with Singularity

singularity run --bind $(pwd) --pwd $(pwd) ya16sdb.img

ya16sdb's People

Contributors

crosenth avatar nhoffman avatar nkrumm avatar dependabot[bot] avatar

Stargazers

Manu Gandham avatar Aroon Chande avatar

Watchers

 avatar Daniel Hoogestraat avatar James Cloos avatar  avatar Yee Mey avatar

ya16sdb's Issues

bin/match_hits.py look top hit within same species

New algo:

  1. Expand vsearch to top 5-10-20 hits
  2. Select all hits at the single highest pct_id
  3. If a hit(s) exists with same species taxonomy id then choose that hit. Otherwise take whatever vsearch returns as top pct_id hit

migrate to ECS

We will provide and build a Dockerfile here, host the image on ghcr, but depoy the app to internal infrastructure using a cdk stack configurations maintained elsewhere.

genome identifier

We need to define an identifier that can be used to group records by genome sequencing project: for shotgun assemblies, accessions refer to a contig, and as a result multiple accessions can refer to the same assembly.

see also #32

roll back https://github.com/nhoffman/ya16sdb/pull/31

Apparently the inclusion of species_group column in created some issues for other pipeline uses. We've come up with an alternate plan for building ref feather files in custom shapes, so it isn't necessary to include the NGS16S-consumed columns added in !31

Best match 16S type strain record in ya16sdb web interface does not match data in NGS16S output

Hi Chris, linking here to the issue I made in the molmicro github a couple of weeks ago https://gitlab.labmed.uw.edu/molmicro/mkrefpkg/-/issues/83

Will this be part of the upcoming release?

It just occurred to me...is this why you and Noah think there is a misclassified E. faecalis type strain, WRT example NZ_CABGZA010000007_337_1876 (coordinates have changed slightly)? It seems like this must just be the ya16sdb web interface bug. There are no Enterococcus type strains in the NCBI_16s_types_details when that 'Enterococcus faecalis' record is blasted...and that database is unfiltered, so if there was one lurking, it should show up. https://share.labmed.uw.edu/molmicro/sanger/report/2021/09/30/2053025_ad_hoc_20210930150544_NZ_CABGZA010000007_337_1876.html

Screen Shot 2021-09-30 at 3 02 09 PM

Screen Shot 2021-09-30 at 3 09 46 PM

Screen Shot 2021-09-30 at 3 11 39 PM

requirements.txt should pin all dependencies

  • need a bare-dependencies.txt for unpinned, direct dependencies
  • create requirements.txt from pip freeze after creating clean venv from bare-dependencies.txt
  • build image from requirements.txt (all deps pinned)

implement "mini" pipeline for testing

We should be able to test the pipeline all the way through with a small subset of NCBI - perhaps by restricting the initial query to a few taxids. I'd suggest doing this first, and then building out the whole pipeline using the mini query.

non-species in dedup/1200bp/named/filtered

I noticed that there are some records in named/filtered without a species-level classification - I would have assumed that these would have been removed between 1200bp and named.

% pwd
/fh/fast/fredricks_d/refpkg/ya16sdb/output/20180402/dedup/1200bp/named/filtered
% git --no-pager log -n1
commit 1f9ad19c6654ed8a46f523a9ef0ddacb27495962
Author: Chris Rosenthal <[email protected]>
Date:   Wed Apr 4 10:53:50 2018 -0700

    adding lineages.txt mothur output per github.com Issue 5
% xsv join tax_id seq_info.csv tax_id taxonomy.csv | xsv search -s species '^$' | xsv select seqname,tax_id,description,rank,species | xsv sort -s description | xsv table -c 50
seqname          tax_id   description                                            rank               species
X87311_1_1445    2049     Actinomycetaceae 16S rRNA gene, isolate SR 139         family
X87313_1_1506    2049     Actinomycetaceae 16S rRNA gene, isolate SR 210         family
X87318_1_1458    2049     Actinomycetaceae 16S rRNA gene, isolate SR 259         family
X87310_1_1504    2049     Actinomycetaceae 16S rRNA gene, isolate SR 272         family
X87617_1_1497    2049     Actinomycete (genus unknown) 16S ribosomal RNA         family
KX773496_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773497_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773498_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773499_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773500_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh290F...  species_group
KX773503_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773507_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773504_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773505_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773506_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh303F...  species_group
KX773511_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773512_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773513_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773514_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773515_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh326F...  species_group
KX773516_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773517_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773518_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773519_1_1540  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773520_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh327F...  species_group
KX773527_1_1539  85620    Candidatus Phytoplasma asteris isolate Rus-CPh331F...  species_group
X80733_1_1447    561      Escherichia sp. gene for 16S rRNA                      genus
X92362_1_1464    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate 33...  genus
X92361_1_1483    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate 4S...  genus
X92360_1_1415    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate A7...  genus
X92363_1_1482    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate B-...  genus
X92358_1_1479    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate G1...  genus
X92364_1_1477    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate G1...  genus
X92366_1_1461    1860     Geodermatophilus sp. 16S ribosomal RNA (isolate G6...  genus
X87321_1_1490    2049     Micromonosporaceae 16S rRNA gene, isolate SR 53        family
X87322_1_1497    28056    Micromonosporaceae 16S rRNA gene, isolate SR 83        family
X93995_1_1450    1763     Mycobacterium sp. 16S ribosomal RNA (strain 89-446...  genus
X81948_1_1491    126      Planctomycetaceae partial 16S rRNA gene (Schlesner...  family
Z31658_1_1494    1224     Proteobacterial (SCB11) gene for 16S ribosomal RNA     phylum
DQ777729_1_1459  1149133  Pseudomonas pseudoalcaligenes 16S ribosomal RNA ge...  species_subgroup_
AB109887_1_1443  1149133  Pseudomonas pseudoalcaligenes gene for 16S rRNA, p...  species_subgroup_
X87314_1_1497    2070     Pseudonocardiaceae 16S rRNA gene, isolate SR 244a      family
KY767658_1_1484  28453    Sphingobacterium strain Q1 16S ribosomal RNA gene,...  genus
X95470_1_1506    2062     Streptomycetaceae 16S rRNA gene (isolate SR 179c)      family
X87309_1_1438    2062     Streptomycetaceae 16S rRNA gene, isolate SR 119        family
X87312_1_1469    2062     Streptomycetaceae 16S rRNA gene, isolate SR 168        family
X87316_1_1501    2062     Streptomycetaceae 16S rRNA gene, isolate SR 257        family
X87320_1_1497    2062     Streptomycetaceae 16S rRNA gene, isolate SR 70         family
X87315_1_1493    2062     Streptomycetaceae rRNA gene, isolate SR 247            family
X87319_1_1491    2004     Streptosporangiaceae 16S rRNA gene, isolate SR 58      family
KX533958_1_1436  1978400  Xanthobacteraceae bacterium L1I3 16S ribosomal RNA...  family__

Records dropped as duplicates but then 'added back' as types are absent from named_type_hits output

Looks like there's a bit of a circular issue emerging from interplay between definition of the named .fasta set (which has duplicate records within a genome dropped https://github.com/nhoffman/ya16sdb/blob/master/SConstruct#L496), and the logic which adds all type strain records back to the 'trusted' .fasta output (and BLAST db). The outcome of this is that the trusted BLASTdb contains dropped duplicate alleles for some seqs within is_type genomes, and these records lack info about the nearest type strain, since the named fa is used as a target in https://github.com/nhoffman/ya16sdb/blob/master/SConstruct#L737

Possible solutions:

The third option seems easiest implementation-wise.

Publication status of NCBI records is not current

The publication status for records submitted in association with PMID:24509479 (https://www.ncbi.nlm.nih.gov/bioproject/PRJEB2397) and PMID:25388376 (https://www.ncbi.nlm.nih.gov/bioproject/229402) was not update upon acceptance of the papers. These came to my attention through our whole genome ratification of Streptococcus pneumoniae records project. There is an A/C polymorphism at 16S position 203 that is said to distinguish S. pneumo (always C) from other S. mitis group species (always A). It turns out there are some S. pneumo strains with A at that position, but most appear to be refseqs (direct submissions of reasonable quality), so we did not know whether the submitters to NCBI had actually done much in the way of phenotypic characterization to ensure they were sequencing S. pneumos and not S. pseudopneumos or S. mitis until I found these publications. We use publication status frequently to evaluate record trustworthiness.

S. pneumos with an 'A' from PMID:24509479 (GCF_001113365.1 appears likely misclassified from our analysis, but the others are S. pneumo):
GCF_001344435.1,SMRU2068,NZ_CHVE01000029_6665_8216
GCF_001113365.1,SMRU2014,NZ_CKYA01000001_562_2114
GCF_001130445.1,SMRU2069,NZ_CLES01000030_2995_4546
GCF_001147945.1,SMRU2652,NZ_CLLB01000007_318_1869

An S. pneumo with an 'A' from PMID:25388376:
NZ_JFJF01000123_70_1621,NZ_JFJF01000123.1,NZ_JFJF01000123,NZ_JFJF01000123,"Streptococcus pneumoniae strain SC_0381 contig_81, whole genome shotgun sequence",1313,2019-09-30,2019-10-31,1,Streptococcus pneumoniae,WGS;RefSeq,Streptococcus pneumoniae,1552,0,SC_0381,genomic DNA,,conjunctivitis,70,1621,False,True,False,refseq,b27646b8b8d34ff1cb6d87a56ee8f2f6e022dd49

@crosenth indicated that he would contact NCBI and report back.

Checking my understanding of the 'type_classification' column in the new interface

I'm going to include ya16sdb in a sequence classification job aid, and I want to make sure I understand the meaning of this column. I was thinking that it was still the best matching type strain, but the genus rank entry for FJ917551_1_1414 and MH283835_1_1424 confuses me. Can you please explain, or direct me to the docs? I looked through the README and checked the wiki, but didn't see an answer to this question.

Screen Shot 2022-05-24 at 11 34 07 AM

view outliers app

Tests

  • search for sequence name
  • search for accession
  • search for accession with multiple hits
  • url-encode accession

consider post-filtered 'type' sequences for target of vsearch

Discussion item:

For the pipeline phase where closest type strain is assessed via vsearch, it would perhaps be higher value for the type strains to be considered post-filtering (in other words, the similarity to closest trusted type strain is determined). I don't know how many of the 'from type' records end up dropped by either do-not-trust or deenurp filter_outliers, but it is a non-zero list.

seq_info.csv should be consistent among all subdirectories

==> 20180324/dedup/1200bp/named/seq_info.csv <==
seqname,version,accession,name,description,tax_id,modified_date,download_date,version_num,source,keywords,organism,length,ambig_count,strain,mol_type,isolate,isolation_source,seq_start,seq_stop,is_type
==> 20180324/dedup/1200bp/named/filtered/seq_info.csv <==
seqname,tax_id

The second file should have the same contents as the first. Find a different name for the latter. tax_ids.csv?

Type strain record present in latest NCBI_16S build not visible on https://ya16sdb.labmed.uw.edu/

I'm coming at this from the perspective of this sv and a few other S. maltophilia-ish svs in the NGS16S case https://share.labmed.uw.edu/molmicro/markergene/23N0214_NGS16S/report/23R169-19/details-sv-0085:23R169-19.html#tab_plottab.

I was looking to see where the S. maltophilia type strains are, NCBI indicates there are many including NR_119220_1_1500, which is in the latest build.

grep "NR_119220.1" /molmicro/common/ncbi/16s/output/LATEST/dedup/1200bp/named/filtered/trusted/seq_info.csv
NR_119220_1_1500,NR_119220.1,NR_119220,NR_119220,"Stenotrophomonas maltophilia strain LMG 958 16S ribosomal RNA, partial sequence",40324,12-Mar-2019,12-Jul-2019,1,Stenotrophomonas maltophilia,RefSeq,Stenotrophomonas maltophilia,1500,1,LMG 958,rRNA,,,1,1500,29,1514,,,,40324,40323,Stenotrophomonas maltophilia,Stenotrophomonas,True,True,True,True,type,,,,,,,,,,,,cab8c8fe29c0259f18a178d94abf7230ead38212,JF343225_1_1237,0.000809,False,21.0,0.0034076332023561,-0.0002546308763511,True,0.0809,438.0

I only see one (DQ067559_1_1412) listed in https://ya16sdb.labmed.uw.edu/. Is there a row limit? The minimum dist_pct I see is 0.91, so I suspect that's the case but wanted to confirm.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.