Git Product home page Git Product logo

sahmi's Introduction

SAHMI: Single-cell Analysis of Host-Microbiome Interactions

Bassel Ghaddar 7/31/2022

Introduction

SAHMI enables the systematic recovery and denoising of microbial signals from genomic sequencing of host tissues. The pipeline consists of R command line functions that implement the following main steps which are further described below.

  1. Taxonomic classification (recommended with Kraken2Uniq)
  2. Extract microbiome reads
  3. Single-cell k-mer analysis
  4. Barcode level signal denoising (barcode k-mer correlation test)
  5. Sample-level signal denoising (sample k-mer correlation tests)
  6. Identifying contaminants and false positives (cell line quantile test)
  7. Quantitation of microbes and creating the barcode-metagenome counts matrix
  8. Joint analysis of host and microbial data

Figure 1A. A schematic representation of the SAHMI workflow

Please see the references below for more information.

Please contact Bassel Ghaddar ([email protected]) or Subhajoyti De ([email protected]) for any questions.

Using the functions

No installation of SAHMI is required - all scripts run from the command line as described. Note that for all file paths that are input as function arguments the path must end in a backslash. Scripts can be made executable by running

chmod +x script.R
Rscript script.R -h

The following R packages must be installed prior to running some SAHMI functions: optparse, stringr, tidyverse, dplyr, ShortRead, data.table

1. Taxonomic classification

Metagenomic classification of paired-end reads from single-cell RNA sequencing fastq files can be performed using any k-mer based mapper that identifies a taxonomic ID for each k-mer and read. However, SAHMI is optimized to run with Kraken2Uniq, which finds exact matches of candidate 35-mer genomic substrings to the lowest common ancestor of genomes in a reference metagenomic database. It is essential that all realistically possible genomes are included as mapping references at this stage (e.g. host, known vectors, etc.), or that host mappable reads are excluded. The required outputs from this step are: a Kraken summary report with sample level metagenomic counts, a Kraken output file with read and k-mer level taxonomic classifications, an MPA-style report, and raw sequencing fastq files with taxonomic classification for each read. Please see https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown for more details on installation and usage of Kraken2/KrakenUniq.

The script run_kraken.r is included for convenience of running Kraken2Uniq and creating the MPA style report and other systematically named Kraken outputs.

run_kraken.r

  • --sample sample name
  • --fq1 path to fastq 1 file
  • --fq2 path to fastq 2 file
  • --out_path output directory path
  • --ncbi_blast_path path to ncbi-blast (see Kraken documentation for details)
  • --Kraken2Uniq_path path to Kraken2 main 'kraken2' function
  • --kraken_database_path path to kraken database
  • --kreport2mpa_path path to kreport2mpa.py function (included in SAHMI/functions)
  • --paired are fastq files paired end (T) or single-end/unpaired (F). Default is T.

The output includes fastq files with Kraken NCBI taxonomic assignments for each read, an output file containing k-mer level taxonomic data, and Kraken standard, uniq, and MPA style reports.

2. Extract microbiome reads

Microbiome reads next need to be extracted from the fastq files and the Kraken output file using the following scripts:

extract_microbiome_reads.r

  • --sample_name sample name
  • --fq path to classified fastq file (sample_1.fq/sample_2.fq) (if paired end reads this function must be once for each read)
  • kraken_report path to kraken2uniq report (sample.kraken.report.txt)
  • mpa_report path to mpa style report (sample.kraken.report.mpa.txt)
  • out_path output path

extract_microbiome_output.r

  • --sample_name sample name
  • --output_file path to kraken output file (sample.output.txt
  • kraken_report path to kraken2uniq report (sample.kraken.report.txt)
  • mpa_report path to mpa style report (sample.kraken.report.mpa.txt)
  • out_path output path

The outputs are fasta files for microbiome reads and a microbiome output file.

3. Single-cell k-mer analysis

The next step is tabulating k-mer statistics across individual barcodes using the script sckmer.r for paired end data and sckmer_unpaired.r for unpaired/single-end sequence data. These functions count the number of k-mers and unique k-mers assigned to a taxon across barcodes. The cell barcode and unique molecular identifier (UMI) are used to identify unique barcodes and reads. Data is reported for taxa of pre-specified ranks (default genus + species) taking into account all subsequently higher resolution ranks. Reads with any k-mers mapped to the host (e.g. human) are discarded. Reads with >50% of the k-mers map outside the taxon's lineage are also discarded. The output is a table of barcodes, taxonomic IDs, number of k-mers, and number of unique k-mers.

sckmer.r

  • --sample_name sample name
  • --fa1 path to microbiome fasta 1 file (sample_1.fa)
  • --fa2 path to microbiome fasta 2 file (sample_2.fa)
  • --microbiome_output_file path to microbiome output file (sample.microbiome.output.txt)
  • kraken_report path to kraken2uniq report (sample.kraken.report.txt)
  • mpa_report path to mpa style report (sample.kraken.report.mpa.txt)
  • out_path output path
  • cb_len nucleotide length of cell barcodes
  • umi_len nucleotide length of unique molecular identifiers (UMI)
  • ranks taxa ranks to analyze. more than one value allowed (e.g. c('G', 'S')). See Kraken report for more detail
  • host host taxonomy ID to exclude
  • min_frac minimum fraction of k-mers directly assigned to taxon ID or its lineage to use read.
  • nsample max number of barcodes to sample per taxon ID

Note that parameters for sckmer_unpared.r are the same but do not include fa2.

4. Barcode level signal denoising (barcode k-mer correlation test)

True taxa are detected on multiple barcodes and with a proprotional number of total and unique k-mer sequences across barcodes, measured as a significant Spearman correlation between the number of total and unique k-mers across barcodes. We demonstrate this using example data from Zhang et al., Cell Reports 2019 for a gastric metaplasia sample positive for Helicobacter pylori. Running SAHMI steps 1-3 generates the Kraken report and sckmer.txt.

library(dplyr)
library(tidyverse)

# kraken report
report = read.delim('./example data/SRR9713132.kraken.report.txt', header = F)
report$V8 = trimws(report$V8)
report[report$V8 %in% c('Homo sapiens', 'Bacteria', 'Fungi', 'Viruses'), ]
##         V1       V2       V3         V4       V5 V6    V7           V8
## 33   99.76 37752159 37752159 2212047756 78264060  S  9606 Homo sapiens
## 34    0.00      453        0     187379    10585  K  4751        Fungi
## 239   0.05    18340     2707    2038484   127826  D     2     Bacteria
## 1771  0.05    19068        0    3131110     3206  D 10239      Viruses
# sckmer data
kmer_data = read.table('./example data/SRR9713132.sckmer.txt', header = T)
head(kmer_data)
##            barcode taxid kmer uniq
## 1 AGGGCCCCGCCGCTTG   210   99   99
## 2 GTCAAATTCCAGAGGG   210   78   78
## 3 GAGCCTAGCCACCTCC   210   76   76
## 4 GCTGTCCCCACATTAG   210  108  108
## 5 CCACCAGAGCCCCTCC   210   78   78
## 6 ATTTAACCAAGAAAAA   210   72   72

We see that Kraken intially detected a large number of unique genera and species in the sample:

length(unique(report$V8[report$V6 %in% c('G', 'S')])) 
## [1] 1035

However, only a small subset of unique taxonomy IDs were detected on sufficient barcodes to test the Spearman corrletion:

# barcode k-mer correlation tests on taxonomy IDs detected on >3 barcodes and with >1 k-mer
# taxid = NCBI taxonomic ID, r = Spearman correlation coefficient, p = adjusted p-value

c = kmer_data %>% 
  subset(kmer > 1) %>%
  group_by(taxid) %>%
  mutate(nn = n()) %>%
  subset(nn > 3) %>% 
  group_by(taxid) %>%
  summarize(r = cor.test(kmer, uniq, method = 'spearman')$estimate,
            p = cor.test(kmer, uniq, method = 'spearman')$p.value,
            .groups='keep') %>%
  mutate(p = p.adjust(p))

c$name = report$V8[match(c$taxid, report$V7)] # add taxa names 
c
## # A tibble: 12 x 4
## # Groups:   taxid [12]
##      taxid     r         p name                      
##      <int> <dbl>     <dbl> <chr>                     
##  1     209 1     0.        Helicobacter              
##  2     210 1     0.        Helicobacter pylori       
##  3    1279 1.00  1.33e- 79 Staphylococcus            
##  4    1280 1.00  1.85e- 32 Staphylococcus aureus     
##  5    1290 1     8.33e-  2 Staphylococcus hominis    
##  6    1485 1     4.96e-  5 Clostridium               
##  7    1491 1     4.96e-  5 Clostridium botulinum     
##  8    2093 1.00  5.36e-301 Mycoplasma                
##  9    2100 1     0.        Mycoplasma hyorhinis      
## 10    5052 0.352 8.44e-  2 Aspergillus               
## 11   12230 1.00  0.        Turnip mosaic virus       
## 12 1603606 1.00  1.85e- 32 Desulfuromonas soudanensis

This test significantly filters list of possible taxa and enriches for real taxa in an individual sample.

5. Sample level signal denoising (sample k-mer correlation tests)

In the low-microbiome biomass setting, real microbes also exhibit a proportional number of total k-mers, number of unique k-mers, as well as number of total assigned sequencing reads across samples; i.e. the following three Spearman correlations are significant when tested using sample-level data provided in Kraken reports: cor(#reads, #k-mers), cor(#reads, #unique k-mers), and cor(#k-mers, #unique k-mers). We demonstrate this using data from 32 samples from Zhang et al. 2019 (8 H.pylori+, 24 H. pylori-).

The R function read_kraken_reports.r is provided for conveinence of reading and formatting multiple Kraken reports. Note that it is not a command line function and needs to be executed in R prior to using it.

read_kraken_reports.r Arguements:

  • files, vector of standard Kraken report file paths (sample.kraken.txt)
  • sample_names, optional sample name
  • study_name, optional study name
  • min_reads, filter taxa with > (default 2) assigned reads
  • min_uniq, filter taxa with > (default 2) unique k-mers

Kraken reports for this study are included as an RDS object.

kr = readRDS('./example data/zhang.reports.RDS')
kr
## # A tibble: 117,539 x 10
##    study sample   rank   taxid name          reads     min   uniq    rpm    rpmm
##    <fct> <fct>    <fct>  <int> <fct>         <int>   <dbl>  <int>  <dbl>   <dbl>
##  1 zhang SRR9713… R          1 root         1.06e8  3.48e9 5.28e7 9.66e5  1.68e8
##  2 zhang SRR9713… R1    131567 cellular …   1.06e8  3.44e9 5.28e7 9.60e5  1.67e8
##  3 zhang SRR9713… D       2759 Eukaryota    1.05e8  3.39e9 5.14e7 9.57e5  1.66e8
##  4 zhang SRR9713… D1     33154 Opisthoko…   1.05e8  3.39e9 5.14e7 9.57e5  1.66e8
##  5 zhang SRR9713… K      33208 Metazoa      1.05e8  3.39e9 5.14e7 9.57e5  1.66e8
##  6 zhang SRR9713… K1      6072 Eumetazoa    1.05e8  3.39e9 5.14e7 9.57e5  1.66e8
##  7 zhang SRR9713… K2     33213 Bilateria    1.05e8  3.39e9 5.14e7 9.57e5  1.66e8
##  8 zhang SRR9713… K3     33511 Deuterost…   1.05e8  3.39e9 5.14e7 9.57e5  1.66e8
##  9 zhang SRR9713… P       7711 Chordata     1.05e8  3.39e9 5.14e7 9.57e5  1.66e8
## 10 zhang SRR9713… P1     89593 Craniata     1.05e8  3.39e9 5.14e7 9.57e5  1.66e8
## # … with 117,529 more rows

Column names key: taxid, NCBI taxonomy ID, reads, # reads assigned; min, estimated # minimizers (k-mers), see Kraken documentation for more detail; uniq, estimated # unique minimizers (k-mers); rpm, reads per million; rpmm, reads per million microbiome reads.

Running the three sample-level k-mer correlation tests on genus and species resolution taxa yields:

require(ggplot2)

# remove taxa detected in < 3 samples
kr = kr %>%
  group_by(taxid) %>%
  mutate(nn = n()) %>%
  subset(nn > 2) %>% 
  select(-nn)

# run correlations 
c2 = kr %>%
  subset(rank %in% c('G', 'S')) %>% 
  group_by(name) %>%
  summarize(r1 = cor(min,uniq,method='spearman'),
            r2 = cor(min,reads,method='spearman'),
            r3 = cor(reads,uniq,method='spearman'),
            p1 = cor.test(min,uniq,method='spearman')$p.value,
            p2 = cor.test(min,reads,method='spearman')$p.value,
            p3 = cor.test(reads,uniq,method='spearman')$p.value
            )

c2
## # A tibble: 5,581 x 7
##    name                         r1     r2     r3       p1         p2          p3
##    <fct>                     <dbl>  <dbl>  <dbl>    <dbl>      <dbl>       <dbl>
##  1 Abyssicoccus              0.965  0.176 0.223  0.          5.84e-1     4.86e-1
##  2 Abyssicoccus albus        0.965  0.176 0.223  0.          5.84e-1     4.86e-1
##  3 Acaryochloris             0.947  0.156 0.0821 8.96e- 9    5.50e-1     7.54e-1
##  4 Acaryochloris marina      0.947  0.156 0.0821 8.96e- 9    5.50e-1     7.54e-1
##  5 Acetilactobacillus       -0.4   -0.949 0.632  7.50e- 1    5.13e-2     3.68e-1
##  6 Acetilactobacillus jins… -0.4   -0.949 0.632  7.50e- 1    5.13e-2     3.68e-1
##  7 Acetivibrio               0.974  0.802 0.843  0.          1.71e-7     9.91e-9
##  8 Acetivibrio clariflavus   0.938  0.569 0.568  4.15e-11    4.63e-3     4.74e-3
##  9 Acetivibrio saccincola    0.965  0.623 0.605  5.05e- 6    2.54e-3     3.67e-3
## 10 Acetivibrio thermocellus  0.750  0.116 0.307  5.30e- 4    6.58e-1     2.30e-1
## # … with 5,571 more rows
# making a scatter plot of correlation test results. 
# Each point is a taxon. x-axis, Spearman correlation value between #k-mers vs. #unique k-mers; y-axis, correlation value between #k-mers vs. #reads; color, correlation value between #reads vs. #unique k-mers. Lines represent contour density.

ggplot(c2, aes(r1, r2, color = r3))+
  geom_point() +
  geom_density_2d(size = 0.75, colour = "black") 

These results show a wide range of values for the three correlations, with only a subset of taxa having signifcant values in all three metrics.

6. Identifying contaminants and false positives (cell line quantile test)

The previous steps enrich for true taxa that produced diverse RNA. The next step in SAHMI is to identify contaminant taxa and spurious false positive assignments. These can be identified based on the widely observed pattern that contaminants appear at higher frequencies in low concentration or negative control samples. In the absence of experimentally matched negative controls, SAHMI provides a negative control resource comprised of microbiome profiles from 2,491 sterile cell experiments from around the world. For each taxon in a test sample, SAHMI compares the fraction of microbiome reads assigned to the taxon [i.e. taxon counts/sum(all bacterial, fungal, viral counts), in reads per million] to the microbiome fraction assigned to the taxon in all cell line experiments. Using the microbiome fraction comparison normalizes for experiments having a varying number of total sequencing reads or varying underlying contamination. SAHMI tests whether the taxon microbial fraction in the test sample is > 99th percentile (by default) of the taxon’s microbiome fraction distribution in cell line data using a one-sample quantile test. Taxa whose counts fall within the cell line distribution are identified as below the cell-line noise threshold. Users may choose how stringently to select the quantile threshold for significance testing.

Table S3.xlsx contains the reads per million microbiome reads (rpmm) percentile data for taxa detected in cell lines. These can be directly to the rpmm in the test samples. Users may also wish to work with the raw cell lines genus and species resolution microbiome data contained in cell.lines.txt to get a better feel for taxa distributions or to include or exclude zero counts in quantile calculations.

To denoise and decontaminate the gastric metaplasia samples from above (SRR9713132) we combined data from the barcode and sample k-mer correlation tests and keep taxa that pass all tests. We then compare the remaining taxa counts to their distribution in the cell line data. The same sample-level data would be used for other samples in the study, but each sample would have its own sckmer results.

# combine k-mer correlation tests, filter for significant values and species resolution
c3 = left_join(c, c2, by = 'name') %>% 
  left_join(select(kr, rank, name) %>% distinct()) %>% 
  subset(p<0.05 & p1<0.05 & p2<0.05 & p3<0.05 & rank == 'S')

c3
## # A tibble: 4 x 11
## # Groups:   taxid [4]
##   taxid     r        p name      r1    r2    r3       p1       p2       p3 rank 
##   <int> <dbl>    <dbl> <chr>  <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl> <fct>
## 1   210  1    0.       Helic… 0.990 0.897 0.915 1.27e-30 1.34e-13 5.80e-15 S    
## 2  1280  1.00 1.85e-32 Staph… 0.769 0.805 0.855 2.49e- 7 1.12e- 9 7.94e-12 S    
## 3  1491  1    4.96e- 5 Clost… 0.995 0.955 0.957 0.       1.45e-20 5.78e-21 S    
## 4  2100  1    0.       Mycop… 0.967 0.753 0.823 4.02e-12 1.28e- 4 8.31e- 6 S

For the taxa identified, we combined their data (for all samples) with the cell line data and plot density plots of their reads per million microbiome reads.

The raw cell lines microbiome data can be downloaded here:

https://www.dropbox.com/s/sgv3clvos281z9d/cell.lines.txt?dl=0

library(scales)
# cell.lines = readxl::read_xlsx('Table S3.xlsx')
cell.lines = read.delim('/Users/bassel/Downloads/cell.lines.txt', header = T) %>% tibble()

df = cell.lines[,1:11] %>% mutate(study = 'cell lines'); df = df[, -2]
df = rbind(df, kr)

ggplot(subset(df, name %in% c3$name), aes(rpmm, fill = study, ..scaled..)) + 
         geom_density(alpha = 0.5, color = NA) +
         facet_grid(~name, scales='free') + 
         scale_fill_manual(values = c(`cell lines` = 'grey50', zhang = 'navyblue')) +
         theme_minimal() + 
         xlab('Microbiome reads per million') +
         ylab('Density') +
         scale_x_log10(labels = trans_format("log10", math_format(10^.x)),
                       breaks = trans_breaks("log10", function(x) 10^x, n=4),
                       oob = scales::squish, expand = c(0,0)) +
         scale_y_continuous(expand = c(0,0)) +
         theme(legend.title = element_blank(),
               panel.border = element_rect(fill = NA, color = 'black'),
               strip.background = element_blank(),
               axis.ticks.x = element_line(size=0),
               axis.ticks.y = element_blank(),
               panel.grid = element_blank(), 
               strip.text = element_text(color = 'black', size = 10),
               axis.text.x = element_text(color = 'black', size = 10),
               axis.text.y = element_text(color = 'black', size = 10),
               axis.title.y = element_text(color = 'black', size = 10),
               axis.title.x = element_text(color = 'black', size = 10),
               plot.margin = unit(c(0, 0.1, 0, 0), "cm"),
               legend.key.size = unit(0.2, "cm"),
               legend.text = element_text(color = 'black', size = 10),
               legend.position = 'bottom')

These density plots show that the only taxon detected at counts per million greater than found in the cell line data is Helicobacter pylori, and only in some samples (middle and right peaks). Quantitative results are obtaind by direclty comparing test sample rpmm to taxa quantile data from Table S3 or cell.lines.txt. Here we merge results from kraken report, k-mer correlation tests, and cell line quantile test for sample SRR9713132.

# get quantiles from cell line data (or alternatively use Table S3)
qtile = 0.99
q_df = cell.lines %>%
  group_by(name, rank) %>% 
  summarize(CLrpmm = 10^quantile(log10(rpmm), qtile, na.rm = T), 
            .groups = 'keep')

left_join(c3, q_df, by = c('name', 'rank')) %>% 
  left_join(subset(kr, sample == 'SRR9713132') %>% select(name, rpmm), by = c('name'))
## # A tibble: 4 x 14
##   taxid.x     r        p name          r1    r2    r3       p1       p2       p3
##     <int> <dbl>    <dbl> <chr>      <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>
## 1     210  1    0.       Helicobac… 0.990 0.897 0.915 1.27e-30 1.34e-13 5.80e-15
## 2    1280  1.00 1.85e-32 Staphyloc… 0.769 0.805 0.855 2.49e- 7 1.12e- 9 7.94e-12
## 3    1491  1    4.96e- 5 Clostridi… 0.995 0.955 0.957 0.       1.45e-20 5.78e-21
## 4    2100  1    0.       Mycoplasm… 0.967 0.753 0.823 4.02e-12 1.28e- 4 8.31e- 6
## # … with 4 more variables: rank <chr>, CLrpmm <dbl>, taxid.y <int>, rpmm <dbl>

7. Quantitation of microbes and creating the barcode-metagenome counts matrix

After identifying true taxa, reads assigned to those taxa are extracted and then undergo a series of filters. The cell barcode and UMI are used to demultiplex the reads and create a barcode x taxa counts matrix. The full taxonomic classification of all resulting barcodes and the number of counts assigned to each clade are tabulated. The follwing command line R function performs this task:

taxa_counts.r

  • --sample_name sample name
  • --fa1 path to microbiome fasta 1 file (sample_1.fa)
  • --fa2 path to microbiome fasta 2 file (sample_2.fa)
  • taxa tsv file containing NCBI taxonomic IDs to extract (will summarize appropriate children data)
  • kraken_report path to kraken2uniq report (sample.kraken.report.txt)
  • mpa_report path to mpa style report (sample.kraken.report.mpa.txt)
  • out_path output path
  • cb_len nucleotide length of cell barcodes
  • umi_len nucleotide length of unique molecular identifiers (UMI)

8. Joint analysis of host and microbial data

SAHMI produces a final barcode by metagenomic counts matrix which can be jointly analyzed with the somatic single-cell data. For starters, we can identify barcodes that tag both host and microbial RNA and visualize them on a UMAP plot.

References

Please see the following publications for more details:

Coming soon.

sahmi's People

Contributors

sjdlabgroup avatar yunuuuu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.