Git Product home page Git Product logo

dbs-pro's Introduction

CI Linux CI MacOS

DBS-Pro Analysis

About

This pipeline analyses data sequencing data from DBS-Pro experiments for protein and PrEST quantification. The DBS-Pro method uses barcoded antibodies for surface protein quantification in droplets. For example to study single exosomes.

DBS-Pro pipeline overview

Overview of DBS-Pro pipeline run on three samples.

The pipeline takes input of single end FASTQs with a construct such as those specified in standard constructs. For each sample the DBS is extracted (extract_dbs) and clustered (dbs_cluster) to enable error correction of the DBS sequences (correct_dbs). At the same time the ABC and UMI are extracted from the same read (extract_abc_umi)and then the UMIs are demultiplexed based on their ABC (demultiplex_abc). For each ABC the UMIs are grouped by DBS then clustered to correct errors (umi_cluster). Finaly the corrected sequences are combined into a read specific DBS, ABC and UMI combination that are tallied to create the final output in the form of a TSV (integrate). If there are multiple sampels these are also merged to generate a combined TSV (merge_data). A final report is also generated to enable some basic QC of the data. Also see the demo for a step-by-step of a typical workflow.

DBS: Droplet Barcode Sequence. Reads sharing this sequence originate from the same droplet.
ABC: Antibody Barcodes Sequence. Identifies which antibody was present in the droplet.
UMI: Unique Molecular Identifier. Identifies how many antibodies with a particular ABC that was present in the droplet.

Setup

First, make sure conda is installed on your system.

  1. Clone the git repository.

    git clone https://github.com/FrickTobias/DBS-Pro
    
  2. Move into the git folder and install all dependencies in a conda environment.

    cd DBS-Pro
    

    For reproducibility the *.lock files are used.

    2.1. For OSX use:

    conda create --name dbspro --file environment.osx-64.lock
    

    2.2. For LINUX use:

    conda create --name dbspro --file environment.linux-64.lock
    

    2.3. Using flexible dependancies (Not recommended)

    conda env create --name dbspro --file environment.yml
    

    This option will likely introduce newer versions the softwares and depenencies which have not yet been tested.

  3. Activate the conda environment.

    conda activate dbspro
    
  4. Install the dbspro package.

    pip install .
    

    For development, please use pip install -e .[dev].

Usage

Prepare a FASTA with each of the antibody barcodes used in your experiment. The entry name will be used to define the targets. Also make sure that each sequence is prepended with ^, this is used for demultiplexing. See the example FASTA below:

>ABC01
^ATGCTG
>ABC02
^GTAGAT
>ABC03
^CTAGCA

Use dbspro init to create an analysis folder. Provide the FASTA with the antibody barcodes (here named ABCs.fasta), an directory name and one or more FASTQ for the samples.

dbspro init --abc ABCs.fasta <output-folder> <sample1.fastq>

If you have several samples you could also provide a CSV file in the line format: </path/to/sample.fastq>,<sample_name>. This enables you to name your samples as you wish. With a CSV the initialization is as follows:

dbspro init --abc ABCs.fasta --sample-csv samples.csv <output-folder>

Once the directory has been successfully initialized, moving into the directory

cd <output-folder>

and check the current (default) configs using

dbspro config

Any changes to the configs should be primaraly be done through the dbspro config command to validate the parameters. You can check the construct layout by running dbspro config --print-construct. Some standard constructs are also defined, see Standard constructs. Once the configs are updated you are ready to run the full analysis using this command.

dbspro run

For more information on how to run use dbspro run -h.

Output files

The main output is a TSV file data.tsv.gz with the following columns:

Column name Description
Barcode The DBS sequence
Target Target name (accuired from ABC FASTA headers)
UMI The UMI sequence
ReadCount Number of reads with this DBS, Target and UMI combination
Sample Sample name

For convenience, anndata h5ad files with count matrices are also generated for each sample. These can be used for downstream analysis using Scanpy. To import the data use the following code:

import scanpy as sc
adata = sc.read_h5ad("mysample.h5ad")
adata

The pipeline also generates a report report.html with some basic QC metrics.

Standard constructs

The most common construct are included as presets which can be initialized using the -c/--construct parameter in dbspro config. Currently available constructs include:

dbspro_v1

Sequence: 5'-CGATGCTAATCAGATCA BDVHBDVHBDVHBDVHBDVH AAGAGTCAATAGACCATCTAACAGGATTCAGGTA XXXXX NNNNNN TTATATCACGACAAGAG-3'
Name:        |       H1      | |       DBS        | |               H2               | |ABC| |UMI | |       H3      |
Size (bp):   |       17      | |        20        | |               34               | | 5 | | 6  | |       17      |

This is the DBS-Pro construct used in the publication Stiller et al. 2019.

dbspro_v2

Sequence: 5'-CAGTCTGAGCGGTTCAACAGG BDVHBDVHBDVHBDVHBDVH GCGGTCGTGCTGTATTGTCTCCCACCATGACTAACGCGCTTG XXXXX NNNNNN CACCTGACGCACTGAATACGC-3'
Name:        |         H1        | |       DBS        | |                   H2                   | |ABC| |UMI | |         H3        |
Size (bp):   |         21        | |        20        | |                   42                   | | 5 | | 6  | |         21        |

This is the DBS-Pro construct used in the publication Banijamali et al. 2022.

pba

Sequence: 5'-NNNNNNNNNNNNNNN ACCTGAGACATCATAATAGCA XXXXX NNNNNN CATTACTAGGAATCACACGCAGAT-3'
Name:        |     DBS     | |         H2        | |ABC| |UMI | |          H3          |
Size (bp):   |      15     | |         21        | | 5 | | 6  | |          24          |

This is the construct used in the article Wu et al. 2019 which introduces the Proximity Barcoding Assay (PBA).

Demo

A short demostration of the pipeline and some downstream analysis is available in the following Jupyter Notebook. This can also be used to test that the conda environment is properly setup.

Development

For notes on development see doc/development.

Publications

Checkout version v0.1 for the pipeline used in:

Stiller, C., Aghelpasand, H., Frick, T., Westerlund, K., Ahmadian, A., & Eriksson Karlström, A. (2019). Fast and efficient Fc-specific photoaffinity labelling to produce antibody-DNA-conjugates. Bioconjugate chemistry.

Version v0.3 was used in:

Banijamali, M., Höjer, P., Nagy, A., Hååg, P., Gomero, E. P., Stiller, C., Kaminskyy, V. O., Ekman, S., Lewensohn, R., Karlström, A. E., Viktorsson, K., & Ahmadian, A. (2022). Characterizing Single Extracellular Vesicles by Droplet Barcode Sequencing for Protein Analysis. Journal of Extracellular Vesicles, e12277.

dbs-pro's People

Contributors

fricktobias avatar pontushojer avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

afshinlab

dbs-pro's Issues

Implement pytest

Use pytest and run tests settings and running from remote directories.

Outdated dependency file

These modules are currently not included in the environment file

  • pandas
  • seaborn
  • dnaio
  • snakemake
  • umi_tools

weird construct file path input

Currently it is seemingly needed to give the path to the construct relative to the output directory rather than your working directory.

Jupyther notebook example run

Provide example of how to edit and run the pipeline in a linked publically available google colab jupyter noteboook

Change pipeline order

Just and idea I had about how we might want to change the order in our pipeline.

I have found the following issue. For UMIs we cluster them for each ABC target but do not separate on DBS. This could mean that we are merging UMIs that should in fact be separate. My proposal would be to separate all UMIs by ABC and DBS before clustering. This would better represent the actual conditions in the experiment.

I am however unsure about the benefits in the end, possibly this would only be a lot of work for nothing, but I wanted to raise the idea anyway to set what you think.

Current pipeline

START. Input = Fastq file

  1. Separate for DBS
    1.1 Extract DBS
    1.2. Cluster DBS
    1.3 Correct DBS fastq

  2. Separate for ABCs
    2.1. Extract ABC-UMI
    2.2 Split ABC-UMI by ABC
    2.3 Cluster ABCs independently
    2.4 Correct ABC fastqs.

  3. Analysis of corrected DBS and ABC files.

END.

Purposed outline pipeline

START. Fastq file

  1. Extract DBS
  2. Extract ABC-UMI
  3. Cluster DBS
  4. Correct DBS fastq
  5. Split/Tag ABC-UMI by DBS //This represent separated dropletts
  6. Split/Tag ABC-UMI by ABC // This represents spliting within dropletts for different targets.
  7. Cluster for each DBS-ABC pair indepentently
  8. Correct DBS-ABC pairs
  9. Analysis

END.

Modular ABC file system

The ABC file system is currently not that adaptable and it would be nice to be able to have several ABC-sequence files for different setups.

I'd suggest adding functionality for adding new construct file using dbspro set and a separate command for changing which is used by adding a dbspro config command (or something like that).

Check for & filter chimeric reads

The UMI sequences can be used to identify chimeric sequences by looking for UMI:s linked to several different ABC or DBS sequences.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.