Git Product home page Git Product logo

prohap's Introduction

ProHap & ProVar

Proteogenomics database-generation tool for protein haplotypes and variants. Preprint describing the tool: DOI.

Databases generated using ProHap

  • Databases obtained from the common haplotypes of the 1000 Genomes Project along with metadata set can be found at DOI.
  • Databases obtained from the common haplotypes of the Release 1.1 of the Haplotype Reference Consortium (HRC) along with metadata set can be found at DOI.
  • Databases obtained from the preliminary release of the Human Pangenome Reference Consortium (HPRC) along with metadata set can be found at DOI.

Note: The databases contain only common haplotypes (maf > 1 %), no individual-level data is available from the databases. For individual-level sequences, please run ProHap on the individual-level data.

Input & Usage

Below is a brief overview, for details on input file format and configuration, please refer to the Wiki page.

Required input:

  • For ProHap: VCF with phased genotypes, one file per chromosome (such as 1000 Genomes Project - downloaded automatically by Snakemake if URL is provided)
  • For ProVar: VCF, single file per dataset. Multiple VCF files can be processed by ProVar in the same run.
  • FASTA file of contaminant sequences. These will then be added to the final FASTA, and tagged as contaminants. The default contaminant database is created by the cRAP project, provided in this repository.
  • GTF annotation file (Ensembl - downloaded automatically by Snakemake)
  • cDNA FASTA file (Ensembl - downloaded automatically by Snakemake)
  • (optional) ncRNA FASTA file (Ensembl - downloaded automatically by Snakemake)

Required software: Snakemake & Conda. ProHap was tested with Ubuntu 22.04.3 LTS. Windows users are encouraged to use the Windows Subsystem for Linux.

Using ProHap with the full 1000 Genomes Project data set (as per default) requires about 1TB disk space!

Usage:

  1. Clone this repository: git clone https://github.com/ProGenNo/ProHap.git; cd ProHap/;
  2. Create a configuration file called config.yaml using https://progenno.github.io/ProHap/. Please refer to the Wiki page for details.
  3. Test Snakemake with a dry-run: snakemake --cores <# provided cores> -n -q
  4. Run the Snakemake pipeline to create your protein database: snakemake --ccores <# provided cores> -p --use-conda

Example: ProHap on 1000 Genomes

In the first usage example, we provide a small example dataset taken from the 1000 Genomes Project on GRCh38. We will use ProHap to create a database of protein haplotypes aligned with Ensembl v.111 (January 2024) using only MANE Select transcripts.

Expected runtime using 4 CPU cores: ~1 hour. Expected runtime using 23 CPU cores: ~30 minutes.

Requirements: Install Conda / Mamba and Snakemake using this guide. Minimum hardware requirements: 1 CPU core, ~5 GB disk space, 3 GB RAM.

Use the following commands to run this example:

# Clone this repository:
git clone https://github.com/ProGenNo/ProHap.git ;
cd ProHap;

# Unpack the sample dataset
cd sample_data ;
gunzip sample_1kGP_common_global.tar.gz ;
tar xf sample_1kGP_common_global.tar ;
cd .. ;

# Copy the configuration to config.yaml
cp config_example1.yaml config.yaml ;

# Activate the snakemake conda environment and run the pipeline
conda activate snakemake ;
snakemake --cores 4 -p --use-conda ;

Using the database for proteomic searches

Once you obtain a list of peptide-spectrum matches (PSMs), you can use a pipeline provided in the PeptideAnnotator repository to map the peptides back to the respective protein haplotype / variant sequences, and map the identified variants back to their genetic origin. For the usage and details, please refer to the following wiki page.

Output

The ProHap / ProVar pipeline produces three kinds of output files. Below is a brief description, please refer to the wiki page for further details.

  1. Concatenated FASTA file: The main result of the pipeline is the concatenated FASTA file, consisting of the ProHap and/or ProVar output, reference sequences from Ensembl, and provided contaminant sequences. The file can be used with any search engine.
    • Optionally, headers are extracted and provided in an attached tab-separated file, and a gene name is added to each protein entry.
  2. Metadata table: Additional information on the variant / haplotype sequences produced by the pipeline, such as genomic coordinates of the variants covered, variant consequence type, etc.
  3. cDNA translations FASTA: FASTA file contains the original translations of variant / haplotype cDNA sequences prior to any optimization, the removal of UTR sequences, and merging with canonical proteins and contaminants.

Bug report and contribution

We welcome bug reports, suggestions of improvements, and contributions. Please do not hesitate to open an issue or a pull request.

Code of Conduct

As part of our efforts toward delivering open and inclusive science, we follow the Contributor Convenant Code of Conduct for Open Source Projects.

Citation

When using ProHap and databases generated using ProHap, please cite the accompanying scientific publication DOI.

prohap's People

Contributors

mvaudel avatar vasicek58 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

xiaoanshi nbuton

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.