Git Product home page Git Product logo

afpap's Introduction

Project logo

AlphaFold-based Protein Analysis Pipeline

Status License


📝 Table of Contents

Description

Pipeline diagram

This project constitutes a protein analysis pipeline that allows for a quick and comprehensive analysis of a protein sequence or structure.

  • The pipeline accepts as input a protein sequence in FASTA format or a protein structure in PDB format. If a PDB file is not provided, the 3D structure of the protein can be predicted using AlphaFold2.

  • A sequence-based analysis can be performed, including determining physicochemical properties and aligning the protein sequence against other databases such as Pfam and Swiss-Prot/UniRef90.

  • A structure-based analysis can be performed, including predicting the effect of amino acid substitutions over the protein stability using SimBa2 and the detection of binding pockets using P2Rank.

  • A list of ligands can be specified in the PDB/MOL2 format. The binding affinity of the protein-ligand interactions can be predicted using AutoDock Vina.

  • The outputs obtained during each process are aggregated into a MultiQC HTML report. The analysis report presents the results in an interactive manner, including visualizing the three-dimensional protein structure using the iCn3D web viewer. The pipeline is being developed using Nextflow.

For upcomming updates and ideas see our Roadmap.

Installation

IMPORTANT: This project is under active development. Momentarily the pipeline can only be utilized by manually installing the desired packages and tools. Support for Docker/Singularity containers represents a high-priority future update.

Prerequisites:

  • Python ≥ 3.8
  • Java ≥ 11
  • git, pip, conda/mamba

Mandatory installation:

  • The minimum setup required for running the pipeline. This configuration allows for executing the sequence properties and 3D structure viewer components.
conda config --env --add channels conda-forge
conda config --env --add channels anaconda
conda config --env --add channels bioconda

conda install 'numpy>=1.18.5' 'pandas>=1.4.1' 'biopython>=1.76' 'multiqc>=1.12' pymol-open-source=2.5.0 graphviz
conda install -c salilab dssp=3.0.0
curl -fsSL https://get.nextflow.io | bash
  • Clone this repository:
git clone https://github.com/OtimusOne/AFPAP.git

Optional components:

The following analysis steps are optional and their installation can be skipped if so desired.

3D structure prediction - AlphaFold2/ColabFold:

  • AlphaFold2 is used to predicted the protein structure if a PDB file is not provided. We use a local instance of ColabFold in order to avoid the large databases used by native AlphaFold2 implementation.
  • Follow the install intructions at: https://github.com/YoshitakaMo/localcolabfold
  • If ColabFold is not installed the user must provide a PDB structure using the --pdb argument if structural analysis is desired. Set skipAlphaFold = true inside nextflow.config.

Pfam search:

conda install pfam_scan perl-json
  • Inside the database directory execute:
hmmpress Pfam-A.hmm
  • Set path variable inside nextflow.config
params {
    ...
    pfam_path="/path/to/Pfam/directory"
    ...
}
  • If Pfam is not used set skipPfamSearch = true inside nextflow.config.

Conservation MSA:

  • The protein sequence conservation can be computed by generating a Multiple Sequence Alignment using a sequence database.
  • Install Blast, Muscle and CD-HIT:
conda install blast muscle cd-hit
curl {https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/uniref90/uniref90.fasta.gz} | gunzip | makeblastdb -out uniref90 -dbtype prot -title UniRef90 -parse_seqids
  • Set path variable inside nextflow.config
params {
    ...
    blast_path="/path/to/BLASTDB/uniref90"
    ...
}
  • warning: SwissProt might not return enough hits to calculate the sequence conservation, downloading large databases such as TrEMBL and UniRef90 might take a lot of time
  • If this component is not used set skipConservationMSA = true inside nextflow.config.

Protein stability changes - SimBa2:

  • SimBa2 is used to predict the effect of amino-acid substitutions on protein stability.
python -m pip install https://github.com/kasperplaneta/SimBa2/archive/main.tar.gz
  • If SimBa2 is not installed set skipStabilityChanges = true inside nextflow.config.

Pocket prediction - P2Rank:

  • P2Rank is used to predict ligand-binding pockets from the protein structure.
  • Follow the install intructions at: https://github.com/rdk/p2rank
  • Pillow is required for generating the pocket visualizations:
conda install pillow
  • If P2Rank is not installed set skipPocketPrediction = true inside nextflow.config.

Molecular Docking - AutoDock Vina:

pip install vina
  • If the pocket prediction step has been executed beforehand, Vina will dock the ligands against each of the predicted pockets, otherwise it will only execute a blind docking step.
  • If Vina is not installed set skipMolecularDocking = true inside nextflow.config.

Usage

Usage example:

nextflow run main.nf --fasta input.fasta
nextflow run path/to/main.nf --pdb input.pdb --ligands "ligand1.mol2 path/to/ligand2.pdb"

For a full list of parameters run:

nextflow run main.nf --help

Authors

  • Maghiar Octavian-Florin

Acknowledgements

If you find this work useful please properly cite any of the revelant tools.

  • AlphaFold2: Jumper, J. et al. (2021) ‘Highly accurate protein structure prediction with AlphaFold’, Nature, 596(7873), pp. 583–589. Available at: https://doi.org/10.1038/s41586-021-03819-2.

  • ColabFold: Mirdita, M. et al. (2021) ColabFold - Making protein folding accessible to all. preprint. Bioinformatics. Available at: https://doi.org/10.1101/2021.08.15.456425.

  • P2Rank: Krivák, R. and Hoksza, D. (2018) ‘P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure’, Journal of Cheminformatics, 10(1), p. 39. Available at: https://doi.org/10.1186/s13321-018-0285-8.

    Jendele, L. et al. (2019) ‘PrankWeb: a web server for ligand binding site prediction and visualization’, Nucleic Acids Research, 47(W1), pp. W345–W349. Available at: https://doi.org/10.1093/nar/gkz424.

  • AutoDock Vina: Eberhardt, J. et al. (2021) ‘AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings’, Journal of Chemical Information and Modeling, 61(8), pp. 3891–3898. Available at: https://doi.org/10.1021/acs.jcim.1c00203.

    Trott, O. and Olson, A.J. (2009) ‘AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading’, Journal of Computational Chemistry, p. NA-NA. Available at: https://doi.org/10.1002/jcc.21334.

  • SimBa2: Bæk, K.T. and Kepp, K.P. (2022) ‘Data set and fitting dependencies when estimating protein mutant stability: Toward simple, balanced, and interpretable models’, Journal of Computational Chemistry, 43(8), pp. 504–518. Available at: https://doi.org/10.1002/jcc.26810.

  • iCn3D: Wang, J. et al. (2020) ‘iCn3D, a web-based 3D viewer for sharing 1D/2D/3D representations of biomolecular structures’, Bioinformatics. Edited by A. Valencia, 36(1), pp. 131–135. Available at: https://doi.org/10.1093/bioinformatics/btz502.

  • Nextflow: Di Tommaso, P. et al. (2017) ‘Nextflow enables reproducible computational workflows’, Nature Biotechnology, 35(4), pp. 316–319. Available at: https://doi.org/10.1038/nbt.3820.

  • MultiQC: Ewels, P. et al. (2016) ‘MultiQC: summarize analysis results for multiple tools and samples in a single report’, Bioinformatics, 32(19), pp. 3047–3048. Available at: https://doi.org/10.1093/bioinformatics/btw354.

  • BLAST: Camacho, C. et al. (2009) ‘BLAST+: architecture and applications’, BMC Bioinformatics, 10(1), p. 421. Available at: https://doi.org/10.1186/1471-2105-10-421.

  • CD-HIT: Fu, L. et al. (2012) ‘CD-HIT: accelerated for clustering the next-generation sequencing data’, Bioinformatics, 28(23), pp. 3150–3152. Available at: https://doi.org/10.1093/bioinformatics/bts565.

  • MUSCLE: Edgar, R.C. (2004) ‘MUSCLE: multiple sequence alignment with high accuracy and high throughput’, Nucleic Acids Research, 32(5), pp. 1792–1797. Available at: https://doi.org/10.1093/nar/gkh340.

afpap's People

Contributors

otimusone avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

afpap's Issues

PFAM database

Good Afternoon,

I've started putting together a dockerfile with all the dependencies required in the repo. It looks like there are a few dependencies without arm64 support, so for now it might need to just be Linux, but I'll push that as soon as I'm done.

I'm going to start adding issues, so hopefully I can start collaborating pretty soon.

Thanks for getting this project started. It is brilliant!

-Cam

We might be able to leverage an API to pull PFAM data instead of downloading all 1.5 gigs.

Use BLAST DB api

Instead of pulling entire blast database. Maybe we can use an API instead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.