HRAS Activation Classifier is a complete pipeline for the generation of a Random Forest model, capable of discriminating a given protein structure between its active or inactive status, based on homology with similar structures available in databases. It was first used to generate a classifier on Human HRAS, hence the name. It is part of the MFP of Omics Data Analysis in the University of VIC.
- Installation
- Usage
- Usage Example
- Contributing
- License
To use this pipeline, you will need to have Python 3.10, and the latest version of NextFlow installed (https://github.com/nextflow-io/nextflow) .
Once you have Python installed, please download the repository with:
git clone https://github.com/JorgeAndOmics/HRAS-Activation-Classifier
You have to install the WCN Standalone Library from Martin Floor repository [Martin-Floor/WCN (github.com)]. Please follow the instalation instructions in the repository, but do not create a custom environment just yet.
After this, you will have to acquire a license for the Pyrosetta library [PyRosetta - Licensing PyRosetta]. Install as advised after this.
Then, with the .yaml file available in the Pipeline folder, modify it with your Pyrosetta credentials and use the following command to create a custom environment. Use of Conda is advised for environment management:
conda env create -f hras_classifier.yaml
Activate the environment with:
conda activate hras_classifier
After this, we enter into the directory :
cd HRAS-Activation-Classifier
cd Pipeline
The HRAS Activation Classifier pipeline takes the following command line arguments:
Parameter | Description |
---|---|
--uniprot_code |
The Uniprot identifier of the desired protein |
--identity_threshold |
The identity selection threshold for the retrieved structures |
--query_coverage_threshold |
The query coverage selection threshold for the retrieved structures |
--active_ligands |
A comma-separated list of ligands that unequivocally identify a protein as active |
--inactive_ligands |
A comma-separated list of ligands that unequivocally identify a protein as inactive |
--gap_open_penalty |
Gap open penalty for the pairwise alignment |
--gap_extend_penalty |
Gap extend penalty for the pairwise alignment |
--seed |
The seed for the pseudo-random processes |
To run the script, open a terminal or command prompt, navigate to the downloaded directory HRAS-Activation-Classifier
file, and enter the following command:
nextflow protein_activity_classifier.nf \
--uniprot_code <UNIPROT_CODE> \
--identity_threshold <IDENTITY_THRESHOLD> \
--query_coverage_threshold <QUERY_COVERAGE_THRESHOLD> \
--active_ligands <ACTIVE_LIGANDS> \
--inactive_ligands <INACTIVE_LIGANDS> \
--gap_open_penalty <GAP_OPEN_PENALTY> \
--gap_extend_penalty <GAP_EXTEND_PENALTY> \
--seed <SEED>
Replace each of the options with the appropriate values for your analysis.
The pipeline was created for the study of Human HRAS (Uniprot code P01112), but it is usable for any kind of enzyme. Suppose you want to generate a model for the protein Citrate synthase, Uniprot code P09948:
nextflow protein_activity_classifier.nf \
--uniprot_code = P09948 \
--identity_threshold = 300 \
--query_coverage_threshold = 50 \
--active_ligands = acetyl-CoA,oxaloacetate \
--inactive_ligands = CoA,H> \
--gap_open_penalty = -1 \
--gap_extend_penalty = -1 \
--seed = 123
The script will commence to download the required files and extract the data for them. The average time for a protein of ~200 amino acids long in a single 6 core-processor with hyperthreading was ~1h to retrieve and preprocess the data and ~2 days for training the model.
If you would like to contribute to the HRAS Activation Classifier script, please fork the repository, make your changes, and submit a pull request.
This project is licensed under the terms of the MIT license. See the LICENSE
file for more information.
- Floor, M., WCN. GitHub. Retrieved 6 September 2023, from https://github.com/Martin-Floor/WCN
- Olsson, M. H. M., Søndergaard, C. R., Rostkowski, M., & Jensen, J. H. (2011). PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical p K a Predictions. Journal of Chemical Theory and Computation, 7(2), 525–537. https://doi.org/10.1021/ct100578z .
- Chaudhury, S., Lyskov, S., & Gray, J. J. (2010). PyRosetta: A script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics, 26(5), 689–691. https://doi.org/10.1093/bioinformatics/btq007
- https://github.com/nextflow-io/nextflow
- All structures used for training the model are available at the end of the preprocessing phase in the generated metadata file.