BD_Project_DS

Input

A representative sequence of the domain family. Columns are: group, UniProt accession, organism, Pfam identifier, Pfam name, domain position in the corresponding UniProt protein, domain sequence. Group assignments are provided here.

Team 1: Q12723, Cyberlindnera mrakii (Yeast) (Williopsis mrakii), PF03060, Nitronate monooxygenase, 10-372, KTFEVRYPIIQAPMAGASTLELAATVTRLGGIGSIPMGSLSEKCDAIETQLENFDELVGDSGRIVNLNFFAHKEPRSGRADVNEEWLKKYDKIYGKAGIEFDKKELKLLYPSFRSIVDPQHPTVRLLKNLKPKIVSFHFGLPHEAVIESLQASDIKIFVTVTNLQEFQQAYESKLDGVVLQGWEAGGHRGNFKANDVEDGQLKTLDLVSTIVDYIDSASISNPPFIIAAGGIHDDESIKELLQFNIAAVQLGTVWLPSSQATISPEHLKMFQSPKSDTMMTAAISGRNLRTISTPFLRDLHQSSPLASIPDYPLPYDSFKSLANDAKQSGKGPQYSAFLAGSNYHKSWKDTRSTEEIFSILVQ

Team 2: P26010, Homo sapiens (Human), PF00362, Integrin beta subunit VWA domain, 147-393, AEGYPVDLYYLMDLSYSMKDDLERVRQLGHALLVRLQEVTHSVRIGFGSFVDKTVLPFVSTVPSKLRHPCPTRLERCQSPFSFHHVLSLTGDAQAFEREVGRQSVSGNLDSPEGGFDAILQAALCQEQIGWRNVSRLLVFTSDDTFHTAGDGKLGGIFMPSDGHCHLDSNGLYSRSTEFDYPSVGQVAQALSAANIQPIFAVTSAALPVYQELSKLIPKSAVGELSEDSSNVVQLIMDAYNSLSSTV

Domain model definition

The objective of the first part of the project is to build a PSSM and HMM model representing the assigned domain. The two models will be generated starting from the assigned input sequence. The accuracy of the models will be evaluated against Pfam annotations as provided in the SwissProt database.

Building the models:

Define your ground truth by finding all proteins in SwissProt annotated (and not annotated) with the assigned Pfam domain and collect the position of the Pfam domain for all sequences. Domain positions are available here or using the InterPro API. --> DONE: solved using bd_addons\interpro_data.py, we have saved all the results into data_team_1\entries\entries.csv
Retrieve homologous proteins starting from your input sequence performing a BLAST search against UniProt or UniRef50 or UniRef90. --> DONE solved searching in the database UniProt, not the ref ones
Generate a multiple sequence alignment (MSA) starting from retrieved hits using T-coffee or ClustalOmega or MUSCLE. --> DONE solved using all the online services mentioned
If necessary, edit the MSA with JalView (or with your custom script) to remove noise. --> DONE actually not done, not necessary (?)
Build a PSSM model starting from the MSA. --> DONE script bash, to check with results from website
Build a HMM model starting from the MSA. --> DONE script generate_hmms
Find significant hits using HMM-SEARCH and PSI-BLAST against SwissProt. --> DONE solved with tools used in 5
Evaluate the ability of matching sequences considering your ground truth. Calculate accuracy, precision, sensitivity, specificity, MCC, F-score, balanced accuracy.
Evaluate the ability of matching domain position considering your ground truth, i.e. residues overlapping (and non overlapping) with Pfam domains. Calculate accuracy, precision, sensitivity, specificity, MCC, F-score, etc.
Consider repeating point 2-4 to improve the performance of your models.
Choose the best model.

Domain family characterization

Once the family model is defined (previous step), you will look at functional and structural aspects/properties of the entire protein family. The objective is to provide insights about the main function of the family.

Dataset definitions:

family_structures - All PDB chains whose sequences significantly match your model and with a minimum overlap of 80%. If necessary, e.g. if you get more than 50 PDB chains, reduce the size of family_structures clustering by sequence identity.
family_sequences - All UniRef90 sequences matching your model. Limit your result to max 1,000 proteins. UniProt annotation (entries XML files) can be retrieved with the "Retrieve/ID mapping" service" from the UniProt website.

Structural characterization

Perform an all-vs-all pairwise structural alignment using the TM-align software.
Build a matrix representing the pairwise RMSD and/or the TM-score provided by TM-align in the previous step for all possible pairs of structures.
Calculate a dendrogram representing a hierarchical clustering of the matrix. You can use scipy.cluster.hierarchy.linkage and scipy.cluster.hierarchy.dendrogram Python methods.
Remove outliers.
Identify conserved positions performing a multiple structural alignment of the family_structures dataset.
Identify long range (sequence separation ≥ 12) conserved contacts. You can align the contact maps of each structure based on the multiple structural alignment and identify conserved positions.
Identify the CATH superfamily (or superfamilies) and family (or families) matching your model, if any.

Taxonomy

Collect the taxonomic lineage (tree branch) for each protein of the family_sequences dataset from UniProt (entity/organism/lineage in the UniProt XML).
Plot the taxonomic tree of the family with nodes size proportional to their relative abundance.

Functional characterization

Collect GO annotations for each protein of the family_sequences dataset (entity/dbReference type="GO" in the UniProt XML).
Calculate the enrichment of each term in the dataset compared to GO annotations available in the SwissProt database (you can download the entire SwissProt XML here). You can use Fisher' exact test and verify that both two-tails and right-tail P-values (or left-tail depending on how you build the confusion matrix) are close to zero.
Plot enriched terms in a word cloud.
Take into consideration the hierarchical structure of the GO ontology and report most significantly enriched branches, i.e. high level terms.

Useful Software

JalView. Multiple sequence alignment viewer.Clustal-Omega. Multiple sequence alignment.
HMMER. Build HMM models of multiple sequence alignments. Perform HMM/sequence database searches.
NCBI-BLAST. Perform database sequence searches.
TM-align. Perform pairwise structural alignments.
HMM parser
PSSM parser

alessandromanente / bd_project_ds Goto Github PK

bd_project_ds's Introduction

BD_Project_DS

Input

Domain model definition

Building the models:

Domain family characterization

Dataset definitions:

Structural characterization

Taxonomy

Functional characterization

Useful Software

Useful databases

bd_project_ds's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent