Git Product home page Git Product logo

bd_project_ds's Introduction

BD_Project_DS

Image

Input

A representative sequence of the domain family. Columns are: group, UniProt accession, organism, Pfam identifier, Pfam name, domain position in the corresponding UniProt protein, domain sequence. Group assignments are provided here.

Team 1: Q12723, Cyberlindnera mrakii (Yeast) (Williopsis mrakii), PF03060, Nitronate monooxygenase, 10-372, KTFEVRYPIIQAPMAGASTLELAATVTRLGGIGSIPMGSLSEKCDAIETQLENFDELVGDSGRIVNLNFFAHKEPRSGRADVNEEWLKKYDKIYGKAGIEFDKKELKLLYPSFRSIVDPQHPTVRLLKNLKPKIVSFHFGLPHEAVIESLQASDIKIFVTVTNLQEFQQAYESKLDGVVLQGWEAGGHRGNFKANDVEDGQLKTLDLVSTIVDYIDSASISNPPFIIAAGGIHDDESIKELLQFNIAAVQLGTVWLPSSQATISPEHLKMFQSPKSDTMMTAAISGRNLRTISTPFLRDLHQSSPLASIPDYPLPYDSFKSLANDAKQSGKGPQYSAFLAGSNYHKSWKDTRSTEEIFSILVQ

Team 2: P26010, Homo sapiens (Human), PF00362, Integrin beta subunit VWA domain, 147-393, AEGYPVDLYYLMDLSYSMKDDLERVRQLGHALLVRLQEVTHSVRIGFGSFVDKTVLPFVSTVPSKLRHPCPTRLERCQSPFSFHHVLSLTGDAQAFEREVGRQSVSGNLDSPEGGFDAILQAALCQEQIGWRNVSRLLVFTSDDTFHTAGDGKLGGIFMPSDGHCHLDSNGLYSRSTEFDYPSVGQVAQALSAANIQPIFAVTSAALPVYQELSKLIPKSAVGELSEDSSNVVQLIMDAYNSLSSTV

Domain model definition

The objective of the first part of the project is to build a PSSM and HMM model representing the assigned domain. The two models will be generated starting from the assigned input sequence. The accuracy of the models will be evaluated against Pfam annotations as provided in the SwissProt database.

Building the models:

  1. Define your ground truth by finding all proteins in SwissProt annotated (and not annotated) with the assigned Pfam domain and collect the position of the Pfam domain for all sequences. Domain positions are available here or using the InterPro API. --> DONE: solved using bd_addons\interpro_data.py, we have saved all the results into data_team_1\entries\entries.csv

  2. Retrieve homologous proteins starting from your input sequence performing a BLAST search against UniProt or UniRef50 or UniRef90. --> DONE solved searching in the database UniProt, not the ref ones

  3. Generate a multiple sequence alignment (MSA) starting from retrieved hits using T-coffee or ClustalOmega or MUSCLE. --> DONE solved using all the online services mentioned

  4. If necessary, edit the MSA with JalView (or with your custom script) to remove noise. --> DONE actually not done, not necessary (?)

  5. Build a PSSM model starting from the MSA. --> DONE script bash, to check with results from website

  6. Build a HMM model starting from the MSA. --> DONE script generate_hmms

  7. Find significant hits using HMM-SEARCH and PSI-BLAST against SwissProt. --> DONE solved with tools used in 5

  8. Evaluate the ability of matching sequences considering your ground truth. Calculate accuracy, precision, sensitivity, specificity, MCC, F-score, balanced accuracy.

  9. Evaluate the ability of matching domain position considering your ground truth, i.e. residues overlapping (and non overlapping) with Pfam domains. Calculate accuracy, precision, sensitivity, specificity, MCC, F-score, etc.

  10. Consider repeating point 2-4 to improve the performance of your models.

  11. Choose the best model.

Domain family characterization

Once the family model is defined (previous step), you will look at functional and structural aspects/properties of the entire protein family. The objective is to provide insights about the main function of the family.

Dataset definitions:

  • family_structures - All PDB chains whose sequences significantly match your model and with a minimum overlap of 80%. If necessary, e.g. if you get more than 50 PDB chains, reduce the size of family_structures clustering by sequence identity.
  • family_sequences - All UniRef90 sequences matching your model. Limit your result to max 1,000 proteins. UniProt annotation (entries XML files) can be retrieved with the "Retrieve/ID mapping" service" from the UniProt website.

Structural characterization

  1. Perform an all-vs-all pairwise structural alignment using the TM-align software.
  2. Build a matrix representing the pairwise RMSD and/or the TM-score provided by TM-align in the previous step for all possible pairs of structures.
  3. Calculate a dendrogram representing a hierarchical clustering of the matrix. You can use scipy.cluster.hierarchy.linkage and scipy.cluster.hierarchy.dendrogram Python methods.
  4. Remove outliers.
  5. Identify conserved positions performing a multiple structural alignment of the family_structures dataset.
  6. Identify long range (sequence separation โ‰ฅ 12) conserved contacts. You can align the contact maps of each structure based on the multiple structural alignment and identify conserved positions.
  7. Identify the CATH superfamily (or superfamilies) and family (or families) matching your model, if any.

Taxonomy

  1. Collect the taxonomic lineage (tree branch) for each protein of the family_sequences dataset from UniProt (entity/organism/lineage in the UniProt XML).
  2. Plot the taxonomic tree of the family with nodes size proportional to their relative abundance.

Functional characterization

  1. Collect GO annotations for each protein of the family_sequences dataset (entity/dbReference type="GO" in the UniProt XML).
  2. Calculate the enrichment of each term in the dataset compared to GO annotations available in the SwissProt database (you can download the entire SwissProt XML here). You can use Fisher' exact test and verify that both two-tails and right-tail P-values (or left-tail depending on how you build the confusion matrix) are close to zero.
  3. Plot enriched terms in a word cloud.
  4. Take into consideration the hierarchical structure of the GO ontology and report most significantly enriched branches, i.e. high level terms.

Useful Software

  • JalView. Multiple sequence alignment viewer.Clustal-Omega. Multiple sequence alignment.
  • HMMER. Build HMM models of multiple sequence alignments. Perform HMM/sequence database searches.
  • NCBI-BLAST. Perform database sequence searches.
  • TM-align. Perform pairwise structural alignments.
  • HMM parser
  • PSSM parser

Useful databases

bd_project_ds's People

Contributors

albiross avatar alessandromanente avatar gianmarcocr avatar rmazzier avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.