Git Product home page Git Product logo

henriupton99 / proteinet-cafa5 Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 69.37 MB

We took a look at the competition context and developed a first model for this 5th edition of the CAFA competition named proteiNet. Following this, we implemented various additions to build its big brother: ProteiNet v2! This new version actually aims to train not one, not two, but 3 models, all three specialized in predicting a aspect group of GO

Home Page: https://www.kaggle.com/code/henriupton/proteinet-pytorch-ems2-t5-protbert-embeddings

Python 100.00%
biology cafa deep-learning machine-learning

proteinet-cafa5's Introduction

We took a look at the competition context and developed a first model for this 5th edition of the CAFA competition named proteiNet : https://www.kaggle.com/code/henriupton/proteinet-pytorch-ems2-t5-protbert-embeddings

Following this, we implemented various additions to build its big brother: ProteiNet v2! This new version actually aims to train not one, not two, but 3 models, all three specialized in predicting a group of GOs of a particular aspect among the three sets presented for CAFA5: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).

The first section of proteiNet v2 is dedicated to the training part of the models. If you want to have a look into the inference section, follow this link : https://www.kaggle.com/code/henriupton/proteinet-aspects-experts-infer

The second section of proteiNet v2 is dedicated to the inference part from the models trained in first section. If you want to have a look into the inference section, follow this link : https://www.kaggle.com/code/henriupton/proteinet-v2-inference-notebook

Feel free to give feedback for improvement !

1. Problem Framing

1.1. What is CAFA ?

CAFA stands for Critical Assessment of Functional Annotation. This Kaggle competition aims to predict the function of proteins using their amino-acid sequences and additional data. Understanding protein function is crucial for comprehending cellular processes and developing new treatments for diseases. With the abundance of genomic sequence data available, assigning accurate biological functions to proteins becomes challenging due to their multifunctionality and interactions with various partners. This competition, hosted by the Function Community of Special Interest (Function-COSI), brings together computational biologists, experimental biologists, and biocurators to improve protein function prediction through data science and machine learning approaches. The goal is to contribute to advancements in medicine, agriculture, and overall human and animal health.

1.2. What to submit ?

This competition evaluates participants' predictions of Gene Ontology (GO) terms for protein sequences. The evaluation is performed on a test set of proteins that initially have no assigned functions but may accumulate experimental annotations after the submission deadline. The test set is divided into three subontologies: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).

image-intro

2. General Baseline

  • Collect Embedding vectors from pre-trained protein function prediction models (T5, ProtBERT or EMS2). Sources for embeddings vectors : T5, ProtBERT,EMS2

  • Generate labels from train_terms file : by considering the top K most common GO terms in all Proteins set, generate for each protein a sparse vector of length K to indicate the true probabilities that each of the K GO terms are in the Protein (0 or 1). Here we retain K = 600

  • Create Pytorch Dataset class that can handle Protein ID and embeddings.

  • Create Pytorch Model class for prediction : can be any architecture of Multilabel classification model that can turn embeddings of shape (E,) to probabilities of shape (K,). Here we explore MultiLayerPerceptron (MLP).

  • Make Cross Validation w.r.t the F1 measure and do Hyperparameter tuning.

2.2. ProteiNet v2 : New features

Thanks to the great interest shown in the notebook dedicated to ProteiNet v1, a large number of bugs and defects have been corrected in this new version. On the other hand, my team (M. Sato, F. Lin and myself) have tried to innovate as much as possible and incorporate various topics of discussion from the competition for ProteiNet v2. Here is an exhaustive list of the most important innovations:

proteinet-cafa5's People

Contributors

henriupton99 avatar

Stargazers

Kattayun Ensafi avatar  avatar Mensur Dlakic avatar

Watchers

 avatar

proteinet-cafa5's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.