jjerphan / cs5242project Goto Github PK
View Code? Open in Web Editor NEWPredicting Protein – Ligand Interaction by using Deep Learning Models
License: GNU General Public License v3.0
Predicting Protein – Ligand Interaction by using Deep Learning Models
License: GNU General Public License v3.0
Models are trained but they should also be saved.
Histories of training phases can be obtained and could be saved to for further analysis.
For now, features are defined and used in different places:
settings.features_names
and settings.nb_features
where features used are defined globallydiscretization.plot_cubes
: where indices of features are hard codedextraction_data.extract_molecule
where molecule get created (and thus where features get defined)It would be nicer to have something that would ensure that the mapping of features is consistent in all the code base.
A FeatureManager
be a class where one can specify the features are to be used as well as the way they get extracted from the original data. This way, all the information about features (settings.features_names
, settings.nb_features
, and indices of features) and their processing that are used in different places in the code base could be gathered in such an object that would insure their consistency.
We already get good results with what we have.
I think that we can improve the results we have by:
Accuracy is not sufficient to evaluate models. Different Metrics can be used, mainly:
This issue explore the evaluation process with metrics, there is two scenarios:
model.compile
and model.evaluate
do the jobcustom_metrics
argument when using keras.models.loadmodels
model.predict
.
scikit-learn
proposes a bunch of metricsFor now, some part of the scripts are using print to stdout
; some are using logs.
Thus at the end of a job execution, we have 2 sets of information (when running on the clusters, stdout
gets redirected to a file, so we have 2 files).
We should choose to use one or the other. If we choose logs, we need to find a way to catch outputs from Keras verbose procedures as they are by default redirected to stdout
.
To do before submission:
For now, as features are changing, it may be possible that the data that has been extracted and examples that have been created before are outdated.
Hence, we should be sure to perform those two first steps if the data present is not consistent.
As we have to submit a list of tens binding ligands for each protein, we need to find a way to match them. Several strategies can be used, this issue is to tracked the design of such strategies.
The first approach would be to return, for each protein, the 10 ligands with the highest probability. However, we know that there is an extra constraint, more precisely that there is a one to one correspondence. Hence, we should or must take decisions for ligands generally and not per protein as we could choose a ligand for a lot of different protein several protein with an high confidence.
If we are given n_p
proteins and n_l
ligands to test :
n_l
ligands and take the 10 best ones.n_p* n_l
systems and then take, for each ligands that are chosen several times, the associated protein of highest confidence.The representation for the cube can be improved:
In order to iterate faster, we can come up with an handler to evaluate the performances of each model via job submission.
Currently the extract data function coded for training/testing datasets only. Need to modify the function to be used for prediction during grading.
The job submission system can be improved and be more modular. Everything is about nicely interfacing submissions file (that can be run as bash
script) and the job scripts.
The little harness that has be built to create submission files incrementally can be improve to be more robust and conciser as well.
For now, we just have a really simple training procedure. We should be able to improve it using:
model.fit
-ing using the class_weight
argument (see the doc)binary_crossentropy
is better than MSE for our case.EarlyStopping
protein_concentration
, ligand_concentration
, hydrophobic_concentration
, polar_concentration
)As the project has been built to be run on a specific cluster, we made the documentation oriented.
We should make it more comprehensive so that one can run what has been more quickly.
For now, after a job, a log in logs
is created but PBS on the clusters spans 1 others files containing the outputs as well as the error.
Also, the submission file can be used several time for different jobs.
Models and histories get saved too.
We should find a way to keep track of jobs and to have their outputs (model, logs…) in the same place.
For now, jobs get run on machines on or cluster but there is no way to know if everything has been done correctly.
We should add logs to make sure they are performed accordingly.
We should create tests:
nb_neg_per_pos
negatives examples for one positives examplesA declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.