The cs5242project from jjerphan

cs5242project's Issues

Saving models and histories

Models are trained but they should also be saved.
Histories of training phases can be obtained and could be saved to for further analysis.

Manage features consistency

For now, features are defined and used in different places:

settings.features_names and settings.nb_features where features used are defined globally
discretization.plot_cubes: where indices of features are hard coded
extraction_data.extract_molecule where molecule get created (and thus where features get defined)

It would be nicer to have something that would ensure that the mapping of features is consistent in all the code base.

A FeatureManager be a class where one can specify the features are to be used as well as the way they get extracted from the original data. This way, all the information about features (settings.features_names, settings.nb_features, and indices of features) and their processing that are used in different places in the code base could be gathered in such an object that would insure their consistency.

Last further improvments

We already get good results with what we have.
I think that we can improve the results we have by:

comparing different new architectures (VGG, Inception and ResNet inspired)
comparing different optimizer (SGD, Adadelta,Adam)
trying learning rate schedules
learning rate values

Evaluation process and metrics

Accuracy is not sufficient to evaluate models. Different Metrics can be used, mainly:

F1score
Recall
Precision
AUC
…

This issue explore the evaluation process with metrics, there is two scenarios:

use what Keras proposes, that is its own handling of metrics as well, as the way to define custom metrics
- this way, everything is delegated to model.compile and model.evaluate do the job
- we may need to define more metrics
- we will need to keep track of custom metrics somewhere for the custom_metrics argument when using keras.models.loadmodels
or use custom evaluation process using "manual a posteriori" computations of score with the outputs of model.predict.
- this might be more complex but we might have more liberty and flexibility for this step
- scikit-learn proposes a bunch of metrics

stdout vs logs

For now, some part of the scripts are using print to stdout ; some are using logs.

Thus at the end of a job execution, we have 2 sets of information (when running on the clusters, stdout gets redirected to a file, so we have 2 files).

We should choose to use one or the other. If we choose logs, we need to find a way to catch outputs from Keras verbose procedures as they are by default redirected to stdout.

Cleaning and documenting repo before submission

To do before submission:

clean, tidy and comment the code
clean README
make doc for cluster computation
present structure of project
prepare structure of project for submission (the folder for training data is different for example)

Consistency of extracted data and training examples with respect to the features to use

For now, as features are changing, it may be possible that the data that has been extracted and examples that have been created before are outdated.

Hence, we should be sure to perform those two first steps if the data present is not consistent.

Matching proteins and ligands together

As we have to submit a list of tens binding ligands for each protein, we need to find a way to match them. Several strategies can be used, this issue is to tracked the design of such strategies.

The first approach would be to return, for each protein, the 10 ligands with the highest probability. However, we know that there is an extra constraint, more precisely that there is a one to one correspondence. Hence, we should or must take decisions for ligands generally and not per protein as we could choose a ligand for a lot of different protein several protein with an high confidence.

If we are given n_p proteins and n_l ligands to test :

The first simpler approach would consist to evaluate the n_l ligands and take the 10 best ones.
The second approach would consist to evaluate the n_p* n_l systems and then take, for each ligands that are chosen several times, the associated protein of highest confidence.

Improve cube representation

The representation for the cube can be improved:

for now, a cube get created for each system but are not absolutely scaled. That is, system that have the same shape but not the same size get represented using the same cube. Thus, we should make sure not just to keep the proportions but also the size. This can be done changing the representation to use cubes with absolute size in Å.
maybe some other improvments here

Handler to evaluate one or several models at the same time

In order to iterate faster, we can come up with an handler to evaluate the performances of each model via job submission.

Modify extract_data/create_examples in order for grade testing

Currently the extract data function coded for training/testing datasets only. Need to modify the function to be used for prediction during grading.

Improve job submission fixture

The job submission system can be improved and be more modular. Everything is about nicely interfacing submissions file (that can be run as bash script) and the job scripts.

The little harness that has be built to create submission files incrementally can be improve to be more robust and conciser as well.

Models and Training procedure improvements

For now, we just have a really simple training procedure. We should be able to improve it using:

weights on class when model.fit-ing using the class_weight argument (see the doc)
a better loss function, binary_crossentropy is better than MSE for our case.
maybe using another optimizer like Adam (?)
adding EarlyStopping
try another set of features (something like protein_concentration, ligand_concentration, hydrophobic_concentration, polar_concentration)
try the others representation for cube
using the other cube representation

Comprehensive documentation

As the project has been built to be run on a specific cluster, we made the documentation oriented.

We should make it more comprehensive so that one can run what has been more quickly.

Gather every info of a job in one place

For now, after a job, a log in logs is created but PBS on the clusters spans 1 others files containing the outputs as well as the error.

Also, the submission file can be used several time for different jobs.
Models and histories get saved too.

We should find a way to keep track of jobs and to have their outputs (model, logs…) in the same place.

jjerphan / cs5242project Goto Github PK

cs5242project's People

Contributors

Stargazers

Watchers

Forkers

cs5242project's Issues

Recommend Projects

Recommend Topics

Recommend Org