Git Product home page Git Product logo

cs5242project's People

Contributors

jjerphan avatar small-donkey avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cs5242project's Issues

Saving models and histories

Models are trained but they should also be saved.
Histories of training phases can be obtained and could be saved to for further analysis.

Manage features consistency

For now, features are defined and used in different places:

  • settings.features_names and settings.nb_features where features used are defined globally
  • discretization.plot_cubes: where indices of features are hard coded
  • extraction_data.extract_molecule where molecule get created (and thus where features get defined)

It would be nicer to have something that would ensure that the mapping of features is consistent in all the code base.

A FeatureManager be a class where one can specify the features are to be used as well as the way they get extracted from the original data. This way, all the information about features (settings.features_names, settings.nb_features, and indices of features) and their processing that are used in different places in the code base could be gathered in such an object that would insure their consistency.

Last further improvments

We already get good results with what we have.
I think that we can improve the results we have by:

  • comparing different new architectures (VGG, Inception and ResNet inspired)
  • comparing different optimizer (SGD, Adadelta,Adam)
  • trying learning rate schedules
  • learning rate values

Evaluation process and metrics

Accuracy is not sufficient to evaluate models. Different Metrics can be used, mainly:

  • F1score
  • Recall
  • Precision
  • AUC

This issue explore the evaluation process with metrics, there is two scenarios:

  • use what Keras proposes, that is its own handling of metrics as well, as the way to define custom metrics
    • this way, everything is delegated to model.compile and model.evaluate do the job
    • we may need to define more metrics
    • we will need to keep track of custom metrics somewhere for the custom_metrics argument when using keras.models.loadmodels
  • or use custom evaluation process using "manual a posteriori" computations of score with the outputs of model.predict.
    • this might be more complex but we might have more liberty and flexibility for this step
    • scikit-learn proposes a bunch of metrics

stdout vs logs

For now, some part of the scripts are using print to stdout ; some are using logs.

Thus at the end of a job execution, we have 2 sets of information (when running on the clusters, stdout gets redirected to a file, so we have 2 files).

We should choose to use one or the other. If we choose logs, we need to find a way to catch outputs from Keras verbose procedures as they are by default redirected to stdout.

Cleaning and documenting repo before submission

To do before submission:

  • clean, tidy and comment the code
  • clean README
  • make doc for cluster computation
  • present structure of project
  • prepare structure of project for submission (the folder for training data is different for example)

Matching proteins and ligands together

As we have to submit a list of tens binding ligands for each protein, we need to find a way to match them. Several strategies can be used, this issue is to tracked the design of such strategies.

The first approach would be to return, for each protein, the 10 ligands with the highest probability. However, we know that there is an extra constraint, more precisely that there is a one to one correspondence. Hence, we should or must take decisions for ligands generally and not per protein as we could choose a ligand for a lot of different protein several protein with an high confidence.

If we are given n_p proteins and n_l ligands to test :

  • The first simpler approach would consist to evaluate the n_l ligands and take the 10 best ones.
  • The second approach would consist to evaluate the n_p* n_l systems and then take, for each ligands that are chosen several times, the associated protein of highest confidence.

Improve cube representation

The representation for the cube can be improved:

  • for now, a cube get created for each system but are not absolutely scaled. That is, system that have the same shape but not the same size get represented using the same cube. Thus, we should make sure not just to keep the proportions but also the size. This can be done changing the representation to use cubes with absolute size in Å.
  • maybe some other improvments here

Improve job submission fixture

The job submission system can be improved and be more modular. Everything is about nicely interfacing submissions file (that can be run as bash script) and the job scripts.

The little harness that has be built to create submission files incrementally can be improve to be more robust and conciser as well.

Models and Training procedure improvements

For now, we just have a really simple training procedure. We should be able to improve it using:

  • weights on class when model.fit-ing using the class_weight argument (see the doc)
  • a better loss function, binary_crossentropy is better than MSE for our case.
  • maybe using another optimizer like Adam (?)
  • adding EarlyStopping
  • try another set of features (something like protein_concentration, ligand_concentration, hydrophobic_concentration, polar_concentration)
  • try the others representation for cube
  • using the other cube representation

Comprehensive documentation

As the project has been built to be run on a specific cluster, we made the documentation oriented.

We should make it more comprehensive so that one can run what has been more quickly.

Gather every info of a job in one place

For now, after a job, a log in logs is created but PBS on the clusters spans 1 others files containing the outputs as well as the error.

Also, the submission file can be used several time for different jobs.
Models and histories get saved too.

We should find a way to keep track of jobs and to have their outputs (model, logs…) in the same place.

Logs details of execution

For now, jobs get run on machines on or cluster but there is no way to know if everything has been done correctly.

We should add logs to make sure they are performed accordingly.

Create tests

We should create tests:

  • molecule representation
  • for creating examples (training, testing and prediction)
  • for making cubes (for not asserts are everywhere to check)
  • to ensure that there is nb_neg_per_pos negatives examples for one positives examples
  • ExamplesIterator
  • ModelsIterator
  • functional test for training
  • functional test for evaluation
  • functional test for global evaluation on several models
  • job submission file creation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.