Git Product home page Git Product logo

drugability's Introduction

Drugability - Machine Learning Study

Here we deposit the code associated with the article Minimal-uncertainty prediction of general drug-likeness based on Bayesian neural networks by Wiktor Beker, Agnieszka Wołos and Bartosz A. Grzybowski.

Prerequisites

Before start, please make sure to have installed following packages (version used in this particular project indicated in parentheses):

Code organisation

The general idea was to separate common steps in machine learning experiments (vectorisation, model training and testing) into logical components and to control the configuration of ML experiment with YAML files, making room for non-standard procedures (that is such not already implemented in Sklearn). As suggested by name, data_preprocess.py contains functions responsible for data loading and preprocess, whereas models.py collects functions constructing a model according to a given config. The make_experiment.py script governs the overall workflow according to a given YAML config file (see python make_experiment.py -h and examples provided below).

PU-learning scripts are placed in separate directory (scripts/pu_fuselier.py for the Fuselier method and scripts/pu_iteration.py for the Liu approach) together with some additional tools.

Configuration file

In general, configuration file is comprised of three parts: loader_config, model_params and cross_validation (in some cases additional vectorizer_config part may be passed). The names are meant to be self-explanatory; see config_files/ for examples.

In loader_config, position data_balance controls data balancing:

  • weights computes sample weight according to the size of each class,
  • random selects random subset of the majority class (of the same size as the minority class),
  • cut selects first part of the majority class (of the same size as the minority class).

Examples

It is assumed to work in this directory. Create directory for experiment data.

mkdir experiments

RdKit and Mol2Vec vectorization

For RdKit and Mol2Vec representations, the data has to be vectorized first (the names used below correspond to paths included in the config files). Note that with Mol2Vec, you should provide path to pickle with Mol2Vec model. An example obtained according to the instruction (trained with ~20M compounds from ZINC and parameters: radius 1, UNK to replace all identifiers that appear less than 4 times, skip-gram and window size of 10 and resulting in 300 dimensional embeddings) is present in models/model_300dim.pkl.

python scripts/vectorize.py --output_core data/drugs_approved_rdkit --descriptor rdkit  data/drugs_approved.smiles
python scripts/vectorize.py --output_core data/drugs_approved_m2v --descriptor mol2vec --mol2vec_model_pkl models/model_300dim.pkl  data/drugs_approved.smiles
python scripts/vectorize.py --output_core data/zinc15_nondrugs_sample_rdkit --descriptor rdkit  data/zinc15_nondrugs_sample.smiles
python scripts/vectorize.py --output_core data/zinc15_nondrugs_sample_m2v --descriptor mol2vec --mol2vec_model_pkl models/model_300dim.pkl  data/zinc15_nondrugs_sample.smiles

Generated files *mu.npz, *std.npz and *idx.npz contains data for normalization step (mean, standard deviation and non-zero indices in the original vector, respectively). Note that in the paper those values were calculated for combined drug+nondrug dataset. Here, for the sake of clarity, we put zinc_nondrugs normalization parameters as an example.

Sample perceptrons

RdKit with 5-fold cross-validation

python make_experiment.py --config config_files/rdkit_ae_zinc.yaml --validation_mode 5cv_test --output_core experiments/rdkit_ae_zinc

Bayesian NN on RdKit with 5-fold cross-validation

python make_experiment.py --config config_files/rdkit_ae_zinc_bayesian.yaml --validation_mode 5cv_test --output_core experiments/rdkit_ae_zinc_bayesian

PU-learning example

python pu_fuselier.py --output_core experiments/fuselier_zinc_rdkit --config config_files/rdkit_ae_zinc_weights.yaml 
python pu_iteration.py --output_core experiments/liu_zinc_rdkit --config config_files/rdkit_ae_zinc_weights.yaml 

Multi-task learning example

Vectorize clintox:

python scripts/extract_labels_from_csv.py --output data/clintox_labels.npz --output_smiles data/clintox.smiles --input data/clintox_cleaned.csv
python scripts/vectorize.py --output_core data/clintox_m2v --descriptor mol2vec --mol2vec_model_pkl MOL2VEC_MODEL_PKL  data/clintox.smiles

Run experiment

python make_experiment.py --config config_files/clintox_multitask.yaml --validation_mode 5cv_test --output_core experiments/clintox_mutlitask

Balanced Random Forest example

Assuming clintox is vectorized (as above).

python scripts/simple_rf.py --labels data/clintox_labels.npz --vectors data/clintox_m2v.npz --names data/clintox_cleaned.csv --brf

Closing remarks

This repository was created as a part of scientific research rather than a standalone project. Due to tight schedule, it's quite 'sketchy'. I hope that with incoming projects I manage to transform it into a flexible framework for ML experiments with a possibility to have a lower-level control over each step (eg. user-defined splitting of the data).

ToDo

  • test whether examples work correctly (ASAP)
  • unify Keras and Sklearn APIs in make_experiment.py

drugability's People

Contributors

nanotekton avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.