Git Product home page Git Product logo

hgraph2graph's Introduction

Hierarchical Generation of Molecular Graphs using Structural Motifs

Our paper is at https://arxiv.org/pdf/2002.03230.pdf

Installation

First install the dependencies via conda:

  • PyTorch >= 1.0.0
  • networkx
  • RDKit >= 2019.03
  • numpy
  • Python >= 3.6

And then run pip install .. Additional dependency for property-guided finetuning:

  • Chemprop >= 1.2.0

Data Format

  • For graph generation, each line of a training file is a SMILES string of a molecule
  • For graph translation, each line of a training file is a pair of molecules (molA, molB) that are similar to each other but molB has better chemical properties. Please see data/qed/train_pairs.txt. The test file is a list of molecules to be optimized. Please see data/qed/test.txt.

Molecule generation pretraining procedure

We can train a molecular language model on a large corpus of unlabeled molecules. We have uploaded a model checkpoint pre-trained on ChEMBL dataset in ckpt/chembl-pretrained/model.ckpt. If you wish to train your own language model, please follow the steps below:

  1. Extract substructure vocabulary from a given set of molecules:
python get_vocab.py --ncpu 16 < data/chembl/all.txt > vocab.txt
  1. Preprocess training data:
python preprocess.py --train data/chembl/all.txt --vocab data/chembl/all.txt --ncpu 16 --mode single
mkdir train_processed
mv tensor* train_processed/
  1. Train graph generation model
mkdir ckpt/chembl-pretrained
python train_generator.py --train train_processed/ --vocab data/chembl/vocab.txt --save_dir ckpt/chembl-pretrained
  1. Sample molecules from a model checkpoint
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsamples 1000

Property-guided molcule generation procedure (a.k.a. finetuning)

The following script loads a trained Chemprop model and finetunes a pre-trained molecule language model to generate molecules with specific chemical properties.

mkdir ckpt/finetune
python finetune_generator.py --train ${ACTIVE_MOLECULES} --vocab data/chembl/vocab.txt --generative_model ckpt/chembl-pretrained/model.ckpt --chemprop_model ${YOUR_PROPERTY_PREDICTOR} --min_similarity 0.1 --max_similarity 0.5 --nsample 10000 --epoch 10 --threshold 0.5 --save_dir ckpt/finetune

Here ${ACTIVE_MOLECULES} should contain a list of experimentally verified active molecules.

${YOUR_PROPERTY_PREDICTOR} should be a directory containing saved chemprop model checkpoint.

--max_similarity 0.5 means any novel molecule should have nearest neighbor similarity lower than 0.5 to any known active molecules in ${ACTIVE_MOLECULES}` file.

--nsample 10000 means to sample 10000 molecules in each epoch.

--threshold 0.5 is the activity threshold. A molecule is considered as active if its predicted chemprop score is greater than 0.5.

In each epoch, generated active molecules are saved in ckpt/finetune/good_molecules.${epoch}. All the novel active molecules are saved in ckpt/finetune/new_molecules.${epoch}

Molecule translation training procedure

Molecule translation is often useful for lead optimization (i.e., modifying a given molecule to improve its properties)

  1. Extract substructure vocabulary from a given set of molecules:
python get_vocab.py --ncpu 16 < data/qed/mols.txt > vocab.txt

Please replace data/qed/mols.txt with your molecules.

  1. Preprocess training data:
python preprocess.py --train data/qed/train_pairs.txt --vocab data/qed/vocab.txt --ncpu 16
mkdir train_processed
mv tensor* train_processed/
  1. Train the model:
mkdir ckpt/translation
python train_translator.py --train train_processed/ --vocab data/qed/vocab.txt --save_dir ckpt/translation
  1. Make prediction on your lead compounds (you can use any model checkpoint, here we use model.5 for illustration)
python translate.py --test data/qed/valid.txt --vocab data/qed/vocab.txt --model ckpt/translation/model.5 --num_decode 20 > results.csv

Polymer generation

The polymer generation code is in the polymer/ folder. The polymer generation code is similar to train_generator.py, but the substructures are tailored for polymers. For generating regular drug like molecules, we recommend to use train_generator.py in the root directory.

hgraph2graph's People

Contributors

wengong-jin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.