Implementation of the Seq2Seq with attention and the Transformer used in Molecular Optimization by Capturing Chemist's Intuition Using Deep Neural Networks. Given a molecule and desirable property changes, the goal is to generate molecules with desirable property changes. This problem can be viewed as a machine translation problem in natural language processing. Property changes are incorporated into input together with SMILES.
Create environment
conda env create -f environment.yml
source activate molopt
- Preprocess data
Encode property change, build vocabulary, and split data into train, validation and test. Outputs are saved in the same directory with input data path.
python preprocess.py --input-data-path data/chembl_02/mmp_prop.csv
- Train model
Train the model and save results and logs to experiments/save_directory/
; The model from each epoch is saved in
experiments/save_directory/checkpoint/
; The training loss, validation loss and validation accuracy are saved in experiments/save_directory/tensorboard/
.
python train.py --data-path data/chembl_02 --save-directory train_transformer --model-choice transformer transformer
A pre-trained Transformer model can be found here. 3. Generate molecules
Use the model saved at a given epoch (e.g. 60) to generate molecules for the given test filename, and save the results to experiments/save_directory/test_file_name/evaluation_epoch/generated_molecules.csv
. The three test sets used in our paper can be found in data/chembl_02/
as below,
- Test-Original ->
data/chembl_02/test.csv
- Test-Molecule ->
data/chembl_02/test_not_in_train.csv
- Test-Property ->
data/chembl_02/test_unseen_L-1_S01_C10_range.csv
python generate.py --model-choice transformer --data-path data/chembl_02 --test-file-name test --model-path experiments/train_transformer/checkpoint --save-directory evaluation_transformer --epoch 60
python generate.py --model-choice transformer --data-path data/chembl_02 --test-file-name test_not_in_train --model-path experiments/train_transformer/checkpoint --save-directory evaluation_transformer --epoch 60
python generate.py --model-choice transformer --data-path data/chembl_02 --test-file-name test_unseen_L-1_S01_C10_range --model-path experiments/train_transformer/checkpoint --save-directory evaluation_transformer --epoch 60
- Compute properties for generated molecules
Since we build the property prediction model based on the in-house experimental data, we can't make it public. But the computed properties can be found in experiments/evaluation_transformer/test_file_name/evaluation_60/generated_molecules_prop.csv
5.Evaluate the generated molecules in term of satisfying the desirable properties and draw molecules
python evaluate.py --data-path experiments/evaluation_transformer/test/evaluation_60/generated_molecules_prop.csv
python evaluate.py --data-path experiments/evaluation_transformer/test_not_in_train/evaluation_60/generated_molecules_prop.csv
python evaluate.py --data-path experiments/evaluation_transformer/test_unseen_L-1_S01_C10_range/evaluation_60/generated_molecules_prop.csv --range-evaluation lower
- Matched molecular pair analysis between starting molecules and generated molecules
- Download mmpdb for matched molecular pair generation
- Parse the downloaded mmpdb path (i.e. path/mmpdb/) to --mmpdb-path of mmp_analysis.py
Between starting molecules and all the generated molecules
python mmp_analysis.py --data-path experiments/evaluation_transformer/test/evaluation_60/generated_molecules_prop.csv --train-path data/chembl_02/train.csv --mmpdb-path path/mmpdb/
python mmp_analysis.py --data-path experiments/evaluation_transformer/test_not_in_train/evaluation_60/generated_molecules_prop.csv --train-path data/chembl_02/train.csv --mmpdb-path path/mmpdb/
python mmp_analysis.py --data-path experiments/evaluation_transformer/test_unseen_L-1_S01_C10_range/evaluation_60/generated_molecules_prop.csv --train-path data/chembl_02/train.csv --mmpdb-path path/mmpdb/
Between starting molecules and all the generated molecules with desirable properties
python mmp_analysis.py --data-path experiments/evaluation_transformer/test/evaluation_60/generated_molecules_prop_statistics.csv --train-path data/chembl_02/train.csv --mmpdb-path path/mmpdb/ --only-desirable
python mmp_analysis.py --data-path experiments/evaluation_transformer/test_not_in_train/evaluation_60/generated_molecules_prop_statistics.csv --train-path data/chembl_02/train.csv --mmpdb-path path/mmpdb/ --only-desirable
python mmp_analysis.py --data-path experiments/evaluation_transformer/test_unseen_L-1_S01_C10_range/evaluation_60/generated_molecules_prop_statistics.csv --train-path data/chembl_02/train.csv --mmpdb-path path/mmpdb/ --only-desirable
The code is copyright 2020 by Jiazhen He and distributed under the Apache-2.0 license. See LICENSE for details.