git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
# add to PYTHONPATH
echo export PYTHONPATH="\$PYTHONPATH:`pwd`" >> ~/.bashrc
source ~/.bashrc
from chembench import load_data
df, induces = load_data('ESOL')
# get the 3 times random split induces
train_idx, valid_idx, test_idx = induces[0]
train_idx, valid_idx, test_idx = induces[1]
train_idx, valid_idx, test_idx = induces[2]
This code repo is focused on the data splitting of the benchmarked dataset in the previous study of MelculeNet
To date, many researchers have developed different molecule deep learning models, However, I found these paper use different random seed to split their dataset in the "Random Split" option, besides, the different scaffold splitting methods are also used. In order to provide easy-to-use and not confusing data split results, here I provide the indexes of the training set, validation set, and test set corresponding to the benchmark dataset.
The basis of the split is from the literature: Wu et al's work except for the ChEMBL dataset
Note that the ChEMBL dataset is originally from Mayr et al's work, the processed data from Yang et al's work. This dataset is split by the scaffold split method, the induces of 0,1,2 is taken from their code repo chemprop
I sincerely hope that all the later research will be able to split the data set based on these indexes, so that we can make a comparison to each other, otherwise misleading results may be caused by different data splitting results, for example in the paper of Xiong et al's work, they used the different random seed to split their dataset in random split, details can be seen in their code repo: AttentiveFP
Lastly, We also have discussed this issue here: deepchem/deepchem#1711
These benchmark datasets and the split induces have benn generated in this repo, the following table is the summary of these datasets.
task_name | task_type | n_samples | n_task | split_method | n_cross_split | task_metrics | |
---|---|---|---|---|---|---|---|
task_id | |||||||
01 | ESOL | regression | 1128 | 1 | random | 3 | RMSE |
02 | FreeSolv | regression | 642 | 1 | random | 3 | RMSE |
03 | Lipop | regression | 4200 | 1 | random | 3 | RMSE |
04 | PDBbind-full | regression | 9880 | 1 | time | 1 | RMSE |
05 | PDBbind-core | regression | 168 | 1 | time | 1 | RMSE |
06 | PDBbind-refined | regression | 3040 | 1 | time | 1 | RMSE |
07 | PCBA | classification | 437929 | 128 | random | 3 | PRC_AUC |
08 | MUV | classification | 93087 | 17 | random | 3 | PRC_AUC |
09 | HIV | classification | 41127 | 1 | scaffold | 3 | ROC_AUC |
10 | BACE | classification | 1513 | 1 | scaffold | 3 | ROC_AUC |
11 | BBBP | classification | 2050 | 1 | scaffold | 3 | ROC_AUC |
12 | Tox21 | classification | 7831 | 12 | random | 3 | ROC_AUC |
13 | ToxCast | classification | 8597 | 617 | random | 3 | ROC_AUC |
14 | SIDER | classification | 1427 | 27 | random | 3 | ROC_AUC |
15 | ClinTox | classification | 1484 | 2 | random | 3 | ROC_AUC |
16 | ChEMBL | classification | 456331 | 1310 | scaffold | 3 | ROC_AUC |