MoleculeNet Benchmark Dataset and Split Induces

Installation

git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
# add to PYTHONPATH
echo export PYTHONPATH="\$PYTHONPATH:`pwd`" >> ~/.bashrc
source ~/.bashrc

Usage

from chembench import load_data
df, induces = load_data('ESOL')

# get the 3 times random split induces
train_idx, valid_idx, test_idx = induces[0]
train_idx, valid_idx, test_idx = induces[1]
train_idx, valid_idx, test_idx = induces[2]

This code repo is focused on the data splitting of the benchmarked dataset in the previous study of MelculeNet

Backgroud

To date, many researchers have developed different molecule deep learning models, However, I found these paper use different random seed to split their dataset in the "Random Split" option, besides, the different scaffold splitting methods are also used. In order to provide easy-to-use and not confusing data split results, here I provide the indexes of the training set, validation set, and test set corresponding to the benchmark dataset.

The basis of the split is from the literature： Wu et al's work except for the ChEMBL dataset

Note that the ChEMBL dataset is originally from Mayr et al's work, the processed data from Yang et al's work. This dataset is split by the scaffold split method, the induces of 0,1,2 is taken from their code repo chemprop

I sincerely hope that all the later research will be able to split the data set based on these indexes, so that we can make a comparison to each other, otherwise misleading results may be caused by different data splitting results, for example in the paper of Xiong et al's work, they used the different random seed to split their dataset in random split, details can be seen in their code repo: AttentiveFP

Lastly, We also have discussed this issue here: deepchem/deepchem#1711

Benchmark DataSet

These benchmark datasets and the split induces have benn generated in this repo, the following table is the summary of these datasets.

	task_name	task_type	n_samples	n_task	split_method	n_cross_split	task_metrics
task_id
01	ESOL	regression	1128	1	random	3	RMSE
02	FreeSolv	regression	642	1	random	3	RMSE
03	Lipop	regression	4200	1	random	3	RMSE
04	PDBbind-full	regression	9880	1	time	1	RMSE
05	PDBbind-core	regression	168	1	time	1	RMSE
06	PDBbind-refined	regression	3040	1	time	1	RMSE
07	PCBA	classification	437929	128	random	3	PRC_AUC
08	MUV	classification	93087	17	random	3	PRC_AUC
09	HIV	classification	41127	1	scaffold	3	ROC_AUC
10	BACE	classification	1513	1	scaffold	3	ROC_AUC
11	BBBP	classification	2050	1	scaffold	3	ROC_AUC
12	Tox21	classification	7831	12	random	3	ROC_AUC
13	ToxCast	classification	8597	617	random	3	ROC_AUC
14	SIDER	classification	1427	27	random	3	ROC_AUC
15	ClinTox	classification	1484	2	random	3	ROC_AUC
16	ChEMBL	classification	456331	1310	scaffold	3	ROC_AUC

shunsunsun / chembench Goto Github PK

chembench's Introduction

MoleculeNet Benchmark Dataset and Split Induces

Installation

Usage

This code repo is focused on the data splitting of the benchmarked dataset in the previous study of MelculeNet

Backgroud

Benchmark DataSet

chembench's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent