Git Product home page Git Product logo

chembench's Introduction

MoleculeNet Benchmark Dataset and Split Induces

Installation

git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
# add to PYTHONPATH
echo export PYTHONPATH="\$PYTHONPATH:`pwd`" >> ~/.bashrc
source ~/.bashrc

Usage

from chembench import load_data
df, induces = load_data('ESOL')

# get the 3 times random split induces
train_idx, valid_idx, test_idx = induces[0]
train_idx, valid_idx, test_idx = induces[1]
train_idx, valid_idx, test_idx = induces[2]

This code repo is focused on the data splitting of the benchmarked dataset in the previous study of MelculeNet

Backgroud

To date, many researchers have developed different molecule deep learning models, However, I found these paper use different random seed to split their dataset in the "Random Split" option, besides, the different scaffold splitting methods are also used. In order to provide easy-to-use and not confusing data split results, here I provide the indexes of the training set, validation set, and test set corresponding to the benchmark dataset.

The basis of the split is from the literature: Wu et al's work except for the ChEMBL dataset

Note that the ChEMBL dataset is originally from Mayr et al's work, the processed data from Yang et al's work. This dataset is split by the scaffold split method, the induces of 0,1,2 is taken from their code repo chemprop

I sincerely hope that all the later research will be able to split the data set based on these indexes, so that we can make a comparison to each other, otherwise misleading results may be caused by different data splitting results, for example in the paper of Xiong et al's work, they used the different random seed to split their dataset in random split, details can be seen in their code repo: AttentiveFP

Lastly, We also have discussed this issue here: deepchem/deepchem#1711

Benchmark DataSet

These benchmark datasets and the split induces have benn generated in this repo, the following table is the summary of these datasets.

task_name task_type n_samples n_task split_method n_cross_split task_metrics
task_id
01 ESOL regression 1128 1 random 3 RMSE
02 FreeSolv regression 642 1 random 3 RMSE
03 Lipop regression 4200 1 random 3 RMSE
04 PDBbind-full regression 9880 1 time 1 RMSE
05 PDBbind-core regression 168 1 time 1 RMSE
06 PDBbind-refined regression 3040 1 time 1 RMSE
07 PCBA classification 437929 128 random 3 PRC_AUC
08 MUV classification 93087 17 random 3 PRC_AUC
09 HIV classification 41127 1 scaffold 3 ROC_AUC
10 BACE classification 1513 1 scaffold 3 ROC_AUC
11 BBBP classification 2050 1 scaffold 3 ROC_AUC
12 Tox21 classification 7831 12 random 3 ROC_AUC
13 ToxCast classification 8597 617 random 3 ROC_AUC
14 SIDER classification 1427 27 random 3 ROC_AUC
15 ClinTox classification 1484 2 random 3 ROC_AUC
16 ChEMBL classification 456331 1310 scaffold 3 ROC_AUC

chembench's People

Contributors

shenwanxiang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.