Git Product home page Git Product logo

chembench's Introduction

In case you would like to cite this:

DOI

1. MolMapNet Dataset

  • the following datasets are reported in the paper of "Out-of-the-Box Deep Learning Prediction of Pharmaceutical Properties by Broadly Learned Knowledge-Based Molecular Representations" , please find details of these datasets in this paper
Data Class Dataset No. of Molecules No. of Tasks Task Metric Task Type
Physico-chemical ESOL Water solubility 1128 1 RMSE Regression
FreeSolv Solvation free energy 642 1 RMSE Regression
Lipop Lipophilicity 4200 1 RMSE Regression
Molecular binding PDBbind-F, PDBbind-C, PDBbind-R Ligand-protein binding: full, core, refined (3 datasets) 9880, 168, 3040 1 for each RMSE Regression
Bio-activity PCBA PubChem HTS bioAssay 437929 128 PRC-AUC Classification
MUV PubChem bioAssay 93087 17 PRC-AUC Classification
ChEMBL bioassay activity dataset 456331 1310 ROC_AUC Classification
Cancer cell-line IC50 A2780, CCRF-CEM12, DU-14512, HCT-1512, KB12, LoVo12, PC-312, SK-OV-312 (8 datasets) 2255, 3047, 2512,994, 2731, 1120, 4294, 1589 1 for each R2 Regression
Malaria Anti-malarial EC508 9998 1 RMSE Regression
BACE-1 benchmark set, ChEMBL novel set, ChEMBL common set, Clinical drugs 1513, 395, 5324, 26 1 ROC_AUC Classification
HIV replication inhibition 41127 1 ROC_AUC Classification
Toxicity Tox21Toxicology in the 21st century 7831 12 ROC_AUC Classification
SIDER Adverse drug reactions of marketed drugs 1427 27 ROC_AUC Classification
ClinTox Clinical trial toxicity 1478 2 ROC_AUC Classification
Pharmacokinetic CYP PubChem BioAssay CYP 1A2, 2C9, 2C19, 2D6, 3A4 inhibition 16896 5 ROC_AUC Classification
LMC-H, LMC-R, LMC-M (Liver Mocrosomal Clearance in human, rat, mouse) 8755 3 R2 Regression
BBBP Blood-brain barrier penetration 2039 1 ROC_AUC Classification

2. Benchmark DataSet in MolNet and Chemprop

These benchmark datasets and the split induces have benn generated in this repo, the following table is the summary of these datasets.

task_name task_type n_samples n_task split_method n_cross_split task_metrics
task_id
01 ESOL regression 1128 1 random 3 RMSE
02 FreeSolv regression 642 1 random 3 RMSE
03 Lipop regression 4200 1 random 3 RMSE
04 PDBbind-full regression 9880 1 time 1 RMSE
05 PDBbind-core regression 168 1 time 1 RMSE
06 PDBbind-refined regression 3040 1 time 1 RMSE
07 PCBA classification 437929 128 random 3 PRC_AUC
08 MUV classification 93087 17 random 3 PRC_AUC
09 HIV classification 41127 1 scaffold 3 ROC_AUC
10 BACE classification 1513 1 scaffold 3 ROC_AUC
11 BBBP classification 2039 1 scaffold 3 ROC_AUC
12 Tox21 classification 7831 12 random 3 ROC_AUC
13 ToxCast classification 8576 617 random 3 ROC_AUC
14 SIDER classification 1427 27 random 3 ROC_AUC
15 ClinTox classification 1478 2 random 3 ROC_AUC
16 ChEMBL classification 456331 1310 scaffold 3 ROC_AUC

Installation

Direct installation:

pip install git+https://github.com/shenwanxiang/ChemBench.git

Developer installation:

git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
pip install -e .

Usage-1: Load the Dataset and MoleculeNet's Split Induces

from chembench import load_data
df, induces = load_data('ESOL')

# get the 3 times random split induces
train_idx, valid_idx, test_idx = induces[0]
train_idx, valid_idx, test_idx = induces[1]
train_idx, valid_idx, test_idx = induces[2]

Usage-2: Load Dataset As Data Object

from chembench import dataset
data = dataset.load_ESOL()
data.x
data.y
data.description


## regression 
dataset.load_Lipop()
dataset.load_ESOL()
dataset.load_FreeSolv()
dataset.load_Malaria()
dataset.load_LMC()
dataset.load_PDBF()
dataset.load_PDBC()
dataset.load_PDBR()


### classification
dataset.load_BBBP()
dataset.load_BACE()
dataset.load_HIV()
dataset.load_MUV()
dataset.load_Tox21()
dataset.load_SIDER()
dataset.load_CYP450()
dataset.load_ToxCast()
dataset.load_ClinTox()
dataset.load_ChEMBL()
dataset.load_PCBA()

Usage-3: Load Cluster Splits

the cluster split results is here, for example, load the cluster splits and random splits for dataset ESOL:

from chembench import get_cluster_induces
induces1 = get_cluster_induces("ESOL", induces = "random_5fcv_5rpts")
induces2 = get_cluster_induces("ESOL", induces = "scaffold_5fcv_1rpts")
print(len(induces1))
print(len(induces2))

For example, the chemical space of the ESOL dataset using 5fold cluster split : ESOL split chemical space

the Kolmogorov-Smirnov statistic on the distribution for the pairwise groups(clusters): ESOL split distribution test

Making a Release

After installing the package in development mode and installing tox with pip install tox, the commands for making a new release are contained within the finish environment in tox.ini. Run the following from the shell:

$ tox -e finish

This script does the following:

  1. Uses BumpVersion to switch the version number in the setup.cfg and src/chembench/version.py to not have the -dev suffix
  2. Packages the code in both a tar archive and a wheel
  3. Uploads to PyPI using twine. Be sure to have a .pypirc file configured to avoid the need for manual input at this step
  4. Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
  5. Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can use tox -e bumpversion minor after.

chembench's People

Contributors

cthoyt avatar shenwanxiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

chembench's Issues

Make code pip-installable

It would be really helpful to downstream users if the code in this package could be installed with pip. Would you be willing to accept a PR to enable this?

Installation not clear

Maybe have a setup.py file ? I did add the pythonpath to the bash but its not able to find

from chembench import load_data. 

Any ideas ?

[UPDATE] : To be deleted Issue. Fixed it on my End.

Deploy ChemBench to PyPI

There are a couple things that make deploying to PyPI so people can really easily get your software with pip install chembench

  1. Improve how versioning is done. I often use BumpVersion
  2. Add a tox.ini configuration for tox for automated usage of BumpVersion and twine for deployment to PyPI
  3. Add documentation on how to use this in the readme

I'll make a PR for these, then we can discuss a bit on how they work before merging.

@shenwanxiang you should register for an account on https://pypi.org/ because you'll need these credentials when you run the script

[Maintenance] Memory Leakage

There are some csv file that is not good for usage. Some features leaks some data which is incorrectly represented among other molecules. Can you tuned up and re-update to guarantee float32 or float16 datatype instead of float64.

Datasets with cluster split return the same splits for all three random seeds

As in the title, the datasets with cluster splitting return the same indices for all three random seeds, more precisely BACE, BBBP, Chembl and HIV. Could you have a look and make sure that the random seed has some effect on cluster splits as well to make the benchmark more robust (and update the splits afterwards)? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.