ibm / pychemex Goto Github PK

Python library for Cheminformatics ML model explainability

License: Apache License 2.0

pychemex's Introduction

PyChemEx

Python library for Cheminformatics ML model explainability

Samples data

Patented Melting Point data is used in this project to demonstrate the functionality. The dataset was derived from the following source.


Title	The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS.
Authors	Tetko, I.V.; Williams, A.J.; Lowe, D.;
Journal reference	Journal of cheminformatics, 2016; 8 (); 2
Dataset name	patents_ochem_enamine_bradley_begstrom_training
URL	https://ochem.eu/article/99826

pychemex's People

Contributors

Watchers

Forkers

ghas-results

pychemex's Issues

Source sample data

Find a open source sample dataset to be used for development and testing.

H7: How do specific fragmetns contribute to the predicted property?

Select a molecule -> (1) Identify all MMPs based on the most common transformations. Compare the predicted difference and the actual difference accross the MMPs
(2) Identify MMPs based on the molecule to "neutral" fragment transformation. Predict the target property for the identified molecules. Use the difference as an estimate of how much that fragment contributes to the prediction

H8: how similar are other molecules?

Building on H4, investigate the following:
1. How many molecules fall within X distance (in feature space) of the selected molecule?
Maybe X is a function of number of features? Or a constant.
Explanation: this could be used as a measure of how well a specific area of chem space is represented, at least based on the selected feature set.
(To discuss: what is chem-space? Is it a function of molecules and the features selected? E.g. even if we have all the molecules but we only count the number of carbons as a feature, we're not really capturing a wide chem space)

2. What is the variance of the target values for molecules within the radius (defined above)?
Explanation: high variance within similar molecule could indicate: activity cliff-ish effect if localised, features not capturing the underlying science enough to differentiate between molecules.

Calculate a lot of features for the sample dataset

which features to be decided. will need to look at MP prediction papers and what's freely and easily available.
Try grouping the features into categories based on what they intend to capture, so specific features could be "turned off" to show the functionality of the library.

Will probably split this into multiple issues later.

H6: How well are specific transformations predicted?

Select a MMP transformation-> compare the predicted vs actual difference across the MMPs with the selected transformation

Summarise key finding from the PoC

Rather everything into one place and evaluate the state of the project. See which components work / which dont. Have an internal discussion regarding improvements / new ideas.

Train a set of models

set-up a ML pipeline to train a set of ML models based on the sets of features calculated.
Details to be confirmed later.

H9: Fragment Based Descriptor Contribution

The Stream of Consciousness

Wouldn't it be useful to know what part of the molecule contributes to certain parts of the molecule?

Possible solution ?

Could we run descriptors over each fragment to figure what their contribution would be? Granted this won't work for everything.

H3: Descriptor to fragment structure

Select feature -> (1) get example molecules from parts of distribution (2) most common fragments in specific part of distribution
select fragment -> how widely distributed are molecules with the specific fragment? (i.e. if for a descriptor, molecules with the given subject are only found in one part of the distribution, it suggests that that desriptor is related to that fragment)

H5: How well are molecules with specific fragments predicted?

Select fragment -> distribution of predicted error of molecules with the fragment vs general population

H2: How do molecules with this fragment compare?

Select a feature, select a fragment -> compare the distribution of the feature between molecules with fragment vs general population

H10: What is your Chemical Space?

Problem

It's always annoying as a scientist when you are not sure exactly what kind of chemical space your data covers. Typically people will say to you:

But how do you know you're covering the correct chemical space?

Idea

A set of simple descriptors should be used to describe the possible chemical space and the space your data occupies.

Current suggestions :
Molecular Weight, Fragment Types

Not sure how to best present this information?
% Complete? I.e. of the range of fragments available your dataset covers x% as a whole and the least/most counts are as follows....

Categorise Descriptors

splitting #5 into two tickets.

Come up with categorisation of discriptors.
Come up with a clever way to store these categorisations.

create a list of hypothesis for PoC 1

set of feature and performance explainability hypothesis to test

Check out AIX 360 and see how easy it is to use

https://github.com/Trusted-AI/AIX360

See how easy it is to use
what the required inputs are
judge if it's worth integrating with it

H4: Any similar molecules / fragments?

For a given set of descriptors and a selected molecule, what are the most similar molecules? (based on distance in selected feature space).
Something similar but fragment based? e.g. average descriptor value for mols with the fragment and compare to averages of mols with other fragments.