Git Product home page Git Product logo

pychemex's Introduction

PyChemEx

Python library for Cheminformatics ML model explainability

Samples data

Patented Melting Point data is used in this project to demonstrate the functionality. The dataset was derived from the following source.

Title The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS.
Authors Tetko, I.V.; Williams, A.J.; Lowe, D.;
Journal reference Journal of cheminformatics, 2016; 8 (); 2
Dataset name patents_ochem_enamine_bradley_begstrom_training
URL https://ochem.eu/article/99826

pychemex's People

Contributors

alex-amc avatar jjanowiak avatar stevemar avatar

Watchers

JJ Asghar avatar James Cloos avatar Thai Tran avatar  avatar

Forkers

ghas-results

pychemex's Issues

Source sample data

Find a open source sample dataset to be used for development and testing.

H7: How do specific fragmetns contribute to the predicted property?

Select a molecule -> (1) Identify all MMPs based on the most common transformations. Compare the predicted difference and the actual difference accross the MMPs
(2) Identify MMPs based on the molecule to "neutral" fragment transformation. Predict the target property for the identified molecules. Use the difference as an estimate of how much that fragment contributes to the prediction

H8: how similar are other molecules?

Building on H4, investigate the following:
1. How many molecules fall within X distance (in feature space) of the selected molecule?
Maybe X is a function of number of features? Or a constant.
Explanation: this could be used as a measure of how well a specific area of chem space is represented, at least based on the selected feature set.
(To discuss: what is chem-space? Is it a function of molecules and the features selected? E.g. even if we have all the molecules but we only count the number of carbons as a feature, we're not really capturing a wide chem space)

2. What is the variance of the target values for molecules within the radius (defined above)?
Explanation: high variance within similar molecule could indicate: activity cliff-ish effect if localised, features not capturing the underlying science enough to differentiate between molecules.

Calculate a lot of features for the sample dataset

which features to be decided. will need to look at MP prediction papers and what's freely and easily available.
Try grouping the features into categories based on what they intend to capture, so specific features could be "turned off" to show the functionality of the library.

Will probably split this into multiple issues later.

Summarise key finding from the PoC

Rather everything into one place and evaluate the state of the project. See which components work / which dont. Have an internal discussion regarding improvements / new ideas.

Train a set of models

set-up a ML pipeline to train a set of ML models based on the sets of features calculated.
Details to be confirmed later.

H9: Fragment Based Descriptor Contribution

The Stream of Consciousness

  • Wouldn't it be useful to know what part of the molecule contributes to certain parts of the molecule?

Possible solution ?

  • Could we run descriptors over each fragment to figure what their contribution would be? Granted this won't work for everything.

H3: Descriptor to fragment structure

Select feature -> (1) get example molecules from parts of distribution (2) most common fragments in specific part of distribution
select fragment -> how widely distributed are molecules with the specific fragment? (i.e. if for a descriptor, molecules with the given subject are only found in one part of the distribution, it suggests that that desriptor is related to that fragment)

H10: What is your Chemical Space?

Problem

It's always annoying as a scientist when you are not sure exactly what kind of chemical space your data covers. Typically people will say to you:

But how do you know you're covering the correct chemical space?

Idea

A set of simple descriptors should be used to describe the possible chemical space and the space your data occupies.

Current suggestions :
Molecular Weight, Fragment Types

Not sure how to best present this information?
% Complete? I.e. of the range of fragments available your dataset covers x% as a whole and the least/most counts are as follows....

Categorise Descriptors

splitting #5 into two tickets.

Come up with categorisation of discriptors.
Come up with a clever way to store these categorisations.

H4: Any similar molecules / fragments?

For a given set of descriptors and a selected molecule, what are the most similar molecules? (based on distance in selected feature space).
Something similar but fragment based? e.g. average descriptor value for mols with the fragment and compare to averages of mols with other fragments.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.