Git Product home page Git Product logo

theochem / diverseselector Goto Github PK

View Code? Open in Web Editor NEW
22.0 11.0 19.0 20.15 MB

Methods for selecting diverse (molecular) database.

Home Page: https://selector.qcdevs.org

License: GNU General Public License v3.0

Python 10.83% CSS 1.44% Jupyter Notebook 77.07% JavaScript 2.16% HTML 8.48% TeX 0.01%
chemical-diversity compound-selection maximum-diversity-molecule maximum-dissimilarity-search variable-selection compound-acquisition chemical-library-design

diverseselector's Introduction

The selector Library

This project supports Python 3.7+ GPLv3 License GitHub Actions CI Tox Status codecov

The selector library provides methods for selecting a diverse subset of a (molecular) dataset.

Citation

Please use the following citation in any publication using the selector library:

@article{
}

Installation

To install selector using the conda package management system, install miniconda or anaconda first, and then:

    # Create and activate qcdevs conda environment (optional, but recommended)
    conda create -n qcdevs python=3.10
    conda activate qcdevs

    # Install the stable release
    # current conda release is not ready yet
    # conda install -c theochem qc-selector

    # install the development version
    pip install git+https://github.com/theochem/Selector.git

To install selector with pip, you may want to create a virtual environment, and then:

    # Install the stable release.
    pip install qc-selector

See https://selector.qcdevs.org for full details.

diverseselector's People

Contributors

ali-tehrani avatar alnaba1 avatar awbroscius avatar dhrumil07 avatar fanwangm avatar farnazh avatar khaleeh avatar marco-2023 avatar maximilianvz avatar paulwayers avatar pre-commit-ci[bot] avatar richrick1 avatar xychem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diverseselector's Issues

Modularize the metric module

The current version of metrics.py has implemented several functions and maybe that's too many for people to memorize. So we will end up with a simpler API. I know you may not able to join the hackathon next week. Can you update the status of this module? @Khaleeh

  1. Which are the functions for computing the distance/similarity?
  2. Which are the functions for computing diversity?
  3. We will need to conversion function to convert distance/similarity/diversity. (This is a reminder for myself.)

Thank you for the nice work!

Generating testings for the subset selection algorithms

For the evaluation metrics, I don't think generating the tests is challenging.

But when it comes to generating tests for different algorithms, things become tricky. Duplicating the original results would be good, but the original dataset is not always available. I would think of having a set of molecules with distinct structures/functional groups and see how the algorithm work. For example, we can have alkanes, alcohols, esters, ketones, et al., then we can test if the implemented method is going to work or not.

Do you have any idea on how to make it work? @PaulWAyers @RichRick1 @JuansaCollins @Khaleeh @alnaba1
Thank you!

Supporting scaled similarity

Comments from @PaulWAyers

Similarity and scaled similarity (S(i,j)/SQRT(S(i,i)S(j,j))
5 different methods for computing dissimilarity
One of these five (or not?) should be S(i,i) + S(j,j)- 2S(i,j) where S(i,j) is the similarity. This is the distance via a covariance matrix and is important in Gaussian processes.

One can also consider S(i,j)/SQRT(S(i,i)S(j,j)) and compute the distance-squared from that. That's a correlation matrix/distance.

[Similarity module] Add more similarity measurements

Implement methods listed in as similarity module https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. Please add detailed documentation to show which similarity functions is corresponding to which distance functions in scikit-learn or scipy.

One question I have shall we separate the similarity and distance measurements? I get confused by some measurements, e.g. Tanimoto index of molecule fingerprints. I would see it as a distance, but they treated it as a similarity, https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. If we decide to distinguish them, we may need to make them into similarity and distance modules instead of one module.

@PaulWAyers @FarnazH

A few more questions regarding #147

Thank you again for your efforts in building the notebook. @marco-2023

I am copying the list of questions as a reminder/placeholder here for reference as I have merged #147.

  • Would it be more precise to say "Running QC-Selector" instead of "Running Dissimilarity Algorithms" because we have multiple flavors for subset selection;
  • Do we have any explanations on why we get 8 and 6 data points instead of the queried 12 for "Adapted Optimizable K-Dissimilarity Selection (OptiSim)"?
  • Can we put all the imports at the top of the jupyter notebook?
  • For each subplot, can we add a title, x label and y label so that people don't have to scroll up to see what the figure means?
  • Can you add a few words saying why MaxSum is sensitive to outliers above the figure?
  • Why Directed Sphere Exclusion is giving 13 molecules instead of 12?

Computing dissimilarity/distance/diversity/similarity of molecules

An essential component of our package is to have a submodule to compute the dissimilarity/distance/diversity/similarity where normally we assume diversity = 1 - similarity.

A good summary of this is Table 2 in Drug Dev. Res.,72(1):74 - 84, 2011. Given a fact that this is a public repo, I am not going to share the screenshot here.

The Tanimoto coefficient is a classic and gold stand similarity metric for molecular fingerprints and we should support it. But it was found that it favors small molecules when used in molecule subset selection and a modification was proposed. My guess is that we should make this the default for molecular fingerprint inputs.

I will keep adding new things. Any idea will be appreciated.
@PaulWAyers @JuansaCollins @RichRick1 @Khaleeh @alnaba1

[Features module] Fix failed tests and improve code/coverage

Critically review the code in features module, update docstrings, add tests and comments to the code to make it more clear. Fix tests that are failing in DiverseSelector/test/test_feature.py.

For example, the feature_reader seems redundant to me. It is just 2 lines of actual code (but we need a lot of likes to explain/document the function), and we can show how to load the data file in an example/notebook or expect the user to know about it. In a nutshell, I don't think making a wrapper function for every functionality from other packages is helpful/useful. It is best (and to our advantage) to show how our package works in conjunction with other packages (in this case pandas). Feel free to let me know what you think @FanwangM and @PaulWAyers.

[Distance Module] Bitstring functionality already covered by non-bitstring functions

@FanwangM can you explain the rationale behind developing the functions euc_bit() and bit_tanimoto()? I have added tests in test_distance.py that demonstrate that they produce the exact same results as using the euclidean distance from sklearn and tanimoto().

The only location I see them used is in nearest_average_tanimoto(). If there is somewhere else in DiverseSelector that they are essential, I can leave them. But otherwise, I think it would make sense for me to replace the bitstring functions with the normal ones and streamline the distance.py module.

Add functionality for "holes" in the dataset

As pointed out by @FarnazH , it would be good to be able to identify which molecules/samples are in low-density regions (where we have little data) and also to identify regions where there is very little (or no) data.

List methods which require distance matrix

We need a list of methods that require a distance matrix as mandatory. Currently, if a distance matrix is not provided, the program in base.py (line 96-101) will compute the distance matrix, which is unnecessary when a feature matrix is enough. Therefore, can you help list methods that you are taking care of and show which methods must have a distance matrix as input and which methods require the distance matrix as an optional input? Thank you.

@RichRick1 @alnaba1

Not all selectors give the correct number of samples

We need sensible default values for the grid/sphere size in some of the selection algorithms.

We should also allow for the case where the number of selected molecules is too small. We can (easily?) optimize sphere-size, for example, using the secant method (evaluating at the closest integer) so that the right number of molecules is selected.

(related): As suggested by @FarnazH , we should allow for sphere size to be dictated by the number of molecules, rather than just a size.

[Diversity module] Adding tests for diversity measurements

We will need to add tests for the following function. Moreover, a wrapper function or Python class simplifies the usage of the diversity functions.

  • total diversity column, Eq. 4 on page 843 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the
    first method in this paper)
  • Eq. 9 on page 844 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the second method in this
    paper)
  • Gini coefficient from Journal of Computational Chemistry2016,37, 2091–2097, Eq 12, 13 as well
  • entropy from Eq. 9 on page 4 of Leguy et al. J Cheminform (2021) 13:76 with implementation at
    https://github.com/jules-leguy/EvoMol/blob/master/evomol/evaluation_entropy.py
  • log-determinant function from Sci. Rep. 12: 1124 (2022)
  • Wasserstein distance for property‑based evaluation of diversity. from Sci. Rep. 12: 1124
    (2022) with codes available at https://github.com/tomotomonakanaka/SUBMO
  • Explicit Diversity Index from J. Chem. Inf. Model. 2006, 46, 5, 1898–1904

Refactorizing the `feature` module

The original attempt is to have one line of code to perform the subset selection, but this leads to a large number of arguments. Therefore, I would propose to have two independent modules where we can compute the molecular features and do the subset selection indepedently.

[Distance module] Add Formulas to Docstrings and Add Tests

Finalize the functions in distance module by:

  1. Unifying docstrings and adding formulas (that match the implementation)
  2. Add tests for each function (to cover all lines) and tricky cases
  3. Polish code to make sure it is readable (e.g., add comments) and pythonic

Tasks

Tasks - Post PR#129

Keywords list

Using the right keywords may help us identify good pieces of literature. Here is a list of keywords that @PaulWAyers and I have used,

1. chemical diversity;
2. compound selection;
3. maximum diversity molecule;

Please feel free to add new keywords if you think that's helpful. If you are adding, adding the numbering would be helpful and I will constantly update the list here for future reference.

Thank you. @Khaleeh @alnaba1 @JuansaCollins @RichRick1

Stratified Sampling

"Simple" stratified sampling can be mimicked by using the propert(ies) of interest as the feature(s). (So the feature vector is exactly the propert(ies) to be sampled.)

To achieve sampling with respect to target properties and diversity with respect to a fingerprint/feature-vector, you need to append the target properties to the feature vector. For each property, specify a "repeat": the more times you repeat the property value the higher its weight.

One way to do this would be to pass a weight, then set the repeat-value for each property to be repeat = np.ceil(weight*n_features) where n_features is the number of features (length of the fingerprint). (I use ceiling to ensure that as long as weight > 0.0 at least one instance of the property is used). The final feature vector has the property value(s) appended to it repeat times.

Then we use this set of features.

Add a utility for augmenting the feature array and provide examples to showcase this.

returning incorrect number of molecules

When using DirectedSphereExclusion and OptiSim, if we set the r not carefully, we may end up having an incorrect number of molecules. Here are some examples.

# SCFP6 and OptiSim with more molecules than required
selected_ids9 = OptiSim(func_distance=lambda x, y: np.linalg.norm(x - y), 
                        r=0.5, k=3, tolerance=2.0, random_seed=42).select(arr=fingerprints_secfp6, num_selected=12)

print(selected_ids9)
graph_mols(input_sdf="demo_mols_2d.sdf",selected_ids=selected_ids9)

# SCFP6 and OptiSim with less molecules than required
selected_ids9 = OptiSim(func_distance=lambda x, y: np.linalg.norm(x - y), 
                        r=1.5, k=3, tolerance=2.0, random_seed=42).select(arr=fingerprints_secfp6, num_selected=12)

print(selected_ids9)
graph_mols(input_sdf="demo_mols_2d.sdf",selected_ids=selected_ids9)

#================================================
# SECFP6 and directed sphere exclusion with more molecules than required
selector = DirectedSphereExclusion(r=0.5, tolerance=5.0, start_id=0, random_seed=42,
                                   func_distance=lambda x, y: np.linalg.norm(x - y))
selected_ids10 = selector.select(arr=fingerprints_secfp6, num_selected=12)

print(selected_ids10)
graph_mols(input_sdf="demo_mols_2d.sdf",selected_ids=selected_ids10)

# SECFP6 and directed sphere exclusion with less molecules than required
selector = DirectedSphereExclusion(r=1, tolerance=5.0, start_id=0, random_seed=42,
                                   func_distance=lambda x, y: np.linalg.norm(x - y))
selected_ids10 = selector.select(arr=fingerprints_secfp6, num_selected=12)

print(selected_ids10)
graph_mols(input_sdf="demo_mols_2d.sdf",selected_ids=selected_ids10)
[num_mols_debugging.zip](https://github.com/theochem/DiverseSelector/files/8884694/num_mols_debugging.zip)


If you want to reproduce the results, please find the attached notebook.

I know this is restricted by the parameter settings of the corresponding algorithms. Do we have a solution to help people get the right number of molecules? Or we can print a warning when the returned number of molecules is not equal to what we asked for and tell people the right direction to help navigate the right parameters (increase or decrease some parameter values).

This is a challenging question and any suggestions would be appreciated. Thank you.
@PaulWAyers @FarnazH @alnaba1 @RichRick1 @Ali-Tehrani @ramirandaq

Cleaning up the codes

For some parts of our package, the documentation is unclear. Can you help double check if the parts you are working on are missing such clear documentation and fix them?

  • We will also need to add the references if some equations or algorithms are taken from some papers. For an example putting references, you can take a look at https://numpydoc.readthedocs.io/en/latest/format.html.
  • Add Raises if there is any error checking
  • Type checking for all the functions or classes including inherited classes
  • Add more tests to increase coverage (add tests for missing lines)
  • Add some explanations of your equation or algorithms to Notes section. We will need to have equations explicitly expressed in the documentation.

If you can think of other things that we should use for cleaning the codes, please comment here. Thank you.
@alnaba1 @RichRick1 @Khaleeh @fwmeng88

[method.distance module]

To do:

  • Add test to reach 100% coverage
  • Add comments in the code
  • Add formulas to docstring and polish docstring (add reference).

optimize_radius not using information of clusters

The function optimize_radius in the file utils.py is not using the cluster_idsparameters. The select_from_cluster method of the DirectedSphereExclusion is also not using the cluster_ids. Because of this, this selection algorithm fails when used with cluster labels.

[Docs] List the application scenarios of `DiverseSelector`

It would be beneficial for people to know how they can use our package for practical problems and I think we can make a list for people in the manuscript.

  • subset selection for HTS/virtual screening/machine learning
  • validation dataset hold-out (stratified version)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.