theochem / diverseselector Goto Github PK

Methods for selecting diverse (molecular) database.

License: GNU General Public License v3.0

Python 10.83% CSS 1.44% Jupyter Notebook 77.07% JavaScript 2.16% HTML 8.48% TeX 0.01%

chemical-diversity compound-selection maximum-diversity-molecule maximum-dissimilarity-search variable-selection compound-acquisition chemical-library-design

diverseselector's Introduction

The `selector` Library

The selector library provides methods for selecting a diverse subset of a (molecular) dataset.

Citation

Please use the following citation in any publication using the selector library:

@article{
}

Installation

To install selector using the conda package management system, install miniconda or anaconda first, and then:

    # Create and activate qcdevs conda environment (optional, but recommended)
    conda create -n qcdevs python=3.10
    conda activate qcdevs

    # Install the stable release
    # current conda release is not ready yet
    # conda install -c theochem qc-selector

    # install the development version
    pip install git+https://github.com/theochem/Selector.git

To install selector with pip, you may want to create a virtual environment, and then:

    # Install the stable release.
    pip install qc-selector

See https://selector.qcdevs.org for full details.

diverseselector's People

Contributors

Stargazers

Watchers

Forkers

khaleeh richrick1 alnaba1 gabrielasd msricher ali-tehrani farnazh ramirandaq rnaimehaom jenniferlukk awbroscius maximilianvz zhaoyilin xychem fanwangm fossabot kunikachandra sid1552 dhrumil07

diverseselector's Issues

[Docs] support of different metrics for binary fingerprints or descriptors

Some metrics only work on binary fingerprints, such as Tanimoto. But some metrics can work for both cases and some metrics work for only for non-binary matrix, such as molecular descriptors. Can you give a list or table to summary this information? We will need this to refactor the metrics module. Thank you.
@Khaleeh

Add instructions of virtual env for development and various TEMPLATES

This would help people follow a consistent pattern writing codes and making PRs.

Modularize the metric module

The current version of metrics.py has implemented several functions and maybe that's too many for people to memorize. So we will end up with a simpler API. I know you may not able to join the hackathon next week. Can you update the status of this module? @Khaleeh

Which are the functions for computing the distance/similarity?
Which are the functions for computing diversity?
We will need to conversion function to convert distance/similarity/diversity. (This is a reminder for myself.)

Thank you for the nice work!

[methods.partition GridPartitioning]

To do:

Add test to reach 100% coverage
Add comments in the code
Add formulas to docstring and polish docstring (add reference).

Add the option to dump selected molecules

This is going to be supported in base.py.

[Similarity module] Add support of scaled similarity

See more discussions in #73. Key points,

Add implementation
Add tests and docstrings

Generating testings for the subset selection algorithms

For the evaluation metrics, I don't think generating the tests is challenging.

But when it comes to generating tests for different algorithms, things become tricky. Duplicating the original results would be good, but the original dataset is not always available. I would think of having a set of molecules with distinct structures/functional groups and see how the algorithm work. For example, we can have alkanes, alcohols, esters, ketones, et al., then we can test if the implemented method is going to work or not.

Do you have any idea on how to make it work? @PaulWAyers @RichRick1 @JuansaCollins @Khaleeh @alnaba1
Thank you!

Supporting scaled similarity

Comments from @PaulWAyers

Similarity and scaled similarity (S(i,j)/SQRT(S(i,i)S(j,j))
5 different methods for computing dissimilarity
One of these five (or not?) should be S(i,i) + S(j,j)- 2S(i,j) where S(i,j) is the similarity. This is the distance via a covariance matrix and is important in Gaussian processes.

One can also consider S(i,j)/SQRT(S(i,i)S(j,j)) and compute the distance-squared from that. That's a correlation matrix/distance.

[Similarity module] Add more similarity measurements

Implement methods listed in as similarity module https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. Please add detailed documentation to show which similarity functions is corresponding to which distance functions in scikit-learn or scipy.

One question I have shall we separate the similarity and distance measurements? I get confused by some measurements, e.g. Tanimoto index of molecule fingerprints. I would see it as a distance, but they treated it as a similarity, https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. If we decide to distinguish them, we may need to make them into similarity and distance modules instead of one module.

@PaulWAyers @FarnazH

A few more questions regarding #147

Thank you again for your efforts in building the notebook. @marco-2023

I am copying the list of questions as a reminder/placeholder here for reference as I have merged #147.

Would it be more precise to say "Running QC-Selector" instead of "Running Dissimilarity Algorithms" because we have multiple flavors for subset selection;
Do we have any explanations on why we get 8 and 6 data points instead of the queried 12 for "Adapted Optimizable K-Dissimilarity Selection (OptiSim)"?
Can we put all the imports at the top of the jupyter notebook?
For each subplot, can we add a title, x label and y label so that people don't have to scroll up to see what the figure means?
Can you add a few words saying why MaxSum is sensitive to outliers above the figure?
Why Directed Sphere Exclusion is giving 13 molecules instead of 12?

[methods.partition DirectedSphereExclusion]

To do:

Add test to reach 100% coverage
Add comments in the code
Add formulas to docstring and polish docstring (add reference).

Add support `p-sums` for the brute-strength algorithm

Not just minimum (p=-infinity) and sum (p=1)) versions, but also we need to support p-sums.
This relates to #4.

Computing dissimilarity/distance/diversity/similarity of molecules

An essential component of our package is to have a submodule to compute the dissimilarity/distance/diversity/similarity where normally we assume diversity = 1 - similarity.

A good summary of this is Table 2 in Drug Dev. Res.,72(1):74 - 84, 2011. Given a fact that this is a public repo, I am not going to share the screenshot here.

The Tanimoto coefficient is a classic and gold stand similarity metric for molecular fingerprints and we should support it. But it was found that it favors small molecules when used in molecule subset selection and a modification was proposed. My guess is that we should make this the default for molecular fingerprint inputs.

I will keep adding new things. Any idea will be appreciated.
@PaulWAyers @JuansaCollins @RichRick1 @Khaleeh @alnaba1

[Features module] Fix failed tests and improve code/coverage

Critically review the code in features module, update docstrings, add tests and comments to the code to make it more clear. Fix tests that are failing in DiverseSelector/test/test_feature.py.

For example, the feature_reader seems redundant to me. It is just 2 lines of actual code (but we need a lot of likes to explain/document the function), and we can show how to load the data file in an example/notebook or expect the user to know about it. In a nutshell, I don't think making a wrapper function for every functionality from other packages is helpful/useful. It is best (and to our advantage) to show how our package works in conjunction with other packages (in this case pandas). Feel free to let me know what you think @FanwangM and @PaulWAyers.

Choose an API

[Diversity module] Add Formulas to Docstrings, Add Tests, & Polish Module

See: #121

For each function (mentioned in #121), make sure that:

implementation is clear with comments.
docstring is polished and gives the formula implemented (add a reference to the paper used for implementation)
tests have 100% coverage

Identify/Choose Methods for Selecting Diverse Samples

[Utils] Utils.py has unfinished to-do and lacks testing

The utils.py has a large to-do block here Is implementation of this is still needed?

Also, mol_loader() needs tests written for it in test_util.py

[Selector module] Add the method in ` MultipleComparisons`

Ramon's group has a clever way of doing diverse selector and we have an in-house implementation and should merge it to this repo.

[Distance Module] Bitstring functionality already covered by non-bitstring functions

@FanwangM can you explain the rationale behind developing the functions euc_bit() and bit_tanimoto()? I have added tests in test_distance.py that demonstrate that they produce the exact same results as using the euclidean distance from sklearn and tanimoto().

The only location I see them used is in nearest_average_tanimoto(). If there is somewhere else in DiverseSelector that they are essential, I can leave them. But otherwise, I think it would make sense for me to replace the bitstring functions with the normal ones and streamline the distance.py module.

Add functionality for "holes" in the dataset

As pointed out by @FarnazH , it would be good to be able to identify which molecules/samples are in low-density regions (where we have little data) and also to identify regions where there is very little (or no) data.

Construct Evil Example

Make an evil example.

The idea would be to find an important function (from the literature) and then sample it very nonuniformly.

We could try the model in the (retracted, but still a model) paper:
https://www.nature.com/articles/s41586-021-04096-9

List methods which require distance matrix

We need a list of methods that require a distance matrix as mandatory. Currently, if a distance matrix is not provided, the program in base.py (line 96-101) will compute the distance matrix, which is unnecessary when a feature matrix is enough. Therefore, can you help list methods that you are taking care of and show which methods must have a distance matrix as input and which methods require the distance matrix as an optional input? Thank you.

@RichRick1 @alnaba1

Quantifying the diversity of selcted subset

We will add the functionality to support measuring the subset diversity.

Examples to show how to use our package

We needs a good demo.

Not all selectors give the correct number of samples

We need sensible default values for the grid/sphere size in some of the selection algorithms.

We should also allow for the case where the number of selected molecules is too small. We can (easily?) optimize sphere-size, for example, using the secant method (evaluating at the closest integer) so that the right number of molecules is selected.

(related): As suggested by @FarnazH , we should allow for sphere size to be dictated by the number of molecules, rather than just a size.

[Diversity module] Adding tests for diversity measurements

We will need to add tests for the following function. Moreover, a wrapper function or Python class simplifies the usage of the diversity functions.

total diversity column, Eq. 4 on page 843 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the
first method in this paper)
Eq. 9 on page 844 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the second method in this
paper)
Gini coefficient from Journal of Computational Chemistry2016,37, 2091–2097, Eq 12, 13 as well
entropy from Eq. 9 on page 4 of Leguy et al. J Cheminform (2021) 13:76 with implementation at
https://github.com/jules-leguy/EvoMol/blob/master/evomol/evaluation_entropy.py
log-determinant function from Sci. Rep. 12: 1124 (2022)
Wasserstein distance for property‑based evaluation of diversity. from Sci. Rep. 12: 1124
(2022) with codes available at https://github.com/tomotomonakanaka/SUBMO
Explicit Diversity Index from J. Chem. Inf. Model. 2006, 46, 5, 1898–1904

Refactorizing the `feature` module

The original attempt is to have one line of code to perform the subset selection, but this leads to a large number of arguments. Therefore, I would propose to have two independent modules where we can compute the molecular features and do the subset selection indepedently.

[Distance module] Add Formulas to Docstrings and Add Tests

Finalize the functions in distance module by:

Unifying docstrings and adding formulas (that match the implementation)
Add tests for each function (to cover all lines) and tricky cases
Polish code to make sure it is readable (e.g., add comments) and pythonic

Tasks

Beta Give feedback

Move sklearn_supported_metrics from utils module to distance module
Move tests related to distance module into test_distance.py
Remove support for sklearn.pairwise_distances function as it only adds overhead to the code.
Add formulas
Add tests
Options

Tasks - Post PR#129

Beta Give feedback

Test coverges: missing lines 184 and 189.
Fix math formula: https://github.com/theochem/DiverseSelector/blob/main/DiverseSelector/distance.py#L149
Having both pairwise_similarity_bit and compute_distance_matrix seems redundant considering the fact that we have the #123 module.
Remove empty lines at the end of the modules, like https://github.com/theochem/DiverseSelector/blob/main/DiverseSelector/test/test_distance.py#L99
Add tests to explicitly test tanimoto and modified_tanimoto
Implementation needs to be improved and further clarified: https://github.com/theochem/DiverseSelector/blob/main/DiverseSelector/distance.py#L223
Needs more clarification in docstring and comments of https://github.com/theochem/DiverseSelector/blob/main/DiverseSelector/distance.py#L143
Options

Implementing Gini coefficient in `metric.py`

[Selector module] Implementing D-optimal designs

Information related to determinantal point processes can be found at #4.

fast greedy algorithm
review
code
two algorithms here

[Diversity module] Problem of using `shannon_entropy` for computing diversity

Sometimes, this function can return us an invalid value, NaN. You can reproduce the error by running the notebook as attached with the latest code in this repo. I think the main problem is we may have zeros in the denominator.

diversity_problem.zip

Can you fix this? Thank you.
@Khaleeh

Keywords list

Using the right keywords may help us identify good pieces of literature. Here is a list of keywords that @PaulWAyers and I have used,

1. chemical diversity;
2. compound selection;
3. maximum diversity molecule;

Please feel free to add new keywords if you think that's helpful. If you are adding, adding the numbering would be helpful and I will constantly update the list here for future reference.

Thank you. @Khaleeh @alnaba1 @JuansaCollins @RichRick1

Add support of SMILES or 2D molecules

This would help people to load different molecule formats.

Design Tests/Assessments for Methods

Stratified Sampling

"Simple" stratified sampling can be mimicked by using the propert(ies) of interest as the feature(s). (So the feature vector is exactly the propert(ies) to be sampled.)

To achieve sampling with respect to target properties and diversity with respect to a fingerprint/feature-vector, you need to append the target properties to the feature vector. For each property, specify a "repeat": the more times you repeat the property value the higher its weight.

One way to do this would be to pass a weight, then set the repeat-value for each property to be repeat = np.ceil(weight*n_features) where n_features is the number of features (length of the fingerprint). (I use ceiling to ensure that as long as weight > 0.0 at least one instance of the property is used). The final feature vector has the property value(s) appended to it repeat times.

Then we use this set of features.

Add a utility for augmenting the feature array and provide examples to showcase this.

returning incorrect number of molecules

When using DirectedSphereExclusion and OptiSim, if we set the r not carefully, we may end up having an incorrect number of molecules. Here are some examples.

# SCFP6 and OptiSim with more molecules than required
selected_ids9 = OptiSim(func_distance=lambda x, y: np.linalg.norm(x - y), 
                        r=0.5, k=3, tolerance=2.0, random_seed=42).select(arr=fingerprints_secfp6, num_selected=12)

print(selected_ids9)
graph_mols(input_sdf="demo_mols_2d.sdf",selected_ids=selected_ids9)

# SCFP6 and OptiSim with less molecules than required
selected_ids9 = OptiSim(func_distance=lambda x, y: np.linalg.norm(x - y), 
                        r=1.5, k=3, tolerance=2.0, random_seed=42).select(arr=fingerprints_secfp6, num_selected=12)

print(selected_ids9)
graph_mols(input_sdf="demo_mols_2d.sdf",selected_ids=selected_ids9)

#================================================
# SECFP6 and directed sphere exclusion with more molecules than required
selector = DirectedSphereExclusion(r=0.5, tolerance=5.0, start_id=0, random_seed=42,
                                   func_distance=lambda x, y: np.linalg.norm(x - y))
selected_ids10 = selector.select(arr=fingerprints_secfp6, num_selected=12)

print(selected_ids10)
graph_mols(input_sdf="demo_mols_2d.sdf",selected_ids=selected_ids10)

# SECFP6 and directed sphere exclusion with less molecules than required
selector = DirectedSphereExclusion(r=1, tolerance=5.0, start_id=0, random_seed=42,
                                   func_distance=lambda x, y: np.linalg.norm(x - y))
selected_ids10 = selector.select(arr=fingerprints_secfp6, num_selected=12)

print(selected_ids10)
graph_mols(input_sdf="demo_mols_2d.sdf",selected_ids=selected_ids10)
[num_mols_debugging.zip](https://github.com/theochem/DiverseSelector/files/8884694/num_mols_debugging.zip)

If you want to reproduce the results, please find the attached notebook.

I know this is restricted by the parameter settings of the corresponding algorithms. Do we have a solution to help people get the right number of molecules? Or we can print a warning when the returned number of molecules is not equal to what we asked for and tell people the right direction to help navigate the right parameters (increase or decrease some parameter values).

This is a challenging question and any suggestions would be appreciated. Thank you.
@PaulWAyers @FarnazH @alnaba1 @RichRick1 @Ali-Tehrani @ramirandaq

Cleaning up the codes

For some parts of our package, the documentation is unclear. Can you help double check if the parts you are working on are missing such clear documentation and fix them?

We will also need to add the references if some equations or algorithms are taken from some papers. For an example putting references, you can take a look at https://numpydoc.readthedocs.io/en/latest/format.html.
Add Raises if there is any error checking
Type checking for all the functions or classes including inherited classes
Add more tests to increase coverage (add tests for missing lines)
Add some explanations of your equation or algorithms to Notes section. We will need to have equations explicitly expressed in the documentation.

If you can think of other things that we should use for cleaning the codes, please comment here. Thank you.
@alnaba1 @RichRick1 @Khaleeh @fwmeng88

Add support of heapsweep

This is inspired by the https://chemfp.readthedocs.io/en/chemfp-4x/chemfp_diversity.html?highlight=%20MaxMin#heapsweep-picker
and the paper is https://doi.org/10.1016/j.tcs.2015.02.033. Not sure if the efforts are worth this implementation.

Also, some picking methods can be interesting to explore in this webpage.

What do you think of this? @PaulWAyers @FarnazH @alnaba1 @RichRick1 @Ali-Tehrani

Methods for Assessing the Diversity of a Set (and how optimal it is)

For example, we could measure how much more diverse a selected sample is than a random sample.

If clusters are similar size, then random-ish samples are a lot better than if clusters are very inhomogenous in size.

[method.distance module]

To do:

Add test to reach 100% coverage
Add comments in the code
Add formulas to docstring and polish docstring (add reference).

[Converter module] Conversion between similarity and dissimilarity/distance

Shall we add a new module named converter to convert the similarity and dissimilarity/distance back and forth?

Translate the functions from https://rdrr.io/cran/smacof/man/sim2diss.html, but we are not going for the ranking method for now. More discussions are listed in #73.

Finalizing the distance and diversity modules

This part remains to be finished for the first official release.

Add a command line tool for people to perform subset selection with one line of `bash` code

This is good to have, but let's leave it to the last stage. An example can be found at https://github.com/theochem/B3clf.

optimize_radius not using information of clusters

The function optimize_radius in the file utils.py is not using the cluster_idsparameters. The select_from_cluster method of the DirectedSphereExclusion is also not using the cluster_ids. Because of this, this selection algorithm fails when used with cluster labels.

subset selection for HTS/virtual screening/machine learning
validation dataset hold-out (stratified version)

[Website] Host documentation with gh-pages or our own domain

Generate the documentation and host it with our own domain.

theochem / diverseselector Goto Github PK

diverseselector's Introduction

The selector Library

Citation

Installation

diverseselector's People

Contributors

Stargazers

Watchers

Forkers

diverseselector's Issues

Tasks

Tasks - Post PR#129

Recommend Projects

Recommend Topics

Recommend Org

The `selector` Library