For example, we could measure how much more diverse a selected sample is than a random

Notes on Wasserstein Distance to Uniform Distribution (for <a class="user-mention notr

Found <a href="https://wires.onlinelibrary.wiley.com/doi/10.1002/wcms.33" rel="nofollo

This paper is very interesting. Thanks for proposing the idea. <a class="user-mention

An article I was trying to remember but (finally) found is: <a href="https://pubs-

Methods for Assessing the Diversity of a Set (and how optimal it is) about diverseselector HOT 16 CLOSED

theochem commented on July 22, 2024 1

Methods for Assessing the Diversity of a Set (and how optimal it is)

from diverseselector.

Comments (16)

PaulWAyers commented on July 22, 2024 1

Determinantal point processes seem really cool here; they maximize the volume of the parallelpiped containing the data, and in that sense ensure that the "interpolation regime" is as big as possible.

from diverseselector.

PaulWAyers commented on July 22, 2024 1

Notes on Wasserstein Distance to Uniform Distribution (for @Khaleeh )
Wasserstein-Dist-to-Uniform-Dist.pdf
Wasserstein-Dist-to-Uniform-Dist.md

from diverseselector.

PaulWAyers commented on July 22, 2024 1

Note we should also support p-sums (not just minimum (p=-infinity) and sum (p=1)) versions for the brute-strength algorithm, so that a balanced representation can be achieved.

from diverseselector.

RichRick1 commented on July 22, 2024

Found this paper really easy to follow and it describes a couple of algorithm (clustering and Dissimilarity-Based Compound Selection) that might be relatively easy to implement at the beginning. Also, Dissimilarity-Based Compound Selection provides a decent metric while quality of the Clustering approach can be used as Gini Index/ Entropy. Here are some screenshots:

And for the clustering algorithm:

So, overall steps might be:

Build (or let the user input) the dissimilarity matrix as a np.array/sparse array/ .txt file (?).
Use either proposed Dissimilarity-Based Compound Selection which doesn't require a lot of computation power and/or resource. Another choice at this stage might be to use clustering algorithm (hierarchical (application can be found in this paper), k-means/ DBSCAN/ etc) and than just select the required amount of samples from each cluster.
Use the Gini Index/ Entropy to decide how diverse dataset is

Potential problems might be:

How to select number of clusters for clustering problem (can be done using elbow method and/or silhouette coefficient)
Clustering algorithm in general doesn't work great, so maybe we need to compare the output with a threshold random sampling
How to deal with imbalanced datasets

@fwmeng88 please, let me know what do you think about this

from diverseselector.

PaulWAyers commented on July 22, 2024

I think these are two great brute-strength methods for generating samples. Selecting the right number of clusters is probably easily left to the user; a default option could be to select the number of clusters to equal the size of the sample desired. A simple revision would be to take n_members elements from each of n_clusters clusters to generate n_members*n_clusters samples, where any deficit (because of a cluster not having enough members) is resolved in some way (random sample or dissimilarity-based selection).

I thinks some of the other methods are more efficient, but for most molecular databases these (very robust; should be extremely reliable, only potential issue is cost) methods should be very reliable.

from diverseselector.

FanwangM commented on July 22, 2024

This paper is very interesting. Thanks for proposing the idea. @RichRick1

DBCS can be the very first few examples that we can implement. But we may need to think about its tendency to select outliers, which relates to the problem you mentioned, class imbalance. These two problems are not identical, but closely related. An alternative to this is the sphere of exclusion algorithm (J. Chem. Inf. Comput. Sci. 1995, 35, 1, 59–67 and ACS Chem. Biol. 2011, 6, 3, 208–217).

Another possibility is that we can try clustering together with the Monte Carlo search. Based on @PaulWAyers's comments above, we can try a clustering algorithm that can provide probability information (soft clustering) and then take a few members out of each cluster to form a new sample set. Then we use the Monte Carlo search to improve the diversity.
@RichRick1

from diverseselector.

PaulWAyers commented on July 22, 2024

An article I was trying to remember but (finally) found is:
https://pubs-acs-org.libaccess.lib.mcmaster.ca/doi/10.1021/ci500749q
https://europepmc.org/article/med/25594586
This basically uses the maximin algorithm (the dissimilarity-based selection) but uses a grid to make it linear scaling.

The directed sphere evolution method is a sphere-of-exclusion method that seems (slightly) better than the most direct approach. It's implemented in the chemalot code (but not in Python)
https://github.com/chemalot/chemalot

I'm starting to feel like we probably have enough algorithms here, unless there is something really amazing out there. But having:

brute strength dissimilarity (expensive; prone to select outliers)
dissimilarity maximization with a grid (not perfectly optimal; less prone to select outliers)
directed sphere evolution (cheaper; less prone to outliers)
clustering+selection

The first papers above suggest that when you have values of the target property, you can sample (even with stratification) within boxes/clusters to ensure diverse samples for the target property, or to select a library that is good for the target property.

This would give us three or four families of methods. With enough work there, the bigger indicator of performance is likely to end up being how good our metric is for selecting molecules...which is something that is a never-ending task, where we mostly need to support a wide variety of metrics.

from diverseselector.

FanwangM commented on July 22, 2024

The brute strength dissimilarity method has been implemented in https://rdrr.io/cran/caret/man/maxDissim.html and the original paper is Willett, P. (1999), "Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds," Journal of Computational Biology, 6, 447-457..

from diverseselector.

FanwangM commented on July 22, 2024

Using Gini coefficient to measure the chemical diversity is a great idea! @RichRick1
This paper explains more details, Journal of Computational Chemistry2016,37, 2091–2097. @PaulWAyers

from diverseselector.

FanwangM commented on July 22, 2024

At this moment, I have seen ways to compute the dataset diversity,

total diversity column, Eq. 4 on page 843 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the first method in this paper)
Eq. 9 on page 844 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the second method in this paper)
Gini coefficient from Journal of Computational Chemistry2016,37, 2091–2097
entropy from Eq. 9 on page 4 of Leguy et al. J Cheminform (2021) 13:76 with implementation at https://github.com/jules-leguy/EvoMol/blob/master/evomol/evaluation_entropy.py
log-determinant function from Sci. Rep. 12: 1124 (2022)
Wasserstein distance for property‑based evaluation of diversity. from Sci. Rep. 12: 1124 (2022) with codes available at https://github.com/tomotomonakanaka/SUBMO
Explicit Diversity Index from J. Chem. Inf. Model. 2006, 46, 5, 1898–1904

I will add more as I read.
@PaulWAyers @RichRick1

Update: fixed wrong link

from diverseselector.

PaulWAyers commented on July 22, 2024

The Gini coefficient works primarily for the binary fingerprint I think. Entropy could be made to work for more general options, but is probably easiest from a bit-string or another case where there are a discrete number of states (or a known distribution of numbers) for each element in the molecular descriptor vector. I'm not sure how the Gini coefficient could be used if only a (dis)similarity matrix was available (ditto for the entropy there, though). But that's a case where some sort of determinantal measure could work...

The links Fanwang had above seem wrong; try below instead.
https://pubs-acs-org.libaccess.lib.mcmaster.ca/doi/10.1021/ci9700337
Equations 12 and 13 in this paper are also sensible.

There may be some other good things here too.

from diverseselector.

PaulWAyers commented on July 22, 2024

moved to issue #7

from diverseselector.

PaulWAyers commented on July 22, 2024

Recommendation: Split issue for similarity/dissimilarity and diversity computation. This amounts to putting some key info in issue #7

from diverseselector.

PaulWAyers commented on July 22, 2024

@FarnazH pointed out that it would be very useful to have a very simple API for evaluating the diversity of a subset.

from diverseselector.

PaulWAyers commented on July 22, 2024

We should implement determinantal point processes as a diversity measure (using the Grammian if features are given; using the distance matrix or one of its implied quantities otherwise). There are greedy algorithms for optimizing the determinantal point process, which is a very good selection method.

from diverseselector.

FanwangM commented on July 22, 2024

@FarnazH pointed out that it would be very useful to have a very simple API for evaluating the diversity of a subset.

This is good to have and the original plan was set to fix this soon. Thanks for the advice.

from diverseselector.

Methods for Assessing the Diversity of a Set (and how optimal it is) about diverseselector HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent