Comments (16)
Determinantal point processes seem really cool here; they maximize the volume of the parallelpiped containing the data, and in that sense ensure that the "interpolation regime" is as big as possible.
from diverseselector.
Notes on Wasserstein Distance to Uniform Distribution (for @Khaleeh )
Wasserstein-Dist-to-Uniform-Dist.pdf
Wasserstein-Dist-to-Uniform-Dist.md
from diverseselector.
Note we should also support p-sums (not just minimum (p=-infinity) and sum (p=1)) versions for the brute-strength algorithm, so that a balanced representation can be achieved.
from diverseselector.
Found this paper really easy to follow and it describes a couple of algorithm (clustering and Dissimilarity-Based Compound Selection) that might be relatively easy to implement at the beginning. Also, Dissimilarity-Based Compound Selection provides a decent metric while quality of the Clustering approach can be used as Gini Index/ Entropy. Here are some screenshots:
And for the clustering algorithm:
So, overall steps might be:
- Build (or let the user input) the dissimilarity matrix as a np.array/sparse array/ .txt file (?).
- Use either proposed Dissimilarity-Based Compound Selection which doesn't require a lot of computation power and/or resource. Another choice at this stage might be to use clustering algorithm (hierarchical (application can be found in this paper), k-means/ DBSCAN/ etc) and than just select the required amount of samples from each cluster.
- Use the Gini Index/ Entropy to decide how diverse dataset is
Potential problems might be:
- How to select number of clusters for clustering problem (can be done using elbow method and/or silhouette coefficient)
- Clustering algorithm in general doesn't work great, so maybe we need to compare the output with a threshold random sampling
- How to deal with imbalanced datasets
@fwmeng88 please, let me know what do you think about this
from diverseselector.
I think these are two great brute-strength methods for generating samples. Selecting the right number of clusters is probably easily left to the user; a default option could be to select the number of clusters to equal the size of the sample desired. A simple revision would be to take n_members
elements from each of n_clusters
clusters to generate n_members*n_clusters
samples, where any deficit (because of a cluster not having enough members) is resolved in some way (random sample or dissimilarity-based selection).
I thinks some of the other methods are more efficient, but for most molecular databases these (very robust; should be extremely reliable, only potential issue is cost) methods should be very reliable.
from diverseselector.
This paper is very interesting. Thanks for proposing the idea. @RichRick1
DBCS can be the very first few examples that we can implement. But we may need to think about its tendency to select outliers, which relates to the problem you mentioned, class imbalance. These two problems are not identical, but closely related. An alternative to this is the sphere of exclusion algorithm (J. Chem. Inf. Comput. Sci. 1995, 35, 1, 59–67 and ACS Chem. Biol. 2011, 6, 3, 208–217).
Another possibility is that we can try clustering together with the Monte Carlo search. Based on @PaulWAyers's comments above, we can try a clustering algorithm that can provide probability information (soft clustering) and then take a few members out of each cluster to form a new sample set. Then we use the Monte Carlo search to improve the diversity.
@RichRick1
from diverseselector.
An article I was trying to remember but (finally) found is:
https://pubs-acs-org.libaccess.lib.mcmaster.ca/doi/10.1021/ci500749q
https://europepmc.org/article/med/25594586
This basically uses the maximin algorithm (the dissimilarity-based selection) but uses a grid to make it linear scaling.
The directed sphere evolution method is a sphere-of-exclusion method that seems (slightly) better than the most direct approach. It's implemented in the chemalot
code (but not in Python)
https://github.com/chemalot/chemalot
I'm starting to feel like we probably have enough algorithms here, unless there is something really amazing out there. But having:
- brute strength dissimilarity (expensive; prone to select outliers)
- dissimilarity maximization with a grid (not perfectly optimal; less prone to select outliers)
- directed sphere evolution (cheaper; less prone to outliers)
- clustering+selection
The first papers above suggest that when you have values of the target property, you can sample (even with stratification) within boxes/clusters to ensure diverse samples for the target property, or to select a library that is good for the target property.
This would give us three or four families of methods. With enough work there, the bigger indicator of performance is likely to end up being how good our metric is for selecting molecules...which is something that is a never-ending task, where we mostly need to support a wide variety of metrics.
from diverseselector.
The brute strength dissimilarity method has been implemented in https://rdrr.io/cran/caret/man/maxDissim.html and the original paper is Willett, P. (1999), "Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds," Journal of Computational Biology, 6, 447-457..
from diverseselector.
Using Gini coefficient to measure the chemical diversity is a great idea! @RichRick1
This paper explains more details, Journal of Computational Chemistry2016,37, 2091–2097. @PaulWAyers
from diverseselector.
At this moment, I have seen ways to compute the dataset diversity,
- total diversity column, Eq. 4 on page 843 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the first method in this paper)
- Eq. 9 on page 844 of J. Chem. Inf. Comput. Sci. 1997, 37, 841-851 (the second method in this paper)
- Gini coefficient from Journal of Computational Chemistry2016,37, 2091–2097
- entropy from Eq. 9 on page 4 of Leguy et al. J Cheminform (2021) 13:76 with implementation at https://github.com/jules-leguy/EvoMol/blob/master/evomol/evaluation_entropy.py
- log-determinant function from Sci. Rep. 12: 1124 (2022)
- Wasserstein distance for property‑based evaluation of diversity. from Sci. Rep. 12: 1124 (2022) with codes available at https://github.com/tomotomonakanaka/SUBMO
- Explicit Diversity Index from J. Chem. Inf. Model. 2006, 46, 5, 1898–1904
I will add more as I read.
@PaulWAyers @RichRick1
Update: fixed wrong link
from diverseselector.
The Gini coefficient works primarily for the binary fingerprint I think. Entropy could be made to work for more general options, but is probably easiest from a bit-string or another case where there are a discrete number of states (or a known distribution of numbers) for each element in the molecular descriptor vector. I'm not sure how the Gini coefficient could be used if only a (dis)similarity matrix was available (ditto for the entropy there, though). But that's a case where some sort of determinantal measure could work...
The links Fanwang had above seem wrong; try below instead.
https://pubs-acs-org.libaccess.lib.mcmaster.ca/doi/10.1021/ci9700337
Equations 12 and 13 in this paper are also sensible.
There may be some other good things here too.
from diverseselector.
moved to issue #7
from diverseselector.
Recommendation: Split issue for similarity/dissimilarity and diversity computation. This amounts to putting some key info in issue #7
from diverseselector.
@FarnazH pointed out that it would be very useful to have a very simple API for evaluating the diversity of a subset.
from diverseselector.
We should implement determinantal point processes as a diversity measure (using the Grammian if features are given; using the distance matrix or one of its implied quantities otherwise). There are greedy algorithms for optimizing the determinantal point process, which is a very good selection method.
from diverseselector.
@FarnazH pointed out that it would be very useful to have a very simple API for evaluating the diversity of a subset.
This is good to have and the original plan was set to fix this soon. Thanks for the advice.
from diverseselector.
Related Issues (20)
- returning incorrect number of molecules HOT 2
- Finalizing the distance and diversity modules HOT 7
- Add support of heapsweep HOT 6
- [Diversity module] Adding tests for diversity measurements HOT 1
- [Similarity module] Add support of scaled similarity HOT 7
- [Converter module] Conversion between similarity and dissimilarity/distance HOT 7
- [Similarity module] Add more similarity measurements HOT 7
- [Distance module] Add Formulas to Docstrings and Add Tests HOT 3
- [Features module] Fix failed tests and improve code/coverage HOT 9
- [Distance Module] Bitstring functionality already covered by non-bitstring functions HOT 3
- [Diversity module] Add Formulas to Docstrings, Add Tests, & Polish Module HOT 2
- [Utils] Utils.py has unfinished to-do and lacks testing HOT 4
- [method.distance module] HOT 4
- [methods.partition DirectedSphereExclusion] HOT 1
- [methods.partition GridPartitioning] HOT 4
- Construct Evil Example HOT 27
- [Selector module] Implementing D-optimal designs HOT 3
- Implementing Gini coefficient in `metric.py` HOT 4
- Supporting scaled similarity HOT 10
- Stratified Sampling
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diverseselector.