Make an evil example. The idea would be to find an important functio

For breast cancer dataset. Can you fit an SVM on a (random) su

Another possibility is cancer risk exposure. <a href="https://www.thelancet.com/cm

Some GitHub repositories have calculators. <a href="https://github.com/raghav103/L

Consider the 4-well potential in the attached papers. Sample u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Construct Evil Example,about theochem/diverseselector

Comments (28)

PaulWAyers commented on August 23, 2024 1

It seems that all the calculators on riskcalc.org from the Cleveland Clinic can be accessed in a free-and-open-source way.
https://github.com/orgs/ClevelandClinicQHS/repositories?type=all

There is a lot of data also on risk factors vs incidence at
https://github.com/kritikaparmar-programmer/HealthCheck

from diverseselector.

PaulWAyers commented on August 23, 2024 1

For breast cancer dataset.

Can you fit an SVM on a (random) subset of the data? How does the error increase ans the quantity of data decreases?Assume, for now, random sampling.
Can you get away with less data if you use diverse sampling?
If you sub-sample the data with a bias (e.g., reject data where the tumor has a large perimeter or large area, using Boltmann-like screening of the data), can you still fit an SVM? Does the fit work better with a "diverse" sample or a "random" sample?

from diverseselector.

xychem commented on August 23, 2024 1

Sorry for late reply, I'm too busy these days. I upload a folder in notebook.There is still something wrong which I'm modifying.

from diverseselector.

PaulWAyers commented on August 23, 2024

The idea is to find an explicit function in the literature,

$$ f(x_1,x_2,\ldots,x_n) $$

which we will then sample. However, we won't sample this uniformly, but rather nonuniformly. (E.g., random numbers but not uniformly distributed.) We'll then see how well diverse-selector works versus random selection.

from diverseselector.

PaulWAyers commented on August 23, 2024

Another possibility is cancer risk exposure.
https://www.thelancet.com/cms/10.1016/S0140-6736(22)01438-6/attachment/e14dc624-4fe9-4ce5-8736-bb17be93d0f3/mmc1.pdf
https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(22)01438-6/fulltext#sec1

from diverseselector.

PaulWAyers commented on August 23, 2024

We could also use actuarial tables.

There are some online cancer-risk calculators.
https://knowyourchances.cancer.gov/big_picture_charts.html
https://www.calculators.org/health/cancer.php
https://www.sciencedirect.com/science/article/pii/S0090429505010071
https://www.mskcc.org/nomograms
https://riskcalc.org/

from diverseselector.

PaulWAyers commented on August 23, 2024

Some GitHub repositories have calculators.
https://github.com/raghav103/Lung_Cancer_Predictor
https://github.com/PanduDcau/Lung-Cancer-Project
https://github.com/ToshY/pca-riskcalculator
https://github.com/advikmaniar/ML-Healthcare-Web-App/tree/main
https://github.com/videntity/python-framingham10yr/tree/master
https://github.com/Jean-njoroge/Breast-cancer-risk-prediction

@xychem will look at these and decide which is best for generating data. The main goal is that it is favorable to

have lots of descriptors/dimensions.
have an easy model that we can run to generate lots of data (beyond what is actually in the training data).
more training data is good.
more stars/forks on the repository is good. This means one of the last three is best, probably.

I'd prefer the last one if it meets all the other criteria. However, it may be better to use the next-to-last one as it is a simple function that we can use. (It's the easiest choice.) The easiest way to sample nonuniformly would be to pick a variable (say, age) and sample it very inhomogeneously. Then our predictions should be bad unless we sample with diversity.

from diverseselector.

PaulWAyers commented on August 23, 2024

Consider the 4-well potential in the attached papers.

Sample using random numbers in the range $-3 \le R_1 \le 3$ and $-3 \le R_2 \le 3$. Select a huge number of points (~1e6).
Sample using Boltzmann sampling. For a given (inverse) temperature, keep each point with probability $\exp[-\beta (E - E_0)]$ where $E_0$ is the lowest energy structure, which has energy 1.780 (see table 1 in the paper. Since the highest reaction barrier is about 1.5 units above the lowest energy, it's reasonable to consider $\beta = 1.5^{-k}$ where k = 0 is a random sample (so it's not necessary), and k=1, k=2, k=3, etc. are increasingly selective samples.

For a given number of samples, $S$, do the following three things:

Select $S$ points from the randomly generated data. ($\beta = 0$)
Select $S$ points from Boltzmann sampling with different $\beta$. These $S$ points are chosen at random, because the Boltzmann sample already screened based on energy.
Select the $S$ points with lowest energy. ($\beta \rightarrow \infty$)

Try to fit the potential energy curve using a method like Gaussian processes. Measure the error based on a regular grid of points in the range $-3 \le R_1 \le 3$ and $-3 \le R_2 \le 3$. I'd generate 100 points in each direction (so 10,000 total points) and keep track of

Mean absolute error.
Root-mean-square error.
Maximum absolute error.

The hypothesis is that the errors get worse when biased sampling is performed, but also that diverse selection helps "cure" some of the problems due to biased sampling. So if we randomly sample $S$ points from the Boltzmann sample, it should be worse than if we diversely selected $S$ points from the Boltzmann sample. This will be especially helpful for reducing outliers.

Dey, B. K., & Ayers, P. W. (2006). A Hamilton–Jacobi type equation for computing minimum potential energy paths. Molecular Physics, 104(4), 541-558.
Liu, Y., Burger, S. K., Dey, B. K., Sarkar, U., Janicki, M. R., & Ayers, P. W. (2010). The Fast Marching Method for Determining Chemical Reaction Mechanisms in Complex Systems. Quantum Biochemistry, 171-195.

from diverseselector.

xychem commented on August 23, 2024

@PaulWAyers @FanwangM @FarnazH

There are some crude restults on my computers. It seems that:

Random sampling is better when k is smaller whatever the maximum error, mean squared error or mean absolute error.
The mean squared error and mean abolute error is decay slowly but better than maximum error, and the maximum error looks like not convergent when k is lager and the largest sampling number is 200.
Optisim method has smaller maximum error than random sampling and all the maximum error of Optisim method is convergent.

Question:

Boltzmann sampling : I randomly choose the 1e+6 points and calculate their probabilities, then according their probabilities to choose 1e+3 points and iterate the processure to get 1e+6 points. This is one of reasons that my caculation is so large. I don't know how to do the boltzamann sampling in a more economical method, so I need your help.
An account for Comupter Canada: I need an account of Computer Canada.The job is large when sampling number over 100 and it runs slowly on my computer, so I need to submit the jobs to servers. Fanwang tought me how to submit jobs and I also have booked the SHARCNET New User / Refresher Webinar Confirmation of Computer Canada tomorrow.

The maximum error with different k by random sampling:

The mean absolute error with different k by random sampling:

The mean squared error with different k by random sampling:

The compare with k=0, 10, 20 between opisim sampling and random sampling:

from diverseselector.

PaulWAyers commented on August 23, 2024

I'd look at a smaller step in k. k=20 is very aggressive screening.

To sample with Boltzmann probability, you just compute the probability, which is $e^{-\beta(E-E_0)}$, where $E_0 = 1.780$ according to the paper. This gives every point a probability, $0 \le p \le 1$. Then for each point you generate a random number, $r$, between zero and one, and accept the result only if $p \ge r$.

The plots you have are very jumpy, because you need to find a way to make multiple samples of the same size. That's easy for random sampling but hard for most of the methods in DiverseSelector, because (by default) they start with the medoid.

The basic idea is to first make a huge sample. I recommended 1e6 points, but you may need more. 1e7 will let you consider a larger value of k, for example.
Make a Boltzmann sample for a given $k \cdot \beta$. $k = 0$ corresponds to a random sample. Also consider the 1000 points with lowest energy; this corresponds to $k \rightarrow \infty$. At the end of step 2, you have a sample for each value of k. When the size of the Boltzmann sample is less than 1e3, the sample is too small and cannot be used.
Construct sub-samples. For each value of $k$, choose a random sub-sample of 1e3 points. Do this repeatedly (perhaps 10 times), so that you have 10 sub-samples for each $k$. By averaging your results over these sub-samples, you'll help smooth over the "bumps" in the curves.
Use random selection, OptiSim, and other algorithms to select $S$ points.
Fit using Gaussian Process Regression (or similar).
Compute mean-absolute error, root-mean-square error, and maximum absolute error for the regular grid of points from $-3.0 \le x,y \le 3.0$.
Average these errors over all of the subsamples from step 3 and plot the result.

I hope this will give smoother plots of error vs. $S$, and show more decisive preferences for various sampling methods. My hypothesis would be that for $k=0$ (random sampling), random selection works well. But as $k$ gets larger, I expect it to be more important to choose diverse selection methods in step 4.

from diverseselector.

xychem commented on August 23, 2024

I use the boltzmann sampling method that you say but it needs much time to get 1e7 points (more than several hours when k is large, so I kill the job). Maybe I need an account of CC (Computer Canada).

And there is a question about the symbol of k. If $p=exp(-1.5^{-k} \cdot (E-E_0))$, it means when $k\to\infty$, and $p \to$ 1 which represents random sampling. The previous code I write is $\beta = 1.5^k$, so the database will be biased with the increasement of k. Howerver when k is zero ,the sample is also boltzmann sampling. The previous result is right because I use the random sampling when k=0.

New code of boltzmann sampling:

# define the boltzmann sampling method
def boltzmann_sample(k,sample_number):
    '''
        k : int
            the order of 1.5

        sample_number : int
            the number of sample number
    '''

    E_0 = 1.780     # the minimum energy
    sample_list=[]
    
    while len(sample_list) < sample_number: 
        rng = np.random.default_rng()
        rn = rng.random()       # generate a random number
        pd = (rng.random((1,2))-0.5)*6      # generate a point with random method from -3 to 3
        p = np.exp(-((1.5)**(k))*((potential_energy(pd[0,0],pd[0,1]))-E_0))     # calculate the probability of the point
        if  p >= rn:        # if p >= random_number than choose the point
            sample_list.append(pd)
    
    sample_list = np.array(sample_list)[:,0]        # change list to np.array
    return sample_list

from diverseselector.

PaulWAyers commented on August 23, 2024

The are instructions on how to make an account on Compute Canada in the BootCamp repository that I think I shared with you when you started. It's good to use Compute Canada. Ideally, you can generate data for a day or two; more data is a good thing.

$k=0$ is random sampling, corresponding to infinite temperature (zero $\beta = 0$).
As $k$ gets larger, we want the temperature to get lower ($\beta$ to get larger). So we want

$$ p = e^{-\beta (E- E_0)} $$

and $\beta = 1.5^k$ is good. Sorry for making a typo there. Do be careful that there is a negative sign in the exponential, so that the probability factor is always between zero and one.

from diverseselector.

xychem commented on August 23, 2024

So I should use the $p=e^{-k\cdot\beta(E-E0)}$ rather than $p=e^{-\beta(E-E0)}$? The former one will be random sample when k=0.
It seems that I need your CCRI to register an account of Compute Canada.

from diverseselector.

PaulWAyers commented on August 23, 2024

The CCRI is in the bootcamp.

from diverseselector.

PaulWAyers commented on August 23, 2024

If you fix $\beta = 1.5$ then use $p=e^{-k \beta (E-E_0)}$.

from diverseselector.

PaulWAyers commented on August 23, 2024

I realized there is a simpler way to do the Boltzmann sampling than described previously. See
#144 (comment)

The basic idea is to first make a huge sample. I recommended 1e6 points, but you may need more. 1e7 will let you consider a larger value of k, for example.
Make several (perhaps 10) Boltzmann samples for a given $k \cdot \beta$. $k = 0$ corresponds to a random sample. Also consider the 1000 points with lowest energy; this corresponds to $k \rightarrow \infty$. At the end of step 2, you have a sample for each value of k. When the size of any of the Boltzmann standards for a given $k$ is less than 1e3, the samples are too small for that value of $k$, and it cannot be used. See the note below on creating independent Boltzmann samples.
Use random selection, OptiSim, and other algorithms to select $S$ points.
Fit using Gaussian Process Regression (or similar).
Compute mean-absolute error, root-mean-square error, and maximum absolute error for the regular grid of points from $-3.0 \le x,y \le 3.0$.
Average these errors over all of the subsamples from step 3 and plot the result.

Generating Independent Samples with Boltzmann Probability for a given $\beta$: First just compute the probability, which is $e^{-\beta(E-E_0)}$, where $E_0 = 1.780$ according to the paper. This gives every point a probability, $0 \le p \le 1$. Suppose we want 10 independent Boltzmann samples. Then for each point, generate 10 random numbers, $r_0, r_1, \ldots r_9$, between zero and one. The point is in the $l$-th Boltzmann sample when $p \ge r_l$.

from diverseselector.

xychem commented on August 23, 2024

I'm sorry for misunderstanding whats you write before. What I did before is choosing a point with random sampling and calculate the probability of the point, select it when it's probability larger than a random number. By using the iteration that I can get 1e7 biased database. It is very strange that I want to get a 1e7 boltzmann sampling database.

What I need to do is:

using random sampling to generate 1e6~1e7 points database $D$
using the boltzmann sampling with different k on the database $D$ to get the subdatabase $D_s'$
using the 10 or more times random sampling to select 1e3 points on the subdatabase $D_s'$ to get the subsubdatabase $D_{ss}''$
using the random selection, OptiSim and other algorithms to select $S$ points on the subsubdatabase $D_{ss}''$.
fitting the PES with Gaussian Process
compute mean absolute error, root mean square error, and maximum absolute error for the regular grid of points from -3 to 3 with 1e4 points (which axis may 1e2)
average these errors to get smoother plots

Parameter details:

with $\beta=1.5$ by using the $k\cdot \beta$ as the boltzmann parameter (while k euqal 0, the boltzmann sampling is transformed to the random sampling)

from diverseselector.

xychem commented on August 23, 2024

I have change the code and get some result with the 30 times random sampling to get smoother plots and the sample number is range(1,300,2). It is hard to do larger sample and larger iterations in my compute and I have submitted the application of the compute canada last week.
A strange thing is when k = np.inf, the error is increased with the sample number. I think it's must something wrong with my code, so I'm checking my code.

strange plot:

from diverseselector.

PaulWAyers commented on August 23, 2024

It will be helpful to explain exactly what these tests are doing. Write your procedure in your own words. Also the label "sample number" should be "number of data points" or something like that. ("Sample number" sounds like you are comparing different (but equivalent) samples.)

Have you tried non-random-sampling?

What's the difference between the first set of plots and the second set of plots? The first set looks (more-or-less) like I expect.

from diverseselector.

PaulWAyers commented on August 23, 2024

Oh, I think I understand. The plots are the same, but the last one has infinity.

Keep in mind that infinity doesn't actually work in your code; you need to just take the n points with lowest energy. You expect very bad results from this strategy, so it doesn't surprise me if the numbers are bad. But this is deterministic so there shouldn't be ups-and-downs I think.

However, the k=8 case probably should look similar, though perhaps you need a larger value of k to see it. Was k=9 impossible because the sample was too small (there weren't 300 points left to use?)

from diverseselector.

xychem commented on August 23, 2024

Sample number is the number of selected points S in step 4.
Yeah, the two plots are same but the last one has the result k=np.inf. I am dong the non-random sampling now.
Yeah, I have same idea with you that the error of np.inf is large or not convergent but it shouldn't be increased with sample number. Because I choose a wrong interval when optimize the hyper-parameter of gaussian process so the fitting is terrible when k is large.
No, actually when k=np.inf , we also have 2130 points can be used(I choose 1e7 points in first step). I choose 300 points before due to the large calculation. Is it appropriate that I choose the 0-20 with the interval 2(which has 10 different k).

New picture with random-sampling

from diverseselector.

xychem commented on August 23, 2024

I test the maxmin, optisim and random sample when k=20. I think when k=20, the database is extremly biased so that the maximum error of random sample method is similar to others. I will try k=10, and I think the diverse-selector will behave better than random sample.

from diverseselector.

PaulWAyers commented on August 23, 2024

"k=np.inf , we also have 2130 points" is very strange. Do all these points have exactly the same energy?

The plots look right! Our impression was correct: when the sampling is very biased, use diverse sampling is really helpful!

I think we are basically in good shape now. We just need to make pretty plots to explain the story, and polish the notebook.

Can we perform more samples; in step 2 of the procedure maybe we can choose more Boltzmann samples? I think that will make the curves a bit smoother. Just going to 20 or 25 samples might smooth things a lot more.

We can probably consider just k=0,2,4,8,16. I think that already the k=16 is extremely biased; it seems that it has essentially no data in the high-density regions. Just making the same plots as you did above, again, for these values of k will give a lot of insight!

Also, we are computing errors from the grid, correct, not the training data (as described in step 5?

from diverseselector.

PaulWAyers commented on August 23, 2024

@FarnazH and @ramirandaq do you have any thoughts?

My only hesitation (but maybe k=20 is just far too severe!) is that the maximum error is still really big........

from diverseselector.

xychem commented on August 23, 2024

Actually not just a point, but they are centred in the minimum energy point(1.40,1.78). Brief introduce my work: At first I choose 1e7 to do boltzmann sample and get different size database with different k (when k is smaller, the number of database is larger). Then choose 1000 points with random sample several times(30 times) to average the error which you mention in step 3. Finally test the error by different sample method (random,maxmin,optisim) with different data points number S. The more details in my previous comment.
I did 30 sample times before, I think it looks like so fluctuating since the the change of the error is small (just 0.0002). I will increase the sample times to 40 or 50.
I don't use the training data, but it perhaps exists some training data in my test data. I test the error by generating the uniform distribution from (-3,3) with 100 points in one dimension, which means the interval between two points is 6/100 = 0.06. So I can get 1e5 points of two dimensions.
I'm testing the different sample method with different k. I think we will get more information this week.

The attachment is the points distribution with different k before random sampling.

from diverseselector.

PaulWAyers commented on August 23, 2024

These plots of points distribution are all the points that are selected? Or after sub-sampling 1000?

I'm curious what k=8 looks like. I'm curious whether there are still points in all 4 wells.

Based on these, I feel like using k=0, 2.5, 5, and 10 may be perfect for this study.

Also, we may want to select more points. How may points do we get with k=14? We might use some number of points close to that for the k-fold sampling (step 2). Little jumps will always be there, and will be a little bit less apparent if we (diversely/randomly) select points at larger intervals (e.g., steps of 5 or 10 points on the x axis in figures like
#144 (comment)

from diverseselector.

FarnazH commented on August 23, 2024

@xychem, I couldn't find your example notebook. Can you please share a link here asap? Preferably you can make a branch for this issue, or just make a PR.

from diverseselector.

FarnazH commented on August 23, 2024

(1) Minyaev, R. M.; Quapp, W.; Subramanian, G.; Schleyer, P. von R.; Mo, Y. Internal Conrotation and Disrotation in H2BCH2BH2 and Diborylmethane 1,3 H Exchange. Journal of Computational Chemistry 1997, 18 (14), 1792–1803.

from diverseselector.

Construct Evil Example about diverseselector HOT 28 CLOSED

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent