Thanks for these tools! Do any of the celer sklearn functions (LassoCV, etc.) have mul

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Here is the new link to the compressed .npz file. The titles of the 4 arrays sto

Thanks for the very clear deion of the problem and your involvment in this.

As pointed out by <a class="user-mention notranslate" data-hovercard-type="user" data-

n_jobs (multi-core CPU) for LassoCV function?,about mathurinm/celer

Comments (14)

yjkimnada commented on August 23, 2024 1

So I believe I've made a proper pull request.

Here are the Numpy files for train_X, train_Y, test_X, test_Y. The X matrices are neural spike response matrices to image stimuli while the train_Y are the image stimuli. When I ran my code for Celer's LassoCV, I didn't normalize of any sort (not sure if it was a good idea for my data).

For reference, running on SKLearn's LassoCV, resulted in around 0.97-0.98 Pearson R correlation between the test_Y and the model's output for test_X.

https://fil.email/C8R3XTYw

from celer.

mathurinm commented on August 23, 2024

Hello @yjkimnada, currently it is not supported indeed, but since we do not overwrite the fit method of LassoCV it should be straightforward to implement: it suffices to add the parameter n_jobs to LassoCV.__init__() method and to pass it to super(LassoCV, self).__init__() afterwards.

Would you like to give it a try in a PR ?

from celer.

yjkimnada commented on August 23, 2024

So just to confirm, I made the following edits in the dropin_sklearn.py file to default with n_jobs=1and have the ability to specify n_jobs=-1

def __init__(self, eps=1e-3, n_alphas=100, alphas=None,
                 fit_intercept=True, normalize=False, max_iter=100,
                 tol=1e-4, cv=None, verbose=0, gap_freq=10,
                 max_epochs=50000, p0=10, prune=0, precompute='auto',
                 positive=False, n_jobs=1):
        super(LassoCV, self).__init__(
            eps=eps, n_alphas=n_alphas, alphas=alphas, max_iter=max_iter,
            tol=tol, cv=cv, fit_intercept=fit_intercept, normalize=normalize,
            verbose=verbose, n_jobs=n_jobs)
        self.gap_freq = gap_freq
        self.max_epochs = max_epochs
        self.p0 = p0
        self.prune = prune
        self.positive = positive

I did confirm that multiple CPUs were being simultaneously used. However, for some weird reason, I do not get any results out for my regression even when waiting for nearly half an hour when the regular sklearn LassoCV finishes under same CV search constraints within a few minutes. Instead, with Celer, I get a bunch of the following errors:

/home/joon/celer/celer/homotopy.py:222: RuntimeWarning: divide by zero encountered in true_divide
  use_accel=1, tol=tol, prune=prune, positive=positive)
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04

!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 7.93e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 5.62e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04

!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04

!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04


!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04

!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04

!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04

!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04

!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04

!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04


!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04

My input X matrix is size 9800-by-104700, so is not incredibly huge. It is sparse such that only around 9-10% of the entries are non-zero so I input it as CSC format as recommended. Finally, I did check via the examples you provide with the leukemia and finance datasets that Celer outperforms SKlearn on those sample codes. Should I be somehow tuning Celer to fit my problem better?

from celer.

mathurinm commented on August 23, 2024

Yes, the way you coded it it what I was thinking about. Could you please push your code in a pull request ?

From your example it seems that celer is having trouble converging (possibly because the prescribed tolerance on the duality gap is too low, eg we ask for a precision of 1e-4 which is very small if the optimal value is, say, 1e10 -- we usually normalize the target y to make setting of tol easier).

Can you share your data so that I can check the code locally ?

from celer.

yjkimnada commented on August 23, 2024

Yes, I’ll make a new pull request and also send you my X and Y data. It’s currently 4 AM here, so I’ll have it sent first thing in the morning.

from celer.

mathurinm commented on August 23, 2024

The files are a bit heavy, especially X_train: could you save it in npz format ? I believe saving it as npy first converted it to dense.

Also, can you check the squared norm of y? it corresponds to the first primal, and sklearn uses it to scale the stopping criterion (see here). If is is very large, you can increase tol (eg tol = 1e-6 * norm(y)** 2 / n_samples)

from celer.

yjkimnada commented on August 23, 2024

Here is the new link to the compressed .npz file. The titles of the 4 arrays stored within are train_X, train_Y, test_X, test_Y. https://fil.email/SMXAZbnk
The Euclidean 2-norm for train_Y (after dividing by n_samples) is 1.35459396 and the squared 2-norm (after dividing by n_samples) is 17982.26316. Which one should I be using and is there a specific reason why you started with tol = 1e-6 (the default for sklearn and celer is 1e-4).
Just a side question, when tolerance is not reached within a regression (for a given tolerance and max_epoch), does celer move on to the next sub-problem or does it keep repeating the current sub-problem (which may explain why I'm getting stuck with no outputs)?

from celer.

mathurinm commented on August 23, 2024

Thanks for the very clear description of the problem and your involvment in this.
I may have given you the wrong reference: I could not use scipy.sparse.load_npz on the file you sent, and when I used np.load, trying to access the design matrix caused my computer to freeze because train_X was loaded as a dense array.
In the end I managed but what I initially meant was that you can save your design matrices with sparse.save_npz("train_X.npz", train_X), this way they are loadable directly as sparse arrays: train_X = sparse.load_npz("train_X.npz")

I am currently fitting a LassoCV with default paramaters; it takes a while, but it's running.
The following snippet, where I have increased the tolerance, indeed takes more than 15 minutes to run (note that you can use verbose=True to have a bit more feedback):

import celer
from scipy import sparse
import numpy as np


# I have saved them separetly in an earlier stage:
X = sparse.load_npz('X_train.npz')
y = np.load('y_train.npy')

clf2 = celer.LassoCV(verbose=True, tol=1e-2)
clf2.fit(X, y)

There is not clear guideline for the tolerance choice, going for 1e-4 was a way to give similar results to sklearn by default. But since sklearn scales tol by norm(y) ** 2, the tolerances are in fact not the same in the end.
I am suspecting that sklearn fits faster because it uses a really high tolerance. In order to "disable" its scaling of tol, can you can make y unit normed (y /= norm(y)) and see if it still fits fast. By the way, on my computer, it has been running for more than 15 minutes now (n_jobs=1)

For celer, increasing the tolerance should make it faster (instead of getting stuck endlessly as it does currently). I think we could:

raise a more explicit message that !!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04, e.g. suggesting to increase the tolerance.
manually increase the tolerance for the user when the solvers takes too long to converge (with a possibility to disable this heuristic). Although it is used in other solvers, I was reluctant to do it in the first place as it is really heuristic.

Note that LassoCV default parameter eps is 1e-3, meaning that the values of alphas go from alpha_max to alpha_max / 1000, which is quite low. Setting eps=1e-2 should also speed up the process, with the obvious drawback that it tests a smaller range of values (alpha_max / 100 is already good enough in my experience, but again this is not backed by theory)

To summarize:

I will raise a better error message when the inner solver does not converge, and think of a way to avoid this with heuristics
increasing tol and eps should make your code faster, while having limited impact on the results ( without guarantee)
sklearn fitting in a few minutes is a surprise to me, I am investigating the reasons.

from celer.

mathurinm commented on August 23, 2024

As pointed out by @josephsalmon, the columns of X have varying norms: around 6k of them are even 0.
Histogram of norms of columns:

Histogram of norms on support of solution of LassoCV:

This creates a bias as columns with small norm are more penalized.

Here is the result I obtained in around 30 mins using n_jobs=5, eps=1e-3, and tol=1e-1:

from celer.

mathurinm commented on August 23, 2024

FInally, I managed to get a very high Pearson coefficient in a short time (around 2 minutes), by fitting with a tolerance equal to 10:

import celer
from scipy import sparse
import numpy as np
import sklearn

from numpy.linalg import norm

X = sparse.load_npz('X_train.npz')
y = np.load('y_train.npy')

clf = celer.LassoCV(n_jobs=5, eps=1e-3, tol=1e1,
                    verbose=True, normalize=True,
                    fit_intercept=True, prune=True)
clf.fit(X, y)


y_test = np.load('y_test.npy')
X_test = sparse.load_npz('X_test.npz')

y_pred = clf.predict(X_test)
pearson = y_pred @ y_test / norm(y_pred) / norm(y_test)
print(pearson)

I get 99.7 %, and the curve looks like a noisier version of the ones where the solution is more precise:

from celer.

yjkimnada commented on August 23, 2024

Thank you for looking into this matter thoroughly.

I have now confirmed that celer LassoCV runs much faster than sklearn LassoCV with equal test performance. Just two quick questions:

It seem like celer LassoCV keeps printing this error /home/joon/celer/celer/homotopy.py:222: RuntimeWarning: divide by zero encountered in true_divide use_accel=1, tol=tol, prune=prune, positive=positive). Although this doesn't seem to affect performance at the task given, do I need to worry about this issue?
I ran your set up precisely and got test Pearson of 0.987 instead of your 0.997. I just wanted to confirm that this was a typo and not an actual difference in performance because that would also be another point of concern.

from celer.

mathurinm commented on August 23, 2024

The division by zero occurs because some columns of X_train have only zeroes in them, and I some point we divide by norm(X[:, j]). This should be harmless, however it is true that celer should handle it better (raise a warning at first, and then ignore these columns).
You can also exclude these columns from your dataset before launching the solver.
With the following snippet:

import celer
from scipy import sparse
import numpy as np
import sklearn

from numpy.linalg import norm

X = sparse.load_npz('X_train.npz')
y = np.load('y_train.npy')

clf = celer.LassoCV(n_jobs=5, eps=1e-3, tol=1e1,
                    verbose=True, normalize=True,
                    fit_intercept=True, prune=True)
clf.fit(X, y)

y_test = np.load('y_test.npy')
X_test = sparse.load_npz('X_test.npz')

y_pred = clf.predict(X_test)
pearson = y_pred @ y_test / norm(y_pred) / norm(y_test)
print(pearson)

I get
0.9976474176865748

Is this the one you run when you get 0.987 ?

from celer.

yjkimnada commented on August 23, 2024

Very interesting, I have been using np.corrcoef() to obtain the R coefficient instead of your custom formula and they result in different values. Running code and using your formula resulted in 0.997 as expected. So this is a separate issue, I assume, with calculating/rounding within np.corrcoef versus np.norm. This isn't a big deal.

Thank you so much overall. Some sort of primer/explanation/guide on choosing the proper tol value would be very much helpful I believe for potential future users.

from celer.

josephsalmon commented on August 23, 2024

about the np.corrcoef() issue: I suspect a degree of freedom difference in the function w.r.t. @mathurinm implementation.
And thx @yjkimnada for the feedback!

from celer.

n_jobs (multi-core CPU) for LassoCV function? about celer HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent