Comments (14)
So I believe I've made a proper pull request.
Here are the Numpy files for train_X, train_Y, test_X, test_Y
. The X matrices are neural spike response matrices to image stimuli while the train_Y are the image stimuli. When I ran my code for Celer's LassoCV, I didn't normalize of any sort (not sure if it was a good idea for my data).
For reference, running on SKLearn's LassoCV, resulted in around 0.97-0.98 Pearson R correlation between the test_Y
and the model's output for test_X
.
from celer.
Hello @yjkimnada, currently it is not supported indeed, but since we do not overwrite the fit
method of LassoCV
it should be straightforward to implement: it suffices to add the parameter n_jobs
to LassoCV.__init__()
method and to pass it to super(LassoCV, self).__init__()
afterwards.
Would you like to give it a try in a PR ?
from celer.
So just to confirm, I made the following edits in the dropin_sklearn.py
file to default with n_jobs=1
and have the ability to specify n_jobs=-1
def __init__(self, eps=1e-3, n_alphas=100, alphas=None,
fit_intercept=True, normalize=False, max_iter=100,
tol=1e-4, cv=None, verbose=0, gap_freq=10,
max_epochs=50000, p0=10, prune=0, precompute='auto',
positive=False, n_jobs=1):
super(LassoCV, self).__init__(
eps=eps, n_alphas=n_alphas, alphas=alphas, max_iter=max_iter,
tol=tol, cv=cv, fit_intercept=fit_intercept, normalize=normalize,
verbose=verbose, n_jobs=n_jobs)
self.gap_freq = gap_freq
self.max_epochs = max_epochs
self.p0 = p0
self.prune = prune
self.positive = positive
I did confirm that multiple CPUs were being simultaneously used. However, for some weird reason, I do not get any results out for my regression even when waiting for nearly half an hour when the regular sklearn LassoCV
finishes under same CV search constraints within a few minutes. Instead, with Celer
, I get a bunch of the following errors:
/home/joon/celer/celer/homotopy.py:222: RuntimeWarning: divide by zero encountered in true_divide
use_accel=1, tol=tol, prune=prune, positive=positive)
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 7.93e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 5.62e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 9.89e-03 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 1.46e-03 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
!!! Inner solver did not converge at epoch 49999, gap: 6.10e-04 > 1.00e-04!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
My input X
matrix is size 9800-by-104700, so is not incredibly huge. It is sparse such that only around 9-10% of the entries are non-zero so I input it as CSC format as recommended. Finally, I did check via the examples you provide with the leukemia and finance datasets that Celer outperforms SKlearn on those sample codes. Should I be somehow tuning Celer to fit my problem better?
from celer.
Yes, the way you coded it it what I was thinking about. Could you please push your code in a pull request ?
From your example it seems that celer is having trouble converging (possibly because the prescribed tolerance on the duality gap is too low, eg we ask for a precision of 1e-4 which is very small if the optimal value is, say, 1e10 -- we usually normalize the target y
to make setting of tol
easier).
Can you share your data so that I can check the code locally ?
from celer.
Yes, I’ll make a new pull request and also send you my X and Y data. It’s currently 4 AM here, so I’ll have it sent first thing in the morning.
from celer.
The files are a bit heavy, especially X_train: could you save it in npz format ? I believe saving it as npy first converted it to dense.
Also, can you check the squared norm of y? it corresponds to the first primal, and sklearn uses it to scale the stopping criterion (see here). If is is very large, you can increase tol
(eg tol = 1e-6 * norm(y)** 2 / n_samples)
from celer.
- Here is the new link to the compressed .npz file. The titles of the 4 arrays stored within are
train_X
,train_Y
,test_X
,test_Y
. https://fil.email/SMXAZbnk - The Euclidean 2-norm for
train_Y
(after dividing by n_samples) is1.35459396
and the squared 2-norm (after dividing by n_samples) is17982.26316
. Which one should I be using and is there a specific reason why you started with tol = 1e-6 (the default for sklearn and celer is 1e-4). - Just a side question, when tolerance is not reached within a regression (for a given tolerance and max_epoch), does celer move on to the next sub-problem or does it keep repeating the current sub-problem (which may explain why I'm getting stuck with no outputs)?
from celer.
Thanks for the very clear description of the problem and your involvment in this.
I may have given you the wrong reference: I could not use scipy.sparse.load_npz
on the file you sent, and when I used np.load
, trying to access the design matrix caused my computer to freeze because train_X
was loaded as a dense array.
In the end I managed but what I initially meant was that you can save your design matrices with sparse.save_npz("train_X.npz", train_X)
, this way they are loadable directly as sparse arrays: train_X = sparse.load_npz("train_X.npz")
I am currently fitting a LassoCV with default paramaters; it takes a while, but it's running.
The following snippet, where I have increased the tolerance, indeed takes more than 15 minutes to run (note that you can use verbose=True
to have a bit more feedback):
import celer
from scipy import sparse
import numpy as np
# I have saved them separetly in an earlier stage:
X = sparse.load_npz('X_train.npz')
y = np.load('y_train.npy')
clf2 = celer.LassoCV(verbose=True, tol=1e-2)
clf2.fit(X, y)
There is not clear guideline for the tolerance choice, going for 1e-4 was a way to give similar results to sklearn by default. But since sklearn scales tol by norm(y) ** 2, the tolerances are in fact not the same in the end.
I am suspecting that sklearn fits faster because it uses a really high tolerance. In order to "disable" its scaling of tol, can you can make y unit normed (y /= norm(y)
) and see if it still fits fast. By the way, on my computer, it has been running for more than 15 minutes now (n_jobs=1)
For celer, increasing the tolerance should make it faster (instead of getting stuck endlessly as it does currently). I think we could:
- raise a more explicit message that
!!! Inner solver did not converge at epoch 49999, gap: 2.44e-04 > 1.00e-04
, e.g. suggesting to increase the tolerance. - manually increase the tolerance for the user when the solvers takes too long to converge (with a possibility to disable this heuristic). Although it is used in other solvers, I was reluctant to do it in the first place as it is really heuristic.
Note that LassoCV default parameter eps
is 1e-3
, meaning that the values of alphas go from alpha_max
to alpha_max / 1000
, which is quite low. Setting eps=1e-2
should also speed up the process, with the obvious drawback that it tests a smaller range of values (alpha_max / 100
is already good enough in my experience, but again this is not backed by theory)
To summarize:
- I will raise a better error message when the inner solver does not converge, and think of a way to avoid this with heuristics
- increasing
tol
andeps
should make your code faster, while having limited impact on the results ( without guarantee) - sklearn fitting in a few minutes is a surprise to me, I am investigating the reasons.
from celer.
As pointed out by @josephsalmon, the columns of X
have varying norms: around 6k of them are even 0.
Histogram of norms of columns:
Histogram of norms on support of solution of LassoCV:
This creates a bias as columns with small norm are more penalized.
Here is the result I obtained in around 30 mins using n_jobs=5, eps=1e-3, and tol=1e-1:
from celer.
FInally, I managed to get a very high Pearson coefficient in a short time (around 2 minutes), by fitting with a tolerance equal to 10:
import celer
from scipy import sparse
import numpy as np
import sklearn
from numpy.linalg import norm
X = sparse.load_npz('X_train.npz')
y = np.load('y_train.npy')
clf = celer.LassoCV(n_jobs=5, eps=1e-3, tol=1e1,
verbose=True, normalize=True,
fit_intercept=True, prune=True)
clf.fit(X, y)
y_test = np.load('y_test.npy')
X_test = sparse.load_npz('X_test.npz')
y_pred = clf.predict(X_test)
pearson = y_pred @ y_test / norm(y_pred) / norm(y_test)
print(pearson)
I get 99.7 %, and the curve looks like a noisier version of the ones where the solution is more precise:
from celer.
Thank you for looking into this matter thoroughly.
I have now confirmed that celer LassoCV runs much faster than sklearn LassoCV with equal test performance. Just two quick questions:
-
It seem like celer LassoCV keeps printing this error
/home/joon/celer/celer/homotopy.py:222: RuntimeWarning: divide by zero encountered in true_divide use_accel=1, tol=tol, prune=prune, positive=positive)
. Although this doesn't seem to affect performance at the task given, do I need to worry about this issue? -
I ran your set up precisely and got test Pearson of 0.987 instead of your 0.997. I just wanted to confirm that this was a typo and not an actual difference in performance because that would also be another point of concern.
from celer.
-
The division by zero occurs because some columns of X_train have only zeroes in them, and I some point we divide by norm(X[:, j]). This should be harmless, however it is true that celer should handle it better (raise a warning at first, and then ignore these columns).
You can also exclude these columns from your dataset before launching the solver. -
With the following snippet:
import celer
from scipy import sparse
import numpy as np
import sklearn
from numpy.linalg import norm
X = sparse.load_npz('X_train.npz')
y = np.load('y_train.npy')
clf = celer.LassoCV(n_jobs=5, eps=1e-3, tol=1e1,
verbose=True, normalize=True,
fit_intercept=True, prune=True)
clf.fit(X, y)
y_test = np.load('y_test.npy')
X_test = sparse.load_npz('X_test.npz')
y_pred = clf.predict(X_test)
pearson = y_pred @ y_test / norm(y_pred) / norm(y_test)
print(pearson)
I get
0.9976474176865748
Is this the one you run when you get 0.987 ?
from celer.
Very interesting, I have been using np.corrcoef()
to obtain the R coefficient instead of your custom formula and they result in different values. Running code and using your formula resulted in 0.997 as expected. So this is a separate issue, I assume, with calculating/rounding within np.corrcoef versus np.norm. This isn't a big deal.
Thank you so much overall. Some sort of primer/explanation/guide on choosing the proper tol
value would be very much helpful I believe for potential future users.
from celer.
about the np.corrcoef()
issue: I suspect a degree of freedom difference in the function w.r.t. @mathurinm implementation.
And thx @yjkimnada for the feedback!
from celer.
Related Issues (20)
- Turn print statement into a warning HOT 2
- ENH - add warning in ``PN_logreg`` for unhandled case
- Segmentation fault when fitting `GroupLasso` HOT 6
- ENH get rid of inv_lc in lasso_fast.pyx
- ENH use Xw - y instead of y - Xw as dual point HOT 1
- MAINT - Rename ``scal`` by ``dnorm``
- BUG - getting ``ConvergenceWarning`` despite convergence of the solver HOT 1
- DOC link to source code in API documentation
- BUG more than one iteration done when fitting with alpha > alpha_max
- MAINT - Use ``create_dual_point`` in group and multitask lasso
- Feature Request: MultiTask GroupLasso
- float32 input not working with celer_path HOT 2
- BUG - unable to install ``celer`` in an empty python virtual environment HOT 4
- MAINT prune default value different between celer_path and Lasso HOT 1
- ENH add action to build and release macOS (and windows?) wheels HOT 1
- `climate._target_region` incorrectly extracts misaligned column HOT 2
- Feature request: group lasso with overlap / latent group lasso HOT 4
- GroupLasso with positive=True HOT 2
- Floating point error accumulation HOT 2
- RuntimeError: Cannot clone object LassoCV(...), as the constructor either does not set or modifies parameter precompute HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from celer.