Remember the previous discussion we had about whether the l2_regularization parameter

BTW, Bottou has a similar implementation: <a href="http://leon.bottou.org/projects

re-parameterize annealing schedule about asgd HOT 11 CLOSED

jaberg commented on June 25, 2024

re-parameterize annealing schedule

from asgd.

Comments (11)

npinto commented on June 25, 2024

Thanks for your analysis, it all makes sense now.

In your opinion, what is the downside of choice 2 ?

from asgd.

npinto commented on June 25, 2024

I guess 3 is more flexible, we could add a note in the docstring suggesting good practices in setting them up, what do you say ?

from asgd.

npinto commented on June 25, 2024

BTW, Bottou has a similar implementation:
http://leon.bottou.org/projects/sgd#stochastic_gradient_svm

from asgd.

jaberg commented on June 25, 2024

Cool, I suggest setting the defaults following Bottou.

from asgd.

npinto commented on June 25, 2024

Sounds good. Closing this issue then, since setting the defaults like Bottou has been covered in other (opened) issues.

from asgd.

jaberg commented on June 25, 2024

It still feels like the learning rate should be protected within a min(1, lr(t)) to prevent a few giant steps at the beginning but I just looked through bottou's svmsgd.cpp and there is no such protection.

Instead, he runs through at most 1000 examples a few times to do a coarse line search on C0 between 1.0 and 2.0, then follows through with the full optimization on the C0 that did best on the limited budget.

This would put his full search range about 3 orders of magnitude higher than our current implementation.

from asgd.

npinto commented on June 25, 2024

In this case, it may make more sense to set the default to 1.0

If we want to write a similar procedure, optimizing sgd_step_size0 on a few examples should probably be written in a different object, what do you think ?

from asgd.

jaberg commented on June 25, 2024

How about putting this into BaseASGD, and implementing there in terms of virtual methods self.fit and self.loss or something? sklearn has names for this sort of behavior but we can pick anything sensible for now.

from asgd.

jaberg commented on June 25, 2024

+1 on setting the default sgd_step_size0 to 1.0 though.

from asgd.

jaberg commented on June 25, 2024

I just made a small change to the Theano implementation that makes it go much faster. For different numbers of features I now see the following runtimes:

N_FEAT:100 Naive:0.192 Theano:0.038
N_FEAT:1000 Naive:0.239 Theano:0.083
N_FEAT:10000 Naive:1.085 Theano:0.520

The change was to replace dot(obs, weights) with (obs * weights).sum(). Theano lacks L1BLAS, and was using GEMM for the inner product. This optimization is now submitted to the theano trunk.

from asgd.

jaberg commented on June 25, 2024

Update: using gemv is as fast as elemwise sum, (and faster than gemm)

Also: benchmarking the Theano and Naive implementations is problem dependent. Theano's "if" is slow, so it computes the weight update regardless of whether the margin is violated. Theano computes things faster than the naive implementation. So if there are a lot of weight updates or the features are relatively small (< 10K items) then Theano is always faster, sometimes by several times. However, when features are long and relatively few examples actually violate the margin, then the Naive implementation can be slightly faster.

from asgd.

re-parameterize annealing schedule about asgd HOT 11 CLOSED

Comments (11)

Related Issues (17)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent