npinto / asgd Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 21.0 251 KB

Averaged Stochastic Gradient Descent Classifiers

Python 100.00%

asgd's People

Contributors

Stargazers

Watchers

Forkers

yamins81 jaberg argoncloud wqren imclab web5design jiangfeng1124 fanfannothing jizhihang plsang mrgloom xieyanfu mikeseven yw3xs lizhangzhan hulalazz heyuanhao appcoreopc kelvin169

asgd's Issues

simple_blas.h missing?

In the caa_stable branch is simpe_blas.h supposed to be present?

Sparsity trick on weights update

see sparsity trick from bottou

ENH: warm-up period with exponential moving asgd and switch from sgd to asgd when empirical loss gets higher

The idea is to boost the performance by "disabling" the averaging until it gets useful. start with exp_moving_asgd_step_size=1e-2 ?

kwargs for mini_batch settings

It would be useful to have the possibility of using "mini_batches" to get better estimations of the gradients (see https://github.com/npinto/asgd/blob/master/asgd/naive_asgd.py#L60). Since we'll be using BLAS, etc. this parameter could possibly speed up convergence.

Use η0 / (1 + λ η0 t)^0.75 by default instead of ...^2/3.

The learning rate has the form η0 / (1 + λ η0 t)^0.75 where λ is the regularization constant.

See:
http://leon.bottou.org/projects/sgd

Ready to test!

I fixed all the (apparent) issues with C and BLAS, and now all BLAS functions are called with the right matrix/version dimensions. I gdb'ed them and they all appear to write in the proper memory location. Also, I cleaned up the C file a little bit. Notice that I created a new branch, caa_blas, for the development of the blas-based algorithm.

Now I should start testing that I implemented the logic of the algorithm as a whole correctly. I think a good approach would be to first have a unit test for each of the three functions, and then some integration test for the algorithm. That would be good especially if you plan to push the software as a public library, and to guarantee the correctness of future edits. What do you think?

For the specifics of testing, I can't exactly replicate the python test, as I doubt that the NumPy Mersenne Twister yields the same sequences as the LCG in glibc when seeded with the same seeds (42 and 43). So, if I want to start by testing everything in C, I should find some other way to do it. Any suggestions about this? I am hoping to make much progress from the testing on.

Finally, let me know when you have a chance to pass me the real testing matrices and the reference to the Python wrapping tools that you suggest, so I can also start looking at these. I know you are busy lately, so no worries.

Let me know...

step_size0 should be cross validated on a small subset of the data

The constant η0 is determined by performing preliminary experiments on a data subsample.

http://leon.bottou.org/projects/sgd

We could also have a asgd.tune_...() methods to "tune" speed and accuracy (here step_size0 would be optimized).

re-parameterize annealing schedule

Remember the previous discussion we had about whether the l2_regularization parameter should be allowed to be 0? I think that was related to a mis-parameterization of the BaseASGD object.

Background: the schedule implemented takes the form C(t) = C0 (1 + C0 k t)^a, where C0 is the initial sgd step size, t is the number of iterations so far, a is the annealing exponent, and k is an extra multiplier on the time. This form is recommended in [1].

The current naive_asgd implementation hardcodes k = l2_regularization, as suggested in [2].

This leads to weirdness though because C0 is typically like 1e-2 or 1e-3, and l2_regularization should usually be something like 1e-6. In the most aggressive case we end have a schedule like

C(t) = C0 (1 + 1e-8 t)^a

For a normal choice of a=.5 this is basically un-annealed SGD, since it takes 100 million examples to get the learning rate from C0 down to C0/sqrt(2). I'm not that confident in saying it, but I don't think this was how the annealing was meant to be used.

I think this could be improved in two ways:

removing the second appearance of C0 from the annealing schedule (simplify things... why put it in twice? and It is not used in [3])
make k default to C0, rather than l2_regularization
make k a parameter of BaseASGD because both [2] and [3] say it's problem-dependent.

[1] http://arxiv.org/pdf/1107.2490v2
[2] http://www.dbs.ifi.lmu.de/~yu_k/cvpr11_0694.pdf
[3] http://hal.archives-ouvertes.fr/docs/00/60/80/41/PDF/gradsto_hal.pdf

Weight updates only if margin constraint violated.

To decrease communication and speed up convergence, we should have an option (default=True) to only update weights when margin constraints have been violated:

e.g.: Line #66 should move up (to Line #63) and if margin >= 1: continue.

https://github.com/npinto/asgd/blob/master/asgd/naive_asgd.py#L66

new reference for readme

Although I didn't get everything out of it by far, but this paper looks like a good one to add to the README:
http://hal.archives-ouvertes.fr/docs/00/60/80/41/PDF/gradsto_hal.pdf

The Cython wrapper cannot find the symbols defined by BLAS

I am having an issue with cython. The cython file compiles correctly to C, and the C file also compiles correctly to .so (see first target of the Makefile on branch caa_devel). However, when I import the .so in python, python complains that
ImportError: ./pasgd.so: undefined symbol: cblas_sdsdot

I am not using cblas_sdsdot in python, and also the function is in fact defined (either in simple_blas.c or from libcblas.so). This sounds like a silly issue, like me not giving some necessary parameter to Python. Did you ever encounter something like this? Suggestions?

Parameter for (exponentially) moving average

BLAS and heap safety

I have integrated BLAS in the code, and all functions are complete. However, a BLAS call causes a memory corruption: I need to track it down and correct it. It is probably a wrong size parameter (e.g. sizeof(*x) instead of 1). Lack of documentation for BLAS (both the NetLib version and OpenBLAS) is definitely an issue - will have to be more careful.

After this, I will need to unit-test the code. A preliminary step could be testing the code entirely in C (which makes testing harder, but allows me to test before writing the Python wrapper) or wrapping the functions first (allows for a more flexible testing, but potentially introduces issues with wrapping).

Compute X_mean and X_std in fit().

"sphere" the data and merge in the weights

multiclass

Hi @npinto and @jaberg

I've been working on a multiclass version of this -- https://github.com/yamins81/asgd/tree/feature/multiclass
When it's ready I'd like to contribute it back, but in the meantime, can you comment on it when you guys have a moment?

Thanks,
Dan

Online LOOE-like for hyper-parameters selection.

Multiple (sgd_step_size0, l2_regularization) could be given and *fit() methods could use BLAS Level-3 operations when appropriate to allow for more data re-use and speed up the computation.

This is confusing... please consult me for more details ;-)

Loosely related work:
Notes on Regularized Least-Squares
by Rifkin. and. Lippert
http://cbcl.mit.edu/publications/ps/MIT-CSAIL-TR-2007-025.pdf

Working feedback

feedback should be restored (btw, it was used in LeCun's NIPS'11 optimization challenge)