The main idea of Nesterov accelerated gradient (NAG, Nesterov momentum) is to update t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Implement simpliﬁed Nesterov momentum about caffe HOT 3 CLOSED

kloudkl commented on April 28, 2024

Implement simpliﬁed Nesterov momentum

from caffe.

Comments (3)

kloudkl commented on April 28, 2024

@qipeng has solved this issue in #741.

from caffe.

pwohlhart commented on April 28, 2024

I might be mistaken, but I dont think your interpretation of the Bengio et al. paper is right. They show that the parameter update (Formula 7) is the same as the one in the regular momentum (Formula 5), except for different coefficients. These coefficients however are then not the same as those used to update the velocity (Formula 6) (which would make if completely the same). That's what makes the difference (although probably a rather slight one?).

from caffe.

qipeng commented on April 28, 2024

Hi @pwohlhart , due to the limitation of the current gradient based solver that it only evaluates the gradient once and updates the parameters once every iteration, my implementation is slightly different from (and perhaps slightly faster than) the original NAG.

Each iteration of the standard NAG can be viewed as:

Update the current parameters to a "future point" with the current velocity
Evaluate the gradient at that point
"Undo" the update
Update the velocity with the gradient at the future point
Update the parameters with the new velocity

Due to the aforementioned limitations, my implementation is:

Evaluate the gradient at a "future point"
Add a negative velocity to the parameter update
Update the velocity, and add the new velocity to the parameter update (multiplied by 1+momentum to update the parameters to the "future point" of the next iteration)
Update the parameters with their corresponding updates

Here several parameter updates in the original algorithm are consolidated.

The only slight difference between this method and the standard NAG is that the parameter states between iterations are always the "future point" of that iteration, i.e. theta + momentum * velocity. This shouldn't cause too big of a problem as the gradient and/or learning rate are usually close to zero when the optimization approaches its end.

from caffe.

Recommend Projects

Implement simpliﬁed Nesterov momentum about caffe HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent