I'd like to give some thoughts and suggest some changes to the introductory text. If you like them, I can do more. If not, feel free to disregard. :)
[Diagram and title]
- I suggest moving the diagram to the top.
If you've spent enough some time optimizing smooth functions, chances are you've met this old nemesis of optimization. The condition has many names, but it always borrows the language of pathology: Ill-conditioning, pathological curvature, dead gradients.
Changing minor nitpicks:
- "this optimizer's old nemesis" -> "this old nemesis of optimization"
- fix spacing around colon.
The problem manifests when minimizing a function using gradient descent, and its symptoms somewhat subtle. Your choice of step-size, seems correct. The gradients don't blow up. There is no division by zero, no NaNs, no square rooting of minus one. In fact, things often begin quite well - with an impressive, almost immediate decrease in the loss. But as the iterations progress, you start to get a nagging feeling you're not making as much progress as you should be. You're iterating hard, but the loss isn't getting smaller. Should you keep iterating, and hope for the best?
I've cut out the gradient descent update rule and related variables here. I think you were including it to introduce it, but it broke the flow of this paragraph without adding much. I think it's better introduced slightly later, when you introduce momentum.
(note: you also had a variable mix up where you used x instead of w)
The problem could be pathological curvature. We've all seen a certain picture, in 2D, of what this looks like. The landscapes are often described as a valleys, trenches, canals, ravines. In these steep valleys, gradient descent fumbles. All progress along certain directions grind close to a halt. Optimization only approaches the optimum in small, tedious steps.
- No changes except avoiding w, since it hasn't been introduced.
[no diagram because it has been moved to top]
As discussed, this doesn't integrate super well here. I think it is better to move to the top.
There is a simple tweak to gradient descent, which TensorFlow calls momentum, which makes things work much better!
Break this line out of the paragraph, so that we can focus the next paragraph on a chunk of technical introduction.
(I suspect there is a better reference for momentum than TensorFlow. The ML community has been calling it that for a long time.)
Suppose we have a function $f(w)$ and want to optimize over $w$ to minimize $f$. In gradient descent, we start with an initial guess, $w^0$, and iteratively improve it using the update rule:
$$w^{k+1} = w^k-\alpha\nabla f(w^k)$$
where $\alpha$ is the step-size. Momentum changes the update rule to:
$$
\begin{aligned}
z^{k+1}&=\beta z^{k}+\nabla f(w^{k})\\
w^{k+1}&=w^{k}-\alpha z^{k+1}.
\end{aligned}
$$
where $z^k$ is an auxiliary sequence, and an extra parameter $\beta$. When $ \beta = 0 $ , we recover gradient descent. But for $ \beta = 0.99 $ (sometimes $ 0.999$, if things are really bad), the situation improves quite dramatically.
We're using this paragraph to formally introduce both regular gradient descent and momentum. I think this fits better.
Something I haven't done, but I think you should seriously consider, is adding a line giving intuition about what momentum is. Something like "If gradient descent is a person walking down hill in the steepest direction, momentum is a bolder rolling down hill." You might also give intuition for the auxiliary sequence by saying something like "where $z^k$ is an auxiliary sequence representing the speed we are presently moving at" or something like that.
Optimizers call this minor miracle "acceleration".
This tiny modification may seem like a cheap hack. A simple trick to get around gradient descent's more aberrant behavior - a smoother for oscillations between steep canyons. But the truth, if anything, is the other way round. It is gradient descent which is the hack. First, momentum gives about a quadratic speedup on convex functions. This is no small matter - this is the speedup you get from the Fast Fourier Transform, and Grover's Algorithm. When the universe speeds things up for you quadratically, you should start to pay attention.
But there's more. A lower bound, courtesy of Nesterov , states that momentum is in a certain technical sense optimal. Now this doesn't mean it is the best algorithm under any circumstances for all functions. But it does mean it satisfies some curiously beautiful mathematical properties which scratches a very human itch for perfection and closure. But more on that later. Let's say this for now - momentum is an algorithm for the book.
I think all three of these paragraphs are great! :)