distillpub / post--momentum Goto Github PK

View Code? Open in Web Editor NEW

196.0 196.0 52.0 6.16 MB

Why Momentum Really Works

Home Page: https://distill.pub/2017/momentum/

License: Creative Commons Attribution 4.0 International

Julia 1.90% JavaScript 77.35% HTML 18.35% CSS 2.40%

article

post--momentum's Introduction

Post -- Exploring Bayesian Optimization

Breaking Bayesian Optimization into small, sizable chunks.

To view the rendered version of the post, visit: https://distill.pub/2020/bayesian-optimization/

Authors

Apoorv Agnihotri and Nipun Batra (both IIT Gandhinagar)

Offline viewing

Open public/index.html in your browser.

NB - the citations may not appear correctly in the offline render

post--momentum's People

Contributors

Stargazers

Watchers

post--momentum's Issues

Few typos

A very interesting read, thanks!

The first 2 eigenfeatures captures variations -> capture
in terms of the eigenvectors of -> eigenvalues

Also, some of inline-footnotes show raw latex, such as (4):(https://cloud.githubusercontent.com/assets/26929201/24710711/a1eca71e-1a26-11e7-8527-384eafde9763.png)

Index error ?

It's minor but I think there is a small index error in the formulas of the part Adventures in Algorithmic Space. The author derives:

w2 = w0 − α∇f(w0) − α∇f(w1)

and then generalize with:

wk = w0 − α∇f(w0) − ... − α∇f(wk)

I think the formula should instead be:

w(k+1) = w0 − α∇f(w0) − ... − α∇f(wk)

And similarly with the following formula.

In regression diagrams, if you've scrubbed a value, maybe provide a reset button?

File this under "nice to have"

Title/Headline Suggestions

Possible headlines

Decoding the mystery of momentum
Why does momentum work?

Format Large Numbers

In general, nice to have 100,000 as opposed to 100000.

Typo and suggestion

Specifically on this section:

... in which each component of acts ...

I'm not sure which component of what. It's missing the object of the preposition "of".

Also in that section, you refer to the eigenvectors of R, \sigma_1 and \sigma_2. Because there's no Oxford comma, I thought the sigmas were the eigenvectors of R. But the "type/shape checking" against the other expressions suggest that the sigmas are scalar. I assume those are the eigenvalues of R?

In an unrelated note: really great article.

Edit: Looks like some of my comments overlap with #68.

Critical Damping Figure Defect

In the figure after "The choice of β crucially affects the rate of return to equilibrium.", the Critical Damping region can not be selected.

Formatting Issues with Momentum Post

My browser is Chrome on Linux.

Link to PDF in popup note 1 ["quadratic speedup on many functions"] does not work in my browser.

Popup note 2 on Tikhonov regularization pops up a bunch of unformatted LaTeX on my screen. Same with note 3. Actually all the popup notes with math.

The formula for the k'th power of a 2x2 is a weird mix of black and grey on my sscreen that I doubt was intentional.

Gradient Descent Diagram

Original diagram:

Proposed change:

Notes:
Orange color is : #ff6600;
Labels are: Helvetica 14px;
For the background paths, you could try something like:
path {
fill: #225577;
fill-opacity: 0.045;
stroke-opacity: 0.2;
stroke: #225577;
}
Border strokes (like those around the containing rectangle) are generally 1px solid rgba(0,0,0,0.2);
Also notice the slightly thicker orange line than original, and slightly large "step" circles.

The colors also are meant to reinforce the z-depth stacking of the elements. The function plot is muted and cool, thus receding into the background. The elements associated with the gradient descent (which are also the dynamic elements) are orange and bolder, thus grabbing your attention more and coming to the forefront of the plot. This sort of coloring also helps you group associated elements. Optimum is an attribute of the function, thus is dark blue. The sliders and solution pointer are all attributes of the gradient descent, thus they are orange. Text is sort of an exception, my stance is it should always be black or grey for readability concerns.

You should use figure html elements to wrap all the figures, and figcaption elements for the paragraphs of text contained within.

<figure>
    <svg>...</svg>
    <figcaption>Caption for the figure.</figcaption>
    <figcaption>Can have two captions if you want!</figcaption>
</figure>

When you click on "optimal step size"

Maybe should set the slider to the optimal value? It's a little hard to find it by sliding around.

Typos

In Onwards an Downwards: Missing word
There <is?>an algebraic interpretation of momentum in terms of approximating polynomials [3, 14].

In Example: The Colorization Problem: Space before first comma
The Laplacian matrix < ,>LGL_GLG 8, which dominates the behavior of the optimization problem, is a valuable bridge between linear algebra and graph theory.

Also in Example: The Colorization Problem: Bad plural
to the colorization problem, and the intuition behind this should be clear.

Lowercase "like"

But like the proverbial blind men feeling an elephant, ...

Add "are" and make interpretations plural

There an algebraic interpretation of momentum in terms of approximating polynomials [3, 14].
SHOULD PROBABLY BE
There are algebraic interpretations of momentum in terms of approximating polynomials [3, 14].

Momentum and Damping Diagram

Original:

Proposed:

In general, it's always a goal of mine to remove unnecessary elements from the diagrams. Here I've removed the borders on the inset plots and used white space instead.

paragraphs are just figcaption elements.
san-serif is just helvetica 14px.

Mobile phone scaling issues?

Some people are reporting they can't fit the diagrams in screen on their phone.

eg. https://twitter.com/AlxCoventry/status/849462418992443392

Narrative & Structure

One of the challenges of writing an article as big as this, and covering as much content, is having a clear overarching narrative. I think at this point it is one of the biggest remaining weaknesses of this article.

A couple things you can do to help improve the narrative:

Try to write a 1 sentence explanation of what each section is about. Is it a coherent unit? Does the section title help the reader understand?
If you were making a "visual table of contents" for this article, is there a symbolic image that would feel appropriate for each section?

Minor pedagogical issues with momentum post.

Claiming "up to a quadratic speedup" in one sentence and then
saying "similar to the FFT speedup or QuickSort" is potentially
confusing --- those both give you speedsup from n^2 to n log n,
and I would not describe them as "quadratic speedup."

"This gift, it seems, doesn't to come at a price. A beautiful free lunch [7] indeed." s/to come at a price/come at a price. Also I disagree with the argument: in general, keeping all the iterates comes at a huge cost in memory, and if you don't keep them, knowing when to stop is far from easy.

I think there is a mistake in the interactive graph right above "Choosing A
Step-Size." It says 2 is the "optimal step size", but it looks
like 2 is way too big, causing the biggest eigenvalue \lambda_3
to not converge at all.

The article introduces a convex quadratic, but doesn't explicitly mentions the connection between that and nonnegative eigenvalues. This could be clarified.

Incorrect exponent in expression for x^{k+1}_i

Near the end of the "First Steps: Gradient Descent" section, your expression for x^{k+1}_i involves the term (1-\alpha\lambda_i)^k . The exponent should actually be k+1 here, not k, as in the previous expression, (1-\alpha\lambda_i)x^k_i , the exponent of (1-\alpha\lambda_i) and the superscript of x^k_i sum to k+1 . Another way to see that the exponent cannot be k is that this would imply that x^1_i = x^0_i , which would in turn imply that w^1 = w^0 .

Nitpicks

Keeping notes on little things I think would be worth fixing.

Small grammatical error

"The condition number is therefore a direct measure pathological of curvature."
Maybe it could be changed to something like "...direct measure of the degree of pathology of..."?

figures overlap with text in firefox

I'm running firefox 52.0.1 on debian linux.

See these images:

Loading Large JSON

Uval.json is pretty big. Might be better to load it with d3.json() rather than a blocking script tag in the head of the document. Ditto other data files > 1 meg.

Delete "to"

This gift, it seems, doesn't to come at a price.
SHOULD BE
This gift, it seems, doesn't come at a price.

Rescaling Strategies for Mobile Devices

There are some general patterns that might make this easier.
Here is one example:

If you contain the main plot all in one SVG, you can use viewbox attribute and you sort of get scaling for free (not perfect, but meh).
https://css-tricks.com/scale-svg/

if your UI elements are in HTML, it's a little easier to lay them out in different ways on mobile.

Add apostrophe

Let's look at how momentum accelerates convergence with a concrete example.

You mean "Let us" not "Allows".

Small typo or overlap

In "The Critical Damping Coefficient" part the second figures "Momentum β" axis starts from .0 and goes to 0.0. I assume that it should start with 1.0. My browser: Google Chrome, System: Windows 10.

w^\star is undefined

The first place it appears is here.

A superscript star is non-standard enough that this should probably be defined.

phasedDiagram0

Chris and I were a little tripped up a little by some of the labeling on this new diagram. The matrix layout with the labels for force and dampening were getting confused for labels of the axis, which are actually velocity and position.

Here's the original diagram:

Here's a proposed reworking of this diagram (ignore the color shifts):

The horizontal layout, while sacrificing some of the matrix comparisons, removes the confusion of axis labels and is a little more readable on smaller laptop screens that have short vertical height.

This next point is minor: I dropped the formulas, because the formula is shown one paragraph previous. When I first was reading through it I thought these were entirely new formulas, not just the same formula repeated with values filled in. Just felt like a little more work than was necessary.

Shift "in" after parenthesis

In fact, all manner of first order algorithms, including the Conjugate Gradient algorithm, AdaMax, Averaged Gradient and more can be written (though not quite so neatly) in this unrolled form.

Katex vs MathJax

I know I suggested this change, but does katex rendering not look as good as the original mathjax? Just wondering, I'm actually not super familiar with the differences. Maybe I'm just imagining there is a visual difference. Katex is definitely faster.

"View all changes" says "There isn’t anything to compare."

It says "We couldn’t figure out how to compare these references, do they point to valid commits?". Specifically, it links to

https://github.com/distillpub/post--momentum/compare/tag%20FIRST_PUBLISHEDTagger:%20Christopher%20Olah%20%[email protected]%3Epublished%20article95506b079372cee3aa7fbc9bd29ee078aaff12e7...1eef98b33219317833891a08609312f6b790af60

Introduction Comments / Proposed Changes

I'd like to give some thoughts and suggest some changes to the introductory text. If you like them, I can do more. If not, feel free to disregard. :)

[Diagram and title]

I suggest moving the diagram to the top.

If you've spent enough some time optimizing smooth functions, chances are you've met this old nemesis of optimization. The condition has many names, but it always borrows the language of pathology: Ill-conditioning, pathological curvature, dead gradients.

Changing minor nitpicks:

"this optimizer's old nemesis" -> "this old nemesis of optimization"
fix spacing around colon.

The problem manifests when minimizing a function using gradient descent, and its symptoms somewhat subtle. Your choice of step-size, seems correct. The gradients don't blow up. There is no division by zero, no NaNs, no square rooting of minus one. In fact, things often begin quite well - with an impressive, almost immediate decrease in the loss. But as the iterations progress, you start to get a nagging feeling you're not making as much progress as you should be. You're iterating hard, but the loss isn't getting smaller. Should you keep iterating, and hope for the best?

I've cut out the gradient descent update rule and related variables here. I think you were including it to introduce it, but it broke the flow of this paragraph without adding much. I think it's better introduced slightly later, when you introduce momentum.

(note: you also had a variable mix up where you used x instead of w)

The problem could be pathological curvature. We've all seen a certain picture, in 2D, of what this looks like. The landscapes are often described as a valleys, trenches, canals, ravines. In these steep valleys, gradient descent fumbles. All progress along certain directions grind close to a halt. Optimization only approaches the optimum in small, tedious steps.

No changes except avoiding w, since it hasn't been introduced.

[no diagram because it has been moved to top]

As discussed, this doesn't integrate super well here. I think it is better to move to the top.

There is a simple tweak to gradient descent, which TensorFlow calls momentum, which makes things work much better!

Break this line out of the paragraph, so that we can focus the next paragraph on a chunk of technical introduction.

(I suspect there is a better reference for momentum than TensorFlow. The ML community has been calling it that for a long time.)

Suppose we have a function $f(w)$ and want to optimize over $w$ to minimize $f$. In gradient descent, we start with an initial guess, $w^0$, and iteratively improve it using the update rule:

$$w^{k+1} = w^k-\alpha\nabla f(w^k)$$

where $\alpha$ is the step-size. Momentum changes the update rule to:

$$ \begin{aligned} z^{k+1}&=\beta z^{k}+\nabla f(w^{k})\\ w^{k+1}&=w^{k}-\alpha z^{k+1}. \end{aligned} $$

where $z^k$ is an auxiliary sequence, and an extra parameter $\beta$. When $ \beta = 0 $ , we recover gradient descent. But for $ \beta = 0.99 $ (sometimes $ 0.999$, if things are really bad), the situation improves quite dramatically.

We're using this paragraph to formally introduce both regular gradient descent and momentum. I think this fits better.

Something I haven't done, but I think you should seriously consider, is adding a line giving intuition about what momentum is. Something like "If gradient descent is a person walking down hill in the steepest direction, momentum is a bolder rolling down hill." You might also give intuition for the auxiliary sequence by saying something like "where $z^k$ is an auxiliary sequence representing the speed we are presently moving at" or something like that.

Optimizers call this minor miracle "acceleration".

This tiny modification may seem like a cheap hack. A simple trick to get around gradient descent's more aberrant behavior - a smoother for oscillations between steep canyons. But the truth, if anything, is the other way round. It is gradient descent which is the hack. First, momentum gives about a quadratic speedup on convex functions. This is no small matter - this is the speedup you get from the Fast Fourier Transform, and Grover's Algorithm. When the universe speeds things up for you quadratically, you should start to pay attention.

But there's more. A lower bound, courtesy of Nesterov , states that momentum is in a certain technical sense optimal. Now this doesn't mean it is the best algorithm under any circumstances for all functions. But it does mean it satisfies some curiously beautiful mathematical properties which scratches a very human itch for perfection and closure. But more on that later. Let's say this for now - momentum is an algorithm for the book.

I think all three of these paragraphs are great! :)

References within footnotes

When you put a reference inside a footnote, it doesn't work. A problem with distill template, but noting here.

Peer Review Report 1 -- Matt Hoffman

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer, Matt Hoffman, for taking the time to write such a thorough review.

A note on the relevance of convergence rates.

This article does not include the word “stochastic” anywhere. This is a problem, because

In machine learning (and probably beyond), momentum is almost always applied to noisy gradients. (Quasi-Newton methods like LBFGS are usually more effective than SGD with momentum in the noise-free setting.)
The theoretical results about momentum that the article presents do not apply in the presence of gradient noise.

That is, to the extent that momentum is useful in modern machine learning, it’s not because it yields better convergence rates. It’s because it makes it possible to speed up SGD’s lengthy initial search process. (This is noted in [1], the Sutskever et al. article.)

That’s not to say that the intuitions developed in this article aren’t relevant to the stochastic setting. They are, especially early in optimization when the magnitude of the gradient noise may be small relative to the magnitude of the oscillations due to using a large step size.

But it’s important to be clear that the convergence rate results are at best suggestive—they probably don’t apply in most practical applications.

Suggestions/observations on figures:

Teaser figure:

It might be interesting to reparameterize this so that alpha/(1-beta) is constant.
The most interesting stuff happens when beta is between 0.9 and 0.99, but it’s hard to get fine control in that region.
It would be nice to be able to set the step size high enough that the dynamics actually explode.

“Decomposing the error”:

If the largest eigenvalue is 100, shouldn’t the maximum stable step size be 0.02 instead of 2?

“Example - Polynomial Regression”:

Scrubbing up and down feels more natural to me than left to right, especially for the first .

Comments on the text:

Content:

“regions of f which aren’t scaled properly”, “In these unfortunate regions…”: I think “directions” is probably a better word than “regions” here. “Regions” implies that there are places where it’s not a problem.
“The iterates either jump between valleys”: What’s that mean? Are there multiple valleys? (I assume this is meant to describe wild oscillatory behavior.)
“The change is innocent”: Innocent of what?
'Optimizers call this minor miracle “acceleration”.’: At first I thought “optimizers” meant algorithms, not people.
“But the truth, if anything, is the other way round. It is gradient descent which is the hack.” In what sense is GD “hackier" than GD+momentum? Working less well than an alternative does not make a method a hack.
“this is the speedup you get from the Fast Fourier Transform”: To be pedantic, the FFT is O(n log n), not O(n).
“think of A as your favorite model of curvature - the Hessian, Fisher Information Matrix [5], etc”: I know what you mean here, but not every reader will.
“captures all the key features of pathological curvature”: To the extent that you define curvature in terms of the Hessian, this is tautological. But “all the key features” sounds like studying quadratics is sufficient, even though it can’t teach you anything about other important issues (e.g., vanishing/exploding gradients, saddle points, etc.).
“the eigenvalues of A”: This is imprecise—the eigenvalues aren’t a space.
In “Choosing A Step-size”, sometimes the smallest eigenvalue is lambda_0 instead of lambda_1.
“It is a measure of how singular a matrix is.”: “how singular” should be “how close to singular”—a matrix is either singular or it isn’t.
“The above analysis reveals an interesting insight”, “Surprisingly, in many applications”, etc.: These sentences tell the reader how to feel, which bugs me. IMO, it’s best to let the reader decide whether something is surprising, interesting, remarkable, etc. (I know many people who agree. Of course, people still do it: https://nsaunders.wordpress.com/2013/07/16/interestingly-the-sentence-adverbs-of-pubmed-central/)
Polynomial regression example: Do you ever say where Q came from? I assume it’s from the eigendecomposition of the Hessian of the linear regression problem, but I don’t see that made explicit anywhere.
“starting at a simple initial point like 0 (call this a prior, if you like)”: I do not. A prior is a distribution, not an initialization. The point that shrinking towards 0 (either using Tikhonov regularization or early stopping) mimics maximum a-posteriori estimation with a normal prior is valid, but the initial point 0 is not a “prior”.
The dynamics of momentum: Again, it’d be nice to be specific about where Q came from. When using notation for the first time in a little while, it’s always good to remind the reader what it means.
“Momentum allows us to use a crank up the step-size up by a factor of 2 before diverging.”: It took me a moment to figure out how the figure demonstrated this—it might be good to point the reader to the upper right corner of the figure.
“Plug this into the convergence rate, and you get”: I think the labels for the convergence rates are backwards.
“Being at the knife’s edge of divergence, like in gradient descent, is a good place to be.” This is true for quadratics, but for non-convex problems it’s not always good advice. It’s hard to say why in general, but my (very hand-wavy) intuition is that using a very large step size and momentum quickly breaks near-symmetries in your random initialization, which makes SGD behave more greedily than it might otherwise.

Typos etc.:

“…descent as many virtues…” as => has.
“simple - when” should use an em dash, not a hyphen.
“optimizers old nemesis” needs an apostrophe.
“along certain directions grind to a halt” grind => grinds.
“short term memory” short term => short-term.
“Fortunately, momentum comes speeds things up significantly.”: delete “comes”?
“the Convex Rosenbrok”: Rosenbrok => Rosenbrock. Also, is this terminology novel?

Make H2 more better

This is probably something I should fix in the distill template. Just noting it here.

Eigenspace Combinations Diagram

Original:

Proposed:

It's "principal components", not "principle"

Of course in the post-spelling age I am just condemning myself to irrelevance. Great article in any event.

Style suggestion: close with "converge" rather than "unify"

One day, hopefully soon, the many perspectives will converge to a satisfying whole. Just a few more small steps -- momentum is carrying us in the right direction.

Slide with "dot" hard to drag

This slider was hard to drag because the "dot" blocks the drag interaction.

More Nits

Chris

general equation format nudging
move period to left of footnote in this sentence: "taxonomy of convergence behavior 5."

Shan

milestonesMomentumFig has awkward break in text with pink line.
momentum2d make orange circles black and white
flow figure has a few minor horizontal alignment issues.
poly1 figure has some text block alignment issues
reduce visual noise by deleting minor things, like lines
momentum2d give leader lines a white halo lines.
make banana fullscreen
offline math rendering.

Gabe

phasediagram1 - order of momentum backwards from previous diagram (should be decreasing toward the right). Also change label from momentum to damping.
dynamic text that contains equations is not rendering in general.
deal with overlapping charts in phasediagram0 (make them all smaller?)
Make momentum vs stepsize slider consistent. Make them all like this:
on all sliders that have an "optimal" label, make the label clickable to set to that value. Often times the value is unreachable by sliding.
rosenViz - I would bring back the "Optimal parameters" label and arrow.
the third eigenvalue in milestonesmomentumfig was not labeled:
In two diagrams, there is inconsistent labeling (maybe an off by one error?). In the text they are labeled lambda [1,2,3], but the horizontal bars in the diagram they are labeled [0,1,2].

Feedback: The Worst Function in the World

Do we really need this more complicated problem? Couldn't you just use the colorizer problem on a chain?
- I realize you want to use Nesterov's "worst function in the world" but I think you could achieve almost as much drama with a very similar function and slightly different wording.
- It seems like the exact same arguments should go through, except a bit cleaner. You loose the observation about the conditioning number, but that feels like a tangent.
- This would nicely feed from the laplacian systems section if you had that as a previous section. You could begin this section by saying something like
  
  Earlier, when we talked about the colorizer problem, we observed that wirey graphs cause bad conditioning in our optimization problem. Taking this to its extreme, we can look at a graph consisting of a single path -- a function so badly conditioned that Nesterov called a variant of it "the worst function in the world...
You haven't introduced the term "convex rosenbrock" but use it here:

Taxonomy of Convergence Diagram

Original:

Proposed:

Streamlining Laplacian Systems Section

I really like the laplacian systems section, but I think you could make it stronger.

(1) The big suggestion would be to lead with the colorizer, instead of the general case. Absorbing a concrete case, with a nice physical motivation, is a lot easier for a reader not familiar with the topic. I'm not super confident this is the right thing to do, but it's often a good heuristic and my sense is that it would flow better.

(2) Either way, I would drop the inline definition of the graph laplacian, and move it to a footnote or hyperlink to wikipedia. I feel more confident this is a good idea. Right now, it feels like the reader should put a lot of effort into understanding the graph laplacian matrix, but really they just need to understand that we can vectorize the equation and then look at eigenvectors.

Example of what a new version might look like:

Imagine a drop of ink, diffusing through water. Movement through equilibrium is made only through local corrections - and hence left undisturbed, its march towards equilibrium is slow and laborious. This is too, a manifestation of pathological curvature.

To make this concrete, let's consider a toy example. On a grid of pixels let G be the graph with vertices as pixels and edges connecting neighbouring pixels. Let D be a set of a few distinguished vertices. Then we try to minimize a sum of two sums,

[colorizer eq]

... [diagram, etc]

To make this a bit more precise, note that we can rewrite the colorizer problem as a simple quadratic with a matrix.

[eq]

This matrix is called a graph laplacian matrix<footnote: definition, talk about importance>.

We can now do our usual trick of looking at the eigenvectors of L_G...

...

Another way of thinking about this is that there is a connection the connectivity of the graph and the condition number of L_G. As we've seen, long, wiry graphs, like paths and grids, condition poorly. Conversely, small world graphs, like expanders and dense graphs, have excellent conditioning

Laplacian Example as a full section?

It still seems a bit strange to me that the laplacian example is a subsection because:

It doesn't feel super related to its section
Polynomial regression example is its own section (and this feels like it flows well)

I think this is a minor issue, but I wanted to raise this separately because it was the only thing keeping #18 open and is an issue of a much narrower scope.

Peer Review Report 2 -- Reviewer B

The following peer review was solicited as part of the Distill review process. Some points in this review were clarified by an editor after consulting the reviewer.

The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer, for taking the time to review this article.

Conflicts of Interest: Reviewer disclosed no conflicts of interest.

Typo: In the section "Optimal Parameters" the labels "Convergence rate, Gradient Descent" and "Convergence rate, Momentum" are switched.
Note that the optimal convergence rate doesn't hold for outside quadratics. Often one needs to use Nesterov. A simple counter-example can be found in section 4.6 of https://arxiv.org/pdf/1408.3595.pdf
In the first diagram of "The Dynamics of Momentum," we can see that there are regimes where only momentum matters for convergence rate. This is surprising and may be worth an additional note. Is it worth writing out equation for eigenvalues explicitly, maybe in a footnote, and showing how learning rate cancels out in the complex case?
Often, people think about Momentum in terms of Chebyshev Polynomials. This doesn't seem necessary to cover, but it might be good to reference somewhere.
In the section "The Limits of Descent" the author writes "Like many such lower bounds, this result must not be taken literally, but spiritually." This is a good point to say, but the author may wish to consider whether it may strike a bad chord with people talking about taking Donald Trump "literally but not seriously."
"Many of our favorite methods, including BFGS, and more, do not fall into the class of linear first order methods." Note that BFGS and Conjugate Gradient are the same thing when restricted to quadratics. (LBFGS is not the same thing, however.)

Feedback: The Critical Damping Coeficient

My biggest piece of feedback is that you could simplify some of your flow by labeling equations, like so:

It would require a bit of finicking, but I think there's several paragraphs that could be simplified a bit if you did this. It also reduces the burden on the reader to link sentences in the paragraph to different parts of the equation. Another benefit is that it make it easier for someone skim-reading to understand what's going on.
I would kill the highlighted sentence and paragraph break. You're shifting gears -- give me the reward of finishing a unit.
As I've reread this section, it does feel a bit like there's a bit of a "slog" feeling (this occurs at several points). I'm going to list a few comments which I'm not at all confident about but I want to try and articulate:
- Often the distance between "resting points" (eg. diagrams, new sections, etc) is so enough that, for most screen sizes there isn't an end in sight. I wonder if that's something one wants to optimize for a bit?
- When one only seldom breaks out of text with an equation, that feels a bit like a "resting point", or at least a point where one can pause and think and not lose their position. When there are many such equations, they are no longer resting points -- instead that whole section of text becomes one unit with a new, very intimidating texture.
This feels a little weird to me:

This feels a bit strange for two reasons. First, it seems like a really big emphasis on a very small chunk and breaking out equations feels a bit on the frequent side here. Second, it makes the user reflect on the equation before you comment on the significant thing -- the quadratic speed up -- below it.

I'd be tempted to either not break out the equation or to use the break to explicitly compare with gradient descent convergence rate (in case anyone forgot 1-aλ).
One weird idea is to merge the part where you're comparing convergence rates and label the two rates. I'm pretty skeptical of this idea, but want to throw it out there.

Some thoughts on this:
- I think it's possible that I'm just excited about this "labeling equations" technique and over using it (you could also use "label = value").
- The design of this is a bit ugly, but there's probably a nice way to make this work.
- This change is almost enough to "put the end in sight" and I think equations with labels are more "resting point" like. (This is still a very rough uncertain idea in my head.)

One needs to focus on the relevant part of the next paragraph to find the explanation of the equation term. (Fix: spatially link them.)
The two terms of the equation are very close together, making the eye want to group them. (Solution: whitespace around plus).
Because you are talking about the "colorization problem" readers may wonder if the colors you've highlighted the equation with are somehow linked to it.
The colors draw extra visual attention to the equation. Color is a powerful tool for drawing attention, but using it unnecessarily can create visual noise and dilute its power.

You might consider this instead:

distillpub / post--momentum Goto Github PK

post--momentum's Introduction

Post -- Exploring Bayesian Optimization

Breaking Bayesian Optimization into small, sizable chunks.

Authors

Offline viewing

post--momentum's People

Contributors

Stargazers

Watchers

Forkers

post--momentum's Issues

A note on the relevance of convergence rates.

Suggestions/observations on figures:

Teaser figure:

“Decomposing the error”:

“Example - Polynomial Regression”:

Comments on the text:

Content:

Typos etc.:

Chris

Shan

Gabe

Recommend Projects

Recommend Topics

Recommend Org