udlbook / udlbook Goto Github PK

View Code? Open in Web Editor NEW

5.0K 85.0 1.1K 423.71 MB

Understanding Deep Learning - Simon J.D. Prince

License: Other

HTML 0.01% Jupyter Notebook 98.08% CSS 0.01% JavaScript 1.91%

udlbook's Introduction

Understanding Deep Learning

Understanding Deep Learning - Simon J.D. Prince

Website

# Install dependencies
npm install

# Run the website in development mode
npm dev

# Build the website
npm build

# Preview the built website
npm preview

# Format the code
npm run format

# Lint the code
npm run lint

# Clean the repository
npm run clean

# Prepare to deploy the website
npm run predeploy

# Deploy the website
npm run deploy

udlbook's People

Contributors

Stargazers

Watchers

Forkers

horizonailab juampatronics mfr3ak jsh4a jdc98 katarina1206 miguelmelerogazo jizhihang atmomomo kln-tbn jackyyvan gakadam yinjc wnz27 ricardomar fouadshehab lallouslab techthiyanes gennaro-farina godnpeter sohailkhanmarwat icodein mkordi ezecc xdev-x pengrongbo sougata09 lqchien mbrukman ossoen pyver liujie40 rspatq6ibi joshuachi norbey66 subiclovepython allensmile codehemp drunkrice jingmouren ayatmorsy89 xjg23 rui143 speech-rsp jwgu dreanyarko maxmax2016 marcelomata fenice420 dpk3 violetsaber rasoft alexgcsa oswaldxia gurpreetkaurjethra nivir qiguaiderenmen muqing1980 dengzhanwang2000 jinping-z yunda4qiu youyouhdhd aibots-team qian5683 danielyj147 duanyuqi987 atrimage swg168 sdieedu ricardobarroslourenco ondrej-tucek hertera1 charyorde tourantouran rosssmith96 muzili77 wangxiushen92 ltqin hu56-dot george-lau littlerookie wang-chbo licycommunication ld-laowu michaelzh24 onmekeep ceopoundz l272 heritage-ljh netkindom yuz100086 lyc-vio quanthao tanluole ukaserge freezyyyyyy immisso mshadi20 maqtech shenghusang

udlbook's Issues

Suggestions on paragraph 19.5

Version 06-02-2023
Eq. 19.24: $\frac{\partial Pr(\tau_i | \theta)}{\partial \theta}$ --> $\frac{\partial Pr(\tau| \theta)}{\partial \theta}$

Second line of Eq. 19.25: The same problem of equation 19.24

Check the second line of eq. 19.25: it looks like there is a not necessary ']' and the $\pi$ should be bold

Check the eq. 19.31: it looks like there is a not necessary ']'

Minor notation errata, pp. 421-422 (v. 2023-03-03)

In v. 2023-03-03
Page 421, Appendix A, Sets:

"The notation $\{\boldsymbol{\mathrm{x}}_i, \boldsymbol{\mathrm{y}}_i\} _{i=1}^I$ denotes the set of $I$ pairs $x_i, y_i$ ."
Since x and y are vectors and not scalars I think it should be:
"The notation $\{\boldsymbol{\mathrm{x}}_i, \boldsymbol{\mathrm{y}}_i\}^I _{i=1}$ denotes the set of $I$ pairs $\boldsymbol{\mathrm{x}}_i, \boldsymbol{\mathrm{y}}_i$."
In $\{1, 2, 3, ...,\}$ I would remove the last comma: $\{1, 2, 3, ...\}$

Page 422, Appendix A, Sets:

In $\{1, ... K\}$ I would add a comma just before $K$: $\{1, ..., K\}$

Typo in Caption of Fig. 4.1

Version 2023_01_16, Fig 4.1 Caption:
"The first network maps inputs x ∈ [0,1] to outputs y ∈ [0,1]" --> "The first network maps inputs x ∈ [-1,1] to outputs y ∈ [-1,1]"

Citations are not well formatted at page 205 (January 23, 2023 release)

At page 205 citations are not well formatted, it seems there is some problem with the LaTeX commands, see this screenshot:

Minor corrections on chapter 18, version 31-01-23-C

Thanks for releasing the book! I've enjoyed going through it!

Here are some typos that I could spot while reading chapter 18:

page 361, line 1: missing a comma in the formula; should be $q(z_{1:t}, x)$
page 361, second paragraph: $x$ has an extra subscript, should probably be $q(z_t|x)$ instead of $q(z_t | x_t)$
equation 18.14: missing a left bracket
equations 18.16 (last line) and 18.26: $σ^2_t$ should probably be $σ^2_1$
equation 18.19: $Φ_{1...t}$ should probably be $Φ_{1...T}$
section 18.4 onwards: $z_{1:t}$ should probably be $z_{1:T}$
section 18.4 onwards: inconsistent indexing for $z$—previous material uses $z_{1:T}$, but here $z_{1‥.T}$ is used interchangeably
section 18.4.1, last paragraph: by "modifying itself" should we also understand "optimize the parameters" (as it mentioned in the second item), or is there a subtle difference?
section 18.4.2, first sentence: "minimize" should probably be "maximize"?
algorithm 18.1: what is $q$? I assume it has nothing to do with the variational distribution, right?
algorithm 18.1: I'm somewhat confused by the notation: $i$ and $\mathcal{B}$. I assume the former is the batched dataset? But then $i$ should range over batches?
algorithm 18.2, last line: $g_1$ should probably be $f_1$
section 18.6.1, last paragraph: shouldn't it say that $q(z_{t-1} | z_t)$ becomes closer to normally distributed with larger time steps?
section 18.6.1, last paragraph (minor): I don't know if T should be encolsed in a math environment, i.e., $T$
section 18.6.2, first sentence (minor): I don't know if you want to use the $t$ subscripts for the $α$'s (to match $z_t$)

The following questions might out of the scope of the book, but I do wonder:

How are the fixed parameters (β, T, σ) chosen in practice?
Can we (efficiently and accurately) estimate $p(x)$ for a given sample? This could be potentially useful for out-of-distribution detection. For example, I know that normalizing flows, while in principle could be used for estimating the model's probability, their performance usually suffers in practice [1]. For diffusion models, there is a discussion on this in section 2.3 of [2], but I'm not sure I follow their argument.

References:

[1] Kirichenko, P., Izmailov, P., & Wilson, A. G. (2020). Why normalizing flows fail to detect out-of-distribution data. NeurIPS, 33, 20578-20589.
[2] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015, June). Deep unsupervised learning using nonequilibrium thermodynamics. In ICML (pp. 2256-2265). PMLR.

Equation 15.4 is a more complex loss function than we have seen before; the discriminator parameters φ are manipulated to minimize the loss function and the generative parameters φ are manipulated to maximize the loss function.
I think the second φ should be θ, if my understanding is correct.

Minor typos

Book looks great so far!

I was reading version 07_10_22_C and spotted the following minor typos.

Eqn 5.4 - last line
missing a "]"

Eqn 6.1
Missing bold \phi under argmin

Eqn 6.5 - top line
\sum_{i=1}^I l_i should be \sum_{i=1}^I {\ell}_i ?

Fig 7.5
Missing indentation for a few lines after the second for loop "for i, data in ..."

A typo

Fig 15.20: "Changing course styles" -> "Changing coarse styles" ? Same for "course" noise.

Typo in problem 16.2

On page 329 in equation (16.27), exp(x^2/2) should be exp(-z^2/2).

A few corrections and comments on chapter 17, version 06-02-23

Hello! I've made a pass through chapter 17—thanks again for the very informative write-up! Here are some small corrections and some questions (you can of course ignore these more general comments if you think they are too distracting from the main point of the book):

equation 17.2: $x$ should be bold, that is, $\mathbf{x}$
equation 17.7, first line: missing a vertical bar to denote conditioning on the parameters, that is, $Pr(\mathbf{x}, \mathbf{z} | \mathbf{Φ})$
figure 17.6, caption: "by either a) improving the ELBO" should be "by improving the ELBO either a) with respect to ... or b) with respect to ..."
figure 17.6: the caption says that "we get closer to the true log likelihood by improving the ELBO [...] with respect to the original parameters φ", but the figure on the right indicates that the gap increases. I would assume that the statement is correct (since, for example, diffusion models close the gap even if they use a fixed variational distribution), but that means that the figure is somewhat deceiving?!
figure 17.12: contains a self reference (that is, the caption mentions "figure17.12", which is incidentally also missing a space).
subsection 17.8.3 mentions "the product $Pr(\mathbf{z}|\mathbf{x})Pr(\mathbf{z})$": I was wondering is there a principled justification of multiplying the prior $P(\mathbf{z})$ with the posterior $P(\mathbf{z}|\mathbf{x})$? It seems a bit unnatural to me from a probabilistic point of view, but maybe this is just motivated from a more pragmatic perspective?!
subsection 17.8.4, equation (17.29):
- the variable $\mathbf{z}$ seems to be unbound—where does it come from?
- are there any properties particular properties that the functions $r_1$ and $r_2$ should have? Defining the regularization terms as arbitrary functions seemed a bit too general to me (e.g., they can negate their input argument or completely ignore it).
- any intuition on why these two regularization terms encourage disentanglement? Are they related to beta-VAE and total correlation VAE? (It was not immediately clear to me the connection between those concepts.)
- (minor) if $L_\mathrm{new}$ denotes a loss, maybe it would be a bit more precise to use a negative ELBO and positive regularizes?
page 352: Maybe the wording can be improved in the following: "[...] then autoencoder is just [...] PCA. Hence, the autoencoder is a generalization of PCA."; for example, by saying "Hence, a nonlinear autoencoder is a generalization of PCA," or something along these lines?
page 352, second paragraph on "Latent space, prior and posterior": there is an undefined reference
page 353, paragraph on "Posterior collapse", last sentence: there is an undefined reference
page 353, paragraph on "Other problems":
- From the description the "information preference" problem seems very similar to posterior collapse. What is the difference between the two?
- The paragraph cites Chen et al (2017) for InfoVAE, but I wonder whether the correct reference isn't

Zhao, Shengjia, Jiaming Song, and Stefano Ermon. "InfoVAE: Information maximizing variational autoencoders." AAAI (2017).

page 354, paragraph on "Disentangling latent representation": "e.g.," should probably be "etc."?
equation 17.35:
- is the subscript $z$ correct in the expectation? There is no $z$ defined elsewhere.
- on the left hand side of the equation "f" is written in roman typeface, while on the right it is written in italics; is this correct? I'm yet sure what is the convention for each of the two variants.

errata_2

In page 423, the paragraph started by "Pruning can be considered a form of ..." may have a loss of symbol "(i)".

page 100 and page 103

Book version 06_02_23_C

On page 100, below equation 7.6, "We aim to compute the derivatives: .... and $\partial y/\partial \omega_4$." Is $\partial y/\partial \omega_4$ a typo?

On page 103, caption of Fig 7.5, should be "... the derivatives $\partial l_i / \partial \beta$ and $\partial l_i/ \partial \omega$", and "... by $\partial l_i/ \partial \beta_k$ or $\partial l_i/ \partial \omega_k$ as appropriate."

P.S.
For the PDF version, is it possible for you to add chapter numbers to the navigation panel? There are places in the book where you say something will be discussed in Chapter XX, and I think it'll be much easier to locate the chapters for PDF readers.

Thank you very much for your great work.

Other suggestions on chapter 19

Version 2023-02-02
Fig. 19.7: "Blue arrow" but the arrow is not blue
Eq. 19.7 --> I think that $\pi[a_t,s_t]$ should be substituted by $\pi[a_t|s_t]$
Eq. 19.11 --> I think it should be $\sum_{a_t}$ and not $\sum_{a}$
Eq, 19.12 -->I think it should be $\arg \max_{a_t}$ and not $\arg \max_{a}$; $r[s_t, a_t]$ should substitute $r[s, a]$

Some more minor issues

Again on version 7-10-22-C5:

p.122: "ADAM" -> "Adam"
p.131:"Randomly which is inefficient": missing punctuation(??)
p.132: Missing reference to an Appendix
p.140: Missing reference to figure in figure 9.5 (probably to fig.8.1)
p.144: "this has smooths out the learned function"
p.145: "maximum-likelihoodcriterion"
p.146: Eq.9.11, change φ το φ' on the integral of the denominator?
p.154: "is not well understood although.." (missing a comma? unsure)

Notation erratum, page 49 (v. 2023-03-03)

In v. 2023-03-03:
Page 49, Sec. 4.4.1:

"the remaining matrices $\Omega_k$ are $D_k \times D_{k-1}$"

I think that following the same notation as the first part of this section it should be:
" $\Omega_k$ are $D_{k+1} \times D_{k}$"
Or
" $\Omega_{k-1}$ are $D_k \times D_{k-1}$"

Figure 3.8

I found the explanation for Figure 3.8 to be lacking. I understand that each image in Figure 3.8 a) to e) corresponds to one activation unti, but I do not understand what the gray lines mean and what does the brown gradient represent.

An explanation along the following lines would benefit the reader: "For example, in Figure 3.8 a), a gray line is the intersection (?) of one of the hyperplane parameterized with (\Phi_0, \Phi_1, \Phi_2) with the (x_1, x_2) plane. The color of the image represents the value of the resulting linear combination of parameters and inputs with warmer colors corresponding to more positive values."

Some suggestions on Chapter 19

Version 06-02-2023

Eq, 19.12 -->I think it should be $r[s_t,a_t]$ and not $r[s,a]$

§19.4 --> Just a comment on the style "The principle of fitted Q-Learning is... . This is known as fitted Q-Learning" --> The repetition of "fitted Q-Learning" looks strange IMHO.

Fig. 19.2 --> "It does not slip on the ice and moves downward" I think that it should be: "It does slip on the ice and moves downward" instead to go left.

Eq. 19.15 : $\max_a [ q[s_{t+1},a_{t+1}] ]$ --> $\max_{a_{t+1}} [ q[s_{t+1},a_{t+1}] ]$. Please note that some authors, e.g. Sutton-Barto, use another formalism: $\max_a [ q[s_{t+1},a] ]$, where $a$ indicates a generic action. It is up to you decide which formalism to choose.

Fig 19.12 --> the same "problem" of Eq. 19.15

Eq. 19.16 --> the same "problem" of Eq. 19.15

Eq. 19.17 --> the same "problem" of Eq. 19.15

Text below Eq. 19.17 --> the same "problem" of Eq. 19.15

§19.4.1 (book page 394) : values $\phi^-)$ --> values $\phi^-$

Eq. 19.18 --> the same "problem" of Eq. 19.15

Eq. 19.19 --> the same "problem" of Eq. 19.15

Eq. 19.20 --> the same "problem" of Eq. 19.15 but for $\arg \max$

Eq. 19.21 --> the same "problem" of Eq. 19.15 but for $\arg \max$

Page 396: "DQN se deep networks" --> "DQNs use deep networks"

Possible typo in equation 5.31 (dx instead of dy)

Hey, it's Roy the undergraduate student from your previous LinkedIn post.

I think there is a typo (dx instead of dy) in equation 5.31 but I may be wrong.

And I would like to ask a small question, how did we get from the first expression in the equation to the second expression?

Thanks!

Small erratum page 22 (v. 2023-03-03)

In v. 2023-03-03:
Page 22, Sec. 2.2.4, Line 4: "However, it also depends on ~~the~~ how expressive the model is."

Minor typo on page 437 (Appendix B)

Under section B.3.3 it says "When the mean of a multivariate normal in x is a linear function Az + b of a second variable y," -- I think this should be Ay + b.

Typo in figure 3.8

I believe in figure 3.8, there seems to be a small typo.

g-h) The clipped planes are then weighted

should be:

g-i) The clipped planes are then weighted

Missing hyphens? etc

In pages 27 and 59 (v. 7-10-22-C-5) the word "multidimensional" is written without a hyphen, right before eq. 3.11.
Also, in page 90, you write "one dimensional", where in most places this is 1D.

These are very minor of course, I'm just letting you know in case you want to normalize w.r.t. the norm throughout the text.

Section 5.6 regression vs classification

"For example, we might want to predict a molecules melting and boiling point (a multivariate classification problem, figure 1.2b) or the object class at every point in an image (a multivariate classification problem, figure 1.4a)"

The molecule melting and boiling point is a regression model I think. Should this be replaced by "(a multivariate regression problem, figure 1.2b)"?

Wrong colors in text of fig 8.5b p. 119

In fig 8.5 Sources of test error, subplot b) text says:

"b) Bias. Even with the best possible parameters, the three-region model (brown line) cannot fit the true function (cyan line) exactly. (...) "

To match the colors in the figure it should say:

"b) Bias. Even with the best possible parameters, the three-region model (cyan line) cannot fit the true function (black line) exactly."

Minor errata, pp. 56-60 & 426(v. 2023-03-03)

In v. 2023-03-03
Page 56, Loss functions:

For binary classification, I think " $y \in [0, 1]$ " should be " $y \in \{0, 1\}$ " because it's a set of two integers (or classes) not an interval of all possible real values between 0 and 1.
For multiclass classification, I think " $y \in [1, 2, ..., K]$ " should be " $y \in \{1, 2, ..., K\}$ " for the same reason

Page 57, figure 5.1 :

" $y \in \mathcal{R}$ " should be " $y \in \mathbb{R}$ "

Page 58, Sec. 5.1.1 :
I'm probably wrong about these, but :

"on the output domain $\boldsymbol{\mathrm{y}}$ ", "the prediction domain is $y \in \mathbb{R}$ " and " which is defined on $y \in \mathbb{R}$ " seem a little odd to me, because the definition domain is $\mathbb{R}$ (or $\mathbb{R}^D$ in the first one) not $y$ (or $\boldsymbol{\mathrm{y}}$). The only case where I would put $y$ is if I wrote it as a set $\{y \in \mathbb{R}\}$.
"The machine learning model $\boldsymbol{\mathrm{f}}[\boldsymbol{\mathrm{x}}, \boldsymbol{\phi}]$ ", are you always talking about a univariate problem ? If it's the case then it should be " $\mathrm{f}[\boldsymbol{\mathrm{x}}, \boldsymbol{\phi}]$ ".
In the footnote "As a function of $\phi$ " should be "As a function of $\psi$ "

Page 60, equation 5.5 :

I believe $\phi$ needs a hat $\hat{\phi}$ because it was estimated by minimizing the negative log likelihood (like in equation 5.12).

Page 62, equation 5.12 :

I believe $\mathrm{f}$ shouldn't be bold because it returns a scalar: $\mu$ or as explained later $\hat{y}$

Page 426, B.1.5 :

"If the value of the random variable ~~variable~~ $y$ "

Page 107 - semantic error

training algorithms -> training samples

About Algorithm 18.1: Diffusion model training

Great book!
I have a question about Algorithm 18.1
I can not find the first part of Equation 18.37 in Algorithm 18.1
Thank you!

Missing citation (version 2023-01-24)

Page 138, section 9.2, Implicit regularization: the first sentence talks about "An intriguing recent finding..." which calls for a citation.

partial derivatives is inconsistent in the equations

The order of partial derivatives is inconsistent in the equations. While the partial derivatives themselves are correct, they are arranged in different orders. In equations 7.11 and 7.12, when we apply the chain rule, the first term for the loss with respect to the pre-activation starts on the left, and each subsequent term goes on the right, i.e., the terms are ordered from left to right. However, in equations 7.17-7.19, the order is from right to left.

EQ 7.12

EQ 7.18 and 7.19

This maybe due to order of matrix multiplication

Missing "personal" in equation 12.15

The word "personal" is not present in any term of the factorization shown in 12.15.

Be careful that also the next paragraph (§12.7.2) does not consider "personal" word when it talks about the right context.

Equation 16.19 and Figure 16.8

Ver 8-3-2023

Please verify the Equation 16.19 and Figure 16.8

Figure 16.8
Figure 16.8 (a) --> it should be $f_2[h_1', \phi_2]$
Figure 16.8 (b) --> it should be $f_2[h_1', \phi_2]$

Equation 16.19
$h_2 = h_2' - f_2[h_1',\phi_2]$
$h_1 = h_1' - f_1[h_2,\phi_1]$

Minor typo Ch9 Notes

UnderstandingDeepLearning_03_03_23_C.pdf Chapter 9 Page 157

The paragraph preceding equation 9.18 contains the text "the second term on the right-hand side must equal zero" but I think it was meant to say "the third term on the right-hand side must equal zero"

Thank you for writing this book!

Miscellaneous minor typos

Page 252. There is an extra closing bracket ] in Eq. (13.4)

Page 292 (Sec 15.3) missing a full stop at the end of the sentence beginning with “Mini-batch discrimination ensures”

Page 318 (Sec 16.3.5) there is a sentence that reads “This requires a different formuation of normalizing flows that learns to from another function rather than a set of samples” – typo in formulation and “to from” should be just “from” I think.

Typo in fig. 12.9

Version 2023_01_31

Fig 12.9: The vocabulary is indicate as $\Omega_v$ in the figure but in the text and in the caption $\Omega_e$ is used to refer to it.

Figure reference error & other minor errata, page 437 (v. 2023-03-03)

v. 2023-03-03
Page 437, Appendix C

Section C.3.1:

problem with the exponential function figure reference.
problem with the logarithm function figure reference.

Section C.3.2: About Stirling’s formula is still empty.

Section C.3.3:

Equation C.3 : Two different font styles for the function f without explanation (I assume the second one is the complex conjugate).
Missing space in "(i.e.the time lag)." should be "(i.e. the time lag)."

Typo in Fig 19.3 and doubt on Fig 19.4

Version 31_01_23_C
In the caption of Fig 19.3 is indicated "blue arrows" but in the figure the arrows don't look blue.

In the Fig 19.4 The policy a) doesn't lead the agent to the reward as indicated in the caption ("This policy generally steers the penguin from top-left to bottom-right where the reward lies"). The action in state 4 and 8 should be modified in "down".

Also the policy b) for different initial states don't reach the goal. Is it the desired behavior ?

Fig 5.1 and 5.11

Just a couple of minor issues I think I've spotted:

Figure 5.1b): x axis of top right figure should run from 0 to 10 instead of 0 to 2, as in 5.1a) (or the probabilities below should read something like P(y | x = 0.4) and P(y | x = 1.4) )

Figure 5.1d): same as 5.1b)

Figure 5.11: the text in the grey rows is cut off at the bottom (makes [] look like the ceiling function ⌈⌉ )

Thanks for this great book, it's very helpful.

Note 1 on page 16

"In fact, this iterative approach is not necessary for the linear regression model."
I would remove "In fact" because the text related to the note doesn't demonstrate the presence of a closed-form solution

Missing slides for newest chapters

Hyperlinks from the website to the slides for chapters 10, 11, 12, and 13 are missing.

Suggestion: Cite figure where ReLU is depicted

Version 2023_01_16, PDF Page 46:
"Many different activation functions have been tried (figure 3.13), but the most common choice is the ReLU" -->
"Many different activation functions have been tried (figure 3.13), but the most common choice is the ReLU (figure 3.1)"

Some suggestions and doubts for Chapter 16

Version 2023-03-03
Page 306: "Consider applying a function $x = f[z, \phi]$ to a base density $Pr(z)$, where $z \in \mathbb{R}^D$ and $f[z,\phi]$ is a deep network": I suggest to rephrase as: "Consider applying a function $x = f[z, \phi]$ to a random variable $z \in \mathbb{R}^D$ with a known density $Pr(z)$ and $f[z,\phi]$ is a neural network"

Equation 16.8 --> It looks like the $-\log [Pr(z_i)]$ should be inside the square "]"
Equation 16.8 --> the term $\phi$ on argmax and argmin is missing
Figure 16.18 (a) --> it should be $f_2[h_1', \phi_2]$
Figure 16.18 (b) --> it should be $f_2[h_1', \phi_2]$
Equation 16.19 --> assuming right the equation 16.18, it should be $h_2 = f_2[h_1', \phi_2] - h_2'$
Equation 16.19 --> assuming right the equation 16.18, it should be $h_1 = f_1[h_2, \phi_1] - h_1'$

Before Equation 16.23 --> I would emphasize that the computation of the trace could be computationally expensive and for this reason we need an estimation that can be done using Hutchinson's trace estimator (https://people.cs.umass.edu/~cmusco/personal_site/pdfs/hutchplusplus50.pdf).

Figure 16.10, Is it possible to create also the figure for the normalizing direction? The description in the text is not so clear for me, In my understanding after the first inverse mapping, using $f_4^{-1}$, I can remove the last part that is already $z^4$ and continue with the remaining part. Am I wrong?

page 318 Footnote 2 --> Could you explain synthetically what do you mean with "better"? Do you mean that the quality of samples is not so good as in the other approaches?

Paragraph 16.5.3 --> The figure 16.4 and the text in §16.5.3 look not so coherent because they use $z$ and $x$ differently. Moreover figure doesn't use q. Is it intentional?

Page 114 - spelling

fforward pass -> forward pass

Minor issue on color of circles in Fig. 2.3

In caption of Fig. 2.3 "The three circles represent the three lines from figure 2.2b-d", the two circles related to 2.2b and 2.2d have the same color of the relative lines, but the circle related to 2.2c has a different color.

Minor typos in 10.2.5 on page 167 and the side column on page 176

The second line in the second paragraph on page 167: change "thes hidden units" to "the hidden units"?
Maybe change "Problems 10.17 - 10.16" to "Problems 10.16 - 10.17" on page 176?

Your book is a fantastic exposition of deep learning, certainly the best I have ever read on the subject. Thank you so much for your work!

Page 112 code indentation

There is a small indentation issue on the latest version. The for i, data in enumerate(data_loader): statement should has the same indentation as epoch_loss = 0.0

Some missing citations (?) at page 350 (January 23, 2023 release)

There are two question marks at page 350, see following screenshot:

errata_1

In the "Notes" at page 403, there may be a grammatical error in the first sentence.

typo in page 147, section 9.3.3

In the second paragraph of section 9.3.3 "Dropout", a word "kinks" occurs. I guess it might be "links".

Page 434 Figure B.3

The figure text '...is used to show that the Kullback-Leibler divergence is always greater than one'. The KL divergence should be always greater than zero.

Expanding the explanation of backpropagation equations

Hey Prof. Simon, it's Roy Amoyal the undergraduate student again.

Something that I have noticed in our Deep Learning class at my university and in the book is that the backpropagation process is the most "mathematically challenging" for most of the students. Unfortunately, when most of the lecturers explain the backpropagation math process, they give up a reminder about the chain rule and a simple illustration using the neural network notations. While for some students it may be obvious, for some reason most of the students lost the lecturers in that part (including me)

Although you give a reference to the "Matrix calculus Appendix C.2" next to equation 7.6, I think something like the next equations can be quite helpful and even dramatically change the understanding of the backpropagation process before jumping into equation 7.6.

For example:
Equation 7.5:

Reminder of the chain rule (composition of 3 functions):
(self-note: I think it's better to remind with 3 functions instead of just 2)
If m(t)=(f∘g∘k)(t)=f(g(k(t))). then m′(x)=f′(g(k(x)))⋅g′(k(x))⋅k′(x).

Because (***self-note: here I think it is really important to explicitly write the equation with the substitutions ***):
(loss function) li = l[f3, yi] = l(b3 + Ω3⋅h3,yi) = l(b3 + Ω3⋅a[f2],yi) = l(b3 + Ω3⋅a[b2 + Ω2⋅h2],yi)
(substituting the expressions from equation 7.5) and because h3 is the inner function of f2 (h2(x)=a[x], and in this case x=f2 so h2=a[f2]), just like in the reminder, when k(t) is an inner function of t inside f and g (composition), we get:

Because we are doing the "backpropagation", we first calculate the derivative of the most "inner" function of the neural network, in our case h3 = a[f2] so we first calculate:

and then we keep calculating the expressions backward.

What do you think?
It could be really helpful to understand this topic better and faster.

Thanks, Roy.