asadoughi / stat-learning Goto Github PK
View Code? Open in Web Editor NEWNotes and exercise attempts for "An Introduction to Statistical Learning"
Home Page: http://asadoughi.github.io/stat-learning
Notes and exercise attempts for "An Introduction to Statistical Learning"
Home Page: http://asadoughi.github.io/stat-learning
As mentioned in #3, adding a navigation or index between answers would be useful.
Maybe something with Github Pages?
This does not remove the 10th and 85th quantile, it removes elements 10 through 85. newAuto = Auto[-(10:85),] If auto had 100 entries all in the same order it would work. I think you need to sort each quantitative variable and then remove elements 0.10_length(mpg): 0.85_length(mpg) for instance.
Hi, I have a question on 4.4.(c)
the solution is 100*(0.1)^100, why do we multiply 100 here
In4.4.(b), the answer did not multiply 2
Thanks
I agree with the intuition of the argument, but it is not necessarily true that each iteration will produce a split of an unused variable. I.e. if variable X(1) is used for f(1) in the first iteration, then X(1) will not be used in f(2). However, it could be used in f(3). In the end, there can be multiple f(b)'s that use X(j).
The final summation statement is true if f(j) is the summation of all f-hats of variable X(j).
I.e. each f(j) = sum[I(X(j) = Xb)*f-hat(Xb)] over all the iterations of boosting, which can be (much) greater than p.
Excuse my poor math notation here
p = 2 not 1
Line 63 in 8ccf543
It is quite a minor issue but JFYI.
There are two consecutive "the"s ;)
"as the the sample size"
The answer is wrong.
eps2 = rnorm(100, 0, 0.5)
is the same as before, while the problem requires you to increase the variance.
As reported by @shayan-mj in #3
In the exercise 13 of chapter 3 was said "create vector, esp, from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25". So, the standard deviation is equal to 0.5, and in solution was written "esp = rnorm(100, 0, 0.25)", but in it should be written "esp = rnorm(100, 0, 0.5)", because "rnorm" function get standard deviation as its argument.
https://github.com/asadoughi/stat-learning/blob/master/ch2/answers#L45
Line 45 says - When the training error is lower than the irreducible error,
overfitting has taken place. This is not true since most of the times the training error would be less
than the irreducible error. I don't thing training error says much about overfitting alone. Only when training error keeps reducing and test error keeps increasing we can say that we're overfitting.
Thoughts?
On page 14 of ISLR (29 of PDF) it describes the Boston data set as "Housing values and other information about Boston suburbs." I didn't know this when I originally answered these questions, and instead I made a new suburbs variable based on the towns in the data set. Instead, the data set should have been used as-is.
I think the answer you gave is confusing and didn't explain the purpose of doing the transformation.
Actually the reason why you should find the $k$th class that will maximize
From the Bayes' Theorem(4.12) we know , for any class
However, the prior probability
So, the objective is to find the largest
With the logarithm transformation we get
In the end finding the largest
Q: Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
Your answer:
From the correlation matrix, I obtained the two highest correlated pairs and used them in picking my interaction effects. From the p-values, we can see that the interaction between displacement and weight is statistically signifcant, while the interactiion between cylinders and displacement is not.
Interaction is not relevent to the correlation. The interaction determines the influence of one predictor on the effectiveness of the other predictor on the response which is different from correlation.
We have cases where there is not correlation but there is interaction. For example in the Advertising
dataset:
a = read.csv("data/Advertising.csv")
pairs(a)
No strong correlation between TV
and Radio
but there is interaction!
In addition, a better way to evaluate the interaction effect is to use .*.
in the lm function; It will include all the predictors and the combination of all possible binary interactions:
sset = subset(Auto, select=-name)
fit = lm(mpg~.*.,sset)
summary(fit)
Note that the decision to include any interaction term in the final model is under the topic of model selection.
Thanks
As reported by @srodriguez0 here 192d368#commitcomment-8356718
The RSS should be mean((Wage$wage[test] - lm.pred)^2). Change 'Wage$age' to 'Wage$wage'
The answer you gave is rather misleading.
And the
The answer should be simplified.
Since we have
\begin{gather}
\end{gather}
For
\begin{align}
\delta_{k}(x) &= \log \big( f_{k}(x)\pi_{k} \big) \
&= -\log (2 \pi)^{1/2} - \log \sigma_{k} - \frac{1}{2}(x-\mu_{k})^{T} \sigma_{k}^{-1}(x-\mu_{k}) + \log \pi_{k} \
&= - \log \sigma_{k}-\frac{1}{2}(x-\mu_{k})^{T}\sigma_{k}^{-1}(x-\mu_{k}) + \log \pi_{k} + c \
&= - \log \sigma_{k} + \log \pi_{k} -\frac{1}{2}\sigma_{k}^{-1}(x-\mu_{k})^{2}
\end{align}
The above formula will generate an quadratic term
Both results in (a) and (b) reflect the same line created in 11a. In other words, y=2x+ϵy=2x+ϵ could also be written x=0.5∗(y−ϵ)x=0.5∗(y−ϵ).
I think they both arent same regression lines, because they both have different slope. or ?
I've also got the answer to this exercise close to 0.45.
But if cv.glm
is used, I get an answer around 0.25 which is confusing:
glm.fit.all <- glm(Direction ~ Lag1 + Lag2)
LOOCV_error <- cv.glm(frame, glm.fit.all)$delta[1] # is around 0.25 instead of 0.45
Can anyone clarify this?
Thank you!
In line 156 there is an unexpected '>'
set.seed(1)
p=100
n=1000
max.rep=1000
x=matrix(ncol=p,nrow=n)
coefi=rep(NA,p)
for (i in 1:p){
x[,i]=rnorm(n)
coefi[i]=rnorm(1)100
}
y=x%%coefi+rnorm(n)
beta=rep(0,p)
error=rep(0,max.rep)
for (j in 1:max.rep){
for (i in 1:p){
a=y-x%*%beta+beta[i]x[,i]
beta[i]=lm(a~x[,i])$coef[2]
}
error[j]=sum((y-x%%beta)^2)
}
error
plot(1:max.rep,error,ylim=c(0,1))
plot(1:10,error[1:10])
In part-b of the question, the examples 1 and 2 are 'prediction' and not 'inference' as mentioned in the solution.
I believe the solutions are all shifted 1 degree of freedom.
For a) the function should be 0 everywhere, otherwise the regularization term is infinity, for b) the first dericative has to be 0 --> function needs to be constant, etc.
Unaware of a complete solution. Googling for "ridge regression posterior mean mode" comes up with things like http://ssli.ee.washington.edu/courses/ee511/HW/hw3_solns.pdf. Googling "lasso posterior mode" hasn't rendered anything useful.
Why is (3,4,6,8) clustered and not (3,4,5,6) or (2) and (1,3,4,5,6,7,8)? Because these are closest to each other and have the least socks and computers?
pairs(college[,1:10])
Error in pairs.default(college[, 1:10]) : non-numeric argument to 'pairs'
plot(college$Private, college$Outstate)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf
Isn't the answer to this exercise g(x)=0 instead of g(x)=k?
The SVM pair plots for polynomial and radial kernels are homogeneous and therefore not helpful for assessing the fits.
The slice argument for plot.svm specifies to which constants the other dimensions should be held, that is, which 2-D hyperplane the data are projected onto for visualization. The default is zero, which may give a hyperplane that does not intersect with the decision boundary. Setting the slice argument to the medians of the other dimensions ensures the hyperplane the data are projected onto is near the data and intersects the decision boundary. The intersection of the boundary and the hyperplane may not perfectly partition data projected from higher dimensions, but it might indicate, for example, appropriate curvature, or provide ideas to improve the fit.
"The high p-value of newspaper suggests that the null hypothesis is true for newspaper." is inaccurate, it should be "The high p-value of newspaper suggests that we can't reject the null hypothesis for newspaper."
Could you expand the process on getting the final result?
I think there might be sth wrong with log operation
The solution is obviously wrong because the check at the end fails (there should be no variability in this ratio).
The reason is that "scale" in R standardizes columns, not rows (which is counter-intuitive, but they do tell us to make sure that the sd of each observation, not variable is 1).
This can be solved by running dsc = t(scale(t(USArrests)))
instead of dsc = scale(USArrests)
Hey, I think your answer for p2 ch7 is wrong.
(a) Since we are minimizing the area under the curve of g(x)^2, g(x) would be just 0.
(b) Since we are minimizing the area under the curve of g'(x)^2, g(x)' would be just 0 and g(x) = k, where k is some constant.
(c) Since we are minimizing the area under the curve of g''(x)^2, g(x)'' would be just 0 and g(x) = ax+b, which is a straight line passing through the points.
(d) g'''(x)= 0, so g(x) =ax^2+bx+c, which is a quadratic line passing through the points.
(e) the penalty term is dropped, so g(x) is a interpolant that goes through all the points to make RSS = 0.
With R version 4.0.2, it seems I must perform a factor() to categorize the "char" column of "Private". Thus, the solution would become
plot(as.factor(college$Private), college$Outstate)
A parametric approach also allows us to make inferences about how the predictors affect the underlying function.
A non-parametric approach has the disadvantage that it is more difficult to draw inferences about how the predictors affect the function.
Thought it might be a good addition to that answer. Great repository! Thank you for uploading. I am using it to check my answers.
In chapter 3, Sub-Section 3.6.2 Simple Linear regression on page 112, there is a paragraph that states, to quote,
we will now plot medv and lstat along with the least squares regression line using the plot() and albline () functions.
plot(lstat, medv)
This code returns an error Error in plot(lstat, medv) : object 'lstat' not found
there is a missing $
in the code plot(lstat, medv)
The correct code syntax should be > plot(Boston$lstat, Boston$medv)
log transform mpg, produced better R-squared and residual diagnostics
here is the R code and output
lm.fit2<-lm(log(mpg)~cylinders+displacement+horsepower+weight+sqrt(acceleration)+year+origin,data=Auto)
summary(lm.fit2)
par(mfrow=c(2,2))
plot(lm.fit2)
plot(predict(lm.fit2),rstudent(lm.fit2))
Call:
lm(formula = log(mpg) ~ cylinders + displacement + horsepower +
weight + sqrt(acceleration) + year + origin, data = Auto)
Residuals:
Min 1Q Median 3Q
-0.41288 -0.06546 0.00002 0.06837
Max
0.34063
Coefficients:
Estimate
(Intercept) 1.829e+00
cylinders -2.806e-02
displacement 6.171e-04
horsepower -1.615e-03
weight -2.497e-04
sqrt(acceleration) -2.344e-02
year 2.952e-02
origin 4.068e-02
Std. Error t value
(Intercept) 1.986e-01 9.210
cylinders 1.156e-02 -2.428
displacement 2.699e-04 2.287
horsepower 5.008e-04 -3.225
weight 2.373e-05 -10.520
sqrt(acceleration) 2.896e-02 -0.810
year 1.823e-03 16.193
origin 9.948e-03 4.090
Pr(>|t|)
(Intercept) < 2e-16 ***
cylinders 0.01562 *
displacement 0.02276 *
horsepower 0.00137 **
weight < 2e-16 ***
sqrt(acceleration) 0.41865
year < 2e-16 ***
origin 5.26e-05 ***
Signif. codes:
0 ‘**’ 0.001 ‘__’ 0.01 ‘’ 0.05
‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.119 on 384 degrees of freedom
Multiple R-squared: 0.8797, Adjusted R-squared: 0.8775
F-statistic: 401 on 7 and 384 DF, p-value: < 2.2e-16
Point (g):
The selected train data (train = (year%%2 == 0) ) leads to optimal KNN solution with k = 3. Test error in this case amounts to 13.7 %.
#######
knn.pred = knn(train.X, test.X, train.mpg01, k = 3)
mean(knn.pred != mpg01.test)
[1] 0.1373626
#######
The R formula is not correct. It should be y = 1 + x + -2 * (x - 1)^2 * I(x > 1)
In the proposed answer to part a, 0.2 standard deviation of optimum is used to demarcate the range of acceptable evaluation metrics. Is there any evidence to support why 0.2? It appears rather arbitrary to me.
solutions say g1 is expected to have smaller test RSS because of one less degree of freedom but this is not necessarily true. For example, if the true DGP is cubic, then g2 will fit better.
Thank you for writing up these solutions! They were very helpful to me.
The answer to 10.7 is incorrect- the observations should be normalized by row rather than by column, which will yield a fixed ratio of 1/6. All that needs to change is your first line:
dsc = t(scale(t(USArrests)))
a = dist(dsc)^2
b = as.dist(1 - cor(t(dsc)))
summary(b/a)
File: https://github.com/asadoughi/stat-learning/blob/master/ch3/applied.Rmd
Answer: 15c.
The text of the answer says "Coefficient for nox is approximately -10 in univariate model and 31 in
multiple regression model.".
But it should be the other way around: "Coefficient for nox is approximately 31 in univariate model and -10 in multiple regression model."
It is not mentioned in the task statement explicitly, but I think it is implied that mpg should be excluded as a predictor.
The solution proposed here uses mpg as a predictor for the mpglevel, which makes this model completely useless (as if we knew mpg, then there would be a much simpler solution for getting mpglevel)
The question require us to scale each observation to have mean 0 and SD 1, not each predictor, in order to produce the desire proportionality.
Suggested answer:
data(USArrests)
scale_data = scale(t(USArrests))
correalation = cor(scale_data)
corr.1.minus.r = 1 - as.dist(correalation)
distance.squared = dist(t(scale_data), method = "euclidean") ^ 2
summary(corr.1.minus.r/distance.squared)
The question requires the use of multiple linear regression.
The predictor "origin" is stored as a num, but it is a qualitative data.
Shouldn't it be change to factor using the following code before doing the regression?
origin_fac = factor(Auto$origin, levels = c("American","European","Japanese"))
new_data = subset(Auto, select = -c(name,origin))
new_data = new_data.frame(new_data, origin_fac)
lm.fix = lm(mpg~. ,data= new_data)
In the Chapter 4 and beyond solutions, I followed a format of using one Rmd file per problem with the knitr package to generate Markdown and HTML files. It would be nice to have a consistent format.
It would be especially nice to have photos of the handwritten notes typed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.