Here you can find all information and files for the practicals of the elective master's course Data Analysis and Visualisation at Utrecht University (course code 201600038
in Osiris).
You are going to be working inside the practicals folder. Download the folder and unzip it to a smart location on your computer.
# | Name | HTML | Answers | |
---|---|---|---|---|
01 | R basics for DAV | .html | ||
02 | Data manipulation & EDA | .html | Answers | |
03 | Data Visualisation using ggplot2 | .html | Answers | |
04 | Assignment EDA | .html | ||
05 | Supervised learning: Regression 1 | .html | Answers | |
06 | Supervised learning: Regression 2 | .html | Answers | |
07 | Supervised learning: Regression 3 | .html | Answers | |
08 | Supervised learning: Classification 1 | .html | Answers | |
09 | Supervised learning: Classification 2 | .html | Answers | |
10 | Assignment Prediction Model | .html | ||
11 | Unsupervised learning: PCA & Correspondence Analysis | .html | Answers | |
12 | Unsupervised learning: Clustering | .html | Answers |
- Install
R
and RStudio Desktop (open source) by following the instructions here - If you don't yet have a TeX distribution, run the following within
RStudio
:install.packages("tinytex") library(tinytex) install_tinytex()
If you have no experience with R
or another programming language, you are going to need to catch up before starting the course and during the course. This is not an introductory course on programming with R, but a course on data analysis and visualisation.
Some good sources are:
- The first two chapters of introduction to R on datacamp
- Install
R
, play around, and read the workflow basics chapter in Hadley Wickham's R for Data Science - Interactive R course: install
R
as in the previous point and in the console type the following lines one by one
install.packages("swirl")
library(swirl)
swirl()
and follow the guide to run the R Programming: The basics of programming in R
interactive course.
The following is the minimum of what you should know about R
before starting with the first practical
- What is
R
(a fancy calculator) and what is an.R
file (a recipe for calculations) - What is an
R
package (a set of functions you can download to use in your own code) - How to run
R
code inRStudio
- What is a variable
x <- 10
- What is a function
y <- fun(x = 10)
- Understand what the following statements do (tip: you may run it in
R
line by line)
y <- "Let him go!"
x <- "Bismillah!"
z <- paste(x, "No, we will not let you go.", y)
rep(z, 3)
1:10
sample(1:20, 4)
sample(1:20, 40, replace = TRUE)
z <- c(1, 2, 3, 4, 5, 4, 3, 2, 1)
z^2
z == 2
z > 2
install.packages("dplyr")
library(dplyr)
- Be able to read the help file of any function, (e.g., type
?plot
in the console)
Anything written in italic font is optional/extra material. You can look those up by yourself if you have extra time.
-
R basics for DAV
R
andRStudio
- Project organisation
- Help files using
?
, CRAN, and internet search R Markdown
- The
ISLR
package (datasets from James ISLR) - The
tidyverse
as a dialect of theR
language (Wickham R4DS) - The google style guide or tidyverse style guide (ISLR does not follow these)
- R packages on GitHub
-
Data manipulation & exploratory data analysis
- Data types:
character
,numeric
,factor
- Lists
- Loading datasets from
.csv
or.xlsx
(or other formats withhaven
) data.frame()
andtibble()
View()
,head()
,tail()
summary()
filter()
,select()
, andmutate()
fromdplyr
bind_rows()
,bind_cols()
- missing values (
na.omit
) group_by()
andsummarise()
fromdplyr
- the pipe operator
%>%
table()
- dplyr cheatsheet
- wide to long format:
gather
andspread
- Data types:
- Data Visualisation using ggplot2
- Preparing data for a
ggplot()
call - What is a
ggplot
object and how to construct it - Aesthetics:
x
,y
,size
,colour
,fill
geom_point()
,geom_line()
,geom_bar()
- Labels, limits
geom_boxplot()
,geom_density()
- themes (
ggthemes
?)
- Preparing data for a
-
HANDIN: Pass / Fail assignment
- Find a dataset and create an Exploratory Data Analysis
- Tip: The new Google dataset search.
- Format: stand-alone
RStudio
project folder with:- the dataset (
csv
,xlsx
,sav
,dat
,json
, or any other common format) - one
.Rmd
notebook file - a compiled
.pdf
or.html
- the dataset (
- Requirements:
- explain the dataset in 1 or 2 paragraphs
- use
tidyverse
- clean, legible
R
code (preferably following the google style guide) - table(s) with relevant summary statistics
- descriptive plots
- explain what you did and why (max 3 paragraphs total)
-
Supervised learning: Regression 1
lm()
, theformula
object, thelm
object and its methods (print()
,summary()
,coef()
,plot()
)- Regression lines in
ggplot
with uncertainty - Linear regression with multiple variables, interaction effects
- Model assessment:
- Train/test split
- Mean square error calculation (
predict()
) - AIC, BIC
- Bias/variance tradeoff
- Supervised learning: Regression 2
- Feature selection
- Regularization using the
glmnet
package - Optimising lambda
- Supervised learning: Regression 3
- Polynomial regression
- Nonlinear regression using the
splines
package - Visualising nonlinear regression
- Supervised learning: Classification 1
- (titanic data? default data?)
- KNN
- Logistic regression (see also 4.2)
- LDA
- Supervised learning: assessing classification methods
- Confusion matrix, errors, AUC, ROC curve
- Cross validation on classification problems
- Classification trees
- HANDIN: Pass / Fail assignment
- Find a dataset and create and assess a prediction model
- Tip: The new Google dataset search.
- Format: stand-alone
RStudio
project folder with:- the dataset (
csv
,xlsx
,sav
,dat
,json
, or any other common format) - one
.Rmd
notebook file - a compiled
.pdf
or.html
- a
.Rproj
file
- the dataset (
- Requirements:
- explain the dataset in 1 or 2 paragraphs
- use
tidyverse
- clean, legible
R
code (preferably following the google style guide) - explain which method you use
- assess your predictions
- make conclusions about your predictions
- use plots where useful (they are almost always useful)
- Unsupervised learning: PCA & Correspondence Analysis
- PCA using
princomp
- Visualising PCA
- SVD
- Correspondence Analysis & Biplots
- PCA using
- Unsupervised learning: Clustering
- K-means clustering with
kmeans()
- Hierarchical clustering with
hclust()
- Visualising clusters in
ggplot
- Modularity clustering with igraph
- K-means clustering with