Git Product home page Git Product logo

dav_practicals's Introduction

Data Analysis and Visualisation practicals

Here you can find all information and files for the practicals of the elective master's course Data Analysis and Visualisation at Utrecht University (course code 201600038 in Osiris).

You are going to be working inside the practicals folder. Download the folder and unzip it to a smart location on your computer.

Links to practicals

# Name HTML PDF Answers
01 R basics for DAV .html .pdf
02 Data manipulation & EDA .html .pdf Answers
03 Data Visualisation using ggplot2 .html .pdf Answers
04 Assignment EDA .html .pdf
05 Supervised learning: Regression 1 .html .pdf Answers
06 Supervised learning: Regression 2 .html .pdf Answers
07 Supervised learning: Regression 3 .html .pdf Answers
08 Supervised learning: Classification 1 .html .pdf Answers
09 Supervised learning: Classification 2 .html .pdf Answers
10 Assignment Prediction Model .html .pdf
11 Unsupervised learning: PCA & Correspondence Analysis .html .pdf Answers
12 Unsupervised learning: Clustering .html .pdf Answers

Prerequisites

  • Install R and RStudio Desktop (open source) by following the instructions here
  • If you don't yet have a TeX distribution, run the following within RStudio:
    install.packages("tinytex")
    library(tinytex)
    install_tinytex()

If you have no experience with R or another programming language, you are going to need to catch up before starting the course and during the course. This is not an introductory course on programming with R, but a course on data analysis and visualisation.

Some good sources are:

install.packages("swirl")
library(swirl)
swirl()

and follow the guide to run the R Programming: The basics of programming in R interactive course.

The following is the minimum of what you should know about R before starting with the first practical

  • What is R (a fancy calculator) and what is an .R file (a recipe for calculations)
  • What is an R package (a set of functions you can download to use in your own code)
  • How to run R code in RStudio
  • What is a variable x <- 10
  • What is a function y <- fun(x = 10)
  • Understand what the following statements do (tip: you may run it in R line by line)
y <- "Let him go!"
x <- "Bismillah!"
z <- paste(x, "No, we will not let you go.", y)
rep(z, 3)
1:10
sample(1:20, 4)
sample(1:20, 40, replace = TRUE)
z <- c(1, 2, 3, 4, 5, 4, 3, 2, 1)
z^2
z == 2
z > 2
install.packages("dplyr")
library(dplyr)
  • Be able to read the help file of any function, (e.g., type ?plot in the console)

Outline of the practicals

Anything written in italic font is optional/extra material. You can look those up by yourself if you have extra time.

Week 1

  • R basics for DAV

    • R and RStudio
    • Project organisation
    • Help files using ?, CRAN, and internet search
    • R Markdown
    • The ISLR package (datasets from James ISLR)
    • The tidyverse as a dialect of the R language (Wickham R4DS)
    • The google style guide or tidyverse style guide (ISLR does not follow these)
    • R packages on GitHub
  • Data manipulation & exploratory data analysis

    • Data types: character, numeric, factor
    • Lists
    • Loading datasets from .csv or .xlsx (or other formats with haven)
    • data.frame() and tibble()
    • View(), head(), tail()
    • summary()
    • filter(), select(), and mutate() from dplyr
    • bind_rows(), bind_cols()
    • missing values (na.omit)
    • group_by() and summarise() from dplyr
    • the pipe operator %>%
    • table()
    • dplyr cheatsheet
    • wide to long format: gather and spread

Week 2

  • Data Visualisation using ggplot2
    • Preparing data for a ggplot() call
    • What is a ggplot object and how to construct it
    • Aesthetics: x, y, size, colour, fill
    • geom_point(), geom_line(), geom_bar()
    • Labels, limits
    • geom_boxplot(), geom_density()
    • themes (ggthemes?)

Week 3

  • HANDIN: Pass / Fail assignment

    • Find a dataset and create an Exploratory Data Analysis
    • Tip: The new Google dataset search.
    • Format: stand-alone RStudio project folder with:
      • the dataset (csv, xlsx, sav, dat, json, or any other common format)
      • one .Rmd notebook file
      • a compiled .pdf or .html
    • Requirements:
      • explain the dataset in 1 or 2 paragraphs
      • use tidyverse
      • clean, legible R code (preferably following the google style guide)
      • table(s) with relevant summary statistics
      • descriptive plots
      • explain what you did and why (max 3 paragraphs total)
  • Supervised learning: Regression 1

    • lm(), the formula object, the lm object and its methods (print(), summary(), coef(), plot())
    • Regression lines in ggplot with uncertainty
    • Linear regression with multiple variables, interaction effects
    • Model assessment:
      • Train/test split
      • Mean square error calculation (predict())
      • AIC, BIC
    • Bias/variance tradeoff

Week 4

  • Supervised learning: Regression 2
    • Feature selection
    • Regularization using the glmnet package
    • Optimising lambda

Week 5

  • Supervised learning: Regression 3
    • Polynomial regression
    • Nonlinear regression using the splines package
    • Visualising nonlinear regression

Week 6

  • Supervised learning: Classification 1
    • (titanic data? default data?)
    • KNN
    • Logistic regression (see also 4.2)
    • LDA

Week 7

  • Supervised learning: assessing classification methods
    • Confusion matrix, errors, AUC, ROC curve
    • Cross validation on classification problems
    • Classification trees

Week 8

  • HANDIN: Pass / Fail assignment
    • Find a dataset and create and assess a prediction model
    • Tip: The new Google dataset search.
    • Format: stand-alone RStudio project folder with:
      • the dataset (csv, xlsx, sav, dat, json, or any other common format)
      • one .Rmd notebook file
      • a compiled .pdf or .html
      • a .Rproj file
    • Requirements:
      • explain the dataset in 1 or 2 paragraphs
      • use tidyverse
      • clean, legible R code (preferably following the google style guide)
      • explain which method you use
      • assess your predictions
      • make conclusions about your predictions
      • use plots where useful (they are almost always useful)
  • Unsupervised learning: PCA & Correspondence Analysis
    • PCA using princomp
    • Visualising PCA
    • SVD
    • Correspondence Analysis & Biplots

Week 9

  • Unsupervised learning: Clustering
    • K-means clustering with kmeans()
    • Hierarchical clustering with hclust()
    • Visualising clusters in ggplot
    • Modularity clustering with igraph

dav_practicals's People

Contributors

vankesteren avatar bagheria avatar lientjemaas avatar

Stargazers

Alex Jane avatar

Watchers

 avatar Alex Jane avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.