Git Product home page Git Product logo

sta380's Introduction

STA 380: Intro to Machine Learning

Welcome to part 2 of STA 380, a course on machine learning in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Hey, here's a test change!

Instructors:

  • Dr. James Scott. Office hours on M T W, 12:30 to 1:20 PM, via Zoom (see Canvas for link). (All times are US central time.)
  • Dr. David Puelz. Office hours TBA.

Students in both sections are welcome to attend either set of office hours!

Exercises

The exercises are available here. These are due Monday, August 16th at 5 PM, U.S central time. Pace yourself over the next few weeks, and start early on the first couple of problems!

Outline of topics

(1) The data scientist's toolbox

Slides: The data scientist's toolbox
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

Readings:

Your assignment after the first class day:

  • Create a GitHub account.
  • Create your first GitHub repository.
  • Inside that repository (on your local machine), create a toy RMarkdown file that does something---e.g. simulates some normal random variables and plots a histogram.
  • Knit that RMarkdown file to a Markdown (.md) output.
  • Push the changes to GitHub and view the final (knitted) .md file.

These instructions will make sense after you read the tutorials above!

(2) Probability: a refresher

Slides: Some fun topics in probability

Two short pieces that illustrate the "fallacy of mistaken compounding":

Optional reference: Chapter 1 of these course notes. There's a lot more technical stuff in here, but Chapter 1 really covers the basics of what every data scientist should know about probability.

(3) Data exploration and visualization

Topics: data visualization and practice with R.

Slides: Introduction to Data Exploration

R materials:

Inspiration and further reference:

(4) Resampling methods

The bootstrap; joint distributions; using the bootstrap to approximate value at risk (VaR).

Slides: Introduction to the bootstrap

Reference: ISL Section 5.2 for a basic overview of the bootstrap.

For the class exercises, you will need to refer to any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.

R scripts and data:

Supplemental resources:

(5) Clustering

Basics of clustering; K-means clustering; hierarchical clustering.

Slides: Introduction to clustering.

Scripts and data:

Readings:

  • ISL Section 10.1 and 10.3 or Elements Chapter 14.3 (more advanced)
  • K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.

(6) Latent features and structure

Principal component analysis (PCA).

Slides: Introduction to PCA

Scripts and data for class:

A few other examples we likely won't cover in class:

Readings:

  • ISL Section 10.2 for the basics or Elements Chapter 14.5 (more advanced)
  • Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.

(7) Networks and association rules

Networks and association rule mining.

Slides: Intro to networks. Note: these slides refer to "lastfm.R" but this is the same thing as "playlists.R" below.

Some supplemental slides on association rule mining. These contain the details of the apriori algorithm. If there's time we might cover some of this in class, but mainly we'll focus on the shorter intro slides above, together with the example R scripts below.

Software you'll need:

Scripts and data:

Supplemental resource: In-depth explanation of the Apriori algorithm

(8) Text data

Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).

Slides on text.

Scripts and data:

If time in class, we'll cover this script below. But if not, it's a useful starting point for your homework anyway:

sta380's People

Contributors

jgscott avatar anaghpal avatar jareddf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.