Git Product home page Git Product logo

06-multivariate-analysis's Introduction

CORE Skills Data Science Springboard - Day 6 - Multivariate analysis and dimensionality reduction

Binder

The aim of today's session will be to generalize the concepts for unidimensional data we learnt about last week to multidimensional data types. We'll also introduce approaches to reduce the dimensionality of a dataset - that is (a) how we can identify when a dataset can be represented accurately with a smaller number of variables, and (b) how we can identify the variables that contain the most information, with techniques like Principal Components Analysis (PCA) and clustering. We'll touch on applying regression on dimension-reduced data, and Partial Least Squares Regression (PLSR, and also called Projection to Latent Structures). These multivariate methods can be very powerful in situations where you need to create models with many variables, but you have few observations to work with and it happens that the variables are correlated. This is often the case with datasets from multichannel instruments.

You should aim to understand the similarities and differences between univariate and multivariate data settings (you'll still need to be able to apply EDA on multivariate data for example). You should also aim to understand the basis of dimensionality reductions, execute measures of correlations, as well as understanding when correlations might be spurious.

Pre-session Reading & Resources

Topics that we'll be discussing in today's workshop include:

sklearn.linear_model.LinearRegression

Robust Regression

[Basis Regression]: Based on the idea that a linear system is one where the whole is literally the sum of the parts. When we solve the linear system then the coefficients in linear regression are how much of each part is present, and the parts don't have to be simple variables like "x" or "y" they can be squiggly lines that represent components that we believe we can decompose our observations into. These components are represented as vectors and can take on any shapes we want. We may have "templates" for these components and an example will be shown where we use basis regression using templates.

Principal Component Analysis Unfortunately most descriptions of the theory of PCA and regression following transformation to principal components become very mathematical quickly. But this link is practical and shows how PCA works in 2D.

Extension reading:

When dealing with multivariate data there is risk of us finding spurious correlations. That is, the bigger our datasets become the more likely it is that we'll see relationships appearing my chance alone. In machine learning we mostly deal with by cross-validation and other methods that judge methods that we use by how well they predict on new data previously unseen by the model.

Tyler Vigen has an amusing site which finds spurious correlations in US statistical data (covered in the Harvard Business Review. Have a play here: http://www.tylervigen.com/spurious-correlations

You'll also see a lot about 'correlation isn't causation' - however this phrase is often overstated. We can construct statistical models which invoke causation although it requires some new statistical tools to cover interventions and counterfactuals. These allow us to answer questions along the lines of 'what would happen if?'. Judea Pearl is a researcher who has done a lot of work into the statistics of causation (i.e. how we can make machines that reason causally like humans) and he has just released a very readable book on this topic recently The Book of Why which is well worth a look.

06-multivariate-analysis's People

Contributors

cericia avatar grongier avatar jeremydatamettle avatar morganjwilliams avatar neil-csiro avatar tamrynbarker avatar ying-yap avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

06-multivariate-analysis's Issues

Update for Summer 2020-2021 Springboard

We'll need to update a few things for the next iteration of the springboard, including:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.