stephens999 / fiveminutestats Goto Github PK

A repo of short "vignettes" illustrating statistical concepts

Home Page: http://stephens999.github.io/fiveMinuteStats

License: Other

HTML 54.15% TeX 45.85%

fiveminutestats's Introduction

fiveMinuteStats

This repo is intended to contain short "vignettes" illustrating statistical concepts. It is very much work in progress. Things may change quickly and often...

The name comes from the fact that, in principle, each vignette should be readable in a short amount of time. Perhaps five minutes.

The overall goal is that by making vignettes short in this way we can try to make learning more "modular". Each vignette should, ideally, focus on introducing a single concept, or a small number of related concepts, that are easily digestable provided other pre-requisite concepts are mastered. One reason for this is to try to make learning easier: break complex ideas down into smaller more easily digestable chunks. Another is that it encourages re-use of material. Just as software engineers write software in a "modular" way, with each function performing a well-defined role, the idea is that these vignettes make learning "modular". If you don't like the way one vignette introduces the concept then you can write a different one and just replace that one part. And in principle if this takes off we can have large numbers of authors, each contributing a small number of vignettes. Modularizing facilitates sharing the load.
In principle we can have multiple vignettes for the same concept and users can choose which one they like.

The idea of breaking learning down into small chunks is kind of obvious, but I was personally inspired by watching videos with my kids: https://artofproblemsolving.com/videos/prealgebra Maybe we can make learning statistics this easy and this much fun? However, I decided against video because a) I'm not as funny as this guy, and b) it is harder to collaborate and update videos.

If you are interested in these ideas, please get in touch, [email protected] (remove the marsupial).

For contributors

The repo is organized using John Blischak's workflowr R package. Each vignette is an R Markdown file, saved in the 'analysis' subdirectory. To add a vignette, run the following:

library("workflowr")
wflow_open("analysis/newfile.Rmd")

See the workflowr online documentation to learn more.

fiveminutestats's People

Contributors

Stargazers

Watchers

Forkers

mbonakda aabiddanda jnovembre taehyun0313 jhmarcus jhsiao999 iamciera narayananr wenbinmei erinfry6 nanxstats peterjaksons mdavy86 senna1128 youlinchen mw2008 lw157 yilingu10 cyang-2014 flopezo amandamiotto yue-jiang hou kaiqianzhang liangyy gzy219 luca-scr rugezhao qin-courses anhnguyendepocen serenidpity xingjieshi thiyangt jgblanc ajbass bonstats ericroh kgilliat mata62n antonrigner dr-dong sflippl habib61 payalbhatia wanibean vsujeesh cwt515 snowdj shuguangsun tashkeev-alex rezajf jystatistics shuxiangzhang matthiaseckhart matheuristic oorona rallen10 cliu822 zejiang-unsw yiqiao-yin we-taper yanbingyi ddejohn hunderlinek accio shuxiamm14 eriqande tiantiy hyunkyungclairekim kiseoklee laura-hunter kiseokuchicago yawomkobara hustwireless nalzok chihyunwang jdblischak katjadellalibera nursatkakon qingwang13 d-morrison lailinxu lechten amazingshi

fiveminutestats's Issues

Typo in formula for Wright-Fisher Model

Hi,
I think that there might be a typo in your Introduction to the Wright-Fisher Model (which is very helpful, thanks for that!). If I understood it correctly, your formula:

$$X_{t} \mid X_{t-1} = x_{t-1} \sim Binomial(n = 2N, p = \frac{x_{t-1}}{2N})$$

should give the probability for $x_{t}$ and not $x_{t-1}$ and should therefore be:

$$X_{t} \mid X_{t-1} = x_{t} \sim Binomial(n = 2N, p = \frac{x_{t-1}}{2N})$$

Is this correct?

data augmentation

could do with a vignette on data augmentation more generally (ie more than just mixture models) to illustrate the idea.

Ideas: mixture models, Factor analysis, non-negative matrix factorization?

vignette on Bayes inference for continuous variables

could do inference for binomial proportion under beta prior, and inference for nomal mean under normal prior...

rendering problems

it seems some of the latex doesn't render properly, - eg in the wilks.html. The problem
appears to start around the \frac tag. This might be relevant:
http://tex.stackexchange.com/questions/104455/convert-markdown-embedded-latex-to-pdf-and-doc

Missing data example in "Likelihood Ratios: examples and pitfalls" is the probability for missing data the same for both models?

Having read the paragraph that explains how to incorporate the probability of missing data I cannot think of a reason for why it would be different in the tusk example: Both models use the same data and the probability of a failing test is therefore the same.

However the data is weighted different because the first allele it is twice as likely in the Forest Elephant. If we had a DNA test that failed more when the allele is not present could we then conclude that the missing the first marker is more probable under M_s?

need a simple vignette on utility - just to illustrate the idea of maximizing expected utility

vignette for lagrange multipliers?

@mbonakda you asked about a wishlist - I think it woudl be nice to
have a vignette on how to do mle for multinomial using lagrange multipliers.

The challenge is that most readers won't know what a lagrange multiplier is...

Could start by doing binomial (which one can easily do without multipliers)
and then extending to multinomial...

Gibbs Sampling

Here are my thoughts on a vignette for Gibbs sampling.
We should think about what to assume people already know. For example I think we can assume they know what a conditional distribution is and what a Markov Chain is. Also for a mixture model example I would assume they know a beta prior is conjugate for binomial sampling.

Then think about how to provide a very simple example to sample from p(x, y) using gibbs sampling.
That is by sampling p(x|y) and then p(y|x). I think we can illustrate this and then
explain how it works.

Then move on to a mixture and doing inference for mixture proportions with known component distributions by introducing latent variables. That is, sampling
p(pi | x1...xn)
in the setting where pi=(1-pi1,pi1) is beta and xj is
p(x) = (1-pi1) f_0(x) + pi1 f1(x)
for known f0 and f1.
by introducing zj \sim bern(pi1)

mcmc dynalist

SISG 2022: Module 10, MCMC for Genetics
TS Eliot: "We shall not cease from exploration And the end of all our exploring Will be to arrive where we started And know the place for the first time."
Key information:
Instructors: Eric C. Anderson and Matthew Stephens.
TAs: Sue Parkinson and Karl Tayeb
Zoom meeting link:
https://uchicago.zoom.us/j/96210188590?pwd=VTNPME9LaE1SWGZmOTlickkxQUFCZz09
Additional details
Matthew Stephens is inviting you to a scheduled Zoom meeting.

Topic: sisg 2022
Time: Jul 18, 2022 08:00 AM Pacific Time (US and Canada)
Every day, 3 occurrence(s)
Jul 18, 2022 08:00 AM
Jul 19, 2022 08:00 AM
Jul 20, 2022 08:00 AM
Please download and import the following iCalendar (.ics) files to your calendar system.
Daily: https://uchicago.zoom.us/meeting/tJIvdumppjMvE9QNtB5fGIg8dgIlJAGUwdCG/ics?icsToken=98tyKuCurDoqG9ydtRCHRowAAIj4c-vxiFxYj_pssgvHViZ0SwSuMuVrPpheN-3H

Join Zoom Meeting
https://uchicago.zoom.us/j/96210188590?pwd=VTNPME9LaE1SWGZmOTlickkxQUFCZz09

Meeting ID: 962 1018 8590
Passcode: 089309
One tap mobile
+13126266799,,96210188590#,,,,*089309# US (Chicago)
+13462487799,,96210188590#,,,,*089309# US (Houston)

Dial by your location
+1 312 626 6799 US (Chicago)
+1 346 248 7799 US (Houston)
+1 646 558 8656 US (New York)
+1 646 931 3860 US
+1 669 444 9171 US
+1 669 900 9128 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 301 715 8592 US (Washington DC)
Meeting ID: 962 1018 8590
Passcode: 089309
Find your local number: https://uchicago.zoom.us/u/actUqRQMtG

Join by SIP
[email protected]

Join by H.323
162.255.37.11 (US West)
162.255.36.11 (US East)
115.114.131.7 (India Mumbai)
115.114.115.7 (India Hyderabad)
213.19.144.110 (Amsterdam Netherlands)
213.244.140.110 (Germany)
103.122.166.55 (Australia Sydney)
103.122.167.55 (Australia Melbourne)
149.137.40.110 (Singapore)
64.211.144.160 (Brazil)
149.137.68.253 (Mexico)
69.174.57.160 (Canada Toronto)
65.39.152.160 (Canada Vancouver)
207.226.132.110 (Japan Tokyo)
149.137.24.110 (Japan Osaka)
Meeting ID: 962 1018 8590
Passcode: 089309

Join by Skype for Business
https://uchicago.zoom.us/skype/96210188590
Slack: you should have access to the Slack channel mod10_mcmc_genetics_2022
Session Times (Seattle time, PST)
Monday 8am-2.30pm
Tuesday 8am-2:30pm
Wednesday 8-11:00am
Material will be delivered via zoom by live lectures and live practical sessions, with additional reading materials and/or slides also provided. Each session builds on previous sessions so you will get maximum benefit by attending every session live and in sequence.
#reading indicates vignette/reading/slides/materials
#exercise indicates exercises
#prep indicates material for instructors reference; you may ignore it
Zoom guidelines
The zoom link is https://uchicago.zoom.us/j/96210188590?pwd=VTNPME9LaE1SWGZmOTlickkxQUFCZz09 with further dial-in details given above under "key information"
We will record each session, and make available to participants as soon as practical. The recordings should be available for 90 days.
Please have your camera on where possible - it helps give a closer approximation to an "in person" experience. Especially try to have your camera on in break-out sessions.
Please mute yourself during lectures (unless you need to speak) but please unmute yourself during break-out sessions.
To get help during breakout sessions you may want to share your screen. You can only do that if you sign into zoom on your computer (not a phone or other mobile device).
Pre-module Preparation:
Please make sure you have working versions of R, Rstudio and the latest version of zoom installed on your computer.
https://www.r-project.org/
https://rstudio.com/products/rstudio/download/
https://zoom.us/
Please be sure to install some necessary R packages with
install.packages(c("tidyverse", "plotly", "workflowr", "expm", "viridis"))
Copy Install the binary versions. Please do not install later versions from source code that require compilation.
Please download the materials from fiveMinuteStats
https://github.com/stephens999/fiveMinuteStats
if you know how to use git, then do it that way. Otherwise the easiest way is to click on the green "Code" button and download the zip file.
once you have downloaded the files, open up the file r_simplemix.Rmd in the analysis/ subdirectory and try to knit it using the Rstudio "knit" button.
In a similar manner to downloading the materials from fiveMinuteStats, also download the materials from sisg-mcmc-exercises-eca
https://github.com/eriqande/sisg-mcmc-exercises-eca
Day 1 (Times are approximate)
8:00 am Introductions (15 mins)
Instructors and TAs introduce themselves
Overview of course and materials
CHECK: Have you completed the preliminary preparation?
8:15am Session 0, Lecture: genetic mixture and breaking the ice! @ms
#reading
https://stephens999.github.io/fiveMinuteStats/r_simplemix.html
#exercise
1a. Find and run ("knit") the Rmd file that created https://stephens999.github.io/fiveMinuteStats/r_simplemix.html
HINT: the Rmd files are in the analysis subdirectory.
1b. Also run the file in the console (eg select "run all" from the Run menu)
2. Complete Exercise 1 in https://stephens999.github.io/fiveMinuteStats/r_simplemix.html
Compare/discuss/troubleshoot the answer to the Exercise in your break-out rooms
Since this is the first time you are using break-out rooms:
introduce yourselves! Give your name, academic background, research interests, and a hobby. Go by alphabetical order of family name. There will be approximately four students per breakout room. From now on we will call the first student A, the second B, then C and D etc.
First student (A) should take the lead in this session. In later sessions B, C and D will take it in turns to take the lead.
eg Student A can share screen as you work through the exercises together.... of course if it helps to switch to have another student share screen then go ahead...
Other students: make suggestions; ask questions... Try to help one another out!
If you would like help from a TA/instructor you should be able to ask for help from Zoom. (Alternatively use the slack channel, and tell us which breakout room would like assistance.) We will be there as soon as we can!
9:00am Session 1: Bayesian inference - the assignment problem @ms
#reading
https://stephens999.github.io/fiveMinuteStats/likelihood_ratio_simple_models.html
https://stephens999.github.io/fiveMinuteStats/LR_and_BF.html
https://stephens999.github.io/fiveMinuteStats/bayes_multiclass.html
#exercise
Use the ideas from this session to complete Exercise 2 in https://stephens999.github.io/fiveMinuteStats/r_simplemix.html
note the answer template in that file
Breakout rooms: student B in each room lead this session.
10 am Session 2: Bayesian inference - Estimating allele frequencies/binomial (50 mins)
#reading
https://stephens999.github.io/fiveMinuteStats/likelihood_function.html
https://stephens999.github.io/fiveMinuteStats/bayes_beta_binomial.html
https://stephens999.github.io/fiveMinuteStats/beta.html
https://stephens999.github.io/fiveMinuteStats/bayes_conjugate.html
#exercise
Complete Exercise 3 in https://stephens999.github.io/fiveMinuteStats/r_simplemix.html .
Breakout rooms: student C in each room lead this session.
11am (Lunch/self-study 1.5 hours)
12.30pm Session 3: Monte Carlo @ea (50 mins)
#reading
Monte Carlo lecture slides in PDF:
https://eriqande.github.io/sisg_mcmc_course/2021-monte-carlo-lecture-slides.pdf
#exercise
When doing all of the exercises, always ask yourself these three questions: 1) what is the random variable being simulated? 2) what is the function g(x)g(x) that is being evaluated? and 3) what is the expectation that I am approximating?
Sampling from a beta posterior distribution
https://eriqande.github.io/sisg-mcmc-exercises-eca/monte-carlo-sampling-from-a-beta-posterior.nb.html (You can download the Rmd from the "Code" button in the upper right of this notebook, or work from the Rmd in the sisg-mcmc-exercises-eca repository)
BONUS READING/EXERCISES: Monte Carlo integration of a deterministic function. (You are not expected to get to it during class time, but it is there if you want to play with it in the evening)
https://eriqande.github.io/sisg-mcmc-exercises-eca/003-monte-carlo-to-evaluate-an-integral.nb.html (or the Rmd in the repo)
1.30pm Session 4: Markov Chains @ea (50 mins)
#reading
Markov Chains lecture slides in PDF:
https://eriqande.github.io/sisg_mcmc_course/2021-markov-chains-lecture-slides.pdf
#exercise
Playing with the bouncing blob.
https://eriqande.github.io/sisg-mcmc-exercises-eca/markov-chain-bouncing-blob-exercise.nb.html (or the Rmd in the repo)
BONUS READING/EXERCISES: Biasing a random walk. You might not get to this during the class period, but it is a useful preamble to Session 5 if you can find the time.
https://eriqande.github.io/sisg-mcmc-exercises-eca/006-markov-chain-biased-random-walk.nb.html (or the Rmd in the repo)
2.30pm Formal period over. Instructors will be available to help troubleshoot issues arising during the day.
Day 2
8am Session 5: Metropolis--Hastings - Intro @ms
#reading https://stephens999.github.io/fiveMinuteStats/MH_intro.html
#prep Eric's sampling from the beta-density via M-H slides/animation.
https://github.com/eriqande/sisg-mcmc-opengl-computer-demos
overview instructions at https://www.youtube.com/watch?v=a8gjem86Uf4
run using sisg-mcmc-opengl-computer-demos stephens$ ./beta_sim
open windows using keys 1 and 2... start/stop using spacebar
9am Session 6: Practical session (MH Simple Examples) @ms
#reading https://stephens999.github.io/fiveMinuteStats/MH-examples1.html
#exercise
Find and run the code that produced the html above (analysis/MH-examples1.Rmd)
Run through the exercises under Examples 1 and 2 in that Rmd file
(Look at Example 3 if you finish 1+2)
10:30 Lunch/Self-study. 1.5 hours.
12 noon Session 7: Metropolis--Hastings in 2d @ea
#reading
MCMC in two dimensions lecture slides in PDF:
https://eriqande.github.io/sisg_mcmc_course/2021-two-dimension-MCMC.pdf
#exercise

Investigate the inbreeding model in R code. The following notebook describes the 2-D and component-wise samplers. A few exercises and questions appear at the bottom.
https://eriqande.github.io/sisg-mcmc-exercises-eca/007-metropolis-hastings-inbreeding.nb.html (or the Rmd in the repo)
Note that a notebook that also includes the Gibbs sampler for this problem can be found at http://eriqande.github.io/sisg_mcmc_course/s04-01-inreeding-model-mcmc.nb.html
1.15 PM Session 8: Gibbs Sampling @ea
#reading
Gibbs sampling lecture slides in PDF:
https://eriqande.github.io/sisg_mcmc_course/2021-Gibbs-sampling-inbreeding-model.pdf
Additional readings from fiveMinuteStats about gibbs sampling and the simple genetic mixture model:
https://stephens999.github.io/fiveMinuteStats/gibbs1.html
https://stephens999.github.io/fiveMinuteStats/gibbs_structure_simple.html
#exercise
We will use the ideas from this session to add to the r_simplemix.Rmd analysis and create a gibbs sampler
The exercises and answer templates are here:
https://stephens999.github.io/fiveMinuteStats/r_simplemix_gibbs_1.html
Day 3
8am Session 9: Gibbs sampling for genetic mixture @ms
In this session we discuss some possible extensions to the MCMC scheme from Session 8, as outlined here: https://stephens999.github.io/fiveMinuteStats/r_simplemix_gibbs_2.html
#exercise
The exercises and answer templates are here:
https://stephens999.github.io/fiveMinuteStats/r_simplemix_gibbs_2.html
Note: these exercises, especially working out the details of the update for m for the correlated allele frequencies model, could take some time, and implementing them all will take you beyond today I think...
9:30am Session 10: Importance sampling and Metropolis-Coupled MCMC @ea
Note that the code for the graphical simulations done in this session (and other sessions) is in: https://github.com/eriqande/sisg-mcmc-opengl-computer-demos
#reading
Importance sampling and simulated tempering lecture slides in PDF:
https://eriqande.github.io/sisg_mcmc_course/2021-imp-samp-mcmcmc.pdf
#exercise
final discussions and course evaluations
11am: finish

wasser et al references?

@mdavy86 thanks for your PR. I haven't worked with refs before in knitr...

the Wasser et al refs don't
seem to show up in likelihood_ratio_simple_models.html
any ideas?

change black line to blue in mixture model

in https://stephens999.github.io/fiveMinuteStats/mixture_models_01.html
it would be more consistent to make the last line blue.

Also there is a p() missing in the words representation of mixture models.

the log(sum(exp(log))) trick

it would be nice to have a separate vignette explaining this
currently covered here:

https://stephens999.github.io/fiveMinuteStats/Importance_sampling.html#example:_computing_with_means_on_log_scale

Asking for the content of Introduction to EM: Gaussian Mixture Models

I am sorry. I do not report any issues, but I have one question. I tried to find the answer, but I could not get it.

I understand this formula, but I don't comprehend the product sign in the formal part. Why is the sum sign replaced by the product sign?

Thank you very much for helping me.

The likelihood ratio for continuous data - An example where the approximation breaks down

Link: https://stephens999.github.io/fiveMinuteStats/likelihood_ratio_simple_continuous_data.html

Are trueLR and approxLR swapped? It seems from the code that the function 'trueLR' calculates the approximated LR and 'approxLR' calculates the true LR.

need a vignette on propogating uncertainty

to illustrate this aspect of bayesian inference.

maybe in context of inference of normal mean with unknown variance.

is there a simpler example?

vignette on "Bayes Decision Rule for Prediction Problems"

Hi professor,
I think in this vignette, the '<=' in the last line of 'The Optimal Decision Rule' should be '>='. Could you check if this is the case? Thanks!

Introduction to Mixture Models talks about N(20,2) but implements N(20,1)

Many thanks for publishing your introductions under CC BY!
The text above the first piece of code for Mixture Models talks about a standard deviation of $2 and refers to N(20,2). The subsequent implementation and histogram use N(20,1), though. Maybe go for 2 everywhere?

Bug report: code in hmm.rmd

In the vignette of HMM, file hmm.Rmd, in the forward algorithm

Line 80:

alpha[t+1,k]__ = m[k]*emit(k,X[t])

it seems that line should be

alpha[t+1,k] = m[k]*emit(k,X[t + 1])

Please check.

The Dirichlet Distribution

Can you create a R example for The Dirichlet Distribution post?

StatQuest

Hi Matthew,

I am not sure whether you are aware of this StatQuest website. It was created by Josh Starmer "as an attempt to explain statistics to genetics researchers". I feel this work is very related to your "5-min stats" idea and thus I am sharing it here, in case you have not seen it.

List of videos: https://statquest.org/video-index/

typo on index page: Inverse Tranform Sampling

normal means example

the shiny app is really nice at: https://stephens999.github.io/fiveMinuteStats/shiny_normal_example.html
but the background needs tidying.
A better statement of the model and data.
And use tau consistently for the precision (not tau^2, and not for the variance!)

vignette on "maximum likelihood estimation works"

the idea here would be to give a qualitative summary of the
two vignettes on Wilks theorem and normality of mle.

In short these results say "mle works".

I find students often don't appreciate this. Especially useful when devising complex
inference schemes under complex models... to check coding etc. Would be good
to give examples of this.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.