Git Product home page Git Product logo

fiveminutestats's Introduction

fiveMinuteStats

This repo is intended to contain short "vignettes" illustrating statistical concepts. It is very much work in progress. Things may change quickly and often...

The name comes from the fact that, in principle, each vignette should be readable in a short amount of time. Perhaps five minutes.

The overall goal is that by making vignettes short in this way we can try to make learning more "modular". Each vignette should, ideally, focus on introducing a single concept, or a small number of related concepts, that are easily digestable provided other pre-requisite concepts are mastered. One reason for this is to try to make learning easier: break complex ideas down into smaller more easily digestable chunks. Another is that it encourages re-use of material. Just as software engineers write software in a "modular" way, with each function performing a well-defined role, the idea is that these vignettes make learning "modular". If you don't like the way one vignette introduces the concept then you can write a different one and just replace that one part. And in principle if this takes off we can have large numbers of authors, each contributing a small number of vignettes. Modularizing facilitates sharing the load.
In principle we can have multiple vignettes for the same concept and users can choose which one they like.

The idea of breaking learning down into small chunks is kind of obvious, but I was personally inspired by watching videos with my kids: https://artofproblemsolving.com/videos/prealgebra Maybe we can make learning statistics this easy and this much fun? However, I decided against video because a) I'm not as funny as this guy, and b) it is harder to collaborate and update videos.

If you are interested in these ideas, please get in touch, [email protected] (remove the marsupial).

For contributors

The repo is organized using John Blischak's workflowr R package. Each vignette is an R Markdown file, saved in the 'analysis' subdirectory. To add a vignette, run the following:

library("workflowr")
wflow_open("analysis/newfile.Rmd")

See the workflowr online documentation to learn more.

fiveminutestats's People

Contributors

accio avatar bonstats avatar ddejohn avatar hunderlinek avatar jasha10 avatar jdblischak avatar jhmarcus avatar jhsiao999 avatar jnovembre avatar jsta avatar lechten avatar matthiaseckhart avatar mbonakda avatar mdavy86 avatar nalzok avatar nanxstats avatar sflippl avatar stephens999 avatar we-taper avatar yue-jiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fiveminutestats's Issues

Typo in formula for Wright-Fisher Model

Hi,
I think that there might be a typo in your Introduction to the Wright-Fisher Model (which is very helpful, thanks for that!). If I understood it correctly, your formula:

$$X_{t} \mid X_{t-1} = x_{t-1} \sim Binomial(n = 2N, p = \frac{x_{t-1}}{2N})$$

should give the probability for $x_{t}$ and not $x_{t-1}$ and should therefore be:

$$X_{t} \mid X_{t-1} = x_{t} \sim Binomial(n = 2N, p = \frac{x_{t-1}}{2N})$$

Is this correct?

data augmentation

could do with a vignette on data augmentation more generally (ie more than just mixture models) to illustrate the idea.

Ideas: mixture models, Factor analysis, non-negative matrix factorization?

Missing data example in "Likelihood Ratios: examples and pitfalls" is the probability for missing data the same for both models?

Having read the paragraph that explains how to incorporate the probability of missing data I cannot think of a reason for why it would be different in the tusk example: Both models use the same data and the probability of a failing test is therefore the same.

However the data is weighted different because the first allele it is twice as likely in the Forest Elephant. If we had a DNA test that failed more when the allele is not present could we then conclude that the missing the first marker is more probable under M_s?

vignette for lagrange multipliers?

@mbonakda you asked about a wishlist - I think it woudl be nice to
have a vignette on how to do mle for multinomial using lagrange multipliers.

The challenge is that most readers won't know what a lagrange multiplier is...

Could start by doing binomial (which one can easily do without multipliers)
and then extending to multinomial...

Gibbs Sampling

Here are my thoughts on a vignette for Gibbs sampling.
We should think about what to assume people already know. For example I think we can assume they know what a conditional distribution is and what a Markov Chain is. Also for a mixture model example I would assume they know a beta prior is conjugate for binomial sampling.

Then think about how to provide a very simple example to sample from p(x, y) using gibbs sampling.
That is by sampling p(x|y) and then p(y|x). I think we can illustrate this and then
explain how it works.

Then move on to a mixture and doing inference for mixture proportions with known component distributions by introducing latent variables. That is, sampling
p(pi | x1...xn)
in the setting where pi=(1-pi1,pi1) is beta and xj is
p(x) = (1-pi1) f_0(x) + pi1 f1(x)
for known f0 and f1.
by introducing zj \sim bern(pi1)

mcmc dynalist

SISG 2022: Module 10, MCMC for Genetics
TS Eliot: "We shall not cease from exploration And the end of all our exploring Will be to arrive where we started And know the place for the first time."
Key information:
Instructors: Eric C. Anderson and Matthew Stephens.
TAs: Sue Parkinson and Karl Tayeb
Zoom meeting link:
https://uchicago.zoom.us/j/96210188590?pwd=VTNPME9LaE1SWGZmOTlickkxQUFCZz09
Additional details
Matthew Stephens is inviting you to a scheduled Zoom meeting.

Topic: sisg 2022
Time: Jul 18, 2022 08:00 AM Pacific Time (US and Canada)
Every day, 3 occurrence(s)
Jul 18, 2022 08:00 AM
Jul 19, 2022 08:00 AM
Jul 20, 2022 08:00 AM
Please download and import the following iCalendar (.ics) files to your calendar system.
Daily: https://uchicago.zoom.us/meeting/tJIvdumppjMvE9QNtB5fGIg8dgIlJAGUwdCG/ics?icsToken=98tyKuCurDoqG9ydtRCHRowAAIj4c-vxiFxYj_pssgvHViZ0SwSuMuVrPpheN-3H

Join Zoom Meeting
https://uchicago.zoom.us/j/96210188590?pwd=VTNPME9LaE1SWGZmOTlickkxQUFCZz09

Meeting ID: 962 1018 8590
Passcode: 089309
One tap mobile
+13126266799,,96210188590#,,,,*089309# US (Chicago)
+13462487799,,96210188590#,,,,*089309# US (Houston)

Dial by your location
+1 312 626 6799 US (Chicago)
+1 346 248 7799 US (Houston)
+1 646 558 8656 US (New York)
+1 646 931 3860 US
+1 669 444 9171 US
+1 669 900 9128 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 301 715 8592 US (Washington DC)
Meeting ID: 962 1018 8590
Passcode: 089309
Find your local number: https://uchicago.zoom.us/u/actUqRQMtG

Join by SIP
[email protected]

Join by H.323
162.255.37.11 (US West)
162.255.36.11 (US East)
115.114.131.7 (India Mumbai)
115.114.115.7 (India Hyderabad)
213.19.144.110 (Amsterdam Netherlands)
213.244.140.110 (Germany)
103.122.166.55 (Australia Sydney)
103.122.167.55 (Australia Melbourne)
149.137.40.110 (Singapore)
64.211.144.160 (Brazil)
149.137.68.253 (Mexico)
69.174.57.160 (Canada Toronto)
65.39.152.160 (Canada Vancouver)
207.226.132.110 (Japan Tokyo)
149.137.24.110 (Japan Osaka)
Meeting ID: 962 1018 8590
Passcode: 089309

Join by Skype for Business
https://uchicago.zoom.us/skype/96210188590
Slack: you should have access to the Slack channel mod10_mcmc_genetics_2022
Session Times (Seattle time, PST)
Monday 8am-2.30pm
Tuesday 8am-2:30pm
Wednesday 8-11:00am
Material will be delivered via zoom by live lectures and live practical sessions, with additional reading materials and/or slides also provided. Each session builds on previous sessions so you will get maximum benefit by attending every session live and in sequence.
#reading indicates vignette/reading/slides/materials
#exercise indicates exercises
#prep indicates material for instructors reference; you may ignore it
Zoom guidelines
The zoom link is https://uchicago.zoom.us/j/96210188590?pwd=VTNPME9LaE1SWGZmOTlickkxQUFCZz09 with further dial-in details given above under "key information"
We will record each session, and make available to participants as soon as practical. The recordings should be available for 90 days.
Please have your camera on where possible - it helps give a closer approximation to an "in person" experience. Especially try to have your camera on in break-out sessions.
Please mute yourself during lectures (unless you need to speak) but please unmute yourself during break-out sessions.
To get help during breakout sessions you may want to share your screen. You can only do that if you sign into zoom on your computer (not a phone or other mobile device).
Pre-module Preparation:
Please make sure you have working versions of R, Rstudio and the latest version of zoom installed on your computer.
https://www.r-project.org/
https://rstudio.com/products/rstudio/download/
https://zoom.us/
Please be sure to install some necessary R packages with
install.packages(c("tidyverse", "plotly", "workflowr", "expm", "viridis"))
Copy Install the binary versions. Please do not install later versions from source code that require compilation.
Please download the materials from fiveMinuteStats
https://github.com/stephens999/fiveMinuteStats
if you know how to use git, then do it that way. Otherwise the easiest way is to click on the green "Code" button and download the zip file.
once you have downloaded the files, open up the file r_simplemix.Rmd in the analysis/ subdirectory and try to knit it using the Rstudio "knit" button.
In a similar manner to downloading the materials from fiveMinuteStats, also download the materials from sisg-mcmc-exercises-eca
https://github.com/eriqande/sisg-mcmc-exercises-eca
Day 1 (Times are approximate)
8:00 am Introductions (15 mins)
Instructors and TAs introduce themselves
Overview of course and materials
CHECK: Have you completed the preliminary preparation?
8:15am Session 0, Lecture: genetic mixture and breaking the ice! @ms
#reading
https://stephens999.github.io/fiveMinuteStats/r_simplemix.html
#exercise
1a. Find and run ("knit") the Rmd file that created https://stephens999.github.io/fiveMinuteStats/r_simplemix.html
HINT: the Rmd files are in the analysis subdirectory.
1b. Also run the file in the console (eg select "run all" from the Run menu)
2. Complete Exercise 1 in https://stephens999.github.io/fiveMinuteStats/r_simplemix.html
Compare/discuss/troubleshoot the answer to the Exercise in your break-out rooms
Since this is the first time you are using break-out rooms:
introduce yourselves! Give your name, academic background, research interests, and a hobby. Go by alphabetical order of family name. There will be approximately four students per breakout room. From now on we will call the first student A, the second B, then C and D etc.
First student (A) should take the lead in this session. In later sessions B, C and D will take it in turns to take the lead.
eg Student A can share screen as you work through the exercises together.... of course if it helps to switch to have another student share screen then go ahead...
Other students: make suggestions; ask questions... Try to help one another out!
If you would like help from a TA/instructor you should be able to ask for help from Zoom. (Alternatively use the slack channel, and tell us which breakout room would like assistance.) We will be there as soon as we can!
9:00am Session 1: Bayesian inference - the assignment problem @ms
#reading
https://stephens999.github.io/fiveMinuteStats/likelihood_ratio_simple_models.html
https://stephens999.github.io/fiveMinuteStats/LR_and_BF.html
https://stephens999.github.io/fiveMinuteStats/bayes_multiclass.html
#exercise
Use the ideas from this session to complete Exercise 2 in https://stephens999.github.io/fiveMinuteStats/r_simplemix.html
note the answer template in that file
Breakout rooms: student B in each room lead this session.
10 am Session 2: Bayesian inference - Estimating allele frequencies/binomial (50 mins)
#reading
https://stephens999.github.io/fiveMinuteStats/likelihood_function.html
https://stephens999.github.io/fiveMinuteStats/bayes_beta_binomial.html
https://stephens999.github.io/fiveMinuteStats/beta.html
https://stephens999.github.io/fiveMinuteStats/bayes_conjugate.html
#exercise
Complete Exercise 3 in https://stephens999.github.io/fiveMinuteStats/r_simplemix.html .
Breakout rooms: student C in each room lead this session.
11am (Lunch/self-study 1.5 hours)
12.30pm Session 3: Monte Carlo @ea (50 mins)
#reading
Monte Carlo lecture slides in PDF:
https://eriqande.github.io/sisg_mcmc_course/2021-monte-carlo-lecture-slides.pdf
#exercise
When doing all of the exercises, always ask yourself these three questions: 1) what is the random variable being simulated? 2) what is the function g(x)g(x) that is being evaluated? and 3) what is the expectation that I am approximating?
Sampling from a beta posterior distribution
https://eriqande.github.io/sisg-mcmc-exercises-eca/monte-carlo-sampling-from-a-beta-posterior.nb.html (You can download the Rmd from the "Code" button in the upper right of this notebook, or work from the Rmd in the sisg-mcmc-exercises-eca repository)
BONUS READING/EXERCISES: Monte Carlo integration of a deterministic function. (You are not expected to get to it during class time, but it is there if you want to play with it in the evening)
https://eriqande.github.io/sisg-mcmc-exercises-eca/003-monte-carlo-to-evaluate-an-integral.nb.html (or the Rmd in the repo)
1.30pm Session 4: Markov Chains @ea (50 mins)
#reading
Markov Chains lecture slides in PDF:
https://eriqande.github.io/sisg_mcmc_course/2021-markov-chains-lecture-slides.pdf
#exercise
Playing with the bouncing blob.
https://eriqande.github.io/sisg-mcmc-exercises-eca/markov-chain-bouncing-blob-exercise.nb.html (or the Rmd in the repo)
BONUS READING/EXERCISES: Biasing a random walk. You might not get to this during the class period, but it is a useful preamble to Session 5 if you can find the time.
https://eriqande.github.io/sisg-mcmc-exercises-eca/006-markov-chain-biased-random-walk.nb.html (or the Rmd in the repo)
2.30pm Formal period over. Instructors will be available to help troubleshoot issues arising during the day.
Day 2
8am Session 5: Metropolis--Hastings - Intro @ms
#reading https://stephens999.github.io/fiveMinuteStats/MH_intro.html
#prep Eric's sampling from the beta-density via M-H slides/animation.
https://github.com/eriqande/sisg-mcmc-opengl-computer-demos
overview instructions at https://www.youtube.com/watch?v=a8gjem86Uf4
run using sisg-mcmc-opengl-computer-demos stephens$ ./beta_sim
open windows using keys 1 and 2... start/stop using spacebar
9am Session 6: Practical session (MH Simple Examples) @ms
#reading https://stephens999.github.io/fiveMinuteStats/MH-examples1.html
#exercise
Find and run the code that produced the html above (analysis/MH-examples1.Rmd)
Run through the exercises under Examples 1 and 2 in that Rmd file
(Look at Example 3 if you finish 1+2)
10:30 Lunch/Self-study. 1.5 hours.
12 noon Session 7: Metropolis--Hastings in 2d @ea
#reading
MCMC in two dimensions lecture slides in PDF:
https://eriqande.github.io/sisg_mcmc_course/2021-two-dimension-MCMC.pdf
#exercise

  1. Investigate the inbreeding model in R code. The following notebook describes the 2-D and component-wise samplers. A few exercises and questions appear at the bottom.
    https://eriqande.github.io/sisg-mcmc-exercises-eca/007-metropolis-hastings-inbreeding.nb.html (or the Rmd in the repo)
    Note that a notebook that also includes the Gibbs sampler for this problem can be found at http://eriqande.github.io/sisg_mcmc_course/s04-01-inreeding-model-mcmc.nb.html
    1.15 PM Session 8: Gibbs Sampling @ea
    #reading
    Gibbs sampling lecture slides in PDF:
    https://eriqande.github.io/sisg_mcmc_course/2021-Gibbs-sampling-inbreeding-model.pdf
    Additional readings from fiveMinuteStats about gibbs sampling and the simple genetic mixture model:
    https://stephens999.github.io/fiveMinuteStats/gibbs1.html
    https://stephens999.github.io/fiveMinuteStats/gibbs_structure_simple.html
    #exercise
    We will use the ideas from this session to add to the r_simplemix.Rmd analysis and create a gibbs sampler
    The exercises and answer templates are here:
    https://stephens999.github.io/fiveMinuteStats/r_simplemix_gibbs_1.html
    Day 3
    8am Session 9: Gibbs sampling for genetic mixture @ms
    In this session we discuss some possible extensions to the MCMC scheme from Session 8, as outlined here: https://stephens999.github.io/fiveMinuteStats/r_simplemix_gibbs_2.html
    #exercise
    The exercises and answer templates are here:
    https://stephens999.github.io/fiveMinuteStats/r_simplemix_gibbs_2.html
    Note: these exercises, especially working out the details of the update for m for the correlated allele frequencies model, could take some time, and implementing them all will take you beyond today I think...
    9:30am Session 10: Importance sampling and Metropolis-Coupled MCMC @ea
    Note that the code for the graphical simulations done in this session (and other sessions) is in: https://github.com/eriqande/sisg-mcmc-opengl-computer-demos
    #reading
    Importance sampling and simulated tempering lecture slides in PDF:
    https://eriqande.github.io/sisg_mcmc_course/2021-imp-samp-mcmcmc.pdf
    #exercise
    final discussions and course evaluations
    11am: finish

wasser et al references?

@mdavy86 thanks for your PR. I haven't worked with refs before in knitr...

the Wasser et al refs don't
seem to show up in likelihood_ratio_simple_models.html
any ideas?

Asking for the content of Introduction to EM: Gaussian Mixture Models

I am sorry. I do not report any issues, but I have one question. I tried to find the answer, but I could not get it.
image
I understand this formula, but I don't comprehend the product sign in the formal part. Why is the sum sign replaced by the product sign?

image

Thank you very much for helping me.

Bug report: code in hmm.rmd

In the vignette of HMM, file hmm.Rmd, in the forward algorithm

Line 80:

alpha[t+1,k]__ = m[k]*emit(k,X[t])

it seems that line should be

alpha[t+1,k] = m[k]*emit(k,X[t + 1])

Please check.

StatQuest

Hi Matthew,

I am not sure whether you are aware of this StatQuest website. It was created by Josh Starmer "as an attempt to explain statistics to genetics researchers". I feel this work is very related to your "5-min stats" idea and thus I am sharing it here, in case you have not seen it.

List of videos: https://statquest.org/video-index/

vignette on "maximum likelihood estimation works"

the idea here would be to give a qualitative summary of the
two vignettes on Wilks theorem and normality of mle.

In short these results say "mle works".

I find students often don't appreciate this. Especially useful when devising complex
inference schemes under complex models... to check coding etc. Would be good
to give examples of this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.