How to do Bayesian statistical modelling using numpy and PyMC3

License: MIT License

Jupyter Notebook 99.55% Python 0.42% Dockerfile 0.02%

pymc bayesian bayesian-data-science bayesian-inference bayesian-statistics bayesian-data-analysis

bayesian-stats-modelling-tutorial's People

Contributors

Stargazers

Watchers

Forkers

kirk86 afcarl emma-zelda hal2001 lewaggs shuyib conormm roncard snowdj lonesome-data gridl xclu zhouyonglong fooway flashriver grathee bingyingshao logeaux bruno-hanzen nrweir briando2005 nick-fusaro masdanil hannahkiesow chaipat-ncm anhnguyendepocen emax4eva1 gustavocarita andyshermanash canyon289 albo-ai vnikov pluflou pllim yuv4r4j jprestifilippo miluo41 shaohu vogt4nick atmamani mobyware cmdlinetips nicolashug cooleel john-science anzelpwj melissawm volpatto sepidehkhajehei nofeetbird0321 mgocken masterfulej septumcapital supr4pt0 martintb dukeprashanth anand086 andrewczgithub linhduongtuan elletsai kevintrannz navidivan ktp-forked-repos hdocmsu vishalbelsare pydatawrangler ducdh1210 lonbergercj stefanosoriano carbilo mjahanshahi thomasmeli data-driven-allocator aitatanit deanberg vtecftwy mriduls matthewfeickert ayajnik gregnwosu sreev pawpol sigmu7 qaolgawu aandrews2 krispeta21 imohitmayank david-cai-cj omersayli coderpriya profanshul laranea luiggi629 mthomas-ketchbrook legendarylance hpfister nfick rodrigochiaradia tongni1975 cyuancheng

bayesian-stats-modelling-tutorial's Issues

Modularity in content

We need to modularize nb1 & 2.

Also need to restructure 3 onwards.

NB3 onwards:

Group comparison
Arbitrary curve regression

[DOC] images for models in examples

the links in the README pull up HTML pages that seem to be missing images for the models.

Perhaps I could convert them to markdown/latex cells so they'll guarantee rendering?

Love this repo, by the way. Writing a software library (and demos for it) based on my thesis work in inverse problems, and this helps motivate what "good tutorials" can look like.

Evidence/data for the bright future of Probabilistic Programming

Both @ericmjl and I are firm believers that Probabilistic Programming has a bright and huge future.

I know other people believe the same. @springcoil has said toe me previously that "PP is the new deep learning" and I understand that @twiecki feels similarly.

What I'd like to do here is amass evidence of the bright future of PP and why we think it will garner increasing adoption.

A few things I've thought of

FB uses Bayesian techniques and PP, such as Prophet
PyMC3 has ~5K stars on github: https://github.com/pymc-devs/pymc3
Bayesian quant methods Quantopian (see here, for example)
Nate Silver using Bayesian methods for 538
FFLabs (usually ahead of the curve) had a WP on PPL in 2017
Growth of academic conferences: PROBPROG2019 and 2020, whole conferences dedicated to the study and application of probabilistic programming languages.
In 2013, O’Reilly itself published a blog post introducing probabilistic programming.

I appreciate this is very limited!

What other evidence/data is there for the future of PPL?

Note: @ericmjl and I are currently drafting a book proposal for O'Reilly, which motivated this question.

Tagging @fonnesbeck, @ericmjl, @betanalpha, @FrizzleFry, @springcoil, @twiecki, @justinbois, @AllenDowney as you all may have thoughts here. Do feel free to tag anybody else you think may have ideas.

thanks!

Notes for EM teaching HBA's material SciPy 2019

There's a chance that I may not make SciPy due to family health reasons and @ericmjl will teach the material I would have.

This issue contains notes for Eric to do so.

I have split NBs 1 & 2 into 2 NBs each:

1a on probability distributions and their stories
1b on joint/cond. probs & Bayes' Rule
2a on "from Bayes' rule to Bayesian inference"
2b on PP and Bayesian workflow

I recall that we were going to spend 90 minutes on this material and that we would cover NBs 1a & 2b in detail, while covering 1b/2a more cursorily.

TBD w/ @ericmjl & we'll drop more notes in here

Colab repo

hey @ericmjl do you know any easy-ish way to create a colab for this repo?

Ideally, without needing to use google drive or anything like that?

e.g. for a single NB in a repo, i can just use the colab chrome extension -- like i did under the colab badges here

but i can't seem to figure out how to do it easily for a repo.

LMK if you know, bro!

Comparing Strong Priors to the Jeffreys Prior

This came out of some discussion with @ericmjl in the hallway track at SciPy 2018.

I had asked during the tutorial about choosing a highly informative prior that assumes a false belief. The widget below visualizes what happens in such a case. We model the highly informative prior with a Gaussian with a small standard deviation. It is modeled after the similar visualization in notebook 2.

def gaussian_pdf(x, mu, sd):
    """
    The Gaussian probability distribution function. We could import this, 
    but the visualation is smoother if we define the function here.
    """
    return np.exp(-np.power(x - mu, 2) / (2 * np.power(sd, 2)))

def plot_posteriors(p=0.6, N=0, mu=0.5, sd=1):
    np.random.seed(42)
    n_successes = np.random.binomial(N, p)
    x = np.linspace(0.01, 0.99, 100)
    prior1 = gaussian_pdf(x, mu, sd)
    likelihood = x**n_successes*(1-x)**(N-n_successes)
    posterior1 = likelihood * prior1
    posterior1 /= np.max(posterior1)  # so that peak always at 1
    plt.plot(x, posterior1, label='Gaussian prior')
    jp = np.sqrt(x*(1-x))**(-1)  # Jeffreys prior
    posterior2 = likelihood*jp  # w/ Jeffreys prior
    posterior2 /= np.max(posterior2)  # so that peak always at 1 (not quite correct to do; see below)
    plt.plot(x, posterior2, label='Jeffreys prior')
    plt.legend()
    plt.show()

The visualization can be created with

interact(plot_posteriors, p=(0, 1, 0.01), N=(0, 100), mu=(0, 1, 0.1), sd=(0.01, 1, 0.01));

I also wrote up a bit of explanation:

Here we are parameterizing the Gaussian pdf by (no surprise) the mean (mu) and standard deviation (sd). We can think of the standard deviation here as a measure of certainty in our prior belief that the probability of getting heads is the mean of the Gaussian.

Interesting things to try:

Set the mean (mu) to 0.2 and the standard deviation (sd) to 0.01. This is specifying a prior that we have a strong belief that the probability of flipping a heads is 0.2 (which is wrong assuming that p is still set to 0.6). Now change the number of trials (N) to 100. What happens?
Now slowly move the standard deviation (sd) to 1. What happens now? Why does the posterior corresponding to the Gassian prior converge to the posterior from the Jeffreys prior?

NB1: discrete probability distribution histograms should be bar charts.

As per title.

conda env create complains about pip

This was seen using conda 4.6.14.

$ conda env create -f environment.yml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata: ...working... failed

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/msys2/noarch/repodata.json.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

If your current network has https://www.anaconda.com blocked, please file
a support request with your network engineering team.

SSLError(MaxRetryError('HTTPSConnectionPool(host=\'repo.anaconda.com\', port=443): Max retries exceeded with url: /pkgs/msys2/noarch/repodata.json.bz2 (Caused by SSLError("Can\'t connect to HTTPS URL because the SSL module is not available."))'))

Steps for html version of tutorial

@ericmjl Today we discussed steps for an html version of tutorial:

add notebooks to subdir of /docs
make sure they execute programmatically: can do this manually by making sure all cells execute in order; can you remind me of how to automate this step?
Travis will build and create github pages

Did I get that mostly correct?

Update from @ericmjl:

Place custom source code in src/bayes_tutorial/<something_appropriate>.py. Then import it back into the notebook.
Use the matplotlibrc file that @hugobowne provided to style all of the plots.
We should probably put this in a "style guide" documentation.

[TODO] Add model diagrams to each notebook

This can be handled post-merge of #5.

I would like a consistent set of model diagrams for each notebook. Time to bust out Illustrator!

Windows: conda installation problem

I tried your conda instruction (conda env create -f environment.yml -v) and ran into this error:

pyc file failed to compile successfully (run_command failed)
python_exe_full_path: C:\...\envs\bayesian-modelling-tutorial\python.exe
py_full_path: C:\...\envs\bayesian-modelling-tutorial\Lib\site-packages\missingno\utils.py
pyc_full_path: C:\...\envs\bayesian-modelling-tutorial\Lib\site-packages\missingno\__pycache__\utils.cpython-37.pyc
compile rc: 1
compile stdout: "Did not find VS in registry or in VS140COMNTOOLS env var - exiting"

compile stderr: ERROR: The system was unable to find the specified registry key or value.

The above message was repeated several times for some other packages, following by this at the end:


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\...\lib\site-packages\conda\core\link.py", line 558,in _execute
    cls._execute_post_link_actions(pkg_idx_tracked, axngroup)
  File "C:\...\lib\site-packages\conda\core\link.py", line 664,in _execute_post_link_actions
    reverse_excs,
conda.CondaMultiError: post-link script failed for package defaults::qt-5.9.7-vc14h73c81de_0
running your command again with `-v` will provide additional information
location of failed script: C:\...\envs\bayesian-modelling-tutorial\Scripts\.qt-post-link.bat

I am using conda 4.6.14 on Windows 10, with Git Bash shell.

p.s. I also tried a combo of creating a basic conda env and then pip install some stuff into it (several variants of that combo) but even when I managed to start the notebook, the import step crashed the notebook kernel with DLL error.

Docker container

It's come up in other tutorials, but maybe having a docker container could be good for getting users up-and-running.

Windows: cannot run notebook

I ran into a problem which has similar traceback as conda/conda#8171 . Despite that issue being marked as resolved and that I am using the latest release of conda 4.6.14, it still happened.

Tried some other things I found on the Internet like uninstalling and re-installing pyzmq and jupyter with either conda or pip. Didn't help.

In my case, the activate command gave some warnings about not able to find vc/vcvarsall.bat, which might or might not be related, as I could still use the environment.

At this point, I have wasted enough time on this, and thus my workaround is to re-create the tutorial enviroment without jupyter, jupyterlab, and nodejs, but add ipython to the YML, then run the examples inside a good ol' IPython session.

Update title/abstract; draft email to SciPy 2019 attendees

Title: Bayesian Data Science: Probabilistic Programming

Abstract: This tutorial will introduce you to the wonderful world of Bayesian data science through the lens of probabilistic programming. In the first hour of the tutorial, we will begin reintroduce the key concept of probability distributions via hacker statistics, hands-on simulation and telling stories of the data-generation processes. We will also cover the basics joint and conditional probability, and Bayes' rule and Bayesian inference. In the latter 2/3 of the tutorial, we will use a series of models to build your familiarity with PyMC3, showcasing how to perform the foundational inference tasks of group comparison and arbitrary curve regression. By the end of this tutorial, you will be equipped with a solid grounding in Bayesian inference, able to write arbitrary models, and have experienced basic model checking workflow.

Notes to registered attendees:

Dear all,

We (Hugo and myself) have reviewed the tutorial schedule, and after discussing with the SciPy tutorial committee and the instructors before and after us in the schedule, we have proposed to deliver a tutorial that focuses on probabilistic programming instead of simulation. The tutorial committee has given us the green light for this change, and we thought it would be only right to inform you of the change and guide you on how to best prepare for the updated tutorial.

The motivation is as follows: Our simulation-based tutorial covers the basic concepts of Bayesian inference through the lens of simulation, which overlaps with Allen Downey's content. In addition, the tutorial after us covers Bayesian model checking using ArviZ, for which it would be very helpful to have had a longer session on probabilistic programming first. Taken together, we envision this as a "Bayes Track"-ish series at SciPy, where our tutorials can exist independently (because we do have enough recap on our own), but complement each other very well nonetheless; the sections that overlap hopefully give sufficient repetition for learning without being overly repetitive.

As such, this tutorial requires basic knowledge in Bayesian inference and probability. You can gain this knowledge for SciPy 2019 in one of two ways. You may either

register for Allen Downey's "Bayesian Statistics Made Simple" tutorial, or
watch our PyCon 2019 tutorial "Bayesian Data Science: Simulation".

Attendance in Allen's tutorial is not a mandatory prerequisite, but highly encouraged; that said, we will cover enough material to ground our material in the recap. Additionally, if you would like to have a stronger grounding on probabilistic programming prior to attending the ArviZ tutorial led by Ravin Kumar and Colin Carroll, our tutorial would be a helpful starting point, though they will also be covering the basics of probabilistic programming for their tutorial. With that all said, we think attending all three would be extremely beneficial for your knowledge development in Bayesian inference, and there might probably be a gift for you from the instructors if you've braved through all three this year!

If you have questions regarding the tutorial, please feel free to reach out to us. Our preferred method of communication is on the GitHub issue tracker, where questions we have addressed before can be publicly viewable by others and searchable.

Cheers,
Eric & Hugo

Modified finches beak dataset

@hugobowne, if you look this commit, you will find that in my branch, I have added one row of data to the finches 2012 dataset.

This simulates the discovery of a new species of finch, for which the only thing we know is that it is genetically related. There is a teaching moment baked into this: if we did not do hierarchical modelling, where we assumed that the new species of finch was genetically related to the known finches, then on the basis of a single observation, we would get posterior estimates for beak length that were unreasonable.

More details will be found in the notebooks that I'm finishing up right now. Working through this particular example just reinforced for me the elegance of using hierarchical modelling where relatedness is a reasonable assumption.

Anyways, that was just a longwinded way of informing you of the minor addition made to the dataset, and I just didn't want to give you unnecessary surprises there!

Class feedback: Use scatter plot?

In the ECDF examples, maybe

plt.plot(x, y, marker='.', linestyle='none');

can be replaced with

plt.scatter(x, y)

What is the link to the PyCon 2019 Tutorial Survey?

Title.

NB1: Latex formatting could be improved.

I will handle this one.

Pedagogical question: open-endedness?

cc: @hugobowne @justinbois

I'm wrestling with one question right now: I really, really want to include one notebook which is very, very unstructured and open-ended. (This would probably be the last nb.) This simulates the scenario that most of us will be in when faced with a novel modelling question.

Yet, given that the class probably only will have had 2 hours with PyMC3, I'm also worried that this would be a bit "too much" for them.

Thus, I have the following hypothesis:

Assume every notebook has an "instructor" version and a "student" version.
Then, an unstructured notebook can be feasible IFF I make it clear to the students that "copying code (from the instructor version) and then mulling over it is a perfectly reasonable way to learn".
Those who are competent won't have to refer to the instructor version.
Those who are feeling less confident have the instructor version to fall back on, and can still learn.

Do you think this is a reasonable hypothesis? Would it be worthwhile to try this out during the tutorial?

"Estimate"

What is your "estimate"?

May have been a bit of confusion by showing off the data generating process and then asking that question.
Would have gotten "estimate" directly if the data generating process was hidden.

Update README

Leave links to snapshot versions of old tutorials in the README.

SciPy 2019 Prep

Following up on the email thread.

1 hr on re-capping probability distribution stories & ECDFs.
Then 3 hours starting from the coin flip to birds to arbitrary regression. That should be enough content. I will have to rewrite the PPL notebooks.

[TODO] Stylistic issues

Collating a running list of stylistic issues that I see as they crop up.

For each notebook that I've created:

Ensure data loading occurs through load_xyz().
"Exercise" should not be a header, but bolded font.
Traceplots should be despined. (Convenience function: utils.despine_traceplot(traces))
All other plots should be despined. (Convenience function: utils.despine(ax))
"Discuss" should not be a header, but a bolded font.

Things to harmonize between Hugo's and my notebooks:

Notebook numbering. I think we should stick to 01-..., 02-..., 03-... etc.
Section numbering. I'm hesitant to number things because of maintenance issues, but I see its utility for navigation within a notebook.

Will append to this comment as more things crop up.

Setting up environment using binder session

Hi,
I am trying to set up environment using binder session for our class and I get error below.

"Failed to connect to event stream."

Has anyone else had this issue? Please advise.

Thank you.
PS.

Style question

Out of curiosity, what tool did you use to make the red distribution vector art? :)
As seen in problem 1, third notebook.

ECDF function in util.py not returning right values

Eric:

Very much enjoy your tutorials on bayesian inference and PYMC.

There was something I noticed in the repo files that confused me a litte. The ecdf function you defined in the notebook (01a-instructor-probability-simulation.ipynb) differs from the ECDF function in the utils.py script. The one in the notebook is below, and gives me the results I expect:

FROM NOTEBOOK

def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points
    n = len(data)

    # x-data for the ECDF
    x = np.sort(data)

    # y-data for the ECDF
    y = np.arange(1, n+1) / n

    return x, y

However, the ECDF function in the utils.py script (below) does not give me the results I expect:

FROM `utils.py` SCRIPT

def ECDF(data):
    x = np.sort(data)
    y = np.cumsum(x) / np.sum(x)
    
    return x, y

Is there an error in the utils.py script?

Thanks,
Tom

Learning Outcomes Rework

Guidelines

Feel free to modify this post to update the learning outcomes. This post is the "master document".
Detailed discussion in the thread below; be sure to quote.

Format

This tutorial is split into 2 four-hour segments. The first segment deals with the basics of probability. The second deals with probabilistic programming and model formulation.

This is a very hands-on tutorial, including ample time for exploration and discovery.

Learning Outcomes

At the end of Part 1 of this tutorial, participants will be able to:

Describe probability distributions by their "story".
Identify cases where data can be modelled by a probability distribution.
Describe a generative process for data, using probability distribution stories.
TBC

At the end of Part 2 of this tutorial, participants will be able to:

Use probability distribution diagrams to draw out a generative model diagrams for:
- Parameter estimation models.
- Group comparison models.
- Hierarchical models.
- Curve fitting models.
Use PyMC3 syntax to implement the above generative models.
Diagnose model appropriateness and fit using visual diagnostics.
TBC.

Highest Posterior Density

how do you like to describe HPD @ericmjl @justinbois @betanalpha ?

ipywdigets in jupyter lab

For people using Jupyter lab, we'll need to instruct them on how to get ipywidgets working in lab notebooks. see here:

https://ipywidgets.readthedocs.io/en/latest/user_install.html#installing-the-jupyterlab-extension

I use them here: https://github.com/ericmjl/bayesian-stats-modelling-tutorial/blob/master/notebooks/02-Instructor-Parameter_estimation_hypothesis_testing.ipynb

it boils down to running the following in a terminal (you'll need node installed)

jupyter labextension install @jupyter-widgets/jupyterlab-manager

The command is taken from here.

2nd note

Also, I thought we had a section in the README on opening a NB or something like point 4 in this README:

https://github.com/datacamp/datacamp_facebook_live_nlp

to do:

add note to README about ipywidgets
add something about opening NBs/lab

Distributions as 1st class citizens?

Could there be a restructuring of NB2's pre-PPL content, such that the "distributions as 1st-class citizens" could be emphasized? I have a hunch that this might help with learning.

I suspect that maybe using scipy.stats could help? It helps reinforce the idea that:

PDFs are nothing more than just math functions.
Calculation of likelihood is all "just" functions.
Math can be abstracted.

Environment creation error

Following the local install instructions in the README, I end up with these errors:

(base) C:\folder_path\bayesian-stats-modelling-tutorial>conda env create -f environment.yml
Collecting package metadata (repodata.json): done
Solving environment: done
Preparing transaction: done
Verifying transaction: done
Executing transaction: | b'Exception while loading config file C:\\me\\.jupyter\\jupyter_notebook_config.py\n    Traceback (most recent call last):\n      File "C:\\me\\Anaconda3\\envs\\bayesian-modelling-tutorial\\lib\\site-packages\\traitlets\\config\\application.py", line 563, in _load_config_files\n        config = loader.load_config()\n      File "C:\\me\\Anaconda3\\envs\\bayesian-modelling-tutorial\\lib\\site-packages\\traitlets\\config\\loader.py", line 457, in load_config\n        self._read_file_as_dict()\n      File "C:\\me\\Anaconda3\\envs\\bayesian-modelling-tutorial\\lib\\site-packages\\traitlets\\config\\loader.py", line 489, in _read_file_as_dict\n        py3compat.execfile(conf_filename, namespace)\n      File "C:\\me\\Anaconda3\\envs\\bayesian-modelling-tutorial\\lib\\site-packages\\ipython_genutils\\py3compat.py", line 198, in execfile\n        exec(compiler(f.read(), fname, \'exec\'), glob, loc)\n      File "C:\\me\\.jupyter\\jupyter_notebook_config.py", line 1\n        c.JupyterLabTemplates.template_dirs = [\'C:\\me\\Anaconda3\\envs\\dsml\\share\\jupyter\\notebook_templates\']\n                                              ^\n    SyntaxError: (unicode error) \'unicodeescape\' codec can\'t decode bytes in position 2-3: truncated \\UXXXXXXXX escape\nEnabling notebook extension jupyter-js-widgets/extension...\n      - Validating: ok\n'

done
#
# To activate this environment, use
#
#     $ conda activate bayesian-modelling-tutorial
#
# To deactivate an active environment, use
#
#     $ conda deactivate

What did I miss?

Class feedback: Please label plots

In the second notebook, when participants are introduced new terms like "prior" etc, it is easy to get lost without plot labels.

[TODO] Create student versions of the notebooks

This should be done pre-merge on PR #5.

JB's suggestions

Please view the following of constructive criticism!

1. Probability_a_simulated_introduction.ipynb

I worry about your presentation of probability here. You are talking about frequency of events, not plausibility of a logical conjecture. Bayes Theorem is still valid, but why not present probability from a Bayesian perspective from the outset?
I especially worry about this with the finch data set. You are sampling out of the empirical distribution function here, which could be really confusing for students when they start to think about likelihoods.
I think maybe a better approach is to demonstrate sampling out of a distribution as a way to study it and understand it. For example, say we did not know that the number of coin flips resulting in heads out of n trials is Binomially distributed. We could get a handle on this distribution by simulating the coin flips, as you have done by drawing out of a Uniform distribution. You can get the same results as if you just plotted the PMF or CDF of a Binomial distribution (or its moments, whatever). This leads naturally into figuring out how we can understand a posterior by sampling out of it. And how can we sample out of a posterior? MCMC
If you instead spent a little time on probability distributions and their stories, the students can get right to modeling and sampling (see below).
I think this is about all you have to do and you can flow right into your next section.

2.Parameter_estimation_hypothesis_testing.ipynb

I don't know if I'd use a Jeffreys prior. It's especially challenging to do a Jeffreys prior for the coin-flipping case. The Haldane or Uniform priors sort of make more intuitive sense, at least at first glance. Jeffreys priors come from Fisher information ideas, which can be tough to grasp. Rather, you might want to introduce ideas of informative or weakly informative priors.
I actually wouldn't bother with mathematical expressions for the posteriors. Rather, just start busting out sampling. For example for the finch beaks, you can just specify the likelihood and prior and start sampling the posterior with PyMC3. For example, you can do the following, with all pertinent units in millimeters.
beak depth | µ, σ ~ Norm(µ, σ)
µ ~ Norm(10, 5)
σ ~ LogNorm(0, 1)
So, with likelihood and prior specified, just start sampling.

Summary

I think there are three main concepts you can get to in 4 hours. What a probability distribution is, the connection between the posterior distribution (what you want) and the likelihood and prior, and that you can sample out of the posterior (as specified by the likelihood and prior) using MCMC. The logical connection between the prior and posterior as the data updates belief is an important point. I think the hacker stats style sampling and the mathematical expressions/plots might get in the way of a more streamlined message.

modeling vs modelling

American spelling is "modeling" with a single "l". 😬

02-Instructor-Parameter_estimation_hypothesis_testing notebook fails to run on ONE of two computers

I have set up according to the instructions given in the Github on three computers. On two of them, the tutorials work, on the third there is a fatal error importing pymc3. All three computers are up to date, fully patched Windows 10 boxes (I can try to provide more info if needed!). The problem is in the 02-Instructor-Parameter_estimation_hypothesis_testing notebook and happens when running the import block at top.

On two of the computers, running this first block gives the warning message:

WARNING (theano.configdefaults): g++ not available, if using conda:
`conda install m2w64-toolchain`
C:\......\anaconda3\envs\bayesian-modelling-tutorial\lib\site-packages\theano\configdefaults.py:560:
UserWarning: DeprecationWarning: there is no c++ compiler.This is
deprecated and with Theano 0.11 a c++ compiler will be mandatory
  warnings.warn("DeprecationWarning: there is no c++ compiler."
WARNING (theano.configdefaults): g++ not detected ! Theano will be
unable to execute optimized C-implementations (for both CPU and GPU)
and will default to Python implementations. Performance will be
severely degraded. To remove this warning, set Theano flags cxx to an
empty string.
WARNING (theano.tensor.blas): Using NumPy C-API based implementation
for BLAS functions.

But things still work! All of the code blocks in the notebook execute, if slowly (for NUTS).

On the third computer, which was set up using the exact same steps, this is the error message that results (and the error is FATAL for using pymc3):

You can find the C code in this temporary file: C:\......\AppData\Local\Temp\theano_compilation_error__knt3a6r
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\theano\gof\lazylinker_c.py in <module>
     80                     version,
---> 81                     actual_version, force_compile, _need_reload))
     82 except ImportError:

ImportError: Version check of the existing lazylinker compiled file. Looking for version 0.211, but found None. Extra debug information: force_compile=False, _need_reload=True

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\theano\gof\lazylinker_c.py in <module>
    104                         version,
--> 105                         actual_version, force_compile, _need_reload))
    106         except ImportError:

ImportError: Version check of the existing lazylinker compiled file. Looking for version 0.211, but found None. Extra debug information: force_compile=False, _need_reload=True

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-1-3441c4f46c01> in <module>
      4 import seaborn as sns
      5 import matplotlib.pyplot as plt
----> 6 import pymc3 as pm
      7 from ipywidgets import interact
      8 import arviz as az

D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\pymc3\__init__.py in <module>
      3 
      4 from .blocking import *
----> 5 from .distributions import *
      6 from .glm import *
      7 from . import gp

D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\pymc3\distributions\__init__.py in <module>
----> 1 from . import timeseries
      2 from . import transforms
      3 
      4 from .continuous import Uniform
      5 from .continuous import Flat

D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\pymc3\distributions\timeseries.py in <module>
----> 1 import theano.tensor as tt
      2 from theano import scan
      3 
      4 from pymc3.util import get_variable_name
      5 from .continuous import get_tau_sigma, Normal, Flat

D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\theano\__init__.py in <module>
    108     object2, utils)
    109 
--> 110 from theano.compile import (
    111     SymbolicInput, In,
    112     SymbolicOutput, Out,

D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\theano\compile\__init__.py in <module>
     10 from theano.compile.function_module import *
     11 
---> 12 from theano.compile.mode import *
     13 
     14 from theano.compile.io import *

D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\theano\compile\mode.py in <module>
      9 import theano
     10 from theano import gof
---> 11 import theano.gof.vm
     12 from theano import config
     13 from six import string_types

D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\theano\gof\vm.py in <module>
    672     if not theano.config.cxx:
    673         raise theano.gof.cmodule.MissingGXX('lazylinker will not be imported if theano.config.cxx is not set.')
--> 674     from . import lazylinker_c
    675 
    676     class CVM(lazylinker_c.CLazyLinker, VM):

D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\theano\gof\lazylinker_c.py in <module>
    138             args = cmodule.GCC_compiler.compile_args()
    139             cmodule.GCC_compiler.compile_str(dirname, code, location=loc,
--> 140                                              preargs=args)
    141             # Save version into the __init__.py file.
    142             init_py = os.path.join(loc, '__init__.py')

D:\......\miniconda3\envs\bayesian-modelling-tutorial\lib\site-packages\theano\gof\cmodule.py in compile_str(module_name, src_code, location, include_dirs, lib_dirs, libs, preargs, py_module, hide_symbols)
   2409             # difficult to read.
   2410             raise Exception('Compilation failed (return status=%s): %s' %
-> 2411                             (status, compile_stderr.replace('\n', '. ')))
   2412         elif config.cmodule.compilation_warning and compile_stderr:
   2413             # Print errors just below the command line.

. collect2.exe: error: ld returned 1 exit statusindows-10-10.0.18362-SP0-Intel64_Family_6_Model_94_Stepping_3_GenuineIntel-3.7.6-64/lazylinker_ext/mod.cpp:976: undefined reference to `__imp__Py_TrueStruct'Error'efined references to `__imp__Py_NoneStruct' followow

At this point, code not using pymc3 works but all pymc3 blocks crash.

Additionally, all three of the computers have trouble starting the environment for this tutorial (at Scipy) and print the following at the shell:

(base) D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>conda activate bayesian-modelling-tutorial
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>SET DISTUTILS_USE_SDK=1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>SET MSSdk=1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>SET "VS_VERSION=15.0"
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>SET "VS_MAJOR=15"
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>SET "VS_YEAR=2017"
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>set "MSYS2_ARG_CONV_EXCL=/AI;/AL;/OUT;/out"
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>set "MSYS2_ENV_CONV_EXCL=CL"
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>set "PY_VCRUNTIME_REDIST=\bin\vcruntime140.dll"
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>set "CXX=cl.exe"
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>set "CC=cl.exe"
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>set "VSINSTALLDIR="
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>for /F "usebackq tokens=*" %i in (`vswhere.exe -nologo -products * -version [15.0,16.0) -property installationPath`) do (set "VSINSTALLDIR=%i\" )
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if not exist "" (for /F "usebackq tokens=*" %i in (`vswhere.exe -nologo -products * -requires Microsoft.VisualStudio.Component.VC.v141.x86.x64 -property installationPath`) do (set "VSINSTALLDIR=%i\" ) )
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if not exist "" (set "VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\" )
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if not exist "C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\" (set "VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\" )
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if not exist "C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\" (set "VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\" )
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if not exist "C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\" (set "VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\" )
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>IF NOT "" == "" (
set "INCLUDE=;"
 set "LIB=;"
 set "CMAKE_PREFIX_PATH=;"
)

D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>call :GetWin10SdkDir
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>call :GetWin10SdkDirHelper HKLM\SOFTWARE\Wow6432Node  1>nul 2>&1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if errorlevel 1 call :GetWin10SdkDirHelper HKCU\SOFTWARE\Wow6432Node  1>nul 2>&1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if errorlevel 1 call :GetWin10SdkDirHelper HKLM\SOFTWARE  1>nul 2>&1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if errorlevel 1 call :GetWin10SdkDirHelper HKCU\SOFTWARE  1>nul 2>&1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if errorlevel 1 exit /B 1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>exit /B 0
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>for /F %i in ('dir /ON /B "\include\10.*"') DO (SET WindowsSDKVer=%~i )
The system cannot find the file specified.

D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if errorlevel 1 (echo "Didn't find any windows 10 SDK. I'm not sure if things will work, but let's try..." )  else (echo Windows SDK version found as: "" )
Windows SDK version found as: ""

D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>IF "win-64" == "win-64" (
set "CMAKE_GEN=Visual Studio 15 2017 Win64"
 set "BITS=64"
)  else (
set "CMAKE_GEN=Visual Studio 15 2017"
 set "BITS=32"
)

D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>pushd C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\
The system cannot find the path specified.

D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>CALL "VC\Auxiliary\Build\vcvars64.bat" -vcvars_ver=14.16
The system cannot find the path specified.

D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>popd
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>IF "" == "" SET "CMAKE_GENERATOR=Visual Studio 15 2017 Win64"
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>call :GetWin10SdkDirHelper HKLM\SOFTWARE\Wow6432Node  1>nul 2>&1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if errorlevel 1 call :GetWin10SdkDirHelper HKCU\SOFTWARE\Wow6432Node  1>nul 2>&1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if errorlevel 1 call :GetWin10SdkDirHelper HKLM\SOFTWARE  1>nul 2>&1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if errorlevel 1 call :GetWin10SdkDirHelper HKCU\SOFTWARE  1>nul 2>&1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>if errorlevel 1 exit /B 1
D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>exit /B 0
(bayesian-modelling-tutorial) D:\......\LOCAL_GITHUB_CLONES\bayesian-stats-modelling-tutorial>

But for emphasis—this shell dump occurs on all 3 computers and two of those computers work, showing only the warning at the start! Based on this output, it seems there might be a mandatory dependency for MS Visual Studio being installed and in some particular location (on one of these machines, VS is around but not found!). On the working machines pymc3 appears to use the python code fallback. For some reason it will not do this the third failing computer.

I hope that these hidden dependencies can be fixed or spelled out, as we were unable to continue with the tutorial as this killed us completely. We can try to watch the replay later if we can get things working.

I am happy to provide more information if I can!

Extract posterior distribution information

How would we extract the posterior distribution information from a theano.tensor.var.TensorVariable to numpy arrays?
Current method makes it hard to utilize this information to be part of a bigger code where you might want to use the posterior distribution (or say just the mean and HPD bounds?).

notebook two throwing stdio.h file not found error on macOS Mojave

Moving this conversation over from this Twitter thread to an issue so others can track.

I'm working through all your nbs to prep for #pycon2019 and ran into an error in the 2nd notebook coming from the pymc3 import. Wanted to let y'all know as I likely won't the the only one to hit this. I've posted a screenshot of the error below, and it appears to be coming from pmc3--specifically the C compiler, from what I can tell from my google searches.

P.S. as requested by @ericmjl -- I did check and this error does NOT appear when using the binder notebooks. So that's a plus!

SciPy 2018 Post-Mortem

@hugobowne I'm writing here some post-SciPy 2018 thoughts, gathered from my subjective view + participants' provided feedback in the feedback form.

Overall Thoughts

General reception to the tutorial was very positive

This was not just from the feedback provided in-person, but also from the feedback provided. Numerical ratings are available on this report on TypeForm.

This has potential to be a 2-part tutorial

The amount of material covered in the first two hours might be better expounded on if we were to propose this as a two-part tutorial, each with a difference audience focus, with you leading the theory and hacker statistics section, and me leading the probabilistic programming session.

I came to this tentative conclusion only after reflecting on the tutorial's progress today. Here's my observations and thoughts so far:

Probability basics: Definition, conditional, joint

As a refresher for probability, this section might have been heavy on material. (Refresher would target audience group 1: already knows stats, but just needs refresher).
As an in-depth introduction for probability, this section might have been light on material. (An in-depth introduction would target
I'm having trouble imagining a group of people in the middle of these two general modes.

For those in the class that needed the in-depth introduction to probability, they really enjoyed your portion of the class. I think if we expanded your section to a full standalone tutorial, it might help learners get into the "distributional thinking" that is needed to work with a probabilistic programming language.

Now that I've had some space away from the material, I think that to introduce joint and conditional probability, starting with two-layer, binary tree-based probability example (e.g. cookie jar problem) might be more useful for audience group (probability novices), while Darwin's finches will be more useful for audience group 1 (probability intermediates/experienced). The cookie jar problem has a very intuitive "path-tracing" intuition to joint and conditional probability, and might help bridge "algebra to pictures" for a beginner (for which we had quite a number of in the SciPy crowd). On the other hand, it might not be complicated enough to be a meaningful refresher for audience group 1. Conditional and joint distribution with the Darwin's finches dataset might be a better fit there.

Probabilistic Programming

(This is admittedly longer because I was more involved in it.)

I think the big lesson I learned from leading this section today was that a lecture + discussion format can still yield a meaningful tutorial experience for participants, but it isn't easily made inclusive of the whole class unless there is intentionally group discussion time scheduled in. Yet group discussion time takes time, and can eat into the opportunity to get practice in-class with the probabilistic programming exercises.

I think for the next iteration, I would redo the learning objectives of this section. Thinking more specifically about assumed knowledge, I would most ideally like to build on top of what you taught in Part 1, listed as follows:

Able to simulate coin flips trials using numpy.
Able to use resampling methods to estimate the confidence interval of a random variable parameter in a model.
Proficient in Python syntax, including slice accessors into array-like objects, accessing values by key in a dictionary.
Able to use a discrete example to describe what joint and conditional probability are.
Able to define the mathematical relationship between conditional and joint probability, and derive Bayes' rule from their definition.
Has performed the t-test.

Then, I would focus on the core, minimal set of Bayesian workflow steps, as described in an updated set of learning objectives. By the end of a workshop, a participant should be able to:

Describe, justify, and use PyMC3 syntax to implement a data generative model for a data analysis problem, parameterized using probability distributions.
Summarize posterior distributions as insights for a particular problem on hand.
Critique the model using posterior predictive checks.

Minutiae

Socratic Discussion

The model building process is iterative, and should involve more of the kind of group discussion that occurred.

To be more inclusive, I would probably schedule more "talk with your neighbours" followed by "share your discussion with the big group". This would allow the shier participants to engage in the class.

Visuals

I have feedback that the model diagrams (red + white distributions) were helpful for thinking through the model building process. Had feedback that they should be introduced somewhat earlier in the class, though I'm not sure of the utility of this.

Environment setup & checks

Binder is always a good idea.
Need more robust environment setup script.
Enforce more automated environment checks with checkenv.py.

May need to list package "nodejs" explicitly in conda environment.yml

Latest version of conda on macOS did not install package nodejs as part of any implicit depedency of the packages in environment.yml if you install it "manually" with conda install nodejs after the env is created, and then run the nbextensions install, all the widgets work.

Get CI working again by using GitHub Actions

👋 @ericmjl (cc @hugobowne as Eric is busy for good reason — congrats Eric!) would you be open to having the CI get revamped so that it is working again if it is redone in GitHub Actions? Or would do you strongly prefer to have it be done in Azure pipelines?

NB1: Perhaps a figure at the end of joint/conditional distribution?

It's this figure:

Minus the "marginal distribution" portion.

ericmjl / bayesian-stats-modelling-tutorial Goto Github PK