aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

Home Page: http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/

License: MIT License

Jupyter Notebook 99.84% Python 0.14% HTML 0.01% CSS 0.01% Shell 0.01%

bayesian-methods pymc mathematical-analysis jupyter-notebook data-science statistics

probabilistic-programming-and-bayesian-methods-for-hackers's Issues

Chapter 3, Clustering example, creating model.

I have a quick question about the
initialization of the model class in the Chapter 3, clustering example.
Why are observations not included in the
list of parameters in mc.Model?

#and to combine it with the observations:
observations = mc.Normal( "obs", center_i, tau_i, value = data, observed = True )

#below we create a model class
model = mc.Model( [p, assignment, taus, centers ] )

DATA: dataset from chp3 smoking may be wrong

I'm looking at the data + analysis from the smoking example in chp3, and I think the data is wrong. (btw, I haven't add it yet, but the data source is from http://data.princeton.edu/wws509/datasets/#smoking. ) It says the pop. variable is in hundred of thousands, but this seem unrealistic. Consider one data point reads there were 1001 lung cancer deaths in 6052*100,000 = 6,052,000,000 ppl. This can't be correct as is.

notebook doesn't look right on windows

I installed python x,y and upgraded all required packages to the newest versions and I'm using the ipython notebook server to view the notebooks. This is how it looks for me:

http://i.imgur.com/8mQuCjx.jpg

You notice that the font looks weird, e.g. the dot in the letter i is shifted. Also the indented text box is not light gray as it should be.

Instantiating MCMC in PyMC examples

Having a peek through Chapter 3, I notice that you instantiate your MCMC objects in a two-step process:

model = Model([var1, var2, var3])
sampler = MCMC(Model)

There is really no need to explicitly create a Model object at all -- MCMC will do this for you if you pass it the variables:

sampler = MCMC([var1, var2, var3])

In fact, typically I just pass a call to vars (or locals) which MCMC will sift through to look for PyMC objects in the local namespace:

sampler = MCMC(vars())

This is particularly useful if you encapsulate your model in a function that returns vars or locals.

In any case, its just one fewer step in the PyMC modeling process to have to think about.

Chapter 2's cheating example's two alternative implementations show different posteriors

From Chapter 2, on cheating detection: "I'll demonstrate the most explicit way, and later show a simplified version. Both versions arrive at the same inference."

The first model shows posterior with no weight at 0 and very little weight above 0.6. The alternative model has a posterior with considerable weight at 0 and no weight above 0.5. Are the two models supposed to be equivalent or not?

Spelling error?

In the 3rd sentence of the 4th paragraph of the "Bayesian State of Mind" section, "Simpley" should be "Simply", right?

Overly complicated text message example

From Ghassen HAMROUNI, direct communication

I have a remark concerning the first chapter:
The example "Inferring behaviour from text-message data" is certainly interesting. But I think it is over-complicated. In fact the reader has just learned about the Poisson process. And he is expecting a simple
application where he can infer the value of the constant λ. But instead you introduce an inhomogeneous Poisson process where the λ(t) is a time-dependent function !

Chapter 4: Notable Failures of the Law of Large Numbers

While the Law of Large Numbers is a profoundly important result, I like many statisticians have come to doubt its universal applicability.

This Law holds for any distribution, minus some pathological examples that only mathematicians have fun with.

I take issue with this assertion. Two important failures off the top of my head:

Fat-tailed distributions - These distributions frequently fail a critical assumption underlying the law of large numbers - namely, that

$$\sum_{n=1}^\infty \frac{Var(X_n)}{n^2} < \infty$$

One example that fails this is the Pareto distribution with $\alpha \in (1,2]$ (which has a divergent variance). This happens _all the time_ - here's a typical example from Terry Tao's blog (2009). Likewise, fat-tailed distributions that do satisfy the (weak) law of large numbers only converge to the predicted asymptote rather slowly, rendering the principle ineffective (see Weron et al, International Journal of Modern Physics C (2001))
1. Flicker Noise (aka 1/f Noise) - Flicker noise is another ubiquitous phenomenon that fails the law of large numbers. Example: A perfect hour-glass has an average clock-drift of 0 seconds. However, clock drift is governed by flicker noise, which fails to satisfy the assumptions behind the (weak) law of large numbers. As a consequence, no matter how many hour-glasses you have, you can't be more accurate than an atomic clock. This example is from Bill Press (of Numerical Recipes fame), who wrote something of a classic on this subject: Flicker Noises in Astronomy and Elsewhere (1978)

Of course, as you point out in your chapter, one can hold the Law of Large Numbers at arms length with Bayesian analysis.

fix Firefox issue

The custom css styling for the book, in styles/custom.css do not play well with Firefox. I'd like to fix this asap.

Predictive dist. should be more explicit

See http://stats.stackexchange.com/questions/57510/pymc-beginner-how-to-actually-sample-from-the-fitted-model

Chapter 3 - Bayesian Landscape with more than one sample

In the section on Bayesian Landscape, the text advises the user to observe how the mountains change with varying sample size.

Changing the sample size from 1 to 100, I can clearly see that the mountain narrows, but it does not converge to the true parameter. (It seems to converge to 3,1 instead of 1,3.)

I am pretty sure the only parameter I changed was the Sample size (N). Could this be an error with the plotting code?

Thanks!

protip: Matrix distributions

Since matrix distributions, like Wishart etc. involve on the order of N^2 variables, it is really useful to give it a good starting value. Unfortunately, setting value = M, where M is usually the emprical scatter matrix, in the initialization call for Stochastic often fails. This is (likely) due to numerical imprecision in M not being truly symmetric. It is useful then to use M = np.round(M, 8 ), which Fortran will like.

Matplotlib styles

Hi,

The plots in the notebooks rely on a custom matplotlibrc for visual style,
Users who do share the same matplotlubrc can't reproduce the graphs,
in terms of visual style, as that information is not part of the notebook.

To make it worse, the default style is horrible:

fyi.

move random variable generation away from scipy and onto PyMC

PyMC has built in random variate generation. To be consistent, we should be using this instread of scipy's .rvs methods.

Edits to Poisson example in chapter 1

From Alex Nelson (direct communication)

In chapter 1 when you describe the Poisson distribution, I have some suggestions.

First, your description of lambda could be improved. You should describe it as the "intensity" of the distribution, which can be seen if one derives it from the Binomial distribution Bin(n,p) taking n to infinity, provided we fix n*p = lambda.

Second, why not leave the aforementioned derivation as an exercise? (Or -- what I do in my notes -- is ask it as a question, then insist the reader make the attempt, and solve it in an example.)

Bayesian Methods for Hackers

Noticed this commit a3d172b and was wondering if you have any plans to rename everything to just Bayesian Methods for Hackers.

The only benefit I see is the name fitting inline with the header.

No pressure. Just curious.

Bandit example stopped working

In Chapter 6 the interactive bandits example does not work anymore.
The buttons are displayed, but the bar charts and pdf plots are not shown.

Just looking over the code the problem could be in these lines:

<div id="paired-bar-chart" style="width: 600px; margin: auto;"> </div>
<div id="beta-graphs" style="width: 600px; margin-left:125px; "> </div>

The same is true for the solo BanditsD3.html file.

Does expression lacks first term in Ch4, expected values and probablities ?

In chapter 4, in "Expected values and probablities" I think the second formula lacks first term.
Maybe it's P(x in A) = ... ?

CH2: external pdf link to paper broken

This link: http://mdwardlab.com/sites/defaultGreenhillWardSacks.pdf ends in the void … 🐐

Explain why we plot M*L in chapter 3

In the first part of chapter 3, it is really mysterious to me why we choose to plot M*L to illustrate how the prior and posterior are related. Could you explain why it is meaningful ?

PyMC 2.0 not supported

Some simple things appear to not work with PyMC 2.0. Should probably list PyMC 2.2 explicitly as the supported version. An example of something you can't do with PyMC 2.0 is:

p = pymc.Poisson('test', 5)
s = p + 1

This will complain about not being able to add a Poisson and an int:

TypeError: unsupported operand type(s) for +: 'Poisson' and 'int'

Chapter 1 explanation issue

Hello,

in chapter one it is stated: "and our confidence is proportional to the width of the curve." - actually, I think the confidence is inversely proportional to the width.

Likelyhood computation in chapter 3

In chapter 3, in "The Bayesian landscape" we create observed data this way:

### create the observed data

N = 1 #sample size of data we observe
p_1_true = 1
p_2_true = 3

data = np.concatenate( [ stats.poisson.rvs( p_1_true, size = (N,1)), \
                            stats.poisson.rvs( p_2_true, size = (N,1))],  axis=1 )

As I understand, stats.poisson.rvs takes a sample in poisson distribution.

Then data is used to compute likelyhood. So we compute likelyhood for two numbers that we do not know and that are not our observed data. Why is it so ?

Why don't we define data as data = [p_1_true, p_2_true] ?

Repo too large

Currently the repo is +25mb. Thats rediculous. Most of it is the data files associated with chapters. Either

reduce the size by cutting data,
zip them then extract when needed.
other?

Histogram Densities Mislabeled As Probabilities in Chapter 1

In chapter 1, when computing histogram results for the traces of MCMC on the example there, the y-axis is labeled probabilities.

This is not how histograms really work in numpy, however. From the documentation:

numpy.histogram(a, bins=10, range=None, normed=False, weights=None, density=None)[source]
Compute the histogram of a set of data.
...
density : bool, optional
If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. Overrides the normed keyword if given.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html

It is correct to write probability density, not probability.

Interpreting probability density is not as straight-forward, however.

In the ideal, limiting case for a Poisson distribution, MCMC should yield a posterior distribution approximating a Dirac delta. Higher variance distributions will be broader and look more uniform. The example in the first chapter has low variance indeed compared to Poisson statistics that tend to arise in nature.

pymc MAP.fit() uses normal approx, not scipy.fmin, by default

small change to chapter 3

Contribute your great style sheet to ipython-contrib

i really like your css style you used for the notebooks. maybe you will consider submitting it to the ipython-contrib repo

https://github.com/ipython-contrib/IPython-contrib

NBviewer experiment.

Hi,

Want to do a nbviewer experiment with css/mathjax, hope you don't mind if we try to do it with your notebooks.

The goal would be to allow notebooks to declare a prefered css in metadata so I shamelessly extracted the css from your notebooks and hosted it on nbviewer ( I still need to add you as the author on top, and would like to know in which is the liscence). I just changed width to max-width for mobile device. So you should be able to remove your hack of reading from your custom-css an modify the metadata and still have your nice css.

What are your thinking on that ? What do you think of the nbviewer prefix in metadata (we'll remove the leading underscore once we think this can be released) ?

Anyway, I let you look at the new version of nbviewer released a few hours ago, and give me your thought.

Line 63 in Chapter 1 contains an unfinished sentence

Problem area is italicized:

"Your code either has a bug in it or not, but we do not know for certain which is true. Though we have a belief about the presence or absence of a bug."

Anyone know how this line is supposed to end?

add %pylab ontop of each ipynb

Currently the --pylab flag is used ( and necessary ) when running the book. Adding a %pylab tag in the first cell alleviates this.

MCMC = Markov Chain Monte Carlo; not Monte Carlo Markov Chains

I believe it should be Markov Chain Monte Carlo not Monte Carlo Markov Chains.

Chapter 3 - Opening the black box of MCMC
"The machinery being employed is called Monte Carlo Markov Chains "

In Chapter 1
"The machinery being employed is called Monte Carlo Markov Chains"

discuss running multiple mcmc to assess convergence

Running multiple samplers helps twice:

Assessing the chains converge to the same posterior
that the individuals chains have converged.

Add details about this to chapter 3,

Typo in Chapter 2

I was reading this fabulous book and noticed in Chapter 2 this definition where it appears that you used the ambient “temperature” array instead of the formal variable. I am not a Python expert so perhaps this is correct but it is odd looking.

Thanks for the book
--david

def p( temp = temperature, alpha = alpha, beta = beta):
return 1.0/( 1. + np.exp( beta*temperature + alpha) )

Running notebook on own, Environment issues

When running the notebook on my own I'm having some issues. I've set my environment up as follows:

clone project
cd into project
mkvirtualenv probabilistic-programming
pip install -r requirements.txt
cd Chapter1_Introduction
ipython notebook

From there my browser opens up and I'm able to read, however when I try to modify any code I see an error. I've tried modifying the the graph in section 'Example: Bug, or just sweet, unintended feature?' and get this error:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-3563e911edb3> in <module>()
----> 1 figsize(12.5, 4)
      2 p = np.linspace(0, 1, 50)
      3 plt.plot(p, 2*p/(1+p), color="#348ABD", lw=3)
      4 #plt.fill_between(p, 2*p/(1+p), alpha=.5, facecolor=["#A60628"])
      5 plt.scatter(0.2, 2*(0.2)/1.2, s=140, c="#348ABD")

NameError: name 'figsize' is not defined

Please let me know what I need to do to modify my environment and be able to run the notebooks interactively. Thanks!

Python AUC code

pinging @benhamner to share his AUC code for Chapter 7, to use in the Don't Overfit example.

Does that ping work?

PEP8 fixes

The code examples I've read do not adhere to the PEP8 coding standards. Are you open to cosmetic fixes for this issue?

book isn't white-on-black friendly

the rendered mathematical notations are black on transparent, and the graphs are black on white

Estimate the reading time

Before I start reading a paperbook or a PDF I usually look at how many pages it has. Then I try to estimate how much time I will spent to get through it.
I do not see an easy way to perform such an assessment with any HTML/GitHub book.
So please tell me how many pages this book would have taken if it was printed on paper. What is your best guess for the average reading time?

Suppressing matplotlib.text, .legend

In the first plot in chapter 1, we see <matplotlib.text.Text at ....>
This can be suppressed by putting a semicolon at the end of
plt.title( "Are there bugs in my code?");
Similarly for legend and other snippets.

"400 : Bad Request" on nbviewer

http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/Prologue/Prologue.ipynb links to:

400 : Bad Request

We couldn't render your notebook

Perhaps it is not valid JSON, or not the right URL?

If this should be a working notebook, please send us a bug report!

I have no idea whatever bug is on side of this project, github or maybe nbviewer so I reported it here.

Error in Price is Right's Showdown

There is a mistake in the rules: if both players go over, it is a double overbid, and no parties win. Thus the loss function will need to be modified a bit.

Enhancing the notebooks with dynamic UIs

Note: the dynamic notebooks live Here

Hi cameron,

I released Exhibitionist about 2 weeks ago. It's a python library for building dynamic HTML/UI views on top of live
python objects in interactive python work. That's practically synonymous with ipnb today.

Since then, I've been looking for a way to showcase the library and what's
possible with it. I'd like to take your notebooks and use Exhibitionist to
integrate interactive UI's (dynamic plots, etc') so that readers can interact
in realtime with the concepts.

I've already implemented the first view, allowing the user to visualize the
exponential distribution while varying λ by using a slider. Here's a snapshot:

I'll be working on this in the coming weeks, how would you feel about
having this live in a fork under @Exhibitionist. Would that be ok?

Broken Link

Real quick -- your link to chapter 5 is broken:

Needs to be:

http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/Chapter5_LossFunctions/LossFunctions.ipynb

requirements.txt

To start, I just want to say that this book is excellent.

It might be helpful if this project came with a requirements.txt file (for use with pip). This would make it trivial to install the small number of dependencies, and ensure the book works for people who might have older versions installed.

error message in first output cell CH3.

It think the nb is fine, it's just an error that got saved in.

Issue with clustering example in chapter 3

There is a subtlety in the clustering example in chapter 3 that should probably be pointed out to the reader, as it is quite instructive. There is a symmetry in the model. Exchanging cluster 0 and cluster 1 and at the same time exchanging p and 1-p leads to an equivalent configuration with the same probability. The prior clearly has that symmetry. The true posterior will have the same symmetry. The distribution obtained by MCMC is very far from symmetric. Markov chain has been trapped in a metastable state. That is good for us, because that distribution is easier to interpret. The true posterior distribution with its bimodal marginals would probably confuse the reader.

We can either point out the above to the reader, or we can change the example to a model that does not have the symmetry. For example one could introduce another positive hyperparameter and set the mean of the second cluster to be different from the mean of the first cluster by this parameter. An exponential prior could be chosen for that new parameter. I implemented that and the convergence of the MCMC is better.

Typos chapter 1

postierior instead of posterior in second code snippet.
abscent instead of absent.
positor instead of posterior.

Minor grammar issue

In Chapter 2 towards the end of describing Deterministic variables, I think the following sentence is not quite right:

During the learning phase, it the variables value that is repeatedly passed in, not the actual variable.

When reading, it seemed like it should be:

During the learning phase, it is the variable's value that is repeatedly passed in, not the actual variable.

ipynb -> tex converter

Hi,

I love your notebooks on nbconvert.

We are working on the brand new nbconvert to convert IPython noteboko files, and especially latex here , pinging @jdfreder that he is the most involved in the latex converter. I thought this might interest both you to convert to latex, and us to have a nice example to get feedback and refine the settings.

Also, if/when you wish I can also put you on the home page of nbconvert.

Issue with halo/sky drawing plot in chapter 5.

Brought to me attention by Gabriel Kronberger:

I'm almost sure that your draw_sky routine has a bug. This is apparent in one-halo sky (7) because the nearest galaxies are not elongated tangentially but instead point into the direction of the halo. Comparing your picture with my outputs some galaxies are shown correctly while others are elongated in the wrong direction. This indicates that there might by a problem with the calculation of the angle using arctan / atan2. I checked your code but can't point the finger at the bug, because I'm not familiar with the ellipse command, does it handle negative angles correctly?

If it helps I used linear transformations for the graphical output in C#

        m.Translate(x, y]);
        m.Rotate(45);
        m.Scale(1f + Math.Max(0, e2[i]), 1.0f - Math.Min(e2[i], 0));
        m.Rotate(-45);
        m.Scale(1f + Math.Max(0, e1[i]), 1.0f - Math.Min(e1[i], 0));
        graphics.Transform = m;
        graphics.FillEllipse(Brushes.Black, -2f, -2f, 4f, 4f);

A main issue is the strange reference system for e1 and e2.
One thing you can try is to check skys that have one very pronounced halo (e.g. 3). For comparison I attached my visualization of training sky number 3.

camdavidsonpilon / probabilistic-programming-and-bayesian-methods-for-hackers Goto Github PK

probabilistic-programming-and-bayesian-methods-for-hackers's Issues

Recommend Projects

Recommend Topics

Recommend Org