allendowney / thinkbayes Goto Github PK

View Code? Open in Web Editor NEW

1.6K 129.0 1.9K 45.16 MB

Code repository for Think Bayes.

Python 44.43% Makefile 0.30% TeX 54.22% HTML 0.78% CSS 0.27%

thinkbayes's Introduction

ThinkBayes

Code repository for Think Bayes: Bayesian Statistics Made Simple by Allen B. Downey

Available from Green Tea Press at http://thinkbayes.com.

Published by O'Reilly Media, October 2013.

thinkbayes's People

Contributors

Stargazers

Watchers

Forkers

mrwizard82d1 datahacking gerbaudo jcbozonier barapa khomsun2013 anb2 kayak0806 vkuznet dabbler12 jac2130 andrelesa socertis rukku a-covar vibster engineear nvdnkpr sabestina yhchung wandeg hmaal davidlj cheerzzh kryptaxeripper pipifuyj sudeepcs idrisr kenishi avi81 niarfe markridings lenwood khelm thang13 jameshunterbr jayteesf mdjmay dmargala lizhen-dlut pavelcalado gwillink davidrpugh gangna bgschiller sp164 mdonahoe jessparker rachelslaybaugh carlotorniai modalursine andyrizzuto stringertheory akamagnus alexeit tttamaki amprokop stanchan stuartcampbell gwdean vingorilla renaud nikos-daniilidis reesevans albertomz theotherrealm uhealin judschneider ixaxaar drvinceknight saromanov kjhan0215 federer007 cartesys datadrover aatishk mark-walls dferguso bdouglasoz z0k ehsan1981 wolfe01 gzmk sebastid kendooley gbelzoni christophergandrud nickolasclarke lazycrazyowl zlhtao2012 wsherby dougneedham dbl001 gwydjon rajamaniv rajaramc abuhassank kjmclean r-morgan rahulkumarshahi

thinkbayes's Issues

Why no binomial distribution in the Euro Problem

In the Euro problem, when calculating the likelihood of the entire set at once, it seems like this should use the binomial distribution. The binomial distribution calculates what the odds are of seeing K instances in N draws if the probability is P, and it seems like that's exactly what the likelihood should be, with N being tails + heads, K being heads, and P being x.

How does this likelihood function differ from a binomial?

No module named 'thinkbayes'

Hey Allen,

I wrote "from thinkbayes.py import Pmf" in order to practice but it shows a message that says "No module named 'thinkbayes'".

TypeError: unhashable type: 'Pmf' (thinkbayes.py)

When trying to run code from hockey.py, I get:

Traceback (most recent call last):

File "", line 1, in
runfile('C:/Users/ssrra/.spyder-py3/temp.py', wdir='C:/Users/ssrra/.spyder-py3')

File "C:\Users\ssrra\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)

File "C:\Users\ssrra\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/ssrra/.spyder-py3/temp.py", line 541, in
main()

File "C:/Users/ssrra/.spyder-py3/temp.py", line 435, in main
goal_dist1 = MakeGoalPmf(suite1)

File "C:/Users/ssrra/.spyder-py3/temp.py", line 127, in MakeGoalPmf
metapmf.Set(pmf, prob)

File "C:\Users\ssrra.spyder-py3\thinkbayes.py", line 589, in Set
self.d[x] = y

TypeError: unhashable type: 'Pmf'

opps

These files are all symbolic links, rather than the actual files.

Chapter 7 Predictions, Section 7.6 Sudden Death

Dear Professor Downey,

In chapter 7, predictions, we are calculating the probability of winning in sudden-death overtime:

We are creating a mixture of Exponential distributions, but we are taking the parameters of the Poisson distribution to do so. The posterior is our belief of what lambda is, which is the parameter for the Poisson distribution, which is also the expected goals per game, not the time between goals. So what is the rationale behind constructing the exponential mixture from goals per game posterior?
I am assuming this is done to find the distribution of "games until goals". But figure 7.3 is named as "Distribution of time between goals". What is the relationship between "games until goal" (i.e. Poisson parameter) and "time between goals" (i.e. exponential parameter)?
A) Would you think it would be prudent to use the actual time between goals to do this computation? For example, we can get the time between goals for each time till the last 4 matches. Use it as a prior. Then update it with the time between goals of the last 4 games to get the posterior of our belief about the exponential distribution parameter and then make a mixture of it?
B) I feel this would also factor in situations where both teams score the same number of goals (like 2-2) and go to the overtime for sudden-death. In point 2, we are only considering going to overtime if no team scores a goal if I am not mistaken. (unless that's what the rules are - please forgive my ignorance of hockey games).

I would be grateful if you or anyone can throw some light on the matter.

Thanks a lot!

figures for ThinkBayes

Hello,
Thanks for having your book available here. Would it also be possible to upload the figures so the book compiles?

Think Bayes

Missing file "BBB_data_from_Rob.csv"

Hi,
I try to run codes in species.py. for method "RunSubject('B1242', conc=1, high=100)", it requires a csv file, "BBB_data_from_Rob.csv". If it is not due to confidential limit, can you please share it?

Thanks for the book and codes, it is nice explained and wrote, I enjoy to read it.

Thanks,
Qiang

Some clarification for Chapter 8 Observer Bias model formulation

Chapter 8 makes an interesting point about Observer Bias on the Red Line, but it took me a while to understand why the distribution over passengers' observed wait times is greater than the true wait times. After some thought it turns out I was assuming a more complicated model than the text. I don't think either model is unreasonable; my intuition just wasn't on the same page and I didn't find an explicit reason in the text to invalidate my model. The correct model might be obvious to most but perhaps the clarification below will help someone in the future:

The text reads:

The average time between trains, as seen by a ran- dom passenger, is substantially higher than the true average.
Why? Because a passenger is more like (sic) to arrive during a large interval than a small one. Consider a simple example: suppose that the time between trains is either 5 minutes or 10 minutes with equal probability. In that case the average time between trains is 7.5 minutes.
But a passenger is more likely to arrive during a 10 minute gap than a 5 minute gap; in fact, twice as likely. If we surveyed arriving passengers, we would find that 2/3 of them arrived during a 10 minute gap, and only 1/3 during a 5 minute gap. So the average time between trains, as seen by an arriving passenger, is 8.33 minutes.

For this to be true, I believe we have to assume a passenger arriving 0 minutes after the previous train has the same observed waiting time as a passenger arriving any arbitrary n > 0 minutes after the train. In other words, a passenger who just missed the previous train and waited the full gap is treated the same as a passenger who just barely made it the train.

My intuition was as follows: In reality, a passenger can arrive at the 9th minute of a 10 minute gap or the 4th minute of a 5 minute gap. Both passengers wait 1 minute. If you model it this way, the biased distribution actually shifts to the left. Why? Let's say there are two passengers arriving per minute (lam = 2). For a 2 minute gap, you might have the following wait times for 4 passengers: [0, 0, 1, 1]. For a 3 minute gap, you might have the following wait times for 6 passengers: [0, 0, 1, 1, 2, 2]. A passenger who waits 0 has arrived just before the train departs. For an n minute gap, wait time n-1 indicates the passenger arrived within the first minute after the previous train departed. From the 2-minute and 3-minute gaps above, you can deduce that across all trains P(wait n) < P(wait n-1). I.e., there is always be a chance for a passenger to wait 0 minutes. But for an e.g. 5 minute gap, it's impossible to wait 6 minutes.

Here is some code to simulate the process and the resulting histogram.

from math import floor
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)

n = 50000  # Number of trains.
l = 2     # Passengers arriving per minute.
T = np.random.normal(10, 2, n) # True time between trains.
W1 = []   # Passengers' observed waiting time (my initial formulation).
W2 = []   # Passengers' observed waiting time (Think Bayes Formulation).

for t in T:
    size = int(floor(t * l)) # This many passengers will end up on the next train.
    W1 += list(np.random.uniform(0, floor(t), size))
    W2 += list(np.ones(size) * t)

bins = int(T.max() - T.min())
plt.hist(T, color='red', bins=bins, alpha=0.3, normed=True, label='True wait $\mu=%.3lf$' % T.mean())
plt.hist(W1, color='blue', bins=bins, alpha=0.3, normed=True, label='Observed wait $\mu=%.3lf$' % np.mean(W1))
plt.hist(W2, color='green', bins=bins, alpha=0.3, normed=True, label='Observed wait simplified $\mu=%.3lf$' % np.mean(W2))
plt.legend(fontsize=8)
plt.show()

bug on root2, unresolved references

def GaussianCdfInverse(p, mu=0, sigma=1):
"""Evaluates the inverse CDF of the gaussian distribution.

See http://en.wikipedia.org/wiki/Normal_distribution#Quantile_function  

Args:
    p: float

    mu: mean parameter

    sigma: standard deviation parameter

Returns:
    float
"""
x = root2 * erfinv(2 * p - 1)
return mu + x * sigma

Possible issues with figures and results in Chapter 9

Hi,

I have been trying to replicate some of your results and I might have found an issue in Chapter 9.

My results suggest a much tighter posterior after observing the four data points x=[15, 16, 18, 21] than what is suggested in your Figures 9.2 and 9.5, as well as in the reported posterior credible intervals.

In fact, I get something closer to your posterior plots if I only include the last datapoint x=[21].

You can see my results in this colab notebook

trying to solve cookie3.py from first edition

Hello,

I have been trying to pull off exercise 2.1 to create the cookie example without replacement but after failing miserably I checked the GIT repository for ThinkBayes2, and found a code with the solution for the second edition....

To my surprise I saw that I was on the correct track, however when trying to re-write the solution for the first version of ThinkBayes I still could not make it work...

if I use the following to set the hypos

bowl1=dict(vanilla=30,chocolate=10)
bowl2=dict(vanilla=20,chocolate=20)
pmf=Cookie([bowl1, bowl2])

I get:

TypeError: unhashable type: 'dict'

If I use the following (as cookie3.py in ThinkBayes2 )

bowl1=Hist(dict(vanilla=30,chocolate=10))
bowl2=Hist(dict(vanilla=20,chocolate=20))

I get :

AttributeError: 'Hist' object has no attribute 'Normalize'

I really want to see the solution to this, any hint or suggestion will be really appreciated!

Many thanks!

Leo

thinkplot.brewer throwing exception in dungeons.py

If dungeons.py is run as is an exception

Traceback (most recent call last):
File "dungeons.py", line 117, in
main()
File "dungeons.py", line 63, in main
colors = thinkplot.Brewer.Colors()
AttributeError: 'module' object has no attribute 'Brewer'

is thrown. I think line 63 should be thinkplot._Brewer.Colors().
It works then, however I am not sure what is it exactly you intend in terms of the single underscore - weakly hidden method.

Create a subdirectory for Julia translations of the Python functions

It would be nice to have a place for budding Bayesians who are also Julia fanatics to be able to submit as PRs translations of the Python code in the book.