Statistical Power
Introduction
You've started to investigate hypothesis testing, p-values and their use for accepting or rejecting the null hypothesis. With this, the power of a statistical test measures an experiment's ability to detect a difference, when one exists. In the case of testing whether a coin is fair, the power of our statistical test would be the probability of rejecting the null hypothesis "this coin is fair" when the coin was unfair. As you might assume, the power of this statistical test would thus depend on several factors including our p-value threshold for rejecting the null hypothesis, the size of our sample and the 'level of unfairness' of the coin in question.
Objectives
You will be able to:
- Describe the concept of “Power” in relation to p-value and effect size for hypothesis testing
- Understand and critically evaluate the factors influencing the power of an experiment
- Perform Power calculation using SciPy and Python
- Demonstrate the impact of sample size on statistical power using simulations
- Demonstrate the combined effect of sample size and effect size on statistical power using simulations
The Power of a Statistical Test
The power of a statistical test is defined as the probability of rejecting the null hypothesis, given that it is indeed false. As with any probability, the power of a statistical test therefore ranges from 0 to 1, with 1 being a perfect test that guarantees rejecting the null hypothesis when it is indeed false.
Intrinsically, this is related to
Note: Ideally,
$\alpha$ and$\beta$ would both be minimized, but this is often costly, impractical or impossible depending on the scenario and required sample sizes.
Effect Size
The effect size is the magnitude of the difference you are testing between the two groups. Thus far, you've mainly been investigating the mean of a sample. For example, after flipping a coin n number of times, you've investigated using a t-test to determine whether the coin is a fair coin (p(heads)=0.5). To do this, you compare the mean of the sample to that of another sample, if comparing coins, or to a know theoretical distribution. Similarly, you might compare the mean income of a sample population to that of a census tract to determine if the populations are statistically different. In such cases, Cohen's D is typically the metric used as the effect size. Cohen's d is defined as:
$ d = \frac{m_1 - m_2}{s}$, where m_1 and m_2 are the respective sample means and s is the (overall) standard deviation
That is the difference of the means divided by the standard deviation.
Power Analysis
Since
- alpha value
- effect size
- sample size
A fantastic visual representation of these values' effect on one another can be found on Kristoffer Magnusson's website.
Let's look at how power might change in the context of varying effect size. To start, imagine the scenario of trying to detect whether or not a coin is fair. In this scenario, the null-hypothesis would be
To start, you might choose an alpha value that you are willing to accept such as
For example, if we wish to state the alternative hypothesis
$ d = \frac{m_1 - m_2}{s}$
$ d = \frac{.55 - .5}{s}$
Furthermore, since we are dealing with a binomial variable, the standard deviation of the sample should follow the formula
So some potential effect size values for various scenarios might look like this:
import numpy as np
import pandas as pd
m1 = .55
m2 = .5
p = m2
rows = []
for n in [10, 20, 50, 500]:
std = np.sqrt(n*p*(1-p))
d = (m1-m2)/std
rows.append({'Effect_Size': d, 'STD': std, 'Num_observations': n})
print('Hypothetical Effect sizes for p(heads)=.55 vs p(heads)=.5')
pd.DataFrame(rows)
Hypothetical Effect sizes for p(heads)=.55 vs p(heads)=.5
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Effect_Size | Num_observations | STD | |
---|---|---|---|
0 | 0.031623 | 10 | 1.581139 |
1 | 0.022361 | 20 | 2.236068 |
2 | 0.014142 | 50 | 3.535534 |
3 | 0.004472 | 500 | 11.180340 |
As a general rule of thumb, all of these effect sizes are quite small. here's the same idea expanded to other alternative hypotheses:
m2 = .5
rows = {}
for n in [10, 20, 50, 500]:
temp_dict = {}
for m1 in [.51, .55, .6, .65, .7, .75, .8, .85, .9]:
p = m1
std = np.sqrt(n*p*(1-p))
d = (m1-m2)/std
temp_dict[m1] = d
rows[n] = temp_dict
print('Hypothetical Effect Sizes for Various Alternative Hypotheses')
df = pd.DataFrame.from_dict(rows, orient='index')
# df.index = [10,20,50, 500]
# df.index.name = 'Sample_Size'
# df.columns.name = 'Alternative Hypothesis'
df
Hypothetical Effect Sizes for Various Alternative Hypotheses
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0.51 | 0.55 | 0.6 | 0.65 | 0.7 | 0.75 | 0.8 | 0.85 | 0.9 | |
---|---|---|---|---|---|---|---|---|---|
10 | 0.006326 | 0.031782 | 0.064550 | 0.099449 | 0.138013 | 0.182574 | 0.237171 | 0.309965 | 0.421637 |
20 | 0.004473 | 0.022473 | 0.045644 | 0.070321 | 0.097590 | 0.129099 | 0.167705 | 0.219179 | 0.298142 |
50 | 0.002829 | 0.014213 | 0.028868 | 0.044475 | 0.061721 | 0.081650 | 0.106066 | 0.138621 | 0.188562 |
500 | 0.000895 | 0.004495 | 0.009129 | 0.014064 | 0.019518 | 0.025820 | 0.033541 | 0.043836 | 0.059628 |
While a bit long winded, you can see that realalistic effect sizes for this scenario could be anywhere from 0.05 (or lower) up to approximately .4.
Now that you have some parameter estimates for alpha
As you've also seen, a common statistical test for comparing sample means is the T-test. Statsmodels has some convenient build in methods for calculating the power of a t-test and plotting power curves. Take a look:
from statsmodels.stats.power import TTestIndPower, TTestPower
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid') #Nice background styling on plots
power_analysis = TTestIndPower()
power_analysis.plot_power(dep_var="nobs",
nobs = np.array(range(5,1500)),
effect_size=np.array([.05, .1, .2,.3,.4,.5]),
alpha=0.05)
plt.show()
As this should demonstrate, detecting small perturbances can be quite difficult!
Similarly, just because a t-test has an incredibly small p-value doesn't neccessarily imply a strong statistical test. As is mentioned in the article Using Effect Size - or Why the P Value Is Not Enough, referenced below, using incredibly large sample sizes such as 22,000 can make even the most trivial effect size statistically significant. Realizing these reciprical relationships and considering all 4 parameters: alpha, effect size, sample size and power are all important when interpreting the results (such as the p-value) of a statistical test.
In addition to plotting a full curve, you can also calculate specific values. Simply don't specify one of the four parameters.
#Calculate power
power_analysis.solve_power(effect_size=.2, nobs1=80, alpha=.05)
0.24175778678474177
#Calculate sample size required
power_analysis.solve_power(effect_size=.2, alpha=.05, power=.8)
393.4056989990335
#Calculate minimum effect size to satisfy desired alpha and power as well as respect sample size limitations
power_analysis.solve_power(nobs1=25, alpha=.05, power=.8)
0.8087077886680412
#Calculate alpha (less traditional)
power_analysis.solve_power(nobs1=25, effect_size=.3, power=.8)
0.6613634273431555
You can also simulate your own data to verify results:
import scipy.stats as stats
def run_ttest_sim(p1, p2, std, nobs, alpha=0.05, n_sim=10**5):
"""p1 and p2 are the underlying means probabilities for 2 normal variables
Samples will be generated using these parameters."""
#Calculate Normalized Effect Size
effect_size = np.abs(p1-p2)/std
#Run a Simulation
# Initialize array to store results
p = (np.empty(n_sim))
p.fill(np.nan)
# Run a for loop for range of values in n_sim
for s in range(n_sim):
control = np.random.normal(loc= p1, scale=std, size=nobs)
experimental = np.random.normal(loc= p2, scale=std, size=nobs)
t_test = stats.ttest_ind(control, experimental)
p[s] = t_test[1]
num_null_rejects = np.sum(p < alpha)
power = num_null_rejects/n_sim
#Store results
stat_dict = {'alpha':alpha,
'nobs':nobs,
'effect_size':effect_size,
'power': power}
return stat_dict
run_ttest_sim(.5, .7, 1, 50)
{'alpha': 0.05,
'nobs': 50,
'effect_size': 0.19999999999999996,
'power': 0.16719}
And going back to the full stats model implementation for verification:
power_analysis.solve_power(nobs1=50, effect_size=0.19999999999999996, alpha=0.05)
0.1676754863454749
power_analysis.solve_power(nobs1=50, effect_size=0.19999999999999996, power=0.16719)
0.049779515826212185
power_analysis.solve_power(nobs1=50, power=0.16719, alpha=0.05)
0.19959710069445308
power_analysis.solve_power(power=0.16719, effect_size=0.19999999999999996, alpha=0.05)
49.80313313853301
Additional Resources
- Stats Models Documentation
- Using Effect Size—or Why the P Value Is Not Enough
- Understanding Statistical Power and Significance Testing - an interactive visualization
Summary
In this lesson, you learned about the idea of "statistical power" and how sample size, alpha and effect size impact the power of an experiment. Remember, the power of a statistical test is the probability of rejecting the null hypothesis when it is indeed false.