Another assumption of linear regression is the normality assumption. This specifically means that the residuals should follow a normal distribution.
You will be able to:
- Describe the normality assumption of linear regression
- Diagnose issues with normality using visualizations and statistical tests
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import statsmodels.api as sm
%matplotlib inline
Recall that a normal distribution looks something like this:
generated_data = stats.norm.rvs(size=400, random_state=7)
fig, ax = plt.subplots()
sns.histplot(generated_data, bins=20, element="step", kde=True, color="green", ax=ax)
ax.set_xlabel("Generated Data")
fig.suptitle("Normal Distribution");
One of the assumptions of linear regression is that the model residuals should follow a normal distribution.
The two main ways of visualizing normality are histograms (like the plot above) and Q-Q plots.
Histograms show the distribution of a single variable by showing the value of that variable along the x-axis and the counts of that variable that fall into a set of bins along the y-axis.
You are looking for a histogram shape that resembles the plot above, a "bell curve" shape.
Let's plot a histogram of the residuals from a model that predicts MPG using displacement
, horsepower
, weight
, and acceleration
.
data = pd.read_csv('auto-mpg.csv', index_col=0)
y = data['mpg']
X = data[['displacement', 'horsepower', 'weight', 'acceleration']]
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
fig, ax = plt.subplots()
sns.histplot(results.resid, bins=20, element="step", kde=True, ax=ax)
ax.set_xlabel("Model Residuals")
fig.suptitle("Not So Normal Distribution");
As you can see, the residuals from the actual model are a bit skewed, not quite normal.
In statistics, a Q-Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.
We will use the Q-Q plot to compare the actual data to theoretical quantiles. In this case the theoretical quantiles would be the normal distribution (using the SciPy stats.norm
function).
When inspecting a Q-Q plot you are looking for the data points to follow the diagonal line as closely as possible.
First, here's an example using the generated normal data:
# Use qqplot function from StatsModels
fig, ax = plt.subplots()
sm.graphics.qqplot(generated_data, dist=stats.norm, line='45', fit=True, ax=ax)
# Customize plot appearance
scatter = ax.lines[0]
line = ax.lines[1]
scatter.set_markeredgecolor("green")
scatter.set_markerfacecolor("green")
line.set_color("black")
fig.suptitle("Normal Distribution");
There are a couple of places where the scatter plot diverges from the diagonal line because this data has some pseudo-random noise, but in general the points and the line are very close together.
Compare that to the plot for the auto MPG model residuals:
# Use qqplot function from StatsModels
fig, ax = plt.subplots()
sm.graphics.qqplot(results.resid, dist=stats.norm, line='45', fit=True, ax=ax)
# Customize plot appearance
line = ax.lines[1]
line.set_color("black")
fig.suptitle("Not So Normal Distribution");
We see that the middle looks ok, but the ends, especially the higher end, are diverging from a normal distribution. This aligns with the skew that we observed in the histogram.
Below are some more examples of histograms and their associated Q-Q plots:
You can also check out this article for a more extended discussion of interpreting Q-Q plots.
There are many potential statistical tests you could use to check for normality, since this kind of statistical test has applications beyond linear regression.
One test to consider is the Jarque-Bera test, which uses skewness and kurtosis to create a test statistic that can be turned into a p-value using the Chi-squared distribution.
The test statistic
Where
Below we compute jb
for the generated data using the StatsModels implementation:
n = generated_data.shape[0]
s = stats.skew(generated_data)
k = 3 + stats.kurtosis(generated_data)
jb = (n / 6.) * (s ** 2 + (1 / 4.) * (k - 3) ** 2)
jb
0.4519188967633713
Then we can calculate the p-value using the
stats.chi2.sf(jb, 2)
0.7977504510815616
As usual, you don't actually need to write out this formula yourself. There is a function you can import from StatsModels instead (documentation here):
from statsmodels.stats.stattools import jarque_bera
jarque_bera(generated_data)
(0.4519188967633713,
0.7977504510815616,
-0.0332236549490571,
2.849335177253925)
jarque_bera(results.resid)
(51.332550412287, 7.133099908986785e-12, 0.7146664114306294, 4.048721550463704)
The four values returned are:
-
$\text{JB}$ test statistic - The p-value for
$\text{JB}$ - Skew
- Kurtosis
You can also view these values directly within a model summary. For example, these are the results for our auto MPG model. If you look in the lower right area you will see Jarque-Bera (JB)
and Prob(JB)
:
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.707
Model: OLS Adj. R-squared: 0.704
Method: Least Squares F-statistic: 233.4
Date: Fri, 17 Jun 2022 Prob (F-statistic): 9.63e-102
Time: 15:28:49 Log-Likelihood: -1120.6
No. Observations: 392 AIC: 2251.
Df Residuals: 387 BIC: 2271.
Df Model: 4
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 45.2511 2.456 18.424 0.000 40.422 50.080
displacement -0.0060 0.007 -0.894 0.372 -0.019 0.007
horsepower -0.0436 0.017 -2.631 0.009 -0.076 -0.011
weight -0.0053 0.001 -6.512 0.000 -0.007 -0.004
acceleration -0.0231 0.126 -0.184 0.854 -0.270 0.224
==============================================================================
Omnibus: 38.359 Durbin-Watson: 0.861
Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.333
Skew: 0.715 Prob(JB): 7.13e-12
Kurtosis: 4.049 Cond. No. 3.56e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.56e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In the case of the Jarque-Bera test the null hypothesis is that the distribution is normal. A sufficiently low p-value means we reject the null hypothesis, i.e. that the distribution is not normal.
For the two datasets presented above:
- The generated data's p-value is about 0.8. This is much higher than the standard alpha of 0.05, so we fail to reject the null hypothesis and can consider the distribution to be normal
- The model residuals' p-value is about 0.000000000007. This is much lower than the standard alpha of 0.05, so we reject the null hypothesis and do not consider the distribution to be normal
Like with some other model assumption checks, this is "opposite" of the p-values you want for model significance or coefficient significance. A low p-value in the Jarque-Bera test is a "bad" outcome, indicating a problem with the model.
As mentioned previously, there are several different options to test for normality, each of which uses a slightly different approach. Below are some additional tests you might want to consider:
The Kolmogorov-Smirnov test compares the CDF (cumulative distribution function) of the two distributions. The null hypothesis is that the distributions are the same. There is an implementation in SciPy (documentation here):
stats.kstest(generated_data, "norm")
KstestResult(statistic=0.03437796417014605, pvalue=0.7184091664903246)
stats.kstest(results.resid, "norm")
KstestResult(statistic=0.32698123251877237, pvalue=8.001138817167398e-38)
The Lilliefors test is based on the Kolmogorov-Smirnov test and uses estimated parameters. The null hypothesis is that the specified data comes from a normal distribution. There is an implementation in StatsModels (documentation here), which returns the test statistic and the p-value:
from statsmodels.stats.diagnostic import kstest_normal
kstest_normal(generated_data)
(0.02646141843017441, 0.7280229807121467)
kstest_normal(results.resid)
(0.06804544956822389, 0.0009999999999998899)
The Anderson-Darling test also compares the CDF of the two distributions. The null hypothesis is that the specified data comes from a normal distribution. There is an implementation in StatsModels (documentation here), which returns the test statistic and p-value:
from statsmodels.stats.diagnostic import normal_ad
normal_ad(generated_data)
(0.306208049572092, 0.5639537893049426)
normal_ad(results.resid)
(2.5714252349764024, 1.688944543036666e-06)
The Omnibus test for normality tests for deviations that result from skewness or kurtosis. The null hypothesis is that the specified data comes from a normal distribution. There is an implementation in StatsModels (documentation here):
from statsmodels.stats.stattools import omni_normtest
omni_normtest(generated_data)
NormaltestResult(statistic=0.3290035852108211, pvalue=0.8483162362467015)
omni_normtest(results.resid)
NormaltestResult(statistic=38.35880421385135, pvalue=4.68264785079494e-09)
This result is also printed out in the model summary, in the lower left. Look for Omnibus
and Prob(Omnibus)
:
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.707
Model: OLS Adj. R-squared: 0.704
Method: Least Squares F-statistic: 233.4
Date: Fri, 17 Jun 2022 Prob (F-statistic): 9.63e-102
Time: 15:28:49 Log-Likelihood: -1120.6
No. Observations: 392 AIC: 2251.
Df Residuals: 387 BIC: 2271.
Df Model: 4
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 45.2511 2.456 18.424 0.000 40.422 50.080
displacement -0.0060 0.007 -0.894 0.372 -0.019 0.007
horsepower -0.0436 0.017 -2.631 0.009 -0.076 -0.011
weight -0.0053 0.001 -6.512 0.000 -0.007 -0.004
acceleration -0.0231 0.126 -0.184 0.854 -0.270 0.224
==============================================================================
Omnibus: 38.359 Durbin-Watson: 0.861
Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.333
Skew: 0.715 Prob(JB): 7.13e-12
Kurtosis: 4.049 Cond. No. 3.56e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.56e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
As you can see, there are a lot of options! Don't worry too much about choosing the best one. If you are especially interested in the topic, feel free to read more about how the tests work and what kinds of false positive and false negative results they might be susceptible to. But in general this is a fairly esoteric statistical topic and as a data professional you should be able to use any of these tests for your needs.
Log transformations tend to be the most helpful when it comes to normality issues. Non-normality in the features or target is often associated with non-normality in the residuals. Below we plot the distributions of all data in the auto MPG model:
data[['mpg', 'displacement', 'horsepower', 'weight', 'acceleration']]\
.hist(figsize=(8, 10));
Then those same distributions after being log transformed:
np.log(data[['mpg', 'displacement', 'horsepower', 'weight', 'acceleration']])\
.hist(figsize=(8, 10), color="lightgreen");
As you can see, for most variables, they still aren't fully normal, but they're closer. However acceleration
looks more skewed now than before. So let's log transform everything except acceleration
.
Then if we build a model using those log-transformed features and target, we see some improvement in the distribution of the residuals:
# Build log transformed model
y_log = np.log(data["mpg"])
X_log = pd.concat([np.log(data[['displacement', 'horsepower', 'weight']]), data[['acceleration']]], axis=1)
log_model = sm.OLS(y_log, sm.add_constant(X_log))
log_results = log_model.fit()
# Set up plot and properties of two models
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,10))
resids = [results.resid, log_results.resid]
labels = ["Original Model", "Log Transformed Model"]
colors = ["blue", "lightgreen"]
# Plot histograms
for index, ax in enumerate(axes[0]):
sns.histplot(resids[index], bins=20, element="step", kde=True, color=colors[index], ax=ax)
ax.set_xlabel("Model Residuals")
ax.set_title(labels[index])
# Plot Q-Q plots
for index, ax in enumerate(axes[1]):
sm.graphics.qqplot(resids[index], dist=stats.norm, line='45', fit=True, ax=ax)
scatter = ax.lines[0]
line = ax.lines[1]
scatter.set_markeredgecolor(colors[index])
scatter.set_markerfacecolor(colors[index])
line.set_color("black")
ax.set_title(labels[index])
fig.tight_layout()
Looking at the p-values from the Jarque-Bera test, you can see we're getting a lot closer:
print(f"Original model: {jarque_bera(results.resid)[1]}")
print(f"Log transformed model: {jarque_bera(log_results.resid)[1]}")
Original model: 7.133099908986785e-12
Log transformed model: 0.0036669351258854627
And using some other measures, we actually can't reject the null hypothesis!
print(f"Original model: {kstest_normal(results.resid)[1]}")
print(f"Log transformed model: {kstest_normal(log_results.resid)[1]}")
Original model: 0.0009999999999998899
Log transformed model: 0.1610763758916161
print(f"Original model: {normal_ad(results.resid)[1]}")
print(f"Log transformed model: {normal_ad(log_results.resid)[1]}")
Original model: 1.688944543036666e-06
Log transformed model: 0.08748354038810166
Note that because the different tests assess normality in different ways, you might make different decisions depending on the statistic you are using.
Also keep in mind that log transformations are fundamentally changing the nature of the linear regression model to be multiplicative with respect to the original units rather than additive. So there may be a trade-off between model interpretability and meeting the normality assumption.
The normality assumption for linear regression is related to the distribution of the model residuals. In this lesson you saw how to evaluate a model with respect to the normality assumption using histograms and Q-Q plots, as well as various different statistical tests. You also saw how log transformation in particular can help meet the normality assumption, or at least to get closer to normal residuals.