You have already seen some examples of how to interpret coefficients for multiple linear regression. In this lesson we will go over some more examples, particularly focusing on models with one-hot encoded categorical predictors.
You will be able to:
- Describe and apply the concept of a reference category
- Interpret coefficients for one-hot encoded variables
- Explain the implications of binning and dropping multiple categories with one-hot encoded variables
Let's look at the Auto MPG dataset, with an engineered make
feature. Then we'll start with a multiple regression model that uses weight
, model year
, and origin
to predict mpg
.
import pandas as pd
# Load data into pandas and engineer "make" feature
data = pd.read_csv("auto-mpg.csv")
data["make"] = data["car name"].str.split().apply(lambda x: x[0])
# Prepare y and X for modeling
y = data["mpg"]
X = data[["weight", "model year", "origin"]]
X = pd.get_dummies(X, columns=["origin"], drop_first=True) # origin is categorical
X
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
weight | model year | origin_2 | origin_3 | |
---|---|---|---|---|
0 | 3504 | 70 | 0 | 0 |
1 | 3693 | 70 | 0 | 0 |
2 | 3436 | 70 | 0 | 0 |
3 | 3433 | 70 | 0 | 0 |
4 | 3449 | 70 | 0 | 0 |
... | ... | ... | ... | ... |
387 | 2790 | 82 | 0 | 0 |
388 | 2130 | 82 | 1 | 0 |
389 | 2295 | 82 | 0 | 0 |
390 | 2625 | 82 | 0 | 0 |
391 | 2720 | 82 | 0 | 0 |
392 rows × 4 columns
import statsmodels.api as sm
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.819
Model: OLS Adj. R-squared: 0.817
Method: Least Squares F-statistic: 437.9
Date: Thu, 12 May 2022 Prob (F-statistic): 3.53e-142
Time: 15:26:52 Log-Likelihood: -1026.1
No. Observations: 392 AIC: 2062.
Df Residuals: 387 BIC: 2082.
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -18.3069 4.017 -4.557 0.000 -26.205 -10.409
weight -0.0059 0.000 -22.647 0.000 -0.006 -0.005
model year 0.7698 0.049 15.818 0.000 0.674 0.866
origin_2 1.9763 0.518 3.815 0.000 0.958 2.995
origin_3 2.2145 0.519 4.268 0.000 1.194 3.235
==============================================================================
Omnibus: 32.293 Durbin-Watson: 1.251
Prob(Omnibus): 0.000 Jarque-Bera (JB): 58.234
Skew: 0.507 Prob(JB): 2.26e-13
Kurtosis: 4.593 Cond. No. 7.39e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.39e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In the model displayed above, we have dropped one of the dummy variable columns for origin
, in order to avoid the dummy variable trap. The dropped column is now our reference category.
So, which column is that?
To investigate this, we need to look at the values of origin
prior to one-hot encoding:
data["origin"].value_counts()
1 245
3 79
2 68
Name: origin, dtype: int64
(1
means a US car maker, 2
means a European car maker, 3
means an Asian car maker.)
Because 1
does not appear in our one-hot encoded variable (we only have origin_2
and origin_3
), we know that 1
is the reference category. Thus we can interpret the coefficients of origin_2
and origin_3
like this:
origin_2
means the difference associated with a car being from a European car maker vs. a US car maker. In other words, compared to US car makers, we see an associated increase of about 2 MPG for European car makers.origin_3
means the difference associated with a car being from an Asian car maker vs. a US car maker. We see an associated increase of about 2.2 MPG for Asian car makers compared to US car makers.
When you run pd.get_dummies
and specify drop_first=True
, this is an easy way to make sure that you avoid the dummy variable trap and that some category is selected as the reference category. However there is no data understanding involved; pandas just chooses the first category (alphabetically) that it finds.
If we skip that argument and manually drop a column instead, we can easily set any category as the reference category.
For example, if we set 2
(European car maker) as the reference category, our model looks like this:
X = data[["weight", "model year", "origin"]]
X = pd.get_dummies(X, columns=["origin"])
X = X.drop("origin_2", axis=1)
X
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
weight | model year | origin_1 | origin_3 | |
---|---|---|---|---|
0 | 3504 | 70 | 1 | 0 |
1 | 3693 | 70 | 1 | 0 |
2 | 3436 | 70 | 1 | 0 |
3 | 3433 | 70 | 1 | 0 |
4 | 3449 | 70 | 1 | 0 |
... | ... | ... | ... | ... |
387 | 2790 | 82 | 1 | 0 |
388 | 2130 | 82 | 0 | 0 |
389 | 2295 | 82 | 1 | 0 |
390 | 2625 | 82 | 1 | 0 |
391 | 2720 | 82 | 1 | 0 |
392 rows × 4 columns
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
results.params
const -16.330638
weight -0.005887
model year 0.769849
origin_1 -1.976306
origin_3 0.238228
dtype: float64
Or if we set 3
(Asian car maker) as the reference category, our model looks like this:
X = data[["weight", "model year", "origin"]]
X = pd.get_dummies(X, columns=["origin"])
X = X.drop("origin_3", axis=1)
X
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
weight | model year | origin_1 | origin_2 | |
---|---|---|---|---|
0 | 3504 | 70 | 1 | 0 |
1 | 3693 | 70 | 1 | 0 |
2 | 3436 | 70 | 1 | 0 |
3 | 3433 | 70 | 1 | 0 |
4 | 3449 | 70 | 1 | 0 |
... | ... | ... | ... | ... |
387 | 2790 | 82 | 1 | 0 |
388 | 2130 | 82 | 0 | 1 |
389 | 2295 | 82 | 1 | 0 |
390 | 2625 | 82 | 1 | 0 |
391 | 2720 | 82 | 1 | 0 |
392 rows × 4 columns
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
results.params
const -16.092410
weight -0.005887
model year 0.769849
origin_1 -2.214534
origin_2 -0.238228
dtype: float64
Looking at the models above, what changes and what stays the same, in terms of the coefficients? Why?
Answer (click to reveal)
Each time the reference category changes, the const
and the other origin
coefficients change.
const
changes because it represents the value when all predictors are 0, and this means that const
represents when the reference category is true.
The other origin
coefficients change because they are always with respect to the reference category. So if the category 1
has been dropped, then origin_2
is with respect to category 1
. But if category 3
has been dropped instead, then origin_2
is with respect to category 3
.
weight
and model year
don't change, because the origin
overall is still being "held constant" by the model, regardless of how origin
is actually encoded.
Choosing an appropriate reference category comes down to your understanding of the data and the stakeholders. Is there a category that makes the most sense as a "baseline", "default", or "neutral" category?
For our origin
category, if you were performing this analysis in the US, then setting origin_1
as the baseline might make the most sense. If your audience were in Europe or Asia, then it might make more sense to drop origin_2
or origin_3
. For a context without a clear match like this, you might choose the category that is most familiar or most well-known to your audience.
You also might consider looking at the data itself to see what makes the most sense as a "baseline". If one category is much more common than the others, that might be a good baseline. Or if one category is in the "middle" (maybe closest to the mean or median) in terms of the target, that also might be a good baseline.
There is no single right answer for choosing a reference category. The model contains the same information regardless of which category is chosen, so you are really just deciding how best to interpret your coefficients.
The category origin
only has three options for a reference category, since there are only three different values present in that column.
What about make
, which contains many more categories?
pd.get_dummies(data["make"])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
amc | audi | bmw | buick | cadillac | capri | chevroelt | chevrolet | chevy | chrysler | ... | renault | saab | subaru | toyota | toyouta | triumph | vokswagen | volkswagen | volvo | vw | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
387 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
388 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
389 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
390 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
391 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
392 rows × 37 columns
We could just use drop_first=True
, add this predictor to our other three, and see what the model outputs:
X = data[["weight", "model year", "origin", "make"]]
X = pd.get_dummies(X, columns=["origin", "make"], drop_first=True)
X
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
weight | model year | origin_2 | origin_3 | make_audi | make_bmw | make_buick | make_cadillac | make_capri | make_chevroelt | ... | make_renault | make_saab | make_subaru | make_toyota | make_toyouta | make_triumph | make_vokswagen | make_volkswagen | make_volvo | make_vw | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3504 | 70 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 3693 | 70 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 3436 | 70 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 3433 | 70 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 3449 | 70 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
387 | 2790 | 82 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
388 | 2130 | 82 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
389 | 2295 | 82 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
390 | 2625 | 82 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
391 | 2720 | 82 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
392 rows × 40 columns
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.849
Model: OLS Adj. R-squared: 0.833
Method: Least Squares F-statistic: 52.34
Date: Thu, 12 May 2022 Prob (F-statistic): 5.34e-122
Time: 15:26:53 Log-Likelihood: -990.31
No. Observations: 392 AIC: 2059.
Df Residuals: 353 BIC: 2214.
Df Model: 38
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
const -15.1432 4.118 -3.677 0.000 -23.243 -7.044
weight -0.0059 0.000 -21.758 0.000 -0.006 -0.005
model year 0.7112 0.050 14.236 0.000 0.613 0.809
origin_2 3.1152 0.786 3.966 0.000 1.570 4.660
origin_3 2.7369 0.861 3.180 0.002 1.044 4.430
make_audi -0.0195 1.231 -0.016 0.987 -2.441 2.402
make_bmw -2.2301 2.165 -1.030 0.304 -6.488 2.028
make_buick 1.1616 0.998 1.164 0.245 -0.801 3.124
make_cadillac 4.2080 2.360 1.783 0.075 -0.434 8.850
make_capri 1.2719 3.253 0.391 0.696 -5.126 7.670
make_chevroelt 0.8019 3.253 0.247 0.805 -5.596 7.200
make_chevrolet 1.2458 0.787 1.584 0.114 -0.301 2.793
make_chevy 1.0286 1.949 0.528 0.598 -2.804 4.861
make_chrysler 0.4977 1.465 0.340 0.734 -2.383 3.379
make_datsun 2.0835 0.853 2.443 0.015 0.406 3.761
make_dodge 1.6943 0.869 1.950 0.052 -0.014 3.403
make_fiat 0.7829 1.179 0.664 0.507 -1.536 3.101
make_ford 0.6462 0.772 0.838 0.403 -0.871 2.164
make_hi 2.2852 3.270 0.699 0.485 -4.145 8.716
make_honda 1.9803 1.001 1.979 0.049 0.012 3.949
make_maxda -3.4983 2.086 -1.677 0.094 -7.600 0.603
make_mazda 0.2790 1.089 0.256 0.798 -1.862 2.420
make_mercedes 2.0763 3.030 0.685 0.494 -3.882 8.035
make_mercedes-benz 0.6670 2.180 0.306 0.760 -3.620 4.954
make_mercury 0.3799 1.147 0.331 0.741 -1.876 2.636
make_nissan 2.8367 2.889 0.982 0.327 -2.846 8.519
make_oldsmobile 2.4425 1.197 2.040 0.042 0.088 4.797
make_opel -1.5092 1.581 -0.955 0.340 -4.618 1.599
make_peugeot -0.2576 1.169 -0.220 0.826 -2.558 2.042
make_plymouth 2.4352 0.841 2.895 0.004 0.781 4.089
make_pontiac 3.0326 1.016 2.984 0.003 1.034 5.031
make_renault 0.5839 1.795 0.325 0.745 -2.947 4.115
make_saab -1.2068 1.572 -0.768 0.443 -4.298 1.885
make_subaru 0.1267 1.537 0.082 0.934 -2.897 3.150
make_toyota -0.0611 0.837 -0.073 0.942 -1.707 1.584
make_toyouta -1.0098 2.890 -0.349 0.727 -6.694 4.674
make_triumph 4.8871 3.017 1.620 0.106 -1.047 10.821
make_vokswagen -4.1781 3.020 -1.383 0.167 -10.118 1.762
make_volkswagen -0.4466 0.933 -0.478 0.633 -2.282 1.389
make_volvo -2.8897 1.319 -2.192 0.029 -5.483 -0.297
make_vw 6.8556 1.325 5.173 0.000 4.249 9.462
==============================================================================
Omnibus: 26.860 Durbin-Watson: 1.302
Prob(Omnibus): 0.000 Jarque-Bera (JB): 56.989
Skew: 0.368 Prob(JB): 4.22e-13
Kurtosis: 4.716 Cond. No. 7.45e+19
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 6.78e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
What a long model summary! How can we make sense of this, particularly given that so many of these coefficients are not statistically significant?
It's difficult to extract much meaning from the make
variable here.
Before doing anything else with categorical data, some data cleaning is often useful.
Something you might notice here is that several of these different make
values are "really" the same. For example, "chevroelt"
is almost certainly a typo, that was intended to be "chevrolet"
.
What happens if we combine together values that seem like they really should be the same?
print(data["make"].nunique(), "unique values")
data["make"].value_counts()
37 unique values
ford 48
chevrolet 43
plymouth 31
dodge 28
amc 27
toyota 25
datsun 23
buick 17
pontiac 16
volkswagen 15
honda 13
mercury 11
mazda 10
oldsmobile 10
peugeot 8
fiat 8
audi 7
vw 6
volvo 6
chrysler 6
opel 4
subaru 4
saab 4
renault 3
chevy 3
cadillac 2
mercedes-benz 2
bmw 2
maxda 2
hi 1
capri 1
chevroelt 1
nissan 1
toyouta 1
triumph 1
vokswagen 1
mercedes 1
Name: make, dtype: int64
replacement_dict = {
"chevroelt": "chevrolet",
"mercedes": "mercedes-benz",
"toyouta": "toyota",
"vokswagen": "volkswagen",
"maxda": "mazda",
"chevy": "chevrolet",
"vw": "volkswagen"
}
data.replace(replacement_dict, inplace=True)
print(data["make"].nunique(), "unique values")
data["make"].value_counts()
30 unique values
ford 48
chevrolet 47
plymouth 31
dodge 28
amc 27
toyota 26
datsun 23
volkswagen 22
buick 17
pontiac 16
honda 13
mazda 12
mercury 11
oldsmobile 10
peugeot 8
fiat 8
audi 7
chrysler 6
volvo 6
opel 4
subaru 4
saab 4
mercedes-benz 3
renault 3
bmw 2
cadillac 2
hi 1
capri 1
nissan 1
triumph 1
Name: make, dtype: int64
X = data[["weight", "model year", "origin", "make"]]
X = pd.get_dummies(X, columns=["origin", "make"], drop_first=True)
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.837
Model: OLS Adj. R-squared: 0.823
Method: Least Squares F-statistic: 59.81
Date: Thu, 12 May 2022 Prob (F-statistic): 2.11e-122
Time: 15:26:53 Log-Likelihood: -1005.1
No. Observations: 392 AIC: 2074.
Df Residuals: 360 BIC: 2201.
Df Model: 31
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
const -17.8911 4.171 -4.290 0.000 -26.093 -9.689
weight -0.0058 0.000 -20.987 0.000 -0.006 -0.005
model year 0.7447 0.051 14.723 0.000 0.645 0.844
origin_2 2.9137 0.782 3.728 0.000 1.377 4.451
origin_3 3.2331 0.834 3.879 0.000 1.594 4.872
make_audi 0.1836 1.236 0.149 0.882 -2.247 2.615
make_bmw -1.9415 2.179 -0.891 0.373 -6.226 2.343
make_buick 1.0693 1.026 1.042 0.298 -0.949 3.087
make_cadillac 4.0221 2.427 1.657 0.098 -0.751 8.795
make_capri 1.2625 3.346 0.377 0.706 -5.317 7.842
make_chevrolet 1.1688 0.796 1.468 0.143 -0.397 2.734
make_chrysler 0.3358 1.506 0.223 0.824 -2.626 3.297
make_datsun 1.5754 0.821 1.919 0.056 -0.039 3.190
make_dodge 1.6160 0.893 1.809 0.071 -0.140 3.372
make_fiat 1.0683 1.181 0.905 0.366 -1.254 3.390
make_ford 0.5908 0.793 0.745 0.457 -0.969 2.151
make_hi 2.3033 3.362 0.685 0.494 -4.309 8.916
make_honda 1.4209 0.967 1.470 0.143 -0.480 3.322
make_mazda -0.9212 0.989 -0.931 0.352 -2.866 1.024
make_mercedes-benz 1.1733 1.829 0.642 0.521 -2.423 4.769
make_mercury 0.2986 1.180 0.253 0.800 -2.021 2.618
make_nissan 2.1631 2.839 0.762 0.447 -3.420 7.746
make_oldsmobile 2.3121 1.231 1.879 0.061 -0.108 4.732
make_opel -1.2031 1.587 -0.758 0.449 -4.325 1.918
make_peugeot -0.0725 1.173 -0.062 0.951 -2.380 2.235
make_plymouth 2.4041 0.865 2.779 0.006 0.703 4.105
make_pontiac 2.9529 1.045 2.826 0.005 0.898 5.008
make_renault 0.8500 1.806 0.471 0.638 -2.701 4.401
make_saab -0.9518 1.579 -0.603 0.547 -4.057 2.154
make_subaru -0.4116 1.503 -0.274 0.784 -3.368 2.544
make_toyota -0.5935 0.798 -0.743 0.458 -2.164 0.977
make_triumph 4.9509 3.043 1.627 0.105 -1.033 10.935
make_volkswagen 1.5855 0.823 1.927 0.055 -0.033 3.204
make_volvo -2.7290 1.325 -2.059 0.040 -5.335 -0.123
==============================================================================
Omnibus: 36.684 Durbin-Watson: 1.196
Prob(Omnibus): 0.000 Jarque-Bera (JB): 68.374
Skew: 0.557 Prob(JB): 1.42e-15
Kurtosis: 4.717 Cond. No. 1.04e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.47e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Well, that didn't do much. We still have lots of coefficients that can't be interpreted because they aren't statistically significant.
Another approach you might consider is creating a catch-all "other" category for categories that don't have many examples in the dataset. We'll choose a somewhat-arbitrary cutoff of 10 observations, which is a conventional cutoff for the number of samples you need per independent variable.
data["make"].value_counts()[data["make"].value_counts() < 10]
peugeot 8
fiat 8
audi 7
chrysler 6
volvo 6
opel 4
subaru 4
saab 4
mercedes-benz 3
renault 3
bmw 2
cadillac 2
hi 1
capri 1
nissan 1
triumph 1
Name: make, dtype: int64
to_replace = data["make"].value_counts()[data["make"].value_counts() < 10].index.values
to_replace
array(['peugeot', 'fiat', 'audi', 'chrysler', 'volvo', 'opel', 'subaru',
'saab', 'mercedes-benz', 'renault', 'bmw', 'cadillac', 'hi',
'capri', 'nissan', 'triumph'], dtype=object)
data.replace(to_replace, value="other", inplace=True)
print(data["make"].nunique(), "unique values")
data["make"].value_counts()
15 unique values
other 61
ford 48
chevrolet 47
plymouth 31
dodge 28
amc 27
toyota 26
datsun 23
volkswagen 22
buick 17
pontiac 16
honda 13
mazda 12
mercury 11
oldsmobile 10
Name: make, dtype: int64
X = data[["weight", "model year", "origin", "make"]]
X = pd.get_dummies(X, columns=["origin", "make"], drop_first=True)
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.832
Model: OLS Adj. R-squared: 0.824
Method: Least Squares F-statistic: 102.7
Date: Thu, 12 May 2022 Prob (F-statistic): 3.02e-132
Time: 15:26:53 Log-Likelihood: -1011.5
No. Observations: 392 AIC: 2061.
Df Residuals: 373 BIC: 2136.
Df Model: 18
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const -18.3333 4.031 -4.548 0.000 -26.260 -10.407
weight -0.0058 0.000 -22.015 0.000 -0.006 -0.005
model year 0.7516 0.049 15.266 0.000 0.655 0.848
origin_2 1.3483 1.199 1.124 0.262 -1.010 3.707
origin_3 1.9241 1.848 1.041 0.299 -1.710 5.559
make_buick 1.0599 1.024 1.036 0.301 -0.953 3.073
make_chevrolet 1.1589 0.795 1.459 0.146 -0.403 2.721
make_datsun 2.8444 2.030 1.402 0.162 -1.146 6.835
make_dodge 1.5970 0.891 1.792 0.074 -0.156 3.350
make_ford 0.5814 0.792 0.734 0.463 -0.975 2.138
make_honda 2.6716 2.117 1.262 0.208 -1.491 6.834
make_mazda 0.3348 2.138 0.157 0.876 -3.868 4.538
make_mercury 0.2870 1.177 0.244 0.808 -2.028 2.602
make_oldsmobile 2.2958 1.227 1.870 0.062 -0.118 4.709
make_other 1.3573 1.234 1.100 0.272 -1.069 3.784
make_plymouth 2.3952 0.864 2.774 0.006 0.697 4.093
make_pontiac 2.9497 1.042 2.830 0.005 0.900 4.999
make_toyota 0.6804 2.017 0.337 0.736 -3.286 4.647
make_volkswagen 3.1081 1.494 2.081 0.038 0.171 6.045
==============================================================================
Omnibus: 31.295 Durbin-Watson: 1.179
Prob(Omnibus): 0.000 Jarque-Bera (JB): 54.861
Skew: 0.502 Prob(JB): 1.22e-12
Kurtosis: 4.533 Cond. No. 8.49e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.49e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
That is a much more manageable number of coefficients. Let's go through and interpret these:
- The reference category for
origin
is1
(US) and formake
isamc
(American Motor Company) const
,weight
, andmodel year
are all still statistically significant- When all other predictors are 0, the MPG would be about -18.3
- For each increase of 1 lb in weight, we see an associated decrease of about 0.006 in MPG
- For each year newer the vehicle is, we see an associated increase of about 0.75 in MPG
origin_2
andorigin_3
are not statistically significant any more- While this might seem surprising, our data understanding can explain it. The
origin
feature and themake
feature are really providing the same information, except thatmake
is more granular. Everymake
category (except forother
) corresponds to exactly oneorigin
category. Therefore it probably does not make sense to include bothorigin
andmake
in the same model
- While this might seem surprising, our data understanding can explain it. The
- At a standard alpha of 0.05, only
make_plymouth
,make_pontiac
, andmake_volkswagen
are statistically significant- When a car's make is
plymouth
compared toamc
, we see an associated increase of about 2.4 in MPG - When a car's make is
pontiac
compared toamc
, we see an associated increase of about 2.9 in MPG - When a car's make is
volkswagen
compared toamc
, we see an associated increase of about 3.1 in MPG
- When a car's make is
All of the significant coefficients happen to be positive. Why is that? It turns out that amc
is the first make
value alphabetically and has the lowest mean MPG:
data.groupby("make").mean().sort_values(by="mpg")[["mpg"]]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
mpg | |
---|---|
make | |
amc | 18.070370 |
mercury | 19.118182 |
buick | 19.182353 |
ford | 19.475000 |
pontiac | 20.012500 |
chevrolet | 20.219149 |
oldsmobile | 21.100000 |
plymouth | 21.703226 |
dodge | 22.060714 |
other | 24.781967 |
toyota | 28.165385 |
mazda | 30.058333 |
datsun | 31.113043 |
volkswagen | 31.840909 |
honda | 33.761538 |
It might surprise you that out of these categories, pontiac
, plymouth
, and volkswagen
were the only ones with statistically significant differences. You wouldn't guess that based on the means alone. But keep in mind that the t-test being used here takes into account the number and the variance in each category, not just the mean.
We could stop there and have this be our final model. Coefficients that aren't statistically significant aren't a problem for a model, they just mean that we can't interpret those coefficients because we don't have enough evidence that those coefficients aren't really zero.
A lot of people who are just learning about linear regression modeling will try dropping all of the predictors whose coefficients aren't statistically significant:
X = data[["weight", "model year", "origin", "make"]]
X = pd.get_dummies(X, columns=["origin", "make"], drop_first=True)
not_significant = [
"origin_2",
"origin_3",
"make_buick",
"make_chevrolet",
"make_datsun",
"make_dodge",
"make_ford",
"make_honda",
"make_mazda",
"make_mercury",
"make_oldsmobile",
"make_other",
"make_toyota"
]
X.drop(not_significant, axis=1, inplace=True)
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.813
Model: OLS Adj. R-squared: 0.811
Method: Least Squares F-statistic: 336.1
Date: Thu, 12 May 2022 Prob (F-statistic): 3.56e-138
Time: 15:26:53 Log-Likelihood: -1032.4
No. Observations: 392 AIC: 2077.
Df Residuals: 386 BIC: 2101.
Df Model: 5
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const -15.1455 3.993 -3.793 0.000 -22.996 -7.295
weight -0.0066 0.000 -29.358 0.000 -0.007 -0.006
model year 0.7622 0.049 15.519 0.000 0.666 0.859
make_plymouth 0.7494 0.641 1.169 0.243 -0.511 2.010
make_pontiac 1.6166 0.881 1.834 0.067 -0.116 3.350
make_volkswagen 1.9408 0.775 2.504 0.013 0.417 3.464
==============================================================================
Omnibus: 31.402 Durbin-Watson: 1.235
Prob(Omnibus): 0.000 Jarque-Bera (JB): 44.519
Skew: 0.582 Prob(JB): 2.15e-10
Kurtosis: 4.171 Cond. No. 7.21e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Don't do this!
First, what do these coefficients mean in this context? There are three different model
values, so...
make_plymouth
is the difference between having amake
ofplymouth
and amake
of...anything other thanplymouth
,pontiac
, orvolkswagen
make_pontiac
is the difference between having amake
ofpontiac
and amake
of anything other thanplymouth
,pontiac
, orvolkswagen
make_volkswagen
is the difference between having amake
ofvolkswagen
and amake
of anything other thanplymouth
,pontiac
, orvolkswagen
That "anything other than" category is no longer a useful reference category!
Second, you did not actually solve the statistical significance problem by dropping all of those columns. Now that those other columns are gone, make_plymouth
and make_pontiac
are no longer statistically significant. This is because even though there was a statistically significant difference between plymouth
and amc
, and between pontiac
and amc
, the difference between these categories and the "anything other than" category is not statistically significant.
A better strategy would go back to our data understanding and remove columns that we have a good reason to believe are no longer useful. In this case that would be the origin
columns, since make
is already providing that same information.
X = data[["weight", "model year", "make"]]
X = pd.get_dummies(X, columns=["make"], drop_first=True)
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.831
Model: OLS Adj. R-squared: 0.824
Method: Least Squares F-statistic: 115.6
Date: Thu, 12 May 2022 Prob (F-statistic): 5.53e-134
Time: 15:26:53 Log-Likelihood: -1012.2
No. Observations: 392 AIC: 2058.
Df Residuals: 375 BIC: 2126.
Df Model: 16
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const -17.5873 3.952 -4.451 0.000 -25.357 -9.817
weight -0.0059 0.000 -23.581 0.000 -0.006 -0.005
model year 0.7460 0.049 15.349 0.000 0.650 0.842
make_buick 1.0983 1.022 1.074 0.283 -0.912 3.109
make_chevrolet 1.1703 0.794 1.474 0.141 -0.391 2.731
make_datsun 4.6792 0.965 4.848 0.000 2.781 6.577
make_dodge 1.6028 0.891 1.800 0.073 -0.148 3.354
make_ford 0.5955 0.791 0.753 0.452 -0.960 2.151
make_honda 4.4959 1.157 3.886 0.000 2.221 6.771
make_mazda 2.1794 1.175 1.855 0.064 -0.130 4.489
make_mercury 0.3127 1.176 0.266 0.791 -2.000 2.625
make_oldsmobile 2.3425 1.226 1.911 0.057 -0.068 4.753
make_other 2.4911 0.768 3.246 0.001 0.982 4.000
make_plymouth 2.3943 0.863 2.775 0.006 0.697 4.091
make_pontiac 2.9949 1.041 2.878 0.004 0.948 5.041
make_toyota 2.5190 0.931 2.705 0.007 0.688 4.350
make_volkswagen 4.3465 0.988 4.398 0.000 2.403 6.290
==============================================================================
Omnibus: 31.497 Durbin-Watson: 1.175
Prob(Omnibus): 0.000 Jarque-Bera (JB): 55.114
Skew: 0.506 Prob(JB): 1.08e-12
Kurtosis: 4.534 Cond. No. 7.42e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.42e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Because we made our modeling decision based on data understanding and not on p-values alone, we actually see that more of our coefficients are significant than before.
Let's go ahead and say that this is our final model.
from sklearn.metrics import mean_squared_error
mean_squared_error(y, results.predict(sm.add_constant(X)), squared=False)
3.2005653137923917
Here we'll walk through a narrative version of our final model, starting from the beginning. It is fairly typical that your actual modeling process will not match the narrative you describe at the end, since the final version will be cleaned up and won't include extraneous details about things you tried that didn't work. For example, unless you had a specific reason to tell stakeholders why origin
wasn't part of the final model, your final narrative can just skip over talking about origin
at all, as this one does.
Our final model included these features:
weight
model year
make
We performed data cleaning on make
to account for typos and alternative names, and also combined vehicles with fewer than 10 samples for their make into a single category called other
. Then we one-hot encoded make
, resulting in 14 dummy predictors.
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5), sharey=True)
ax1.set_ylabel("mpg")
data.plot.scatter(x="weight", y="mpg", ax=ax1)
data.plot.scatter(x="model year", y="mpg", ax=ax2);
fig, ax = plt.subplots(figsize=(12,5))
data.groupby("make").mean().sort_values(by="mpg").plot.bar(y="mpg", ax=ax)
ax.axhline(y=data["mpg"].mean(), label="mean", color="black", linestyle="--")
ax.legend();
These features were fed into an ordinary least-squares multiple regression model. This model:
- is statistically significant overall (F-statistic p-value 5.53e-134)
- explains about 82% of the variance in MPG (adjusted R-Squared 0.824)
- is off by about 3.2 MPG in an average prediction (RMSE 3.2005653137923917)
Below are partial regression plots for all model features:
fig = plt.figure(figsize=(15,15))
sm.graphics.plot_partregress_grid(
results,
exog_idx=list(X.columns),
grid=(4,4),
fig=fig)
plt.tight_layout()
plt.show()
Below are all model coefficients and p-values:
results_df = pd.concat([results.params, results.pvalues], axis=1)
results_df.columns = ["coefficient", "p-value"]
results_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
coefficient | p-value | |
---|---|---|
const | -17.587307 | 1.130022e-05 |
weight | -0.005941 | 4.720990e-76 |
model year | 0.745954 | 1.321660e-41 |
make_buick | 1.098322 | 2.833603e-01 |
make_chevrolet | 1.170251 | 1.413107e-01 |
make_datsun | 4.679217 | 1.828514e-06 |
make_dodge | 1.602834 | 7.272445e-02 |
make_ford | 0.595459 | 4.521172e-01 |
make_honda | 4.495894 | 1.203801e-04 |
make_mazda | 2.179408 | 6.434006e-02 |
make_mercury | 0.312668 | 7.905208e-01 |
make_oldsmobile | 2.342526 | 5.678076e-02 |
make_other | 2.491078 | 1.277613e-03 |
make_plymouth | 2.394280 | 5.804084e-03 |
make_pontiac | 2.994904 | 4.237245e-03 |
make_toyota | 2.518955 | 7.148628e-03 |
make_volkswagen | 4.346503 | 1.426705e-05 |
And these are only coefficients with p-values below 0.05:
results_df = results_df[results_df["p-value"] < 0.05].sort_values(by="coefficient")
results_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
coefficient | p-value | |
---|---|---|
const | -17.587307 | 1.130022e-05 |
weight | -0.005941 | 4.720990e-76 |
model year | 0.745954 | 1.321660e-41 |
make_plymouth | 2.394280 | 5.804084e-03 |
make_other | 2.491078 | 1.277613e-03 |
make_toyota | 2.518955 | 7.148628e-03 |
make_pontiac | 2.994904 | 4.237245e-03 |
make_volkswagen | 4.346503 | 1.426705e-05 |
make_honda | 4.495894 | 1.203801e-04 |
make_datsun | 4.679217 | 1.828514e-06 |
Writing this out:
- For a car with
weight
andmodel year
of 0 (i.e. weighing 0 lbs and built in 1900), as well as amake
ofamc
, we would expect an MPG of about -17.6- It's ok that this is not a valid MPG, since a car weighing 0 lbs is not valid either
- For each increase of 1 lb in weight, we see an associated decrease in MPG of about .006
- For each year newer of a model, we see an associated increase in MPG of about 0.75
- Breaking down the
make
:- For a Plymouth vehicle compared to an AMC vehicle, we see an associated increase in MPG of about 2.4
- For a Toyota vehicle compared to an AMC vehicle, we see an associated increase in MPG of about 2.5
- For a Pontiac vehicle compared to an AMC vehicle, we see an associated increase in MPG of about 3.0
- For a Volkswagen vehicle compared to an AMC vehicle, we see an associated increase in MPG of about 4.3
- For a Honda vehicle compared to an AMC vehicle, we see an associated increase in MPG of about 4.5
- For a Datsun vehicle compared to an AMC vehicle, we see an associated increase in MPG of about 4.7
results_df.drop(["const", "weight", "model year", "make_other"]).plot.barh(y="coefficient");
Note that while we included make
categories with higher p-values in the model, we did not report their coefficients. That is because we can't be confident that those coefficients aren't actually zero. Also we did not report a coefficient for make_other
because this is not a particularly meaningful thing to explain to a stakeholder.
We also aren't plotting the make
values on the same graph as the weight
and model year
values because they are all measured in different units (lbs vs. years vs. vehicle make), so it doesn't really make sense to compare them. Whereas it might be interesting to a stakeholder to see just the make
values compared against each other.
In this lesson you looked at several different multiple regression models, all built using one-hot encoded categorical features in addition to numeric features. You saw how the choice of a reference category impacts the interpretation of the coefficients, and also walked through several different examples of what to do and what not to do when you encounter a feature with many categories.