py-econometrics / pyfixest Goto Github PK
View Code? Open in Web Editor NEWFast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
Home Page: https://py-econometrics.github.io/pyfixest/pyfixest.html
License: MIT License
Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
Home Page: https://py-econometrics.github.io/pyfixest/pyfixest.html
License: MIT License
... as strange things happen when dependent variables have dtype object.
adj = False
does nothingCurrently there is no indication of what dependencies are required for the package. At the moment I'm repeatedly trying to load Fixest
and installing the missing packages it notices one by one.
E.g.
feols('y1 + y2 ~ x1 | species ', data = base, vcov = "hetero")
leads to
IndexError: index 1 is out of bounds for axis 1 with size 1
For options, see here:
I.e. for fml = "Y ~ i(X2, X1)". No problems for formulas with fixed effects.
tidy()
summary()
etable()
i()
)Example:
import pyfixest as pf
import numpy as np
from pyfixest.utils import get_data
fixest = pf.Fixest(data = data)
fixest.feols("Y~X1 | csw0(X2, X3)", vcov = {'CRV1':'id'})
fixest.summary()
# ###
#
# ---
# ###
#
# Dep. var.: Y
# Inference: {'CRV1': 'id'}
# Observations: 998
#
# Estimate Std. Error t value Pr(>|t|)
# Intercept 6.648203 0.220649 30.130262 0.00000
# X1 -0.141200 0.211081 -0.668937 0.50369
# ---
# ###
#
# Fixed effects: X2
# Dep. var.: Y
# Inference: {'CRV1': 'id'}
# Observations: 998
#
# Estimate Std. Error t value Pr(>|t|)
# X1 -0.142274 0.210556 -0.675707 0.499383
# ---
# ###
#
# Fixed effects: X2+X3
# Dep. var.: Y
# Inference: {'CRV1': 'id'}
# Observations: 998
#
# Estimate Std. Error t value Pr(>|t|)
# X1 -0.096317 0.204801 -0.470296 0.638247
fixest.vcov({'CRV3':'group_id'}).summary()
>>> fixest.vcov({'CRV3':'group_id'}).summary()
# ###
#
# ---
# ###
#
# Dep. var.: Y
# Inference: {'CRV1': 'id'}
# Observations: 998
#
# Estimate Std. Error t value Pr(>|t|)
# Intercept 6.648203 0.229614 28.953831 0.000000
# X1 -0.141200 0.207516 -0.680428 0.502745
# ---
# ###
#
# Fixed effects: X2
# Dep. var.: Y
# Inference: {'CRV1': 'id'}
# Observations: 998
#
# Estimate Std. Error t value Pr(>|t|)
# X1 -0.142274 0.20774 -0.684867 0.49999
# ---
# ###
#
# Fixed effects: X2+X3
# Dep. var.: Y
# Inference: {'CRV1': 'id'}
# Observations: 998
#
# Estimate Std. Error t value Pr(>|t|)
# X1 -0.096317 0.206282 -0.46692 0.644768
#
Was inspired to try using a datetime in a regression, given that statsmodels handles these incorrectly. Currently these do not appear to be supported in pyfixest
from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf
res = restaurant_inspections.load_pandas().data
res['DT'] = [pd.to_datetime(str(y)+'-01-01 00:00:00') for y in res['Year']]
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ DT', vcov = 'iid')
fixest.summary()
# UFuncTypeError: ufunc 'multiply' cannot use operands with types dtype('int32') and dtype('<M8[ns]')
Note that this also occurs without the 00:00:00, or if using a datetime.date()
instead of a Pandas datetime
. Compare to R:
library(fixest)
library(causaldata)
data(restaurant_inspections)
restaurant_inspections$Time = lubridate::ymd_hms(paste0(restaurant_inspections$Year, '-01-01 00:00:00'))
feols(Year~Time, data = restaurant_inspections)
# OLS estimation, Dep. Var.: Year
# Observations: 27,178
# Standard-errors: IID
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.970001e+03 3.323279e-05 59278828 < 2.2e-16 ***
# Time 3.170000e-08 2.580000e-14 1226887 < 2.2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 7.994e-4 Adj. R2: 1
That's probably enough tinkering for me today! Sorry to load up the Issue page.
fe1^fe2
E.g.
fit = feols(fml = 'Y ~ X1 + X2 | X3', vcov = {'CRV1':'group_id'}, data = data)
# invalid value encountered in sqrt
# np.sqrt(np.diagonal(self.vcov[x]))
Multicollinearity?
Some more toying around, and I'm getting different p-value calculations from the original fixest, even in cases where estimates and t-stats match:
With clustering, the t-stats match but p-values are different:
from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf
res = restaurant_inspections.load_pandas().data
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'Year'}))
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: inspection_score
#
# Estimate Std. Error t value Pr(>|t|)
# Intercept 93.627262 0.321288 291.412047 0.000000
# Weekend 2.096548 0.862544 2.430656 0.015072
library(fixest)
library(causaldata)
data(restaurant_inspections)
feols(inspection_score ~ Weekend , data = restaurant_inspections, vcov = ~Year)
# OLS estimation, Dep. Var.: inspection_score
# Observations: 27,178
# Standard-errors: Clustered (Year)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 93.62726 0.321288 291.41205 < 2.2e-16 ***
# WeekendTRUE 2.09655 0.862544 2.43066 0.029106 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 6.25408 Adj. R2: 8.241e-4
With IID and HC1, the t-stats and p-values differ by very small amounts
fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'iid')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year
#
# Estimate Std. Error t value Pr(>|t|)
# Intercept 2010.343815 0.036224 55497.059216 0.000000
# Weekend -0.862863 0.412097 -2.093833 0.036275
feols(Year~Weekend, data = restaurant_inspections)
OLS estimation, Dep. Var.: Year
Observations: 27,178
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2010.343815 0.036226 55495.01719 < 2.2e-16 ***
WeekendTRUE -0.862863 0.412112 -2.09376 0.036291 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 5.94874 Adj. R2: 1.245e-4
fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'HC1')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year
#
# Estimate Std. Error t value Pr(>|t|)
# Intercept 2010.343815 0.036181 55562.889684 0.000000
# Weekend -0.862863 0.471032 -1.831856 0.066973
feols(Year~Weekend, data = restaurant_inspections, vcov = 'hc1')
# OLS estimation, Dep. Var.: Year
# Observations: 27,178
# Standard-errors: Heteroskedasticity-robust
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2010.343815 0.036182 55561.86747 < 2.2e-16 ***
# WeekendTRUE -0.862863 0.471041 -1.83182 0.066989 .
Improve:
ropensci statistical standards as a reference.
Fairly straightforward:
CRV3-jackknife
, e.g. as function argument {'CRV3-jackknife':'group_id'}
Link to original issue: link
For some reason, signs of integers are swapped, i.e. for event studies with time-to-treatment (ttt) -27, ...., 0, ..., 21, the estimated coefs are labeled as -21, ..., 0, ..., 27.
Status quo: minor deviations due to small sample adjustments.
Required: implementation of ssc()
function with all fixed effect options.
After fixed effects are projected out, check if the resulting design matrix Xtilde is singular.
Required Steps:
summary()
and tidy()
methods. For tidy()
, simply add index "stage1" and "stage2". Split summary() into two parts. Check out what fixest does.Nice to have's (for now):
In Practice:
For common errors, raise user friendly error messages.
demean()
function that allows for NA values in either X or Ydemean()
(i.e. the handling of missings + multiple fixed effects)feols
front end against fixest::feols
Simply add one more for loop ;)
Instead of normal.
.feols()
See here.
... handle missing values - they need to be dropped
Motivation:
Textbooks with code:
C(..., levels = ...)
option.i()
option to interact with a categorical variable? The key advantage of the i()
method seems custom tooling around it, e.g. iplot
. But that could be handled with plotting method with decent string regex for coef names (i.e. for variables included in i()
). Maybe not 100% error prone?KeyError: "['X1:X2'] not in index"
. Happens because variable "X1:X2" is not contained in the input data. Likely requires even more reshuffling of the formula parsing - demeaning - model matrix pipeline.
E.g. currently it is not possible to run a model like Y ~ X1 + csw(log(X2), X3)
(but it is possible to run e.g. log(Y) ~ X1 + csw(X2, X3)
.
... when checking for NA values in cluster variable.
Should be fairly straightforward: for no clustering, run MNW's summclust algo, else do what sandwich
does. E.g. do along these lines:
if self.has_fixef == False:
# inverse hessian precomputed?
tXX = np.transpose(self.X) @ self.X
tXy = np.transpose(self.X) @ self.Y
# compute leave-one-out regression coefficients (aka clusterjacks')
for ixg, g in enumerate(clusters):
Xg = self.X[np.equal(ixg, group)]
Yg = self.Y[np.equal(ixg, group)]
tXgXg = np.transpose(Xg) @ Xg
# jackknife regression coefficient
beta_jack[ixg,:] = (
np.linalg.pinv(tXX - tXgXg) @ (tXy - np.transpose(Xg) @ Yg)
).flatten()
else:
for ixg, g in enumerate(clusters):
data = self.data[np.equal(ixg, group)]
model = Fixest(data)
model.feols(self.formula, vcov = "iid")
beta_jack[ixg,:] = model.beta_hat
vcov
is a required positional argument in feols
. Note this also means you can't specify the vcov using the .vcov
method without also specifying it in the feols
call as a positional argument. Suggest matching the defaults from the R package of vcov = 'iid'
with no fixed effects specified, or clustered at the level of the first fixed effect if fixed effects are specified., or overwriting that with whatever was set in .vcov
.help(fixest.feols)
suggests using dict("CRV1":"clustervar")
to specify a cluster variable but this is improper syntax.pip install causaldata
; note there are no missing values in this data):from causaldata import restaurant_inspections
import pandas as pd
import pyfixest as pf
import numpy as np
res = restaurant_inspections.load_pandas().data
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = 'iid')
# runs fine
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
# maybe it prefers numbers?
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'Year'}))
# IndexError: index 27134 is out of bounds for axis 0 with size 27076
# Or integers?
res['random_category'] = np.random.randint(0, 10, res.shape[0])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'random_category'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076
# or categoricals?
res['categorical'] = pd.Categorical(res['random_category'])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'categorical'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076
# Works fine without the FEs
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'categorical'}))
# although still not for business_name
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
# Maybe it can't handle single-row FEs alongside clusters?
res['counts'] = (res
.groupby('business_name')['business_name']
.transform('size'))
fixest = pf.Fixest(data = res.query('counts > 1'))
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# ValueError: matrix should have the same number of rows as fixed effect IDs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.