py-econometrics / pyfixest Goto Github PK

View Code? Open in Web Editor NEW

126.0 5.0 30.0 15.02 MB

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax

Home Page: https://py-econometrics.github.io/pyfixest/pyfixest.html

License: MIT License

Python 77.46% R 0.83% Jupyter Notebook 21.47% Just 0.24%

pyfixest's Issues

Make sure / check that dependent variables are of numeric types

... as strange things happen when dependent variables have dtype object.

ssc() function does not work properly

i.e. adj = False does nothing

Add requirements.txt

Currently there is no indication of what dependencies are required for the package. At the moment I'm repeatedly trying to load Fixest and installing the missing packages it notices one by one.

Bug in multiple estimations with fixed effects

E.g.

feols('y1 + y2 ~ x1 | species ', data = base, vcov = "hetero")

leads to

IndexError: index 1 is out of bounds for axis 1 with size 1

CRV3 inference

currently blocked by default -> unblock
update with code from statsmodels PR

Different Design Matrix Combinations

For options, see here:

sw()
csw()
fsplit

Bug for i() formula without fixed effects

I.e. for fml = "Y ~ i(X2, X1)". No problems for formulas with fixed effects.

Pytest freezes for tests of interacted fixed effects via i()

Tests here
potential reason?

Additional methods

tidy()
summary()
etable()
iplot() (after implementing i())

Add support for wild (cluster) bootstrap inference via wildboottest

wildboottest repo
For now, only for regressions without fixed effects.

Pass NAs to vcov() to use for clustered standard errors

pass NAs to cluster variable
raise exception when there is an additional missing variable in the clustering variable (i.e. when the clustering variable is not part of the model)

vcov() method does not update summary info

Example:

import pyfixest as pf
import numpy as np
from pyfixest.utils import get_data

fixest = pf.Fixest(data = data)
fixest.feols("Y~X1 | csw0(X2, X3)", vcov = {'CRV1':'id'})
fixest.summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.220649 30.130262   0.00000
#        X1 -0.141200    0.211081 -0.668937   0.50369
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274    0.210556 -0.675707  0.499383
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.096317    0.204801 -0.470296  0.638247
fixest.vcov({'CRV3':'group_id'}).summary()
>>> fixest.vcov({'CRV3':'group_id'}).summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.229614 28.953831  0.000000
#        X1 -0.141200    0.207516 -0.680428  0.502745
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274     0.20774 -0.684867   0.49999
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error  t value  Pr(>|t|)
# X1 -0.096317    0.206282 -0.46692  0.644768
#

Support for datetime variables

Was inspired to try using a datetime in a regression, given that statsmodels handles these incorrectly. Currently these do not appear to be supported in pyfixest

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

res['DT'] = [pd.to_datetime(str(y)+'-01-01 00:00:00') for y in res['Year']]

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ DT', vcov = 'iid')
fixest.summary()
# UFuncTypeError: ufunc 'multiply' cannot use operands with types dtype('int32') and dtype('<M8[ns]')

Note that this also occurs without the 00:00:00, or if using a datetime.date() instead of a Pandas datetime. Compare to R:

library(fixest)
library(causaldata)

data(restaurant_inspections)

restaurant_inspections$Time = lubridate::ymd_hms(paste0(restaurant_inspections$Year, '-01-01 00:00:00'))

feols(Year~Time, data = restaurant_inspections)
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: IID 
# Estimate   Std. Error  t value  Pr(>|t|)    
# (Intercept) 1.970001e+03 3.323279e-05 59278828 < 2.2e-16 ***
#   Time        3.170000e-08 2.580000e-14  1226887 < 2.2e-16 ***
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 7.994e-4   Adj. R2: 1

That's probably enough tinkering for me today! Sorry to load up the Issue page.

Multiple Estimations and Formula Syntax

Bug: formula with more than one regressor and fixed effects

E.g.

fit = feols(fml = 'Y ~ X1 + X2 | X3', vcov = {'CRV1':'group_id'}, data = data)
# invalid value encountered in sqrt
#  np.sqrt(np.diagonal(self.vcov[x]))

Multicollinearity?

p-value and some t-stat calculations differ from R fixest

Some more toying around, and I'm getting different p-value calculations from the original fixest, even in cases where estimates and t-stats match:

With clustering, the t-stats match but p-values are different:

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'Year'}))
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: inspection_score 
# 
#            Estimate  Std. Error    t value  Pr(>|t|)
# Intercept 93.627262    0.321288 291.412047  0.000000
#   Weekend  2.096548    0.862544   2.430656  0.015072

library(fixest)
library(causaldata)

data(restaurant_inspections)

feols(inspection_score ~ Weekend , data = restaurant_inspections, vcov = ~Year)

# OLS estimation, Dep. Var.: inspection_score
# Observations: 27,178 
# Standard-errors: Clustered (Year) 
# Estimate Std. Error   t value  Pr(>|t|)    
# (Intercept) 93.62726   0.321288 291.41205 < 2.2e-16 ***
#   WeekendTRUE  2.09655   0.862544   2.43066  0.029106 *  
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 6.25408   Adj. R2: 8.241e-4

With IID and HC1, the t-stats and p-values differ by very small amounts

fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'iid')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036224 55497.059216  0.000000
# Weekend   -0.862863    0.412097    -2.093833  0.036275

feols(Year~Weekend, data = restaurant_inspections)
OLS estimation, Dep. Var.: Year
Observations: 27,178 
Standard-errors: IID 
               Estimate Std. Error     t value  Pr(>|t|)    
(Intercept) 2010.343815   0.036226 55495.01719 < 2.2e-16 ***
WeekendTRUE   -0.862863   0.412112    -2.09376  0.036291 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 5.94874   Adj. R2: 1.245e-4

fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'HC1')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036181 55562.889684  0.000000
# Weekend   -0.862863    0.471032    -1.831856  0.066973

feols(Year~Weekend, data = restaurant_inspections, vcov = 'hc1')
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: Heteroskedasticity-robust 
# Estimate Std. Error     t value  Pr(>|t|)    
# (Intercept) 2010.343815   0.036182 55561.86747 < 2.2e-16 ***
#   WeekendTRUE   -0.862863   0.471041    -1.83182  0.066989 .

General quality improvements

Improve:

the documentation
error messages
performance
test untested functionality

ropensci statistical standards as a reference.

Enable cluster jackknife inference

Fairly straightforward:

simply add clustering variant CRV3-jackknife, e.g. as function argument {'CRV3-jackknife':'group_id'}
uncomment code here
add tests, done!

Make sure to drop singleton fixed effects from the clustering matrix

Link to original issue: link

replace patsy dependency byformulaic

Tests

compare against fixest
internal benchmarks

Bug in i() method

For some reason, signs of integers are swapped, i.e. for event studies with time-to-treatment (ttt) -27, ...., 0, ..., 21, the estimated coefs are labeled as -21, ..., 0, ..., 27.

Match fixest standard errors exactly

Status quo: minor deviations due to small sample adjustments.

Required: implementation of ssc() function with all fixed effect options.

Check for mulicollinearity

After fixed effects are projected out, check if the resulting design matrix Xtilde is singular.

Improve performance

See here for faster solving of the least squares problem.
faster demeaning algo via numba / c++
clean up code, drop redundancies, etc
for a given column in the design matrix, only project out fe's when not already done (for multiple estimation)

IV support

Required Steps:

allow three-part formulas
implement IV estimation & inference. Start with just-identified models.
update summary() and tidy() methods. For tidy(), simply add index "stage1" and "stage2". Split summary() into two parts. Check out what fixest does.

Nice to have's (for now):

common IV diagnostics
AR confidence intervals

In Practice:

implement IV estimator (Z'X)^(-1) Z'Y.
always create X and Z. For OLS, set X = Z
pass everything through, done (at least for models with only one endog variable)
after this is implemented, allow for over identified models. This requires implementation of the 2SLS estimator + updates to inference procedures

Singleton fixed effects

Add message when data is dropped due to singleton fixed effects
add tests with singleton fixed effects

Link to original issue

Add support for two-way clustering

Simply apply vcov() cluster three times: vcov_a + vcov_b - vcov_ab.
For ssc, see here.

User friendly error handling

For common errors, raise user friendly error messages.

Custom fixed effects demeaning function

Write a custom demean() function that allows for NA values in either X or Y
further, allow for weights (for WLS)
implement dropping of singletons
note that there are a few errors in the current version of demean() (i.e. the handling of missings + multiple fixed effects)

Unit Tests

test feols front end against fixest::feols
test parsing of formula syntax
add continuous integration
add even more tests

Multiple estimations: add 'split' option

Simply add one more for loop ;)

P-values for clustered errors based on t(G-1) distribution

Instead of normal.

Link to original issue

Mimic fixest vcov defaults

drop requirement to set vcov argument in .feols()
for no specified vcov without fixed effects: iid inference
for no specified vcov with fixed effects: CRV1 clustered at the first fixed effects level

See here.

Handle NA values gracefully

... handle missing values - they need to be dropped

Reproduce applied econometrics textbook examples in notebooks

Motivation:

uncovers potentially needed functionality
makes sure that pyfixest actually runs for "real world" problems and not simply my synthetic test datasets

Textbooks with code:

the effect (data in the causaldata package)
the mixtape (data also included in the causaldata package)
Data Analysis Business & Econ (github repo)

Handling of Categorical Variables

Build around formulaic's C(..., levels = ...) option.
Need to pass levels information to formula.
Rename levels to "ref" so that users can only provide a reference str instead of a full list.
Is there a need for a dedicated i() option to interact with a categorical variable? The key advantage of the i() method seems custom tooling around it, e.g. iplot. But that could be handled with plotting method with decent string regex for coef names (i.e. for variables included in i()). Maybe not 100% error prone?

HC3 and CRV3 do not seem to match

investigate and fix
tests are already added

KeyError: "['X1:X2'] not in index". Happens because variable "X1:X2" is not contained in the input data. Likely requires even more reshuffling of the formula parsing - demeaning - model matrix pipeline.

Documentation

Fixest
feols
...
Prepare mkdocs.

start with copying fwildclusterboot::boot_ssc()
move on to all features of fixest::ssc()

Link to original issue

Implement CRV3 for arbitrary fixed effects

Should be fairly straightforward: for no clustering, run MNW's summclust algo, else do what sandwich does. E.g. do along these lines:

                if self.has_fixef == False:
                    # inverse hessian precomputed?
                    tXX = np.transpose(self.X) @ self.X
                    tXy = np.transpose(self.X) @ self.Y

                    # compute leave-one-out regression coefficients (aka clusterjacks')
                    for ixg, g in enumerate(clusters):

                        Xg = self.X[np.equal(ixg, group)]
                        Yg = self.Y[np.equal(ixg, group)]
                        tXgXg = np.transpose(Xg) @ Xg
                        # jackknife regression coefficient
                        beta_jack[ixg,:] = (
                            np.linalg.pinv(tXX - tXgXg) @ (tXy - np.transpose(Xg) @ Yg)
                        ).flatten()

                else:

                    for ixg, g in enumerate(clusters):
                        data = self.data[np.equal(ixg, group)]
                        model = Fixest(data)
                        model.feols(self.formula, vcov = "iid")
                        beta_jack[ixg,:] = model.beta_hat

Issues with specifying vcov

Currently, vcov is a required positional argument in feols. Note this also means you can't specify the vcov using the .vcov method without also specifying it in the feols call as a positional argument. Suggest matching the defaults from the R package of vcov = 'iid' with no fixed effects specified, or clustered at the level of the first fixed effect if fixed effects are specified., or overwriting that with whatever was set in .vcov.
help(fixest.feols) suggests using dict("CRV1":"clustervar") to specify a cluster variable but this is improper syntax.
In my example data I'm having trouble setting any sort of clustering variable alongside a set of fixed effects (pip install causaldata; note there are no missing values in this data):

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest as pf
import numpy as np

res = restaurant_inspections.load_pandas().data
fixest = pf.Fixest(data = res)

fixest.feols('inspection_score ~ Weekend | business_name', vcov = 'iid')
# runs fine

fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

# maybe it prefers numbers?
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'Year'}))
# IndexError: index 27134 is out of bounds for axis 0 with size 27076

# Or integers?
res['random_category'] = np.random.randint(0, 10, res.shape[0])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'random_category'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# or categoricals?
res['categorical'] = pd.Categorical(res['random_category'])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'categorical'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# Works fine without the FEs
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'categorical'}))

# although still not for business_name
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''


# Maybe it can't handle single-row FEs alongside clusters?
res['counts'] = (res
		   .groupby('business_name')['business_name']
		   .transform('size'))
fixest = pf.Fixest(data = res.query('counts > 1'))
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# ValueError: matrix should have the same number of rows as fixed effect IDs.

py-econometrics / pyfixest Goto Github PK

pyfixest's Issues

Recommend Projects

Recommend Topics

Recommend Org