Hi Matthew - thanks for making this super useful package available!

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Goodness, sorry <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Hi again <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

drop both columns in dependent variable and design matrix when missings occur about formulaic HOT 5 CLOSED

s3alfisc commented on June 22, 2024

drop both columns in dependent variable and design matrix when missings occur

from formulaic.

Comments (5)

s3alfisc commented on June 22, 2024

Another nice feature of the R model.frame class is that it returns a set of informative attributes - for example, it returns an index of dropped columns. Is there a similar feature for formulaic? I quickly glanced over all attributes of formulaic.model_matrix.ModelMatrix but did not find an equivalent attribute.

attributes(mf)
$names
[1] "Y" "X"

$terms
Y ~ X
attr(,"variables")
list(Y, X)
attr(,"factors")
  X
Y 0
X 1
attr(,"term.labels")
[1] "X"
attr(,"order")
[1] 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
attr(,"predvars")
list(Y, X)
attr(,"dataClasses")
        Y         X 
"numeric" "numeric" 

$row.names
[1]  2  3  4  5  6  7  8  9 10

$class
[1] "data.frame"

$na.action
1 
1 
attr(,"class")
[1] "omit"

from formulaic.

matthewwardrop commented on June 22, 2024

Hi @s3alfisc! Thanks for reaching out!

re: nan dropping being inconsistent between the left- and right- hand sides, that's interesting. I cannot reproduce that behaviour. Which version of formulaic are you using?

re: metadata for which rows are being dropped, you're correct; formulaic doesn't propagate that information through to the ModelSpec (since it was not "specification", but rather data-specific state). When using pandas, however, the index is maintained, allowing you to determine which rows were omitted. Do you have use-cases where this would be useful?

from formulaic.

s3alfisc commented on June 22, 2024

Hi Matthew,

I am running version 0.5.2. Sorry for not reporting the package version, a rookie mistake. Is there any other information I could provide to help debug this?

A (potentially highly specialized) use case where metadata on dropped columns might be helpful:

I want to run cluster robust inference as a post estimation command after a regression model, specified via Y ~ X, where some columns have been dropped due to missing values. The regression model only stores the 'cleaned' X and Y as used when fitting the model. The clustering variable is not included in the regression model, and therefore not included in the design matrix X. In order to make the post-estimation inference work, I need to drop the columns which were dropped from X and Y from the clustering variable as well, and for that, an index of dropped columns might be handy. The alternative to metadata for dropped rows would be to force the categorical clustering variable to an integer type, to add it to the model formula and run model_matrix('Y~X + cluster) and to fetch the NaN-less cluster variable from it.

A similar problem arises for regression models with high-dimensional fixed effects, where the fixed effects are projected out prior to running OLS on the residualized X and Y (e.g. as in the fixest R package). In this case, I want to keep the categorical fixed effect variable in a single column, hence create the one-hot encoded X only for variables which are not "projected out". In a second step, I then delete missing columns from both X and the fixed effect variable(s), residualize, and run OLS.

There is an obvious workaround for both problems (even if the input is not a pandas.DataFrame): just set the na_action = 'ignore' and get an index of all missing values in both X and Y myself. Here's a (admittedly rather convoluted) code example I hacked out over the weekend. =)

Thanks for your response, and please let me know if I can help debug this further! Best, Alex

from formulaic.

matthewwardrop commented on June 22, 2024

Goodness, sorry @s3alfisc ! Life has kept me busy and this went under the radar.

I'm cautious about adding too much extra information to the model spec (such as the indices of missing rows) because it is often the case that you want to serialize it for later use; and it is preferable if it doesn't scale with the size of the data. We could revisit that as necessary, of course.

The immediate solution that comes to mind is to use multi-part formulae, like:

Y ~ X | cluster, which would result in a structured output of three model matrices : lhs=Y, rhs=(X, cluster). This keeps the distinction between your cluster variables, but also guarantees that the same rows are dropped across all model matrices. There's actually more you could do too, like:

from formulaic import Formula

Formula(lhs='Y', rhs='X', clusters='cluster')

This would result in three top-level model matrices, which you can extract by name or index. Also, if you are using pandas dataframes, the index is maintained from the input data, so you could use that to slice your data too.

Do any of these solve your use-case?

from formulaic.

matthewwardrop commented on June 22, 2024

Hi again @s3alfisc ! I'm going to assume that the above does solve your use-cases, and close this one out. Feel free to reopen if you'd like to resume the conversation!

from formulaic.

drop both columns in dependent variable and design matrix when missings occur about formulaic HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent