Git Product home page Git Product logo

Comments (5)

s3alfisc avatar s3alfisc commented on June 22, 2024

Another nice feature of the R model.frame class is that it returns a set of informative attributes - for example, it returns an index of dropped columns. Is there a similar feature for formulaic? I quickly glanced over all attributes of formulaic.model_matrix.ModelMatrix but did not find an equivalent attribute.

attributes(mf)
$names
[1] "Y" "X"

$terms
Y ~ X
attr(,"variables")
list(Y, X)
attr(,"factors")
  X
Y 0
X 1
attr(,"term.labels")
[1] "X"
attr(,"order")
[1] 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
attr(,"predvars")
list(Y, X)
attr(,"dataClasses")
        Y         X 
"numeric" "numeric" 

$row.names
[1]  2  3  4  5  6  7  8  9 10

$class
[1] "data.frame"

$na.action
1 
1 
attr(,"class")
[1] "omit"

from formulaic.

matthewwardrop avatar matthewwardrop commented on June 22, 2024

Hi @s3alfisc! Thanks for reaching out!

re: nan dropping being inconsistent between the left- and right- hand sides, that's interesting. I cannot reproduce that behaviour. Which version of formulaic are you using?

re: metadata for which rows are being dropped, you're correct; formulaic doesn't propagate that information through to the ModelSpec (since it was not "specification", but rather data-specific state). When using pandas, however, the index is maintained, allowing you to determine which rows were omitted. Do you have use-cases where this would be useful?

from formulaic.

s3alfisc avatar s3alfisc commented on June 22, 2024

Hi Matthew,

I am running version 0.5.2. Sorry for not reporting the package version, a rookie mistake. Is there any other information I could provide to help debug this?

A (potentially highly specialized) use case where metadata on dropped columns might be helpful:

I want to run cluster robust inference as a post estimation command after a regression model, specified via Y ~ X, where some columns have been dropped due to missing values. The regression model only stores the 'cleaned' X and Y as used when fitting the model. The clustering variable is not included in the regression model, and therefore not included in the design matrix X. In order to make the post-estimation inference work, I need to drop the columns which were dropped from X and Y from the clustering variable as well, and for that, an index of dropped columns might be handy. The alternative to metadata for dropped rows would be to force the categorical clustering variable to an integer type, to add it to the model formula and run model_matrix('Y~X + cluster) and to fetch the NaN-less cluster variable from it.

A similar problem arises for regression models with high-dimensional fixed effects, where the fixed effects are projected out prior to running OLS on the residualized X and Y (e.g. as in the fixest R package). In this case, I want to keep the categorical fixed effect variable in a single column, hence create the one-hot encoded X only for variables which are not "projected out". In a second step, I then delete missing columns from both X and the fixed effect variable(s), residualize, and run OLS.

There is an obvious workaround for both problems (even if the input is not a pandas.DataFrame): just set the na_action = 'ignore' and get an index of all missing values in both X and Y myself. Here's a (admittedly rather convoluted) code example I hacked out over the weekend. =)

Thanks for your response, and please let me know if I can help debug this further! Best, Alex

from formulaic.

matthewwardrop avatar matthewwardrop commented on June 22, 2024

Goodness, sorry @s3alfisc ! Life has kept me busy and this went under the radar.

I'm cautious about adding too much extra information to the model spec (such as the indices of missing rows) because it is often the case that you want to serialize it for later use; and it is preferable if it doesn't scale with the size of the data. We could revisit that as necessary, of course.

The immediate solution that comes to mind is to use multi-part formulae, like:

Y ~ X | cluster, which would result in a structured output of three model matrices : lhs=Y, rhs=(X, cluster). This keeps the distinction between your cluster variables, but also guarantees that the same rows are dropped across all model matrices. There's actually more you could do too, like:

from formulaic import Formula

Formula(lhs='Y', rhs='X', clusters='cluster')

This would result in three top-level model matrices, which you can extract by name or index. Also, if you are using pandas dataframes, the index is maintained from the input data, so you could use that to slice your data too.

Do any of these solve your use-case?

from formulaic.

matthewwardrop avatar matthewwardrop commented on June 22, 2024

Hi again @s3alfisc ! I'm going to assume that the above does solve your use-cases, and close this one out. Feel free to reopen if you'd like to resume the conversation!

from formulaic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.