Comments (5)
Another nice feature of the R model.frame
class is that it returns a set of informative attributes - for example, it returns an index of dropped columns. Is there a similar feature for formulaic
? I quickly glanced over all attributes of formulaic.model_matrix.ModelMatrix
but did not find an equivalent attribute.
attributes(mf)
$names
[1] "Y" "X"
$terms
Y ~ X
attr(,"variables")
list(Y, X)
attr(,"factors")
X
Y 0
X 1
attr(,"term.labels")
[1] "X"
attr(,"order")
[1] 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
attr(,"predvars")
list(Y, X)
attr(,"dataClasses")
Y X
"numeric" "numeric"
$row.names
[1] 2 3 4 5 6 7 8 9 10
$class
[1] "data.frame"
$na.action
1
1
attr(,"class")
[1] "omit"
from formulaic.
Hi @s3alfisc! Thanks for reaching out!
re: nan dropping being inconsistent between the left- and right- hand sides, that's interesting. I cannot reproduce that behaviour. Which version of formulaic are you using?
re: metadata for which rows are being dropped, you're correct; formulaic doesn't propagate that information through to the ModelSpec
(since it was not "specification", but rather data-specific state). When using pandas, however, the index
is maintained, allowing you to determine which rows were omitted. Do you have use-cases where this would be useful?
from formulaic.
Hi Matthew,
I am running version 0.5.2. Sorry for not reporting the package version, a rookie mistake. Is there any other information I could provide to help debug this?
A (potentially highly specialized) use case where metadata on dropped columns might be helpful:
I want to run cluster robust inference as a post estimation command after a regression model, specified via Y ~ X
, where some columns have been dropped due to missing values. The regression model only stores the 'cleaned' X and Y as used when fitting the model. The clustering variable is not included in the regression model, and therefore not included in the design matrix X. In order to make the post-estimation inference work, I need to drop the columns which were dropped from X and Y from the clustering variable as well, and for that, an index of dropped columns might be handy. The alternative to metadata for dropped rows would be to force the categorical clustering variable to an integer type, to add it to the model formula and run model_matrix('Y~X + cluster)
and to fetch the NaN-less cluster variable from it.
A similar problem arises for regression models with high-dimensional fixed effects, where the fixed effects are projected out prior to running OLS on the residualized X and Y (e.g. as in the fixest R package). In this case, I want to keep the categorical fixed effect variable in a single column, hence create the one-hot encoded X only for variables which are not "projected out". In a second step, I then delete missing columns from both X and the fixed effect variable(s), residualize, and run OLS.
There is an obvious workaround for both problems (even if the input is not a pandas.DataFrame): just set the na_action = 'ignore'
and get an index of all missing values in both X and Y myself. Here's a (admittedly rather convoluted) code example I hacked out over the weekend. =)
Thanks for your response, and please let me know if I can help debug this further! Best, Alex
from formulaic.
Goodness, sorry @s3alfisc ! Life has kept me busy and this went under the radar.
I'm cautious about adding too much extra information to the model spec (such as the indices of missing rows) because it is often the case that you want to serialize it for later use; and it is preferable if it doesn't scale with the size of the data. We could revisit that as necessary, of course.
The immediate solution that comes to mind is to use multi-part formulae, like:
Y ~ X | cluster
, which would result in a structured output of three model matrices : lhs=Y, rhs=(X, cluster)
. This keeps the distinction between your cluster variables, but also guarantees that the same rows are dropped across all model matrices. There's actually more you could do too, like:
from formulaic import Formula
Formula(lhs='Y', rhs='X', clusters='cluster')
This would result in three top-level model matrices, which you can extract by name or index. Also, if you are using pandas dataframes, the index is maintained from the input data, so you could use that to slice your data too.
Do any of these solve your use-case?
from formulaic.
Hi again @s3alfisc ! I'm going to assume that the above does solve your use-cases, and close this one out. Feel free to reopen if you'd like to resume the conversation!
from formulaic.
Related Issues (20)
- How to include structural zeros? HOT 1
- Retain Column Names for sparse model matrices HOT 4
- Formulaic not raising an exception when required fields are missing in the dataset HOT 2
- Allow formatting the categorical encoded variables HOT 4
- Throw error when formula has parameters that are not available HOT 2
- Support polars HOT 4
- Dropping Indices via "+0" or "-1" and reference levels for categoricals HOT 1
- Extending `formulaic` to work with other input types HOT 2
- Handling individual columns that can expand into multiple columns HOT 7
- Support the hashing trick as an encoding strategy for categorical features HOT 6
- `model_spec.transform_state` bugged when formula is not correctly written HOT 1
- Is there a way to get the baseline value for categorical variables? HOT 7
- Add . operator HOT 1
- Suggestions for creating `get_feature_names_out` for Scikit Learn ColumnTransformer compatibility? HOT 3
- Is it possible to define custom operators? HOT 2
- Is it possible to force the `Formula` class to not expand categorical variables? HOT 3
- Add required variables to the `Formula` class HOT 5
- Potential Bug / different defaults for Intercept / Reference Levels when using `Formula.get_model_matrix()` with categoricals HOT 2
- Potential bug in Interacting variables via `:` syntax for categorical variables HOT 3
- Incompatibility with pandas development version HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from formulaic.