Comments (2)
Hi @eric-czech !
Thanks for reaching out! Unfortunately formulaic doesn't (yet?) contain this functionality.
As you may already be aware, model matrices output by formulaic
are wrapped in a ModelMatrix
proxy class, which has an attribute .model_spec
. While there isn't a direct mapping of input dataframe to columns, there is a formula term to model-matrix-columns map:
X.model_spec.structure
[(1, [1], ['Intercept']),
(a, [a-], ['a[T.B]', 'a[T.C]']),
(b, [b], ['b']),
(a:b, [a-:b], ['a[T.B]:b', 'a[T.C]:b'])]
For simple cases like this you could reverse engineer the support of any particular model matrix column, but it really wouldn't be much better than parsing it out of the names; and when you start applying transforms on the columns, all bets are off.
I'm willing to look into adding this feature if there is need, but I'm curious as to why the reverse logic is valuable in your use case! [It's worth noting that this is not trivial, since formulaic doesn't know whether features are being populated from the namespace or from the dataframe; the best we could probably do is outputting a list of symbols, and then allowing you to do a set difference from the columns of your incoming dataframe].
from formulaic.
While there isn't a direct mapping of input dataframe to columns, there is a formula term to model-matrix-columns map
I see, thanks. I noticed I could get similar information out of Patsy but it still looked like some parsing was going to be inevitable.
I'm willing to look into adding this feature if there is need, but I'm curious as to why the reverse logic is valuable in your use case
Awesome! A simple use case I have in mind would be to summarize the effects of multiple related features (e.g. one-hot encodings or polynomial basis functions) by grouping them together. In other words, if my original feature a
above resulted in 3 binary features then I might want to sum the coefficients from a linear model or Shapley values across all of them as a combined effect. Another would be to simply associate features with logical groupings like the data source they were originally from, which would again be useful in downstream model interpretation. That's certainly all possible based on the string names of generated features with more fragile, analysis-specific code (my typical go-to).
There are a few other use cases like this that I would characterize more broadly as propagation of field metadata (e.g. perhaps via pandas.Series.attrs or xarray.DataArray.attrs). You could imagine that if I simply attached some metadata to my feature a
, having that metadata replicated in all features produced from it would be very useful. I'm surprised there still doesn't seem to be a good solution for this in the scikit-learn ecosystem, but it seems like a natural extension of more extensive model matrix building tools like Patsy and formulaic.
the best we could probably do is outputting a list of symbols, and then allowing you to do a set difference from the columns of your incoming dataframe
Hmm would that be a list of symbols associated with each individual output column? Sounds like that could work.
On that note, would doing model_matrix(..., df, context=None)
ensure that the results only come from the dataframe?
from formulaic.
Related Issues (20)
- Regression: `ModelSpec.output` is not respected HOT 1
- version number == 0.0.0 HOT 1
- __repr__() got an unexpected keyword argument 'to_str' HOT 1
- Query the number of formula terms HOT 5
- Sparse matrix creation is slow HOT 2
- Explicitly passed terms should be sorted like parsed terms. HOT 4
- Make a `Q` operator that behaves like patsy's Q. HOT 3
- Feature Request: circular transform HOT 3
- bug: incorrect metadata license file HOT 2
- Allow interaction with yourself HOT 1
- Test failures: `AssertionError: approx() is not supported in a boolean context.` HOT 4
- drop both columns in dependent variable and design matrix when missings occur HOT 5
- DOC: Explicitly mention support for multiple variables on the left hand side HOT 3
- Terms not being evaluated in get_model_matrix() HOT 2
- 17 tests fail: ModuleNotFoundError: No module named 'interface_meta' HOT 2
- How can the encoding choices for one dataset be reused for another? HOT 3
- Intercept term breaks when RHS formula begins with a parentheses HOT 2
- How do I set the reference level for a categorical term? HOT 4
- Output column names for sparse output HOT 4
- access dataframe index? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from formulaic.