Great library <a class="user-mention notranslate" data-hovercard-type="user" data-hove

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Determine which input columns were used to generate output columns about formulaic HOT 2 CLOSED

matthewwardrop commented on July 19, 2024

Determine which input columns were used to generate output columns

from formulaic.

Comments (2)

matthewwardrop commented on July 19, 2024

Hi @eric-czech !

Thanks for reaching out! Unfortunately formulaic doesn't (yet?) contain this functionality.

As you may already be aware, model matrices output by formulaic are wrapped in a ModelMatrix proxy class, which has an attribute .model_spec. While there isn't a direct mapping of input dataframe to columns, there is a formula term to model-matrix-columns map:

X.model_spec.structure

[(1, [1], ['Intercept']),
(a, [a-], ['a[T.B]', 'a[T.C]']),
(b, [b], ['b']),
(a:b, [a-:b], ['a[T.B]:b', 'a[T.C]:b'])]

For simple cases like this you could reverse engineer the support of any particular model matrix column, but it really wouldn't be much better than parsing it out of the names; and when you start applying transforms on the columns, all bets are off.

I'm willing to look into adding this feature if there is need, but I'm curious as to why the reverse logic is valuable in your use case! [It's worth noting that this is not trivial, since formulaic doesn't know whether features are being populated from the namespace or from the dataframe; the best we could probably do is outputting a list of symbols, and then allowing you to do a set difference from the columns of your incoming dataframe].

from formulaic.

eric-czech commented on July 19, 2024

While there isn't a direct mapping of input dataframe to columns, there is a formula term to model-matrix-columns map

I see, thanks. I noticed I could get similar information out of Patsy but it still looked like some parsing was going to be inevitable.

I'm willing to look into adding this feature if there is need, but I'm curious as to why the reverse logic is valuable in your use case

Awesome! A simple use case I have in mind would be to summarize the effects of multiple related features (e.g. one-hot encodings or polynomial basis functions) by grouping them together. In other words, if my original feature a above resulted in 3 binary features then I might want to sum the coefficients from a linear model or Shapley values across all of them as a combined effect. Another would be to simply associate features with logical groupings like the data source they were originally from, which would again be useful in downstream model interpretation. That's certainly all possible based on the string names of generated features with more fragile, analysis-specific code (my typical go-to).

There are a few other use cases like this that I would characterize more broadly as propagation of field metadata (e.g. perhaps via pandas.Series.attrs or xarray.DataArray.attrs). You could imagine that if I simply attached some metadata to my feature a, having that metadata replicated in all features produced from it would be very useful. I'm surprised there still doesn't seem to be a good solution for this in the scikit-learn ecosystem, but it seems like a natural extension of more extensive model matrix building tools like Patsy and formulaic.

the best we could probably do is outputting a list of symbols, and then allowing you to do a set difference from the columns of your incoming dataframe

Hmm would that be a list of symbols associated with each individual output column? Sounds like that could work.

On that note, would doing model_matrix(..., df, context=None) ensure that the results only come from the dataframe?

from formulaic.

Determine which input columns were used to generate output columns about formulaic HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent