Git Product home page Git Product logo

Comments (2)

matthewwardrop avatar matthewwardrop commented on July 19, 2024

Hi @eric-czech !

Thanks for reaching out! Unfortunately formulaic doesn't (yet?) contain this functionality.

As you may already be aware, model matrices output by formulaic are wrapped in a ModelMatrix proxy class, which has an attribute .model_spec. While there isn't a direct mapping of input dataframe to columns, there is a formula term to model-matrix-columns map:

X.model_spec.structure

[(1, [1], ['Intercept']),
(a, [a-], ['a[T.B]', 'a[T.C]']),
(b, [b], ['b']),
(a:b, [a-:b], ['a[T.B]:b', 'a[T.C]:b'])]

For simple cases like this you could reverse engineer the support of any particular model matrix column, but it really wouldn't be much better than parsing it out of the names; and when you start applying transforms on the columns, all bets are off.

I'm willing to look into adding this feature if there is need, but I'm curious as to why the reverse logic is valuable in your use case! [It's worth noting that this is not trivial, since formulaic doesn't know whether features are being populated from the namespace or from the dataframe; the best we could probably do is outputting a list of symbols, and then allowing you to do a set difference from the columns of your incoming dataframe].

from formulaic.

eric-czech avatar eric-czech commented on July 19, 2024

While there isn't a direct mapping of input dataframe to columns, there is a formula term to model-matrix-columns map

I see, thanks. I noticed I could get similar information out of Patsy but it still looked like some parsing was going to be inevitable.

I'm willing to look into adding this feature if there is need, but I'm curious as to why the reverse logic is valuable in your use case

Awesome! A simple use case I have in mind would be to summarize the effects of multiple related features (e.g. one-hot encodings or polynomial basis functions) by grouping them together. In other words, if my original feature a above resulted in 3 binary features then I might want to sum the coefficients from a linear model or Shapley values across all of them as a combined effect. Another would be to simply associate features with logical groupings like the data source they were originally from, which would again be useful in downstream model interpretation. That's certainly all possible based on the string names of generated features with more fragile, analysis-specific code (my typical go-to).

There are a few other use cases like this that I would characterize more broadly as propagation of field metadata (e.g. perhaps via pandas.Series.attrs or xarray.DataArray.attrs). You could imagine that if I simply attached some metadata to my feature a, having that metadata replicated in all features produced from it would be very useful. I'm surprised there still doesn't seem to be a good solution for this in the scikit-learn ecosystem, but it seems like a natural extension of more extensive model matrix building tools like Patsy and formulaic.

the best we could probably do is outputting a list of symbols, and then allowing you to do a set difference from the columns of your incoming dataframe

Hmm would that be a list of symbols associated with each individual output column? Sounds like that could work.

On that note, would doing model_matrix(..., df, context=None) ensure that the results only come from the dataframe?

from formulaic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.