Git Product home page Git Product logo

Comments (11)

matthewwardrop avatar matthewwardrop commented on July 19, 2024 2

@TELSER1 This has been fixed and pushed out in v0.5.0 .

from formulaic.

frederik-plum-hauschultz avatar frederik-plum-hauschultz commented on July 19, 2024 1

Thank you for this! I was btw drawn to this package not so much due to performance (which seems to be 10x faster than patsy on my setup) but the fact that it can be pickled.

from formulaic.

matthewwardrop avatar matthewwardrop commented on July 19, 2024

Hi @petrhrobar ,

Yes... this is easily done with the current state of formulaic.

You can just do:

from formulaic import Formula

df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

trans = Formula('y ~ x + z')

mm1 = trans.get_model_matrix(df)

df2 = pandas.DataFrame({
    'y': [3, 3, 3],
    'x': ['A', 'B', 'B'],
    'z': [0.3, 0.1, 0.222222222],
})

mm2 = mm1.model_spec.get_model_matrix(df2)

Or, using the sugar method model_matrix in 0.3.4+:

mm2 = model_matrix(mm1, df2)

Hope that helps. I'll leave this open for documentation purposes until the docsite is updated.

from formulaic.

petrhrobar avatar petrhrobar commented on July 19, 2024

THanks, Matthew!

I was actually able to figure it out myself yesterday from a previous issue about this topic. So I guess it kind of boils down to the documentation :/.

If I may recommend and show how I want to use this is to have a sklean component:

from sklearn.base import BaseEstimator, TransformerMixin
from formulaic import Formula, model_matrix

class FormulaicTransformer(TransformerMixin, BaseEstimator):

    def __init__(self, formula):
        self.formula = formula

    def fit(self, X, y = None):
        """Fits the estimator"""
        self._trans = model_matrix(self.formula, X).model_spec.rhs
        return self

    def transform(self, X, y= None):
        """Fits the estimator"""        
        X_ = self._trans.get_model_matrix(X)
        return X_


pipe = Pipeline([
    ("formula", FormulaicTransformer("(bs(yday, df=12) + wday + num_date")),
    ("scale", StandardScaler()),
    ("model", LinearRegression())
])

As this persists the design info and can be pickled. It may be used as a proper sklearn component!
This is a badass feature!

from formulaic.

matthewwardrop avatar matthewwardrop commented on July 19, 2024

Nice! And yes... documentation will come... eventually!

This is a really cool use of Formulaic :). Maybe something like this makes sense to bring into formulaic itself at some point; or perhaps even better, upstream into sklearn.

When used in libraries, though, I do recommend using Formula(...).get_model_matrix(...) since that way the compute context is explicitly established. When using model_matrix the default behaviour is to make the entire locals() and globals() context available to use in formulae. For local use, that's fine... for libraries, not so much.

from formulaic.

frederik-plum-hauschultz avatar frederik-plum-hauschultz commented on July 19, 2024

Hi @petrhrobar ,

Yes... this is easily done with the current state of formulaic.

You can just do:

from formulaic import Formula

df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

trans = Formula('y ~ x + z')

mm1 = trans.get_model_matrix(df)

df2 = pandas.DataFrame({
    'y': [3, 3, 3],
    'x': ['A', 'B', 'B'],
    'z': [0.3, 0.1, 0.222222222],
})

mm2 = mm1.model_spec.get_model_matrix(df2)

Or, using the sugar method model_matrix in 0.3.4+:

mm2 = model_matrix(mm1, df2)

Hope that helps. I'll leave this open for documentation purposes until the docsite is updated.

I have been attempting to use Formulaic and ran in to the same issue. The above example does not seem to work anymore. Is there a different way of doing this now? Thanks in advance!

from formulaic.

matthewwardrop avatar matthewwardrop commented on July 19, 2024

Hi @frederik-plum-hauschultz !

Since a while back (not sure if it was the case when I wrote this or not), the output of get_model_matrix() is a structured Structured instance that reflects the structure of the formula.

For example:

>>> from formulaic import Formula

>>> df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

>>> trans = Formula('y ~ x + z')

>>> mm1 = trans.get_model_matrix(df)
>>> mm1
.lhs:
       y
    0  0
    1  1
    2  2
.rhs:
       Intercept  x[T.B]  x[T.C]    z
    0        1.0       0       0  0.3
    1        1.0       1       0  0.1
    2        1.0       0       1  0.2

>>> mm1.model_spec
.lhs:
    <formulaic.model_spec.ModelSpec object at 0x7f0f43f6df10>
.rhs:
    <formulaic.model_spec.ModelSpec object at 0x7f0f43f6df40>

It isn't possible in 0.3.x to call methods of the nested structure directly. I might add that in 0.4.x, but am not yet 100% convinced it is a good idea (maybe about 95% at the moment, leaning toward doing it, in which case it will appear in 0.4.0 shortly; feel free to nudge me if you like the idea).

In the meantime you can do:

df2 = pandas.DataFrame({
    'y': [3, 3, 3],
    'x': ['A', 'B', 'B'],
    'z': [0.3, 0.1, 0.222222222],
})

mm2 = mm1.model_spec.rhs.get_model_matrix(df2)

or, if you want both the lhs and rhs bits done in one step:

>>> mm1.model_spec._map(lambda spec: spec.get_model_matrix(df2))
.lhs:
       y
    0  3
    1  3
    2  3
.rhs:
       Intercept  x[T.B]  x[T.C]         z
    0        1.0       0       0  0.300000
    1        1.0       1       0  0.100000
    2        1.0       1       0  0.222222

Hope that helps.

from formulaic.

TELSER1 avatar TELSER1 commented on July 19, 2024

How would this work if I wanted a sparse output?

If I set
mm1 = trans.get_model_matrix(df, output='sparse')

I get back the expected sparse matrices, and the model specs are

.lhs:
    ModelSpec(formula=y, materializer='pandas', ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='sparse', structure=[EncodedTermStructure(term=y, scoped_terms=[y], columns=['y'])], transform_state={}, encoder_state={'y': (<Kind.NUMERICAL: 'numerical'>, {})})
.rhs:
    ModelSpec(formula=1 + x + z, materializer='pandas', ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='sparse', structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=x, scoped_terms=[x-], columns=['x[T.B]', 'x[T.C]']), EncodedTermStructure(term=z, scoped_terms=[z], columns=['z'])], transform_state={}, encoder_state={'x': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C']}), 'z': (<Kind.NUMERICAL: 'numerical'>, {})})

However, when reusing the model spec:

mm1.model_spec.rhs.get_model_matrix(df2)

A dataframe is returned, and I can't pass output='sparse'.

from formulaic.

matthewwardrop avatar matthewwardrop commented on July 19, 2024

Hi @TELSER1 !

Thanks for reaching out! There was a regression introduced in 0.4.0 that I will be fixing shortly (hopefully tonight, added #102 to track it) whereby ModelSpec.output is not respected. You can workaround this by rolling back to 0.3.x, or using:

from formulaic import model_matrix

model_matrix(model_spec, <new_data>, output='sparse')

or

from formulaic.materializers.pandas import PandasMaterializer

PandasMaterializer(<new data>).get_model_matrix(model_spec, output='sparse')

Hope that helps!

from formulaic.

TELSER1 avatar TELSER1 commented on July 19, 2024

Thanks for the help! In the spirit of Frederik's comment, I'm particularly interested in the serializability and sparse output functionality; I am trying to estimate some large, sparse regression models.

from formulaic.

matthewwardrop avatar matthewwardrop commented on July 19, 2024

@TELSER1 Thanks for the context! That's largely why I wrote formulaic too :).

from formulaic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.