Comments (11)
@TELSER1 This has been fixed and pushed out in v0.5.0 .
from formulaic.
Thank you for this! I was btw drawn to this package not so much due to performance (which seems to be 10x faster than patsy on my setup) but the fact that it can be pickled.
from formulaic.
Hi @petrhrobar ,
Yes... this is easily done with the current state of formulaic.
You can just do:
from formulaic import Formula
df = pandas.DataFrame({
'y': [0,1,2],
'x': ['A', 'B', 'C'],
'z': [0.3, 0.1, 0.2],
})
trans = Formula('y ~ x + z')
mm1 = trans.get_model_matrix(df)
df2 = pandas.DataFrame({
'y': [3, 3, 3],
'x': ['A', 'B', 'B'],
'z': [0.3, 0.1, 0.222222222],
})
mm2 = mm1.model_spec.get_model_matrix(df2)
Or, using the sugar method model_matrix
in 0.3.4+:
mm2 = model_matrix(mm1, df2)
Hope that helps. I'll leave this open for documentation purposes until the docsite is updated.
from formulaic.
THanks, Matthew!
I was actually able to figure it out myself yesterday from a previous issue about this topic. So I guess it kind of boils down to the documentation :/.
If I may recommend and show how I want to use this is to have a sklean
component:
from sklearn.base import BaseEstimator, TransformerMixin
from formulaic import Formula, model_matrix
class FormulaicTransformer(TransformerMixin, BaseEstimator):
def __init__(self, formula):
self.formula = formula
def fit(self, X, y = None):
"""Fits the estimator"""
self._trans = model_matrix(self.formula, X).model_spec.rhs
return self
def transform(self, X, y= None):
"""Fits the estimator"""
X_ = self._trans.get_model_matrix(X)
return X_
pipe = Pipeline([
("formula", FormulaicTransformer("(bs(yday, df=12) + wday + num_date")),
("scale", StandardScaler()),
("model", LinearRegression())
])
As this persists the design info and can be pickled. It may be used as a proper sklearn component!
This is a badass feature!
from formulaic.
Nice! And yes... documentation will come... eventually!
This is a really cool use of Formulaic :). Maybe something like this makes sense to bring into formulaic itself at some point; or perhaps even better, upstream into sklearn
.
When used in libraries, though, I do recommend using Formula(...).get_model_matrix(...)
since that way the compute context is explicitly established. When using model_matrix
the default behaviour is to make the entire locals() and globals() context available to use in formulae. For local use, that's fine... for libraries, not so much.
from formulaic.
Hi @petrhrobar ,
Yes... this is easily done with the current state of formulaic.
You can just do:
from formulaic import Formula df = pandas.DataFrame({ 'y': [0,1,2], 'x': ['A', 'B', 'C'], 'z': [0.3, 0.1, 0.2], }) trans = Formula('y ~ x + z') mm1 = trans.get_model_matrix(df) df2 = pandas.DataFrame({ 'y': [3, 3, 3], 'x': ['A', 'B', 'B'], 'z': [0.3, 0.1, 0.222222222], }) mm2 = mm1.model_spec.get_model_matrix(df2)
Or, using the sugar method
model_matrix
in 0.3.4+:mm2 = model_matrix(mm1, df2)
Hope that helps. I'll leave this open for documentation purposes until the docsite is updated.
I have been attempting to use Formulaic and ran in to the same issue. The above example does not seem to work anymore. Is there a different way of doing this now? Thanks in advance!
from formulaic.
Hi @frederik-plum-hauschultz !
Since a while back (not sure if it was the case when I wrote this or not), the output of get_model_matrix()
is a structured Structured
instance that reflects the structure of the formula.
For example:
>>> from formulaic import Formula
>>> df = pandas.DataFrame({
'y': [0,1,2],
'x': ['A', 'B', 'C'],
'z': [0.3, 0.1, 0.2],
})
>>> trans = Formula('y ~ x + z')
>>> mm1 = trans.get_model_matrix(df)
>>> mm1
.lhs:
y
0 0
1 1
2 2
.rhs:
Intercept x[T.B] x[T.C] z
0 1.0 0 0 0.3
1 1.0 1 0 0.1
2 1.0 0 1 0.2
>>> mm1.model_spec
.lhs:
<formulaic.model_spec.ModelSpec object at 0x7f0f43f6df10>
.rhs:
<formulaic.model_spec.ModelSpec object at 0x7f0f43f6df40>
It isn't possible in 0.3.x to call methods of the nested structure directly. I might add that in 0.4.x, but am not yet 100% convinced it is a good idea (maybe about 95% at the moment, leaning toward doing it, in which case it will appear in 0.4.0 shortly; feel free to nudge me if you like the idea).
In the meantime you can do:
df2 = pandas.DataFrame({
'y': [3, 3, 3],
'x': ['A', 'B', 'B'],
'z': [0.3, 0.1, 0.222222222],
})
mm2 = mm1.model_spec.rhs.get_model_matrix(df2)
or, if you want both the lhs and rhs bits done in one step:
>>> mm1.model_spec._map(lambda spec: spec.get_model_matrix(df2))
.lhs:
y
0 3
1 3
2 3
.rhs:
Intercept x[T.B] x[T.C] z
0 1.0 0 0 0.300000
1 1.0 1 0 0.100000
2 1.0 1 0 0.222222
Hope that helps.
from formulaic.
How would this work if I wanted a sparse output?
If I set
mm1 = trans.get_model_matrix(df, output='sparse')
I get back the expected sparse matrices, and the model specs are
.lhs:
ModelSpec(formula=y, materializer='pandas', ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='sparse', structure=[EncodedTermStructure(term=y, scoped_terms=[y], columns=['y'])], transform_state={}, encoder_state={'y': (<Kind.NUMERICAL: 'numerical'>, {})})
.rhs:
ModelSpec(formula=1 + x + z, materializer='pandas', ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='sparse', structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=x, scoped_terms=[x-], columns=['x[T.B]', 'x[T.C]']), EncodedTermStructure(term=z, scoped_terms=[z], columns=['z'])], transform_state={}, encoder_state={'x': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C']}), 'z': (<Kind.NUMERICAL: 'numerical'>, {})})
However, when reusing the model spec:
mm1.model_spec.rhs.get_model_matrix(df2)
A dataframe is returned, and I can't pass output='sparse'.
from formulaic.
Hi @TELSER1 !
Thanks for reaching out! There was a regression introduced in 0.4.0 that I will be fixing shortly (hopefully tonight, added #102 to track it) whereby ModelSpec.output
is not respected. You can workaround this by rolling back to 0.3.x, or using:
from formulaic import model_matrix
model_matrix(model_spec, <new_data>, output='sparse')
or
from formulaic.materializers.pandas import PandasMaterializer
PandasMaterializer(<new data>).get_model_matrix(model_spec, output='sparse')
Hope that helps!
from formulaic.
Thanks for the help! In the spirit of Frederik's comment, I'm particularly interested in the serializability and sparse output functionality; I am trying to estimate some large, sparse regression models.
from formulaic.
@TELSER1 Thanks for the context! That's largely why I wrote formulaic
too :).
from formulaic.
Related Issues (20)
- DOC: Explicitly mention support for multiple variables on the left hand side HOT 3
- Terms not being evaluated in get_model_matrix() HOT 2
- 17 tests fail: ModuleNotFoundError: No module named 'interface_meta' HOT 2
- How can the encoding choices for one dataset be reused for another? HOT 3
- Intercept term breaks when RHS formula begins with a parentheses HOT 2
- How do I set the reference level for a categorical term? HOT 4
- Support for sympy >= 1.10 HOT 3
- ENH: Preserve variable order as they appear in formulas HOT 5
- 2 tests fail HOT 1
- Interaction between two categorical covariates sometimes switches order, causing error HOT 3
- Intercept is not added after being removed HOT 4
- Proposal: support columns representing multiple features HOT 3
- Formulaic struggles with NAs and `poly()` syntax HOT 3
- Escaped variables and functions HOT 3
- How to include structural zeros? HOT 1
- Retain Column Names for sparse model matrices HOT 4
- Formulaic not raising an exception when required fields are missing in the dataset HOT 2
- Allow formatting the categorical encoded variables HOT 4
- Throw error when formula has parameters that are not available HOT 2
- Support polars HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from formulaic.