matthewwardrop / formulaic Goto Github PK

View Code? Open in Web Editor NEW

316.0 10.0 21.0 2.2 MB

A high-performance implementation of Wilkinson formulas for Python.

License: MIT License

Python 59.00% Jupyter Notebook 40.98% CSS 0.01%

formulaic's Introduction

Formulaic is a high-performance implementation of Wilkinson formulas for Python.

Documentation: https://matthewwardrop.github.io/formulaic
Source Code: https://github.com/matthewwardrop/formulaic
Issue tracker: https://github.com/matthewwardrop/formulaic/issues

It provides:

high-performance dataframe to model-matrix conversions.
support for reusing the encoding choices made during conversion of one data-set on other datasets.
extensible formula parsing.
extensible data input/output plugins, with implementations for:
- input:
  - pandas.DataFrame
  - pyarrow.Table
- output:
  - pandas.DataFrame
  - numpy.ndarray
  - scipy.sparse.CSCMatrix
support for symbolic differentiation of formulas (and hence model matrices).
and much more.

Example code

import pandas
from formulaic import Formula

df = pandas.DataFrame({
    'y': [0, 1, 2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

y, X = Formula('y ~ x + z').get_model_matrix(df)

y =

	y
0	0
1	1
2	2

X =

	Intercept	x[T.B]	x[T.C]	z
0	1.0	0	0	0.3
1	1.0	1	0	0.1
2	1.0	0	1	0.2

Note that the above can be short-handed to:

from formulaic import model_matrix
model_matrix('y ~ x + z', df)

Benchmarks

Formulaic typically outperforms R for both dense and sparse model matrices, and vastly outperforms patsy (the existing implementation for Python) for dense matrices (patsy does not support sparse model matrix output).

For more details, see here.

Related projects and prior art

Patsy: a prior implementation of Wilkinson formulas for Python, which is widely used (e.g. in statsmodels). It has fantastic documentation (which helped bootstrap this project), and a rich array of features.
StatsModels.jl @formula: The implementation of Wilkinson formulas for Julia.
R Formulas: The implementation of Wilkinson formulas for R, which is thoroughly introduced here. [R itself is an implementation of S, in which formulas were first made popular].
The work that started it all: Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392–399, 1973.

Used by

Below are some of the projects that use Formulaic:

Glum: High performance Python GLM's with all the features.
Lifelines: Survival analysis in Python.
Linearmodels: Additional linear models including instrumental variable and panel data models that are missing from statsmodels.
Pyfixest: Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax.
Tabmat: Efficient matrix representations for working with tabular data.
Add your project here!

formulaic's People

Contributors

Stargazers

Watchers

formulaic's Issues

Proposal: Add `commutative` to `Operator`

#34 Adds support for random effects via the pipe operator. While going through the example in the PR I noticed that operations that are written as x2|m were then printed as m:x2 because formulaic sorts the factors within a term. Apart from the : that should be a | (this is not a major issue), order here does matter, so printing x2|m is not the same as printing m|x2.

This issue suggests that the Operator class could have a commutative property that indicates whether you can change the order of the operators or not.

But maybe there's a better approach? Let's have this space for discussion.

API question: Bind model spec info to data container or not

Preamble: I have a hard time understanding some of the core code parts.

1. What is the reason to add metadata to the data container?

import pandas as pd
from formulaic import model_matrix

df = pd.DataFrame({
    'a': ['A', 'B', 'C'],
})
X = model_matrix("a", df)

X is then of type formulaic.model_matrix.ModelMatrix which is wraps a pandas.DataFrame.

isinstance(X, pd.DataFrame)

Naively, I would expect model_matrix to return a 2-tuple consisting of a dataframe and a ModelSpec.

2. Inspectability

Furthermore, X has a model_spec property. This is, however, not inspectable, i.e. it is not listed in dir(X) (and no autocompletion).

Rip a new release

👋 Hi Matt, do you have an approximate timeline for a new release?

Formulaic / Poetry install top-level tests package

Unclear to me what the issue is. I'm not familiar with poetry, but I have a top-level namespace package from formulaic that's called tests. Here is what I believe to be the relevant poetry issue python-poetry/poetry#2353 with links to other packages that are running into this..

Is there some way to specify that certain interactions never occur?

>>> import numpy as np
>>> import numpy.linalg as la
>>> import pandas as pd
>>> import formulaic
>>> index_vals = tuple("abc")
>>> level_names = list("AB")
>>> n_samples = 2
>>> ds_simple = pd.DataFrame(
...     index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]), 
...     columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
... ).reset_index()
>>> simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple)[1]
>>> # Approximate the condition number of simple_X
>>> np.divide(*la.svd(simple_X)[1][[0, -1]])
13.9282...
>>> simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
>>> np.divide(*la.svd(simple_X)[1][[0, -1]])
5.06320...e+16

I would expect the condition numbers to be somewhat closer to each other.
Is this just a bad expectation?

In the case this is derived from, A == "a" sets something to zero, and then multiplies that by the stuff B controls.
Is the standard practice for this situation to duplicate data expected to be identical so the tests don't crash?

BUG: SyntaxError converted to KeyError

I'm not sure if this should be considered syntax error, but if I try the formula

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.standard_normal((100,3)),columns=["y","y2","x"])
from formulaic import model_matrix
dep = model_matrix("y y2 ~ 1 + x", df, ensure_full_rank=False)

I get KeyError: 'yy2' which suggests the white space ie being trimmed. FWIW patsy's tokenizer would raise SyntaxError on this.

Add the missing transforms to bring parity with patsy / R.

Formulaic's support for stateful transforms is underutilised. Over the coming couple of weeks I'll be adding support in for the missing transforms, aiming to be compatible with patsy and R implementations.

C: Add support for custom contrast matrices, etc. (PR #70)
scale: Center and rescale for mean = 0 and stddev = 1
poly: Polynomial basis (PR #44)
bs : B-Spline basis (PR #21)
cr : cubic spline basis
cc : cyclic cubic spline basis
te : tensor product smooth

Null rows that are dropped are not consistent across left and right hand sides

e.g.

from formulaic import Formula
import pandas

Formula("y ~ x").get_model_matrix(pandas.DataFrame({'y': [1,2,3,4,None], 'x': [None, None, 1,2,3]}))

outputs:

(     y
 0  1.0
 1  2.0
 2  3.0
 3  4.0,
    Intercept    x
 2        1.0  1.0
 3        1.0  2.0
 4        1.0  3.0)

Fix minimum dependencies and add CI

The current requirements from setup

install_requires=[
        "astor",
        "interface_meta>=1.2",
        "numpy",
        "pandas",
        "scipy",
        "wrapt",
    ]

do not specify minimum versions (except for interface_meta). It would be nice to know the min versions and to have a CI run for them.

DOC: Reusing generated model specifications

IS it possible to use a "fitted" transformer and evaluate a new (however, similar dataset)?

Let's have the following example:

import pandas
from formulaic import Formula

df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

trans = Formula('y ~ x + z')

trans.get_model_matrix(df)

df2 = pandas.DataFrame({
    'y': [3, 3, 3],
    'x': ['A', 'B', 'B'],
    'z': [0.3, 0.1, 0.222222222],
})

trans.get_model_matrix(df2)

suppose that my dfis my training data and df2 are testing data.
If I create X matrix for the model training it outputs:

trans.get_model_matrix(df)
.rhs
       Intercept  x[T.B]  x[T.C]    z
    0        1.0       0       0  0.3
    1        1.0       1       0  0.1
    2        1.0       0       1  0.2

A category is a referenced one.

Now I want to do the same for testing data:

trans.get_model_matrix(df2)
.rhs
       Intercept  x[T.B]         z
    0        1.0       0  0.300000
    1        1.0       1  0.100000
    2        1.0       1  0.222222

As you can see this does not persist original design info and matrixes df and df2 are not compatible. The model would fail ass the number of features is not the same.

Is this already implemented somehow?

Build new ModelMatrix using state of existing ModelSpec

Hi Matthew. This looks like a very promising package. I'm hoping to use this instead of patsy as soon as the API has matured. I'm therefore doing some experimentation with it in its current state. The reason I want to move away from patsy is that it does not support pickling DesignInfo. One feature of patsy that is critical to me though, is the ability to build new design matrices using an already existing DesignInfo object. This is supported in patsy through build_design_matrices. Can this be achieved using formulaic as well? It seems to me like calling get_model_matrix automatically recomputes the state (like the mean used in center()) using the supplied dataset. I would like to be able to use the state of an already existing ModelMatrix/ModelSpec when creating a new ModelMatrix. Is this currently supported?

Add `poly` transform

cf. pydata/patsy#92

BUG: Cannot parse formula with a function

I can't seel to get functions in formulas to work. This is a pretty basic one I think.

import numpy as np
from pandas import DataFrame, Categorical

formula = "y ~ 1 +np.exp(x1)"
y = np.random.randn(1000)
x1 = np.random.randn(1000)
d = np.random.randint(0, 4, 1000)
d = Categorical(d)
data = DataFrame({"y": y, "x1": x1, "d": d})
data["Intercept"] = 1.0
model_matrix(formula, data,context=0)

produces

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-930162752fbf> in <module>
      9 data = DataFrame({"y": y, "x1": x1, "d": d})
     10 data["Intercept"] = 1.0
---> 11 model_matrix(formula, data,context=0)

c:\git\formulaic\formulaic\sugar.py in model_matrix(formula, data, context, **kwargs)
     12         else:
     13             context = None  # pragma: no cover
---> 14     return Formula(formula).get_model_matrix(data, context=context, **kwargs)

c:\git\formulaic\formulaic\formula.py in get_model_matrix(self, data, context, materializer, ensure_full_rank, **kwargs)
     61         if not inspect.isclass(materializer) or not issubclass(materializer, FormulaMaterializer):
     62             raise FormulaMaterializerInvalidError("Materializers must be subclasses of `formulaic.materializers.FormulaMaterializer`.")
---> 63         return materializer(data, context=context or {}).get_model_matrix(self, ensure_full_rank=ensure_full_rank, **kwargs)
     64
     65     def differentiate(self, *vars, use_sympy=False):

c:\git\formulaic\formulaic\materializers\base.py in get_model_matrix(self, spec, ensure_full_rank, na_action, output)
    108         # Step 0: Check whether formula separators are in play, and if so, recurse.
    109         if isinstance(spec.formula.terms, tuple):
--> 110             return tuple(
    111                 self.get_model_matrix(Formula(terms), ensure_full_rank=ensure_full_rank, na_action=na_action, output=output)
    112                 for terms in spec.formula.terms

c:\git\formulaic\formulaic\materializers\base.py in <genexpr>(.0)
    109         if isinstance(spec.formula.terms, tuple):
    110             return tuple(
--> 111                 self.get_model_matrix(Formula(terms), ensure_full_rank=ensure_full_rank, na_action=na_action, output=output)
    112                 for terms in spec.formula.terms
    113             )

c:\git\formulaic\formulaic\materializers\base.py in get_model_matrix(self, spec, ensure_full_rank, na_action, output)
    118         for term in spec.formula.terms:
    119             for factor in term.factors:
--> 120                 self._evaluate_factor(factor, spec, drop_rows)
    121
    122         drop_rows = sorted(drop_rows)

c:\git\formulaic\formulaic\materializers\base.py in _evaluate_factor(self, factor, spec, drop_rows)
    269                 value = self._lookup(factor.expr)
    270             elif factor.eval_method.value == 'python':
--> 271                 value = self._evaluate(factor.expr, factor.metadata, spec)
    272             elif factor.eval_method.value == 'literal':
    273                 value = EvaluatedFactor(factor, self._evaluate(factor.expr, factor.metadata, spec), kind='constant')

c:\git\formulaic\formulaic\materializers\base.py in _evaluate(self, expr, metadata, spec)
    306
    307     def _evaluate(self, expr, metadata, spec):
--> 308         return stateful_eval(expr, self.layered_context, {expr: metadata}, spec.transform_state, spec)
    309
    310     def _is_categorical(self, values):

c:\git\formulaic\formulaic\utils\stateful_transforms.py in stateful_eval(expr, env, metadata, state, spec)
     66     stateful_nodes = {}
     67     for node in ast.walk(code):
---> 68         if isinstance(node, ast.Call) and getattr(env.get(node.func.id), '__is_stateful_transform__', False):
     69             stateful_nodes[astor.to_source(node).strip()] = node
     70

AttributeError: 'Attribute' object has no attribute 'id'

Documentation: rendering issue for "X"s in "Formula Grammar" section

Hello! I'm considering using formulaic, and was taking a quick read through the documentation.

The tables on the Formula Grammar page have check marks and what I assumed should be X-like symbols to indicate the presence or absence of a feature. The X's are not rendering properly for me either on Firefox or Chrome (image below from Firefox, Chrome similar but the boxes are empty instead of containing "01F" and "509"):

Index is not preserved in DataFrames after transformation

df = pd.DataFrame([
    {'a': 1, 'b': 4},
    {'a': 2, 'b': 5},
    {'a': 3, 'b': 6},
])

df = df.set_index("b")
print(df.index)

from formulaic import Formula

df_ = formulaic.Formula('a').get_model_matrix(df)
print(df_.index)

This may be intentional (but I would argue it makes things more surprising and adds additional necessary code). Thoughts @matthewwardrop?

Test that formulaic objects are pickable

One could add a (unit) test to assert that formulaic objects are pickable, in contrast to patsy.

ENH: Detect which variables are necessary to exist in data based on formula

Current it is difficult to get parts of the formula and what are the data.frame variables (i.e. the column names). It would be useful to add some facilities to enable further processing.

For example if you have formula log(y) ~ x, we would need to invert log to predict y. To be able to do so, we need to know lhs and that the variable is y.

Add helper methods for capturing contexts when using formulaic in libraries

Hi @matthewwardrop, looking for some advice here. I'm trying to debug this issue occurring in lifelines. Internally, I'm using the Formula api, ex: something like Formula(formula).get_model_matrix(data). I've narrowed the problem down to us not using the context kwarg. I don't know what context to provide here. In the library, we having something like (I'm simplifying):

### users shell / script
def custom_func(x):
    return x + 1

CoxPHFitter().fit(...,df=df, formula="np.log10(x) + custom_func(x)")


### lifelines.estimation.coxph_fitter.py
...
def fit(self, df=None, formula=None ):
      return self.g(df, formula)

def g(self, df, formula)
    return self.regressors.transform_df(df, formula)
...


### lifelines.utils.__init__.py
...
    def transform_df(self, df, formula):
        return Formula(formula).get_model_matrix(df)
...

And so get_model_matrix never sees the context that contains np nor custom_func.

In sugar.py, I've seen you retrieve the frame explicitly with context=0, but doesn't that only work if it is called from the top of the stack? What if we don't know where we are in the stack?

A possibility is to do something like:

### lifelines.utils.__init__.py
...
    def transform_df(self, df, formula):
        import sys
        call_frame = sys._getframe(3)
        context = LayeredMapping(call_frame.f_locals, call_frame.f_globals)
        return Formula(formula).get_model_matrix(df, context=context))
...

But that relies on me knowing apriori that I need 3 there - which is too fragile and doesn't allow for any reuse.

Have you seen other libraries get around this? What do you advise?

na handling

Add means to handle missing values: omit, fail, pass

Stateful transforms attempt to evaluate constants in Python calls and fails.

For example:

import pandas
from formulaic.transforms import basis_spline

df = pandas.DataFrame({'a': [1,2,3,4]})
Formula("bs(a, df=10)").get_model_matrix(df)

results in FactorEvaluationError: Unable to evaluate factor `bs(a, df=10)`. [KeyError: <class 'ast.Constant'>].

Originally surfaced in #74.

Add imputation for missing values

Has missing value handling been implemented? If so, is there any way to configure what to do? I would like to examine moving linearmodels to formulaic since I would really like an extensible formula parser, which I couldn't do with patsy.

Something funky with indexes and DataFrames

Hi @matthewwardrop!
Something came up in lifelines related to Dataframe indexes and transform. Here's a repo example:

from formulaic import Formula

design_info = Formula("1")

df = pd.DataFrame(np.arange(5), index=[0, 2, 4, 6, 8])

print(design_info.get_model_matrix(df)) # should have nulls in half of it

ENH: Support for Linear Constraints

It would be great if it was possible to support linear constraint transformation from str to constraint arrays.

https://patsy.readthedocs.io/en/latest/API-reference.html#linear-constraints

I don't see any support for this and it is the final piece from me dropping patsy from linearmodels.

DOC: One-hot vs. Dummy Encoding

The Quickstart page says

You will notice that the categorical values for a have been one-hot encoded

Would it be more accurate to say that they categorical values have been dummy encoded? It seems to me (knowing nothing about encoding until an half an hour ago), that one-hot encoding adds k columns to the X matrix and dummy encoding adds k-1 columns. It looks like you add k-1 because one gets dropped?

Unexpected behaviour: Repeated + raises KeyError

Running

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.standard_normal((100,3)),columns=["y","y2","x"])
from formulaic import model_matrix
dep = model_matrix("y ~ 1 + + x", df, ensure_full_rank=False)

produces KeyError: '++'. This also seems to be an issue in the tokenizer. FWIW patsy would parse this as "y ~ 1 + x"

I found these running part of the test suite for linearmodels using formulaic.

Add `include_intercept` to the `Formula` constructor?

I think there is a discrepancy in the output of rhs:

formula "y~x" -> rhs = "x+1"
formula "y~x-1" -> rhs = "x"
formula "y~x+0" -> rhs = "x"

I would have expected to get rhs = "x-1" or rhs = "x+0" for the second and third case.

Spline Support like Patsy

Would it be possible to also implement (apart from already implemented interactions and category variable) splines feature:

Patsy library does this really nicely.

Example

Allow `Term` instances to control the naming of output columns in model matrices

This allows Term subclasses to label model matrix columns differently based on context; e.g. if a column is a random effect column, then it might be labelled as: a|g rather than a:g.

What are the limitations from treatying formulaic as a drop-in replacement for patsy?

Aside from not supporting recarrays (which is sensible), are there any known limitations?

BUG: numpy.typing is only available on recent version of NumPy

numpy.typing is only available on very recent versions of NumPy. This import is optional and should only be imported when TYPE_CHECKING so that recent NumPy is required for type checking but not otherwise.

Move to poetry

I think it would beneficial to move to poetry for dependency management and build process. It would also make it easier to setup for development.

Add migration guide from patsy.

ENH: Make sympy an optional dependency

SymPy depends on mpmath which does not appear to distribute binary modules. This makes pip installation difficult on Windows, and possibly OSX. SymPy doesn't seem fundamental to formulaic and only seems to be used for differentiation.

Could this be converted to an optional feature that would raise if not available, as in

differentiate(...,sympy=True)

Traceback:
...
ImportError: SymPy must be installed to use this feature

Determine which input columns were used to generate output columns

Great library @matthewwardrop!

Is there a way to determine how input columns correspond to output columns in an example like this?

import pandas
from formulaic import model_matrix

df = pandas.DataFrame({
    'y': [0, 1, 2],
    'a': ['A', 'B', 'C'],
    'b': [0.3, 0.1, 0.2],
})

y, X = model_matrix("y ~ a + b + a:b", df)
X

Intercept	a[T.B]	a[T.C]	b	a[T.B]:b	a[T.C]:b
1.0	0	0	0.3	0.0	0.0
1.0	1	0	0.1	0.1	0.0
1.0	0	1	0.2	0.0	0.2

For example, I would like to know that the input columns a and b were used to generated the resulting interaction field a[T.B]:b. I would like to establish this through some structured, intermediate data structure though rather than parsing it back out of the names. Does such a structure exist and is it easy to access?

Thanks!

Improve user experience around programmatic `Formula` and `ModelSpec` creation

Currently the Formula objects are easiest to directly create from strings, and the ModelSpec is automatically created. This isn't always ideal when formulas and model specs are programmatically created and/or mutated. The API for these objects should be improved and extended to allow for direct creation/mutation, especially in conjunction with Structured formulae.

cc: @xjing76

Column dtypes are not consistent

A model spec does not keep the column type.
Example:

import pandas as pd
from formulaic import model_matrix


df1 = pd.DataFrame({
    'a': ['A', 'B', 'C'],
})

df2 = pd.DataFrame({
    'a': ['A', 'A', 'B'],
})

X1 = model_matrix("a", df1)
X1.info()

gives

 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Intercept  3 non-null      float64
 1   a[T.B]     3 non-null      uint8  
 2   a[T.C]     3 non-null      uint8

But then

X2 = X1.model_spec.get_model_matrix(df2)
X2.info()

gives

 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Intercept  3 non-null      float64
 1   a[T.B]     3 non-null      uint8  
 2   a[T.C]     3 non-null      float64  # <= This is not the same dtype as before!

model specs appear to retain values from data matrix

I was looking at the modelspec object after a formulaic.model_matrix(formula, df) call with some synthetic data, and noted that modelspec was rather large (self-contained code eg below). Poking at the object (using pympler), I found that the EvaluatedFactor terms nested within that .structure attribute contain copies of the data columns from the data matrix.

When I iterated through .structure and set the EvaluatedFactor.values terms to None, .get_model_matrix() continued to function correctly on a dataframe different from the original one I'd used to create the modelspec.

A few questions, if I might?

Are there circumstances where it's necessary to retain those data copies in the modelspec object, in order to support subsequent .get_model_matrix calls?
If not, might it be reasonable not to store those potentially-large data under the modelspec, or perhaps store them optionally, eg via some kwarg to the model_matrix method?
Would there be anywhere else within the modelspec that data are stored? (From this experiment at least, it doesn't appear to be so.)

Here's the code I used, for reference:

import numpy as np
import pandas as pd
import formulaic as fm
import pympler
import scipy.special

np.random.seed(123)

# -- a formula to exercise formulaic
formula = "y ~ x1 * x2 + bs(x1, 3) + m * x1"

n = 100000  # largish to make detection easy

def gen_frame(n):
    return pd.DataFrame(
        dict(
            x1=np.random.normal(size=n),
            x2=np.random.normal(size=n),
            m=np.random.choice(list("abc"), size=n),
            e=np.random.normal(size=n),
        )).assign(y=lambda f: (scipy.special.expit(1 + 2 * f.x1 - f.x2 + f.e) >
                               0.5).astype(int))

df = gen_frame(n)  # ie, training set
dfp = gen_frame(5)  # ie, prediction set

y, X = fm.model_matrix(formula, df)

f"Original ModelSpec size: {pympler.asizeof.asizeof(X.model_spec)}"

# -- build design matrix from original modelspec
Xp1 = X.model_spec.get_model_matrix(dfp)

# -- remove values from evaluated factors
for _, factors, _ in X.model_spec.structure:
    for f in factors:
        for ff in f.factors:
            ff.factor.values = None

f"Decanted ModelSpec size: {pympler.asizeof.asizeof(X.model_spec)}"

# -- build design matrix from decanted modelspec
Xp2 = X.model_spec.get_model_matrix(dfp)

# -- prove that the .values are unnecessary, at least in this case
pd.testing.assert_frame_equal(Xp1, Xp2)

which yields:

'Original ModelSpec size: 28436952'
'Decanted ModelSpec size: 17984'

and does not raise an assertion error, implying that the two design matrices had identical values.

Thank you for creating and sharing this package!

Add a "safe" mode that avoids using `eval`?

In some cases, particularly in production code, it is nice to have a safe mode that avoids using eval, at the expense of some functionality. This issue is here so this line of thought doesn't get lost, but also to see how important this is to people.

To be clear, removing eval in the current state would basically cripple Formulaic by preventing any Python transformations (including built-in transformations). There may be a middle ground whereby we introspect the generated Python code AST, and verify that the functions to be executed are sanctioned.

BUG: Repeated constant-like terms are incorrectly parsed

Formulas like 0 + 0 + 1 + x are incorrectly parsed and result in a DataFrame that looks like:

from formulaic import model_matrix
from pandas import DataFrame 
df = DataFrame([[0],[1],[2],[3],[4]],columns=["x"])
dep = model_matrix("0 + 0 + 1 + x", df, ensure_full_rank=False)
print(DataFrame(dep))

   Intercept  Intercept  x
0        0.0        1.0  0
1        0.0        1.0  1
2        0.0        1.0  2
3        0.0        1.0  3
4        0.0        1.0  4

0 + 0 + 1 should evaluate to just 1 +

Variable (covariate) names with dashes are not parsed

Summary

I am analyzing Pandas DataFrame in which some variable names have dashes, e.g., foo-bar.

This results in an error when calling formulaic.model_matrix.

Error message

FactorEvaluationError: Unable to evaluate factor `b`. [KeyError: 'b']

Minimum reproducible example

The error is reproduced by running the following code:

import pandas
from formulaic import model_matrix

df = pandas.DataFrame({
    'y': [0, 1, 2],
    'a': ['A', 'B', 'C'],
    'b-foo': [0.3, 0.1, 0.2],
})

y, X = model_matrix("y ~ a + b-foo", df)

Expected behavior

The following DataFrame X is generated:

| Intercept | a[T.B] | a[T.C] | b-foo |
|-----------|--------|--------|-------|
| 1.0       | 0      | 0      | 0.3   |
| 1.0       | 1      | 0      | 0.1   |
| 1.0       | 0      | 1      | 0.2   |

Potential solutions (high level syntax)

To prevent collision with the - operator for negation, the formula could be checked for spaces. y ~ a + b-foo would handle b-foo as a variable (covariate) name, whereas y ~ a + b - foo would negate a variable named foo.

Alternatively, the formula could parse variable names inside of quotes, e.g., y ~ a + "b-foo".

Consider adding numpy transforms by default

Formulaic does not currently "pollute" the evaluation context with additional transforms beyond those explicitly implemented in formulaic (like "C", "bs", "poly", etc). It is probably worth exposing numpy as np and sum of its functions directly as transforms in formulas without requiring users to have it in their namespace (if using model_matrix) or passing them in via context={}. Candidates include: exp, log, sum, etc.

This makes it easier to use formulaic in places where you don't want users randomly inserting things into the namespace, but still want a more complete set of transforms.

Add support for structured formulae and model matrices.

Currently, Formula and ModelMatrix objects are presented as-is to users, and do not contain any nested structure. For example, materializating the formula y ~ 1 + x will result in a tuple of two ModelMatrix instances being returned; one for the left-hand side and one of the right-hand side. While a good starting point, it is limiting for extended use-cases where the right hand side might include terms (such as random-effect terms) which need to be demarcated from the rest of the terms. We could solve this using tuples all the way down, but a richer representation that allows lookup of components by name is valuable.

DOC: Docs Inconsistency

I know the docs are early stages and I'm not even sure if this is worth an issue (please let me know if it's not) but the link "Quick Start" link in the body of this page points to a page titled "Quickstart".

[Bug] Args not effective in direct ModelSpec construction

import pandas as pd
from formulaic.model_spec import ModelSpec
import scipy

data = [[1,10],[2,12],[1,13]]
df = pd.DataFrame(data,columns=['a','b'])
formula = 'b~ 1+C(a)'
y_, X_ = formulaic.Formula(formula).get_model_matrix(df.head(1))

kwargs = {'output':'sparse'}
spec = ModelSpec(formula=formula, structure=X_.model_spec.structure,  **kwargs)
y, X = spec.get_model_matrix(df)

	Intercept	C(a)[T.2]
0	1.0	0
1	1.0	1
2	1.0	0

Here is a minimal example when we construct ModelSpec directly with additional args like output ='sparse' the return objects are still in dense format.

Bug handling categorical features?

I am training a model with a pd.categorical feature with categories not following alphabetic order. During prediction I pass this categorical feature as string. When transforming getting the model matrix using X.model_spec.get_model_matrix(x_string), the string feature is transformed to a pd.categorical with features alphabetically ordered resulting in FactorEncodingError: Term categorical feature has generated columns that are inconsistent with specification. The number and name of columns is the same but the other is different.

I use this code to replicate this error:

from formulaic import model_matrix
from random import shuffle
import numpy as np

for cat_order in [['A','G','B','D','E','F'], ['A','B','D','E','F']]:
   # Define the model matrix
   examples = cat_order*2
   shuffle(examples)
   df = pd.DataFrame({'y': [i for i, _ in enumerate(examples)], 'a':pd.Categorical(examples, cat_order)})
   y, X = model_matrix("y ~ C(a)", df)
   # Check if we can transform a categorical feature with specified order, a noncateegorical feature 
   # into a model matrix
   from formulaic.errors import FactorEncodingError
   df_test = pd.DataFrame({
       'y': [17,18],
       'a': pd.Categorical(['A','B'], cat_order),

   })
   df_testb = pd.DataFrame({
       'y': [17,18],
       'a': ['A','B'],

   })

   print('categorical order',cat_order)
   for df, ftype in [(df_test,'cat'),(df_testb,'non-cat')]:
       try:
           X.model_spec.get_model_matrix(df)
           print(f'{ftype} passes')
       except FactorEncodingError:
           print(f'{ftype} does not pass')

outputs:

categorical order ['A', 'G', 'B', 'D', 'E', 'F']
cat passes
non-cat does not pass
categorical order ['A', 'B', 'D', 'E', 'F']
cat passes
non-cat passes

"Finish" documentation

The current documentation is incomplete and full of typos. Finish it!

Formulas should not silently swallow literals during materialization.

Currently in formulaic during materialization, bare literals will be silently dropped. In some sense, this makes sense, but given that these are extraneous, an error should be raised.

For example:

import pandas
import formulaic
formulaic.model_matrix('a + b + 50', pandas.DataFrame({'a': [1,2,3], 'b': ['a', 'b', 'c']}))

Consider migrating project to some community umbrella

What about a possible migration of formulaic under some community space like pydata (like patsy) or statsmodels (like statsmodels)?

This is just a suggestion with the purpose of a broader community support and with the lessons learnt from patsy in mind. I'm not a member of any of those github organizations.

From the patsy Readme:

patsy is no longer under active development. As of August 2021, Matthew Wardrop (@matthewwardrop) and Tomás Capretto (@tomicapretto) have taken on responsibility from Nathaniel Smith (@njsmith) for keeping the lights on, but no new feature development is planned. The spiritual successor of this project is Formulaic, and we recommend those interested in new feature development contribute there.

Then there are, e.g., statsmodels/statsmodels#6858 and bambinos/formulae#51 (comment).

It would be interesting to hear different peoples' perspectives including the current maintainer @matthewwardrop, but also—if I may ping you—@tomicapretto, @bashtage, @josef-pkt, @rgommers.

Add support for multi-stage formulas.

In some of my work I am interested in exploring two-stage least-square regression on sparse data, and thus in making Formulaic able to handle it nicely.

My plan is to allow formulas of form:
y ~ a + [b + c ~ z1 + z2] | a + [e + f ~ z1 + z2] | d + [b + c ~ z1 + z2] | d + [e + f ~ z1 + z2]
In my proposed grammar, this would also be equivalent to:
y ~ (a|d) + [b + c | e + f ~ z1 + z2]
Using multipart syntax in the rhs of nested formulas would be forbidden.

The API for accessing the various pieces of this Formula is as yet not fully fleshed out, and naming has not been properly considered, but would be something like:

f = Formula('y ~ (a|d) + [b + c | e + f ~ z1 + z2]')
f.formula_for(rhs_part=0, stage=0)  # b + c ~ z1 + z2
f.formula_for(rhs_part=0, stage=1)  # y ~ a + b + c
f.formula_for(rhs_part=1, stage=0)  # e + f ~ z1 + z2
f.formula_for(rhs_part=1, stage=1)  # y ~ a + e + f
f.formula_for(rhs_part=0) # y ~ a + [b + c ~ z1 + z2]

f = Formula('y ~ x + z')
f.formula_for() # y ~ x + z

On a multipart formula like this one, calls to get_model_matrix will need to specify the part and stage for which the model matrix should be generated. If there is only one part or stage, this will not be necessary. Formulaic explicitly will not attempt to do any modeling with this, and will expect users of the library to do any memoisation that is required for two-stage least-squares to work when pumping new data sets through a pre-trained model.

I'm especially keen to know what @bashtage thinks about this, given that this is something he has explored a lot more in linearmodels.

Need help?

Hello Matt! The README says that Formulaic is a work in progress, is there anything that you'd like to see changed/added to the code that I might be able to help with? Writing tests? Docs? I don't have a strong programming/statistics/math background (yet) so I may not be much use with some things.