Comments (10)
I found a better description of what's going on:
import numpy as np
import numpy.linalg as la
import pandas as pd
import formulaic
index_vals = tuple("abc")
level_names = list("AB")
n_samples = 2
def describe_design_matrix(design_matrix):
print("Shape:", design_matrix.shape)
print("Rank: ", la.matrix_rank(design_matrix))
print("Approximate condition number:",
np.divide(*la.svd(design_matrix)[1][[0, -1]]))
ds_simple = pd.DataFrame(
index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]),
columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
).reset_index()
print("All sampled")
simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple)[1]
describe_design_matrix(simple_X)
print("Only some sampled")
simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
describe_design_matrix(simple_X)
print("Reduced X")
reduced_X = simple_X.loc[:, [col for col in simple_X.columns if not col.startswith("A[T.b]:")]]
describe_design_matrix(reduced_X)
produces
All sampled
Shape: (18, 9)
Rank: 9
Approximate condition number: 13.928203230275503
Only some sampled
Shape: (14, 9)
Rank: 7
Approximate condition number: 5.063203904729927e+16
Reduced X
Shape: (14, 7)
Rank: 7
Approximate condition number: 9.878656395922679
I think I can turn this into something I can use for the original problem.
Is there some way I can tell the rank-reducer that some combinations of factors will not exist?
from formulaic.
Just realized I should probably mention:
This is formulaic 0.1.2 on CPython 3.6.10.
from formulaic.
A slightly more complicated case:
import numpy as np
import numpy.linalg as la
import pandas as pd
import formulaic
index_vals = tuple("abc")
level_names = list("ABC")
n_samples = 2
def describe_design_matrix(design_matrix):
print("Shape:", design_matrix.shape)
print("Rank: ", la.matrix_rank(design_matrix))
print("Approximate condition number:",
np.divide(*la.svd(design_matrix)[1][[0, -1]]))
ds_simple = pd.DataFrame(
index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]),
columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
).reset_index()
print("Only some sampled")
simple_X = formulaic.Formula("y ~ (A + B + C) ** 3").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
describe_design_matrix(simple_X)
print("Reduced X")
reduced_X = simple_X.loc[:, [col for col in simple_X.columns if not col.startswith("A[T.b]:B")]]
describe_design_matrix(reduced_X)
Since formulaic appears to sort the column names before forming the interactions, a general solution would need the test to be similar to:
reduced_X = simple_X.loc[
:,
[
col for col in simple_X.columns
if (
# Check for interaction terms involving A == 'b' ...
(not col.startswith("A[T.b]:") and ":A[T.b]" not in col)
# ... and any value of B
or "B" not in col
)
# Equivalently,
# if not ((col.startswith("A[T.b]:") or ":A[T.b]" in col) and "B" in col
# I may need to expand the test for B to avoid problems if, for example, A.endswith(B)
# (":B[" in col or col.startswith("B[T."))
# should work
]
]
I'm not entirely sure how to tell formulaic that A[T.a]:B[T.x]
exists only for x = a
, so it needs to move its A-
definition from A[T.a]:B[T.{x != a}]
(which doesn't exist) to A[T.b]:B[T.{x !=a}]
.
This looks like most of an implementation and a start at a test suite for this functionality if an API gets decided.
(There will be problems if someone decides to name their categorical variables "some_name"
and ":some_name[T.a_val_in_some_name]:"
, but I'm not seeing a way around that with this approach.
I have no problem with avoiding weird names like this.)
from formulaic.
Hi @DWesl ! Sorry for the delay on this. I usually only work on Formulaic on the weekends, in and around family stuff.
This is a very interesting problem. The ensure_full_rank
option currently only guarantees structural full-rankness. That is, it assumes that after categorical encoding, the data itself is full rank.
I'm a little reluctant to add functionality along these lines to the formula grammar itself, since I think that will complicate things a lot. However I do see two alternative paths to supporting something like this downstream (or even integrating it directly into Formulaic).
(1) As a post-model-matrix-generation transform that looks at the structure and patches it to be full rank.
(2) Implementing it as a transform.
For (2), it should be "trivial" to add this as a transform, but making it a stateful transform (something that can be reused on a different dataset) would be more tricky. I'm imagining a new transform that look something like:
model_matrix('~ rank_reduced_product( C(A), C(B), ..., options=...)')
Obviously as the number of interactions grow, this looks more and more ugly.
What are your thoughts?
from formulaic.
Oh... And you could explicitly write out the interactions:
model_matrix("{C(A)['a']}:{C(B)['b']}",...)
This syntax should work in the latest master, but with some tweaks to the parser perhaps we could enable doing:
model_matrix("A.a:B.b + ...")
If we did that, would that be sufficient for your use-case? You'd have to expand your products manually.
from formulaic.
There's certainly the option of writing out the interactions explicitly, but I've worked around this for now by using:
reduced_X = simple_X.loc[
:,
[
col for col in simple_X.columns
if (
# Check for interaction terms involving A == 'b' ...
if not ((col.startswith("A[T.b]:") or ":A[T.b]" in col) and
# I need to expand the test for B to avoid problems if, for example, A.endswith(B)
(":B[" in col or col.startswith("B[T."))
)
]
]
(The problem that sparked this issue has the second column named essentially BA
, which requires a bit more care than just B
)
I would not add it to the grammar, I was thinking a keyword option to the Formula.get_model_matrix
method, since this kind of thing very much depends on the exact way the data was collected.
Possibly a better (more consistent) way to do this would be to use the grammar to set B[T.b]
as the baseline and removing the interaction terms which are always zero. This could be really easy to do in CSC sparse format (if the next column starts at the same place as the current column, there's no nonzero elements), and np.count_nonzero(..., axis=0) > 0
should give an index that could be used to drop columns in everything else.
... that's not currently implemented in formulaic, but the idea should be sound.
I have no idea which of these better conforms to the expectations of people looking at linear regression coefficients or ANOVA tables.
from formulaic.
Sorry for the delay here again @DWesl ! I've had a lot on my plate the last few weeks.
I think I am persuaded that adding a keyword argument along the lines of "drop_columns_with_no_support" or "drop_all_zero_columns" could be useful, and that having be part of the model matrix generation process is valuable because that way new data will be forced to respect the limited interactions even if the new data has support for these interactions.
Would that meet your use-case?
from formulaic.
That gets most of the way there; I'd need to reorder the variables in the Categorical to get a column with all zeros, but that's not that much effort.
from formulaic.
I asked about this in pydata/patsy#161, and puzzled out what formula would be needed to ensure the all-zero columns happen: (C(A, Treatment('b')) + B + C) ** 3
. I don't think formulaic supports changing the reference level yet, but if it respects the order of the levels in a pandas.Categorical
dtype, I should be able to achieve the same effect by changing the order there.
from formulaic.
Hi @DWesl !
As I prepare formulaic for the v1.0 release, I wanted to go back through all the old issues and make sure things were dealt with properly. I know it has been a really long time (sorry). Formulaic has come a long way during this time, but I don't think this specific issue has been "solved".
Formulaic does now support customizing the reference level in contrasts/etc, as well as explicitly ordering the levels and respecting the order of factors in the formula, so you could use what you wrote above... but, there's still room for improvement.
Has your thinking in this space evolved?
from formulaic.
Related Issues (20)
- Escaped variables and functions HOT 3
- How to include structural zeros? HOT 1
- Retain Column Names for sparse model matrices HOT 4
- Formulaic not raising an exception when required fields are missing in the dataset HOT 2
- Allow formatting the categorical encoded variables HOT 4
- Throw error when formula has parameters that are not available HOT 2
- Support polars HOT 5
- Dropping Indices via "+0" or "-1" and reference levels for categoricals HOT 1
- Extending `formulaic` to work with other input types HOT 2
- Handling individual columns that can expand into multiple columns HOT 7
- Support the hashing trick as an encoding strategy for categorical features HOT 6
- `model_spec.transform_state` bugged when formula is not correctly written HOT 1
- Is there a way to get the baseline value for categorical variables? HOT 7
- Add . operator HOT 1
- Suggestions for creating `get_feature_names_out` for Scikit Learn ColumnTransformer compatibility? HOT 3
- Is it possible to define custom operators? HOT 2
- Is it possible to force the `Formula` class to not expand categorical variables? HOT 3
- Add required variables to the `Formula` class HOT 6
- Potential Bug / different defaults for Intercept / Reference Levels when using `Formula.get_model_matrix()` with categoricals HOT 2
- Potential bug in Interacting variables via `:` syntax for categorical variables HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from formulaic.