Git Product home page Git Product logo

Comments (10)

DWesl avatar DWesl commented on July 19, 2024

I found a better description of what's going on:

import numpy as np
import numpy.linalg as la
import pandas as pd
import formulaic
index_vals = tuple("abc")
level_names = list("AB")
n_samples = 2


def describe_design_matrix(design_matrix):
    print("Shape:", design_matrix.shape)
    print("Rank: ", la.matrix_rank(design_matrix))
    print("Approximate condition number:",
          np.divide(*la.svd(design_matrix)[1][[0, -1]]))


ds_simple = pd.DataFrame(
    index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]), 
    columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
).reset_index()

print("All sampled")
simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple)[1]
describe_design_matrix(simple_X)

print("Only some sampled")
simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
describe_design_matrix(simple_X)

print("Reduced X")
reduced_X = simple_X.loc[:, [col for col in simple_X.columns if not col.startswith("A[T.b]:")]]
describe_design_matrix(reduced_X)

produces

All sampled
Shape: (18, 9)
Rank:  9
Approximate condition number: 13.928203230275503
Only some sampled
Shape: (14, 9)
Rank:  7
Approximate condition number: 5.063203904729927e+16
Reduced X
Shape: (14, 7)
Rank:  7
Approximate condition number: 9.878656395922679

I think I can turn this into something I can use for the original problem.
Is there some way I can tell the rank-reducer that some combinations of factors will not exist?

from formulaic.

DWesl avatar DWesl commented on July 19, 2024

Just realized I should probably mention:
This is formulaic 0.1.2 on CPython 3.6.10.

from formulaic.

DWesl avatar DWesl commented on July 19, 2024

A slightly more complicated case:

import numpy as np
import numpy.linalg as la
import pandas as pd
import formulaic
index_vals = tuple("abc")
level_names = list("ABC")
n_samples = 2


def describe_design_matrix(design_matrix):
    print("Shape:", design_matrix.shape)
    print("Rank: ", la.matrix_rank(design_matrix))
    print("Approximate condition number:",
          np.divide(*la.svd(design_matrix)[1][[0, -1]]))


ds_simple = pd.DataFrame(
    index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]), 
    columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
).reset_index()

print("Only some sampled")
simple_X = formulaic.Formula("y ~ (A + B + C) ** 3").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
describe_design_matrix(simple_X)

print("Reduced X")
reduced_X = simple_X.loc[:, [col for col in simple_X.columns if not col.startswith("A[T.b]:B")]]
describe_design_matrix(reduced_X)

Since formulaic appears to sort the column names before forming the interactions, a general solution would need the test to be similar to:

reduced_X = simple_X.loc[
    :, 
    [
        col for col in simple_X.columns 
        if (
            # Check for interaction terms involving A == 'b' ...
            (not col.startswith("A[T.b]:") and ":A[T.b]" not in col)
            # ... and any value of B
            or "B" not in col
        )
        # Equivalently,
        # if not ((col.startswith("A[T.b]:") or ":A[T.b]" in col) and "B" in col
        # I may need to expand the test for B to avoid problems if, for example, A.endswith(B)
        # (":B[" in col or col.startswith("B[T."))
        # should work
    ]
]

I'm not entirely sure how to tell formulaic that A[T.a]:B[T.x] exists only for x = a, so it needs to move its A- definition from A[T.a]:B[T.{x != a}] (which doesn't exist) to A[T.b]:B[T.{x !=a}].

This looks like most of an implementation and a start at a test suite for this functionality if an API gets decided.
(There will be problems if someone decides to name their categorical variables "some_name" and ":some_name[T.a_val_in_some_name]:", but I'm not seeing a way around that with this approach.
I have no problem with avoiding weird names like this.)

from formulaic.

matthewwardrop avatar matthewwardrop commented on July 19, 2024

Hi @DWesl ! Sorry for the delay on this. I usually only work on Formulaic on the weekends, in and around family stuff.

This is a very interesting problem. The ensure_full_rank option currently only guarantees structural full-rankness. That is, it assumes that after categorical encoding, the data itself is full rank.

I'm a little reluctant to add functionality along these lines to the formula grammar itself, since I think that will complicate things a lot. However I do see two alternative paths to supporting something like this downstream (or even integrating it directly into Formulaic).

(1) As a post-model-matrix-generation transform that looks at the structure and patches it to be full rank.
(2) Implementing it as a transform.

For (2), it should be "trivial" to add this as a transform, but making it a stateful transform (something that can be reused on a different dataset) would be more tricky. I'm imagining a new transform that look something like:

model_matrix('~ rank_reduced_product( C(A), C(B), ..., options=...)') 

Obviously as the number of interactions grow, this looks more and more ugly.

What are your thoughts?

from formulaic.

matthewwardrop avatar matthewwardrop commented on July 19, 2024

Oh... And you could explicitly write out the interactions:

model_matrix("{C(A)['a']}:{C(B)['b']}",...)

This syntax should work in the latest master, but with some tweaks to the parser perhaps we could enable doing:

model_matrix("A.a:B.b + ...")

If we did that, would that be sufficient for your use-case? You'd have to expand your products manually.

from formulaic.

DWesl avatar DWesl commented on July 19, 2024

There's certainly the option of writing out the interactions explicitly, but I've worked around this for now by using:

reduced_X = simple_X.loc[
    :, 
    [
        col for col in simple_X.columns 
        if (
            # Check for interaction terms involving A == 'b' ...
            if not ((col.startswith("A[T.b]:") or ":A[T.b]" in col) and
            # I need to expand the test for B to avoid problems if, for example, A.endswith(B)
            (":B[" in col or col.startswith("B[T."))
        )
    ]
]

(The problem that sparked this issue has the second column named essentially BA, which requires a bit more care than just B)

I would not add it to the grammar, I was thinking a keyword option to the Formula.get_model_matrix method, since this kind of thing very much depends on the exact way the data was collected.

Possibly a better (more consistent) way to do this would be to use the grammar to set B[T.b] as the baseline and removing the interaction terms which are always zero. This could be really easy to do in CSC sparse format (if the next column starts at the same place as the current column, there's no nonzero elements), and np.count_nonzero(..., axis=0) > 0 should give an index that could be used to drop columns in everything else.

... that's not currently implemented in formulaic, but the idea should be sound.

I have no idea which of these better conforms to the expectations of people looking at linear regression coefficients or ANOVA tables.

from formulaic.

matthewwardrop avatar matthewwardrop commented on July 19, 2024

Sorry for the delay here again @DWesl ! I've had a lot on my plate the last few weeks.

I think I am persuaded that adding a keyword argument along the lines of "drop_columns_with_no_support" or "drop_all_zero_columns" could be useful, and that having be part of the model matrix generation process is valuable because that way new data will be forced to respect the limited interactions even if the new data has support for these interactions.

Would that meet your use-case?

from formulaic.

DWesl avatar DWesl commented on July 19, 2024

That gets most of the way there; I'd need to reorder the variables in the Categorical to get a column with all zeros, but that's not that much effort.

from formulaic.

DWesl avatar DWesl commented on July 19, 2024

I asked about this in pydata/patsy#161, and puzzled out what formula would be needed to ensure the all-zero columns happen: (C(A, Treatment('b')) + B + C) ** 3. I don't think formulaic supports changing the reference level yet, but if it respects the order of the levels in a pandas.Categorical dtype, I should be able to achieve the same effect by changing the order there.

from formulaic.

matthewwardrop avatar matthewwardrop commented on July 19, 2024

Hi @DWesl !

As I prepare formulaic for the v1.0 release, I wanted to go back through all the old issues and make sure things were dealt with properly. I know it has been a really long time (sorry). Formulaic has come a long way during this time, but I don't think this specific issue has been "solved".

Formulaic does now support customizing the reference level in contrasts/etc, as well as explicitly ordering the levels and respecting the order of factors in the formula, so you could use what you wrote above... but, there's still room for improvement.

Has your thinking in this space evolved?

from formulaic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.