Git Product home page Git Product logo

Comments (2)

matthewwardrop avatar matthewwardrop commented on June 27, 2024

Hi @bnjhng. Thanks for reaching out.

In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?

In terms of the right solution to this, your approach isn't crazy! It likely misses some edge-cases, but it probably works in the majority of cases. I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?

from formulaic.

bnjhng avatar bnjhng commented on June 27, 2024

In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?

Thanks for linking it! Somehow I missed that section as I was going through the docs.

I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?

We have found that very often, when performing operations that do not involve row indexing (i.e., the vast majority of data transformations), working in dictionary of numpy arrays has speed advantage over pandas DataFrame.

It likely misses some edge-cases, but it probably works in the majority of cases.

This is indeed our conclusion! The main limitation we have encountered so far is with categorical encoding. And we have isolated the main issue to the following:

>>> from formulaic.materializers.types import FactorValues
>>> import pandas as pd
>>> import numpy as np
>>> 
>>> # while this works as expected:
>>> print(pd.Series(FactorValues(pd.Series(["a", "b", "c"]))))
0    a
1    b
2    c
dtype: object

>>> # this doesn't give the expected results:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"]))))
0     abc
1    bc
2     c
dtype: object

>>> # and this straight up errors out:
>>> print(pd.Series(FactorValues(np.array([1, 2, 3]))))
TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got FactorValues)

The doctoring of FactorValues says that it is:

A convenience wrapper that surfaces a FactorValuesMetadata instance at
<object>.__formulaic_metadata__. This wrapper can otherwise wrap any
object and behaves just like that object.

But clearly, in the case of numpy.ndarray, the wrapper doesn't behave just like numpy.ndarray. To get the code above to work properly, one way is to explicitly call the __wrapped__ attribute:

# both of the following works as expected:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"])).__wrapped__))
0    a
1    b
2    c
dtype: object

>>> print(pd.Series(FactorValues(np.array([1, 2, 3])).__wrapped__))
0    1
1    2
2    3
dtype: int64

However, there are places in the formulaic code that simply does pandas.Series(data) instead of pandas.Series(data.__wrapped__), for example here and here.

Would it be possible to fix this limitation with formulaic?

from formulaic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.