Comments (2)
Hi @bnjhng. Thanks for reaching out.
In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?
In terms of the right solution to this, your approach isn't crazy! It likely misses some edge-cases, but it probably works in the majority of cases. I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?
from formulaic.
In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?
Thanks for linking it! Somehow I missed that section as I was going through the docs.
I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?
We have found that very often, when performing operations that do not involve row indexing (i.e., the vast majority of data transformations), working in dictionary of numpy arrays has speed advantage over pandas DataFrame.
It likely misses some edge-cases, but it probably works in the majority of cases.
This is indeed our conclusion! The main limitation we have encountered so far is with categorical encoding. And we have isolated the main issue to the following:
>>> from formulaic.materializers.types import FactorValues
>>> import pandas as pd
>>> import numpy as np
>>>
>>> # while this works as expected:
>>> print(pd.Series(FactorValues(pd.Series(["a", "b", "c"]))))
0 a
1 b
2 c
dtype: object
>>> # this doesn't give the expected results:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"]))))
0 abc
1 bc
2 c
dtype: object
>>> # and this straight up errors out:
>>> print(pd.Series(FactorValues(np.array([1, 2, 3]))))
TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got FactorValues)
The doctoring of FactorValues
says that it is:
A convenience wrapper that surfaces a
FactorValuesMetadata
instance at
<object>.__formulaic_metadata__
. This wrapper can otherwise wrap any
object and behaves just like that object.
But clearly, in the case of numpy.ndarray
, the wrapper doesn't behave just like numpy.ndarray
. To get the code above to work properly, one way is to explicitly call the __wrapped__
attribute:
# both of the following works as expected:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"])).__wrapped__))
0 a
1 b
2 c
dtype: object
>>> print(pd.Series(FactorValues(np.array([1, 2, 3])).__wrapped__))
0 1
1 2
2 3
dtype: int64
However, there are places in the formulaic code that simply does pandas.Series(data)
instead of pandas.Series(data.__wrapped__)
, for example here and here.
Would it be possible to fix this limitation with formulaic?
from formulaic.
Related Issues (20)
- How to include structural zeros? HOT 1
- Retain Column Names for sparse model matrices HOT 4
- Formulaic not raising an exception when required fields are missing in the dataset HOT 2
- Allow formatting the categorical encoded variables HOT 4
- Throw error when formula has parameters that are not available HOT 2
- Support polars HOT 4
- Dropping Indices via "+0" or "-1" and reference levels for categoricals HOT 1
- Handling individual columns that can expand into multiple columns HOT 7
- Support the hashing trick as an encoding strategy for categorical features HOT 6
- `model_spec.transform_state` bugged when formula is not correctly written HOT 1
- Is there a way to get the baseline value for categorical variables? HOT 7
- Add . operator HOT 1
- Suggestions for creating `get_feature_names_out` for Scikit Learn ColumnTransformer compatibility? HOT 3
- Is it possible to define custom operators? HOT 2
- Is it possible to force the `Formula` class to not expand categorical variables? HOT 3
- Add required variables to the `Formula` class HOT 6
- Potential Bug / different defaults for Intercept / Reference Levels when using `Formula.get_model_matrix()` with categoricals HOT 2
- Potential bug in Interacting variables via `:` syntax for categorical variables HOT 3
- Incompatibility with pandas development version HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from formulaic.