Git Product home page Git Product logo

sklearndf's People

Contributors

j-ittner avatar jason-bentley avatar joerg-schneider avatar konst-int-i avatar mgelsm avatar mtsokol avatar sithankanna avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sklearndf's Issues

Add Maximum Relevance Minimum Redundancy as feature selection algorithm

Is your feature request related to a problem? Please describe.
Feature is not directly related to a problem, but is rather an enhancement of existing functionality. As suggested by Julian King on the facet Slack channel, we could add Maximum Relevance Minimum Redundancy (MRMR) as a feature selection algorithm.

The algorithm is explained in the following papers:
https://arxiv.org/pdf/1908.05376.pdf
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1423-9

Describe the solution you'd like
Implement MrmrDF in a similar fashion to BorutaDF such that it can be passed into the sklearndf pipeline.

Describe alternatives you've considered
The paper also suggests using a redundancy matrix in to shine some light on the feature selection as shown below. While this is for discussion, I would not use this output to avoid confusion with the shap value redundancy calculated as part of the feature selection.

image

Cannot use ColumnTransformerDF inside of StackingRegressorDF

Summary:

Using StackingRegressorDF on pipelines containing a ColumnTransformerDF raises an error on .fit.

Using a StackingRegressorDF as the last part of a PipelineDF works as expected. But creating multiple PipelineDF objects with ColumnTransformerDF and then stacking these fails with the following error:

TypeError: StackingRegressorDF.fit: ColumnTransformerDF.fit_transform: arg y must be None, or a pandas Series or DataFrame

Root cause

Most likely the reason is this line in StackingRegressor.fit:

y = column_or_1d(y, warn=True)

Reproduceable example:

from sklearndf.pipeline import PipelineDF
from sklearndf.regression import LinearRegressionDF, ElasticNetDF
from sklearndf.transformation import ColumnTransformerDF, StandardScalerDF
from sklearndf.regression import StackingRegressorDF

import pandas as pd
import numpy as np

# toy data set
np.random.seed(1)
data = pd.DataFrame({
    'x1': np.random.uniform(size=(10,)),
    'x2': np.random.uniform(size=(10,)),
    'y': np.random.uniform(size=(10,)),
})

# basic building blocks
model1 = LinearRegressionDF()
model2 = ElasticNetDF()
preprocessing = ColumnTransformerDF([
    ('x1', StandardScalerDF(), ['x1']),
    ('x2', 'passthrough', ['x1']),
])

# Pipeline with stack works
pipeline = PipelineDF([
    ('preprocessing', preprocessing),
    ('stack', StackingRegressorDF([
        ('model1', model1),
        ('model2', model2),
    ]))
])
pipeline.fit(data, data['y'])
print(pipeline.predict(data))

# Stack of Pipelines doesn't
stack_of_pipelines = StackingRegressorDF([
    ('pipeline1', PipelineDF([
        ('preprocessing', preprocessing),
        ('model1', model1)
    ])),
    ('pipeline2', PipelineDF([
        ('preprocessing', preprocessing),
        ('model2', model1)
    ]))
])
stack_of_pipelines.fit(data, data['y'])

Native XGBoost support

Is your feature request related to a problem? Please describe.
XGboost is currently not natively supported by sklearndf

Describe the solution you'd like
Currently, users can fix this individually by using the make_df_regressor wrapper

from xgboost import XGBRegressor
from sklearndf.wrapper import make_df_regressor
XGBRegressorDF = make_df_regressor(XGBRegressor)

It would be desirable to move this directly into sklearnf.regression and sklearn.classification for the XGboost regressor/classifier respectively.

To avoid additional dependencies, we should make an assertion that xgboost must be installed if the XGBRegressorDF/XGBClassifierDF is used.

Describe alternatives you've considered
n/a

Additional context
n/a

OneHotEncoderDF: OneHot encoder wrapper fails for columns reduction options

Describe the bug
The wrapper for the OneHot encoder fails for columns reduction options (drop= "if_binary" or "first")

The wrapper automatically computes the expected columns length of the transformed dataset without taking into account the drop option

To Reproduce
Steps to reproduce the behavior:

  1. open a notebook
  2. Run the following code
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearndf.pipeline import PipelineDF
from sklearndf.transformation import (
    ColumnTransformerDF,
    OneHotEncoderDF,
    SimpleImputerDF,
)
X_churn : pd.DataFrame = ...
y_churn : pd.Series = ...
<img width="1088" alt="Screenshot 2021-02-16 at 16 09 49" src="https://user-images.githubusercontent.com/32160831/108081572-6733de80-7071-11eb-8bca-f52932a4173e.png">

# For categorical features we will use the mode as the imputation value and also one-hot encode
preprocessing_categorical = PipelineDF(
    steps=[
        ("imputer", SimpleImputerDF(strategy="most_frequent", fill_value="<na>")),
        ("one-hot", OneHotEncoderDF(sparse=False, drop="if_binary")),
    ]
)

# For numeric features we will impute using the median
preprocessing_numerical = SimpleImputerDF(strategy="median")

# Put the pipeline together
preprocessing_features = ColumnTransformerDF(
    transformers=[
        (
            "categorical",
            preprocessing_categorical,
            make_column_selector(dtype_include=object),
        ),
        (
            "numerical",
            preprocessing_numerical,
            make_column_selector(dtype_include=np.number),
        ),
    ]
)

# Run the preprocessing
transformed_features = preprocessing_features.fit_transform(X=X_churn, y=y_churn)
transformed_features.head()
  1. See error

Screenshot 2021-02-16 at 16 16 22

Expected behavior
Expected to see the transformed dataset with only one column for categorical columns that have only 2 unique values

  • Version: sklearndf==1.0.1

numpy RandomState is now legacy

Is your feature request related to a problem? Please describe.
The RandomState used from numpy is now legacy (https://numpy.org/doc/1.19/reference/random/legacy.html) and has been indirectly replaced by Generator (https://numpy.org/doc/stable/reference/random/generator.html).
This is also something that may impact scikit-learn in the future and a PR has been opened regarding (scikit-learn/scikit-learn#16988).

Describe the solution you'd like
Given the dependency upon scikit-learn and the fact that RandomState is now legacy we should be proactive in ensuring we make the neccesary updates should RandomState be retired and/or scikit-learn replaces RandomState.

[ColumnTransformerDF] Allow "passthrough" option for remainder

Is your feature request related to a problem? Please describe.
Hi,
I would like to use the "passthrough" option for the remainder argument of ColumnTransformerDF as in the original version of ColumnTransformer of scikit-learn. For now, only "drop" is accepted.

Describe the solution you'd like
Right now, the only possible value for the argument remainder is "drop". It means that if we apply aColumnTransformerDF on only some columns of a DataFrame df, the other columns unused will be dropped. The solution I am looking for is to have the "passthrough" option. It means that the columns unaffected by the transformation in ColumnTransformerDF will be preserved and left unmodified in the final DataFrame df.

Describe alternatives you've considered
I tried to implement a custom version of ColumnTransformerWrapperDF with the "passthrough" option. It is working so far but I did not test it extensively and I do not know all your requirements and constraints.

Original Code

class ColumnTransformerWrapperDF(
    TransformerWrapperDF[ColumnTransformer], metaclass=ABCMeta
):
    """
    DF wrapper for :class:`sklearn.compose.ColumnTransformer`.
    Requires all transformers passed as the ``transformers`` parameter to implement
    :class:`.TransformerDF`.
    """

    __DROP = "drop"
    __PASSTHROUGH = "passthrough"

    __SPECIAL_TRANSFORMERS = (__DROP, __PASSTHROUGH)

    def _validate_delegate_estimator(self) -> None:
        column_transformer: ColumnTransformer = self.native_estimator

        if column_transformer.remainder != ColumnTransformerWrapperDF.__DROP:
            raise ValueError(
                f"unsupported value for arg remainder: ({column_transformer.remainder})"
            )

        non_compliant_transformers: List[str] = [
            type(transformer).__name__
            for _, transformer, _ in column_transformer.transformers
            if not (
                isinstance(transformer, TransformerDF)
                or transformer in ColumnTransformerWrapperDF.__SPECIAL_TRANSFORMERS
            )
        ]
        if non_compliant_transformers:
            from .. import ColumnTransformerDF

            raise ValueError(
                f"{ColumnTransformerDF.__name__} only accepts instances of "
                f"{TransformerDF.__name__} or special values "
                f'"{" and ".join(ColumnTransformerWrapperDF.__SPECIAL_TRANSFORMERS)}" '
                "as valid transformers, but "
                f'also got: {", ".join(non_compliant_transformers)}'
            )

    def _get_features_original(self) -> pd.Series:
        """
        Return the series mapping output column names to original columns names.
        :return: the series with index the column names of the output dataframe and
        values the corresponding input column names.
        """

        return reduce(
            lambda x, y: x.append(y),
            (
                (
                    pd.Series(index=columns, data=columns)
                    if df_transformer == ColumnTransformerWrapperDF.__PASSTHROUGH
                    else df_transformer.feature_names_original_
                )
                for _, df_transformer, columns in self.native_estimator.transformers_
                if (
                    len(columns) > 0
                    and df_transformer != ColumnTransformerWrapperDF.__DROP
                )
            ),
        )

My Custom Version

class ColumnTransformerCustomWrapperDF(
    TransformerWrapperDF[ColumnTransformer], metaclass=ABCMeta
):
    """
    DF wrapper for :class:`sklearn.compose.ColumnTransformer`.
    Requires all transformers passed as the ``transformers`` parameter to implement
    :class:`.TransformerDF`.
    """

    __DROP = "drop"
    __PASSTHROUGH = "passthrough"

    __SPECIAL_TRANSFORMERS = (__DROP, __PASSTHROUGH)

    def _validate_delegate_estimator(self) -> None:
        column_transformer: ColumnTransformer = self.native_estimator

        if (
            column_transformer.remainder
            not in ColumnTransformerCustomWrapperDF.__SPECIAL_TRANSFORMERS
        ):
            raise ValueError(
                f"unsupported value for arg remainder: ({column_transformer.remainder})"
            )

        non_compliant_transformers: List[str] = [
            type(transformer).__name__
            for _, transformer, _ in column_transformer.transformers
            if not (
                isinstance(transformer, TransformerDF)
                or transformer
                in ColumnTransformerCustomWrapperDF.__SPECIAL_TRANSFORMERS
            )
        ]
        if non_compliant_transformers:
            authorised_transformers = " and ".join(
                ColumnTransformerCustomWrapperDF.__SPECIAL_TRANSFORMERS
            )
            raise ValueError(
                f"{ColumnTransformerDF.__name__} only accepts instances of "
                f"{TransformerDF.__name__} or special values "
                f'"{authorised_transformers}" as valid transformers, but '
                f'also got: {", ".join(non_compliant_transformers)}'
            )

    def _get_features_original(self) -> pd.Series:
        """
        Return the series mapping output column names to original columns names.
        :return: the series with index the column names of the output dataframe and
        values the corresponding input column names.
        """
        result = reduce(
            lambda x, y: x.append(y),
            (
                (
                    pd.Series(
                        index=self.feature_names_in_[columns],
                        data=self.feature_names_in_[columns],
                    )
                    if df_transformer == ColumnTransformerCustomWrapperDF.__PASSTHROUGH
                    else df_transformer.feature_names_original_
                )
                for _, df_transformer, columns in self.native_estimator.transformers_
                if (
                    len(columns) > 0
                    and df_transformer != ColumnTransformerCustomWrapperDF.__DROP
                )
            ),
        )
        return result

# Instantiation of ColumnCustomTransformerDF
ColumnCustomTransformerDF = make_df_transformer(
    ColumnTransformer,
    name="ColumnTransformerCustomDF",
    base_wrapper=ColumnTransformerCustomWrapperDF,
)

Additional context

  • OS: ubuntu 20.04 LTS
  • Python: 3.8.10
  • sklearndf==1.2.1

Thank you in advance for your help,

[BUG] - Inverse Transform does not work for StandardScaler

Describe the bug
Hi,
First of all, well done for the package. I have been using sklearndf for a couple of months and it is very handy!
I notice however an issue with the Transformers. In particular, I can't apply the inverse_transform

To Reproduce
For example, for StandardScalerDF, I can fit it, transform my DataFrame without issues. However, if I try to inverse the transformation, it fails.

import pandas as pd
import numpy as np
from sklearndf.transformation import StandardScalerDF

# Initialise a random DataFrame
df = pd.DataFrame(np.random.randint(1, 100, size=(10, 2)), columns=["A", "B"])
print(df)

# Instantiation and Fitting of a Standard Scaler
scaler = StandardScalerDF()
scaler.fit(df)
df = scaler.transform(df)
print(df)

# Inverse the Scaling
df = scaler.inverse_transform(df) ## ERROR --> NotFittedError: StandardScalerDF is not fitted

Did I miss anything?

Expected behavior
Expect the inverse transform to be performed.

Screenshots
image

First Idea where to look for
I notice the usage of reset_fit() in the method inverse_transform of TransformerWrapperDF. Is it really needed? It is this command which generates the bug as it re-initializes the attribute self._features_in to None.
By commenting it, it works.

Desktop :

  • OS: ubuntu 20.04 LTS
  • Python: 3.8.10
  • sklearndf==1.2.1

Thank you in advance for your help,

Support for scikit-learn>=0.24

Current requirement is scikit-learn 0.23.x. Sklearn 0.24.* has been out since December 2020 and is becoming the default version is most environments. This incompatibility prevents us from using sklearndf.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.