Git Product home page Git Product logo

Comments (2)

ThomasBury avatar ThomasBury commented on May 27, 2024

Hi @Tialo,

Numpy does not have built-in support for heterogeneous data types, so it is not well-suited for handling data with mixed types or non-numeric data. However, it provides a structured array data type, which can store and manipulate arrays with different data types. Still, it has limitations compared to Pandas data frames. This is why CatBoost relies on lists, pd.DataFrame or Pool.

using:

X = np.array([[1, 2, 'a'], [3, 4, 'b']])
X

returns:

array([['1', '2', 'a'],
       ['3', '4', 'b']], dtype='<U11')

which means that all the entries are treated as object (unicode strings).

If you want to use numpy arrays with non-numerical columns rather than pd.DataFrame, use structured arrays

x = np.array([(1, 2, 'a'), (3, 4, 'b')],
             dtype=[('num1', 'i4'), ('num2', 'i4'), ('cat1', 'U10')])

returns

array([(1, 2, 'a'), (3, 4, 'b')],
      dtype=[('num1', '<i4'), ('num2', '<i4'), ('cat1', '<U10')])

and the conversion to a pd.DataFrame works as expected

x = pd.DataFrame(x)
x.dtypes

which returns

num1     int32
num2     int32
cat1    object
dtype: object

When working with heterogenous data, prefer pandas over numpy.

I hope it helps.

NB: I'm not sure I understood the original method you're referring to. Isn't it the same as the current definition of _create_shadow?

from arfs.

Tialo avatar Tialo commented on May 27, 2024

I tried to fit Leshy with dataframe, that has numerical and categorical features. And in your fit method this dataframe will be transformed into dataframe where every feature will be encoded as categorical. I will show you why this happens.

In this method you pass dataframe into np.nan_to_num function which returns numpy ndarray. As you said as original dataframe contains at least one categorical values, numpy will encode every value into unicode strings, thus X.dtype will be object.

def _fit(self, X_raw, y, sample_weight=None):

Mentioned function.

X = np.nan_to_num(X)

Also if X will always be numpy ndarray(which is true because np.nan_to_num return ndarray) this if is always False.

if not isinstance(X, np.ndarray):

Then X passed to this method.

cur_imp = self._add_shadows_get_imps(X, y, sample_weight, dec_reg)

Where you take some columns and then concatenate it with shadow features, which then passed to this function.

imp = _get_shap_imp(

model, X_tt, y_tt, w_tt = _split_fit_estimator(

Finally you are making pandas DataFrame of numpy ndarray with dtype equals 'object', then X.dtypes will be object.

And after it, your function will encode all columns because all of them are object.

def get_pandas_cat_codes(X):

After encoding each column of dataframe will be int64 which are oridinal encoded features.

model = estimator.fit(

And this will lead to poor performance of catboost, despite the fact it can encode categorical features, it is bad idea to try to encode numerical features too. I hope I explained you the problem, if not feel free to ask anything!

Upd. Sorry for closing and opening the issue, I've miss clicked.

from arfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.