Comments (2)
Hi @Tialo,
Numpy does not have built-in support for heterogeneous data types, so it is not well-suited for handling data with mixed types or non-numeric data. However, it provides a structured array data type, which can store and manipulate arrays with different data types. Still, it has limitations compared to Pandas data frames. This is why CatBoost relies on lists, pd.DataFrame or Pool.
using:
X = np.array([[1, 2, 'a'], [3, 4, 'b']])
X
returns:
array([['1', '2', 'a'],
['3', '4', 'b']], dtype='<U11')
which means that all the entries are treated as object (unicode strings).
If you want to use numpy arrays with non-numerical columns rather than pd.DataFrame, use structured arrays
x = np.array([(1, 2, 'a'), (3, 4, 'b')],
dtype=[('num1', 'i4'), ('num2', 'i4'), ('cat1', 'U10')])
returns
array([(1, 2, 'a'), (3, 4, 'b')],
dtype=[('num1', '<i4'), ('num2', '<i4'), ('cat1', '<U10')])
and the conversion to a pd.DataFrame works as expected
x = pd.DataFrame(x)
x.dtypes
which returns
num1 int32
num2 int32
cat1 object
dtype: object
When working with heterogenous data, prefer pandas over numpy.
I hope it helps.
NB: I'm not sure I understood the original method you're referring to. Isn't it the same as the current definition of _create_shadow?
from arfs.
I tried to fit Leshy with dataframe, that has numerical and categorical features. And in your fit method this dataframe will be transformed into dataframe where every feature will be encoded as categorical. I will show you why this happens.
In this method you pass dataframe into np.nan_to_num
function which returns numpy ndarray. As you said as original dataframe contains at least one categorical values, numpy will encode every value into unicode strings, thus X.dtype
will be object.
Mentioned function.
Also if X will always be numpy ndarray(which is true because np.nan_to_num
return ndarray) this if is always False.
Then X passed to this method.
Where you take some columns and then concatenate it with shadow features, which then passed to this function.
arfs/src/arfs/feature_selection/allrelevant.py
Line 1018 in ca71fb2
Finally you are making pandas DataFrame of numpy ndarray with dtype equals 'object', then X.dtypes
will be object.
And after it, your function will encode all columns because all of them are object.
Line 117 in ca71fb2
After encoding each column of dataframe will be int64 which are oridinal encoded features.
And this will lead to poor performance of catboost, despite the fact it can encode categorical features, it is bad idea to try to encode numerical features too. I hope I explained you the problem, if not feel free to ask anything!
Upd. Sorry for closing and opening the issue, I've miss clicked.
from arfs.
Related Issues (20)
- Update dependency HOT 1
- LightGBMError: Number of classes should be specified and greater than 1 for multiclass training HOT 1
- GrootCV: Extracting average SHAP over all iterations *per sample* in addition to per feature HOT 1
- How to get MRMR into a cross-validation pipeline? HOT 4
- BoostAGroota works wrong with set_config(transform_output="pandas") HOT 1
- potential to specify time series splitter HOT 7
- GrootCV is missing class_weight param for muticlass classification HOT 1
- Numba HOT 1
- Consider using FastTreeSHAP? HOT 5
- Ability to pass in a model to GrootCV HOT 7
- arfs.feature_selection module not found HOT 4
- Cannot suppress runtime warning HOT 1
- [BUG] - add a safeguard when there is a single categorical column
- LightGBM bump and folds var HOT 3
- [BUG] User-Specified Threshold for CollinearityThreshold is not Applied. HOT 1
- Leshy fit method always overwrites to importance==shap if fasttreeshap not installed HOT 3
- Issue with Custom Callable Implementation in CollinearityThreshold Class HOT 2
- Issue with Overly Aggressive Feature Removal in CollinearityThreshold Class
- Bug: MinRedundancyMaxRelevance Function Modifies Input DataFrame by Adding target Column HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arfs.