abhayspawar / featexp Goto Github PK

View Code? Open in Web Editor NEW

747.0 747.0 163.0 5.07 MB

Feature exploration for supervised learning

License: MIT License

Python 1.13% Jupyter Notebook 98.87%

data-exploration data-science feature-engineering machine-learning visualization

featexp's Introduction

Hi there 👋

featexp's People

Contributors

Stargazers

Watchers

Forkers

jofujofu makarevichy usccolumbia loveyakamoz mangquan kiwicao caowei0127 mrstororo yongyug lidpeng lymcurry kai2020-hello shaoqibnu zfh1005 angadh decpaul dc-y bernieyagyu clowne1 canglangshushu shannonyu zorrock conradbm michaelabehsera allensmile elsawei regivm mmeendez8 nbernini trendingtechnology hanwsf zzzz123321 lisa-wang1987 amimul sudarshan1413 shuangyumo liuzongquan langk alagre skkeyan valeman valeriagomesdesouza santamm jimmytoronto fzhurd chenglongcui examin shenbennwdsl cottrell tomrod imrohit1 lihengtianxia yishuihanhan fulin-wei ruting1 chuckwoody zbn123 yanghainan zwt233 xianbin7 luque0108 karenyyng leizhi90 primeston databatman andrewliujian flyingwing mengkunzhao xiaoxiao19 for-competition caitlindong mejihero keithofaptos chrisgao001 ringwraith awalkinclouds fatihcu monkeyshichi lxw4939 piotrm777 yuanjie-ai gctian scorpioxxt notplaid limclara-sq rtygbwwwerr iampride neuropolis wangscdm akalsnyx alexhusted wish2018 luobaozhu dfenglei sbairishal jrdeco560 mihlos pursh2002 patkakou-zz arlanovcsb

featexp's Issues

Bug in xgboost DMatrix

dtrain = xgb.DMatrix(X_test, label=y_test, missing=np.nan)
dtest = xgb.DMatrix(X_train, label=y_train, missing=np.nan)

The X_test is for dtrain not X_train?? Is there anything wrong?

Bug in get_trend_stats()

I am getting this error using get_trend_stats()

ValueError Traceback (most recent call last)
in ()
----> 1 stats = get_trend_stats(data=train, target_col='CANTIDAD_DIR_REP_BIG_RT', data_test=test)
2 stats

~/anaconda3/lib/python3.6/site-packages/featexp/base.py in get_trend_stats(data, target_col, features_list, bins, data_test)
247 ignored.append(feature)
248 else:
--> 249 cuts, grouped = get_grouped_data(input_data=data, feature=feature, target_col=target_col, bins=bins)
250 trend_changes = get_trend_changes(grouped_data=grouped, feature=feature, target_col=target_col)
251 if has_test:

~/anaconda3/lib/python3.6/site-packages/featexp/base.py in get_grouped_data(input_data, feature, target_col, bins, cuts)
36 # if reduced_cuts>0:
37 # print('Reduced the number of bins due to less variation in feature')
---> 38 print(cuts)
39 cut_series = pd.cut(input_data[feature], cuts)
40 else:

~/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates)
226 bins = _convert_bin_to_numeric_type(bins, dtype)
227 if (np.diff(bins) < 0).any():
--> 228 raise ValueError('bins must increase monotonically.')
229
230 fac, bins = _bins_to_cuts(x, bins, right=right, labels=labels,

ValueError: bins must increase monotonically.

I checked cuts value and this is its content:
[475836.1, 897023.3999999999, 1256710.22, 1334681.24, 1838614.84, 1838614.8399999999, 3230684.84]
So it seems there is a bug since two cuts have the same value!

Is it possible to use the package for binary classification?

Hello,

It is not clear for me if this package can be applied to binary classification problems. Can you clarify?

If possible, what is the average of a target in a binary class problem?
Thank you

Is this project still maintained?

Thanks

Colab version compatibility

Great package. I have some issues when running !pip install featexp in google Colab.
I get the following messages:

ERROR: google-colab 1.0.0 has requirement pandas~=0.24.0, but you'll have pandas 0.23.4 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
ERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.
Successfully installed featexp-0.0.5 matplotlib-3.0.2 numpy-1.15.4 pandas-0.23.4
WARNING: The following packages were previously imported in this runtime:
[matplotlib,mpl_toolkits,numpy,pandas]
You must restart the runtime in order to use newly installed versions.

As a consequence (I believe) commands like dataframe.head() no longer work.

featexp_demo file not exist error

FileNotFoundErrorTraceback (most recent call last)
in ()
----> 1 X_train, X_test, y_train, y_test, train_users, test_users = import_and_create_train_test_data()
.....(blablabla)
FileNotFoundError: File b'demo/data/application_train.csv' does not exist

why? any knows how to solve?

Pandas SettingWithCopyWarning

When I run get_trend_stats I get the following warning multiple times:

featexp/base.py:23: SettingWithCopyWarning
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

The origin of the problem is the fact that get_grouped_data changes its input dataframe. It might be better to copy the input data before doing anything with it.

Setup:

Python 3.8.13
featexp 0.0.7
Pandas 1.4.3

Matplotlib outputs

Hi again Abhay,

I ended up locally playing with the library since it was not allowing manipulation of the output figures (like changing resolution and saving pdf outputs as opposed to png). It is quite a quick fix I think, just need to restructure a bit the draw_plots function.

Issue with get_trend_stats()

Hey,
What am I doing wrong here?
https://cl.ly/466adfd96d75 I am using the dataset in the article on credit risk.
Thanks,

bug in get_trend_stats

I meet this problem: ValueError: missing values must be missing in the same location both left and right sides. Can you help me solve this wrong? Thank you very much

代码中test相关文件名的问题

在get_dataloader.py中，定义test_name='testb'，但原始test数据文件夹里（百度云下载的），数据文件前缀都是test_a，没有testb开头的，代码执行时会报文件找不到的错误
还有在main_LR.py的load_dataset中，也是‘/test/testb_base.csv'这样的引用，，但并没有testb这种数据

请问这个testb是否实际是指test_a这样开头的文件？谢谢！

@wj19971997

Implementing sample/data weight

Hi Abhay,
I appreciate your great effort on publishing your code. I wonder if there is a way to implement weights for each data point since this is often the case in my domain. Thanks.
Cheers,
Rui

KeyError: "Column 'target' does not exist!"

Hello, I am trying to use this wonderful tool, but the error occurs as 'KeyError: "Column 'target' does not exist!"'. I am pretty sure that 'target' is in train data as a column.
Could you help me with this? Many thanks.

dependent library version old?

It seems that numpy matplotlib 's version maybe a little old?

cannot import name 'get_trend_stats_feature'

First, thanks for developing this package. I tried the "get_univariate_plots" function and it worked but I can not import the "get_trend_stats_feature" function. Do you have ideas?

AssertionError: `result` has not been initialized.

Getting this error again and again while executing the command:
stats = get_trend_stats(data=data_train, target_col='target', data_test=data_split[0])

Update

Hello,

I forked the repo today to be able to use it with newer versions of the required libraries.
btw, it worked with:

numpy==1.17.4
pandas==0.25.3
matplotlib==3.0.2

Also, I did some cosmetic changes which I tried to add with a PR, but I guess you are not allowing them? I am getting the following error:

remote: Permission to abhayspawar/featexp.git denied to pauroger.
fatal: unable to access 'https://github.com/abhayspawar/featexp.git/': The requested URL returned error: 403