abhishekkrthakur / autoxgb Goto Github PK

View Code? Open in Web Editor NEW

657.0 657.0 90.0 2.9 MB

XGBoost + Optuna

License: Apache License 2.0

Makefile 0.46% Python 99.54%

autoxgb's Introduction

Hi there 👋

I'm a data scientist / machine learning engineer.

autoxgb's People

Stargazers

Watchers

Forkers

tolga-karahan hguidara loretoparisi smtr42 rightbrainy micseb webclinic017 toshihikoyanase dsivaji kimroos shamoo100 techthiyanes bernieyagyu faisal-alsrheed tharunkumar01 aydinemre abhinavm24 hercules261188 zevarela mysticaltech xhqing stjordanis himanshuparashar0101 nazike fdoperezi inf800 mohamed-aziz-ben-nessir kiminh jf2study arusri23 alexandermonneret acartro pourya-ir jimmy-inl valeman seanahmad aditta-das vedantthapa aidenzich valleyberg sourajmishra geofferyzhong jackattempts mallick7 ustcsteve xhulianothe1 dataqueenpend satrusskumar fanwangm gurumail10 vishalbelsare python-repository-hub wintaowang navdeep-g sandy4321 siddharth22122000 ayanatherate zhouqianang faisal-and-friends zxmek gg-big-org iclgg ngandn18 zgui123 skyzip007 majiouk fghstar manu87ds giwan2 restevesd overenginar rishiraj longshared jekoosina gavinchen1314 ramstorage egerdm-ai wangkun543604 tuhinmallick dianlingfen skon7 handsomeboycrj craigy101 lmoscoted 1767796246 sergiorodenas koneoumar watercolorpens xiulonghan

autoxgb's Issues

How to use pandas dataset instead of saving it and using the csv

AutoXGB for credit card fraud detection

Hey Abhishek, great work in setting up this really useful library, certainly makes the implementation of XGBoost much simpler. I ran a mini project using AutoXGB by objectively evaluating its use against the standard XGBoost. Happy to hear your thoughts. The writeup can be found here: https://towardsdatascience.com/autoxgb-for-financial-fraud-detection-f88f30d4734a?sk=13bbbe9761698db8d4c0ffef661db916

NaN issue when multi-target regression

the autoxgb study does not launch when one of the targets is missing .

is there anywork around ?

How to save the trained model?

AssertionError - assert version is not None

executed code:

>>> axgb = AutoXGB(
...     train_filename="X_train.csv",
...     output="output",
...     test_filename="X_valid.csv",
...     task="classification",
...     idx=None,
...     targets=["label"],
...     features=['feat1', 'feat2', 'feat3', 'feat4', 'feat5'],
...     categorical_features=None,
...     use_gpu=False,
...     num_folds=5,
...     seed=42,
...     num_trials=100,
...     time_limit=360,
...     fast=False,
... )

error log:

2023-04-21 04:37:30.385 | INFO     | autoxgb.autoxgb:_process_data:149 - Reading training data
2023-04-21 04:37:30.727 | INFO     | autoxgb.utils:reduce_memory_usage:48 - Mem. usage decreased to 5.07 Mb (80.9% reduction)
2023-04-21 04:37:30.732 | INFO     | autoxgb.autoxgb:_determine_problem_type:140 - Problem type: multi_class_classification
2023-04-21 04:37:30.851 | INFO     | autoxgb.utils:reduce_memory_usage:48 - Mem. usage decreased to 1.87 Mb (82.4% reduction)
2023-04-21 04:37:30.851 | INFO     | autoxgb.autoxgb:_create_folds:58 - Creating folds
2023-04-21 04:37:30.868 | INFO     | autoxgb.autoxgb:_process_data:170 - Encoding target(s)
2023-04-21 04:37:30.875 | INFO     | autoxgb.autoxgb:_process_data:195 - Found 0 categorical features.
2023-04-21 04:37:31.084 | INFO     | autoxgb.autoxgb:_process_data:236 - Model config: train_filename='X_train.csv' test_filename='X_valid.csv' idx='id' targets=['label'] problem_type=<ProblemType.multi_class_classification: 2> output='output' features=['feat1', 'feat2', 'feat3', 'feat4', 'feat5'] num_folds=5 use_gpu=False seed=42 categorical_features=[] num_trials=100 time_limit=360 fast=False
2023-04-21 04:37:31.084 | INFO     | autoxgb.autoxgb:_process_data:237 - Saving model config
2023-04-21 04:37:31.085 | INFO     | autoxgb.autoxgb:_process_data:241 - Saving encoders

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/models/venv-autoxgb/lib/python3.8/site-packages/autoxgb/autoxgb.py", line 247, in train
    best_params = train_model(self.model_config)
  File "/workspace/models/venv-autoxgb/lib/python3.8/site-packages/autoxgb/utils.py", line 207, in train_model
    study = optuna.create_study(
  File "/workspace/models/venv-autoxgb/lib/python3.8/site-packages/optuna/study/study.py", line 1136, in create_study
    storage = storages.get_storage(storage)
  File "/workspace/models/venv-autoxgb/lib/python3.8/site-packages/optuna/storages/__init__.py", line 31, in get_storage
    return _CachedStorage(RDBStorage(storage))
  File "/workspace/models/venv-autoxgb/lib/python3.8/site-packages/optuna/storages/_rdb/storage.py", line 187, in __init__
    self._version_manager.check_table_schema_compatibility()
  File "/workspace/models/venv-autoxgb/lib/python3.8/site-packages/optuna/storages/_rdb/storage.py", line 1310, in check_table_schema_compatibility
    current_version = self.get_current_version()
  File "/workspace/models/venv-autoxgb/lib/python3.8/site-packages/optuna/storages/_rdb/storage.py", line 1337, in get_current_version
    assert version is not None
AssertionError

pip freeze:

alembic==1.10.3
anyio==3.6.2
asgiref==3.6.0
attrs==23.1.0
autopage==0.5.1
autoxgb==0.2.2
click==8.1.3
cliff==4.2.0
cmaes==0.9.1
cmd2==2.4.3
colorlog==6.7.0
fastapi==0.70.0
greenlet==2.0.2
h11==0.14.0
idna==3.4
importlib-metadata==6.5.0
importlib-resources==5.12.0
joblib==1.1.0
loguru==0.5.3
Mako==1.2.4
MarkupSafe==2.1.2
numpy==1.21.3
optuna==2.10.0
packaging==23.1
pandas==1.3.4
pbr==5.11.1
prettytable==3.7.0
pyarrow==6.0.0
pydantic==1.8.2
pyperclip==1.8.2
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0
scikit-learn==1.0.1
scipy==1.10.1
six==1.16.0
sniffio==1.3.0
SQLAlchemy==2.0.9
starlette==0.16.0
stevedore==5.0.0
threadpoolctl==3.1.0
tqdm==4.65.0
typing_extensions==4.5.0
uvicorn==0.15.0
wcwidth==0.2.6
xgboost==1.5.0
zipp==3.15.0

AttributeError: dlsym(0x7fd108ca6760, XGDMatrixCreateFromDense): symbol not found

As per the subject, I am getting the error when I am running in local:


2021-11-01 15:45:04.651 | INFO     | autoxgb.autoxgb:__post_init__:42 - Output directory: output3
2021-11-01 15:45:04.652 | WARNING  | autoxgb.autoxgb:__post_init__:49 - No id column specified. Will default to `id`.
2021-11-01 15:45:04.653 | INFO     | autoxgb.autoxgb:_process_data:149 - Reading training data
2021-11-01 15:45:04.885 | INFO     | autoxgb.utils:reduce_memory_usage:48 - Mem. usage decreased to 2.19 Mb (76.0% reduction)
2021-11-01 15:45:04.891 | INFO     | autoxgb.autoxgb:_determine_problem_type:140 - Problem type: multi_class_classification
2021-11-01 15:45:04.892 | INFO     | autoxgb.autoxgb:_create_folds:58 - Creating folds
2021-11-01 15:45:04.922 | INFO     | autoxgb.autoxgb:_process_data:170 - Encoding target(s)
2021-11-01 15:45:04.931 | INFO     | autoxgb.autoxgb:_process_data:195 - Found 0 categorical features.
2021-11-01 15:45:05.054 | INFO     | autoxgb.autoxgb:_process_data:236 - Model config: train_filename='train.csv' test_filename=None idx='id' targets=['label'] problem_type=<ProblemType.multi_class_classification: 2> output='output3' features=['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'y1', 'z1', 'z2', 'z3', 'z4'] num_folds=5 use_gpu=False seed=42 categorical_features=[] num_trials=100 time_limit=360 fast=False
2021-11-01 15:45:05.054 | INFO     | autoxgb.autoxgb:_process_data:237 - Saving model config
2021-11-01 15:45:05.055 | INFO     | autoxgb.autoxgb:_process_data:241 - Saving encoders
[I 2021-11-01 15:45:05,230] A new study created in RDB with name: autoxgb
[W 2021-11-01 15:45:05,339] Trial 0 failed because of the following error: AttributeError('dlsym(0x7fd108ca6760, XGDMatrixCreateFromDense): symbol not found')
Traceback (most recent call last):
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/autoxgb/utils.py", line 172, in optimize
    model.fit(
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py", line 506, in inner_f
    return f(**kwargs)
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py", line 1231, in fit
    train_dmatrix, evals = _wrap_evaluation_matrices(
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py", line 286, in _wrap_evaluation_matrices
    train_dmatrix = create_dmatrix(
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py", line 1245, in <lambda>
    create_dmatrix=lambda **kwargs: DMatrix(nthread=self.n_jobs, **kwargs),
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py", line 506, in inner_f
    return f(**kwargs)
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py", line 616, in __init__
    handle, feature_names, feature_types = dispatch_data_backend(
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py", line 707, in dispatch_data_backend
    return _from_pandas_df(data, enable_categorical, missing, threads,
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py", line 299, in _from_pandas_df
    return _from_numpy_array(data, missing, nthread, feature_names,
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py", line 179, in _from_numpy_array
    _LIB.XGDMatrixCreateFromDense(
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/Users/A124661/opt/anaconda3/envs/deep_py38/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: dlsym(0x7fd108ca6760, XGDMatrixCreateFromDense): symbol not found
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/pp/ym01m3sx0hg3my_gzpsdl8680000gp/T/ipykernel_728/1462055845.py in <module>
     16     fast=fast,
     17 )
---> 18 axgb.train()

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/autoxgb/autoxgb.py in train(self)
    245     def train(self):
    246         self._process_data()
--> 247         best_params = train_model(self.model_config)
    248         logger.info("Training complete")
    249         self.predict(best_params)

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/autoxgb/utils.py in train_model(model_config)
    211         load_if_exists=True,
    212     )
--> 213     study.optimize(optimize_func, n_trials=model_config.num_trials, timeout=model_config.time_limit)
    214     return study.best_params
    215 

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
    398             )
    399 
--> 400         _optimize(
    401             study=self,
    402             func=func,

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
     64     try:
     65         if n_jobs == 1:
---> 66             _optimize_sequential(
     67                 study,
     68                 func,

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar)
    161 
    162         try:
--> 163             trial = _run_trial(study, func, catch)
    164         except Exception:
    165             raise

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py in _run_trial(study, func, catch)
    262 
    263     if state == TrialState.FAIL and func_err is not None and not isinstance(func_err, catch):
--> 264         raise func_err
    265     return trial
    266 

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/optuna/study/_optimize.py in _run_trial(study, func, catch)
    211 
    212     try:
--> 213         value_or_values = func(trial)
    214     except exceptions.TrialPruned as e:
    215         # TODO(mamu): Handle multi-objective cases.

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/autoxgb/utils.py in optimize(trial, xgb_model, use_predict_proba, eval_metric, model_config)
    170 
    171         else:
--> 172             model.fit(
    173                 xtrain,
    174                 ytrain,

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
    504         for k, arg in zip(sig.parameters, args):
    505             kwargs[k] = arg
--> 506         return f(**kwargs)
    507 
    508     return inner_f

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
   1229 
   1230         model, feval, params = self._configure_fit(xgb_model, eval_metric, params)
-> 1231         train_dmatrix, evals = _wrap_evaluation_matrices(
   1232             missing=self.missing,
   1233             X=X,

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py in _wrap_evaluation_matrices(missing, X, y, group, qid, sample_weight, base_margin, feature_weights, eval_set, sample_weight_eval_set, base_margin_eval_set, eval_group, eval_qid, create_dmatrix, enable_categorical, label_transform)
    284 
    285     """
--> 286     train_dmatrix = create_dmatrix(
    287         data=X,
    288         label=label_transform(y),

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/sklearn.py in <lambda>(**kwargs)
   1243             eval_group=None,
   1244             eval_qid=None,
-> 1245             create_dmatrix=lambda **kwargs: DMatrix(nthread=self.n_jobs, **kwargs),
   1246             enable_categorical=self.enable_categorical,
   1247             label_transform=label_transform,

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py in inner_f(*args, **kwargs)
    504         for k, arg in zip(sig.parameters, args):
    505             kwargs[k] = arg
--> 506         return f(**kwargs)
    507 
    508     return inner_f

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, group, qid, label_lower_bound, label_upper_bound, feature_weights, enable_categorical)
    614             return
    615 
--> 616         handle, feature_names, feature_types = dispatch_data_backend(
    617             data,
    618             missing=self.missing,

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical)
    705         return _from_tuple(data, missing, threads, feature_names, feature_types)
    706     if _is_pandas_df(data):
--> 707         return _from_pandas_df(data, enable_categorical, missing, threads,
    708                                feature_names, feature_types)
    709     if _is_pandas_series(data):

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py in _from_pandas_df(data, enable_categorical, missing, nthread, feature_names, feature_types)
    297     data, feature_names, feature_types = _transform_pandas_df(
    298         data, enable_categorical, feature_names, feature_types)
--> 299     return _from_numpy_array(data, missing, nthread, feature_names,
    300                              feature_types)
    301 

~/opt/anaconda3/envs/deep_py38/lib/python3.8/site-packages/xgboost/data.py in _from_numpy_array(data, missing, nthread, feature_names, feature_types)
    177     config = bytes(json.dumps(args), "utf-8")
    178     _check_call(
--> 179         _LIB.XGDMatrixCreateFromDense(
    180             _array_interface(data),
    181             config,

~/opt/anaconda3/envs/deep_py38/lib/python3.8/ctypes/__init__.py in __getattr__(self, name)
    384         if name.startswith('__') and name.endswith('__'):
    385             raise AttributeError(name)
--> 386         func = self.__getitem__(name)
    387         setattr(self, name, func)
    388         return func

~/opt/anaconda3/envs/deep_py38/lib/python3.8/ctypes/__init__.py in __getitem__(self, name_or_ordinal)
    389 
    390     def __getitem__(self, name_or_ordinal):
--> 391         func = self._FuncPtr((name_or_ordinal, self))
    392         if not isinstance(name_or_ordinal, int):
    393             func.__name__ = name_or_ordinal

AttributeError: dlsym(0x7fd108ca6760, XGDMatrixCreateFromDense): symbol not found

Anyone knoes how to get the best params ?

Hi, I want to see the parameters of the trained model after the training is complete. Can anyone help me out with it ?

Multioutput regression

Hello, xgboost now supports multi output regression! The feature request has been finally merged.
Will AutoXGB support it this out of the box?

Autoxgb CLI predict giving: AttributeError: 'ModelConfig' object has no attribute 'target_cols'

I am trying to run:

autoxgb predict --model_path output/ --test_filename test_file.csv --out_filename tmp.csv
test_file.csv is

where test_file.csv is:

id,L0_n,L0_r,L0_w,L0_s,L0_freq,L0_L,L0_Q
700,2.25,67,15,2.1,2.25,1.406883,17.5144
701,5.75,69,22,2.1,2.25,14.00953,14.61921

I get the following error:

File "/home/neo/wellth-wrk/env/lib/python3.8/site-packages/autoxgb/predict.py", line 85, in _predict_df
final_preds = pd.DataFrame(final_preds, columns=self.model_config.target_cols)
AttributeError: 'ModelConfig' object has no attribute 'target_cols'

A post about your good repository in the medium

@abhishekkrthakur
Thanks for your great repo. I have written the following brief post for introducing your great repo:
AutoXGB: XGBoost + Optuna
Best

How do I add port to autoxgb serve

autoxgb serve --model_path outputs/mll --host 0.0.0.0 --debug

I don't see the option for port

TypeError: argument of type 'method' is not iterable

Hi,
I try autoxgb but very quickly it returns that error:

2022-03-08 09:22:21.269 | INFO | autoxgb.autoxgb:post_init:42 - Output directory: output2
2022-03-08 09:22:21.276 | WARNING | autoxgb.autoxgb:post_init:49 - No id column specified. Will default to `id`.
2022-03-08 09:22:21.283 | INFO | autoxgb.autoxgb:_process_data:149 - Reading training data

TypeError Traceback (most recent call last)
in ()
37 fast=fast,
38 )
---> 39 axgb.train()
10 frames
/usr/local/lib/python3.7/dist-packages/autoxgb/autoxgb.py in train(self)
244
245 def train(self):
--> 246 self._process_data()
247 best_params = train_model(self.model_config)
248 logger.info("Training complete")

/usr/local/lib/python3.7/dist-packages/autoxgb/autoxgb.py in _process_data(self)
148 def _process_data(self):
149 logger.info("Reading training data")
--> 150 train_df = pd.read_csv(self.train_filename)
151 train_df = reduce_memory_usage(train_df)
152 problem_type = self._determine_problem_type(train_df)

/usr/local/lib/python3.7/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584 kwds.update(kwds_defaults)
585
--> 586 return _read(filepath_or_buffer, kwds)
587
588

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
480
481 # Create the parser.
--> 482 parser = TextFileReader(filepath_or_buffer, **kwds)
483
484 if chunksize or iterator:

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in init(self, f, engine, **kwds)
809 self.options["has_index_names"] = kwds["has_index_names"]
810
--> 811 self._engine = self._make_engine(self.engine)
812
813 def close(self):

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
1038 )
1039 # error: Too many arguments for "ParserBase"
-> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
1041
1042 def _failover_to_python(self):

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/c_parser_wrapper.py in init(self, src, **kwds)
49
50 # open handles
---> 51 self._open_handles(src, kwds)
52 assert self.handles is not None
53

/usr/local/lib/python3.7/dist-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
227 memory_map=kwds.get("memory_map", False),
228 storage_options=kwds.get("storage_options", None),
--> 229 errors=kwds.get("encoding_errors", "strict"),
230 )
231

/usr/local/lib/python3.7/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
583
584 # read_csv does not know whether the buffer is opened in binary/text mode
--> 585 if _is_binary_mode(path_or_buf, mode) and "b" not in mode:
586 mode += "b"
587

/usr/local/lib/python3.7/dist-packages/pandas/io/common.py in _is_binary_mode(handle, mode)
960 # classes that expect bytes
961 binary_classes = (BufferedIOBase, RawIOBase)
--> 962 return isinstance(handle, binary_classes) or "b" in getattr(handle, "mode", mode)

TypeError: argument of type 'method' is not iterable

Don't know why..??? Thank you all for your help

My code:
from autoxgb import AutoXGB

required parameters:

train_filename = df.iloc[:round(df.shape[0]*.8)]
output = "output2"

optional parameters

test_filename = None
task = None
idx = df.index
targets = ["Goal"]
features = None
categorical_features = None
use_gpu = False
num_folds = 5
seed = 42
num_trials = 100
time_limit = 360
fast = False

Now its time to train the model!

axgb = AutoXGB(
train_filename=train_filename,
output=output,
test_filename=test_filename,
task=task,
idx=idx,
targets=targets,
features=features,
categorical_features=categorical_features,
use_gpu=use_gpu,
num_folds=num_folds,
seed=seed,
num_trials=num_trials,
time_limit=time_limit,
fast=fast,
)
axgb.train()

local variable 'test_pred_temp' referenced before assignment

I am getting:

local variable 'test_pred_temp' referenced before assignment

AttributeError: module 'pyarrow.lib' has no attribute 'MonthDayNanoIntervalArray'

https://www.kaggle.com/somesh88/playground-feb

the database and code notebook linked above can anyone please help me regarding the same.

best params and model

Hi,

Thanks for building a very useful package. I have two simple questions:

How come only the following params are tuned:
{'colsample_bytree': 0.18270180565544739,
'early_stopping_rounds': 401,
'learning_rate': 0.013529250923369278,
'max_depth': 6,
'n_estimators': 20000,
'reg_alpha': 0.0019387086612090178,
'reg_lambda': 5.879563892375361e-08,
'subsample': 0.8925701729066172}
what about gamma and other xgBoost parameters? Are they assumed to be default values?
How do I access the best model from the output directory? I plugged in the above best params in my xgb model, but didn't get the same result as autoxgb result showed. Is there a way to access these models and/or the best model in the output directory, so I can run the model on any data to see the results?

thank you so much, any help will be greatly appreciated.

p.s. any docs on how to use the output files? There are lot of useful info there, but don't know how to access them smartly.

TypeError: init() got an unexpected keyword argument 'handle_unknown'

Hey I got this error and no idea where it is from. My manual xgb model works without problems with exactly the same dataset.

I have no idea how to use github so sorry if this does not match your standards.

Eventhough it is pip installed, still ModuleNotFoundError: No module named 'autoxgb'

On some of the linux machines I have, it was installed with pip:
pip3 install autoxgb

but simply doing:
import autoxgb

ModuleNotFoundError: No module named 'autoxgb'

Assertionerror exception: no description

executed code

# Now its time to train the model!
axgb = AutoXGB(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
axgb.train()

log:

2023-08-04 14:31:55.744 | INFO     | autoxgb.autoxgb:__post_init__:42 - Output directory: output3
2023-08-04 14:31:55.744 | INFO     | autoxgb.autoxgb:_process_data:149 - Reading training data
2023-08-04 14:31:55.765 | INFO     | autoxgb.utils:reduce_memory_usage:48 - Mem. usage decreased to 0.79 Mb (37.5% reduction)
2023-08-04 14:31:55.767 | INFO     | autoxgb.autoxgb:_determine_problem_type:140 - Problem type: multi_label_classification
2023-08-04 14:31:55.767 | INFO     | autoxgb.autoxgb:_create_folds:58 - Creating folds
2023-08-04 14:31:55.772 | INFO     | autoxgb.autoxgb:_process_data:195 - Found 18 categorical features.
2023-08-04 14:31:55.772 | INFO     | autoxgb.autoxgb:_process_data:198 - Encoding categorical features
2023-08-04 14:31:55.924 | INFO     | autoxgb.autoxgb:_process_data:236 - Model config: train_filename='/data_16t/hongziwen/autoxgb-main/data_samples/multi_label_classification.csv' test_filename=None idx='id' targets=['service_a', 'service_b'] problem_type=<ProblemType.multi_label_classification: 3> output='output3' features=['release', 'n_0047', 'n_0050', 'n_0052', 'n_0061', 'n_0067', 'n_0075', 'n_0078', 'n_0091', 'n_0108', 'n_0109', 'o_0176', 'o_0264', 'c_0466', 'c_0500', 'c_0638', 'c_0699', 'c_0738', 'c_0761', 'c_0770', 'c_0838', 'c_0870', 'c_0980', 'c_1145', 'c_1158', 'c_1189', 'c_1223', 'c_1227', 'c_1244', 'c_1259'] num_folds=5 use_gpu=True seed=42 categorical_features=['release', 'c_0466', 'c_0500', 'c_0638', 'c_0699', 'c_0738', 'c_0761', 'c_0770', 'c_0838', 'c_0870', 'c_0980', 'c_1145', 'c_1158', 'c_1189', 'c_1223', 'c_1227', 'c_1244', 'c_1259'] num_trials=100 time_limit=360 fast=False
2023-08-04 14:31:55.924 | INFO     | autoxgb.autoxgb:_process_data:237 - Saving model config
2023-08-04 14:31:55.925 | INFO     | autoxgb.autoxgb:_process_data:241 - Saving encoders

error

Exception has occurred: AssertionError
exception: no description
  File "/data_16t//autoxgb-main/examples/multi_label_classification.py", line 39, in <module>
    axgb.train()
AssertionError:

I reproduce it according to the readme.md file.

module 'pyarrow.lib' has no attribute 'MonthDayNanoIntervalArray'

Getting error while using TPS November data on Kaggle conda env (my GPU is on)

https://www.kaggle.com/yogeshkalauni/tps-nov-21-auto-xgboost-error

Getting error while using pip install in Kaggle kernel.

Collecting autoxgb
  Downloading autoxgb-0.2.1-py3-none-any.whl (20 kB)
Collecting scikit-learn==1.0.1
  Downloading scikit_learn-1.0.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.2 MB)
     |████████████████████████████████| 23.2 MB 1.3 MB/s eta 0:00:01
Requirement already satisfied: optuna==2.10.0 in /opt/conda/lib/python3.7/site-packages (from autoxgb) (2.10.0)
Collecting pyarrow==6.0.0
  Downloading pyarrow-6.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.5 MB)
     |████████████████████████████████| 25.5 MB 43.9 MB/s eta 0:00:01
Requirement already satisfied: pydantic==1.8.2 in /opt/conda/lib/python3.7/site-packages (from autoxgb) (1.8.2)
Collecting loguru==0.5.3
  Downloading loguru-0.5.3-py3-none-any.whl (57 kB)
     |████████████████████████████████| 57 kB 4.9 MB/s  eta 0:00:01
Collecting xgboost==1.5.0
  Downloading xgboost-1.5.0-py3-none-manylinux2014_x86_64.whl (173.5 MB)
     |████████████████████████████████| 173.5 MB 66 kB/s s eta 0:00:01    |██████████████████              | 97.9 MB 59.6 MB/s eta 0:00:02
Collecting pandas==1.3.4
  Downloading pandas-1.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
     |████████████████████████████████| 11.3 MB 46.0 MB/s eta 0:00:01
Requirement already satisfied: fastapi==0.70.0 in /opt/conda/lib/python3.7/site-packages (from autoxgb) (0.70.0)
Requirement already satisfied: uvicorn==0.15.0 in /opt/conda/lib/python3.7/site-packages (from autoxgb) (0.15.0)
Collecting numpy==1.21.3
  Downloading numpy-1.21.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
     |████████████████████████████████| 15.7 MB 39.9 MB/s eta 0:00:01
Collecting joblib==1.1.0
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
     |████████████████████████████████| 306 kB 39.9 MB/s eta 0:00:01
Requirement already satisfied: starlette==0.16.0 in /opt/conda/lib/python3.7/site-packages (from fastapi==0.70.0->autoxgb) (0.16.0)
Requirement already satisfied: scipy!=1.4.0 in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (1.7.1)
Requirement already satisfied: cliff in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (3.9.0)
Requirement already satisfied: colorlog in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (6.5.0)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (21.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (4.62.3)
Requirement already satisfied: alembic in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (1.7.4)
Requirement already satisfied: cmaes>=0.8.2 in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (0.8.2)
Requirement already satisfied: sqlalchemy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (1.4.25)
Requirement already satisfied: PyYAML in /opt/conda/lib/python3.7/site-packages (from optuna==2.10.0->autoxgb) (5.4.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas==1.3.4->autoxgb) (2.8.0)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas==1.3.4->autoxgb) (2021.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.7/site-packages (from pydantic==1.8.2->autoxgb) (3.10.0.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn==1.0.1->autoxgb) (2.2.0)
Requirement already satisfied: anyio<4,>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from starlette==0.16.0->fastapi==0.70.0->autoxgb) (3.3.0)
Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.7/site-packages (from uvicorn==0.15.0->autoxgb) (8.0.1)
Requirement already satisfied: asgiref>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from uvicorn==0.15.0->autoxgb) (3.4.1)
Requirement already satisfied: h11>=0.8 in /opt/conda/lib/python3.7/site-packages (from uvicorn==0.15.0->autoxgb) (0.12.0)
Requirement already satisfied: sniffio>=1.1 in /opt/conda/lib/python3.7/site-packages (from anyio<4,>=3.0.0->starlette==0.16.0->fastapi==0.70.0->autoxgb) (1.2.0)
Requirement already satisfied: idna>=2.8 in /opt/conda/lib/python3.7/site-packages (from anyio<4,>=3.0.0->starlette==0.16.0->fastapi==0.70.0->autoxgb) (2.10)
Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from click>=7.0->uvicorn==0.15.0->autoxgb) (4.8.1)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=20.0->optuna==2.10.0->autoxgb) (2.4.7)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas==1.3.4->autoxgb) (1.16.0)
Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.7/site-packages (from sqlalchemy>=1.1.0->optuna==2.10.0->autoxgb) (1.1.1)
Requirement already satisfied: Mako in /opt/conda/lib/python3.7/site-packages (from alembic->optuna==2.10.0->autoxgb) (1.1.5)
Requirement already satisfied: importlib-resources in /opt/conda/lib/python3.7/site-packages (from alembic->optuna==2.10.0->autoxgb) (5.2.2)
Requirement already satisfied: PrettyTable>=0.7.2 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (2.2.0)
Requirement already satisfied: cmd2>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (2.2.0)
Requirement already satisfied: autopage>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (0.4.0)
Requirement already satisfied: stevedore>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (3.4.0)
Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from cliff->optuna==2.10.0->autoxgb) (5.6.0)
Requirement already satisfied: colorama>=0.3.7 in /opt/conda/lib/python3.7/site-packages (from cmd2>=1.0.0->cliff->optuna==2.10.0->autoxgb) (0.4.4)
Requirement already satisfied: attrs>=16.3.0 in /opt/conda/lib/python3.7/site-packages (from cmd2>=1.0.0->cliff->optuna==2.10.0->autoxgb) (21.2.0)
Requirement already satisfied: pyperclip>=1.6 in /opt/conda/lib/python3.7/site-packages (from cmd2>=1.0.0->cliff->optuna==2.10.0->autoxgb) (1.8.2)
Requirement already satisfied: wcwidth>=0.1.7 in /opt/conda/lib/python3.7/site-packages (from cmd2>=1.0.0->cliff->optuna==2.10.0->autoxgb) (0.2.5)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click>=7.0->uvicorn==0.15.0->autoxgb) (3.5.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /opt/conda/lib/python3.7/site-packages (from Mako->alembic->optuna==2.10.0->autoxgb) (2.0.1)
Installing collected packages: numpy, joblib, xgboost, scikit-learn, pyarrow, pandas, loguru, autoxgb
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
  Attempting uninstall: joblib
    Found existing installation: joblib 1.0.1
    Uninstalling joblib-1.0.1:
      Successfully uninstalled joblib-1.0.1
  Attempting uninstall: xgboost
    Found existing installation: xgboost 1.4.2
    Uninstalling xgboost-1.4.2:
      Successfully uninstalled xgboost-1.4.2
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.23.2
    Uninstalling scikit-learn-0.23.2:
      Successfully uninstalled scikit-learn-0.23.2
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 5.0.0
    Uninstalling pyarrow-5.0.0:
      Successfully uninstalled pyarrow-5.0.0
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.3
    Uninstalling pandas-1.3.3:
      Successfully uninstalled pandas-1.3.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-io 0.18.0 requires tensorflow-io-gcs-filesystem==0.18.0, which is not installed.
explainable-ai-sdk 1.3.2 requires xai-image-widget, which is not installed.
dask-cudf 21.8.3 requires cupy-cuda114, which is not installed.
cudf 21.8.3 requires cupy-cuda110, which is not installed.
beatrix-jupyterlab 3.1.1 requires google-cloud-bigquery-storage, which is not installed.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.21.3 which is incompatible.
tfx-bsl 1.3.0 requires absl-py<0.13,>=0.9, but you have absl-py 0.14.0 which is incompatible.
tfx-bsl 1.3.0 requires numpy<1.20,>=1.16, but you have numpy 1.21.3 which is incompatible.
tfx-bsl 1.3.0 requires pyarrow<3,>=1, but you have pyarrow 6.0.0 which is incompatible.
tensorflow 2.6.0 requires numpy~=1.19.2, but you have numpy 1.21.3 which is incompatible.
tensorflow 2.6.0 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tensorflow 2.6.0 requires typing-extensions~=3.7.4, but you have typing-extensions 3.10.0.2 which is incompatible.
tensorflow-transform 1.3.0 requires absl-py<0.13,>=0.9, but you have absl-py 0.14.0 which is incompatible.
tensorflow-transform 1.3.0 requires numpy<1.20,>=1.16, but you have numpy 1.21.3 which is incompatible.
tensorflow-transform 1.3.0 requires pyarrow<3,>=1, but you have pyarrow 6.0.0 which is incompatible.
tensorflow-io 0.18.0 requires tensorflow<2.6.0,>=2.5.0, but you have tensorflow 2.6.0 which is incompatible.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.3 which is incompatible.
numba 0.54.0 requires numpy<1.21,>=1.17, but you have numpy 1.21.3 which is incompatible.
matrixprofile 1.1.10 requires protobuf==3.11.2, but you have protobuf 3.18.1 which is incompatible.
hypertools 0.7.0 requires scikit-learn!=0.22,<0.24,>=0.19.1, but you have scikit-learn 1.0.1 which is incompatible.
dask-cudf 21.8.3 requires dask<=2021.07.1,>=2021.6.0, but you have dask 2021.9.1 which is incompatible.
dask-cudf 21.8.3 requires pandas<1.3.0dev0,>=1.0, but you have pandas 1.3.4 which is incompatible.
cudf 21.8.3 requires pandas<1.3.0dev0,>=1.0, but you have pandas 1.3.4 which is incompatible.
apache-beam 2.32.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.4 which is incompatible.
apache-beam 2.32.0 requires numpy<1.21.0,>=1.14.3, but you have numpy 1.21.3 which is incompatible.
apache-beam 2.32.0 requires pyarrow<5.0.0,>=0.15.1, but you have pyarrow 6.0.0 which is incompatible.
apache-beam 2.32.0 requires typing-extensions<3.8.0,>=3.7.0, but you have typing-extensions 3.10.0.2 which is incompatible.
Successfully installed autoxgb-0.2.1 joblib-1.1.0 loguru-0.5.3 numpy-1.21.3 pandas-1.3.4 pyarrow-6.0.0 scikit-learn-1.0.1 xgboost-1.5.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

from autoxgb import AutoXGB


# required parameters:
train_filename = "../input/tabular-playground-series-nov-2021/train.csv"
output = "outputt"

# optional parameters
test_filename = '../input/tabular-playground-series-nov-2021/test.csv'
task = 'classification'
idx = None
targets = ["target"]
features = None
categorical_features = None
use_gpu = True
num_folds = 5
seed = 42
num_trials = 100
time_limit = 7*60*60
fast = False

# Now its time to train the model!
axgb = AutoXGB(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
axgb.train()

2021-11-01 07:03:06.106 | INFO     | autoxgb.autoxgb:__post_init__:42 - Output directory: outputt
2021-11-01 07:03:06.108 | WARNING  | autoxgb.autoxgb:__post_init__:49 - No id column specified. Will default to `id`.
2021-11-01 07:03:06.110 | INFO     | autoxgb.autoxgb:_process_data:149 - Reading training data
2021-11-01 07:03:22.502 | INFO     | autoxgb.utils:reduce_memory_usage:50 - Mem. usage decreased to 117.30 Mb (74.9% reduction)
2021-11-01 07:03:22.583 | INFO     | autoxgb.autoxgb:_determine_problem_type:140 - Problem type: binary_classification
2021-11-01 07:03:38.131 | INFO     | autoxgb.utils:reduce_memory_usage:50 - Mem. usage decreased to 105.06 Mb (74.8% reduction)
2021-11-01 07:03:38.132 | INFO     | autoxgb.autoxgb:_create_folds:58 - Creating folds
2021-11-01 07:03:38.248 | INFO     | autoxgb.autoxgb:_process_data:170 - Encoding target(s)
2021-11-01 07:03:38.282 | INFO     | autoxgb.autoxgb:_process_data:195 - Found 0 categorical features.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_38/3565386527.py in <module>
     37     fast=fast,
     38 )
---> 39 axgb.train()

/opt/conda/lib/python3.7/site-packages/autoxgb/autoxgb.py in train(self)
    244 
    245     def train(self):
--> 246         self._process_data()
    247         best_params = train_model(self.model_config)
    248         logger.info("Training complete")

/opt/conda/lib/python3.7/site-packages/autoxgb/autoxgb.py in _process_data(self)
    210                     test_fold[categorical_features] = ord_encoder.transform(test_fold[categorical_features].values)
    211                 categorical_encoders[fold] = ord_encoder
--> 212             fold_train.to_feather(os.path.join(self.output, f"train_fold_{fold}.feather"))
    213             fold_valid.to_feather(os.path.join(self.output, f"valid_fold_{fold}.feather"))
    214             if self.test_filename is not None:

/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    205                 else:
    206                     kwargs[new_arg_name] = new_arg_value
--> 207             return func(*args, **kwargs)
    208 
    209         return cast(F, wrapper)

/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in to_feather(self, path, **kwargs)
   2517         from pandas.io.feather_format import to_feather
   2518 
-> 2519         to_feather(self, path, **kwargs)
   2520 
   2521     @doc(

/opt/conda/lib/python3.7/site-packages/pandas/io/feather_format.py in to_feather(df, path, storage_options, **kwargs)
     44     """
     45     import_optional_dependency("pyarrow")
---> 46     from pyarrow import feather
     47 
     48     if not isinstance(df, DataFrame):

/opt/conda/lib/python3.7/site-packages/pyarrow/feather.py in <module>
     23                          concat_tables, schema)
     24 import pyarrow.lib as ext
---> 25 from pyarrow import _feather
     26 from pyarrow._feather import FeatherError  # noqa: F401
     27 from pyarrow.vendored.version import Version

/opt/conda/lib/python3.7/site-packages/pyarrow/_feather.pyx in init pyarrow._feather()

AttributeError: module 'pyarrow.lib' has no attribute 'MonthDayNanoIntervalArray'

Performance Issue: Slow read_csv() Function with pandas Version 1.3.4 for CSV Files with Large Number of Columns

Issue Description:
Hello.
I have discovered a performance degradation in the read_csv function of pandas version 1.3.4 when handling CSV files with a large number of columns. This problem significantly increases the loading time from just a few seconds in the previous version 1.2.5 to several minutes, almost 60x diff. I found some discussions on GitHub related to this issue, including #44106 and #44192.
I found that autoxgb/src/autoxgb/predict.py and autoxgb/src/autoxgb/autoxgb.py both used the influenced api.

Steps to Reproduce:

I have created a small reproducible example to better illustrate this issue.

# v1.3.4
import os
import pandas
import numpy
import timeit

def generate_sample():
    if os.path.exists("test_small.csv.gz") == False:
        nb_col = 100000
        nb_row = 5
        feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
        for i in range(nb_col):
            feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
        df = pandas.DataFrame(feature_list)
        df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")

def load_csv_file():
    col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
    types_dict = {col: numpy.float32 for col in col_names}
    types_dict.update({'sample': str})
    feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
    print("loaded dataframe shape:", feature_df.shape)

generate_sample()
timeit.timeit(load_csv_file, number=1)

# results
loaded dataframe shape: (5, 100000)
120.37690759263933

# v1.3.5
import os
import pandas
import numpy
import timeit

def generate_sample():
    if os.path.exists("test_small.csv.gz") == False:
        nb_col = 100000
        nb_row = 5
        feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
        for i in range(nb_col):
            feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
        df = pandas.DataFrame(feature_list)
        df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")

def load_csv_file():
    col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
    types_dict = {col: numpy.float32 for col in col_names}
    types_dict.update({'sample': str})
    feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
    print("loaded dataframe shape:", feature_df.shape)


generate_sample()
timeit.timeit(load_csv_file, number=1)

# results
loaded dataframe shape: (5, 100000)
2.8567268839105964

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 1.3.5 or exploring other solutions to optimize the performance of loading CSV files.
Any other workarounds or solutions would be greatly appreciated.
Thank you!

[Que] : How to get feature importance df with plot ?

Is there a way to access `xgb` model object to use `plot_importance` ?

xgboost.plot_importance() has been quite handy to plot important features. Is there a way to do that?

Thanks!

Installation Issue with autoxgb in Kaggle Environment

Dear autoxgb Developers,

I am reaching out to report an installation issue encountered with the autoxgb package within the Kaggle notebook environment. During the installation process via pip, the operation fails due to a Cython compilation error related to the sklearn.ensemble._hist_gradient_boosting.splitting.pyx module.

Cython.Compiler.Errors.CompileError: sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
...
error: metadata-generation-failed

Time_limit . Doesn't stop the training

I have been dealing with TPS in Kaggle and I have tried auto xgboost. I have set the time limit to 3600*4.
But the training didn't stop at 4 hours. Now is at 6.5 hours and still going. Is anything I am doing wrong?

ps. the first trial took 4 hours to complete