dabl / dabl Goto Github PK

View Code? Open in Web Editor NEW

718.0 718.0 105.0 7.54 MB

Data Analysis Baseline Library

Home Page: https://dabl.github.io/

License: BSD 3-Clause "New" or "Revised" License

Shell 0.85% Python 13.11% Jupyter Notebook 86.04%

dabl's Introduction

dabl

The data analysis baseline library.

"Mr Sanchez, are you a data scientist?"
"I dabl, Mr president."

Find more information on the website.

Try it out

pip install dabl

Current scope and upcoming features

This library is very much still under development. Current code focuses mostly on exploratory visualization and preprocessing. There are also drop-in replacements for GridSearchCV and RandomizedSearchCV using successive halfing. There are preliminary portfolios in the style of POSH auto-sklearn to find strong models quickly. In essence that boils down to a quick search over different gradient boosting models and other tree ensembles and potentially kernel methods.

Check out the the website and example gallery to get an idea of the visualizations that are available.

Stay Tuned!

Related packages

Lux

Lux is an awesome project for easy interactive visualization of pandas dataframes within notebooks.

Pandas Profiling

Pandas Profiling can provide a thorough summary of the data in only a single line of code. Using the ProfileReport() method, you are able to access a HTML report of your data that can help you find correlations and identify missing data.

dabl focuses less on statistical measures of individual columns, and more on providing a quick overview via visualizations, as well as convienient preprocessing and model search for machine learning.

dabl's People

Contributors

Stargazers

Watchers

Forkers

marksecada nicolashug mertnuhuz 5hirish suvayu jcfr olamyy betatim ipacheco-uy shayan-taheri gridl jaidevd a3digit raab70 lepchenkov sahanduiuc gpsbach kevintrannz glemaitre chkoar merikson esvhd navdeep-g encryptedcommerce admoreau kbhartiya j1c pdhinwa msank00 yanasr pascalanton adekunleba svoons dhawal93 praths007 muzakparov amueller lutfis dumasss163 forkdump jcardonnet ianozsvald sanjaradylov bscowboy seljukgulcan nikolayvoronchikhin mmccarty nilichen thecobb varnithchordia jingmouren tesseract-42 vaibhavkumar11 stjordanis h4pz chagge dhirschfeld fagan2888 bhishanpdl volkerh mohansaimandalapu jonathanrhughes thomasjpfan sriakshata neelghoshal mahimamittal24 sangeetha2000 satorugojo3795 ecomodeller aryendra hercules261188 playfloor 2series nrohan09-cloud arturoguizar khoa-yelo shashvat207 linuxnote biomds tdudipalla rsakshi11 ruchitarpatil001 rutujapate stefmolin mohammeedasadatar vishalbelsare carbirbal tessyworld massawe14 sarkaft guptadeepak18 bpkroth danni3-256 joshaghani90

dabl's Issues

AnyClassifier fails on baseball

from dabl.models import AnyClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_openml

bunch = fetch_openml(data_id=185)
X = bunch.data
y = bunch.target
cross_val_score(AnyClassifier(), X, y, scoring='f1_macro', cv=10, error_score='raise')

errors with a NaN error and I don't know why :-/
@thomasjpfan @NicolasHug feel free to check it out if you want to play with dabl.

class histograms is slow

Looks like a lot of time is spend in "bar" right now. We should be able to speed this up.

running successive halving on digits with SVC takes longer than GridSearchCV

import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import fetch_mldata, load_iris, load_digits
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data / 16, digits.target, stratify=digits.target, random_state=42)

param_grid = {'C': np.logspace(-3, 2, 6), 'gamma': np.logspace(-3, 2, 6)}
gs = GridSearchCV(SVC(), param_grid, cv=5)
gs.fit(X_train, y_train)

takes ~40s on my laptop, vs about 80s for successive halving. That's unfortunate. A lot of the time is spend in predicting, which we call more often on same-sized datasets (I think), but there's even more time spend in fitting, which doesn't make sense to me.
cc @NicolasHug

first don't exhaust budget in SH, then do?

also allow ctrl+c break?

Example of display_estimator for dabl

Regarding scikit-learn/scikit-learn#14180 (comment). Look at those nested metaestimators!

EasyPreprocessor has no target_col in fit?

the models have, though?

Clean Up Examples

they are using template classifier right now

bug? pytest reports empty warnings in parametrized tests

The pytest output on travis has many lines like this:

plot/tests/test_supervised.py::test_plots_smoke[0-100-classification]

and I don't know why :-/

EasyPreprocessor fails on categorical data as integers with missing values

Filling missing int or float values with dabl_missing breaks.
We might want to call clean on the data first (which converts the categories to string....), or have a separate OneHotEncoder for non-string categories. That would make the feature names harder to interpret because missingness will be encoded as and integer or float then.

Alternatively we can just error on dirty data?

object integer types are not handled correctly in detect_types

If an integer column is stored as object type, it's becoming continuous because it's handled as "dirty float". That requires some fixes.
Maybe that will also allow us to fix handing low cardinality floats as categorical.

steal add_legend from seaborn

We need to either move all the grids to Grid or we should steal "add_legend" because it seems a bit tricky.
That probably requires introducing our own grid class.
Maybe the base "Grid" class is enough? I don't like facetgrid because it requires melting.

AnyClassifier doesn't support multi-metric scoring b/c successive halving doesn't

use loggers not warnings

We really should be using logging mechanisms instead of warnings and/or verbosity...

weird interactions in gradient boosting and successive halving

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import fetch_openml
from dabl.search import RandomSuccessiveHalving
from scipy import stats

bunch = fetch_openml(data_id=40701)

X = bunch.data
y = bunch.target
param_dist = {'max_leaf_nodes': stats.randint(2, 100)}
rsh = RandomSuccessiveHalving(HistGradientBoostingClassifier(),
                              param_distributions=param_dist,
                              verbose=10, r_min=40, scoring="roc_auc")
rsh.fit(X, y)

doesn't work as expected. In the first round all the AUCs are .5 and I think some random params get selected.
I'm not sure what's happening. It could be that because it's a imbalanced dataset something weird happens? Also without specifying r_min it doesn't work at all b/c of validation set issues, I think.

@NicolasHug could you have a look?
Also: I realized HistGradientBoosting doesn't have all the subsampling options and no support for class_weight ;)

put binary columns through separate one-hot-encoder that drops one column

this will make sure binary columns will not be expanded into two columns

allow setting number of columns for plots

can be nice for having more slide-friendly plots...

maybe store adult in zipped format?

to make the repo smaller.

AnyRegressor

ensure shown pairplots are not redundant

Often many similar pairplots are shown. It might be interesting to force diversity in some way.
Maybe removing highly correlated features might help?

ModuleNotFoundError: No module named 'sklearn.experimental'

Seems like some incompatibility with Windows anaconda3. Install using PIP on two windows boxes and when importing in Jupyter or CL I receive a ModuleNotFoundError. Installed on my Linux box and it imports without error. Please advise.

BUG if useless feature is tagged as non-useless in clean we get an error

it will be useless and whatever it's tagged. We should overwrite the useless (and add a test)

fix CI

doens't have sklearn right now?

embed plots in user guide and quick start guide

Right now the user guide doesn't have any plots except for the examples.
We should be able to render the plots generated by the doctests somehow.
Maybe
https://matplotlib.org/sampledoc/extensions.html
or
https://matthew-brett.github.io/nb2plots/nbplots.html

how to show density of very peaked distributions (not non-negative)

There's a nice example of some very peaky distributions in the bank marketing dataset:

data = fetch_openml(data_id=1461)

On some of these Yeo-Johnson seems to work, on others it doesn't seem to provide anything useful. Also, the KDE plots might be a bit misleading.

without YJ:

with YJ:

Looks like V13 has more information on a first look, but probably just an artifact of the KDE and we really shouldn't be using KDEs.

Someone thought much more about this:
https://arxiv.org/pdf/1305.0215.pdf
logarithmically spaced bins seems to be the main method for plotting?

Type detection

Hi @amueller, I want to try my hands on the Type Detection tasks in the todo and was wondering if you could shed a little more light on it.

Am I correct to assume that we're looking for a Detection object that has methods that detects the various things listed in the list?

Add example of different categorical plots?

Might be interesting to show off mosaic plots vs count and proportion?
"dna" might be a good dataset?

AnyClassifier not in top level namespace

plot_supervised crashes with cleaned data

I was trying to follow the examples in the quick start guide, but I see a crash when I call plot_supervised with a cleaned dataframe. Passing the original dataframe works. Here's an example notebook with the titanic dataset.

PS: there are a few minor mistakes in the quick start guide, e.g. detect_types_dataframe should be detect_types, inconsistent namespace (sometimes you use dabl.*, sometimes the global namespace), etc.

todo: regression test for integer categorical data

we should make sure that we don't fill a missing value with a string if the dtype is integer.

Error 'TypeError: 'function' object is not iterable' in function 'plot'

Hello,

for allmost all functions of the package 'dabl' (such as 'clean' and 'plot' for example) I keep getting the following error:

TypeError: 'function' object is not iterable

This does not seem to be a problem of my specific data, because when I run the following example code (from the website):

import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from dabl import plot
from dabl.utils import data_df_from_bunch

wine_bunch = load_wine()
wine_df = data_df_from_bunch(wine_bunch)

plot(wine_df, 'target')

I then get the following output on my machine:


  File "<ipython-input-61-92b3b7736897>", line 1, in <module>
    plot(wine_df, 'target')

  File "C:\Daten\Anaconda3\lib\site-packages\dabl\plot\supervised.py", line 485, in plot
    X, types = clean(X, type_hints=type_hints, return_types=True)

  File "C:\Daten\Anaconda3\lib\site-packages\dabl\preprocessing.py", line 380, in clean
    lambda x: str(x))

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\accessor.py", line 115, in f
    return self._delegate_method(name, *args, **kwargs)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\categorical.py", line 2204, in _delegate_method
    res = method(*args, **kwargs)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\categorical.py", line 940, in rename_categories
    cat.categories = new_categories

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\categorical.py", line 408, in categories
    new_dtype = CategoricalDtype(categories, ordered=self.ordered)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\dtypes\dtypes.py", line 154, in __init__
    self._finalize(categories, ordered, fastpath=False)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\dtypes\dtypes.py", line 181, in _finalize
    fastpath=fastpath)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\dtypes\dtypes.py", line 319, in _validate_categories
    categories = Index(categories, tupleize_cols=False)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 374, in __new__
    subarr = _asarray_tuplesafe(data, dtype=object)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\common.py", line 379, in _asarray_tuplesafe
    values = list(values)

TypeError: 'function' object is not iterable

However, when I installed dabl it did not complain (it said: 'Successfully built dabl').
Is this a bug? Is there maybe an obvious workaround so that I can still use the package?

Thank you for your time

Hannes

type_hints vs types inconsistent

"public" or first-class methods should have a type_hints parameter to allow users to give the function hints.
"internal" methods should have a types parameter that allows short-cutting recomputing the types.
Currently we're inconsistent on which we use where.
Maybe types is not actually necessary and we can recompute every time?
We should also decide on when to clean. Cleaning makes computing types faster (I think).

don't use seaborn for pairplot for few features

I don't like the diagonals of the pairplot. Either we should reimplement the pairplot or we just get rid of it and not special-case it?

build and deploy website automatically

We should build and upload the website automatically like in sklearn.
The template project should have the code for that:

https://github.com/scikit-learn-contrib/project-template

Using Gridspec with custom ratios

May help for some (future) plots? Close if this is useless :)

import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import numpy as np

rs = np.random.RandomState(42)
x1, x2, = rs.multivariate_normal([0, 0], [(1, 0.5), (0.5, 1)], 500).T

fig = plt.figure(figsize=(6, 6))
gs = GridSpec(2, 2, width_ratios=[4, 1], height_ratios=[1, 4], hspace=0, wspace=0)
top_ax = fig.add_subplot(gs[0])
top_ax.set_axis_off()
left_ax = fig.add_subplot(gs[3])
left_ax.set_axis_off()
main_ax = fig.add_subplot(gs[2])

top_ax.hist(x1, bins=30)
left_ax.hist(x2, bins=30, orientation='horizontal')
main_ax.scatter(x1, x2, alpha=0.6)

Missing Test Files

Most of the files defined in "testing" ipython notebook are not under the dataset folder. Maybe they can be removed?

avocado = pd.read_csv("/home/andy/datasets/avocado.csv", parse_dates=['Date']) telco_churn = pd.read_csv("/home/andy/datasets/WA_Fn-UseC_-Telco-Customer-Churn.csv") #restaurant = pd.read_csv("/home/andy/datasets/restaurant-and-market-health-violations.csv") titanic = pd.read_csv("dabl/tests/titanic.csv") ames = pd.read_excel("/home/andy/datasets/AmesHousing.xls") cars = pd.read_excel("/home/andy/datasets/2018 FE Guide for DOE-release dates before 1-24-2018-no-sales-1-23-2018public.xlsx") target = 'Comb Unadj FE - Conventional Fuel' #accidents = pd.read_csv("/home/andy/datasets/Acc.csv") #violations = pd.read_csv("/home/andy/datasets/Traffic_Violations.csv") adult = pd.read_csv("/home/andy/datasets/adult.csv") wine_quality = fetch_openml(data_id=287)

No outlier detection for pairplots, PCA, LDA

We should remove outliers the same way as for the univariate plots, but need to make sure that we don't warn twice for the same feature.

multiple types for single column

Originally each column could have single types, then I changed types to be exclusive.

If we have a low-cardinality (or even low-cardiality float) we probably want to encode it as both categorical and continuous for learning, while we only want to show it as categorical in plotting.

The information whether something was int or float before being made categorical is "lost" because we have to transform categoricals to strings so we can do missing value imputation in SimpleImputer (which is kinda bad in itself).

Possible solutions:

go back to having a separate "low cardinality int" type and decide on the spot (in plotting and in learning) what to do with it.
have a separate dtype annotation
give both types to the column, have plotting ignore the continuous one (one issue: unit testing the function is a bit harder since we have less guarantees?)
have a more complex types object that's not just a boolean dataframe (doesn't really solve the core issue but might make it easier; could have a "plot_how" or something).

test: ensure string targets in classification work correctly

regression test for label encoder thing

need to fix alpha of legend in scatterplots

if transparency is high, the legend looks all white. Easy to fix.

add demo dataset for showing / trying all the possible different visualizations

There should be synthetic demo datasets that shows off all the possible ways to visualize features, probably one for regression, one for binary classification, one for multi-class classification and possibly one for multi-class with many classes.

I'm thinking mix of low and high cardinality categorical variables, overplotting issues in 2d and diverse univariate distributions.
I have no idea how to visualize the informativeness of powerlaw features off the top of my head.

Also: for categorical variables having very skewed distributions vs even distributions in the categories would be good as different types of visualizations work on these.

TODOs

General features

SimpleModels

ensure linear models are actually fast enough
add knn if the dataset is small enough (maybe?)

Anymodels

validate grid
add classifier
add regressor

Model Distillation

add model distillation for classifiers
add model distillation for regression

Type detection

Plot supervised

plotting for explain

Preprocessing & Feature extraction

add treatment for high-cardinality categoricals
string feature extraction

expand quick start guide

Right now there's info leakage on titanic.
Once we show the plots (#51) we can identify that and drop the column and then run the simple model.
Also, it should get explain and AnyModel added.

should type_hints for constant features be ignored?

Right now constant features with type hints are kept. Is that good?

Calculating AUC with AnyClassifier... AttributeError: 'AnyClassifier' object has no attribute 'decision_function'

It seems like it is not possible to calculate AUC scores when using AnyClassifier with sklearn.model_selection.corss_val_score. Here is a minimal example:

X, y = datasets.load_breast_cancer(return_X_y=True)
sr = dabl.models.AnyClassifier(force_exhaust_budget=False)
cross_val_score(sr, X = X, y = y, scoring = 'roc_auc', cv = 3)

best classifier:  HistGradientBoostingClassifier(l2_regularization=0.0001, learning_rate=0.1,
                               loss='auto', max_bins=64, max_depth=20,
                               max_iter=400, max_leaf_nodes=128,
                               min_samples_leaf=6, n_iter_no_change=None,
                               random_state=47806, scoring=None, tol=1e-07,
                               validation_fraction=0.2, verbose=0)
best score: 0.958
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/miniconda3/lib/python3.7/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
    180             try:
--> 181                 y_pred = clf.decision_function(X)
    182 

AttributeError: 'AnyClassifier' object has no attribute 'decision_function'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-18-16e17e76c51b> in <module>
      3 sr = dabl.models.AnyClassifier(force_exhaust_budget=False)
      4 
----> 5 cross_val_score(sr, X = X, y = y, scoring = 'roc_auc', cv = 3)

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    387                                 fit_params=fit_params,
    388                                 pre_dispatch=pre_dispatch,
--> 389                                 error_score=error_score)
    390     return cv_results['test_score']
    391 

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    229             return_times=True, return_estimator=return_estimator,
    230             error_score=error_score)
--> 231         for train, test in cv.split(X, y, groups))
    232 
    233     zipped_scores = list(zip(*scores))

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    552         fit_time = time.time() - start_time
    553         # _score will return dict if is_multimetric is True
--> 554         test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
    555         score_time = time.time() - start_time - fit_time
    556         if return_train_score:

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
    595     """
    596     if is_multimetric:
--> 597         return _multimetric_score(estimator, X_test, y_test, scorer)
    598     else:
    599         if y_test is None:

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
    625             score = scorer(estimator, X_test)
    626         else:
--> 627             score = scorer(estimator, X_test, y_test)
    628 
    629         if hasattr(score, 'item'):

/miniconda3/lib/python3.7/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
    186 
    187             except (NotImplementedError, AttributeError):
--> 188                 y_pred = clf.predict_proba(X)
    189 
    190                 if y_type == "binary":

AttributeError: 'AnyClassifier' object has no attribute 'predict_proba'

%matplotlib inline
from sklearn.datasets import fetch_openml
from dabl.utils import data_df_from_bunch
from dabl import plot_supervised
# wine_quality
data = fetch_openml(data_id=287)
data = data_df_from_bunch(data)
plot_supervised(data, 'target')

Note: this is not the scikit-learn "wine" dataset

reasons might be: many classes, bad kde bandwith? bad scatter size? Uninformative data? overplotting?

Look at goodtables

https://github.com/frictionlessdata/goodtables-py

maybe?

Cannot import dabl

I installed dabl like this:

$ cd /path/to/dabl-repo
$ pip3 install --user -e .

But when I import from ipython, it fails with:

In [ ]: import dabl
ImportError                               Traceback (most recent call last)
<ipython-input-4-6b072d28c65f> in <module>
----> 1 import dabl

~/build/data-an/dabl/dabl/__init__.py in <module>
      2 from .models import SimpleClassifier, SimpleRegressor
      3 from .plot.supervised import plot_supervised
----> 4 from .explain import explain
      5
      6 __all__ = ['EasyPreprocessor', 'SimpleClassifier', 'SimpleRegressor',

~/build/data-an/dabl/dabl/explain.py in <module>
      1 import numpy as np
      2
----> 3 from sklearn.tree import DecisionTreeClassifier, plot_tree
      4 from sklearn.ensemble import RandomForestClassifier
      5 from sklearn.pipeline import Pipeline

ImportError: cannot import name 'plot_tree'

I couldn't find a plot_tree function in sklearn.tree, are you using a development version?

Incompatibilities between DABL AnyClassifier and OpenML

Hi, I am running into issues when I try to run AnyClassiefier on OpenML tasks.

So far I have encountered the following examples:

Getting a value error only when I try to run on the openml task, not when I run on the same dataset locally.

task = openml.tasks.get_task(15)
clf = make_pipeline(dabl.models.AnyClassifier(force_exhaust_budget=False))
run = openml.runs.run_model_on_task(clf, task)

best classifier:  SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.03162277660168379,
    kernel='rbf', max_iter=-1, probability=False, random_state=1,
    shrinking=True, tol=0.001, verbose=False)
best score: 0.964
best classifier:  HistGradientBoostingClassifier(l2_regularization=0.0001, learning_rate=0.1,
                               loss='auto', max_bins=16, max_depth=7,
                               max_iter=200, max_leaf_nodes=4,
                               min_samples_leaf=4, n_iter_no_change=None,
                               random_state=7320, scoring=None, tol=1e-07,
                               validation_fraction=0.1, verbose=0)
best score: 0.952
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-d9533783e081> in <module>
      9 # run clf on the task
     10 print('Run clf on the task')
---> 11 run = openml.runs.run_model_on_task(clf, task)
     12 
     13 # print feedbackack

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_model_on_task(model, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow, return_flow)
    104         seed=seed,
    105         add_local_measures=add_local_measures,
--> 106         upload_flow=upload_flow,
    107     )
    108     if return_flow:

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_flow_on_task(flow, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow)
    220         task=task,
    221         extension=flow.extension,
--> 222         add_local_measures=add_local_measures,
    223     )
    224 

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in _run_task_get_arffcontent(flow, model, task, extension, add_local_measures)
    444             rep_no=rep_no,
    445             fold_no=fold_no,
--> 446             X_test=test_x,
    447         )
    448         if trace is not None:

/miniconda3/lib/python3.7/site-packages/openml/extensions/sklearn/extension.py in _run_model_on_fold(self, model, task, X_train, rep_no, fold_no, y_train, X_test)
   1356 
   1357             if isinstance(task, OpenMLSupervisedTask):
-> 1358                 model_copy.fit(X_train, y_train)
   1359             elif isinstance(task, OpenMLClusteringTask):
   1360                 model_copy.fit(X_train)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    354                                  self._log_message(len(self.steps) - 1)):
    355             if self._final_estimator != 'passthrough':
--> 356                 self._final_estimator.fit(Xt, y, **fit_params)
    357         return self
    358 

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/models.py in fit(self, X, y, target_col)
    351             scoring='recall_macro')
    352         self.search_ = gs
--> 353         gs.fit(X, y)
    354         self.est_ = gs.best_estimator_
    355 

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/search.py in fit(self, X, y, groups, **fit_params)
    132             groups=groups,
    133         )
--> 134         super().fit(X, y=y, groups=groups, **fit_params)
    135         # Set best_score_: BaseSearchCV does not set it, as refit is a callable
    136         self.best_score_ = (

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/_search.py in fit(self, X, y, groups, **fit_params)
    342                 return results
    343 
--> 344             self._run_search(evaluate_candidates, X, y, groups)
    345 
    346         # For multi-metric evaluation, store the best_index_, best_params_ and

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/search.py in _run_search(self, evaluate_candidates, X, y, groups)
    232                             'r_i': [r_i] * n_candidates}
    233             results = evaluate_candidates(candidate_params, X_iter, y_iter,
--> 234                                           groups, more_results=more_results)
    235 
    236             n_candidates_to_keep = ceil(n_candidates / self.ratio)

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/_search.py in evaluate_candidates(candidate_params, X, y, groups, more_results)
    316                                for parameters, (train, test)
    317                                in product(candidate_params,
--> 318                                           cv.split(X, y, groups)))
    319 
    320                 if len(out) < 1:

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    922                 self._iterating = self._original_iterator is not None
    923 
--> 924             while self.dispatch_one_batch(iterator):
    925                 pass
    926 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    552         fit_time = time.time() - start_time
    553         # _score will return dict if is_multimetric is True
--> 554         test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
    555         score_time = time.time() - start_time - fit_time
    556         if return_train_score:

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
    595     """
    596     if is_multimetric:
--> 597         return _multimetric_score(estimator, X_test, y_test, scorer)
    598     else:
    599         if y_test is None:

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
    625             score = scorer(estimator, X_test)
    626         else:
--> 627             score = scorer(estimator, X_test, y_test)
    628 
    629         if hasattr(score, 'item'):

/miniconda3/lib/python3.7/site-packages/sklearn/metrics/scorer.py in __call__(self, estimator, X, y_true, sample_weight)
     88         """
     89 
---> 90         y_pred = estimator.predict(X)
     91         if sample_weight is not None:
     92             return self._sign * self._score_func(y_true, y_pred,

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    420         for _, name, transform in self._iter(with_final=False):
    421             Xt = transform.transform(Xt)
--> 422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 
    424     @if_delegate_has_method(delegate='_final_estimator')

/miniconda3/lib/python3.7/site-packages/sklearn/svm/base.py in predict(self, X)
    572             Class labels for samples in X.
    573         """
--> 574         y = super().predict(X)
    575         return self.classes_.take(np.asarray(y, dtype=np.intp))
    576 

/miniconda3/lib/python3.7/site-packages/sklearn/svm/base.py in predict(self, X)
    320         y_pred : array, shape (n_samples,)
    321         """
--> 322         X = self._validate_for_predict(X)
    323         predict = self._sparse_predict if self._sparse else self._dense_predict
    324         return predict(X)

/miniconda3/lib/python3.7/site-packages/sklearn/svm/base.py in _validate_for_predict(self, X)
    452 
    453         X = check_array(X, accept_sparse='csr', dtype=np.float64, order="C",
--> 454                         accept_large_sparse=False)
    455         if self._sparse and not sp.isspmatrix(X):
    456             X = sp.csr_matrix(X)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    540         if force_all_finite:
    541             _assert_all_finite(array,
--> 542                                allow_nan=force_all_finite == 'allow-nan')
    543 
    544     if ensure_min_samples > 0:

/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
     54                 not allow_nan and not np.isfinite(X).all()):
     55             type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56             raise ValueError(msg_err.format(type_err, X.dtype))
     57     # for object dtype data, we only check for NaNs (GH-13254)
     58     elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Running into issues when features are filtered out with the near_constant_threshold.

task = openml.tasks.get_task(3)
clf = make_pipeline(dabl.models.AnyClassifier(force_exhaust_budget=False))
run = openml.runs.run_model_on_task(clf, task)

/Users/hp2500/Google Drive/STUDY/Columbia/Research/dabl/dabl/preprocessing.py:255: UserWarning: Discarding near-constant features: [2, 13, 15, 16, 18, 24, 27, 28, 29]
  near_constant.index[near_constant].tolist()))
best classifier:  HistGradientBoostingClassifier(l2_regularization=1e-06, learning_rate=0.1,
                               loss='auto', max_bins=128, max_depth=12,
                               max_iter=300, max_leaf_nodes=4,
                               min_samples_leaf=3, n_iter_no_change=None,
                               random_state=28019, scoring=None, tol=1e-07,
                               validation_fraction=0.2, verbose=0)
best score: 0.959
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-2-75c2ea531ac0> in <module>
      9 # run clf on the task
     10 print('Run clf on the task')
---> 11 run = openml.runs.run_model_on_task(clf, task)
     12 
     13 # print feedbackack

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_model_on_task(model, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow, return_flow)
    104         seed=seed,
    105         add_local_measures=add_local_measures,
--> 106         upload_flow=upload_flow,
    107     )
    108     if return_flow:

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_flow_on_task(flow, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow)
    220         task=task,
    221         extension=flow.extension,
--> 222         add_local_measures=add_local_measures,
    223     )
    224 

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in _run_task_get_arffcontent(flow, model, task, extension, add_local_measures)
    444             rep_no=rep_no,
    445             fold_no=fold_no,
--> 446             X_test=test_x,
    447         )
    448         if trace is not None:

/miniconda3/lib/python3.7/site-packages/openml/extensions/sklearn/extension.py in _run_model_on_fold(self, model, task, X_train, rep_no, fold_no, y_train, X_test)
   1393         # it returns the clusters
   1394         if isinstance(task, OpenMLSupervisedTask):
-> 1395             pred_y = model_copy.predict(X_test)
   1396         elif isinstance(task, OpenMLClusteringTask):
   1397             pred_y = model_copy.predict(X_train)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    420         for _, name, transform in self._iter(with_final=False):
    421             Xt = transform.transform(Xt)
--> 422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 
    424     @if_delegate_has_method(delegate='_final_estimator')

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/models.py in predict(self, X)
    300         check_is_fitted(self, 'est_')
    301         if getattr(self, 'classes_', None) is not None:
--> 302             return self.classes_[self.est_.predict(X)]
    303 
    304         return self.est_.predict(X)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    419         Xt = X
    420         for _, name, transform in self._iter(with_final=False):
--> 421             Xt = transform.transform(Xt)
    422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/preprocessing.py in transform(self, X)
    550         # Check is fit had been called
    551         check_is_fitted(self, ['ct_'])
--> 552         return self.ct_.transform(X)

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
    510 
    511         X = _check_X(X)
--> 512         Xs = self._fit_transform(X, None, _transform_one, fitted=True)
    513         self._validate_output(Xs)
    514 

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    410                     message=self._log_message(name, idx, len(transformers)))
    411                 for idx, (name, trans, column, weight) in enumerate(
--> 412                         self._iter(fitted=fitted, replace_strings=True), 1))
    413         except ValueError as e:
    414             if "Expected 2D array, got 1D array instead" in str(e):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    752             tasks = BatchedCalls(itertools.islice(iterator, batch_size),
    753                                  self._backend.get_nested_backend(),
--> 754                                  self._pickle_cache)
    755             if len(tasks) == 0:
    756                 # No more tasks available in the iterator: tell caller to stop.

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __init__(self, iterator_slice, backend_and_jobs, pickle_cache)
    208 
    209     def __init__(self, iterator_slice, backend_and_jobs, pickle_cache=None):
--> 210         self.items = list(iterator_slice)
    211         self._size = len(self.items)
    212         if isinstance(backend_and_jobs, tuple):

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in <genexpr>(.0)
    409                     message_clsname='ColumnTransformer',
    410                     message=self._log_message(name, idx, len(transformers)))
--> 411                 for idx, (name, trans, column, weight) in enumerate(
    412                         self._iter(fitted=fitted, replace_strings=True), 1))
    413         except ValueError as e:

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _get_column(X, key)
    636         else:
    637             # numpy arrays, sparse arrays
--> 638             return X[:, key]
    639 
    640 

IndexError: boolean index did not match indexed array along dimension 1; dimension is 36 but corresponding boolean dimension is 27

After getting rid of the variance threshold I am getting...

task = openml.tasks.get_task(3)
clf = make_pipeline(dabl.models.AnyClassifier(force_exhaust_budget=False))
run = openml.runs.run_model_on_task(clf, task)

/miniconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py:565: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask &= (ar1 != a)
/miniconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py:569: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-75c2ea531ac0> in <module>
      9 # run clf on the task
     10 print('Run clf on the task')
---> 11 run = openml.runs.run_model_on_task(clf, task)
     12 
     13 # print feedbackack

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_model_on_task(model, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow, return_flow)
    104         seed=seed,
    105         add_local_measures=add_local_measures,
--> 106         upload_flow=upload_flow,
    107     )
    108     if return_flow:

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_flow_on_task(flow, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow)
    220         task=task,
    221         extension=flow.extension,
--> 222         add_local_measures=add_local_measures,
    223     )
    224 

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in _run_task_get_arffcontent(flow, model, task, extension, add_local_measures)
    444             rep_no=rep_no,
    445             fold_no=fold_no,
--> 446             X_test=test_x,
    447         )
    448         if trace is not None:

/miniconda3/lib/python3.7/site-packages/openml/extensions/sklearn/extension.py in _run_model_on_fold(self, model, task, X_train, rep_no, fold_no, y_train, X_test)
   1393         # it returns the clusters
   1394         if isinstance(task, OpenMLSupervisedTask):
-> 1395             pred_y = model_copy.predict(X_test)
   1396         elif isinstance(task, OpenMLClusteringTask):
   1397             pred_y = model_copy.predict(X_train)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    420         for _, name, transform in self._iter(with_final=False):
    421             Xt = transform.transform(Xt)
--> 422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 
    424     @if_delegate_has_method(delegate='_final_estimator')

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/models.py in predict(self, X)
    300         check_is_fitted(self, 'est_')
    301         if getattr(self, 'classes_', None) is not None:
--> 302             return self.classes_[self.est_.predict(X)]
    303 
    304         return self.est_.predict(X)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    419         Xt = X
    420         for _, name, transform in self._iter(with_final=False):
--> 421             Xt = transform.transform(Xt)
    422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/preprocessing.py in transform(self, X)
    550         # Check is fit had been called
    551         check_is_fitted(self, ['ct_'])
--> 552         return self.ct_.transform(X)

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
    510 
    511         X = _check_X(X)
--> 512         Xs = self._fit_transform(X, None, _transform_one, fitted=True)
    513         self._validate_output(Xs)
    514 

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    410                     message=self._log_message(name, idx, len(transformers)))
    411                 for idx, (name, trans, column, weight) in enumerate(
--> 412                         self._iter(fitted=fitted, replace_strings=True), 1))
    413         except ValueError as e:
    414             if "Expected 2D array, got 1D array instead" in str(e):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _transform_one(transformer, X, y, weight, **fit_params)
    693 
    694 def _transform_one(transformer, X, y, weight, **fit_params):
--> 695     res = transformer.transform(X)
    696     # if we have a weight for this transformer, multiply output
    697     if weight is None:

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _transform(self, X)
    538         Xt = X
    539         for _, _, transform in self._iter():
--> 540             Xt = transform.transform(Xt)
    541         return Xt
    542 

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
    730                                        copy=True)
    731         else:
--> 732             return self._transform_new(X)
    733 
    734     def inverse_transform(self, X):

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
    678         """New implementation assuming categorical input"""
    679         # validation of X happens in _check_X called by _transform
--> 680         X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
    681 
    682         n_samples, n_features = X_int.shape

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    135 
    136                     Xi[~valid_mask] = self.categories_[i][0]
--> 137             _, encoded = _encode(Xi, self.categories_[i], encode=True)
    138             X_int[:, i] = encoded
    139 

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py in _encode(values, uniques, encode)
    108         return res
    109     else:
--> 110         return _encode_numpy(values, uniques, encode)
    111 
    112 

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py in _encode_numpy(values, uniques, encode)
     47         if diff:
     48             raise ValueError("y contains previously unseen labels: %s"
---> 49                              % str(diff))
     50         encoded = np.searchsorted(uniques, values)
     51         return uniques, encoded

ValueError: y contains previously unseen labels: [0.0]

dabl / dabl Goto Github PK

dabl's Introduction

dabl

Try it out

Current scope and upcoming features

Related packages

Lux

Pandas Profiling

dabl's People

Contributors

Stargazers

Watchers

Forkers

dabl's Issues

General features

SimpleModels

Anymodels

Model Distillation

Type detection

Plot supervised

plotting for explain

Preprocessing & Feature extraction

Recommend Projects

Recommend Topics

Recommend Org