Git Product home page Git Product logo

dabl's Introduction

dabl

CI

The data analysis baseline library.

  • "Mr Sanchez, are you a data scientist?"
  • "I dabl, Mr president."

Find more information on the website.

Try it out

pip install dabl

or Binder

Current scope and upcoming features

This library is very much still under development. Current code focuses mostly on exploratory visualization and preprocessing. There are also drop-in replacements for GridSearchCV and RandomizedSearchCV using successive halfing. There are preliminary portfolios in the style of POSH auto-sklearn to find strong models quickly. In essence that boils down to a quick search over different gradient boosting models and other tree ensembles and potentially kernel methods.

Check out the the website and example gallery to get an idea of the visualizations that are available.

Stay Tuned!

Related packages

Lux

Lux is an awesome project for easy interactive visualization of pandas dataframes within notebooks.

Pandas Profiling

Pandas Profiling can provide a thorough summary of the data in only a single line of code. Using the ProfileReport() method, you are able to access a HTML report of your data that can help you find correlations and identify missing data.

dabl focuses less on statistical measures of individual columns, and more on providing a quick overview via visualizations, as well as convienient preprocessing and model search for machine learning.

dabl's People

Contributors

amueller avatar baogianghoangvu avatar betatim avatar bpkroth avatar dhirschfeld avatar ecomodeller avatar encryptedcommerce avatar esvhd avatar glemaitre avatar h4pz avatar hp2500 avatar ipacheco-uy avatar j1c avatar jcfr avatar kathrynle20 avatar mertnuhuz avatar nicolashug avatar parthpankajtiwary avatar praths007 avatar seljukgulcan avatar stefmolin avatar suvayu avatar svoons avatar thomasjpfan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dabl's Issues

AnyClassifier fails on baseball

from dabl.models import AnyClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_openml

bunch = fetch_openml(data_id=185)
X = bunch.data
y = bunch.target
cross_val_score(AnyClassifier(), X, y, scoring='f1_macro', cv=10, error_score='raise')

errors with a NaN error and I don't know why :-/
@thomasjpfan @NicolasHug feel free to check it out if you want to play with dabl.

running successive halving on digits with SVC takes longer than GridSearchCV

import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import fetch_mldata, load_iris, load_digits
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data / 16, digits.target, stratify=digits.target, random_state=42)

param_grid = {'C': np.logspace(-3, 2, 6), 'gamma': np.logspace(-3, 2, 6)}
gs = GridSearchCV(SVC(), param_grid, cv=5)
gs.fit(X_train, y_train)

takes ~40s on my laptop, vs about 80s for successive halving. That's unfortunate. A lot of the time is spend in predicting, which we call more often on same-sized datasets (I think), but there's even more time spend in fitting, which doesn't make sense to me.
cc @NicolasHug

EasyPreprocessor fails on categorical data as integers with missing values

Filling missing int or float values with dabl_missing breaks.
We might want to call clean on the data first (which converts the categories to string....), or have a separate OneHotEncoder for non-string categories. That would make the feature names harder to interpret because missingness will be encoded as and integer or float then.

Alternatively we can just error on dirty data?

steal add_legend from seaborn

We need to either move all the grids to Grid or we should steal "add_legend" because it seems a bit tricky.
That probably requires introducing our own grid class.
Maybe the base "Grid" class is enough? I don't like facetgrid because it requires melting.

weird interactions in gradient boosting and successive halving

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import fetch_openml
from dabl.search import RandomSuccessiveHalving
from scipy import stats

bunch = fetch_openml(data_id=40701)

X = bunch.data
y = bunch.target
param_dist = {'max_leaf_nodes': stats.randint(2, 100)}
rsh = RandomSuccessiveHalving(HistGradientBoostingClassifier(),
                              param_distributions=param_dist,
                              verbose=10, r_min=40, scoring="roc_auc")
rsh.fit(X, y)

doesn't work as expected. In the first round all the AUCs are .5 and I think some random params get selected.
I'm not sure what's happening. It could be that because it's a imbalanced dataset something weird happens? Also without specifying r_min it doesn't work at all b/c of validation set issues, I think.

@NicolasHug could you have a look?
Also: I realized HistGradientBoosting doesn't have all the subsampling options and no support for class_weight ;)

ModuleNotFoundError: No module named 'sklearn.experimental'

Seems like some incompatibility with Windows anaconda3. Install using PIP on two windows boxes and when importing in Jupyter or CL I receive a ModuleNotFoundError. Installed on my Linux box and it imports without error. Please advise.

fix CI

doens't have sklearn right now?

how to show density of very peaked distributions (not non-negative)

There's a nice example of some very peaky distributions in the bank marketing dataset:

data = fetch_openml(data_id=1461)

On some of these Yeo-Johnson seems to work, on others it doesn't seem to provide anything useful. Also, the KDE plots might be a bit misleading.

without YJ:
image

with YJ:
image

Looks like V13 has more information on a first look, but probably just an artifact of the KDE and we really shouldn't be using KDEs.

Someone thought much more about this:
https://arxiv.org/pdf/1305.0215.pdf
logarithmically spaced bins seems to be the main method for plotting?

Type detection

Hi @amueller, I want to try my hands on the Type Detection tasks in the todo and was wondering if you could shed a little more light on it.

Am I correct to assume that we're looking for a Detection object that has methods that detects the various things listed in the list?

plot_supervised crashes with cleaned data

I was trying to follow the examples in the quick start guide, but I see a crash when I call plot_supervised with a cleaned dataframe. Passing the original dataframe works. Here's an example notebook with the titanic dataset.

PS: there are a few minor mistakes in the quick start guide, e.g. detect_types_dataframe should be detect_types, inconsistent namespace (sometimes you use dabl.*, sometimes the global namespace), etc.

Error 'TypeError: 'function' object is not iterable' in function 'plot'

Hello,

for allmost all functions of the package 'dabl' (such as 'clean' and 'plot' for example) I keep getting the following error:

TypeError: 'function' object is not iterable

This does not seem to be a problem of my specific data, because when I run the following example code (from the website):

import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from dabl import plot
from dabl.utils import data_df_from_bunch

wine_bunch = load_wine()
wine_df = data_df_from_bunch(wine_bunch)

plot(wine_df, 'target')

I then get the following output on my machine:


  File "<ipython-input-61-92b3b7736897>", line 1, in <module>
    plot(wine_df, 'target')

  File "C:\Daten\Anaconda3\lib\site-packages\dabl\plot\supervised.py", line 485, in plot
    X, types = clean(X, type_hints=type_hints, return_types=True)

  File "C:\Daten\Anaconda3\lib\site-packages\dabl\preprocessing.py", line 380, in clean
    lambda x: str(x))

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\accessor.py", line 115, in f
    return self._delegate_method(name, *args, **kwargs)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\categorical.py", line 2204, in _delegate_method
    res = method(*args, **kwargs)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\categorical.py", line 940, in rename_categories
    cat.categories = new_categories

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\categorical.py", line 408, in categories
    new_dtype = CategoricalDtype(categories, ordered=self.ordered)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\dtypes\dtypes.py", line 154, in __init__
    self._finalize(categories, ordered, fastpath=False)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\dtypes\dtypes.py", line 181, in _finalize
    fastpath=fastpath)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\dtypes\dtypes.py", line 319, in _validate_categories
    categories = Index(categories, tupleize_cols=False)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 374, in __new__
    subarr = _asarray_tuplesafe(data, dtype=object)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\common.py", line 379, in _asarray_tuplesafe
    values = list(values)

TypeError: 'function' object is not iterable

However, when I installed dabl it did not complain (it said: 'Successfully built dabl').
Is this a bug? Is there maybe an obvious workaround so that I can still use the package?

Thank you for your time

Hannes

type_hints vs types inconsistent

"public" or first-class methods should have a type_hints parameter to allow users to give the function hints.
"internal" methods should have a types parameter that allows short-cutting recomputing the types.
Currently we're inconsistent on which we use where.
Maybe types is not actually necessary and we can recompute every time?
We should also decide on when to clean. Cleaning makes computing types faster (I think).

Using Gridspec with custom ratios

May help for some (future) plots? Close if this is useless :)

import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import numpy as np

rs = np.random.RandomState(42)
x1, x2, = rs.multivariate_normal([0, 0], [(1, 0.5), (0.5, 1)], 500).T

fig = plt.figure(figsize=(6, 6))
gs = GridSpec(2, 2, width_ratios=[4, 1], height_ratios=[1, 4], hspace=0, wspace=0)
top_ax = fig.add_subplot(gs[0])
top_ax.set_axis_off()
left_ax = fig.add_subplot(gs[3])
left_ax.set_axis_off()
main_ax = fig.add_subplot(gs[2])

top_ax.hist(x1, bins=30)
left_ax.hist(x2, bins=30, orientation='horizontal')
main_ax.scatter(x1, x2, alpha=0.6)

testing

Missing Test Files

Most of the files defined in "testing" ipython notebook are not under the dataset folder. Maybe they can be removed?

avocado = pd.read_csv("/home/andy/datasets/avocado.csv", parse_dates=['Date']) telco_churn = pd.read_csv("/home/andy/datasets/WA_Fn-UseC_-Telco-Customer-Churn.csv") #restaurant = pd.read_csv("/home/andy/datasets/restaurant-and-market-health-violations.csv") titanic = pd.read_csv("dabl/tests/titanic.csv") ames = pd.read_excel("/home/andy/datasets/AmesHousing.xls") cars = pd.read_excel("/home/andy/datasets/2018 FE Guide for DOE-release dates before 1-24-2018-no-sales-1-23-2018public.xlsx") target = 'Comb Unadj FE - Conventional Fuel' #accidents = pd.read_csv("/home/andy/datasets/Acc.csv") #violations = pd.read_csv("/home/andy/datasets/Traffic_Violations.csv") adult = pd.read_csv("/home/andy/datasets/adult.csv") wine_quality = fetch_openml(data_id=287)

multiple types for single column

Originally each column could have single types, then I changed types to be exclusive.

If we have a low-cardinality (or even low-cardiality float) we probably want to encode it as both categorical and continuous for learning, while we only want to show it as categorical in plotting.

The information whether something was int or float before being made categorical is "lost" because we have to transform categoricals to strings so we can do missing value imputation in SimpleImputer (which is kinda bad in itself).

Possible solutions:

  • go back to having a separate "low cardinality int" type and decide on the spot (in plotting and in learning) what to do with it.
  • have a separate dtype annotation
  • give both types to the column, have plotting ignore the continuous one (one issue: unit testing the function is a bit harder since we have less guarantees?)
  • have a more complex types object that's not just a boolean dataframe (doesn't really solve the core issue but might make it easier; could have a "plot_how" or something).

add demo dataset for showing / trying all the possible different visualizations

There should be synthetic demo datasets that shows off all the possible ways to visualize features, probably one for regression, one for binary classification, one for multi-class classification and possibly one for multi-class with many classes.

I'm thinking mix of low and high cardinality categorical variables, overplotting issues in 2d and diverse univariate distributions.
I have no idea how to visualize the informativeness of powerlaw features off the top of my head.

Also: for categorical variables having very skewed distributions vs even distributions in the categories would be good as different types of visualizations work on these.

TODOs

General features

  • add grids
  • add successive halving
  • add timeouts
  • add simple models
  • add cleanup function for dirty floats
  • add cleanup for low cardinality int
  • add interpretable models
  • add feature engineering
  • add distilled classifier

SimpleModels

  • ensure linear models are actually fast enough
  • add knn if the dataset is small enough (maybe?)

Anymodels

  • validate grid
  • add classifier
  • add regressor

Model Distillation

  • add model distillation for classifiers
  • add model distillation for regression

Type detection

  • detect types of low cardinality integers
  • detect indices
  • detect missing value encoding
  • detect near constant values
  • detect duplicate rows
  • detect duplicate columns
  • detect ordered data
  • DETECT non-unique index!!!
  • integers as objects not detected!
  • low cardinality floats should be treated as categorical

Plot supervised

  • Detect overplotting (super hacky for now, better than nothing)
  • find interesting scatterplots
  • plot high cardinality categoricals
  • plot free string categoricals
  • plot text
  • switch to logarithmic axes when appropriate
  • time series
  • show correlation plots
  • if multiple scatter plots share axes, make sure they line up (on y)
  • don't show density plots if they are garbage?
  • make sure colormaps fit between different plots
  • add class legends for classification
  • make sure legends / colors match between different plots
  • don't show PCA if worse than original features
  • don't show scatterplots of highly correlated features (i.e. two of the three scatter plots are basically the same)
  • better wrap plot grids / abstract away
  • add missing value indicator, make sure it's not duplicated
  • use mosaic plots (from statsmodels?)
  • even when plotting all categorical features, show mutual information
  • subsample for scatter plots?
  • subsample for kde plots or switch to histograms? Or always histograms?
  • add scree plot when it is interesting? Or report 95% variance number of components? We're computing PCA anyway (for classification at least).
  • classification pairplot is ugly with single continuous feature
  • detect extreme outliers
  • remove outliers on pca and lda plots (probably as part of refactoring lol)
  • redo / remove pairwise plot?
  • interactions of categorical features? or do we leave that to trees (tic tac toe etc?)
  • high number of classes. what to do? [isolet?]
  • class hist plots should align
    within a grid row (or overall?)
  • maybe separate plot type for ordinal?
  • add diamond dataset example?
  • for very dense scatterplots: either use spatial binning and then do a tree map or use a decision tree for partitioning to overcome overplotting? also see https://www.r-bloggers.com/ggplot2-for-big-data/amp/
  • ordinal plots have possibly bad ticklabels, see bioresponse
  • should integer features that are recognized as categorical still go into LDA and PCA? look at credit-g maybe
  • regression: do pca on continuous features

plotting for explain

  • residual plots for regression
  • confusion matrix for classification
  • precision-recall curve
  • partial dependence plots
  • permutation importance
  • calibration plot for explain?

Preprocessing & Feature extraction

  • add treatment for high-cardinality categoricals
  • string feature extraction

expand quick start guide

Right now there's info leakage on titanic.
Once we show the plots (#51) we can identify that and drop the column and then run the simple model.
Also, it should get explain and AnyModel added.

Calculating AUC with AnyClassifier... AttributeError: 'AnyClassifier' object has no attribute 'decision_function'

It seems like it is not possible to calculate AUC scores when using AnyClassifier with sklearn.model_selection.corss_val_score. Here is a minimal example:

X, y = datasets.load_breast_cancer(return_X_y=True)
sr = dabl.models.AnyClassifier(force_exhaust_budget=False)
cross_val_score(sr, X = X, y = y, scoring = 'roc_auc', cv = 3)
best classifier:  HistGradientBoostingClassifier(l2_regularization=0.0001, learning_rate=0.1,
                               loss='auto', max_bins=64, max_depth=20,
                               max_iter=400, max_leaf_nodes=128,
                               min_samples_leaf=6, n_iter_no_change=None,
                               random_state=47806, scoring=None, tol=1e-07,
                               validation_fraction=0.2, verbose=0)
best score: 0.958
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/miniconda3/lib/python3.7/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
    180             try:
--> 181                 y_pred = clf.decision_function(X)
    182 

AttributeError: 'AnyClassifier' object has no attribute 'decision_function'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-18-16e17e76c51b> in <module>
      3 sr = dabl.models.AnyClassifier(force_exhaust_budget=False)
      4 
----> 5 cross_val_score(sr, X = X, y = y, scoring = 'roc_auc', cv = 3)

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    387                                 fit_params=fit_params,
    388                                 pre_dispatch=pre_dispatch,
--> 389                                 error_score=error_score)
    390     return cv_results['test_score']
    391 

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    229             return_times=True, return_estimator=return_estimator,
    230             error_score=error_score)
--> 231         for train, test in cv.split(X, y, groups))
    232 
    233     zipped_scores = list(zip(*scores))

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    552         fit_time = time.time() - start_time
    553         # _score will return dict if is_multimetric is True
--> 554         test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
    555         score_time = time.time() - start_time - fit_time
    556         if return_train_score:

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
    595     """
    596     if is_multimetric:
--> 597         return _multimetric_score(estimator, X_test, y_test, scorer)
    598     else:
    599         if y_test is None:

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
    625             score = scorer(estimator, X_test)
    626         else:
--> 627             score = scorer(estimator, X_test, y_test)
    628 
    629         if hasattr(score, 'item'):

/miniconda3/lib/python3.7/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
    186 
    187             except (NotImplementedError, AttributeError):
--> 188                 y_pred = clf.predict_proba(X)
    189 
    190                 if y_type == "binary":

AttributeError: 'AnyClassifier' object has no attribute 'predict_proba'

Dataset specific issues

%matplotlib inline
from sklearn.datasets import fetch_openml
from dabl.utils import data_df_from_bunch
from dabl import plot_supervised
# wine_quality
data = fetch_openml(data_id=287)
data = data_df_from_bunch(data)
plot_supervised(data, 'target')

Note: this is not the scikit-learn "wine" dataset

reasons might be: many classes, bad kde bandwith? bad scatter size? Uninformative data? overplotting?

Cannot import dabl

I installed dabl like this:

$ cd /path/to/dabl-repo
$ pip3 install --user -e .

But when I import from ipython, it fails with:

In [ ]: import dabl
ImportError                               Traceback (most recent call last)
<ipython-input-4-6b072d28c65f> in <module>
----> 1 import dabl

~/build/data-an/dabl/dabl/__init__.py in <module>
      2 from .models import SimpleClassifier, SimpleRegressor
      3 from .plot.supervised import plot_supervised
----> 4 from .explain import explain
      5
      6 __all__ = ['EasyPreprocessor', 'SimpleClassifier', 'SimpleRegressor',

~/build/data-an/dabl/dabl/explain.py in <module>
      1 import numpy as np
      2
----> 3 from sklearn.tree import DecisionTreeClassifier, plot_tree
      4 from sklearn.ensemble import RandomForestClassifier
      5 from sklearn.pipeline import Pipeline

ImportError: cannot import name 'plot_tree'

I couldn't find a plot_tree function in sklearn.tree, are you using a development version?

Incompatibilities between DABL AnyClassifier and OpenML

Hi, I am running into issues when I try to run AnyClassiefier on OpenML tasks.

So far I have encountered the following examples:

  1. Getting a value error only when I try to run on the openml task, not when I run on the same dataset locally.
task = openml.tasks.get_task(15)
clf = make_pipeline(dabl.models.AnyClassifier(force_exhaust_budget=False))
run = openml.runs.run_model_on_task(clf, task)
best classifier:  SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.03162277660168379,
    kernel='rbf', max_iter=-1, probability=False, random_state=1,
    shrinking=True, tol=0.001, verbose=False)
best score: 0.964
best classifier:  HistGradientBoostingClassifier(l2_regularization=0.0001, learning_rate=0.1,
                               loss='auto', max_bins=16, max_depth=7,
                               max_iter=200, max_leaf_nodes=4,
                               min_samples_leaf=4, n_iter_no_change=None,
                               random_state=7320, scoring=None, tol=1e-07,
                               validation_fraction=0.1, verbose=0)
best score: 0.952
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-d9533783e081> in <module>
      9 # run clf on the task
     10 print('Run clf on the task')
---> 11 run = openml.runs.run_model_on_task(clf, task)
     12 
     13 # print feedbackack

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_model_on_task(model, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow, return_flow)
    104         seed=seed,
    105         add_local_measures=add_local_measures,
--> 106         upload_flow=upload_flow,
    107     )
    108     if return_flow:

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_flow_on_task(flow, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow)
    220         task=task,
    221         extension=flow.extension,
--> 222         add_local_measures=add_local_measures,
    223     )
    224 

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in _run_task_get_arffcontent(flow, model, task, extension, add_local_measures)
    444             rep_no=rep_no,
    445             fold_no=fold_no,
--> 446             X_test=test_x,
    447         )
    448         if trace is not None:

/miniconda3/lib/python3.7/site-packages/openml/extensions/sklearn/extension.py in _run_model_on_fold(self, model, task, X_train, rep_no, fold_no, y_train, X_test)
   1356 
   1357             if isinstance(task, OpenMLSupervisedTask):
-> 1358                 model_copy.fit(X_train, y_train)
   1359             elif isinstance(task, OpenMLClusteringTask):
   1360                 model_copy.fit(X_train)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    354                                  self._log_message(len(self.steps) - 1)):
    355             if self._final_estimator != 'passthrough':
--> 356                 self._final_estimator.fit(Xt, y, **fit_params)
    357         return self
    358 

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/models.py in fit(self, X, y, target_col)
    351             scoring='recall_macro')
    352         self.search_ = gs
--> 353         gs.fit(X, y)
    354         self.est_ = gs.best_estimator_
    355 

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/search.py in fit(self, X, y, groups, **fit_params)
    132             groups=groups,
    133         )
--> 134         super().fit(X, y=y, groups=groups, **fit_params)
    135         # Set best_score_: BaseSearchCV does not set it, as refit is a callable
    136         self.best_score_ = (

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/_search.py in fit(self, X, y, groups, **fit_params)
    342                 return results
    343 
--> 344             self._run_search(evaluate_candidates, X, y, groups)
    345 
    346         # For multi-metric evaluation, store the best_index_, best_params_ and

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/search.py in _run_search(self, evaluate_candidates, X, y, groups)
    232                             'r_i': [r_i] * n_candidates}
    233             results = evaluate_candidates(candidate_params, X_iter, y_iter,
--> 234                                           groups, more_results=more_results)
    235 
    236             n_candidates_to_keep = ceil(n_candidates / self.ratio)

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/_search.py in evaluate_candidates(candidate_params, X, y, groups, more_results)
    316                                for parameters, (train, test)
    317                                in product(candidate_params,
--> 318                                           cv.split(X, y, groups)))
    319 
    320                 if len(out) < 1:

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    922                 self._iterating = self._original_iterator is not None
    923 
--> 924             while self.dispatch_one_batch(iterator):
    925                 pass
    926 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    552         fit_time = time.time() - start_time
    553         # _score will return dict if is_multimetric is True
--> 554         test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
    555         score_time = time.time() - start_time - fit_time
    556         if return_train_score:

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
    595     """
    596     if is_multimetric:
--> 597         return _multimetric_score(estimator, X_test, y_test, scorer)
    598     else:
    599         if y_test is None:

/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
    625             score = scorer(estimator, X_test)
    626         else:
--> 627             score = scorer(estimator, X_test, y_test)
    628 
    629         if hasattr(score, 'item'):

/miniconda3/lib/python3.7/site-packages/sklearn/metrics/scorer.py in __call__(self, estimator, X, y_true, sample_weight)
     88         """
     89 
---> 90         y_pred = estimator.predict(X)
     91         if sample_weight is not None:
     92             return self._sign * self._score_func(y_true, y_pred,

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    420         for _, name, transform in self._iter(with_final=False):
    421             Xt = transform.transform(Xt)
--> 422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 
    424     @if_delegate_has_method(delegate='_final_estimator')

/miniconda3/lib/python3.7/site-packages/sklearn/svm/base.py in predict(self, X)
    572             Class labels for samples in X.
    573         """
--> 574         y = super().predict(X)
    575         return self.classes_.take(np.asarray(y, dtype=np.intp))
    576 

/miniconda3/lib/python3.7/site-packages/sklearn/svm/base.py in predict(self, X)
    320         y_pred : array, shape (n_samples,)
    321         """
--> 322         X = self._validate_for_predict(X)
    323         predict = self._sparse_predict if self._sparse else self._dense_predict
    324         return predict(X)

/miniconda3/lib/python3.7/site-packages/sklearn/svm/base.py in _validate_for_predict(self, X)
    452 
    453         X = check_array(X, accept_sparse='csr', dtype=np.float64, order="C",
--> 454                         accept_large_sparse=False)
    455         if self._sparse and not sp.isspmatrix(X):
    456             X = sp.csr_matrix(X)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    540         if force_all_finite:
    541             _assert_all_finite(array,
--> 542                                allow_nan=force_all_finite == 'allow-nan')
    543 
    544     if ensure_min_samples > 0:

/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
     54                 not allow_nan and not np.isfinite(X).all()):
     55             type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56             raise ValueError(msg_err.format(type_err, X.dtype))
     57     # for object dtype data, we only check for NaNs (GH-13254)
     58     elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
  1. Running into issues when features are filtered out with the near_constant_threshold.
task = openml.tasks.get_task(3)
clf = make_pipeline(dabl.models.AnyClassifier(force_exhaust_budget=False))
run = openml.runs.run_model_on_task(clf, task)
/Users/hp2500/Google Drive/STUDY/Columbia/Research/dabl/dabl/preprocessing.py:255: UserWarning: Discarding near-constant features: [2, 13, 15, 16, 18, 24, 27, 28, 29]
  near_constant.index[near_constant].tolist()))
best classifier:  HistGradientBoostingClassifier(l2_regularization=1e-06, learning_rate=0.1,
                               loss='auto', max_bins=128, max_depth=12,
                               max_iter=300, max_leaf_nodes=4,
                               min_samples_leaf=3, n_iter_no_change=None,
                               random_state=28019, scoring=None, tol=1e-07,
                               validation_fraction=0.2, verbose=0)
best score: 0.959
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-2-75c2ea531ac0> in <module>
      9 # run clf on the task
     10 print('Run clf on the task')
---> 11 run = openml.runs.run_model_on_task(clf, task)
     12 
     13 # print feedbackack

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_model_on_task(model, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow, return_flow)
    104         seed=seed,
    105         add_local_measures=add_local_measures,
--> 106         upload_flow=upload_flow,
    107     )
    108     if return_flow:

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_flow_on_task(flow, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow)
    220         task=task,
    221         extension=flow.extension,
--> 222         add_local_measures=add_local_measures,
    223     )
    224 

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in _run_task_get_arffcontent(flow, model, task, extension, add_local_measures)
    444             rep_no=rep_no,
    445             fold_no=fold_no,
--> 446             X_test=test_x,
    447         )
    448         if trace is not None:

/miniconda3/lib/python3.7/site-packages/openml/extensions/sklearn/extension.py in _run_model_on_fold(self, model, task, X_train, rep_no, fold_no, y_train, X_test)
   1393         # it returns the clusters
   1394         if isinstance(task, OpenMLSupervisedTask):
-> 1395             pred_y = model_copy.predict(X_test)
   1396         elif isinstance(task, OpenMLClusteringTask):
   1397             pred_y = model_copy.predict(X_train)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    420         for _, name, transform in self._iter(with_final=False):
    421             Xt = transform.transform(Xt)
--> 422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 
    424     @if_delegate_has_method(delegate='_final_estimator')

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/models.py in predict(self, X)
    300         check_is_fitted(self, 'est_')
    301         if getattr(self, 'classes_', None) is not None:
--> 302             return self.classes_[self.est_.predict(X)]
    303 
    304         return self.est_.predict(X)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    419         Xt = X
    420         for _, name, transform in self._iter(with_final=False):
--> 421             Xt = transform.transform(Xt)
    422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/preprocessing.py in transform(self, X)
    550         # Check is fit had been called
    551         check_is_fitted(self, ['ct_'])
--> 552         return self.ct_.transform(X)

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
    510 
    511         X = _check_X(X)
--> 512         Xs = self._fit_transform(X, None, _transform_one, fitted=True)
    513         self._validate_output(Xs)
    514 

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    410                     message=self._log_message(name, idx, len(transformers)))
    411                 for idx, (name, trans, column, weight) in enumerate(
--> 412                         self._iter(fitted=fitted, replace_strings=True), 1))
    413         except ValueError as e:
    414             if "Expected 2D array, got 1D array instead" in str(e):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    752             tasks = BatchedCalls(itertools.islice(iterator, batch_size),
    753                                  self._backend.get_nested_backend(),
--> 754                                  self._pickle_cache)
    755             if len(tasks) == 0:
    756                 # No more tasks available in the iterator: tell caller to stop.

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __init__(self, iterator_slice, backend_and_jobs, pickle_cache)
    208 
    209     def __init__(self, iterator_slice, backend_and_jobs, pickle_cache=None):
--> 210         self.items = list(iterator_slice)
    211         self._size = len(self.items)
    212         if isinstance(backend_and_jobs, tuple):

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in <genexpr>(.0)
    409                     message_clsname='ColumnTransformer',
    410                     message=self._log_message(name, idx, len(transformers)))
--> 411                 for idx, (name, trans, column, weight) in enumerate(
    412                         self._iter(fitted=fitted, replace_strings=True), 1))
    413         except ValueError as e:

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _get_column(X, key)
    636         else:
    637             # numpy arrays, sparse arrays
--> 638             return X[:, key]
    639 
    640 

IndexError: boolean index did not match indexed array along dimension 1; dimension is 36 but corresponding boolean dimension is 27
  1. After getting rid of the variance threshold I am getting...
task = openml.tasks.get_task(3)
clf = make_pipeline(dabl.models.AnyClassifier(force_exhaust_budget=False))
run = openml.runs.run_model_on_task(clf, task)
/miniconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py:565: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask &= (ar1 != a)
/miniconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py:569: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-75c2ea531ac0> in <module>
      9 # run clf on the task
     10 print('Run clf on the task')
---> 11 run = openml.runs.run_model_on_task(clf, task)
     12 
     13 # print feedbackack

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_model_on_task(model, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow, return_flow)
    104         seed=seed,
    105         add_local_measures=add_local_measures,
--> 106         upload_flow=upload_flow,
    107     )
    108     if return_flow:

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_flow_on_task(flow, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow)
    220         task=task,
    221         extension=flow.extension,
--> 222         add_local_measures=add_local_measures,
    223     )
    224 

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in _run_task_get_arffcontent(flow, model, task, extension, add_local_measures)
    444             rep_no=rep_no,
    445             fold_no=fold_no,
--> 446             X_test=test_x,
    447         )
    448         if trace is not None:

/miniconda3/lib/python3.7/site-packages/openml/extensions/sklearn/extension.py in _run_model_on_fold(self, model, task, X_train, rep_no, fold_no, y_train, X_test)
   1393         # it returns the clusters
   1394         if isinstance(task, OpenMLSupervisedTask):
-> 1395             pred_y = model_copy.predict(X_test)
   1396         elif isinstance(task, OpenMLClusteringTask):
   1397             pred_y = model_copy.predict(X_train)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    420         for _, name, transform in self._iter(with_final=False):
    421             Xt = transform.transform(Xt)
--> 422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 
    424     @if_delegate_has_method(delegate='_final_estimator')

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/models.py in predict(self, X)
    300         check_is_fitted(self, 'est_')
    301         if getattr(self, 'classes_', None) is not None:
--> 302             return self.classes_[self.est_.predict(X)]
    303 
    304         return self.est_.predict(X)

/miniconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    419         Xt = X
    420         for _, name, transform in self._iter(with_final=False):
--> 421             Xt = transform.transform(Xt)
    422         return self.steps[-1][-1].predict(Xt, **predict_params)
    423 

~/Google Drive/STUDY/Columbia/Research/dabl/dabl/preprocessing.py in transform(self, X)
    550         # Check is fit had been called
    551         check_is_fitted(self, ['ct_'])
--> 552         return self.ct_.transform(X)

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
    510 
    511         X = _check_X(X)
--> 512         Xs = self._fit_transform(X, None, _transform_one, fitted=True)
    513         self._validate_output(Xs)
    514 

/miniconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    410                     message=self._log_message(name, idx, len(transformers)))
    411                 for idx, (name, trans, column, weight) in enumerate(
--> 412                         self._iter(fitted=fitted, replace_strings=True), 1))
    413         except ValueError as e:
    414             if "Expected 2D array, got 1D array instead" in str(e):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

/miniconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _transform_one(transformer, X, y, weight, **fit_params)
    693 
    694 def _transform_one(transformer, X, y, weight, **fit_params):
--> 695     res = transformer.transform(X)
    696     # if we have a weight for this transformer, multiply output
    697     if weight is None:

/miniconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _transform(self, X)
    538         Xt = X
    539         for _, _, transform in self._iter():
--> 540             Xt = transform.transform(Xt)
    541         return Xt
    542 

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
    730                                        copy=True)
    731         else:
--> 732             return self._transform_new(X)
    733 
    734     def inverse_transform(self, X):

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
    678         """New implementation assuming categorical input"""
    679         # validation of X happens in _check_X called by _transform
--> 680         X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
    681 
    682         n_samples, n_features = X_int.shape

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    135 
    136                     Xi[~valid_mask] = self.categories_[i][0]
--> 137             _, encoded = _encode(Xi, self.categories_[i], encode=True)
    138             X_int[:, i] = encoded
    139 

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py in _encode(values, uniques, encode)
    108         return res
    109     else:
--> 110         return _encode_numpy(values, uniques, encode)
    111 
    112 

/miniconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py in _encode_numpy(values, uniques, encode)
     47         if diff:
     48             raise ValueError("y contains previously unseen labels: %s"
---> 49                              % str(diff))
     50         encoded = np.searchsorted(uniques, values)
     51         return uniques, encoded

ValueError: y contains previously unseen labels: [0.0]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.