Git Product home page Git Product logo

scikit-multiflow / scikit-multiflow Goto Github PK

View Code? Open in Web Editor NEW
750.0 750.0 183.0 62.39 MB

A machine learning package for streaming data in Python. The other ancestor of River.

Home Page: https://scikit-multiflow.github.io/

License: BSD 3-Clause "New" or "Revised" License

Python 98.69% C++ 0.63% Makefile 0.02% Dockerfile 0.17% Jupyter Notebook 0.33% Shell 0.15%
machine-learning meka moa scikit scikit-learn stream streaming-data

scikit-multiflow's Introduction

Build status Build Status codecov Python version Anaconda-Server Badge PyPI version Anaconda-Server Badge DockerHub License Gitter

scikit-multiflow is a machine learning package for streaming data in Python.


creme and scikit-multiflow are merging into a new project called River.

We feel that both projects share the same vision. We believe that pooling our resources instead of duplicating work will benefit both sides. We are also confident that this will benefit both communities. There will be more people working on the new project, which will allow us to distribute work more efficiently. We will thus be able to work on more features and improve the overall quality of the project.

Both projects will stop active development. The code for both projects will remain publicly available, although development will only focus on minor maintenance during a transition period. The architecture of the new package is very similar to that of creme. It will focus on single-instance incremental models.

We encourage users to use River instead of creme. We understand that this transition will require an extra effort in the short term from current users. However, we believe that the result will be better for everyone in the long run.

You will still be able to install and use creme as well as scikit-multiflow. Both projects will remain on PyPI, conda-forge and GitHub.


Quick links

Features

Incremental Learning

Stream learning models are created incrementally and are updated continuously. They are suitable for big data applications where real-time response is vital.

Adaptive learning

Changes in data distribution harm learning. Adaptive methods are specifically designed to be robust to concept drift changes in dynamic environments.

Resource-wise efficient

Streaming techniques efficiently handle resources such as memory and processing time given the unbounded nature of data streams.

Easy to use

scikit-multiflow is designed for users with any experience level. Experiments are easy to design, setup, and run. Existing methods are easy to modify and extend.

Stream learning tools

In its current state, scikit-multiflow contains data generators, multi-output/multi-target stream learning methods, change detection methods, evaluation methods, and more.

Open source

Distributed under the BSD 3-Clause, scikit-multiflow is developed and maintained by an active, diverse and growing community.

Use cases

The following tasks are supported in scikit-multiflow:

Supervised learning

When working with labeled data. Depending on the target type can be either classification (discrete values) or regression (continuous values)

Single/multi output

Single-output methods predict a single target-label (binary or multi-class) for classification or a single target-value for regression. Multi-output methods simultaneously predict multiple variables given an input.

Concept drift detection

Changes in data distribution can harm learning. Drift detection methods are designed to rise an alarm in the presence of drift and are used alongside learning methods to improve their robustness against this phenomenon in evolving data streams.

Unsupervised learning

When working with unlabeled data. For example, anomaly detection where the goal is the identification of rare events or samples which differ significantly from the majority of the data.


Jupyter Notebooks

In order to display plots from scikit-multiflow within a Jupyter Notebook we need to define the proper mathplotlib backend to use. This is done by including the following magic command at the beginning of the Notebook:

%matplotlib notebook

JupyterLab is the next-generation user interface for Jupyter, currently in beta, it can display interactive plots with some caveats. If you use JupyterLab then the current solution is to use the jupyter-matplotlib extension:

%matplotlib widget

Citing scikit-multiflow

If scikit-multiflow has been useful for your research and you would like to cite it in a academic publication, please use the following Bibtex entry:

@article{skmultiflow,
  author  = {Jacob Montiel and Jesse Read and Albert Bifet and Talel Abdessalem},
  title   = {Scikit-Multiflow: A Multi-output Streaming Framework },
  journal = {Journal of Machine Learning Research},
  year    = {2018},
  volume  = {19},
  number  = {72},
  pages   = {1-5},
  url     = {http://jmlr.org/papers/v19/18-251.html}
}

scikit-multiflow's People

Contributors

abifet avatar albandecrevoisier avatar alex891 avatar ameliachui avatar andrefcruz avatar anhquancao avatar bgulowaty avatar darkmyter avatar fabriciojoc avatar foxriver76 avatar fwille avatar garawalid avatar gilbertoolimpio avatar guimatsumoto avatar imrnsalm avatar ingako avatar jacobmontiel avatar jiahy0825 avatar jmread avatar jmrozanec avatar krifimedamine avatar lckr avatar lfleck avatar luccaportes avatar mertozer94 avatar minhhuong avatar payam-ebadi avatar pgijsbers avatar smastelini avatar yupbank avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit-multiflow's Issues

ConceptDriftStream Logging

I'm playing around with the Concept Drift Stream Generator with the following code:

from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential
from skmultiflow.trees.hoeffding_adaptive_tree import HAT
from skmultiflow.data.concept_drift_stream import ConceptDriftStream
from skmultiflow.data import AGRAWALGenerator

"""1. Create stream"""
stream = ConceptDriftStream(stream_option=AGRAWALGenerator(random_state=112), 
                            drift_stream_option=AGRAWALGenerator(random_state=112, classification_function=2),
                            random_state=None,
                            alpha_option=0.0, # angle of change grade 0 - 90
                            position_option=250000,
                            width_option=10000)

stream.prepare_for_use()

"""2. Create classifier"""
clf = HAT(split_criterion='info_gain')

"""3. Setup evaluator"""
evaluator = EvaluatePrequential(show_plot=True,
                                pretrain_size=1,
                                max_samples=1000000,
                                metrics=['performance', 'kappa_t', 'kappa_m', 'kappa'],
                                output_file=None)

"""4. Run evaluator"""
evaluator.evaluate(stream=stream, model=clf, model_names=['HAT']) 

When running the code it is very slow due to the enormous amount of logging, like:

root - INFO - -50000.0%
root - INFO - -50050.0%
root - INFO - -50100.0%
root - INFO - -50150.0%
root - INFO - -50200.0%
root - INFO - -50250.0%
root - INFO - -50300.0%
root - INFO - -50350.0%
root - INFO - -50400.0%
root - INFO - -50450.0%
root - INFO - -50500.0%
root - INFO - -50550.0%
root - INFO - -50600.0%
root - INFO - -50650.0%
root - INFO - -50700.0%
root - INFO - -50750.0%
root - INFO - -50800.0%
root - INFO - -50850.0%
root - INFO - -50900.0%
root - INFO - -50950.0%
root - INFO - -51000.0%
root - INFO - -51050.0%
root - INFO - -51100.0%
root - INFO - -51150.0%
root - INFO - -51200.0%
root - INFO - -51250.0%
root - INFO - -51300.0%
root - INFO - -51350.0%
root - INFO - -51400.0%
root - INFO - -51450.0%
root - INFO - -51500.0%
root - INFO - -51550.0%
root - INFO - -51600.0%
root - INFO - -51650.0%
root - INFO - -51700.0%
root - INFO - -51750.0%

Also I don't know what that means for me.

Remove __author__ tags

Using the author tag at the beginning of files is a legacy practice and is no longer necessary given that git tracks this information.

Additionally, we can create a contributors file for explicit reference.

Stream of samples with delayed responses

This is just a question. Suppose my samples are produced as a stream, but that I cannot know the response before a certain delay (during which new samples are produced and need to be predicted). In the simple examples in the doc, it looks like the model is evaluated right after it made its prediction. Is there a way to somehow delay the evaluation, allowing for other samples to be predicted in the mean time?

Hoeffding Tree stops too early with "Can not normalize, normalization factor is zero"

At first: Thanks for fixing the Naive Bayes issue that fast. ;)

When I use Hoeffding Tree it stops long before max_samples is reached with the following info in the log

Can not normalize, normalization factor is zero

The used Code looks like this:

from skmultiflow.data.generators.waveform_generator import WaveformGenerator
from skmultiflow.data.generators.sea_generator import SEAGenerator
from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential
from skmultiflow.classification.trees.hoeffding_tree import HoeffdingTree

stream = WaveformGenerator()
#stream = SEAGenerator()

stream.prepare_for_use() 

clf = HoeffdingTree()

evaluator = EvaluatePrequential(show_plot=True, 
                                pretrain_size=250,
                                max_samples=50000,
                                metrics=['performance', 'kappa']) 

evaluator.evaluate(stream=stream, model=clf)

Result is:

root - INFO - Prequential Evaluation
root - INFO - Evaluating 1 target(s).
root - INFO - Pre-training on 250 samples.
root - INFO - Evaluating...
root - INFO - 5.0%
root - INFO - 10.0%
root - INFO - 15.0%
root - INFO - Evaluation time: 6.437 s
root - INFO - Total samples: 8201
root - INFO - Global performance:
root - INFO - Model 0 - Accuracy : 0.797
root - INFO - Model 0 - Kappa : 0.696
Can not normalize, normalization factor is zero

Also when I use the other stream generator (SEAGenerator), it stops at about 3.4k samples.

Leveraging Bagging with HoeffdingTree

LeveragingBagging not working with HoeffdingTree, see for example _test_prequential_bagging.py. No error message given:

root - INFO - Prequential Evaluation
root - INFO - Generating 2 targets.
root - INFO - Pre-training on 2000 samples.
root - INFO - Evaluating...

root - INFO - Evaluation time: 0.000 s
root - INFO - Total instances: 2000
root - INFO - Global performance:
root - INFO - Learner 0 - Accuracy     : 0.000
root - INFO - Learner 0 - Kappa        : 0.000
root - INFO - Learner 0 - Kappa T      : 0.000

(But it works with KNN no problem).

The memory contabilization might not take into account the objects pointed by the variables

Greetings,

I think I found an issue in the memory costs calculation of the Hoeffding Tree-based algorithms.

The use of sys.getsizeof alone over an class object, list, dictionary, among others, does not account for the actual sizes of the properties contained within the structures, but only the pointers to them. For example:

import sys

sys.getsizeof([]) # returns 64
sys.getsizeof(1) # returns 28

sys.getsizeof([1]) # returns 72
sys.getsizeof([1, 1]) # returns 80

This simple example shows that Python calculates the byte size only considering the pointers to the actual referenced objects.

This kind of issue is discussed in https://code.tutsplus.com/tutorials/understand-how-much-memory-your-python-objects-use--cms-25609.

An elegant solution is presented in https://goshippo.com/blog/measure-real-size-any-python-object/.

Applications which need to measure the memory costs of a solution would benefit from such strategy.

Concept Drift Stream Alpha Value > 45.0

Documentation says the alpha value can be between 0 and 90, but > 45 throws the following error:

  File "/home/moritz/anaconda3/lib/python3.6/site-packages/skmultiflow/evaluation/evaluate_prequential.py", line 206, in evaluate
    self.model = self._train_and_test()

  File "/home/moritz/anaconda3/lib/python3.6/site-packages/skmultiflow/evaluation/evaluate_prequential.py", line 241, in _train_and_test
    X, y = self.stream.next_sample(self.pretrain_size)

  File "/home/moritz/anaconda3/lib/python3.6/site-packages/skmultiflow/data/concept_drift_stream.py", line 127, in next_sample
    x = -4.0 * float(self.sample_idx - self.position_option) / float(self.width_option)
ZeroDivisionError: float division by zero

Used code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Aug 13 08:52:32 2018

@author: moritz
"""

from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential
from skmultiflow.trees.hoeffding_adaptive_tree import HAT
from skmultiflow.data.concept_drift_stream import ConceptDriftStream
from skmultiflow.data import AGRAWALGenerator

"""1. Create stream"""
stream = ConceptDriftStream(stream_option=AGRAWALGenerator(random_state=112), 
                            drift_stream_option=AGRAWALGenerator(random_state=112, classification_function=2),
                            random_state=None,
                            alpha_option=45.1, # or any value greater 45.0
                            position_option=250000,
                            width_option=1)

stream.prepare_for_use()

"""2. Create classifier"""
clf = HAT(split_criterion='info_gain')

"""3. Setup evaluator"""
evaluator = EvaluatePrequential(show_plot=True,
                                pretrain_size=1,
                                max_samples=1000000,
                                metrics=['performance', 'kappa_t', 'kappa_m', 'kappa'],
                                output_file=None)

"""4. Run evaluator"""
evaluator.evaluate(stream=stream, model=clf, model_names=['HAT'])

Error with show_plot=True

TypeError: legend() missing 1 required positional argument: 'labels'
for example, _test_comparison_prequential.py or _test_file_stream.py gives this error.

Predictions shall be returned as numpy.array instead of lists

Expected behaviour

As discussed in #39, Predictions from predict() and predict_proba() shall be returned as numpy.array instead of lists.

This is to be consistent with scikit-learn which is also the expected data type for many users.

Actual behaviour

Currently, we are returning predictions as list.

Note: A fix is to cast predictions before returning them:

# current
return predictions
# suggested
return numpy.array(predictions)

This change is to be propagated to all models for consistency.

HoeffdingTree returns prediction as list

Expected behaviour

All predictions are returned as Numpy arrays

Actual behaviour

HoeffdingTree returns lists

Returns
-------
list
Predicted labels for all instances in X.

Returns
-------
list
Predicted the probabilities of all the labels for all instances in X.

Just a comment:
The HoeffdingTree classifier runs extremly slow on my machine (about 8s for a batch of my data vs. < 1s for all sklearn classifiers that I tested, including MLP with 100 hidden nodes). Would it be possible to improve the runtime performance somehow?

Consistent naming for Attribute Observers for tree methods

The number of attribute observers is increasing and there is no clear convention for the naming.

A proposal is to rename the observers as follows:
HoeffdingNominalAttributeClassObserver --> NominalAttributeRegressionObserver
HoeffdingNumericAttributeClassObserver --> NumericAttributeRegressionObserver
HoeffdingMultiOutputTNumericAttributeObserver --> NumericAttributeRegressionObserverMultiTarget
GaussianNumericAttributeClassObserver --> NumericAttributeClassObserverGaussian
BinaryTreeNumericAttributeClassObserver --> NumericAttributeClassObserverBinaryTree

This change will affect the name of the corresponding files.

Error on Naive Bayes: Can't instantiate abstract class NaiveBayes with abstract methods reset

Hi,

when I try to use the NaiveBayes, I receive the following error:

Can't instantiate abstract class NaiveBayes with abstract methods reset

The code looks like this:

from skmultiflow.data.generators.waveform_generator import WaveformGenerator
from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential
from skmultiflow.classification.naive_bayes import NaiveBayes

stream = WaveformGenerator()

stream.prepare_for_use() 

clf = NaiveBayes()

evaluator = EvaluatePrequential(show_plot=True, 
                                pretrain_size=500,
                                max_samples=50000,
                                metrics=['performance', 'kappa']) 

evaluator.evaluate(stream=stream, model=clf) 

It would be great, if you could explain me why this isn't working.

thank you.

Should be model neutral

Should be neutral and refer to a 'model' rather than to a 'classifier' in evaluation code, comments, and throughout, except in very specific classifier-only code. (Models can be classifiers, regressors, unsupervised, reinforcement, etc.)

Use a buffer to share performance metrics between evaluator and visuaulizer

Expected behaviour

A common buffer shall be used to store performance metrics. The evaluator should write/update the buffer, while the visualizer should read the data from it. Additionally, such buffer could be used to write the log file if requested.

Evaluator -----> Buffer ----> Vizualizer / File 

Actual behaviour

Currently, two buffers are used, one in the evaluator and one in the visualizer.

More user friendly handling of evaluator measurements

Wouldn't it be better to print the results at the end with 4 decimals like in the plot window? If I set "show_plot=false" I only get 3 decimals printed at the end of the evaluation because in the code it is written:

 if 'performance' in self.metrics:
                logging.info('{} - Accuracy     : {:.3f}'.format(
                    self.model_names[i], self.global_classification_metrics[i].get_performance()))

Also I think it would be user friendly to have a method like evaluator.get_measurements which returns Kappa, Accuracy and so on for all models. Currently I have to use

evaluator.global_classification_metrics[0].get_performance()

for getting accuracy of the first model etc.

SEA generator with concept drift

Expected behaviour

We want to have a stream generator with concept drift

Actual behaviour

Current SEA generator has the function generate_drift to account for this case, but it has to be manually called between next_sample calls.

Steps to reproduce the behaviour

sea.next_sample()
sea.generate_drift()
sea.next_sample()

Add predict_proba to Hoeffding Tree

Expected behaviour

predict_proba shall return an array with the probability for each class.

Example: For a binary classification problem, the corresponding leaf node contains {0:5, 1:10}

Step 1. Normalize the dictionary
{0:5, 1:10} --> {0:1/3, 1:2/3}

Step 2: Convert the dictionary to an array where the index corresponds to the class:
{0:5, 1:10} --> [1/3, 2/3]

Warning: We need to account for cases where there are missing classes (array indices) in the dictionary:
{0:5, 2:10} --> [1/3, 0, 2/3]

Additionally, we need to consider the generic approach to be used across scikit-multiflow. Suggestion:

import operator
def predict(X):
   ...
   y_proba = self.predict_proba(X)`
   # Get index (class) for higest probability   
   index, _ = max(enumerate(y_proba), key=operator.itemgetter(1))   
   return index

Actual behaviour

Not implemented yet

ConceptDriftStream generator

Request to add ConceptDriftStream generator as in MOA to scikit-multiflow.

Example:

WriteStreamToARFFFile -s (ConceptDriftStream -s generators.AgrawalGenerator -d (ConceptDriftStream -s (generators.AgrawalGenerator -f 2) -d (ConceptDriftStream -s generators.AgrawalGenerator -d (generators.AgrawalGenerator -f 4) -p 25000 -w 10000) -p 25000 -w 1) -p 25000 -w 10000) -f example4.arff -m 100000

Multiple concept drifts can be simulated with nesting, and each drift can have its position and width manually adjusted. In this case, three shifts at 25000, 50000, 75000 with widths 10000, 1, 10000, using AGRAWAL generators with different classification functions.

Plotting rolling mean yields a graph like:

screen shot 2018-07-02 at 11 02 32 am

[Question] Is there a difference to calling `fit` from `partial_fit` for first batch

I am sorry for this simple question, but could not find it in the documentation.

Imagine a setting where you get data in batches, you want to do batch updates to your model.
For the very first batch, you train a model from scratch.
Does it matter if I call fit or partial_fit, in both cases just providing the first batch of data?
Does the resulting model differ and/or take more/less time to compute?

Clean-up Hoeffding Adaptive Tree

A working version of HAT is ready. Minor changes are required to standardize HAT for consistency with the framework, including:

  • Clean-up
  • Documentation
  • Demo/Test file
  • ....

A new StreamEvaluator that supports multiple Stream-StreamModels

I created a very similar project internally at our organization before I became aware of your work, and we thought it'd be a good idea to work with the community instead of duplicating effort.

The primary architectural difference is that there is a type of StreamEvaluator that is more like a pipeline (a FlowingPipeline, if you will), in that it supports a list of Stream-StreamModel pairs. Each Stream after the first one receives the transformed output of the previous StreamModel.

Would you be open for the addition of this feature? If so, I can make a feature branch and create relevant changes and associated documentation for a more detailed review.

Prequential and Holdout Evaluators refactoring

At the moment task_type in evaluate_prequential.py can be one of the following: 'classification', 'regression' or 'multi_output'. Probably, this specification is not necessary. The task type can be inferred by the evaluation metric specified (currently: plot_options). Furthermore, multi-output tasks can be either a regression or classification nature.

Issue with reading dataset as pd.DataFrame

Hi, I am trying to download datasets as pd.DataFrame and convert them to DataStream after, but the streams seem to be empty and prepare_to_use function is not working giving the error 'numpy.ndarray' object has no attribute 'columns'. Here is my code, in which I can see the data until DataStream function.

for i in range(6):
dataset = oml.datasets.get_dataset(idy[i][0])
X, y, attribute_names = dataset.get_data(
target=dataset.default_target_attribute,
return_attribute_names=True,)
data = pd.DataFrame(X, columns=attribute_names)
data['class'] = y
stream = DataStream(data=data)
list_of_streams.append(stream)

Bilge

Concept Drift Stream Optimisation

Current Implementation

    def next_sample(self, batch_size=1):
        self.current_sample_x = None
        self.current_sample_y = None

        for j in range(batch_size):
            self.sample_idx += 1
            x = -4.0 * float(self.sample_idx - self.position) / float(self.width)
            probability_drift = 1.0 / (1.0 + np.exp(x))
            if self.random_state.rand() > probability_drift:
                if self.current_sample_x is None:
                    self.current_sample_x, self.current_sample_y = self._input_stream.next_sample()
                else:
                    X, y = self._input_stream.next_sample()
                    self.current_sample_x = np.append(self.current_sample_x, X, axis=0)
                    self.current_sample_y = np.append(self.current_sample_y, y, axis=0)
            else:
                if self.current_sample_x is None:
                    self.current_sample_x, self.current_sample_y = self._drift_stream.next_sample()
                else:
                    X, y = self._drift_stream.next_sample()
                    self.current_sample_x = np.append(self.current_sample_x, X, axis=0)
                    self.current_sample_y = np.append(self.current_sample_y, y, axis=0)

        return self.current_sample_x, self.current_sample_y

Suggested Implementation

    def next_sample(self, batch_size=1):
        self.current_sample_x = np.zeros((batch_size, self.n_features))
        self.current_sample_y = np.zeros((batch_size, self.n_targets))

        for j in range(batch_size):
            self.sample_idx += 1
            x = -4.0 * float(self.sample_idx - self.position) / float(self.width)
            probability_drift = 1.0 / (1.0 + np.exp(x))
            if self.random_state.rand() > probability_drift:
                X, y = self._input_stream.next_sample()
            else:
                X, y = self._drift_stream.next_sample()
            self.current_sample_x[j, :] = X
            self.current_sample_y[j, :] = y

        return self.current_sample_x, self.current_sample_y.flatten()

When nesting multiple ConceptDriftStream instances and computing 100000 instances, next_sample would bottleneck the code due to repeated calls to np.append.

By initialising current_sample_x and current_sample_y and assigning values instead, we were able to achieve a speed up of over 20x.

Should `predict_proba` be implemented for AdaptiveRandomForest?

I would like to use a AdaptiveRandomForest model and predict class probabilities with it.
To this end, I made a predict_proba implementation for it. Is this something that is desired to be added to the main repository?
If so, I will add documentation/tests and create a PR.

HoeffdingTree predict_proba returns different dimensions per sample

Expected behaviour

output_dim = num_samples x num_classes

Actual behaviour

output is an array of lists with different length

Steps to reproduce the behaviour

test files are attached
data.zip

import numpy as np
from skmultiflow.trees.hoeffding_tree import HoeffdingTree

data = np.load("test_data.npy")
labels = np.load("test_labels.npy")

model = HoeffdingTree()
model.partial_fit(data, labels)
pred_proba = model.predict_proba(data)

>> [list([1.0]), ... , list([0, 1.0]), ... , list([1.0]) , ...]

Models are not fully scikit-learn compatible

In order to follow scikit-learn's philosophy, wouldn't it make sense to derive your base model

class StreamModel(BaseObject, metaclass=ABCMeta):

from the sklearn base class for estimators http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html ?

Since get_params and set_params are not implemented in scikit-multiflow, useful sklearn utils like http://scikit-learn.org/stable/modules/generated/sklearn.base.clone.html can not be applied to the scikit-multiflow models.

LED Generator (Drift) doesn't work

I just experienced the following behaviour:

LED Generator + parent class doesn't seem to work.

Used Code:

from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential
from skmultiflow.data.led_generator_drift import LEDGeneratorDrift
from skmultiflow.trees.hoeffding_adaptive_tree import HAT

"""1. Create stream"""
stream = LEDGeneratorDrift(has_noise=False, noise_percentage=0.0, n_drift_features=0)

stream.prepare_for_use()

"""2. Create classifier"""
clf = HAT(split_criterion='info_gain')

"""3. Setup evaluator"""
evaluator = EvaluatePrequential(show_plot=False,
                                pretrain_size=1,
                                max_samples=100000,
                                metrics=['performance', 'kappa_t', 'kappa_m', 'kappa'],
                                output_file=None)

"""4. Run evaluator"""
evaluator.evaluate(stream=stream, model=clf, model_names=['HAT']) 

Shown error message:

  File "/home/moritz/anaconda3/lib/python3.6/site-packages/skmultiflow/evaluation/evaluate_prequential.py", line 199, in evaluate
    if self._check_configuration():

  File "/home/moritz/anaconda3/lib/python3.6/site-packages/skmultiflow/evaluation/base_evaluator.py", line 239, in _check_configuration
    raise ValueError('Unexpected number of outputs in stream: {}.'.format(self.stream.n_targets))

ValueError: Unexpected number of outputs in stream: 0.

best regards

Hoeffding Tree classifier with more than one million samples

Training classification Hoeffding Tree with large data streams results, after 999999 samples, in this error AttributeError: 'int' object has no attribute 'calc_byte_size_including_subtree

Error seen with HyperplaneGenerator, SineGenerator, RandomTreeGenerator

Adaptive Hoeffding Tree 'gini' -- 'InactiveLearningNode' object has no attribute 'filter_instance_to_leaves'

Used code:

from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential
from skmultiflow.data.hyper_plane_generator import HyperplaneGenerator
from skmultiflow.trees.hoeffding_adaptive_tree import HAT

"""1. Create stream"""
stream = HyperplaneGenerator(mag_change=0.001, noise_percentage=0.1)

stream.prepare_for_use()

"""2. Create classifier"""
clf = HAT(split_criterion='gini')

"""3. Setup evaluator"""
evaluator = EvaluatePrequential(show_plot=False,
                                pretrain_size=1,
                                max_samples=1000000,
                                metrics=['performance', 'kappa_t', 'kappa_m', 'kappa'],
                                output_file=None)

"""4. Run evaluator"""
evaluator.evaluate(stream=stream, model=clf, model_names=['HAT'])

It stops the classification after a few instances with the following error:

root - INFO - Prequential Evaluation
root - INFO - Evaluating 1 target(s).
root - INFO - Pre-training on 1 samples.
root - INFO - Evaluating...
root - INFO - Evaluation time: 10.584 s
root - INFO - Total samples: 22444
root - INFO - Global performance:
root - INFO - HAT - Accuracy     : 0.844
root - INFO - HAT - Kappa        : 0.688
root - INFO - HAT - Kappa T      : 0.688
root - INFO - HAT - Kappa M      : 0.692
'InactiveLearningNode' object has no attribute 'filter_instance_to_leaves'

best regards

Include running time as evaluation metric

The running time should be included as an evaluation metric. It is already being logged for the overall process, but it would be nice to get a plot over time. It could also be interesting to split into training time / testing time.

'NoneType' object is not iterable

This uninformative error comes from LeveragingBagging, under for example, covtype.csv. See _test_file_stream_multiple_cfier.py ;the error is quite subtle, in fact it is not even explicitly an error:

root - INFO - Prequential Evaluation
root - INFO - Generating 7 targets.
root - INFO - Pre-training on 100 samples.
root - INFO - Evaluating...
'NoneType' object is not iterable
root - INFO - Evaluation time: 0.000 s
root - INFO - Total instances: 100
root - INFO - Global performance:
root - INFO - Learner 0 - Accuracy     : 0.000

Meta models work with generated streams but give error for file streams

Hi, I use the same code of meta model for generated stream and it gives predictions but for file streams I receive an error. Prequential evaluation gives 'NoneType' object is not iterable' error. When I code the evaluation myself, I see that the predictions for the csv file data streams are none type so 'corrects' are never increased and performance is 0.0.
The same file streams work with other classification methods. Attached is one of the csv files I receive this error for.

CovPokElec.zip

Add more stream generators

We want to increase the number of available stream generators:

  • Agrawal
  • HyperPlane
  • LED
  • LED + drift
  • Mixed
  • Sine
  • STAGGER
  • Dataset to stream

Nice to have:
API support for external services (e.g Twitter, AWS, TensorFlow)

EvaluatePrequential Plot does not show up if batch_size != 1

Expected behaviour

EvaluatePrequential plot can be generated independent of the batch size

Actual behaviour

For batch_size != 1, the plot only shows up after the last evaluation and no metrics are plotted

Steps to reproduce the behaviour

%matplotlib notebook
from skmultiflow.data.generators.waveform_generator import WaveformGenerator
from skmultiflow.classification.trees.hoeffding_tree import HoeffdingTree
from skmultiflow.data.file_stream import FileStream
from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential

stream = WaveformGenerator()
stream.prepare_for_use()
ht = HoeffdingTree()
evaluator = EvaluatePrequential(show_plot=True,
                                pretrain_size=200,
                                max_samples=20000, batch_size=2)
evaluator.evaluate(stream=stream, model=ht)

(Jupyter notebook server 5.5.0, matplotlib 2.2.2)

Resetting the stream

At the moment, calling eval from EvaluatePrequential multiple times on the same stream/classifier will give unintuitive results (for example, empty output in the CSV). The stream should be reset automatically, or give error message, or document carefully to avoid confusion.

Nominal attribute class observer: Iterating over class instead of attribute values

for val_idx in self._att_val_dist_per_class.keys():

In nominal attribute class observer L49, isn't it supposed to iterate over the attribute possible values and not the class values since val_idx will be compared to the attribute values in L74 ?

If that's the case a list could be used to keep the possible values seen so far, that would be used instead of self._att_val_dist_per_class.keys()

#Solution:

_att_values = []
if att_val not in self._att_values:
   self._att_values.append(att_val)
for val_idx in self._att_values:

OzaBaggingAdwin eval classifier

Hi to everyone,

I have this question, it's possible to utilize in OzaBaggingAdwin or OzaBagging a specific learner (e.g KNN() or hoeffding_tree()) to evaluate the classification performance?

I tried this short code but don't works.

schermata 2017-12-16 alle 13 23 32

The error that occurs is this:

schermata 2017-12-16 alle 13 29 00

While utilize HoeffdingTree() as a classfier:

schermata 2017-12-16 alle 13 33 19

Better output of CSV files

When using multiple classifiers, the results in the output csv file is difficult to interpret (for example:
_test_file_stream_multiple_cfier.py). Either the description in the comments should be easier to parse (in relation to the numbers, global_performance_1, global_performance_2, etc), or a key for each classifier should be contained in the column labels, or some other approach that makes it easy to interpret the csv results without having to check the code that generated it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.