ikki407 / stacking Goto Github PK

View Code? Open in Web Editor NEW

216.0 8.0 75.0 5.19 MB

Stacked Generalization (Ensemble Learning)

License: MIT License

Python 98.88% Shell 1.12%

stacking ensemble-learning scikit-learn xgboost prediction ensemble

stacking's People

Contributors

Stargazers

Watchers

stacking's Issues

logging issue

The outputs of this library are not simple to understand if many models are trained together. So logging of information for trained models is needed to understand easily.

Like:

Level 1:

Model1 train-loss 0.5123±0.10 validation-loss 0.6345±0.15
Model2 train-loss 0.4324±0.20 validation-loss 0.5456±0.08
Model3 train-loss 0.5963±0.14 validation-loss 0.6501±0.05
...

Level 2:
...

Version update stability

Refactor hard coding for stability in preparation for the version up of libraries, such as sklearn, keras, and XGBoost.
Specifically, initialization of BaseModel class.

Writing documentation

This library is still not very simple. Documentation should be created.

Reinitialize weights of neural net

In Keras-implemented neural net, to avoid recompile, initial weights after compilation is saved and used at the next beginning of training in cross validation. However, the initial weights are same in all fold-training. So initial weights should be changed at each training.

A possible solution is passing the argument of compilation (e.g., optimizer, loss, and metrics).
In binary_class.py,

class ModelV2(BaseModel):
        def build_model(self):
            model = Sequential()
            model.add(Dense(64, input_shape=nn_input_dim_NN, init='he_normal'))
            model.add(LeakyReLU(alpha=.00001))
            model.add(Dropout(0.5))

            model.add(Dense(output_dim, init='he_normal'))
            model.add(Activation('softmax'))
            sgd = SGD(lr=0.1, decay=1e-5, momentum=0.9, nesterov=True)

            compile_options = {
                                           'optimizer': sgd, 
                                           'loss': 'categorical_crossentropy', 
                                           'metrics': ["accuracy"]
                                           }

            return KerasClassifier(nn=model, compile_options=compile_options, **self.params)

In base.py,

import copy

class KerasClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self,nn):
        self.nn = nn

    def fit(self, X, y, X_test=None, y_test=None):
        self.compiled_nn = copy.copy(self.nn)
        self.compiled_nn.compile(**self.compile_options)

        return self.compiled_nn(X, y)

But this approach leads memory consumption...

Multi-classification task

Multi-classification task is needed.

Only binary-classification and regression tasks are available now.

Update README.md

"About this library" and "Get started"

For more general package(stacking, bagging, and more)

Need to show some test of stacking, bagging, and more.

Stacking tests are done, but bagging tests are not.

Bagging tests are easy because only sub-sampling training data and training models with that data is needed.

Several options are considered(i.e., argument at model definition(subsample=0.8), different parameter models, different random state models, ...)

If this is done, this package will be more general one of ensemble learning.

Task-dependent functions

Should change task-dependent functions to virtual functions?

User need to define such functions themselves. (e.g., CV-fold index, save_pred_as_submit_format, )

About EXAMPLE

Hi
I find something maybe you don't notice
1. First of all, run 'pip install stacking' , it's not work for macOS because the xgboost package can't install here automatic，so we need install xgboost first manually.
2. I notice that in "examples/multi_class/scripts/multiclass.py" , something wrong in line 25 ,it should be "from keras.regularizers import l1, l2" ,little mistake but make 'bash run.sh' can't work
3. 'No module named paratext' is another problem ,the solution is git from https://github.com/wiseio/paratext, and don't forget to install swig by using 'brew install swig' if brew have been installed

That's it ,now i can run the EXAMPLE, i suggest the environment should write in README.md just in case.
BTW, project is wonderful！

scikit-learn models

In BaseModel class, scikit-learn models can be used, but should reconsider that approach.

How to define target column in feature set.

Target column need to be included in train dataset as column name 'target' now.

How to create CV-fold index

previous version for CV-fold file. Using index.

train:

[2,3,4,5,6,7,8,9]

[0,1,4,5,6,7,8,9]

[0,1,2,3,6,7,8,9]

[0,1,2,3,4,5,8,9]

[0,1,2,3,4,5,6,7]

test:

[0,1]

[2,3]

[4,5]

[6,7]

[8,9]

current version for CV-fold file(better than previous one). Using fold ID.

[0,0,1,1,2,2,3,3,4,4]

But current BaseModel uses previous version architectures.
If current version is used, it is changed to previous format.
So need to change that to using original format.
And need to change .ix to .iloc for stable behavior.

Need to change global CV-fold file name with new CV-fold file name, if new CV-fold be created.

Large data loading problem

Training and test data are now loaded at the beginning of model building.

However, it is not time-efficient if the data is very large.

So, if the data used is not changed through staking script, the data should be stored somewhere of base.py.

STORED_X = pd.DataFrame()

def load_data(stored=True):
    if STORED_X:
        return STORED_X

    else:
        ...

    if stored:
        STORED_X = X.copy()
        return STORED_X

    else:
        return X

For more simple model wrapper class

All models are described in base.py.

For more simplicity, each model wrapper(XGB, Keras, VW, ...) should be described in each module, i.e., base_XGB.py, base_keras.py.

Then, we use in base.py like:

import base_XGB

class XGBClassifier(base_XGB.XGBClassifier):
    def __init__(self, params={}, num_round=50):
        base_XGB.XGBClassifier.__init__(self, params=params, num_round=num_round)

more variation of load_data() function

Current possible data format is only .csv file. But more format are desirable. For example, .npy, .zip, .gz, and more.

Saving format

Need to design how to save prediction as submission format (def save_pred_as_submit_format()).

requirement.txt

This is needed.

Efficient PATH setting problem

Now path is defined using slash / like

PATH = "foo/"
PATH + 'hoge'

, but it seems good to use os.path.join like

import os

PATH = "foo"
os.path.join(PATH, 'hoge')

Python3 support

Now, this library support Python2 only.
Python3 support is needed to enhance usability.

Template for stacking and bagging

Creating template for stacking & bagging seem to be useful.
Need to create it under examples/.

Model evaluation

BaseModel should have evaluation score of CV after training.

m.run()
m.score

Fold1: 0.234
Fold2: 0.323
Fold3: 0.233

Cleaning comments in base.py

Many meaningless comments currently included in base.py should be removed.

File name problem(base_fixed_fold.py)

Change base_fixed_fold.py to base.py
And then fixes test scripts.

validation in training of stacking

In stacking, test data(out-of-fold data) is not passed to models as validation data (XGBoost and NN). That is, validation scores are calculated after model training is done. It will be convenient to check the validation score every epoch. So need to pass out-of-fold data in model training as well.

maximum recursion depth exceeded

RuntimeError: maximum recursion depth exceeded

I am getting this error while running the first model in stage 1. how to over come this ?

Add README and experiments in examples

Comparison between stacking and other techniques should be taken.
README of each example is needed to easily understand.

Test

Need to carry out following tests.

binary classification
multi-class classification
regression

First, we need to create dataset(train & test) for above problems.

Next, define several models.

Finally, do stacking.

Examples or Test?

Current tests(binary, multi-class, and regression) are under test/.
Specific tests should be under examples/ or docs/examples/.

CNN examples?

Checking function when making directory

Now data directories(i.e., data/input(output)/...) are made in importing stacking if the directories does not exist.
But if this is simply implemented, user can make directories without intent.
It is useful that one can select if making them in the first import.

like:

Can new directories for input and output data be created? [Y/n]
Y(N) <---- input

ikki407 / stacking Goto Github PK

stacking's People

Contributors

Stargazers

Watchers

Forkers

stacking's Issues

Recommend Projects

Recommend Topics

Recommend Org