Git Product home page Git Product logo

stacking's People

Contributors

ikki407 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stacking's Issues

load_data() bug

Current load_data() use the same train and test features. (related #2 )
And target is included in train dataset.

FEATURE_LIST_stage1 = {
                'train':(INPUT_PATH + 'train.csv',
                         FEATURES_PATH + 'train_log.csv',

                        ),#target is in 'train'
                'test':(INPUT_PATH + 'test.csv',
                        FEATURES_PATH + 'test_log.csv',
                        ),
                }

However, in level 2 stacking, the following codes cause an error if target data is passed only in train dataset. This error occurs due to the difference of length of lists between train and test.

    FEATURE_LIST_stage2 = {
                'train':(INPUT_PATH + 'target.csv',

                         TEMP_PATH + 'v1_stage1_all_fold.csv',
                         TEMP_PATH + 'v2_stage1_all_fold.csv',
                         TEMP_PATH + 'v3_stage1_all_fold.csv',
                         TEMP_PATH + 'v4_stage1_all_fold.csv',
                         TEMP_PATH + 'v5_stage1_all_fold.csv',
                         TEMP_PATH + 'v6_stage1_all_fold.csv',

                        ),#target is in 'train'
                'test':(
                         TEMP_PATH + 'v1_stage1_test.csv',
                         TEMP_PATH + 'v2_stage1_test.csv',
                         TEMP_PATH + 'v3_stage1_test.csv',
                         TEMP_PATH + 'v4_stage1_test.csv',
                         TEMP_PATH + 'v5_stage1_test.csv',
                         TEMP_PATH + 'v6_stage1_test.csv',                       
                        ),
                }

This bug is fixed by using each length of list in reading.
But for more general library, new key of feature dictionary, i.e., target, should be made like:

    FEATURE_LIST_stage2 = {
                'train':(
                         TEMP_PATH + 'v1_stage1_all_fold.csv',
                         TEMP_PATH + 'v2_stage1_all_fold.csv',
                         TEMP_PATH + 'v3_stage1_all_fold.csv',
                        ),#target is not in 'train'

                'target':(
                         INPUT_PATH + 'target.csv',
                        ),#target is in 'target'

                'test':(
                         TEMP_PATH + 'v1_stage1_test.csv',
                         TEMP_PATH + 'v2_stage1_test.csv',
                         TEMP_PATH + 'v3_stage1_test.csv',
                        ),
                }

This is very reasonable.

logging issue

The outputs of this library are not simple to understand if many models are trained together. So logging of information for trained models is needed to understand easily.

Like:

Level 1:

Model1 train-loss 0.5123±0.10 validation-loss 0.6345±0.15
Model2 train-loss 0.4324±0.20 validation-loss 0.5456±0.08
Model3 train-loss 0.5963±0.14 validation-loss 0.6501±0.05
...

Level 2:
...

Version update stability

Refactor hard coding for stability in preparation for the version up of libraries, such as sklearn, keras, and XGBoost.
Specifically, initialization of BaseModel class.

Reinitialize weights of neural net

In Keras-implemented neural net, to avoid recompile, initial weights after compilation is saved and used at the next beginning of training in cross validation. However, the initial weights are same in all fold-training. So initial weights should be changed at each training.

A possible solution is passing the argument of compilation (e.g., optimizer, loss, and metrics).
In binary_class.py,

class ModelV2(BaseModel):
        def build_model(self):
            model = Sequential()
            model.add(Dense(64, input_shape=nn_input_dim_NN, init='he_normal'))
            model.add(LeakyReLU(alpha=.00001))
            model.add(Dropout(0.5))

            model.add(Dense(output_dim, init='he_normal'))
            model.add(Activation('softmax'))
            sgd = SGD(lr=0.1, decay=1e-5, momentum=0.9, nesterov=True)

            compile_options = {
                                           'optimizer': sgd, 
                                           'loss': 'categorical_crossentropy', 
                                           'metrics': ["accuracy"]
                                           }

            return KerasClassifier(nn=model, compile_options=compile_options, **self.params)

In base.py,

import copy

class KerasClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self,nn):
        self.nn = nn

    def fit(self, X, y, X_test=None, y_test=None):
        self.compiled_nn = copy.copy(self.nn)
        self.compiled_nn.compile(**self.compile_options)

        return self.compiled_nn(X, y)

But this approach leads memory consumption...

Multi-classification task

Multi-classification task is needed.

Only binary-classification and regression tasks are available now.

For more general package(stacking, bagging, and more)

Need to show some test of stacking, bagging, and more.

Stacking tests are done, but bagging tests are not.

Bagging tests are easy because only sub-sampling training data and training models with that data is needed.

Several options are considered(i.e., argument at model definition(subsample=0.8), different parameter models, different random state models, ...)

If this is done, this package will be more general one of ensemble learning.

Task-dependent functions

Should change task-dependent functions to virtual functions?

User need to define such functions themselves. (e.g., CV-fold index, save_pred_as_submit_format, )

About EXAMPLE

Hi
I find something maybe you don't notice
1. First of all, run 'pip install stacking' , it's not work for macOS because the xgboost package can't install here automatic,so we need install xgboost first manually.
2. I notice that in "examples/multi_class/scripts/multiclass.py" , something wrong in line 25 ,it should be "from keras.regularizers import l1, l2" ,little mistake but make 'bash run.sh' can't work
3. 'No module named paratext' is another problem ,the solution is git from https://github.com/wiseio/paratext, and don't forget to install swig by using 'brew install swig' if brew have been installed

That's it ,now i can run the EXAMPLE, i suggest the environment should write in README.md just in case.
BTW, project is wonderful!

scikit-learn models

In BaseModel class, scikit-learn models can be used, but should reconsider that approach.

How to create CV-fold index

  • previous version for CV-fold file. Using index.

train:

[2,3,4,5,6,7,8,9]

[0,1,4,5,6,7,8,9]

[0,1,2,3,6,7,8,9]

[0,1,2,3,4,5,8,9]

[0,1,2,3,4,5,6,7]

test:

[0,1]

[2,3]

[4,5]

[6,7]

[8,9]

  • current version for CV-fold file(better than previous one). Using fold ID.

[0,0,1,1,2,2,3,3,4,4]

But current BaseModel uses previous version architectures.
If current version is used, it is changed to previous format.
So need to change that to using original format.
And need to change .ix to .iloc for stable behavior.

Need to change global CV-fold file name with new CV-fold file name, if new CV-fold be created.

Large data loading problem

Training and test data are now loaded at the beginning of model building.

However, it is not time-efficient if the data is very large.

So, if the data used is not changed through staking script, the data should be stored somewhere of base.py.

STORED_X = pd.DataFrame()

def load_data(stored=True):
    if STORED_X:
        return STORED_X

    else:
        ...

    if stored:
        STORED_X = X.copy()
        return STORED_X

    else:
        return X

For more simple model wrapper class

All models are described in base.py.

For more simplicity, each model wrapper(XGB, Keras, VW, ...) should be described in each module, i.e., base_XGB.py, base_keras.py.

Then, we use in base.py like:

import base_XGB

class XGBClassifier(base_XGB.XGBClassifier):
    def __init__(self, params={}, num_round=50):
        base_XGB.XGBClassifier.__init__(self, params=params, num_round=num_round)

Saving format

Need to design how to save prediction as submission format (def save_pred_as_submit_format()).

Efficient PATH setting problem

Now path is defined using slash / like

PATH = "foo/"
PATH + 'hoge'

, but it seems good to use os.path.join like

import os

PATH = "foo"
os.path.join(PATH, 'hoge')

Python3 support

Now, this library support Python2 only.
Python3 support is needed to enhance usability.

Model evaluation

BaseModel should have evaluation score of CV after training.

m.run()
m.score
Fold1: 0.234
Fold2: 0.323
Fold3: 0.233

validation in training of stacking

In stacking, test data(out-of-fold data) is not passed to models as validation data (XGBoost and NN). That is, validation scores are calculated after model training is done. It will be convenient to check the validation score every epoch. So need to pass out-of-fold data in model training as well.

maximum recursion depth exceeded

RuntimeError: maximum recursion depth exceeded

I am getting this error while running the first model in stage 1. how to over come this ?

Test

Need to carry out following tests.

  • binary classification
  • multi-class classification
  • regression

First, we need to create dataset(train & test) for above problems.

Next, define several models.

Finally, do stacking.

Examples or Test?

Current tests(binary, multi-class, and regression) are under test/.
Specific tests should be under examples/ or docs/examples/.

Checking function when making directory

Now data directories(i.e., data/input(output)/...) are made in importing stacking if the directories does not exist.
But if this is simply implemented, user can make directories without intent.
It is useful that one can select if making them in the first import.

like:

Can new directories for input and output data be created? [Y/n]
Y(N) <---- input

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.