ikki407 / stacking Goto Github PK
View Code? Open in Web Editor NEWStacked Generalization (Ensemble Learning)
License: MIT License
Stacked Generalization (Ensemble Learning)
License: MIT License
Current load_data() use the same train and test features. (related #2 )
And target is included in train dataset.
FEATURE_LIST_stage1 = {
'train':(INPUT_PATH + 'train.csv',
FEATURES_PATH + 'train_log.csv',
),#target is in 'train'
'test':(INPUT_PATH + 'test.csv',
FEATURES_PATH + 'test_log.csv',
),
}
However, in level 2 stacking, the following codes cause an error if target data is passed only in train dataset. This error occurs due to the difference of length of lists between train and test.
FEATURE_LIST_stage2 = {
'train':(INPUT_PATH + 'target.csv',
TEMP_PATH + 'v1_stage1_all_fold.csv',
TEMP_PATH + 'v2_stage1_all_fold.csv',
TEMP_PATH + 'v3_stage1_all_fold.csv',
TEMP_PATH + 'v4_stage1_all_fold.csv',
TEMP_PATH + 'v5_stage1_all_fold.csv',
TEMP_PATH + 'v6_stage1_all_fold.csv',
),#target is in 'train'
'test':(
TEMP_PATH + 'v1_stage1_test.csv',
TEMP_PATH + 'v2_stage1_test.csv',
TEMP_PATH + 'v3_stage1_test.csv',
TEMP_PATH + 'v4_stage1_test.csv',
TEMP_PATH + 'v5_stage1_test.csv',
TEMP_PATH + 'v6_stage1_test.csv',
),
}
This bug is fixed by using each length of list in reading.
But for more general library, new key of feature dictionary, i.e., target, should be made like:
FEATURE_LIST_stage2 = {
'train':(
TEMP_PATH + 'v1_stage1_all_fold.csv',
TEMP_PATH + 'v2_stage1_all_fold.csv',
TEMP_PATH + 'v3_stage1_all_fold.csv',
),#target is not in 'train'
'target':(
INPUT_PATH + 'target.csv',
),#target is in 'target'
'test':(
TEMP_PATH + 'v1_stage1_test.csv',
TEMP_PATH + 'v2_stage1_test.csv',
TEMP_PATH + 'v3_stage1_test.csv',
),
}
This is very reasonable.
The outputs of this library are not simple to understand if many models are trained together. So logging of information for trained models is needed to understand easily.
Like:
Level 1:
Model1 train-loss 0.5123±0.10 validation-loss 0.6345±0.15
Model2 train-loss 0.4324±0.20 validation-loss 0.5456±0.08
Model3 train-loss 0.5963±0.14 validation-loss 0.6501±0.05
...
Level 2:
...
Refactor hard coding for stability in preparation for the version up of libraries, such as sklearn, keras, and XGBoost.
Specifically, initialization of BaseModel class.
This library is still not very simple. Documentation should be created.
In Keras-implemented neural net, to avoid recompile, initial weights after compilation is saved and used at the next beginning of training in cross validation. However, the initial weights are same in all fold-training. So initial weights should be changed at each training.
A possible solution is passing the argument of compilation (e.g., optimizer, loss, and metrics).
In binary_class.py,
class ModelV2(BaseModel):
def build_model(self):
model = Sequential()
model.add(Dense(64, input_shape=nn_input_dim_NN, init='he_normal'))
model.add(LeakyReLU(alpha=.00001))
model.add(Dropout(0.5))
model.add(Dense(output_dim, init='he_normal'))
model.add(Activation('softmax'))
sgd = SGD(lr=0.1, decay=1e-5, momentum=0.9, nesterov=True)
compile_options = {
'optimizer': sgd,
'loss': 'categorical_crossentropy',
'metrics': ["accuracy"]
}
return KerasClassifier(nn=model, compile_options=compile_options, **self.params)
In base.py,
import copy
class KerasClassifier(BaseEstimator, ClassifierMixin):
def __init__(self,nn):
self.nn = nn
def fit(self, X, y, X_test=None, y_test=None):
self.compiled_nn = copy.copy(self.nn)
self.compiled_nn.compile(**self.compile_options)
return self.compiled_nn(X, y)
But this approach leads memory consumption...
Multi-classification task is needed.
Only binary-classification and regression tasks are available now.
"About this library" and "Get started"
Need to show some test of stacking, bagging, and more.
Stacking tests are done, but bagging tests are not.
Bagging tests are easy because only sub-sampling training data and training models with that data is needed.
Several options are considered(i.e., argument at model definition(subsample=0.8), different parameter models, different random state models, ...)
If this is done, this package will be more general one of ensemble learning.
Should change task-dependent functions to virtual functions?
User need to define such functions themselves. (e.g., CV-fold index, save_pred_as_submit_format, )
Hi
I find something maybe you don't notice
1. First of all, run 'pip install stacking' , it's not work for macOS because the xgboost package can't install here automatic,so we need install xgboost first manually.
2. I notice that in "examples/multi_class/scripts/multiclass.py" , something wrong in line 25 ,it should be "from keras.regularizers import l1, l2" ,little mistake but make 'bash run.sh' can't work
3. 'No module named paratext' is another problem ,the solution is git from https://github.com/wiseio/paratext, and don't forget to install swig by using 'brew install swig' if brew have been installed
That's it ,now i can run the EXAMPLE, i suggest the environment should write in README.md just in case.
BTW, project is wonderful!
In BaseModel class, scikit-learn models can be used, but should reconsider that approach.
How to define target column in feature set.
Target column need to be included in train dataset as column name 'target' now.
train:
[2,3,4,5,6,7,8,9]
[0,1,4,5,6,7,8,9]
[0,1,2,3,6,7,8,9]
[0,1,2,3,4,5,8,9]
[0,1,2,3,4,5,6,7]
test:
[0,1]
[2,3]
[4,5]
[6,7]
[8,9]
[0,0,1,1,2,2,3,3,4,4]
But current BaseModel uses previous version architectures.
If current version is used, it is changed to previous format.
So need to change that to using original format.
And need to change .ix to .iloc for stable behavior.
Need to change global CV-fold file name with new CV-fold file name, if new CV-fold be created.
Training and test data are now loaded at the beginning of model building.
However, it is not time-efficient if the data is very large.
So, if the data used is not changed through staking script, the data should be stored somewhere of base.py
.
STORED_X = pd.DataFrame()
def load_data(stored=True):
if STORED_X:
return STORED_X
else:
...
if stored:
STORED_X = X.copy()
return STORED_X
else:
return X
All models are described in base.py.
For more simplicity, each model wrapper(XGB, Keras, VW, ...) should be described in each module, i.e., base_XGB.py, base_keras.py.
Then, we use in base.py
like:
import base_XGB
class XGBClassifier(base_XGB.XGBClassifier):
def __init__(self, params={}, num_round=50):
base_XGB.XGBClassifier.__init__(self, params=params, num_round=num_round)
Current possible data format is only .csv
file. But more format are desirable. For example, .npy
, .zip
, .gz
, and more.
Need to design how to save prediction as submission format (def save_pred_as_submit_format()).
This is needed.
Now path is defined using slash /
like
PATH = "foo/"
PATH + 'hoge'
, but it seems good to use os.path.join
like
import os
PATH = "foo"
os.path.join(PATH, 'hoge')
Now, this library support Python2 only.
Python3 support is needed to enhance usability.
Creating template for stacking & bagging seem to be useful.
Need to create it under examples/.
BaseModel should have evaluation score of CV after training.
m.run()
m.score
Fold1: 0.234
Fold2: 0.323
Fold3: 0.233
Many meaningless comments currently included in base.py
should be removed.
Change base_fixed_fold.py to base.py
And then fixes test scripts.
In stacking, test data(out-of-fold data) is not passed to models as validation data (XGBoost and NN). That is, validation scores are calculated after model training is done. It will be convenient to check the validation score every epoch. So need to pass out-of-fold data in model training as well.
RuntimeError: maximum recursion depth exceeded
I am getting this error while running the first model in stage 1. how to over come this ?
Comparison between stacking and other techniques should be taken.
README of each example is needed to easily understand.
Need to carry out following tests.
First, we need to create dataset(train & test) for above problems.
Next, define several models.
Finally, do stacking.
Current tests(binary, multi-class, and regression) are under test/.
Specific tests should be under examples/ or docs/examples/.
Now data directories(i.e., data/input(output)/...) are made in importing stacking if the directories does not exist.
But if this is simply implemented, user can make directories without intent.
It is useful that one can select if making them in the first import.
like:
Can new directories for input and output data be created? [Y/n]
Y(N) <---- input
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.