azure / fast_retraining Goto Github PK

Show how to perform fast retraining with LightGBM in different business cases

License: MIT License

Jupyter Notebook 96.44% Python 3.50% Shell 0.06%

lightgbm xgboost boosted-trees machine-learning gpu benchmark azure distributed-systems gbdt gbm gbrt kaggle

fast_retraining's Issues

Update readme with link to post

https://blogs.technet.microsoft.com/machinelearning/2017/07/25/lessons-learned-benchmarking-fast-machine-learning-algorithms/

plotter of model vs retrained model performance

Blog writeup

Some advices from David:

Probably the most important advice I can give is this: know your audience. Think of the person who’s going to have the most interest in what you have to say, and write for that one person. As you mention, providing the code they can use to replicate what you do is important. I also feel like it’s better to focus on specifics rather than generalities: benchmarks/results from a specific analysis of a specific dataset on specific hardware are much more interesting IMO than “what might be”.

My other advice is keep it short. This is harder than it sounds – it’s tempting to go into every detail. But your post will have more impact the more people read it, and not many people will take the time to read more than a page or so. My usual target is 4-6 paragraphs, plus images/videos (and do include at least one image). Keep the details for the Github repo, documentation or whitepapers, and link to them as needed.

Lastly, follow journalistic principles and summarize your entire post in the first paragraph. Make sure your main point is included there – ideally in the first sentence – along with any “hero” links you’d like the reader to follow. That way a reader will know immediately if they’re interested in your content, and even if not, they still come away with the main point.

reinstall new version of lighgbm

Try gpu_use_dp=false in GPU parameters to see if improves the time

Planning Strata

Planning kick off meeting. THAT IS THE FIRST MVP after this we iterate depending on the time we have

Methodology (based on TDSP):

Experiment in notebooks, common libraries in python
Develop in branches, merge to master via PR accepted by an external person (not by the commiter)
Try to do atomic commits in case anybody wants to cherry pick
In master there is always working code
FOCUS: get as fast as possible to MVP1 and then iterate.
When someone picks a task, you have to assign it to yourself and change the state in Projects

TODO MVP1:

experiment page is not accessible.

Comparison of experiments in LightGBM with xgboost

More experiments

add requirements.txt

Compare XGBoost and LightGBM on Resnet features from Planet Kaggle

Create dataset from Resnet50 features
Create experiment for XGBoost
Create experiment for lightGBM
Add docs

timer class and logger

Comparison experiment using football dataset

parameter tuning framework

NameError: name 'generate_validation_files' is not defined while trying planet notebook.

LightGBM version: 2.0.5

NameError Traceback (most recent call last)
~/h2o4gpu/testsxgboost/04_PlanetKaggle_GPU.py in ()
51
52
---> 53 X_train, y_train, X_test, y_test = load_planet_kaggle()
54
55

~/h2o4gpu/testsxgboost/libs/loaders.py in load_planet_kaggle()
212 if not os.listdir(val_path):
213 logger.info('Validation folder is empty, moving files...')
--> 214 generate_validation_files(train_path, val_path)
215
216 logger.info('Reading in labels')

NameError: name 'generate_validation_files' is not defined

Solution: In loaders.py:

from libs.planet_kaggle import to_multi_label_dict, get_file_count, enrich_with_feature_encoding, featurise_images, generate_validation_files

Refactor and merge all code into master

Decide the experiments we want to present
Clean code and make comments
Merge code in master

Create comparison notebook

Add HIGGS experiment

dataset: https://archive.ics.uci.edu/ml/datasets/HIGGS

Implement bokeh plots

@msalvaris on branch https://github.com/Azure/fast_retraining/tree/bokeh

Retrain CPU experiments in the same machine as in GPU

correct issue in football CPU

https://github.com/Azure/fast_retraining/blob/master/experiments/03_football.ipynb

Comparison experiment using BCI dataset

Bokeh version

change requirements with recent version of bokeh

Fix issue with Planet Kaggle dataset

General plan outline

Contents of MVP2:

Optimize airline experiment
Add comparison with XGBoost hist
Add Guolin experiments and run them on the same machine
Update notebooks with new parameters
Add experiments with GPU
Change notebooks with bokeh
Blog writeup

XGBoost GPU benchmarks

Hi, I am the author of the XGBoost GPU algorithms.

Your benchmarks of my GPU hist algorithm are simply running on the CPU. The reason for this is the 'tree_method':'hist' parameter is overriding the selection of the GPU updater. This was fixed some time ago but it seems you are using an older commit. The correct usage would now be to set 'tree_method':'gpu_hist'. I would appreciate if you can update your benchmarks, I think you might find my algorithm far more competitive.

I also noticed that the XGBoost CPU hist algorithm has not had the number of bins set correctly, so you would be comparing 256 bins for XGBoost against 63 bins for LightGBM. This was due to a mistake in our documentation regarding the naming of the parameter that I have noted in dmlc/xgboost#2567.

Thanks
Rory

Experiment with rendering Bokeh plots in blog post

while trying football notebook

Generating match features...
Generating match labels...
Generating bookkeeper data...
/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/indexing.py:297: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
(19673, 48)
CPU times: user 10min 50s, sys: 7.58 s, total: 10min 58s
Wall time: 10min 58s
In [9]:

feables = convert_cols_categorical_to_numeric(feables)
feables.head()
[autoreload of six failed: Traceback (most recent call last):
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 246, in check
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 385, in superreload
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 324, in update_generic
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 276, in update_class
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/six.py", line 93, in get
setattr(obj, self.name, result) # Invokes set.
AttributeError: 'NoneType' object has no attribute 'cStringIO'
]

General plan outline

Speeding up Machine Learning Applications with LightGBM library in Real Time domains

Short description
The speed of a machine learning algorithm can be crucial in problems that requires retraining in real time. Some useful domains can be IoT, sport result prediction, predictive maintenance and healthcare. Microsoft recently open sourced LightGBM library for decision trees that outperformed other libraries in both speed and performance. In this talk we will demo several applications using LightGBM.

Abstract
In some applications training and retraining times need to be kept below 5 seconds to be useful. Such applications are often referred to as real-time and include but are not limited to IoT, sport result prediction, predictive maintenance and healthcare. Algorithms that allow for fast retraining are fundamental to enabling such applications and can open-up new business opportunities. One reason for retraining is that the features used in these applications can degrade causing previously useful features to no longer be useful. Such degradation is often observed as sensors age or as information becomes out-of-date.
LightGBM is a new open source library created by Microsoft that is set to become the new standard in decision trees algorithms. Depending on the application, it can be anything from 4 to 10 times faster than XGBoost and offers a higher accuracy. It has already been proven useful in several Kaggle competitions.
In this talk we will explore this promising library, compare it with the current state of the art and demo a business case of a real-time application.

Slides

Concept drift (5 minutes)

Introduction
Intro to Concept Drift
What ways can we combat concept Drift

XGBoost & LightGBM (10-15 minutes)

Intro to the unreassonble effectivness of XGBoost
What is LightGBM
Summary of XGBoost vs LightGBM training speed, execution speed and accuracy
- Planet Kaggle
- Flights
- ...
Why is LightGBM faster

Real time applications (10 minutes)

Real time application
- BCI
- Airplane
- IoT?
- ...
Business case?

The load_iot function does not convert the data to the correct data types

Review README

Setup DSVM + Libraries

https://github.com/Microsoft/LightGBM

Find football datasets

https://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/

create conda environment variables for data and other stuff

Create data loaders

Export bokeh plots to htlm+js

LightGBMError: b'GPU Tree Learner was not enabled in this build. Recompile with CMake option -DUSE_GPU=1'

LightGBMError Traceback (most recent call last)
~/h2o4gpu/testsxgboost/06_HIGGS_GPU.py in ()
242
243 with Timer() as train_t:
--> 244 lgbm_clf_pipeline = lgb.train(params, lgb_train, num_boost_round=num_rounds)
245
246 with Timer() as test_t:

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
178 """construct booster"""
179 try:
--> 180 booster = Booster(params=params, train_set=train_set)
181 if is_valid_contain_train:
182 booster.set_train_data_name(train_data_name)

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/basic.py in init(self, params, train_set, model_file, silent)
1251 train_set.construct().handle,
1252 c_str(params_str),
-> 1253 ctypes.byref(self.handle)))
1254 """save reference to data"""
1255 self.train_set = train_set

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/basic.py in _safe_call(ret)
46 """
47 if ret != 0:
---> 48 raise LightGBMError(_LIB.LGBM_GetLastError())
49
50

LightGBMError: b'GPU Tree Learner was not enabled in this build. Recompile with CMake option -DUSE_GPU=1'

Unknown solution. Only happens with airlines and higgs -- larger data sets, even though system has 1080ti with full free memory. Other smaller data sets like credit and football have no such issues. It's like the error message is wrong and in reality ran out of memory with lightgbm.

Add the GPU results to the histogram
Divide the table of results into two
Add the computer features below the benchmark

azure / fast_retraining Goto Github PK

fast_retraining's Issues

LightGBM version: 2.0.5

Recommend Projects

Recommend Topics

Recommend Org