azure / fast_retraining Goto Github PK
View Code? Open in Web Editor NEWShow how to perform fast retraining with LightGBM in different business cases
License: MIT License
Show how to perform fast retraining with LightGBM in different business cases
License: MIT License
Some advices from David:
Probably the most important advice I can give is this: know your audience. Think of the person who’s going to have the most interest in what you have to say, and write for that one person. As you mention, providing the code they can use to replicate what you do is important. I also feel like it’s better to focus on specifics rather than generalities: benchmarks/results from a specific analysis of a specific dataset on specific hardware are much more interesting IMO than “what might be”.
My other advice is keep it short. This is harder than it sounds – it’s tempting to go into every detail. But your post will have more impact the more people read it, and not many people will take the time to read more than a page or so. My usual target is 4-6 paragraphs, plus images/videos (and do include at least one image). Keep the details for the Github repo, documentation or whitepapers, and link to them as needed.
Lastly, follow journalistic principles and summarize your entire post in the first paragraph. Make sure your main point is included there – ideally in the first sentence – along with any “hero” links you’d like the reader to follow. That way a reader will know immediately if they’re interested in your content, and even if not, they still come away with the main point.
Planning kick off meeting. THAT IS THE FIRST MVP after this we iterate depending on the time we have
Methodology (based on TDSP):
TODO MVP1:
NameError Traceback (most recent call last)
~/h2o4gpu/testsxgboost/04_PlanetKaggle_GPU.py in ()
51
52
---> 53 X_train, y_train, X_test, y_test = load_planet_kaggle()
54
55
~/h2o4gpu/testsxgboost/libs/loaders.py in load_planet_kaggle()
212 if not os.listdir(val_path):
213 logger.info('Validation folder is empty, moving files...')
--> 214 generate_validation_files(train_path, val_path)
215
216 logger.info('Reading in labels')
NameError: name 'generate_validation_files' is not defined
Solution: In loaders.py:
from libs.planet_kaggle import to_multi_label_dict, get_file_count, enrich_with_feature_encoding, featurise_images, generate_validation_files
change requirements with recent version of bokeh
Contents of MVP2:
Hi, I am the author of the XGBoost GPU algorithms.
Your benchmarks of my GPU hist algorithm are simply running on the CPU. The reason for this is the 'tree_method':'hist' parameter is overriding the selection of the GPU updater. This was fixed some time ago but it seems you are using an older commit. The correct usage would now be to set 'tree_method':'gpu_hist'. I would appreciate if you can update your benchmarks, I think you might find my algorithm far more competitive.
I also noticed that the XGBoost CPU hist algorithm has not had the number of bins set correctly, so you would be comparing 256 bins for XGBoost against 63 bins for LightGBM. This was due to a mistake in our documentation regarding the naming of the parameter that I have noted in dmlc/xgboost#2567.
Thanks
Rory
Generating match features...
Generating match labels...
Generating bookkeeper data...
/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/indexing.py:297: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
(19673, 48)
CPU times: user 10min 50s, sys: 7.58 s, total: 10min 58s
Wall time: 10min 58s
In [9]:
feables = convert_cols_categorical_to_numeric(feables)
feables.head()
[autoreload of six failed: Traceback (most recent call last):
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 246, in check
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 385, in superreload
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 324, in update_generic
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 276, in update_class
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/six.py", line 93, in get
setattr(obj, self.name, result) # Invokes set.
AttributeError: 'NoneType' object has no attribute 'cStringIO'
]
Speeding up Machine Learning Applications with LightGBM library in Real Time domains
Short description
The speed of a machine learning algorithm can be crucial in problems that requires retraining in real time. Some useful domains can be IoT, sport result prediction, predictive maintenance and healthcare. Microsoft recently open sourced LightGBM library for decision trees that outperformed other libraries in both speed and performance. In this talk we will demo several applications using LightGBM.
Abstract
In some applications training and retraining times need to be kept below 5 seconds to be useful. Such applications are often referred to as real-time and include but are not limited to IoT, sport result prediction, predictive maintenance and healthcare. Algorithms that allow for fast retraining are fundamental to enabling such applications and can open-up new business opportunities. One reason for retraining is that the features used in these applications can degrade causing previously useful features to no longer be useful. Such degradation is often observed as sensors age or as information becomes out-of-date.
LightGBM is a new open source library created by Microsoft that is set to become the new standard in decision trees algorithms. Depending on the application, it can be anything from 4 to 10 times faster than XGBoost and offers a higher accuracy. It has already been proven useful in several Kaggle competitions.
In this talk we will explore this promising library, compare it with the current state of the art and demo a business case of a real-time application.
Slides
LightGBMError Traceback (most recent call last)
~/h2o4gpu/testsxgboost/06_HIGGS_GPU.py in ()
242
243 with Timer() as train_t:
--> 244 lgbm_clf_pipeline = lgb.train(params, lgb_train, num_boost_round=num_rounds)
245
246 with Timer() as test_t:
~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
178 """construct booster"""
179 try:
--> 180 booster = Booster(params=params, train_set=train_set)
181 if is_valid_contain_train:
182 booster.set_train_data_name(train_data_name)
~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/basic.py in init(self, params, train_set, model_file, silent)
1251 train_set.construct().handle,
1252 c_str(params_str),
-> 1253 ctypes.byref(self.handle)))
1254 """save reference to data"""
1255 self.train_set = train_set
~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/basic.py in _safe_call(ret)
46 """
47 if ret != 0:
---> 48 raise LightGBMError(_LIB.LGBM_GetLastError())
49
50
LightGBMError: b'GPU Tree Learner was not enabled in this build. Recompile with CMake option -DUSE_GPU=1'
Unknown solution. Only happens with airlines and higgs -- larger data sets, even though system has 1080ti with full free memory. Other smaller data sets like credit and football have no such issues. It's like the error message is wrong and in reality ran out of memory with lightgbm.
branch new parameters: https://github.com/Azure/fast_retraining/blob/new_parameters/experiments/02_BCI.ipynb
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.