microsoft / lightgbm Goto Github PK

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Home Page: https://lightgbm.readthedocs.io/en/latest/

License: MIT License

CMake 0.87% C++ 51.41% Python 20.39% R 12.09% C 3.16% Shell 1.51% PowerShell 0.46% SWIG 0.61% M4 0.12% Cuda 9.39%

gbdt gbm machine-learning data-mining distributed lightgbm gbrt microsoft decision-trees gradient-boosting

lightgbm's Introduction

Light Gradient Boosting Machine

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

Faster training speed and higher efficiency.
Lower memory usage.
Better accuracy.
Support of parallel, distributed, and GPU learning.
Capable of handling large-scale data.

For further details, please refer to Features.

Benefiting from these advantages, LightGBM is being widely-used in many winning solutions of machine learning competitions.

Comparison experiments on public datasets show that LightGBM can outperform existing boosting frameworks on both efficiency and accuracy, with significantly lower memory consumption. What's more, distributed learning experiments show that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.

Get Started and Documentation

Our primary documentation is at https://lightgbm.readthedocs.io/ and is generated from this repository. If you are new to LightGBM, follow the installation instructions on that site.

Next you may want to read:

Examples showing command line usage of common tasks.
Features and algorithms supported by LightGBM.
Parameters is an exhaustive list of customization you can make.
Distributed Learning and GPU Learning can speed up computation.
FLAML provides automated tuning for LightGBM (code examples).
Optuna Hyperparameter Tuner provides automated tuning for LightGBM hyperparameters (code examples).
Understanding LightGBM Parameters (and How to Tune Them using Neptune).

Documentation for contributors:

How we update readthedocs.io.
Check out the Development Guide.

News

Please refer to changelogs at GitHub releases page.

Some old update logs are available at Key Events page.

External (Unofficial) Repositories

Projects listed here offer alternative ways to use LightGBM. They are not maintained or officially endorsed by the LightGBM development team.

LightGBMLSS (An extension of LightGBM to probabilistic modelling from which prediction intervals and quantiles can be derived): https://github.com/StatMixedML/LightGBMLSS

FLAML (AutoML library for hyperparameter optimization): https://github.com/microsoft/FLAML

Optuna (hyperparameter optimization framework): https://github.com/optuna/optuna

Julia-package: https://github.com/IQVIA-ML/LightGBM.jl

JPMML (Java PMML converter): https://github.com/jpmml/jpmml-lightgbm

Nyoka (Python PMML converter): https://github.com/SoftwareAG/nyoka

Treelite (model compiler for efficient deployment): https://github.com/dmlc/treelite

lleaves (LLVM-based model compiler for efficient inference): https://github.com/siboehm/lleaves

Hummingbird (model compiler into tensor computations): https://github.com/microsoft/hummingbird

cuML Forest Inference Library (GPU-accelerated inference): https://github.com/rapidsai/cuml

daal4py (Intel CPU-accelerated inference): https://github.com/intel/scikit-learn-intelex/tree/master/daal4py

m2cgen (model appliers for various languages): https://github.com/BayesWitnesses/m2cgen

leaves (Go model applier): https://github.com/dmitryikh/leaves

ONNXMLTools (ONNX converter): https://github.com/onnx/onnxmltools

SHAP (model output explainer): https://github.com/slundberg/shap

Shapash (model visualization and interpretation): https://github.com/MAIF/shapash

dtreeviz (decision tree visualization and model interpretation): https://github.com/parrt/dtreeviz

SynapseML (LightGBM on Spark): https://github.com/microsoft/SynapseML

Kubeflow Fairing (LightGBM on Kubernetes): https://github.com/kubeflow/fairing

Kubeflow Operator (LightGBM on Kubernetes): https://github.com/kubeflow/xgboost-operator

lightgbm_ray (LightGBM on Ray): https://github.com/ray-project/lightgbm_ray

Mars (LightGBM on Mars): https://github.com/mars-project/mars

ML.NET (.NET/C#-package): https://github.com/dotnet/machinelearning

LightGBM.NET (.NET/C#-package): https://github.com/rca22/LightGBM.Net

Ruby gem: https://github.com/ankane/lightgbm-ruby

LightGBM4j (Java high-level binding): https://github.com/metarank/lightgbm4j

lightgbm-rs (Rust binding): https://github.com/vaaaaanquish/lightgbm-rs

MLflow (experiment tracking, model monitoring framework): https://github.com/mlflow/mlflow

{bonsai} (R {parsnip}-compliant interface): https://github.com/tidymodels/bonsai

{mlr3extralearners} (R {mlr3}-compliant interface): https://github.com/mlr-org/mlr3extralearners

lightgbm-transform (feature transformation binding): https://github.com/microsoft/lightgbm-transform

postgresml (LightGBM training and prediction in SQL, via a Postgres extension): https://github.com/postgresml/postgresml

vaex-ml (Python DataFrame library with its own interface to LightGBM): https://github.com/vaexio/vaex

Support

Ask a question on Stack Overflow with the lightgbm tag, we monitor this for new questions.
Open bug reports and feature requests (not questions) on GitHub issues.

How to Contribute

Check CONTRIBUTING page.

Microsoft Open Source Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Reference Papers

Yu Shi, Guolin Ke, Zhuoming Chen, Shuxin Zheng, Tie-Yan Liu. "Quantized Training of Gradient Boosting Decision Trees" (link). Advances in Neural Information Processing Systems 35 (NeurIPS 2022), pp. 18822-18833.

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. "LightGBM: A Highly Efficient Gradient Boosting Decision Tree". Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149-3157.

Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tie-Yan Liu. "A Communication-Efficient Parallel Algorithm for Decision Tree". Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 1279-1287.

Huan Zhang, Si Si and Cho-Jui Hsieh. "GPU Acceleration for Large-scale Tree Boosting". SysML Conference, 2018.

Note: If you use LightGBM in your GitHub projects, please add lightgbm in the requirements.txt.

License

This project is licensed under the terms of the MIT license. See LICENSE for additional details.

lightgbm's People

Contributors

Stargazers

Watchers

Forkers

caomw frankfqchen chanis poseidon1214 cephdon chanpaul pursueorigin jxlijunhao aguicode storyeah cjyclaire jizhihang egqm ericxsun convexsetgithub lixiaosi33 ycdoit fandywang bunnyrabbit8mile wanjinchang stevenlol weigq alexbigboy mldl poneyo xuanheiiis zengjianping yalechang yiiwood luokan chrsitinass wuntoguo benjamesbabala samshaq19912009 gucasbrg fulquan gaoliqiang nkhuyu chuan92 pjanic bayesquant cqgwin gzjackcai theburningcrusade gavinzjchao mornydew amos-zq siqueries bpd1069 louiss007 jianbotang rlugojr vijaym123 igara432 github-hongweizhang arnocandel weibogit zshanwei zmoon111 inachencyr leezqcst alvis-huang gufalcon hitluobin hiker13 mxbi neuroradiology dc7ark abpin zachlungu bussiere techscientist 42machinelearning eroszip moandcompany dengwc qcjxberin mylinyuzhi lucaschan999 qqgeogor colinsongf yijunran wzmsltw robustfengbin scofield0li fmx naplessss lijiannuist raj347 xiaojiu01 xjsxujingsong adrianhust jinyu0310 geegeegeek rafaqat xuehui1991 yuyincug wangjiahong lambdaji xdcs100

lightgbm's Issues

Segmentation fault when initial score size doesn't equal to data

If the init_score file does not exists or its size does not match training data, I think it should get same result without init_score file.

But the log is like this (on ubuntu 16.04):

[LightGBM] [Info] Loading parameters .. finished
[LightGBM] [Info] Start loading initial scores
[LightGBM] [Error] Initial score size doesn't equal to data, score file will be ignored
[LightGBM] [Info] Finish loading data, use 0.042958 seconds
[LightGBM] [Info] Number of data:7000, Number of features:28
[LightGBM] [Info] Finish training initilization.
[LightGBM] [Info] Start train ...
[LightGBM] [Info] re-bagging, using 5600 data to train
[LightGBM] [Info] cannot find more split with gain = -inf , current #leaves=3
[LightGBM] [Info] Iteration:1, training's l2 loss : -nan
[LightGBM] [Info] Iteration:1, regression.test's l2 loss : -nan
[LightGBM] [Info] 0.000481 seconds elapsed, finished 1 iteration
[LightGBM] [Info] cannot find more split with gain = -inf , current #leaves=3
[LightGBM] [Info] Iteration:2, training's l2 loss : -nan
[LightGBM] [Info] Iteration:2, regression.test's l2 loss : -nan
[LightGBM] [Info] 0.000923 seconds elapsed, finished 2 iteration
[LightGBM] [Info] cannot find more split with gain = -inf , current #leaves=3
[LightGBM] [Info] Iteration:3, training's l2 loss : -nan
[LightGBM] [Info] Iteration:3, regression.test's l2 loss : -nan
[LightGBM] [Info] 0.001280 seconds elapsed, finished 3 iteration
[LightGBM] [Info] cannot find more split with gain = -inf , current #leaves=1
[LightGBM] [Info] Can't training anymore, there isn't any leaf meets split requirements.
[LightGBM] [Info] 0.001448 seconds elapsed, finished 4 iteration
[LightGBM] [Info] Finished train
Segmentation fault (core dumped)

Another bug report on Mac:

lightgbm(32636,0x7fffb2c553c0) malloc: *** error for object 0x7fa7d2c03b10: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6

Btw, since multiclass classification scores have different formats, it's not supported yet.

[Feature] Random Forest mode

Just wondering, are there any plan for modifications to be able to use LightGBM in Random Forest mode? It would be really interesting to see how LightGBM fares in a (potential) Random Forest mode, both in terms of speed/performance vs xgboost, H2O, Python scikit-learn, and R (LightGBM could be much faster than any of them?).

[RESOLVED] unsupported option '-fopenmp' on Mac OS X

Is Mac OS X supported?

I'm on 10.12 Sierra and get the "unsupported option '-fopenmp'" error after "make -j"

How to get feature importances

Hey,

LightGBM is greate, I try to use it for several datasets, but I cannot get feature importance after training.
How can I enable to output feature importance after training?

Thanks

Plans to support multiclass classification?

or through a 1 vs all classifier?

lambdarank prediction error

Running the lambdarank example, training finished.

When run "LightGBM config=predict.conf

gives error

"[LightGBM Error] input format error, should be LibSVM"

num_class is missing from Configuration page

The parameter num_class is missing from Configuration page:
https://github.com/Microsoft/LightGBM/wiki/Configuration

[Feature]Interface for data layers

I would like to open this issue for tracking progress on data/model in-memory transfer interface development.

the interface should offering:

serialization from [memory, disk] to [memory,disk] for data
access data in training/validating process
porting libraries for other tools

[Feature] Integration with Azure ML Studio

This is a great package. I'd suggested (politely) that if there are not any plans to bring XGBoost within the Azure ML Studio GUI, this would be a good module to add as an option and provide a much needed tool within the interface

Can't find executable lightgbm

I follow the Installation Guide, and the compile success. But there is no lightgbm file, the CMakeLists.txt might have some problem?

Not precise configuration of xgboost

LightGBM uses leaf-wise algorithm instead and controls model complexity by num_leaves. So we cannot compare them in the exact same model setting. For the tradeoff, we use xgboost with max_depth=8, which will have max number leaves to 255

You should use min_child_weight instead it controls second sum derivative. In case of regression task it equivalent to number of nodes(divided by 2). For you objective functions it's no so trivial, but it's better that default min_child_weight=1.

If you only set max_depth=8 it guarantys only that max number of leafes, but destitution could be really skewed and overfit train data set(in your case with default min_data_in_leaf it's less skewed ).
Besides that increase that parameter should make you computation little faster(you no need to calculate gain if your sum less then min_child_weight).

PS
To make your benchmark more representative take a wining solution from kaggle(for instance) and try to get at least the same level of accuracy, otherwise you may compare apples with oranges.

is_unbalance=true produces error.

Hi, I am performing binary classification with an unbalanced target. I would like to try the is_unbalance setting, however when I set it to 'true' in my config file, I get the message "cannot find more split with gain = -inf , current #leaves=1". When I set to false, it trains without an issue.
Also I cannot tell from the configuration page how this parameter will be used in the model. It would be great if you can advise. Config file here

With is_unbalance=true
[LightGBM] [Error] Feature Column_575 only contains one value, will be ignored
[LightGBM] [Error] Feature Column_577 only contains one value, will be ignored
[LightGBM] [Error] Feature Column_578 only contains one value, will be ignored
[LightGBM] [Info] Finish loading data, use 21.484110 seconds
[LightGBM] [Info] Number of postive:5503, number of negative:941495
[LightGBM] [Info] Number of data:946998, Number of features:1542
[LightGBM] [Info] Finish training initilization.
[LightGBM] [Info] Start train
[LightGBM] [Info] cannot find more split with gain = -inf , current #leaves=1
[LightGBM] [Info] Can't training anymore, there isn't any leaf meets split requirements.
[LightGBM] [Info] Finish train

With is_unbalance=false
[LightGBM] [Error] Feature Column_575 only contains one value, will be ignored
[LightGBM] [Error] Feature Column_577 only contains one value, will be ignored
[LightGBM] [Error] Feature Column_578 only contains one value, will be ignored
[LightGBM] [Info] Finish loading data, use 21.266789 seconds
[LightGBM] [Info] Number of postive:5503, number of negative:941495
[LightGBM] [Info] Number of data:946998, Number of features:1542
[LightGBM] [Info] Finish training initilization.
[LightGBM] [Info] Start train
[LightGBM] [Info] 0.547807 seconds elapsed, finished 1 iteration
[LightGBM] [Info] 0.918354 seconds elapsed, finished 2 iteration
[LightGBM] [Info] 1.394204 seconds elapsed, finished 3 iteration
[LightGBM] [Info] 1.808137 seconds elapsed, finished 4 iteration
[LightGBM] [Info] 2.188765 seconds elapsed, finished 5 iteration
[LightGBM] [Info] 2.593393 seconds elapsed, finished 6 iteration

Interpretation of LambdaRank results file

How should the results in the LambdaRank results file be interpreted?

I think the numbers are scores within the query group, and higher is better?

I assume we use the .query file to work out how many records are written for each query group?

[Feature] Julia Package / Interface compatibility

It would be great if you would consider Julia's requirements when designing LightGBM's interface for R and Python. There's some quite extensive documentation on how Julia can call C and C++ code here: http://docs.julialang.org/en/release-0.5/manual/calling-c-and-fortran-code/#man-calling-c-and-fortran-code, but we can use this issue for discussion as well.

To get things started until LightGBM's interface is finished, I've build a Julia package that uses the current command-line interface to LightGBM for anyone interested: https://github.com/Allardvm/LightGBM.jl.

Segfault in SparseBin() due to invalid data_indices[0]

Using the latest master, I occasionally (but not systematically) get a segfault while training a sizable database that I, unfortunately, am unable to share because it contains sensitive data.

Using a debug-flagged recompiled LightGBM, I used the core dump to obtain the stack at the time of the segfault:

#0  LightGBM::SparseBin<unsigned char>::Split (this=0x2edd4c0, threshold=0,
data_indices=0x5984878,num_data=0, lte_indices=0x59ba000, gt_indices=0x5eab450)
    at /home/a_van_mossel/LightGBM/src/io/sparse_bin.hpp:62
#1  0x0000000000476276 in LightGBM::DataPartition::Split () at /home/a_van_mossel/LightGBM/src/treelearner/data_partition.hpp:112
#2  0x00007ff1e4c1d43e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#3  0x00007ff1e47e270a in start_thread (arg=0x7ff1dd201700) at pthread_create.c:333
#4  0x00007ff1e451882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

This suggests that the following line in sparse_bin.hpp#L62 is at fault:

const auto fast_pair = fast_index_[(data_indices[0]) >> fast_index_shift_];

Indeed, the core dump points out that, at this point in the trace, data_indices[0] = -461546632, and _fast_index_shift_ = 14. The result of a right shift then becomes -28171 which is not a valid index into fast_index_ (the remainder of data_indices makes more sense, e.g. data_indices[1] = 32753 and data_indices[2] = 0).

My hunch is that this has something to do with the fact that num_data=0 at this point, but I have not (yet) been able to determine the root cause.

Do you have any ideas? I'd be happy to look up more symbols in the core dump if this helps.

Multi-Class classification C++ example

Hi:
can you provide a Multi-Class classification C++ example?
thanks

[Feature] Support for monotonic constraints?

Are you planning support for monotonic constraints? See e.g. here dmlc/xgboost#1514

[Feature]LightGBM_model.txt also record internal node output values or SummedWeights in leaf

Say if one leaf split trun into internal node indexOfNewInternal , we can record like

_previousLeafValue[indexOfNewInternal] = _leafValue[leaf];

and at last output previousLeafValue (length as NumInternalNodes) to LightGBM_model.txt.

So we can use this info for writting wrapper code to calc each feature importance for one specific instance prediction/inference by summing up value diff before and after split following the inference tree path.

Another work around might be record for SummedWeights in each leaf, so we can reconstruct internal node output/value.

[Feature] silence the progress or verbosity

Is there any way to silence the progress?

If I do not want the output "[LightGBM] 1.501629 seconds elapsed, finished 50 iteration]"

svmlight format leads to different models, because it ignores 0-value terms?

Seems the data in svmlight format leads to a different model from the same data in csv/tsv format, because it ignores the value=0 terms.

Use data in LightGBM/examples/regression for example. I write a python notebook to compare between different data formats under same configurations.

The outputs of svmlight format data(step 2 & 3) are the same, and they are different from the output of tsv format data(step 1); if I keep 0-terms in svmlight format data(step 4), it shows the same result with tsv format data(step 1).

Also, I tried to add if (fabs(val)>1e-10) before out_features->emplace_back(idx, val); in class CSVParser/TSVParser (LightGBM/src/io/parser.hpp line 23&52). It shows similar performances to svmlight format data.

libSVM format does not work for prediction anymore

I use dump_svmlight_file to dump an array for prediction.
Since two days ago, I get the following errors when I try to predict from a dumped array.

[LightGBM] [Info] Data file: /tmp/tmpum15hb5w/X_to_pred.svm doesn't contain label column
[LightGBM] [Fatal] input format error, should be LibSVM
[LightGBM] [Fatal] input format error, should be LibSVM

It should not look for label since we are at prediction phase.

Gradients calculation doesn't match with probability

Gradients
https://github.com/Microsoft/LightGBM/blob/master/src/objective/binary_objective.hpp#L63-L78
63 const score_t response = -2.0f * label * sigmoid_ / (1.0f + std::exp(2.0f * label * sigmoid_ * score[i]));
66 hessians[i] = abs_response * (2.0f * sigmoid_ - abs_response) * label_weight;
and line 75, 78
This is derived from logloss func L = -log(1+exp(-2 * label * score)) and prob = 1 / (1+exp(-2*score)).
But the following prob calcs don't contain 2 * within exp().

Probability
https://github.com/Microsoft/LightGBM/blob/master/src/boosting/gbdt.cpp#L368
368 ret = 1.0 / (1.0 + std::exp(-sigmoid_ * ret));

https://github.com/Microsoft/LightGBM/blob/master/src/metric/binary_metric.hpp#L60-L68
60, 68 score_t prob = 1.0f / (1.0f + std::exp(-sigmoid_ * score[i]));

So you have two options. you can remove all 2.0f * in Gradients calculation, or you can add 2.0f* within exp() in probability calc.

what format shoud be when data has missing value?

Tried to repeat the example using my own data.

There are missing values in the data, with default value 'NA', the LightGBM gave error message.

( No luck after replaces as 'nan')

"[LightGBM] finished load parameters
[LightGBM Error] meet error while parsing string to float, expect a nan here"

feature selection

hi:
i want to use the code for feature selecting , can you give me some ideas?
or give me an example?
thanks

Early stopping not working anymore

Couple of days ago early stopping was working.

In binary_classification example
If you add early_stopping_round = 10 and increase num_trees you won't have early stopping.
Instead you will have the following outputs:

[LightGBM] [Info] Iteration:998, training's auc: 1.000000
[LightGBM] [Info] Iteration:998, training's log loss: 0.004594
[LightGBM] [Info] 2.432232 seconds elapsed, finished 998 iteration
[LightGBM] [Info] cannot find more split with gain = -inf , current #leaves=16
[LightGBM] [Info] Iteration:999, training's auc: 1.000000
[LightGBM] [Info] Iteration:999, training's log loss: 0.004589
[LightGBM] [Info] 2.434642 seconds elapsed, finished 999 iteration
[LightGBM] [Info] cannot find more split with gain = -inf , current #leaves=16
[LightGBM] [Info] Iteration:1000, training's auc: 1.000000
[LightGBM] [Info] Iteration:1000, training's log loss: 0.004583
[LightGBM] [Info] 2.436162 seconds elapsed, finished 1000 iteration
[LightGBM] [Info] Finish train

LightGBM_model.txt is filled with all the trees up to num_trees (here num_trees=1000)

Tree=999
num_leaves=16
split_feature=4 22 19 17 17 17 10 2 2 14 6 1 4 14 17
split_gain=0.014389 0.0334668 0.0279201 0.0597537 0.0215986 0.0541655 0.0464277 0.0482441 0.0578334 0.0282108 0.0201576 0.051258 0.0497428 0.0360274 0.0349594
threshold=-0.987 1.0975 0.6725 0.9145 0.8825 0.7905 0.6615 -0.863 0.1885 -0.1445 0.5075 0.5715 -0.2875 0.0205 1.1995
left_child=1 2 3 -1 5 6 7 -2 -9 -10 12 -12 -6 14 -14
right_child=4 -3 -4 -5 10 -7 -8 8 9 -11 11 -13 13 -15 -16
leaf_parent=3 7 1 2 3 12 5 6 8 9 9 11 11 14 13 14
leaf_value=0.0134557 -0.0104802 -0.00468365 -0.00194244 -0.000967473 -0.00549971 -0.0104303 0.00554494 0.00615408 0.00031414 -0.009697 -0.00756818 0.00540046 0.00318853 0.000175487 0.014602

[Feature] Python Bindings for LightGBM

Hi all, we have tentative plan on extending LightGBM to python users. please share us your opinions.

log_file param not working

I am trying to use the log_file param (as seen on line 97 config.h) but it seems to not work: no log is output to the file (the file is not even created).

However, the model is trained successfully. I end up with the model file only.

Here is my train.conf:

task=train
application=regression
data="lgbm_train.csv"
valid="lgbm_val.csv"
num_iterations=10
early_stopping_rounds=10
learning_rate=0.1
num_leaves=127
tree_learner=serial
num_threads=2
histogram_pool_size=-1
min_data_in_leaf=100
min_sum_hessian_in_leaf=10
feature_fraction=1
feature_fraction_seed=2
bagging_fraction=1
bagging_freq=0
bagging_seed=3
max_bin=255
data_random_seed=1
data_has_label=true
output_model="lgbm_model.txt"
log_file="lgbm_log.txt"
is_sigmoid=true
is_pre_partition=false
is_sparse=true
two_round=false
save_binary=false
sigmoid=1
is_unbalance=false
max_position=20
label_gain=0,1,3,7,15,31,63
metric=l2
metric_freq=1
is_training_metric=true
ndcg_at=1,2,3,4,5
num_machines=1
local_listen_port=12400
time_out=120

Using commit: 1aefcd8 (does not work on commit d365762 also)
Operating System: Windows 8.1

I have looked for extra calls when log_file is present, but did not find any in the source code (except in config.cpp). Is it because it is not yet implemented?

How to encode categorical features

I have a libsvm file containing a mix of continuous and categorical features, and I'm attempting to use LambdaRank on it. Here is a sample:

0.0 1:0.142095914742 7:1.0 302:1.0
0.0 1:0.20207253886 397:1.0 979:1.0
0.0 1:0.243902439024 68:1.0
1.0 1:0.142095914742 7:1.0 302:1.0
0.0 1:0.0689655172414 318:1.0 397:1.0
0.0 1:0.047619047619 45:1.0 323:1.0 531:1.0

When I train, I get the following:

[LightGBM] [Warning] Ignoring feature Column_33307, only has one value
[LightGBM] [Warning] Ignoring feature Column_33308, only has one value
[LightGBM] [Warning] Ignoring feature Column_33309, only has one value
[LightGBM] [Warning] Ignoring feature Column_33310, only has one value
[LightGBM] [Warning] Ignoring feature Column_33311, only has one value

I think this means it is ignoring my categorical features. How am I supposed to encode them?

MAE objective function

Hi,
For regression problems, the objective function typically used is RMSE. However, if the evaluation metric is MAE, there have good discussions on Kaggle on modifying the objective function. Reference: https://www.kaggle.com/c/allstate-claims-severity/forums/t/24520/effect-of-mae. Is there any way to use a custom objective function in LightGBM?
Thank you.

Is there a way to get the trees Feature important ?

Segmentation fault (core dumped)

I have a big train data in libsvm format (11GB) like:

-1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:1 15:1 16:1 17:1 18:1 19:1 20:1 21:1
-1 22:1 23:1 24:1 25:1 26:1 27:1 28:1 29:1 30:1 31:1 32:1 33:1 34:1 35:1 36:1 37:1 38:1 39:1 18:1 40:1 41:1
-1 42:1 43:1 44:1 3:1 45:1 46:1 47:1 48:1 49:1 50:1 10:1 51:1 52:1 53:1 54:1 55:1 56:1 38:1 18:1 57:1 58:1
+1 59:1 60:1 61:1 62:1 63:1 64:1 65:1 66:1 67:1 68:1 69:1 70:1 71:1 72:1 73:1 74:1 75:1 38:1 18:1 76:1 77:1
-1 78:1 79:1 80:1 3:1 81:1 82:1 83:1 84:1 85:1 86:1 10:1 87:1 88:1 89:1 90:1 91:1 92:1 93:1 94:1 95:1 96:1

and instance with 26GB RAM. Even with use_two_round_loading = true it runs out of RAM

[LightGBM] [Info] Finished loading parameters
Killed

So I tried to use it with head -n10000 of train and validation.

[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Saving data to binary file train_sample.libsvm
Segmentation fault (core dumped)

It looks like train data were loaded but then program crashed. How can I track the reason?

[Enhancement] Accelerate tree learning by GPU

This is an interesting topic and may can further accelerate the learning speed of LightGBM.
@chivee @xuehui1991 Please investigate about this.

Problem building on osx

I have a problem building on osx:
clang: error: unsupported option '-fopenmp' which is related to http://stackoverflow.com/questions/36211018/clang-error-errorunsupported-option-fopenmp-on-mac-osx-el-capitan-buildin but cannot be fixed by setting

export CC=gcc-6
export CXX=g++
export MPICXX=mpicxx

what am I missing?

-- The C compiler identification is GNU 6.2.0
-- The CXX compiler identification is AppleClang 8.0.0.8000038
-- Checking whether C compiler has -isysroot
m-- Checking whether C compiler has -isysroot - yes
-- Checking whether C compiler supports OSX deployment target flag
-- Checking whether C compiler supports OSX deployment target flag - yes
-- Check for working C compiler: /usr/local/bin/gcc-6
a-- Check for working C compiler: /usr/local/bin/gcc-6 -- works
-- Detecting C compiler ABI info
ke-- Detecting C compiler ABI info - done
-- Detecting C compile features
 --- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/usr/bin/g++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/usr/bin/g++ -- works
-- Detecting CXX compiler ABI info
j-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found MPI_C: /usr/local/Cellar/open-mpi/2.0.1/lib/libmpi.dylib
-- Found MPI_CXX: /usr/local/Cellar/open-mpi/2.0.1/lib/libmpi.dylib
/usr/local/Cellar/open-mpi/2.0.1/lib/libmpi.dylib
/usr/local/Cellar/open-mpi/2.0.1/lib/libmpi.dylib
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/geoHeil/Dropbox/masterThesis/LightGBM/build
geoHeil:build geoHeil$ make -j
Scanning dependencies of target lightgbm
[  9%] Building CXX object src/CMakeFiles/lightgbm.dir/main.cpp.o
[  9%] Building CXX object src/CMakeFiles/lightgbm.dir/application/application.cpp.o
[ 13%] Building CXX object src/CMakeFiles/lightgbm.dir/io/dataset.cpp.o
[ 18%] Building CXX object src/CMakeFiles/lightgbm.dir/io/metadata.cpp.o
[ 22%] Building CXX object src/CMakeFiles/lightgbm.dir/boosting/gbdt.cpp.o
[ 27%] Building CXX object src/CMakeFiles/lightgbm.dir/io/config.cpp.o
[ 36%] Building CXX object src/CMakeFiles/lightgbm.dir/boosting/boosting.cpp.o
[ 36%] Building CXX object src/CMakeFiles/lightgbm.dir/io/parser.cpp.o
[ 40%] Building CXX object src/CMakeFiles/lightgbm.dir/io/bin.cpp.o
[ 45%] Building CXX object src/CMakeFiles/lightgbm.dir/io/tree.cpp.o
[ 50%] Building CXX object src/CMakeFiles/lightgbm.dir/metric/dcg_calculator.cpp.o
[ 54%] Building CXX object src/CMakeFiles/lightgbm.dir/metric/metric.cpp.o
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
make[2]: *** [src/CMakeFiles/lightgbm.dir/io/parser.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [src/CMakeFiles/lightgbm.dir/io/metadata.cpp.o] Error 1
clang: error: unsupported option '-fopenmp'
make[2]: *** [src/CMakeFiles/lightgbm.dir/boosting/gbdt.cpp.o] Error 1
make[2]: *** [src/CMakeFiles/lightgbm.dir/io/config.cpp.o] Error 1
make[2]: *** [src/CMakeFiles/lightgbm.dir/io/dataset.cpp.o] Error 1
make[2]: *** [src/CMakeFiles/lightgbm.dir/io/bin.cpp.o] Error 1
clang: error: unsupported option '-fopenmp'
make[2]: *** [src/CMakeFiles/lightgbm.dir/io/tree.cpp.o] Error 1
clang: clang: error: unsupported option '-fopenmp'error:
unsupported option '-fopenmp'
make[2]: *** [src/CMakeFiles/lightgbm.dir/boosting/boosting.cpp.o] Error 1
make[2]: *** [src/CMakeFiles/lightgbm.dir/metric/dcg_calculator.cpp.o] Error 1
clang: error: unsupported option '-fopenmp'
make[2]: *** [src/CMakeFiles/lightgbm.dir/metric/metric.cpp.o] Error 1
clang: error: unsupported option '-fopenmp'
[ 59%] Building CXX object src/CMakeFiles/lightgbm.dir/objective/objective_function.cpp.o
make[2]: *** [src/CMakeFiles/lightgbm.dir/application/application.cpp.o] Error 1
clang: error: unsupported option '-fopenmp'
make[2]: *** [src/CMakeFiles/lightgbm.dir/main.cpp.o] Error 1
clang: error: unsupported option '-fopenmp'
make[2]: *** [src/CMakeFiles/lightgbm.dir/objective/objective_function.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/lightgbm.dir/all] Error 2

libSVM data problem

LightGBM print log, "[LightGBM] #data:18269614 #feature:43".
The datas have 44 dimensions started with 0, why the log says #feature:43? It's something wrong?

[Feature] R package

Was actually surprised to see that you haven't got one already, given that MS owns revolution analytics.

Granted, some of the multi-node features will be hard to integrate into R, but OpenMPI-backed single node training looks promising too and I'd like to compare it to xgboost in our production workflow.

Let me know if you guys have internal plans for building an R package or whether you're waiting for the community to pitch in

Load lib_lightgbm.so in Python or R

When I using OpenMPI to build LightGBM, I cannot load the lib_lightgbm.so in R or Python:

import ctypes
lib = ctypes.cdll.LoadLibrary('lib_lightgbm.so')

OSError: lib_lightgbm.so: undefined symbol: ompi_mpi_comm_world

library(Rcpp)
dyn.load("~/git/LightGBM/lib_lightgbm.so")

...lib_lightgbm.so: undefined symbol: ompi_mpi_comm_world

However when I build lightGBM without openMPI, these methods can load the lib_lightgbm.so as normal.

Python 2.7.6
R 3.2.5
Unbuntu 14.04 x64
mpirun (Open MPI) 2.0.1

categorical data

Hi,

so far dummy coding categorical data worked fine. However, with the latest update, I see
Feature 3050 only contains one value, will be ignored
in the logs for each encoded column.

This only seems to be a problem when is_unbalance=true is passed
How should I handle categorical data for lightGBM so that it works fine?

Don't support 'nan' in data file

Could you add your lib to benchm-ml?

https://github.com/szilard/benchm-ml

Exe immediately closes when opened

The lightgbm.exe immediately closes when opened.

RAM consumption becoming bottleneck when #feature are large and RAM are intensive

LightGBM will allocate a cache to store historical histograms. This will cost about #leaf x #feature x #bin. May need to design a LRU liked cache to reduce this memory cost.

Add max_depth option to deal with over-fitting

LightGBM grows tree by leaf-wise. This will give good accuracy when #data is large.
However, it may cause over-fitting while #data is small.
Add max_depth to limit the depth of tree model can deal with it.

[Feature]C#/.NET Bindings and Nuget package?

Any plans to package this for C# devs, preferably through a nuget package?

[Feature] PMML support

Being able to save the model in PMML format would be great (https://github.com/jpmml). Would make it easy to move LightGBM models into production.

Binary Classification Result Not Binary

I followed the "binary classification" example and got a list of float numbers. Am I missing anything here?

face verification

Hi:
After extracting face region high dimensional LBP feature, can I use LightGBM further select features, used for face verification? Can you give some advice or examples?
thanks

c_api & python: clarification

Hi,

I am working on using c_api.cpp to improve the current python binding for LightGBM.
The implementation of c_api.cpp is really close to what already exist with xgboost thus it help a lot to better understand your version, thanks for that.

I am planning on creating two classes that will be equivalent to Booster() and DMatrix(). However I m having hard time understanding how to use some c_api.cpp functions.

I have created a gist to explain my problems (the script work if you want to try it):

L55: What reference parameter stands for ?
L56: Is it the proper way to create validation data ?
L73: How can I use LGBM_BoosterEval to evaluate both train and test data ?

Thanks you !

More flexible format of input data

Currently, LightGBM only support fixed format data input.
I want to add some new features:

support data with header
can specify index/name of label, query/group and weight columns
can specify some columns in data that won't be used in training

Using LightGBM with csv files

Hi,

I want to use csv files but I don't know how can I convert them for LightGBM. Is there an easy way for it?

Thanks in advance!

prediction results for classification are not probability?

With latest version, prediction results for classification can have minus value (value<0), but I want probability for results (0 - 1).
(working on AWS EC2 AMI)

git clone --recursive https://github.com/Microsoft/LightGBM ; cd LightGBM
git checkout 0702b2ffbff9148c30768f23eaffc84dcb45cae4 # Nov 11, 2016 change GetTrainnigScore to non-const. # https://github.com/Microsoft/LightGBM/commit/0702b2ffbff9148c30768f23eaffc84dcb45cae4
mkdir build ; cd build
cmake .. 
make -j 

cd ..
cp ./lightgbm examples/binary_classification/
cd examples/binary_classification/
./lightgbm config=train.conf
./lightgbm config=predict.conf
head LightGBM_predict_result.txt

0.571395
-0.057555
-1.24912
0.0390908
-0.949834
-0.764455
-0.656284
0.0431959
0.928821
-0.263096

With older version (Nov 8, 2016), prediction results for classification are probability.

git clone --recursive https://github.com/Microsoft/LightGBM ; cd LightGBM
git checkout 785398a2443be461bd0f58c680b285852113232e # Commits on Nov 8, 2016 https://github.com/Microsoft/LightGBM/commit/785398a2443be461bd0f58c680b285852113232e
mkdir build ; cd build
cmake .. 
make -j 

cd ..
cp ./lightgbm examples/binary_classification/
cd examples/binary_classification/
./lightgbm config=train.conf
./lightgbm config=predict.conf
head LightGBM_predict_result.txt

0.758192
0.471254
0.075981
0.519535
0.130146
0.178153
0.212058
0.521585
0.865022
0.371406