Git Product home page Git Product logo

gbm-perf's Introduction

GBM Performance

Performance of the top/most widely used open source gradient boosting machines (GBM)/ boosted trees (GBDT) implementations (h2o, xgboost, lightgbm, catboost) on the airline dataset (100K, 1M and 10M records) and with 100 trees, depth 10, learning rate 0.1.

Popularity of GBM implementations

Poll conducted via twitter (April, 2019):

More recent twitter poll (September, 2020):

June 2024:

How to run/reproduce the benchmark

Installing to latest software versions and running/timing is easy and fully automated with docker:

CPU

(requires docker)

git clone https://github.com/szilard/GBM-perf.git
cd GBM-perf/cpu
sudo docker build --build-arg CACHE_DATE=$(date +%Y-%m-%d) -t gbmperf_cpu .
sudo docker run --rm gbmperf_cpu

GPU

(requires docker, NVIDIA drivers and the nvidia-docker utility)

git clone https://github.com/szilard/GBM-perf.git
cd GBM-perf/gpu
sudo docker build --build-arg CACHE_DATE=$(date +%Y-%m-%d) -t gbmperf_gpu .
sudo nvidia-docker run --rm gbmperf_gpu

Results

CPU

r4.8xlarge (32 cores, but run on physical cores only/no hyperthreading) with software as of 2024-06-04:

Tool Time[s] 100K Time[s] 1M Time[s] 10M AUC 1M AUC 10M
h2o 11 12 60 0.762 0.776
xgboost 0.4 2.7 40 0.749 0.757
lightgbm 2.3 4.0 20 0.765 0.792
catboost 1.9 7.0 70 0.734?! 0.735?!

Results on newer hardware (m7i/c7i/r7i) here (TLDR: ~2x speedup on the newer hardware).

GPU

p3.2xlarge (1 GPU, NVIDIA V100) with software as of 2024-06-06 (and CUDA 12.5):

Tool Time[s] 100K Time[s] 1M Time[s] 10M AUC 1M AUC 10M
h2o xgboost 6.4 14 42 0.749 0.756
xgboost 0.7 1.3 5 0.748 0.756
lightgbm 7 9 40 0.766 0.791
catboost 1.6 3.4 23 0.735 ?! 0.737 ?!

Results on newer hardware (A10/A100/H100) here (TLDR: more modest speedups compared to neural nets, ~1.3x for XGBoost on the largest data).

Additional results

Some additional studies obtained "manually" (not fully automated with docker as the main benchmark above). Thanks @Laurae2 for lots of help with some if these.

Faster CPUs

AWS has now better CPUs than r4.8xlarge (Xeon E5-2686 v4 2.30GHz, 32 cores), for example with higher CPU frequency c5.9xlarge (Xeon Platinum 8124M 3.00GHz, 36 cores) or more number of cores m5.12xlarge (Xeon Platinum 8175M 2.50GHz, 48 cores).

c5.9xlarge and m5.12xlarge are typically 20-50% faster than r4.8xlarge, for larger data more cores (m5.12xlarge) is the best, for smaller data high-frequency CPU (c5.9xlarge) is the best. Nevertheless, the ranking of libs by training time stays the same for a given data size when changing CPU. More details here.

Even more recently a CPU with both higher frequency and more cores became available on AWS: c5.12xlarge (Xeon Platinum 8275CL 3.00GHz, 48 cores) and also instances with 2 of these CPUs (but see results for multi-socket systems below): c5.24xlarge and c5.metal. Results for c5.metal are here.

2024 update: latest results for the newest CPUs (c7i.metal-48xl and c7a.metal-48xl) are here.

Multi-core scaling (CPU)

While GBM trees must be grown sequentially (as building each tree depends on the results of the previous ones), GBM training can be parallelized e.g. by parallelizing the computation in each split (more exactly the histogram calculations). Modern CPUs have many cores, but the scaling of these GBM implementations is far worse from being proportional to the number of cores. Furthermore, it has been known for long (2016) that xgboost (and later lightgbm) slow down (!) on systems with 2 or more CPU sockets or when hyperthreaded cores are used. These problems have been very recently mitigated (2020), but it is still usually best to restrict your training process to the physical cores (avoid hyperthreading) and only 1 CPU socket (if the server has 2 or more sockets).

Even if only physical (no hyperthreading) CPU cores are used on 1 socket only, the speedup for example from 1 core to 16 cores is not 16x, but (on r4.8xlarge):

data size h2o xgboost lightgbm catboost
0.1M 3x 6.5x 1.5x 3.5x
1M 8x 6.5x 4x 6x
10M 24x 5x 7.5x 8x

with more details here. In fact the scaling was worse until very recently, for example xgboost was at 2.5x at 1M rows (vs 6.5x now) before several optimizations have been implemented in 2020.

2024 update: latest results for the newest CPUs (c7i.metal-48xl and c7a.metal-48xl) with multicore scaling up to 48 physical cores (no hyperthreading) and beyond (with hyperthreading) are here.

Multi-socket CPUs

Most high-end servers have nowadays more than 1 CPU on the motherboard. For example c5.18xlarge has 2 CPUs (2x of the c5.9xlarge CPUs mentioned above), same for r4.16xlarge or m5.24xlarge. There are even EC2 instances with 4 CPUs e.g. x1.32xlarge (128 cores) or more.

One would think more CPU cores means higher training speed, though because of RAM topology and NUMA, most of the above tools used to run slower on 2 CPUs than 1 CPU (!) until very recently (2020). The slowdown was sometimes pretty dramatic, e.g. 2x for lightgbm or 3-5x for xgboost even for the largest data in this benchmark. Very recently these effects have been mitigated by several optimizations in lightgbm and even more notably in xgboost. More details on the NUMA issue here, here and here.

Currently, the difference in training speed e.g. on r4.16xlarge (2 sockets, 16 cores + 16 HT each, so total of 64 cores) between 16 physical cores and 64 total cores is:

data size h2o xgboost lightgbm catboost
0.1M -40% -50% -70% 15%
1M -15 % -2% -60% -20%
10M 25% 35% -20% 10%

where negative numbers mean on 64 cores it is slower than on 16 cores (by that much %) (e.g. -50% means a decrease in speed by 50% that is a doubling of training time). These numbers were much much worse until very recently (2020), for example training time (sec) for xgboost 1M rows:

cores May 2019 Sept 2020
1 30 34
16 (1so) 12 5.1
64 (2so+HT) 120 5.2

that is xgboost was 10x slower on 64 cores vs 16 cores and it was slower on 64 cores vs even 1 core (!). One can see that the recent optimizations have improved both the multicore scaling and the NUMA (multi-socket) issue.

2024 update: latest results for the newest CPUs (c7i.metal-48xl and c7a.metal-48xl) with 192 cores (with 2 CPU sockets) are here.

100M records and RAM usage

Results on the fastest CPU (most cores, 1 socket, see above why this is the fastest) and the fastest GPU on EC2. The data is obtained by replicating the 10M dataset 10x, so the AUC is not indicative of a learning curve, just used to see if it is equal approximately the 10M AUC (it should be).

For the CPU runs, "RAM train" is measured as the increase in memory usage during training (on top of the RAM used by the data). For the GPU runs, the "GPU memory" usage is the total GPU memory used (cannot separate training from copies of the data), while the "extra RAM" is the additional RAM used by some of the tools (on the CPU) if any.

CPU (m5.12xlarge):

Tool   time [s] AUC RAM train [GB]
h2o 520 0.775 8
xgboost 510 0.751 15
lightgbm ohe 310 0.774 5
catboost 930 0.736 50

GPU (Tesla V100):

Tool   time [s] AUC GPU mem [GB] extra RAM [GB]
h2o xgboost 270 0.755 4 30
xgboost 80 0.756 6 0
lightgbm ohe 400 0.774 3 6
catboost crash (OOM)     >16 14

catboost GPU crashes out-of-memory on the 16GB GPU.

h2o xgboost on GPU is slower than native xgboost on GPU and also adds a lot of overhead in RAM usage ("extra RAM") (this must be due to some pre- and post-processing of data in h2o as one can see by looking at the GPU utilization patterns as discussed next).

More details here.

2024 update: latest CPU results for the newest c7i.metal-48xl (using 48 physical cores on 1 socket, no hyperthreading, no NUMA/2 sockets):

Tool   time [s]
xgboost 190
lightgbm 55

The 2.7x speedup in CPU XGBoost (vs results above) are due to the significant improvement in multicore scaling (2020), further improvements in speed and also the increase in number of cores of the top CPUs. The 5.6x speedup in CPU LightGBM are due to implementation of direct handling of categorical variables and increase in number of cores of the top CPUs (LightGBM benefits more than XGBoost with many CPU cores for large datasets).

GPU utilization patterns

For the GPU runs, it is interesting to observe the GPU utilization patterns and also the CPU utilization meanwhile (usually 1 CPU thread).

xgboost uses GPU at ~80% and 1 CPU core at 100%.

h2o xgboost shows 3 phases: first only using CPU at ~30% (all cores) and no GPU, then GPU at ~70% and CPU at 100%, then no GPU and CPU at 100%. This means 3-4x longer training time vs native xgboost.

lightgbm uses GPU at 5-10% and meanwhile CPU at 100% (all cores). It can be made to use 1 CPU core only (nthread = 1), but then it may be slower.

catboost uses GPU at ~80% and 1 CPU core at 100%. Unlike the other tools catboost takes all the GPU memory available when it starts training no matter of the data size (so we don't know how much memory it needs by using the standard monitoring tools).

More details here.

Spark MLlib

In my previous broader benchmark of ML libraries, Spark MLlib GBT (and random forest as well) performed very poorly (10-100x running time vs top libs, 10-100x memory usage and an accuracy issue for larger data) and therefore it was not included in the current GBM/GBT benchmark. However, people might still be interested if there has been any improvements since 2016 and Spark 2.0.

With Spark 2.4.2 as of 2019-05-05 the accuracy issue for larger data has been fixed, but the speed and the memory footprint did not improve:

size time lgbm [s] time spark [s] ratio AUC lgbm AUC spark
100K 2.4 1020 425 0.730 0.721
1M 5.2 1380 265 0.764 0.748
10M 42 8390 200 0.774 0.755

(compared to lighgbm CPU) (Spark code here)

So Spark MLlib GBT is still 100x slower than the top tools. In case you are wondering if more nodes or bigger data would help, the answer in nope (see below).

Spark MLlib on 100M records and RAM usage

Besides being slow, Spark also uses 100x RAM compared to the top tools. In fact, on 100M records (20GB after being loaded from disk and cached in RAM) it crashes out-of-memory even on servers with almost 1 TB RAM.

    100M     10M    
trees depth time [s] AUC RAM [GB] time [s] AUC RAM [GB]
1 1 1150 0.634 620 70 0.635 110
1 10 1350 0.712 620 90 0.712 112
10 10 7850 0.731 780 830 0.731 125
100 10 crash OOM   >960 (OOM) 8390 0.755 230

(100M ran on x1e.8xlarge [32 cores, 960GB RAM], 10M ran on r4.8xlarge [32 cores, 240GB RAM])

(compare this with 100M records 100 trees depth 10, lightgbm 5GB RAM usage)

More details here.

Note the situation is much better for linear models in Spark MLlib, only 3-4x slower and 10x more memory footprint vs h2o for example, see results here (and training linear models is much much faster than trees, so training times are reasonable even for large data).

Spark on a cluster

Results on a EMR cluster with master+10 slave nodes and comparison with local mode on 1 server (and "cluster" with 1 master+1 slave). To run in reasonable time only 10 trees (depth 10) have been used.

size hw nodes cores partitions time [s] RAM [GB] avail RAM [GB]
10M local r4.8xl 32 32 830 125 240
10M Cluster_1 r4.8xl 32 64 1180 73 240
10M Cluster_10 r4.8xl 320 320 (m) 330   2400
100M local x1e.8xl 32   7850 780 960
100M Cluster_10 r4.8xl 320 585 1825 10*72 2400

100M records data is "big" enough for Spark to be in the "at scale" modus operandi. However, the computation speed and memory footprint inefficiencies of the algorithm/implementation are so huge that no cluster of any size can really help. Furthermore larger data (billions) would mean even more prohibitively slow training (many hours/days) for any reasonable cluster size (remember, the timings above are for 10 trees, any decent GBM would need at least 100 trees).

Also, the fact that Spark has so huge memory footprint means that one can run e.g. lightgbm instead on much less RAM, so that even larger datasets would fit in the RAM of a single server. Results for lightgbm for comparison with the above Spark cluster results (10 trees):

size hw cores time [s] AUC RAM [GB] avail RAM [GB]
10M r4.8xl 16 (m) 7 0.743 4 240
100M r4.8xl 16 (m) 60 0.743 13(d)+5 240

More details here.

Recommendations

If you don't have a GPU, lightgbm and xgboost (CPU) train the fastest.

If you have a GPU, xgboost (GPU) is very fast (and depending on the data, your hardware etc. often faster than the above mentioned lightgbm/xgboost on CPU).

If you consider deployment, h2o has the best ways to deploy as a real-time (fast scoring) application.

Note, however, there are a lot more other criteria to consider when you choose which tool to use, e.g.:

You can find more info in my talks at several conferences and meetups with many of them having video recordings available, for example my talk at Berlin Buzzwords in 2019, video recording here, slides here, or a more updated talk from November 2020 at the LA Data Science Meetup, video recording here, slides here.

2024 updates in a university seminar talk, slides here.

gbm-perf's People

Contributors

daroczig avatar jameslamb avatar mluds avatar szilard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gbm-perf's Issues

Spark/xgboost integration

m5.2xlarge 8cores 30GB RAM

for comparison:

On 8 cores (including HT):

0.1m:
h2o 11.805 0.7022567
xgboost 3.295 0.7324224
lightgbm 2.287 0.7298355
1m:
h2o 29.214 0.7623596
xgboost 12.52 0.7494959
lightgbm 6.962 0.7636987
10m:
h2o 291.868 0.7763336
xgboost 109.124 0.7551197
lightgbm 57.033 0.7742033

On 4 cores:

0.1m:
h2o 11.499 0.7022444
xgboost 3.023 0.7324224
lightgbm 2.004 0.7298355
1m:
h2o 33.062 0.7623495
xgboost 15.785 0.7494959
lightgbm 6.662 0.7636987
10m:
h2o 376.488 0.7763268
xgboost 126.148 0.7551197
lightgbm 61.678 0.7742033

Spark cluster

previous results single server:

    100M     10M    
trees depth time [s] AUC RAM [GB] time [s] AUC RAM [GB]
1 1 1150 0.634 620 70 0.635 110
1 10 1350 0.712 620 90 0.712 112
10 10 7850 0.731 780 830 0.731 125
100 10 crash OOM   >960 (OOM) 8070 0.755 230

10M ran on:
r4.8xlarge (32 cores, 1 NUMA, 240GB RAM)

100M ran on:
x1e.8xlarge (32 cores, 1 NUMA, 960GB RAM)

issue to run the 'build' command

Hi
I tried to run 'sudo docker build --build-arg CACHE_DATE=$(date + '1') -t gbmperf_cpu .' There is an error at the step 8/15 to install 'h2o'. The error is "
Content type 'application/x-tar' length 123701774 bytes (118.0 MB)

downloaded 118.0 MB

ERROR: dependency ‘RCurl’ is not available for package ‘h2o’

  • removing ‘/usr/local/lib/R/site-library/h2o’

The downloaded source packages are in
‘/tmp/downloaded_packages’
Warning message:
In install.packages(pkgs, ...) :
installation of package ‘h2o’ had non-zero exit status
"
After that, the process seems continuing, but there is no test result to follow the step 15/15.

I am not sure what to do next after 15/15 to get the results. Or since there is an error, the test cant' run.

Spark MLlib GBT 100M dataset

du -sm *.csv
467     train-10m.csv
47      train-1m.csv
5       train-0.1m.csv

du -sm *.parquet
2385    spark_ohe-train-100m.parquet
239     spark_ohe-train-10m.parquet
25      spark_ohe-train-1m.parquet
3       spark_ohe-train-0.1m.parquet


free -m
              total        used        free      shared  buff/cache   available
Mem:         245854         568      244920           8         365      244043


lscpu
CPU(s):                32


${SPARK_ROOT}/bin/spark-shell --master local[*] --driver-memory 220G --executor-memory 220G


import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.GBTClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val d_train = spark.read.parquet("spark_ohe-train.parquet").cache()
val d_test = spark.read.parquet("spark_ohe-test.parquet").cache()
(d_train.count(), d_test.count())


free -m
              total        used        free      shared  buff/cache   available
Mem:         245854       64579      178405           8        2868      180025

Screen Shot 2019-05-08 at 5 50 53 AM


val rf = new GBTClassifier().setLabelCol("label").setFeaturesCol("features").
  setMaxIter(100).setMaxDepth(10).setStepSize(0.1).
  setMaxBins(100).setMaxMemoryInMB(10240)     // max possible setMaxMemoryInMB (otherwise errors out)
val pipeline = new Pipeline().setStages(Array(rf))

val now = System.nanoTime
val model = pipeline.fit(d_train)

c5.metal & multicore scaling

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping:                        7
CPU MHz:                         1200.669
CPU max MHz:                     3900.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        6000.00
Virtualization:                  VT-x
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        48 MiB
L3 cache:                        71.5 MiB
NUMA node0 CPU(s):               0-23,48-71
NUMA node1 CPU(s):               24-47,72-95

results on instances with smaller amount of cores m5.4xlarge (16 cores=8+8HT)

Results on instances with smaller amount of cores - m5.4xlarge (16 cores=8+8HT):

On 4 cores:

0.1m:
xgboost 1 0.7338433
lightgbm 1.651 0.7174663
1m:
xgboost 7.46 0.7478858
lightgbm 5.267 0.7650181
10m:
xgboost 77.76 0.7539445
lightgbm 40.336 0.792273

On 8 cores:

0.1m:
xgboost 0.648 0.7338433
lightgbm 1.619 0.7174663
1m:
xgboost 4.465 0.7478858
lightgbm 3.669 0.7650181
10m:
xgboost 48.294 0.7539445
lightgbm 23.228 0.792273

On 16 cores (8+8HT):

0.1m:
xgboost 0.662 0.7338433
lightgbm 2.095 0.7174663
1m:
xgboost 3.955 0.7478858
lightgbm 4.019 0.7650181
10m:
xgboost 41.501 0.7539445
lightgbm 20.639 0.792273

CPU performance on different CPUs and multi-socket (NUMA) servers

c5.9xlarge:

This is a newer/faster CPU vs the one used in the main benchmark

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                36
On-line CPU(s) list:   0-35
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:              4
CPU MHz:               3000.000
BogoMIPS:              6000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-35
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat

Note: need to change to Dockerfile to use cores 0-17 instead of 0-15.

0.1m:
8.731 0.7022154
3.242 0.7324224
1.958 0.7298355
4.567 0.7225903
1m:
14.123 0.762363
10.831 0.7494959
4.325 0.7636987
33.879 0.7402029
10m:
84.734 0.7763153
59.556 0.7551197
32.721 0.7742033
341.073 0.7439026

2-socket instance c5.18xlarge:

Note: need to change to Dockerfile to use cores 0-35 instead of 0-15.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:              3
CPU MHz:               3000.000
BogoMIPS:              6000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat
0.1m:
11.705 0.7022184
58.622 0.7324224
5.495 0.7298355
7.589 0.7225903
1m:
17.387 0.7623644
59.489 0.7494959
10.048 0.7636987
37.498 0.7402029
10m:
50.278 0.776314
140.663 0.7551197
63.908 0.7742033
351.112 0.7439026

for comparison r4.8xlarge (main benchmark):

0.1m:
16.206 0.7022242
3.859 0.7324224
2.378 0.7298355
5.502 0.7225903
1m:
20.308 0.7623565
11.433 0.7494959
5.217 0.7636987
49.431 0.7402029
10m:
100.612 0.7763295
78.177 0.7551197
41.929 0.7742033
489.229 0.7439026

LightGBM weird multi-core scaling

On 10M rows c5.metal lightgbm is 2.5x faster on 2 cores vs 1 core. Any idea @guolinke why?

10:lightgbm:1:0::210.265:0.792273
10:lightgbm:1:0::210.205:0.792273
10:lightgbm:1:0::210.163:0.792273
...
10:lightgbm:2:0-1::84.973:0.792273
10:lightgbm:2:0-1::85.043:0.792273
10:lightgbm:2:0-1::84.996:0.792273

so 1 core ~219 sec, 2 cores ~85 sec!

suppressMessages({
library(data.table)
library(ROCR)
library(lightgbm)
library(Matrix)
})

set.seed(123)

d_train <- fread("train.csv", showProgress=FALSE)
d_test <- fread("test.csv", showProgress=FALSE)

d_all <- rbind(d_train, d_test)
d_all$dep_delayed_15min <- ifelse(d_all$dep_delayed_15min=="Y",1,0)

d_all_wrules <- lgb.convert_with_rules(d_all)       
d_all <- d_all_wrules$data
cols_cats <- names(d_all_wrules$rules) 

d_train <- d_all[1:nrow(d_train)]
d_test <- d_all[(nrow(d_train)+1):(nrow(d_train)+nrow(d_test))]

p <- ncol(d_all)-1
dlgb_train <- lgb.Dataset(data = as.matrix(d_train[,1:p]), label = d_train$dep_delayed_15min)

cat(system.time({
  md <- lgb.train(data = dlgb_train, 
            objective = "binary", 
            nrounds = 100, num_leaves = 512, learning_rate = 0.1, 
            categorical_feature = cols_cats,
            verbose = 0)
})[[3]],":",sep="")


phat <- predict(md, data = as.matrix(d_test[,1:p]))
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
RUN wget https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv && \
    wget https://s3.amazonaws.com/benchm-ml--main/train-1m.csv && \
    wget https://s3.amazonaws.com/benchm-ml--main/train-10m.csv && \
    wget https://s3.amazonaws.com/benchm-ml--main/test.csv

GPU utilization patterns

xgboost:

1M:

[0] Tesla V100-SXM2-16GB | 29'C,   0 % |     0 / 16160 MB |
[0] Tesla V100-SXM2-16GB | 29'C,  62 % |  2278 / 16160 MB | root(2268M)
[0] Tesla V100-SXM2-16GB | 30'C,  80 % |  1096 / 16160 MB | root(1086M)
[0] Tesla V100-SXM2-16GB | 30'C,  80 % |  1096 / 16160 MB | root(1086M)
[0] Tesla V100-SXM2-16GB | 30'C,  79 % |  1096 / 16160 MB | root(1086M)
[0] Tesla V100-SXM2-16GB | 30'C,   0 % |  1098 / 16160 MB | root(1088M)
[0] Tesla V100-SXM2-16GB | 29'C,   0 % |     0 / 16160 MB |

Screen Shot 2019-04-30 at 8 28 10 AM

03:10:48 PM  all    1.38    0.00    0.25    0.00    0.00    0.00    0.00    0.00    0.00   98.37
03:10:49 PM  all   10.50    0.00    3.38    0.00    0.00    0.00    0.00    0.00    0.00   86.12
03:10:50 PM  all   10.26    0.00    3.88    0.00    0.00    0.00    0.00    0.00    0.00   85.86
03:10:51 PM  all   10.78    0.00    2.13    0.00    0.00    0.00    0.00    0.00    0.00   87.09
03:10:52 PM  all   11.36    0.00    2.50    0.00    0.00    0.00    0.00    0.00    0.00   86.14
03:10:53 PM  all   11.26    0.00    3.00    0.00    0.00    0.00    0.00    0.00    0.00   85.73
03:10:54 PM  all    7.18    0.00    2.52    0.13    0.00    0.00    0.13    0.00    0.00   90.05

Screen Shot 2019-04-30 at 8 27 56 AM

10M:

[0] Tesla V100-SXM2-16GB | 26'C,   0 % |     0 / 16160 MB |
[0] Tesla V100-SXM2-16GB | 27'C,  42 % |  2624 / 16160 MB | root(2614M)
[0] Tesla V100-SXM2-16GB | 27'C,  77 % |  2624 / 16160 MB | root(2614M)
[0] Tesla V100-SXM2-16GB | 28'C,  76 % |  2624 / 16160 MB | root(2614M)
[0] Tesla V100-SXM2-16GB | 28'C,  73 % |  2624 / 16160 MB | root(2614M)
[0] Tesla V100-SXM2-16GB | 28'C,  80 % |  2624 / 16160 MB | root(2614M)
[0] Tesla V100-SXM2-16GB | 29'C,  86 % |  1612 / 16160 MB | root(1602M)
[0] Tesla V100-SXM2-16GB | 30'C,  86 % |  1612 / 16160 MB | root(1602M)
[0] Tesla V100-SXM2-16GB | 30'C,  87 % |  1612 / 16160 MB | root(1602M)
[0] Tesla V100-SXM2-16GB | 30'C,  86 % |  1612 / 16160 MB | root(1602M)
[0] Tesla V100-SXM2-16GB | 30'C,  86 % |  1612 / 16160 MB | root(1602M)
[0] Tesla V100-SXM2-16GB | 30'C,  85 % |  1612 / 16160 MB | root(1602M)
[0] Tesla V100-SXM2-16GB | 31'C,  86 % |  1612 / 16160 MB | root(1602M)
[0] Tesla V100-SXM2-16GB | 29'C,   0 % |     0 / 16160 MB |
08:41:22 PM  all    1.38    0.00    0.25    0.00    0.00    0.00    0.00    0.00    0.00   98.37
08:41:23 PM  all    6.75    0.00    0.38    0.00    0.00    0.00    0.12    0.00    0.00   92.75
08:41:24 PM  all   11.50    0.00    2.00    0.00    0.00    0.00    0.00    0.00    0.00   86.50
08:41:25 PM  all   10.62    0.00    2.88    0.00    0.00    0.00    0.00    0.00    0.00   86.50
08:41:26 PM  all   11.53    0.00    2.63    0.00    0.00    0.00    0.00    0.00    0.00   85.84
08:41:27 PM  all   11.62    0.00    2.38    0.00    0.00    0.00    0.00    0.00    0.00   86.00
08:41:28 PM  all   11.79    0.00    2.26    0.00    0.00    0.00    0.00    0.00    0.00   85.95
08:41:29 PM  all   11.62    0.00    2.62    0.00    0.00    0.00    0.00    0.00    0.00   85.75
08:41:30 PM  all   11.38    0.00    2.75    0.00    0.00    0.00    0.00    0.00    0.00   85.88
08:41:31 PM  all   10.28    0.00    3.76    0.00    0.00    0.00    0.00    0.00    0.00   85.96

08:41:31 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
08:41:32 PM  all    9.75    0.00    4.50    0.00    0.00    0.00    0.00    0.00    0.00   85.75
08:41:33 PM  all    9.64    0.00    3.50    0.00    0.00    0.00    0.00    0.00    0.00   86.86
08:41:34 PM  all    9.91    0.00    3.39    0.00    0.00    0.00    0.00    0.00    0.00   86.70
08:41:35 PM  all   10.74    0.00    3.50    0.00    0.00    0.00    0.00    0.00    0.00   85.77
08:41:36 PM  all   10.14    0.00    4.01    0.00    0.00    0.00    0.00    0.00    0.00   85.86
08:41:37 PM  all   10.34    0.00    3.78    0.00    0.00    0.00    0.00    0.00    0.00   85.88
08:41:38 PM  all    1.39    0.00    1.01    0.13    0.00    0.00    0.00    0.00    0.00   97.48

Intel vs AMD

xgboost, 1M records dataset, AWS EC2 m5 vs m5a:

m5.8xlarge

Model name:            Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
CPU MHz:               2499.998
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-31

taskset -c ... R ...

Time [s]:

0   26.729
0-1  22.222
0-3  14.65
0-7  11.837  
0-15  11.768
all 18.705

m5a.8xlarge

Model name:            AMD EPYC 7571
CPU MHz:               2199.986
L1d cache:             32K
L1i cache:             64K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
0  30.911
0-1  20.355
0-3  14.88
0-7  14.607
0-7,16-23   20.225
0-15  23.155
all 31.051

Dask lightgbm

m5.4xlarge 16c (8+8HT)

1M rows

integer encoding for simplicity

CPU Single threaded performance

This might be relevant for training lots of models (100s, 1000s...) on smaller data, when running them in parallel 1 model/CPU core would be probably the most efficient if the data is small and all the datasets (or if on same data, then multiple copies of the data) fit in RAM.

aGTBoost

New implementation aGTBoost https://github.com/Blunde1/agtboost

library(data.table)
library(ROCR)
library(xgboost)
library(agtboost)
library(Matrix)

set.seed(123)

d_train <- fread("https://github.com/szilard/benchm-ml--data/raw/master/train-0.1m.csv")
d_test <- fread("https://github.com/szilard/benchm-ml--data/raw/master/test.csv")


## xgboost sparse

X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]
y_train <- ifelse(d_train$dep_delayed_15min=='Y',1,0)

dxgb_train <- xgb.DMatrix(data = X_train, label = y_train)

system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  tree_method = "hist")
})
# user  system elapsed 
# 95.708   0.733  16.304 

system.time({
phat <- predict(md, newdata = X_test)
})
# user  system elapsed 
# 1.002   0.000   0.134 
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
# 0.7324224


## xgboost dense

X_train_test <- model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]
y_train <- ifelse(d_train$dep_delayed_15min=='Y',1,0)

dxgb_train <- xgb.DMatrix(data = X_train, label = y_train)

system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  tree_method = "hist")
})
# user  system elapsed 
# 109.702   1.095  17.444 

system.time({
  phat <- predict(md, newdata = X_test)
})
# user  system elapsed 
# 2.287   0.377   1.262
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
# 0.7324224 


## agtboost 

system.time({
md <- gbt.train(y = y_train, x = X_train, loss_function = "logloss", 
                learning_rate = 0.1, nrounds = 10000, verbose = 1)
})
# it: 147  |  n-leaves: 3  |  tr loss: 0.426  |  gen loss: 0.4333
# it: 148  |  n-leaves: 2  |  tr loss: 0.426  |  gen loss: 0.4333
# user   system  elapsed 
# 1643.150    0.535 1643.348 

system.time({
phat <- predict(md, X_test)
})
# user  system elapsed 
# 21.172   0.205  21.374
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
# 0.7247399 

Problem with GPU version of XGBoost

Hi @szilard I'm trying to create a similar benchmark. I'm having problems when compiling the last version of xgboost. I used yesterday's commit:

commit 65d2513714b4d3c4f5e8d0ebad607b1f81b46379
Author: wxchan <[email protected]>
Date:   Mon Jun 12 21:33:42 2017 +0800
    [python-package] fix sklearn n_jobs/nthreads and seed/random_state bug  (#2378)
    * add a testcase causing RuntimeError
    * move seed/random_state/nthread/n_jobs check to get_xgb_params()
    * fix failed test

Could you share the commit you used?

100 million data results

Using hardware from here: #12

Using dmlc/xgboost@84d992b and microsoft/LightGBM@5ece53b

100M obtained using 10x 10m data.

CPU:

?gb Size Speed (s) AUC
xgb 0.1m 4.181 0.7324224
xgb 1m 15.978 0.7494959
xgb 10m 104.598 0.7551197
xgb 100m 673.861 irrelevant
lgb 0.1m 1.763 0.7298355
lgb 1m 4.253 0.7636987
lgb 10m 38.197 0.7742033
lgb 100m 599.396 irrelevant

1x Quadro P1000:

?gb Size Speed (s) AUC
xgb 0.1m 17.529 0.7328954
xgb 1m 38.528 0.7499591
xgb 10m 103.154 0.7564821
xgb 100m CRASH irrelevant
lgb 0.1m 18.345 0.7298129
lgb 1m 22.179 0.7640155
lgb 10m 62.929 0.774168
lgb 100m 396.233 irrelevant

4x Quadro P1000:

?gb Size Speed (s) AUC
xgb 0.1m 18.838 0.7324756
xgb 1m 36.877 0.749169
xgb 10m 64.994 0.7564492
xgb 100m 232.947 irrelevant

RAM usage:

LightGBM: 2739 MB on GPU
xgboost 1 GPU: CRASH
xgboost 4 GPUs: 2077 MB on each GPU

Run on p4d.24xlarge with A100-SXM4-40GB GPU

CUDA 11.2.0

h2o:

 [1] "water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: XGBoost_model_R_1611054589962_1.  Details: ERRR on field: _backend: GPU backend (gpu_id: 0) is not functional. Check CUDA_PATH and/or GPU installation.\n"

XGBoost:

xgboost [11:11:59] WARNING: /xgboost/src/learner.cc:222: No visible GPU is found, setting `gpu_id` to -1
Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :
  [11:11:59] /xgboost/src/gbm/gbtree.cc:511: Check failed: common::AllVisibleGPUs() >= 1 (0 vs. 1) : No visible GPU is found for XGBoost.

Lightgbm:

lightgbm Error in lgb.last_error() : api error: No OpenCL device found
Error in initialize(...) : lgb.Booster: cannot create Booster handle
Calls: cat ... system.time -> lgb.train -> <Anonymous> -> initialize

catboost:

catboost Error in catboost.train(learn_pool = dx_train, test_pool = NULL, params = params) :
  catboost/cuda/cuda_lib/cuda_base.h:281: CUDA error 802: system not yet initialized
Calls: cat -> system.time -> catboost.train

nvidia-smi -i 0
Tue Jan 19 11:09:30 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:10:1C.0 Off |                    0 |
| N/A   28C    P0    41W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Dask xgboost

m5.4xlarge 16c (8+8HT)

1M rows

integer encoding for simplicity

Spark MLlib GBT 4d (for demos)

100 trees, depth 10, 100K data, 32 cores (1 node) runs for 17 mins. Ugh!

How can we make Spark MLlib GBT work fast enough for a demo?

Smaller data? Less trees? Less depth? More cores? More nodes? Let's help the Spark fans...

Add Microsoft EBM (Explainable Boosting Machine)

Microsoft pre-released the Explainable Boosting Machine 2 weeks ago:

https://github.com/microsoft/interpret

It has very promising performance profile

Dataset/AUROC Domain Logistic Regression Random Forest XGBoost Explainable Boosting Machine
Adult Income Finance .907±.003 .903±.002 .922±.002 .928±.002
Heart Disease Medical .895±.030 .890±.008 .870±.014 .916±.010
Breast Cancer Medical .995±.005 .992±.009 .995±.006 .995±.006
Telecom Churn Business .804±.015 .824±.002 .850±.006 .851±.005
Credit Fraud Security .979±.002 .950±.007 .981±.003 .975±.005

lightgbm: better matching hyperparams

The h2o and xgboost seems are run by depth-wise with max_depth=10, while lightgbm is run by leaf-wise with max_leaves=1024.

As a result, the speed of lightgbm gpu is not comparable with xgboost and h2o .

catboost ordered vs plain

Screen Shot 2019-05-02 at 10 58 03 AM

Start R interactively in docker with all stuff installed:

docker run -it gbmperf_cpu taskset -c 0-15 R

suppressMessages({
library(data.table)
library(ROCR)
library(catboost)
})

set.seed(123)

d_train <- fread("train-0.1m.csv", showProgress=FALSE, stringsAsFactors=TRUE)
d_test <- fread("test.csv", showProgress=FALSE, stringsAsFactors=FALSE)   ## to match factors in train and test with bind

d_train_test <- rbind(d_train, d_test)
p <- ncol(d_train_test)-1

d_train_test$dep_delayed_15min <- ifelse(d_train_test$dep_delayed_15min=="Y",1,0)   ## need numeric y

d_train <- d_train_test[(1:nrow(d_train)),]
d_test <-  d_train_test[(nrow(d_train)+1):(nrow(d_train)+nrow(d_test)),]


dx_train <- catboost.load_pool(d_train[,1:p], label = d_train$dep_delayed_15min)
dx_test  <- catboost.load_pool(d_test[,1:p])


params <- list(iterations = 100, depth = 10, learning_rate = 0.1,
   verbose = 0)
cat(system.time({
  md <- catboost.train(learn_pool = dx_train, test_pool = NULL, params = params)
})[[3]]," ",sep="")


phat <- catboost.predict(md, dx_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")

Run with defaults:

> params <- list(iterations = 100, depth = 10, learning_rate = 0.1,
+    verbose = 0)
> cat(system.time({
+   md <- catboost.train(learn_pool = dx_train, test_pool = NULL, params = params)
+ })[[3]]," ",sep="")
Dataset is provided, but PredictionValuesChange feature importance don't use it, since non-empty LeafWeights in model.
5.367 >
>
> phat <- catboost.predict(md, dx_test)
> rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
> cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
0.7225903

Run "plain":

>
> params <- list(iterations = 100, depth = 10, learning_rate = 0.1,
+    boosting_type = "Plain",
+    verbose = 0)
> cat(system.time({
+   md <- catboost.train(learn_pool = dx_train, test_pool = NULL, params = params)
+ })[[3]]," ",sep="")
Dataset is provided, but PredictionValuesChange feature importance don't use it, since non-empty LeafWeights in model.
5.231 >
>
> phat <- catboost.predict(md, dx_test)
> rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
> cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
0.7225903

Run "ordered":

> params <- list(iterations = 100, depth = 10, learning_rate = 0.1,
+    boosting_type = "Ordered",
+    verbose = 0)
> cat(system.time({
+   md <- catboost.train(learn_pool = dx_train, test_pool = NULL, params = params)
+ })[[3]]," ",sep="")
Dataset is provided, but PredictionValuesChange feature importance don't use it, since non-empty LeafWeights in model.
5.106 >
>
> phat <- catboost.predict(md, dx_test)
> rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
> cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
0.7193254

For 1M dataset:

> params <- list(iterations = 100, depth = 10, learning_rate = 0.1,
+    verbose = 0)
> cat(system.time({
+   md <- catboost.train(learn_pool = dx_train, test_pool = NULL, params = params)
+ })[[3]]," ",sep="")
Dataset is provided, but PredictionValuesChange feature importance don't use it, since non-empty LeafWeights in model.
50.586 >
>
> phat <- catboost.predict(md, dx_test)
> rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
> cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
0.7402029
>
>
>
> params <- list(iterations = 100, depth = 10, learning_rate = 0.1,
+    boosting_type = "Plain",
+    verbose = 0)
> cat(system.time({
+   md <- catboost.train(learn_pool = dx_train, test_pool = NULL, params = params)
+ })[[3]]," ",sep="")
Dataset is provided, but PredictionValuesChange feature importance don't use it, since non-empty LeafWeights in model.
50.202 >
>
> phat <- catboost.predict(md, dx_test)
> rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
> cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
0.7402029
>
>
>
>
> params <- list(iterations = 100, depth = 10, learning_rate = 0.1,
+    boosting_type = "Ordered",
+    verbose = 0)
> cat(system.time({
+   md <- catboost.train(learn_pool = dx_train, test_pool = NULL, params = params)
+ })[[3]]," ",sep="")
Dataset is provided, but PredictionValuesChange feature importance don't use it, since non-empty LeafWeights in model.
40.7 >
>
> phat <- catboost.predict(md, dx_test)
> rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
> cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
0.7335985

lightgbm with categorical_feature encoding (instead of OHE)

Instead of OHE (with sparse.model.matrix) we can use lightgbm's special encoding in which the data is stored as integers but it it treated as categorical:

OHE:

X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]

dlgb_train <- lgb.Dataset(data = X_train, label = ifelse(d_train$dep_delayed_15min=='Y',1,0))


cat(system.time({
  md <- lgb.train(data = dlgb_train, 
            objective = "binary", 
            nrounds = 100, num_leaves = 512, learning_rate = 0.1, 
            verbose = 0)
})[[3]]," ",sep="")

cat.enc:

d_all <- rbind(d_train, d_test)
d_all$dep_delayed_15min <- ifelse(d_all$dep_delayed_15min=="Y",1,0)

d_all_wrules <- lgb.convert_with_rules(d_all)       
d_all <- d_all_wrules$data
cols_cats <- names(d_all_wrules$rules) 

d_train <- d_all[1:nrow(d_train)]
d_test <- d_all[(nrow(d_train)+1):(nrow(d_train)+nrow(d_test))]

p <- ncol(d_all)-1
dlgb_train <- lgb.Dataset(data = as.matrix(d_train[,1:p]), label = d_train$dep_delayed_15min)

cat(system.time({
  md <- lgb.train(data = dlgb_train, 
            objective = "binary", 
            nrounds = 100, num_leaves = 512, learning_rate = 0.1, 
            categorical_feature = cols_cats,
            verbose = 0)
})[[3]]," ",sep="")

The main diff:

OHE:

X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))

  md <- lgb.train(data = dlgb_train, 
            nrounds = 100, num_leaves = 512, learning_rate = 0.1, 

cat.enc:

d_all_wrules <- lgb.convert_with_rules(d_all)       
d_all <- d_all_wrules$data
cols_cats <- names(d_all_wrules$rules) 

cat(system.time({
  md <- lgb.train(data = dlgb_train, 
            nrounds = 100, num_leaves = 512, learning_rate = 0.1, 
            categorical_feature = cols_cats,

Full code here: https://github.com/szilard/GBM-perf/tree/master/wip-testing/lightgbm-catenc/cpu/run

Timings [sec] and AUC:

CPU r4.8xlarge:

0.1m:
lightgbm OHE 2.107 0.7301411
lightgbm catenc 2.137 0.7174663
1m:
lightgbm OHE 3.998 0.7655526
lightgbm catenc 4.058 0.7650181
10m:
lightgbm OHE 20.749 0.7745457
lightgbm catenc 20.845 0.792273

Spark logistic regression (for comparison)

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val d_train = spark.read.parquet("spark_ohe-train.parquet").cache()
val d_test = spark.read.parquet("spark_ohe-test.parquet").cache()
(d_train.count(), d_test.count())

val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(Array(lr))

val now = System.nanoTime
val model = pipeline.fit(d_train)
val elapsed = ( System.nanoTime - now )/1e9
elapsed

val predictions = model.transform(d_test)

val evaluator = new BinaryClassificationEvaluator().setLabelCol("label").setRawPredictionCol("probability").setMetricName("areaUnderROC")
evaluator.evaluate(predictions)

Can MLDB do GBM?

@nicolaskruchten @jeremybarnes

Can MLDB do gradient boosting machines (GBM) similar to xgboost (with shrinkage/learning rate)?

I was trying this:

        "configuration": {
            "type": "boosting",
            "validation_split": 0,
            "min_iter": 100,
            "max_iter": 100,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 10,
                "random_feature_propn": 1
            }
        },

(complete code here)

I cannot find if MLDB boosting has a learning_rate parameter?

Also the above code is running very slowly, about 100sec vs 20sec for the random forest previously benchmarked here. htop or mpstat shows about 15% CPU utilization.

Random Forests

c5.9xlarge (18 cores, HT off):

1M:

Lightgbm:

suppressMessages({
library(data.table)
library(ROCR)
library(lightgbm)
library(Matrix)
})

set.seed(123)

d_train <- fread("train-1m.csv", showProgress=FALSE)
d_test <- fread("test.csv", showProgress=FALSE)

d_all <- rbind(d_train, d_test)
d_all$dep_delayed_15min <- ifelse(d_all$dep_delayed_15min=="Y",1,0)

d_all_wrules <- lgb.convert_with_rules(d_all)       
d_all <- d_all_wrules$data
cols_cats <- names(d_all_wrules$rules) 

d_train <- d_all[1:nrow(d_train)]
d_test <- d_all[(nrow(d_train)+1):(nrow(d_train)+nrow(d_test))]

p <- ncol(d_all)-1
dlgb_train <- lgb.Dataset(data = as.matrix(d_train[,1:p]), label = d_train$dep_delayed_15min)

auc <- function() {
  phat <- predict(md, data = as.matrix(d_test[,1:p]))
  rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
  cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
}


system.time({
  md <- lgb.train(data = dlgb_train, 
            objective = "binary", 
            nrounds = 100, num_leaves = 512, learning_rate = 0.1, 
            categorical_feature = cols_cats,
            verbose = 2)
})
auc()


system.time({
  md <- lgb.train(data = dlgb_train, 
            objective = "binary", 
            nrounds = 100, max_depth = 10, num_leaves = 2**17, 
            boosting_type = "rf", bagging_freq = 1, bagging_fraction = 0.632, feature_fraction = 1/sqrt(p),
            categorical_feature = cols_cats,
            verbose = 2)
})
auc()

Results:

GBM:

> system.time({
+   md <- lgb.train(data = dlgb_train,
+             objective = "binary",
+             nrounds = 100, num_leaves = 512, learning_rate = 0.1,
+             categorical_feature = cols_cats,
+             verbose = 2)
+ })
[LightGBM] [Info] Number of positive: 192982, number of negative: 807018
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000818
[LightGBM] [Debug] init for col-wise cost 0.000006 seconds, init for row-wise cost 0.004295 seconds
[LightGBM] [Debug] col-wise cost 0.006771 seconds, row-wise cost 0.000792 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007568 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Dense Multi-Val Bin
[LightGBM] [Info] Total Bins 1095
[LightGBM] [Info] Number of data points in the train set: 1000000, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.192982 -> initscore=-1.430749
[LightGBM] [Info] Start training from score -1.430749
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 16
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 16
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 17
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 16
...
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 23
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 25
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 22
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 21
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 21
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 24
[LightGBM] [Debug] Trained a tree with leaves = 512 and max_depth = 21
   user  system elapsed
 57.506   0.191   3.258
> auc()
0.7650181
> system.time({
+   md <- lgb.train(data = dlgb_train,
+             objective = "binary",
+             nrounds = 100, max_depth = 10, num_leaves = 2**17, learning_rate = 0.1,
+             categorical_feature = cols_cats,
+             verbose = 2)
+ })
[LightGBM] [Info] Number of positive: 192982, number of negative: 807018
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000818
[LightGBM] [Debug] init for col-wise cost 0.000008 seconds, init for row-wise cost 0.004492 seconds
[LightGBM] [Debug] col-wise cost 0.007450 seconds, row-wise cost 0.000537 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007995 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Dense Multi-Val Bin
[LightGBM] [Info] Total Bins 1095
[LightGBM] [Info] Number of data points in the train set: 1000000, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.192982 -> initscore=-1.430749
[LightGBM] [Info] Start training from score -1.430749
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 876 and max_depth = 10
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 895 and max_depth = 10
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 910 and max_depth = 10
...
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 650 and max_depth = 10
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 745 and max_depth = 10
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 672 and max_depth = 10
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 759 and max_depth = 10
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 649 and max_depth = 10
   user  system elapsed
 53.896   0.227   3.058
> auc()
0.7614953

RF:

> system.time({
+   md <- lgb.train(data = dlgb_train,
+             objective = "binary",
+             nrounds = 100, max_depth = 10, num_leaves = 2**17,
+             boosting_type = "rf", bagging_freq = 1, bagging_fraction = 0.632, feature_fraction = 1/sqrt(p),
+             categorical_feature = cols_cats,
+             verbose = 2)
+ })
[LightGBM] [Info] Number of positive: 192982, number of negative: 807018
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000818
[LightGBM] [Debug] init for col-wise cost 0.000008 seconds, init for row-wise cost 0.004629 seconds
[LightGBM] [Debug] col-wise cost 0.001792 seconds, row-wise cost 0.000410 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002210 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Dense Multi-Val Bin
[LightGBM] [Info] Total Bins 1095
[LightGBM] [Info] Number of data points in the train set: 1000000, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.192982 -> initscore=-1.430749
[LightGBM] [Info] Start training from score -1.430749
[LightGBM] [Debug] Re-bagging, using 632548 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 405 and max_depth = 10
[LightGBM] [Debug] Re-bagging, using 631955 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 331 and max_depth = 10
...
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 677 and max_depth = 10
[LightGBM] [Debug] Re-bagging, using 632031 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 736 and max_depth = 10
[LightGBM] [Debug] Re-bagging, using 631444 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 377 and max_depth = 10
   user  system elapsed
 42.253   0.191   2.364
> auc()
0.7314994
> system.time({
+   md <- lgb.train(data = dlgb_train,
+             objective = "binary",
+             nrounds = 100, max_depth = 15, num_leaves = 2**17,
+             boosting_type = "rf", bagging_freq = 1, bagging_fraction = 0.632, feature_fraction = 1/sqrt(p),
+             categorical_feature = cols_cats,
+             verbose = 2)
+ })
[LightGBM] [Info] Number of positive: 192982, number of negative: 807018
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000818
[LightGBM] [Debug] init for col-wise cost 0.000005 seconds, init for row-wise cost 0.004689 seconds
[LightGBM] [Debug] col-wise cost 0.001777 seconds, row-wise cost 0.000313 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002095 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Dense Multi-Val Bin
[LightGBM] [Info] Total Bins 1095
[LightGBM] [Info] Number of data points in the train set: 1000000, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.192982 -> initscore=-1.430749
[LightGBM] [Info] Start training from score -1.430749
[LightGBM] [Debug] Re-bagging, using 632548 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 933 and max_depth = 15
[LightGBM] [Debug] Re-bagging, using 631955 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 635 and max_depth = 15
[LightGBM] [Debug] Re-bagging, using 632394 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4024 and max_depth = 15
[LightGBM] [Debug] Re-bagging, using 631446 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 767 and max_depth = 15
[LightGBM] [Debug] Re-bagging, using 631800 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3809 and max_depth = 15
...
[LightGBM] [Debug] Re-bagging, using 632325 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 3902 and max_depth = 15
[LightGBM] [Debug] Re-bagging, using 632031 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 4475 and max_depth = 15
[LightGBM] [Debug] Re-bagging, using 631444 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 754 and max_depth = 15
   user  system elapsed
217.521   2.950  12.288
> auc()
0.7392125
> system.time({
+   md <- lgb.train(data = dlgb_train,
+             objective = "binary",
+             nrounds = 100, max_depth = 20, num_leaves = 2**17,
+             boosting_type = "rf", bagging_freq = 1, bagging_fraction = 0.632, feature_fraction = 1/sqrt(p),
+             categorical_feature = cols_cats,
+             verbose = 2)
+ })
[LightGBM] [Info] Number of positive: 192982, number of negative: 807018
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000818
[LightGBM] [Debug] init for col-wise cost 0.000006 seconds, init for row-wise cost 0.004546 seconds
[LightGBM] [Debug] col-wise cost 0.001789 seconds, row-wise cost 0.000315 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002110 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Dense Multi-Val Bin
[LightGBM] [Info] Total Bins 1095
[LightGBM] [Info] Number of data points in the train set: 1000000, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.192982 -> initscore=-1.430749
[LightGBM] [Info] Start training from score -1.430749
[LightGBM] [Debug] Re-bagging, using 632548 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 960 and max_depth = 17
[LightGBM] [Debug] Re-bagging, using 631955 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 675 and max_depth = 18
[LightGBM] [Debug] Re-bagging, using 632394 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 5528 and max_depth = 20
[LightGBM] [Debug] Re-bagging, using 631446 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 842 and max_depth = 19
...
[LightGBM] [Debug] Re-bagging, using 632325 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 7083 and max_depth = 20
[LightGBM] [Debug] Re-bagging, using 632031 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 7377 and max_depth = 20
[LightGBM] [Debug] Re-bagging, using 631444 data to train
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 763 and max_depth = 17
   user  system elapsed
484.497   9.613  27.724
> auc()
0.7415699

GBM deep:

> system.time({
+   md <- lgb.train(data = dlgb_train,
+             objective = "binary",
+             nrounds = 100, max_depth = 20, num_leaves = 2**17, learning_rate = 0.1,
+             categorical_feature = cols_cats,
+             verbose = 0)
+ })
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007312 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
...
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    user   system  elapsed
1846.680   16.217  103.627
> auc()
0.7704145

GPU performance for Quadro P1000 (and 4 GPUs)

CPU: Dual Xeon Gold 6154 (36 cores / 72 threads, 3.7 GHz)
OS: Pop!_OS 18.10
GPU versions: dmlc/xgboost@4fac987 and microsoft/LightGBM@5ece53b
Compilers / Drivers: CUDA 10.0.154 + NCCL 2.3.7 + OpenCL 1.2 + gcc 8.1 + Intel MKL 2019

CPU only with 18 physical threads (numactl for 1st socket + OpenMP environment variables lock in):

?gb Size Speed (s) AUC
xgb 0.1m 4.181 0.7324224
xgb 1m 15.978 0.7494959
xgb 10m 104.598 0.7551197
lgb 0.1m 1.763 0.7298355
lgb 1m 4.253 0.7636987
lgb 10m 38.197 0.7742033

1x Quadro P1000:

?gb Size Speed (s) AUC
xgb 0.1m 17.529 0.7328954
xgb 1m 38.528 0.7499591
xgb 10m 103.154 0.7564821
lgb 0.1m 18.345 0.7298129
lgb 1m 22.179 0.7640155
lgb 10m 62.929 0.774168

4x Quadro P1000:

?gb Size Speed (s) AUC
xgb 0.1m 18.838 0.7324756
xgb 1m 36.877 0.749169
xgb 10m 64.994 0.7564492

xgboost CPU slowdown (on some servers/high number of cores)

Weird slowdown in xgboost in latest versions with large number of threads on single-socket high-core-count CPUs (this is separate from the slowdown on multi-socket/multi-NUMA-node CPUs).

Investigated with @Laurae2 (to be reported to xgboost devs).

The issue popped up when rerunning this benchmark with new versions of xgboost (on 32 core, 1-socket/1-NUMA node r4.8xlarge):

VER=v0.72

threads   runtime(s)       AUC
1           37.871        0.7494959
16          14.262        0.7494959
32          16.304        0.7494959

VER=v0.82

threads   runtime(s)   AUC
1         29.532      0.7494959
16        15.805      0.7494959
32       176.856      0.7494959

Code here: https://github.com/szilard/GBM-perf/tree/master/wip-testing/xgboost-slowdown

XGBoost CPU speed by version

m5.4xlarge (16 cores=8+8HT)

1M rows dataset

sudo docker run --rm -ti gbmperf_cpu /bin/bash

ln -s train-1m.csv train.csv
wget https://cran.r-project.org/src/contrib/Archive/xgboost/xgboost_0.71.1.tar.gz
## wget the other versions you want to run 
for i in `ls -1 xgboost_*.tar.gz`; do 
  R CMD INSTALL $i >/dev/null 2>/dev/null 
  egrep "Version:|Date:" /usr/local/lib/R/site-library/xgboost/DESCRIPTION
  R --slave < GBM-perf/cpu/run/2-xgboost.R
done
Version Date time [s] AUC
0.71.1 2018-05-11 13.5 0.7494959
0.81.0.1 2019-01-30 54.8 0.7494959
0.90.0.1 2019-07-25 26.4 0.7494959
1.0.0.1 2020-03-23 6.6 0.7494531
1.1.1.1 2020-06-12 4.9 0.7478858
1.2.0.1 2020-08-28 5.9 0.7478858
1.3.1.1 2020-12-22 6.1 0.7478858

(Versions/Dates from R/CRAN; 0.71 is the first one that has hist)

With github version as of today: 3.9 0.7478858

xgboost Exact performance

Hardware/Software from: #12

exact 70 thread:

Size Time(s) AUC
0.1M 1.572 0.7330478
1M 17.208 0.7501779
10M 446.807 0.7558084
100M 5644.698 (irrelevant) 0.7562036

lightgbm categorical values from R

@Laurae2

Is this the best way to deal with categorical variables in the lightgbm R package?

X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))

dlgb_train <- lgb.Dataset(data = X_train, label = ifelse(d_train$dep_delayed_15min=='Y',1,0))

md <- lgb.train(data = dlgb_train, objective = "binary",  ...

see full code here.

It already gets awesome runtime and AUC, see here.

xgboost multicore scaling and NUMA improvement by xgboost version

xgboost improved significantly in multicore scaling and NUMA

Runtimes by version on r4.16xlarge (2so, 16c+HT) on 1, 16 and 64 cores on 1M rows:

version date t 1c [s] t 16c [s] t 64c [s]
0.70 2017-12-31 30 15 37
0.72 2018-06-01 36 14 37
0.80 2018-08-13 27 14 34
0.82 2019-03-04 27 19 53
0.90 2019-05-20 27 18.7 53
1.0.0 2020-02-20 25 5.7 7.5
1.1.0 2020-05-17 30 3.8 3.6
1.2.0 2020-08-22 33 4.6 5

git clone --recursive https://github.com/dmlc/xgboost 

VER=v1.2.0
cd xgboost && git checkout tags/$VER && git submodule init && git submodule update && cd R-package && R CMD INSTALL . && cd /

taskset -c 0 R < GBM-perf/cpu/run/2-xgboost.R
taskset -c 0-15 R < GBM-perf/cpu/run/2-xgboost.R
R < GBM-perf/cpu/run/2-xgboost.R

Dask (xgboost and lightgbm with Dask) -- likely WRONG results

UPDATE: I suspect a data leakage in Dask when lumping train and test to do a consistent label encoding and then splitting it back into train-test. Data leakage might occur because of partitions (??). That might be responsible for higher AUC. So instead of this look at this new github issue #50 with the analysis redone using integer encoding outside of Dask.


m5.4xlarge 16c (8+8HT)

1M rows

integer encoding for simplicity

h2o weird multi-core scaling

For 10M rows, h2o is 24x faster on 16 cores compared to 1 core. Any ideas why? @arnocandel

Timings (boxplot of 3 runs) on 0.1,1,10M rows on 1,2,4,8,16 cores on r4x.16xlarge restrained to physical cores (no HT) on 1-socket only:

Screen Shot 2020-09-17 at 4 19 22 AM

Speedups from n/2 to n cores:

Screen Shot 2020-09-17 at 4 21 50 AM

these should be <2 that is below red line.

How can the speedup from 1 core to 2 cores be >2 (2.4) on 10M rows?

Code:

library(h2o)

h2o.init()

dx_train <- h2o.importFile("train.csv")
dx_test <- h2o.importFile("test.csv")

Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]

cat(system.time({
  md <- h2o.gbm(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, 
          distribution = "bernoulli", 
          ntrees = 100, max_depth = 10, learn_rate = 0.1, 
          nbins = 100)
})[[3]]," ",sep="")

run as:

taskset -c $LCORES R --slave < $TOOL.R $NCORES 

where $LCORES is 0,0-1,0-3,0-7,0-15.

c5a.24xlarge (AMD) & multicore scaling

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              2
Core(s) per socket:              48
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7R32
Stepping:                        0
CPU MHz:                         1800.153
BogoMIPS:                        5599.80
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        24 MiB
L3 cache:                        192 MiB
NUMA node0 CPU(s):               0-95

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.