szilard / benchm-ml Goto Github PK

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

License: MIT License

R 83.98% Python 16.02%

machine-learning data-science r python gradient-boosting-machine random-forest deep-learning xgboost h2o spark

benchm-ml's People

Stargazers

Watchers

Forkers

hantuzun xsongx hetong007 lihang00 chandrad npp97 mindis alexanderhub ageek jayfans3 xiaokekehaha yiiwood xuanhan863 ffyu ajkl pprett raj2014 krmohanty hucheng jonathanchu lambder jstokes codevlabs u4ickleviathan rtvt123 samgiampa bussiere hyonschu curtisz curtiszimmerman ml-ai-nlp-ir acaballero2010 r-learner ekoziol rupakc nkhuyu kevinhsu qibaoyuan fancycheung nurnoch irwenqiang li-ch ccmien jayhetee scofield0li wavelets csheehey oftensmile jagannalla albert1988 ike-okonkwo bssrdf jyt109 luojiahuli jeremybarnes naveenareddy txd866 codeaudit eraly nicolaskruchten doaa-altarawy cequencer yuanfenw 178354170lizihao myui sandy4321 sandeep433 keaideii cvml tempbottle lpysama hemal-thakkar marvely philipus dxl0632 chuhoting rahul-sindhu ericsimonzhu elianomarques adityosanjaya yinxusen jinxustartup bekterra dinomagri xiaodiu2010 philipppro rwzhao hmcuesta pelluru yuanchao mldl akzaidi ieee820 mutual-ai kaushikchowdhury jeffzhengye chandrasaripaka nijiazhi colinsongf manasrk

benchm-ml's Issues

Linear & Random Forests TODOs

I finished what I wanted to focus on for Linear and Random Forests. There are lots of things one might want to do, I listed a few here:
https://github.com/szilard/benchm-ml/blob/master/TODO.md

If you are interested in tackling any of those, let me know.

Is it better to consider generating sparse matrix in R?

As far as I know, using model.matrix is generating a dense matrix https://github.com/szilard/benchm-ml/blob/master/2-rf/1.R#L16 which is extremely memory inefficient.

The glmnet package in R can do linear regression on top of sparse matrix. I guess H2O can handle sparsity really well so it never crashes. If we can also have the result with the sparse matrix, then it is getting fairer and more meaningful.

issue deleted

New bench-ml 4-h2o.R for H2O cluster version: 3.8.3.3

Hi,
this is the corrected R code for for H2O cluster version: 3.8.3.3 and R 3.3.1
The old code would not run under these versions. The final AUC with sample_rate = 1.0 for 1 million records is 0.77 which tops the old results.

For 10M the AUC is 0.7922 for a quad core CPU@4Ghz in 1676.02 seconds (more accurate and also 2x faster than the current report with a 32 thread machine and using only two GByte of RAM).

This needs code validation.

# works for H2O Flow 3.8.3.3 and R 3.3.1 (July 2016)
# load H2O Flow at http://localhost:54321/flow/index.html
library(h2o)
# because H2o is limited to two cores by default we need to assign all  cores/threads
h2o.init(nthreads = -1)

# load data from current directory
dx_train <- h2o.importFile(path = "train-1m.csv")
dx_test <- h2o.importFile( path = "test.csv")

# assign variables
Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]

# start training H2O random forest 
system.time({
    md <- h2o.randomForest(x = Xnames, y = "dep_delayed_15min", training_frame= dx_train, sample_rate = 0.632, ntrees = 100, max_depth = 20)
    })

# prediction
phat <- h2o.predict(md, dx_test)

# extract  accuracy and compare against test set
phat$Accuracy <- phat$predict == dx_test$dep_delayed_15min
# display Accuracy (0.70)
mean(phat$Accuracy)

# display AUC (0.73)
system.time({
  print(h2o.performance(md, dx_test)@metrics$AUC)
})

Spark random forest low AUC etc

Splitting #5 in two: random forest here, logistic regression in different issue.

Summary: Random forest in Spark has low AUC (and is slower/larger memory footprint).

For n = 100K Spark gets AUC = 0.65 vs e.g. 0.72/0.73 in H2O/xgboost.

Code here https://github.com/szilard/benchm-ml/blob/master/2-rf/5b-spark.txt
Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-0.1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-0.1m.csv

Originally ran on 1.3.0, but same in 1.4.0 (a bit faster, but same AUC).

Can you guys look at the code and optimize it/make it better, especially get better AUC?

Spark logistic regression issues

Splitting #5 in two: logistic regression here and random forest in a different issue.

Summary: Logistic regression has lower AUC in Spark.

For n=1M Spark gets AUC = 0.703 while R/Python etc. AUC = 0.711.

Code here https://github.com/szilard/benchm-ml/blob/master/1a-spark-logistic/spark.txt
Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-1m.csv

Spark version used: 1.3.0

benchmarking with autosklearn (zeroconf)

Great initiative, thanks for making this public!
You might be interested in extending your benchmarking to the auto-sklearn. https://github.com/automl/auto-sklearn
I have created a script that can take in a sparse dataset in the pandas HDFS dataframe .h5 format and run a binary classification on it on multiprocessing cluster with auto-sklearn. https://github.com/Motorrat/autosklearn-zeroconf Myself I will try to duplicate your benchmark, but just in case you are on it you might want to try out yourself.

Spark Random forest accuracy --spam?

Hi guys

I was running random forest using spark in R

Can any one tell me how I get accuracy

I would have got normall r square but it drops certain row when random forest runs

so to get r square I need equal rows in original data and predicted data

DL with mxnet

Trying to see if DL can match RF/GBM in accuracy on the airline dataset (where train is sampled from years 2005-2006, while validation and test sets sampled disjunctly from 2007). Also, some variables are kept categorical artificially and are intentionally not encoded as ordinal variables (to better match the structure of business datasets).

Recap: with 10M records training (largest in the benchmark) RF AUC 0.80 GBM 0.81 (on test set).

So far I get 0.72 with DL with mxnet on 1M train:
https://github.com/szilard/benchm-ml/blob/master/4-DL/2-mxnet.R

Comparably on the 1M train xgboost has achieved 0.77 and with some tuning I think it can get 0.79.

I tried a few architectures (#hidden layers etc), but it won't beat the settings I took from an mxnet example. Runs about 1 minute to train on a EC2 g2.8xlarge box using 1 GPU (if using all 4 GPUs it was slower). nvidia-smi shows GPU utilization ~20% and memory usage ~2GB (out of 4GB). On CPU (32 cores) it training takes about 5 mins.

The "problem" is DL learns very fast, the best AUC (on a validation set) is reached after 2 epochs. On the other hand xgboost runs ~1hr to get good accuracy. That is the DL model seems underfitted to me.

Surely, DL might not beat GBM on this kind of data (proxy for general business data such as credit risk or fraud detection), but it should do better than 0.72.

Datasets:
https://s3.amazonaws.com/benchm-ml--main/train-1m.csv
https://s3.amazonaws.com/benchm-ml--main/train-10m.csv
https://s3.amazonaws.com/benchm-ml--main/valid.csv
https://s3.amazonaws.com/benchm-ml--main/test.csv

Request to add License

Can you add License to this repo.

Thanks!

DL with h2o

Recap: with 10M records training (largest in the benchmark) RF AUC 0.80 GBM 0.81 (on test set).

So far I get 0.73 with DL with h2o on 1M and 10M train as well:
https://github.com/szilard/benchm-ml/blob/master/4-DL/1-h2o.R

I tried a few architectures/activation/regularizations, but it won't beat the default. Runs about 2-3 minutes with early stopping (using validation set) on a 32 cores EC2 box.

The "problem" is DL learns very fast, the best AUC reached after 1.3 epochs on 1M rows train and 0.15 epochs on 10M (and early stopping kicks in around 9 and 0.9, rsp). On the other hand RF/GBM runs ~1hr to get good accuracy. That is the DL model seems underfitted to me.

Surely, DL might not beat GBM on this kind of data (proxy for general business data such as credit risk or fraud detection), but it should do better than 0.73.

RandomForest Example

Hello,
I'm trying to train RandomForest Model, but getting same result for each test(about 300 entries)
Here's Java Code

RandomForest model;
model = new RandomForest(X,Y,500);

For this data, python sklearn works as expected

clf = RandomForestClassifier(n_estimators=500)
clf = clf.fit(X, Y)

What am I missing?

Thanks

Rborist

Thanks @suiji for Rborist code. If I run it with 100 trees as in https://github.com/szilard/benchm-ml/tree/master/z-other-tools (on 32 core box) I get:
Time: 87 sec
AUC: 66.43
Something is wrong, the AUC is very low.

I checked out latest github version, then in ArboristBridgeR/Package dir i ran ./dev.sh which created Rborist.tar.gz and then I installed with R CMD INSTALL

mxnet sparse data format

Motivation: I can't run mxnet on the 10M records airline set #29 because model.matrix crashes out of RAM (on g2.8xlarge with 60GB or RAM - largest available for GPU instances).

Using Matrix::sparse.model.matrix to encode the categorical data would be great (uses <2GB RAM), but I get:

Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Strangely on the 1M dataset I get another error:

Error: io.cc:50: Seems X, y was passed in a Row major way, MXNetR adopts a column major convention.

sklearn using sparse data representation

I know from @glouppe that "RFs in sklearn now support sparse matrices too"
https://twitter.com/glouppe/status/660012865554903040

It would be interesting to see the results with sparse for RF and for logistic regression too. We should see lower memory footprint and perhaps faster runs. Anyone wants to help w the code (PR)?

GBM variable 1: Month is not of type numeric, ordered, or factor.

For gbm_2.1.1 and R 3.3.1 I get the following error for
benchm-ml/3-boosting/1-gbm.R

> system.time({
+   md <- gbm(dep_delayed_15min ~ ., data = d_train, distribution = "bernoulli", 
+             n.trees = 1000, 
+             interaction.depth = 16, shrinkage = 0.01, n.minobsinnode = 1,
+             bag.fraction = 0.5)
+ })

Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w,  : 
  variable 1: Month is not of type numeric, ordered, or factor.
Timing stopped at: 0.02 0 0.01

Tobias

Citation

Hey,
thanks for this repository. It's tremendously useful. Would it be possible to maybe add info on how to cite this repository? Maybe sth like:

@misc{,
	author = {Pafka, Szilard},
	title = {benchm-ml},
	publisher = {GitHub},
	year = {2019},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/szilard/benchm-ml}},
	commit = {13325ce3edd7c902390197f43bcc7938c306bbe3}
}

Best,
Simon

SMILE

Thanks for great work! We have an open source machine learning library called SMILE (https://github.com/haifengl/smile). We have incorporated your benchmark (https://github.com/haifengl/smile/blob/master/benchmark/src/main/scala/smile/benchmark/Airline.scala). We found that our system is much faster for this data set. For 100K training data on a 4 core machine, we can train a random forest with 500 trees in 100 seconds, and gradient boost trees of 300 trees in 180 seconds. Projected to 32 cores, I think that we will be much faster than all the tools you tested. You can try it out by cloning our project. Then

sbt benchmark/run

This also includes benchmark on USPS data, which you may ignore. Thanks!

add xgboost to benchmark

Just discovered this repo. Since you are comparing gradient boosting algorithms, would be great if you could add https://github.com/dmlc/xgboost to comparison.

It also have a support for random forest.

Thanks

Update Latest version of XGBoost

Thanks to this benchmark, we now have a good understanding of what is going on in #14 in #2
Specifically, cacheline related issues for exact greedy algorithm. See our detailed analysis in this paper http://arxiv.org/abs/1603.02754

In short, exact greedy algorithm will suffer from cache line issues when facing dataset larger than 1M, which we can counter balance with pre-fetching but still not perfect.

We are adding a new option call tree_method to xgboost, which will allow user to choose the algorithm. By default it will choose the faster one, and will send a message to user when approximate algorithm is choosed. I think it might be interesting to rerun the benchmark on this latest version.

See https://github.com/dmlc/xgboost/tree/master/R-package for instructions. The drat or install from source should work. To confirm you are using the latest version, check if the message occur when running on 10M data

Tree method is automatically selected to be 'approx'...

mllib test code - RAM / AUC improvements needed

@szilard For MLlib, you should repartition the data to match the number of cores. For example, try train.repartition(32).cache(). Otherwise, you may not use all the cores. Also, if the data is sparse, you should create sparse vectors instead of dense vectors.

Upgrade to VW v8.0

It seems that VW 7.10 is under the current tests. Would be great if you update it with v.8.0:
https://github.com/JohnLangford/vowpal_wabbit/archive/8.0.zip
I don't think it'll change ROCs but would be interesting if there are any performance regressions happened.

other dataset of such type for benchmarking?

@tqchen I moved your last question to a new issue:

Thanks for the clarification! BTW, do you have any idea if there is any other dataset of such type for benchmarking? For example, a dataset with more columns and rows.

One thing I noticed about this dataset is that seems the output was very dependent on one variable(when the features are randomly dropped at rate of 50%, one output tree could be very bad). This might make the result become a singular case where the result simply repeatively cut on a single feature.

running your benchmarks from beginning to end

Hey Szilard,

I'd like to replicate your code from beginning to end perhaps on Google Compute Engine (GCE), mainly to test out GCE with Vagrant. Do you know have a sense of how long the entire process would take assuming a similar server size as what you used on EC2?

Is there a convenient way to run all your scripts in from folder 0 to 4? That is, is there a master script that executes them all?

I notice that the results are written out to the console. Do you have a script that scrapes all the AUC's for your comparison analysis?

Thanks!

In README, link to https://github.com/szilard/talks is dead

In the README, there's a link to https://github.com/szilard/talks but it's dead.

Spark random forest issues

This is to collaborate on some issues with Spark RF also addressed by @jkbradley in comments to this post http://datascience.la/benchmarking-random-forest-implementations/ (see comments by Joseph Bradley). cc: @mengxr

Please see “Absolute Minimal Benchmark” for random forests https://github.com/szilard/benchm-ml/tree/master/z-other-tools and let's use the 1M row training set and the test set linked in from there.

@jkbradley says: One-hot encoder: Spark 1.4 includes this, plus a lot more feature transformers. Preprocessing should become ever-easier, especially using DataFrames (Spark 1.3+).

Yes, indeed. Can you please provide code that reads in the original dataset (pre- 1-hot encoding) and does the 1-hot encoding in Spark. Also, if random forest 1.4 API can use data frames, I guess we should use that for the training. Can you please provide code for that too.

@jkbradley says: AUC/accuracy: The AUC issue appears to be caused by MLlib tree ensembles aggregating votes, rather than class probabilities, as you suggested. I re-ran your test using class probabilities (which can be aggregated by hand), and then got the same AUC as other libraries. We’re planning on including this fix in Spark 1.5 (and thanks for providing some evidence of its importance!).

Fantastic. Can you please share code that does that already? I would be happy to check it out.

Integer encoding for categorical variables in random forests in R

This quote stuck out to me:

It cannot cope by default with a large number of categories, therefore the data had to be one-hot encoded.

Did you try integer-encoding categories? It looks like you did for python, maybe that's worth trying with R.

xgboost RF bump for n=10M

Moved "something weird happens for the largest data size (n=10M) - the trend for Run time and AUC "breaks", see figures main README" issue from #2 here.

upgrade H2O to 3.0

Spot check runtime/AUC/RAM for linear and RF for at least 1 size.

Possible data leakage

Hi, szilard!
thanks for your benchmarks, I think that you found an interesting dataset for comparison.

HOWEVER

The time of departure present in the data is exact time when aircraft takes off.
Thus, by analyzing the aircrafts from airport X to airport Y by carrier Z one can establish at which time aircrafts should take off to be in time (and that's what deep trees do, to my belief).

At least, I could easily see such patterns in data.

It doesn't seem to be very useful to predict if aircraft departures in time given you already know this information.

So, my suggestion is either to replace DepTime with PlannedDepTime (if you know how to get this infomation) or put DepTime = DepTime // 200 to reduce possibility of using this information, while this altered feature gives approximate information about the flight schedule.

best boosting AUC?

@tqchen @hetong007 I'm trying to get a good AUC with boosting for the largest dataset (n = 10M). Would be nice to beat random forests :)

So far I did some basic grid search https://github.com/szilard/benchm-ml/blob/master/3-boosting/0-xgboost-init-grid.R for n = 1M (not the largest dataset) and seems like deeper trees, min_child_weight = 1 subsample = 0.5 work well.

I'm running now https://github.com/szilard/benchm-ml/blob/master/3-boosting/6a-xgboost-grid.R with n = 10M by just looping over max_depth = c(2,5,10,20,50) but it's been running for a while.

Any suggestions?

Smallest learning rate I'm using is eta = 0.01, any experience with smaller values?

PS: See results so far here: https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines

License for datasets

Hi
Could you please help me to understand if MIT covers datasets mentioned here
Training datasets of sizes 10K, 100K, 1M, 10M are generated
from the well-known airline dataset (using years 2005 and 2006). A test set of size 100K is generated from the same (using year 2007).

benchm-ml/z-other-tools/4-h2o.R change in import format

Hi,
for H2O cluster version: 3.8.3.3
the import function in benchm-ml/z-other-tools/4-h2o.R should be corrected to

dx_train <- h2o.importFile(path = "train-1m.csv")

otherwise the following error occours

> dx_train <- h2o.importFile(h2oServer, path = "train-1m.csv")
Error: is.character(key) && length(key) == 1L && !is.na(key) is not TRUE

Cheers
Tobias

Datacratic MLDB results

This code gives an AUC of 0.7417 in 12.1s for the 1M training set on an r3.8xlarge EC2 instance with the latest release of Datacratic's Machine Learning Database (MLDB), available at http://mldb.ai/

from pymldb import Connection
mldb = Connection("http://localhost/")

mldb.v1.datasets("bench-train-1m").put({
    "type": "text.csv.tabular",
    "params": { "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/train-1m.csv" }
})

mldb.v1.datasets("bench-test").put({
    "type": "text.csv.tabular",
    "params": { "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/test.csv" }
})

mldb.v1.procedures("benchmark").put({
    "type": "classifier.experiment",
    "params": {
        "experimentName": "benchm_ml",
        "training_dataset": {"id": "bench-train-1m"},
        "testing_dataset": {"id": "bench-test"},
        "configuration": {
            "type": "bagging",
            "num_bags": 100,
            "validation_split": 0.50,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 20,
                "random_feature_propn": 0.5
            }
        },
        "modelFileUrlPattern": "file://tmp/models/benchml_$runid.cls",
        "label": "dep_delayed_15min = 'Y'",
        "select": "* EXCLUDING(dep_delayed_15min)",
        "mode": "boolean"
    }
})

import time

start_time = time.time()

result = mldb.v1.procedures("benchmark").runs.post({})

run_time = time.time() - start_time
auc = result.json()["status"]["folds"][0]["results"]["auc"]

print "\n\nAUC = %0.4f, time = %0.4f\n\n" % (auc, run_time)

maybe add libfm and libffm to benchmark

http://www.libfm.org/
Factorization machines (FM) are a generic approach that allows to mimic most factorization models by feature engineering.

http://www.csie.ntu.edu.tw/~cjlin/libffm/
LIBFFM is an open source tool for field-aware factorization machines (FFM). For the formulation of FFM, please see these slides. It has been used to win two recent click-through rate prediction competitions (Criteo's and Avazu's).

They are also very interesting tools.

comment:re sklearn -- integer encoding vs 1-hot (py)

(Your post popped up in my twitter feed)
I'm not sure why you said you needed to one-hot encode categorical variables for scikit's random forest; I'm fairly certain you do not need to(and probably shouldn't). It's been awhile since I looked at the source, but I'm pretty sure it handles categorical variables encoded as a single vector of numbers just fine from empirical tests; performance is almost always worse if the features were one-hot encoded.

Add Rborist

Could you add Rborist in serial and parallel mode to add another (fast?) random forest implementation?

Great project. Very useful to have comparisons.

LightGBM results

New GBM implementation released by Microsoft: https://github.com/Microsoft/LightGBM

on 10M dataset, r3.8xlarge

trying to match xgboost & LightGBM params

nround = 100, max_depth = 10, eta = 0.1

num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=100
num_iterations=100  learning_rate=0.1  num_leaves=512   min_data_in_leaf=100
num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=0

Tool	time (s)	AUC
xgboost	350	0.7511
LightGBM 1	500	0.7848
LightGBM 2	350	0.7729
LightGBM 3	450	0.7897

Code to get the results here

Question on the metric of AUC

It seems to be a little bit confused that the evaluation on classification tasks uses the probabilities output directly in calculating the AUC.
For example, in 6-xgboost.R#L39,
Will it be better to do that with (phat>0.5)?