Git Product home page Git Product logo

szilard / benchm-ml Goto Github PK

View Code? Open in Web Editor NEW
1.9K 148.0 335.0 1.11 MB

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

License: MIT License

R 83.98% Python 16.02%
machine-learning data-science r python gradient-boosting-machine random-forest deep-learning xgboost h2o spark

benchm-ml's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

benchm-ml's Issues

New bench-ml 4-h2o.R for H2O cluster version: 3.8.3.3

Hi,
this is the corrected R code for for H2O cluster version: 3.8.3.3 and R 3.3.1
The old code would not run under these versions. The final AUC with sample_rate = 1.0 for 1 million records is 0.77 which tops the old results.

For 10M the AUC is 0.7922 for a quad core CPU@4Ghz in 1676.02 seconds (more accurate and also 2x faster than the current report with a 32 thread machine and using only two GByte of RAM).

This needs code validation.

# works for H2O Flow 3.8.3.3 and R 3.3.1 (July 2016)
# load H2O Flow at http://localhost:54321/flow/index.html
library(h2o)
# because H2o is limited to two cores by default we need to assign all  cores/threads
h2o.init(nthreads = -1)

# load data from current directory
dx_train <- h2o.importFile(path = "train-1m.csv")
dx_test <- h2o.importFile( path = "test.csv")

# assign variables
Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]

# start training H2O random forest 
system.time({
    md <- h2o.randomForest(x = Xnames, y = "dep_delayed_15min", training_frame= dx_train, sample_rate = 0.632, ntrees = 100, max_depth = 20)
    })

# prediction
phat <- h2o.predict(md, dx_test)

# extract  accuracy and compare against test set
phat$Accuracy <- phat$predict == dx_test$dep_delayed_15min
# display Accuracy (0.70)
mean(phat$Accuracy)

# display AUC (0.73)
system.time({
  print(h2o.performance(md, dx_test)@metrics$AUC)
})

Spark random forest low AUC etc

Splitting #5 in two: random forest here, logistic regression in different issue.

Summary: Random forest in Spark has low AUC (and is slower/larger memory footprint).

For n = 100K Spark gets AUC = 0.65 vs e.g. 0.72/0.73 in H2O/xgboost.

Code here https://github.com/szilard/benchm-ml/blob/master/2-rf/5b-spark.txt
Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-0.1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-0.1m.csv

Originally ran on 1.3.0, but same in 1.4.0 (a bit faster, but same AUC).

Can you guys look at the code and optimize it/make it better, especially get better AUC?

Spark logistic regression issues

Splitting #5 in two: logistic regression here and random forest in a different issue.

Summary: Logistic regression has lower AUC in Spark.

For n=1M Spark gets AUC = 0.703 while R/Python etc. AUC = 0.711.

Code here https://github.com/szilard/benchm-ml/blob/master/1a-spark-logistic/spark.txt
Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-1m.csv

Spark version used: 1.3.0

benchmarking with autosklearn (zeroconf)

Great initiative, thanks for making this public!
You might be interested in extending your benchmarking to the auto-sklearn. https://github.com/automl/auto-sklearn
I have created a script that can take in a sparse dataset in the pandas HDFS dataframe .h5 format and run a binary classification on it on multiprocessing cluster with auto-sklearn. https://github.com/Motorrat/autosklearn-zeroconf Myself I will try to duplicate your benchmark, but just in case you are on it you might want to try out yourself.

Spark Random forest accuracy --spam?

Hi guys

I was running random forest using spark in R

Can any one tell me how I get accuracy

I would have got normall r square but it drops certain row when random forest runs

so to get r square I need equal rows in original data and predicted data

DL with mxnet

Trying to see if DL can match RF/GBM in accuracy on the airline dataset (where train is sampled from years 2005-2006, while validation and test sets sampled disjunctly from 2007). Also, some variables are kept categorical artificially and are intentionally not encoded as ordinal variables (to better match the structure of business datasets).

Recap: with 10M records training (largest in the benchmark) RF AUC 0.80 GBM 0.81 (on test set).

So far I get 0.72 with DL with mxnet on 1M train:
https://github.com/szilard/benchm-ml/blob/master/4-DL/2-mxnet.R

Comparably on the 1M train xgboost has achieved 0.77 and with some tuning I think it can get 0.79.

I tried a few architectures (#hidden layers etc), but it won't beat the settings I took from an mxnet example. Runs about 1 minute to train on a EC2 g2.8xlarge box using 1 GPU (if using all 4 GPUs it was slower). nvidia-smi shows GPU utilization ~20% and memory usage ~2GB (out of 4GB). On CPU (32 cores) it training takes about 5 mins.

The "problem" is DL learns very fast, the best AUC (on a validation set) is reached after 2 epochs. On the other hand xgboost runs ~1hr to get good accuracy. That is the DL model seems underfitted to me.

Surely, DL might not beat GBM on this kind of data (proxy for general business data such as credit risk or fraud detection), but it should do better than 0.72.

Datasets:
https://s3.amazonaws.com/benchm-ml--main/train-1m.csv
https://s3.amazonaws.com/benchm-ml--main/train-10m.csv
https://s3.amazonaws.com/benchm-ml--main/valid.csv
https://s3.amazonaws.com/benchm-ml--main/test.csv

DL with h2o

Trying to see if DL can match RF/GBM in accuracy on the airline dataset (where train is sampled from years 2005-2006, while validation and test sets sampled disjunctly from 2007). Also, some variables are kept categorical artificially and are intentionally not encoded as ordinal variables (to better match the structure of business datasets).

Recap: with 10M records training (largest in the benchmark) RF AUC 0.80 GBM 0.81 (on test set).

So far I get 0.73 with DL with h2o on 1M and 10M train as well:
https://github.com/szilard/benchm-ml/blob/master/4-DL/1-h2o.R

I tried a few architectures/activation/regularizations, but it won't beat the default. Runs about 2-3 minutes with early stopping (using validation set) on a 32 cores EC2 box.

The "problem" is DL learns very fast, the best AUC reached after 1.3 epochs on 1M rows train and 0.15 epochs on 10M (and early stopping kicks in around 9 and 0.9, rsp). On the other hand RF/GBM runs ~1hr to get good accuracy. That is the DL model seems underfitted to me.

Surely, DL might not beat GBM on this kind of data (proxy for general business data such as credit risk or fraud detection), but it should do better than 0.73.

Datasets:
https://s3.amazonaws.com/benchm-ml--main/train-1m.csv
https://s3.amazonaws.com/benchm-ml--main/train-10m.csv
https://s3.amazonaws.com/benchm-ml--main/valid.csv
https://s3.amazonaws.com/benchm-ml--main/test.csv

RandomForest Example

Hello,
I'm trying to train RandomForest Model, but getting same result for each test(about 300 entries)
Here's Java Code

RandomForest model;
model = new RandomForest(X,Y,500);                                                        

For this data, python sklearn works as expected

clf = RandomForestClassifier(n_estimators=500)
clf = clf.fit(X, Y)

What am I missing?

Thanks

Rborist

Thanks @suiji for Rborist code. If I run it with 100 trees as in https://github.com/szilard/benchm-ml/tree/master/z-other-tools (on 32 core box) I get:
Time: 87 sec
AUC: 66.43
Something is wrong, the AUC is very low.

I checked out latest github version, then in ArboristBridgeR/Package dir i ran ./dev.sh which created Rborist.tar.gz and then I installed with R CMD INSTALL

mxnet sparse data format

Motivation: I can't run mxnet on the 10M records airline set #29 because model.matrix crashes out of RAM (on g2.8xlarge with 60GB or RAM - largest available for GPU instances).

Using Matrix::sparse.model.matrix to encode the categorical data would be great (uses <2GB RAM), but I get:

Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Strangely on the 1M dataset I get another error:

Error: io.cc:50: Seems X, y was passed in a Row major way, MXNetR adopts a column major convention.

GBM variable 1: Month is not of type numeric, ordered, or factor.

For gbm_2.1.1 and R 3.3.1 I get the following error for
benchm-ml/3-boosting/1-gbm.R

> system.time({
+   md <- gbm(dep_delayed_15min ~ ., data = d_train, distribution = "bernoulli", 
+             n.trees = 1000, 
+             interaction.depth = 16, shrinkage = 0.01, n.minobsinnode = 1,
+             bag.fraction = 0.5)
+ })

Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w,  : 
  variable 1: Month is not of type numeric, ordered, or factor.
Timing stopped at: 0.02 0 0.01 

Tobias

Citation

Hey,
thanks for this repository. It's tremendously useful. Would it be possible to maybe add info on how to cite this repository? Maybe sth like:

@misc{,
	author = {Pafka, Szilard},
	title = {benchm-ml},
	publisher = {GitHub},
	year = {2019},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/szilard/benchm-ml}},
	commit = {13325ce3edd7c902390197f43bcc7938c306bbe3}
}

Best,
Simon

SMILE

Thanks for great work! We have an open source machine learning library called SMILE (https://github.com/haifengl/smile). We have incorporated your benchmark (https://github.com/haifengl/smile/blob/master/benchmark/src/main/scala/smile/benchmark/Airline.scala). We found that our system is much faster for this data set. For 100K training data on a 4 core machine, we can train a random forest with 500 trees in 100 seconds, and gradient boost trees of 300 trees in 180 seconds. Projected to 32 cores, I think that we will be much faster than all the tools you tested. You can try it out by cloning our project. Then

sbt benchmark/run

This also includes benchmark on USPS data, which you may ignore. Thanks!

Update Latest version of XGBoost

Thanks to this benchmark, we now have a good understanding of what is going on in #14 in #2
Specifically, cacheline related issues for exact greedy algorithm. See our detailed analysis in this paper http://arxiv.org/abs/1603.02754

In short, exact greedy algorithm will suffer from cache line issues when facing dataset larger than 1M, which we can counter balance with pre-fetching but still not perfect.

We are adding a new option call tree_method to xgboost, which will allow user to choose the algorithm. By default it will choose the faster one, and will send a message to user when approximate algorithm is choosed. I think it might be interesting to rerun the benchmark on this latest version.

See https://github.com/dmlc/xgboost/tree/master/R-package for instructions. The drat or install from source should work. To confirm you are using the latest version, check if the message occur when running on 10M data

Tree method is automatically selected to be 'approx'...

mllib test code - RAM / AUC improvements needed

@szilard For MLlib, you should repartition the data to match the number of cores. For example, try train.repartition(32).cache(). Otherwise, you may not use all the cores. Also, if the data is sparse, you should create sparse vectors instead of dense vectors.

other dataset of such type for benchmarking?

@tqchen I moved your last question to a new issue:

Thanks for the clarification! BTW, do you have any idea if there is any other dataset of such type for benchmarking? For example, a dataset with more columns and rows.

One thing I noticed about this dataset is that seems the output was very dependent on one variable(when the features are randomly dropped at rate of 50%, one output tree could be very bad). This might make the result become a singular case where the result simply repeatively cut on a single feature.

running your benchmarks from beginning to end

Hey Szilard,

I'd like to replicate your code from beginning to end perhaps on Google Compute Engine (GCE), mainly to test out GCE with Vagrant. Do you know have a sense of how long the entire process would take assuming a similar server size as what you used on EC2?

Is there a convenient way to run all your scripts in from folder 0 to 4? That is, is there a master script that executes them all?

I notice that the results are written out to the console. Do you have a script that scrapes all the AUC's for your comparison analysis?

Thanks!

Spark random forest issues

This is to collaborate on some issues with Spark RF also addressed by @jkbradley in comments to this post http://datascience.la/benchmarking-random-forest-implementations/ (see comments by Joseph Bradley). cc: @mengxr

Please see “Absolute Minimal Benchmark” for random forests https://github.com/szilard/benchm-ml/tree/master/z-other-tools and let's use the 1M row training set and the test set linked in from there.

@jkbradley says: One-hot encoder: Spark 1.4 includes this, plus a lot more feature transformers. Preprocessing should become ever-easier, especially using DataFrames (Spark 1.3+).

Yes, indeed. Can you please provide code that reads in the original dataset (pre- 1-hot encoding) and does the 1-hot encoding in Spark. Also, if random forest 1.4 API can use data frames, I guess we should use that for the training. Can you please provide code for that too.

@jkbradley says: AUC/accuracy: The AUC issue appears to be caused by MLlib tree ensembles aggregating votes, rather than class probabilities, as you suggested. I re-ran your test using class probabilities (which can be aggregated by hand), and then got the same AUC as other libraries. We’re planning on including this fix in Spark 1.5 (and thanks for providing some evidence of its importance!).

Fantastic. Can you please share code that does that already? I would be happy to check it out.

xgboost RF bump for n=10M

Moved "something weird happens for the largest data size (n=10M) - the trend for Run time and AUC "breaks", see figures main README" issue from #2 here.

Possible data leakage

Hi, szilard!
thanks for your benchmarks, I think that you found an interesting dataset for comparison.

HOWEVER

The time of departure present in the data is exact time when aircraft takes off.
Thus, by analyzing the aircrafts from airport X to airport Y by carrier Z one can establish at which time aircrafts should take off to be in time (and that's what deep trees do, to my belief).

At least, I could easily see such patterns in data.

It doesn't seem to be very useful to predict if aircraft departures in time given you already know this information.

So, my suggestion is either to replace DepTime with PlannedDepTime (if you know how to get this infomation) or put DepTime = DepTime // 200 to reduce possibility of using this information, while this altered feature gives approximate information about the flight schedule.

best boosting AUC?

@tqchen @hetong007 I'm trying to get a good AUC with boosting for the largest dataset (n = 10M). Would be nice to beat random forests :)

So far I did some basic grid search https://github.com/szilard/benchm-ml/blob/master/3-boosting/0-xgboost-init-grid.R for n = 1M (not the largest dataset) and seems like deeper trees, min_child_weight = 1 subsample = 0.5 work well.

I'm running now https://github.com/szilard/benchm-ml/blob/master/3-boosting/6a-xgboost-grid.R with n = 10M by just looping over max_depth = c(2,5,10,20,50) but it's been running for a while.

Any suggestions?

Smallest learning rate I'm using is eta = 0.01, any experience with smaller values?

PS: See results so far here: https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines

License for datasets

Hi
Could you please help me to understand if MIT covers datasets mentioned here
Training datasets of sizes 10K, 100K, 1M, 10M are generated
from the well-known airline dataset (using years 2005 and 2006). A test set of size 100K is generated from the same (using year 2007).

benchm-ml/z-other-tools/4-h2o.R change in import format

Hi,
for H2O cluster version: 3.8.3.3
the import function in benchm-ml/z-other-tools/4-h2o.R should be corrected to

dx_train <- h2o.importFile(path = "train-1m.csv")

otherwise the following error occours

> dx_train <- h2o.importFile(h2oServer, path = "train-1m.csv")
Error: is.character(key) && length(key) == 1L && !is.na(key) is not TRUE

Cheers
Tobias

Datacratic MLDB results

This code gives an AUC of 0.7417 in 12.1s for the 1M training set on an r3.8xlarge EC2 instance with the latest release of Datacratic's Machine Learning Database (MLDB), available at http://mldb.ai/

from pymldb import Connection
mldb = Connection("http://localhost/")

mldb.v1.datasets("bench-train-1m").put({
    "type": "text.csv.tabular",
    "params": { "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/train-1m.csv" }
})

mldb.v1.datasets("bench-test").put({
    "type": "text.csv.tabular",
    "params": { "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/test.csv" }
})

mldb.v1.procedures("benchmark").put({
    "type": "classifier.experiment",
    "params": {
        "experimentName": "benchm_ml",
        "training_dataset": {"id": "bench-train-1m"},
        "testing_dataset": {"id": "bench-test"},
        "configuration": {
            "type": "bagging",
            "num_bags": 100,
            "validation_split": 0.50,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 20,
                "random_feature_propn": 0.5
            }
        },
        "modelFileUrlPattern": "file://tmp/models/benchml_$runid.cls",
        "label": "dep_delayed_15min = 'Y'",
        "select": "* EXCLUDING(dep_delayed_15min)",
        "mode": "boolean"
    }
})

import time

start_time = time.time()

result = mldb.v1.procedures("benchmark").runs.post({})

run_time = time.time() - start_time
auc = result.json()["status"]["folds"][0]["results"]["auc"]

print "\n\nAUC = %0.4f, time = %0.4f\n\n" % (auc, run_time)

comment:re sklearn -- integer encoding vs 1-hot (py)

(Your post popped up in my twitter feed)
I'm not sure why you said you needed to one-hot encode categorical variables for scikit's random forest; I'm fairly certain you do not need to(and probably shouldn't). It's been awhile since I looked at the source, but I'm pretty sure it handles categorical variables encoded as a single vector of numbers just fine from empirical tests; performance is almost always worse if the features were one-hot encoded.

Add Rborist

Could you add Rborist in serial and parallel mode to add another (fast?) random forest implementation?

Great project. Very useful to have comparisons.

LightGBM results

New GBM implementation released by Microsoft: https://github.com/Microsoft/LightGBM

on 10M dataset, r3.8xlarge

trying to match xgboost & LightGBM params

nround = 100, max_depth = 10, eta = 0.1
num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=100
num_iterations=100  learning_rate=0.1  num_leaves=512   min_data_in_leaf=100
num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=0
Tool time (s) AUC
xgboost 350 0.7511
LightGBM 1 500 0.7848
LightGBM 2 350 0.7729
LightGBM 3 450 0.7897

Code to get the results here

Question on the metric of AUC

It seems to be a little bit confused that the evaluation on classification tasks uses the probabilities output directly in calculating the AUC.
For example, in 6-xgboost.R#L39,
Will it be better to do that with (phat>0.5)?

5-spark.txt: spark-train-10m.csv

file: 1-linear/5-spark.txt contains lines:
val d_train = load("spark-train-10m.csv").repartition(32).cache()
val d_test = load("spark-test-10m.csv").repartition(32).cache()

However, those files are not created anywhere else (0-init). I'm wondering shouldn't be:
train-10m.csv and test.csv instead? (those files are in 0-init)

How to time the algorithms?

How do you time the training time of the different algorithms? Are there tools available that can be used for all ML software?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.