matthieugomez / benchmark-stata-r Goto Github PK

Speed benchmark of Stata and R on common data manipulations

R 65.16% Stata 34.84%

benchmark-stata-r's Introduction

Benchmarks

Results

This page compares the speed of R and Stata for typical data analysis. Instructions are runned on randomly generated datasets of with 10 millions observations. I try to use the fastest command available in each language. In particular, I use gtools in Stata. I use data.table, fst, and fixest in R.

Code

All the code below can be downloaded in the code folder in the repository. The dataset is generated in R using the file 1-generate-datasets.r. The R code in the file 2-benchmark-r.r: The Stata code in the file 3-benchmark-stata.do:

Session Info

The machine used for this benchmark has a 3.5 GHz Intel Core i5 (4 cores) with a SSD disk.

The Stata version is Stata 16 MP with 2 cores. The R session info is

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] scales_1.0.0      ggplot2_3.2.1     stringr_1.4.0     fst_0.9.0        
[5] statar_0.7.1      lfe_2.8-3         Matrix_1.2-17     tidyr_1.0.0      
[9] data.table_1.12.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2         pillar_1.4.2       compiler_3.6.0     tools_3.6.0       
 [5] zeallot_0.1.0      lifecycle_0.1.0    tibble_2.1.3       gtable_0.3.0      
 [9] lattice_0.20-38    pkgconfig_2.0.3    rlang_0.4.0        parallel_3.6.0    
[13] withr_2.1.2        dplyr_0.8.3        vctrs_0.2.0        grid_3.6.0        
[17] tidyselect_0.2.5   glue_1.3.1         R6_2.4.0           Formula_1.2-3     
[21] purrr_0.3.2        magrittr_1.5       ellipsis_0.3.0     backports_1.1.4   
[25] matrixStats_0.55.0 assertthat_0.2.1   xtable_1.8-4       colorspace_1.4-1  
[29] sandwich_2.5-1     stringi_1.4.3      lazyeval_0.2.2     munsell_0.5.0     
[33] crayon_1.3.4       zoo_1.8-6

benchmark-stata-r's People

Contributors

Stargazers

Watchers

Forkers

parthasen linearregression selcukfidan47 floswald zauster sergiocorreia hughparsonage guhjy mjfrigaard snowdj boomskats bjcairns jasperjiasc markmusumba jeverding cesarruy filius23 matixr prehensilecode

benchmark-stata-r's Issues

remarks to 3-benchmark-stata.do

Hi,
Thanks for your code.
Just two remarks:
I think it would help if you add
ssc install autorename
ssc install ftools
into the comment section as well.

To make it directly usable on Stata SE
/* benchmark */
if `c(processors)' >= 2 set processors 2
would help.

Best wishes,
Rainer

Use faster BLAS

Apple's vecLib will speed up computations, especially regressions, compared to the standard BLAS bundled with R.

How to:
https://gist.github.com/nicebread/6920c8287d7bffb03007

stata-gtools

Use stata-gtools for collapse + egen instructions once weights + MAC OSX compilations are implemented @mcaceresb

Time merging from file on Disk

Stata only represents a single data set in memory at any given time, so comparing its performance (which is IO bound by nature) to joins of objects stored in RAM is not comparable.

two notes.

Hi.

The fact that you're running the Stata code using 2 cpu cores and R code with 4 cores (unless I'm misunderstanding something) should be emphasized and bolded in red with 72 size font...
It would help if you could also provide the actual time of execution in addition to relative. Because 60 minutes vs 6 minutes is significant in common workflow, but 60 milliseconds vs 6 milliseconds is typically not.

Misleading time function for R multicore operations?

Hi Matthieu,

In the benchmark-r.r file you define the timing function as

time <- function(x){sum(system.time(x)[1:2])}

So, you're essentially summing the "user" (first) and system" (second) columns of system.time(). However, because "user" CPU time accumulates time spent across cores, aren't you essentially punishing any multicore operation?

I mean, maybe I'm missing something obvious — or it has to do with equating against StataMP's timing functions — but why not simply take the "elapsed" (third) column of system.time()?

A probably redundant reprex:

library(fst)
library(fixest)
library(microbenchmark)

setFixest_nthreads(4)

DT <- read.fst("~/statabenchmark/1e7.fst") 
DT1 <- DT[1:(nrow(DT)/2),]

reg_2hfe <- function() invisible(feols(v3 ~  v2 + id4 + id5  + as.factor(v1) | id6 + id3, DT1))

## time function
time <- function(x){sum(system.time(x)[1:2])}
time(reg_2hfe())
#> [1] 31.89

## full system.time output (for comparison)
system.time(reg_2hfe())
#>    user  system elapsed 
#>  30.445   0.911   9.410

## microbenchmark (another comparison)
microbenchmark(reg_2hfe(), times = 5)
#> Unit: seconds
#>        expr      min       lq     mean   median       uq      max neval
#>  reg_2hfe() 9.540085 9.590495 9.669584 9.638338 9.692307 9.886695     5

^{Created on 2020-06-10 by the reprex package (v0.3.0)}

collapse

Hi Matthieu,

Is it possible to add -collapse- to the benchmarks?
Probably two versions, one when collapsing to a small unit (e.g. individual -> country) and one for larger unit (e.g. individual -> household).

I think it's one of the most common Stata commands and would be interesting to see its performance (I have the gut feeling that it will be quite slow, even with the -fast- option).

sreshape

have you tried using sreshape instead of stata's default reshape? In my particular application it was 10x faster than the default.

PS: I wrote a single-line timing function called timeit a while back, it essentially wraps timer on/timer off. This may or may not be more convenient than the tic/toc you use.

Data.Table class is not equivalent to a Stata data set

Can you use R-base functionality to compare against Stata core functionality? Creating a primary key in the two data.tables isn't really comparable to doing the same in Stata.

Benchmark result using Windows 11 using Intel i9-13900K

Dear Matthieu,

I have replicated your test using a brand new personal computer running with Windows 11 using a 13th Gen Intel i9-13900K, 3000 Mhz, 24 Core(s) system with 64GB RAM.
My Stata 18 is a 12MP license but because the Intel CPU has only 8 'real cores' I have set your test core parameter to 8 for both Stata and R (version R version 4.2.3 (2023-03-15 ucrt)).
This is my result:

It is interesting to observe that this is not that different from the your earlier benchmark comparison except for 'plot 1000 points' and 'append' where Stata is performing better, relatively.

Include fst package and missing values

Thanks for this repo: this is an illuminating benchmark.

I was wondering whether you could consider updating the binary IO in R with the relatively recent fst package: https://github.com/fstpackage/fst -- though you may want to wait until data.table 1.10.6 which has very very fast csv reading.

Also, I think you should consider adding in missing values (NA, NaN) in one of the data sets.