Git Product home page Git Product logo

benchmark-stata-r's Introduction

Benchmarks

Results

This page compares the speed of R and Stata for typical data analysis. Instructions are runned on randomly generated datasets of with 10 millions observations. I try to use the fastest command available in each language. In particular, I use gtools in Stata. I use data.table, fst, and fixest in R.

Code

All the code below can be downloaded in the code folder in the repository. The dataset is generated in R using the file 1-generate-datasets.r. The R code in the file 2-benchmark-r.r: The Stata code in the file 3-benchmark-stata.do:

Session Info

The machine used for this benchmark has a 3.5 GHz Intel Core i5 (4 cores) with a SSD disk.

The Stata version is Stata 16 MP with 2 cores. The R session info is

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] scales_1.0.0      ggplot2_3.2.1     stringr_1.4.0     fst_0.9.0        
[5] statar_0.7.1      lfe_2.8-3         Matrix_1.2-17     tidyr_1.0.0      
[9] data.table_1.12.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2         pillar_1.4.2       compiler_3.6.0     tools_3.6.0       
 [5] zeallot_0.1.0      lifecycle_0.1.0    tibble_2.1.3       gtable_0.3.0      
 [9] lattice_0.20-38    pkgconfig_2.0.3    rlang_0.4.0        parallel_3.6.0    
[13] withr_2.1.2        dplyr_0.8.3        vctrs_0.2.0        grid_3.6.0        
[17] tidyselect_0.2.5   glue_1.3.1         R6_2.4.0           Formula_1.2-3     
[21] purrr_0.3.2        magrittr_1.5       ellipsis_0.3.0     backports_1.1.4   
[25] matrixStats_0.55.0 assertthat_0.2.1   xtable_1.8-4       colorspace_1.4-1  
[29] sandwich_2.5-1     stringi_1.4.3      lazyeval_0.2.2     munsell_0.5.0     
[33] crayon_1.3.4       zoo_1.8-6    

benchmark-stata-r's People

Contributors

boomskats avatar floswald avatar grantmcdermott avatar matthieugomez avatar prehensilecode avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

benchmark-stata-r's Issues

remarks to 3-benchmark-stata.do

Hi,
Thanks for your code.
Just two remarks:
I think it would help if you add
ssc install autorename
ssc install ftools
into the comment section as well.

To make it directly usable on Stata SE
/* benchmark */
if `c(processors)' >= 2 set processors 2
would help.

Best wishes,
Rainer

Time merging from file on Disk

Stata only represents a single data set in memory at any given time, so comparing its performance (which is IO bound by nature) to joins of objects stored in RAM is not comparable.

two notes.

Hi.

  1. The fact that you're running the Stata code using 2 cpu cores and R code with 4 cores (unless I'm misunderstanding something) should be emphasized and bolded in red with 72 size font...
  2. It would help if you could also provide the actual time of execution in addition to relative. Because 60 minutes vs 6 minutes is significant in common workflow, but 60 milliseconds vs 6 milliseconds is typically not.

Misleading time function for R multicore operations?

Hi Matthieu,

In the benchmark-r.r file you define the timing function as

time <- function(x){sum(system.time(x)[1:2])}

So, you're essentially summing the "user" (first) and system" (second) columns of system.time(). However, because "user" CPU time accumulates time spent across cores, aren't you essentially punishing any multicore operation?

I mean, maybe I'm missing something obvious โ€” or it has to do with equating against StataMP's timing functions โ€” but why not simply take the "elapsed" (third) column of system.time()?

A probably redundant reprex:

library(fst)
library(fixest)
library(microbenchmark)

setFixest_nthreads(4)

DT <- read.fst("~/statabenchmark/1e7.fst") 
DT1 <- DT[1:(nrow(DT)/2),]

reg_2hfe <- function() invisible(feols(v3 ~  v2 + id4 + id5  + as.factor(v1) | id6 + id3, DT1))

## time function
time <- function(x){sum(system.time(x)[1:2])}
time(reg_2hfe())
#> [1] 31.89

## full system.time output (for comparison)
system.time(reg_2hfe())
#>    user  system elapsed 
#>  30.445   0.911   9.410

## microbenchmark (another comparison)
microbenchmark(reg_2hfe(), times = 5)
#> Unit: seconds
#>        expr      min       lq     mean   median       uq      max neval
#>  reg_2hfe() 9.540085 9.590495 9.669584 9.638338 9.692307 9.886695     5

Created on 2020-06-10 by the reprex package (v0.3.0)

collapse

Hi Matthieu,

Is it possible to add -collapse- to the benchmarks?
Probably two versions, one when collapsing to a small unit (e.g. individual -> country) and one for larger unit (e.g. individual -> household).

I think it's one of the most common Stata commands and would be interesting to see its performance (I have the gut feeling that it will be quite slow, even with the -fast- option).

sreshape

Hi

have you tried using sreshape instead of stata's default reshape? In my particular application it was 10x faster than the default.

PS: I wrote a single-line timing function called timeit a while back, it essentially wraps timer on/timer off. This may or may not be more convenient than the tic/toc you use.

Benchmark result using Windows 11 using Intel i9-13900K

Dear Matthieu,

I have replicated your test using a brand new personal computer running with Windows 11 using a 13th Gen Intel i9-13900K, 3000 Mhz, 24 Core(s) system with 64GB RAM.
My Stata 18 is a 12MP license but because the Intel CPU has only 8 'real cores' I have set your test core parameter to 8 for both Stata and R (version R version 4.2.3 (2023-03-15 ucrt)).
This is my result:

1e7

It is interesting to observe that this is not that different from the your earlier benchmark comparison except for 'plot 1000 points' and 'append' where Stata is performing better, relatively.

Include fst package and missing values

Thanks for this repo: this is an illuminating benchmark.

I was wondering whether you could consider updating the binary IO in R with the relatively recent fst package: https://github.com/fstpackage/fst -- though you may want to wait until data.table 1.10.6 which has very very fast csv reading.

Also, I think you should consider adding in missing values (NA, NaN) in one of the data sets.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.