Git Product home page Git Product logo

dbloadss's Introduction

dbloadss

This package and accompanying scripts is to test load times to MS SQL Server 17 (SS) from various methods:

  1. Push from R with RODBC
  2. Push from R with odbc
  3. Pull from SS with a stored procedure

This package is really for number 3. We can do the rest from a script, although it's a useful place to keep some functions.

Test Database

The code for this post runs on my Windows 10 laptop, where I have a local SQL Server 17 instance running, with a database called ml.

Load Times

We're going to push a big data frame into SQL Server using three methods. The data set is a simple randomly generated data frame.

library(dbloadss)

random_data_set(n_rows = 5, n_cols = 5)
#>      COL_1    COL_2     COL_3    COL_4    COL_5
#> 1 7.337917 3.873387 0.4994443 2.026814 9.443235
#> 2 2.281623 4.896035 6.9717638 8.305794 7.919791
#> 3 3.512538 1.542272 5.2494306 3.978269 3.453247
#> 4 5.815818 2.686854 7.6096585 2.029168 8.976130
#> 5 3.637890 6.507928 2.3285816 8.489208 5.399292

R is pretty quick at this sort of thing so we don't really need to worry about how long it takes to make a big data frame.

n_rows <- 3000000
system.time({
  random_data <- random_data_set(n_rows = n_rows, n_cols = 5)
})
#>    user  system elapsed 
#>    0.75    0.01    0.81

but what we're interested in is how fast to push this to SQL Server?.

RODBC

RODBC was, for a long time, the standard way to connect to SQL Server from R. It's a great package that makes it easy to send queries, collect results, and handles type conversions pretty well. However, it is a bit slow for pushing data in. The fastest I could manage was using the sqlSave function with the safeties off. Very interested to hear if there's a better method. For 3m rows it's a no go. So scaling back to 30k rows we get:

library(RODBC)

db <- odbcDriverConnect('driver={SQL Server};server=localhost\\SQL17ML;database=ml;trusted_connection=true')

n_rodbc <- 30000

odbcQuery(db, "drop table randData;")
#> [1] 1
time30k <- system.time({
  RODBC::sqlSave(
    db,
    dat = random_data[1:n_rodbc,],
    tablename = "randData",
    rownames = FALSE,
    fast = TRUE,
    safer = FALSE
  )
})
odbcClose(db)

time30k
#>    user  system elapsed 
#>    1.95    0.62  142.98

Over 2 minutes! It's been roughly linear for me so that total write time for 3m rows is a few hours.

ODBC

odbc is a relatively new package from RStudio which provides a DBI compliant ODBC interface. It is considerably faster for writes. Here we'll push the full 3m rows.

library(DBI)

dbi <- dbConnect(odbc::odbc(),
                 driver = "SQL Server",
                 server="localhost\\SQL17ML",
                 database = "ml")

time3modbc <- system.time({
  dbWriteTable(dbi,
               name = "randData",
               value = random_data,
               overwrite = TRUE)
})
dbDisconnect(dbi)

time3modbc
#>    user  system elapsed 
#>   10.28    0.62   63.80

SQL Server External Script

An alternative approach is to use the new features in SQL Server 17 (and 16) for calling out to R scripts from SQL. This is done via the sp_execute_external_script command, which we will wrap in a stored procedure. This is what that looks like for me:

use [ml];

DROP PROC IF EXISTS generate_random_data;
GO
CREATE PROC generate_random_data(@nrow int)
AS
BEGIN
 EXEC sp_execute_external_script
       @language = N'R'  
     , @script = N'  
          library(dbloadss)
          random_data <- random_data_set(n_rows = nrow_r, n_cols = 5)
' 
   , @output_data_1_name = N'random_data'
     , @params = N'@nrow_r int'
     , @nrow_r = @nrow
    WITH RESULT SETS ((
          "COL_1" float not null,   
        "COL_2" float not null,  
        "COL_3" float not null,   
        "COL_4" float not null,
            "COL_5" float not null)); 
END;
GO

We then call the stored procedure with another query (skipping out a step that clears it inbetween tests).

INSERT INTO randData
EXEC [dbo].[generate_random_data] @nrow = 3000000

SQL Server Timer

and this runs in 34 seconds. My best guess for the performance increase is that the data is serialised more efficiently. More to investigate.

License

MIT © Mango Solutions

dbloadss's People

Contributors

dougmet avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.