Git Product home page Git Product logo

etl's Introduction

etl

Travis-CI Build Status CRAN_Status_Badge CRAN RStudio mirror downloads

etl is an R package to facilitate Extract - Transform - Load (ETL) operations for medium data. The end result is generally a populated SQL database, but the user interaction takes place solely within R.

etl is on CRAN, so you can install it in the usual way, then load it.

install.packages("etl")
library(etl)

Instantiate an etl object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial mtcars database is built into etl.

cars <- etl("mtcars")
## No database was specified so I created one for you at:

## /tmp/Rtmpw1T0Oo/file44856c2958c2.sqlite3
class(cars)
## [1] "etl_mtcars" "etl"        "src_dbi"    "src_sql"    "src"

Connect to a local or remote database

etl works with a local or remote database to store your data. Every etl object extends a dplyr::src_dbi object. If, as in the example above, you do not specify a SQL source, a local RSQLite database will be created for you. However, you can also specify any source that inherits from dplyr::src_dbi.

Note: If you want to use a database other than a local RSQLite, you must create the mtcars database and have permission to write to it first!

# For PostgreSQL
library(RPostgreSQL)
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")

# Alternatively, for MySQL
library(RMySQL)
db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
cars <- etl("mtcars", db)

At the heart of etl are three functions: etl_extract(), etl_transform(), and etl_load().

Extract

The first step is to acquire data from an online source.

cars %>%
  etl_extract()
## Extracting raw data...

This creates a local store of raw data.

Transform

These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).

cars %>%
  etl_transform()
## Transforming raw data...

Load

Populate the SQL database with the transformed data.

cars %>%
  etl_load()
## Loading 1 file(s) into the database...

Do it all at once

To populate the whole database from scratch, use etl_create.

cars %>%
  etl_create()
## Initializing DB using SQL script init.sqlite

## Extracting raw data...

## Transforming raw data...

## Loading 1 file(s) into the database...

You can also update an existing database without re-initializing, but watch out for primary key collisions.

cars %>%
  etl_update()

Do Your Analysis

Now that your database is populated, you can work with it as a src data table just like any other dplyr source.

cars %>%
  tbl("mtcars") %>%
  group_by(cyl) %>%
  summarise(N = n(), mean_mpg = mean(mpg))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning

## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.22.0 [/tmp/Rtmpw1T0Oo/file44856c2958c2.sqlite3]
##     cyl     N mean_mpg
##   <int> <int>    <dbl>
## 1     4    11     26.7
## 2     6     7     19.7
## 3     8    14     15.1

Create your own ETL packages

Suppose you want to create your own ETL package called pkgname. All you have to do is write a package that requires etl, and then you have to write two S3 methods:

etl_extract.etl_pkgname()
etl_load.etl_pkgname()

Please see the "Extending etl" vignette for more information.

Use other ETL packages

  • macleish Travis-CI Build Status CRAN_Status_Badge : Weather and spatial data from the MacLeish Field Station in Whately, MA.
  • airlines Travis-CI Build Status : On-time flight arrival data from the Bureau of Transportation Statistics
  • citibike Travis-CI Build Status : Municipal bike-sharing system in New York City
  • nyc311 Travis-CI Build Status : Phone calls to New York City's feedback hotline
  • fec Travis-CI Build Status : Campaign contribution data from the Federal Election Commission
  • imdb Travis-CI Build Status : Mirror of the Internet Movie Database

Cite

Please see the full manuscript for additional details.

citation("etl")
## 
## To cite etl in publications use:
## 
##   Benjamin S. Baumer (2017). A Grammar for Reproducible and
##   Painless Extract-Transform-Load Operations on Medium Data.
##   arXiv, 8(23), 1-24. URL https://arxiv.org/abs/1708.07073.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data},
##     author = {Benjamin S. Baumer},
##     journal = {arXiv},
##     year = {2017},
##     volume = {8},
##     number = {23},
##     pages = {1--24},
##     url = {https://arxiv.org/abs/1708.07073},
##   }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.