Git Product home page Git Product logo

starschemar's Introduction

starschemar

Travis build status

The multidimensional data model was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure.

Transformations can be carried out using professional ETL (extract, transform and load) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way.

Frequently, the operations to be performed aim to transform a flat table (with data that comes from operational systems) into a star schema (which implements a multidimensional system). With the tools mentioned above, this transformation can be carried out, but it requires a lot of work.

The goal of starschemar is to define transformations that allow you to easily obtain star schemas from flat tables. In addition, it includes basic data cleaning, dimension enrichment, incremental data refresh and query operations, adapted to this context.

Installation

You can install the released version of starschemar from CRAN with:

install.packages("starschemar")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("josesamos/starschemar")

Example

To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System data set in the form of a flat table, the first rows of which are shown below.

Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths Other Deaths
1962 1 1962-01-06 1 CT Bridgeport 3 46 43
1962 2 1962-01-13 1 CT Bridgeport 2 43 41
1962 3 1962-01-20 1 CT Bridgepor 2 40 38
1962 1 1962-01-06 9 WA Tacoma 4 50 46
NA NA 1962-01-13 9 WA Tacoma 2 45 43
1962 3 1962-01-20 9 WA Tacoma 0 39 39

The original data has been modified to have some erroneous or missing data.

The transformation to obtain a star schema from the flat table using starschemar package is as follows:

library(starschemar)
library(tidyr)

# columns to consider in the definition
dput(colnames(ft))
#> c("Year", "WEEK", "Week Ending Date", "REGION", "State", "City", 
#> "Pneumonia and Influenza Deaths", "All Deaths", "Other Deaths"
#> )

dm <- dimensional_model() %>%
  define_fact(
    name = "mrs_cause",
    measures = c(
      "Pneumonia and Influenza Deaths",
      "Other Deaths"
    ),
  ) %>%
  define_dimension(
    name = "when",
    attributes = c(
      "Week Ending Date",
      "WEEK",
      "Year"
    )
  ) %>%
  define_dimension(
    name = "where",
    attributes = c(
      "REGION",
      "State",
      "City"
    )
  )

st <- star_schema(ft, dm) %>%
  snake_case() %>%
  character_dimensions(
    NA_replacement_value = "Unknown",
    length_integers = list(week = 2)
  )

The tables of dimensions and facts of the obtained star schema are shown below.

when_key week_ending_date week year
1 1962-01-06 01 1962
2 1962-01-13 02 1962
3 1962-01-13 Unknown Unknown
4 1962-01-20 03 1962
where_key region state city
1 1 CT Bridgepor
2 1 CT Bridgeport
3 9 WA Tacoma
when_key where_key pneumonia_and_influenza_deaths other_deaths nrow_agg
1 2 3 43 1
1 3 4 46 1
2 2 2 41 1
3 3 2 43 1
4 1 2 38 1
4 3 0 39 1

The tables show the erroneous and missing data. We are going to perform some data cleaning operations to correct them.

where <- st %>%
  get_dimension("where")

when <- st %>%
  get_dimension("when")

updates <- record_update_set() %>%
  update_selection(
    dimension = where,
    columns = c("city"),
    old_values = c("Bridgepor"),
    new_values = c("Bridgeport")
  ) %>%
  match_records(dimension = when,
                old = 3,
                new = 2)

st <- st %>%
  modify_dimension_records(updates)

The new dimension and fact tables are shown below.

when_key week_ending_date week year
1 1962-01-06 01 1962
2 1962-01-13 02 1962
3 1962-01-20 03 1962
where_key region state city
1 1 CT Bridgeport
2 9 WA Tacoma
where_key when_key pneumonia_and_influenza_deaths other_deaths nrow_agg
1 1 3 43 1
1 2 2 41 1
1 3 2 38 1
2 1 4 46 1
2 2 2 43 1
2 3 0 39 1

In addition to the operations in the examples shown, starschemar offers support for defining role playing and role dimensions in a star schema, fact constellations with conformed dimensions, dimension enrichment, incremental data refresh operations, the ability to export results in various tibble-based formats, and query operations, adapted to this context.

starschemar's People

Contributors

josesamos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.