Git Product home page Git Product logo

skimr's Introduction

skimr

## Dev mode: OFF

Build Status codecov

The goal of skimr is to provide a frictionless approach to dealing with summary statistics iteratively and interactively as part of a pipeline, and that conforms to the principle of least surprise.

skimr provides summary statistics that you can skim quickly to understand and your data and see what may be missing. It handles different data types (numerics, factors, etc), and returns a skimr object that can be piped or displayed nicely for the human reader.

See our blog post here.

Installation

# install.packages("devtools")
devtools::install_github("hadley/pillar")
devtools::install_github("ropenscilabs/skimr")

Skim statistics in the console

Compared to other summary() type functions skimr

  • added missing, complete, n, sd
  • reports numeric/int/double separately from factor/chr
  • handles dates, logicals and other classes
  • uses Hadley Wickham's pillar package, specifically `pillar::spark-bar
  • can be extended to new data classes within data frames and new object types

Nicely separates numeric and factor variables:

skim(chickwts)
## Skim summary statistics
##  n obs: 71 
##  n variables: 2 
## 
## Variable type: factor 
##    var missing complete  n n_unique                         top_counts ordered
## 1 feed       0       71 71        6 soy: 14, cas: 12, lin: 12, sun: 12   FALSE
## 
## Variable type: numeric 
##      var missing complete  n   mean    sd min   p25 median   p75 max     hist
## 1 weight       0       71 71 261.31 78.07 108 204.5    258 323.5 423 ▃▅▅▇▃▇▂▂

Many numeric variables:

skim(mtcars)
## Skim summary statistics
##  n obs: 32 
##  n variables: 11 
## 
## Variable type: numeric 
##     var missing complete  n   mean     sd   min    p25 median    p75    max     hist
## 1    am       0       32 32   0.41   0.5   0      0      0      1      1    ▇▁▁▁▁▁▁▆
## 2  carb       0       32 32   2.81   1.62  1      2      2      4      8    ▆▇▂▇▁▁▁▁
## 3   cyl       0       32 32   6.19   1.79  4      4      6      8      8    ▆▁▁▃▁▁▁▇
## 4  disp       0       32 32 230.72 123.94 71.1  120.83 196.3  326    472    ▇▆▁▂▅▃▁▂
## 5  drat       0       32 32   3.6    0.53  2.76   3.08   3.7    3.92   4.93 ▃▇▁▅▇▂▁▁
## 6  gear       0       32 32   3.69   0.74  3      3      4      4      5    ▇▁▁▆▁▁▁▂
## 7    hp       0       32 32 146.69  68.56 52     96.5  123    180    335    ▃▇▃▅▂▃▁▁
## 8   mpg       0       32 32  20.09   6.03 10.4   15.43  19.2   22.8   33.9  ▃▇▇▇▃▂▂▂
## 9  qsec       0       32 32  17.85   1.79 14.5   16.89  17.71  18.9   22.9  ▃▂▇▆▃▃▁▁
## 10   vs       0       32 32   0.44   0.5   0      0      0      1      1    ▇▁▁▁▁▁▁▆
## 11   wt       0       32 32   3.22   0.98  1.51   2.58   3.33   3.61   5.42 ▃▃▃▇▆▁▁▂

Another example:

skim(iris)
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## 
## Variable type: factor 
##       var missing complete   n n_unique                       top_counts ordered
## 1 Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0   FALSE
## 
## Variable type: numeric 
##            var missing complete   n mean   sd min p25 median p75 max     hist
## 1 Petal.Length       0      150 150 3.76 1.77 1   1.6   4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
## 2  Petal.Width       0      150 150 1.2  0.76 0.1 0.3   1.3  1.8 2.5 ▇▁▁▅▃▃▂▂
## 3 Sepal.Length       0      150 150 5.84 0.83 4.3 5.1   5.8  6.4 7.9 ▂▇▅▇▆▅▂▂
## 4  Sepal.Width       0      150 150 3.06 0.44 2   2.8   3    3.3 4.4 ▁▂▅▇▃▂▁▁

Handles grouped data

Skim() can handle data that has been grouped using dplyr::group_by.

iris %>% dplyr::group_by(Species) %>% skim()
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
##  group variables: Species 
## 
## Variable type: numeric 
##       Species          var missing complete  n mean   sd min  p25 median  p75 max     hist
## 1      setosa Petal.Length       0       50 50 1.46 0.17 1   1.4    1.5  1.58 1.9 ▁▁▅▇▇▅▂▁
## 2      setosa  Petal.Width       0       50 50 0.25 0.11 0.1 0.2    0.2  0.3  0.6 ▂▇▁▂▂▁▁▁
## 3      setosa Sepal.Length       0       50 50 5.01 0.35 4.3 4.8    5    5.2  5.8 ▂▃▅▇▇▃▁▂
## 4      setosa  Sepal.Width       0       50 50 3.43 0.38 2.3 3.2    3.4  3.68 4.4 ▁▁▃▅▇▃▂▁
## 5  versicolor Petal.Length       0       50 50 4.26 0.47 3   4      4.35 4.6  5.1 ▁▃▂▆▆▇▇▃
## 6  versicolor  Petal.Width       0       50 50 1.33 0.2  1   1.2    1.3  1.5  1.8 ▆▃▇▅▆▂▁▁
## 7  versicolor Sepal.Length       0       50 50 5.94 0.52 4.9 5.6    5.9  6.3  7   ▃▂▇▇▇▃▅▂
## 8  versicolor  Sepal.Width       0       50 50 2.77 0.31 2   2.52   2.8  3    3.4 ▁▂▃▅▃▇▃▁
## 9   virginica Petal.Length       0       50 50 5.55 0.55 4.5 5.1    5.55 5.88 6.9 ▂▇▃▇▅▂▁▂
## 10  virginica  Petal.Width       0       50 50 2.03 0.27 1.4 1.8    2    2.3  2.5 ▂▁▇▃▃▆▅▃
## 11  virginica Sepal.Length       0       50 50 6.59 0.64 4.9 6.23   6.5  6.9  7.9 ▁▁▃▇▅▃▂▃
## 12  virginica  Sepal.Width       0       50 50 2.97 0.32 2.2 2.8    3    3.18 3.8 ▁▃▇▇▅▃▁▂

skim_df object (long format)

By default skim prints beautifully in the console, but it also produces a long, tidy-format skim_df object that can be computed on.

a <-  skim(chickwts)
dim(a)
## [1] 23  6
print.data.frame(skim(chickwts))
##       var    type       stat     level    value formatted
## 1  weight numeric    missing      .all   0.0000         0
## 2  weight numeric   complete      .all  71.0000        71
## 3  weight numeric          n      .all  71.0000        71
## 4  weight numeric       mean      .all 261.3099    261.31
## 5  weight numeric         sd      .all  78.0737     78.07
## 6  weight numeric        min      .all 108.0000       108
## 7  weight numeric        p25      .all 204.5000     204.5
## 8  weight numeric     median      .all 258.0000       258
## 9  weight numeric        p75      .all 323.5000     323.5
## 10 weight numeric        max      .all 423.0000       423
## 11 weight numeric       hist      .all       NA  ▃▅▅▇▃▇▂▂
## 12   feed  factor    missing      .all   0.0000         0
## 13   feed  factor   complete      .all  71.0000        71
## 14   feed  factor          n      .all  71.0000        71
## 15   feed  factor   n_unique      .all   6.0000         6
## 16   feed  factor top_counts   soybean  14.0000   soy: 14
## 17   feed  factor top_counts    casein  12.0000   cas: 12
## 18   feed  factor top_counts   linseed  12.0000   lin: 12
## 19   feed  factor top_counts sunflower  12.0000   sun: 12
## 20   feed  factor top_counts  meatmeal  11.0000   mea: 11
## 21   feed  factor top_counts horsebean  10.0000   hor: 10
## 22   feed  factor top_counts      <NA>   0.0000     NA: 0
## 23   feed  factor    ordered      .all   0.0000     FALSE

Compute on the full skim_df object

skim(mtcars) %>% dplyr::filter(stat=="hist")
## # A tibble: 11 x 6
##      var    type  stat level value formatted
##    <chr>   <chr> <chr> <chr> <dbl>     <chr>
##  1   mpg numeric  hist  .all    NA  ▃▇▇▇▃▂▂▂
##  2   cyl numeric  hist  .all    NA  ▆▁▁▃▁▁▁▇
##  3  disp numeric  hist  .all    NA  ▇▆▁▂▅▃▁▂
##  4    hp numeric  hist  .all    NA  ▃▇▃▅▂▃▁▁
##  5  drat numeric  hist  .all    NA  ▃▇▁▅▇▂▁▁
##  6    wt numeric  hist  .all    NA  ▃▃▃▇▆▁▁▂
##  7  qsec numeric  hist  .all    NA  ▃▂▇▆▃▃▁▁
##  8    vs numeric  hist  .all    NA  ▇▁▁▁▁▁▁▆
##  9    am numeric  hist  .all    NA  ▇▁▁▁▁▁▁▆
## 10  gear numeric  hist  .all    NA  ▇▁▁▆▁▁▁▂
## 11  carb numeric  hist  .all    NA  ▆▇▂▇▁▁▁▁

Works with strings, lists and other column classes!

skim(dplyr::starwars)
## Skim summary statistics
##  n obs: 87 
##  n variables: 13 
## 
## Variable type: character 
##          var missing complete  n min max empty n_unique
## 1  eye_color       0       87 87   3  13     0       15
## 2     gender       3       84 87   4  13     0        4
## 3 hair_color       5       82 87   4  13     0       12
## 4  homeworld      10       77 87   4  14     0       48
## 5       name       0       87 87   3  21     0       87
## 6 skin_color       0       87 87   3  19     0       31
## 7    species       5       82 87   3  14     0       37
## 
## Variable type: integer 
##      var missing complete  n   mean    sd min p25 median p75 max     hist
## 1 height       6       81 87 174.36 34.77  66 167    180 191 264 ▁▁▁▂▇▃▁▁
## 
## Variable type: list 
##         var missing complete  n n_unique min_length median_length max_length
## 1     films       0       87 87       24          1             1          7
## 2 starships       0       87 87       17          0             0          5
## 3  vehicles       0       87 87       11          0             0          2
## 
## Variable type: numeric 
##          var missing complete  n  mean     sd min  p25 median  p75  max     hist
## 1 birth_year      44       43 87 87.57 154.69   8 35       52 72    896 ▇▁▁▁▁▁▁▁
## 2       mass      28       59 87 97.31 169.46  15 55.6     79 84.5 1358 ▇▁▁▁▁▁▁▁

Specify your own statistics

funs <- list(iqr = IQR,
    quantile = purrr::partial(quantile, probs = .99))
  skim_with(numeric = funs, append = FALSE)
  skim_v(iris$Sepal.Length)
## # A tibble: 2 x 5
##      type     stat level value formatted
##     <chr>    <chr> <chr> <dbl>     <chr>
## 1 numeric      iqr  .all   1.3       1.3
## 2 numeric quantile   99%   7.7       7.7
# Restore defaults
  skim_with_defaults()
 funs <- list(iqr = IQR,
    quantile = purrr::partial(quantile, probs = .99))
  skim_with(numeric = funs, append = FALSE)
  skim_v(iris$Sepal.Length)
## # A tibble: 2 x 5
##      type     stat level value formatted
##     <chr>    <chr> <chr> <dbl>     <chr>
## 1 numeric      iqr  .all   1.3       1.3
## 2 numeric quantile   99%   7.7       7.7
# Restore defaults
  skim_with_defaults()

Limitations of current version

Currently the print methods are still in early stages of development. Printing is limited to numeric, character, and factor data types. Therefore although additional types that are supported by skim() and skim_v() will not display with the default printing. To view these you may view and manipulate the skim object.

At the moment in addition to the three types with print support complex, logical, Date, POSIXct, and ts classes are supported with skim_v methods and the results are in the skim object.

We are also aware that both print.skim and print.data.frame (used for the skim object) do not handle significant digits correctly.

The develop branch currently addresses some of these issues. To install the develop branch:

install_github("ropenscilabs/skimr", ref = "develop")

Windows support for spark histograms

Windows cannot print the spark-histogram characters when printing a data-frame. For example, "▂▅▇" is printed as "<U+2582><U+2585><U+2587>". This longstanding problem originates in the low-level code for printing dataframes. One workaround for showing these characters in Windows is to set the CTYPE part of your locale to Chinese/Japanese/Korean with Sys.setlocale("LC_CTYPE", "Chinese"). These values do show up by default when printing a data-frame created by skim() as a list (as.list()) or as a matrix (as.matrix()).

Printing spark histograms and line graphs in knitted documents

Spark-bar and spark-line work in the console but may not work when you knit them to a specific document format. The same session that produces a correctly rendered HTML document may produce an incorrectly rendered PDF, for example. This issue can generally be addressed by changing fonts to one with good building block (for histogtams) and braille support (for line graphs). For example, the open font "DejaVu Sans" from the extra font package supports these. You may also want to try wrapping your results in knitr::kable().

Displays in documents of different types will vary. For example, one user found that the font "Yu Gothic UI Semilight" produced consistent results for Microsoft Word and Libre Office Write.

Contributing

We welcome issue reports and pull requests including adding support for different variable classes.

skimr's People

Contributors

elinw avatar michaelquinn32 avatar shanellis avatar jules32 avatar earino avatar haozhu233 avatar penguiin-test avatar ameliamn avatar tjmahr avatar benmarwick avatar gshotwell avatar marcmtk avatar paulklemm avatar nowosad avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.