dcomtois / summarytools Goto Github PK

R Package to Quickly and Neatly Summarize Data

R 97.42% CSS 2.02% HTML 0.42% TeX 0.14%

html-report frequency-table pander descriptive-statistics r rstudio rstats pandoc rmarkdown markdown pandoc-markdown

summarytools's Introduction

Summarytools 1.0 is out!

summarytools is a an R package for data cleaning, exploring, and simple reporting. The package was developed with the following objectives in mind:

Provide a coherent set of easy-to-use descriptive functions that are akin to those included in commercial statistical software suites such as SAS, SPSS, and Stata
Offer flexibility in terms of output format & content
Integrate well with commonly used software & tools for reporting (the RStudio IDE, Rmarkdown, and knitr) while also allowing for standalone, simple report generation from any R interface

On a more personal level, I simply wish to share with the R community and the scientific community at large the functions I first developed for myself, that I ultimately realized would benefit a lot of people who are looking for the same thing I was seeking in the first place.

Support summarytools’ Development

If summarytools helps you get things done, please consider making a donation. By doing so now, you’ll help me feel useful, but more importantly contribute to the package’s development and help other people like you who benefit from its current and future features. I regularly receive feature requests, and when I receive donations, I set aside some time to work on them, making summarytools even more relevant for data scientists, students and researchers around the world. No matter how small the amount is, I always appreciate the gesture. A list of sponsors can be found further below.

Package Documentation

The bulk of the technical documentation can now be found in the following vignettes:

Introduction to summarytools | CRAN version
Summarytools in R Markdown | CRAN Version
PDF Manual (automatically generated by CRAN)

Installing summarytools

Required Software

Additional software is used by summarytools for fine-tuning graphics as well as offering interactive functionality. If you are installing summarytools for the first time, click on the relevant link to get OS-specific instructions. On Windows, no additional software is required.

Mac OS X
Ubuntu / Debian / Mint
Older Ubuntu (14 and 16)
Fedora / Red Hat / CentOS
Solaris

Installing From GitHub

This method has the advantage of benefiting from minor fixes and improvements that are added between CRAN releases. Its main drawback is that you won’t be noticed when a new version is available. You can either check this page from time to time, or best, use a package that checks for package updates on various repositories, such as dtupdate and Drat.

install.packages("remotes")        # Using devtools is also possible
library(remotes)
install_github("rapporter/pander") # Strongly recommended
install_github("dcomtois/summarytools", build_vignettes = TRUE)

Installing From CRAN

CRAN versions are stable but are not updated as often as the GitHub versions. On the plus side, they can be easier to install on some systems.

install.packages("summarytools")

Latest Changes

In dfSummary():
- It is now possible to control which statistics to show in the Freqs / Values column (see help("st_options", "summarytools") for examples)
- In html outputs, tables are better aligned horizontally (categories >> counts >> charts); if misalignment occurs, adjusting graph.magnif should resolve it
- List-type columns and Inf values are handled properly
In descr() and ctable() several display glitches were corrected
Selected heading elements can be totally omitted on an individual basis
Improved functionality for customized terms / translations

For more details, see vignette("introduction", "summarytools") as well as news(package = "summarytools").

Additional Software Installations

Required Software on Mac OS

Magick++

Open a terminal window and enter the following:

brew install imagemagick@6

If you do not have brew installed, simply enter this command in the terminal:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

XQuartz

If you’re using Mac OS X version 10.8 (Mountain Lion) or more recent versions, you’ll need to download the .dmg image from xquartz.org and add it to your Applications folder.

Back to installation instructions

Required Software for Debian / Ubuntu / Linux Mint

Magick++
sudo apt install libmagick++-dev

Back to installation instructions

Required Software for Older Ubuntu Versions

This applies only if you are using Ubuntu Trusty (14.04) or Xenial (16.04).

Magick++

sudo add-apt-repository -y ppa:opencpu/imagemagick
sudo apt-get update
sudo apt-get install -y libmagick++-dev

Back to installation instructions

Required Software for Fedora / Red Had / CentOS

Magick++
sudo yum install ImageMagick-c++-devel

Back to installation instructions

Required Software for Solaris

Magick++

pkgadd -d http://get.opencsw.org/now
/opt/csw/bin/pkgutil -U
/opt/csw/bin/pkgutil -y -i imagemagick 
/usr/sbin/pkgchk -L CSWimagemagick

Back to installation instructions

Install packages

source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")
install.packages('devtools')
library(devtools)
install_github('dcomtois/summarytools')

Load packages

library("summarytools")
library(Biobase)
library("ALL")

Shiny Server

server <- function(input, output, session) {
output$summaryTable <- renderUI({
#-- Load the ALL data
data(ALL)
#-- Subset
eset_object <- ALL [1:3,] # choose only 3 variables
#-- The group of interest
eset_groups <-"BT"
ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))),
INDICES = (pData(eset_object)[,eset_groups]),
FUN = descr, stats ="all",
transpose = TRUE)

view(ALL_stats_by_BT,
method = 'render',
omit.headings = FALSE,
bootstrap.css = FALSE)
})
}

Shiny UI

ui <- fluidPage(theme = "dfSummary.css",
fluidRow(
uiOutput("summaryTable")
)
)

As a side note, if one reads in the data as Global variable: eset_object <<- ALL [1:3,], the variable names will be displayed. But this is not a solution to the problem as it is wise to avoid global variables!

Suggestion: identify when an ID columns contains a checksum

When a data set contains an ID which has a checksum, this is very useful to know. E.g. when bar codes are used (EAN https://en.wikipedia.org/wiki/International_Article_Number) it is very useful to know, especially when column names are not obvious.

Name of Group variable is not updated when using descr() with by() in Shiny app

I found that the Name of Group variable (and group level) is not retrieved or re-updated on UI when selecting a new Group variable in the app. It should be noted that the corresponding table (calculations) updates upon selection a new group variable.
Moreover, the group variable is also displayed in a static manner. Using the Shiny App below, the issue could be exemplified to some extent. For example, the Group variable is displayed on UI, as shown below:

Group: (pData(eset_object)[, eset_groups]) = B


# Install packages
source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
biocLite("Biobase")
install.packages('devtools')
devtools::install_github('dcomtois/summarytools')

# Load packages
library(summarytools)
library(Biobase)
library(ALL) 

# Shiny Server
server <- function(input, output, session) {
  output$summaryTable <- renderUI({
    #-- Load the ALL data
    data(ALL)  
    #-- Subset
    eset_object <- ALL [1:3,] # choose only 3 variables 
    #-- The group of interest 
    eset_groups <-"BT"
    # print(rownames (eset_object)) # print variable names
    ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))), 
                          INDICES = (pData(eset_object)[,eset_groups]), 
                          FUN = descr, stats ="all", 
                          transpose = TRUE)

    view(ALL_stats_by_BT,
         method = 'render',
         omit.headings = FALSE,
         bootstrap.css = FALSE)
  })
}

# Shiny UI
ui <- fluidPage(theme = "dfSummary.css",
                fluidRow(
                  uiOutput("summaryTable")
                )
)

# Lauch
shinyApp(ui, server)

Of note, if you replace eSet_object <- relevant_est() to eSet_object <<- relevant_est() (that is Global Env) the option Data Frame will be retrieved and displayed on UI, as presented below:

Data Frame: as.data.frame(t(exprs(eSet_object)))
Group: (pData(eSet_object)[, eSet_groups]) = B

Winword output suggestion

Thanks again for your great package. Is it possible to add some suggestion on how to render the output in word or html using rmarkdown in RStudio?
Best

Suggestion: how often does ID values exist

When analyzing a data set with e.g. client ID's it is very usefull to know how often unique ID's appear in the dataset. e.g. 90% appears once, 5% appears twice, etc.. (data frame summary)

Error in ctable function: $ operator is invalid for atomic vectors

When I use the ctable function with the pipe operator %$% from the package magrittr an error occurs: Error: $ operator is invalid for atomic vectors

library(summarytools)
library(magrittr)

tobacco %$% ctable(smoker, diseased)

Traceback
14. na.omit(c(parse_info_y$var_names, deparse(dnn[[2]]))) at ctable.R#194
13. ctable(smoker, diseased)
12. eval(substitute(expr), data, enclos = parent.frame())
11. eval(substitute(expr), data, enclos = parent.frame())
10. with.default(., ctable(smoker, diseased))
9. with(., ctable(smoker, diseased))
8. function_list[k]
7. withVisible(function_list[k])
6. freduce(value, _function_list)
5. _fseq(_lhs)
4. eval(quote(_fseq(_lhs)), env, env)
3. eval(quote(_fseq(_lhs)), env, env)
2. withVisible(eval(quote(_fseq(_lhs)), env, env))

tobacco %$% ctable(smoker, diseased)

Error appears in na.omit function in line 194, ctable.R file.

y_name  <- na.omit(c(parse_info_y$var_names, deparse(dnn[[2]])))[1]

Many thanks.

Controlling which stats to use in the case of numerical variables when using dfSummary()

I'm wondering whether it is possible to control which Stats to be shown in the case of numerical variables when using dfSummary().
This is almost necessary to be able to control which Stats to use for numerical variables, particularity in the case of CV. This is because CV values should not be calculated for a data on a logarithmic scale!

Error in sect_title[[2]] : subscript out of bounds

I keep getting this error using dfSummary -- and it has happened for all of my data. All of the code worked before...

x was converted to a data frame
Error in sect_title[[2]] : subscript out of bounds

Suggestion: Distinct count of factor/character column

It would be useful to have a distinct count of unique values of either factor or character column.
For example, if I have a column labeled email, I would like to know how many unique emails I have in that column.
Here is an example of my output.
When the field type is Integer, then you get a distinct could of values, but when it's a character/factor then it counts frequency but not count of unique values.

Thank you

Suggestion to add number of unique rows

The current version of the Data Frame Summary shows the number of rows. In many cases it is very usefull to know how many unique rows there are. For example the iris dataset contains 150 rows, but there is one duplicate row (e.g. nrow(unique(iris)) gives 149). It would be very helpfull to add this to the top of the report.

Add mosaic plots as a feature.

Maybe add plots like in vcd?

dfSummary: Freqs(% Valid) numerical vectors and integer proportions

Hello,

I have a .csv data file that I am reading into a data frame. When I run the dfSummary() function in the console or render on RMarkdown, although some integers are only two distinct values with 100% valid entries, the frequencies(%) are not printed on the output. Interestingly, some integers with <10 values will have printed out frequencies, but there really isn't any notable pattern to why these will print whereas the majority will not. When using an older version of summarytools (0.6.5), this frequency issue is not a problem. Is there something I can do besides go through all of my variables and convert them to factors to resolve this issue? Thanks and please let me know if I need to clarify anything. I'm relatively new to programming and R. :)

Found issue with coefficient of variation (CV)

Dear Dominic,
I found the package "summarytools" very useful!

However, I also found that CV values are calculated inappropriately in the package. When viewing the relevant code contained in "descr.R", I found that CV values are calculated using
ifelse("cv" %in% stats, variable.mean / variable.sd, NA)
As you know the correct formula to calculate coefficient of variation is: CV = (Standard Deviation (σ) / Mean (μ)), why this chunk needs to be replaced by
ifelse("cv" %in% stats, variable.sd / variable.mean, NA)

Best regards,
Payam

Suggestion: identify when there is a ID - Despription relation between two columns

When using the data frame summary it is very handy when there are two columns that contain an ID and a description of something. For example when column 3 has distinct ID's 1 and 2 and column 123 contains the distinct values "MALE" and "FEMALE" it is very practical to mention that column 3 and column 123 are related.

feature: select columns in freq

New to the package. Very interesting contribution! I may have missed this: is there a way to select the columns that freq returns? I can remove NAs with report.nas = FALSE. I know I can drop the Totals row with totals = FALSE. Is there an option of the freq function to keep/drop the percentage column and/or the cumulative percentage column?

Something like report.cum = FALSE and report.pct = FALSE ...

bug

fyi, I get the following error:

Error in isTRUE(extra_space) : object 'extra_space' not found

Will try to post a reproducible example

fyi on other functionality from another package

including cumsum etc.
https://github.com/TysonStanley/furniture

dfSummary: options valid.col & na.col

Dear Dominic,

first of all, I want to say that your package is great! Thank you!!!

Second I have noticed that the two options of dfSummary do not seem to work when set to false.
Am I doing something wrong?
here is an example with iris
view(dfSummary(iris, varnumbers = FALSE, valid.col = FALSE, na.col = FALSE , omit.headings=TRUE))

Suggestion: give information about date-time columns

In the data frame summary when a column contains date/time information I would suggest to give a distribution of the days (e.g. monday 6%, tuesday 12%, etc..), months, hours, etc. This reveals easily season patterns, workday behavior, etc.

Suggestion: threat binary integers different

In the data frame summary if an integer contains only 0 and 1's I believe it is not very useful to describe "mean (sd) : 0.23 (0.42) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (1.82)". I suggest it is more usefull to mention how many 0 and 1 values occur.

Suggestion: mention also the sum of the values

In the data frame summary, when a column contains for example an amount in euro's, I would suggest to also add the sum of the values in the data frame summary.

Error in prettyNum if missing value

Using summarytools 0.8.6 getting error on some variables where everything is either 0 or 1 and there is a also a missing value. I have other character and factor vectors with missing values and those are being handled correctly.

Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
invalid 'nsmall' argument

Reproducible example

dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, 0), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)

#so far so good, but then look what happens when an NA is inserted

dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, NA), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)

special handling for dates

hola!

excited so a couple more suggestions:

i think it would be to useful to allow for special handling of date and time vars.
for categorical vars./char. vars. with more than 10 unique, it may be useful to present breakdown of 9 most common (as you do, I think) and then the 10th can be 'other or all else' (which totals up for everything other char. val.

hth

Option to omit Totals row in freq()

A user has requested that feature, will be working on it soon.

useNA in ctable

I am getting an error using ctable every time I try and set useNA to "no". It works just fine for "always" or "ifany". Error is below:

Error in ctable(stjean$pnc5_new, stjean$preterm, prop = "t", useNA = "no") :
'useNA' must be one of 'ifany', 'always', or 'no'

dfSummary fails when a whole factor column is NA

Hi,

When I was trying to generate a dfSummary of a new dataset I could not due to an error. I could replicate the bug when running this functions on the iris dataset. This error occurs when I have a whole factor column with NAs.

This works:

data(iris)
dfSummary(iris)

Now, when I set a factor column to NA it doesn't.

iris$Species <- as.factor(rep(NA, nrow(iris)))
dfSummary(iris)

This is the error, identical to my dataset.
Error in png(img_png <- tempfile(fileext = ".png"), width = 150, height = 26 * :
invalid 'height' argument
In addition: Warning messages:
1: In max(counts) : no non-missing arguments to max; returning -Inf
2: In max(props * 100) : no non-missing arguments to max; returning -Inf

Regards,
Victor

Suggestion: mention the number of columns in the top of the data frame summary

Especially when using the data frame summary with a lot of columns it is handy to mention the number of columns on the top of the page, e.g. next to the number of rows.

Wrong link for the recommendation vignette

There's an error with the link for the recommendation vignette. I'll create a PR that solves this.

The error is here:

The following vignettes complements this page: [Recommendations for
Using summarytools With
Rmarkdown](https://cdn.rawgit.com/dcomtois/summarytools/dev-current/inst/doc/Recommendations-rmarkdown.html)

dfSummary graphs slow to generate when number of breaks is high

Under some circonstances, the html graph can take a (very) long time to generate. If you do not need the graphs, just set graph.col = FALSE until the issue is resolved. Thanks to Adam Medcalf for pointing this out.

handling NA in freq

I see that freq always print NA and Cumulative Valid. I would suggest to add an boolean option, ignore.na=FALSE, that, when TRUE, ignores NA (and also does not print "Valid" frequencies columns

parameter to select which statistics to print

Thanks for your great package. As a suggestion, I would like to propose to add a character vector parameter with default values to explicit which statistics are being tabulated to the descr function .

with() returns Var1 instead of the named variable

data(exams)
with(exams, by(english, gender, descr))

returns descriptive statistics for "english" for each gender. However, the statistics table shows Var1 as the column name instead of showing the actual named variable (which would be english, in this case).

                Var1

         Mean  76.66
      Std.Dev   9.35
          Min   55.9
          Max   93.2
       Median   77.1
          mad   7.56
          IQR    8.2
           CV    8.2
     Skewness  -0.25
  SE.Skewness   0.58
     Kurtosis  -0.25

Was it intentional? If not, it would probably be a good idea to display the actual variable name

strange result for character variable in dfSummary

I assume the percentage should always sum to 100% but in the screenshot below the "other" level gets 103.1% Not sure what is going on there. A link to the dataset used is provided below.

https://github.com/radiant-rstats/radiant.data/raw/master/data/titanic.rda

Suggestion : add some <br> in the view(dfSummary(data),method = "render")

When I run
view(dfSummary(data)) in the console I get something like this

but when I put
view(dfSummary(data), method = "render") in my Rmd (html_output) , I get this :

I think adding some <br> at the end each lines in Stats / Values and Freqs to have the same result that in the Rstudio Viewer could be very good :)

Thanks for your package !

Escape Characters Causing Ugly Display in Jupyter

I love the summaries this tool generates in RStudio. Thanks!

My problem is that using this with Jupyter doesn't seem to work. Reproduction below:

Inspecting the data frame:

ddd = summarytools::dfSummary(mtcars)
ddd$Variable

Produces this:

[1] "mpg\\\n[numeric]"  "cyl\\\n[numeric]"  "disp\\\n[numeric]" "hp\\\n[numeric]"   "drat\\\n[numeric]"
 [6] "wt\\\n[numeric]"   "qsec\\\n[numeric]" "vs\\\n[numeric]"   "am\\\n[numeric]"   "gear\\\n[numeric]"
[11] "carb\\\n[numeric]"

Which works great in RStudio or the command line, poorly in Jupyter.

I am struggling to figure out if there is simple a parameter I am missing? Or maybe there is a method I can pipe this output through to unescape those characters?

If I figure it out I'll post a solution.

Q1 and Q3

Hi, is it also possible to specify 25th and 75th percentiles (as Q1 and Q3) maybe? Cause they are frequently used as descriptive reporting. Best

simplify code with dplyr

Hey,

Great package!

I think code for descr etc. can be radically simplified using dplyr.

For instance:

iris <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", col_names = F)
iris_num <- iris %>%
   summarize_if(is.numeric, funs(mean = mean, median = median, min = min, max = max, missing = sum(is.na(.)))) 
iris_num_long <- iris_num %>%
  gather(key = "key", value = "words") %>%
  separate(key, into = c("var", "statistic")) %>%
  spread(key = "var", value = "words")

produces

 iris_num_long
# A tibble: 5 x 5
  statistic    X1    X2    X3    X4
* <chr>     <dbl> <dbl> <dbl> <dbl>
1 max        7.90  4.40  6.90 2.50 
2 mean       5.84  3.05  3.76 1.20 
3 median     5.80  3.00  4.35 1.30 
4 min        4.30  2.00  1.00 0.100
5 missing    0     0     0    0

and this allows you to pass arbitrary functions to summarize easily

Suggestion: mention it when an integer is a sequence

When in the data frame summary an integer column contains for example 110 distinct values (0 < 43 < 109) it is useful to note that it is the sequence 0:109.

suggestion: identify primary key of dataframe

In the Data Frame Summary it would be very useful to identify which column contains the 'primary key' (as it is called in databases). A column could be the primary key when the number of rows in the data frame equals the number of distinct values. Of course not every table has a primary key, but that is also useful to mention.

Suggestion: mention most frequent value

In the data frame summary if an column contains 115 distinct values (such as countries) and 99% of the values is a specific country, this is very useful to mention what the most frequent country is. In general I believe It is usefull to display to most frequent values.

Specify column widths

Especially in the context of rendering html for markdown; right now the size of graphs responds to windows size and the graph.magnif parameter doesn't enforce actual wanted size.

Error loading summarytools

I just installed summarytools 0.8.3 from CRAN with no error messages.

packageVersion("summarytools")
[1] ‘0.8.3’

> library(summarytools)
Error in get(method, envir = home) : 
  lazy-load database 'xxx/summarytools/R/summarytools.rdb' is corrupt
In addition: Warning messages:
1: In .registerS3method(fin[i, 1], fin[i, 2], fin[i, 3], fin[i, 4],  :
  restarting interrupted promise evaluation
2: In get(method, envir = home) :
  restarting interrupted promise evaluation
3: In get(method, envir = home) : internal error -3 in R_decompress1
Error: package or namespace load failed for ‘summarytools’

Installing from github gives the same results.

Session info ------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.447)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2018-04-27

Getting error "Error in ctable(... : Could not find function "ctable""

Getting error "Error in ctable... : Could not find function "ctable""

Error from view(dfSummary(df))

I read a clean dataset in from SQL, and tried the below:
library(summarytools)

view(dfSummary(df))
Error in plot.window(xlim, ylim, "", ...) : need finite 'ylim' values
In addition: Warning messages:
1: In n * h : NAs produced by integer overflow
2: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow
3: In n * h : NAs produced by integer overflow
4: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow

even when graph.col = FALSE, gives graph error

view(dfSummary(hehe, graph.col = FALSE), file = "data_summary.html", append = TRUE, footnote = NA)
Error in plot.window(xlim, ylim, "", ...) : need finite 'ylim' values

Suggestion: mention rows with all NA's

When using the data frame summary I encountered a dataset which had rows with only empty columns (NA's). It would be handy to mention this when this occurs on the top of the page at the data frame summary.

limiting the statistics in descr()

This is a feature request. It would be great to add an argument to limit the statistics (mean, sd, etc.). For example, if someone only wants to return mean, median and sd , then the argument could be something like

stats = c('mean', 'median', 'sd')

The final descriptive table would only return the above listed statistics instead of all of them. The default could be stats = "all".

Rd formatting

The output of dfSummary would look nice in the data documentation created by roxygen2 (r-lib/roxygen2#307). Converting a data frame to .Rd is straightforward, but the data frames created by dfSummary contain embedded newlines -- this makes it a bit more difficult.

dcomtois / summarytools Goto Github PK

summarytools's Introduction

Summarytools 1.0 is out!

Support summarytools’ Development

Package Documentation

Installing summarytools

Required Software

Installing From GitHub

Installing From CRAN

Latest Changes

Additional Software Installations

Required Software on Mac OS

Required Software for Debian / Ubuntu / Linux Mint

Required Software for Older Ubuntu Versions

Required Software for Fedora / Red Had / CentOS

Required Software for Solaris

Sponsors

summarytools's People

Contributors

Stargazers

Watchers

Forkers

summarytools's Issues

Install packages

Load packages

Shiny Server

Shiny UI

Recommend Projects

Recommend Topics

Recommend Org