Git Product home page Git Product logo

summarytools's Introduction

CRAN_Status_Badge PayPal donate button

package-design-image3

Summarytools 1.0 is out!

summarytools is a an R package for data cleaning, exploring, and simple reporting. The package was developed with the following objectives in mind:

  • Provide a coherent set of easy-to-use descriptive functions that are akin to those included in commercial statistical software suites such as SAS, SPSS, and Stata
  • Offer flexibility in terms of output format & content
  • Integrate well with commonly used software & tools for reporting (the RStudio IDE, Rmarkdown, and knitr) while also allowing for standalone, simple report generation from any R interface

On a more personal level, I simply wish to share with the R community and the scientific community at large the functions I first developed for myself, that I ultimately realized would benefit a lot of people who are looking for the same thing I was seeking in the first place.

Support summarytools’ Development

If summarytools helps you get things done, please consider making a donation. By doing so now, you’ll help me feel useful, but more importantly contribute to the package’s development and help other people like you who benefit from its current and future features. I regularly receive feature requests, and when I receive donations, I set aside some time to work on them, making summarytools even more relevant for data scientists, students and researchers around the world. No matter how small the amount is, I always appreciate the gesture. A list of sponsors can be found further below.

Package Documentation

The bulk of the technical documentation can now be found in the following vignettes:

Introduction to summarytools | CRAN version
Summarytools in R Markdown | CRAN Version
PDF Manual (automatically generated by CRAN)

Installing summarytools

Required Software

Additional software is used by summarytools for fine-tuning graphics as well as offering interactive functionality. If you are installing summarytools for the first time, click on the relevant link to get OS-specific instructions. On Windows, no additional software is required.

Mac OS X
Ubuntu / Debian / Mint
Older Ubuntu (14 and 16)
Fedora / Red Hat / CentOS
Solaris

Installing From GitHub

This method has the advantage of benefiting from minor fixes and improvements that are added between CRAN releases. Its main drawback is that you won’t be noticed when a new version is available. You can either check this page from time to time, or best, use a package that checks for package updates on various repositories, such as dtupdate and Drat.

install.packages("remotes")        # Using devtools is also possible
library(remotes)
install_github("rapporter/pander") # Strongly recommended
install_github("dcomtois/summarytools", build_vignettes = TRUE)

Installing From CRAN

CRAN versions are stable but are not updated as often as the GitHub versions. On the plus side, they can be easier to install on some systems.

install.packages("summarytools")

Latest Changes

  • In dfSummary():

    • It is now possible to control which statistics to show in the Freqs / Values column (see help("st_options", "summarytools") for examples)
    • In html outputs, tables are better aligned horizontally (categories >> counts >> charts); if misalignment occurs, adjusting graph.magnif should resolve it
    • List-type columns and Inf values are handled properly
  • In descr() and ctable() several display glitches were corrected

  • Selected heading elements can be totally omitted on an individual basis

  • Improved functionality for customized terms / translations

For more details, see vignette("introduction", "summarytools") as well as news(package = "summarytools").

Additional Software Installations

Required Software on Mac OS

Magick++

Open a terminal window and enter the following:

brew install imagemagick@6

If you do not have brew installed, simply enter this command in the terminal:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

XQuartz

If you’re using Mac OS X version 10.8 (Mountain Lion) or more recent versions, you’ll need to download the .dmg image from xquartz.org and add it to your Applications folder.

Back to installation instructions

Required Software for Debian / Ubuntu / Linux Mint

Magick++
sudo apt install libmagick++-dev

Back to installation instructions

Required Software for Older Ubuntu Versions

This applies only if you are using Ubuntu Trusty (14.04) or Xenial (16.04).

Magick++

sudo add-apt-repository -y ppa:opencpu/imagemagick
sudo apt-get update
sudo apt-get install -y libmagick++-dev

Back to installation instructions

Required Software for Fedora / Red Had / CentOS

Magick++
sudo yum install ImageMagick-c++-devel

Back to installation instructions

Required Software for Solaris

Magick++

pkgadd -d http://get.opencsw.org/now
/opt/csw/bin/pkgutil -U
/opt/csw/bin/pkgutil -y -i imagemagick 
/usr/sbin/pkgchk -L CSWimagemagick

Back to installation instructions

Sponsors

A big thanks to the following people who made donations:

  • Ashirwad Barnwal
  • David Thomas
  • Peter Nilsson
  • Ross Dunne
  • Igor Rubets
  • Joerg Sahlmann
  • Roger Hilfiker

summarytools is the result of many hours of work. If you find the package brings value to your work, please take a moment to make a small donation.

The package comes with no guarantees. It is a work in progress and feedback is always welcome. Please open an issue on GitHub if you find a bug or wish to submit a feature request.

Back to top

summarytools's People

Contributors

brunaw avatar cmrnp avatar dcomtois avatar emraher avatar faviovazquez avatar iago-pssjd avatar jonmcalder avatar mcanouil avatar rprrr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

summarytools's Issues

Name of variable (features) disappears when using descr() with by() in Shiny app

The descr() function from R-package summarytools generates common central tendency statistics and measures of dispersion for numerical data in R.

When I use descr() with by() in a Shiny app, names of variable (features) contained in the data disappear and not displayed. Instead, the names are replaced by Var1, Var2, Var3 etc.

I do not really understand why the names disappear when I implement these code in the Shiny app (see below).

Install packages

source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")
install.packages('devtools')
library(devtools)
install_github('dcomtois/summarytools')

Load packages

library("summarytools")
library(Biobase)
library("ALL")

Shiny Server

server <- function(input, output, session) {
output$summaryTable <- renderUI({
#-- Load the ALL data
data(ALL)
#-- Subset
eset_object <- ALL [1:3,] # choose only 3 variables
#-- The group of interest
eset_groups <-"BT"
ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))),
INDICES = (pData(eset_object)[,eset_groups]),
FUN = descr, stats ="all",
transpose = TRUE)

view(ALL_stats_by_BT,
method = 'render',
omit.headings = FALSE,
bootstrap.css = FALSE)
})
}

Shiny UI

ui <- fluidPage(theme = "dfSummary.css",
fluidRow(
uiOutput("summaryTable")
)
)

As a side note, if one reads in the data as Global variable: eset_object <<- ALL [1:3,], the variable names will be displayed. But this is not a solution to the problem as it is wise to avoid global variables!

Name of Group variable is not updated when using descr() with by() in Shiny app

I found that the Name of Group variable (and group level) is not retrieved or re-updated on UI when selecting a new Group variable in the app. It should be noted that the corresponding table (calculations) updates upon selection a new group variable.
Moreover, the group variable is also displayed in a static manner. Using the Shiny App below, the issue could be exemplified to some extent. For example, the Group variable is displayed on UI, as shown below:

Group: (pData(eset_object)[, eset_groups]) = B


# Install packages
source("https://bioconductor.org/biocLite.R")
biocLite("ALL")
biocLite("Biobase")
install.packages('devtools')
devtools::install_github('dcomtois/summarytools')

# Load packages
library(summarytools)
library(Biobase)
library(ALL) 

# Shiny Server
server <- function(input, output, session) {
  output$summaryTable <- renderUI({
    #-- Load the ALL data
    data(ALL)  
    #-- Subset
    eset_object <- ALL [1:3,] # choose only 3 variables 
    #-- The group of interest 
    eset_groups <-"BT"
    # print(rownames (eset_object)) # print variable names
    ALL_stats_by_BT <- by(data = as.data.frame(t(exprs(eset_object))), 
                          INDICES = (pData(eset_object)[,eset_groups]), 
                          FUN = descr, stats ="all", 
                          transpose = TRUE)

    view(ALL_stats_by_BT,
         method = 'render',
         omit.headings = FALSE,
         bootstrap.css = FALSE)
  })
}

# Shiny UI
ui <- fluidPage(theme = "dfSummary.css",
                fluidRow(
                  uiOutput("summaryTable")
                )
)

# Lauch
shinyApp(ui, server)

Of note, if you replace eSet_object <- relevant_est() to eSet_object <<- relevant_est() (that is Global Env) the option Data Frame will be retrieved and displayed on UI, as presented below:

Data Frame: as.data.frame(t(exprs(eSet_object)))
Group: (pData(eSet_object)[, eSet_groups]) = B

Winword output suggestion

Thanks again for your great package. Is it possible to add some suggestion on how to render the output in word or html using rmarkdown in RStudio?
Best

Suggestion: how often does ID values exist

When analyzing a data set with e.g. client ID's it is very usefull to know how often unique ID's appear in the dataset. e.g. 90% appears once, 5% appears twice, etc.. (data frame summary)

Error in ctable function: $ operator is invalid for atomic vectors

When I use the ctable function with the pipe operator %$% from the package magrittr an error occurs: Error: $ operator is invalid for atomic vectors

library(summarytools)
library(magrittr)

tobacco %$% ctable(smoker, diseased)

Traceback
14. na.omit(c(parse_info_y$var_names, deparse(dnn[[2]]))) at ctable.R#194
13. ctable(smoker, diseased)
12. eval(substitute(expr), data, enclos = parent.frame())
11. eval(substitute(expr), data, enclos = parent.frame())
10. with.default(., ctable(smoker, diseased))
9. with(., ctable(smoker, diseased))
8. function_list[k]
7. withVisible(function_list[k])
6. freduce(value, _function_list)
5. _fseq(_lhs)
4. eval(quote(_fseq(_lhs)), env, env)
3. eval(quote(_fseq(_lhs)), env, env)
2. withVisible(eval(quote(_fseq(_lhs)), env, env))

  1. tobacco %$% ctable(smoker, diseased)

Error appears in na.omit function in line 194, ctable.R file.

y_name  <- na.omit(c(parse_info_y$var_names, deparse(dnn[[2]])))[1]

Many thanks.

Error in sect_title[[2]] : subscript out of bounds

I keep getting this error using dfSummary -- and it has happened for all of my data. All of the code worked before...

x was converted to a data frame
Error in sect_title[[2]] : subscript out of bounds

Suggestion: Distinct count of factor/character column

It would be useful to have a distinct count of unique values of either factor or character column.
For example, if I have a column labeled email, I would like to know how many unique emails I have in that column.
Here is an example of my output.
When the field type is Integer, then you get a distinct could of values, but when it's a character/factor then it counts frequency but not count of unique values.

1

Thank you

Suggestion to add number of unique rows

The current version of the Data Frame Summary shows the number of rows. In many cases it is very usefull to know how many unique rows there are. For example the iris dataset contains 150 rows, but there is one duplicate row (e.g. nrow(unique(iris)) gives 149). It would be very helpfull to add this to the top of the report.

dfSummary: Freqs(% Valid) numerical vectors and integer proportions

Hello,

I have a .csv data file that I am reading into a data frame. When I run the dfSummary() function in the console or render on RMarkdown, although some integers are only two distinct values with 100% valid entries, the frequencies(%) are not printed on the output. Interestingly, some integers with <10 values will have printed out frequencies, but there really isn't any notable pattern to why these will print whereas the majority will not. When using an older version of summarytools (0.6.5), this frequency issue is not a problem. Is there something I can do besides go through all of my variables and convert them to factors to resolve this issue? Thanks and please let me know if I need to clarify anything. I'm relatively new to programming and R. :)

Found issue with coefficient of variation (CV)

Dear Dominic,
I found the package "summarytools" very useful!

However, I also found that CV values are calculated inappropriately in the package. When viewing the relevant code contained in "descr.R", I found that CV values are calculated using
ifelse("cv" %in% stats, variable.mean / variable.sd, NA)
As you know the correct formula to calculate coefficient of variation is: CV = (Standard Deviation (σ) / Mean (μ)), why this chunk needs to be replaced by
ifelse("cv" %in% stats, variable.sd / variable.mean, NA)

Best regards,
Payam

feature: select columns in freq

New to the package. Very interesting contribution! I may have missed this: is there a way to select the columns that freq returns? I can remove NAs with report.nas = FALSE. I know I can drop the Totals row with totals = FALSE. Is there an option of the freq function to keep/drop the percentage column and/or the cumulative percentage column?

Something like report.cum = FALSE and report.pct = FALSE ...

bug

fyi, I get the following error:

Error in isTRUE(extra_space) : object 'extra_space' not found

Will try to post a reproducible example

dfSummary: options valid.col & na.col

Dear Dominic,

first of all, I want to say that your package is great! Thank you!!!

Second I have noticed that the two options of dfSummary do not seem to work when set to false.
Am I doing something wrong?
here is an example with iris
view(dfSummary(iris, varnumbers = FALSE, valid.col = FALSE, na.col = FALSE , omit.headings=TRUE))

Suggestion: give information about date-time columns

In the data frame summary when a column contains date/time information I would suggest to give a distribution of the days (e.g. monday 6%, tuesday 12%, etc..), months, hours, etc. This reveals easily season patterns, workday behavior, etc.

Suggestion: threat binary integers different

In the data frame summary if an integer contains only 0 and 1's I believe it is not very useful to describe "mean (sd) : 0.23 (0.42) min < med < max : 0 < 0 < 1 IQR (CV) : 0 (1.82)". I suggest it is more usefull to mention how many 0 and 1 values occur.

Error in prettyNum if missing value

Using summarytools 0.8.6 getting error on some variables where everything is either 0 or 1 and there is a also a missing value. I have other character and factor vectors with missing values and those are being handled correctly.

Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
invalid 'nsmall' argument

Reproducible example

dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, 0), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)

#so far so good, but then look what happens when an NA is inserted

dt <- data.frame(finger.involved = c(0, 0, 1, 1, 0, 0, 0, NA), toe.involved = c(0, 0, 1, 1, 1, 1, 0, 0))
dfSummary(dt)

special handling for dates

hola!

excited so a couple more suggestions:

  1. i think it would be to useful to allow for special handling of date and time vars.
  2. for categorical vars./char. vars. with more than 10 unique, it may be useful to present breakdown of 9 most common (as you do, I think) and then the 10th can be 'other or all else' (which totals up for everything other char. val.

hth

useNA in ctable

I am getting an error using ctable every time I try and set useNA to "no". It works just fine for "always" or "ifany". Error is below:

Error in ctable(stjean$pnc5_new, stjean$preterm, prop = "t", useNA = "no") :
'useNA' must be one of 'ifany', 'always', or 'no'

dfSummary fails when a whole factor column is NA

Hi,

When I was trying to generate a dfSummary of a new dataset I could not due to an error. I could replicate the bug when running this functions on the iris dataset. This error occurs when I have a whole factor column with NAs.

This works:

data(iris)
dfSummary(iris)

Now, when I set a factor column to NA it doesn't.

iris$Species <- as.factor(rep(NA, nrow(iris)))
dfSummary(iris)

This is the error, identical to my dataset.
Error in png(img_png <- tempfile(fileext = ".png"), width = 150, height = 26 * :
invalid 'height' argument
In addition: Warning messages:
1: In max(counts) : no non-missing arguments to max; returning -Inf
2: In max(props * 100) : no non-missing arguments to max; returning -Inf

Regards,
Victor

Wrong link for the recommendation vignette

There's an error with the link for the recommendation vignette. I'll create a PR that solves this.

The error is here:

The following vignettes complements this page: [Recommendations for
Using summarytools With
Rmarkdown](https://cdn.rawgit.com/dcomtois/summarytools/dev-current/inst/doc/Recommendations-rmarkdown.html)

handling NA in freq

I see that freq always print NA and Cumulative Valid. I would suggest to add an boolean option, ignore.na=FALSE, that, when TRUE, ignores NA (and also does not print "Valid" frequencies columns

parameter to select which statistics to print

Thanks for your great package. As a suggestion, I would like to propose to add a character vector parameter with default values to explicit which statistics are being tabulated to the descr function .

with() returns Var1 instead of the named variable

data(exams)
with(exams, by(english, gender, descr))

returns descriptive statistics for "english" for each gender. However, the statistics table shows Var1 as the column name instead of showing the actual named variable (which would be english, in this case).

                Var1

         Mean  76.66
      Std.Dev   9.35
          Min   55.9
          Max   93.2
       Median   77.1
          mad   7.56
          IQR    8.2
           CV    8.2
     Skewness  -0.25
  SE.Skewness   0.58
     Kurtosis  -0.25

Was it intentional? If not, it would probably be a good idea to display the actual variable name

Suggestion : add some <br> in the view(dfSummary(data),method = "render")

When I run
view(dfSummary(data)) in the console I get something like this
2018-12-14 11_03_08-data frame summary

but when I put
view(dfSummary(data), method = "render") in my Rmd (html_output) , I get this :

2018-12-14 11_03_44-rapport sur dig reporting de decembre

I think adding some <br> at the end each lines in Stats / Values and Freqs to have the same result that in the Rstudio Viewer could be very good :)

Thanks for your package !

Escape Characters Causing Ugly Display in Jupyter

I love the summaries this tool generates in RStudio. Thanks!

My problem is that using this with Jupyter doesn't seem to work. Reproduction below:

jupyter

Inspecting the data frame:

ddd = summarytools::dfSummary(mtcars)
ddd$Variable

Produces this:

[1] "mpg\\\n[numeric]"  "cyl\\\n[numeric]"  "disp\\\n[numeric]" "hp\\\n[numeric]"   "drat\\\n[numeric]"
 [6] "wt\\\n[numeric]"   "qsec\\\n[numeric]" "vs\\\n[numeric]"   "am\\\n[numeric]"   "gear\\\n[numeric]"
[11] "carb\\\n[numeric]"

Which works great in RStudio or the command line, poorly in Jupyter.

I am struggling to figure out if there is simple a parameter I am missing? Or maybe there is a method I can pipe this output through to unescape those characters?

If I figure it out I'll post a solution.

Q1 and Q3

Hi, is it also possible to specify 25th and 75th percentiles (as Q1 and Q3) maybe? Cause they are frequently used as descriptive reporting. Best

simplify code with dplyr

Hey,

Great package!

I think code for descr etc. can be radically simplified using dplyr.

For instance:

iris <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", col_names = F)
iris_num <- iris %>%
   summarize_if(is.numeric, funs(mean = mean, median = median, min = min, max = max, missing = sum(is.na(.)))) 
iris_num_long <- iris_num %>%
  gather(key = "key", value = "words") %>%
  separate(key, into = c("var", "statistic")) %>%
  spread(key = "var", value = "words")

produces

 iris_num_long
# A tibble: 5 x 5
  statistic    X1    X2    X3    X4
* <chr>     <dbl> <dbl> <dbl> <dbl>
1 max        7.90  4.40  6.90 2.50 
2 mean       5.84  3.05  3.76 1.20 
3 median     5.80  3.00  4.35 1.30 
4 min        4.30  2.00  1.00 0.100
5 missing    0     0     0    0    

and this allows you to pass arbitrary functions to summarize easily

suggestion: identify primary key of dataframe

In the Data Frame Summary it would be very useful to identify which column contains the 'primary key' (as it is called in databases). A column could be the primary key when the number of rows in the data frame equals the number of distinct values. Of course not every table has a primary key, but that is also useful to mention.

Suggestion: mention most frequent value

In the data frame summary if an column contains 115 distinct values (such as countries) and 99% of the values is a specific country, this is very useful to mention what the most frequent country is. In general I believe It is usefull to display to most frequent values.

Specify column widths

Especially in the context of rendering html for markdown; right now the size of graphs responds to windows size and the graph.magnif parameter doesn't enforce actual wanted size.

Error loading summarytools

I just installed summarytools 0.8.3 from CRAN with no error messages.

packageVersion("summarytools")
[1] ‘0.8.3’
> library(summarytools)
Error in get(method, envir = home) : 
  lazy-load database 'xxx/summarytools/R/summarytools.rdb' is corrupt
In addition: Warning messages:
1: In .registerS3method(fin[i, 1], fin[i, 2], fin[i, 3], fin[i, 4],  :
  restarting interrupted promise evaluation
2: In get(method, envir = home) :
  restarting interrupted promise evaluation
3: In get(method, envir = home) : internal error -3 in R_decompress1
Error: package or namespace load failed for ‘summarytools’

Installing from github gives the same results.

Session info ------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.447)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2018-04-27                  

Error from view(dfSummary(df))

I read a clean dataset in from SQL, and tried the below:
library(summarytools)

view(dfSummary(df))
Error in plot.window(xlim, ylim, "", ...) : need finite 'ylim' values
In addition: Warning messages:
1: In n * h : NAs produced by integer overflow
2: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow
3: In n * h : NAs produced by integer overflow
4: In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow

Suggestion: mention rows with all NA's

When using the data frame summary I encountered a dataset which had rows with only empty columns (NA's). It would be handy to mention this when this occurs on the top of the page at the data frame summary.

limiting the statistics in descr()

This is a feature request. It would be great to add an argument to limit the statistics (mean, sd, etc.). For example, if someone only wants to return mean, median and sd , then the argument could be something like

stats = c('mean', 'median', 'sd')

The final descriptive table would only return the above listed statistics instead of all of them. The default could be stats = "all".

Rd formatting

The output of dfSummary would look nice in the data documentation created by roxygen2 (r-lib/roxygen2#307). Converting a data frame to .Rd is straightforward, but the data frames created by dfSummary contain embedded newlines -- this makes it a bit more difficult.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.