boxuancui / dataexplorer Goto Github PK

View Code? Open in Web Editor NEW

504.0 34.0 88.0 47.94 MB

Automate Data Exploration and Treatment

Home Page: http://boxuancui.github.io/DataExplorer/

License: Other

R 100.00%

cran data-analysis visualization r data-exploration data-science eda r-package rstats

dataexplorer's Introduction

DataExplorer

Background

Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

However, the latest stable version (if any) could be found on GitHub, and installed using devtools package.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer")

If you would like to install the latest development version, you may install the develop branch.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer", ref = "develop")

Examples

The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.

Report

To get a report for the airquality dataset:

library(DataExplorer)
create_report(airquality)

To get a report for the diamonds dataset with response variable price:

library(ggplot2)
create_report(diamonds, y = "price")

Visualization

Instead of running create_report, you may also run each function individually for your analysis, e.g.,

## View basic description for airquality data
introduce(airquality)


rows	153
columns	6
discrete_columns	0
continuous_columns	6
all_missing_columns	0
total_missing_values	44
complete_rows	111
total_observations	918
memory_usage	6,376

## Plot basic description for airquality data
plot_intro(airquality)

## View missing value distribution for airquality data
plot_missing(airquality)

## Left: frequency distribution of all discrete variables
plot_bar(diamonds)
## Right: `price` distribution of all discrete variables
plot_bar(diamonds, with = "price")

## View frequency distribution by a discrete variable
plot_bar(diamonds, by = "cut")

## View histogram of all continuous variables
plot_histogram(diamonds)

## View estimated density distribution of all continuous variables
plot_density(diamonds)

## View quantile-quantile plot of all continuous variables
plot_qq(diamonds)

## View quantile-quantile plot of all continuous variables by feature `cut`
plot_qq(diamonds, by = "cut")

## View overall correlation heatmap
plot_correlation(diamonds)

## View bivariate continuous distribution based on `cut`
plot_boxplot(diamonds, by = "cut")

## Scatterplot `price` with all other continuous features
plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L)

## Visualize principal component analysis
plot_prcomp(diamonds, maxcat = 5L)

#> 2 features with more than 5 categories ignored!
#> color: 7 categories
#> clarity: 8 categories

Feature Engineering

To make quick updates to your data:

## Group bottom 20% `clarity` by frequency
group_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE)

## Group bottom 20% `clarity` by `price`
group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE)

## Dummify diamonds dataset
dummify(diamonds)
dummify(diamonds, select = "cut")

## Set values for missing observations
df <- data.frame("a" = rnorm(260), "b" = rep(letters, 10))
df[sample.int(260, 50), ] <- NA
set_missing(df, list(0L, "unknown"))

## Update columns
update_columns(airquality, c("Month", "Day"), as.factor)
update_columns(airquality, 1L, function(x) x^2)

## Drop columns
drop_columns(diamonds, 8:10)
drop_columns(diamonds, "clarity")

Articles

See article wiki page.

dataexplorer's People

Contributors

Stargazers

Watchers

Forkers

smalik xfim mergenthaler laurae2 benjamesbabala uraboer raelili jasongregory condwanaland weekend-warrior strategist922 brycesisu 0tertra amrrs voltek62 konchada2 guhjy nakarumanchi jbdatascience polymetrika chrisstevens baifengbai rory2104 day15 raunakm90 jcboost makarevichy jiaxiangbu bonfirefan kmaheshkulkarni hangtime79 langfob zhaoxiaohe eyadsibai dataseolabs abh8017 stjordanis rserran biodavidjm chengjingfeng conradbm crackend alexlabram ktaranov danilol13 donojazz alex33261 ralf4data jaimieshing duydn sangeetm gejielin dnldelarosa hieuqtran devzohaib anhmike emfuzzylogic abson-dev d2squared fabbondanza harborwang drshaneburke tlaus nf15 edouard-legoupil rmsharp clinicopath memo1986 hermandr natcoutts whjelmar standardgalactic imarin79 kosalk davidr9708 seferin gshou-lilly shiry123 keyong-bio munoztd0 namfrans speedo8769 tjkelman sxrpm vineetp6 elipousson

dataexplorer's Issues

Add PlotStr() to GenerateReport()

Reference: http://stackoverflow.com/questions/25819539/how-to-add-an-interactive-visualization-to-r-markdown

CollapseCategory does not update when input data is not data.table

To re-produce:

data <- data.frame("a" = as.factor(round(rnorm(500, 10, 5))), "b" = rexp(500, 1:500))
table(data$a)
CollapseCategory(data, "a", 0.2, update = TRUE) ## data is not updated
table(data$a)

Error: Aesthetics must be either length 1 or the same as the data (100): x, y, fill

Worked beautifully on the sample data sets. I tried on a personal data set and received the following error:

  |......................................                           |  59%
label: density_continuous
  |..........................................                       |  65%
  ordinary text without R code

  |..............................................                   |  71%
label: correlation_continuous
Quitting from lines 51-52 (report.rmd) 
Error: Aesthetics must be either length 1 or the same as the data (100): x, y, fill

CorrelationDiscrete should not use contrast of data object

Instead, use reshape2::dcast for all levels of a discrete feature.

Add flexibility to name the new category in CollapseCategory

Exclude all-missing discrete features from BarDiscrete

Fix pandoc usage in unit test

CRAN check results: https://cran.r-project.org/web/checks/check_results_DataExplorer.html

Error in GenerateReport when data contains no discrete features

Reported by Uros Godnov:

I've tried your package and when running GenerateReport(mydata) I get the following error:

Quitting from lines 58-59 (report.rmd)
Error in BarDiscrete(data) : No Discrete Features

Add capability to visualize variable importance

Change threshold to collapsing percentage in CollapseCategory()

Set percentage to 1 - current threshold

Add function to reset missing values

Add SetNaTo().

Add DensityContinuous()

install rprojroot as part of install process?

Hey sir! Wondering if it's possible to set it up so the library installs rprojroot as part of the install process. I didn't have it installed and as such it popped an error for me.

Better handling of missing values in GenerateReport

The example looks very nice. I am trying to run it on my own data, but I am getting the following error:

label: correlation_continuous
Quitting from lines 51-52 (report.rmd) 
Error in seq.default(from = best$lmin, to = best$lmax, by = best$lstep) : 
  'from' must be of length 1

S3 implementation of plots

Automatically collapse categories in report generation

Add unit test

All discrete, no continuous.
All continuous, no discrete.
1 discrete, multiple continuous.
1 continuous, multiple discrete.
1 continuous with mostly NA.

Add support for principle component analysis

Inspiration:

Extend SetNaTo() to discrete features

Change all output format to message

Do not use cat().

Ability to exclude categories in group_category

Ignore and keep indicated categories when collapsing.

Error in file(con, "w") : cannot open the connection

Hi,

I've just tried DataExplorer and have followed the minimal example:

library(DataExplorer)
GenerateReport(iris)

It seems that the knitr part runs OK, but then it stops at the end with error:

Error in file(con, "w") : cannot open the connection
In addition: Warning message:
In file(con, "w") :  cannot open file 'report.knit.md': Permission denied

I have also tried the diamonds example with the same result.

Any hint on what may happen?

Thank you,

> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Gentoo/Linux

locale:
 [1] LC_CTYPE=ca_AD.UTF8        LC_NUMERIC=C               LC_TIME=ca_AD.UTF8         LC_COLLATE=C               LC_MONETARY=ca_AD.UTF8     LC_MESSAGES=ca_AD.UTF8     LC_PAPER=ca_AD.UTF8
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ca_AD.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggplot2_2.1.0      DataExplorer_0.2.4 vimcom_1.2-6       setwidth_1.0-4     colorout_1.0-3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3      knitr_1.12.3     magrittr_1.5     munsell_0.4.3    colorspace_1.2-6 stringr_1.0.0    plyr_1.8.3       tools_3.2.4      grid_3.2.4       data.table_1.9.6 gtable_0.2.0     htmltools_0.3.5
[13] yaml_2.1.13      digest_0.6.9     gridExtra_2.2.1  reshape2_1.4.1   formatR_1.3      evaluate_0.8.3   rmarkdown_0.9.5  labeling_0.3     stringi_1.0-1    scales_0.4.0     chron_2.3-47
>

Add scatterplot matrix fixing on one feature

The feature should mostly be response variable, so should support both discrete and continuous scale.

More controls in CollapseCategory()

Add ability to collapse based on desired columns.
Add ability to collapse based on multiple columns.

Order bar charts for easier comparison

Add option for title on plots

Thanks for the library!

Would it be possible to add an option for a title to the plots?

I think all it would require is
function(data, title=NULL) {
[ggplot code] +
ggtitle(title)

Ryan

Badges

Add Travis CI for branch "develop"
Add badge: http://www.r-pkg.org/badges/version/rpackage

[Breaking Change] Apply tidyverse style

http://style.tidyverse.org/
To deprecate old functions, see https://stackoverflow.com/a/10145627/2158269.

Update := and with=FALSE usage

Warning message from data.table:

Warning message:
In [.data.table(data, , :=(ind, NULL), with = FALSE) :
with=FALSE together with := was deprecated in v1.9.4 released Oct 2014. Please wrap the LHS of := with parentheses; e.g., DT[,(myVar):=sum(b),by=a] to assign to column name(s) held in variable myVar. See ?':=' for other examples. As warned in 2014, this is now a warning.

Feature request: report the structure of more complex Objects

At first: I like your package very much, it works perfect for data.frames.

But I am looking for a functionality to get quick information on more complex Data, as structured lists or multidimensional arrays. What are the names, where are the tables etc.
Or maybe, to report the content of a *.RData file (Data, functions, ...)?

Is it possible to include this in your Package?

Better readability for legend in CorrelationContinuous

Something like ggplot2::guides(fill = guide_legend(nrow = 1))

Add function to customize report

Pick what to appear in report.
Able to customize each chart.

Remove density estimate from report

Add support to collapse categories other than frequency

Add wrapper to drop multiple variables

Add boxplot for all continuous features by one feature

If selected feature is discrete, do box plot for all categories. Otherwise, split continuous features into quartiles and treat it as discrete.

rename repo because CRAN has another package with this name

Detect input data format and output the same format

All data are transformed into data.table for speed. Might want to detect the input format and output the same format.

CollapseCategory
SplitColType
PlotMissing maybe?

Add capability for cross validation

GenerateReport: Error in !args[["quiet"]] : invalid argument type

When quiet argument is not supplied, the function will not print report location, instead, generate an error.

eda?

Hi
Where is package "eda" that GenerateReport requires?

GenerateReport<-
function (input_data, output_file = "report.html", output_dir = getwd(),
...)
{
report_dir <- system.file("rmd_template/report.rmd", package = "eda")

Combine correlation plots

Change HistogramContinuous() to plot frequency instead of density

In addition,

exclude all-missing continuous features.
fix wrong input to argument binwidth causing error.

Add vignette

Fix Rd cross-references warnings

checking Rd cross-references ... WARNING
Missing link or links in documentation object 'BarDiscrete.Rd':
'data.table'
Missing link or links in documentation object 'CollapseCategory.Rd':
'data.table'
Missing link or links in documentation object 'CorrelationContinuous.Rd':
'data.table'
Missing link or links in documentation object 'CorrelationDiscrete.Rd':
'data.table'
Missing link or links in documentation object 'DensityContinuous.Rd':
'data.table'
Missing link or links in documentation object 'DropVar.Rd':
'data.table'
Missing link or links in documentation object 'GenerateReport.Rd':
'data.table'
Missing link or links in documentation object 'HistogramContinuous.Rd':
'data.table'
Missing link or links in documentation object 'PlotMissing.Rd':
'data.table'
Missing link or links in documentation object 'SetNaTo.Rd':
'data.table'
Missing link or links in documentation object 'SplitColType.Rd':
'data.table'
See section 'Cross-references' in the 'Writing R Extensions' manual.

https://www.r-project.org/nosvn/R.check/r-devel-windows-ix86+x86_64/DataExplorer-00check.html

There are **`r format(nrow(data), big.mark = ",")`** observations (rows)

In addition,

include more information in Basic Statistics section.
overall organization of the report.

Update documentation for SplitColType.r

num_all_missing is missing.

Add support for non-ASCII characters

Reported by @djhurio in #16

I have installed development version. I am not getting errors any more. Report is generated with a warning:
Warning message:
In writeLines(if (encoding == "") res else native_encode(res, to = encoding),  :
  invalid char string in output conversion`
And report is unreadable.