Git Product home page Git Product logo

dataexplorer's Introduction

DataExplorer

CRAN Version Downloads Total Downloads GitHub Stars R-CMD-check codecov CII Best Practices Contributor Covenant

Background

Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

However, the latest stable version (if any) could be found on GitHub, and installed using devtools package.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer")

If you would like to install the latest development version, you may install the develop branch.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer", ref = "develop")

Examples

The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.

Report

To get a report for the airquality dataset:

library(DataExplorer)
create_report(airquality)

To get a report for the diamonds dataset with response variable price:

library(ggplot2)
create_report(diamonds, y = "price")

Visualization

Instead of running create_report, you may also run each function individually for your analysis, e.g.,

## View basic description for airquality data
introduce(airquality)
rows 153
columns 6
discrete_columns 0
continuous_columns 6
all_missing_columns 0
total_missing_values 44
complete_rows 111
total_observations 918
memory_usage 6,376
## Plot basic description for airquality data
plot_intro(airquality)

## View missing value distribution for airquality data
plot_missing(airquality)

## Left: frequency distribution of all discrete variables
plot_bar(diamonds)
## Right: `price` distribution of all discrete variables
plot_bar(diamonds, with = "price")

## View frequency distribution by a discrete variable
plot_bar(diamonds, by = "cut")

## View histogram of all continuous variables
plot_histogram(diamonds)

## View estimated density distribution of all continuous variables
plot_density(diamonds)

## View quantile-quantile plot of all continuous variables
plot_qq(diamonds)

## View quantile-quantile plot of all continuous variables by feature `cut`
plot_qq(diamonds, by = "cut")

## View overall correlation heatmap
plot_correlation(diamonds)

## View bivariate continuous distribution based on `cut`
plot_boxplot(diamonds, by = "cut")

## Scatterplot `price` with all other continuous features
plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L)

## Visualize principal component analysis
plot_prcomp(diamonds, maxcat = 5L)
#> 2 features with more than 5 categories ignored!
#> color: 7 categories
#> clarity: 8 categories

Feature Engineering

To make quick updates to your data:

## Group bottom 20% `clarity` by frequency
group_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE)

## Group bottom 20% `clarity` by `price`
group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE)

## Dummify diamonds dataset
dummify(diamonds)
dummify(diamonds, select = "cut")

## Set values for missing observations
df <- data.frame("a" = rnorm(260), "b" = rep(letters, 10))
df[sample.int(260, 50), ] <- NA
set_missing(df, list(0L, "unknown"))

## Update columns
update_columns(airquality, c("Month", "Day"), as.factor)
update_columns(airquality, 1L, function(x) x^2)

## Drop columns
drop_columns(diamonds, 8:10)
drop_columns(diamonds, "clarity")

Articles

See article wiki page.

dataexplorer's People

Contributors

boxuancui avatar davidr9708 avatar etiennebacher avatar xfim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataexplorer's Issues

Error: Aesthetics must be either length 1 or the same as the data (100): x, y, fill

Worked beautifully on the sample data sets. I tried on a personal data set and received the following error:

  |......................................                           |  59%
label: density_continuous
  |..........................................                       |  65%
  ordinary text without R code

  |..............................................                   |  71%
label: correlation_continuous
Quitting from lines 51-52 (report.rmd) 
Error: Aesthetics must be either length 1 or the same as the data (100): x, y, fill

install rprojroot as part of install process?

Hey sir! Wondering if it's possible to set it up so the library installs rprojroot as part of the install process. I didn't have it installed and as such it popped an error for me.

Better handling of missing values in GenerateReport

The example looks very nice. I am trying to run it on my own data, but I am getting the following error:

label: correlation_continuous
Quitting from lines 51-52 (report.rmd) 
Error in seq.default(from = best$lmin, to = best$lmax, by = best$lstep) : 
  'from' must be of length 1

Add unit test

  1. All discrete, no continuous.
  2. All continuous, no discrete.
  3. 1 discrete, multiple continuous.
  4. 1 continuous, multiple discrete.
  5. 1 continuous with mostly NA.

Error in file(con, "w") : cannot open the connection

Hi,

I've just tried DataExplorer and have followed the minimal example:

library(DataExplorer)
GenerateReport(iris)

It seems that the knitr part runs OK, but then it stops at the end with error:

Error in file(con, "w") : cannot open the connection
In addition: Warning message:
In file(con, "w") :  cannot open file 'report.knit.md': Permission denied

I have also tried the diamonds example with the same result.

Any hint on what may happen?

Thank you,

> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Gentoo/Linux

locale:
 [1] LC_CTYPE=ca_AD.UTF8        LC_NUMERIC=C               LC_TIME=ca_AD.UTF8         LC_COLLATE=C               LC_MONETARY=ca_AD.UTF8     LC_MESSAGES=ca_AD.UTF8     LC_PAPER=ca_AD.UTF8
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ca_AD.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggplot2_2.1.0      DataExplorer_0.2.4 vimcom_1.2-6       setwidth_1.0-4     colorout_1.0-3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3      knitr_1.12.3     magrittr_1.5     munsell_0.4.3    colorspace_1.2-6 stringr_1.0.0    plyr_1.8.3       tools_3.2.4      grid_3.2.4       data.table_1.9.6 gtable_0.2.0     htmltools_0.3.5
[13] yaml_2.1.13      digest_0.6.9     gridExtra_2.2.1  reshape2_1.4.1   formatR_1.3      evaluate_0.8.3   rmarkdown_0.9.5  labeling_0.3     stringi_1.0-1    scales_0.4.0     chron_2.3-47
>

Add option for title on plots

Thanks for the library!

Would it be possible to add an option for a title to the plots?

I think all it would require is
function(data, title=NULL) {
[ggplot code] +
ggtitle(title)

Ryan

Update := and with=FALSE usage

Warning message from data.table:

Warning message:
In [.data.table(data, , :=(ind, NULL), with = FALSE) :
with=FALSE together with := was deprecated in v1.9.4 released Oct 2014. Please wrap the LHS of := with parentheses; e.g., DT[,(myVar):=sum(b),by=a] to assign to column name(s) held in variable myVar. See ?':=' for other examples. As warned in 2014, this is now a warning.

Feature request: report the structure of more complex Objects

At first: I like your package very much, it works perfect for data.frames.

But I am looking for a functionality to get quick information on more complex Data, as structured lists or multidimensional arrays. What are the names, where are the tables etc.
Or maybe, to report the content of a *.RData file (Data, functions, ...)?

Is it possible to include this in your Package?

eda?

Hi
Where is package "eda" that GenerateReport requires?

GenerateReport<-
function (input_data, output_file = "report.html", output_dir = getwd(),
...)
{
report_dir <- system.file("rmd_template/report.rmd", package = "eda")

Fix Rd cross-references warnings

checking Rd cross-references ... WARNING
Missing link or links in documentation object 'BarDiscrete.Rd':
'data.table'
Missing link or links in documentation object 'CollapseCategory.Rd':
'data.table'
Missing link or links in documentation object 'CorrelationContinuous.Rd':
'data.table'
Missing link or links in documentation object 'CorrelationDiscrete.Rd':
'data.table'
Missing link or links in documentation object 'DensityContinuous.Rd':
'data.table'
Missing link or links in documentation object 'DropVar.Rd':
'data.table'
Missing link or links in documentation object 'GenerateReport.Rd':
'data.table'
Missing link or links in documentation object 'HistogramContinuous.Rd':
'data.table'
Missing link or links in documentation object 'PlotMissing.Rd':
'data.table'
Missing link or links in documentation object 'SetNaTo.Rd':
'data.table'
Missing link or links in documentation object 'SplitColType.Rd':
'data.table'
See section 'Cross-references' in the 'Writing R Extensions' manual.

https://www.r-project.org/nosvn/R.check/r-devel-windows-ix86+x86_64/DataExplorer-00check.html

Fix color scale for correlation plots

Reported by @Raelili :

The color scale for correlation heat map looks misleading sometimes. For example, a -0.8 correlation looks less correlated than a 0.1 correlation. The color scale should be fixed to either [-1, 1] or the maximum of both absolute values.

Update rmarkdown template

In the following line, observations are not rows.

There are **`r format(nrow(data), big.mark = ",")`** observations (rows)

In addition,

  1. include more information in Basic Statistics section.
  2. overall organization of the report.

Add support for non-ASCII characters

Reported by @djhurio in #16

I have installed development version. I am not getting errors any more. Report is generated with a warning:

Warning message:
In writeLines(if (encoding == "") res else native_encode(res, to = encoding),  :
  invalid char string in output conversion`

And report is unreadable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.