Git Product home page Git Product logo

incidence2's Introduction

CRAN status R-CMD-check

incidence2

incidence2 is an R package that implements functions and classes to compute, handle and visualise incidence from linelist data. It refocusses the scope of the original incidence package. Unlike the original package, incidence2 concentrates only on the initial calculation, manipulation and plotting of the resultant incidence objects.

Installing the package

You can install the released version of incidence2 from CRAN with:

install.packages("incidence2")

Vignettes

An overview of incidence2 is provided in the vignette distributed with the package:

  • vignette("incidence2", package = "incidence2")

incidence2's People

Contributors

timtaylor avatar thibautjombart avatar

Stargazers

Patrick Elungat avatar Maneesha Muriki avatar  avatar Jeremy Price avatar Pietro Monticone avatar Altan Orhon avatar AM avatar Joshua Levy avatar Jimmy Briggs avatar Yo Yehudi avatar Ming Hao avatar Lorsark avatar Andree Valle Campos avatar Sam Abbott avatar Serge Stinckwich avatar Paul Campbell avatar  avatar

Watchers

James Cloos avatar Amy Gimma avatar Stephan Glöckner avatar  avatar

incidence2's Issues

On the behaviour of `facet_plot`

Some thoughts on how facetting may work, esp with regards to using groups for facetting and/or color-filling. Nothing hard-set, more for discussion purpose. It would be useful to have a facet argument handling which grouping variables are used for facetting. Together with fill, this should give more flexibility to the user for designing plots with different grouping variables displayed.

Here are some proposed behaviours:

  • facet_plot(x): plot the incidence object using all grouping variables for facetting
  • facet_plot(x, facet = "foo"): same, using only variable foo for facetting
  • facet_plot(x, facet = c(foo, bar)): same, using variables foo and bar
  • facet_plot(x, facet = "foo", fill = bar): use foo for facetting and bar for filling
  • facet_plot(x, facet = c("foo", "bar"), fill = bar): use foo and bar for facetting, and bar for filling; redundant, but that's okay, the user asked for it

What do you think?

Invisible plot when using colors

See reprex below:

library(magrittr)
library(incidence2)

outbreaks::ebola_sim_clean$linelist %>%
  incidence(date_of_onset, groups = outcome) %>%
  plot()
#> The number of colors (1) did not match the number of groups (3).
#> Using `col_pal` instead.

It may come from the borders ie bars are so thin we see only the border? Plot is fine with wider time intervals.

Created on 2020-07-06 by the reprex package (v0.3.0)

User request: allow date adjustment using % strptime abbreviations syntax

@nsbatra - the following requires the dev (GitHub main/master branches) of incidence2 and grates but hopefully works as you were hoping. Let me know what you think:

library(outbreaks)
library(incidence2)

dat <- ebola_sim_clean$linelist
x <- incidence(dat, date_of_onset, interval = "month")
x
#> An incidence2 object: 13 x 2
#> 5829 cases from 2014-Apr to 2015-Apr
#> interval: 1 month
#> cumulative: FALSE
#> 
#>    date_index count
#>    <month>    <int>
#>  1 2014-Apr       7
#>  2 2014-May      67
#>  3 2014-Jun     102
#>  4 2014-Jul     228
#>  5 2014-Aug     540
#>  6 2014-Sep    1144
#>  7 2014-Oct    1199
#>  8 2014-Nov     779
#>  9 2014-Dec     567
#> 10 2015-Jan     427
#> 11 2015-Feb     307
#> 12 2015-Mar     277
#> 13 2015-Apr     185
 
# centred dates (default for yearweek, single months, quarters and years)
plot(x, color = "white")

# histogram-esque dates on the breaks (defaults to "%Y-%m-%d")
plot(x, color = "white", centre_dates = FALSE)

# can specify a different format
plot(x, color = "white", centre_dates = FALSE, date_format = "%d-%m-%Y")

Created on 2021-05-19 by the reprex package (v2.0.0)

Default color palette

The 3 requirements of the new color palette would be:

  • look nice to as many humans as possible
  • be colorblind friendly
  • correspond to categorical variables

Quite a bit of thinking on these has been done by the viridis package:
https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html

As well, these very good resources:

I suggest we use this issue to propose palettes. Ideally put them to a vote at some point.

grouping within make_incidence is slow and hacky

The lines below are slow and hacky:

  x <- grouped_df(x, c(date_index, groups))
  x <- summarise(x, count = count_dates(.data[[date_index]], breaks), .groups = "keep")
  x <- mutate(x, {{date_index}} := breaks)
  x <- summarise(x, count = sum(.data$count), .groups = "keep")
  colnames(x) <- c("bin_date", colnames(x)[-1])
  x <- ungroup(x)

replace summarise with some sort of split / tapply/ rbind combo will probably be quicker

Possible bug with month interval combined with groups

I find the following potential issue:

library(outbreaks)
library(incidence2)

## this works
ebola_sierraleone_2014 %>% 
  incidence(date_of_onset, interval = "1 week", groups = district) %>% 
  plot()

## this throws an error
ebola_sierraleone_2014 %>% 
  incidence(date_of_onset, interval = "1 month", groups = district) %>% 
  plot()

## note that month interval without group works
ebola_sierraleone_2014 %>% 
  incidence(date_of_onset, interval = "1 month") %>% 
  plot()

Possible new features: subsetting time windows

Subsetting objects by given time windows may be one of the only things made slightly easier in the original incidence package. For instance, x[1:5] would get you the first 5 time steps (days / weeks / months) of the object, which is a little trickier to do now. It would be useful to have some functions helping with this - see some proposed example uses below.

Filter first / last days / weeks / months etc.

Filter the data to retain the first or last data points, predicated on a duration. There is a question here, as to how duration can be specified:

  1. simplest form: duration is provided as integer days
  2. other simple form: duration is provided as integer time intervals (as specified by the bins of the object)
  3. interpreted like the 'interval' argument of incidence2::incidence, so we could do things like "3 months" to have the first 3 months of data (possibly months 1 and 3 not being complete)

Examples would be (depending on the option above we retain):

  • filter_first(x, 30): retain the first 30 days of data, or all of it if there are less than 30 days
  • filter_first(x, "1 month"): retain the first month of data; may not be a full month, only data from the first reported month
  • filter_last(x, "4 weeks"): retain the last 4 weeks of data; the last week may not be complete e.g. if the last date is a Thursday, so this may not be 28 days of data
  • filter_last(x, 28): filters the first 28 days data (irrespective of week definition)

Subset

We could re-implement the features of incidence::subset(), but possibly renaming the function. It would merely be a wrapper for filter on dates.

Where to put labels on the x-axis?

There has been debates in the past on where dates should appear on the x-axis. I will try to sum up views / things to take into account below, and maybe some will add thoughts to it.

Original incidence package

  • epicurves were treated as histograms; bars represent case counts between 2 time points, so that e.g. for monthly incidence, a date on the x-axis marks the left hand-side of the bin (label to the left)
  • for fitting, a single date needs to be associated to a case count; thus we were using the middle of the time interval (label in the middle)
  • we did not have options for plotting epicurves as points / lines
  • several users complained that label in the middle was more intuitive

Current considerations

  • I suspect most epis do not read epicurves as histograms, so label in the middle would make sense
  • if we add geom_point and geom_line as options for plot and facet_plot, it is preferrable to have a consistent label positioning, which works the same for all geoms; label in the middle seems better for this: it still makes sense with geom_bar
  • model predictions will probably work better with label in the middle
  • devel-wise, it is safer to go with the least-amount of fiddling with ggplot2 handing of the x-axis

Fix y-axis labelling for periods

see labelling issue on y-axis below

library(outbreaks)
library(incidence2)
dat <- ebola_sim_clean$linelist
x2w<- incidence(dat, date_of_onset, interval = "2 weeks")
x2w
#> An incidence2 object: 28 x 2
#> 5829 cases from 2014-04-07 to 2015-05-03
#> interval: 14 days
#> cumulative: FALSE
#> 
#>    date_index               count
#>    <period>                 <int>
#>  1 2014-04-07 to 2014-04-20     2
#>  2 2014-04-21 to 2014-05-04     9
#>  3 2014-05-05 to 2014-05-18    29
#>  4 2014-05-19 to 2014-06-01    34
#>  5 2014-06-02 to 2014-06-15    44
#>  6 2014-06-16 to 2014-06-29    52
#>  7 2014-06-30 to 2014-07-13    72
#>  8 2014-07-14 to 2014-07-27   120
#>  9 2014-07-28 to 2014-08-10   166
#> 10 2014-08-11 to 2014-08-24   255
#> # … with 18 more rows
plot(x2w, color = "white")

Created on 2021-05-20 by the reprex package (v2.0.0)

Possible bug with integer dates input

Please place an "x" in all the boxes that apply

  • I have the most recent version of incidence2 and R
  • I have found a bug
  • I have a reproducible example
  • I want to request a new feature

I think the following should probably work; I am guessing it has to do with dates being simple numbers (not Date or grates stuff) but not 100% sure. Here's a reprex:

library(incidence2)

df <- data.frame(
  dates = 1:100,
  counts = round(rnorm(100, 1000, 200))
)

x <- incidence(df, date_index = dates, counts = counts)
x
#> An incidence2 object: 100 x 2
#> 100028 counts from 1 to 100
#> interval: 1 day
#> cumulative: FALSE
#> 
#>    date_index counts
#>         <int>  <dbl>
#>  1          1    775
#>  2          2   1389
#>  3          3   1341
#>  4          4    999
#>  5          5    694
#>  6          6    645
#>  7          7   1000
#>  8          8   1365
#>  9          9   1119
#> 10         10    959
#> # … with 90 more rows
plot(x)
#> Error in `substring<-`(`*tmp*`, 1, 1, value = toupper(first_letter)): replacing substrings in a non-character object

Created on 2021-03-29 by the reprex package (v1.0.0)

Allowing adding non-integer numbers to grate objects

A maybe not very frequent use case: define limits between two time intervals defined by grate, e.g. to visually delineate epochs in a graph using a vertical line.

Currently the following will error on purpose:

> as_yrwk("2021-W03") + 1
[1] "2021-W04"

> as_yrwk("2021-W03") + 1.5
Error: Can only add whole numbers to <yrwk> objects

But unsure if we want to change this or not.

A better implementation of the underlying data?

Currently week labels are added as additional columns to the underlying data. Should the underlying structure should just be the binned data with attributes that trigger different behaviour? For example, I think the following is a cleaner implementation:

library(incidence2)
library(dplyr, warn.conflicts = FALSE)

# get some data
data(ebola_sim_clean, package = "outbreaks")
dat <- 
  ebola_sim_clean$linelist %>% 
  filter(date_of_onset <= "2014-07-07")


# generate object with our current implementation
inci <- incidence(dat, date_index = date_of_onset, interval = "week")
inci
#> <incidence object>
#> [207 cases from days 2014-04-07 to 2014-07-07]
#> [interval: 1 week]
#> [cumulative: FALSE]
#> 
#>    date_group weeks    isoweeks count
#>    <date>     <aweek>  <chr>    <int>
#>  1 2014-04-07 2014-W15 2014-W15     1
#>  2 2014-04-14 2014-W16 2014-W16     1
#>  3 2014-04-21 2014-W17 2014-W17     5
#>  4 2014-04-28 2014-W18 2014-W18     4
#>  5 2014-05-05 2014-W19 2014-W19    12
#>  6 2014-05-12 2014-W20 2014-W20    17
#>  7 2014-05-19 2014-W21 2014-W21    15
#>  8 2014-05-26 2014-W22 2014-W22    19
#>  9 2014-06-02 2014-W23 2014-W23    23
#> 10 2014-06-09 2014-W24 2014-W24    21
#> 11 2014-06-16 2014-W25 2014-W25    30
#> 12 2014-06-23 2014-W26 2014-W26    22
#> 13 2014-06-30 2014-W27 2014-W27    34
#> 14 2014-07-07 2014-W28 2014-W28     3
str(inci)
#> tibble [14 × 4] (S3: incidence/tbl_df/tbl/data.frame)
#>  $ date_group: Date[1:14], format: "2014-04-07" "2014-04-14" ...
#>  $ weeks     : 'aweek' chr [1:14] "2014-W15" "2014-W16" "2014-W17" "2014-W18" ...
#>   ..- attr(*, "week_start")= int 1
#>  $ isoweeks  : chr [1:14] "2014-W15" "2014-W16" "2014-W17" "2014-W18" ...
#>  $ count     : int [1:14] 1 1 5 4 12 17 15 19 23 21 ...
#>  - attr(*, "date")= chr [1:3] "date_group" "weeks" "isoweeks"
#>  - attr(*, "count")= chr "count"
#>  - attr(*, "interval")= chr "week"
#>  - attr(*, "cumulative")= logi FALSE

# drop the extra columns
x <- inci %>% 
  select(date_group, count)


# create demo subclass of tibble without the additional columns
# give it a "week" attribute (but could be "quarter", "year", "month", etc...)
tbl <- tibble::new_tibble(x,
                          nrow = nrow(x),
                          type = "week",
                          class = "demo")

# users would see the groupings they expect
print.demo <- function(x, ...) {
  
  # title
  cat("<demo object>\n")
  out <- x
  if (attr(x, "type") == "week") {
    out$date_group <- aweek::date2week(out$date_group)  
  }
  out <- format(tibble::as_tibble(out))
  
  cat(out[-1], sep = "\n")
  cat("\n")
  invisible(x)
}

tbl
#> <demo object>
#>    date_group count
#>    <aweek>    <int>
#>  1 2014-W15-1     1
#>  2 2014-W16-1     1
#>  3 2014-W17-1     5
#>  4 2014-W18-1     4
#>  5 2014-W19-1    12
#>  6 2014-W20-1    17
#>  7 2014-W21-1    15
#>  8 2014-W22-1    19
#>  9 2014-W23-1    23
#> 10 2014-W24-1    21
#> 11 2014-W25-1    30
#> 12 2014-W26-1    22
#> 13 2014-W27-1    34
#> 14 2014-W28-1     3

# but underlying we have the original binning data
str(tbl)
#> tibble [14 × 2] (S3: demo/tbl_df/tbl/data.frame)
#>  $ date_group: Date[1:14], format: "2014-04-07" "2014-04-14" ...
#>  $ count     : int [1:14] 1 1 5 4 12 17 15 19 23 21 ...
#>  - attr(*, "type")= chr "week"

Created on 2020-07-08 by the reprex package (v0.3.0)

Improve tests

Currently testing has become a bit of a mixed bag due to the refactoring and is a little unmaintainable. Work needs to be done to implement a more methodical approach.

Bug in keep_first and keep_last overshooting

Please place an "x" in all the boxes that apply

  • I have the most recent version of reportfactory and R
  • I have found a bug
  • I have a reproducible example
  • I want to request a new feature

Quick description

These functions return NAs when the second argument exceeds the size of the object. It should probably return the whole thing, not even sure if we want a warning in there.

Reprex

library(tidyverse)
library(incidence2)

tibble(date = 1:10) %>% 
  incidence(date) %>% 
  keep_last(10)
#> An incidence2 object: 10 x 2
#> 10 cases from 1 to 10
#> interval: 1 day
#> cumulative: FALSE
#> 
#>    date_index count
#>         <int> <int>
#>  1          1     1
#>  2          2     1
#>  3          3     1
#>  4          4     1
#>  5          5     1
#>  6          6     1
#>  7          7     1
#>  8          8     1
#>  9          9     1
#> 10         10     1

tibble(date = 1:10) %>% 
  incidence(date) %>% 
  keep_last(11)
#> # A tibble: 10 x 2
#>    date_index count
#>         <int> <int>
#>  1         NA    NA
#>  2         NA    NA
#>  3         NA    NA
#>  4         NA    NA
#>  5         NA    NA
#>  6         NA    NA
#>  7         NA    NA
#>  8         NA    NA
#>  9         NA    NA
#> 10         NA    NA

tibble(date = 1:10) %>% 
  incidence(date) %>% 
  keep_first(10)
#> An incidence2 object: 10 x 2
#> 10 cases from 1 to 10
#> interval: 1 day
#> cumulative: FALSE
#> 
#>    date_index count
#>         <int> <int>
#>  1          1     1
#>  2          2     1
#>  3          3     1
#>  4          4     1
#>  5          5     1
#>  6          6     1
#>  7          7     1
#>  8          8     1
#>  9          9     1
#> 10         10     1

tibble(date = 1:10) %>% 
  incidence(date) %>% 
  keep_first(11)
#> # A tibble: 10 x 2
#>    date_index count
#>         <int> <int>
#>  1         NA    NA
#>  2         NA    NA
#>  3         NA    NA
#>  4         NA    NA
#>  5         NA    NA
#>  6         NA    NA
#>  7         NA    NA
#>  8         NA    NA
#>  9         NA    NA
#> 10         NA    NA

Created on 2021-03-08 by the reprex package (v1.0.0)

Potential bug with multiple values in date_index

Please place an "x" in all the boxes that apply

  • I have the most recent version of incidence2 and R
  • I have found a bug
  • I have a reproducible example
  • I want to request a new feature

In the following example, 2 valid dates are provided to date_index but generate an error. Sorry if I am missing something obvious. See example below:

library(incidence2)
library(outbreaks)

dat <- ebola_sim_clean$linelist
i <- incidence(dat,
               date_index = c(onset = date_of_onset, outcome = date_of_outcome),
               interval = 7,
               group = "hospital")
#> Error: Names must be unique.
#> x These names are duplicated:
#>   * "outcome" at locations 6 and 7.
names(dat)
#>  [1] "case_id"                 "generation"             
#>  [3] "date_of_infection"       "date_of_onset"          
#>  [5] "date_of_hospitalisation" "date_of_outcome"        
#>  [7] "outcome"                 "gender"                 
#>  [9] "hospital"                "lon"                    
#> [11] "lat"

Created on 2021-03-22 by the reprex package (v1.0.0)

Error with show_cases = TRUE with monthly incidence

See reprex below:

library(outbreaks)
library(incidence2)

x <- ebola_sim_clean$linelist %>%
  incidence(date_index = date_of_onset, interval = "month")
plot(x, show_cases = TRUE)
#> Error in `$<-.data.frame`(`*tmp*`, "width", value = c(30L, 31L, 30L, 31L, : replacement has 13 rows, data has 5829

But the same with weekly intervals works.

Code Review 1

Borrowing rOpenSci guidance:

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the ropensci reviewer guide.

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Review Comments

consider using {tsibble}

hi - like the idea of moving this to cleaner syntax!

Just thought I would suggest using tsibble to do some of the underlying legwork. There are yearweek and year* functions and all works pretty clean.

The advantage over aweek is that is recognised as a date automatically. The disadvantage over aweek is that cannot (as of yet) set a different start day for a week.

Actually just posted issues this morning on {aweek} and {tsibble} about this.

As a sidenote - while in the process of redoing everything might be worth considering renaming just to make certain epis happy and avoid semantic discussions around incidence vs incidence rate vs prevalence (see)

Release incidence2 0.1.0

Prepare for release:

  • Check that description is informative
  • Check licensing of included files
  • devtools::build_readme()
  • usethis::use_cran_comments()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • Update cran-comments.md
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_news_md()
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Update install instructions in README
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Accept data.table as incidence input / or error

Please place an "x" in all the boxes that apply

  • I have the most recent version of incidence2 and R
  • I have found a bug
  • I have a reproducible example
  • I want to request a new feature

Need to produce an error or convert if data.table object is provided as input

library(data.table)
library(incidence2)
dat <- incidence2::covidregionaldataUK
incidence(dat, date, count = cases_new)
#> An incidence object: 490 x 2
#> date range: [2020-01-30] to [2021-06-02]
#> cases_new: 8379330
#> interval: 1 day
#> 
#>    date_index cases_new
#>    <date>         <dbl>
#>  1 2020-01-30         3
#>  2 2020-01-31         0
#>  3 2020-02-01         0
#>  4 2020-02-02         0
#>  5 2020-02-03         0
#>  6 2020-02-04         0
#>  7 2020-02-05         2
#>  8 2020-02-06         0
#>  9 2020-02-07         0
#> 10 2020-02-08         8
#> # … with 480 more rows
setDT(dat)
incidence(dat, date, count = cases_new)
#> Error in `[.data.table`(x, date_index): When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

Created on 2021-08-23 by the reprex package (v2.0.1)


Slowness with record types (possibly my misuse of dplyr across)

I think I'm using across poorly as the example below is crazily slow. Here I'm using clock but I'm pretty sure that is not the bottleneck:

library(incidence2)
library(clock)
library(microbenchmark)

dat <- covidregionaldataUK

# default uses data.table
default <- function() {
  incidence(dat, date_index = date, groups = region, counts = ends_with("new"))
}

# here clock is just used as an example and is not the bottle-neck
record <- function() {
  build_incidence(
    dat,
    date_index = date,
    groups = region,
    counts = ends_with("new"),
    FUN = function(x) calendar_narrow(as_year_month_day(x), precision = "day")
  )
}

microbenchmark(default(), record(), times = 10)
#> Unit: milliseconds
#>       expr         min          lq        mean      median          uq
#>  default()    5.053506    5.303756    7.184131    5.825458    6.297723
#>   record() 4614.132282 4683.771452 4741.717130 4751.499357 4782.525466
#>         max neval
#>    18.83551    10
#>  4835.74966    10

Created on 2021-06-28 by the reprex package (v2.0.0)

Center x-axis annotations by default

The current default for x-axis annotations is to be left-justified, e.g.:

library(outbreaks)
library(incidence2)

x <- ebola_sim_clean$linelist %>%
  incidence(date_index = date_of_onset, interval = "month")
plot(x) + scale_x_incidence(x, format = "%d %M %Y")
#> Scale for 'x' is already present. Adding another scale for 'x', which will
#> replace the existing scale.

I suspect most users will want to center the labels by default. It could be done by adding a center_labels = TRUE by default to scale_x_incidence, but this argument already exists with a different meaning (position of the tick marks) in the plot method.

One solution would be to:

  1. rename center_labels to center_tick in the plot method
  2. add a center_labels argument to scale_x_incidence

But I am sure there are other options - just can't think of a good one atm.

Not importing incidence 1?

at I'm mostly just curious, but what is the reasoning behind the design decision to copy over the code from {incidence} initially instead of importing? From my perspective, if there's a bug in the future, then there are two places where it needs to be fixed.

Strategy for renaming functions

For instance, pool may be better named regroup, and I guess there could be more cases like this. Generally speaking, renaming things from the original incidence poses some trade-offs. There are several strategies we may consider:

Stick to the old

We keep old names as much as possible, and only use new names for new features.

Scrap the old

As this is a reboot, we can do away with old names, and rely on documentation for people to find out correspondence. A softer version would be to have incidence2::pool merely return NULL (or an error) and throw a message saying that this feature is now called regroup in incidence2.

Aliases

We could have incidence2::regroup <- incidence2::pool. If so, do we want to:

  • keep aliases going forward (I think not)
  • mark old names as deprecated and eventually remove them? It might make sense in terms of transition, but it is weird to develop a new package with already deprecated functions, with a schedule that explicitely plans breaking backward compatibility fruther down the line.

First trial comments

These are comments or discussion items which in itself seem too small to warrant a dedicated issue. This issue is work in progress.

Doc

  • in ?incidence, specify that x can be a data.frame or a tibble, for the incidence functions, and that the indicated methods dispatch along the second argument (which is unusual, though very cool)

  • I love that the following works:

library(magrittr)
library(incidence2)
outbreaks::ebola_sim_clean$linelist %>%
  incidence(date_of_onset, groups = c(gender, hospital, outcome)) %>%
  pool(c(gender, outcome))
#> <incidence object>
#> [5829 cases from days 2014-04-07 to 2015-04-30]
#> [interval: 1 day]
#> [cumulative: FALSE]
#> 
#>    date_group gender outcome count
#>    <date>     <fct>  <fct>   <int>
#>  1 2014-04-07 f      Death       0
#>  2 2014-04-07 f      Recover     0
#>  3 2014-04-07 f      <NA>        1
#>  4 2014-04-07 m      Death       0
#>  5 2014-04-07 m      Recover     0
#>  6 2014-04-07 m      <NA>        0
#>  7 2014-04-08 f      Death       0
#>  8 2014-04-08 f      Recover     0
#>  9 2014-04-08 f      <NA>        0
#> 10 2014-04-08 m      Death       0
#> # … with 2,324 more rows

Going forward it would be nice to have a vignette dedicated to handling incidence objects and illustrate this feature, amongst others. Also, would it make sense to create an alias (or merely rename) for pool to regroup , as it might make more sense to people? Which makes me realise we have not decided on a policy for keeping names / creating new names.

Plotting

  • the message:
Error in plot_single(x, group, stack, color, col_pal, alpha, border, xlab,  : 
  A single plot can only stack/dodge one variable.
 Please `pool` the object first or use `plot_facet`

Should read facet_plot

  • the y-axis legend for interval = 14 reads semi-weekly (google has it as "twice a week"); I think bi-weekly may be better (google says it can be either "once every 2 weeks" or "twice a week")

Adding moving average as geom_line

Hey - not a must have but might be nice to have option to add a moving average line.
This pretty commonly used on messy epi data.
Should be quite easy to implement now that {slider} has been released.
See r4epi discussion

Maybe should just be left for users to so separately (i.e. add to plot themselves after)?

Plots

plot vs facet_plot

The current design with plot only working with 0-1 group makes sense, and pool is nice enough to use. However I wonder if users would like this to be wrapped automatically through plot. Maybe this is something we should ask in community feedback? I think my personnal preference is the current implementation.

More plotting options

I would like to add some more plotting options beyond the geom_col, which is effectively a histogram. It would be nice to add a geom_point and a geom_line, e.g.

## only points
dat %>% 
  incidence(date_of_onset) %>%
  plot(type = "point")

## points and lines
dat %>% 
  incidence(date_of_onset) %>%
  plot(type = c("point", "line")

In terms of x-axis positioning, these would be set at the middle of the corresponding time interval.

Little helpers

It would be nice to offer some small helpers which will be frequently used. Again, community feedback will be useful there. I can think of, for instance:

  • a small helper for rotating the labels on the x-axis, as this is often needed to avoid overlapping labels
  • a small helper for changing the date formats on the x-axis

Default styling

This is a big selling point, and I like the idea that the new version will look, well, new.

  • background: not a fan of the ggplot2 default grey; may be try defaulting to theme_bw()?
  • default color palette: the one I had made for incidence is quite sad looking; I had used paletton to pick colors that were quite distinct but even then I am not sure the result works well; it would be nice to have a better, more vivid, colorblind-friendly default
  • default color: could become the first color of the default color palette?
  • other palettes: do we want to provide out-of-the-box support for other color palettes, e.g. those defined in ggplot2 and RColorBrewer (probably sticking to categorical variables)?

Width argument to specify no gap between bars

Hello!
Great improvements on the original package - thank you very much! I really like the ability to facet and to use count data.

I would like to ask if the plotting functions can allow a width argument or otherwise an option for the there to be no gap between bars. At the US CDC and it seems in Europe as well (see ref below) there is a traditional guideline that epidemic curves (when large enough that cases are not shown as boxes) should be histograms and not bar charts - or at least that there be no spaces between the bars. If this option can be offered I think it would also offer a solution to the varying width and frequency of "white lines"/gaps between bars, which appear for example in the github readme (below).

From the vignette - "white lines"/bar gaps appearing at different frequencies across the plot
image

From the vignette - "white lines"/bar gaps of varying thickness across the plot
image

I tried to include a width argument in plot() but it was not accepted. When I tried to add a geom_col() to plot() and specify width that also did not work. While experimenting, I tried to use geom_col alone directly on a weekly incidence2 object. When I specified width = 7 I was able to achieve non-overlapping bars without any gaps. This makes sense given that it was a weekly incidence object and according to this ggplot2 issue discussion which says that geom_col width is interpreted in absolute units (days in this case).

Here is that example - the outbreaks ebola_sim_clean linelist

pacman::p_load(incidence2, tidyverse, outbreaks)
b <- incidence2::incidence(outbreaks::ebola_sim_clean$linelist, date_index = date_of_onset, groups = gender, interval = "week")
plot(b, fill = gender) # weird varying white "gaps" between bars
ggplot(data = b)+geom_col(aes(x = bin_date, y = count, fill = gender), width = 7) # no gaps

I just wanted to chime in and see if this was something that is possible. Perhaps at the least the width argument could be allowed to pass to the underlying geom_col? Then the user could tinker and find the correct width?

Thanks very much for considering!

ECDC guidelines for presentation of surveillance data

Plot error with interval = 1

This is apparently caused by a week_var missing:

## bug for plotting with interval = 1

library(outbreaks)
library(incidence2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dat <- ebola_sim_clean$linelist
glimpse(dat)
#> Rows: 5,829
#> Columns: 11
#> $ case_id                 <chr> "d1fafd", "53371b", "f5c3d8", "6c286a", "0f58…
#> $ generation              <int> 0, 1, 1, 2, 2, 0, 3, 3, 2, 3, 4, 3, 4, 2, 4, …
#> $ date_of_infection       <date> NA, 2014-04-09, 2014-04-18, NA, 2014-04-22, …
#> $ date_of_onset           <date> 2014-04-07, 2014-04-15, 2014-04-21, 2014-04-…
#> $ date_of_hospitalisation <date> 2014-04-17, 2014-04-20, 2014-04-25, 2014-04-…
#> $ date_of_outcome         <date> 2014-04-19, NA, 2014-04-30, 2014-05-07, 2014…
#> $ outcome                 <fct> NA, NA, Recover, Death, Recover, NA, Recover,…
#> $ gender                  <fct> f, m, f, f, f, f, f, f, m, m, f, f, f, f, f, …
#> $ hospital                <fct> Military Hospital, Connaught Hospital, other,…
#> $ lon                     <dbl> -13.21799, -13.21491, -13.22804, -13.23112, -…
#> $ lat                     <dbl> 8.473514, 8.464927, 8.483356, 8.464776, 8.452…

i <- incidence(dat, date_index = date_of_onset)
plot(i)
#> Error: Must extract column with a single valid subscript.
#> ✖ Subscript `week_var` can't be `NA`.

Created on 2020-07-03 by the reprex package (v0.3.0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.