reconverse / incidence2 Goto Github PK

Compute and visualise incidence (reworking of the original incidence package)

Home Page: https://www.reconverse.org/incidence2

License: Other

R 99.57% Makefile 0.43%

incidence2's Introduction

incidence2

incidence2 is an R package that implements functions and classes to compute, handle and visualise incidence from linelist data. It refocusses the scope of the original incidence package. Unlike the original package, incidence2 concentrates only on the initial calculation, manipulation and plotting of the resultant incidence objects.

Installing the package

You can install the released version of incidence2 from CRAN with:

install.packages("incidence2")

Vignettes

An overview of incidence2 is provided in the vignette distributed with the package:

vignette("incidence2", package = "incidence2")

incidence2's People

Contributors

Stargazers

Watchers

Forkers

minghao2016 jamesmbaazam

incidence2's Issues

Volunteers for code review

We're now seeking volunteers to review the package code. If interested please leave a message here.

need an as_incidence function to convert already aggregated data.frames

make `nrow` and explicit argument of `facet_plot`

Currently i pass nrow through dots to ggplot2::facet. Let's make that explicit.

On the behaviour of `facet_plot`

Some thoughts on how facetting may work, esp with regards to using groups for facetting and/or color-filling. Nothing hard-set, more for discussion purpose. It would be useful to have a facet argument handling which grouping variables are used for facetting. Together with fill, this should give more flexibility to the user for designing plots with different grouping variables displayed.

Here are some proposed behaviours:

facet_plot(x): plot the incidence object using all grouping variables for facetting
facet_plot(x, facet = "foo"): same, using only variable foo for facetting
facet_plot(x, facet = c(foo, bar)): same, using variables foo and bar
facet_plot(x, facet = "foo", fill = bar): use foo for facetting and bar for filling
facet_plot(x, facet = c("foo", "bar"), fill = bar): use foo and bar for facetting, and bar for filling; redundant, but that's okay, the user asked for it

What do you think?

`na_as_group = FALSE` needs to additionally trim the first and last dates

Currently if na_as_group is set to FALSE then you can end up with some dates with 0 counts at the beginning and n. Need to retrim dates once NAs are removed

wrapper to convert incidence objects

It would be useful to provide a wrapper to convert incidence objects from the original package to incidence2 objects.

Invisible plot when using colors

See reprex below:

library(magrittr)
library(incidence2)

outbreaks::ebola_sim_clean$linelist %>%
  incidence(date_of_onset, groups = outcome) %>%
  plot()
#> The number of colors (1) did not match the number of groups (3).
#> Using `col_pal` instead.

It may come from the borders ie bars are so thin we see only the border? Plot is fine with wider time intervals.

^{Created on 2020-07-06 by the reprex package (v0.3.0)}

Ensure error messages refer to variables by parameter value (i.e. not `x`)

User request: allow date adjustment using % strptime abbreviations syntax

@nsbatra - the following requires the dev (GitHub main/master branches) of incidence2 and grates but hopefully works as you were hoping. Let me know what you think:

library(outbreaks)
library(incidence2)

dat <- ebola_sim_clean$linelist
x <- incidence(dat, date_of_onset, interval = "month")
x
#> An incidence2 object: 13 x 2
#> 5829 cases from 2014-Apr to 2015-Apr
#> interval: 1 month
#> cumulative: FALSE
#> 
#>    date_index count
#>    <month>    <int>
#>  1 2014-Apr       7
#>  2 2014-May      67
#>  3 2014-Jun     102
#>  4 2014-Jul     228
#>  5 2014-Aug     540
#>  6 2014-Sep    1144
#>  7 2014-Oct    1199
#>  8 2014-Nov     779
#>  9 2014-Dec     567
#> 10 2015-Jan     427
#> 11 2015-Feb     307
#> 12 2015-Mar     277
#> 13 2015-Apr     185
 
# centred dates (default for yearweek, single months, quarters and years)
plot(x, color = "white")

# histogram-esque dates on the breaks (defaults to "%Y-%m-%d")
plot(x, color = "white", centre_dates = FALSE)

# can specify a different format
plot(x, color = "white", centre_dates = FALSE, date_format = "%d-%m-%Y")

^{Created on 2021-05-19 by the reprex package (v2.0.0)}

Default color palette

The 3 requirements of the new color palette would be:

look nice to as many humans as possible
be colorblind friendly
correspond to categorical variables

Quite a bit of thinking on these has been done by the viridis package:
https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html

As well, these very good resources:

I suggest we use this issue to propose palettes. Ideally put them to a vote at some point.

make a proper summary function

Currently we just have a hack that cats output rather than return a meaningful object.

grouping within make_incidence is slow and hacky

The lines below are slow and hacky:

  x <- grouped_df(x, c(date_index, groups))
  x <- summarise(x, count = count_dates(.data[[date_index]], breaks), .groups = "keep")
  x <- mutate(x, {{date_index}} := breaks)
  x <- summarise(x, count = sum(.data$count), .groups = "keep")
  colnames(x) <- c("bin_date", colnames(x)[-1])
  x <- ungroup(x)

replace summarise with some sort of split / tapply/ rbind combo will probably be quicker

Possible bug with month interval combined with groups

I find the following potential issue:

library(outbreaks)
library(incidence2)

## this works
ebola_sierraleone_2014 %>% 
  incidence(date_of_onset, interval = "1 week", groups = district) %>% 
  plot()

## this throws an error
ebola_sierraleone_2014 %>% 
  incidence(date_of_onset, interval = "1 month", groups = district) %>% 
  plot()

## note that month interval without group works
ebola_sierraleone_2014 %>% 
  incidence(date_of_onset, interval = "1 month") %>% 
  plot()

Possible new features: subsetting time windows

Subsetting objects by given time windows may be one of the only things made slightly easier in the original incidence package. For instance, x[1:5] would get you the first 5 time steps (days / weeks / months) of the object, which is a little trickier to do now. It would be useful to have some functions helping with this - see some proposed example uses below.

Filter first / last days / weeks / months etc.

Filter the data to retain the first or last data points, predicated on a duration. There is a question here, as to how duration can be specified:

simplest form: duration is provided as integer days
other simple form: duration is provided as integer time intervals (as specified by the bins of the object)
interpreted like the 'interval' argument of incidence2::incidence, so we could do things like "3 months" to have the first 3 months of data (possibly months 1 and 3 not being complete)

Examples would be (depending on the option above we retain):

filter_first(x, 30): retain the first 30 days of data, or all of it if there are less than 30 days
filter_first(x, "1 month"): retain the first month of data; may not be a full month, only data from the first reported month
filter_last(x, "4 weeks"): retain the last 4 weeks of data; the last week may not be complete e.g. if the last date is a Thursday, so this may not be 28 days of data
filter_last(x, 28): filters the first 28 days data (irrespective of week definition)

Subset

We could re-implement the features of incidence::subset(), but possibly renaming the function. It would merely be a wrapper for filter on dates.

Allow count to work on multiple columns at once

Count should be made to work on multiple columns at once.

Add dimensions to print output of incidence object

Where to put labels on the x-axis?

There has been debates in the past on where dates should appear on the x-axis. I will try to sum up views / things to take into account below, and maybe some will add thoughts to it.

Original incidence package

epicurves were treated as histograms; bars represent case counts between 2 time points, so that e.g. for monthly incidence, a date on the x-axis marks the left hand-side of the bin (label to the left)
for fitting, a single date needs to be associated to a case count; thus we were using the middle of the time interval (label in the middle)
we did not have options for plotting epicurves as points / lines
several users complained that label in the middle was more intuitive

Current considerations

I suspect most epis do not read epicurves as histograms, so label in the middle would make sense
if we add geom_point and geom_line as options for plot and facet_plot, it is preferrable to have a consistent label positioning, which works the same for all geoms; label in the middle seems better for this: it still makes sense with geom_bar
model predictions will probably work better with label in the middle
devel-wise, it is safer to go with the least-amount of fiddling with ggplot2 handing of the x-axis

Add progress bar for when bootstrap / estimate peak are running

Fix y-axis labelling for periods

see labelling issue on y-axis below

library(outbreaks)
library(incidence2)
dat <- ebola_sim_clean$linelist
x2w<- incidence(dat, date_of_onset, interval = "2 weeks")
x2w
#> An incidence2 object: 28 x 2
#> 5829 cases from 2014-04-07 to 2015-05-03
#> interval: 14 days
#> cumulative: FALSE
#> 
#>    date_index               count
#>    <period>                 <int>
#>  1 2014-04-07 to 2014-04-20     2
#>  2 2014-04-21 to 2014-05-04     9
#>  3 2014-05-05 to 2014-05-18    29
#>  4 2014-05-19 to 2014-06-01    34
#>  5 2014-06-02 to 2014-06-15    44
#>  6 2014-06-16 to 2014-06-29    52
#>  7 2014-06-30 to 2014-07-13    72
#>  8 2014-07-14 to 2014-07-27   120
#>  9 2014-07-28 to 2014-08-10   166
#> 10 2014-08-11 to 2014-08-24   255
#> # … with 18 more rows
plot(x2w, color = "white")

^{Created on 2021-05-20 by the reprex package (v2.0.0)}

Add sort method

Add a sort method for the incidence class. It could help with things like reconhub/trendbreaker#45

`group_dates` could be a user supplied function

Currently date binning is performed by an internal function group_dates. This function could actually be any monitonically increasing, user-defined function. May be worth implementing.

Update intro vignette function list with new functions (e.g. `build_incidence`)

Possible bug with integer dates input

Please place an "x" in all the boxes that apply

I have the most recent version of incidence2 and R
I have found a bug
I have a reproducible example
I want to request a new feature

I think the following should probably work; I am guessing it has to do with dates being simple numbers (not Date or grates stuff) but not 100% sure. Here's a reprex:

library(incidence2)

df <- data.frame(
  dates = 1:100,
  counts = round(rnorm(100, 1000, 200))
)

x <- incidence(df, date_index = dates, counts = counts)
x
#> An incidence2 object: 100 x 2
#> 100028 counts from 1 to 100
#> interval: 1 day
#> cumulative: FALSE
#> 
#>    date_index counts
#>         <int>  <dbl>
#>  1          1    775
#>  2          2   1389
#>  3          3   1341
#>  4          4    999
#>  5          5    694
#>  6          6    645
#>  7          7   1000
#>  8          8   1365
#>  9          9   1119
#> 10         10    959
#> # … with 90 more rows
plot(x)
#> Error in `substring<-`(`*tmp*`, 1, 1, value = toupper(first_letter)): replacing substrings in a non-character object

^{Created on 2021-03-29 by the reprex package (v1.0.0)}

Allowing adding non-integer numbers to grate objects

A maybe not very frequent use case: define limits between two time intervals defined by grate, e.g. to visually delineate epochs in a graph using a vertical line.

Currently the following will error on purpose:

> as_yrwk("2021-W03") + 1
[1] "2021-W04"

> as_yrwk("2021-W03") + 1.5
Error: Can only add whole numbers to <yrwk> objects

But unsure if we want to change this or not.

A better implementation of the underlying data?

Currently week labels are added as additional columns to the underlying data. Should the underlying structure should just be the binned data with attributes that trigger different behaviour? For example, I think the following is a cleaner implementation:

library(incidence2)
library(dplyr, warn.conflicts = FALSE)

# get some data
data(ebola_sim_clean, package = "outbreaks")
dat <- 
  ebola_sim_clean$linelist %>% 
  filter(date_of_onset <= "2014-07-07")


# generate object with our current implementation
inci <- incidence(dat, date_index = date_of_onset, interval = "week")
inci
#> <incidence object>
#> [207 cases from days 2014-04-07 to 2014-07-07]
#> [interval: 1 week]
#> [cumulative: FALSE]
#> 
#>    date_group weeks    isoweeks count
#>    <date>     <aweek>  <chr>    <int>
#>  1 2014-04-07 2014-W15 2014-W15     1
#>  2 2014-04-14 2014-W16 2014-W16     1
#>  3 2014-04-21 2014-W17 2014-W17     5
#>  4 2014-04-28 2014-W18 2014-W18     4
#>  5 2014-05-05 2014-W19 2014-W19    12
#>  6 2014-05-12 2014-W20 2014-W20    17
#>  7 2014-05-19 2014-W21 2014-W21    15
#>  8 2014-05-26 2014-W22 2014-W22    19
#>  9 2014-06-02 2014-W23 2014-W23    23
#> 10 2014-06-09 2014-W24 2014-W24    21
#> 11 2014-06-16 2014-W25 2014-W25    30
#> 12 2014-06-23 2014-W26 2014-W26    22
#> 13 2014-06-30 2014-W27 2014-W27    34
#> 14 2014-07-07 2014-W28 2014-W28     3
str(inci)
#> tibble [14 × 4] (S3: incidence/tbl_df/tbl/data.frame)
#>  $ date_group: Date[1:14], format: "2014-04-07" "2014-04-14" ...
#>  $ weeks     : 'aweek' chr [1:14] "2014-W15" "2014-W16" "2014-W17" "2014-W18" ...
#>   ..- attr(*, "week_start")= int 1
#>  $ isoweeks  : chr [1:14] "2014-W15" "2014-W16" "2014-W17" "2014-W18" ...
#>  $ count     : int [1:14] 1 1 5 4 12 17 15 19 23 21 ...
#>  - attr(*, "date")= chr [1:3] "date_group" "weeks" "isoweeks"
#>  - attr(*, "count")= chr "count"
#>  - attr(*, "interval")= chr "week"
#>  - attr(*, "cumulative")= logi FALSE

# drop the extra columns
x <- inci %>% 
  select(date_group, count)


# create demo subclass of tibble without the additional columns
# give it a "week" attribute (but could be "quarter", "year", "month", etc...)
tbl <- tibble::new_tibble(x,
                          nrow = nrow(x),
                          type = "week",
                          class = "demo")

# users would see the groupings they expect
print.demo <- function(x, ...) {
  
  # title
  cat("<demo object>\n")
  out <- x
  if (attr(x, "type") == "week") {
    out$date_group <- aweek::date2week(out$date_group)  
  }
  out <- format(tibble::as_tibble(out))
  
  cat(out[-1], sep = "\n")
  cat("\n")
  invisible(x)
}

tbl
#> <demo object>
#>    date_group count
#>    <aweek>    <int>
#>  1 2014-W15-1     1
#>  2 2014-W16-1     1
#>  3 2014-W17-1     5
#>  4 2014-W18-1     4
#>  5 2014-W19-1    12
#>  6 2014-W20-1    17
#>  7 2014-W21-1    15
#>  8 2014-W22-1    19
#>  9 2014-W23-1    23
#> 10 2014-W24-1    21
#> 11 2014-W25-1    30
#> 12 2014-W26-1    22
#> 13 2014-W27-1    34
#> 14 2014-W28-1     3

# but underlying we have the original binning data
str(tbl)
#> tibble [14 × 2] (S3: demo/tbl_df/tbl/data.frame)
#>  $ date_group: Date[1:14], format: "2014-04-07" "2014-04-14" ...
#>  $ count     : int [1:14] 1 1 5 4 12 17 15 19 23 21 ...
#>  - attr(*, "type")= chr "week"

^{Created on 2020-07-08 by the reprex package (v0.3.0)}

Rename argument n.breaks -> n_breaks

Minor thing but for consistency, it would make sense to rename the n.breaks argument to n_breaks in plot.incidence2 and facet_plot.

Add vignette about the grouped date classes

Improve tests

Currently testing has become a bit of a mixed bag due to the refactoring and is a little unmaintainable. Work needs to be done to implement a more methodical approach.

Bug in keep_first and keep_last overshooting

Please place an "x" in all the boxes that apply

I have the most recent version of reportfactory and R
I have found a bug
I have a reproducible example
I want to request a new feature

Quick description

These functions return NAs when the second argument exceeds the size of the object. It should probably return the whole thing, not even sure if we want a warning in there.

Reprex

library(tidyverse)
library(incidence2)

tibble(date = 1:10) %>% 
  incidence(date) %>% 
  keep_last(10)
#> An incidence2 object: 10 x 2
#> 10 cases from 1 to 10
#> interval: 1 day
#> cumulative: FALSE
#> 
#>    date_index count
#>         <int> <int>
#>  1          1     1
#>  2          2     1
#>  3          3     1
#>  4          4     1
#>  5          5     1
#>  6          6     1
#>  7          7     1
#>  8          8     1
#>  9          9     1
#> 10         10     1

tibble(date = 1:10) %>% 
  incidence(date) %>% 
  keep_last(11)
#> # A tibble: 10 x 2
#>    date_index count
#>         <int> <int>
#>  1         NA    NA
#>  2         NA    NA
#>  3         NA    NA
#>  4         NA    NA
#>  5         NA    NA
#>  6         NA    NA
#>  7         NA    NA
#>  8         NA    NA
#>  9         NA    NA
#> 10         NA    NA

tibble(date = 1:10) %>% 
  incidence(date) %>% 
  keep_first(10)
#> An incidence2 object: 10 x 2
#> 10 cases from 1 to 10
#> interval: 1 day
#> cumulative: FALSE
#> 
#>    date_index count
#>         <int> <int>
#>  1          1     1
#>  2          2     1
#>  3          3     1
#>  4          4     1
#>  5          5     1
#>  6          6     1
#>  7          7     1
#>  8          8     1
#>  9          9     1
#> 10         10     1

tibble(date = 1:10) %>% 
  incidence(date) %>% 
  keep_first(11)
#> # A tibble: 10 x 2
#>    date_index count
#>         <int> <int>
#>  1         NA    NA
#>  2         NA    NA
#>  3         NA    NA
#>  4         NA    NA
#>  5         NA    NA
#>  6         NA    NA
#>  7         NA    NA
#>  8         NA    NA
#>  9         NA    NA
#> 10         NA    NA

^{Created on 2021-03-08 by the reprex package (v1.0.0)}

Potential bug with multiple values in date_index

Please place an "x" in all the boxes that apply

I have the most recent version of incidence2 and R
I have found a bug
I have a reproducible example
I want to request a new feature

In the following example, 2 valid dates are provided to date_index but generate an error. Sorry if I am missing something obvious. See example below:

library(incidence2)
library(outbreaks)

dat <- ebola_sim_clean$linelist
i <- incidence(dat,
               date_index = c(onset = date_of_onset, outcome = date_of_outcome),
               interval = 7,
               group = "hospital")
#> Error: Names must be unique.
#> x These names are duplicated:
#>   * "outcome" at locations 6 and 7.
names(dat)
#>  [1] "case_id"                 "generation"             
#>  [3] "date_of_infection"       "date_of_onset"          
#>  [5] "date_of_hospitalisation" "date_of_outcome"        
#>  [7] "outcome"                 "gender"                 
#>  [9] "hospital"                "lon"                    
#> [11] "lat"

^{Created on 2021-03-22 by the reprex package (v1.0.0)}

Error with show_cases = TRUE with monthly incidence

See reprex below:

library(outbreaks)
library(incidence2)

x <- ebola_sim_clean$linelist %>%
  incidence(date_index = date_of_onset, interval = "month")
plot(x, show_cases = TRUE)
#> Error in `$<-.data.frame`(`*tmp*`, "width", value = c(30L, 31L, 30L, 31L, : replacement has 13 rows, data has 5829

But the same with weekly intervals works.

NSE should be used for plot arguments to match with other functions

Code Review 1

Borrowing rOpenSci guidance:

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the ropensci reviewer guide.

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Review Comments

fix/improve dplyr reconstruction

Currently the dplyr reconstructable checking is superfluous in places and missing some bits in others. See https://github.com/reconhub/trending/blob/master/R/dplyr.R has a slightly better approach including a method for $<-.

date group labels when daily interval

The date label when interval = 1L would look better if it said date rather than bin_date

consider using {tsibble}

hi - like the idea of moving this to cleaner syntax!

Just thought I would suggest using tsibble to do some of the underlying legwork. There are yearweek and year* functions and all works pretty clean.

The advantage over aweek is that is recognised as a date automatically. The disadvantage over aweek is that cannot (as of yet) set a different start day for a week.

Actually just posted issues this morning on {aweek} and {tsibble} about this.

As a sidenote - while in the process of redoing everything might be worth considering renaming just to make certain epis happy and avoid semantic discussions around incidence vs incidence rate vs prevalence (see)

Release incidence2 0.1.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accept data.table as incidence input / or error

Please place an "x" in all the boxes that apply

I have the most recent version of incidence2 and R
I have found a bug
I have a reproducible example
I want to request a new feature

Need to produce an error or convert if data.table object is provided as input

library(data.table)
library(incidence2)
dat <- incidence2::covidregionaldataUK
incidence(dat, date, count = cases_new)
#> An incidence object: 490 x 2
#> date range: [2020-01-30] to [2021-06-02]
#> cases_new: 8379330
#> interval: 1 day
#> 
#>    date_index cases_new
#>    <date>         <dbl>
#>  1 2020-01-30         3
#>  2 2020-01-31         0
#>  3 2020-02-01         0
#>  4 2020-02-02         0
#>  5 2020-02-03         0
#>  6 2020-02-04         0
#>  7 2020-02-05         2
#>  8 2020-02-06         0
#>  9 2020-02-07         0
#> 10 2020-02-08         8
#> # … with 480 more rows
setDT(dat)
incidence(dat, date, count = cases_new)
#> Error in `[.data.table`(x, date_index): When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

^{Created on 2021-08-23 by the reprex package (v2.0.1)}

Slowness with record types (possibly my misuse of dplyr across)

I think I'm using across poorly as the example below is crazily slow. Here I'm using clock but I'm pretty sure that is not the bottleneck:

library(incidence2)
library(clock)
library(microbenchmark)

dat <- covidregionaldataUK

# default uses data.table
default <- function() {
  incidence(dat, date_index = date, groups = region, counts = ends_with("new"))
}

# here clock is just used as an example and is not the bottle-neck
record <- function() {
  build_incidence(
    dat,
    date_index = date,
    groups = region,
    counts = ends_with("new"),
    FUN = function(x) calendar_narrow(as_year_month_day(x), precision = "day")
  )
}

microbenchmark(default(), record(), times = 10)
#> Unit: milliseconds
#>       expr         min          lq        mean      median          uq
#>  default()    5.053506    5.303756    7.184131    5.825458    6.297723
#>   record() 4614.132282 4683.771452 4741.717130 4751.499357 4782.525466
#>         max neval
#>    18.83551    10
#>  4835.74966    10

^{Created on 2021-06-28 by the reprex package (v2.0.0)}

Center x-axis annotations by default

The current default for x-axis annotations is to be left-justified, e.g.:

library(outbreaks)
library(incidence2)

x <- ebola_sim_clean$linelist %>%
  incidence(date_index = date_of_onset, interval = "month")
plot(x) + scale_x_incidence(x, format = "%d %M %Y")
#> Scale for 'x' is already present. Adding another scale for 'x', which will
#> replace the existing scale.

I suspect most users will want to center the labels by default. It could be done by adding a center_labels = TRUE by default to scale_x_incidence, but this argument already exists with a different meaning (position of the tick marks) in the plot method.

One solution would be to:

rename center_labels to center_tick in the plot method
add a center_labels argument to scale_x_incidence

But I am sure there are other options - just can't think of a good one atm.

Not importing incidence 1?

at I'm mostly just curious, but what is the reasoning behind the design decision to copy over the code from {incidence} initially instead of importing? From my perspective, if there's a bug in the future, then there are two places where it needs to be fixed.

Strategy for renaming functions

For instance, pool may be better named regroup, and I guess there could be more cases like this. Generally speaking, renaming things from the original incidence poses some trade-offs. There are several strategies we may consider:

Stick to the old

We keep old names as much as possible, and only use new names for new features.

Scrap the old

As this is a reboot, we can do away with old names, and rely on documentation for people to find out correspondence. A softer version would be to have incidence2::pool merely return NULL (or an error) and throw a message saying that this feature is now called regroup in incidence2.

Aliases

We could have incidence2::regroup <- incidence2::pool. If so, do we want to:

keep aliases going forward (I think not)
mark old names as deprecated and eventually remove them? It might make sense in terms of transition, but it is weird to develop a new package with already deprecated functions, with a schedule that explicitely plans breaking backward compatibility fruther down the line.

First trial comments

These are comments or discussion items which in itself seem too small to warrant a dedicated issue. This issue is work in progress.

Doc

in ?incidence, specify that x can be a data.frame or a tibble, for the incidence functions, and that the indicated methods dispatch along the second argument (which is unusual, though very cool)
I love that the following works:

library(magrittr)
library(incidence2)
outbreaks::ebola_sim_clean$linelist %>%
  incidence(date_of_onset, groups = c(gender, hospital, outcome)) %>%
  pool(c(gender, outcome))
#> <incidence object>
#> [5829 cases from days 2014-04-07 to 2015-04-30]
#> [interval: 1 day]
#> [cumulative: FALSE]
#> 
#>    date_group gender outcome count
#>    <date>     <fct>  <fct>   <int>
#>  1 2014-04-07 f      Death       0
#>  2 2014-04-07 f      Recover     0
#>  3 2014-04-07 f      <NA>        1
#>  4 2014-04-07 m      Death       0
#>  5 2014-04-07 m      Recover     0
#>  6 2014-04-07 m      <NA>        0
#>  7 2014-04-08 f      Death       0
#>  8 2014-04-08 f      Recover     0
#>  9 2014-04-08 f      <NA>        0
#> 10 2014-04-08 m      Death       0
#> # … with 2,324 more rows

Going forward it would be nice to have a vignette dedicated to handling incidence objects and illustrate this feature, amongst others. Also, would it make sense to create an alias (or merely rename) for pool to regroup , as it might make more sense to people? Which makes me realise we have not decided on a policy for keeping names / creating new names.

Plotting

the message:

Error in plot_single(x, group, stack, color, col_pal, alpha, border, xlab,  : 
  A single plot can only stack/dodge one variable.
 Please `pool` the object first or use `plot_facet`

Should read facet_plot

the y-axis legend for interval = 14 reads semi-weekly (google has it as "twice a week"); I think bi-weekly may be better (google says it can be either "once every 2 weeks" or "twice a week")

Add vignette docs for complete_dates function

Adding moving average as geom_line

Hey - not a must have but might be nice to have option to add a moving average line.
This pretty commonly used on messy epi data.
Should be quite easy to implement now that {slider} has been released.
See r4epi discussion

Maybe should just be left for users to so separately (i.e. add to plot themselves after)?

Plots

`plot` vs `facet_plot`

The current design with plot only working with 0-1 group makes sense, and pool is nice enough to use. However I wonder if users would like this to be wrapped automatically through plot. Maybe this is something we should ask in community feedback? I think my personnal preference is the current implementation.

More plotting options

I would like to add some more plotting options beyond the geom_col, which is effectively a histogram. It would be nice to add a geom_point and a geom_line, e.g.

## only points
dat %>% 
  incidence(date_of_onset) %>%
  plot(type = "point")

## points and lines
dat %>% 
  incidence(date_of_onset) %>%
  plot(type = c("point", "line")

In terms of x-axis positioning, these would be set at the middle of the corresponding time interval.

Little helpers

It would be nice to offer some small helpers which will be frequently used. Again, community feedback will be useful there. I can think of, for instance:

a small helper for rotating the labels on the x-axis, as this is often needed to avoid overlapping labels
a small helper for changing the date formats on the x-axis

Default styling

This is a big selling point, and I like the idea that the new version will look, well, new.

background: not a fan of the ggplot2 default grey; may be try defaulting to theme_bw()?
default color palette: the one I had made for incidence is quite sad looking; I had used paletton to pick colors that were quite distinct but even then I am not sure the result works well; it would be nice to have a better, more vivid, colorblind-friendly default
default color: could become the first color of the default color palette?
other palettes: do we want to provide out-of-the-box support for other color palettes, e.g. those defined in ggplot2 and RColorBrewer (probably sticking to categorical variables)?

Width argument to specify no gap between bars

Hello!
Great improvements on the original package - thank you very much! I really like the ability to facet and to use count data.

I would like to ask if the plotting functions can allow a width argument or otherwise an option for the there to be no gap between bars. At the US CDC and it seems in Europe as well (see ref below) there is a traditional guideline that epidemic curves (when large enough that cases are not shown as boxes) should be histograms and not bar charts - or at least that there be no spaces between the bars. If this option can be offered I think it would also offer a solution to the varying width and frequency of "white lines"/gaps between bars, which appear for example in the github readme (below).

From the vignette - "white lines"/bar gaps appearing at different frequencies across the plot

From the vignette - "white lines"/bar gaps of varying thickness across the plot

I tried to include a width argument in plot() but it was not accepted. When I tried to add a geom_col() to plot() and specify width that also did not work. While experimenting, I tried to use geom_col alone directly on a weekly incidence2 object. When I specified width = 7 I was able to achieve non-overlapping bars without any gaps. This makes sense given that it was a weekly incidence object and according to this ggplot2 issue discussion which says that geom_col width is interpreted in absolute units (days in this case).

Here is that example - the outbreaks ebola_sim_clean linelist

pacman::p_load(incidence2, tidyverse, outbreaks)
b <- incidence2::incidence(outbreaks::ebola_sim_clean$linelist, date_index = date_of_onset, groups = gender, interval = "week")
plot(b, fill = gender) # weird varying white "gaps" between bars
ggplot(data = b)+geom_col(aes(x = bin_date, y = count, fill = gender), width = 7) # no gaps

I just wanted to chime in and see if this was something that is possible. Perhaps at the least the width argument could be allowed to pass to the underlying geom_col? Then the user could tinker and find the correct width?

Thanks very much for considering!

ECDC guidelines for presentation of surveillance data

Plot error with interval = 1

This is apparently caused by a week_var missing:

## bug for plotting with interval = 1

library(outbreaks)
library(incidence2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dat <- ebola_sim_clean$linelist
glimpse(dat)
#> Rows: 5,829
#> Columns: 11
#> $ case_id                 <chr> "d1fafd", "53371b", "f5c3d8", "6c286a", "0f58…
#> $ generation              <int> 0, 1, 1, 2, 2, 0, 3, 3, 2, 3, 4, 3, 4, 2, 4, …
#> $ date_of_infection       <date> NA, 2014-04-09, 2014-04-18, NA, 2014-04-22, …
#> $ date_of_onset           <date> 2014-04-07, 2014-04-15, 2014-04-21, 2014-04-…
#> $ date_of_hospitalisation <date> 2014-04-17, 2014-04-20, 2014-04-25, 2014-04-…
#> $ date_of_outcome         <date> 2014-04-19, NA, 2014-04-30, 2014-05-07, 2014…
#> $ outcome                 <fct> NA, NA, Recover, Death, Recover, NA, Recover,…
#> $ gender                  <fct> f, m, f, f, f, f, f, f, m, m, f, f, f, f, f, …
#> $ hospital                <fct> Military Hospital, Connaught Hospital, other,…
#> $ lon                     <dbl> -13.21799, -13.21491, -13.22804, -13.23112, -…
#> $ lat                     <dbl> 8.473514, 8.464927, 8.483356, 8.464776, 8.452…

i <- incidence(dat, date_index = date_of_onset)
plot(i)
#> Error: Must extract column with a single valid subscript.
#> ✖ Subscript `week_var` can't be `NA`.

^{Created on 2020-07-03 by the reprex package (v0.3.0)}

reinstate macos devel action when fixed

I've commented out the mac devel action due to r-lib/actions#140. This is a reminder for me to reinstate when fixed.

make result of estimate_peak compatible with broom

i.e tidy and augment

reconverse / incidence2 Goto Github PK

incidence2's Introduction

incidence2

Installing the package

Vignettes

incidence2's People

Contributors

Stargazers

Watchers

Forkers

incidence2's Issues

Filter first / last days / weeks / months etc.

Subset

Original incidence package

Current considerations

Please place an "x" in all the boxes that apply

Please place an "x" in all the boxes that apply

Quick description

Reprex

Please place an "x" in all the boxes that apply

Package Review

Documentation

Functionality

Review Comments

Please place an "x" in all the boxes that apply

Stick to the old

Scrap the old

Aliases

Doc

Plotting

plot vs facet_plot

More plotting options

Little helpers

Default styling

Recommend Projects

Recommend Topics

Recommend Org

`plot` vs `facet_plot`