cmu-delphi / epidatr Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 5.0 10.72 MB

Delphi Epidata API R Client

Home Page: https://cmu-delphi.github.io/epidatr/

License: Other

R 99.56% Makefile 0.44%

epidatr's People

Contributors

Stargazers

Watchers

Forkers

elray1 dajmcdon e3bo willtownes dsweber2

epidatr's Issues

pvt_paho_dengue points to the wrong endpoint

paho_dengue points to the quidel endpoint.

add raise for status for http errors

fetch methods may need httr::stop_for_status or other http error handling; e.g., I just (2021-07-14 11:53AM ET) ran into a 404 trying covidcast("fb-survey", "smoothed_cli", "day", "county", epirange(20210405, 20210410), "*") %>% fetch_classic() . This stopped with the message below, succeeded the following attempt, and gave the 404's HTML content in a 6-row data frame when I tried a third time with fetch_df

Fix `covidcast_meta` `fetch_tbl` parsing issues

Running

ccm = delphi.epidata::covidcast_meta() %>%
   delphi.epidata::fetch_tbl()

yields

Warning message:
One or more parsing issues, see `problems()` for details

due to non-date min_time, max_time, and max_issue for time type week / data source nchs-mortality.

Some investigation code:

library(tidyverse)
library(data.table)
library(delphi.epidata)
library(epiprocess)

ccm = delphi.epidata::covidcast_meta() %>%
  delphi.epidata::fetch_tbl()

probs = problems()

probs %>%
  print(n=50L)

ccm %>%
  # row numbers in probs appear to include the colnames row, so adjust for that when slicing
  slice(unique(probs[["row"]]-1L)) %>%
  select(data_source, signal, time_type, min_time, max_time, max_issue) %>%
  print(n=40L)

ccm_old = covidcast::covidcast_meta()

ccm_old %>%
  map(class)

ccm %>%
  slice(probs[["row"]]-1L) %>%
  left_join(ccm_old, by=c("data_source","signal","geo_type","time_type")) %>%
  select(data_source, signal, time_type, max_time.x, max_time.y, max_issue.x, max_issue.y, min_time.x, min_time.y) %>%
  print(n=50L)

create covidcastepidata version

Add `print` S3 methods for any classes missing them (e.g., `EpiRange`)

The highest-priority things to get print methods already have methods written, but may not be used; see #43. However, classes like EpiRange may still benefit from a neater-looking print method.

First-pass documentation + additional documentation issues

Addressing the bare-bones documentation is a high priority. We should look over and get the progress in #23 merged. Any remaining items in its checklist + other documentation issues should be filed into separate issues.

`install_github()` error

I can't currently install the package. I get the following error:

remotes::install_github("cmu-delphi/delphi-epidata-r")

The downloaded source packages are in
	‘/private/var/folders/fr/49035_g925n4f5g72mjntmth0000gn/T/RtmpvjwfKK/downloaded_packages’
✓  checking for file ‘/private/var/folders/fr/49035_g925n4f5g72mjntmth0000gn/T/RtmpvjwfKK/remotes7e0d69325f3d/cmu-delphi-delphi-epidata-r-5160ad9/DESCRIPTION’ ...
─  preparing ‘delphi.epidata’:
✓  checking DESCRIPTION meta-information ...
   Warning in grepl(e, files, perl = TRUE, ignore.case = TRUE) :
     PCRE pattern compilation error
   	'unrecognized character follows \'
   	at 'Makefile$'
   Error in grepl(e, files, perl = TRUE, ignore.case = TRUE) : 
     invalid regular expression '^\Makefile$'
   Execution halted
Error: Failed to install 'delphi.epidata' from GitHub:
  System command 'R' failed, exit status: 1, stdout & stderr were printed

Fix `issue` parsing in `fluview` endpoint

It looks like issue is set to be parsed as a date when it's actually an epiweek; the result is NA issues:

  library(delphi.epidata)
  library(magrittr)
  fluview("hhs1", epirange(201501,201503)) %>% fetch_tbl()
#> Warning: One or more parsing issues, see `problems()` for details
#> # A tibble: 3 × 16
#>   release_date region issue  epiweek      lag num_ili num_patients
#>   <chr>        <chr>  <date> <date>     <int>   <int>        <int>
#> 1 2017-10-24   hhs1   NA     2015-01-04   143     915        48019
#> 2 2017-10-24   hhs1   NA     2015-01-11   142    1306        51021
#> 3 2017-10-24   hhs1   NA     2015-01-18   141    1923        50890
#> # … with 9 more variables: num_providers <dbl>, num_age_0 <int>,
#> #   num_age_1 <int>, num_age_2 <int>, num_age_3 <int>, num_age_4 <int>,
#> #   num_age_5 <int>, wili <dbl>, ili <dbl>
  print(readr::problems())
#> # A tibble: 3 × 5
#>     row   col expected         actual file                            
#>   <int> <int> <chr>            <chr>  <chr>                           
#> 1     2     3 date like %Y%m%d 201740 /tmp/Rtmp2ckL6r/file40877e0d5bbf
#> 2     3     3 date like %Y%m%d 201740 /tmp/Rtmp2ckL6r/file40877e0d5bbf
#> 3     4     3 date like %Y%m%d 201740 /tmp/Rtmp2ckL6r/file40877e0d5bbf

^{Created on 2022-07-13 by the reprex package (v2.0.1) --- and fixed up due to readr::problems() somehow being an empty 0x4 tibble without the file column}

Add missing `@export` tags for S3 methods

See r-lib/devtools#2293.

One important instance (not necessarily the only one) is print.epidata_call.

([Also r]equires re-documenting and including the [updates --- NAMESPACE should be updated to include some S3method entries as a result if the roxygen comments have been appropriately updated].)

add print method for epidatacall

EpiDataCall should have a print method (with instructions for users expecting this to just give some output). Users might also try to pull columns from the EpiDataCall object... maybe [[ and $ methods that stop and give a useful message (hinting that maybe the user forgot to call a fetch function) when attempting to access a field that doesn't exist would be helpful.

enforce data type within data frame

geo_value is assigned inconsistent types by fetch_df: e.g., try

covidcast("fb-survey", "smoothed_cli", "day", "county", epirange(20210405, 20210410), "42003") %>% fetch_df() %>% lapply(class) %>% `[[`("geo_value")

vs.

covidcast("fb-survey", "smoothed_cli", "day", "county", epirange(20210405, 20210410), "04013") %>% fetch_df() %>% lapply(class) %>% `[[`("geo_value")

Consider allowing `Date`s in place of epiranges, `Date`s & hyphenated datestrings in `epirange`s

Currently, Date objects cannot be fed in as, e.g., time_values to epidatr::covidcast, as Date objects are not considered epirange-like, nor can they be used successfully as the endpoints of an epirange. This is an inconvenience, as often we do Date-based arithmetic or use Sys.Date() to arrive at time_values of interest, and, furthermore, simply using as.character or toString doesn't always put them in the accepted format. (The hyphenated-string handling might require a little care due to hyphens being allowed to represent ranges between non-hyphenated-datestrings; hyphenated strings probably will need converted to non-hyphenated strings somewhere, and probably already are in the situations where they already work.)

It'd be nice for all, not just some, of the queries below to work. [Providing more comprehensible and helpful error messages, if easier, could be a temporary patch (--- e.g., suggesting to use format(dates, "%Y%m%d")).]

library(epidatr)
library(magrittr)
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          "20220101", "*") %>% fetch_tbl()
#> # A tibble: 1 × 15
#>   geo_value signal    source geo_t…¹ time_…² time_value direc…³ issue        lag
#>   <chr>     <chr>     <chr>  <ord>   <ord>   <date>       <dbl> <date>     <int>
#> 1 us        confirme… hhs    nation  day     2022-01-01      NA 2022-03-23    81
#> # … with 6 more variables: missing_value <int>, missing_stderr <int>,
#> #   missing_sample_size <int>, value <dbl>, stderr <dbl>, sample_size <dbl>,
#> #   and abbreviated variable names ¹geo_type, ²time_type, ³direction
#> # ℹ Use `colnames()` to see all variable names
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          c("20220101","20220102"), "*") %>% fetch_tbl()
#> # A tibble: 2 × 15
#>   geo_value signal    source geo_t…¹ time_…² time_value direc…³ issue        lag
#>   <chr>     <chr>     <chr>  <ord>   <ord>   <date>       <dbl> <date>     <int>
#> 1 us        confirme… hhs    nation  day     2022-01-01      NA 2022-03-23    81
#> 2 us        confirme… hhs    nation  day     2022-01-02      NA 2022-07-10   189
#> # … with 6 more variables: missing_value <int>, missing_stderr <int>,
#> #   missing_sample_size <int>, value <dbl>, stderr <dbl>, sample_size <dbl>,
#> #   and abbreviated variable names ¹geo_type, ²time_type, ³direction
#> # ℹ Use `colnames()` to see all variable names
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          "20220101-20220105", "*") %>% fetch_tbl()
#> # A tibble: 5 × 15
#>   geo_value signal    source geo_t…¹ time_…² time_value direc…³ issue        lag
#>   <chr>     <chr>     <chr>  <ord>   <ord>   <date>       <dbl> <date>     <int>
#> 1 us        confirme… hhs    nation  day     2022-01-01      NA 2022-03-23    81
#> 2 us        confirme… hhs    nation  day     2022-01-02      NA 2022-07-10   189
#> 3 us        confirme… hhs    nation  day     2022-01-03      NA 2022-03-23    79
#> 4 us        confirme… hhs    nation  day     2022-01-04      NA 2022-08-04   212
#> 5 us        confirme… hhs    nation  day     2022-01-05      NA 2022-03-23    77
#> # … with 6 more variables: missing_value <int>, missing_stderr <int>,
#> #   missing_sample_size <int>, value <dbl>, stderr <dbl>, sample_size <dbl>,
#> #   and abbreviated variable names ¹geo_type, ²time_type, ³direction
#> # ℹ Use `colnames()` to see all variable names
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          epirange("20220101","20220105"), "*") %>% fetch_tbl()
#> # A tibble: 5 × 15
#>   geo_value signal    source geo_t…¹ time_…² time_value direc…³ issue        lag
#>   <chr>     <chr>     <chr>  <ord>   <ord>   <date>       <dbl> <date>     <int>
#> 1 us        confirme… hhs    nation  day     2022-01-01      NA 2022-03-23    81
#> 2 us        confirme… hhs    nation  day     2022-01-02      NA 2022-07-10   189
#> 3 us        confirme… hhs    nation  day     2022-01-03      NA 2022-03-23    79
#> 4 us        confirme… hhs    nation  day     2022-01-04      NA 2022-08-04   212
#> 5 us        confirme… hhs    nation  day     2022-01-05      NA 2022-03-23    77
#> # … with 6 more variables: missing_value <int>, missing_stderr <int>,
#> #   missing_sample_size <int>, value <dbl>, stderr <dbl>, sample_size <dbl>,
#> #   and abbreviated variable names ¹geo_type, ²time_type, ³direction
#> # ℹ Use `colnames()` to see all variable names
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          "2022-01-01", "*") %>% fetch_tbl()
#> # A tibble: 1 × 15
#>   geo_value signal    source geo_t…¹ time_…² time_value direc…³ issue        lag
#>   <chr>     <chr>     <chr>  <ord>   <ord>   <date>       <dbl> <date>     <int>
#> 1 us        confirme… hhs    nation  day     2022-01-01      NA 2022-03-23    81
#> # … with 6 more variables: missing_value <int>, missing_stderr <int>,
#> #   missing_sample_size <int>, value <dbl>, stderr <dbl>, sample_size <dbl>,
#> #   and abbreviated variable names ¹geo_type, ²time_type, ³direction
#> # ℹ Use `colnames()` to see all variable names
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          c("2022-01-01", "2022-01-02"), "*") %>% fetch_tbl()
#> # A tibble: 2 × 15
#>   geo_value signal    source geo_t…¹ time_…² time_value direc…³ issue        lag
#>   <chr>     <chr>     <chr>  <ord>   <ord>   <date>       <dbl> <date>     <int>
#> 1 us        confirme… hhs    nation  day     2022-01-01      NA 2022-03-23    81
#> 2 us        confirme… hhs    nation  day     2022-01-02      NA 2022-07-10   189
#> # … with 6 more variables: missing_value <int>, missing_stderr <int>,
#> #   missing_sample_size <int>, value <dbl>, stderr <dbl>, sample_size <dbl>,
#> #   and abbreviated variable names ¹geo_type, ²time_type, ³direction
#> # ℹ Use `colnames()` to see all variable names
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          epirange("2022-01-01", "2022-01-05"), "*") %>% fetch_tbl()
#> Error: '' does not exist in current working directory ('/tmp/RtmpFtrhOX/reprex-20de19cb4198-prior-boa').
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          epirange("2022-01-01", "2022-01-05"), "*") %>% fetch_tbl()
#> Error: '' does not exist in current working directory ('/tmp/RtmpFtrhOX/reprex-20de19cb4198-prior-boa').
date = as.Date("2022-01-05")
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          date, "*") %>% fetch_tbl()
#> Error in !is.list: invalid argument type
# (the weird error message here is from #46) 
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          c(date, date+1L), "*") %>% fetch_tbl()
#> Error in !is.list: invalid argument type
# (the weird error message here is from #46) 
n_days = 5L
covidcast("hhs", "confirmed_admissions_covid_1d_prop", "day", "nation",
          epirange(date, date + n_days - 1L), "*") %>% fetch_tbl()
#> Error in epirange(date, date + n_days - 1L): (is.numeric(from) || is.character(from)) && length(from) == 1 is not TRUE

^{Created on 2022-09-27 by the reprex package (v2.0.1)}

pvt_twitter fails when using fetch_classic

Getting the bug

r$> pvt_twitter(auth, locations = "CA", epiweeks = epirange(201501, 202001)) %>% fetch_classic
Error in if (info$name %in% columns) { : argument is of length zero
r$> traceback()
3: parse_data_frame(epidata_call, m$epidata, disable_date_parsing = disable_date_parsing)
2: fetch_classic(.)
1: pvt_twitter(...)

Consider adding download caching to the client

I would think the following are higher priority than trying to implement streaming:

Tools to cache and perform smart updates on the cache.
...

Originally posted by @brookslogan in #13 (comment)

We have work on adding data caching to evalcast here. There is the option of porting that logic here. It would be natural to keep data-fetching/caching logic contained in this package. Then evalcast and the packages that supersede it can simply be updated to use this client and get caching functionality for free.

`covidcast`'s `as_of` should not "require" an epirange, and does not properly handle epiranges

See line here. as_of in the covidcast package allows only a single date; ~~we should support that in epidatr~~ [it looks like we already support this in epidatr, but we should make sure to test we can use a Date class here after #47 is fixed, and/or clarify the error message that an "epirange" is required]. Further, epiranges aren't actually handled properly here; providing a non-zero-width epirange only produces results for one as_of (plus no column for which as_of it came from --- note that as_of and issue are not interchangeable). Maybe actually handling epiranges could be made into a separate "enhancement" request issue; for now, maybe we should just reject epiranges and allow only single dates for as_of.

Currently working around for the single-as-of case like this:
epidatr::covidcast("hhs", "confirmed_admissions_influenza_1d", "day", geo_type, epidatr::epirange(format(target_date_range[[1L]], "%Y%m%d"), format(target_date_range[[2L]], "%Y%m%d")), geo_values, as_of=epidatr::epirange(format(evaluation_as_of, "%Y%m%d"), format(evaluation_as_of, "%Y%m%d")) ) %>% epidatr::fetch_tbl()

library(epidatr)
library(magrittr)
generates_misleading_error =
  epidatr::covidcast("hhs", "confirmed_admissions_influenza_1d",
                     "day", "nation",
                     epidatr::epirange("20220801", "20220809"),
                     as_of = as.Date("2022-09-09"),
                     "us") %>%
  fetch_tbl()
#> Error in `check_single_epirange_param()`:
#> ! argument as_of is not a epirange
works_but_could_be_shorter =
  epidatr::covidcast("hhs", "confirmed_admissions_influenza_1d",
                     "day", "nation",
                     epidatr::epirange("20220801", "20220809"),
                     as_of = format(as.Date("2022-09-09"), "%Y%m%d"),
                     "us") %>%
  fetch_tbl()
wrong =
  epidatr::covidcast("hhs", "confirmed_admissions_influenza_1d",
                     "day", "nation",
                     epidatr::epirange("20220801", "20220809"),
                     as_of = epidatr::epirange(
                       format(as.Date("2022-09-09"), "%Y%m%d"),
                       format(as.Date("2022-09-20"), "%Y%m%d")
                     ),
                     "us") %>%
  fetch_tbl()
all.equal(works_but_could_be_shorter, wrong) # TRUE is bad
#> [1] TRUE

^{Created on 2022-09-27 by the reprex package (v2.0.1)}

Ambiguous error when using fetch_tbl on a null response

A call that produces a null result, such as

y <- covidcast( 
      data_source = "bad-source", 
      signals = "bad-signal", 
      geo_type = "state",  
      time_type = "day", 
      time_value = epirange(20200601, 20221201),
      geo_values = "ca,fl")

produces a cryptic error when requesting fetch_tbl

r$> y %>% fetch_tbl
Error: '' does not exist in current working directory ('/home/dskel/Documents/Code/Delphi/epiprocess').

This is because fetch_tbl calls fetch_csv and attempts to readr::read_csv on the output without validation and, in this case, fetch_csv returns an empty string.

covid_hosp_facility `issues` arg need not be required

Add examples

I am having trouble getting more than 0 rows out of fluview; being able to type ?fluview and see/copy an example would be quite helpful.

Typo: `fip_code`; also, make sure it requires text rather than an integer

fip_code should be fips_code here.

Additionally, this should require a string rather than an integer / strings rather than integers. (FIPS codes should include appropriate zero padding, so they should be strings.)

Update styling commands to work similarly across systems

styler::style_pkg() appears to be inconsistent about formatting vignettes. I've tried on my system with CRAN and GitHub versions of styler, and they don't seem to format the vignettes, and CI complains about that. [I just needed to start a new session after upgrading styler.]

consider optimizing package organization for autocomplete/user assist

There are a ton of endpoints, and most users will only use a small number of them.

Current organization is flat, with endpoints, fetchers, and utility functions all mixed together:

* package
  * afhsb [endpoint]
  * cdc [endpoint]
  * covidcast_meta [endpoint]
  * delphi [endpoint]
  * dengue_nowcast [endpoint]
  * dengue_sensors [endpoint]
  * ecdc_ili [endpoint]
  * epirange [utility]
  * fetch_classic [fetcher]
  * fetch_csv [fetcher]
  * fetch_df [fetcher]
  * fetch_json [fetcher]
  * flusurv [endpoint]
  * fluview_meta [endpoint]
  * gft [endpoint]
  * ght [endpoint]
  * kcdc_ili [endpoint]
  * meta [endpoint]
  * meta_afhsb [endpoint]
  * meta_norostat [endpoint]
  * nidss_dengue [endpoint]
  * norostat [endpoint]
  * nowcast [endpoint]
  * paho_dengue [endpoint]
  * quidel [endpoint]
  * sensors [endpoint]
  * with_base_url [utility]

This organization is easiest to implement and automatically generates matching man pages, but it does flood autocomplete, making it difficult to refer to utility functions.

There are several alternative organization schemes we could consider, all with pros and cons:

Named lists
Prefixes
R6 objects

Named lists

* package
  * endpoint$
    * afhsb 
    * cdc 
    * covidcast_meta 
    * delphi 
    * dengue_nowcast 
    * dengue_sensors 
    * ecdc_ili 
    * flusurv 
    * fluview_meta 
    * gft 
    * ght 
    * kcdc_ili 
    * meta 
    * meta_afhsb 
    * meta_norostat 
    * nidss_dengue 
    * norostat 
    * nowcast 
    * paho_dengue 
    * quidel 
    * sensors 
  * epirange
  * fetch$
    * classic
    * csv
    * df
    * json
  * with_base_url

To make this work, we'd have to specify our own man files for the lists and their members.

Prefixes

We're already considering adding pvt_ for the restricted endpoints; we could easily switch to eg endpoint_covidcast. Autocomplete would still be flooded but it would be easier for users to tell what was what.

R6 objects

Not familiar wtih this; maybe @brookslogan can advise?

Consider removing `direction` column output by `covidcast` endpoint

We did this in the covidcast package at some point. We should consider it here too (direction is no longer being generated and may have been NA-d out entirely). However, it's not clear whether this should happen within this package or in the server code.

Output type specification for `covid_hosp_facility` may be wrong

This line seems problematic. Might need to be a "date" even though it describes a week.

Test query:

covid_hosp_facility('390119', epirange(20201201, 20201207)) %>% fetch_tbl()

Fix error in epirange list check

Fix this line to call is.list on the appropriate argument and change the non-working nested sapplys to something that makes sense (probably a single purrr::map_lgl). I encountered this when trying to feed a single Date object in.
Fix any downstream behavior, e.g., in preparing the query, which may not actually accommodate, e.g., a list of epiranges.
Add tests for this case.
Update docs.

Improve function argument defaults and improve metadata lookup functionality

From the tooling discussion on 11/2:

many of the signal function arguments don't have defaults; it would be nice to have some reasonable defaults here to make it user-friendly for new users
the regular metadata lookup returns too much information; it would be nice to have some filtering of this information
can we leverage the metadata lookup to get reasonable defaults?

There are related functions in the old covidcast R package, such as specific_meta that would be helpful to port.

cc @lcbrooks @keanmingtan @jacobbien

consider unifying error handling

It'd be nice to unify the approach to error handling of HTTP and Epidata errors. Maybe stop in the case of any non-success result&message, and storing away this error info in some mutable package member so that the user can debug and/or report (especially considering these things are stochastic). (Internally this might mean taking the other approach instead: encoding everything as list elements like last.result&last.message&last.httr.status, then having a single standardized stop_for_epidata_or_httr_issues function called by each of the fetch functions.)

Complete first testing&documentation pass for remaining endpoints

We are merging #23 as-is. There are a few infrequently-used endpoints that have some remaining TODOs from this first pass. Quoting from the list there:

dengue_nowcast (need valid locations, but API is lacking) EDIT: done in #102
dengue_sensors (need valid locations and names, but API is lacking) EDIT: done in #102
ecdc_ili EDIT: done in #102
kcdc_ili (need valid region, but API is lacking) EDIT: done in #102
paho_dengue (bug prevents testing) EDIT: done in #102

Re-enable testing of `covid_hosp_facility_lookup` example

This endpoint now responds very quickly, so related examples are good to include in tests.

Can't depend on `delphi.epidata` in other packages

For another R package to depend on delphi.epidata, it needs two things:

delphi.epidata in the DESCRIPTION (Imports/Suggests/etc)
A Remotes: field in the DESCRIPTION pointing to cmu-delphi/delphi-epidata-r@dev (the repo name and the branch).

However, because the repo name differs from the package name, the installation routine will fail. This is an issue in the {remotes} package that performs the installation: r-lib/remotes#676.

Locally, one can first install delphi.epidata manually to avoid this, but any CI on Github will fail.

prefix password protected endpoints with `pvt_`

Prefix with pvt_ for endpoints that always require a specific role / api key

`covid_hosp_facility` example yields empty string

I am having trouble constructing a call that doesn't give an empty string as a result. We should fix this example / the implementation, and also move examples to use fetch_tbl instead of fetch_csv so this situation will raise an error (plus, as mentioned independently, to be more relevant to users).

consider adding fetch_tbl and fetch_DT

t'd be nice to have a fetch_tbl function (for tibbles), and maybe fetch_DT (data.table). (Maybe not having these discourages users from spamming unnecessarily large fetches though since they have to sit through some of the content printing rather than just seeing a preview.)

https://tibble.tidyverse.org/articles/tibble.html

https://www.rdocumentation.org/packages/data.table/versions/1.14.0

Allow streaming downloading, if possible

Regarding downloading mechanism:

I haven't tried any streaming libraries. The most promising from a very short search is sparklyr, which also has functions that seem relevant (or duplicative?) to epiprocess and epipredict. But I'm not sure we would want to just use pre-packaged streaming, because we have I don't think we have guarantees about ordering of API response rows across all API endpoints + because of versioning.
It's still worth taking a look at sparklyr for epiprocess and epipredict. (And maybe arrow as well? Although I recall arrow causing some installation problems/waits.)
I think delphi.epidata currently uses csv as the transfer mechanism for most/all output formats.

Originally posted by @brookslogan in #13 (comment)

See other comments in that thread as well.

Explain in documentation how to select all geo values

To select all geo_values you can use "*". This is not clear in the documentation, eg for function "covidcast".

API call returns data frame instead of string

Currently calling the fetch_csv function returns a string representation of the CSV, it would make it easier for users if it was automatically parsed into a data frame or some other appropriate data structure.

Track roxygen-generated NAMESPACE and documentation in git

Installing the package via

devtools::install_github("cmu-delphi/delphi-epidata-r", ref="fix-rbuildignore-makefile-typo")

produces multiple instances of these warnings:

Warning messages:
1: replacing previous import ‘jsonlite::unbox’ by ‘rlang::unbox’ when loading ‘delphi.epidata’ 
2: replacing previous import ‘jsonlite::flatten’ by ‘rlang::flatten’ when loading ‘delphi.epidata’

and attempts to read the documentation, e.g.,

help(epirange)

come back with

No documentation for ‘epirange’ in specified packages and libraries:

while cloning and running devtools::document() then devtools::unload() then devtools::install() appears to resolve the above problems.

While it's not the git way to track auto-generated files, it seems to be the R way. Tracking the NAMESPACE and .Rd files should simplify and improve the installation experience.

Don't duplicate source name in `names(covidcast_epidata()$signals)`

library(epidatr)
head(names(covidcast_epidata()$signals))
#> [1] "chng.chng:smoothed_outpatient_cli"      
#> [2] "chng.chng:smoothed_adj_outpatient_cli"  
#> [3] "chng.chng:smoothed_outpatient_covid"    
#> [4] "chng.chng:smoothed_adj_outpatient_covid"
#> [5] "chng.chng:smoothed_outpatient_flu"      
#> [6] "chng.chng:smoothed_adj_outpatient_flu"

^{Created on 2023-01-20 by the reprex package (v2.0.1)}

This probably just requires adding an unname before c-ing the per-source signal lists together:

    all_signals <- do.call(c, lapply(sources, function(x) {
        l <- c(x$signals)
        names(l) <- paste(x$source, names(l), sep = ":")
        l
    }))

    all_signals <- do.call(c, unname(lapply(sources, function(x) {
        l <- c(x$signals)
        names(l) <- paste(x$source, names(l), sep = ":")
        l
    })))

updating docs, tests, etc.

Marking this P0, as fixing it is a breaking change for the entire covidcast_epidata()$signals interface.

Package evaluation & first pass documentation

The goal of this task is to better understand whether this package does what we need it to, and permit other Delphi members to understand the same without having to read the source.

Delphi blog posts that include plots also include the code used to produce those plots. Reproduce as many of the plots as you can from the following blog posts, using this package instead of covidcastR/covidcast.

Use your newly acquired knowledge to repair the Roxygen docs so that they run successfully, and flesh out the package documentation with the basics.

Set up a `gh-pages` branch, host documentation site built using `pkgdown`

E.g., to match epiprocess, epipredict, covidcast.

There are libraries (maybe usethis) to do some/all of this in maybe just a few commands. @dajmcdon might remember how the two epi* packages above were set up.

geo_values accepts a character vector

in the covidcast function, the geo_values are expected to be a string. It would be more convenient if it could accept a character vector. This could be done by using a wrapper function

gv<-c("ca","fl","ma")
x<-paste0(gv,collapse=",")
covidcast(...,geo_values=x)

Make endpoints arguments consistent

Some endpoints accept region first and epiweeks second, others reverse this. Should make this consistent.

Use consistent naming for `covidcast` `data_source`s/`source`s

Use the same base name for data-source-associated:

parameters to covidcast and related functions
elements of covidcast_epidata()
columns of covidcast results
etc.

Either source and sources or data_source and data_sources.

implement 404 retry

since we have those weird errors these days

wiki fails when calling fetch_classic

Same error as #19

wiki(articles = "avian_influenza", epiweeks = epirange(201501, 202001)) %>% fetch_classic

consider more performance CSV parsing

tibble::as_tibble(data.table::setDF(data.table::fread(csv.result))) is much faster than readr::read_csv(csv.result) . For one query I tried the savings were 10% of the fetch time. (Note that fread has different column type inference rules; e.g., giving some type of numeric geo_value results for both of the above, but we probably want character or factor output.)

add custom print method for fetch_csv result

to avoid long string printing

#' Prevent printing of an entire string if it is long
#'
#' Makes a \code{maybe_long_string} S3 object from a single string (length-1
#' character vector) that operates like a normal character vector except, when
#' printed, it will display the length of the string and only up to some number
#' of characters of the string, rather than the entire string.  This can prevent some
#' annoyances during interactive use, as well as some editors crashing or
#' slowing down due to long line lengths.
#'
#' @details
#' By default, up to 2000 characters will be displayed; this is configurable
#' with \code{options(maybe_long_string__char_limit = new.limit)}.
#'
#' Other S3 classes on \code{string} are retained; calling this function on an
#' object with class \code{"character"} will result in an object with class
#' \code{c("maybe_long_string","character")}.
#'
#' @param string a length-1 character vector
#' @return a `maybe_long_string` S3 object
#'
#' @examples
#'
#' options(maybe_long_string__char_limit = 5)
#'
#' s1 = maybe_long_string("abcde")
#' s1 # prints extra info plus the entire string
#'
#' s2 = maybe_long_string("abcdefgh")
#' s2 # prints extra info and part of the string
#'
#' @export
maybe_long_string = function(string) {
  if (!inherits(string, "character")) {
    stop ('`string` must be a character vector')
  }
  if (length(string) != 1L) {
    stop ('`string` must be a length-1 vector (any number of characters)')
  }
  class(string) <- c("maybe_long_string", class(string))
  return (string)
}

#' Print a \code{\link{maybe_long_string}} object
#'
#' @param x a \code{maybe_long_string} object
#' @param ... ignored
#'
#' @export
print.maybe_long_string = function(x, ...) {
  char.limit = getOption("maybe_long_string__char_limit", default = 2000L)
  cat('# A maybe_long_string object with', nchar(x), 'characters; showing up to', char.limit, 'characters below.  To print the entire string, use `print(as.character(x))`:\n')
  cat(substr(x, 1L, char.limit))
  if (nchar(x) > char.limit) {
    cat("[...]")
  }
  cat("\n")
  invisible(x)
}

[Metaissue] Priority tag existing issues

Something like:

P0 tag critical issues that need to be taken care before we officially release this package
P1 tag less critical issues
P2 tag feature enhancements

Make `fetch_tbl` and `fetch_df` compatible with `vroom` upgrade

Using either of these generates:

Error: The `file` argument of `vroom()` must use `I()` for literal data as of vroom 1.5.0.

#37 looks like it fixes this for fetch_tbl ~~but also needs applied to fetch_df~~ [fetch_df dispatches to fetch_tbl].

Don't retry download if aborted via interrupt, or improve messaging

A common workflow is to repeatedly add code to a file and source it in. When reading or downloading data at the top of such a file takes a long time, one might instead select the part of the file after the download and evaluate that part of the file. It's easy to accidentally hit a shortcut or command to evaluate the whole file, though, so one may often accidentally start a download and want to cancel out, e.g., with C-c. However, the (normally very helpful!) download-retrying mechanism will actually catch this interrupt as an error and retry the download, stopping the user from resuming until they've interrupted it enough times or twice in quick succession (once during a download + once inside the retry-delay window).

It will likely be friendlier overall to detect such interrupts in the retrying mechanism and not retry in those cases. (There is a chance that a user doesn't realize it's a download taking some amount of time, and will decide against continuing on with the interrupt, but these seem outweighed by a user who doesn't C-c fast enough and feels "trapped" and restarts the process. Perhaps the latter could be addressed with a modification to the retry message rather than changing whether to retry and might be even better.)

Implement a `fetch_raw` option, `only_supports_raw` instead of `only_supports_classic`

The endpoints marked with only_supports_classic = TRUE, such as meta, were designed in the original Epidata interface that received JSON and parsed it into a nested list structure. Changing only_supports_classic to only_supports_json or only_supports_classic_and_json prevents some strange print output on the nested data frames (and may allow an easier transition for those using the original Epidata, although there are probably not many/any active human users at this point).