aariq / bumbl Goto Github PK

View Code? Open in Web Editor NEW

3.0 2.0 4.0 1.65 MB

R package for modeling bumblebee colony growth.

License: Other

R 95.40% TeX 4.60%

bumblebee demography switchpoint glm

bumbl's Introduction

Hi there 👋, I'm Eric Scott (he)

I'm a Scientific Programmer & Educator with CCT Data Science at University of Arizona.

🔭 I’m currently working on an R package for calculating estimated volatility of chemical compounds, volcalc
🌱 I’m currently learning geospatial analysis in R with terra, sf, stars, etc.
👯 I’m looking for collaborators at University of Arizona CALES in need of data analysis and visualization
🍵 I'm also a huge tea geek!

Latest Blog Posts

Tea Geek Stuff

bumbl's People

Contributors

Stargazers

Watchers

Forkers

rkillick karawoo hadley scientiafelis

bumbl's Issues

Add error message for something gone wrong in glm()

Currently if a glm() fails, the only error the user sometimes sees is Error: unknown column 'tau' and the traceback is also not helpful. #30

Column name conflicts with covariates when agument = TRUE

When covariates are included in bumbl() with augment = TRUE there are two columns with the covariate name that mean different things. covariate.x is the original data and covariate.y is the coefficient for the covariate for the GLM for each colony.

prepend coefficient column names with "beta_"?

library(bumbl)
out <- bumbl(bombus, colonyID = colony, t = week, formula = d.mass ~ week + cum_floral, augment = TRUE)
colnames(out)
#>  [1] "site"             "colony"           "wild"             "habitat"         
#>  [5] "date"             "week"             "mass"             "d.mass"          
#>  [9] "floral_resources" "cum_floral.x"     "converged"        "tau"             
#> [13] "logN0"            "logLam"           "decay"            "logNmax"         
#> [17] "cum_floral.y"     ".fitted"          ".se.fit"          ".resid"

^{Created on 2020-12-02 by the reprex package (v0.3.0)}

Add Zenodo DOI to package citation

After getting a DOI, edit citation for package to include citing this DOI in addition to Crone and Williams.

Minimize dependencies

Try to minimize dependencies.

Fit switchpoint model with all colonies and covariates (rework bumbl() internals)

In the case of a dataset of bumblebee worker count over time with possible covariates, we want to estimate different values of tau (the switchpoint) for each colony, but a single value for each covariate (or variance parameter like in a negative binomial model).

The current behavior is to fit as many models as there are colonies to find tau. You could do this with a simple formula, count ~ week supplied to the current version of bumbl(), and then take the resulting tau's and plug them into a model with covariates using all the data. However, the value of tau might depend on covariates or variance parameters.

Elizabeth thinks the solution is to optimize values of tau for all colonies simultaneously (e.g. with optim()) to maximize likelihood of a model including covariates fit to all the data.

I think this involves fitting a model like count ~ time*colony + .post*colony + covariates where .post is 0 before tau and time-tau after tau. The goal would be to optimize the values of tau for each colony to maximize the likelihood of that GLM.

add change in mass column to bombus

Citation should include Eric as a co-author

In addition to Crone and Williams, users should cite the package itself with Eric and Elizabeth as co-authors. Archive with Zenodo to get DOI to associate with package.

use optim() for finding optimal value of switchpoint

optim() requires a function with the parameter to be optimized as the first argument that returns some kind of score like log-likelihood. Consider using optim() inside of brkpt() instead of iterating over meshpoints of tau to find value of tau that maximizes likelihood.

A new, lowest level function could be created that only takes a formula like size ~ time and a value for the switchpoint, tau, and fits the modified lm (with .post in the formula). Then, if brkpt() is called with no value of tau supplied, it uses optim() to optimize tau, then re-fits model with that new function. If brkpt() is called with a value of tau, then it skips the optimization step.

Add contributions.md

Anyone with bumblebee-related modeling functions (or data??) should be encouraged to contribute.

Document functions that go into IPM

Include equations for functions plugged into IPM. E.g. mass_func(), surv_func().

add cumulative floral resources to bombus data frame

Add a column for cumulative floral resources, i.e. the sum for each week from the start of the study to the measurement date.

Add column indicating convergence

In the output for bumbl(), add a column indicating whether a model for that particular colony converged.

Make plotting totally silent

Right now bumbl_plot outputs things to the console. E.g.


#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL

augment = TRUE results don't print correctly

I think my tests are missing this, but when augment = TRUE the print method is messed up for reasons I don't understand and can't reproduce. This is the error message I get:

bumbl(noerrs, colonyID = colony, t = week, formula = mass ~ week, augment = TRUE)
Error: `x` must be a vector, not a `grouped_df/tbl_df/tbl/data.frame/bumbldf` object.

Colony names are incorrect

Coercing to factor somewhere caused a problem. The colony IDs output by bumbl() are wrong and don't match those in the data.

Warning when number of time steps >> number of tau values

If a dataset has a longer time series, the default number of possible values for tau to test may not be enough. I don't know what the threshold ratio of timesteps to taus should be. This should probably be figured out analytically. E.g. reduce # of meshpoints relative to timesteps until estimated tau is significantly different from estimate with very high number of meshpoints?

Add plot output (argument or function)

There should be a way to plot the observed and fitted results from the model for all colonies.

Two ways to do this:

Make the output of bumbl() an S3 object and write a plot.bumbl() method.
Make the plot an optional side-effect of calling bumbl() with something like plot = FALSE

1 is probably best practice for a number of reasons that I can't fully articulate—one would be that you could actually get the plot object and modify it (especially if it's a ggplot2 plot). 2 is going to be way easier to implement in the short term because I currently don't know how to make/register an S3 object or a plot() method.

Regardless of how I implement it, Elizabeth suggests three options for plots:

PDF or other file output with a faceted type plot. I guess you'd need to specify an output path.
Print all the plots individually to the console/viewer pane all at once.
Don't produce a plot.

Inconsistency between vignette and code: should time column be edited to be tau when time > tau?

In the vignette, based on Crone and Williams, the model for after the switchpoint is $ln(W_t) = ln(W_0) + \tau ln(\lambda) + (t-\tau) ln(\delta)$ . In the code (including Elizabeth's original analysis code), only the t-tau column, called .post is created. The time column is left as-is. Shouldn't it be set to tau when time > tau?

In otherwords, in the example below, shouldn't the changepoint glm use mass ~ .time + .post instead of mass ~ week + .post?

library(bumbl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tau = 9
bombus %>% 
  filter(colony == 9) %>% 
  select(week, mass) %>% 
  mutate(.post = ifelse(week <= tau, 0, week - tau),
         .time = ifelse(week > tau, tau, week))
#> # A tibble: 16 x 4
#>     week  mass .post .time
#>    <int> <dbl> <dbl> <dbl>
#>  1     0 1910.     0     0
#>  2     1 1940      0     1
#>  3     2 1938      0     2
#>  4     3 1976.     0     3
#>  5     4 2010.     0     4
#>  6     5 2143      0     5
#>  7     6 2278      0     6
#>  8     7 2364      0     7
#>  9     8 2387      0     8
#> 10     9 2317      0     9
#> 11    10 2297      1     9
#> 12    11 2303      2     9
#> 13    12 2246      3     9
#> 14    13 2212      4     9
#> 15    14 2164      5     9
#> 16    15 2141      6     9

^{Created on 2021-02-16 by the reprex package (v0.3.0)}

bumbl() errors when multiple equivalent taus found

This was a purposeful error, but should be changed to a warning and bumbl should return NAs for just that colony.

should brkpt be exported?

I think if there were some default behavior for bumbl() when not given a colonyID, then I could safely make brkpt not exported.

Not always getting same results as Natalie from bipm()

This is documented in the tests on the ipm branch. I just want to have a space to keep notes here.

Add a timeline to README

Add some goals/to-dos/timeline (what is the word I'm trying to think of??) to README.

Finish documentation and vignette
Submit to CRAN
Add IPM functions (link to branch)
Submit to CRAN
Development done
Possible further development: extend to GLMMs, better optimization function for finding switchpoint, more bumblebee related functions (link to contributions.md)

when wkr_size_sd is small, NaN introduced in model matrix and lambda() fails

bipm(wkr_size_mean = 4.17, wkr_size_sd = 0.0002)

if you go into debug on this, you can see there is a NaN in row 1, col 51

autoplot method

add an autoplot() method for bumbldf to produce a faceted ggplot of colony growth. See also #32

Document how to replace default functions with anonymous function

Use @notes to show how to supply anonymous function like

bipm(wkr_mass_f = function(wkr_size) -0.1 + 0.4*wkr_size)

And test that this actually works.

Not flexible enough

From an email from Elizabeth:

I have found that, in using the bumbl package, it is too streamlined, in the sense that it is impossible to use in cases where I want any other outputs (for example, diagnostics such as log-likelihoods or R2 values for the fitted curves).

This might be solved by (optionally) keeping the model object as a column in the data frame output by bumbl() plus some examples in a vignette on how to extract additional info from that list-column.

bumbl() assumes there is a switchpoint, doesn't test for one

bumbl() might fail to find a switchpoint, but it also can find one when there clearly isn't one. For now I'll note this in the documentation, but in the future it would be good to test if the switchpoint model was a better fit than a line, perhaps.

library(bumbl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
noswitch <- 
  bombus %>% 
  filter(colony == 9) %>% 
  mutate(d.mass = 0.1)

bumbl(noswitch, colonyID = colony, t = week, formula = d.mass ~ week)
#> # A tibble: 1 x 7
#>   colony converged   tau logN0    logLam    decay logNmax
#>   <chr>  <lgl>     <dbl> <dbl>     <dbl>    <dbl>   <dbl>
#> 1 9      TRUE       4.29 -2.30 -1.63e-16 1.77e-16   -2.30

^{Created on 2021-04-08 by the reprex package (v0.3.0)}

Remove `.nb` versions of bumbl() and brkpt()

This is a lot of duplicated code. Instead have family = "nb" as a special case which calls glm.nb() instead of glm(). Any other value for family gets passed to glm()

reason for augment = TRUE to be on link scale?

Ask Elizabeth if the fitted results should be returned on the response scale or should require back-transformation.

If you change this, you'll need to change the call to broom::augment in colony-growth.R and the back transformation in plotting.R

bumbl() gives warning about coercing colony ID to character

"Column colony joining factor and character vector, coercing into character vector "

This happens in a join somewhere and can be fixed just by pre-coercing to character.

Implicit assumption of colony ID column name in bumbl()

library(bumbl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

#this works:
bumbl(bombus, t = week, mass ~ week)
#> # A tibble: 1 x 6
#>   colony   tau logN0 logLam   decay logNmax
#>   <chr>  <dbl> <dbl>  <dbl>   <dbl>   <dbl>
#> 1 NA      6.20  7.52 0.0170 -0.0215    7.63

#but this doesn't:
bumbl(rename(bombus, colony_test = colony), t = week, mass ~ week)
#> Error: Join columns must be present in data.
#> x Problem with `colony`.

^{Created on 2020-11-01 by the reprex package (v0.3.0)}

Package documentation

Check that package documentation is complete. I think I still need:

~~[ ] Double check docs for bipm()~~
Double check docs for bombus()
Double check docs for bumbl()
Package description
Elizabeth's ORCID (if she has one and wants it to be associated with the package)
Citation (should it be Crone and Williams? Maybe not if bumbl also includes Natalie's work)

Constrain line after tau to be a decline

For example, colony 38 in bombus has a switchpoint at week 4, but the slope actually increases after that point, and that's not what we're modeling.

Allow for count data

Allow measure of colony growth to be counts

Options would be normal, poisson, or overdispersed poisson using either lmer( count ~ week + (1|week), family = poisson)or glm.nb(count ~ week....

add requirement for tidyr ≥ 1.0.0

there are binaries for all platforms now on CRAN so it should be no problem for people to install it. Then, get rid of that unnest <- unnest_legacy hack

Handle dates correctly

Previously Dates could be used as the time variable in bumbl(), but they were silently coerced to numeric by ifelse(). For now, I only allow numeric variables for time. I should change this to rlang::if_else() and handle Dates properly.

add tests for covariates

from elizabeth: "I tried wild as a covariate and the program ignores the covariate altogether. I tried floral_resources as a covariate and it works fine (weird)"

Confidence intervals for tau

Use liklihood profile generated in brkpt to figure out what range of taus aren't significantly different from the chosen tau.

lower CI = maximum value of tau that has tau < MLE(tau) & abs(∆LL) > 1.9

upper CI = min of tau that has tau > MLE(tau) & abs(∆LL) > 1.9

where ∆LL = diff between LL for MLE(tau) [i.e. the best tau] and for any given tau

Should plots be marginal effects plots?

currently plots use data from broom::augment(glm) with no newdata. I could instead set all covariates to their mean/reference value and produce marginal effects plots.

Error: unknown column "tau"

Howdy! I'm excited to finally be able to start playing with this package. I'm trying to fit models in accordance with your readme, but having some issues. The bumbl function keeps throwing an error saying, "unknown column tau". Like you, I have my t set to equal the week of the experiment. The error is preceded by a statement for each of my colonies stating that it cannot find starting values. Below is the function call and the head() of my data

bumbl(colmass.df, colonyID = colony_id, t = week, formula = mass.dif ~ week, augment = TRUE)

Vignette

Add equation for GLM version
Mention count data
Why might you want to add a co-variate at the colony level?
~~Natalie's IPM function~~

Add column for convergence

In the output for bumbl(), add a column indicating whether a model for that particular colony converged.

allow user to supply vector of tau, one value per colony

I need to look more into the current behavior and think about desired behavior, but here's the general idea...

You may want to use a different set of covariates when estimating the switchpoint, tau, than when you estimate colony growth, given a switchpoint. The only way to really do that currently is to run bumbl() once, then extract the values of tau, then sort of manually fit the switchpoint model by calculating the .post covariate. If you could override the grid-search portion of bumbl by providing known taus, you could solve this issue by running bumbl() once to estimate tau for each colony, then run it again with bumbl(...tau = tau) and a different set of co-variates.

I think this would do what Elizabeth and Sylvie were hoping it would do.

bombus2 <- 
  bombus %>%
  group_by(colony) %>%
  mutate(d.mass = mass - min(mass) + 0.1) %>%
  select(site, colony, wild, habitat, date, week, mass, d.mass, everything()) %>%
  ungroup()
bombus2

out <- bumbl(bombus2, colonyID = colony, t = week, formula = log(d.mass) ~ week, augment = TRUE)
out2 <- out %>% filter(colony %in% c(104))
# attributes(out2)
bumbl_plot(out2)