Git Product home page Git Product logo

bumbl's Introduction

Hi there 👋, I'm Eric Scott (he)

Website Mastodon Follow

I'm a Scientific Programmer & Educator with CCT Data Science at University of Arizona.

  • 🔭 I’m currently working on an R package for calculating estimated volatility of chemical compounds, volcalc
  • 🌱 I’m currently learning geospatial analysis in R with terra, sf, stars, etc.
  • 👯 I’m looking for collaborators at University of Arizona CALES in need of data analysis and visualization
  • 🍵 I'm also a huge tea geek!

Latest Blog Posts


Tea Geek Stuff


Eric's GitHub stats

bumbl's People

Contributors

aariq avatar hadley avatar rkillick avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

bumbl's Issues

Column name conflicts with covariates when agument = TRUE

When covariates are included in bumbl() with augment = TRUE there are two columns with the covariate name that mean different things. covariate.x is the original data and covariate.y is the coefficient for the covariate for the GLM for each colony.

prepend coefficient column names with "beta_"?

library(bumbl)
out <- bumbl(bombus, colonyID = colony, t = week, formula = d.mass ~ week + cum_floral, augment = TRUE)
colnames(out)
#>  [1] "site"             "colony"           "wild"             "habitat"         
#>  [5] "date"             "week"             "mass"             "d.mass"          
#>  [9] "floral_resources" "cum_floral.x"     "converged"        "tau"             
#> [13] "logN0"            "logLam"           "decay"            "logNmax"         
#> [17] "cum_floral.y"     ".fitted"          ".se.fit"          ".resid"

Created on 2020-12-02 by the reprex package (v0.3.0)

Fit switchpoint model with all colonies and covariates (rework bumbl() internals)

In the case of a dataset of bumblebee worker count over time with possible covariates, we want to estimate different values of tau (the switchpoint) for each colony, but a single value for each covariate (or variance parameter like in a negative binomial model).

The current behavior is to fit as many models as there are colonies to find tau. You could do this with a simple formula, count ~ week supplied to the current version of bumbl(), and then take the resulting tau's and plug them into a model with covariates using all the data. However, the value of tau might depend on covariates or variance parameters.

Elizabeth thinks the solution is to optimize values of tau for all colonies simultaneously (e.g. with optim()) to maximize likelihood of a model including covariates fit to all the data.

I think this involves fitting a model like count ~ time*colony + .post*colony + covariates where .post is 0 before tau and time-tau after tau. The goal would be to optimize the values of tau for each colony to maximize the likelihood of that GLM.

use optim() for finding optimal value of switchpoint

optim() requires a function with the parameter to be optimized as the first argument that returns some kind of score like log-likelihood. Consider using optim() inside of brkpt() instead of iterating over meshpoints of tau to find value of tau that maximizes likelihood.

A new, lowest level function could be created that only takes a formula like size ~ time and a value for the switchpoint, tau, and fits the modified lm (with .post in the formula). Then, if brkpt() is called with no value of tau supplied, it uses optim() to optimize tau, then re-fits model with that new function. If brkpt() is called with a value of tau, then it skips the optimization step.

related:

Add contributions.md

Anyone with bumblebee-related modeling functions (or data??) should be encouraged to contribute.

Make plotting totally silent

Right now bumbl_plot outputs things to the console. E.g.


#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL

augment = TRUE results don't print correctly

I think my tests are missing this, but when augment = TRUE the print method is messed up for reasons I don't understand and can't reproduce. This is the error message I get:

bumbl(noerrs, colonyID = colony, t = week, formula = mass ~ week, augment = TRUE)
Error: `x` must be a vector, not a `grouped_df/tbl_df/tbl/data.frame/bumbldf` object.

Colony names are incorrect

Coercing to factor somewhere caused a problem. The colony IDs output by bumbl() are wrong and don't match those in the data.

Warning when number of time steps >> number of tau values

Suggested here: ropenscilabs/statistical-software-review#2 (comment)

If a dataset has a longer time series, the default number of possible values for tau to test may not be enough. I don't know what the threshold ratio of timesteps to taus should be. This should probably be figured out analytically. E.g. reduce # of meshpoints relative to timesteps until estimated tau is significantly different from estimate with very high number of meshpoints?

Add plot output (argument or function)

There should be a way to plot the observed and fitted results from the model for all colonies.

Two ways to do this:

  1. Make the output of bumbl() an S3 object and write a plot.bumbl() method.
  2. Make the plot an optional side-effect of calling bumbl() with something like plot = FALSE

1 is probably best practice for a number of reasons that I can't fully articulate—one would be that you could actually get the plot object and modify it (especially if it's a ggplot2 plot). 2 is going to be way easier to implement in the short term because I currently don't know how to make/register an S3 object or a plot() method.

Regardless of how I implement it, Elizabeth suggests three options for plots:

  1. PDF or other file output with a faceted type plot. I guess you'd need to specify an output path.
  2. Print all the plots individually to the console/viewer pane all at once.
  3. Don't produce a plot.

Inconsistency between vignette and code: should time column be edited to be tau when time > tau?

In the vignette, based on Crone and Williams, the model for after the switchpoint is . In the code (including Elizabeth's original analysis code), only the t-tau column, called .post is created. The time column is left as-is. Shouldn't it be set to tau when time > tau?

In otherwords, in the example below, shouldn't the changepoint glm use mass ~ .time + .post instead of mass ~ week + .post?

library(bumbl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tau = 9
bombus %>% 
  filter(colony == 9) %>% 
  select(week, mass) %>% 
  mutate(.post = ifelse(week <= tau, 0, week - tau),
         .time = ifelse(week > tau, tau, week))
#> # A tibble: 16 x 4
#>     week  mass .post .time
#>    <int> <dbl> <dbl> <dbl>
#>  1     0 1910.     0     0
#>  2     1 1940      0     1
#>  3     2 1938      0     2
#>  4     3 1976.     0     3
#>  5     4 2010.     0     4
#>  6     5 2143      0     5
#>  7     6 2278      0     6
#>  8     7 2364      0     7
#>  9     8 2387      0     8
#> 10     9 2317      0     9
#> 11    10 2297      1     9
#> 12    11 2303      2     9
#> 13    12 2246      3     9
#> 14    13 2212      4     9
#> 15    14 2164      5     9
#> 16    15 2141      6     9

Created on 2021-02-16 by the reprex package (v0.3.0)

should brkpt be exported?

I think if there were some default behavior for bumbl() when not given a colonyID, then I could safely make brkpt not exported.

Add a timeline to README

Add some goals/to-dos/timeline (what is the word I'm trying to think of??) to README.

  • Finish documentation and vignette
  • Submit to CRAN
  • Add IPM functions (link to branch)
  • Submit to CRAN
  • Development done
  • Possible further development: extend to GLMMs, better optimization function for finding switchpoint, more bumblebee related functions (link to contributions.md)

autoplot method

add an autoplot() method for bumbldf to produce a faceted ggplot of colony growth. See also #32

Not flexible enough

From an email from Elizabeth:

I have found that, in using the bumbl package, it is too streamlined, in the sense that it is impossible to use in cases where I want any other outputs (for example, diagnostics such as log-likelihoods or R2 values for the fitted curves).

This might be solved by (optionally) keeping the model object as a column in the data frame output by bumbl() plus some examples in a vignette on how to extract additional info from that list-column.

bumbl() assumes there is a switchpoint, doesn't test for one

bumbl() might fail to find a switchpoint, but it also can find one when there clearly isn't one. For now I'll note this in the documentation, but in the future it would be good to test if the switchpoint model was a better fit than a line, perhaps.

library(bumbl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
noswitch <- 
  bombus %>% 
  filter(colony == 9) %>% 
  mutate(d.mass = 0.1)

bumbl(noswitch, colonyID = colony, t = week, formula = d.mass ~ week)
#> # A tibble: 1 x 7
#>   colony converged   tau logN0    logLam    decay logNmax
#>   <chr>  <lgl>     <dbl> <dbl>     <dbl>    <dbl>   <dbl>
#> 1 9      TRUE       4.29 -2.30 -1.63e-16 1.77e-16   -2.30

Created on 2021-04-08 by the reprex package (v0.3.0)

reason for augment = TRUE to be on link scale?

Ask Elizabeth if the fitted results should be returned on the response scale or should require back-transformation.

If you change this, you'll need to change the call to broom::augment in colony-growth.R and the back transformation in plotting.R

Implicit assumption of colony ID column name in bumbl()

library(bumbl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

#this works:
bumbl(bombus, t = week, mass ~ week)
#> # A tibble: 1 x 6
#>   colony   tau logN0 logLam   decay logNmax
#>   <chr>  <dbl> <dbl>  <dbl>   <dbl>   <dbl>
#> 1 NA      6.20  7.52 0.0170 -0.0215    7.63

#but this doesn't:
bumbl(rename(bombus, colony_test = colony), t = week, mass ~ week)
#> Error: Join columns must be present in data.
#> x Problem with `colony`.

Created on 2020-11-01 by the reprex package (v0.3.0)

Package documentation

Check that package documentation is complete. I think I still need:

  • [ ] Double check docs for bipm()
  • Double check docs for bombus()
  • Double check docs for bumbl()
  • Package description
  • Elizabeth's ORCID (if she has one and wants it to be associated with the package)
  • Citation (should it be Crone and Williams? Maybe not if bumbl also includes Natalie's work)

Allow for count data

Allow measure of colony growth to be counts

Options would be normal, poisson, or overdispersed poisson using either lmer( count ~ week + (1|week), family = poisson)or glm.nb(count ~ week....

add requirement for tidyr ≥ 1.0.0

there are binaries for all platforms now on CRAN so it should be no problem for people to install it. Then, get rid of that unnest <- unnest_legacy hack

Handle dates correctly

Previously Dates could be used as the time variable in bumbl(), but they were silently coerced to numeric by ifelse(). For now, I only allow numeric variables for time. I should change this to rlang::if_else() and handle Dates properly.

add tests for covariates

from elizabeth: "I tried wild as a covariate and the program ignores the covariate altogether. I tried floral_resources as a covariate and it works fine (weird)"

Confidence intervals for tau

Use liklihood profile generated in brkpt to figure out what range of taus aren't significantly different from the chosen tau.

lower CI = maximum value of tau that has tau < MLE(tau) & abs(∆LL) > 1.9

upper CI = min of tau that has tau > MLE(tau) & abs(∆LL) > 1.9

where ∆LL = diff between LL for MLE(tau) [i.e. the best tau] and for any given tau

Should plots be marginal effects plots?

currently plots use data from broom::augment(glm) with no newdata. I could instead set all covariates to their mean/reference value and produce marginal effects plots.

Error: unknown column "tau"

Howdy! I'm excited to finally be able to start playing with this package. I'm trying to fit models in accordance with your readme, but having some issues. The bumbl function keeps throwing an error saying, "unknown column tau". Like you, I have my t set to equal the week of the experiment. The error is preceded by a statement for each of my colonies stating that it cannot find starting values. Below is the function call and the head() of my data

bumbl(colmass.df, colonyID = colony_id, t = week, formula = mass.dif ~ week, augment = TRUE)

Screen Shot 2019-11-16 at 5 27 09 PM

Vignette

  • Add equation for GLM version
  • Mention count data
  • Why might you want to add a co-variate at the colony level?
  • Natalie's IPM function

Add column for convergence

In the output for bumbl(), add a column indicating whether a model for that particular colony converged.

allow user to supply vector of tau, one value per colony

I need to look more into the current behavior and think about desired behavior, but here's the general idea...

You may want to use a different set of covariates when estimating the switchpoint, tau, than when you estimate colony growth, given a switchpoint. The only way to really do that currently is to run bumbl() once, then extract the values of tau, then sort of manually fit the switchpoint model by calculating the .post covariate. If you could override the grid-search portion of bumbl by providing known taus, you could solve this issue by running bumbl() once to estimate tau for each colony, then run it again with bumbl(...tau = tau) and a different set of co-variates.

I think this would do what Elizabeth and Sylvie were hoping it would do.

plot method

Rather than using the bumbl_plot() function, write a proper plot() method. Should be able to do this for multiple plots (e.g. plot.lm)

Test bipm() further

I'd like to try to replicate more of Natalie's results from her manuscript. These could be written as formal tests, or just done informally.

bumbl_plot plotting incorrect y-value? bumbl(augment = TRUE) producing bad fitted values?

Something is wrong with the plotting or calculation of fitted values. Something to do with back-transforming fitted values I think. Perhaps instead of log(mass) ~ week, it should use glm(mass ~ week, family = gaussian(link = "log")) by default.

I should test out the difference in fitted values using mass and change in mass using both lm and glm and report to elizabeth.

bombus2 <- 
  bombus %>%
  group_by(colony) %>%
  mutate(d.mass = mass - min(mass) + 0.1) %>%
  select(site, colony, wild, habitat, date, week, mass, d.mass, everything()) %>%
  ungroup()
bombus2

out <- bumbl(bombus2, colonyID = colony, t = week, formula = log(d.mass) ~ week, augment = TRUE)
out2 <- out %>% filter(colony %in% c(104))
# attributes(out2)
bumbl_plot(out2)

rlang ≥ 0.40 required?

It shouldn't be, but when Jenny tried to install, it told her she had the wrong version of rlang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.