adibender / pammtools Goto Github PK

View Code? Open in Web Editor NEW

44.0 4.0 11.0 42.98 MB

Piece-wise exponential Additive Mixed Modeling tools

Home Page: https://adibender.github.io/pammtools/

License: Other

R 94.25% TeX 5.75%

r rstats survival-analysis additive-models piece-wise-exponential pamm pammtools

pammtools's People

Contributors

Stargazers

Watchers

Forkers

junmeiw pkopper gavinsimpson jemus42 staffanbetner romainfrancois xu-lisa lzumeta hlblade lvcheer

pammtools's Issues

Add tidiers for cox.aalen models

get_cumu_coef doesn't work for cox.aalen models

remove include_last

The way its currently imlemented include_last argument doesn't make sense as it's only used when cut is unspecified, which will lead to last interval cut point being set to last censored event time. The last interval then by default will be empty (without) events and could potentially disrupt estimation.

add_cumu_hazard removes previously created hazard column(s)

should only delete hazard columns if they were created during the call to add_cumu_hazard, otherwise keep columns.

~~- same holds for add_surv_prob and cumu_hazard columns.~~

`interval_length` argument in `add_hazard`...

... should not have to be a quosure. Users should be able to specify character or bare variable name

no visible binding notes

in dplyr code, use .data pronouns instead of utils::globalVariables
in ggplot code use aes_sring where possible
in map and similar use ~mean(., na.rm=T) instead of ~funs(mean(., na.rm=T))

Visualize LagLead window/simulated effects

Would be nice to have a suit of functions that facilitate visualization of

lag-lead windows (for simulated data and ped/nested_fdf)
~~- [ ] simulated effects (for simulated data)~~
estimated effects, see also #29 (for ped/nested_fdf + model)

For the simulated data, functions could directly access stored true effects, if these are stored separately
in the data frame or access the simulation formula stored in the attributes of the simulated data and apply to newdata (possibly created by make_newdata).

Add geoms for plots of hazards and cumulative hazards

Models ares usually estimated with baseline hazards evaluated at interval end points,
thus predictions and fits also return hazard at interval endpoints.
For plotting purposes, however, hazard and cumulative hazard should start at (x=0, y=0).

Model specification

@fabian-s I think its time to think more generally about how we call the functions for data transformation.

Right now we use Surv(time, status)~ ... + ... | cumulative(...),
although the Surv(time, status) part is only a mirrage, as we don't really do anything Surv specific with it, except extracting the event time and status variables, while the usual functionality of Surv() is not supported, e.g.
Surv(time, status == 2) ~ ... (see also #31) or
Surv(time1, time2, status) ~ ... e.g. for left truncated data,
etc.

This will also be relevant when/if we extend the functionality to competing risks/multistate models, in which case we need to support calls like

as_ped(Surv(time, event1) | Surv(time, event2) ~ lin_pred_event1 | lin_pred_event2

This could be done nicely using the Formula functionality, but we already use | on the RHS to differentiate between cumulative effects and "normal" effects ~ ... + ... | cumulative().
The latter may not be necessary, as we can simply extract cumulative via the specials function?

Add ci = FALSE option to gg_slice and gg_cumu_eff

gg_slice
gg_cumu_eff

get_terms function

currently length.out argument can not be set
switch from lapply to purrr::map
alow terms to be specified as bare names?
switch to tidyeval
add ci logical argument that controls if ci should be returned

tutorial paper behind paywall

@adibender @hoarzpassey :
Michel (checkmate-Autor) hat mich drauf aiufmerksam gemacht dass unser tutorial paper was wir auf dem README verlinken hinter ner paywall ist -- was haltet ihr davon den preprint auch noch nach arxiv zu laden und das stattdessen dort zu verlinken...?

Add tests for visualizations of cumulative effects (for differnt scenarios) + real data examples

gg_partial
gg_partial_ll
gg_cumu_eff

LIfecycle of individual functions

Can use ' \Sexpr[results=rd, stage=render]{rlang:::lifecycle("maturing")} in the description field of the roxygen documentation of each function to produce respective badges in man pages.
See https://www.tidyverse.org/lifecycle/ for possibilities

inconsistent variable names

e.g. ci_lower in add_hazard vs. low in tidy_smooth.

check for inconsistencies and assimilate

Fix data preparation for cumulative effects

as_ped(.... cumulative(... works for simulated data, because sim_pexp returns complete exposure history regardless of simulated event time
for real data sets, exposure history is only know partially
data transformation produces NA column names + entries
see also #60

Bug in ped_info when no covariates present

Add functions that return cumulative coefficients

to replicate devault visualization of aalen models, e.g., plot.aalen from the timereg package.

Replace visualization code in vignettes with respective gg_ functions

e.g. TVE vignette: https://adibender.github.io/pammtools/articles/tveffects.html
- use gg_slice, etc.

consolidate transformation functions

Currently, vignettes, examples etc. use different functions to create PED data

Extend as_ped to be able to create PED data with concurrent TDCs, currently implemented in split_tdc (see https://adibender.github.io/pammtools/articles/tdcovar.html#analysis-of-the-pbc-data). Could be implemented by introducing a new special concurrent, i.e.

list(pbc, pbcseq) %>% as_ped(Surv(time, status)~.|concurrent(bili, protime,  te_var = "day"), ...)

Adjust all examples
Adjust vignettes
in as_ped, need better handling of cut if unspecified. Currently cut selected for TDC data not the same as the one selected by split_data

~~-[ ] maybe have as_ped return nested dfs by default? -> better printing, etc.~~

rename func to cumulative

Consider reducing dependencies

e.g.,

prodlim only needed by pec, which is currently not implemented in pammtools (also pec)
- survival only needed for survSplit (that could be done manually), to load data in examples (which could be replaced by internal data) and vignettes, that won't be submitted to CRAN anyway (only HP via pkgdown (also needed for Surv(), thus will be kept)
~~- msm (only needed to simulate data from PEXP), however, I'd prefer to make a separate package for survival data simulation~~ (maybe later)
modelr is not developed anymore and only needed for seq_range (seq_range copied to the package)
~~RColorBrewer, scam, coxme, knitr, rmarkdown only needed for vignettes (that will not be submitted to CRAN)~~ (vignettes will be submitted)

add geoms for hazard, cumulative hazard and survival

usually, PED data is evaluated at tend or intmid, thus when plotting using ggplot2 geom_line for example, the line will start at tend of the first interval, hower,

S(t) should always start at t=0, S(0) = 1 (geom_surv)
for non piece-wise constant hazards should start at t=0, h(0) = 0 (geom_hazard)
H(t) should always start at t=0, H(0) = 0 (just reuse geom_hazard?)
for piece-wise constant h(t) should usually start at t=0, h(0) = h(t_1) (is a special function needed for that? Or is geom_step(..., direction="hv") sufficient? Or should it start with a vertical line from t=0, h(0) = 0 to t=0, h(0) = h(t_1)

rename te to tz in code and documentation

... to avoid confusion with mgcv::te

Examples on pkgdown reference pages print funny

E.g.

Looks like it doesn't print tibbles properly. Do I need to import smthing or is it a pkgdown bug?

@fabian-s Do you know whats going on?

Add vignette on cumulative effects (with real data example)

warning in add_hazard to explicit

Warning in add_hazard when provided time variable doesn't match times used for fitting to explicit. Clutters output + omits informative text at the end of the warning.

Put values at the end of the warning + only first 10 values

Selective suffix to time matrices

Currently

get_func(data, ~ func(t-te, x) + func(t-te, z))

creates matrices Latency.x, LL.x, etc. , which potentially uses up a lot of memory.
If ll_fun is equal across all func terms, only one representation of each T, TE, LL and Latency should be created.

replace tbl_df with as_tibble

tbl_df is deprecated, replace with tibble::as_tibble

Fix sim_pexp for time-dependent covariates

Only return covariates that could be observed given the simulated survival time.
Add tests that check correct data creation

add_survprob

ci argument cannot be changed b/c hard coded
switch to tidyeval
rename cum* and surv* names by adding separating _

Data transformation

There are mainly 3 different types of Data Transformation required to fit PA(M)Ms:

1. "normal" time-to-event data (TTED) -> Piece-wise exponential Data (PED)
2. TTED with time-dependent covariate (TDC) -> PED for concurrent effect of the TDC
3. TTED with TDC -> PED with cumulative effect

Ideally, after data transformation user should be abel to fit the model directly.

The first two are basically solved and described in the Time-dependent covariates vignette.

For the third, we need a general function, that should be able to create data for different model types/types of cumulative effects, e.g.

WCE f(t-te)z(te),
DLNM f(t-te, z(te)),
TV DLNM f(t, t-te, z(te)),
general ELRA f(t, te, z(te)),
but also f_male(t-te)z(te) + f_female(t-te)z(te) (in mgcv terms: ...+ s(t-te, by=z*sex*LL) + ...

For the later, the sex covariate would also need to be transformed into a matrix.

Thus, a general Trafo function would have to have formula interface or similar, similar to the mgcv formula, e.g.

data %>% as_functional("x1", x_eff(t-te, x1, by = sex, ll_fun = function(t, te) {te <= t})

which would create the necessary columns.

Document data trafo arguments directly in as_ped

Currently, as_ped is the main function for data transformation, but has only two arguments documented

data
formula

All other arguments are passed to split_data via ellipsis and split_datais an external function

Lags and leads for concurrent effects

Currently ll_fun argument to concurrent is ignored. It should be possible to specify lag and lead times for concurrent effects similarly to cumulative

Adjust functions for data sets with matrix/list columns

Currently

sample_info doesn't work on PED data with matrix colums
make_newdata also doesn't work and must be redifined to usefully work for data with list/matrix columns
etc.

See also #29, which would benefit from useful make_newdata implementation for such data.

Rename func components

in func, use the actual variable names for t and te.
Benefits:

allows to check if covariate in data
avoid naming conflict with base::t and mgcv::te
possible to specify different covariate terms observed on different exposure scales/domains

Disadvantage:

What if no separate time variable available/data already in long format? (could be caught during preprocessing/nesting? i.e., creating new "pseudo" variable for exposure time.

Update patient/daily data

add Heyland et. al "How you splice the cake" reference (maybe more?)
drop some unneeded columns (maybe only one survival time variable?)
rename columns(?) to make them shorter/more concise
Make integer columns integer (e.g. PatientDied)

More robust split_data function

Currently calls like

data(tTRACE, package = "timereg")
as_ped(Surv(time, status == 9)~., data = tTRACE)

do not work

currently it is no possible to omit suffix

currently, one can not specify func(..., suffix = "") as this is the default and will be ignored and
the te_var argument used as suffix instead.

set suffix=NULL as default

Also, maybe check number of func components in the formula, such that suffix is only appended if suffix specified or when more than one func component in formula.

cut argument is ignored when as_ped applied to nested_fdf

checkmate bug fix breaks unit test

I've recently discovered and fixed a bug w.r.t. missing checks in lists in mllg/checkmate#146. The bugfix now tests lists for elements being identical to NULL which comes closest to a "missing" in lists, but this unfortunately breaks a unit test in your package. Before the fix, the test was defunct and did not trigger any check at all.

Could you please check your assertions accordingly with the devel release of checkmate? I'll prepare a new release of checkmate and want to upload soon.

This is the relevant part of the check log:

  ── 1. Failure: Formula special 'func' works as expected (@test-specials.R#6)  ──
  Check on cumu1 isn't true.
  Contains missing values (element 4)
  
  ── 2. Failure: Formula special 'concurrent' works as expected (@test-specials.R#
  Check on ccr1 isn't true.
  Contains missing values (element 3)
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  OK: 279 SKIPPED: 0 FAILED: 2
  1. Failure: Formula special 'func' works as expected (@test-specials.R#6) 
  2. Failure: Formula special 'concurrent' works as expected (@test-specials.R#18)

Sorry for the inconvenience.

Prepare CRAN release

Add vignette on cumulative effects (see #62)
Fix #64
Add more tests for cumulative effects related functions (see #61)
Fix tests and checks
devtools::release()

most .ped S3-functions should return a PED like object with intervals

For example, make_newdata.ped returns specified newdata, with first interval.
One would expect (and its probably more useful) a data set with all intervals specified originally
as well as newly specified data, such that predict fundtions (or add_ family of functions can be directly applied).

sample_info.ped could be an exception, as this should return info about the data sample, not the PED data, which has different distribution when applying mean for example.

Convenience functions for slices through xD effects

Fro example, when

mod <- gam(... + te(karno, tend), ...)

then

mod %>% slice_1d(time, karno = c(20, 50, 90))

would create a tibble with predicted values over a grid of time and specified karno values.

Many functions don't work with bam objects

E.g., tidy_smooth, gg_smoth etc.

write test for all functions that work with gam objects to also use bam objects
make sure they work for bams

dplyr messages not apparent to user

@fabian-s In many functions we use dplyr functions, which however are not visible or apparent to the user. Nevertheless messages form left_join are returned from a call to make_newdata etc.

Do you know a way to suppress messages from other function calls within a function?
In some cases on the other hand, such messages could be helpful? How to decide and is it possible to suppress individual evaluations?

Integration weights not used when constructing covariate matrices

Currently, when constructing matrices for cumulative effects, integration weights are not calculated. Thus estimation will only be correct for exposure time-grids with 1 unit differences between exposure time points

Keep original name of time variable when constructing T mat?

Currently when using cumulative, the constructed time matrix will have the name of the original
time variable in Surv(time, status)...

May be misleading, as what's actually constructed is a matrix column with entries t_j

Set up pkgdown CI build

This will make everything much smoother + avoids clogging commits with homepage updates:
r-lib/pkgdown#428
see also here: https://www.r-bloggers.com/continuous-deployment-of-package-documentation-with-pkgdown-and-travis-ci/
should also enable different pages for different branches?

A new, more general version of `make_newdata`, e.g.,

ped %>% make_newdata( unique(sex), seq_range(age, 20), karno=c(20, 50, 60))

would create a new data set with all values of sex variable, age grid of 20 and karno values of 20, 50, and 60.

can be made relatively easy for simple case using purrr::cross etc. but more complicated for data with cumulative effects

Logo/Hexsticker

@fabian-s Ich habe mitbekommen, dass du jetzt Hexsticker experte bist (tidyfun) :-D

Ideas?