Git Product home page Git Product logo

pammtools's People

Contributors

adibender avatar fabian-s avatar gavinsimpson avatar hadley avatar jemus42 avatar pkopper avatar staffanbetner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pammtools's Issues

remove include_last

The way its currently imlemented include_last argument doesn't make sense as it's only used when cut is unspecified, which will lead to last interval cut point being set to last censored event time. The last interval then by default will be empty (without) events and could potentially disrupt estimation.

no visible binding notes

  • in dplyr code, use .data pronouns instead of utils::globalVariables
  • in ggplot code use aes_sring where possible
  • in map and similar use ~mean(., na.rm=T) instead of ~funs(mean(., na.rm=T))

Visualize LagLead window/simulated effects

Would be nice to have a suit of functions that facilitate visualization of

  • lag-lead windows (for simulated data and ped/nested_fdf)
    - [ ] simulated effects (for simulated data)
  • estimated effects, see also #29 (for ped/nested_fdf + model)

For the simulated data, functions could directly access stored true effects, if these are stored separately
in the data frame or access the simulation formula stored in the attributes of the simulated data and apply to newdata (possibly created by make_newdata).

Add geoms for plots of hazards and cumulative hazards

Models ares usually estimated with baseline hazards evaluated at interval end points,
thus predictions and fits also return hazard at interval endpoints.
For plotting purposes, however, hazard and cumulative hazard should start at (x=0, y=0).

Model specification

@fabian-s I think its time to think more generally about how we call the functions for data transformation.

  • Right now we use Surv(time, status)~ ... + ... | cumulative(...),
    although the Surv(time, status) part is only a mirrage, as we don't really do anything Surv specific with it, except extracting the event time and status variables, while the usual functionality of Surv() is not supported, e.g.

  • Surv(time, status == 2) ~ ... (see also #31) or

  • Surv(time1, time2, status) ~ ... e.g. for left truncated data,

  • etc.

This will also be relevant when/if we extend the functionality to competing risks/multistate models, in which case we need to support calls like

as_ped(Surv(time, event1) | Surv(time, event2) ~ lin_pred_event1 | lin_pred_event2

This could be done nicely using the Formula functionality, but we already use | on the RHS to differentiate between cumulative effects and "normal" effects ~ ... + ... | cumulative().
The latter may not be necessary, as we can simply extract cumulative via the specials function?

get_terms function

  • currently length.out argument can not be set
  • switch from lapply to purrr::map
  • alow terms to be specified as bare names?
  • switch to tidyeval
  • add ci logical argument that controls if ci should be returned

tutorial paper behind paywall

@adibender @hoarzpassey :
Michel (checkmate-Autor) hat mich drauf aiufmerksam gemacht dass unser tutorial paper was wir auf dem README verlinken hinter ner paywall ist -- was haltet ihr davon den preprint auch noch nach arxiv zu laden und das stattdessen dort zu verlinken...?

Fix data preparation for cumulative effects

  • as_ped(.... cumulative(... works for simulated data, because sim_pexp returns complete exposure history regardless of simulated event time
  • for real data sets, exposure history is only know partially
  • data transformation produces NA column names + entries
  • see also #60

consolidate transformation functions

Currently, vignettes, examples etc. use different functions to create PED data

list(pbc, pbcseq) %>% as_ped(Surv(time, status)~.|concurrent(bili, protime,  te_var = "day"), ...)
  • Adjust all examples

  • Adjust vignettes

  • in as_ped, need better handling of cut if unspecified. Currently cut selected for TDC data not the same as the one selected by split_data

-[ ] maybe have as_ped return nested dfs by default? -> better printing, etc.

  • rename func to cumulative

Consider reducing dependencies

e.g.,

  • prodlim only needed by pec, which is currently not implemented in pammtools (also pec)
    - survival only needed for survSplit (that could be done manually), to load data in examples (which could be replaced by internal data) and vignettes, that won't be submitted to CRAN anyway (only HP via pkgdown (also needed for Surv(), thus will be kept)
    - msm (only needed to simulate data from PEXP), however, I'd prefer to make a separate package for survival data simulation (maybe later)
  • modelr is not developed anymore and only needed for seq_range (seq_range copied to the package)
  • RColorBrewer, scam, coxme, knitr, rmarkdown only needed for vignettes (that will not be submitted to CRAN) (vignettes will be submitted)

add geoms for hazard, cumulative hazard and survival

usually, PED data is evaluated at tend or intmid, thus when plotting using ggplot2 geom_line for example, the line will start at tend of the first interval, hower,

  • S(t) should always start at t=0, S(0) = 1 (geom_surv)
  • for non piece-wise constant hazards should start at t=0, h(0) = 0 (geom_hazard)
  • H(t) should always start at t=0, H(0) = 0 (just reuse geom_hazard?)
  • for piece-wise constant h(t) should usually start at t=0, h(0) = h(t_1) (is a special function needed for that? Or is geom_step(..., direction="hv") sufficient? Or should it start with a vertical line from t=0, h(0) = 0 to t=0, h(0) = h(t_1)

warning in add_hazard to explicit

Warning in add_hazard when provided time variable doesn't match times used for fitting to explicit. Clutters output + omits informative text at the end of the warning.

Put values at the end of the warning + only first 10 values

Selective suffix to time matrices

Currently

get_func(data, ~ func(t-te, x) + func(t-te, z))

creates matrices Latency.x, LL.x, etc. , which potentially uses up a lot of memory.
If ll_fun is equal across all func terms, only one representation of each T, TE, LL and Latency should be created.

add_survprob

  • ci argument cannot be changed b/c hard coded
  • switch to tidyeval
  • rename cum* and surv* names by adding separating _

Data transformation

There are mainly 3 different types of Data Transformation required to fit PA(M)Ms:

  • 1. "normal" time-to-event data (TTED) -> Piece-wise exponential Data (PED)
  • 2. TTED with time-dependent covariate (TDC) -> PED for concurrent effect of the TDC
  • 3. TTED with TDC -> PED with cumulative effect

Ideally, after data transformation user should be abel to fit the model directly.

The first two are basically solved and described in the Time-dependent covariates vignette.

For the third, we need a general function, that should be able to create data for different model types/types of cumulative effects, e.g.

  • WCE f(t-te)z(te),
  • DLNM f(t-te, z(te)),
  • TV DLNM f(t, t-te, z(te)),
  • general ELRA f(t, te, z(te)),
  • but also f_male(t-te)z(te) + f_female(t-te)z(te) (in mgcv terms: ...+ s(t-te, by=z*sex*LL) + ...

For the later, the sex covariate would also need to be transformed into a matrix.

Thus, a general Trafo function would have to have formula interface or similar, similar to the mgcv formula, e.g.

data %>% as_functional("x1", x_eff(t-te, x1, by = sex, ll_fun = function(t, te) {te <= t})

which would create the necessary columns.

Document data trafo arguments directly in as_ped

Currently, as_ped is the main function for data transformation, but has only two arguments documented

  • data
  • formula

All other arguments are passed to split_data via ellipsis and split_datais an external function

Lags and leads for concurrent effects

Currently ll_fun argument to concurrent is ignored. It should be possible to specify lag and lead times for concurrent effects similarly to cumulative

Adjust functions for data sets with matrix/list columns

Currently

  • sample_info doesn't work on PED data with matrix colums
  • make_newdata also doesn't work and must be redifined to usefully work for data with list/matrix columns
  • etc.

See also #29, which would benefit from useful make_newdata implementation for such data.

Rename func components

in func, use the actual variable names for t and te.
Benefits:

  • allows to check if covariate in data
  • avoid naming conflict with base::t and mgcv::te
  • possible to specify different covariate terms observed on different exposure scales/domains

Disadvantage:

  • What if no separate time variable available/data already in long format? (could be caught during preprocessing/nesting? i.e., creating new "pseudo" variable for exposure time.

Update patient/daily data

  • add Heyland et. al "How you splice the cake" reference (maybe more?)
  • drop some unneeded columns (maybe only one survival time variable?)
  • rename columns(?) to make them shorter/more concise
  • Make integer columns integer (e.g. PatientDied)

currently it is no possible to omit suffix

currently, one can not specify func(..., suffix = "") as this is the default and will be ignored and
the te_var argument used as suffix instead.

  • set suffix=NULL as default

Also, maybe check number of func components in the formula, such that suffix is only appended if suffix specified or when more than one func component in formula.

checkmate bug fix breaks unit test

I've recently discovered and fixed a bug w.r.t. missing checks in lists in mllg/checkmate#146. The bugfix now tests lists for elements being identical to NULL which comes closest to a "missing" in lists, but this unfortunately breaks a unit test in your package. Before the fix, the test was defunct and did not trigger any check at all.

Could you please check your assertions accordingly with the devel release of checkmate? I'll prepare a new release of checkmate and want to upload soon.

This is the relevant part of the check log:

  ── 1. Failure: Formula special 'func' works as expected (@test-specials.R#6)  ──
  Check on cumu1 isn't true.
  Contains missing values (element 4)
  
  ── 2. Failure: Formula special 'concurrent' works as expected (@test-specials.R#
  Check on ccr1 isn't true.
  Contains missing values (element 3)
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  OK: 279 SKIPPED: 0 FAILED: 2
  1. Failure: Formula special 'func' works as expected (@test-specials.R#6) 
  2. Failure: Formula special 'concurrent' works as expected (@test-specials.R#18) 

Sorry for the inconvenience.

Prepare CRAN release

  • Add vignette on cumulative effects (see #62)
  • Fix #64
  • Add more tests for cumulative effects related functions (see #61)
  • Fix tests and checks
  • devtools::release()

most .ped S3-functions should return a PED like object with intervals

For example, make_newdata.ped returns specified newdata, with first interval.
One would expect (and its probably more useful) a data set with all intervals specified originally
as well as newly specified data, such that predict fundtions (or add_ family of functions can be directly applied).

sample_info.ped could be an exception, as this should return info about the data sample, not the PED data, which has different distribution when applying mean for example.

dplyr messages not apparent to user

@fabian-s In many functions we use dplyr functions, which however are not visible or apparent to the user. Nevertheless messages form left_join are returned from a call to make_newdata etc.

Do you know a way to suppress messages from other function calls within a function?
In some cases on the other hand, such messages could be helpful? How to decide and is it possible to suppress individual evaluations?

A new, more general version of `make_newdata`, e.g.,

ped %>% make_newdata( unique(sex), seq_range(age, 20), karno=c(20, 50, 60)) 

would create a new data set with all values of sex variable, age grid of 20 and karno values of 20, 50, and 60.

can be made relatively easy for simple case using purrr::cross etc. but more complicated for data with cumulative effects

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.