You can download the latest stable GitHub
version using:
library(remotes)
install_github("bozenne/lmbreak")
The functionnalities of the package:
library(lmbreak)
lmbreak version 0.1.0
will be exemplified on the following dataset:
data(SDIpsilo, package = "lmbreak")
SDIpsilo <- SDIpsilo[SDIpsilo$type %in% c("noise","trailing") == FALSE,]
str(SDIpsilo)
'data.frame': 326 obs. of 4 variables: $ id : Factor w/ 15 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ time : num 0 20 40 60 80 100 120 160 180 200 ... $ type : chr "signal" "signal" "signal" "signal" ... $ score: num 0 1 3 8 10 10 10 10 10 7 ...
where the experience of 15 individuals after drug intake is monitored over time. To start with consider the data of individual 13:
SDIpsilo13 <- SDIpsilo[SDIpsilo$id==13,]
The lmbreak
function can be used to model his experience by
breakpoint model:
- with 2 breakpoints and three slopes (“111” pattern)
- with 2 breakpoints: one slope, one plateau, one slope (“101” pattern)
- with 1 breakpoint: two slopes (“11” pattern)
e.XP111 <- lmbreak(score ~ 0 + bp(time, "111"), data = SDIpsilo13)
e.XP101 <- lmbreak(score ~ 0 + bp(time, "101"), data = SDIpsilo13)
e.XP11 <- lmbreak(score ~ 0 + bp(time, "11"), data = SDIpsilo13)
The call to lmbreak
is similar to the lm
function except that the
breakpoint variable (i.e. variable whose relationship with the
response variable is to be modeled using broken lines) should be
wrapper by bp()
and indicate the number of breakpoints and possible
constrains on the slopes (pattern). The function will then estimate
the the position of the breakpoint and slopes. The method plot
can
then be applied to the output of lmbreak
to visualize the model
fit:
plot(e.XP111, ylim = c(0,12)) ## left panel
plot(e.XP101, ylim = c(0,12)) ## middle panel
plot(e.XP11, ylim = c(0,12)) ## right panel
For the last call to plot
, the argument extrapolate
was used to
display the model fit beyond the observed timepoints.
The method model.tables
can be used to obtain a concise output of
the estimates in a data.frame
format:
model.tables(e.XP101)
time duration intercept slope 1 0.00000 87.87879 0.000000 0.110 2 87.87879 142.78788 9.666667 0.000 3 230.66667 69.33333 9.666667 -0.125 4 300.00000 NA 1.000000 NA
Other summary statistics of the breakpoint fit can be extracted using
the coef
method with the argument type
(see the documentation
help(coef.lmbreak)
). For instance the area under the fitted curve
(AUC) between time 0 and 300 can be computed running:
coef(e.XP101, type = "auc", interval = c(0,300))
[1] 2174.808
The predict
method can also be used to extract the fitted values (up
to a certain time resolution, here 1 time unit):
fit.XP101 <- predict(e.XP101, newdata = data.frame(time = seq(0,440,by=1)))
cbind(head(fit.XP101), "",tail(fit.XP101))
time estimate "" time estimate 1 0 0.00 435 NA 2 1 0.11 436 NA 3 2 0.22 437 NA 4 3 0.33 438 NA 5 4 0.44 439 NA 6 5 0.55 440 NA
Fitted values beyond the last observed non-NA outcome will
automatically be set to missing (i.e. NA
), unless the argument
extrapolate
is set to TRUE.
fitE.XP101 <- predict(e.XP101, newdata = data.frame(time = seq(0,440,by=1)), extrapolate = TRUE)
cbind(head(fitE.XP101), "",tail(fitE.XP101))
time estimate "" time estimate 1 0 0.00 435 -15.875 2 1 0.11 436 -16.000 3 2 0.22 437 -16.125 4 3 0.33 438 -16.250 5 4 0.44 439 -16.375 6 5 0.55 440 -16.500
When specifying a pattern that does not fit the data, the estimation procedure may fail to find reliable estimates and will output a warning message:
e.XP01 <- lmbreak(score ~ 0 + bp(time, "01"), data = SDIpsilo13)
Warning message: In lmbreak(score ~ 0 + bp(time, "01"), data = SDIpsilo13) : The solution found by the optimizer has invalid breakpoint positions.
It is possible to specify alternative patterns that will only be investigated if the previous one(s) had convergence issues:
e.XPrescue <- lmbreak(score ~ 0 + bp(time, c("01","11")), data = SDIpsilo13)
coef(e.XPrescue,c("pattern","breakpoint"))
pattern breakpoint 1 11 113.1476
The mlmbreak
function provides a convenient way to fit a (separate)
breakpoint model to each individuals. To do so one should specify the
cluster
argument to flag the variable in the dataset identifying the
individuals:
e.XPall <- mlmbreak(score ~ 0 + bp(time, "101"), cluster = "id", data = SDIpsilo,
trace = FALSE)
summary(e.XPall)
Call: mlmbreak(formula = score ~ 0 + bp(time, "101"), data = SDIpsilo, cluster = "id", trace = FALSE) Breakpoints: id pattern cv continuity R2 breakpoint maxVs 1 101 TRUE TRUE 0.9833193 84.50704, 162.05128 < 1e-07 2 101 TRUE TRUE 0.9921334 55.55556, 87.52688 < 1e-07 3 101 TRUE TRUE 0.9915031 65.14286, 166.48148 < 1e-07 4 101 TRUE TRUE 0.9811031 105.7692, 169.8089 < 1e-07 5 101 TRUE TRUE 0.9838541 49.12281, 173.91304 < 1e-07 6 101 TRUE TRUE 0.9933673 70, 150 < 1e-07 7 101 TRUE TRUE 0.9839889 47.61905, 87.91209 < 1e-07 8 101 TRUE TRUE 0.9855812 86.95652, 129.53271 < 1e-07 9 101 TRUE TRUE 0.9753291 49.12281, 115.93750 2.0289e-07 10 101 TRUE TRUE 0.9961527 65.11628, 195.23809 5.1759e-07 11 101 TRUE FALSE 0.9828458 32.51327, 100.00000 0.25541 12 101 TRUE TRUE 0.9654704 43.47826, 150.99237 < 1e-07 13 101 TRUE TRUE 0.9944311 87.87879, 230.66667 < 1e-07 14 101 TRUE TRUE 0.9777323 157.8947, 248.0208 4.7554e-07 15 101 TRUE TRUE 0.9911019 157.3034, 234.7368 < 1e-07
In this example an upslope, plateau, normalization (101 pattern) could
be fitted for all individuals but we could also have specified
alternative patterns with the syntax bp(time, c("101","11")
. The
pattern 11
would then have been used for any individual where the
optimizer convergence criteria were not met with pattern 101
. Once
more key summary statistics can be extracted using the model.tables
method:
model.tables(e.XPall, format = "array", cluster = 1:2)
, , 1 time duration intercept slope 1 0.00000 84.50704 0 0.11833333 2 84.50704 77.54424 10 0.00000000 3 162.05128 157.94872 10 -0.06964286 4 320.00000 NA -1 NA , , 2 time duration intercept slope 1 0.00000 55.55556 0.000000 0.13500000 2 55.55556 31.97133 7.500000 0.00000000 3 87.52688 172.47312 7.500000 -0.02583333 4 260.00000 NA 3.044444 NA
and a graphical display of the model fit can be obtained using the plot
method:
plot(e.XPall, ylim = c(0,10))
By default a different facet is used for each individual. A single
facet can be used by setting the argument scales
to "none"
:
plot(e.XPall, ylim = c(0,10), scales = "none")
The fitted values for each individual can be extract once again with the predict
method:
fit.XPall <- predict(e.XPall, newdata = data.frame(time = seq(0,440,by=1)), extrapolate = TRUE)
cbind(head(fit.XPall), "", tail(fit.XPall))
id time estimate "" id time estimate 1 1 0 0.0000000 15 435 -6.307143 2 1 1 0.1183333 15 436 -6.388571 3 1 2 0.2366667 15 437 -6.470000 4 1 3 0.3550000 15 438 -6.551429 5 1 4 0.4733333 15 439 -6.632857 6 1 5 0.5916667 15 440 -6.714286
Due to extrapolation some of the fitted values are estimate to be negative, which is not realistic in the application since the scale is non-negative. An add-hoc solution can be to set the negative values to 0:
fit.XPall$estimate <- pmax(fit.XPall$estimate,0)
Currently the package is limited to a single continous response variable and a single breakpoint variable without interaction with other covariates. No tools for uncertainty quantification or statistical inference is implemented. The segmented package is a more mature implementation of breakpoint models with possibilities for statistical inference.
Another limitation of the current approach is the lack of a model of the ‘average’ response. While is possible to compute the average and standard deviation of the fit over all individuals, e.g.:
library(LMMstar)
fit.XPmean <- summarize(estimate ~ time, data = fit.XPall)[,c("observed","time","mean","sd")]
cbind(head(fit.XPmean),"",tail(fit.XPmean))
observed time mean sd "" observed time mean sd 1 15 0 0.0000000 0.0000000 15 435 0 0 2 15 1 0.1438590 0.0578802 15 436 0 0 3 15 2 0.2877180 0.1157604 15 437 0 0 4 15 3 0.4315770 0.1736406 15 438 0 0 5 15 4 0.5754361 0.2315208 15 439 0 0 6 15 5 0.7192951 0.2894010 15 440 0 0
its graphical display:
## aggregate the observed scores
SDIpsilo.aggr <- summarize(score ~ score + time, data = SDIpsilo)[,c("observed","time","score")]
library(ggplot2)
gg.mean <- ggplot(mapping = aes(x = time))
gg.mean <- gg.mean + geom_point(data = SDIpsilo.aggr, aes(y = score, size = observed, color = "Observed"))
gg.mean <- gg.mean + geom_line(data = fit.XPall, aes(y = estimate, group = id, color = "Individual fit"))
gg.mean <- gg.mean + geom_line(data = fit.XPmean, aes(y = mean, color = "Average of the individual fit"), linewidth = 2)
gg.mean <- gg.mean + labs(size = "Number of individuals", colour = "")
gg.mean
Advarselsbesked: pakke 'ggplot2' blev bygget under R version 4.2.3
is not consistent with the individual models. Consider for instance the case where all individuals would have a plateau at 10. Because they may plateau at different timepoints, the average may always be below 10.