lmbreak: Linear Regression with Unknown Breakpoints

Installation

You can download the latest stable GitHub version using:

library(remotes)
install_github("bozenne/lmbreak")

Functionalities: single pattern, single dataset

The functionnalities of the package:

library(lmbreak)

lmbreak version 0.1.0

will be exemplified on the following dataset:

data(SDIpsilo, package = "lmbreak")
SDIpsilo <- SDIpsilo[SDIpsilo$type %in% c("noise","trailing") == FALSE,]
str(SDIpsilo)

'data.frame':	326 obs. of  4 variables:
 $ id   : Factor w/ 15 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ time : num  0 20 40 60 80 100 120 160 180 200 ...
 $ type : chr  "signal" "signal" "signal" "signal" ...
 $ score: num  0 1 3 8 10 10 10 10 10 7 ...

where the experience of 15 individuals after drug intake is monitored over time. To start with consider the data of individual 13:

SDIpsilo13 <- SDIpsilo[SDIpsilo$id==13,]

The lmbreak function can be used to model his experience by breakpoint model:

with 2 breakpoints and three slopes (“111” pattern)
with 2 breakpoints: one slope, one plateau, one slope (“101” pattern)
with 1 breakpoint: two slopes (“11” pattern)

e.XP111 <- lmbreak(score ~ 0 + bp(time, "111"), data = SDIpsilo13)
e.XP101 <- lmbreak(score ~ 0 + bp(time, "101"), data = SDIpsilo13)
e.XP11 <- lmbreak(score ~ 0 + bp(time, "11"), data = SDIpsilo13)

The call to lmbreak is similar to the lm function except that the breakpoint variable (i.e. variable whose relationship with the response variable is to be modeled using broken lines) should be wrapper by bp() and indicate the number of breakpoints and possible constrains on the slopes (pattern). The function will then estimate the the position of the breakpoint and slopes. The method plot can then be applied to the output of lmbreak to visualize the model fit:

plot(e.XP111, ylim = c(0,12)) ## left panel
plot(e.XP101, ylim = c(0,12)) ## middle panel
plot(e.XP11, ylim = c(0,12)) ## right panel

For the last call to plot, the argument extrapolate was used to display the model fit beyond the observed timepoints.

The method model.tables can be used to obtain a concise output of the estimates in a data.frame format:

model.tables(e.XP101)

       time  duration intercept  slope
1   0.00000  87.87879  0.000000  0.110
2  87.87879 142.78788  9.666667  0.000
3 230.66667  69.33333  9.666667 -0.125
4 300.00000        NA  1.000000     NA

Other summary statistics of the breakpoint fit can be extracted using the coef method with the argument type (see the documentation help(coef.lmbreak)). For instance the area under the fitted curve (AUC) between time 0 and 300 can be computed running:

coef(e.XP101, type = "auc", interval = c(0,300))

[1] 2174.808

The predict method can also be used to extract the fitted values (up to a certain time resolution, here 1 time unit):

fit.XP101 <- predict(e.XP101, newdata = data.frame(time = seq(0,440,by=1)))
cbind(head(fit.XP101), "",tail(fit.XP101))

  time estimate "" time estimate
1    0     0.00     435       NA
2    1     0.11     436       NA
3    2     0.22     437       NA
4    3     0.33     438       NA
5    4     0.44     439       NA
6    5     0.55     440       NA

Fitted values beyond the last observed non-NA outcome will automatically be set to missing (i.e. NA), unless the argument extrapolate is set to TRUE.

fitE.XP101 <- predict(e.XP101, newdata = data.frame(time = seq(0,440,by=1)), extrapolate = TRUE)
cbind(head(fitE.XP101), "",tail(fitE.XP101))

  time estimate "" time estimate
1    0     0.00     435  -15.875
2    1     0.11     436  -16.000
3    2     0.22     437  -16.125
4    3     0.33     438  -16.250
5    4     0.44     439  -16.375
6    5     0.55     440  -16.500

Functionalities: multiple patterns

When specifying a pattern that does not fit the data, the estimation procedure may fail to find reliable estimates and will output a warning message:

e.XP01 <- lmbreak(score ~ 0 + bp(time, "01"), data = SDIpsilo13)

Warning message:
In lmbreak(score ~ 0 + bp(time, "01"), data = SDIpsilo13) :
  The solution found by the optimizer has invalid breakpoint positions.

It is possible to specify alternative patterns that will only be investigated if the previous one(s) had convergence issues:

e.XPrescue <- lmbreak(score ~ 0 + bp(time, c("01","11")), data = SDIpsilo13)
coef(e.XPrescue,c("pattern","breakpoint"))

  pattern breakpoint
1      11   113.1476

Functionalities: mutiple datasets

The mlmbreak function provides a convenient way to fit a (separate) breakpoint model to each individuals. To do so one should specify the cluster argument to flag the variable in the dataset identifying the individuals:

e.XPall <- mlmbreak(score ~ 0 + bp(time, "101"), cluster = "id", data = SDIpsilo,
                    trace = FALSE)
summary(e.XPall)

Call:
mlmbreak(formula = score ~ 0 + bp(time, "101"), data = SDIpsilo, 
    cluster = "id", trace = FALSE)

Breakpoints:
 id pattern   cv continuity        R2          breakpoint      maxVs
  1     101 TRUE       TRUE 0.9833193 84.50704, 162.05128    < 1e-07
  2     101 TRUE       TRUE 0.9921334  55.55556, 87.52688    < 1e-07
  3     101 TRUE       TRUE 0.9915031 65.14286, 166.48148    < 1e-07
  4     101 TRUE       TRUE 0.9811031  105.7692, 169.8089    < 1e-07
  5     101 TRUE       TRUE 0.9838541 49.12281, 173.91304    < 1e-07
  6     101 TRUE       TRUE 0.9933673             70, 150    < 1e-07
  7     101 TRUE       TRUE 0.9839889  47.61905, 87.91209    < 1e-07
  8     101 TRUE       TRUE 0.9855812 86.95652, 129.53271    < 1e-07
  9     101 TRUE       TRUE 0.9753291 49.12281, 115.93750 2.0289e-07
 10     101 TRUE       TRUE 0.9961527 65.11628, 195.23809 5.1759e-07
 11     101 TRUE      FALSE 0.9828458 32.51327, 100.00000    0.25541
 12     101 TRUE       TRUE 0.9654704 43.47826, 150.99237    < 1e-07
 13     101 TRUE       TRUE 0.9944311 87.87879, 230.66667    < 1e-07
 14     101 TRUE       TRUE 0.9777323  157.8947, 248.0208 4.7554e-07
 15     101 TRUE       TRUE 0.9911019  157.3034, 234.7368    < 1e-07

In this example an upslope, plateau, normalization (101 pattern) could be fitted for all individuals but we could also have specified alternative patterns with the syntax bp(time, c("101","11"). The pattern 11 would then have been used for any individual where the optimizer convergence criteria were not met with pattern 101. Once more key summary statistics can be extracted using the model.tables method:

model.tables(e.XPall, format = "array", cluster = 1:2)

, , 1

       time  duration intercept       slope
1   0.00000  84.50704         0  0.11833333
2  84.50704  77.54424        10  0.00000000
3 162.05128 157.94872        10 -0.06964286
4 320.00000        NA        -1          NA

, , 2

       time  duration intercept       slope
1   0.00000  55.55556  0.000000  0.13500000
2  55.55556  31.97133  7.500000  0.00000000
3  87.52688 172.47312  7.500000 -0.02583333
4 260.00000        NA  3.044444          NA

and a graphical display of the model fit can be obtained using the plot method:

plot(e.XPall, ylim = c(0,10))

By default a different facet is used for each individual. A single facet can be used by setting the argument scales to "none":

plot(e.XPall, ylim = c(0,10), scales = "none")

The fitted values for each individual can be extract once again with the predict method:

fit.XPall <- predict(e.XPall, newdata = data.frame(time = seq(0,440,by=1)), extrapolate = TRUE)
cbind(head(fit.XPall), "", tail(fit.XPall))

  id time  estimate "" id time  estimate
1  1    0 0.0000000    15  435 -6.307143
2  1    1 0.1183333    15  436 -6.388571
3  1    2 0.2366667    15  437 -6.470000
4  1    3 0.3550000    15  438 -6.551429
5  1    4 0.4733333    15  439 -6.632857
6  1    5 0.5916667    15  440 -6.714286

Due to extrapolation some of the fitted values are estimate to be negative, which is not realistic in the application since the scale is non-negative. An add-hoc solution can be to set the negative values to 0:

fit.XPall$estimate <- pmax(fit.XPall$estimate,0)

Limitations & alternative

Currently the package is limited to a single continous response variable and a single breakpoint variable without interaction with other covariates. No tools for uncertainty quantification or statistical inference is implemented. The segmented package is a more mature implementation of breakpoint models with possibilities for statistical inference.

Another limitation of the current approach is the lack of a model of the ‘average’ response. While is possible to compute the average and standard deviation of the fit over all individuals, e.g.:

library(LMMstar)
fit.XPmean <- summarize(estimate ~ time, data = fit.XPall)[,c("observed","time","mean","sd")]
cbind(head(fit.XPmean),"",tail(fit.XPmean))

  observed time      mean        sd "" observed time mean sd
1       15    0 0.0000000 0.0000000          15  435    0  0
2       15    1 0.1438590 0.0578802          15  436    0  0
3       15    2 0.2877180 0.1157604          15  437    0  0
4       15    3 0.4315770 0.1736406          15  438    0  0
5       15    4 0.5754361 0.2315208          15  439    0  0
6       15    5 0.7192951 0.2894010          15  440    0  0

its graphical display:

## aggregate the observed scores
SDIpsilo.aggr <- summarize(score ~ score + time, data = SDIpsilo)[,c("observed","time","score")]

library(ggplot2)
gg.mean <- ggplot(mapping = aes(x = time))
gg.mean <- gg.mean + geom_point(data = SDIpsilo.aggr, aes(y = score, size = observed, color = "Observed"))
gg.mean <- gg.mean + geom_line(data = fit.XPall, aes(y = estimate, group = id, color = "Individual fit"))
gg.mean <- gg.mean + geom_line(data = fit.XPmean, aes(y = mean, color = "Average of the individual fit"), linewidth = 2)
gg.mean <- gg.mean + labs(size = "Number of individuals", colour = "")
gg.mean

Advarselsbesked:
pakke 'ggplot2' blev bygget under R version 4.2.3

is not consistent with the individual models. Consider for instance the case where all individuals would have a plateau at 10. Because they may plateau at different timepoints, the average may always be below 10.

bozenne / lmbreak Goto Github PK

lmbreak's Introduction

lmbreak: Linear Regression with Unknown Breakpoints

Installation

Functionalities: single pattern, single dataset

Functionalities: multiple patterns

Functionalities: mutiple datasets

Limitations & alternative

lmbreak's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent