thie1e / cutpointr Goto Github PK

View Code? Open in Web Editor NEW

83.0 4.0 13.0 11.56 MB

Optimal cutpoints in R: determining and validating optimal cutpoints in binary classification

Home Page: https://cran.r-project.org/package=cutpointr

R 97.92% C++ 2.08%

r roc-curve cutpoint-optimization bootstrapping

cutpointr's Introduction

cutpointr
Calculating cutpoints
More robust cutpoint estimation methods
Additional features
Plotting
- Flexible plotting function
- Manual plotting
Benchmarks

cutpointr

cutpointr is an R package for tidy calculation of “optimal” cutpoints. It supports several methods for calculating cutpoints and includes several metrics that can be maximized or minimized by selecting a cutpoint. Some of these methods are designed to be more robust than the simple empirical optimization of a metric. Additionally, cutpointr can automatically bootstrap the variability of the optimal cutpoints and return out-of-bag estimates of various performance metrics.

Installation

You can install cutpointr from CRAN using the menu in RStudio or simply:

install.packages("cutpointr")

Example

For example, the optimal cutpoint for the included data set is 2 when maximizing the sum of sensitivity and specificity.

library(cutpointr)
data(suicide)
head(suicide)
#>   age gender dsi suicide
#> 1  29 female   1      no
#> 2  26   male   0      no
#> 3  26 female   0      no
#> 4  27 female   0      no
#> 5  28 female   0      no
#> 6  53   male   2      no
cp <- cutpointr(suicide, dsi, suicide, 
                method = maximize_metric, metric = sum_sens_spec)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values

summary(cp)
#> Method: maximize_metric 
#> Predictor: dsi 
#> Outcome: suicide 
#> Direction: >= 
#> 
#>     AUC   n n_pos n_neg
#>  0.9238 532    36   496
#> 
#>  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn fp  tn
#>                 2        1.7518 0.8647      0.8889      0.8629 32  4 68 428
#> 
#> Predictor summary: 
#>     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
#>  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
#>       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
#>      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0

plot(cp)

When considering the optimality of a cutpoint, we can only make a judgement based on the sample at hand. Thus, the estimated cutpoint may not be optimal within the population or on unseen data, which is why we sometimes put the “optimal” in quotation marks.

cutpointr makes assumptions about the direction of the dependency between class and x, if direction and / or pos_class or neg_class are not specified. The same result as above can be achieved by manually defining direction and the positive / negative classes which is slightly faster, since the classes and direction don’t have to be determined:

opt_cut <- cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes",
                     neg_class = "no", method = maximize_metric, metric = youden)

opt_cut is a data frame that returns the input data and the ROC curve (and optionally the bootstrap results) in a nested tibble. Methods for summarizing and plotting the data and results are included (e.g. summary, plot, plot_roc, plot_metric)

To inspect the optimization, the function of metric values per cutpoint can be plotted using plot_metric, if an optimization function was used that returns a metric column in the roc_curve column. For example, the maximize_metric and minimize_metric functions do so:

plot_metric(opt_cut)

Predictions for new data can be made using predict:

predict(opt_cut, newdata = data.frame(dsi = 0:5))
#> [1] "no"  "no"  "yes" "yes" "yes" "yes"

Features

Calculation of optimal cutpoints in binary classification tasks
Tidy output, integrates well with functions from the tidyverse
Functions for plotting ROC curves, metric distributions and more
Bootstrapping for simulating the cutpoint variability and for obtaining out-of-bag estimates of various metrics (as a form of internal validation) with optional parallelisation
Multiple methods for calculating cutpoints
Multiple metrics can be chosen for maximization / minimization
Tidyeval

Calculating cutpoints

Method functions for cutpoint estimation

The included methods for calculating cutpoints are:

maximize_metric: Maximize the metric function
minimize_metric: Minimize the metric function
maximize_loess_metric: Maximize the metric function after LOESS smoothing
minimize_loess_metric: Minimize the metric function after LOESS smoothing
maximize_spline_metric: Maximize the metric function after spline smoothing
minimize_spline_metric: Minimize the metric function after spline smoothing
maximize_gam_metric: Maximize the metric function after smoothing via Generalized Additive Models
minimize_gam_metric: Minimize the metric function after smoothing via Generalized Additive Models
maximize_boot_metric: Bootstrap the optimal cutpoint when maximizing a metric
minimize_boot_metric: Bootstrap the optimal cutpoint when minimizing a metric
oc_manual: Specify the cutoff value manually
oc_mean: Use the sample mean as the “optimal” cutpoint
oc_median: Use the sample median as the “optimal” cutpoint
oc_youden_kernel: Maximize the Youden-Index after kernel smoothing the distributions of the two classes
oc_youden_normal: Maximize the Youden-Index parametrically assuming normally distributed data in both classes

Metric functions

The included metrics to be used with the minimization and maximization methods are:

accuracy: Fraction correctly classified
abs_d_sens_spec: The absolute difference of sensitivity and specificity
abs_d_ppv_npv: The absolute difference between positive predictive value (PPV) and negative predictive value (NPV)
roc01: Distance to the point (0,1) on ROC space
cohens_kappa: Cohen’s Kappa
sum_sens_spec: sensitivity + specificity
sum_ppv_npv: The sum of positive predictive value (PPV) and negative predictive value (NPV)
prod_sens_spec: sensitivity * specificity
prod_ppv_npv: The product of positive predictive value (PPV) and negative predictive value (NPV)
youden: Youden- or J-Index = sensitivity + specificity - 1
odds_ratio: (Diagnostic) odds ratio
risk_ratio: risk ratio (relative risk)
p_chisquared: The p-value of a chi-squared test on the confusion matrix
misclassification_cost: The sum of the misclassification cost of false positives and false negatives. Additional arguments: cost_fp, cost_fn
total_utility: The total utility of true / false positives / negatives. Additional arguments: utility_tp, utility_tn, cost_fp, cost_fn
F1_score: The F1-score (2 * TP) / (2 * TP + FP + FN)
metric_constrain: Maximize a selected metric given a minimal value of another selected metric
sens_constrain: Maximize sensitivity given a minimal value of specificity
spec_constrain: Maximize specificity given a minimal value of sensitivity
acc_constrain: Maximize accuracy given a minimal value of sensitivity

Furthermore, the following functions are included which can be used as metric functions but are more useful for plotting purposes, for example in plot_cutpointr, or for defining new metric functions: tp, fp, tn, fn, tpr, fpr, tnr, fnr, false_omission_rate, false_discovery_rate, ppv, npv, precision, recall, sensitivity, and specificity.

The inputs to the arguments method and metric are functions so that user-defined functions can easily be supplied instead of the built-in ones.

Separate subgroups and bootstrapping

Cutpoints can be separately estimated on subgroups that are defined by a third variable, gender in this case. Additionally, if boot_runs is larger zero, cutpointr will carry out the usual cutpoint calculation on the full sample, just as before, and additionally on boot_runs bootstrap samples. This offers a way of gauging the out-of-sample performance of the cutpoint estimation method. If a subgroup is given, the bootstrapping is carried out separately for every subgroup which is also reflected in the plots and output.

set.seed(12)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 1000)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...
opt_cut
#> # A tibble: 1 x 16
#>   direction optimal_cutpoint method          sum_sens_spec      acc sensitivity
#>   <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>
#> 1 >=                       2 maximize_metric       1.75179 0.864662    0.888889
#>   specificity      AUC pos_class neg_class prevalence outcome predictor
#>         <dbl>    <dbl> <fct>     <fct>          <dbl> <chr>   <chr>    
#> 1    0.862903 0.923779 yes       no         0.0676692 suicide dsi      
#>   data               roc_curve                 boot                 
#>   <list>             <list>                    <list>               
#> 1 <tibble [532 x 2]> <roc_cutpointr [13 x 10]> <tibble [1,000 x 23]>

The returned object has the additional column boot which is a nested tibble that includes the cutpoints per bootstrap sample along with the metric calculated using the function in metric and various default metrics. The metrics are suffixed by _b to indicate in-bag results or _oob to indicate out-of-bag results:

opt_cut$boot
#> [[1]]
#> # A tibble: 1,000 x 23
#>    optimal_cutpoint    AUC_b  AUC_oob sum_sens_spec_b sum_sens_spec_oob    acc_b
#>               <dbl>    <dbl>    <dbl>           <dbl>             <dbl>    <dbl>
#>  1                2 0.957271 0.883929         1.80385           1.71429 0.874060
#>  2                1 0.917932 0.934547         1.70356           1.69732 0.751880
#>  3                2 0.920303 0.946023         1.79462           1.73030 0.874060
#>  4                2 0.940457 0.962121         1.82116           1.75713 0.892857
#>  5                2 0.849167 0.96             1.66132           1.75556 0.847744
#>  6                4 0.925729 0.926536         1.80080           1.51486 0.924812
#>  7                2 0.927077 0.918705         1.74447           1.77536 0.885338
#>  8                2 0.958148 0.882258         1.82238           1.66559 0.862782
#>  9                4 0.910606 0.922593         1.80121           1.52730 0.913534
#> 10                1 0.871421 0.975241         1.62103           1.80226 0.736842
#> # ... with 990 more rows, and 17 more variables: acc_oob <dbl>,
#> #   sensitivity_b <dbl>, sensitivity_oob <dbl>, specificity_b <dbl>,
#> #   specificity_oob <dbl>, cohens_kappa_b <dbl>, cohens_kappa_oob <dbl>,
#> #   TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>, TP_oob <dbl>, FP_oob <dbl>,
#> #   TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>

The summary and plots include additional elements that summarize or display the bootstrap results:

summary(opt_cut)
#> Method: maximize_metric 
#> Predictor: dsi 
#> Outcome: suicide 
#> Direction: >= 
#> Nr. of bootstraps: 1000 
#> 
#>     AUC   n n_pos n_neg
#>  0.9238 532    36   496
#> 
#>  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn fp  tn
#>                 2        1.7518 0.8647      0.8889      0.8629 32  4 68 428
#> 
#> Predictor summary: 
#>     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
#>  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
#>       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
#>      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0
#> 
#> Bootstrap summary: 
#>           Variable Min.   5% 1st Qu. Median Mean 3rd Qu.  95% Max.   SD NAs
#>   optimal_cutpoint 1.00 1.00    2.00   2.00 2.12    2.00 4.00 4.00 0.72   0
#>              AUC_b 0.83 0.88    0.91   0.93 0.92    0.94 0.96 0.98 0.02   0
#>            AUC_oob 0.82 0.86    0.90   0.92 0.92    0.95 0.97 1.00 0.03   0
#>    sum_sens_spec_b 1.57 1.67    1.72   1.76 1.76    1.80 1.84 1.89 0.05   0
#>  sum_sens_spec_oob 1.37 1.56    1.66   1.72 1.71    1.78 1.86 1.90 0.09   0
#>              acc_b 0.73 0.77    0.85   0.87 0.86    0.88 0.91 0.94 0.04   0
#>            acc_oob 0.72 0.77    0.85   0.86 0.86    0.88 0.90 0.93 0.04   0
#>      sensitivity_b 0.72 0.81    0.86   0.90 0.90    0.94 0.98 1.00 0.05   0
#>    sensitivity_oob 0.44 0.67    0.80   0.87 0.86    0.93 1.00 1.00 0.10   0
#>      specificity_b 0.72 0.76    0.85   0.86 0.86    0.88 0.91 0.94 0.04   0
#>    specificity_oob 0.69 0.76    0.84   0.86 0.86    0.88 0.91 0.94 0.04   0
#>     cohens_kappa_b 0.16 0.27    0.37   0.42 0.41    0.46 0.52 0.66 0.07   0
#>   cohens_kappa_oob 0.15 0.25    0.34   0.39 0.39    0.44 0.51 0.62 0.08   0
plot(opt_cut)

Parallelized bootstrapping

Using foreach and doRNG the bootstrapping can be parallelized easily. The doRNG package is being used to make the bootstrap sampling reproducible.

if (suppressPackageStartupMessages(require(doParallel) & require(doRNG))) {
  cl <- makeCluster(2) # 2 cores
  registerDoParallel(cl)
  registerDoRNG(12) # Reproducible parallel loops using doRNG
  opt_cut <- cutpointr(suicide, dsi, suicide, gender, pos_class = "yes",
                 direction = ">=", boot_runs = 1000, allowParallel = TRUE)
  stopCluster(cl)
  opt_cut
}
#> Running bootstrap...
#> # A tibble: 2 x 18
#>   subgroup direction optimal_cutpoint method          sum_sens_spec      acc
#>   <chr>    <chr>                <dbl> <chr>                   <dbl>    <dbl>
#> 1 female   >=                       2 maximize_metric       1.80812 0.885204
#> 2 male     >=                       3 maximize_metric       1.62511 0.842857
#>   sensitivity specificity      AUC pos_class neg_class prevalence outcome
#>         <dbl>       <dbl>    <dbl> <chr>     <fct>          <dbl> <chr>  
#> 1    0.925926    0.882192 0.944647 yes       no         0.0688776 suicide
#> 2    0.777778    0.847328 0.861747 yes       no         0.0642857 suicide
#>   predictor grouping data               roc_curve                
#>   <chr>     <chr>    <list>             <list>                   
#> 1 dsi       gender   <tibble [392 x 2]> <roc_cutpointr [11 x 10]>
#> 2 dsi       gender   <tibble [140 x 2]> <roc_cutpointr [11 x 10]>
#>   boot                 
#>   <list>               
#> 1 <tibble [1,000 x 23]>
#> 2 <tibble [1,000 x 23]>

More robust cutpoint estimation methods

Bootstrapped cutpoints

It has been shown that bagging can substantially improve performance of a wide range of types of models in regression as well as in classification tasks. This method is available for cutpoint estimation via the maximize_boot_metric and minimize_boot_metric functions. If one of these functions is used as method, boot_cut bootstrap samples are drawn, the cutpoint optimization is carried out in each one and a summary (e.g. the mean) of the resulting optimal cutpoints on the bootstrap samples is returned as the optimal cutpoint in cutpointr. Note that if bootstrap validation is run, i.e. if boot_runs is larger zero, an outer bootstrap will be executed. In the bootstrap validation routine boot_runs bootstrap samples are generated and each one is again bootstrapped boot_cut times. This may lead to long run times, so activating the built-in parallelization may be advisable.

The advantages of bootstrapping the optimal cutpoint are that the procedure doesn’t possess parameters that have to be tuned, unlike the LOESS smoothing, that it doesn’t rely on assumptions, unlike the Normal method, and that it is applicable to any metric that can be used with minimize_metric or maximize_metric, unlike the Kernel method. Furthermore, like Random Forests cannot be overfit by increasing the number of trees, the bootstrapped cutpoints cannot be overfit by running an excessive amount of boot_cut repetitions.

set.seed(100)
cutpointr(suicide, dsi, suicide, gender, 
          method = maximize_boot_metric,
          boot_cut = 200, summary_func = mean,
          metric = accuracy, silent = TRUE)
#> # A tibble: 2 x 18
#>   subgroup direction optimal_cutpoint method               accuracy      acc
#>   <chr>    <chr>                <dbl> <chr>                   <dbl>    <dbl>
#> 1 female   >=                 5.73246 maximize_boot_metric 0.956633 0.956633
#> 2 male     >=                 8.41026 maximize_boot_metric 0.95     0.95    
#>   sensitivity specificity      AUC pos_class neg_class prevalence outcome
#>         <dbl>       <dbl>    <dbl> <fct>     <fct>          <dbl> <chr>  
#> 1    0.444444    0.994521 0.944647 yes       no         0.0688776 suicide
#> 2    0.222222    1        0.861747 yes       no         0.0642857 suicide
#>   predictor grouping data               roc_curve                boot 
#>   <chr>     <chr>    <list>             <list>                   <lgl>
#> 1 dsi       gender   <tibble [392 x 2]> <roc_cutpointr [11 x 9]> NA   
#> 2 dsi       gender   <tibble [140 x 2]> <roc_cutpointr [11 x 9]> NA

LOESS smoothing for selecting a cutpoint

When using maximize_metric and minimize_metric the optimal cutpoint is selected by searching the maximum or minimum of the metric function. For example, we may want to minimize the misclassification cost. Since false negatives (a suicide attempt was not anticipated) can be regarded as much more severe than false positives we can set the cost of a false negative cost_fn for example to ten times the cost of a false positive.

opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric,
                     metric = misclassification_cost, cost_fp = 1, cost_fn = 10)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values

plot_metric(opt_cut)

As this “optimal” cutpoint may depend on minor differences between the possible cutoffs, smoothing of the function of metric values by cutpoint value might be desirable, especially in small samples. The minimize_loess_metric and maximize_loess_metric functions can be used to smooth the function so that the optimal cutpoint is selected based on the smoothed metric values. Options to modify the smoothing, which is implemented using loess.as from the fANCOVA package, include:

criterion: the criterion for automatic smoothing parameter selection: “aicc” denotes bias-corrected AIC criterion, “gcv” denotes generalized cross-validation.
degree: the degree of the local polynomials to be used. It can be 0, 1 or 2.
family: if “gaussian” fitting is by least-squares, and if “symmetric” a re-descending M estimator is used with Tukey’s biweight function.
user.span: the user-defined parameter which controls the degree of smoothing.

Using parameters for the LOESS smoothing of criterion = "aicc", degree = 2, family = "symmetric", and user.span = 0.7 we get the following smoothed versions of the above metrics:

opt_cut <- cutpointr(suicide, dsi, suicide, gender, 
                     method = minimize_loess_metric,
                     criterion = "aicc", family = "symmetric", 
                     degree = 2, user.span = 0.7,
                     metric = misclassification_cost, cost_fp = 1, cost_fn = 10)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values

plot_metric(opt_cut)

The optimal cutpoint for the female subgroup changes to 3. Note, though, that there are no reliable rules for selecting the “best” smoothing parameters. Notably, the LOESS smoothing is sensitive to the number of unique cutpoints. A large number of unique cutpoints generally leads to a more volatile curve of metric values by cutpoint value, even after smoothing. Thus, the curve tends to be undersmoothed in that scenario. The unsmoothed metric values are returned in opt_cut$roc_curve in the column m_unsmoothed.

Smoothing via Generalized Additive Models for selecting a cutpoint

In a similar fashion, the function of metric values per cutpoint can be smoothed using Generalized Additive Models with smooth terms. Internally, mgcv::gam carries out the smoothing which can be customized via the arguments formula and optimizer, see help("gam", package = "mgcv"). Most importantly, the GAM can be specified by altering the default formula, for example the smoothing function could be configured to apply cubic regression splines ("cr") as the smooth term. As the suicide data has only very few unique cutpoints, it is not very suitable for showcasing the GAM smoothing, so we will use two classes of the iris data here. In this case, the purely empirical method and the GAM smoothing lead to identical cutpoints, but in practice the GAM smoothing tends to be more robust, especially with larger data. An attractive feature of the GAM smoothing is that the default values tend to work quite well and usually require no tuning, eliminating researcher degrees of freedom.

library(ggplot2)
exdat <- iris
exdat <- exdat[exdat$Species != "setosa", ]
opt_cut <- cutpointr(exdat, Petal.Length, Species,
                     method = minimize_gam_metric,
                     formula = m ~ s(x.sorted, bs = "cr"),
                     metric = abs_d_sens_spec)
#> Assuming the positive class is virginica
#> Assuming the positive class has higher x values
plot_metric(opt_cut)

Spline smoothing for selecting a cutpoint

Again in the same fashion the function of metric values per cutpoint can be smoothed using smoothing splines. By default, the number of knots is automatically chosen using the cutpoint_knots function. That function uses stats::.nknots.smspl, which is the default in stats::smooth.spline to pick the number of knots.

Alternatively, the number of knots can be set manually and also the other smoothing parameters of stats::smooth.spline can be set as desired. For details see ?maximize_spline_metric.

opt_cut <- cutpointr(suicide, dsi, suicide, gender, 
                     method = minimize_spline_metric, spar = 0.4,
                     metric = misclassification_cost, cost_fp = 1, cost_fn = 10)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> nknots: 10
#> nknots: 10
plot_metric(opt_cut)

Parametric method assuming normality

The Normal method in oc_youden_normal is a parametric method for maximizing the Youden-Index or equivalently the sum of (Se) and (Sp). It relies on the assumption that the predictor for both the negative and positive observations is normally distributed. In that case it can be shown that

[c^* = \frac{(\mu_P \sigma_N^2 - \mu_N \sigma_P^2) - \sigma_N \sigma_P \sqrt{(\mu_N - \mu_P)^2 + (\sigma_N^2 - \sigma_P^2) log(\sigma_N^2 / \sigma_P^2)}}{\sigma_N^2 - \sigma_P^2}]

where the negative class is normally distributed with (\sim N(\mu_N, \sigma_N^2)) and the positive class independently normally distributed with (\sim N(\mu_P, \sigma_P^2)) provides the optimal cutpoint (c^) that maximizes the Youden-Index. If (\sigma_N) and (\sigma_P) are equal, the expression can be simplified to (c^ = \frac{\mu_N + \mu_P}{2}). However, the oc_youden_normal method in cutpointr always assumes unequal standard deviations. Since this method does not select a cutpoint from the observed predictor values, it is questionable which values for (Se) and (Sp) should be reported. Here, the Youden-Index can be calculated as

[J = \Phi(\frac{c^* - \mu_N}{\sigma_N}) - \Phi(\frac{c^* - \mu_P}{\sigma_P})]

if the assumption of normality holds. However, since there exist several methods that do not select cutpoints from the available observations and to unify the reporting of metrics for these methods, cutpointr reports all metrics, e.g. (Se) and (Sp), based on the empirical observations.

cutpointr(suicide, dsi, suicide, gender, method = oc_youden_normal)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> # A tibble: 2 x 18
#>   subgroup direction optimal_cutpoint method           sum_sens_spec      acc
#>   <chr>    <chr>                <dbl> <chr>                    <dbl>    <dbl>
#> 1 female   >=                 2.47775 oc_youden_normal       1.71618 0.895408
#> 2 male     >=                 3.17226 oc_youden_normal       1.54453 0.864286
#>   sensitivity specificity      AUC pos_class neg_class prevalence outcome
#>         <dbl>       <dbl>    <dbl> <fct>     <fct>          <dbl> <chr>  
#> 1    0.814815    0.901370 0.944647 yes       no         0.0688776 suicide
#> 2    0.666667    0.877863 0.861747 yes       no         0.0642857 suicide
#>   predictor grouping data               roc_curve                boot 
#>   <chr>     <chr>    <list>             <list>                   <lgl>
#> 1 dsi       gender   <tibble [392 x 2]> <roc_cutpointr [11 x 9]> NA   
#> 2 dsi       gender   <tibble [140 x 2]> <roc_cutpointr [11 x 9]> NA

Nonparametric kernel method

A nonparametric alternative is the Kernel method [@fluss_estimation_2005]. Here, the empirical distribution functions are smoothed using the Gaussian kernel functions (\hat{F}N(t) = \frac{1}{n} \sum^n{i=1} \Phi(\frac{t - y_i}{h_y})) and (\hat{G}P(t) = \frac{1}{m} \sum^m{i=1} \Phi(\frac{t - x_i}{h_x})) for the negative and positive classes respectively. Following Silverman’s plug-in “rule of thumb” the bandwidths are selected as (h_y = 0.9 * min{s_y, iqr_y/1.34} * n^{-0.2}) and (h_x = 0.9 * min{s_x, iqr_x/1.34} * m^{-0.2}) where (s) is the sample standard deviation and (iqr) is the inter quartile range. It has been demonstrated that AUC estimation is rather insensitive to the choice of the bandwidth procedure [@faraggi_estimation_2002] and thus the plug-in bandwidth estimator has also been recommended for cutpoint estimation. The oc_youden_kernel function in cutpointr uses a Gaussian kernel and the direct plug-in method for selecting the bandwidths. The kernel smoothing is done via the bkde function from the KernSmooth package [@wand_kernsmooth:_2013].

Again, there is a way to calculate the Youden-Index from the results of this method [@fluss_estimation_2005] which is

[\hat{J} = max_c {\hat{F}_N(c) - \hat{G}_N(c) }]

but as before we prefer to report all metrics based on applying the cutpoint that was estimated using the Kernel method to the empirical observations.

cutpointr(suicide, dsi, suicide, gender, method = oc_youden_kernel)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> # A tibble: 2 x 18
#>   subgroup direction optimal_cutpoint method           sum_sens_spec      acc
#>   <chr>    <chr>                <dbl> <chr>                    <dbl>    <dbl>
#> 1 female   >=                 1.18128 oc_youden_kernel       1.80812 0.885204
#> 2 male     >=                 1.31636 oc_youden_kernel       1.58694 0.807143
#>   sensitivity specificity      AUC pos_class neg_class prevalence outcome
#>         <dbl>       <dbl>    <dbl> <fct>     <fct>          <dbl> <chr>  
#> 1    0.925926    0.882192 0.944647 yes       no         0.0688776 suicide
#> 2    0.777778    0.809160 0.861747 yes       no         0.0642857 suicide
#>   predictor grouping data               roc_curve                boot 
#>   <chr>     <chr>    <list>             <list>                   <lgl>
#> 1 dsi       gender   <tibble [392 x 2]> <roc_cutpointr [11 x 9]> NA   
#> 2 dsi       gender   <tibble [140 x 2]> <roc_cutpointr [11 x 9]> NA

Additional features

Calculating only the ROC curve

When running cutpointr, a ROC curve is by default returned in the column roc_curve. This ROC curve can be plotted using plot_roc. Alternatively, if only the ROC curve is desired and no cutpoint needs to be calculated, the ROC curve can be created using roc() and plotted using plot_cutpointr. The roc function, unlike cutpointr, does not determine direction, pos_class or neg_class automatically.

roc_curve <- roc(data = suicide, x = dsi, class = suicide,
    pos_class = "yes", neg_class = "no", direction = ">=")
auc(roc_curve)
#> [1] 0.9237791
head(roc_curve)
#> # A tibble: 6 x 9
#>   x.sorted    tp    fp    tn    fn       tpr      tnr        fpr      fnr
#>      <dbl> <dbl> <dbl> <int> <int>     <dbl>    <dbl>      <dbl>    <dbl>
#> 1      Inf     0     0   496    36 0         1        0          1       
#> 2       11     1     0   496    35 0.0277778 1        0          0.972222
#> 3       10     2     1   495    34 0.0555556 0.997984 0.00201613 0.944444
#> 4        9     3     1   495    33 0.0833333 0.997984 0.00201613 0.916667
#> 5        8     4     1   495    32 0.111111  0.997984 0.00201613 0.888889
#> 6        7     7     1   495    29 0.194444  0.997984 0.00201613 0.805556
plot_roc(roc_curve)

Midpoints

So far - which is the default in cutpointr - we have considered all unique values of the predictor as possible cutpoints. An alternative could be to use a sequence of equidistant values instead, for example in the case of the suicide data all integers in ([0, 10]). However, with very sparse data and small intervals between the candidate cutpoints (i.e. a ‘dense’ sequence like seq(0, 10, by = 0.01)) this leads to the uninformative evaluation of large ranges of cutpoints that all result in the same metric value. A more elegant alternative, not only for the case of sparse data, that is supported by cutpointr is the use of a mean value of the optimal cutpoint and the next highest (if direction = ">=") or the next lowest (if direction = "<=") predictor value in the data. The result is an optimal cutpoint that is equal to the cutpoint that would be obtained using an infinitely dense sequence of candidate cutpoints and is thus usually more efficient computationally. This behavior can be activated by setting use_midpoints = TRUE, which is the default. If we use this setting, we obtain an optimal cutpoint of 1.5 for the complete sample on the suicide data instead of 2 when maximizing the sum of sensitivity and specificity.

Assume the following small data set:

dat <- data.frame(outcome = c("neg", "neg", "neg", "pos", "pos", "pos", "pos"),
                  pred    = c(1, 2, 3, 8, 11, 11, 12))

Since the distance of the optimal cutpoint (8) to the next lowest observation (3) is rather large we arrive at a range of possible cutpoints that all maximize the metric. In the case of this kind of sparseness it might for example be desirable to classify a new observation with a predictor value of 4 as belonging to the negative class. If use_midpoints is set to TRUE, the mean of the optimal cutpoint and the next lowest observation is returned as the optimal cutpoint, if direction is >=. The mean of the optimal cutpoint and the next highest observation is returned as the optimal cutpoint, if direction = "<=".

opt_cut <- cutpointr(dat, x = pred, class = outcome, use_midpoints = TRUE)
#> Assuming the positive class is pos
#> Assuming the positive class has higher x values
plot_x(opt_cut)

A simulation demonstrates more clearly that setting use_midpoints = TRUE avoids biasing the cutpoints. To simulate the bias of the metric functions, the predictor values of both classes were drawn from normal distributions with constant standard deviations of 10, a constant mean of the negative class of 100 and higher mean values of the positive class that are selected in such a way that optimal Youden-Index values of 0.2, 0.4, 0.6, and 0.8 result in the population. Samples of 9 different sizes were drawn and the cutpoints that maximize the Youden-Index were estimated. The simulation was repeated 10000 times. As can be seen by the mean error, use_midpoints = TRUE eliminates the bias that is introduced by otherwise selecting the value of an observation as the optimal cutpoint. If direction = ">=", as in this case, the observation that represents the optimal cutpoint is the highest possible cutpoint that leads to the optimal metric value and thus the biases are positive. The methods oc_youden_normal and oc_youden_kernel are always unbiased, as they don’t select a cutpoint based on the ROC-curve or the function of metric values per cutpoint.

Finding all cutpoints with acceptable performance

By default, most packages only return the “best” cutpoint and disregard other cutpoints with quite similar performance, even if the performance differences are minuscule. cutpointr makes this process more explicit via the tol_metric argument. For example, if all cutpoints are of interest that achieve at least an accuracy within 0.05 of the optimally achievable accuracy, tol_metric can be set to 0.05 and also those cutpoints will be returned.

In the case of the suicide data and when maximizing the sum of sensitivity and specificity, empirically the cutpoints 2 and 3 lead to quite similar performances. If tol_metric is set to 0.05, both will be returned.

opt_cut <- cutpointr(suicide, dsi, suicide, metric = sum_sens_spec, 
                     tol_metric = 0.05, break_ties = c)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Multiple optimal cutpoints found, applying break_ties.
library(tidyr)
opt_cut %>% 
    select(optimal_cutpoint, sum_sens_spec) %>% 
    unnest(cols = c(optimal_cutpoint, sum_sens_spec))
#> # A tibble: 2 x 2
#>   optimal_cutpoint sum_sens_spec
#>              <dbl>         <dbl>
#> 1                2       1.75179
#> 2                1       1.70251

Manual and mean / median cutpoints

Using the oc_manual function the optimal cutpoint will not be determined based on, for example, a metric but is instead set manually using the cutpoint argument. This is useful for supplying and evaluating cutpoints that were found in the literature or in other external sources.

The oc_manual function could also be used to set the cutpoint to the sample mean using cutpoint = mean(data$x). However, this may introduce a bias into the bootstrap validation procedure, since the actual mean of the population is not known and thus the mean to be used as the cutpoint should be automatically determined in every resample. To do so, the oc_mean and oc_median functions can be used.

set.seed(100)
opt_cut_manual <- cutpointr(suicide, dsi, suicide, method = oc_manual, 
                       cutpoint = mean(suicide$dsi), boot_runs = 30)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...
set.seed(100)
opt_cut_mean <- cutpointr(suicide, dsi, suicide, method = oc_mean, boot_runs = 30)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...

Nonstandard evaluation via tidyeval

The arguments to cutpointr do not need to be enclosed in quotes. This is possible thanks to nonstandard evaluation of the arguments, which are evaluated on data.

Functions that use nonstandard evaluation are often not suitable for programming with. The use of nonstandard evaluation may lead to scoping problems and subsequent obvious as well as possibly subtle errors. cutpointr uses tidyeval internally and accordingly the same rules as for programming with dplyr apply. Arguments can be unquoted with !!:

myvar <- "dsi"
cutpointr(suicide, !!myvar, suicide)

ROC curve and optimal cutpoint for multiple variables

Alternatively, we can map the standard evaluation version cutpointr to the column names. If direction and / or pos_class and neg_class are unspecified, these parameters will automatically be determined by cutpointr so that the AUC values for all variables will be (> 0.5).

We could do this manually, e.g. using purrr::map, but to make this task more convenient multi_cutpointr can be used to achieve the same result. It maps multiple predictor columns to cutpointr, by default all numeric columns except for the class column.

mcp <- multi_cutpointr(suicide, class = suicide, pos_class = "yes", 
                use_midpoints = TRUE, silent = TRUE) 
summary(mcp)
#> Method: maximize_metric 
#> Predictor: age, dsi 
#> Outcome: suicide 
#> 
#> Predictor: age 
#> -------------------------------------------------------------------------------- 
#>  direction    AUC   n n_pos n_neg
#>         <= 0.5257 532    36   496
#> 
#>  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn  fp tn
#>              55.5        1.1154 0.1992      0.9722      0.1431 35  1 425 71
#> 
#> Predictor summary: 
#>     Data Min. 5% 1st Qu. Median    Mean 3rd Qu.   95% Max.      SD NAs
#>  Overall   18 19      24   28.0 34.1259   41.25 65.00   83 15.0542   0
#>       no   18 19      24   28.0 34.2218   41.25 65.50   83 15.1857   0
#>      yes   18 18      22   27.5 32.8056   41.25 54.25   69 13.2273   0
#> 
#> Predictor: dsi 
#> -------------------------------------------------------------------------------- 
#>  direction    AUC   n n_pos n_neg
#>         >= 0.9238 532    36   496
#> 
#>  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn fp  tn
#>               1.5        1.7518 0.8647      0.8889      0.8629 32  4 68 428
#> 
#> Predictor summary: 
#>     Data Min.   5% 1st Qu. Median   Mean 3rd Qu.  95% Max.     SD NAs
#>  Overall    0 0.00       0      0 0.9211       1 5.00   11 1.8527   0
#>       no    0 0.00       0      0 0.6331       0 4.00   10 1.4122   0
#>      yes    0 0.75       4      5 4.8889       6 9.25   11 2.5498   0

Accessing `data`, `roc_curve`, and `boot`

The object returned by cutpointr is of the classes cutpointr, tbl_df, tbl, and data.frame. Thus, it can be handled like a usual data frame. The columns data, roc_curve, and boot consist of nested data frames, which means that these are list columns whose elements are data frames. They can either be accessed using [ or by using functions from the tidyverse. If subgroups were given, the output contains one row per subgroup and the function that accesses the data should be mapped to every row or the data should be grouped by subgroup.

# Extracting the bootstrap results
set.seed(123)
opt_cut <- cutpointr(suicide, dsi, suicide, gender, boot_runs = 1000)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...

# Using base R to summarise the result of the bootstrap
summary(opt_cut$boot[[1]]$optimal_cutpoint)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   2.000   2.000   2.172   2.000   5.000
summary(opt_cut$boot[[2]]$optimal_cutpoint)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   1.000   3.000   2.921   4.000  11.000

# Using dplyr and tidyr
library(tidyr)
opt_cut %>% 
  group_by(subgroup) %>% 
  select(boot) %>% 
  unnest(boot) %>% 
  summarise(sd_oc_boot = sd(optimal_cutpoint),
            m_oc_boot  = mean(optimal_cutpoint),
            m_acc_oob  = mean(acc_oob))
#> Adding missing grouping variables: `subgroup`
#> # A tibble: 2 x 4
#>   subgroup sd_oc_boot m_oc_boot m_acc_oob
#>   <chr>         <dbl>     <dbl>     <dbl>
#> 1 female     0.765835    2.172   0.879809
#> 2 male       1.51434     2.9205  0.806057

Adding metrics to the result of cutpointr() or roc()

By default, the output of cutpointr includes the optimized metric and several other metrics. The add_metric function adds further metrics. Here, we’re adding the negative predictive value (NPV) and the positive predictive value (PPV) at the optimal cutpoint per subgroup:

cutpointr(suicide, dsi, suicide, gender, metric = youden, silent = TRUE) %>% 
    add_metric(list(ppv, npv)) %>% 
    select(subgroup, optimal_cutpoint, youden, ppv, npv)
#> # A tibble: 2 x 5
#>   subgroup optimal_cutpoint   youden      ppv      npv
#>   <chr>               <dbl>    <dbl>    <dbl>    <dbl>
#> 1 female                  2 0.808118 0.367647 0.993827
#> 2 male                    3 0.625106 0.259259 0.982301

In the same fashion, additional metric columns can be added to a roc_cutpointr object:

roc(data = suicide, x = dsi, class = suicide, pos_class = "yes",
    neg_class = "no", direction = ">=") %>% 
  add_metric(list(cohens_kappa, F1_score)) %>% 
  select(x.sorted, tp, fp, tn, fn, cohens_kappa, F1_score) %>% 
  head()
#> # A tibble: 6 x 7
#>   x.sorted    tp    fp    tn    fn cohens_kappa  F1_score
#>      <dbl> <dbl> <dbl> <int> <int>        <dbl>     <dbl>
#> 1      Inf     0     0   496    36    0         0        
#> 2       11     1     0   496    35    0.0505813 0.0540541
#> 3       10     2     1   495    34    0.0931229 0.102564 
#> 4        9     3     1   495    33    0.138338  0.15     
#> 5        8     4     1   495    32    0.181615  0.195122 
#> 6        7     7     1   495    29    0.300981  0.318182

User-defined functions

method

User-defined functions can be supplied to method, which is the function that is responsible for returning the optimal cutpoint. To define a new method function, create a function that may take as input(s):

data: A data.frame or tbl_df
x: (character) The name of the predictor variable
class: (character) The name of the class variable
metric_func: A function for calculating a metric, e.g. accuracy. Note that the method function does not necessarily have to accept this argument
pos_class: The positive class
neg_class: The negative class
direction: ">=" if the positive class has higher x values, "<=" otherwise
tol_metric: (numeric) In the built-in methods, all cutpoints will be returned that lead to a metric value in the interval [m_max - tol_metric, m_max + tol_metric] where m_max is the maximum achievable metric value. This can be used to return multiple decent cutpoints and to avoid floating-point problems.
use_midpoints: (logical) In the built-in methods, if TRUE (default FALSE) the returned optimal cutpoint will be the mean of the optimal cutpoint and the next highest observation (for direction = “>”) or the next lowest observation (for direction = “<”) which avoids biasing the optimal cutpoint.
...: Further arguments that are passed to metric or that can be captured inside of method

The function should return a data frame or tibble with one row, the column optimal_cutpoint, and an optional column with an arbitrary name with the metric value at the optimal cutpoint.

For example, a function for choosing the cutpoint as the mean of the independent variable could look like this:

mean_cut <- function(data, x, ...) {
    oc <- mean(data[[x]])
    return(data.frame(optimal_cutpoint = oc))
}

If a method function does not return a metric column, the default sum_sens_spec, the sum of sensitivity and specificity, is returned as the extra metric column in addition to accuracy, sensitivity and specificity.

Some method functions that make use of the additional arguments (that are captured by ...) are already included in cutpointr, see the list at the top. Since these functions are arguments to cutpointr their code can be accessed by simply typing their name, see for example oc_youden_normal.

metric

User defined metric functions can be used as well. They are mainly useful in conjunction with method = maximize_metric, method = minimize_metric, or one of the other minimization and maximization functions. In case of a different method function metric will only be used as the main out-of-bag metric when plotting the result. The metric function should accept the following inputs as vectors:

tp: Vector of true positives
fp: Vector of false positives
tn: Vector of true negatives
fn: Vector of false negatives
...: Further arguments

The function should return a numeric vector, a matrix, or a data.frame with one column. If the column is named, the name will be included in the output and plots. Avoid using names that are identical to the column names that are by default returned by cutpointr, as such names will be prefixed by metric_ in the output. The inputs (tp, fp, tn, and fn) are vectors. The code of the included metric functions can be accessed by simply typing their name.

For example, this is the misclassification_cost metric function:

misclassification_cost
#> function(tp, fp, tn, fn, cost_fp = 1, cost_fn = 1, ...) {
#>     misclassification_cost <- cost_fp * fp + cost_fn * fn
#>     misclassification_cost <- matrix(misclassification_cost, ncol = 1)
#>     colnames(misclassification_cost) <- "misclassification_cost"
#>     return(misclassification_cost)
#> }
#> <bytecode: 0x00000000139a4b38>
#> <environment: namespace:cutpointr>

Plotting

cutpointr includes several convenience functions for plotting data from a cutpointr object. These include:

plot_cutpointr: General purpose plotting function for cutpointr or roc_cutpointr objects
plot_cut_boot: Plot the bootstrapped distribution of optimal cutpoints
plot_metric: If maximize_metric or minimize_metric was used this function plots all possible cutoffs on the x-axis vs. the respective metric values on the y-axis. If bootstrapping was run, a confidence interval based on the bootstrapped distribution of metric values at each cutpoint can be displayed. To display no confidence interval set conf_lvl = 0.
plot_metric_boot: Plot the distribution of out-of-bag metric values
plot_precision_recall: Plot the precision recall curve
plot_sensitivity_specificity: Plot all cutpoints vs. sensitivity and specificity
plot_roc: Plot the ROC curve
plot_x: Plot the distribution of the predictor variable

set.seed(102)
opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric,
                     metric = abs_d_sens_spec, boot_runs = 200, silent = TRUE)
opt_cut
#> # A tibble: 2 x 18
#>   subgroup direction optimal_cutpoint method          abs_d_sens_spec      acc
#>   <chr>    <chr>                <dbl> <chr>                     <dbl>    <dbl>
#> 1 female   >=                       2 minimize_metric       0.0437341 0.885204
#> 2 male     >=                       2 minimize_metric       0.0313825 0.807143
#>   sensitivity specificity      AUC pos_class neg_class prevalence outcome
#>         <dbl>       <dbl>    <dbl> <fct>     <fct>          <dbl> <chr>  
#> 1    0.925926    0.882192 0.944647 yes       no         0.0688776 suicide
#> 2    0.777778    0.809160 0.861747 yes       no         0.0642857 suicide
#>   predictor grouping data               roc_curve                
#>   <chr>     <chr>    <list>             <list>                   
#> 1 dsi       gender   <tibble [392 x 2]> <roc_cutpointr [11 x 10]>
#> 2 dsi       gender   <tibble [140 x 2]> <roc_cutpointr [11 x 10]>
#>   boot               
#>   <list>             
#> 1 <tibble [200 x 23]>
#> 2 <tibble [200 x 23]>
plot_cut_boot(opt_cut)

plot_metric(opt_cut, conf_lvl = 0.9)

plot_metric_boot(opt_cut)
#> Warning: Removed 2 rows containing non-finite values (stat_density).

plot_precision_recall(opt_cut)

plot_sensitivity_specificity(opt_cut)

plot_roc(opt_cut)

All plot functions, except for the standard plot method that returns a composed plot, return ggplot objects than can be further modified. For example, changing labels, title, and the theme can be achieved this way:

p <- plot_x(opt_cut)
p + ggtitle("Distribution of dsi") + theme_minimal() + xlab("Depression score")

Flexible plotting function

Using plot_cutpointr any metric can be chosen to be plotted on the x- or y-axis and results of cutpointr() as well as roc() can be plotted. If a cutpointr object is to be plotted, it is thus irrelevant which metric function was chosen for cutpoint estimation. Any metric that can be calculated based on the ROC curve can be subsequently plotted as only the true / false positives / negatives over all cutpoints are needed. That way, not only the above plots can be produced, but also any combination of two metrics (or metric functions) and / or cutpoints. The built-in metric functions as well as user-defined functions or anonymous functions can be supplied to xvar and yvar. If bootstrapping was run, confidence intervals can be plotted around the y-variable. This is especially useful if the cutpoints, available in the cutpoints function, are placed on the x-axis. Note that confidence intervals can only be correctly plotted if the values of xvar are constant across bootstrap samples. For example, confidence intervals for TPR by FPR (a ROC curve) cannot be plotted easily, as the values of the false positive rate vary per bootstrap sample.

set.seed(500)
oc <- cutpointr(suicide, dsi, suicide, boot_runs = 1000, 
                metric = sum_ppv_npv) # metric irrelevant for plot_cutpointr
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...
plot_cutpointr(oc, xvar = cutpoints, yvar = sum_sens_spec, conf_lvl = 0.9)

plot_cutpointr(oc, xvar = fpr, yvar = tpr, aspect_ratio = 1, conf_lvl = 0)

plot_cutpointr(oc, xvar = cutpoint, yvar = tp, conf_lvl = 0.9) + geom_point()

Manual plotting

Since cutpointr returns a data.frame with the original data, bootstrap results, and the ROC curve in nested tibbles, these data can be conveniently extracted and plotted manually. The relevant nested tibbles are in the columns data, roc_curve and boot. The following is an example of accessing and plotting the grouped data.

set.seed(123) 
opt_cut <- cutpointr(suicide, dsi, suicide, gender, boot_runs = 1000)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...

opt_cut %>% 
    select(data, subgroup) %>% 
    unnest %>% 
    ggplot(aes(x = suicide, y = dsi)) + 
    geom_boxplot(alpha = 0.3) + facet_grid(~subgroup)
#> Warning: `cols` is now required when using unnest().
#> Please use `cols = c(data)`

Benchmarks

To offer a comparison to established solutions, cutpointr 1.0.0 will be benchmarked against optimal.cutpoints from OptimalCutpoints 1.1-4, ThresholdROC 2.7 and custom functions based on ROCR 1.0-7 and pROC 1.15.0. By generating data of different sizes, the benchmarks will offer a comparison of the scalability of the different solutions.

Using prediction and performance from the ROCR package and roc from the pROC package, we can write functions for computing the cutpoint that maximizes the sum of sensitivity and specificity. pROC has a built-in function to optimize a few metrics:

# Return cutpoint that maximizes the sum of sensitivity and specificiy
# ROCR package
rocr_sensspec <- function(x, class) {
    pred <- ROCR::prediction(x, class)
    perf <- ROCR::performance(pred, "sens", "spec")
    sens <- slot(perf, "y.values")[[1]]
    spec <- slot(perf, "x.values")[[1]]
    cut <- slot(perf, "alpha.values")[[1]]
    cut[which.max(sens + spec)]
}

# pROC package
proc_sensspec <- function(x, class) {
    r <- pROC::roc(class, x, algorithm = 2, levels = c(0, 1), direction = "<")
    pROC::coords(r, "best", ret="threshold", transpose = FALSE)[1]
}

The benchmarking will be carried out using the microbenchmark package and randomly generated data. The values of the x predictor variable are drawn from a normal distribution which leads to a lot more unique values than were encountered before in the suicide data. Accordingly, the search for an optimal cutpoint is much more demanding, if all possible cutpoints are evaluated.

Benchmarks are run for sample sizes of 100, 1000, 1e4, 1e5, 1e6, and 1e7. For low sample sizes cutpointr is slower than the other solutions. While this should be of low practical importance, cutpointr scales more favorably with increasing sample size. The speed disadvantage in small samples that leads to the lower limit of around 25ms is mainly due to the nesting of the original data and the results that makes the compact output of cutpointr possible. This observation is emphasized by the fact that cutpointr::roc is quite fast also in small samples. For sample sizes > 1e5 cutpointr is a little faster than the function based on ROCR and pROC. Both of these solutions are generally faster than OptimalCutpoints and ThresholdROC with the exception of small samples. OptimalCutpoints and ThresholdROC had to be excluded from benchmarks with more than 1e4 observations due to high memory requirements and/or excessive run times, rendering the use of these packages in larger samples impractical.

# ROCR package
rocr_roc <- function(x, class) {
    pred <- ROCR::prediction(x, class)
    perf <- ROCR::performance(pred, "sens", "spec")
    return(NULL)
}

# pROC package
proc_roc <- function(x, class) {
    r <- pROC::roc(class, x, algorithm = 2, levels = c(0, 1), direction = "<")
    return(NULL)
}

n	task	cutpointr	OptimalCutpoints	pROC	ROCR	ThresholdROC
1e+02	Cutpoint Estimation	4.5018015	2.288702	0.662101	1.812802	1.194301
1e+03	Cutpoint Estimation	4.8394010	45.056801	0.981001	2.176401	36.239852
1e+04	Cutpoint Estimation	8.5662515	2538.612001	4.031701	5.667101	2503.801251
1e+05	Cutpoint Estimation	45.3845010	NA	37.150151	43.118751	NA
1e+06	Cutpoint Estimation	465.0032010	NA	583.095000	607.023851	NA
1e+07	Cutpoint Estimation	5467.3328010	NA	7339.356101	7850.258700	NA
1e+02	ROC curve calculation	0.7973505	NA	0.447701	1.732651	NA
1e+03	ROC curve calculation	0.8593010	NA	0.694802	2.035852	NA
1e+04	ROC curve calculation	1.8781510	NA	3.658050	5.662151	NA
1e+05	ROC curve calculation	11.0992510	NA	35.329301	42.820852	NA
1e+06	ROC curve calculation	159.8100505	NA	610.433700	612.471901	NA
1e+07	ROC curve calculation	2032.6935510	NA	7081.897251	7806.385452	NA

cutpointr's People

Contributors

Stargazers

Watchers

Forkers

kormilitzin aarthyv mutual-ai guhjy muschellij2 romainfrancois maltarel abreums raysunau jnaranjo-sciling minghao2016 rnaimehaom jystatistics

cutpointr's Issues

[BUG]: cutpointr() not recognizing character vectors for "x" argument input via lapply()

I'm having many issues when trying to create cutpointr models through lapply().

This is an example that I find easy to understand why I think it should work but is not.

In this case, when I try to pass a character vector to the argument x of cutpoinitr() using an lapply() it is not recognizing the existence of such an object.

My way around it is to input predictions and class as vectors instead of in a data.frame, but would prefer to do it the way I suppose it is intended. Is there something that could be done?, maybe something that has to do with tidy evaluation eval_tidy()?

Best regards

Columns nested in the `boot` column become columns of lists with allowParallel = TRUE and certain seeds

Consider the following code:

library(doSNOW)
library(dplyr)
library(tidyr)
cl <- makeCluster(2) # 2 cores
registerDoSNOW(cl)

set.seed(101)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
					   silent = TRUE, allowParallel = TRUE)
opt_cut_b %>% select(boot) %>% unnest %>% head

This outputs a tibble with columns as expected:

# A tibble: 6 x 23
  optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b sensitivity_oob
             <dbl> <dbl>   <dbl>           <dbl>            <dbl> <dbl>   <dbl>         <dbl>           <dbl>
1                2 0.927   0.920            1.76             1.73 0.883   0.833         0.872           0.9  
2                4 0.960   0.894            1.90             1.51 0.940   0.857         0.966           0.632
3                2 0.905   0.956            1.73             1.75 0.850   0.862         0.886           0.889
4                2 0.958   0.881            1.84             1.66 0.874   0.854         0.969           0.8  
5                3 0.927   0.947            1.77             1.68 0.898   0.876         0.867           0.8  
6                2 0.931   0.954            1.76             1.80 0.853   0.873         0.912           0.929
# … with 14 more variables: specificity_b <dbl>, specificity_oob <dbl>, kappa_b <dbl>, kappa_oob <dbl>, TP_b <dbl>,
#   FP_b <dbl>, TN_b <int>, FN_b <int>, TP_oob <dbl>, FP_oob <dbl>, TN_oob <int>, FN_oob <int>, roc_curve_b <list>,
#   roc_curve_oob <list>

Now at times with certain seeds:

set.seed(102)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
					   silent = TRUE, allowParallel = TRUE)
opt_cut_b %>% select(boot) %>% unnest

Notice how the columns are now holding <list>s of <dbl [1]>

# A tibble: 100 x 23
   optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b sensitivity_oob
   <list>           <dbl>   <dbl> <list>          <list>           <lis> <list>  <list>        <list>         
 1 <dbl [1]>        0.945   0.875 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 2 <dbl [1]>        0.959   0.877 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 3 <dbl [1]>        0.937   0.905 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 4 <dbl [1]>        0.927   0.955 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 5 <dbl [1]>        0.965   0.849 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 6 <dbl [1]>        0.909   0.976 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 7 <dbl [1]>        0.960   0.830 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 8 <dbl [1]>        0.908   0.959 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 9 <dbl [1]>        0.935   0.958 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
10 <dbl [1]>        0.891   0.951 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
# … with 90 more rows, and 14 more variables: specificity_b <list>, specificity_oob <list>, kappa_b <list>,
#   kappa_oob <list>, TP_b <list>, FP_b <list>, TN_b <list>, FN_b <list>, TP_oob <list>, FP_oob <list>, TN_oob <list>,
#   FN_oob <list>, roc_curve_b <list>, roc_curve_oob <list>

This does not seem to occur with allowParallel = FALSE.

Update of benchmark data

Dear Christian,

Thanks for the benchmarks that are performed in the vignette. I've been looking into why pROC is significantly slower, was able to identify and fix some of the bottlenecks. I'm planning to release it around the end of the month and will propose a pull request once it is on CRAN if that's OK with you.

I was wondering if speed was the only reason to exclude pROC with > than 1e5 observations, or if memory was also a factor. In the vignette you write:

OptimalCutpoints and ThresholdROC had to be excluded from benchmarks with
more than 1e4 observations and pROC from benchmarks with more than 1e5
observations due to high memory requirements and/or excessive run times

Do you remember if memory was a criteria for pROC and if so what was your criteria exactly, or if it was only a run time reason? I have been able to run the benchmark with 1e7 data points in pROC without any noticable memory issue. Here is what it looks like:

PS: in the second plot, you run only the ROCR::prediction function, and not ROCR::performance which is necessary to get the sensitivities and specificities. Is there a reason for that?

Plot a the ROC curve with manual settings

Sorry, but I'm quite a newbie here and I don't know if this is really the right place to ask.

I wanted to just plot the ROC curve with the cutpoint and a vertival and horizontal line at the cutpoint.
In addition, I wanted to change the appearance of the cutpoint and the lines (ROC, vline, hline) with regard to type, size/with and color. I also wanted to print the AUC value and the cutpoint values into the graph.

However, I failed and finally I don't have any idea what to do with the tibbles. Can anyone help me please!

Many thanks in advance!

Jo1511

Re-upload to CRAN

I don't know if you are aware of it already, but I just noticed that cutpointr was removed from CRAN a few days ago.

Call functions from namespace

Hi,
Thank you so much for cutpointr.

It's more a question than a bug, perhaps. I may have missed something obvious, but when I call cutpointr functions from its namespace (as I would for inclusion in my own package), I get an error. Here's my minimal reproducible example:

library(cutpointr)
#> Warning: package 'cutpointr' was built under R version 3.5.2
opt_cut_namespace <- cutpointr::cutpointr_(
  suicide,
  "dsi",
  "suicide",
  method = cutpointr::minimize_boot_metric,
  metric = cutpointr::false_omission_rate
)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Error in `$<-.data.frame`(`*tmp*`, "method", value = c("::", "cutpointr", : replacement has 3 rows, data has 1
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
#> [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=English_Canada.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] cutpointr_0.7.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.18     knitr_1.20       bindr_0.1.1      magrittr_1.5    
#>  [5] tidyselect_0.2.4 R6_2.2.2         rlang_0.2.1      foreach_1.4.4   
#>  [9] stringr_1.3.1    dplyr_0.7.6      tools_3.5.1      iterators_1.0.10
#> [13] htmltools_0.3.6  yaml_2.2.0       rprojroot_1.3-2  digest_0.6.15   
#> [17] assertthat_0.2.0 tibble_1.4.2     crayon_1.3.4     bindrcpp_0.2.2  
#> [21] tidyr_0.8.1      purrr_0.2.5      codetools_0.2-15 glue_1.3.0      
#> [25] evaluate_0.11    rmarkdown_1.10   stringi_1.1.7    compiler_3.5.1  
#> [29] pillar_1.3.0     backports_1.1.2  pkgconfig_2.0.1

What's the recommended way, if any, to do this? Thanks again.

Best,
V.

Error in !predictor : invalid argument type

Hi @Thie1e,

I'm running the example and I'm having this error, I cannot understand what I'm doing wrong... I just follow the example! Thanks!

> cp <- cutpointr(suicide, dsi, suicide, method = maximize_metric, metric = sum_sens_spec)
Assuming the positive class is yes
Assuming the positive class has higher x values
Error in !predictor : invalid argument type

An ambiguous region bounded by two cutpoint

Hi Christian:

When developing a diagnostic test, it is common to have three regions: eg high confident negative, ambiguous, and high confident positive. Two cutpoints cp1 and cp2 would accomplish this. Cutpointr current only accepts one cp as: method = oc_manual, cutpoint = cp. I ended up preparing two data, data1 includes only the ambiguous middle and data2 without those falling in the middle region.

It would be great if cutpointr could support something like:

opt_cut <- cutpointr(data=suicide, x=age, class=suicide,
direction = ">=", pos_class = "yes", neg_class = "no",
method = oc_manual, cutpoint = c(30, 40))

Prepare for tidyr v1.0.0

tidyr is preparing for the release of what will become v1.0.0.

Here's what we see for cutpointr in our initial revdep check:

https://github.com/tidyverse/tidyr/blob/master/revdep/problems.md#cutpointr

I took a quick look at your dev version and it seems the tidyr::nest_() calls are gone, which is good. I'm seeing other errors ("Tibble columns must have consistent lengths"), though, which keep from verifying if all is well with respect to tidyr.

It looks like you are perhaps building towards a release anyway. So I suggest that, while you are at it, you also make sure you're testing against the devel version of tidyr. Let me know if you need any pointers.

Here is advice on how to test your package on travis against the devel version of tidyr during this period:

tidyverse/tidyr#692 (comment)

add_metric adds the metric column multiple times

See #60

Missing metrics if maximize/minimize_boot_metric

Hi @Thie1e,

I'm here again 😓... I'm faceting to an error I never saw before:

> cutpointr(data = u, x = x, class = class, na.rm = T,
+           metric = sum_sens_spec, method = minimize_boot_metric, 
+           boot_cut = 100, boot_stratify = T, boot_runs = 100)
Assuming the positive class is 1
Assuming the positive class has higher x values
Error in optimize_metric(data = data[b_ind, ], x = x, class = class, metric_func = metric_func,  : 
  All metric values are missing.
In addition: Warning message:
In roc(data = data, x = !!x, class = !!class, pos_class = pos_class,  :
  ROC curve contains no positives

And this is my data structure:

structure(list(x = c(0.225, 1.936, 0.0315, 0.0078, 0.4698, 19.35, 
0.0531, 1.7466176, 0, 0.02350828, 0.0714725, 0.5275296, 7.68378, 
0.05376, 0.020688, 0.08143, 1.127828, 0, 0.0313956, NA, 0.04976592, 
30.072, 6.492, 2.99, 2.52, 0.17, 0.03321, 0.3306, 0, 0.884, NA, 
29.7, 1.4, 0, 0, 0.12320784, 1.108, 0.023104, 66.448512, 4.180792, 
0.5792, 0.0444, 0, 0.0392, 0, 0, 0.2105334, 0.225, 3.355, 23.4
), class = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 50L
), class = "data.frame")

I see that you updated the package version, maybe I'm missing something? Thanks for your help (again)!

Specify a customer cutpoint using oc_manual=avalue ignored?

Hi Christian,

I wonder if oc_manual is used to specify a custom cutpoint. I ran the following code to provide my own cutpoint instead of letting the function find the optimal cutpoint. Not sure if oc_manual is used correctly but the output ignores my input. Please advise!

`
data(suicide)
opt_cut <- cutpointr(data = suicide, x = dsi, class = suicide,
pos_class = 'yes', neg_class = 'no',
direction = '>=', boot_stratify = TRUE,
oc_manual = 2.5)

Thanks much @Thie1e

minor mistake in readme

In section Metric Functions "cost_misclassification" should be changed to "misclassification_cost"

`use_midpoints = TRUE` per default?

The current default is use_midpoints = FALSE. However, as the Readme and the vignette illustrate, this leads to a certain bias when methods other than the normal or kernel method are applied. On the other hand, many users might prefer the current setting, because the returned cutpoints can be found in the data. Additionally, changing the default might surprise current users. Should it be changed or not?

Review / Change bootstrapping routine

Especially with imbalanced data sets that contain a low absolute number of observations of one of the two classes, some bootstrap samples will not contain observations of both classes and the cutpoint optimization cannot be run. There are several ways to deal with that. Currently, cutpointr uses option 1:

If a bootstrap sample contains only one class, redraw until a sample is drawn that contains both classes.
If a bootstrap sample contains only one class, return NA for all results of that bootstrap repetition. We did that in an older version, but it leads to the question how to deal with the results later, e.g. for plotting. Since many results may be missing, the plots of distributions may be misleading (based on a very low number of repetitions). We issued warnings in that case, but the constant warnings are confusing.
Sample with replacement separately from the positive and negative observations (stratified bootstrapping). This has the advantage that in every resample both classes will be contained, however the prevalence (= fraction of positive observations) is constant here.

My impression is that option 3 leads to worse confidence intervals than option 1. The cutpointr:::simple_boot function supports both schemes. To switch to option 3 the argument in simple_boot needs to be set and the code before it is called needs to be edited (some necessary lines for option 3 are currently commented out). A simulation study (different distributions of predictor values, different metrics) to check the coverage probabilities of confidence intervals from options 1 and 3 would be helpful here.

Installation problem

The package looks very exciting, however, when I run the following code, I get the error:

> install.packages('cutpointr')
Warning in install.packages :
  package ‘cutpointr’ is not available (for R version 3.4.3)

My machine:

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Any ideas?

Calculating confidence intervals in cutpointr

Hi,
Is there a way to calculate 95% CIs for PPV and NPV in cutpointr? I have managed to do this for the AUC but not PPV or NPV.
Thanks

How to include more than one predictors?

Hi Christian,

Does cutpointr support more than one predictors? In this dataset, for example, could I use dsi + age + gender as predictors together (not individually)? Then the cutpoint won't be the most optimal dsi but the probability for assigning the positive class.

`
opt_cut <- cutpointr(data = suicide, x = dsi, class = suicide,
pos_class = 'yes', neg_class = 'no',
direction = '>=', boot_runs = 100)

@Thie1e

cutpointr() subgroup option how to determine opt_cut$boot list belonging to which subgroup?

Hi Christian,

Thanks much for the nice and fast package to make cupoint related calculations enjoyable!

I have a quick question. When using subgroup option in the cutpointr(), for example subgroup = 'gender', the opt_cut$boot would be a list of two items. How can I tell which item for gender = female and which is for gender = male? I guess it is based on alphabetical order therefore female in opt_cut$boot[[1]] and male in opt_cut$boot[[2]], is this correct?

Thanks a lot.

@Thie1e

Set manual color for only one line

Dear Christian @Thie1e

I ran cutpointr with subgroup and bootstrap. Then I draw lines corresponding to each subgroup. Because one of the lines is my reference so I'd like to set a color for only this line. To illustrate what I try to achieve, I use the sample dataset as following. As shown, I am able to manually set two colors for gender. However, what I really want is to continue using default color for all lines except my reference line. The reason is that I will have many lines (subgroup) and I'd like to see clearly my reference line among the rest of the lines.

Any advice how I can achive this?

Thanks much.

`
library(cutpointr)
library(dplyr)
library(ggplot2)

data(suicide)

opt_cut <- cutpointr(data=suicide, x=dsi, class=suicide, direction = ">=", pos_class = "yes",
neg_class = "no", subgroup = gender,
method = maximize_metric, metric = youden,
boot_runs = 100) %>%
add_metric(list(cohens_kappa))

plot_cutpointr(opt_cut, xvar = cutpoint,
yvar = cohens_kappa,
conf_lvl = 0.95,
aspect_ratio = NULL) +
scale_x_continuous(n.breaks=20, minor_breaks = waiver()) +
scale_y_continuous(n.breaks=5, minor_breaks = waiver()) +
scale_color_manual(values = c("#353436",
"#02e302")) +
scale_fill_manual(values = c("#353436",
"#02e302"))

Inconsistent calculations using manual method

I am using the manual method to set specific cutpoints
Whilst the sensitivity and specificity changes with each manual outpoint, the AUC does not
Why isn't the AUC changing with each outpoint?
(Unless I have done something wrong?!)

thank you


> 
> > cp <- cutpointr(data, test, status, method = maximize_metric, metric = sum_sens_spec)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> > cp
>  A tibble: 1 x 16
>   direction optimal_cutpoint method          sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence
>   <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl>
> 1 >=                    21.3 maximize_metric       1.33777 0.670951    0.664894    0.672881 0.701632         1         0   0.241645
>   outcome predictor           data roc_curve           boot 
>   <chr>   <chr>     <list<df[,2]>> <list>              <lgl>
> 1 status  test          [778 × 2] <tibble [157 × 10]> NA   
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 21, boot_runs = 30)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> Running bootstrap...
> >                             
> > cp
>  A tibble: 1 x 16
>   direction optimal_cutpoint method          sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence
>   <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl>
> 1 >=                    21.3 maximize_metric       1.33777 0.670951    0.664894    0.672881 0.701632         1         0   0.241645
>   outcome predictor           data roc_curve           boot 
>   <chr>   <chr>     <list<df[,2]>> <list>              <lgl>
> 1 status  test          [778 × 2] <tibble [157 × 10]> NA   
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      21 oc_manual       1.33655 0.664524    0.675532    0.661017 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot              
>   <chr>     <list<df[,2]>> <list<df[,9]>> <list>            
> 1 test          [778 × 2]      [157 × 9] <tibble [30 × 23]>
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 20, boot_runs = 30)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> Running bootstrap...
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      20 oc_manual       1.32322 0.651671    0.680851    0.642373 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot              
>   <chr>     <list<df[,2]>> <list<df[,9]>> <list>            
> 1 test          [778 × 2]      [157 × 9] <tibble [30 × 23]>
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 30, boot_runs = 30)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> Running bootstrap...
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      30 oc_manual       1.32584 0.722365    0.547872    0.777966 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot              
>   <chr>     <list<df[,2]>> <list<df[,9]>> <list>            
> 1 test          [778 × 2]      [157 × 9] <tibble [30 × 23]>
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 40, boot_runs = 30)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> Running bootstrap...
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      40 oc_manual       1.29017 0.755784    0.430851    0.859322 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot              
>   <chr>     <list<df[,2]>> <list<df[,9]>> <list>            
> 1 test          [778 × 2]      [157 × 9] <tibble [30 × 23]>
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 40)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      40 oc_manual       1.29017 0.755784    0.430851    0.859322 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot 
>   <chr>     <list<df[,2]>> <list<df[,9]>> <lgl>
> 1 test          [778 × 2]      [157 × 9] NA

Summary and plot don't work well with multi_cutpointr

Running summary on data from multi_cutpointr throws an error:

multi_cut <- multi_cutpointr(suicide, c("age", "dsi"), "suicide", subgroup="gender")
summary(multi_cut)
[...]
---------------------------------------------------------------------------------------------- 
Error in Math.data.frame(list(optimal_cutpoint = 56, method = "maximize_metric",  : 
  non-numeric variable(s) in data frame: method

The plot function is also behaving weird, and seems to be mixing both x variables in the same plot

plot(multi_cut)

Similar issues happen with bootstrap enabled (boot_runs > 0).

Creating a composite biomarker score using regression coefficients

Hello,

I wanted to create a composite biomarker score of 4 markers that I can then enter into further predictive analysis. I was planning on doing this by extracting the β coefficients from the multivariable logistic regression with all (standardized) biomarkers and then multiply those with the (standardized) biomarker levels to create a composite. I just wondered whether this method was ok and how I can go about this using R?

Thanks,
Laura

direction parameter in the cutpointr()

Dear Dr. Thiele @Thie1e

We have a quick question about allowable inputs for parameters of cutpointr(). direction currently only accepts" >=" or "<=" but not ">" or "<". We are trying to match the cutpointr output with output from another tool and we figured out the difference in the ">=" in cutpointr and ">" in the other tool. Is there any way we can set ">" in cutpointr()? Any advice is greatly appreciated.

Thanks so much for your help.

Auto-select numeric columns in multi_cutpointr

If I run multi_cutpointr on the suicide dataset without specifying x, I get the following error on a non-numeric column:

multi_cutpointr(suicide, class = "suicide")
age:
Assuming the positive class is no
Assuming the positive class has higher x values
gender:
Error in median.default(x[class == uc[1]]) : need numeric data

It is common to have non-numeric columns with different groupings etc. that may not be used right now. Just like the gender column. It would be a nice addition to auto-detect numeric columns only.

Confidence Intervals for ROC curves

CI functionality could be added to plot_roc, see #43.

95% confidence intervals instead of getting limits at 5% and 95% in summary of cutpointr

I received the question how to get the 95% confidence intervals instead of getting only the limits at 5% and 95% from the summary output. I'm posting a solution for calculating bootstrap quantiles here for future reference.

library(cutpointr)
library(tidyverse)

cp1 <- cutpointr(suicide$dsi, suicide$suicide, suicide$gender, 
                 boot_runs = 1000, boot_stratify = TRUE,
                 na.rm=TRUE)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...

boot_ci(cp1, acc, in_bag = F, alpha = 0.05) %>% 
  mutate(variable = "acc")
#> # A tibble: 4 x 4
#>   subgroup quantile values variable
#>   <chr>       <dbl>  <dbl> <chr>   
#> 1 female      0.025  0.779 acc     
#> 2 female      0.975  0.931 acc     
#> 3 male        0.025  0.630 acc     
#> 4 male        0.975  0.926 acc

cp1 %>% 
  select(subgroup, boot) %>%
  unnest(boot) %>% 
  dplyr::select(-(TP_b:roc_curve_oob)) %>% 
  pivot_longer(-subgroup) %>% 
  group_by(subgroup, name) %>% 
  summarise(lower_ci = quantile(value, 0.025, na.rm = TRUE),
            upper_ci = quantile(value, 0.975, na.rm = TRUE))
#> `summarise()` regrouping output by 'subgroup' (override with `.groups` argument)
#> # A tibble: 26 x 4
#> # Groups:   subgroup [2]
#>    subgroup name             lower_ci upper_ci
#>    <chr>    <chr>               <dbl>    <dbl>
#>  1 female   acc_b               0.798    0.936
#>  2 female   acc_oob             0.779    0.931
#>  3 female   AUC_b               0.893    0.979
#>  4 female   AUC_oob             0.882    0.988
#>  5 female   cohens_kappa_b      0.324    0.625
#>  6 female   cohens_kappa_oob    0.287    0.605
#>  7 female   optimal_cutpoint    1        4    
#>  8 female   sensitivity_b       0.815    1    
#>  9 female   sensitivity_oob     0.636    1    
#> 10 female   specificity_b       0.786    0.937
#> # ... with 16 more rows

^{Created on 2020-12-22 by the reprex package (v0.3.0)}

AUC confidence interval

Hi, is there a way of calculating the confidence interval of AUC with cutpointr?
Thank you

Cutpointr confidence interval predictive positive value

Hi,

I'm using cutpointr with boot_runs to get cutpoints , AUC, specificity and sensitivity with their respective confidence intervals. I also "manually" extracted the PPV and NPV, but I don't know how to get their respective confidence intervals.
Does someone has any idea ?

Thank you !

multiple cutpoints

Hi, I would like to know if you have any plans to make a function to return multiple cutpoints instead of one cutpoint?
What would your approach be in selecting multiple cutpoints?

more than one metric function on the table

is it possible to report more than one metric function on the columns of the result of cutpointr? Like PPV, NPV, Accuracy?
thank you

Can we specify the bootstrap sampling size?

Hi Christian,

Is there any way to specify the sampling size in cutpointr() instead using the same sample size as input data?

Thanks much @Thie1e

NSE in multi_cutpointr?

Not a big deal but somewhat surprising, I can do:

cutpointr(suicide, dsi, class=suicide)

but not:

multi_cutpointr(suicide, dsi, class=suicide)
Error in multi_cutpointr(suicide, dsi, class = suicide) : 
  class should be the name of the outcome variable (character)

It would be nice to have NSE for multi_cutpointr too.

Multiple cutoffs in a loop

Hi @Thie1e,

I hope you're doing well. I'm trying to run cutpointr inside a loop and because rlang problems I totally locked...

Try 1: it seems that's because i is considered as a string

for(i in c("var1", "var2", "var3"){
  cutpointr(data = mydata, x = i, class = myclass, pos_class = 1, metric = sum_sens_spec)
}

Error in if (stats::median(x[class != pos_class]) < stats::median(x[class ==  : 
  missing value where TRUE/FALSE needed

Try 2: I force i to be the original variable....

for(i in c("var1", "var2", "var3"){
  cutpointr(data = mydata, x = eval(parse(text = i)), class = myclass, pos_class = 1, metric = sum_sens_spec)
}

Error: Can't convert a call to a string
Run `rlang::last_error()` to see where the error occurred.
<error/rlang_error>
Can't convert a call to a string
Backtrace:
 1. cutpointr::cutpointr(...)
 2. cutpointr:::cutpointr.default(...)
 3. rlang::as_name(rlang::enquo(x))
 4. rlang::as_string(x)
 5. rlang:::abort_coercion(x, "a string")
Run `rlang::last_trace()` to see the full context.

How can I solve, or at leas bypass, this problem and run it inside a loop, please? Thanks in advance.

Make printing of summary_cutpointr nicer in Rmd documents

When running summary() on a cutpointr (or multi_cutpointr) object in an Rmd, the output is split up into multiple output windows below the Rmd chunk, because the summary function prints the elements of the output one by one. The output should look the same as in the console. Maybe also round bootstrap summary to 2 instead of 3 decimal places to make the output narrower. Also, the ROC plot should say "by subgroup" instead of "by class" when subgroups are given.

How to access ppv values given a custom cutpoint

Hi Christian,

Using the following code, I can get to opt_cut for a custom cutpoint. How do I access the PPV and NPV values from the 100 samplings?
I'd like to plot the PPV or NPV distribution. I figure it has something to do with opt_cut$boot but I am not able to figure it out. Thanks much.

`
library(cutpointr)
library(dplyr)

Optimal cutpoint for dsi

data(suicide)
opt_cut <- cutpointr(data = suicide, x = dsi, class = suicide,
pos_class = 'yes', neg_class = 'no',
direction = '>=', boot_stratify = TRUE,
method = oc_manual, cutpoint = 3, boot_runs = 100) %>%
cutpointr::add_metric(list(ppv, npv))
`
@Thie1e

Allow bootstrap stratification for maximize_boot_metric and minimize_boot_metric

Currently, the bootstrap that is run within maximize_boot_metric and minimize_boot_metric is non-stratified, which can obviously cause problems if there is a low number of positives or a low number of negatives (#33), so we should have the option to also stratify that bootstrap, not only the outer one.

Use patchwork instead of gridExtra

As soon as the patchwork package is on CRAN, plot.cutpointr should use that package instead of gridExtra. That way, all plots that are generated by cutpointr will be ggplot objects and additionally the resulting plot can be further modified.

Explain oc_youden Kernel

Dear @Thie1e,

When using cutpointr, I obtained better results when using method = oc_youden_kernel becasue the two classes are not normally distributed. When using default method = maximize_metric, is empical cdf method used? Thanks much.

Doubt about metric sum_sens_spec

Hi (again ::sweat::) @Thie1e,

I've a (possibly silly) question, if you don't mind, please. I ran cutpointr(..., metric = sum_sens_spec) and I don't understand how it calculates this cutoff because, when I do it manually my results, are a bit different.

Over the ct$roc_curve I have added a new col (ss) with sens + spec calculation (just for test what I'm saying)...

# A tibble: 379 x 10
   x.sorted    tp    fp    tn    fn     tpr   tnr   fpr   fnr    ss
      <dbl> <dbl> <dbl> <int> <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1   Inf        0     0    29   349 0           1     0 1      1   
 2     6.32     1     0    29   348 0.00287     1     0 0.997  1.00
 3     5.17     2     0    29   347 0.00573     1     0 0.994  1.01
 4     4.88     3     0    29   346 0.00860     1     0 0.991  1.01
 5     4.78     4     0    29   345 0.0115      1     0 0.989  1.01
 6     4.47     5     0    29   344 0.0143      1     0 0.986  1.01
 7     4.27     6     0    29   343 0.0172      1     0 0.983  1.02
 8     4.12     7     0    29   342 0.0201      1     0 0.980  1.02
 9     4.00     8     0    29   341 0.0229      1     0 0.977  1.02
10     3.85     9     0    29   340 0.0258      1     0 0.974  1.03
# ... with 369 more rows

... and the x.sorted with higher value is distinct the one calculated by cutpointr. What I'm missing, please?

Add option to plot functions to produce facets

As suggested in #8, we could plot facets per subgroup instead of distinguishing subgroups by linetype.

Failure with dev tidyr

When I check cutpointr with the dev version of tidyr, I see:

checking examples ... ERROR

...

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> library(cutpointr)
> cutpointr(suicide, dsi, suicide, gender) %>%
+   add_metric(list(ppv, npv)) %>%
+   select(optimal_cutpoint, subgroup, AUC, sum_sens_spec, ppv, npv)
Assuming the positive class is yes
Assuming the positive class has higher x values
Error in check_roc_curve(optcut) : 
  roc_curve as returned by the method function is not an object of the class roc_cutpointr
Calls: %>% ... cutpointr_internal -> <Anonymous> -> .f -> check_roc_curve
Execution halted

checking tests ...

 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  ══ testthat results  ═════════════════════════════════════════════════════════════════
  OK: 87 SKIPPED: 0 FAILED: 40
  1. Error: Cutpointr returns a cutpointr without NAs and a certain Nr of rows (@test-cutpointr.R#3) 
  2. Error: Cutpointr works with different data types (@test-cutpointr.R#19) 
  3. Error: Bootstrap does not return duplicate colnames (@test-cutpointr.R#78) 
  4. Error: Plotting with bootstrapping is silent (@test-cutpointr.R#94) 
  5. Error: AUC calculation is correct and works with Inf and -Inf (@test-cutpointr.R#134) 
  6. Error: Correct midpoints are found (@test-cutpointr.R#149) 
  7. Error: find_metric_name finds metric (@test-cutpointr.R#160) 
  8. Error: no duplicate column names are returned (@test-cutpointr.R#182) 
  9. Error: Correct cutpoints with example data (@test-cutpointr.R#212) 
  1. ...
  
  Error: testthat unit tests failed
  Execution halted

checking re-building of vignette outputs ... WARNING

Error in re-building vignettes:
  ...
Quitting from lines 46-52 (cutpointr.Rmd) 
Error: processing vignette 'cutpointr.Rmd' failed with diagnostics:
roc_curve as returned by the method function is not an object of the class roc_cutpointr
Execution halted

Would you mind looking into this for me? It's possible that I've accidentally changed the API tidyr in someway but the changes are small and cutpointr is the only CRAN package that shows problems.

Getting different f1-score result

Hi,

I am using your package to check my quick implementation of the dice/f1-score.

When I use the cutpointr() function and put the displayed optimal threshold (0.007) into my dice function I am getting a different result (0.6586022) than the cutpointr() function displays (0.672).

Do you work with approximations or is this a bug in my or your code?

# libraries
library(magrittr)
library(tibble)
#> Warning: Paket 'tibble' wurde unter R Version 3.5.1 erstellt
library(cutpointr)
#> Warning: Paket 'cutpointr' wurde unter R Version 3.5.1 erstellt
# testdata
set.seed(123)
df_test <- tibble(pred = runif(1000,0,1),
                  resp = sample(c(0,1), size = 1000, replace = TRUE))
# functions
binary_outcome <- function(x, thr){
  
  x[x >= thr] <- 1L
  x[x  < thr] <- 0L
  x
}
true_positive <- function(bo, label){
  bo <- as.integer(bo); label <- as.integer(label)
  sum(bo == 1L & label == 1L)
}
false_positive <- function(bo, label){
  bo <- as.integer(bo); label <- as.integer(label)
  sum(bo == 1L & label == 0L)
}
true_negative <- function(bo, label){
  bo <- as.integer(bo); label <- as.integer(label)
  sum(bo == 0L & label == 0L)
}
false_negative <- function(bo, label){
  bo <- as.integer(bo); label <- as.integer(label)
  sum(bo == 0L & label == 1L)
}
dice <- function(pred, label, thr){
  label <- as.integer(label)
  bo <- binary_outcome(pred, thr)
  
  tp <- true_positive(bo, label)
  fp <- false_positive(bo, label)
  tn <- true_negative(bo, label)
  fn <- false_negative(bo, label)
  
  2 * tp /(2 * tp + fn + fp)
}

# Test
cp <- cutpointr(data = df_test, x = pred, class = resp, 
                method = maximize_metric, metric = F1_score)
#> Assuming the positive class is 0
#> Assuming the positive class has higher x values
cp
#> # A tibble: 1 x 16
#>   direction optimal_cutpoint method          F1_score   acc sensitivity
#>   <chr>                <dbl> <chr>              <dbl> <dbl>       <dbl>
#> 1 >=                 0.00700 maximize_metric    0.672 0.509       0.998
#>   specificity   AUC pos_class neg_class prevalence outcome predictor
#>         <dbl> <dbl>     <dbl>     <dbl>      <dbl> <chr>   <chr>    
#> 1      0.0141 0.521         0         1      0.503 resp    pred     
#>   data                 roc_curve                 boot 
#>   <list>               <list>                    <lgl>
#> 1 <tibble [1,000 x 2]> <data.frame [1,001 x 10]> NA
dice(pred = df_test$pred, label = df_test$resp, thr = 0.00700)
#> [1] 0.6586022

# including the precise cutpoint
dice(pred = df_test$pred, label = df_test$resp, thr = cp$optimal_cutpoint)
#> [1] 0.6581598

Created on 2018-07-20 by the reprex package (v0.2.0).

Add function for more convenient calculation of CIs

We could add a function that makes it easier to calculate confidence intervals from bootstrap results, as was desired in #10

plot_sensitivity_specificity not ploting

Something is wrong with the plot_sensitivity_specificity command. When I do this I get an empty ggplot estructure with a vertical line on the number 2.

Is this correct?

library(cutpointr)
data(suicide)
opt_cut <- cutpointr(suicide, dsi, suicide)
plot_sensitivity_specificity(opt_cut)

Thanx

Documentation and cutpointr output suggestion

Thank you for developing this really great package! I have three suggestions/enhancements for this package.

First, the output to the cutpointr( ) function contains a column that is named after the metric that the user chooses. So if metric = "youden" then the column will be called youden. I suggest changing this column name to metric_value and adding a new column called metric that basically gives the character string of the metric used (in this case youden). This will enable the users to run the analysis for mutiple metrics, bind results together by row, and use tidyverse functions to wrangle the output.

Second, when adding in subgroups the cutpointr( ) output has a new column subgroup and therefore this no longer matches the function output from the non-subgroup cutpointr( ) case. I suggest adding a subgroup column with the character string none or something similar to the non-subgroup case so that these two outputs can be bound if the user wants to do this.

Last, this is sort of a question and a suggestion. Can you please change the boot_stratify explanation to say something like "...keeping a proportionate number of positives and negatives as in the full dataset when resampling"? As it's written now, it sounds like the number of positives is exactly equal to the number of negatives.

Thanks again!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

thie1e / cutpointr Goto Github PK

cutpointr's Introduction

cutpointr

Installation

Example

Features

Calculating cutpoints

Method functions for cutpoint estimation

Metric functions

Separate subgroups and bootstrapping

Parallelized bootstrapping

More robust cutpoint estimation methods

Bootstrapped cutpoints

LOESS smoothing for selecting a cutpoint

Smoothing via Generalized Additive Models for selecting a cutpoint

Spline smoothing for selecting a cutpoint

Parametric method assuming normality

Nonparametric kernel method

Additional features

Calculating only the ROC curve

Midpoints

Finding all cutpoints with acceptable performance

Manual and mean / median cutpoints

Nonstandard evaluation via tidyeval

ROC curve and optimal cutpoint for multiple variables

Accessing data, roc_curve, and boot

Adding metrics to the result of cutpointr() or roc()

User-defined functions

method

metric

Plotting

Flexible plotting function

Manual plotting

Benchmarks

cutpointr's People

Contributors

Stargazers

Watchers

Forkers

cutpointr's Issues

Optimal cutpoint for dsi

Recommend Projects

Recommend Topics

Recommend Org

Accessing `data`, `roc_curve`, and `boot`