anujkhare / iregnet Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 12.0 1.49 MB

R 64.01% C++ 35.02% C 0.98%

iregnet's People

Contributors

Stargazers

Watchers

Forkers

tdhock yugam1 engineerreversed parismita shets rovervan devinderkaur mu4bu2 theadityasam wlemieux03 georgheinze

iregnet's Issues

The optimization branch

The link for benchmarks: https://theadityasam.github.io/iregbenchmark/benchmarks.html

I encountered few issues which I'll list over here:

For prostate, a weird thing happens, for some particular dataset sizes the runtimes reduce drastically(by x1000 times!), hence the plot looks very skewed. Strangely this happens only when iregnet optimised is added to the benchmarking tests; not when benchmark is performed on master iregnet and glmnet
For 100000 observations, we see a pattern we expect, glmnet stays consistent whereas iregnet and iregnet optimised are linear with the optimised branch being faster
penalty.learning produces NAN errors in the optimised branch but not in the master branch
While trying with neuroblastoma, iregnet works fine but optimised branch produces the following error:

error: Row::subvec(): indices out of bounds or incorrectly used
Error in fit_cpp(x, y, family, alpha, lambda_path = lambda, num_lambda = num_lambda,  : 
  Row::subvec(): indices out of bounds or incorrectly used

Add some "developer documentation" to the README

I'm revisiting this code after a long time and being out of touch with R development, don't remember basic things like:

How do you set up a development environment?
How do I run the tests?
How to compile the API documentation?

These are simple commands, might be easy enough to compile them. Maybe, we can add a makefile for this?

Monday, 5th June 2017

Updates:

Armadillo + OpenBlas linking issues
Steps 1-4 in proposal locally done, need to clean and commit

(Some) Next steps:

For any Armadillo issues, contact the maintainers (Dirk is super responsive)
Put config for the src in iregnet/src/makevars
Set up Continuous Integration performance testing using Rperform
Move the code into a branch on anujkhare/iregnet
(Anuj will fix Travis tests on anujkhare/iregnet/master)
Read about git-flow workflow
- We don't completely follow it, but it is a good read!
Power on!!

@tdhock @RoverVan

Deviance

Deviance is just the residual taken to the log-likelihood scale.

"Extreme_value" distribution not supported in cv.iregnet()

"Extreme_value" not supported distribution in cv.iregnet() function

> cv.iregnet(X, Y ,  fmaily = "extreme_value")
Error in stopifnot_error(paste("family must be one of", paste(names(pfun.list),  : 
  family must be one of gaussian, logistic, exponential

Iregnet package version:

> packageDescription("iregnet")
Package: iregnet
Type: Package
Title: Regularized interval regression
Version: 0.1.0.9000
Author: Anuj Khare <[email protected]>, Toby D Hocking <[email protected]>,
          Jelle Goeman, Aditya Samantaray <[email protected]>
Maintainer: Anuj Khare <[email protected]>, Aditya Samantaray
          <[email protected]>
Description: Interval regression with four types of censoring and elastic net
          regularization.
License: GPL-3
LazyData: TRUE
Suggests: ElemStatLearn, glmnet, testthat, knitr, rmarkdown
LinkingTo: Rcpp
Depends: R (>= 2.10)
Imports: ggplot2, utils, methods, stats, survival, future.apply, data.table, Matrix,
          namedCapture, penaltyLearning, SpatialExtremes
RoxygenNote: 6.1.1
Remotes: tdhock/penaltyLearning
VignetteBuilder: knitr
RemoteType: github
RemoteHost: api.github.com
RemoteRepo: iregnet
RemoteUsername: anujkhare
RemoteRef: master
RemoteSha: 89cc904894495511b801f71ff99cbfed6043dd97
GithubRepo: iregnet
GithubUsername: anujkhare
GithubRef: master
GithubSHA1: 89cc904894495511b801f71ff99cbfed6043dd97
NeedsCompilation: yes
Packaged: 2019-09-11 18:52:54 UTC; andrewhurst
Built: R 3.6.1; x86_64-apple-darwin15.6.0; 2019-09-11 18:52:55 UTC; unix

-- File: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/iregnet/Meta/package.rds

variable names in returned beta matrix

hey @anujkhare I ran your code on the ovarian data set

data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1, ecog.ps=ecog.ps-1))
y_l <- ovarian$futime
y_r <- ovarian$futime
y_r[ovarian$fustat == 0] <- NA
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")

currently we have

> fit$beta[, 1:7]
         [,1]     [,2]     [,3]     [,4]     [,5]    [,6]     [,7]
[1,] 6.772081 6.772118 6.772103 6.772112 6.772108 6.77211 6.772109
[2,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
[3,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
[4,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
[5,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
>

Can you please add row names to the beta matrix? (Intercept) and column names of input X matrix, same as glmnet.

Log-likelihood values inconsistent wrt survreg

For the unregularized solution returned by iregnet, the log-likelihood value is different from that of survreg for the same coefficients and scale parameter.

Inf limits?

Hey @anujkhare I ran your code on the ovarian data set and I noticed that Inf limits make the solver silently fail:

data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1, ecog.ps=ecog.ps-1))
y_l <- ovarian$futime
y_r <- ovarian$futime
y_r[ovarian$fustat == 0] <- Inf
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")

current result is

> fit$beta[, 1:7]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]  NaN  Inf  Inf  Inf  Inf  Inf  Inf
[2,]  NaN    0    0    0    0    0    0
[3,]  NaN    0    0    0    0    0    0
[4,]  NaN    0    0    0    0    0    0
[5,]  NaN    0    0    0    0    0    0

two possible solutions:

stop with an error if there are any Inf values for the output.
treat Inf same as NA

run on benchmark data sets

hi @theadityasam @anujkhare I wrote this R script https://github.com/tdhock/neuroblastoma-data/blob/master/iregnet.R to benchmark the performance of cv.iregnet.

The benchmark includes 33 different data sets, each with several train/test splits (designated by fold ID numbers in folds.csv file).

To run the code please clone that repo, git clone https://github.com/tdhock/neuroblastoma-data.git and then run the "iregnet.R" script.

The script runs cv.iregnet with scale fixed at 1 (iregnet.scale1), and then with estimate_scale=TRUE. (iregnet.estscale)

It also runs baselines penaltyLearning::IntervalRegressionCV (penaltyLearning.scale1) and constant (always predict 0).

Initial results indicate room for improvement, especially in terms of speed. For example I got these results on one data set for test fold ID 1:

> (pred.err <- do.call(rbind, pred.err.list))

         data.name test.fold              algorithm seconds threshold labels
1: ATAC_JV_adipose         1               constant    0.00 predicted    130
2: ATAC_JV_adipose         1         iregnet.scale1   51.80 predicted    130
3: ATAC_JV_adipose         1       iregnet.estscale   52.05 predicted    130
4: ATAC_JV_adipose         1 penaltyLearning.scale1    4.99 predicted    130
   possible.fp possible.fn  min.thresh  max.thresh  fp fn errors       FPR  tp
1:         130         128        -Inf 6.783764882 130  0    130 1.0000000 128
2:         130         128 -0.02067610 0.065002854  20 23     43 0.1538462 105
3:         130         128 -0.04882619 0.032067314  24 21     45 0.1846154 107
4:         130         128 -0.01867027 0.009351324  24 19     43 0.1846154 109
         TPR error.percent       auc
1: 1.0000000     100.00000 0.8881611
2: 0.8203125      33.07692 0.8785457
3: 0.8359375      34.61538 0.8791466
4: 0.8515625      33.07692 0.8780048
>

The results indicate that iregnet (with fixed or estimated scale) is about 10 times slower than penaltyLearning. (50 seconds vs 5 seconds)

I got a warning message that the algo failed to converge so please investigate.

Tests: uncensored data with alpha < 1

Need to write tests for general elastic net regularization.

Tests: more predictor variables than observations

The case when number of variables is more than the number of observations.

16 May 2019
linear regression y_i ~ N( w^T x_i, \sigma^2)
log t_i = y_i
y_i ~ N( f(x_i), \sigma^2)
f(x_i) = w^T x_i
f(x_i) = b_0 + b^T x_i
uncensored y_i => likelihood is density
censored y_i => likelihood is cumulative distribution function

Aditya said that NAN error in iregnet only happens when train labels have no lower limits or no upper limits. in this case the max likelihood model is undefined, so you should stop with an error that tells the user that at least one upper limit and one lower limit are required.

cv.iregnet: you should check to make sure that each train set/split has at least one upper and one lower limit, otherwise stop with an informative error before running the optimization/iregnet. e.g. here is the error message I use in penaltyLearning::IntervalRegressionCV, which tells the user which
https://github.com/tdhock/penaltyLearning/blob/master/R/IntervalRegression.R#L189

NA's in x

Check for NA's in x?

Duplicate columns in covariate matrix

Glmnet explicitly removes duplicate columns in the covariate matrix before the fit is done. We should do something similar.

comparison with FISTA R code

Hey @anujkhare here is my R code for computing the gradient of AFT loss functions https://github.com/tdhock/glm-optimality/blob/master/figure-interval-loss-derivative.R but I did not fully consider the scale parameter when I wrote that code, so we need to double-check to make sure that those gradients agree with the Survival code

Tests: wrt survreg for distributions other than Guassian

We need to implement tests for non regularized data and non-Gaussian loss.

Integration of Surv object

We should accept a Surv object from the Survival package for the output values, rather than the 2-column y matrix as of now.

large lambda values?

Hey @anujkhare there seems to be some problem with the first lambda value (too large, 1e35) and the first few lambda values (too many lambda values which yield a weight vector of all zeros). Can you double-check the lambda grid computation? It would be best if the first lambda value gives an intercept-only model, and then the second lambda value gives at least one non-zero element in the weight vector.

Right now I see

data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1))
y_l <- ovarian$futime
y_r <- ovarian$futime
y_r[ovarian$fustat == 0] <- NA
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")
rownames(fit$beta) <- c("(Intercept)", colnames(X))
t(with(fit, rbind(lambda=lambda, scale=scale, beta))[,1:14])

> t(with(fit, rbind(lambda=lambda, scale=scale, beta))[,1:14])
            lambda    scale (Intercept)          age residual.disease treatment
 [1,] 1.000000e+35 1.265787    6.772081  0.000000000                0         0
 [2,] 1.175100e+01 1.265758    6.772118  0.000000000                0         0
 [3,] 1.070728e+01 1.265775    6.772103  0.000000000                0         0
 [4,] 9.756176e+00 1.265768    6.772112  0.000000000                0         0
 [5,] 8.889411e+00 1.265772    6.772108  0.000000000                0         0
 [6,] 8.099720e+00 1.265770    6.772110  0.000000000                0         0
 [7,] 7.380153e+00 1.265771    6.772109  0.000000000                0         0
 [8,] 6.724524e+00 1.265771    6.772110  0.000000000                0         0
 [9,] 6.127134e+00 1.265771    6.772110  0.000000000                0         0
[10,] 5.582817e+00 1.265771    6.772110  0.000000000                0         0
[11,] 5.086855e+00 1.265771    6.772110  0.000000000                0         0
[12,] 4.634953e+00 1.265771    6.772110  0.000000000                0         0
[13,] 4.682283e+00 1.202117    7.036940 -0.005226961                0         0
[14,] 4.762457e+00 1.137780    7.319351 -0.010762496                0         0
>

weights parameter

Hey @anujkhare how hard would it be to modify the C++ code to use weights w_i > 0 specific to each observation/row i? The weights would affect the loss/likelihood function. Right now the objective function is

min_f sum_i loss(y_i, f(x_i)) + Penalty(f)

but with weights it would become

min_f sum_i w_i loss(y_i, f(x_i)) + Penalty(f)

I don't know of any data sets where this is absolutely essential for iregnet, but it is typically useful in glmnet for logistic regression models (binary classification problems) where there are a lot more of one class than another. For now I don't think we should worry about implementing this, but I was just wondering if you had thought about it.

Fix code formatting

Print function

We would like to have a basic print function which displays only a few important things.

Release to CRAN

Add a version to the package

@tdhock Besides versioning the package and making sure that R CMD checks pass, is there anything else that we need to do?

In general, we can do a lot more testing. @theadityasam: there are a bunch of datasets that Toby mentioned on this page. It'd be a good exercise for us to fit models on these using cv.iregnet.

Make print.iregnet more informative

@theadityasam to add some more details

compilation warnings

Hi @anujkhare @theadityasam I am getting these compiler warnings on my system. Although the code compiles/installs, it would be good to fix those warnings when you get a chance.

c:/Rtools/mingw_64/bin/g++  -I"c:/PROGRA~1/R/R-36~1.1/include" -DNDEBUG  -I"C:/Users/th798/R/win-library/3.6/Rcpp/include"        -O2 -Wall  -mtune=generic -c distributions.cpp -o distributions.o
distributions.cpp: In function 'double compute_grad_response(double*, double*, double*, const double*, const double*, const double*, double, const IREG_CENSORING*, long long unsigned int, IREG_DIST, double*, bool)':
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_LOG_GAUSSIAN' not handled in switch [-Wswitch]
   switch(dist) {
         ^
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_LOG_LOGISTIC' not handled in switch [-Wswitch]
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_EXPONENTIAL' not handled in switch [-Wswitch]
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_WEIBULL' not handled in switch [-Wswitch]
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_UNKNOWN' not handled in switch [-Wswitch]
distributions.cpp: In function 'Rcpp::List iregnet_compute_gradients(Rcpp::NumericMatrix, Rcpp::NumericMatrix, Rcpp::NumericVector, double, Rcpp::String)':
distributions.cpp:426:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull j = 0; j < n_vars; ++j) {
                     ^
distributions.cpp:428:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (ull i = 0; i < n_obs; ++i) {
                       ^
distributions.cpp: In function 'double compute_grad_response(double*, double*, double*, const double*, const double*, const double*, double, const IREG_CENSORING*, long long unsigned int, IREG_DIST, double*, bool)':
distributions.cpp:229:25: warning: 'ddsig' may be used uninitialized in this function [-Wmaybe-uninitialized]
       ddsig_sum += ddsig;
                         ^
distributions.cpp:228:23: warning: 'dsig' may be used uninitialized in this function [-Wmaybe-uninitialized]
       dsig_sum += dsig;
                       ^
distributions.cpp:219:26: warning: 'ddg' may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (dsig == 0 || ddg == 0)
                          ^
distributions.cpp:222:30: warning: 'dg' may be used uninitialized in this function [-Wmaybe-uninitialized]
       response = eta[i] - dg / ddg;
                              ^
c:/Rtools/mingw_64/bin/g++  -I"c:/PROGRA~1/R/R-36~1.1/include" -DNDEBUG  -I"C:/Users/th798/R/win-library/3.6/Rcpp/include"        -O2 -Wall  -mtune=generic -c iregnet_fit.cpp -o iregnet_fit.o
iregnet_fit.cpp: In function 'Rcpp::List fit_cpp(Rcpp::NumericMatrix, Rcpp::NumericMatrix, Rcpp::String, Rcpp::NumericVector, int, Rcpp::IntegerVector, bool, double, double, bool, bool, bool, double, double, int, double)':
iregnet_fit.cpp:126:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for(ull i = 0; i < num_lambda; ++i) {
                      ^
iregnet_fit.cpp:339:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull m = 0; m <= end_ind; ++m) {
                     ^
iregnet_fit.cpp:212:10: warning: variable 'lambda_max_unscaled' set but not used [-Wunused-but-set-variable]
   double lambda_max_unscaled;
          ^
iregnet_fit.cpp: In function 'double get_y_means(Rcpp::NumericMatrix&, IREG_CENSORING*, double*)':
iregnet_fit.cpp:382:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = 0; i < y.nrow(); ++i) {
                     ^
iregnet_fit.cpp: In function 'void get_censoring_types(Rcpp::NumericMatrix&, IREG_CENSORING*)':
iregnet_fit.cpp:407:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = 0; i < y.nrow(); ++i) {
                     ^
iregnet_fit.cpp: In function 'void standardize_x(Rcpp::NumericMatrix&, double*, double*, bool)':
iregnet_fit.cpp:467:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = int(intercept); i < X.ncol(); ++i) {  // don't standardize intercept col.
                                  ^
iregnet_fit.cpp:469:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (ull j = 0; j < X.nrow(); ++j) {
                       ^
iregnet_fit.cpp:474:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (ull j = 0; j < X.nrow(); ++j) {
                       ^
iregnet_fit.cpp:482:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (ull j = 0; j < X.nrow(); ++j) {
                       ^
iregnet_fit.cpp: In function 'void standardize_y(Rcpp::NumericMatrix&, double*, double&)':
iregnet_fit.cpp:491:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = 0; i < y.nrow() * y.ncol(); ++i) {
                     ^
iregnet_fit.cpp:494:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = 0; i < y.nrow(); ++i)
                     ^
iregnet_fit.cpp: In function 'double get_init_var(double*, IREG_CENSORING*, long long unsigned int, IREG_DIST)':
iregnet_fit.cpp:503:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (int i = 0; i < n; ++i) {
                     ^
iregnet_fit.cpp:508:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (int i = 0; i < n; ++i) {
                     ^
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_GAUSSIAN' not handled in switch [-Wswitch]
   switch (dist) {
          ^
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_LOG_GAUSSIAN' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_LOG_LOGISTIC' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_EXPONENTIAL' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_WEIBULL' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_UNKNOWN' not handled in switch [-Wswitch]
c:/Rtools/mingw_64/bin/g++ -shared -s -static-libgcc -o iregnet.dll tmp.def RcppExports.o distributions.o iregnet_fit.o -Lc:/PROGRA~1/R/R-36~1.1/bin/x64 -lR
installing to C:/Users/th798/R/win-library/3.6/00LOCK-iregnet/00new/iregnet/libs/x64

support weibull distribution

look at formulas in http://members.cbio.mines-paristech.fr/~thocking/survival.pdf

NAN Error

So currently I believe the NAN issue occurs when the data is either completely right censored or left truncated. I'll be introducing an error message in the get_censoring_types function of iregnet_fit.cpp whenever this condition is met and then will be testing it in various cases to see if the NAN error still occurs. If it does then I need to discuss if we have missed any other cause for the error.

error for invalid output intervals?

Hey @anujkhare would it be possible to stop with an error when the upper limit is in less than the lower limit? For example

data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1, ecog.ps=ecog.ps-1))
y_l <- ovarian$futime
y_r <- ovarian$futime-1
y_r[ovarian$fustat == 0] <- NA
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")

I see we have fit$error_status==1 but it would also be nice to stop with an error that tells the user that there is a problem with the outputs.

future instead of foreach please

Hi in recent years it has become apparent that the future framework should be preferred over foreach, so I would suggest using future.apply::future_lapply instead of foreach in cv.iregnet. For a usage example see https://github.com/tdhock/penaltyLearning/blob/master/R/IntervalRegression.R#L192

is that ok with you @anujkhare ?

is that something you could implement in a new PR this summer? @theadityasam

Check exponential distribution

License

@tdhock We need to set a license for the project to make it available on CRAN. Which license would you suggest? [List of licenses for R packages]
glmnet uses GPL-2, survival uses LGPL(>=2).

ovarian and lung data sets?

hey @anujkhare are the ovarian and lung data sets really just copies from the survival package? If so then there is no need to actually include them, since the survival package (and those data sets) is included with every copy of R. (survival is a recommended pakage)