Git Product home page Git Product logo

iregnet's People

Contributors

anujkhare avatar rovervan avatar tdhock avatar theadityasam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

iregnet's Issues

The optimization branch

The link for benchmarks: https://theadityasam.github.io/iregbenchmark/benchmarks.html

I encountered few issues which I'll list over here:

  • For prostate, a weird thing happens, for some particular dataset sizes the runtimes reduce drastically(by x1000 times!), hence the plot looks very skewed. Strangely this happens only when iregnet optimised is added to the benchmarking tests; not when benchmark is performed on master iregnet and glmnet
  • For 100000 observations, we see a pattern we expect, glmnet stays consistent whereas iregnet and iregnet optimised are linear with the optimised branch being faster
  • penalty.learning produces NAN errors in the optimised branch but not in the master branch
  • While trying with neuroblastoma, iregnet works fine but optimised branch produces the following error:
error: Row::subvec(): indices out of bounds or incorrectly used
Error in fit_cpp(x, y, family, alpha, lambda_path = lambda, num_lambda = num_lambda,  : 
  Row::subvec(): indices out of bounds or incorrectly used

Add some "developer documentation" to the README

I'm revisiting this code after a long time and being out of touch with R development, don't remember basic things like:

  1. How do you set up a development environment?
  2. How do I run the tests?
  3. How to compile the API documentation?

These are simple commands, might be easy enough to compile them. Maybe, we can add a makefile for this?

Summary of weekly calls

Monday, 5th June 2017

Updates:

  • Armadillo + OpenBlas linking issues
  • Steps 1-4 in proposal locally done, need to clean and commit

(Some) Next steps:

  • For any Armadillo issues, contact the maintainers (Dirk is super responsive)
  • Put config for the src in iregnet/src/makevars
  • Set up Continuous Integration performance testing using Rperform
  • Move the code into a branch on anujkhare/iregnet
  • (Anuj will fix Travis tests on anujkhare/iregnet/master)
  • Read about git-flow workflow
    • We don't completely follow it, but it is a good read!
  • Power on!!

@tdhock @RoverVan

Deviance

Deviance is just the residual taken to the log-likelihood scale.

"Extreme_value" distribution not supported in cv.iregnet()

"Extreme_value" not supported distribution in cv.iregnet() function

> cv.iregnet(X, Y ,  fmaily = "extreme_value")
Error in stopifnot_error(paste("family must be one of", paste(names(pfun.list),  : 
  family must be one of gaussian, logistic, exponential

Iregnet package version:

> packageDescription("iregnet")
Package: iregnet
Type: Package
Title: Regularized interval regression
Version: 0.1.0.9000
Author: Anuj Khare <[email protected]>, Toby D Hocking <[email protected]>,
          Jelle Goeman, Aditya Samantaray <[email protected]>
Maintainer: Anuj Khare <[email protected]>, Aditya Samantaray
          <[email protected]>
Description: Interval regression with four types of censoring and elastic net
          regularization.
License: GPL-3
LazyData: TRUE
Suggests: ElemStatLearn, glmnet, testthat, knitr, rmarkdown
LinkingTo: Rcpp
Depends: R (>= 2.10)
Imports: ggplot2, utils, methods, stats, survival, future.apply, data.table, Matrix,
          namedCapture, penaltyLearning, SpatialExtremes
RoxygenNote: 6.1.1
Remotes: tdhock/penaltyLearning
VignetteBuilder: knitr
RemoteType: github
RemoteHost: api.github.com
RemoteRepo: iregnet
RemoteUsername: anujkhare
RemoteRef: master
RemoteSha: 89cc904894495511b801f71ff99cbfed6043dd97
GithubRepo: iregnet
GithubUsername: anujkhare
GithubRef: master
GithubSHA1: 89cc904894495511b801f71ff99cbfed6043dd97
NeedsCompilation: yes
Packaged: 2019-09-11 18:52:54 UTC; andrewhurst
Built: R 3.6.1; x86_64-apple-darwin15.6.0; 2019-09-11 18:52:55 UTC; unix

-- File: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/iregnet/Meta/package.rds 

variable names in returned beta matrix

hey @anujkhare I ran your code on the ovarian data set

data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1, ecog.ps=ecog.ps-1))
y_l <- ovarian$futime
y_r <- ovarian$futime
y_r[ovarian$fustat == 0] <- NA
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")

currently we have

> fit$beta[, 1:7]
         [,1]     [,2]     [,3]     [,4]     [,5]    [,6]     [,7]
[1,] 6.772081 6.772118 6.772103 6.772112 6.772108 6.77211 6.772109
[2,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
[3,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
[4,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
[5,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
> 

Can you please add row names to the beta matrix? (Intercept) and column names of input X matrix, same as glmnet.

Inf limits?

Hey @anujkhare I ran your code on the ovarian data set and I noticed that Inf limits make the solver silently fail:

data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1, ecog.ps=ecog.ps-1))
y_l <- ovarian$futime
y_r <- ovarian$futime
y_r[ovarian$fustat == 0] <- Inf
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")

current result is

> fit$beta[, 1:7]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]  NaN  Inf  Inf  Inf  Inf  Inf  Inf
[2,]  NaN    0    0    0    0    0    0
[3,]  NaN    0    0    0    0    0    0
[4,]  NaN    0    0    0    0    0    0
[5,]  NaN    0    0    0    0    0    0

two possible solutions:

  • stop with an error if there are any Inf values for the output.
  • treat Inf same as NA

run on benchmark data sets

hi @theadityasam @anujkhare I wrote this R script https://github.com/tdhock/neuroblastoma-data/blob/master/iregnet.R to benchmark the performance of cv.iregnet.

The benchmark includes 33 different data sets, each with several train/test splits (designated by fold ID numbers in folds.csv file).

To run the code please clone that repo, git clone https://github.com/tdhock/neuroblastoma-data.git and then run the "iregnet.R" script.

The script runs cv.iregnet with scale fixed at 1 (iregnet.scale1), and then with estimate_scale=TRUE. (iregnet.estscale)

It also runs baselines penaltyLearning::IntervalRegressionCV (penaltyLearning.scale1) and constant (always predict 0).

Initial results indicate room for improvement, especially in terms of speed. For example I got these results on one data set for test fold ID 1:

> (pred.err <- do.call(rbind, pred.err.list))

         data.name test.fold              algorithm seconds threshold labels
1: ATAC_JV_adipose         1               constant    0.00 predicted    130
2: ATAC_JV_adipose         1         iregnet.scale1   51.80 predicted    130
3: ATAC_JV_adipose         1       iregnet.estscale   52.05 predicted    130
4: ATAC_JV_adipose         1 penaltyLearning.scale1    4.99 predicted    130
   possible.fp possible.fn  min.thresh  max.thresh  fp fn errors       FPR  tp
1:         130         128        -Inf 6.783764882 130  0    130 1.0000000 128
2:         130         128 -0.02067610 0.065002854  20 23     43 0.1538462 105
3:         130         128 -0.04882619 0.032067314  24 21     45 0.1846154 107
4:         130         128 -0.01867027 0.009351324  24 19     43 0.1846154 109
         TPR error.percent       auc
1: 1.0000000     100.00000 0.8881611
2: 0.8203125      33.07692 0.8785457
3: 0.8359375      34.61538 0.8791466
4: 0.8515625      33.07692 0.8780048
> 

The results indicate that iregnet (with fixed or estimated scale) is about 10 times slower than penaltyLearning. (50 seconds vs 5 seconds)

I got a warning message that the algo failed to converge so please investigate.

summary of calls with aditya

16 May 2019
linear regression y_i ~ N( w^T x_i, \sigma^2)
log t_i = y_i
y_i ~ N( f(x_i), \sigma^2)
f(x_i) = w^T x_i
f(x_i) = b_0 + b^T x_i
uncensored y_i => likelihood is density
censored y_i => likelihood is cumulative distribution function

Aditya said that NAN error in iregnet only happens when train labels have no lower limits or no upper limits. in this case the max likelihood model is undefined, so you should stop with an error that tells the user that at least one upper limit and one lower limit are required.

cv.iregnet: you should check to make sure that each train set/split has at least one upper and one lower limit, otherwise stop with an informative error before running the optimization/iregnet. e.g. here is the error message I use in penaltyLearning::IntervalRegressionCV, which tells the user which
https://github.com/tdhock/penaltyLearning/blob/master/R/IntervalRegression.R#L189

Integration of Surv object

We should accept a Surv object from the Survival package for the output values, rather than the 2-column y matrix as of now.

large lambda values?

Hey @anujkhare there seems to be some problem with the first lambda value (too large, 1e35) and the first few lambda values (too many lambda values which yield a weight vector of all zeros). Can you double-check the lambda grid computation? It would be best if the first lambda value gives an intercept-only model, and then the second lambda value gives at least one non-zero element in the weight vector.

Right now I see

data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1))
y_l <- ovarian$futime
y_r <- ovarian$futime
y_r[ovarian$fustat == 0] <- NA
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")
rownames(fit$beta) <- c("(Intercept)", colnames(X))
t(with(fit, rbind(lambda=lambda, scale=scale, beta))[,1:14])
> t(with(fit, rbind(lambda=lambda, scale=scale, beta))[,1:14])
            lambda    scale (Intercept)          age residual.disease treatment
 [1,] 1.000000e+35 1.265787    6.772081  0.000000000                0         0
 [2,] 1.175100e+01 1.265758    6.772118  0.000000000                0         0
 [3,] 1.070728e+01 1.265775    6.772103  0.000000000                0         0
 [4,] 9.756176e+00 1.265768    6.772112  0.000000000                0         0
 [5,] 8.889411e+00 1.265772    6.772108  0.000000000                0         0
 [6,] 8.099720e+00 1.265770    6.772110  0.000000000                0         0
 [7,] 7.380153e+00 1.265771    6.772109  0.000000000                0         0
 [8,] 6.724524e+00 1.265771    6.772110  0.000000000                0         0
 [9,] 6.127134e+00 1.265771    6.772110  0.000000000                0         0
[10,] 5.582817e+00 1.265771    6.772110  0.000000000                0         0
[11,] 5.086855e+00 1.265771    6.772110  0.000000000                0         0
[12,] 4.634953e+00 1.265771    6.772110  0.000000000                0         0
[13,] 4.682283e+00 1.202117    7.036940 -0.005226961                0         0
[14,] 4.762457e+00 1.137780    7.319351 -0.010762496                0         0
> 

weights parameter

Hey @anujkhare how hard would it be to modify the C++ code to use weights w_i > 0 specific to each observation/row i? The weights would affect the loss/likelihood function. Right now the objective function is

min_f sum_i loss(y_i, f(x_i)) + Penalty(f) 

but with weights it would become

min_f sum_i w_i loss(y_i, f(x_i)) + Penalty(f)

I don't know of any data sets where this is absolutely essential for iregnet, but it is typically useful in glmnet for logistic regression models (binary classification problems) where there are a lot more of one class than another. For now I don't think we should worry about implementing this, but I was just wondering if you had thought about it.

Print function

We would like to have a basic print function which displays only a few important things.

Release to CRAN

  1. Add a version to the package

@tdhock Besides versioning the package and making sure that R CMD checks pass, is there anything else that we need to do?

In general, we can do a lot more testing. @theadityasam: there are a bunch of datasets that Toby mentioned on this page. It'd be a good exercise for us to fit models on these using cv.iregnet.

compilation warnings

Hi @anujkhare @theadityasam I am getting these compiler warnings on my system. Although the code compiles/installs, it would be good to fix those warnings when you get a chance.

c:/Rtools/mingw_64/bin/g++  -I"c:/PROGRA~1/R/R-36~1.1/include" -DNDEBUG  -I"C:/Users/th798/R/win-library/3.6/Rcpp/include"        -O2 -Wall  -mtune=generic -c distributions.cpp -o distributions.o
distributions.cpp: In function 'double compute_grad_response(double*, double*, double*, const double*, const double*, const double*, double, const IREG_CENSORING*, long long unsigned int, IREG_DIST, double*, bool)':
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_LOG_GAUSSIAN' not handled in switch [-Wswitch]
   switch(dist) {
         ^
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_LOG_LOGISTIC' not handled in switch [-Wswitch]
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_EXPONENTIAL' not handled in switch [-Wswitch]
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_WEIBULL' not handled in switch [-Wswitch]
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_UNKNOWN' not handled in switch [-Wswitch]
distributions.cpp: In function 'Rcpp::List iregnet_compute_gradients(Rcpp::NumericMatrix, Rcpp::NumericMatrix, Rcpp::NumericVector, double, Rcpp::String)':
distributions.cpp:426:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull j = 0; j < n_vars; ++j) {
                     ^
distributions.cpp:428:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (ull i = 0; i < n_obs; ++i) {
                       ^
distributions.cpp: In function 'double compute_grad_response(double*, double*, double*, const double*, const double*, const double*, double, const IREG_CENSORING*, long long unsigned int, IREG_DIST, double*, bool)':
distributions.cpp:229:25: warning: 'ddsig' may be used uninitialized in this function [-Wmaybe-uninitialized]
       ddsig_sum += ddsig;
                         ^
distributions.cpp:228:23: warning: 'dsig' may be used uninitialized in this function [-Wmaybe-uninitialized]
       dsig_sum += dsig;
                       ^
distributions.cpp:219:26: warning: 'ddg' may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (dsig == 0 || ddg == 0)
                          ^
distributions.cpp:222:30: warning: 'dg' may be used uninitialized in this function [-Wmaybe-uninitialized]
       response = eta[i] - dg / ddg;
                              ^
c:/Rtools/mingw_64/bin/g++  -I"c:/PROGRA~1/R/R-36~1.1/include" -DNDEBUG  -I"C:/Users/th798/R/win-library/3.6/Rcpp/include"        -O2 -Wall  -mtune=generic -c iregnet_fit.cpp -o iregnet_fit.o
iregnet_fit.cpp: In function 'Rcpp::List fit_cpp(Rcpp::NumericMatrix, Rcpp::NumericMatrix, Rcpp::String, Rcpp::NumericVector, int, Rcpp::IntegerVector, bool, double, double, bool, bool, bool, double, double, int, double)':
iregnet_fit.cpp:126:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for(ull i = 0; i < num_lambda; ++i) {
                      ^
iregnet_fit.cpp:339:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull m = 0; m <= end_ind; ++m) {
                     ^
iregnet_fit.cpp:212:10: warning: variable 'lambda_max_unscaled' set but not used [-Wunused-but-set-variable]
   double lambda_max_unscaled;
          ^
iregnet_fit.cpp: In function 'double get_y_means(Rcpp::NumericMatrix&, IREG_CENSORING*, double*)':
iregnet_fit.cpp:382:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = 0; i < y.nrow(); ++i) {
                     ^
iregnet_fit.cpp: In function 'void get_censoring_types(Rcpp::NumericMatrix&, IREG_CENSORING*)':
iregnet_fit.cpp:407:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = 0; i < y.nrow(); ++i) {
                     ^
iregnet_fit.cpp: In function 'void standardize_x(Rcpp::NumericMatrix&, double*, double*, bool)':
iregnet_fit.cpp:467:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = int(intercept); i < X.ncol(); ++i) {  // don't standardize intercept col.
                                  ^
iregnet_fit.cpp:469:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (ull j = 0; j < X.nrow(); ++j) {
                       ^
iregnet_fit.cpp:474:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (ull j = 0; j < X.nrow(); ++j) {
                       ^
iregnet_fit.cpp:482:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (ull j = 0; j < X.nrow(); ++j) {
                       ^
iregnet_fit.cpp: In function 'void standardize_y(Rcpp::NumericMatrix&, double*, double&)':
iregnet_fit.cpp:491:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = 0; i < y.nrow() * y.ncol(); ++i) {
                     ^
iregnet_fit.cpp:494:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (ull i = 0; i < y.nrow(); ++i)
                     ^
iregnet_fit.cpp: In function 'double get_init_var(double*, IREG_CENSORING*, long long unsigned int, IREG_DIST)':
iregnet_fit.cpp:503:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (int i = 0; i < n; ++i) {
                     ^
iregnet_fit.cpp:508:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   for (int i = 0; i < n; ++i) {
                     ^
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_GAUSSIAN' not handled in switch [-Wswitch]
   switch (dist) {
          ^
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_LOG_GAUSSIAN' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_LOG_LOGISTIC' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_EXPONENTIAL' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_WEIBULL' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_UNKNOWN' not handled in switch [-Wswitch]
c:/Rtools/mingw_64/bin/g++ -shared -s -static-libgcc -o iregnet.dll tmp.def RcppExports.o distributions.o iregnet_fit.o -Lc:/PROGRA~1/R/R-36~1.1/bin/x64 -lR
installing to C:/Users/th798/R/win-library/3.6/00LOCK-iregnet/00new/iregnet/libs/x64

NAN Error

So currently I believe the NAN issue occurs when the data is either completely right censored or left truncated. I'll be introducing an error message in the get_censoring_types function of iregnet_fit.cpp whenever this condition is met and then will be testing it in various cases to see if the NAN error still occurs. If it does then I need to discuss if we have missed any other cause for the error.

error for invalid output intervals?

Hey @anujkhare would it be possible to stop with an error when the upper limit is in less than the lower limit? For example

data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1, ecog.ps=ecog.ps-1))
y_l <- ovarian$futime
y_r <- ovarian$futime-1
y_r[ovarian$fustat == 0] <- NA
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")

I see we have fit$error_status==1 but it would also be nice to stop with an error that tells the user that there is a problem with the outputs.

ovarian and lung data sets?

hey @anujkhare are the ovarian and lung data sets really just copies from the survival package? If so then there is no need to actually include them, since the survival package (and those data sets) is included with every copy of R. (survival is a recommended pakage)

Add license

@tdhock It'd be good to add a license file to this repo. I think that a permissible MIT license is okay. What do you think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.