iregnet's People
Forkers
tdhock yugam1 engineerreversed parismita shets rovervan devinderkaur mu4bu2 theadityasam wlemieux03 georgheinzeiregnet's Issues
The optimization branch
The link for benchmarks: https://theadityasam.github.io/iregbenchmark/benchmarks.html
I encountered few issues which I'll list over here:
- For prostate, a weird thing happens, for some particular dataset sizes the runtimes reduce drastically(by x1000 times!), hence the plot looks very skewed. Strangely this happens only when iregnet optimised is added to the benchmarking tests; not when benchmark is performed on master iregnet and glmnet
- For 100000 observations, we see a pattern we expect, glmnet stays consistent whereas iregnet and iregnet optimised are linear with the optimised branch being faster
- penalty.learning produces NAN errors in the optimised branch but not in the master branch
- While trying with neuroblastoma, iregnet works fine but optimised branch produces the following error:
error: Row::subvec(): indices out of bounds or incorrectly used
Error in fit_cpp(x, y, family, alpha, lambda_path = lambda, num_lambda = num_lambda, :
Row::subvec(): indices out of bounds or incorrectly used
Add some "developer documentation" to the README
I'm revisiting this code after a long time and being out of touch with R development, don't remember basic things like:
- How do you set up a development environment?
- How do I run the tests?
- How to compile the API documentation?
These are simple commands, might be easy enough to compile them. Maybe, we can add a makefile for this?
Summary of weekly calls
Monday, 5th June 2017
Updates:
- Armadillo + OpenBlas linking issues
- Steps 1-4 in proposal locally done, need to clean and commit
(Some) Next steps:
- For any Armadillo issues, contact the maintainers (Dirk is super responsive)
- Put config for the src in
iregnet/src/makevars
- Set up Continuous Integration performance testing using Rperform
- Move the code into a branch on anujkhare/iregnet
- (Anuj will fix Travis tests on anujkhare/iregnet/master)
- Read about git-flow workflow
- We don't completely follow it, but it is a good read!
- Power on!!
Deviance
Deviance is just the residual taken to the log-likelihood scale.
"Extreme_value" distribution not supported in cv.iregnet()
"Extreme_value" not supported distribution in cv.iregnet() function
> cv.iregnet(X, Y , fmaily = "extreme_value")
Error in stopifnot_error(paste("family must be one of", paste(names(pfun.list), :
family must be one of gaussian, logistic, exponential
Iregnet package version:
> packageDescription("iregnet")
Package: iregnet
Type: Package
Title: Regularized interval regression
Version: 0.1.0.9000
Author: Anuj Khare <[email protected]>, Toby D Hocking <[email protected]>,
Jelle Goeman, Aditya Samantaray <[email protected]>
Maintainer: Anuj Khare <[email protected]>, Aditya Samantaray
<[email protected]>
Description: Interval regression with four types of censoring and elastic net
regularization.
License: GPL-3
LazyData: TRUE
Suggests: ElemStatLearn, glmnet, testthat, knitr, rmarkdown
LinkingTo: Rcpp
Depends: R (>= 2.10)
Imports: ggplot2, utils, methods, stats, survival, future.apply, data.table, Matrix,
namedCapture, penaltyLearning, SpatialExtremes
RoxygenNote: 6.1.1
Remotes: tdhock/penaltyLearning
VignetteBuilder: knitr
RemoteType: github
RemoteHost: api.github.com
RemoteRepo: iregnet
RemoteUsername: anujkhare
RemoteRef: master
RemoteSha: 89cc904894495511b801f71ff99cbfed6043dd97
GithubRepo: iregnet
GithubUsername: anujkhare
GithubRef: master
GithubSHA1: 89cc904894495511b801f71ff99cbfed6043dd97
NeedsCompilation: yes
Packaged: 2019-09-11 18:52:54 UTC; andrewhurst
Built: R 3.6.1; x86_64-apple-darwin15.6.0; 2019-09-11 18:52:55 UTC; unix
-- File: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/iregnet/Meta/package.rds
variable names in returned beta matrix
hey @anujkhare I ran your code on the ovarian data set
data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1, ecog.ps=ecog.ps-1))
y_l <- ovarian$futime
y_r <- ovarian$futime
y_r[ovarian$fustat == 0] <- NA
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")
currently we have
> fit$beta[, 1:7]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 6.772081 6.772118 6.772103 6.772112 6.772108 6.77211 6.772109
[2,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
[3,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
[4,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
[5,] 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
>
Can you please add row names to the beta matrix? (Intercept)
and column names of input X matrix, same as glmnet.
Log-likelihood values inconsistent wrt survreg
For the unregularized solution returned by iregnet, the log-likelihood value is different from that of survreg for the same coefficients and scale parameter.
Predict function
Inf limits?
Hey @anujkhare I ran your code on the ovarian data set and I noticed that Inf
limits make the solver silently fail:
data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1, ecog.ps=ecog.ps-1))
y_l <- ovarian$futime
y_r <- ovarian$futime
y_r[ovarian$fustat == 0] <- Inf
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")
current result is
> fit$beta[, 1:7]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] NaN Inf Inf Inf Inf Inf Inf
[2,] NaN 0 0 0 0 0 0
[3,] NaN 0 0 0 0 0 0
[4,] NaN 0 0 0 0 0 0
[5,] NaN 0 0 0 0 0 0
two possible solutions:
- stop with an error if there are any
Inf
values for the output. - treat
Inf
same asNA
run on benchmark data sets
hi @theadityasam @anujkhare I wrote this R script https://github.com/tdhock/neuroblastoma-data/blob/master/iregnet.R to benchmark the performance of cv.iregnet.
The benchmark includes 33 different data sets, each with several train/test splits (designated by fold ID numbers in folds.csv file).
To run the code please clone that repo, git clone https://github.com/tdhock/neuroblastoma-data.git
and then run the "iregnet.R" script.
The script runs cv.iregnet with scale fixed at 1 (iregnet.scale1), and then with estimate_scale=TRUE. (iregnet.estscale)
It also runs baselines penaltyLearning::IntervalRegressionCV (penaltyLearning.scale1) and constant (always predict 0).
Initial results indicate room for improvement, especially in terms of speed. For example I got these results on one data set for test fold ID 1:
> (pred.err <- do.call(rbind, pred.err.list))
data.name test.fold algorithm seconds threshold labels
1: ATAC_JV_adipose 1 constant 0.00 predicted 130
2: ATAC_JV_adipose 1 iregnet.scale1 51.80 predicted 130
3: ATAC_JV_adipose 1 iregnet.estscale 52.05 predicted 130
4: ATAC_JV_adipose 1 penaltyLearning.scale1 4.99 predicted 130
possible.fp possible.fn min.thresh max.thresh fp fn errors FPR tp
1: 130 128 -Inf 6.783764882 130 0 130 1.0000000 128
2: 130 128 -0.02067610 0.065002854 20 23 43 0.1538462 105
3: 130 128 -0.04882619 0.032067314 24 21 45 0.1846154 107
4: 130 128 -0.01867027 0.009351324 24 19 43 0.1846154 109
TPR error.percent auc
1: 1.0000000 100.00000 0.8881611
2: 0.8203125 33.07692 0.8785457
3: 0.8359375 34.61538 0.8791466
4: 0.8515625 33.07692 0.8780048
>
The results indicate that iregnet (with fixed or estimated scale) is about 10 times slower than penaltyLearning. (50 seconds vs 5 seconds)
I got a warning message that the algo failed to converge so please investigate.
Tests: uncensored data with alpha < 1
Need to write tests for general elastic net regularization.
Tests: more predictor variables than observations
The case when number of variables is more than the number of observations.
summary of calls with aditya
16 May 2019
linear regression y_i ~ N( w^T x_i, \sigma^2)
log t_i = y_i
y_i ~ N( f(x_i), \sigma^2)
f(x_i) = w^T x_i
f(x_i) = b_0 + b^T x_i
uncensored y_i => likelihood is density
censored y_i => likelihood is cumulative distribution function
Aditya said that NAN error in iregnet only happens when train labels have no lower limits or no upper limits. in this case the max likelihood model is undefined, so you should stop with an error that tells the user that at least one upper limit and one lower limit are required.
cv.iregnet: you should check to make sure that each train set/split has at least one upper and one lower limit, otherwise stop with an informative error before running the optimization/iregnet. e.g. here is the error message I use in penaltyLearning::IntervalRegressionCV, which tells the user which
https://github.com/tdhock/penaltyLearning/blob/master/R/IntervalRegression.R#L189
NA's in x
Check for NA's in x?
Duplicate columns in covariate matrix
Glmnet explicitly removes duplicate columns in the covariate matrix before the fit is done. We should do something similar.
comparison with FISTA R code
Hey @anujkhare here is my R code for computing the gradient of AFT loss functions https://github.com/tdhock/glm-optimality/blob/master/figure-interval-loss-derivative.R but I did not fully consider the scale parameter when I wrote that code, so we need to double-check to make sure that those gradients agree with the Survival code
Tests: wrt survreg for distributions other than Guassian
We need to implement tests for non regularized data and non-Gaussian loss.
Integration of Surv object
We should accept a Surv object from the Survival package for the output values, rather than the 2-column y matrix as of now.
large lambda values?
Hey @anujkhare there seems to be some problem with the first lambda value (too large, 1e35) and the first few lambda values (too many lambda values which yield a weight vector of all zeros). Can you double-check the lambda grid computation? It would be best if the first lambda value gives an intercept-only model, and then the second lambda value gives at least one non-zero element in the weight vector.
Right now I see
data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1))
y_l <- ovarian$futime
y_r <- ovarian$futime
y_r[ovarian$fustat == 0] <- NA
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")
rownames(fit$beta) <- c("(Intercept)", colnames(X))
t(with(fit, rbind(lambda=lambda, scale=scale, beta))[,1:14])
> t(with(fit, rbind(lambda=lambda, scale=scale, beta))[,1:14])
lambda scale (Intercept) age residual.disease treatment
[1,] 1.000000e+35 1.265787 6.772081 0.000000000 0 0
[2,] 1.175100e+01 1.265758 6.772118 0.000000000 0 0
[3,] 1.070728e+01 1.265775 6.772103 0.000000000 0 0
[4,] 9.756176e+00 1.265768 6.772112 0.000000000 0 0
[5,] 8.889411e+00 1.265772 6.772108 0.000000000 0 0
[6,] 8.099720e+00 1.265770 6.772110 0.000000000 0 0
[7,] 7.380153e+00 1.265771 6.772109 0.000000000 0 0
[8,] 6.724524e+00 1.265771 6.772110 0.000000000 0 0
[9,] 6.127134e+00 1.265771 6.772110 0.000000000 0 0
[10,] 5.582817e+00 1.265771 6.772110 0.000000000 0 0
[11,] 5.086855e+00 1.265771 6.772110 0.000000000 0 0
[12,] 4.634953e+00 1.265771 6.772110 0.000000000 0 0
[13,] 4.682283e+00 1.202117 7.036940 -0.005226961 0 0
[14,] 4.762457e+00 1.137780 7.319351 -0.010762496 0 0
>
weights parameter
Hey @anujkhare how hard would it be to modify the C++ code to use weights w_i > 0 specific to each observation/row i? The weights would affect the loss/likelihood function. Right now the objective function is
min_f sum_i loss(y_i, f(x_i)) + Penalty(f)
but with weights it would become
min_f sum_i w_i loss(y_i, f(x_i)) + Penalty(f)
I don't know of any data sets where this is absolutely essential for iregnet, but it is typically useful in glmnet for logistic regression models (binary classification problems) where there are a lot more of one class than another. For now I don't think we should worry about implementing this, but I was just wondering if you had thought about it.
Fix code formatting
Print function
We would like to have a basic print function which displays only a few important things.
Release to CRAN
- Add a version to the package
@tdhock Besides versioning the package and making sure that R CMD checks pass, is there anything else that we need to do?
In general, we can do a lot more testing. @theadityasam: there are a bunch of datasets that Toby mentioned on this page. It'd be a good exercise for us to fit models on these using cv.iregnet.
Make print.iregnet more informative
@theadityasam to add some more details
compilation warnings
Hi @anujkhare @theadityasam I am getting these compiler warnings on my system. Although the code compiles/installs, it would be good to fix those warnings when you get a chance.
c:/Rtools/mingw_64/bin/g++ -I"c:/PROGRA~1/R/R-36~1.1/include" -DNDEBUG -I"C:/Users/th798/R/win-library/3.6/Rcpp/include" -O2 -Wall -mtune=generic -c distributions.cpp -o distributions.o
distributions.cpp: In function 'double compute_grad_response(double*, double*, double*, const double*, const double*, const double*, double, const IREG_CENSORING*, long long unsigned int, IREG_DIST, double*, bool)':
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_LOG_GAUSSIAN' not handled in switch [-Wswitch]
switch(dist) {
^
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_LOG_LOGISTIC' not handled in switch [-Wswitch]
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_EXPONENTIAL' not handled in switch [-Wswitch]
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_WEIBULL' not handled in switch [-Wswitch]
distributions.cpp:69:9: warning: enumeration value 'IREG_DIST_UNKNOWN' not handled in switch [-Wswitch]
distributions.cpp: In function 'Rcpp::List iregnet_compute_gradients(Rcpp::NumericMatrix, Rcpp::NumericMatrix, Rcpp::NumericVector, double, Rcpp::String)':
distributions.cpp:426:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull j = 0; j < n_vars; ++j) {
^
distributions.cpp:428:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull i = 0; i < n_obs; ++i) {
^
distributions.cpp: In function 'double compute_grad_response(double*, double*, double*, const double*, const double*, const double*, double, const IREG_CENSORING*, long long unsigned int, IREG_DIST, double*, bool)':
distributions.cpp:229:25: warning: 'ddsig' may be used uninitialized in this function [-Wmaybe-uninitialized]
ddsig_sum += ddsig;
^
distributions.cpp:228:23: warning: 'dsig' may be used uninitialized in this function [-Wmaybe-uninitialized]
dsig_sum += dsig;
^
distributions.cpp:219:26: warning: 'ddg' may be used uninitialized in this function [-Wmaybe-uninitialized]
if (dsig == 0 || ddg == 0)
^
distributions.cpp:222:30: warning: 'dg' may be used uninitialized in this function [-Wmaybe-uninitialized]
response = eta[i] - dg / ddg;
^
c:/Rtools/mingw_64/bin/g++ -I"c:/PROGRA~1/R/R-36~1.1/include" -DNDEBUG -I"C:/Users/th798/R/win-library/3.6/Rcpp/include" -O2 -Wall -mtune=generic -c iregnet_fit.cpp -o iregnet_fit.o
iregnet_fit.cpp: In function 'Rcpp::List fit_cpp(Rcpp::NumericMatrix, Rcpp::NumericMatrix, Rcpp::String, Rcpp::NumericVector, int, Rcpp::IntegerVector, bool, double, double, bool, bool, bool, double, double, int, double)':
iregnet_fit.cpp:126:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(ull i = 0; i < num_lambda; ++i) {
^
iregnet_fit.cpp:339:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull m = 0; m <= end_ind; ++m) {
^
iregnet_fit.cpp:212:10: warning: variable 'lambda_max_unscaled' set but not used [-Wunused-but-set-variable]
double lambda_max_unscaled;
^
iregnet_fit.cpp: In function 'double get_y_means(Rcpp::NumericMatrix&, IREG_CENSORING*, double*)':
iregnet_fit.cpp:382:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull i = 0; i < y.nrow(); ++i) {
^
iregnet_fit.cpp: In function 'void get_censoring_types(Rcpp::NumericMatrix&, IREG_CENSORING*)':
iregnet_fit.cpp:407:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull i = 0; i < y.nrow(); ++i) {
^
iregnet_fit.cpp: In function 'void standardize_x(Rcpp::NumericMatrix&, double*, double*, bool)':
iregnet_fit.cpp:467:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull i = int(intercept); i < X.ncol(); ++i) { // don't standardize intercept col.
^
iregnet_fit.cpp:469:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull j = 0; j < X.nrow(); ++j) {
^
iregnet_fit.cpp:474:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull j = 0; j < X.nrow(); ++j) {
^
iregnet_fit.cpp:482:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull j = 0; j < X.nrow(); ++j) {
^
iregnet_fit.cpp: In function 'void standardize_y(Rcpp::NumericMatrix&, double*, double&)':
iregnet_fit.cpp:491:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull i = 0; i < y.nrow() * y.ncol(); ++i) {
^
iregnet_fit.cpp:494:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (ull i = 0; i < y.nrow(); ++i)
^
iregnet_fit.cpp: In function 'double get_init_var(double*, IREG_CENSORING*, long long unsigned int, IREG_DIST)':
iregnet_fit.cpp:503:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (int i = 0; i < n; ++i) {
^
iregnet_fit.cpp:508:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (int i = 0; i < n; ++i) {
^
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_GAUSSIAN' not handled in switch [-Wswitch]
switch (dist) {
^
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_LOG_GAUSSIAN' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_LOG_LOGISTIC' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_EXPONENTIAL' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_WEIBULL' not handled in switch [-Wswitch]
iregnet_fit.cpp:513:10: warning: enumeration value 'IREG_DIST_UNKNOWN' not handled in switch [-Wswitch]
c:/Rtools/mingw_64/bin/g++ -shared -s -static-libgcc -o iregnet.dll tmp.def RcppExports.o distributions.o iregnet_fit.o -Lc:/PROGRA~1/R/R-36~1.1/bin/x64 -lR
installing to C:/Users/th798/R/win-library/3.6/00LOCK-iregnet/00new/iregnet/libs/x64
support weibull distribution
look at formulas in http://members.cbio.mines-paristech.fr/~thocking/survival.pdf
NAN Error
So currently I believe the NAN issue occurs when the data is either completely right censored or left truncated. I'll be introducing an error message in the get_censoring_types function of iregnet_fit.cpp whenever this condition is met and then will be testing it in various cases to see if the NAN error still occurs. If it does then I need to discuss if we have missed any other cause for the error.
error for invalid output intervals?
Hey @anujkhare would it be possible to stop with an error when the upper limit is in less than the lower limit? For example
data(ovarian)
X <- with(ovarian, cbind(age=age, residual.disease=resid.ds-1, treatment=rx-1, ecog.ps=ecog.ps-1))
y_l <- ovarian$futime
y_r <- ovarian$futime-1
y_r[ovarian$fustat == 0] <- NA
y_surv <- log(cbind(y_l, y_r))
fit <- iregnet(X, y_surv, family="gaussian")
I see we have fit$error_status==1
but it would also be nice to stop with an error that tells the user that there is a problem with the outputs.
future instead of foreach please
Hi in recent years it has become apparent that the future framework should be preferred over foreach, so I would suggest using future.apply::future_lapply
instead of foreach in cv.iregnet. For a usage example see https://github.com/tdhock/penaltyLearning/blob/master/R/IntervalRegression.R#L192
is that ok with you @anujkhare ?
is that something you could implement in a new PR this summer? @theadityasam
Check exponential distribution
License
@tdhock We need to set a license for the project to make it available on CRAN. Which license would you suggest? [List of licenses for R packages]
glmnet
uses GPL-2, survival
uses LGPL(>=2).
ovarian and lung data sets?
hey @anujkhare are the ovarian and lung data sets really just copies from the survival package? If so then there is no need to actually include them, since the survival package (and those data sets) is included with every copy of R. (survival is a recommended pakage)
Initial fit for scale when there is no intercept
Currently, the initial fit is done iff both scale and intercept need to be estimated. We need to estimate initial value of scale when there is no intercept as well.
Test input validations
Add license
@tdhock It'd be good to add a license file to this repo. I think that a permissible MIT license is okay. What do you think?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.