jolars / sgdnet Goto Github PK

Fast Sparse Linear Models for Big Data with SAGA

Home Page: https://jolars.github.io/sgdnet/

License: GNU General Public License v3.0

R 65.00% C++ 35.00%

statistical-learning machine-learning gsoc-2018 r generalized-linear-models elastic-net lasso-regression ridge-regression

sgdnet's Introduction

sgdnet

sgdnet is an R-package that fits elastic net-regularized generalized linear models to big data using the incremental gradient average algorithm SAGA (Defazio et al. 2014).

Installation

sgdnet is not currently available on CRAN but can be installed using the devtools package:

# install.packages("devtools")
devtools::install_github("jolars/sgdnet")

Usage

It is simple to fit a model using sgdnet. The interface deliberately mimics that of glmnet to facilitate transitioning between the two.

First we load the package, and then we fit a multinomial model to the iris data set. We se the elastic net penalty to 0.8 using the alpha argument to achieve a compromise between the ridge and lasso penalties.

sgdnet fits the model across an automatically computed regularization path. Altneratively, the user might supply their own path using the lambda argument.

library(sgdnet)
fit <- sgdnet(iris[, 1:4], iris[, 5], family = "multinomial", alpha = 0.8)
plot(fit)

License

sgdnet is open source software, licensed under GPL-3.

Versioning

sgdnet uses semantic versioning.

Acknowledgements

The initial work on sgdnet was supported by Google through the Google Summer of Code program with Michael Weylandt and Toby Dylan Hocking as mentors.

sgdnet's People

Contributors

Stargazers

Watchers

Forkers

wzzlcss kirito58676676765

sgdnet's Issues

Fix unwarranted destruction of sparsity while calculating sample norms

On the following line, I believe that the sparsity is destroyed by the subtraction.

sgdnet/src/utils.h

Line 70 in 7e9261a

standardize ? (x.col(j) - x_center_scaled.matrix()).squaredNorm()

It is not necessarily a big problem since it is only temporarily destroyed for each column, but could still probably be better implemented through a for loop.

Fitting intercept differs between sparse and dense implementations

The source in scikit-learn uses intercept_decay to shrink the learning rate when features are sparse in order to prevent intercept oscillation. There is additional information at https://stackoverflow.com/questions/46835181/sklearn-perceptron-behaving-differently-for-sparse-and-dense.

The problem is that the output differs depending on whether features are sparse or not, which is not what one expects from the function.

Possible solutions:

Center elements in-place as glmnet does.
Keep it as it is. Perhaps throwing a warning when the intercept is fit and features are sparse.
Make the implementations equivalent and accept the possibility of convergence issues.
Don't allow the intercept to be fit when features are sparse, i.e. throw an error.

Separating sparse and dense implementations

I have begun to think that it might be wise to separate the sparse and dense implementations of the algorithm.

Here are the pros and cons that I can think of if we were to do this:

Pros

Avoid overhead from iterating across a vector for non-zero elements in the dense implementation for which it is pointless. This would also mean more compact code in LaggedUpdate() for the dense implementation.
Allow use of .colptr(), .unsafe_col() and .memptr() for faster access to elements in the dense implementation.
Avoid branching on sparse condition, for instance for in-place scaling and centering.

Cons

Code duplication, particularly in this stage where nothing is final.

Thoughts?

lambda_{max} for multivariate gaussian model

I am trying to figure out how glmnet picks lambda_max, the starting value for its regularization path, for the multivariate gaussian model but haven't been able to get it right. I understand that it should correspond to max_l 1/(n*alpha)|<x_l, y>|

This is just an example to illustrate and not the actual implementation as it stands:

# the univariate case
x <- as.matrix(iris[, 2:4])
y <- iris[, 1]

get_lambda <- function(x, y, alpha = 1) {
  y <- as.matrix(y)
  
  sd2 <- function(y) sqrt(sum((y - mean(y))^2)/length(y))
  
  sy <- apply(y, 2, sd2)
  sx <- apply(x, 2, sd2)
  
  xx <- as.matrix(scale(x, scale = sx))
  yy <- as.matrix(scale(y, scale = sy))
  
  lambda <- max(abs(crossprod(yy, xx)) * sy)/NROW(x)
  lambda/max(0.001, alpha)
}

# the univariate case
x <- as.matrix(iris[, 2:4])
y <- iris[, 1]

# looks fine!
max(glmnet(x, y)$lambda)
#> [1] 0.7194595
get_lambda(x, y)
#> [1] 0.7194595

# the multivariate case
x <- as.matrix(iris[, 3:4])
y <- iris[, 1:2]

# something is wrong
max(glmnet(x, y, "mgaussian")$lambda)
#> [1] 0.7431435
get_lambda(x, y)
#> [1] 0.7194595

coef(glmnet(x, y, "mgaussian", lambda = get_lambda(x, y)))
#> $Sepal.Length
#> 3 x 1 sparse Matrix of class "dgCMatrix"
#>                      s0
#>              5.79435769
#> Petal.Length 0.01303237
#> Petal.Width  .         
#> 
#> $Sepal.Width
#> 3 x 1 sparse Matrix of class "dgCMatrix"
#>                        s0
#>               3.070002986
#> Petal.Length -0.003371382
#> Petal.Width   .

I feel like this is something silly that I am missing. Do you know what I am missing @tdhock @michaelweylandt ?

Save a data copy in fitted objects

Consider storing a copy of x and y (feature matrix and response) in the fitted object from sgdnet().

Pros

Follows conventions from most modelling objects in R.
Makes it more convenient to call predict methods.

Cons

Large data objects are may be prohibitive to copy.
Encourages predictions using "past" data, which could be considered bad form.

Possible solutions

Use a thin argument in sgdnet() to let the user decide if they want to store the data object.
Attempt to retrieve the data object from memory (throwing a warning while doing so).
Use a option, such as sgdnet.store_data so that users can decide if they want to store the data.

First fit on automatic lambda path for ridge penalty

So, I just realized that glmnet does not fit the first lambda on the lambda path as advertised when alpha = 0 (ridge penalty). Instead, it just fits the intercept-only model. However, this is only true when we are using the automatic path:

x <- cbind(mtcars$hp, mtcars$drat)
y <- mtcars$hp
f1 <- glmnet::glmnet(x, y, alpha = 0, nlambda = 5)
f2 <- glmnet::glmnet(x, y, alpha = 0, lambda = f1$lambda)

coef(f1)
# 3 x 5 sparse Matrix of class "dgCMatrix"
# s0            s1           s2          s3         s4
# (Intercept)  1.466875e+02 147.266925785 150.70764862 131.7163971 36.5188046
# V1           1.010101e-36   0.009881444   0.08939354   0.4734925  0.8909357
# V2          -5.812650e-35  -0.564124780  -4.76373281 -15.1489993 -5.7055708

coef(f2)
# 3 x 5 sparse Matrix of class "dgCMatrix"
# s0            s1           s2          s3         s4
# (Intercept)  1.467475e+02 147.266925785 150.70764862 131.7163971 36.5188046
# V1           9.988002e-04   0.009881444   0.08939354   0.4734925  0.8909357
# V2          -5.743033e-02  -0.564124780  -4.76373281 -15.1489993 -5.7055708

Have a look at the first column.

As of c49cdab, we will do the same, that is, use the intercept-only model as the first fit on the automatic lambda path. This is obviously a little devious because the user probably does not except the runs to produce different output. It is, however, true to the idea of starting at a ridge penalty strength at lambda_max. It is also of course efficient.

Alternatives

We could diverge from glmnet and actually fit the model equal to the first lambda on the path. The downside is that results will no longer be equivalent for the ridge penalty and loss of efficiency.
We could return the Inf as the first penalty on the path, which would probably be the most theoretically correct option. This also diverges from glmnet but only in regards to the lambda path that is returned to the user, not the actual fit. This also means that the first and second values aren't actually log-spaced anymore, but I'm not sure that matters much.

What do you think @michaelweylandt, @tdhock?

Approximation of lipschitz constant for use in step size

Currently, we are approximating the Lipschitz constant as in scikit-learn by using the maximum absolute rowise squared norm, whilst the theoretically best choice, if I am not mistaken, is to take the largest eigenvalue of the hessian. The reason for this behavior, as I understand it, is that the decomposition is computationally demanding but it does seem suboptimal, particularly since we are using a constant step size.

Here is a short demonstration (in R) of the different methods for least squares:

set.seed(1)
x <- scale(matrix(rnorm(1000000), nrow = 10000, ncol = 100))

rowNormsMax <- function(x) max(abs(rowSums(x^2)))
hessianEigen <- function(x) max(eigen(crossprod(x), FALSE, TRUE)$values/nrow(x))

rowNormsMax(x)
#> [1] 164.6416
hessianEigen(x)
#> [1] 1.199674

microbenchmark::microbenchmark(rowNormsMax(x), hessianEigen(x))
#> Unit: milliseconds
#>             expr       min       lq      mean    median        uq       max neval
#>   rowNormsMax(x)  4.591024  7.10048  9.094845  7.317713  7.635671 127.13514   100
#>  hessianEigen(x) 11.155758 11.26153 11.678655 11.558702 11.881869  13.99339   100

I have planned to run proper tests on this to figure out a best solution but I though I would just put it up here to remember to revisit it later.

Also, I wanted to check that I am not missing something. The difference seems to be huge?

Erroneous results with non-standardized features

The package is producing strange results when features aren't standardized.

library(glmnet)
library(sgdnet)

set.seed(1)

x <- with(trees, cbind(Girth, Height))
y <- trees$Volume

sfit <- sgdnet(x, y, standardize = T)
gfit <- glmnet(x, y, standardize = T)

coef(sfit, s = 0)
# 3 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -57.9724027
# Girth         4.7078901
# Height        0.3390993
coef(gfit, s = 0)
# 3 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -56.9741742
# Girth         4.6882466
# Height        0.3293873

sfit <- sgdnet(x, y, standardize = F)
gfit <- glmnet(x, y, standardize = F)

coef(sfit, s = 0)
# 3 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) 27.7431997
# Girth        4.7665048
# Height      -0.7912432
coef(gfit, s = 0)
# 3 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -57.5678250
# Girth         4.6724709
# Height        0.3399486

# for reference
coef(lm(Volume ~ ., data = trees))
# (Intercept)       Girth      Height 
# -57.9876589   4.7081605   0.3392512

As you can see, the coefficients are way off. The issue seems to be related to the absence of centering. I have gone over the rescaling multiple times (

sgdnet/src/utils.h

Lines 442 to 472 in 5f22c99

 void Rescale(std::vector<double> weights, 

 std::vector<std::vector<double>>& weights_archive, 

 std::vector<double> intercept, 

 std::vector<std::vector<double>>& intercept_archive, 

 const std::vector<double>& x_center, 

 const std::vector<double>& x_scale, 

 const std::vector<double>& y_center, 

 const std::vector<double>& y_scale, 

 const unsigned n_features, 

 const unsigned n_classes, 

 const bool fit_intercept) { 

 double x_scale_prod = 0.0; 

 for (unsigned f_ind = 0; f_ind < n_features; ++f_ind) { 

 if (x_scale[f_ind] != 0.0) { 

 for (unsigned c_ind = 0; c_ind < n_classes; ++c_ind) { 

 weights[f_ind*n_classes + c_ind] *= y_scale[c_ind]/x_scale[f_ind]; 

 x_scale_prod += x_center[f_ind]*weights[f_ind*n_classes + c_ind]; 

 } 

 } 

 } 

 if (fit_intercept) { 

 for (unsigned c_ind = 0; c_ind < n_classes; ++c_ind) 

 intercept[c_ind] = 

 intercept[c_ind]*y_scale[c_ind] + y_center[c_ind] - x_scale_prod; 

 } 

 weights_archive.push_back(std::move(weights)); 

 intercept_archive.push_back(std::move(intercept)); 

 }

) but i haven't found any issues there so I am guessing that there's something wrong inside the algorithm.

Do you have any clues of what's going on @michaelweylandt @tdhock ?

Interestingly, the scikit-learn implementation unconditionally centers its features (except for the sparse implementation).

Warn if maximum number of iterations was met.

sgdnet() should probably warn if the maximum number of iterations was met before the convergence criterion was met.

Data in package and for benchmarking

I have currently opted to include rather large data sets in the package. The intent was originally to use these datasets both for examples and for benchmarking. However, the fact that they are large means that quite a few of the example run slowly. Perhaps this will be alleviated as the algorithm is optimized but it will likely always mean that we, for instance, cannot run automated valgrind builds on travis-ci. (They time out.)

More pertinently, perhaps, is that the data sets are supplemented by other data sets in the benchmarking because I thought it would be inappropriate to only benchmark for one data set for each family (and including several data sets for each response type seems like overkill). Therefore, the benchmarks have to be run "offline" and not at the time that the vignettes are built, which means we store the data (currently only in benchmarks_binomial) and then produce the vignettes off of these.

The issue is also the type of benchmarks that we would like to run. Currently, we are only fitting the models for single lambdas and varying over tolerance to return measures of loss and time. This is of course problematic because glmnet, which we want to compare against, is currently more efficient in fitting the full path (in the lasso case). However, since we are using different stopping criteria for the algorithms, just running the full path and not controlling for the objective loss appropriately risks biasing towards the algorithm that accepts looser fits (I am not sure which one that is at the moment.) I am not sure I exactly understand what glmnet

Anyway, the options as I see them are:

Alternative 1: Large datasets in package, simple benchmarks

Keep the large datasets in the package, produce simpler benchmarks that can be run at the time the vignette is built.

Pros:

Don't have to remember to rerun the benchmarks every time the package is updated.
Benchmarks possibly better reflect "real" use of the packages.
Package is self-contained and vignettes reproducible without having to visit GitHub repo for source code. Don't have store internal data only used for vignettes.

Alternative 2: Small datasets in package, extensive benchmarks (with large datasets)

Change to smaller datasets in the package for running examples and tests. In this case, I will probably create a separate R package for the datasets to be stored only on github.

Pros

Loss can be controlled and compared against.
Pseudo-measure for visualizing convergence path
Can run valgrind builds on travis-ci
Smaller package footprint

Faster SubSampling for Minibatches

Use Eigen's View functionality to speed up minibatch gradient construction once Eigen 3.4 is released: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=329

See #24 (comment)

	void Rescale(std::vector<double> weights,
	std::vector<std::vector<double>>& weights_archive,
	std::vector<double> intercept,
	std::vector<std::vector<double>>& intercept_archive,
	const std::vector<double>& x_center,
	const std::vector<double>& x_scale,
	const std::vector<double>& y_center,
	const std::vector<double>& y_scale,
	const unsigned n_features,
	const unsigned n_classes,
	const bool fit_intercept) {

	double x_scale_prod = 0.0;
	for (unsigned f_ind = 0; f_ind < n_features; ++f_ind) {
	if (x_scale[f_ind] != 0.0) {
	for (unsigned c_ind = 0; c_ind < n_classes; ++c_ind) {
	weights[f_indn_classes + c_ind] = y_scale[c_ind]/x_scale[f_ind];
	x_scale_prod += x_center[f_ind]weights[f_indn_classes + c_ind];
	}
	}
	}

	if (fit_intercept) {
	for (unsigned c_ind = 0; c_ind < n_classes; ++c_ind)
	intercept[c_ind] =
	intercept[c_ind]*y_scale[c_ind] + y_center[c_ind] - x_scale_prod;
	}

	weights_archive.push_back(std::move(weights));
	intercept_archive.push_back(std::move(intercept));
	}

jolars / sgdnet Goto Github PK

sgdnet's Introduction

sgdnet

Installation

Usage

License

Versioning

Acknowledgements

sgdnet's People

Contributors

Stargazers

Watchers

Forkers

sgdnet's Issues

Pros

Cons

Pros

Cons

Possible solutions

Alternatives

Alternative 1: Large datasets in package, simple benchmarks

Pros:

Alternative 2: Small datasets in package, extensive benchmarks (with large datasets)

Pros

Recommend Projects

Recommend Topics

Recommend Org