Git Product home page Git Product logo

sgdnet's Introduction

sgdnet

Travis build status AppVeyor build status Coverage status

sgdnet is an R-package that fits elastic net-regularized generalized linear models to big data using the incremental gradient average algorithm SAGA (Defazio et al.ย 2014).

Installation

sgdnet is not currently available on CRAN but can be installed using the devtools package:

# install.packages("devtools")
devtools::install_github("jolars/sgdnet")

Usage

It is simple to fit a model using sgdnet. The interface deliberately mimics that of glmnet to facilitate transitioning between the two.

First we load the package, and then we fit a multinomial model to the iris data set. We se the elastic net penalty to 0.8 using the alpha argument to achieve a compromise between the ridge and lasso penalties.

sgdnet fits the model across an automatically computed regularization path. Altneratively, the user might supply their own path using the lambda argument.

library(sgdnet)
fit <- sgdnet(iris[, 1:4], iris[, 5], family = "multinomial", alpha = 0.8)
plot(fit)

The coefficients from a multinomial model along the regularization path fit to the iris data set.

License

sgdnet is open source software, licensed under GPL-3.

Versioning

sgdnet uses semantic versioning.

Acknowledgements

The initial work on sgdnet was supported by Google through the Google Summer of Code program with Michael Weylandt and Toby Dylan Hocking as mentors.

sgdnet's People

Contributors

jolars avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

sgdnet's Issues

Fitting intercept differs between sparse and dense implementations

The source in scikit-learn uses intercept_decay to shrink the learning rate when features are sparse in order to prevent intercept oscillation. There is additional information at https://stackoverflow.com/questions/46835181/sklearn-perceptron-behaving-differently-for-sparse-and-dense.

The problem is that the output differs depending on whether features are sparse or not, which is not what one expects from the function.

Possible solutions:

  • Center elements in-place as glmnet does.
  • Keep it as it is. Perhaps throwing a warning when the intercept is fit and features are sparse.
  • Make the implementations equivalent and accept the possibility of convergence issues.
  • Don't allow the intercept to be fit when features are sparse, i.e. throw an error.

Separating sparse and dense implementations

I have begun to think that it might be wise to separate the sparse and dense implementations of the algorithm.

Here are the pros and cons that I can think of if we were to do this:

Pros

  • Avoid overhead from iterating across a vector for non-zero elements in the dense implementation for which it is pointless. This would also mean more compact code in LaggedUpdate() for the dense implementation.
  • Allow use of .colptr(), .unsafe_col() and .memptr() for faster access to elements in the dense implementation.
  • Avoid branching on sparse condition, for instance for in-place scaling and centering.

Cons

  • Code duplication, particularly in this stage where nothing is final.

Thoughts?

lambda_{max} for multivariate gaussian model

I am trying to figure out how glmnet picks lambda_max, the starting value for its regularization path, for the multivariate gaussian model but haven't been able to get it right. I understand that it should correspond to max_l 1/(n*alpha)|<x_l, y>|

This is just an example to illustrate and not the actual implementation as it stands:

# the univariate case
x <- as.matrix(iris[, 2:4])
y <- iris[, 1]

get_lambda <- function(x, y, alpha = 1) {
  y <- as.matrix(y)
  
  sd2 <- function(y) sqrt(sum((y - mean(y))^2)/length(y))
  
  sy <- apply(y, 2, sd2)
  sx <- apply(x, 2, sd2)
  
  xx <- as.matrix(scale(x, scale = sx))
  yy <- as.matrix(scale(y, scale = sy))
  
  lambda <- max(abs(crossprod(yy, xx)) * sy)/NROW(x)
  lambda/max(0.001, alpha)
}

# the univariate case
x <- as.matrix(iris[, 2:4])
y <- iris[, 1]

# looks fine!
max(glmnet(x, y)$lambda)
#> [1] 0.7194595
get_lambda(x, y)
#> [1] 0.7194595

# the multivariate case
x <- as.matrix(iris[, 3:4])
y <- iris[, 1:2]

# something is wrong
max(glmnet(x, y, "mgaussian")$lambda)
#> [1] 0.7431435
get_lambda(x, y)
#> [1] 0.7194595

coef(glmnet(x, y, "mgaussian", lambda = get_lambda(x, y)))
#> $Sepal.Length
#> 3 x 1 sparse Matrix of class "dgCMatrix"
#>                      s0
#>              5.79435769
#> Petal.Length 0.01303237
#> Petal.Width  .         
#> 
#> $Sepal.Width
#> 3 x 1 sparse Matrix of class "dgCMatrix"
#>                        s0
#>               3.070002986
#> Petal.Length -0.003371382
#> Petal.Width   .  

I feel like this is something silly that I am missing. Do you know what I am missing @tdhock @michaelweylandt ?

Save a data copy in fitted objects

Consider storing a copy of x and y (feature matrix and response) in the fitted object from sgdnet().

Pros

  • Follows conventions from most modelling objects in R.
  • Makes it more convenient to call predict methods.

Cons

  • Large data objects are may be prohibitive to copy.
  • Encourages predictions using "past" data, which could be considered bad form.

Possible solutions

  • Use a thin argument in sgdnet() to let the user decide if they want to store the data object.
  • Attempt to retrieve the data object from memory (throwing a warning while doing so).
  • Use a option, such as sgdnet.store_data so that users can decide if they want to store the data.

First fit on automatic lambda path for ridge penalty

So, I just realized that glmnet does not fit the first lambda on the lambda path as advertised when alpha = 0 (ridge penalty). Instead, it just fits the intercept-only model. However, this is only true when we are using the automatic path:

x <- cbind(mtcars$hp, mtcars$drat)
y <- mtcars$hp
f1 <- glmnet::glmnet(x, y, alpha = 0, nlambda = 5)
f2 <- glmnet::glmnet(x, y, alpha = 0, lambda = f1$lambda)

coef(f1)
# 3 x 5 sparse Matrix of class "dgCMatrix"
# s0            s1           s2          s3         s4
# (Intercept)  1.466875e+02 147.266925785 150.70764862 131.7163971 36.5188046
# V1           1.010101e-36   0.009881444   0.08939354   0.4734925  0.8909357
# V2          -5.812650e-35  -0.564124780  -4.76373281 -15.1489993 -5.7055708

coef(f2)
# 3 x 5 sparse Matrix of class "dgCMatrix"
# s0            s1           s2          s3         s4
# (Intercept)  1.467475e+02 147.266925785 150.70764862 131.7163971 36.5188046
# V1           9.988002e-04   0.009881444   0.08939354   0.4734925  0.8909357
# V2          -5.743033e-02  -0.564124780  -4.76373281 -15.1489993 -5.7055708

Have a look at the first column.

As of c49cdab, we will do the same, that is, use the intercept-only model as the first fit on the automatic lambda path. This is obviously a little devious because the user probably does not except the runs to produce different output. It is, however, true to the idea of starting at a ridge penalty strength at lambda_max. It is also of course efficient.

Alternatives

  • We could diverge from glmnet and actually fit the model equal to the first lambda on the path. The downside is that results will no longer be equivalent for the ridge penalty and loss of efficiency.
  • We could return the Inf as the first penalty on the path, which would probably be the most theoretically correct option. This also diverges from glmnet but only in regards to the lambda path that is returned to the user, not the actual fit. This also means that the first and second values aren't actually log-spaced anymore, but I'm not sure that matters much.

What do you think @michaelweylandt, @tdhock?

Approximation of lipschitz constant for use in step size

Currently, we are approximating the Lipschitz constant as in scikit-learn by using the maximum absolute rowise squared norm, whilst the theoretically best choice, if I am not mistaken, is to take the largest eigenvalue of the hessian. The reason for this behavior, as I understand it, is that the decomposition is computationally demanding but it does seem suboptimal, particularly since we are using a constant step size.

Here is a short demonstration (in R) of the different methods for least squares:

set.seed(1)
x <- scale(matrix(rnorm(1000000), nrow = 10000, ncol = 100))

rowNormsMax <- function(x) max(abs(rowSums(x^2)))
hessianEigen <- function(x) max(eigen(crossprod(x), FALSE, TRUE)$values/nrow(x))

rowNormsMax(x)
#> [1] 164.6416
hessianEigen(x)
#> [1] 1.199674

microbenchmark::microbenchmark(rowNormsMax(x), hessianEigen(x))
#> Unit: milliseconds
#>             expr       min       lq      mean    median        uq       max neval
#>   rowNormsMax(x)  4.591024  7.10048  9.094845  7.317713  7.635671 127.13514   100
#>  hessianEigen(x) 11.155758 11.26153 11.678655 11.558702 11.881869  13.99339   100

I have planned to run proper tests on this to figure out a best solution but I though I would just put it up here to remember to revisit it later.

Also, I wanted to check that I am not missing something. The difference seems to be huge?

Erroneous results with non-standardized features

The package is producing strange results when features aren't standardized.

library(glmnet)
library(sgdnet)

set.seed(1)

x <- with(trees, cbind(Girth, Height))
y <- trees$Volume

sfit <- sgdnet(x, y, standardize = T)
gfit <- glmnet(x, y, standardize = T)

coef(sfit, s = 0)
# 3 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -57.9724027
# Girth         4.7078901
# Height        0.3390993
coef(gfit, s = 0)
# 3 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -56.9741742
# Girth         4.6882466
# Height        0.3293873

sfit <- sgdnet(x, y, standardize = F)
gfit <- glmnet(x, y, standardize = F)

coef(sfit, s = 0)
# 3 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) 27.7431997
# Girth        4.7665048
# Height      -0.7912432
coef(gfit, s = 0)
# 3 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -57.5678250
# Girth         4.6724709
# Height        0.3399486

# for reference
coef(lm(Volume ~ ., data = trees))
# (Intercept)       Girth      Height 
# -57.9876589   4.7081605   0.3392512 

As you can see, the coefficients are way off. The issue seems to be related to the absence of centering. I have gone over the rescaling multiple times (

sgdnet/src/utils.h

Lines 442 to 472 in 5f22c99

void Rescale(std::vector<double> weights,
std::vector<std::vector<double>>& weights_archive,
std::vector<double> intercept,
std::vector<std::vector<double>>& intercept_archive,
const std::vector<double>& x_center,
const std::vector<double>& x_scale,
const std::vector<double>& y_center,
const std::vector<double>& y_scale,
const unsigned n_features,
const unsigned n_classes,
const bool fit_intercept) {
double x_scale_prod = 0.0;
for (unsigned f_ind = 0; f_ind < n_features; ++f_ind) {
if (x_scale[f_ind] != 0.0) {
for (unsigned c_ind = 0; c_ind < n_classes; ++c_ind) {
weights[f_ind*n_classes + c_ind] *= y_scale[c_ind]/x_scale[f_ind];
x_scale_prod += x_center[f_ind]*weights[f_ind*n_classes + c_ind];
}
}
}
if (fit_intercept) {
for (unsigned c_ind = 0; c_ind < n_classes; ++c_ind)
intercept[c_ind] =
intercept[c_ind]*y_scale[c_ind] + y_center[c_ind] - x_scale_prod;
}
weights_archive.push_back(std::move(weights));
intercept_archive.push_back(std::move(intercept));
}
) but i haven't found any issues there so I am guessing that there's something wrong inside the algorithm.

Do you have any clues of what's going on @michaelweylandt @tdhock ?

Interestingly, the scikit-learn implementation unconditionally centers its features (except for the sparse implementation).

Data in package and for benchmarking

I have currently opted to include rather large data sets in the package. The intent was originally to use these datasets both for examples and for benchmarking. However, the fact that they are large means that quite a few of the example run slowly. Perhaps this will be alleviated as the algorithm is optimized but it will likely always mean that we, for instance, cannot run automated valgrind builds on travis-ci. (They time out.)

More pertinently, perhaps, is that the data sets are supplemented by other data sets in the benchmarking because I thought it would be inappropriate to only benchmark for one data set for each family (and including several data sets for each response type seems like overkill). Therefore, the benchmarks have to be run "offline" and not at the time that the vignettes are built, which means we store the data (currently only in benchmarks_binomial) and then produce the vignettes off of these.

The issue is also the type of benchmarks that we would like to run. Currently, we are only fitting the models for single lambdas and varying over tolerance to return measures of loss and time. This is of course problematic because glmnet, which we want to compare against, is currently more efficient in fitting the full path (in the lasso case). However, since we are using different stopping criteria for the algorithms, just running the full path and not controlling for the objective loss appropriately risks biasing towards the algorithm that accepts looser fits (I am not sure which one that is at the moment.) I am not sure I exactly understand what glmnet

Anyway, the options as I see them are:

Alternative 1: Large datasets in package, simple benchmarks

Keep the large datasets in the package, produce simpler benchmarks that can be run at the time the vignette is built.

Pros:

  • Don't have to remember to rerun the benchmarks every time the package is updated.
  • Benchmarks possibly better reflect "real" use of the packages.
  • Package is self-contained and vignettes reproducible without having to visit GitHub repo for source code. Don't have store internal data only used for vignettes.

Alternative 2: Small datasets in package, extensive benchmarks (with large datasets)

Change to smaller datasets in the package for running examples and tests. In this case, I will probably create a separate R package for the datasets to be stored only on github.

Pros

  • Loss can be controlled and compared against.
  • Pseudo-measure for visualizing convergence path
  • Can run valgrind builds on travis-ci
  • Smaller package footprint

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.