joshday / sparseregression.jl Goto Github PK

Statistical Models with Regularization in Pure Julia

License: Other

Julia 100.00%

statistical-models regularization julia statistical-learning linear-models sparse-regression

sparseregression.jl's Introduction

SparseRegression

Documentation	Master Build	Test Coverage

This package relies on primitives defined in the JuliaML ecosystem to implement high-performance algorithms for linear models which often produce sparsity in the coefficients.

Quickstart

using Pkg
Pkg.add("SparseRegression")
using SparseRegression

x = randn(10_000, 50)
y = x * range(-1, stop=1, length=50) + randn(10_000)

s = SModel(x, y, L2DistLoss(), L2Penalty())
@time learn!(s)
s

sparseregression.jl's People

Contributors

Stargazers

Watchers

Forkers

evizero spri902 standardgalactic

sparseregression.jl's Issues

SparseRegression doesn't seem to work with large n and p

When I tried to fit linear regression with L1 penalty using SparseRegression.jl, it works when n, p = 120, 20. Here I am comparing the SparseRegression results with those from Convex.jl and GLMNet.jl.

using SparseRegression, Convex, Mosek, GLMNet

srand(123)

n = 120
p = 20
X = randn(n, p)
β = ones(p)
y = X * β + randn(n)
obswt = ones(n)
penwt = ones(p)
λ = 2.0

## SparseRegression 
obs = Obs(X, y, obswt)
s = SModel(obs, L1Penalty(), LinearRegression(), (λ / n) .* penwt)
o = learn!(s, ProxGrad(obs), MaxIter(500), Converged(coef; tol = 1e-8))

## MosekSolver 
β̂ = Convex.Variable(p)
problem = Convex.minimize(0.5sumsquares(y - X * β̂) + λ * sumabs(β̂))
solve!(problem, MosekSolver(LOG=1))

## GLMNet
path = glmnet(X, y;
        weights = obswt, lambda = [λ/n],
        penalty_factor = penwt, standardize = false, intercept = false)
βglm = path.betas

@show [o.β β̂.value βglm]

# SparseReg, Mosek, GLMNet
 0.996103  0.996102  0.996144
 0.958     0.957999  0.958087
 1.14586   1.14586   1.14589 
 1.02801   1.02801   1.02796 
 0.968675  0.968675  0.968678
 1.10844   1.10844   1.10851 
 1.01347   1.01347   1.01352 
 1.09424   1.09424   1.09418 
 0.882447  0.882446  0.882433
 0.932587  0.932587  0.932571
 0.787311  0.78731   0.787303
 1.00141   1.00141   1.0014  
 0.965957  0.965956  0.966005
 0.89519   0.895189  0.895159
 0.838721  0.83872   0.838718
 1.00217   1.00217   1.00219 
 0.954384  0.954384  0.954397
 0.949507  0.949505  0.949499
 1.1132    1.1132    1.11321 
 0.959992  0.959991  0.959991

However, when I have n, p = 150, 50, I get the following output. As you can see, SparseRegression estimates are way off.

# SparseReg, Mosek, GLMNet
  2.13629e81  1.02639   1.02657 
 -2.61546e81  1.1055    1.105   
 -2.53927e81  0.951271  0.951674
 -5.95999e80  1.01511   1.01516 
  3.94948e80  1.03645   1.03593 
  1.43515e81  0.935677  0.936053
 -1.17062e81  1.08666   1.08655 
 -4.65342e80  0.939689  0.939939
 -8.41215e80  0.911773  0.910926
 -1.75503e78  0.791574  0.79191 
  1.72289e81  1.01617   1.01708 
  2.19634e81  0.987906  0.988613
  1.61378e81  1.00895   1.00872 
 -3.82315e80  1.0233    1.02418 
  2.74319e81  0.937699  0.937991
  8.61547e80  1.02474   1.02477 
  6.9639e80   0.768784  0.768965
  5.48598e79  1.07326   1.0713  
  4.97916e80  0.783091  0.783205
 -5.80884e80  0.802373  0.801334
 -6.5806e80   1.01256   1.0124  
 -6.14199e79  1.00347   1.00274 
  7.46284e80  0.727274  0.727363
 -2.25676e80  0.949216  0.949416
  ⋮                             
  7.73578e80  0.984818  0.984203
  6.3947e80   0.833552  0.834118
  2.90481e81  1.09726   1.0976  
  8.944e80    1.0063    1.00643 
 -4.32028e80  1.05851   1.05811 
  1.40527e81  0.954338  0.954746
 -2.87338e81  0.993805  0.993915
 -3.71517e81  0.949183  0.94929 
 -1.77348e81  0.829302  0.829224
 -2.25373e81  0.862183  0.861883
 -2.042e81    1.11283   1.11321 
  3.39315e81  0.864357  0.863508
  2.25659e81  0.949905  0.949652
 -7.52984e80  1.00489   1.00471 
 -4.04302e81  0.80317   0.803785
 -8.12163e80  0.99047   0.990365
  6.32441e80  1.08585   1.08649 
 -2.91391e81  1.05195   1.05212 
 -1.14707e81  0.958913  0.959471
 -5.09791e80  0.771741  0.771618
 -1.22227e81  0.895636  0.895012
 -1.99326e80  0.990013  0.989888
 -2.568e81    1.03589   1.03554 
  1.05625e80  1.07675   1.07691

Let me know if there's anything I am missing. Thanks!

run femtocleaner

doc images

Rewrite

If anyone is using this package, I'm about to break everything. Check out the dev branch for changes.

SnpArray as input for SModel

It would be great if we could use a SnpArray object as input for an SModel, instead of having to convert to a matrix which can be very heavy: @klkeys @Hua-Zhou

using DataFrames, SparseRegression, SnpArrays, Distributions
chr = SnpArray("SNP_data29a_missing"; people = 212, snps = 253141)
typeof(chr) #SnpArrays.SnpArray{2}
SnpArray <: AbstractArray #true
y = convert(Vector{Float64},randn(212).>0)
y[y.==0] = -1 # SparseRegression needs outcome -1,1
isa(chr,AbstractArray) #true
s = SModel(chr,y, LogitDistLoss(), L1Penalty())
@time learn!(s)
## ERROR: MethodError: no method matching *(::Tuple{Bool,Bool}, ::Float64)

So, the multiplication operations does not work properly. It would be great if they did, as we would avoid converting a big dataset to float matrix. This requires a lot of work, so it is only a suggestion of something that would be cool to have in SparseRegression, thanks!

Line search

Is it possible to extract the gradient from train!() in some way and how do I call the loss function? I want to implement a line search for the learning rate:

lr=0.7, a = 0.5, g = 0.7
for i=1:it
   learn!(s, Fista(lr), MaxIter(1), Converged(coef))
   while  loss(coef(s) - lr * grad) > (loss(coef(s)) - a * lr * dot(grad,-grad))
      lr = lr * g
   end
end

Rename package

StatisticalLearning is way too general. SparseRegression?

ANN: Rewrite

Here's what the future of SparseRegression looks like:

https://github.com/joshday/SparseRegNext.jl

It's mostly the same, except now the algorithm types contain the obs instead of the model. Learning a model is slightly more verbose, but helps clean up a lot of the code and makes the model type more accessible to stochastic/online algorithms.

x, y, b = linregdata(10_000, 10)

learn!(ERM(10), ProxGrad(Obs(x, y)), MaxIter(50), Converged(coef))

Better default step size for ProxGrad

In a previous iteration of SparseRegression, the algorithm types didn't hold the data. Now that they do, the data can be used to determine the step size to ensure ProxGrad is an MM algorithm, or at least provide a better default than s=1.

Typestable use of scaledloss

Just a little note on : https://github.com/joshday/SparseRegression.jl/blob/master/src/SparseRegression.jl#L52

There is a big difference between

scaledloss(L2DistLoss(),0.5) # <-- only if factor is decided in runtime
scaledloss(L2DistLoss(),Val{0.5}) # <-- better for a fixed constant

using a number as scale factor

julia> foo(t,y) = value(scaledloss(L2DistLoss(),0.5),t,y)
foo (generic function with 1 method)

julia> @code_llvm foo(-1.,3.)

define %jl_value_t* @julia_foo_70908(double, double) #0 {
top:
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-2672, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #2
  %ptls = bitcast i8* %ptls_i8 to %jl_value_t***
  %2 = alloca [9 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 0
  %3 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 2
  %4 = bitcast %jl_value_t** %3 to i8*
  call void @llvm.memset.p0i8.i32(i8* %4, i8 0, i32 56, i32 8, i1 false)
  %5 = bitcast [9 x %jl_value_t*]* %2 to i64*
  store i64 14, i64* %5, align 8
  %6 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 1
  %7 = bitcast i8* %ptls_i8 to i64*
  %8 = load i64, i64* %7, align 8
  %9 = bitcast %jl_value_t** %6 to i64*
  store i64 %8, i64* %9, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %ptls, align 8
  %10 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 6
  %11 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 5
  %12 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 4
  %13 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 3
  %14 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 8
  %15 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 7
  store %jl_value_t* inttoptr (i64 140341062011144 to %jl_value_t*), %jl_value_t** %10, align 8
  store %jl_value_t* inttoptr (i64 140341062009088 to %jl_value_t*), %jl_value_t** %15, align 8
  store %jl_value_t* inttoptr (i64 140341148509776 to %jl_value_t*), %jl_value_t** %14, align 8
  %16 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %10, i32 3)
  store %jl_value_t* %16, %jl_value_t** %13, align 8
  store %jl_value_t* inttoptr (i64 140341065654200 to %jl_value_t*), %jl_value_t** %3, align 8
  %17 = call %jl_value_t* @jl_gc_pool_alloc(i8* %ptls_i8, i32 1432, i32 16)
  %18 = getelementptr inbounds %jl_value_t, %jl_value_t* %17, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 140341062352080 to %jl_value_t*), %jl_value_t** %18, align 8
  %19 = bitcast %jl_value_t* %17 to double*
  store double %0, double* %19, align 8
  store %jl_value_t* %17, %jl_value_t** %12, align 8
  %20 = call %jl_value_t* @jl_gc_pool_alloc(i8* %ptls_i8, i32 1432, i32 16)
  %21 = getelementptr inbounds %jl_value_t, %jl_value_t* %20, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 140341062352080 to %jl_value_t*), %jl_value_t** %21, align 8
  %22 = bitcast %jl_value_t* %20 to double*
  store double %1, double* %22, align 8
  store %jl_value_t* %20, %jl_value_t** %11, align 8
  %23 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %3, i32 4)
  %24 = load i64, i64* %9, align 8
  store i64 %24, i64* %7, align 8
  ret %jl_value_t* %23
}

note the jl_apply_generic and jl_gc_pool_alloc doing nasty things

using a Val as scalefactor

julia> foo(t,y) = value(scaledloss(L2DistLoss(),Val{0.5}),t,y)
foo (generic function with 1 method)

julia> @code_llvm foo(-1.,3.)

define double @julia_foo_70921(double, double) #0 {
top:
  %2 = fsub double %1, %0
  %3 = fmul double %2, %2
  %4 = fmul double %3, 5.000000e-01
  ret double %4
}

(isn't that code a thing of beauty. could not handcraft it any better)

Problem with example

The Quick example no longer works. It used to do so.

Add CoordinateDescent <: Algorithm

error with all betas positive

I was trying to code cross validation, when I realized that no matter which subset of the data I selected, I would always get all betas positive.
This could be a bug in the optimization?
I don't get the same results when I run with GLMNet.

using DataFrames, SparseRegression
dat = readtable("test.txt", header=true)
y = convert(Array{Float64,1},dat[:,1]) ##1st column: response
X = convert(Array{Float64,2},dat[:,2:end]) ##10,000 SNPs
obs = Obs(X, y)
s = SModel(obs, LogisticRegression(), L1Penalty())
@time learn!(s, ProxGrad(obs,0.001), MaxIter(10000), Converged(coef))
#23.880103 seconds (106.87 k allocations: 765.675 MiB, 0.69% gc time)
#SparseRegression.SModel{LossFunctions.LogitMarginLoss,PenaltyFunctions.L1Penalty}
 #> β        : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 #> λ factor : [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1  …  0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
 #> Loss     : LogitMarginLoss
 #> Penalty  : L1Penalty
find(x -> x != 0,s.β) ##103 SNPs for λ=0.1
mean(s.β[find(x -> x != 0,s.β)])
find(x -> x < 0,s.β) ## 0

Using GLMNet:

using GLMNet, Distributions
y1 = Bool.(y)
y2 = .!y1
yy = [y1 y2]
path = @time glmnet(X, convert(Matrix{Float64}, yy), Binomial())

In GLMNet, the biggest lambda used is 0.089, and with this lambda, all betas are equal to 0 (no covariates in the model). SparseRegression is using lambda=0.1 (even bigger lambda), so I would expect all betas to be close to zero (but the mean is equal to 0.013, not that close to zero).
Could there by any rounding error somewhere, or something, making these estimated coefficients further from zero?

When trying with a different lambda (the best lambda by CV in GLMNet which yields 19 covariates in the model), we still get all positive betas for 170 covariates:

cv = @time glmnetcv(X, yy, Binomial())
#11.709583 seconds (299.73 k allocations: 623.351 MiB, 1.03% gc time)
#Logistic GLMNet Cross Validation
#100 models for 10000 predictors in 10 folds
#Best λ 0.062 (mean loss 1.380, std 0.008)
obs = Obs(X, y)
s = SModel(obs, LogisticRegression(), L1Penalty(), fill(0.062,size(X,2)))
@time learn!(s, ProxGrad(obs,0.001), MaxIter(10000), Converged(coef))
find(x -> x != 0,s.β) ##170 SNPs for λ=0.1
mean(s.β[find(x -> x != 0,s.β)])
find(x -> x < 0,s.β) ## 0

I attach a test data file to reproduce the error.
test.txt

using SparseRegression

x = randn(1000, 1)
y = 2 .* x
obs = Obs(x, y)
s = SModel(Obs(x, y), LinearRegression())
learn!(s, ProxGrad(obs), MaxIter(50), Converged(coef))

returns

INFO: Converged after 1 iterations: [0.0]
SparseRegression.SModel{LossFunctions.ScaledDistanceLoss{LossFunctions.LPDistLoss{2},0.5},PenaltyFunctions.L2Penalty}
  > β        : [0.0]
  > λ factor : [0.1]
  > Loss     : 0.5 * (L2DistLoss)
  > Penalty  : L2Penalty

which does not seem right to me.