Git Product home page Git Product logo

sparseregression.jl's Introduction

SparseRegression

Documentation Master Build Test Coverage
Build Status Build status codecov

This package relies on primitives defined in the JuliaML ecosystem to implement high-performance algorithms for linear models which often produce sparsity in the coefficients.

Quickstart

using Pkg
Pkg.add("SparseRegression")
using SparseRegression

x = randn(10_000, 50)
y = x * range(-1, stop=1, length=50) + randn(10_000)

s = SModel(x, y, L2DistLoss(), L2Penalty())
@time learn!(s)
s

sparseregression.jl's People

Contributors

aiorla avatar dkarrasch avatar evizero avatar femtocleaner[bot] avatar github-actions[bot] avatar joshday avatar juliatagbot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sparseregression.jl's Issues

SparseRegression doesn't seem to work with large n and p

When I tried to fit linear regression with L1 penalty using SparseRegression.jl, it works when n, p = 120, 20. Here I am comparing the SparseRegression results with those from Convex.jl and GLMNet.jl.

using SparseRegression, Convex, Mosek, GLMNet

srand(123)

n = 120
p = 20
X = randn(n, p)
β = ones(p)
y = X * β + randn(n)
obswt = ones(n)
penwt = ones(p)
λ = 2.0

## SparseRegression 
obs = Obs(X, y, obswt)
s = SModel(obs, L1Penalty(), LinearRegression(), (λ / n) .* penwt)
o = learn!(s, ProxGrad(obs), MaxIter(500), Converged(coef; tol = 1e-8))

## MosekSolver 
β̂ = Convex.Variable(p)
problem = Convex.minimize(0.5sumsquares(y - X * β̂) + λ * sumabs(β̂))
solve!(problem, MosekSolver(LOG=1))

## GLMNet
path = glmnet(X, y;
        weights = obswt, lambda =/n],
        penalty_factor = penwt, standardize = false, intercept = false)
βglm = path.betas

@show [o.β β̂.value βglm]
# SparseReg, Mosek, GLMNet
 0.996103  0.996102  0.996144
 0.958     0.957999  0.958087
 1.14586   1.14586   1.14589 
 1.02801   1.02801   1.02796 
 0.968675  0.968675  0.968678
 1.10844   1.10844   1.10851 
 1.01347   1.01347   1.01352 
 1.09424   1.09424   1.09418 
 0.882447  0.882446  0.882433
 0.932587  0.932587  0.932571
 0.787311  0.78731   0.787303
 1.00141   1.00141   1.0014  
 0.965957  0.965956  0.966005
 0.89519   0.895189  0.895159
 0.838721  0.83872   0.838718
 1.00217   1.00217   1.00219 
 0.954384  0.954384  0.954397
 0.949507  0.949505  0.949499
 1.1132    1.1132    1.11321 
 0.959992  0.959991  0.959991

However, when I have n, p = 150, 50, I get the following output. As you can see, SparseRegression estimates are way off.

# SparseReg, Mosek, GLMNet
  2.13629e81  1.02639   1.02657 
 -2.61546e81  1.1055    1.105   
 -2.53927e81  0.951271  0.951674
 -5.95999e80  1.01511   1.01516 
  3.94948e80  1.03645   1.03593 
  1.43515e81  0.935677  0.936053
 -1.17062e81  1.08666   1.08655 
 -4.65342e80  0.939689  0.939939
 -8.41215e80  0.911773  0.910926
 -1.75503e78  0.791574  0.79191 
  1.72289e81  1.01617   1.01708 
  2.19634e81  0.987906  0.988613
  1.61378e81  1.00895   1.00872 
 -3.82315e80  1.0233    1.02418 
  2.74319e81  0.937699  0.937991
  8.61547e80  1.02474   1.02477 
  6.9639e80   0.768784  0.768965
  5.48598e79  1.07326   1.0713  
  4.97916e80  0.783091  0.783205
 -5.80884e80  0.802373  0.801334
 -6.5806e80   1.01256   1.0124  
 -6.14199e79  1.00347   1.00274 
  7.46284e80  0.727274  0.727363
 -2.25676e80  0.949216  0.949416
                               
  7.73578e80  0.984818  0.984203
  6.3947e80   0.833552  0.834118
  2.90481e81  1.09726   1.0976  
  8.944e80    1.0063    1.00643 
 -4.32028e80  1.05851   1.05811 
  1.40527e81  0.954338  0.954746
 -2.87338e81  0.993805  0.993915
 -3.71517e81  0.949183  0.94929 
 -1.77348e81  0.829302  0.829224
 -2.25373e81  0.862183  0.861883
 -2.042e81    1.11283   1.11321 
  3.39315e81  0.864357  0.863508
  2.25659e81  0.949905  0.949652
 -7.52984e80  1.00489   1.00471 
 -4.04302e81  0.80317   0.803785
 -8.12163e80  0.99047   0.990365
  6.32441e80  1.08585   1.08649 
 -2.91391e81  1.05195   1.05212 
 -1.14707e81  0.958913  0.959471
 -5.09791e80  0.771741  0.771618
 -1.22227e81  0.895636  0.895012
 -1.99326e80  0.990013  0.989888
 -2.568e81    1.03589   1.03554 
  1.05625e80  1.07675   1.07691 

Let me know if there's anything I am missing. Thanks!

Rewrite

If anyone is using this package, I'm about to break everything. Check out the dev branch for changes.

SnpArray as input for SModel

It would be great if we could use a SnpArray object as input for an SModel, instead of having to convert to a matrix which can be very heavy: @klkeys @Hua-Zhou

using DataFrames, SparseRegression, SnpArrays, Distributions
chr = SnpArray("SNP_data29a_missing"; people = 212, snps = 253141)
typeof(chr) #SnpArrays.SnpArray{2}
SnpArray <: AbstractArray #true
y = convert(Vector{Float64},randn(212).>0)
y[y.==0] = -1 # SparseRegression needs outcome -1,1
isa(chr,AbstractArray) #true
s = SModel(chr,y, LogitDistLoss(), L1Penalty())
@time learn!(s)
## ERROR: MethodError: no method matching *(::Tuple{Bool,Bool}, ::Float64)

So, the multiplication operations does not work properly. It would be great if they did, as we would avoid converting a big dataset to float matrix. This requires a lot of work, so it is only a suggestion of something that would be cool to have in SparseRegression, thanks!

Line search

Is it possible to extract the gradient from train!() in some way and how do I call the loss function? I want to implement a line search for the learning rate:

lr=0.7, a = 0.5, g = 0.7
for i=1:it
   learn!(s, Fista(lr), MaxIter(1), Converged(coef))
   while  loss(coef(s) - lr * grad) > (loss(coef(s)) - a * lr * dot(grad,-grad))
      lr = lr * g
   end
end

Rename package

StatisticalLearning is way too general. SparseRegression?

ANN: Rewrite

Here's what the future of SparseRegression looks like:

https://github.com/joshday/SparseRegNext.jl

It's mostly the same, except now the algorithm types contain the obs instead of the model. Learning a model is slightly more verbose, but helps clean up a lot of the code and makes the model type more accessible to stochastic/online algorithms.

x, y, b = linregdata(10_000, 10)

learn!(ERM(10), ProxGrad(Obs(x, y)), MaxIter(50), Converged(coef))

Better default step size for ProxGrad

In a previous iteration of SparseRegression, the algorithm types didn't hold the data. Now that they do, the data can be used to determine the step size to ensure ProxGrad is an MM algorithm, or at least provide a better default than s=1.

Typestable use of scaledloss

Just a little note on : https://github.com/joshday/SparseRegression.jl/blob/master/src/SparseRegression.jl#L52

There is a big difference between

scaledloss(L2DistLoss(),0.5) # <-- only if factor is decided in runtime
scaledloss(L2DistLoss(),Val{0.5}) # <-- better for a fixed constant

using a number as scale factor

julia> foo(t,y) = value(scaledloss(L2DistLoss(),0.5),t,y)
foo (generic function with 1 method)

julia> @code_llvm foo(-1.,3.)

define %jl_value_t* @julia_foo_70908(double, double) #0 {
top:
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-2672, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #2
  %ptls = bitcast i8* %ptls_i8 to %jl_value_t***
  %2 = alloca [9 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 0
  %3 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 2
  %4 = bitcast %jl_value_t** %3 to i8*
  call void @llvm.memset.p0i8.i32(i8* %4, i8 0, i32 56, i32 8, i1 false)
  %5 = bitcast [9 x %jl_value_t*]* %2 to i64*
  store i64 14, i64* %5, align 8
  %6 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 1
  %7 = bitcast i8* %ptls_i8 to i64*
  %8 = load i64, i64* %7, align 8
  %9 = bitcast %jl_value_t** %6 to i64*
  store i64 %8, i64* %9, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %ptls, align 8
  %10 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 6
  %11 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 5
  %12 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 4
  %13 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 3
  %14 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 8
  %15 = getelementptr [9 x %jl_value_t*], [9 x %jl_value_t*]* %2, i64 0, i64 7
  store %jl_value_t* inttoptr (i64 140341062011144 to %jl_value_t*), %jl_value_t** %10, align 8
  store %jl_value_t* inttoptr (i64 140341062009088 to %jl_value_t*), %jl_value_t** %15, align 8
  store %jl_value_t* inttoptr (i64 140341148509776 to %jl_value_t*), %jl_value_t** %14, align 8
  %16 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %10, i32 3)
  store %jl_value_t* %16, %jl_value_t** %13, align 8
  store %jl_value_t* inttoptr (i64 140341065654200 to %jl_value_t*), %jl_value_t** %3, align 8
  %17 = call %jl_value_t* @jl_gc_pool_alloc(i8* %ptls_i8, i32 1432, i32 16)
  %18 = getelementptr inbounds %jl_value_t, %jl_value_t* %17, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 140341062352080 to %jl_value_t*), %jl_value_t** %18, align 8
  %19 = bitcast %jl_value_t* %17 to double*
  store double %0, double* %19, align 8
  store %jl_value_t* %17, %jl_value_t** %12, align 8
  %20 = call %jl_value_t* @jl_gc_pool_alloc(i8* %ptls_i8, i32 1432, i32 16)
  %21 = getelementptr inbounds %jl_value_t, %jl_value_t* %20, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 140341062352080 to %jl_value_t*), %jl_value_t** %21, align 8
  %22 = bitcast %jl_value_t* %20 to double*
  store double %1, double* %22, align 8
  store %jl_value_t* %20, %jl_value_t** %11, align 8
  %23 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %3, i32 4)
  %24 = load i64, i64* %9, align 8
  store i64 %24, i64* %7, align 8
  ret %jl_value_t* %23
}

note the jl_apply_generic and jl_gc_pool_alloc doing nasty things

using a Val as scalefactor

julia> foo(t,y) = value(scaledloss(L2DistLoss(),Val{0.5}),t,y)
foo (generic function with 1 method)

julia> @code_llvm foo(-1.,3.)

define double @julia_foo_70921(double, double) #0 {
top:
  %2 = fsub double %1, %0
  %3 = fmul double %2, %2
  %4 = fmul double %3, 5.000000e-01
  ret double %4
}

(isn't that code a thing of beauty. could not handcraft it any better)

error with all betas positive

I was trying to code cross validation, when I realized that no matter which subset of the data I selected, I would always get all betas positive.
This could be a bug in the optimization?
I don't get the same results when I run with GLMNet.

using DataFrames, SparseRegression
dat = readtable("test.txt", header=true)
y = convert(Array{Float64,1},dat[:,1]) ##1st column: response
X = convert(Array{Float64,2},dat[:,2:end]) ##10,000 SNPs
obs = Obs(X, y)
s = SModel(obs, LogisticRegression(), L1Penalty())
@time learn!(s, ProxGrad(obs,0.001), MaxIter(10000), Converged(coef))
#23.880103 seconds (106.87 k allocations: 765.675 MiB, 0.69% gc time)
#SparseRegression.SModel{LossFunctions.LogitMarginLoss,PenaltyFunctions.L1Penalty}
 #> β        : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 #> λ factor : [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1  …  0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
 #> Loss     : LogitMarginLoss
 #> Penalty  : L1Penalty
find(x -> x != 0,s.β) ##103 SNPs for λ=0.1
mean(s.β[find(x -> x != 0,s.β)])
find(x -> x < 0,s.β) ## 0

Using GLMNet:

using GLMNet, Distributions
y1 = Bool.(y)
y2 = .!y1
yy = [y1 y2]
path = @time glmnet(X, convert(Matrix{Float64}, yy), Binomial())

In GLMNet, the biggest lambda used is 0.089, and with this lambda, all betas are equal to 0 (no covariates in the model). SparseRegression is using lambda=0.1 (even bigger lambda), so I would expect all betas to be close to zero (but the mean is equal to 0.013, not that close to zero).
Could there by any rounding error somewhere, or something, making these estimated coefficients further from zero?

When trying with a different lambda (the best lambda by CV in GLMNet which yields 19 covariates in the model), we still get all positive betas for 170 covariates:

cv = @time glmnetcv(X, yy, Binomial())
#11.709583 seconds (299.73 k allocations: 623.351 MiB, 1.03% gc time)
#Logistic GLMNet Cross Validation
#100 models for 10000 predictors in 10 folds
#Best λ 0.062 (mean loss 1.380, std 0.008)
obs = Obs(X, y)
s = SModel(obs, LogisticRegression(), L1Penalty(), fill(0.062,size(X,2)))
@time learn!(s, ProxGrad(obs,0.001), MaxIter(10000), Converged(coef))
find(x -> x != 0,s.β) ##170 SNPs for λ=0.1
mean(s.β[find(x -> x != 0,s.β)])
find(x -> x < 0,s.β) ## 0

I attach a test data file to reproduce the error.
test.txt

Does this package work?

...or am I doing something wrong? The following

using SparseRegression

x = randn(1000, 1)
y = 2 .* x
obs = Obs(x, y)
s = SModel(Obs(x, y), LinearRegression())
learn!(s, ProxGrad(obs), MaxIter(50), Converged(coef))

returns

INFO: Converged after 1 iterations: [0.0]
SparseRegression.SModel{LossFunctions.ScaledDistanceLoss{LossFunctions.LPDistLoss{2},0.5},PenaltyFunctions.L2Penalty}
  > β        : [0.0]
  > λ factor : [0.1]
  > Loss     : 0.5 * (L2DistLoss)
  > Penalty  : L2Penalty

which does not seem right to me.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.