biona001 / knockoffs.jl Goto Github PK
View Code? Open in Web Editor NEWVariable Selection with Knockoffs
License: MIT License
Variable Selection with Knockoffs
License: MIT License
Knockoffs.jl will soon become a dependency for GhostKnockoffGWAS, which will be compiled into a relocatable app.
Knockoffs's dependency LowRankApprox.jl
depends on FFTW.jl
which in turn depends on MKL_jll.jl
which is a staggering 596.668 MiB. Can we remove this dependency?
@JuliaRegistrator register()
Hi Benjamin,
I hope you are doing well!
I wanted to check about the following. The following seems not to match (for any of the fixed-X knockoff constructions):
me = fixed_knockoffs(X, :maxent)
maximum(abs.(me.Xko'me.Xko .- me.Sigma)) #large
Do you know why that is? (Happy to provide a MWE if it helps, though it happens for all designs I have tried it on.)
Thank you,
Nikos
As reported by Jiaqi, there seems to be some significant performance regression for group knockoff solver.
MWE: (Knockoffs.jl v1.1.7 and on Julia 1.10)
using Knockoffs
using LinearAlgebra
using Random
using CSV
using DataFrames
# gnomad test data
datadir = "/Users/biona001/Benjamin_Folder/research/4th_project_groupKO/group_knockoff_test_data"
mapfile = CSV.read(joinpath(datadir, "Groups_2_127374341_128034347.txt"), DataFrame)
groups = mapfile[!, :group]
covfile = CSV.read(joinpath(datadir, "CorG_2_127374341_128034347.txt"), DataFrame)
Σ = covfile |> Matrix{Float64}
Σ = 0.99Σ + 0.01I
@time solve_s_group(
Symmetric(Σ), groups, :maxent,
m = 1, # number of knockoffs per variable to generate
tol = 0.001, # convergence tolerance
outer_iter = 10, # max number of coordinate descent iterations
robust = false, # whether to use robust cholesky updates
verbose = true # whether to print informative intermediate results
)
┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs ~/.julia/dev/Knockoffs/src/group.jl:360
Maxent initial obj = -44822.11340914695
Iter 1 (PCA): obj = -26794.756871827496, δ = 2.6915336716594664, t1 = 24.96, t2 = 32.32
Iter 2 (CCD): obj = -23681.635922280064, δ = 0.5920258362537, t1 = 534.9, t2 = 881.71, t3 = 0.22
Iter 3 (PCA): obj = -23476.599883936684, δ = 1.0851838794654751, t1 = 559.7, t2 = 913.86
Iter 4 (CCD): obj = -23441.194586522815, δ = 0.01917853738286209, t1 = 1059.27, t2 = 1761.83, t3 = 0.44
Iter 5 (PCA): obj = -23428.13516213957, δ = 0.20912865614936338, t1 = 1085.4, t2 = 1794.69
... # terminated
On Knockoffs.jl v0.3.0 (Julia 1.6.7), see Fast Cholesky update notebook
solve_group_max_entropy_ccd: Optimizing 58997 variables
┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs /Users/biona001/.julia/dev/Knockoffs/src/group.jl:230
initial obj = -13811.93738418022
Iter 1: obj = -8863.340455999352, δ = 0.9033274319764998, t1 = 8.85, t2 = 18.23, t3 = 0.04
Iter 2: obj = -7509.4699925543855, δ = 0.7627302489344706, t1 = 19.04, t2 = 36.57, t3 = 0.08
Iter 3: obj = -7495.512979153219, δ = 0.153471691368552, t1 = 29.25, t2 = 54.77, t3 = 0.12
Iter 4: obj = -7491.027981075002, δ = 0.0467433749392695, t1 = 39.2, t2 = 72.95, t3 = 0.16
Iter 5: obj = -7489.0243504490645, δ = 0.007200479138871564, t1 = 49.08, t2 = 90.98, t3 = 0.2
Iter 6: obj = -7487.899557778442, δ = 0.004873425110742543, t1 = 58.98, t2 = 109.34, t3 = 0.24
Iter 7: obj = -7487.198762711129, δ = 0.0031886466238885947, t1 = 68.75, t2 = 127.98, t3 = 0.28
Iter 8: obj = -7486.7318922165505, δ = 0.002106998635746351, t1 = 78.17, t2 = 146.58, t3 = 0.32
Iter 9: obj = -7486.405698134362, δ = 0.0011212737859179697, t1 = 87.39, t2 = 165.21, t3 = 0.37
Iter 10: obj = -7486.1699950030525, δ = 0.0006708441935208511, t1 = 96.52, t2 = 183.44, t3 = 0.41
283.395020 seconds (3.23 M allocations: 329.596 MiB, 0.05% gc time, 0.61% compilation time)
Initial objective value being different is due to different initialization of S. However the timing seems much slower, even if ran on different computers.
The following creates MWE data (test.bed/bim/fam
)
using SnpArrays
using MendelIHT
using Random
Random.seed!(2022) # set seed for reproducibility
n = 1000 # sample size
p = 10000 # number of covariates
x = simulate_correlated_snparray("test.bed", n, p, block_length=20, prob=0.75)
make_bim_fam_files(x, rand(1000), "test")
and a call to fastPHASE HMM knockoffs will throw NaN
error that is caused by normalizing constants overflowing
@time X̃ = hmm_knockoff(data_path, plink_outfile="test.fastphase.knockoffs")
For group knockoffs, when we use SDP/ME/MVR solvers, we initiailzing the optimzation with equicorrelated S matrix. However, that S
matrix is the largest S
such that (m+1)/m*Sigma-S > 0
. This means that when we compute cholesky((m+1)/m*Sigma-S)
, we could potentially introduce numerical stabilities.
For this reason, the next minor patch release might initialize S
with, for example, equi.S ./ 2
to naively circumvent this issue.
On cluster environment, it seems the progress for solve_s
and solve_s_group
is printed only when the job completes, which is not very meaningful for monitoring convergence.
MWE
Only centering and scaling each column to mean 0 variance 1 (but not to unit norm)
using Knockoffs, Random, StatsBase, Statistics
Random.seed!(2021)
n = 3000
p = 1000
X = randn(n, p)
zscore!(X, mean(X, dims=1), std(X, dims=1))
knockoff = knockoff_equi(X);
X̃ = knockoff.X̃
Σ = knockoff.Σ
julia> [vec(X̃' * X̃) vec(Σ)]
1000000×2 Matrix{Float64}:
2999.0 2999.0
68.3776 68.3776
-22.3342 -22.3342
-43.2302 -43.2302
-93.228 -93.2281
32.9963 32.9963
-43.5092 -43.5092
71.2002 71.2002
-91.1404 -91.1404
-28.554 -28.554
-31.7598 -31.7598
-13.1698 -13.1698
-34.6885 -34.6885
22.8903 22.8903
15.2742 15.2742
36.5119 36.5119
-23.4127 -23.4127
⋮
-95.6452 -95.6452
-79.5508 -79.5508
-150.037 -150.037
-24.5637 -24.5637
-86.4215 -86.4215
-26.914 -26.914
2.37256 2.37256
-68.1161 -68.1161
-38.4121 -38.4121
-17.9981 -17.9981
16.1139 16.1139
31.8385 31.8385
-19.9923 -19.9923
31.6304 31.6305
151.434 151.434
-32.3494 -32.3494
2999.0 2999.0
Normaling each column so that ||x_j||_2^2 = 1:
using Knockoffs, Random
Random.seed!(2021)
n = 3000
p = 1000
X = randn(n, p)
normalize_col!(X)
knockoff = knockoff_equi(X);
X̃ = knockoff.X̃
Σ = knockoff.Σ
julia> [vec(X̃' * X̃) vec(Σ)]
1000000×2 Matrix{Float64}:
1.0006 0.999942
0.0113254 0.0227959
-0.0117438 -0.007447
-0.00865152 -0.0144145
-0.0298051 -0.0310659
0.00509079 0.0110014
-0.0118063 -0.0145065
0.0198903 0.0237353
-0.0204552 -0.0303879
-0.00816176 -0.00951761
-0.00766452 -0.0105895
-0.00383353 -0.00439077
-0.00735197 -0.0115651
0.00505043 0.00762776
-0.000776242 0.00509295
0.0112497 0.0121713
-0.00566405 -0.00780654
⋮
-0.0245307 -0.0318917
-0.0281486 -0.0265195
-0.0363757 -0.0500211
-0.00516446 -0.00819044
-0.0156618 -0.02881
-0.0116239 -0.00897332
-0.00495919 0.00079108
-0.0227494 -0.0227123
-0.00883139 -0.0128069
-0.0114954 -0.00600044
0.00750958 0.00537207
0.0062047 0.010616
-0.00481099 -0.00666484
0.00555084 0.010545
0.0406162 0.050493
-0.0076684 -0.0107781
0.994321 0.999956
Currently the SDP solver in Knockoffs.jl
is ~2 times slower than knockoff
package in R and ~4 times slower than the knockoff package in Matlab. Speed difference is entirely due to underlying solver (Hypatia SDP solver+ Convex.jl
in Julia vs R's Rdsdp vs Matlab's SDPT3 solver)
How do we could get more performance?
MWE
using Revise
using Knockoffs
using Test
using LinearAlgebra
using Random
using StatsBase
using Statistics
using Distributions
using ToeplitzMatrices
using RCall
using PositiveFactorizations
using UnicodePlots
using MATLAB
using SCS
using JuMP
using Convex
# simulate data
Random.seed!(2022)
n = 100
p = 400
ρ = 0.4
Sigma = Matrix(SymmetricToeplitz(ρ.^(0:(p-1))))
L = cholesky(Sigma).L
X = randn(n, p) * L # var(X) = L var(N(0, 1)) L' = var(Σ)
true_mu = zeros(p)
Knockoffs.jl
@time knockoff = modelX_gaussian_knockoffs(X, :sdp, true_mu, Sigma) # precompile
s = knockoff.s
2.686531 seconds (507.05 k allocations: 399.686 MiB, 2.05% gc time)
Compare with Matteo's knockoffs in R
@rput Sigma X true_mu
@time begin
R"""
library(knockoff)
r_s = create.solve_sdp(Sigma)
X_ko = create.gaussian(X, true_mu, Sigma, method = "sdp", diag_s=r_s)
"""
end
@rget r_s X_ko
1.208416 seconds (54 allocations: 1.328 KiB)
Compare with Matteo's knockoffs in Matlab
# compare with Matteo's knockoffs in MATLAB
@mput Sigma X true_mu
@time begin
mat"""
addpath("/Users/biona001/Documents/MATLAB/knockoffs_matlab")
matlab_s = knockoffs.create.solveSDP(Sigma)
"""
end
@mget matlab_s
0.715965 seconds (7 allocations: 192 bytes)
At least the answer agrees:
[s r_s matlab_s]
200×3 Matrix{Float64}:
1.0 1.0 1.0
0.971429 0.971427 0.971428
0.811429 0.811431 0.811429
0.875429 0.875427 0.875428
0.849829 0.849829 0.849829
0.860069 0.860068 0.860068
0.855973 0.855973 0.855973
0.857611 0.857611 0.857611
0.856956 0.856956 0.856956
0.857218 0.857218 0.857218
0.857113 0.857113 0.857113
0.857155 0.857155 0.857155
0.857138 0.857138 0.857138
⋮
0.857155 0.857155 0.857155
0.857113 0.857113 0.857113
0.857218 0.857218 0.857218
0.856956 0.856956 0.856956
0.857611 0.857611 0.857611
0.855973 0.855973 0.855973
0.860069 0.860068 0.860068
0.849829 0.849829 0.849829
0.875429 0.875427 0.875428
0.811429 0.811431 0.811429
0.971429 0.971427 0.971428
1.0 1.0 1.0
There is a trick to sample m-knockoffs without forming a p(m+1) * p(m+1) covariance in memory. The trick was used in sample_mvn_efficient
but not yet in condition
. Need to do so asap.
solve_s_graphical_group
and related methods:SDP
option in group knockoffs should be :sdp
Also update readme:
#53 implemented fit_marginal
which internally uses T[i] = |Zi| - |Zko[i]|
as importance score for the i
th feature. In Zihuai's ghostknockoff paper, he instead used T[i] = (Z[i])^2 - (Zko[i])^2
(see eq 15).
I should try this too to see if power is improved.
Related to #71
MWE:
using Knockoffs, LinearAlgebra
Sigma = Symmetric([1.0 0.14138129031570185 -0.07530156753912527 0.1425092255960416 -0.03975762881225776 -0.03975762881225776; 0.14138129031570185 1.0 -0.06269073019504869 0.18796864772097638 -0.042632500750932216 -0.042632500750932216; -0.07530156753912527 -0.06269073019504869 1.0 0.10393405423314547 0.8269051049537056 0.8269051049537056; 0.1425092255960416 0.18796864772097638 0.10393405423314547 1.0 0.1598138853324871 0.1598138853324871; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 1.0 0.9999999999999998; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 0.9999999999999998 1.0])
groups = [1, 2, 3, 4, 3, 3]
julia> group_reps = choose_group_reps(Sigma, groups, threshold=0.9)
6-element Vector{Int64}:
1
2
3
4
5
72037
julia> group_reps = choose_group_reps(Sigma, groups, threshold=0.9)
6-element Vector{Int64}:
1
2
3
4
5
139762467531552
julia> group_reps = choose_group_reps(Sigma, groups, threshold=0.9)
6-element Vector{Int64}:
1
2
3
4
5
66968
MWE
using Knockoffs, LinearAlgebra
Sigma = Symmetric([1.0 0.14138129031570185 -0.07530156753912527 0.1425092255960416 -0.03975762881225776 -0.03975762881225776; 0.14138129031570185 1.0 -0.06269073019504869 0.18796864772097638 -0.042632500750932216 -0.042632500750932216; -0.07530156753912527 -0.06269073019504869 1.0 0.10393405423314547 0.8269051049537056 0.8269051049537056; 0.1425092255960416 0.18796864772097638 0.10393405423314547 1.0 0.1598138853324871 0.1598138853324871; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 1.0 0.9999999999999998; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 0.9999999999999998 1.0])
groups = [1, 2, 3, 4, 3, 3]
choose_group_reps(Sigma, groups, threshold=0.5)
ERROR: MethodError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
Stacktrace:
[1] reduce_empty(op::Base.MappingRF{Base.var"#316#317"{…}, Base.BottomRF{…}}, ::Type{Pair{…}})
@ Base ./reduce.jl:361
[2] reduce_empty_iter
@ ./reduce.jl:384 [inlined]
[3] reduce_empty_iter
@ ./reduce.jl:383 [inlined]
[4] foldl_impl
@ ./reduce.jl:49 [inlined]
[5] mapfoldl_impl
@ ./reduce.jl:44 [inlined]
[6] mapfoldl
@ ./reduce.jl:175 [inlined]
[7] _findmax
@ ./reduce.jl:908 [inlined]
[8] findmax
@ ./reducedim.jl:1226 [inlined]
[9] _findmax
@ ./reduce.jl:934 [inlined]
[10] findmax
@ ./reducedim.jl:1203 [inlined]
[11] select_one(C::Matrix{Float64}, vlist::Vector{Int64}, RSS0::Int64, tol::Float64)
@ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:2083
[12] select_one(C::AbstractMatrix, vlist::Any, RSS0::Any, tol::Any)
@ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:2081 [inlined]
[13] select_best_rss_subset(C::SubArray{Float64, 2, Symmetric{…}, Tuple{…}, false}, k::Int64)
@ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:2099
[14] choose_group_reps(Σ::Symmetric{…}, groups::Vector{…}; threshold::Float64, prioritize_idx::Nothing, Σinv::Symmetric{…})
@ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:1998
[15] top-level scope
@ REPL[4]:1
Some type information was truncated. Use `show(err)` to see complete types
This is on Julia v1.10 and Knockoffs.jl version v1.1.9
See https://github.com/JuliaRegistries/General/actions/runs/7543605423/job/20535002853?pr=98968
I think we need to wait until Hypatia makes a new release.
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.