biona001 / knockoffs.jl Goto Github PK

Variable Selection with Knockoffs

License: MIT License

Julia 100.00%

bioinformatics conditional-independence conditional-independence-test false-discovery-rate fdr feature-selection genomics group-knockoffs hypothesis-testing knockoff-filter knockoffs lasso model-selection multiple-knockoffs variable-selection

knockoffs.jl's Issues

Possibly reduce artifacts size

Knockoffs.jl will soon become a dependency for GhostKnockoffGWAS, which will be compiled into a relocatable app.

Knockoffs's dependency LowRankApprox.jl depends on FFTW.jl which in turn depends on MKL_jll.jl which is a staggering 596.668 MiB. Can we remove this dependency?

Fixed-X knockoffs X.ko'X.ko

Hi Benjamin,

I hope you are doing well!

I wanted to check about the following. The following seems not to match (for any of the fixed-X knockoff constructions):

me = fixed_knockoffs(X, :maxent)
maximum(abs.(me.Xko'me.Xko .- me.Sigma)) #large

Do you know why that is? (Happy to provide a MWE if it helps, though it happens for all designs I have tried it on.)

Thank you,
Nikos

Performance regression

As reported by Jiaqi, there seems to be some significant performance regression for group knockoff solver.

MWE: (Knockoffs.jl v1.1.7 and on Julia 1.10)

using Knockoffs
using LinearAlgebra
using Random
using CSV
using DataFrames

# gnomad test data
datadir = "/Users/biona001/Benjamin_Folder/research/4th_project_groupKO/group_knockoff_test_data"
mapfile = CSV.read(joinpath(datadir, "Groups_2_127374341_128034347.txt"), DataFrame)
groups = mapfile[!, :group]
covfile = CSV.read(joinpath(datadir, "CorG_2_127374341_128034347.txt"), DataFrame)
Σ = covfile |> Matrix{Float64}
Σ = 0.99Σ + 0.01I

@time solve_s_group(
    Symmetric(Σ), groups, :maxent, 
    m = 1,          # number of knockoffs per variable to generate
    tol = 0.001,    # convergence tolerance
    outer_iter = 10,    # max number of coordinate descent iterations
    robust = false, # whether to use robust cholesky updates
    verbose = true    # whether to print informative intermediate results
)

┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs ~/.julia/dev/Knockoffs/src/group.jl:360
Maxent initial obj = -44822.11340914695
Iter 1 (PCA): obj = -26794.756871827496, δ = 2.6915336716594664, t1 = 24.96, t2 = 32.32
Iter 2 (CCD): obj = -23681.635922280064, δ = 0.5920258362537, t1 = 534.9, t2 = 881.71, t3 = 0.22
Iter 3 (PCA): obj = -23476.599883936684, δ = 1.0851838794654751, t1 = 559.7, t2 = 913.86
Iter 4 (CCD): obj = -23441.194586522815, δ = 0.01917853738286209, t1 = 1059.27, t2 = 1761.83, t3 = 0.44
Iter 5 (PCA): obj = -23428.13516213957, δ = 0.20912865614936338, t1 = 1085.4, t2 = 1794.69
... # terminated

On Knockoffs.jl v0.3.0 (Julia 1.6.7), see Fast Cholesky update notebook

solve_group_max_entropy_ccd: Optimizing 58997 variables
┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs /Users/biona001/.julia/dev/Knockoffs/src/group.jl:230
initial obj = -13811.93738418022
Iter 1: obj = -8863.340455999352, δ = 0.9033274319764998, t1 = 8.85, t2 = 18.23, t3 = 0.04
Iter 2: obj = -7509.4699925543855, δ = 0.7627302489344706, t1 = 19.04, t2 = 36.57, t3 = 0.08
Iter 3: obj = -7495.512979153219, δ = 0.153471691368552, t1 = 29.25, t2 = 54.77, t3 = 0.12
Iter 4: obj = -7491.027981075002, δ = 0.0467433749392695, t1 = 39.2, t2 = 72.95, t3 = 0.16
Iter 5: obj = -7489.0243504490645, δ = 0.007200479138871564, t1 = 49.08, t2 = 90.98, t3 = 0.2
Iter 6: obj = -7487.899557778442, δ = 0.004873425110742543, t1 = 58.98, t2 = 109.34, t3 = 0.24
Iter 7: obj = -7487.198762711129, δ = 0.0031886466238885947, t1 = 68.75, t2 = 127.98, t3 = 0.28
Iter 8: obj = -7486.7318922165505, δ = 0.002106998635746351, t1 = 78.17, t2 = 146.58, t3 = 0.32
Iter 9: obj = -7486.405698134362, δ = 0.0011212737859179697, t1 = 87.39, t2 = 165.21, t3 = 0.37
Iter 10: obj = -7486.1699950030525, δ = 0.0006708441935208511, t1 = 96.52, t2 = 183.44, t3 = 0.41
283.395020 seconds (3.23 M allocations: 329.596 MiB, 0.05% gc time, 0.61% compilation time)

Initial objective value being different is due to different initialization of S. However the timing seems much slower, even if ran on different computers.

HMM normalizing constant overflows

The following creates MWE data (test.bed/bim/fam)

using SnpArrays
using MendelIHT
using Random
Random.seed!(2022) # set seed for reproducibility
n = 1000 # sample size
p = 10000 # number of covariates
x = simulate_correlated_snparray("test.bed", n, p, block_length=20, prob=0.75)
make_bim_fam_files(x, rand(1000), "test")

and a call to fastPHASE HMM knockoffs will throw NaN error that is caused by normalizing constants overflowing

@time X̃ = hmm_knockoff(data_path, plink_outfile="test.fastphase.knockoffs")

Possible numerical error when initializing with equi correlation solution

For group knockoffs, when we use SDP/ME/MVR solvers, we initiailzing the optimzation with equicorrelated S matrix. However, that S matrix is the largest S such that (m+1)/m*Sigma-S > 0. This means that when we compute cholesky((m+1)/m*Sigma-S), we could potentially introduce numerical stabilities.

For this reason, the next minor patch release might initialize S with, for example, equi.S ./ 2 to naively circumvent this issue.

No progress printed on screen even when `verbose=true`

On cluster environment, it seems the progress for solve_s and solve_s_group is printed only when the job completes, which is not very meaningful for monitoring convergence.

Normalizing to unit vector introduces numerical instabilities

MWE

Only centering and scaling each column to mean 0 variance 1 (but not to unit norm)

using Knockoffs, Random, StatsBase, Statistics

Random.seed!(2021)
n = 3000
p = 1000
X = randn(n, p)
zscore!(X, mean(X, dims=1), std(X, dims=1))
knockoff = knockoff_equi(X);

X̃ = knockoff.X̃
Σ = knockoff.Σ

julia> [vec(X̃' * X̃) vec(Σ)]
1000000×2 Matrix{Float64}:
 2999.0      2999.0
   68.3776     68.3776
  -22.3342    -22.3342
  -43.2302    -43.2302
  -93.228     -93.2281
   32.9963     32.9963
  -43.5092    -43.5092
   71.2002     71.2002
  -91.1404    -91.1404
  -28.554     -28.554
  -31.7598    -31.7598
  -13.1698    -13.1698
  -34.6885    -34.6885
   22.8903     22.8903
   15.2742     15.2742
   36.5119     36.5119
  -23.4127    -23.4127
    ⋮
  -95.6452    -95.6452
  -79.5508    -79.5508
 -150.037    -150.037
  -24.5637    -24.5637
  -86.4215    -86.4215
  -26.914     -26.914
    2.37256     2.37256
  -68.1161    -68.1161
  -38.4121    -38.4121
  -17.9981    -17.9981
   16.1139     16.1139
   31.8385     31.8385
  -19.9923    -19.9923
   31.6304     31.6305
  151.434     151.434
  -32.3494    -32.3494
 2999.0      2999.0

Normaling each column so that ||x_j||_2^2 = 1:

using Knockoffs, Random

Random.seed!(2021)
n = 3000
p = 1000
X = randn(n, p)
normalize_col!(X)
knockoff = knockoff_equi(X);

X̃ = knockoff.X̃
Σ = knockoff.Σ

julia> [vec(X̃' * X̃) vec(Σ)]
1000000×2 Matrix{Float64}:
  1.0006        0.999942
  0.0113254     0.0227959
 -0.0117438    -0.007447
 -0.00865152   -0.0144145
 -0.0298051    -0.0310659
  0.00509079    0.0110014
 -0.0118063    -0.0145065
  0.0198903     0.0237353
 -0.0204552    -0.0303879
 -0.00816176   -0.00951761
 -0.00766452   -0.0105895
 -0.00383353   -0.00439077
 -0.00735197   -0.0115651
  0.00505043    0.00762776
 -0.000776242   0.00509295
  0.0112497     0.0121713
 -0.00566405   -0.00780654
  ⋮
 -0.0245307    -0.0318917
 -0.0281486    -0.0265195
 -0.0363757    -0.0500211
 -0.00516446   -0.00819044
 -0.0156618    -0.02881
 -0.0116239    -0.00897332
 -0.00495919    0.00079108
 -0.0227494    -0.0227123
 -0.00883139   -0.0128069
 -0.0114954    -0.00600044
  0.00750958    0.00537207
  0.0062047     0.010616
 -0.00481099   -0.00666484
  0.00555084    0.010545
  0.0406162     0.050493
 -0.0076684    -0.0107781
  0.994321      0.999956

SDP solver a lot slower than Matlab/R equivalent

Currently the SDP solver in Knockoffs.jl is ~2 times slower than knockoff package in R and ~4 times slower than the knockoff package in Matlab. Speed difference is entirely due to underlying solver (Hypatia SDP solver+ Convex.jl in Julia vs R's Rdsdp vs Matlab's SDPT3 solver)

How do we could get more performance?

MWE

using Revise
using Knockoffs
using Test
using LinearAlgebra
using Random
using StatsBase
using Statistics
using Distributions
using ToeplitzMatrices
using RCall
using PositiveFactorizations
using UnicodePlots
using MATLAB
using SCS
using JuMP
using Convex

# simulate data
Random.seed!(2022)
n = 100
p = 400
ρ = 0.4
Sigma = Matrix(SymmetricToeplitz(ρ.^(0:(p-1))))
L = cholesky(Sigma).L
X = randn(n, p) * L # var(X) = L var(N(0, 1)) L' = var(Σ)
true_mu = zeros(p)

Knockoffs.jl

@time knockoff = modelX_gaussian_knockoffs(X, :sdp, true_mu, Sigma) # precompile
s = knockoff.s
  2.686531 seconds (507.05 k allocations: 399.686 MiB, 2.05% gc time)

Compare with Matteo's knockoffs in R

@rput Sigma X true_mu
@time begin
    R"""
    library(knockoff)
    r_s = create.solve_sdp(Sigma)
    X_ko = create.gaussian(X, true_mu, Sigma, method = "sdp", diag_s=r_s)
    """
end
@rget r_s X_ko
  1.208416 seconds (54 allocations: 1.328 KiB)

Compare with Matteo's knockoffs in Matlab

# compare with Matteo's knockoffs in MATLAB
@mput Sigma X true_mu
@time begin
    mat"""
    addpath("/Users/biona001/Documents/MATLAB/knockoffs_matlab")
    matlab_s = knockoffs.create.solveSDP(Sigma)
    """
end
@mget matlab_s
  0.715965 seconds (7 allocations: 192 bytes)

At least the answer agrees:

[s r_s matlab_s]
200×3 Matrix{Float64}:
 1.0       1.0       1.0
 0.971429  0.971427  0.971428
 0.811429  0.811431  0.811429
 0.875429  0.875427  0.875428
 0.849829  0.849829  0.849829
 0.860069  0.860068  0.860068
 0.855973  0.855973  0.855973
 0.857611  0.857611  0.857611
 0.856956  0.856956  0.856956
 0.857218  0.857218  0.857218
 0.857113  0.857113  0.857113
 0.857155  0.857155  0.857155
 0.857138  0.857138  0.857138
 ⋮                   
 0.857155  0.857155  0.857155
 0.857113  0.857113  0.857113
 0.857218  0.857218  0.857218
 0.856956  0.856956  0.856956
 0.857611  0.857611  0.857611
 0.855973  0.855973  0.855973
 0.860069  0.860068  0.860068
 0.849829  0.849829  0.849829
 0.875429  0.875427  0.875428
 0.811429  0.811431  0.811429
 0.971429  0.971427  0.971428
 1.0       1.0       1.0

Efficient sampling of multiple knockoffs

There is a trick to sample m-knockoffs without forming a p(m+1) * p(m+1) covariance in memory. The trick was used in sample_mvn_efficient but not yet in condition. Need to do so asap.

Improve documentation

Group knockoffs: Missing documentation for representative group knockoffs
Calling from R: objective function wrong
Need docstrings for solve_s_graphical_group and related methods
GhostKnockoffs: update reference
Example in model-X knockoffs: increase sample size or effect size so power is better
The :SDP option in group knockoffs should be :sdp

Also update readme:

Maybe add message for pr/bugs
update installation instructions once package is registered

documentation not building

Use squared Z-score for marginal feature importance statistic

#53 implemented fit_marginal which internally uses T[i] = |Zi| - |Zko[i]| as importance score for the ith feature. In Zihuai's ghostknockoff paper, he instead used T[i] = (Z[i])^2 - (Zko[i])^2 (see eq 15).

I should try this too to see if power is improved.

Scale group SDP objective by group size

`choose_group_reps` returns stochastic/incorrect results

Related to #71

MWE:

using Knockoffs, LinearAlgebra
Sigma = Symmetric([1.0 0.14138129031570185 -0.07530156753912527 0.1425092255960416 -0.03975762881225776 -0.03975762881225776; 0.14138129031570185 1.0 -0.06269073019504869 0.18796864772097638 -0.042632500750932216 -0.042632500750932216; -0.07530156753912527 -0.06269073019504869 1.0 0.10393405423314547 0.8269051049537056 0.8269051049537056; 0.1425092255960416 0.18796864772097638 0.10393405423314547 1.0 0.1598138853324871 0.1598138853324871; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 1.0 0.9999999999999998; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 0.9999999999999998 1.0])
groups = [1, 2, 3, 4, 3, 3]

julia> group_reps = choose_group_reps(Sigma, groups, threshold=0.9)
6-element Vector{Int64}:
     1
     2
     3
     4
     5
 72037

julia> group_reps = choose_group_reps(Sigma, groups, threshold=0.9)
6-element Vector{Int64}:
               1
               2
               3
               4
               5
 139762467531552

julia> group_reps = choose_group_reps(Sigma, groups, threshold=0.9)
6-element Vector{Int64}:
     1
     2
     3
     4
     5
 66968

`choose_group_reps`: reducing over an empty collection is not allowed

MWE

using Knockoffs, LinearAlgebra
Sigma = Symmetric([1.0 0.14138129031570185 -0.07530156753912527 0.1425092255960416 -0.03975762881225776 -0.03975762881225776; 0.14138129031570185 1.0 -0.06269073019504869 0.18796864772097638 -0.042632500750932216 -0.042632500750932216; -0.07530156753912527 -0.06269073019504869 1.0 0.10393405423314547 0.8269051049537056 0.8269051049537056; 0.1425092255960416 0.18796864772097638 0.10393405423314547 1.0 0.1598138853324871 0.1598138853324871; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 1.0 0.9999999999999998; -0.03975762881225776 -0.042632500750932216 0.8269051049537056 0.1598138853324871 0.9999999999999998 1.0])
groups = [1, 2, 3, 4, 3, 3]
choose_group_reps(Sigma, groups, threshold=0.5)

ERROR: MethodError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
Stacktrace:
  [1] reduce_empty(op::Base.MappingRF{Base.var"#316#317"{…}, Base.BottomRF{…}}, ::Type{Pair{…}})
    @ Base ./reduce.jl:361
  [2] reduce_empty_iter
    @ ./reduce.jl:384 [inlined]
  [3] reduce_empty_iter
    @ ./reduce.jl:383 [inlined]
  [4] foldl_impl
    @ ./reduce.jl:49 [inlined]
  [5] mapfoldl_impl
    @ ./reduce.jl:44 [inlined]
  [6] mapfoldl
    @ ./reduce.jl:175 [inlined]
  [7] _findmax
    @ ./reduce.jl:908 [inlined]
  [8] findmax
    @ ./reducedim.jl:1226 [inlined]
  [9] _findmax
    @ ./reduce.jl:934 [inlined]
 [10] findmax
    @ ./reducedim.jl:1203 [inlined]
 [11] select_one(C::Matrix{Float64}, vlist::Vector{Int64}, RSS0::Int64, tol::Float64)
    @ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:2083
 [12] select_one(C::AbstractMatrix, vlist::Any, RSS0::Any, tol::Any)
    @ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:2081 [inlined]
 [13] select_best_rss_subset(C::SubArray{Float64, 2, Symmetric{…}, Tuple{…}, false}, k::Int64)
    @ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:2099
 [14] choose_group_reps(Σ::Symmetric{…}, groups::Vector{…}; threshold::Float64, prioritize_idx::Nothing, Σinv::Symmetric{…})
    @ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/fRlxF/src/group.jl:1998
 [15] top-level scope
    @ REPL[4]:1
Some type information was truncated. Use `show(err)` to see complete types

This is on Julia v1.10 and Knockoffs.jl version v1.1.9

CI failing on Julia v1.10

See https://github.com/JuliaRegistries/General/actions/runs/7543605423/job/20535002853?pr=98968

I think we need to wait until Hypatia makes a new release.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

biona001 / knockoffs.jl Goto Github PK

knockoffs.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org