Git Product home page Git Product logo

deepknockoffs's Introduction

Deep Knockoffs

This repository provides a Python package for sampling approximate model-X knockoffs using deep generative models.

Accompanying paper: https://arxiv.org/abs/1811.06687. Published in the Journal of the American Statistical Association (https://doi.org/10.1080/01621459.2019.1660174)

To learn more about the algorithm implemented in this package, visit https://web.stanford.edu/group/candes/deep-knockoffs/ and read the accompanying paper.

To learn more about the broader framework of knockoffs, visit https://web.stanford.edu/group/candes/knockoffs/.

Software dependencies

The code contained in this repository was tested on the following configuration of Python:

  • python=3.6.5
  • numpy=1.14.0
  • scipy=1.0.0
  • pytorch=0.4.1
  • cvxpy=1.0.10
  • cvxopt=1.2.0
  • pandas=0.23.4

Installation guide

cd DeepKnockoffs
python setup.py install --user

Examples

License

This software is distributed under the GPLv3 license and it comes with ABSOLUTELY NO WARRANTY.

deepknockoffs's People

Contributors

msesia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

deepknockoffs's Issues

How to get binary DeepKnockoffs?

When I was trying to generate deep knockoffs for binary input data using parameter family = 'binary', the output was continuous, even though I would've expected the generator to return binary values. In the architecture of the neural net in machine.py, there is a normalization procedure in ll.135 after nn.Sigmoid(), which I think is a mistake, since the probability of belonging to class = 1 shouldn't be normalized. Could you explain why you put the normalization there?
As a workaround to get a binary knockoff matrix, I commented out the normalization in ll.135 and then used thresholding, i.e. return values > threshold as 1's and values <= threshold as 0's. However, it would be useful for practitioners to have binary values returned directly or, alternatively, give users an explanation of how to interpret the continuous values returned for family = binary .

You can find a reproducible example that yields continuous values for family = binary below.

Thanks for your help!


from DeepKnockoffs  import KnockoffMachine
from DeepKnockoffs import GaussianKnockoffs
from keras.datasets import mnist
import numpy as np


(train_X, train_y), (test_X, test_y) = mnist.load_data()

#####  reshape and take, as an example, only images of digit 0 
train_X = np.concatenate(train_X[train_y == 0])

##### make mnist binary
train_X[train_X>0] = 1

##### perform similar steps as in tutorial experiments-1.ipynb:

# Compute the empirical covariance matrix of the training data
SigmaHat = np.cov(train_X, rowvar=False)

# Initialize generator of second-order knockoffs
second_order = GaussianKnockoffs(SigmaHat, mu=np.mean(train_X,0), method="sdp")

# Measure pairwise second-order knockoff correlations 
corr_g = (np.diag(SigmaHat) - np.diag(second_order.Ds)) / np.diag(SigmaHat)

print('Average absolute pairwise correlation: %.3f.' %(np.mean(np.abs(corr_g))))

#### take a subsample of 30 images to speed up computation 
train_X = train_X[:28*30]

#### Load some training parameters from parameters.py 
training_params = training_params = {'LAMBDA':1.0,'DELTA':1.0, 'GAMMA':1.0 }

# Set the parameters for training deep knockoffs
pars = dict()
# Number of epochs
pars['epochs'] = 10
# Number of iterations over the full data per epoch
pars['epoch_length'] = 100
# Data type, either "continuous" or "binary" -> "binary"
pars['family'] = "binary" 
# Dimensions of the data
pars['p'] = train_X.shape[1]
# Size of the test set
pars['test_size']  = int(0.1*train_X.shape[0])
# Batch size
pars['batch_size'] = int(0.5*train_X.shape[0])
# Learning rate
pars['lr'] = 0.01
# When to decrease learning rate (unused when equal to number of epochs)
pars['lr_milestones'] = [pars['epochs']]
# Width of the network (number of layers is fixed to 6)
pars['dim_h'] = int(10*train_X.shape[1])
# Penalty for the MMD distance
pars['GAMMA'] = training_params['GAMMA']
# Penalty encouraging second-order knockoffs
pars['LAMBDA'] = training_params['LAMBDA']
# Decorrelation penalty hyperparameter
pars['DELTA'] = training_params['DELTA']
# Target pairwise correlations between variables and knockoffs
pars['target_corr'] = corr_g
# Kernel widths for the MMD measure (uniform weights)
pars['alphas'] = [1.,2.,4.,8.,16.,32.,64.,128.]


# Initialize the machine
machine = KnockoffMachine(pars)

# Train the machine
print("Fitting the knockoff machine...")
machine.train(train_X)

#### check that machine family is indeed 'binary
machine.family
#### When generating knockoffs, continuous data is returned
machine.generate(train_X)

Do we need `target_corr`?

Hi there,

I am trying to use your package to do the variable selection. But I am confused about your example code & program, where you have calculated a target correlation by the second-order knockoffs and then treat it as a parameter. But it seems that the pseudo-algorithm in your paper does not have such term,
image
Would the objective function need the target correlation? Am I missing something?

Potential training resume bug

I noticed that when enabling the resume option in the KnockoffMachine's train function, loading from previously saved checkpoints would fail because the '_checkpoint.pth.tar' extension would be double added to the checkpoint_name.

In the DeepKnockoffs/DeepKnockoffs/machine.py, the '_checkpoint.pth.tar' extension was first added to checkpoint_name during object construction (__int__(self, ...)):
Screen Shot 2021-06-29 at 11 31 32 AM

Then when calling self.load(self.checkpoint_name) to resume training, in the load function, the '_checkpoint.pth.tar' extension was added to the checkpoint_name again, which lead to no checkpoint found error.
Screen Shot 2021-06-29 at 11 37 06 AM
Screen Shot 2021-06-29 at 11 36 02 AM

Simply change line 549 in DeepKnockoffs/DeepKnockoffs/machine.py to filename = checkpoing_name should solve the issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.