msesia / deepknockoffs Goto Github PK

View Code? Open in Web Editor NEW

49.0 3.0 13.0 38.37 MB

Approximate knockoffs and model-free variable selection.

Home Page: https://web.stanford.edu/group/candes/deep-knockoffs/

License: GNU General Public License v3.0

Python 100.00%

deepknockoffs's Introduction

Deep Knockoffs

This repository provides a Python package for sampling approximate model-X knockoffs using deep generative models.

Accompanying paper: https://arxiv.org/abs/1811.06687. Published in the Journal of the American Statistical Association (https://doi.org/10.1080/01621459.2019.1660174)

To learn more about the algorithm implemented in this package, visit https://web.stanford.edu/group/candes/deep-knockoffs/ and read the accompanying paper.

To learn more about the broader framework of knockoffs, visit https://web.stanford.edu/group/candes/knockoffs/.

Software dependencies

The code contained in this repository was tested on the following configuration of Python:

python=3.6.5
numpy=1.14.0
scipy=1.0.0
pytorch=0.4.1
cvxpy=1.0.10
cvxopt=1.2.0
pandas=0.23.4

Installation guide

cd DeepKnockoffs
python setup.py install --user

Examples

examples/toy-example.ipynb A usage example on a toy problem with multivariate Gaussian variables is available in the form of a Jupyter Notebook.
examples/experiments-1.ipynb Code to train the machine used in the paper.
examples/experiments-2.ipynb Code to compute the goodness-of-fit diagnostics for the machine used in the paper.
examples/experiments-3.ipynb Code to perform the controlled variable selection experiments in the paper.
examples/data-preprocessing.ipynb Example of how to pre-process data containing extremely correlated variables.

License

This software is distributed under the GPLv3 license and it comes with ABSOLUTELY NO WARRANTY.

deepknockoffs's People

Contributors

Stargazers

Watchers

Forkers

xiaoleihou214 dloewenstein peterpark77 alec-flowers alxglvckij janetvdg zhenjiangfan winstonchenn oscarjil123 dyh1265 beiningwu7 ngminh-jo starbot2001

deepknockoffs's Issues

pkg_resources.DistributionNotFound: The 'DeepKnockoffs' distribution was not found and is required by the application

When installing the module 'Knockoffs' on my vscode, I meet problem as followed :

code:
sys.path.insert(0,"D:\code\deepknockoffs-master\DeepKnockoffs")
from DeepKnockoffs.mmd import mix_rbf_mmd2_loss
And the error is on the second line.
I want to figure out how to install 'Knockoffs' properly.

How to get binary DeepKnockoffs?

When I was trying to generate deep knockoffs for binary input data using parameter family = 'binary', the output was continuous, even though I would've expected the generator to return binary values. In the architecture of the neural net in machine.py, there is a normalization procedure in ll.135 after nn.Sigmoid(), which I think is a mistake, since the probability of belonging to class = 1 shouldn't be normalized. Could you explain why you put the normalization there?
As a workaround to get a binary knockoff matrix, I commented out the normalization in ll.135 and then used thresholding, i.e. return values > threshold as 1's and values <= threshold as 0's. However, it would be useful for practitioners to have binary values returned directly or, alternatively, give users an explanation of how to interpret the continuous values returned for family = binary .

You can find a reproducible example that yields continuous values for family = binary below.

Thanks for your help!


from DeepKnockoffs  import KnockoffMachine
from DeepKnockoffs import GaussianKnockoffs
from keras.datasets import mnist
import numpy as np


(train_X, train_y), (test_X, test_y) = mnist.load_data()

#####  reshape and take, as an example, only images of digit 0 
train_X = np.concatenate(train_X[train_y == 0])

##### make mnist binary
train_X[train_X>0] = 1

##### perform similar steps as in tutorial experiments-1.ipynb:

# Compute the empirical covariance matrix of the training data
SigmaHat = np.cov(train_X, rowvar=False)

# Initialize generator of second-order knockoffs
second_order = GaussianKnockoffs(SigmaHat, mu=np.mean(train_X,0), method="sdp")

# Measure pairwise second-order knockoff correlations 
corr_g = (np.diag(SigmaHat) - np.diag(second_order.Ds)) / np.diag(SigmaHat)

print('Average absolute pairwise correlation: %.3f.' %(np.mean(np.abs(corr_g))))

#### take a subsample of 30 images to speed up computation 
train_X = train_X[:28*30]

#### Load some training parameters from parameters.py 
training_params = training_params = {'LAMBDA':1.0,'DELTA':1.0, 'GAMMA':1.0 }

# Set the parameters for training deep knockoffs
pars = dict()
# Number of epochs
pars['epochs'] = 10
# Number of iterations over the full data per epoch
pars['epoch_length'] = 100
# Data type, either "continuous" or "binary" -> "binary"
pars['family'] = "binary" 
# Dimensions of the data
pars['p'] = train_X.shape[1]
# Size of the test set
pars['test_size']  = int(0.1*train_X.shape[0])
# Batch size
pars['batch_size'] = int(0.5*train_X.shape[0])
# Learning rate
pars['lr'] = 0.01
# When to decrease learning rate (unused when equal to number of epochs)
pars['lr_milestones'] = [pars['epochs']]
# Width of the network (number of layers is fixed to 6)
pars['dim_h'] = int(10*train_X.shape[1])
# Penalty for the MMD distance
pars['GAMMA'] = training_params['GAMMA']
# Penalty encouraging second-order knockoffs
pars['LAMBDA'] = training_params['LAMBDA']
# Decorrelation penalty hyperparameter
pars['DELTA'] = training_params['DELTA']
# Target pairwise correlations between variables and knockoffs
pars['target_corr'] = corr_g
# Kernel widths for the MMD measure (uniform weights)
pars['alphas'] = [1.,2.,4.,8.,16.,32.,64.,128.]


# Initialize the machine
machine = KnockoffMachine(pars)

# Train the machine
print("Fitting the knockoff machine...")
machine.train(train_X)

#### check that machine family is indeed 'binary
machine.family
#### When generating knockoffs, continuous data is returned
machine.generate(train_X)

Do we need `target_corr`?

Hi there,

I am trying to use your package to do the variable selection. But I am confused about your example code & program, where you have calculated a target correlation by the second-order knockoffs and then treat it as a parameter. But it seems that the pseudo-algorithm in your paper does not have such term,

Would the objective function need the target correlation? Am I missing something?

Potential training resume bug

I noticed that when enabling the resume option in the KnockoffMachine's train function, loading from previously saved checkpoints would fail because the '_checkpoint.pth.tar' extension would be double added to the checkpoint_name.

In the DeepKnockoffs/DeepKnockoffs/machine.py, the '_checkpoint.pth.tar' extension was first added to checkpoint_name during object construction (__int__(self, ...)):

Then when calling self.load(self.checkpoint_name) to resume training, in the load function, the '_checkpoint.pth.tar' extension was added to the checkpoint_name again, which lead to no checkpoint found error.

Simply change line 549 in DeepKnockoffs/DeepKnockoffs/machine.py to filename = checkpoing_name should solve the issue.

msesia / deepknockoffs Goto Github PK

deepknockoffs's Introduction

Deep Knockoffs

Software dependencies

Installation guide

Examples

License

deepknockoffs's People

Contributors

Stargazers

Watchers

Forkers

deepknockoffs's Issues

pkg_resources.DistributionNotFound: The 'DeepKnockoffs' distribution was not found and is required by the application

How to get binary DeepKnockoffs?

Do we need `target_corr`?

Potential training resume bug

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent