When I was trying to generate deep knockoffs for binary input data using parameter family = 'binary'
, the output was continuous, even though I would've expected the generator to return binary values. In the architecture of the neural net in machine.py
, there is a normalization procedure in ll.135 after nn.Sigmoid()
, which I think is a mistake, since the probability of belonging to class = 1
shouldn't be normalized. Could you explain why you put the normalization there?
As a workaround to get a binary knockoff matrix, I commented out the normalization in ll.135 and then used thresholding, i.e. return values > threshold as 1's and values <= threshold as 0's. However, it would be useful for practitioners to have binary values returned directly or, alternatively, give users an explanation of how to interpret the continuous values returned for family = binary
.
from DeepKnockoffs import KnockoffMachine
from DeepKnockoffs import GaussianKnockoffs
from keras.datasets import mnist
import numpy as np
(train_X, train_y), (test_X, test_y) = mnist.load_data()
##### reshape and take, as an example, only images of digit 0
train_X = np.concatenate(train_X[train_y == 0])
##### make mnist binary
train_X[train_X>0] = 1
##### perform similar steps as in tutorial experiments-1.ipynb:
# Compute the empirical covariance matrix of the training data
SigmaHat = np.cov(train_X, rowvar=False)
# Initialize generator of second-order knockoffs
second_order = GaussianKnockoffs(SigmaHat, mu=np.mean(train_X,0), method="sdp")
# Measure pairwise second-order knockoff correlations
corr_g = (np.diag(SigmaHat) - np.diag(second_order.Ds)) / np.diag(SigmaHat)
print('Average absolute pairwise correlation: %.3f.' %(np.mean(np.abs(corr_g))))
#### take a subsample of 30 images to speed up computation
train_X = train_X[:28*30]
#### Load some training parameters from parameters.py
training_params = training_params = {'LAMBDA':1.0,'DELTA':1.0, 'GAMMA':1.0 }
# Set the parameters for training deep knockoffs
pars = dict()
# Number of epochs
pars['epochs'] = 10
# Number of iterations over the full data per epoch
pars['epoch_length'] = 100
# Data type, either "continuous" or "binary" -> "binary"
pars['family'] = "binary"
# Dimensions of the data
pars['p'] = train_X.shape[1]
# Size of the test set
pars['test_size'] = int(0.1*train_X.shape[0])
# Batch size
pars['batch_size'] = int(0.5*train_X.shape[0])
# Learning rate
pars['lr'] = 0.01
# When to decrease learning rate (unused when equal to number of epochs)
pars['lr_milestones'] = [pars['epochs']]
# Width of the network (number of layers is fixed to 6)
pars['dim_h'] = int(10*train_X.shape[1])
# Penalty for the MMD distance
pars['GAMMA'] = training_params['GAMMA']
# Penalty encouraging second-order knockoffs
pars['LAMBDA'] = training_params['LAMBDA']
# Decorrelation penalty hyperparameter
pars['DELTA'] = training_params['DELTA']
# Target pairwise correlations between variables and knockoffs
pars['target_corr'] = corr_g
# Kernel widths for the MMD measure (uniform weights)
pars['alphas'] = [1.,2.,4.,8.,16.,32.,64.,128.]
# Initialize the machine
machine = KnockoffMachine(pars)
# Train the machine
print("Fitting the knockoff machine...")
machine.train(train_X)
#### check that machine family is indeed 'binary
machine.family
#### When generating knockoffs, continuous data is returned
machine.generate(train_X)