A first introduction to deep knockoffs in Python

Notebook written by Matteo Sesia and Yaniv Romano

Stanford University, Department of Statistics

Last updated on: November 18, 2018

The purpose of this notebook is to provide a simple usage example of the DeepKnockoff package for generating approximate knockoffs. While this code can be easily run on a laptop within a couple of minutes, more accurate knockoff machines and higher-dimensional distributions may require additional computational time or the availaibility of a graphical processing unit.

Load the required libraries

Additional dependencies for this notebook:

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from DeepKnockoffs import KnockoffMachine
from DeepKnockoffs import GaussianKnockoffs
import data
import diagnostics
import parameters

Data generating model

We model the distribution of $X \in \mathbb{R}^p $ as multivariate Gaussian, with $p=20$ and the covariance matrix of an auto-regressive process of order one. The default correlation parameter is $\rho =0.5$.

In [2]:
# Number of features
p = 20

# Load the built-in Gaussian model and its default parameters
# The currently available built-in models are:
# - gaussian : Multivariate Gaussian distribution
# - gmm      : Gaussian mixture model
# - mstudent : Multivariate Student's-t distribution
# - sparse   : Multivariate sparse Gaussian distribution 
model = "gaussian"
distribution_params = parameters.GetDistributionParams(model, p)

# Initialize the data generator
DataSampler = data.DataSampler(distribution_params)

Let's sample $n=2000$ observations of $X$. This dataset will be used later to train a deep knockoff machine.

In [3]:
# Number of training examples
n = 2000

# Sample training data
X_train = DataSampler.sample(n)
print("Generated a training dataset of size: %d x %d." %(X_train.shape))
Generated a training dataset of size: 2000 x 20.

Second-order knockoffs

After computing the empirical covariance matrix of $X$ in the training dataset, we can initialize a generator of second-order knockoffs. The solution of the SDP determines the pairwise correlations between the original variables and the knockoffs produced by this algorithm.

In [4]:
# Compute the empirical covariance matrix of the training data
SigmaHat = np.cov(X_train, rowvar=False)

# Initialize generator of second-order knockoffs
second_order = GaussianKnockoffs(SigmaHat, mu=np.mean(X_train,0), method="sdp")

# Measure pairwise second-order knockoff correlations 
corr_g = (np.diag(SigmaHat) - np.diag(second_order.Ds)) / np.diag(SigmaHat)

print('Average absolute pairwise correlation: %.3f.' %(np.mean(np.abs(corr_g))))
Average absolute pairwise correlation: 0.319.

Deep knockoff machine

We initialize a deep knockoff machine with the following parameters. The default number of epochs is small to reduce the computation time, and it should be increased to improve the accuracy of the machine. The size of the test set indicates the number of observations that should be set aside in order to monitor in real time the ability of the machine to generate knockoff copies for previously unobserved data.

In [5]:
# Load the default hyperparameters for this model
training_params = parameters.GetTrainingHyperParams(model)

# Set the parameters for training deep knockoffs
pars = dict()
# Number of epochs
pars['epochs'] = 10
# Number of iterations over the full data per epoch
pars['epoch_length'] = 50
# Data type, either "continuous" or "binary"
pars['family'] = "continuous"
# Dimensions of the data
pars['p'] = p
# Size of the test set
pars['test_size']  = int(0.1*n)
# Batch size
pars['batch_size'] = int(0.45*n)
# Learning rate
pars['lr'] = 0.01
# When to decrease learning rate (unused when equal to number of epochs)
pars['lr_milestones'] = [pars['epochs']]
# Width of the network (number of layers is fixed to 6)
pars['dim_h'] = int(10*p)
# Penalty for the MMD distance
pars['GAMMA'] = training_params['GAMMA']
# Penalty encouraging second-order knockoffs
pars['LAMBDA'] = training_params['LAMBDA']
# Decorrelation penalty hyperparameter
pars['DELTA'] = training_params['DELTA']
# Target pairwise correlations between variables and knockoffs
pars['target_corr'] = corr_g
# Kernel widths for the MMD measure (uniform weights)
pars['alphas'] = [1.,2.,4.,8.,16.,32.,64.,128.]

# Initialize the machine
machine = KnockoffMachine(pars)

Let's fit the machine to the training data. The value of the loss function on the training and test data will be printed after each epoch, along with other diagnostics based on the MMD, the second moments and the pairwise correlations between variables and knockoffs.

In [6]:
# Train the machine
print("Fitting the knockoff machine...")
machine.train(X_train)
Fitting the knockoff machine...
[   1/  10], Loss: (0.6287, 0.8204), MMD: (0.2954,0.5362), Cov: (0.424,0.686), Decorr: (0.392,0.397)
[   2/  10], Loss: (0.3170, 0.8240), MMD: (0.2542,0.5282), Cov: (0.216,0.642), Decorr: (0.351,0.325)
[   3/  10], Loss: (0.2901, 0.7034), MMD: (0.2494,0.5081), Cov: (0.171,0.581), Decorr: (0.334,0.338)
[   4/  10], Loss: (0.2823, 0.8376), MMD: (0.2469,0.5206), Cov: (0.142,0.623), Decorr: (0.331,0.330)
[   5/  10], Loss: (0.2799, 0.7857), MMD: (0.2469,0.5116), Cov: (0.134,0.582), Decorr: (0.326,0.315)
[   6/  10], Loss: (0.2767, 0.7345), MMD: (0.2452,0.5097), Cov: (0.141,0.587), Decorr: (0.328,0.335)
[   7/  10], Loss: (0.2774, 0.8336), MMD: (0.2465,0.5212), Cov: (0.133,0.590), Decorr: (0.318,0.320)
[   8/  10], Loss: (0.2781, 0.7661), MMD: (0.2476,0.5075), Cov: (0.133,0.611), Decorr: (0.321,0.336)
[   9/  10], Loss: (0.2758, 0.7342), MMD: (0.2459,0.5179), Cov: (0.127,0.651), Decorr: (0.318,0.317)
[  10/  10], Loss: (0.2757, 0.7169), MMD: (0.2455,0.5195), Cov: (0.133,0.621), Decorr: (0.315,0.311)

Knockoff diagnostics on the training data

The second-order method and the machine trained above can be used to generate knockoff copies of the original data.

In [7]:
# Generate deep knockoffs
Xk_train_m = machine.generate(X_train)
print("Size of the deep knockoff dataset: %d x %d." %(Xk_train_m.shape))

# Generate second-order knockoffs
Xk_train_g = second_order.generate(X_train)
print("Size of the second-order knockoff dataset: %d x %d." %(Xk_train_g.shape))
Size of the deep knockoff dataset: 2000 x 20.
Size of the second-order knockoff dataset: 2000 x 20.

The goodness of fit of the knockoffs on the training data can be assessed by checking whether the empirical covariance matrix $\hat{G}$ of $(X,\tilde{X})$ satisfies the required symmetries. Given that $$ \hat{G} = \begin{bmatrix} \hat{G}_{XX} & \hat{G}_{X\tilde{X}} \\ \hat{G}_{X\tilde{X}} & \hat{G}_{\tilde{X}\tilde{X}} \end{bmatrix}, $$ we expect that $\hat{G}_{\tilde{X}\tilde{X}} \approx \hat{G}_{XX}$ and $\hat{G}_{X\tilde{X}} \approx \hat{G}_{XX}$. The latter equality should only hold for the off-diagonal elements. The diagonal elements of $\hat{G}_{X\tilde{X}}$ should be small as possible for the knockoffs to be powerful.

Simple diagnostics can be obtained by displaying the entries of the above blocks of the covariance matrix in scatter plots.

In [8]:
# Plot diagnostics for deep knockoffs
diagnostics.ScatterCovariance(X_train, Xk_train_m)
Out[8]:

Knockoff diagnostics on test data

The knockoff generator can be used on new observations of $X$.

In [9]:
# Sample test data
X_test = DataSampler.sample(n, test=True)
print("Generated a test dataset of size: %d x %d." %(X_test.shape))
Generated a test dataset of size: 2000 x 20.

Knockoffs for the test data are generated using the deep machine, the second-order method and an oracle that knows the exact distribution of $X$.

In [10]:
# Generate deep knockoffs
Xk_test_m = machine.generate(X_test)
print("Size of the deep knockoff test dataset: %d x %d." %(Xk_test_m.shape))

# Generate second-order knockoffs
Xk_test_g = second_order.generate(X_test)
print("Size of the second-order knockoff test dataset: %d x %d." %(Xk_test_g.shape))

# Generate oracle knockoffs
oracle = GaussianKnockoffs(DataSampler.Sigma, method="sdp", mu=DataSampler.mu)
Xk_test_o = oracle.generate(X_test)
print("Size of the oracle knockoff test dataset: %d x %d." %(Xk_test_o.shape))
Size of the deep knockoff test dataset: 2000 x 20.
Size of the second-order knockoff test dataset: 2000 x 20.
Size of the oracle knockoff test dataset: 2000 x 20.

A simple measure of the goodness of fit on the test data for the knockoffs generated by the deep machine is shown in the scatter plots below.

In [11]:
# Plot diagnostics for deep knockoffs
diagnostics.ScatterCovariance(X_test, Xk_test_m)
Out[11]:

The corresponding scatter plots for the second-order knockoffs are below.

In [12]:
# Plot diagnostics for second-order knockoffs
diagnostics.ScatterCovariance(X_test, Xk_test_g)
Out[12]:

Finally, the scatter plots for the oracle knockoffs.

In [13]:
# Plot diagnostics for oracle knockoffs
diagnostics.ScatterCovariance(X_test, Xk_test_o)
Out[13]:

More diagnostics can be computed by drawing new observations from the data sampler and applying the goodness-of-fit tests discussed in the paper.

In [14]:
# Compute goodness of fit diagnostics on 50 test sets containing 100 observations each
n_exams = 50
n_samples = 100
exam = diagnostics.KnockoffExam(DataSampler,
                                {'Oracle':oracle, 'Machine':machine, 'Second-order':second_order})
diagnostics = exam.diagnose(n_samples, n_exams)
Computing knockoff diagnostics...
[=========================] 100%
In [15]:
# Summarize diagnostics
diagnostics.groupby(['Method', 'Metric', 'Swap']).describe()
Out[15]:
Value
count mean std min 25% 50% 75% max
Method Metric Swap
Machine Covariance full 50.0 1.291893 3.492836 -6.243256 -1.132957 1.361824 3.831310 7.621140
partial 50.0 1.201393 3.512167 -7.099976 -0.807556 1.344955 3.703308 7.811020
self 50.0 0.319834 0.020318 0.265953 0.307952 0.319296 0.332854 0.361379
Energy full 50.0 0.178917 0.028412 0.127072 0.160863 0.172369 0.201243 0.281264
partial 50.0 0.177388 0.026872 0.139052 0.158226 0.172530 0.193290 0.277696
KNN full 50.0 0.516900 0.044697 0.445000 0.480000 0.505000 0.553750 0.610000
partial 50.0 0.516100 0.044565 0.430000 0.481250 0.515000 0.533750 0.625000
MMD full 50.0 0.000638 0.004654 -0.008441 -0.002597 -0.000508 0.004116 0.016774
partial 50.0 0.000600 0.004108 -0.005813 -0.002666 0.000556 0.003095 0.016446
Oracle Covariance full 50.0 0.844697 3.372040 -7.439972 -1.423676 0.952896 3.050602 7.580246
partial 50.0 0.888457 3.534530 -7.467270 -2.099644 1.265839 3.598145 8.151550
self 50.0 0.305346 0.015524 0.266235 0.297000 0.303309 0.311392 0.342307
Energy full 50.0 0.181859 0.033260 0.130520 0.159350 0.172052 0.199673 0.306199
partial 50.0 0.181983 0.031292 0.134997 0.159830 0.173137 0.203746 0.293915
KNN full 50.0 0.499000 0.034226 0.435000 0.476250 0.500000 0.520000 0.575000
partial 50.0 0.497500 0.035689 0.425000 0.480000 0.505000 0.518750 0.605000
MMD full 50.0 0.000829 0.005462 -0.007689 -0.003113 0.000204 0.003769 0.020159
partial 50.0 0.000627 0.005011 -0.005562 -0.002878 -0.001016 0.003096 0.019734
Second-order Covariance full 50.0 0.044135 3.958364 -7.354050 -2.445362 -0.357407 1.946968 10.844696
partial 50.0 0.172106 3.938555 -6.427490 -2.399406 0.060036 2.490494 12.221039
self 50.0 0.316073 0.015253 0.280959 0.306835 0.317167 0.325608 0.348290
Energy full 50.0 0.180721 0.034463 0.124246 0.152715 0.173233 0.200477 0.274749
partial 50.0 0.180581 0.033092 0.131867 0.158839 0.169689 0.195604 0.275722
KNN full 50.0 0.489200 0.036188 0.410000 0.460000 0.490000 0.515000 0.600000
partial 50.0 0.490700 0.035985 0.410000 0.465000 0.490000 0.520000 0.570000
MMD full 50.0 0.000477 0.005564 -0.009314 -0.004011 -0.000429 0.003495 0.013740
partial 50.0 0.000313 0.005559 -0.007791 -0.003860 -0.000923 0.002761 0.015787

The diagnostics corresponding to the covariance test are shown below.

In [16]:
# Plot covariance goodness-of-fit statistics
data = diagnostics[(diagnostics.Metric=="Covariance") & (diagnostics.Swap != "self")]
fig,ax = plt.subplots(figsize=(12,6))
sns.boxplot(x="Swap", y="Value", hue="Method", data=data)
fig
Out[16]:

Similarly, we can plot the diagnostics corresponding to the $k$-nearest neighbors test.

In [17]:
# Plot k-nearest neighbors goodness-of-fit statistics
data = diagnostics[(diagnostics.Metric=="KNN") & (diagnostics.Swap != "self")]
fig,ax = plt.subplots(figsize=(12,6))
sns.boxplot(x="Swap", y="Value", hue="Method", data=data)
fig
Out[17]: