A first introduction to deep knockoffs in Python¶

Notebook written by Matteo Sesia and Yaniv Romano¶

Stanford University, Department of Statistics¶

Last updated on: November 18, 2018¶

The purpose of this notebook is to provide a simple usage example of the DeepKnockoff package for generating approximate knockoffs. While this code can be easily run on a laptop within a couple of minutes, more accurate knockoff machines and higher-dimensional distributions may require additional computational time or the availaibility of a graphical processing unit.

Load the required libraries¶

Additional dependencies for this notebook:

torch_two_sample https://github.com/josipd/torch-two-sample

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from DeepKnockoffs import KnockoffMachine
from DeepKnockoffs import GaussianKnockoffs
import data
import diagnostics
import parameters

Data generating model¶

We model the distribution of $X \in \mathbb{R}^p $ as multivariate Gaussian, with $p=20$ and the covariance matrix of an auto-regressive process of order one. The default correlation parameter is $\rho =0.5$.

# Number of features
p = 20

# Load the built-in Gaussian model and its default parameters
# The currently available built-in models are:
# - gaussian : Multivariate Gaussian distribution
# - gmm      : Gaussian mixture model
# - mstudent : Multivariate Student's-t distribution
# - sparse   : Multivariate sparse Gaussian distribution 
model = "gaussian"
distribution_params = parameters.GetDistributionParams(model, p)

# Initialize the data generator
DataSampler = data.DataSampler(distribution_params)

Let's sample $n=2000$ observations of $X$. This dataset will be used later to train a deep knockoff machine.

# Number of training examples
n = 2000

# Sample training data
X_train = DataSampler.sample(n)
print("Generated a training dataset of size: %d x %d." %(X_train.shape))

Generated a training dataset of size: 2000 x 20.

Second-order knockoffs¶

After computing the empirical covariance matrix of $X$ in the training dataset, we can initialize a generator of second-order knockoffs. The solution of the SDP determines the pairwise correlations between the original variables and the knockoffs produced by this algorithm.

# Compute the empirical covariance matrix of the training data
SigmaHat = np.cov(X_train, rowvar=False)

# Initialize generator of second-order knockoffs
second_order = GaussianKnockoffs(SigmaHat, mu=np.mean(X_train,0), method="sdp")

# Measure pairwise second-order knockoff correlations 
corr_g = (np.diag(SigmaHat) - np.diag(second_order.Ds)) / np.diag(SigmaHat)

print('Average absolute pairwise correlation: %.3f.' %(np.mean(np.abs(corr_g))))

Average absolute pairwise correlation: 0.319.

Deep knockoff machine¶

We initialize a deep knockoff machine with the following parameters. The default number of epochs is small to reduce the computation time, and it should be increased to improve the accuracy of the machine. The size of the test set indicates the number of observations that should be set aside in order to monitor in real time the ability of the machine to generate knockoff copies for previously unobserved data.

# Load the default hyperparameters for this model
training_params = parameters.GetTrainingHyperParams(model)

# Set the parameters for training deep knockoffs
pars = dict()
# Number of epochs
pars['epochs'] = 10
# Number of iterations over the full data per epoch
pars['epoch_length'] = 50
# Data type, either "continuous" or "binary"
pars['family'] = "continuous"
# Dimensions of the data
pars['p'] = p
# Size of the test set
pars['test_size']  = int(0.1*n)
# Batch size
pars['batch_size'] = int(0.45*n)
# Learning rate
pars['lr'] = 0.01
# When to decrease learning rate (unused when equal to number of epochs)
pars['lr_milestones'] = [pars['epochs']]
# Width of the network (number of layers is fixed to 6)
pars['dim_h'] = int(10*p)
# Penalty for the MMD distance
pars['GAMMA'] = training_params['GAMMA']
# Penalty encouraging second-order knockoffs
pars['LAMBDA'] = training_params['LAMBDA']
# Decorrelation penalty hyperparameter
pars['DELTA'] = training_params['DELTA']
# Target pairwise correlations between variables and knockoffs
pars['target_corr'] = corr_g
# Kernel widths for the MMD measure (uniform weights)
pars['alphas'] = [1.,2.,4.,8.,16.,32.,64.,128.]

# Initialize the machine
machine = KnockoffMachine(pars)

Let's fit the machine to the training data. The value of the loss function on the training and test data will be printed after each epoch, along with other diagnostics based on the MMD, the second moments and the pairwise correlations between variables and knockoffs.

# Train the machine
print("Fitting the knockoff machine...")
machine.train(X_train)

Fitting the knockoff machine...
[   1/  10], Loss: (0.6287, 0.8204), MMD: (0.2954,0.5362), Cov: (0.424,0.686), Decorr: (0.392,0.397)
[   2/  10], Loss: (0.3170, 0.8240), MMD: (0.2542,0.5282), Cov: (0.216,0.642), Decorr: (0.351,0.325)
[   3/  10], Loss: (0.2901, 0.7034), MMD: (0.2494,0.5081), Cov: (0.171,0.581), Decorr: (0.334,0.338)
[   4/  10], Loss: (0.2823, 0.8376), MMD: (0.2469,0.5206), Cov: (0.142,0.623), Decorr: (0.331,0.330)
[   5/  10], Loss: (0.2799, 0.7857), MMD: (0.2469,0.5116), Cov: (0.134,0.582), Decorr: (0.326,0.315)
[   6/  10], Loss: (0.2767, 0.7345), MMD: (0.2452,0.5097), Cov: (0.141,0.587), Decorr: (0.328,0.335)
[   7/  10], Loss: (0.2774, 0.8336), MMD: (0.2465,0.5212), Cov: (0.133,0.590), Decorr: (0.318,0.320)
[   8/  10], Loss: (0.2781, 0.7661), MMD: (0.2476,0.5075), Cov: (0.133,0.611), Decorr: (0.321,0.336)
[   9/  10], Loss: (0.2758, 0.7342), MMD: (0.2459,0.5179), Cov: (0.127,0.651), Decorr: (0.318,0.317)
[  10/  10], Loss: (0.2757, 0.7169), MMD: (0.2455,0.5195), Cov: (0.133,0.621), Decorr: (0.315,0.311)

Knockoff diagnostics on the training data¶

The second-order method and the machine trained above can be used to generate knockoff copies of the original data.

# Generate deep knockoffs
Xk_train_m = machine.generate(X_train)
print("Size of the deep knockoff dataset: %d x %d." %(Xk_train_m.shape))

# Generate second-order knockoffs
Xk_train_g = second_order.generate(X_train)
print("Size of the second-order knockoff dataset: %d x %d." %(Xk_train_g.shape))

Size of the deep knockoff dataset: 2000 x 20.
Size of the second-order knockoff dataset: 2000 x 20.

The goodness of fit of the knockoffs on the training data can be assessed by checking whether the empirical covariance matrix $\hat{G}$ of $(X,\tilde{X})$ satisfies the required symmetries. Given that $$ \hat{G} = \begin{bmatrix} \hat{G}_{XX} & \hat{G}_{X\tilde{X}} \\ \hat{G}_{X\tilde{X}} & \hat{G}_{\tilde{X}\tilde{X}} \end{bmatrix}, $$ we expect that $\hat{G}_{\tilde{X}\tilde{X}} \approx \hat{G}_{XX}$ and $\hat{G}_{X\tilde{X}} \approx \hat{G}_{XX}$. The latter equality should only hold for the off-diagonal elements. The diagonal elements of $\hat{G}_{X\tilde{X}}$ should be small as possible for the knockoffs to be powerful.

Simple diagnostics can be obtained by displaying the entries of the above blocks of the covariance matrix in scatter plots.

# Plot diagnostics for deep knockoffs
diagnostics.ScatterCovariance(X_train, Xk_train_m)

Knockoff diagnostics on test data¶

The knockoff generator can be used on new observations of $X$.

# Sample test data
X_test = DataSampler.sample(n, test=True)
print("Generated a test dataset of size: %d x %d." %(X_test.shape))

Generated a test dataset of size: 2000 x 20.

Knockoffs for the test data are generated using the deep machine, the second-order method and an oracle that knows the exact distribution of $X$.

# Generate deep knockoffs
Xk_test_m = machine.generate(X_test)
print("Size of the deep knockoff test dataset: %d x %d." %(Xk_test_m.shape))

# Generate second-order knockoffs
Xk_test_g = second_order.generate(X_test)
print("Size of the second-order knockoff test dataset: %d x %d." %(Xk_test_g.shape))

# Generate oracle knockoffs
oracle = GaussianKnockoffs(DataSampler.Sigma, method="sdp", mu=DataSampler.mu)
Xk_test_o = oracle.generate(X_test)
print("Size of the oracle knockoff test dataset: %d x %d." %(Xk_test_o.shape))

Size of the deep knockoff test dataset: 2000 x 20.
Size of the second-order knockoff test dataset: 2000 x 20.
Size of the oracle knockoff test dataset: 2000 x 20.

A simple measure of the goodness of fit on the test data for the knockoffs generated by the deep machine is shown in the scatter plots below.

# Plot diagnostics for deep knockoffs
diagnostics.ScatterCovariance(X_test, Xk_test_m)

The corresponding scatter plots for the second-order knockoffs are below.

# Plot diagnostics for second-order knockoffs
diagnostics.ScatterCovariance(X_test, Xk_test_g)

Finally, the scatter plots for the oracle knockoffs.

# Plot diagnostics for oracle knockoffs
diagnostics.ScatterCovariance(X_test, Xk_test_o)

More diagnostics can be computed by drawing new observations from the data sampler and applying the goodness-of-fit tests discussed in the paper.

# Compute goodness of fit diagnostics on 50 test sets containing 100 observations each
n_exams = 50
n_samples = 100
exam = diagnostics.KnockoffExam(DataSampler,
                                {'Oracle':oracle, 'Machine':machine, 'Second-order':second_order})
diagnostics = exam.diagnose(n_samples, n_exams)

Computing knockoff diagnostics...
[=========================] 100%

# Summarize diagnostics
diagnostics.groupby(['Method', 'Metric', 'Swap']).describe()

The diagnostics corresponding to the covariance test are shown below.

# Plot covariance goodness-of-fit statistics
data = diagnostics[(diagnostics.Metric=="Covariance") & (diagnostics.Swap != "self")]
fig,ax = plt.subplots(figsize=(12,6))
sns.boxplot(x="Swap", y="Value", hue="Method", data=data)
fig

Similarly, we can plot the diagnostics corresponding to the $k$-nearest neighbors test.

# Plot k-nearest neighbors goodness-of-fit statistics
data = diagnostics[(diagnostics.Metric=="KNN") & (diagnostics.Swap != "self")]
fig,ax = plt.subplots(figsize=(12,6))
sns.boxplot(x="Swap", y="Value", hue="Method", data=data)
fig

			Value
			count	mean	std	min	25%	50%	75%	max
Method	Metric	Swap
Machine	Covariance	full	50.0	1.291893	3.492836	-6.243256	-1.132957	1.361824	3.831310	7.621140
		partial	50.0	1.201393	3.512167	-7.099976	-0.807556	1.344955	3.703308	7.811020
		self	50.0	0.319834	0.020318	0.265953	0.307952	0.319296	0.332854	0.361379
	Energy	full	50.0	0.178917	0.028412	0.127072	0.160863	0.172369	0.201243	0.281264
	Energy	partial	50.0	0.177388	0.026872	0.139052	0.158226	0.172530	0.193290	0.277696
	KNN	full	50.0	0.516900	0.044697	0.445000	0.480000	0.505000	0.553750	0.610000
	KNN	partial	50.0	0.516100	0.044565	0.430000	0.481250	0.515000	0.533750	0.625000
	MMD	full	50.0	0.000638	0.004654	-0.008441	-0.002597	-0.000508	0.004116	0.016774
	MMD	partial	50.0	0.000600	0.004108	-0.005813	-0.002666	0.000556	0.003095	0.016446
Oracle	Covariance	full	50.0	0.844697	3.372040	-7.439972	-1.423676	0.952896	3.050602	7.580246
		partial	50.0	0.888457	3.534530	-7.467270	-2.099644	1.265839	3.598145	8.151550
		self	50.0	0.305346	0.015524	0.266235	0.297000	0.303309	0.311392	0.342307
	Energy	full	50.0	0.181859	0.033260	0.130520	0.159350	0.172052	0.199673	0.306199
	Energy	partial	50.0	0.181983	0.031292	0.134997	0.159830	0.173137	0.203746	0.293915
	KNN	full	50.0	0.499000	0.034226	0.435000	0.476250	0.500000	0.520000	0.575000
	KNN	partial	50.0	0.497500	0.035689	0.425000	0.480000	0.505000	0.518750	0.605000
	MMD	full	50.0	0.000829	0.005462	-0.007689	-0.003113	0.000204	0.003769	0.020159
	MMD	partial	50.0	0.000627	0.005011	-0.005562	-0.002878	-0.001016	0.003096	0.019734
Second-order	Covariance	full	50.0	0.044135	3.958364	-7.354050	-2.445362	-0.357407	1.946968	10.844696
		partial	50.0	0.172106	3.938555	-6.427490	-2.399406	0.060036	2.490494	12.221039
		self	50.0	0.316073	0.015253	0.280959	0.306835	0.317167	0.325608	0.348290
	Energy	full	50.0	0.180721	0.034463	0.124246	0.152715	0.173233	0.200477	0.274749
	Energy	partial	50.0	0.180581	0.033092	0.131867	0.158839	0.169689	0.195604	0.275722
	KNN	full	50.0	0.489200	0.036188	0.410000	0.460000	0.490000	0.515000	0.600000
	KNN	partial	50.0	0.490700	0.035985	0.410000	0.465000	0.490000	0.520000	0.570000
	MMD	full	50.0	0.000477	0.005564	-0.009314	-0.004011	-0.000429	0.003495	0.013740
	MMD	partial	50.0	0.000313	0.005559	-0.007791	-0.003860	-0.000923	0.002761	0.015787