The purpose of this notebook is to allow the numerical experiments described in the paper to be reproduced easily. Running this code may take a few minutes on a graphical graphical processing unit.
Additional dependencies for this notebook:
torch_two_sample https://github.com/josipd/torch-two-sampleimport matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from DeepKnockoffs import KnockoffMachine
from DeepKnockoffs import GaussianKnockoffs
import data
import diagnostics
import parameters
We model $X \in \mathbb{R}^p $ as a multivariate Student's-t distribution, with $p=100$ and the covariance matrix of an auto-regressive process of order one. The default correlation parameter for this distribution is $\rho =0.5$ and the number of degrees of freedom $\nu = 3$.
# Number of features
p = 100
# Load the built-in multivariate Student's-t model and its default parameters
# The currently available built-in models are:
# - gaussian : Multivariate Gaussian distribution
# - gmm : Gaussian mixture model
# - mstudent : Multivariate Student's-t distribution
# - sparse : Multivariate sparse Gaussian distribution
model = "mstudent"
distribution_params = parameters.GetDistributionParams(model, p)
# Initialize the data generator
DataSampler = data.DataSampler(distribution_params)
After computing the empirical covariance matrix of $X$ in the training dataset, we can initialize a generator of second-order knockoffs. The solution of the SDP determines the pairwise correlations between the original variables and the knockoffs produced by this algorithm.
# Number of training examples
n = 10000
# Sample training data
X_train = DataSampler.sample(n)
print("Generated a training dataset of size: %d x %d." %(X_train.shape))
# Compute the empirical covariance matrix of the training data
SigmaHat = np.cov(X_train, rowvar=False)
# Initialize generator of second-order knockoffs
second_order = GaussianKnockoffs(SigmaHat, mu=np.mean(X_train,0), method="sdp")
# Measure pairwise second-order knockoff correlations
corr_g = (np.diag(SigmaHat) - np.diag(second_order.Ds)) / np.diag(SigmaHat)
print('Average absolute pairwise correlation: %.3f.' %(np.mean(np.abs(corr_g))))
The default parameters of the machine are set below, as most appropriate for the specific built-in model considered.
# Load the default hyperparameters for this model
training_params = parameters.GetTrainingHyperParams(model)
# Set the parameters for training deep knockoffs
pars = dict()
# Number of epochs
pars['epochs'] = 100
# Number of iterations over the full data per epoch
pars['epoch_length'] = 100
# Data type, either "continuous" or "binary"
pars['family'] = "continuous"
# Dimensions of the data
pars['p'] = p
# Size of the test set
pars['test_size'] = 0
# Batch size
pars['batch_size'] = int(0.5*n)
# Learning rate
pars['lr'] = 0.01
# When to decrease learning rate (unused when equal to number of epochs)
pars['lr_milestones'] = [pars['epochs']]
# Width of the network (number of layers is fixed to 6)
pars['dim_h'] = int(10*p)
# Penalty for the MMD distance
pars['GAMMA'] = training_params['GAMMA']
# Penalty encouraging second-order knockoffs
pars['LAMBDA'] = training_params['LAMBDA']
# Decorrelation penalty hyperparameter
pars['DELTA'] = training_params['DELTA']
# Target pairwise correlations between variables and knockoffs
pars['target_corr'] = corr_g
# Kernel widths for the MMD measure (uniform weights)
pars['alphas'] = [1.,2.,4.,8.,16.,32.,64.,128.]
Let's load the trained machine stored in the tmp/ subdirectory.
# Where the machine is stored
checkpoint_name = "tmp/" + model
# Initialize the machine
machine = KnockoffMachine(pars)
# Load the machine
machine.load(checkpoint_name)
The knockoff generator can be used on new observations of $X$.
# Compute goodness of fit diagnostics on 50 test sets containing 100 observations each
n_exams = 100
n_samples = 1000
exam = diagnostics.KnockoffExam(DataSampler,
{'Machine':machine, 'Second-order':second_order})
diagnostics = exam.diagnose(n_samples, n_exams)
# Summarize diagnostics
diagnostics.groupby(['Method', 'Metric', 'Swap']).describe()
The distribution of the diagnostics corresponding to the covariance test are shown below.
# Plot covariance goodness-of-fit statistics
data = diagnostics[(diagnostics.Metric=="Covariance") & (diagnostics.Swap != "self")]
fig,ax = plt.subplots(figsize=(12,6))
sns.boxplot(x="Swap", y="Value", hue="Method", data=data)
fig
Similarly, we can plot the diagnostics corresponding to differet tests.
# Plot k-nearest neighbors goodness-of-fit statistics
data = diagnostics[(diagnostics.Metric=="KNN") & (diagnostics.Swap != "self")]
fig,ax = plt.subplots(figsize=(12,6))
sns.boxplot(x="Swap", y="Value", hue="Method", data=data)
fig
# Plot MMD goodness-of-fit statistics
data = diagnostics[(diagnostics.Metric=="MMD") & (diagnostics.Swap != "self")]
fig,ax = plt.subplots(figsize=(12,6))
sns.boxplot(x="Swap", y="Value", hue="Method", data=data)
fig
# Plot energy goodness-of-fit statistics
data = diagnostics[(diagnostics.Metric=="Energy") & (diagnostics.Swap != "self")]
fig,ax = plt.subplots(figsize=(12,6))
sns.boxplot(x="Swap", y="Value", hue="Method", data=data)
fig
# Plot average absolute pairwise correlation between variables and knockoffs
data = diagnostics[(diagnostics.Metric=="Covariance") & (diagnostics.Swap == "self")]
fig,ax = plt.subplots(figsize=(12,6))
sns.boxplot(x="Swap", y="Value", hue="Method", data=data)
fig