Numerical experiments III (variable selection)¶

Notebook written by Matteo Sesia and Yaniv Romano¶

Stanford University, Department of Statistics¶

Last updated on: November 19, 2018¶

The purpose of this notebook is to allow the numerical experiments described in the paper to be reproduced easily. Running this code may take several minutes.

Load the required libraries¶

Additional dependencies for this notebook:

glmnet_python https://github.com/bbalasub1/glmnet_python

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from DeepKnockoffs import KnockoffMachine
from DeepKnockoffs import GaussianKnockoffs
import data
import parameters
import selection
import sys

Data generating model¶

We model $X \in \mathbb{R}^p $ as a multivariate Student's-t distribution, with $p=100$ and the covariance matrix of an auto-regressive process of order one. The default correlation parameter for this distribution is $\rho =0.5$ and the number of degrees of freedom $\nu = 3$.

# Number of features
p = 100

# Load the built-in multivariate Student's-t model and its default parameters
# The currently available built-in models are:
# - gaussian : Multivariate Gaussian distribution
# - gmm      : Gaussian mixture model
# - mstudent : Multivariate Student's-t distribution
# - sparse   : Multivariate sparse Gaussian distribution 
model = "mstudent"
distribution_params = parameters.GetDistributionParams(model, p)

# Initialize the data generator
DataSampler = data.DataSampler(distribution_params)

Second-order knockoffs¶

After computing the empirical covariance matrix of $X$ in the training dataset, we can initialize a generator of second-order knockoffs. The solution of the SDP determines the pairwise correlations between the original variables and the knockoffs produced by this algorithm.

# Number of training examples
n = 10000

# Sample training data
X_train = DataSampler.sample(n)
print("Generated a training dataset of size: %d x %d." %(X_train.shape))

# Compute the empirical covariance matrix of the training data
SigmaHat = np.cov(X_train, rowvar=False)

# Initialize generator of second-order knockoffs
second_order = GaussianKnockoffs(SigmaHat, mu=np.mean(X_train,0), method="sdp")

# Measure pairwise second-order knockoff correlations 
corr_g = (np.diag(SigmaHat) - np.diag(second_order.Ds)) / np.diag(SigmaHat)

print('Average absolute pairwise correlation: %.3f.' %(np.mean(np.abs(corr_g))))

Generated a training dataset of size: 10000 x 100.
Average absolute pairwise correlation: 0.462.

Deep knockoff machine¶

The default parameters of the machine are set below, as most appropriate for the specific built-in model considered.

# Load the default hyperparameters for this model
training_params = parameters.GetTrainingHyperParams(model)

# Set the parameters for training deep knockoffs
pars = dict()
# Number of epochs
pars['epochs'] = 100
# Number of iterations over the full data per epoch
pars['epoch_length'] = 100
# Data type, either "continuous" or "binary"
pars['family'] = "continuous"
# Dimensions of the data
pars['p'] = p
# Size of the test set
pars['test_size']  = 0
# Batch size
pars['batch_size'] = int(0.5*n)
# Learning rate
pars['lr'] = 0.01
# When to decrease learning rate (unused when equal to number of epochs)
pars['lr_milestones'] = [pars['epochs']]
# Width of the network (number of layers is fixed to 6)
pars['dim_h'] = int(10*p)
# Penalty for the MMD distance
pars['GAMMA'] = training_params['GAMMA']
# Penalty encouraging second-order knockoffs
pars['LAMBDA'] = training_params['LAMBDA']
# Decorrelation penalty hyperparameter
pars['DELTA'] = training_params['DELTA']
# Target pairwise correlations between variables and knockoffs
pars['target_corr'] = corr_g
# Kernel widths for the MMD measure (uniform weights)
pars['alphas'] = [1.,2.,4.,8.,16.,32.,64.,128.]

Let's load the trained machine stored in the tmp/ subdirectory.

# Where the machine is stored
checkpoint_name = "tmp/" + model

# Initialize the machine
machine = KnockoffMachine(pars)

# Load the machine
machine.load(checkpoint_name)

=> loading checkpoint 'tmp/mstudent_checkpoint.pth.tar'
=> loaded checkpoint 'tmp/mstudent_checkpoint.pth.tar' (epoch 100)

Controlled variable selection¶

The knockoff generator can be used to perform variable selection on new observations of $X$ and artificial labels $Y$ sampled from some known conditional model $P_{Y \mid X}$. In order to reduce the runtime, the results are averaged over 100 replications of the experiment, instead of 1000 as in the paper.

# Number of non-zero coefficients in P(Y|X)
signal_n = 30

# Amplitude of the non-zero coefficients 
signal_amplitude_vec = [3, 5, 7, 10, 15, 20, 25]

# Compute the FDR as the average proportion of false discoveries over n_experiments
n_experiments = 100

# Target FDR level
nominal_fdr = 0.1

Load the default parameters for the experiments corresponding to the chosen model.

test_params = parameters.GetFDRTestParams(model)

# Initialize table of results
results = pd.DataFrame(columns=['Model','Experiment', 'Method', 'FDP', 'Power', \
                                'Amplitude', 'Signals', 'Alpha', 'FDR.nominal'])

# Run experiments
for amp_id in range(len(signal_amplitude_vec)):
    # Set the signal amplitude
    signal_amplitude = signal_amplitude_vec[amp_id]
    print("Running %d experiments with signal amplitude: %.2f" %(n_experiments,signal_amplitude))

    for exp_id in range(n_experiments):
        # Show progress
        sys.stdout.write('\r')
        sys.stdout.write("[%-25s] %d%%" % ('='*int((exp_id+1)/n_experiments*25), ((exp_id+1)/n_experiments)*100))
        sys.stdout.flush()

        # Sample X
        X = DataSampler.sample(test_params["n"], test=True)
        
        # Sample Y|X
        y,theta = selection.sample_Y(X, signal_n=signal_n, signal_a=signal_amplitude)
    
        # Generate deep knockoffs
        Xk_m = machine.generate(X)
        # Compute importance statistics
        W_m  = selection.lasso_stats(X,Xk_m,y,alpha=test_params["elasticnet_alpha"],scale=False)
        # Select important variables with the knockoff filter
        selected_m, FDP_m, POW_m = selection.select(W_m, theta, nominal_fdr=nominal_fdr)
        # Store results
        results = results.append({'Model':model,'Experiment':exp_id, 'Method':'deep', 
                                  'Power':POW_m, 'FDP':FDP_m, \
                                  'Amplitude':signal_amplitude, 'Signals':signal_n, \
                                  'Alpha':0.1, 'FDR.nominal':nominal_fdr}, ignore_index=True)
        
        # Generate second-order knockoffs
        Xk_g = second_order.generate(X)
        # Compute importance statistics
        W_g  = selection.lasso_stats(X,Xk_g,y,alpha=test_params["elasticnet_alpha"],scale=False)
        # Select important variables with the knockoff filter
        selected_g, FDP_g, POW_g = selection.select(W_g, theta, nominal_fdr=nominal_fdr)
        # Store results
        results = results.append({'Model':model,'Experiment':exp_id, 'Method':'second-order',
                                  'Power':POW_g, 'FDP':FDP_g, \
                                  'Amplitude':signal_amplitude, 'Signals':signal_n, \
                                  'Alpha':0.1, 'FDR.nominal':nominal_fdr}, ignore_index=True)
        
    sys.stdout.write('\n')

Running 100 experiments with signal amplitude: 3.00
[=========================] 100%
Running 100 experiments with signal amplitude: 5.00
[=========================] 100%
Running 100 experiments with signal amplitude: 7.00
[=========================] 100%
Running 100 experiments with signal amplitude: 10.00
[=========================] 100%
Running 100 experiments with signal amplitude: 15.00
[=========================] 100%
Running 100 experiments with signal amplitude: 20.00
[=========================] 100%
Running 100 experiments with signal amplitude: 25.00
[=========================] 100%

Compute average power and false discovery rate for each value of the signal amplitude.

# Summarize results
results.groupby(['Model', 'Method', 'Amplitude', 'Alpha', 'FDR.nominal']).describe(percentiles=[])

Plot the average power for each method as a function of the signal amplitude.

fig,ax = plt.subplots(figsize=(12,6))
sns.pointplot(x="Amplitude", y="Power", hue="Method", data=results)
fig

Plot the false discovery rate as a function of the signal amplitude.

fig,ax = plt.subplots(figsize=(12,6))
plt.plot([-1, np.max(signal_amplitude_vec)+1], 
        [nominal_fdr, nominal_fdr], linewidth=1, linestyle="--", color="red")
sns.pointplot(x="Amplitude", y="FDP", hue="Method", data=results)
plt.ylim([0,0.25])
fig

					FDP						Power
					count	mean	std	min	50%	max	count	mean	std	min	50%	max
Model	Method	Amplitude	Alpha	FDR.nominal
mstudent	deep	3	0.1	0.1	100.0	0.001000	0.010000	0.0	0.000000	0.100000	100.0	0.003000	0.030000	0.000000	0.000000	0.300000
		5	0.1	0.1	100.0	0.019886	0.063278	0.0	0.000000	0.346154	100.0	0.065000	0.155943	0.000000	0.000000	0.600000
		7	0.1	0.1	100.0	0.043976	0.070643	0.0	0.000000	0.250000	100.0	0.268000	0.257891	0.000000	0.333333	0.866667
		10	0.1	0.1	100.0	0.035979	0.047498	0.0	0.000000	0.214286	100.0	0.472667	0.263678	0.000000	0.533333	0.833333
		15	0.1	0.1	100.0	0.063980	0.064313	0.0	0.050000	0.242424	100.0	0.759667	0.176910	0.000000	0.800000	1.000000
		20	0.1	0.1	100.0	0.075911	0.063298	0.0	0.065591	0.268293	100.0	0.876333	0.136437	0.000000	0.916667	1.000000
		25	0.1	0.1	100.0	0.063252	0.055795	0.0	0.062500	0.285714	100.0	0.926000	0.105365	0.366667	0.966667	1.000000
	second-order	3	0.1	0.1	100.0	0.055974	0.129829	0.0	0.000000	0.636364	100.0	0.059667	0.129689	0.000000	0.000000	0.566667
		5	0.1	0.1	100.0	0.092130	0.124531	0.0	0.000000	0.400000	100.0	0.235333	0.242500	0.000000	0.250000	0.733333
		7	0.1	0.1	100.0	0.121968	0.106872	0.0	0.125000	0.380952	100.0	0.493667	0.257663	0.000000	0.566667	0.933333
		10	0.1	0.1	100.0	0.152130	0.101825	0.0	0.160645	0.420000	100.0	0.749667	0.210312	0.000000	0.800000	0.966667
		15	0.1	0.1	100.0	0.166158	0.109259	0.0	0.158770	0.462963	100.0	0.884333	0.146677	0.000000	0.933333	1.000000
		20	0.1	0.1	100.0	0.161099	0.092079	0.0	0.158770	0.454545	100.0	0.920000	0.120418	0.000000	0.933333	1.000000
		25	0.1	0.1	100.0	0.167098	0.093473	0.0	0.151515	0.454545	100.0	0.953000	0.086209	0.366667	0.966667	1.000000