This notebook illustrates the usage of the knockoff construction scripts to create knockoffs for variables that are distributed according to the Ising model on a grid: \[ P(X) = \frac{1}{Z(\beta_0)} e^{\beta_0 \sum_{(i,i^\prime) \in E} X_i X_{i^\prime}} \] where \(i\) and \(i^\prime\) are each coordinates in the \(d \times d\) grid, \(E\) is the set of edges of the grid, and each \(X_i \in \{-1, 1\}\). Section 5.2.3 of the accompanying paper presents simulation results in this setting.

Knockoffs for the Ising model

library(tidyverse)
library(reshape2)
library(knockoff)
library(bayess)
library(latex2exp)
source("../Ising/ising.r")  #load the ising knockoff functions
# configure parallelism
library(foreach)
library(doMC)
n_cores = 3
registerDoMC(cores = n_cores)
# Simulation #parameters
n = 500  #number of samples
d = 5  #5x5 grid
beta_0 = 0.25  #interaction strength

We create 500 samples from a 5x5 Ising model using the bayess package.

# sample from the Ising model
set.seed(200)
start = Sys.time()
X = t(replicate(n, as.vector(isingibbs(niter = 100, n = d, m = d, beta = 2 * 
    beta_0))))
X = 2 * X - 1  #convert to +- 1 instead of 0,1
end = Sys.time()

print(end - start)
Time difference of 17.87556 secs
dim(X)
[1] 500  25

Next, we generate knockoffs for each sample using the ising_knockoff function (Section 4.6 of the paper).

start = Sys.time()
set.seed(200)
Xk = c()  # store the sampled knockoffs
# create knockoffs for each observation
Xk = foreach(i = 1:n, .combine = "rbind") %dopar% {
    x = matrix(X[i, ], nrow = d, ncol = d)
    xk = ising_knockoff(x, d, d, beta_0)  #sample one knockoff
    as.vector(xk)
}
end = Sys.time()
print(paste0("Average time per knockoff (seconds): ", (end - start) * n_cores/n))
[1] "Average time per knockoff (seconds): 0.0867124571800232"
dim(Xk)
[1] 500  25

The mean absolute correlation between each variable and its corresponding knockoff is

cors = c()
for (j in 1:(d * d)) {
    cors = c(cors, cor(X[, j], Xk[, j]))
}
mean(cors)
[1] 0.2385055

Diagnostics: checking means and covariances

As a basic diagnostic, we check that each coordinate is mean zero and the the empirical covariance matrices of \(X\) and \(\tilde{X}\) are close. As the number of samples increases, the empircal covariance of \(X\) and \(\tilde{X}\) will converge to the same population covariance.

max(abs(colMeans(X)))  #find the coordinate with mean farthest away from 0
[1] 0.108
max(abs(colMeans(Xk)))  #find the coordinate with mean farthest away from 0
[1] 0.112
S = cor(X)  #correlation matrix of X
Sk = cor(Xk)  #correlation matrix of Xk
max(abs(S - Sk))  #largest difference in the correlation matrix.
[1] 0.1701452

We can visualize the difference in the empirical covariance matrices of \(X\) and \(\tilde{X}\) with a heatmap.

# visualize the difference in the empirical covariance
library(ggcorrplot)
ggcorrplot(cor(X) - cor(Xk), ggtheme = ggplot2::theme_bw, legend.title = "Corr diff", 
    title = "Difference in correlation structure") + scale_y_reverse()

We see that the difference in the empirical correlation matrix of \(X\) and the empirical correlation matrix of \(\tilde{X}\) are small, as expected, since \(X\) and \(\tilde{X}\) come from the same distribution.

Quality across coordinates.

Next, we investigate the quality of the knockoffs at each location in the grid. We compute the correlation between \(X_{i,j}\) and \(\tilde{X}_{i,j}\) for all each index \((i,j)\) in the grid and plot the resulting data as a heatmap.

# compute correlation between original variable and knockoff at each
# coordinate.
cors = data.frame(stringsAsFactors = FALSE)
for (i in 1:d) {
    for (j in 1:d) {
        cors = rbind(cors, c("knockoff", i, j, cor(X[, i + (j - 1) * d], Xk[, 
            i + (j - 1) * d])), stringsAsFactors = FALSE)
    }
}
colnames(cors) = c("type", "row", "col", "ko_orig_corr")
cors$ko_orig_corr = as.numeric(cors$ko_orig_corr)
cors$row = as.numeric(cors$row)
cors$col = as.numeric(cors$col)
cors = cors %>% mutate(order = d * (row - 1) + col)
# plot the knockoff quality at each coordinate
quality_grid = ggplot(cors, aes(x = col, y = row, fill = ko_orig_corr)) + geom_tile(color = "white") + 
    scale_fill_continuous(high = "#000000", low = "#56B1F7", limits = c(0.1, 
        0.5)) + scale_y_reverse() + coord_equal() + labs(fill = "cor") + theme_bw() + 
    theme(text = element_text(size = 10)) + xlab("k") + ylab("j") + labs(title = TeX("$5 \\times 5$ grid, $\\beta_0 = 0.25$")) + 
    coord_fixed() + theme(plot.title = element_text(size = 10)) + theme(plot.title = element_text(hjust = 0.5)) + 
    theme(legend.position = "bottom") + theme(legend.text = element_text(size = 8))

quality_grid

We see that the variables on the edges of the grid are less correlated with their knockoffs, as expected.

Knockoffs for large grids

For large grids, we can still generate knockoffs efficiently using the divid-and-conquer technique (Section 4.4 of the paper), which is implemented in the ising_knockoff_slice function. We demonstrate this technique here.

First, we sample from the Ising model on a grid.

# Simulation #parameters
n = 200  #number of samples
d = 20  #20x20 grid
beta_0 = 0.25  #interaction strength
width = 3  #divide-and-conquer width
# sample from the ising model
set.seed(100)
X = t(replicate(n, as.vector(isinghm(niter = 50, n = d, m = d, beta = 2 * beta_0))))
dim(X)
[1] 200 400

Next, we sample the knockoffs using the divide-and-conquer technique.

# sample each knockoff
set.seed(100)
start = Sys.time()
Xk = foreach(i = 1:n, .combine = "rbind") %dopar% {
    x = matrix(X[i, ], nrow = d, ncol = d)
    xk = ising_knockoff_slice(x, d, d, beta_0, max_width = width)
    as.vector(xk)
}
end = Sys.time()
print(paste0("Average time per knockoff (seconds): ", round((end - start) * 
    n_cores/n, digits = 3)))
[1] "Average time per knockoff (seconds): 0.396"
dim(Xk)
[1] 200 400

We again compute the mean correlation between \(X_{j,k}\) and \(\tilde{X}_{j,k}\):

cors = c()
for (j in 1:(d * d)) {
    cors = c(cors, cor(X[, j], Xk[, j]))
}
mean(cors)
[1] 0.31051

Quality across coordinates.

Next, we investigate the quality of the knockoffs at each location in the grid. We compute the correlation between \(X_{i,j}\) and \(\tilde{X}_{i,j}\) for all each index \((i,j)\) in the grid and plot the resulting data as a heatmap.

# compute correlation between original variable and knockoff at each
# coordinate.
cors = data.frame(stringsAsFactors = FALSE)
for (i in 1:d) {
    for (j in 1:d) {
        cors = rbind(cors, c("knockoff", i, j, cor(X[, i + (j - 1) * d], Xk[, 
            i + (j - 1) * d])), stringsAsFactors = FALSE)
    }
}
colnames(cors) = c("type", "row", "col", "ko_orig_corr")
cors$ko_orig_corr = as.numeric(cors$ko_orig_corr)
cors$row = as.numeric(cors$row)
cors$col = as.numeric(cors$col)
cors = cors %>% mutate(order = d * (i - 1) + j)
# plot the knockoff quality at each coordinate
quality_grid = ggplot(cors, aes(x = col, y = row, fill = ko_orig_corr)) + geom_tile(color = "white") + 
    scale_fill_continuous(high = "#000000", low = "#56B1F7", limits = c(-0.1, 
        0.8)) + scale_y_reverse() + coord_equal() + labs(fill = "cor") + theme_bw() + 
    theme(text = element_text(size = 10)) + xlab("k") + ylab("j") + labs(title = TeX("$20 \\times 20$ grid, $\\beta_0 = 0.25$")) + 
    coord_fixed() + theme(plot.title = element_text(size = 10)) + theme(plot.title = element_text(hjust = 0.5)) + 
    theme(legend.position = "bottom") + theme(legend.text = element_text(size = 8))

quality_grid

In this plot, we can see that some columns have higher correlation that others due to the slicing. We can adjust the slicing pattern to remove this effect if needed.