The Methodology of Knockoffs

In the statement of the variable selection problem, we assume that the response variables are obtained independently from some distribution \(F_{Y|X}\), conditional on the explanatory variables:

\[ \begin{align*} & Y_i | (X_{i,1},\ldots,X_{i,p}) \overset{\text{ind.}}{\sim} F_{Y|X} , & i=1,\ldots,n. \end{align*} \]

This setting is extremely general, since at this point we have specified neither the form of the conditional distribution \(F_{Y|X}\), nor the origin of the explanatory variables \(X\). This broad family of models includes two paradigms in which the knockoff methodology can applied to perform variable selection with FDR control:

Under each paradigm, the knockoff methodology has its own strenghts and limitations, as summarized below. Which paradigm should be considered for a given applied problem heavily depends on the data and prior knowledge available to the statistician.

The Model-X paradigm

In the Model-X paradigm, we assume that the explanatory variables are drawn i.i.d. from some distribution \(F_X\). Then, one can write

\[ \begin{align*} & (X_{i,1},\ldots,X_{i,p},Y_i) \overset{\text{i.i.d.}}{\sim} F_{XY} , & i=1,\ldots,n, \end{align*} \]

for some joint distribution \(F_{XY} = F_{Y|X} \circ F_X\) over all variables. In this case, for the knockoff methodology to be applicable,

In this setting, a knockoff copy \(\tilde{X}=(\tilde{X}_1,\ldots,\tilde{X}_p)\) of a vector of random variables \(X=(X_1,\ldots,X_p)\) must be constructed such that, for any subset \(S \subseteq \{1,\ldots,p\}\),

\[ \begin{align*} \left(X, \tilde{X}\right)_{\text{swap}(S)} \overset{d}{=} \left(X, \tilde{X}\right). \end{align*} \]

Above, the vector \(\left(X, \tilde{X}\right)_{\text{swap}(S)}\) is obtained by swapping the entries \(X_j\) and \(\tilde{X}_j\) for any \(j \in S\). Moreover, \(Y\) should be conditionally independent of \(\tilde{X}\) given \(X\).

Reference

“Panning for Gold: ”Model-X“ Knockoffs for High-dimensional Controlled Variable Selection”,
Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. J. R. Stat. Soc. B. (2018). Link to the paper.

The Fixed-X paradigm

In the Fixed-X paradigm, no assumptions are made on the origin of the explanatory variables (which may be chosen adverserially), but some rather restrictive constraints are imposed on \(F_{Y|X}\). In this case, for the knockoff methodology to be applicable,

\[ \begin{align*} W_j = W_j\left( [\mathbf{X} \; \tilde{\mathbf{X}}]’ [\mathbf{X} \; \tilde{\mathbf{X}}], [\mathbf{X} \; \tilde{\mathbf{X}}]’ \mathbf{Y} \right). \end{align*} \]

For example, it can be shown that \(Z_j = \hat{\beta}_{j}(\lambda)\), the lasso coefficient with a fixed \(\lambda\), yields a \(W_j\) that satisfies the sufficiency requirement. However, tuning the value of \(\lambda\) by cross-validation would not satisfy the sufficiency requirement.

In this setting, we denote by \(\mathbf{X}_j\) the vector with the \(n\) observations of the \(j\)-th variable (i.e. the \(j\)-th column of the \(n \times p\) design matrix) and by \(\tilde{\mathbf{X}}_j\) the corresponding column of knockoffs. Then, we require that:

\[ \begin{align*} & \tilde{\mathbf{X}}_j’ \tilde{\mathbf{X}}_k = \mathbf{X}_j’ \mathbf{X}_j & \forall j,k, \\ & \tilde{\mathbf{X}}_j’ \mathbf{X}_k = \mathbf{X}_j’ \mathbf{X}_j & \forall j \neq k. \end{align*} \]

Reference

“Controlling the False Discovery Rate via Knockoffs”, Rina Foygel Barber and Emmanuel Candès. Ann. Statist. 43 (2015). Link to the paper.

The Methodology of Knockoffs - Modeling

Two modeling paradigms

The Model-X paradigm

The Fixed-X paradigm