## The Methodology of Knockoffs## The variable selection problemThe very general problem addressed by the knockoff methodology is the following. Suppose that we can observe a response \(Y\) and \(p\) potential explanatory variables \(X=(X_1, \ldots, X_p)\). Given \(n\) samples \(\left(X_{i,1}, \ldots, X_{i,p}, Y_{i}\right)_{i=1}^{n}\), we would like to know which predictors are important for the response. We assume that, conditionally on the predictors, the responses are independent and the conditional distribution of \(Y_{i}\) only depends on its corresponding vector of predictors \((X_{i,1},\ldots,X_{i,p})\). Formally, we write this as: \[ \begin{align*} & Y_i | (X_{i,1},\ldots,X_{i,p}) \overset{\text{ind.}}{\sim} F_{Y|X} , & i=1,\ldots,n, \end{align*} \] for some conditional distribution \(F_{Y|X}\). The variable selection problem is motivated by the belief that, in many practical applications, \(F_{Y|X}\) actually only depends on a (small) subset \(\mathcal{S} \subset \{1, \ldots, p\}\) of the predictors, such that conditionally on \(\{X_j\}_{j \in \mathcal{S}}\), \(Y\) is independent of all other variables. This is a very intuitive definition, that can be informally restated by saying that the other variables are not important because they do not provide any additional information about \(Y\).
A minimal set \(\mathcal{S}\) with this property is usually known as a In order to avoid any ambiguity in those pathological cases in which the Markov blanket is not unique, we will say that the \(j\)-th predictor is Our goal is to discover as many relevant variables as possible while keeping the false discovery rate (FDR) under control. For a selection rule that selects a subset \(\hat{S}\) of the predictors, the FDR is defined as \[ \begin{align*} \text{FDR} = \mathbb{E} \left[ \frac{|\hat{S} \cap \mathcal{H}_0|}{\max(1,|\hat{S}|)} \right]. \end{align*} \] ## An important applicationControlled variable selection is particularly relevant in the context of statistical genetics. For instance, a genome-wide association study aims at finding genetic variants that are associated with or influence a trait, choosing from a pool of hundreds of thousands to millions of single-nucleotide polymorphisms. This trait could be the level of cholesterol or a major disease. |