Variable Selection with Knockoffs

A powerful and versatile framework for controlled variable selection.

Missing figure — Can you tell which of these two is the original?

The knockoff filter is a general framework for controlling the false discovery rate when performing variable selection.

This website offers an informal introduction to knockoffs and links to the related papers, as well as software and tutorials.

Outline

In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and we would like to be able to discover which predictors are important for the response. At the same time, we need to assure that most of the discoveries are indeed true and replicable.

As the name suggests, the knockoff filter operates by manufacturing knockoff variables that are cheap — their construction does not require collecting any new data — and are designed to mimic the correlation structure found within the original variables. The knockoffs serve as negative controls and they allow one to identify the truly important predictors, while controlling the false discovery rate (FDR) — the expected fraction of false discoveries among all discoveries.

This procedure selects variables that are clearly better than their knockoff copies, according to some measures of feature importance that can be computed with a variety of popular methods. Therefore, the knockoff filter can be seen as a versatile wrapper that can bring together the power of modern machine learning tools and rigorous finite-sample statistical guarantees.