Introduction

This web tutorial demonstrates the use of the knockoffs methodology to perform controlled variable selection in a genome-wide association study. The tutorial accompanies the paper (Sesia, Sabatti, and Candès 2017), where the procedure is justified and more details are presented.

As in the work of (Sesia, Sabatti, and Candès 2017) and (Candes et al. 2016), the purpose of this statistical analysis is to identify the important variables in the conditional distribution of \(Y | X_1, \ldots X_p\), where \(Y\) is a phenotype of interest (e.g. the disease status) and \(X\) is a family of SNPs. This is done by testing the \(p\) multivariate hypotheses that \(Y\) is independent of the \(j\)-th SNP \(X_j\) conditionally on the other variables, for \(j=1,\ldots,p\). Formally, we write this as:

\[Y \perp\!\!\!\!\perp X_j | X_{- j}.\] This approach is very natural to anyone working with parametric generalized linear models, where \(Y\) has a probability distribution taken from an exponential family, which depends upon the covariates only through the linear combination \(\beta_1 X_1 + \beta_2 X_2 + \ldots +\beta_p X_p\). In such models and under broad conditions, the above test for \(X_j\) is equivalent to testing \(\beta_j=0\).

However, in this tutorial we do not make any parametric assumptions on the conditional distribution of the phenotype given the genotypes. In this more general setting (and under weak conditions) this can be shown to equivalent to finding the Markov blanket of \(Y\), i.e. the “smallest”" subset of \(S\) such that, conditionally on \(\{X_j\}_{j \in S}\), \(Y\) is independent of all other variables.

For more information about variable selection with knockoffs, see (Candes et al. 2016) or visit the knockoffs homepage.

Dependencies

Three R packages are used in this tutorial.

  • snpStats from the Bioconductor repository is used to load and pre-process the SNP data.
  • knockoff from CRAN is the main package for the knockoffs methodology.
  • SNPknock from CRAN is used to create knockoffs for genetic data (version 0.7.1).

An external genotype imputation software is also required.

  • fastPHASE is a genotype imputation software distributed in the form of a binary executable for Linux or Mac OS. We use it to construct knockoffs by fitting a hidden Markov model to the genotype data.

If you do not already have fastPHASE, you can download the fastPHASE tarball from the website liked above. Then, extract it into a convenient directory (e.g. “~/.local/bin/”).

Data

Due to the unavailability of free datasets containing individual-levels genotypes and phenotypes, this tutorial uses a small synthetic dataset created from a publicly available dataset.

You can download the simulated study used in this tutorial from here.
Then, unpack the archive and place it in your R work directory.

This dataset was obtained with the HAPGEN2 simulation software (Su, Marchini, and Donnelly 2011). We chose 10,000 SNPs from a portion of chromosome 22 (positions 0-26192121) among those genotyped by the International HapMap Project (release #22 - NCBI Build 36) and freely available from https://mathgen.stats.ox.ac.uk. A study of 200 cases and 350 controls was simulated with 10 “causal” loci, whose multiplicative effects for each copy of the risk allele are uniformly distributed between 1.25 and 1.5.

Data pre-processing

Loading the data

If you have not done this before, you can install the snpStats package from the Bioconductor repository as follows.

source("https://bioconductor.org/biocLite.R")
biocLite("snpStats")

Once snpStats is successfully installed, you can load it as any other R package.

library(snpStats)

The read.impute function provides a simple way of loading the genotypes for the cases and controls in the simulated study.

dat.cases = read.impute("data_sim/sim.out.cases.gen")
## Reading IMPUTE data from file data_sim/sim.out.cases.gen
## Reading SnpMatrix with 350 rows and 10000 columns
dat.contr = read.impute("data_sim/sim.out.controls.gen")
## Reading IMPUTE data from file data_sim/sim.out.controls.gen
## Reading SnpMatrix with 200 rows and 10000 columns

Instead of storing cases and controls separately, we merge them into a single dataset.

row.names(dat.cases) = sapply(1:dim(dat.cases)[1], function(i) paste("Case",i,sep=""))
row.names(dat.contr) = sapply(1:dim(dat.contr)[1], function(i) paste("Control",i,sep=""))
genotypes = rbind(dat.contr, dat.cases)
phenotypes = c(rep(0,dim(dat.contr)[1]), rep(1,dim(dat.cases)[1]))

SNP level filtering

Once the data is loaded, we are ready to remove SNPs that fail to meet minimum criteria due to missing data, low variability or genotyping errors. snpStats provides functions, col.summary and row.summary, that return statistics on SNPs and samples, respectively.

A more complete tutorial on data pre-processing for association studies can be found on the ORSG project website.

To create SNP summary statistics (MAF, call rate, etc.), one can type:

snpsum.col = col.summary(genotypes)
print(head(snpsum.col))
##            Calls Call.rate Certain.calls        RAF        MAF       P.AA       P.AB      P.BB      z.HWE
## rs11089130   550         1             1 0.61090909 0.38909091 0.14000000 0.49818182 0.3618182  1.1239411
## rs738829     550         1             1 0.79909091 0.20090909 0.04545455 0.31090909 0.6436364 -0.7435504
## rs915674     550         1             1 0.86363636 0.13636364 0.02363636 0.22545455 0.7509091 -1.0039135
## rs915675     550         1             1 0.84636364 0.15363636 0.02727273 0.25272727 0.7200000 -0.6616527
## rs915677     550         1             1 0.93545455 0.06454545 0.01272727 0.10363636 0.8836364 -3.3252588
## rs9604721    550         1             1 0.02454545 0.02454545 0.95090909 0.04909091 0.0000000  0.5901269

Using these summary statistics, we keep the subset of SNPs that meet our criteria for minimum call rate and minor allele frequency.

# Setting thresholds
call = 0.95
minor = 0.01

# Filter on MAF and call rate
use = with(snpsum.col, (!is.na(MAF) & MAF > minor) & Call.rate >= call)
use[is.na(use)] = FALSE                # Remove NA's as well

cat(ncol(genotypes)-sum(use),"SNPs will be removed due to low MAF or call rate.\n")
## 397 SNPs will be removed due to low MAF or call rate.
# Subset genotypes and SNP summary data for SNPs that pass call rate and MAF criteria
genotypes = genotypes[,use]
snpsum.col = snpsum.col[use,]

We also remove SNPs that are not in Hardy-Weinberg equilibrium.

hardy = 10^-6      # HWE cut-off

snpsum.colCont = col.summary(genotypes)
HWEuse = with(snpsum.colCont, !is.na(z.HWE) & ( abs(z.HWE) < abs( qnorm(hardy/2) ) ) )
rm(snpsum.colCont)
HWEuse[is.na(HWEuse)] = FALSE          # Remove NA's as well
cat(ncol(genotypes)-sum(HWEuse),"SNPs will be removed due to high HWE.\n")
## 0 SNPs will be removed due to high HWE.
# Subset genotypes and SNP summary data for SNPs that pass HWE criteria
genotypes = genotypes[,HWEuse]
snpsum.col = snpsum.col[HWEuse,]
print(genotypes)
## A SnpMatrix with  550 rows and  9603 columns
## Row names:  Control1 ... Case350 
## Col names:  rs11089130 ... rs9625239

By default, read.impute returns an object of the class SnpMatrix. From this point onwards, it will be more convenient to work with a numeric matrix.

X = as(genotypes, "numeric")

SNP pruning

Since many SNPs are extremely highly correlated and the sample size is small, this data does allow a sufficiently high resolution to distinguish the truly important SNPs from their most similar “neighbors”. As argued in (Sesia, Sabatti, and Candès 2017) and by others before, the most compelling scientific question lies in the identification of relevant clusters of tightly linked sites, rather than individual SNPs.

Cluster the SNPs

Create a hierarchical clustering dendrogram using the sample correlations as a similarity measure, then cut it so that no two clusters have any cros-correlations above a threshold value of 0.75.

Sigma = cov(X)
Sigma.distance = as.dist(1 - abs(cov2cor(Sigma)))
fit = hclust(Sigma.distance, method="single")
corr_max = 0.75
clusters = cutree(fit, h=1-corr_max)

Below, one can see a quick summary of the clusters that we obtained.

# Number of clusters
max(clusters)
## [1] 2323
# Distribution of cluster sizes
summary(as.numeric(table(clusters)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   4.134   4.000 123.000

Choose cluster representatives

At this point, pruning is carried out by choosing a representive SNP from each cluster. The most promising SNP in each cluster is picked according to marginal association p-values (from the Cochran-Armitage test) computed with 20% of the data.

set.seed(123)
ind.screen = rep(F, length(phenotypes))
ind.screen[sample.int(length(phenotypes), size=length(phenotypes)*0.2)] = T
pvals.screen = p.value(single.snp.tests(phenotypes[ind.screen], snp.data = genotypes[ind.screen,]), df=1)

These marginal p-values are used to choose the cluster representatives as follows.

ind.repr = sapply(1:max(clusters), function(c) {
  cluster_elements = clusters==c
  top_within = which.min(pvals.screen[cluster_elements])
  if( length(top_within)==0 ) top_within = 1
  which(cluster_elements)[top_within]
})
X = X[,ind.repr]
pvals.screen = pvals.screen[ind.repr]

Creating knockoffs

Below, we generate a knockoff copy of the genotypes using the SNPknock package.

library(SNPknock)

Fitting the HMM

First, we fit a hidden Markov model to these genotypes, with the help of the software fastPHASE. Since fastPHASE takes as input genotype sequences in a specific format, we must first convert the genotype matrix by calling SNPknock.fp.writeX. By default, this function will write onto a temporary file in the R temporary directory.

# Convert X into the suitable fastPHASE input format, write it into a temporary file
# and return the path to that file.
Xinp_file = SNPknock.fp.writeX(X)

Assuming that we have already downloaded fastPHASE, we call it to fit the hidden Markov model.

 # Path to the fastPHASE executable
fp_path  = "~/.local/bin/fastPHASE"
# Call fastPhase and return the path to the parameter estimate files
fp_outPath = SNPknock.fp.runFastPhase(fp_path, Xinp_file, K=8, numit=10)

Once fastPHASE is finished, we can retrieve its output and construct the hidden Markov model as follows.

r_file = paste(fp_outPath, "_rhat.txt", sep="")
theta_file = paste(fp_outPath, "_thetahat.txt", sep="")
alpha_file = paste(fp_outPath, "_alphahat.txt", sep="")
char_file = paste(fp_outPath, "_origchars", sep="")
hmm = SNPknock.fp.loadFit(r_file, theta_file, alpha_file, char_file)

Sampling the knockoffs

Having fitted an HMM to the genotype data, creating the knockoffs is very easy.

Xk = SNPknock.knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta)

Knockoff diagnostics

Below, we compare a summary of the original variables and their knockoff copies.

table(X)
## X
##      0      1      2 
## 461811 328504 487335
table(Xk)
## Xk
##      0      1      2 
## 462072 327092 488486

We also compute some diagnostics to verify that these knockoffs are indeed valid (e.g. they are echangeable with the original variables).

# Compare SNP column means
plot(colMeans(X),colMeans(Xk),col=rgb(0,0,0,alpha=0.1), pch=16,cex=1); abline(a=0, b=1, col='red', lty=2)

# Compare correlations between consecutive SNPs
corrX = sapply(2:dim(X)[2], function(j) cor(X[,j-1],X[,j]))
corrXk = sapply(2:dim(X)[2], function(j) cor(Xk[,j-1],Xk[,j]))
plot(corrX,corrXk,col=rgb(0,0,0,alpha=0.1), pch=16,cex=1); abline(a=0, b=1, col='red', lty=2)

# Compare correlations between original SNPs and their successive knockoff
corrXXk = sapply(2:dim(X)[2], function(j) cor(X[,j-1],Xk[,j]))
plot(corrX,corrXXk,col=rgb(0,0,0,alpha=0.1), pch=16,cex=1); abline(a=0, b=1, col='red', lty=2)

Variable selection

The knockoff genotypes that we have created above can be used to perform controlled variable selection with the use of the knockoff package.

library(knockoff)

Importance statistics

Before we proceed to compute importance statistics with the knockoff package, we must be careful about the observations that were previously used to pick the cluster representatives during pruning. These can be partially re-used, as long as their corresponding knockoff copies are made identical to the original variables.

Xk[ind.screen,] = X[ind.screen,]

Then, we compute importance statistics based on regularized logistic regression with an \(\ell_1\)-norm penalty tuned by cross-validation. In particular, for each SNP \(j\), we compute \[ W_j = |\hat{\beta}_j| - |\hat{\tilde{\beta}}_j|, \] where \(\hat{\beta}_j\) is the coefficient estimate for the \(j\)-th original variable and \(\hat{\tilde{\beta}}_j\) is the corresponding quantity for its knockoff. Since the \(\ell_1\)-norm penalty induces a sparse model, \(W\) is also sparse as shown below.

set.seed(123)
W = stat.glmnet_coefdiff(X, Xk, phenotypes, family="binomial")
plot(W, pch=16, cex=1)

Adaptive thresholding

Finally, we perform variable selection by choosing an adaptive threshold for \(W\) with the knockoff filter. The threshold is chosen such that the target false discovery rate is 0.1.

t = knockoff.threshold(W, fdr=0.1, offset=0)
discoveries = which(W >= t)
names(discoveries) = colnames(genotypes)[discoveries]
print(discoveries)
## rs12157341  rs5746945  rs5747338   rs451840  rs5746517  rs5748091  rs1210696 
##         19        421       1181       1265       1571       2008       2059
colors = rep("gray",length(W))
colors[discoveries] = "blue"
plot(W, col=colors, pch=16, cex=1); abline(h=t, lty=2)

Validation

In this tutorial we used a synthetic dataset for which we have access to the ground truth. Therefore, we can easily validate our results.

The list of the true important SNPs is also provided with this dataset. We can load it and inspect it as follows.

risk.dat = read.table("data_sim/disease.risk.txt")
print(risk.dat)
##      allele  idx position risk_hete risk_homo        rs
## 5388      1 5388 22152390  1.489739  2.979477 rs9620278
## 1510      0 1510 16907889  1.448372  2.896744  rs455758
## 1501      1 1501 16890873  1.490597  2.981194  rs424962
## 8858      0 8858 25640922  1.272689  2.545377 rs9613298
## 6309      1 6309 23531242  1.282036  2.564071 rs9608340
## 8258      1 8258 25186766  1.391429  2.782858  rs713998
## 5007      1 5007 21731173  1.251896  2.503791 rs9612225
## 40        1   40 15002875  1.255662  2.511325 rs6010298
## 9125      1 9125 25738063  1.421840  2.843680  rs134913
## 1229      0 1229 16599872  1.290679  2.581359  rs181399

In order to be able to compare these with our findings, we must identify the clusters containing the important SNPs.

signals = sapply(1:length(risk.dat$rs), function(i) clusters[which(names(clusters)==risk.dat$rs[i])])
print(signals)
## rs9620278  rs455758  rs424962 rs9613298 rs9608340  rs713998 rs9612225 rs6010298  rs134913  rs181399 
##      1265       421       415      2008      1454      1875      1181        19      2059       357
colors = rep("gray",length(W))
colors[discoveries] = "red"
colors[signals] = "green"
plot(W, col=colors, pch=16, cex=1); abline(h=t, lty=2)

From the plot above we can compute the false discovery proportion (FDP) and the power of our procedure on this dataset.

true.discoveries = intersect(discoveries, signals)
false.discoveries = setdiff(discoveries, signals)
fdp = length(false.discoveries) / length(discoveries)  
print(fdp) # False discovery proporion
## [1] 0.1428571
power = length(true.discoveries) / length(signals)     
print(power) # Power
## [1] 0.6

Comparison

Marginal testing

For the sake of completeness, we compare our results to what one would obtain with marginal testing. We compute marginal association p-values with all the data, using the Cochran-Armitage test.

pvals = p.value(single.snp.tests(phenotypes, snp.data = genotypes), df=1)
plot(-log10(pvals), col=rgb(0,0,0,alpha=0.25), pch=16,cex=1)

Here, we see that at the standard GWAS significance level of \(5 \dot 10^{-8}\), no discoveries can be made on this dataset with marginal testing.

Benjamini-Hochberg

If we try to control the false discovery rate (at the same level 0.1) by applying the Benjamini-Hochberg procedure to the marginal association p-values, we can make a few discoveries.

pvals.BH = p.adjust(pvals, method = "BH")
plot(-log10(pvals.BH), col=rgb(0,0,0,alpha=0.25), pch=16,cex=1); abline(h=-log10(0.1), lty=2)

disc.marg = which(pvals.BH < 0.1)

These individual SNP discoveries correspond to the following clusters.

disc.marg.clust = sapply(1:length(disc.marg), function(i) clusters[which(names(clusters)==names(disc.marg)[i])])
disc.marg.clust = unique(disc.marg.clust)
print(disc.marg.clust)
## [1]   19 1181 1265 2059 2057

Again, we can compare these findings with the ground truth.

true.discoveries.marg = intersect(disc.marg.clust, signals)
false.discoveries.marg = setdiff(disc.marg.clust, signals)
fdp.marg = length(false.discoveries.marg) / length(disc.marg.clust)  
print(fdp.marg) # False discovery proporion
## [1] 0.2
power.marg = length(true.discoveries.marg) / length(signals)
print(power.marg) # Power
## [1] 0.4

References

Candes, E., Y. Fan, L. Janson, and J. Lv. 2016. “Panning for gold: model-free knockoffs for high-dimensional controlled variable selection.” ArXiv E-Prints, October. https://statweb.stanford.edu/~candes/papers/MF_knockoffs.pdf.

Sesia, M., C. Sabatti, and E. J. Candès. 2017. “Gene Hunting with Knockoffs for Hidden Markov Models.” ArXiv E-Prints, June. https://statweb.stanford.edu/~candes/papers/HMM_Knockoffs.pdf.

Su, Zhan, Jonathan Marchini, and Peter Donnelly. 2011. “HAPGEN2: Simulation of Multiple Disease Snps.” Bioinformatics 27 (16). Oxford University Press: 2304–5. doi:10.1093/bioinformatics/btr341.