This web tutorial demonstrates the use of the knockoffs methodology to perform controlled variable selection in a genome-wide association study. The tutorial accompanies the paper (Sesia, Sabatti, and Candès 2017), where the procedure is justified and more details are presented.
As in the work of (Sesia, Sabatti, and Candès 2017) and (Candes et al. 2016), the purpose of this statistical analysis is to identify the important variables in the conditional distribution of \(Y | X_1, \ldots X_p\), where \(Y\) is a phenotype of interest (e.g. the disease status) and \(X\) is a family of SNPs. This is done by testing the \(p\) multivariate hypotheses that \(Y\) is independent of the \(j\)-th SNP \(X_j\) conditionally on the other variables, for \(j=1,\ldots,p\). Formally, we write this as:
\[Y \perp\!\!\!\!\perp X_j | X_{- j}.\] This approach is very natural to anyone working with parametric generalized linear models, where \(Y\) has a probability distribution taken from an exponential family, which depends upon the covariates only through the linear combination \(\beta_1 X_1 + \beta_2 X_2 + \ldots +\beta_p X_p\). In such models and under broad conditions, the above test for \(X_j\) is equivalent to testing \(\beta_j=0\).
However, in this tutorial we do not make any parametric assumptions on the conditional distribution of the phenotype given the genotypes. In this more general setting (and under weak conditions) this can be shown to equivalent to finding the Markov blanket of \(Y\), i.e. the “smallest”" subset of \(S\) such that, conditionally on \(\{X_j\}_{j \in S}\), \(Y\) is independent of all other variables.
For more information about variable selection with knockoffs, see (Candes et al. 2016) or visit the knockoffs homepage.
Three R packages are used in this tutorial.
snpStats from the Bioconductor repository is used to load and pre-process the SNP data.knockoff from CRAN is the main package for the knockoffs methodology.SNPknock from CRAN is used to create knockoffs for genetic data (version 0.7.1).An external genotype imputation software is also required.
If you do not already have fastPHASE, you can download the fastPHASE tarball from the website liked above. Then, extract it into a convenient directory (e.g. “~/.local/bin/”).
Due to the unavailability of free datasets containing individual-levels genotypes and phenotypes, this tutorial uses a small synthetic dataset created from a publicly available dataset.
You can download the simulated study used in this tutorial from here.
Then, unpack the archive and place it in your R work directory.
This dataset was obtained with the HAPGEN2 simulation software (Su, Marchini, and Donnelly 2011). We chose 10,000 SNPs from a portion of chromosome 22 (positions 0-26192121) among those genotyped by the International HapMap Project (release #22 - NCBI Build 36) and freely available from https://mathgen.stats.ox.ac.uk. A study of 200 cases and 350 controls was simulated with 10 “causal” loci, whose multiplicative effects for each copy of the risk allele are uniformly distributed between 1.25 and 1.5.
If you have not done this before, you can install the snpStats package from the Bioconductor repository as follows.
source("https://bioconductor.org/biocLite.R")
biocLite("snpStats")
Once snpStats is successfully installed, you can load it as any other R package.
library(snpStats)
The read.impute function provides a simple way of loading the genotypes for the cases and controls in the simulated study.
dat.cases = read.impute("data_sim/sim.out.cases.gen")
## Reading IMPUTE data from file data_sim/sim.out.cases.gen
## Reading SnpMatrix with 350 rows and 10000 columns
dat.contr = read.impute("data_sim/sim.out.controls.gen")
## Reading IMPUTE data from file data_sim/sim.out.controls.gen
## Reading SnpMatrix with 200 rows and 10000 columns
Instead of storing cases and controls separately, we merge them into a single dataset.
row.names(dat.cases) = sapply(1:dim(dat.cases)[1], function(i) paste("Case",i,sep=""))
row.names(dat.contr) = sapply(1:dim(dat.contr)[1], function(i) paste("Control",i,sep=""))
genotypes = rbind(dat.contr, dat.cases)
phenotypes = c(rep(0,dim(dat.contr)[1]), rep(1,dim(dat.cases)[1]))
Once the data is loaded, we are ready to remove SNPs that fail to meet minimum criteria due to missing data, low variability or genotyping errors. snpStats provides functions, col.summary and row.summary, that return statistics on SNPs and samples, respectively.
A more complete tutorial on data pre-processing for association studies can be found on the ORSG project website.
To create SNP summary statistics (MAF, call rate, etc.), one can type:
snpsum.col = col.summary(genotypes)
print(head(snpsum.col))
## Calls Call.rate Certain.calls RAF MAF P.AA P.AB P.BB z.HWE
## rs11089130 550 1 1 0.61090909 0.38909091 0.14000000 0.49818182 0.3618182 1.1239411
## rs738829 550 1 1 0.79909091 0.20090909 0.04545455 0.31090909 0.6436364 -0.7435504
## rs915674 550 1 1 0.86363636 0.13636364 0.02363636 0.22545455 0.7509091 -1.0039135
## rs915675 550 1 1 0.84636364 0.15363636 0.02727273 0.25272727 0.7200000 -0.6616527
## rs915677 550 1 1 0.93545455 0.06454545 0.01272727 0.10363636 0.8836364 -3.3252588
## rs9604721 550 1 1 0.02454545 0.02454545 0.95090909 0.04909091 0.0000000 0.5901269
Using these summary statistics, we keep the subset of SNPs that meet our criteria for minimum call rate and minor allele frequency.
# Setting thresholds
call = 0.95
minor = 0.01
# Filter on MAF and call rate
use = with(snpsum.col, (!is.na(MAF) & MAF > minor) & Call.rate >= call)
use[is.na(use)] = FALSE # Remove NA's as well
cat(ncol(genotypes)-sum(use),"SNPs will be removed due to low MAF or call rate.\n")
## 397 SNPs will be removed due to low MAF or call rate.
# Subset genotypes and SNP summary data for SNPs that pass call rate and MAF criteria
genotypes = genotypes[,use]
snpsum.col = snpsum.col[use,]
We also remove SNPs that are not in Hardy-Weinberg equilibrium.
hardy = 10^-6 # HWE cut-off
snpsum.colCont = col.summary(genotypes)
HWEuse = with(snpsum.colCont, !is.na(z.HWE) & ( abs(z.HWE) < abs( qnorm(hardy/2) ) ) )
rm(snpsum.colCont)
HWEuse[is.na(HWEuse)] = FALSE # Remove NA's as well
cat(ncol(genotypes)-sum(HWEuse),"SNPs will be removed due to high HWE.\n")
## 0 SNPs will be removed due to high HWE.
# Subset genotypes and SNP summary data for SNPs that pass HWE criteria
genotypes = genotypes[,HWEuse]
snpsum.col = snpsum.col[HWEuse,]
print(genotypes)
## A SnpMatrix with 550 rows and 9603 columns
## Row names: Control1 ... Case350
## Col names: rs11089130 ... rs9625239
By default, read.impute returns an object of the class SnpMatrix. From this point onwards, it will be more convenient to work with a numeric matrix.
X = as(genotypes, "numeric")
Since many SNPs are extremely highly correlated and the sample size is small, this data does allow a sufficiently high resolution to distinguish the truly important SNPs from their most similar “neighbors”. As argued in (Sesia, Sabatti, and Candès 2017) and by others before, the most compelling scientific question lies in the identification of relevant clusters of tightly linked sites, rather than individual SNPs.
Create a hierarchical clustering dendrogram using the sample correlations as a similarity measure, then cut it so that no two clusters have any cros-correlations above a threshold value of 0.75.
Sigma = cov(X)
Sigma.distance = as.dist(1 - abs(cov2cor(Sigma)))
fit = hclust(Sigma.distance, method="single")
corr_max = 0.75
clusters = cutree(fit, h=1-corr_max)
Below, one can see a quick summary of the clusters that we obtained.
# Number of clusters
max(clusters)
## [1] 2323
# Distribution of cluster sizes
summary(as.numeric(table(clusters)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 4.134 4.000 123.000
At this point, pruning is carried out by choosing a representive SNP from each cluster. The most promising SNP in each cluster is picked according to marginal association p-values (from the Cochran-Armitage test) computed with 20% of the data.
set.seed(123)
ind.screen = rep(F, length(phenotypes))
ind.screen[sample.int(length(phenotypes), size=length(phenotypes)*0.2)] = T
pvals.screen = p.value(single.snp.tests(phenotypes[ind.screen], snp.data = genotypes[ind.screen,]), df=1)
These marginal p-values are used to choose the cluster representatives as follows.
ind.repr = sapply(1:max(clusters), function(c) {
cluster_elements = clusters==c
top_within = which.min(pvals.screen[cluster_elements])
if( length(top_within)==0 ) top_within = 1
which(cluster_elements)[top_within]
})
X = X[,ind.repr]
pvals.screen = pvals.screen[ind.repr]
Below, we generate a knockoff copy of the genotypes using the SNPknock package.
library(SNPknock)
First, we fit a hidden Markov model to these genotypes, with the help of the software fastPHASE. Since fastPHASE takes as input genotype sequences in a specific format, we must first convert the genotype matrix by calling SNPknock.fp.writeX. By default, this function will write onto a temporary file in the R temporary directory.
# Convert X into the suitable fastPHASE input format, write it into a temporary file
# and return the path to that file.
Xinp_file = SNPknock.fp.writeX(X)
Assuming that we have already downloaded fastPHASE, we call it to fit the hidden Markov model.
# Path to the fastPHASE executable
fp_path = "~/.local/bin/fastPHASE"
# Call fastPhase and return the path to the parameter estimate files
fp_outPath = SNPknock.fp.runFastPhase(fp_path, Xinp_file, K=8, numit=10)
Once fastPHASE is finished, we can retrieve its output and construct the hidden Markov model as follows.
r_file = paste(fp_outPath, "_rhat.txt", sep="")
theta_file = paste(fp_outPath, "_thetahat.txt", sep="")
alpha_file = paste(fp_outPath, "_alphahat.txt", sep="")
char_file = paste(fp_outPath, "_origchars", sep="")
hmm = SNPknock.fp.loadFit(r_file, theta_file, alpha_file, char_file)
Having fitted an HMM to the genotype data, creating the knockoffs is very easy.
Xk = SNPknock.knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta)
Below, we compare a summary of the original variables and their knockoff copies.
table(X)
## X
## 0 1 2
## 461811 328504 487335
table(Xk)
## Xk
## 0 1 2
## 462072 327092 488486
We also compute some diagnostics to verify that these knockoffs are indeed valid (e.g. they are echangeable with the original variables).
# Compare SNP column means
plot(colMeans(X),colMeans(Xk),col=rgb(0,0,0,alpha=0.1), pch=16,cex=1); abline(a=0, b=1, col='red', lty=2)
# Compare correlations between consecutive SNPs
corrX = sapply(2:dim(X)[2], function(j) cor(X[,j-1],X[,j]))
corrXk = sapply(2:dim(X)[2], function(j) cor(Xk[,j-1],Xk[,j]))
plot(corrX,corrXk,col=rgb(0,0,0,alpha=0.1), pch=16,cex=1); abline(a=0, b=1, col='red', lty=2)
# Compare correlations between original SNPs and their successive knockoff
corrXXk = sapply(2:dim(X)[2], function(j) cor(X[,j-1],Xk[,j]))
plot(corrX,corrXXk,col=rgb(0,0,0,alpha=0.1), pch=16,cex=1); abline(a=0, b=1, col='red', lty=2)
The knockoff genotypes that we have created above can be used to perform controlled variable selection with the use of the knockoff package.
library(knockoff)
Before we proceed to compute importance statistics with the knockoff package, we must be careful about the observations that were previously used to pick the cluster representatives during pruning. These can be partially re-used, as long as their corresponding knockoff copies are made identical to the original variables.
Xk[ind.screen,] = X[ind.screen,]
Then, we compute importance statistics based on regularized logistic regression with an \(\ell_1\)-norm penalty tuned by cross-validation. In particular, for each SNP \(j\), we compute \[ W_j = |\hat{\beta}_j| - |\hat{\tilde{\beta}}_j|, \] where \(\hat{\beta}_j\) is the coefficient estimate for the \(j\)-th original variable and \(\hat{\tilde{\beta}}_j\) is the corresponding quantity for its knockoff. Since the \(\ell_1\)-norm penalty induces a sparse model, \(W\) is also sparse as shown below.
set.seed(123)
W = stat.glmnet_coefdiff(X, Xk, phenotypes, family="binomial")
plot(W, pch=16, cex=1)
Finally, we perform variable selection by choosing an adaptive threshold for \(W\) with the knockoff filter. The threshold is chosen such that the target false discovery rate is 0.1.
t = knockoff.threshold(W, fdr=0.1, offset=0)
discoveries = which(W >= t)
names(discoveries) = colnames(genotypes)[discoveries]
print(discoveries)
## rs12157341 rs5746945 rs5747338 rs451840 rs5746517 rs5748091 rs1210696
## 19 421 1181 1265 1571 2008 2059
colors = rep("gray",length(W))
colors[discoveries] = "blue"
plot(W, col=colors, pch=16, cex=1); abline(h=t, lty=2)
In this tutorial we used a synthetic dataset for which we have access to the ground truth. Therefore, we can easily validate our results.
The list of the true important SNPs is also provided with this dataset. We can load it and inspect it as follows.
risk.dat = read.table("data_sim/disease.risk.txt")
print(risk.dat)
## allele idx position risk_hete risk_homo rs
## 5388 1 5388 22152390 1.489739 2.979477 rs9620278
## 1510 0 1510 16907889 1.448372 2.896744 rs455758
## 1501 1 1501 16890873 1.490597 2.981194 rs424962
## 8858 0 8858 25640922 1.272689 2.545377 rs9613298
## 6309 1 6309 23531242 1.282036 2.564071 rs9608340
## 8258 1 8258 25186766 1.391429 2.782858 rs713998
## 5007 1 5007 21731173 1.251896 2.503791 rs9612225
## 40 1 40 15002875 1.255662 2.511325 rs6010298
## 9125 1 9125 25738063 1.421840 2.843680 rs134913
## 1229 0 1229 16599872 1.290679 2.581359 rs181399
In order to be able to compare these with our findings, we must identify the clusters containing the important SNPs.
signals = sapply(1:length(risk.dat$rs), function(i) clusters[which(names(clusters)==risk.dat$rs[i])])
print(signals)
## rs9620278 rs455758 rs424962 rs9613298 rs9608340 rs713998 rs9612225 rs6010298 rs134913 rs181399
## 1265 421 415 2008 1454 1875 1181 19 2059 357
colors = rep("gray",length(W))
colors[discoveries] = "red"
colors[signals] = "green"
plot(W, col=colors, pch=16, cex=1); abline(h=t, lty=2)
From the plot above we can compute the false discovery proportion (FDP) and the power of our procedure on this dataset.
true.discoveries = intersect(discoveries, signals)
false.discoveries = setdiff(discoveries, signals)
fdp = length(false.discoveries) / length(discoveries)
print(fdp) # False discovery proporion
## [1] 0.1428571
power = length(true.discoveries) / length(signals)
print(power) # Power
## [1] 0.6
For the sake of completeness, we compare our results to what one would obtain with marginal testing. We compute marginal association p-values with all the data, using the Cochran-Armitage test.
pvals = p.value(single.snp.tests(phenotypes, snp.data = genotypes), df=1)
plot(-log10(pvals), col=rgb(0,0,0,alpha=0.25), pch=16,cex=1)
Here, we see that at the standard GWAS significance level of \(5 \dot 10^{-8}\), no discoveries can be made on this dataset with marginal testing.
If we try to control the false discovery rate (at the same level 0.1) by applying the Benjamini-Hochberg procedure to the marginal association p-values, we can make a few discoveries.
pvals.BH = p.adjust(pvals, method = "BH")
plot(-log10(pvals.BH), col=rgb(0,0,0,alpha=0.25), pch=16,cex=1); abline(h=-log10(0.1), lty=2)
disc.marg = which(pvals.BH < 0.1)
These individual SNP discoveries correspond to the following clusters.
disc.marg.clust = sapply(1:length(disc.marg), function(i) clusters[which(names(clusters)==names(disc.marg)[i])])
disc.marg.clust = unique(disc.marg.clust)
print(disc.marg.clust)
## [1] 19 1181 1265 2059 2057
Again, we can compare these findings with the ground truth.
true.discoveries.marg = intersect(disc.marg.clust, signals)
false.discoveries.marg = setdiff(disc.marg.clust, signals)
fdp.marg = length(false.discoveries.marg) / length(disc.marg.clust)
print(fdp.marg) # False discovery proporion
## [1] 0.2
power.marg = length(true.discoveries.marg) / length(signals)
print(power.marg) # Power
## [1] 0.4
Candes, E., Y. Fan, L. Janson, and J. Lv. 2016. “Panning for gold: model-free knockoffs for high-dimensional controlled variable selection.” ArXiv E-Prints, October. https://statweb.stanford.edu/~candes/papers/MF_knockoffs.pdf.
Sesia, M., C. Sabatti, and E. J. Candès. 2017. “Gene Hunting with Knockoffs for Hidden Markov Models.” ArXiv E-Prints, June. https://statweb.stanford.edu/~candes/papers/HMM_Knockoffs.pdf.
Su, Zhan, Jonathan Marchini, and Peter Donnelly. 2011. “HAPGEN2: Simulation of Multiple Disease Snps.” Bioinformatics 27 (16). Oxford University Press: 2304–5. doi:10.1093/bioinformatics/btr341.