Multiple comparisons

Setup

For a model ${\cal M}$, suppose we have $m$ null hypotheses \[ H_{0,i} \subset {\cal M}, \qquad 1 \leq i \leq m \]
We also assume we have $p$-values $p_i, 1 \leq i \leq m$ (i.e. [0,1] valued random variables) such that \[ P_F(p_i \leq t) \leq t \qquad \forall t \in [0,1], F \in H_{0,i} \]
A multiple testing procedure ${\cal T}$ is a map ${\cal T}:[0,1]^m \to \{0,1\}^m$ with 0 denoting a declared null and a 1 a declared positive.
Common procedure: thresholding at threshold $t$ \[ {\cal T}_t(p_1, \dots, p_m)[i] = 1_{[0,t]}(p_i) \]

# %load_ext rpy2.ipython

library(locfdr)
data(hivdata)
length(hivdata) # m

## [1] 7680

Multiple comparisons

Setup

Results are often summarized for $F \in {\cal M}$ as

	True null	True positive	Decisions
Declared positive	$V({\cal T}, F)$	$S({\cal T}, F)$	$R({\cal T})$
Declared null	$U({\cal T}, F)$	$T({\cal T}, F)$	$m-R({\cal T})$
	$m_0(F)$	$m_1(F)$	$m$

Most common error rates can be summarized by the above table. (With some abuse of notation we can think of e.g. $V({\cal T}, F)$ as a subset of $\{1, \dots, m\}$ or its cardinality.)

Family Wise Error Rate

For $F \in {\cal M}$

\[ \newcommand{\fwer}{\text{FWER}} \newcommand{\fdr}{\text{FDR}} \newcommand{\fdp}{\text{FDP}} \]

\[ \fwer(F, {\cal T}) = P_F(V({\cal T}) \geq 1) = P_F(V({\cal T}, F) \geq 1) \]

Multiple comparisons

DFQL?

Does this address / permit / facilitate inference after selection?
After running ${\cal T}$ can we talk about confidence intervals? point estimators? posterior distributions?
Broadly speaking, seems the answer is no.

Multiple comparisons

Special cases

Suppose \[ H_{0,i} = \left\{F: \theta_i(F) = 0\right\}, \qquad 1 \leq i \leq m \] for $m$ parameters $\theta_i:{\cal M} \rightarrow \mathbb{R}$.
In this case, some procedures ${\cal T}$ can be used to contruct confidence intervals for $\theta_i(F)$ with a simultaneous property.
Ignoring asymptotic considerations for now, we might restrict attention to multivariate normal \[ {\cal M} = {\cal M}(\Sigma) = \left\{N(\theta, \Sigma): \theta \in \mathbb{R}^m\right\}$ \] and take \[ p_i = 2 \cdot \left(1 - \Phi\left(|Z_i| \right) \right) \] assuming $\text{diag}(\Sigma)=1$ so we can think of \[ Z_i = \frac{\widehat{\theta}_i}{\text{SE}(\widehat{\theta}_i)} \]

Multiple comparisons

Special cases

plot(density(hivdata))

pval = 2 * (1 - pnorm(abs(hivdata)))

Multiple comparisons

Controlling FWER

Simplest procedure to control \[\fwer(F, {\cal T}) \leq \alpha, \qquad \forall F \in {\cal M}\] is Bonferroni: ${\cal T}_{\alpha / m}$.
Proof: By Markov inequality \[ \begin{aligned} P_F(V({\cal T}_{\alpha / m}, F) \geq 1)& \leq E_F(V({\cal T}_{\alpha / m}, F)) \\ & = E_F\left( \# \{1 \leq i \leq m: H_{0,i}(F)=0, p_i \leq \alpha / m \} \right) \\ &= \sum_{i:H_{0,i}(F)=0} P_F(p_i \leq \alpha / m) \\ &= \alpha \frac{m_0(F)}{m} \\ & \leq \alpha \end{aligned} \]
Can spend $\alpha$ over $m$ hypotheses differently by choosing $\sum_{i=1}^m \alpha_i = \alpha$.
Works for any multiple comparisons problem, only requires valid $p$-values $(p_i)_{1 \leq i \leq m}$.

sum(p.adjust(pval, method='bonferroni') < 0.05)

## [1] 10

Multiple comparisons

Simultaneous confidence intervals

In the multivariate normal setting $F \equiv \theta$ and \[ P_F(p_i \leq \alpha / m) = P_F(|Z_i| \geq z_{1 - \alpha / (2m)}) \] with $z_q={\tt qnorm}(q)$.
Hypothesis $i$ is rejected if $100 (1 - \alpha / m)$ % confidence interval does not cover 0.
In particular under $\theta=0$ (global null) \[ V({\cal T}, 0) = \left\{i: {\cal T}[i] = 1\right\} \] and \[ P_0\left( \exists i: |Z_i| \geq z_{1 - \alpha / (2m)}\right) = \fwer({\cal T}, 0) \leq \alpha \]

Multiple comparisons

Simultaneous confidence intervals

Define \[ \tilde{Z}(\theta_0)_i = \frac{\widehat{\theta}_i - \theta_{0,i}}{\text{SE}(\widehat{\theta}_i)}. \]
Under $F=\theta_0$ we have $\tilde{Z}(\theta_0)$ has the same distribution as $Z$ under $F=0$.
$\implies$ for every $\theta \in \mathbb{R}^m$ \[ \begin{aligned} \alpha & \geq P_{0} \left(\exists i: |Z|_i \geq z_{1 - \alpha / (2m)} \right)\\ &= P_{\theta} \left(\exists i: |\tilde{Z}(\theta)|_i \geq z_{1 - \alpha / (2m)} \right)\\ & = P_{\theta} \left( \exists i: \theta_i \not \in \widehat{\theta}_i \pm \text{SE}(\widehat{\theta}_i) \cdot z_{1 -\alpha/(2m)} \right) \\ \end{aligned} \]
$\iff$ \[ P_{\theta} \left(\cap_{1 \leq i \leq m}\{ \theta_i \in \widehat{\theta}_i \pm \text{SE}(\widehat{\theta}_i) \cdot z_{1 -\alpha/(2m)}\} \right) \geq 1 - \alpha \]

Multiple comparisons

Simultaneous confidence intervals

We call the region \[ \prod_{i=1}^m \widehat{\theta}_i \pm \text{SE}(\widehat{\theta}_i) z_{1 - \alpha / (2m)} \] a $100(1-\alpha)$ % simultaneous confidence region.
Exercise: in the multivariate normal setting, suppose ${\cal T}$ is such that \[ \fwer({\cal T}, 0) \leq \alpha \] then \[ \left\{\theta: {\cal T}(\tilde{Z}(\theta)) \equiv 0 \right\} \] is a $100(1-\alpha)$ % simultaneous confidence region. That is, ${\cal T}$ need only control FWER in the weak sense to construct a simultaneous confidence region.
Simultaneous intervals can be found by projecting a simultaneous region onto coordinate axes (not necessarily an easy task). Bonferroni intervals had advantage that these projections are simple.

Multiple comparisons

How does Bonferroni behave as $m \to \infty$

Tail of the normal: for $z > 0$ \[ \frac{e^{-z^2/2}}{\sqrt{2\pi}} \cdot \left(\frac{1}{z} - \frac{1}{z^3}\right) \leq 1 - \Phi(z) \leq \frac{e^{-z^2/2}}{\sqrt{2\pi}} \cdot \frac{1}{z} \]
The quantile $z_{1 - \alpha / (2m)}$ solves \[ \frac{\alpha}{2m} = 1 - \Phi(z_{1 - \alpha / (2m)}) \implies z_{1 - \alpha / 2m} \approx \sqrt{2 \log m} \]

-˜Exercise: Show that for any $\epsilon > 0$ \[ 1 - \Phi((1 + \epsilon) \sqrt{2 \log m}) \overset{m \to \infty}{\to} 0, m \left( 1 - \Phi((1 - \epsilon) \sqrt{2 \log m}) \right) \overset{m \to \infty}{\to} \infty. \] Argue that \[\frac{z_{1 - \alpha/(2m)}}{\sqrt{2 \log m}} \to 1.\] Would it change things if we had used one-sided $p$-values?

M = 1e6
c(qnorm(1 - 0.05 / M), sqrt(2 * log(M)))

## [1] 5.326724 5.256522

Multiple comparisons

Improving Bonferroni: Holm procedure

Order the $p$-values to produce $p_{(1)} \leq p_{(2)} \leq \dots \leq p_{(m)}$.
Set \[ \hat{i}(p) = \min\left\{i \geq 1: p_{(i)} > \frac {\alpha}{m + 1 - i} \right\} \] with $\min(\emptyset)=0$ or set $p_{(0)}=0$ and start at $i=0$.
Reject all $H_{0,i}$ with $1 \leq i < \hat{i}(p)$.

In words

First run Bonferroni on all $m$ hypotheses.
If smallest $p$-value is rejected, move on to next best $m-1$ hypotheses and run Bonferroni on them.
Continue until the smallest $p$-value of such a Bonferroni is not rejected.

sum(p.adjust(pval, method='holm') < 0.05)

## [1] 10

Multiple comparisons

Holm procedure

Call the procedure ${\cal T}_{Holm,\alpha}$: \[ P_F(V({\cal T}_{Holm,\alpha}) \geq 1) \leq \alpha \]

Proof

Suppose a true null is rejected, then there is some smallest index $i$ when this happens: $i_0(p, F)$.
The first $i_0(p,F)$ were false nulls so $m-m_0(F) \geq i_0(p, F) - 1$ and \[ \frac{\alpha}{m - i_0(p,F)+1} \leq \frac{\alpha}{m_0(F)} \] and hence there was a true null $p$-value less than $\alpha / m_0(F)$.
By usual Bonferroni this happens with probability less than $\alpha$.

Multiple comparisons

Holm procedure

An example of a closed testing procedure, which is a way to construct an FWER controlling procedure from tests of intersection nulls \[ H_{0,I} = \cap_{i \in I} H_{0,i} \] each at level $\alpha$.
Holm procedure is closure of the Bonferroni testing procedure on each intersection null.
Benefits from a fast way to test each intersection null.

Multiple comparisons

Step down and step up

Given a sequence of thresholds $q_i$, a step-down procedure rejects all $i$ less than \[ \hat{i}^{\downarrow}_q(p) = \min\left\{i: p_{(i)} > q_i \right\}. \]
A step-up procedure rejects all $i$ less than \[ \hat{i}^{\uparrow}_q(p) = \max\left\{i: p_{(i)} \leq q_i \right\}. \]
Note that $\hat{i}^{\downarrow}_q(p) \leq \hat{i}^{\uparrow}_q(p)$ so for the same thresholds $q$ $\hat{i}^{\uparrow}_q$, is more powerful than $\hat{i}^{\downarrow}_q$.

Hochberg’s method

Hochberg is the step-up version of Holm.
Under independence Hochberg’s procedure controls FWER.

Multiple comparisons

$k$-FWER Lehmann and Romano (2005)

Definition \[ k-\fwer({\cal T}, F) = P_F(V({\cal T}, F) \geq k) \]
Threshold ${\cal T}_{\alpha \cdot k/m}$ controls $k-\fwer$. (Markov inequality again.)
Step down procedure with thresholds \[ q_i = \frac{k \cdot \alpha}{m + k - i} \] also controls $k-\fwer$. (Almost identical proof to Holm)
Thresholds are identical to Holm when $k=1$ (usual FWER).

Which error rate to choose?

In your application / field, what are the costs to false positives?
Can you tolerate any?

Multiple comparisons

False Discovery Rate (FDR)

Definition (Benjamini and Hochberg, 1995) \[ \fdr({\cal T}, F) = E_F\left(\frac{V({\cal T}, F)}{\max(R({\cal T}, F), 1)}\right) \]
Expectation of False Discovery Proportion (FDP)

\[ \fdp({\cal T}, F) = \frac{V({\cal T}, F)}{\max(R({\cal T}, F), 1)} \]

Multiple comparisons

False Discovery Rate (FDR)

Note \[ \fdr({\cal T}, F) \leq \fwer({\cal T}, F) \]
$\implies$ control of FDR at $\alpha$ yields more discoveries.
Suppose \[ \fdr({\cal T}_{t'_{\alpha}}, F) = \alpha, \qquad \fwer({\cal T}_{t_{\alpha}}, F) = \alpha. \] with $t'_{\alpha} > t_{\alpha}$.
Then true discoveries $T({\cal T}_{t'_{\alpha}}) \geq T({\cal T}_{t_{\alpha}}).$

Multiple comparisons

Large scale inference (Efron, Storey, Tibshirani)

Suppose now $m \to \infty$ with $m_0/m \to \pi_0$.
For fixed threshold $t$ \[ \frac{1}{m} V({\cal T}_t,F) \to \pi_0 \cdot t \]
Under reasonable assumptions we might also expect \[ \frac{1}{m} R({\cal T}_t, F) \to \mathbb{G}(t) \] for some distribution $\mathbb{G}$ on $[0,1]$.
As \[ \frac{1}{m} R({\cal T}_t, F) = \frac{1}{m} V({\cal T}_t, F) + \frac{1}{m} S({\cal T}_t, F) \] we can write \[ \mathbb{G}(t) = \pi_0 \cdot t + (1 - \pi_0) \cdot \mathbb{G}_1(t) \]
In this setting \[ \begin{aligned} \fdp({\cal T}_t, F) &\to \frac{\pi_0 \cdot t}{\mathbb{G}(t)} \\ \fdr({\cal T}_t, F) &\to \frac{\pi_0 \cdot t}{\mathbb{G}(t)} \\ \end{aligned} \]

Multiple comparisons

Control of FDR? (Efron, Storey, Tibshirani, Benjamini-Hochberg)

If we knew $\mathbb{G}$ (and $\pi_0$), we might pick a threshold $t$ such that \[ \frac{\pi_0 \cdot t}{\mathbb{G}(t)} \leq \alpha \]
As higher thresholds lead to more true rejections we would like to take \[ t_{\alpha}(\mathbb{G}, \pi_0) = \max\left\{t: \frac{\pi_0 \cdot t}{\mathbb{G}(t)} \leq \alpha \right\} \]
We can conservatively estimate limit (recall $\pi_0 < 1$) with \[ \widehat{\fdr}(t) = \frac{t}{\widehat{G}(t)} \] with $\widehat{G}(t)$ the empirical CDF of the $p$-values.
Seems reasonable to take \[ \widehat{t}_{\alpha} = \sup\left\{t: \widehat{\fdr}(t) \leq \alpha \right\}. \]
Equivalent to Benjamini-Hochberg (BH) step-up procedure \[ \hat{i}^{\uparrow}_q(p) \] with \[ q_i = \alpha \frac{i}{m}. \]

sum(p.adjust(pval, method='BH') < 0.05)

## [1] 18

Multiple comparisons

Control of FDR Storey, Taylor, Siegmund

At $\widehat{t}_{\alpha}$ \[ \widehat{G}(t) = \frac{t}{\alpha} \]
$\implies$ \[ \fdp({\cal T}_{\widehat{t}_{\alpha}}, F) = \alpha \frac{V({\cal T}_{\widehat{t}_{\alpha}}, F)}{m \cdot \widehat{t}_{\alpha}} \]
If we make the assumption that the true null $p$-values are independent (amongst themselves and of the true positive $p$-values) for all $F \in {\cal M}$ then \[ \frac{V({\cal T}_t, F)}{t} \] is a (backward) martingale in a filtration in which $\widehat{t}_{\alpha}$ is a stopping time.
Optional stopping implies \[ \fdr({\cal T}_{\widehat{t}_{\alpha}}, F) = E_F\left[ \alpha \frac{V({\cal T}_{\widehat{t}_{\alpha}}, F)}{m \cdot \widehat{t}_{\alpha}}\right] = E_F\left[ \alpha \frac{V({\cal T}_{1}, F)}{m \cdot 1}\right] = \frac{\alpha \cdot m_0(F)}{m} \leq \alpha. \]
In asymptotic setting, essentially enough to have uniform consistency of $\widehat{G}$ to something, i.e. $\mathbb{G}$.

Multiple comparisons

Two groups model (Storey, Efron)

Idealized asymptotic picture of FDR led to the model for the $Z$ scores as IID draws \[ \begin{aligned} \theta_i &\sim \pi_0 \cdot \delta_0 + (1 - \pi_0) G \\ Z_i | \theta_i &\sim N(\theta, 1) \end{aligned} \]
Models parameters of $F$ as a mixture of 0’s and $G$.
The full parameters of $F$ are lost and with $m$ sufficiently large \[ \fdr({\cal T}_t, F) \approx \frac{1}{\pi_0} P(\theta = 0 | Z > z_{1 - t/2}) \overset{\text{def}}{=} \frac{1}{\pi_0} p\fdr(t) = \text{Fdr}(t) \]
Can also make sense of (switching to $z$ scale) \[ \text{fdr}(z) = P(\theta = 0 | Z = z) = \frac{\pi_0 \cdot \phi(z)}{N(0,1) \star ( \pi_0 \cdot \delta_0 + (1 - \pi_0) \cdot G )} \]
We can estimate $\text{fdr}(z)$ using a density estimate $\widehat{f}$ of the $Z$ scores and an estimate of $\pi_0$ \[ \widehat{\text{fdr}}(z) = \frac{\widehat{\pi}_0 \cdot \phi(z)}{\widehat{f}(z)}. \]

Multiple comparisons

Two groups model (Efron)

results_0 = locfdr(hivdata, nulltype=0)

print(names(results_0))

## [1] "fdr"  "fp0"  "Efdr" "cdf1" "mat"  "z.2"  "call"

print(sum(results_0$fdr < 0.05))

## [1] 13

print(sum(results_0$fdr < 0.2))

## [1] 17

print(results_0$z.2)

## [1]       NA 3.808781

Multiple comparisons

Large scale inference and multiple comparisons

Two groups model is an example of empirical Bayes / compound decision problem.
Strict control of FDR is not guaranteed.
Introduces problem of estimation (e.g. $\widehat{\pi}_0$ and $\widehat{f}$). Efron (2011)

Multiple comparisons

Large scale inference and multiple comparisons

Does provide improved estimator via Tweedie’s formula Efron (2011) \[ \widehat{\theta}(z) = E[\theta | Z=z] = z + l'(z) \] with \[ l'(z) = \frac{d}{dz} \log f(z) \] using plugin $\widehat{l} = \log \widehat{f}(z)$ for some density estimate $\widehat{f}$.
Yields estimates \[ \widehat{\theta}_i(Z) = \widehat{\theta}_i(Z_i) = Z_i + \hat{l}'(Z_i). \]
If willing / able to estimate density of $G$ it is possible to contruct posterior interval. BUT, estimate of $G$ is hard – a deconvolution problem.
Not clear such intervals would have any coverage property.
Exercise: Suppose $G = N(\mu_1, \sigma^2_1)$. Derive posterior intervals for $\theta_i | Z_i$. What is their coverage rate? How does this relate to the Bernstein von Mises phenomenon?

Multiple comparisons

Two groups model

Efron also expanded model $N(0, \sigma^2_0) \star (\pi_0 \delta_{\mu_0} + (1 - \pi_0) \cdot G)$.

results = locfdr(hivdata)

print(names(results))

## [1] "fdr"  "fp0"  "Efdr" "cdf1" "mat"  "z.2"  "call"

print(results$fp0)

##             delta      sigma          p0
## thest  0.00000000 1.00000000 1.211336060
## theSD  0.00000000 0.00000000 0.011414290
## mlest -0.11597111 0.75373650 0.934174674
## mleSD  0.01228385 0.01483100 0.008933865
## cmest -0.08303908 0.71176630 0.894938035
## cmeSD  0.01385561 0.01252345 0.007288211

print(sum(results$fdr < 0.05))

## [1] 73

print(sum(results$fdr < 0.2))

## [1] 160

print(results$z.2)

## [1] -2.471104  2.227792

Multiple comparisons

False coverage rate

Consider a confidence interval procedure ${\cal I}:\mathbb{R}^m \rightarrow \prod_{i=1}^m {\cal B}(\mathbb{R})$ based on a point estimator $\widehat{\theta} \in \mathbb{R}^m$
A parameter is not reported if its interval is $(-\infty, \infty)$

Benjamini and Yekutieli (2005) define

	Does not cover	Covers	Decisions
Reported	$V({\cal I}, F)$	$S({\cal I}, F)$	$R({\cal I})$
Not reported	$U({\cal I}, F)$	$T({\cal I}, F)$	$m-R({\cal I})$
	$m_0(F)$	$m_1(F)$	$m$

A multiple comparisons procedure ${\cal T}$ combined with a confidence level $\alpha'$ can be combined to form a confidence interval procedure: return $\widehat{\theta}_i \pm \text{SE}(\widehat{\theta}_i) z_{1 - \alpha'/2}$ for each $i$ rejected by ${\cal T}$.
For example, ${\cal I}$ might run BH and decide to report intervals for those selected by BH.

Multiple comparisons

False coverage rate

The False Coverage Rate (FCR) is \[ E_F\left[\frac{V({\cal I}, F)}{\max(R({\cal I}, F), 1)}\right] \]
For ${\cal I}$ constructed from ${\cal T}$ any error-rate control of ${\cal T}$ is not necessarily related to FCR of corresponding ${\cal I}$.

Multiple comparisons

False coverage rate

A procedure ${\cal T}$ is simple if, for every $1 \leq i \leq m$ on the event $i$ is reported \[ R({\cal T}) = R({\cal T}, \widehat{\theta}) = 1 + R_{-i,\cal T}(\widehat{\theta}_{-i}) \]
Typically, $R_{-i}$ will not depend on $i$ for symmetric procedures. BH and Bonferroni are simple procedures.
For simple procedures and independent $\widehat{\theta}$, Benjamini and Yekutieli (2005) show that $100(1-)% $ Bonferroni intervals with $m / R({\cal T})$ comparisons control FCR at $\alpha$.
Analogous result for non-simple just harder to state.
Exercise: Run the Benjamini-Hochberg procedure at level 0.05 on hivdata and report Benjamini and Yekutieli’s FCR controlling intervals. Compare to Bonferroni simultaneous intervals.

	True null	True positive	Decisions
Declared positive	\(V({\cal T}, F)\)	\(S({\cal T}, F)\)	\(R({\cal T})\)
Declared null	\(U({\cal T}, F)\)	\(T({\cal T}, F)\)	\(m-R({\cal T})\)
	\(m_0(F)\)	\(m_1(F)\)	\(m\)

	Does not cover	Covers	Decisions
Reported	\(V({\cal I}, F)\)	\(S({\cal I}, F)\)	\(R({\cal I})\)
Not reported	\(U({\cal I}, F)\)	\(T({\cal I}, F)\)	\(m-R({\cal I})\)
	\(m_0(F)\)	\(m_1(F)\)	\(m\)

Multiple comparisons I

web.stanford.edu/class/stats364/

Multiple comparisons

Outline

Multiple comparisons

Setup

Multiple comparisons

Setup

Family Wise Error Rate

Multiple comparisons

DFQL?

Multiple comparisons

Special cases

Multiple comparisons

Special cases

Multiple comparisons

Controlling FWER

Multiple comparisons

Simultaneous confidence intervals

Multiple comparisons

Simultaneous confidence intervals

Multiple comparisons

Simultaneous confidence intervals

Multiple comparisons

How does Bonferroni behave as \(m \to \infty\)

Multiple comparisons

Improving Bonferroni: Holm procedure

In words

Multiple comparisons

Holm procedure

Proof

Multiple comparisons

Holm procedure

Multiple comparisons

Step down and step up

Hochberg’s method

Multiple comparisons

\(k\)-FWER Lehmann and Romano (2005)

Which error rate to choose?

Multiple comparisons

False Discovery Rate (FDR)

Multiple comparisons

False Discovery Rate (FDR)

Multiple comparisons

Large scale inference (Efron, Storey, Tibshirani)

Multiple comparisons

Control of FDR? (Efron, Storey, Tibshirani, Benjamini-Hochberg)

Multiple comparisons

Control of FDR Storey, Taylor, Siegmund

Multiple comparisons

Two groups model (Storey, Efron)

Multiple comparisons

Two groups model (Efron)

Multiple comparisons

Large scale inference and multiple comparisons

Multiple comparisons

Large scale inference and multiple comparisons

Multiple comparisons

Two groups model

Multiple comparisons

False coverage rate

Multiple comparisons

False coverage rate

Multiple comparisons

False coverage rate

Summary