Reference:(Sampson and Sill, 2005)
We’ve only looked at single parameter problem so far….
The drop the losers study design involves a first stage (wlog I’ve setup problem so final estimator has variance 1) \[ (Z_1, \dots, Z_K) \sim N(\mu, 2 \cdot I_K), \qquad \mu \in \mathbb{R}^K \]
Our winner is \[ w^*(Z) = \text{argmax}_{1 \leq i \leq K} Z_i. \]
Set \[ W | w^*=i \sim N(\mu_i, 2) \] and \[ \hat{\theta} = \frac{1}{2}(Z_{w^*}+W). \]
For concreteness, let’s suppose that each trial tests efficacy of several drugs against COVID-19 and \(i\) is a promising candidate treatment D (hydroxychloroquine or remdisivir).
Carrying out this drop the losers analysis will ensure that (after several trials have run their course) that 95% of the confidence intervals in papers where D was the apparently best treatment will cover the true effect of D against COVID-19.
For each \(i\), we can write a pre-selection model for \((Z_1, \dots, Z_K)\) and winner \(i\) as \[ {\cal M}_i = \left\{N\left( \begin{pmatrix} 0 \\ \mu_i \\ \mu_{-i} \end{pmatrix}, \text{diag}(1, 1, 2, \dots, 2) \right): \mu \in \mathbb{R}^K \right\} \] with the first coordinate \[ \Delta = \frac{1}{2}(Z_{w^*} - W), \] the second coordinate being denoted \(T_1\) and remaining coordinates denoted by \(T_2, \dots, T_K\).
Note: this preselection model depends only on \(i\) through a parameterization of convenience. Each \({\cal M}_i\) has the same likelihood.
Allowing for different sample sizes / variances in the two stages the model might also be \[ {\cal M}_i = \left\{N\left( \begin{pmatrix} 0 \\ \mu_i \\ \mu_{-i} \end{pmatrix}, \text{diag}(\tau^2, 1, 1+\tau^2, \dots, 1+\tau^2) \right): \mu \in \mathbb{R}^K \right\} \]
Or possibly different variances for each arm separately \[ {\cal M}_i = \left\{N\left( \begin{pmatrix} 0 \\ \mu_i \\ \mu_{-i} \end{pmatrix}, \text{diag}(\tau^2, \sigma^2_i, \text{diag}(\sigma^2_{-i})) \right): \mu \in \mathbb{R}^K \right\} \]
For a given \(i\), the conditional model is \[ {\cal M}^*_i = \left\{F^*: \frac{dF^*}{dF}(\delta, t) \propto 1_{A}(\delta,t), F \in {\cal M}_i \right\} \] where \[ A = \{(\delta, t): (t_1+\delta) \geq \max_{2 \leq k \leq K} t_k\} \]
This is an exponential family with natural parameters \((\mu_i, \mu_{-i})\) with corresponding sufficient statistics \(T(\delta,t)=t\).
Reference measure has density \[(\delta, t) \mapsto \phi_{0,\Sigma}(\delta, t) \cdot 1_A(\delta, t).\]
Following our single parameter problem, it is natural to consider using the law of \(T_1=Z_i\) under \(H_{0,i}: \mu_i= \gamma\) in the model \({\cal M}^*_i\) to construct a CI.
Clearly, the law of \(T_1\) depends on \(\mu_{-i}\) in the model \({\cal M}^*_i\).
For example, suppose \(\min_{k \neq i}\mu_{k} \gg 1\) then if \(\gamma=0\) we will have \(T_1 \gg 1\) while if \(\max_{k \neq i}\mu_k \ll -1\) then if \(\gamma=0\) we will have \(T_1 \approx N(0, 1)\).
The \(\mu_{-i}\) are truly nuisance parameters!
At a high level, we note that for each \(F_{\mu} \in {\cal M}_i\) the push forward law \(Z_i\) depends only on \(\mu_i\).
In the conditional setting, for each \(F_{\mu}^* \in {\cal M}_i^*\) the push forward law \(Z_i\) depends on all of \(\mu\).
This issue is not an issue of difficult asymptotics – the problem exists in the limit!
The problem exists even when you have independent test statistics you use for selection: selection couples all the \(Z_i\)’s together under \({\cal M}_i\)!.
Suppose we take multivariate laws \(F\) on \(\mathbb{R}^k\) s.t. \[ \|n^{1/2} \mu(F)\|_{\infty} \leq 3, \text{Cov}_F(Z) = I \] and we set \(Z_n\) to be the (normalized) law of the sample mean of \(n/2\) draws from such an \(F\) and \(W_n\) to be the (normalized) law of the sample mean of \(n/2\) draws from arm \(i\). This determines a model \({\cal M}_{i,n}\)
Subject to usual CLT conditions, supposing \((F_n \in {\cal M}_{i,n})_{n \geq 1}\) is such that \(\mu(F_n) \to \mu_0\) \[ (Z_n, W_n) \overset{D}{\to} (Z, W) = F_{\mu_0} \in {\cal M}_i \]
So, we expect any phenomenon in the limiting problem to be present in the asymptotics when we relax the exact Gaussianity assumption…
Use \(\hat{\mu}_i = 0\) and draw from \[ N\left( \begin{pmatrix} 0 \\ \gamma \\ 0 \end{pmatrix}, \text{diag}(\tau^2, 1, 1+\tau^2, \dots, 1+\tau^2) \right) \] restricted to \(A\), returning only \(T_1\).
Reject \(H_0:\mu_i=\gamma\) if observed \(T\) is outside the central 95% of this distribution. (Equal tailed test)
Use \(\hat{\mu}_{-i} = Z_{-i}\) and draw from \[ N\left( \begin{pmatrix} 0 \\ \gamma \\ Z_{-i} \end{pmatrix}, \text{diag}(\tau^2, 1, 1+\tau^2, \dots, 1+\tau^2) \right) \] restricted to \(A\), returning only \(T_1\).
Reject \(H_0:\mu_i=\gamma\) if observed \(T_1\) is outside the central 95% of this distribution. (Equal tailed test)
Use \(\hat{\mu}_i = Z_{-i}\) as well as \(\hat{\mu}_i=Z_i\) and draw from \[ N\left( \begin{pmatrix} 0 \\ Z_i \\ Z_{-i} \end{pmatrix}, \text{diag}(\tau^2, 1, 1+\tau^2, \dots, 1+\tau^2) \right) \] restricted to \(A\), returning only \(T_1\).
Construct a 95% CI by quantile method and check whether \(\gamma\) is covered.
Without selection, tests with bootstrap can be done by shifting to \(\gamma\) with shift \(\gamma - Z_i\).
Reject \(H_0:\mu_i=\gamma\) if observed \(T_1\) is outside the central 95% of this distribution. (Equal tailed test)
winning_idx, K, noise_sd = 0, 10, 1
truth = np.zeros(K)
def draw_one(truth,
noise_sd,
winning_idx):
while True:
Z_star = (np.random.standard_normal(truth.shape) * np.sqrt(1 + noise_sd**2)
+ truth)
if Z_star[winning_idx] == np.max(Z_star):
break
W = 1 / noise_sd * np.random.standard_normal() + truth[winning_idx]
Z_star[winning_idx] = (Z_star[winning_idx] + noise_sd**2 * W) / (1 + noise_sd**2)
return Z_stardef reference_full(gamma,
observed,
winning_idx=0,
nuisance=null,
noise_sd=1,
ndraw=2000):
reference_mean = nuisance(observed, gamma, winning_idx)
reference = []
for _ in range(ndraw):
reference.append(draw_one(reference_mean,
noise_sd,
winning_idx))
reference = np.array(reference)
pvalue = np.mean(reference[:, winning_idx] < observed[winning_idx])
return observed[winning_idx], reference[:, winning_idx], pvalueIn all examples, \(\gamma=\mu_i\) (i.e. \(H_{0,i}: \mu_i=\gamma\) is always true).
\(\implies\) Our \(p\)-value should look like a \(p\)-value…
## proportion covering truth: 0.07
K, noise_sd = 20, 0.5
truth = np.ones(K) * 0
oracle = lambda Z, gamma, winning_idx: truth # an oracle for "nuisance"Conditioning on \(Z_{-i}\) (i.e. \((T_2, \dots, T_K)\) in \({\cal M}^*_i\)) removes dependence on \(\mu_{-i}\).
Note \[ 1_A(\Delta, T) | T_{-1} = 1_{\{T_1 + \Delta > C\}} \] where \(C=\max_{2 \leq k \leq K} T_k\) is measurable w.r.t \(Z_{-i}\).
Further \[ E[1_A(\Delta, T) | T_{-1}, T_1] = 1 - \Phi((C - T_1) / \tau) \]
Final model considers laws \[{\cal L}\left(T_1 | (T_1+\Delta) \geq \max_{2 \leq k \leq K} T_k, (T_i)_{2 \leq k \leq K}\right)\] for each \(F_{\mu} \in {\cal M}\) (or, equivalently each \(F_{\mu}^* \in {\cal M}^*\)). This is a randomized one-sample problem with threshold \(C(T_2, \dots, T_k)\)!
def reference_conditional(gamma,
observed,
winning_idx=0,
noise_sd=1,
ndraw=20000):
random_threshold = max([observed[j] for j in range(observed.shape[0])
if j != winning_idx])
Zsample = np.random.standard_normal(ndraw) + gamma
W = normal_dbn.sf((random_threshold - Zsample) / noise_sd)
F = discrete_family(Zsample, W)
pvalue = F.cdf(0, observed[winning_idx])
return pvaluenp.linspace(1, 2.5,20)Suppose \(F_n\) is a sequence of laws on \(\mathbb{R}^K\) such that, in distribution, \[ (Z_n, W_n) \overset{F_n}{\to} (Z, W) = F_{\mu_0} \in {\cal M}_i, \qquad \|\mu_0\|_{\infty} \leq 3 \]
Then, by inspection above we see the pivot at \(\mu_0\) defined by reference_conditional is expressible (up to sampling) as the ratio of expectations under \(F_{\mu_0}\) of continuous functions of \((Z_n,W_n)\).
By standard weak convergence arguments, the law of the pivot converges weakly to \(\text{Unif}(0,1)\).
This argument can be made quantitative and uniform (c.f. Markovic and Taylor (2016))
What if we relax the assumption that \(\text{Cov}_F(Z)=I\) to unknown (diagonal) covariance?
We demonstrated in silico that the plugin principle fails here (i.e. the score test statistic is not pivotal).
In practice, with data splitting we will not know \(\tau\) (or \(\sigma\)) but with large enough samples we can estimate these rather well.
Going back to our discussions on unequal variances, etc. we see that the drop the losers pivot will depend on \[ \Sigma = \text{diag}(\sigma^2_W, \sigma^2_T) \] and if we use sample means instead of normalized sample means, proper dependence on \(n\) should have \[ \Sigma_n = n^{-1} \text{diag}(\sigma^2_{W,n}, \sigma^2_{T,n}) \]
We could write the pivotal quantity for \(H_0:n^{1/2}\mu_i=\gamma\) \[ {\cal P}_n(\gamma) = {\cal P}_n(n^{-1/2}\gamma, \hat{\mu}_{i,n}, \hat{\mu}_{-i,n}, \Sigma_n / n) = {\cal P}_{\infty}(\gamma, T_{1,n}, T_{-1,n}, \Sigma_n) \] where \({\cal P}_{\infty}\) is the (rescaled) pivot above and \[ \Sigma_n = \text{Var}(T_n) = O(1) \]
Failure of the plug-in principle for \(\mu\) parameters can be expressed at a high level as \(O(n^{-1/2})\) fluctuations in \(\hat{\mu}_n\) leads to \(O(1)\) changes in \({\cal P}\).
However, \(O(n^{-1/2})\) fluctuations in \(\Sigma_n\) (assuming \(n \cdot \Sigma_n > \kappa \cdot I\) uniformly in \(n\)) lead to \(O(n^{-1/2})\) changes in \({\cal P}\).
In the data splitting example (with only one parameter) we noted that there existed a normally distributed unbiased estimator of \(\mu\) (at least when \(\tau > 0\)) in model \({\cal M}^*\)? What is the analogous estimator of \(\mu_i\) in model \({\cal M}_i^*\)?
Write a function that computes the 95% CI for \(\mu_i\) in model \({\cal M}_i^*\). Compare their length to the 95% CI for \(\mu_i\) based on the unbiased estimator above for various values of \(\mu\), \(\tau=1\) and \(K=20\).
The CI based on the unbiased estimator has the advantage that it’s easy to compute. Suppose now that you make your living by constructing 95% CIs \([L, U]\) in drop-the-losers trials where your payout for a given trial is \[ 1_{[L, U]}(\mu_{w^*}) - \lambda \cdot (U - L). \] That is, you are rewarded if the constructed interval covers its target, but you must pay for each unit in length. What CI would you use? For a few different values of \(\mu\) and \(\lambda\) compare your payout using the full conditional drop-the-losers interval and the data splitting interval.
We argued above that plugin estimates of variance can be OK under some conditions. Are there any linear functions of \(\mu\) that are safe to use plugin estimates in drop the losers? Are there additional assumptions you might make under which more or perhaps many linear functions of \(\mu\) are safe to use in using a method like reference_full above?
Suppose we have run \(M\) different drop-the-loser studies in which hydroxychloroquine is one of the candidates for treatment of COVID-19 with different other treatments across the \(M\) studies. Of these \(m < M\) show hydroxychloroquine to be the most effective treatment. Confident in the efficacy of hydroxychloroquine, you want to pool these \(m\) studies to improve your interval or point estimate for the effect of hydroxychloroquine.
For \(m\) different studies in which hydroxychloroquine is the best treatment, describe a model of all corresponding treatment effects under which you can carry out exact inference as in the drop-the-losers design. Do you trust the resulting confidence interval based on this model? (Added recently: Does it matter that hydroxychloroquine did not win the other \(M-m\) trials if our goal is valid conditional inference over all COVID-19 trial run?)
A colleague suggests you might also use the \(M-m\) studies in which hydroxychloroquine was not the winner. What do you think of this suggestion? Can it help you in your task of valid inference for the hydroxychloroquine effect?
Going back to the original drop-the-losers problem suppose that remdisivir as well as hydroxychloroquine. Suppose you ran a trial in which you committed to drawing more samples from the hydroxychloroquine arm as well as whatever the winning arm was. How would you construct valid confidence intervals for the effect of hydroxychloroquine conditional on your trial producing remdisivir as its winner?