Setup

Reference:(Sampson and Sill, 2005)
We’ve only looked at single parameter problem so far….
The drop the losers study design involves a first stage (wlog I’ve setup problem so final estimator has variance 1) \[ (Z_1, \dots, Z_K) \sim N(\mu, 2 \cdot I_K), \qquad \mu \in \mathbb{R}^K \]
Our winner is \[ w^*(Z) = \text{argmax}_{1 \leq i \leq K} Z_i. \]
Set \[ W | w^*=i \sim N(\mu_i, 2) \] and \[ \hat{\theta} = \frac{1}{2}(Z_{w^*}+W). \]

Conditioning on the winner

For concreteness, let’s suppose that each trial tests efficacy of several drugs against COVID-19 and \(i\) is a promising candidate treatment D (hydroxychloroquine or remdisivir).
Carrying out this drop the losers analysis will ensure that (after several trials have run their course) that 95% of the confidence intervals in papers where D was the apparently best treatment will cover the true effect of D against COVID-19.

Conditioning on the winner

For each \(i\), we can write a pre-selection model for \((Z_1, \dots, Z_K)\) and winner \(i\) as \[ {\cal M}_i = \left\{N\left( \begin{pmatrix} 0 \\ \mu_i \\ \mu_{-i} \end{pmatrix}, \text{diag}(1, 1, 2, \dots, 2) \right): \mu \in \mathbb{R}^K \right\} \] with the first coordinate \[ \Delta = \frac{1}{2}(Z_{w^*} - W), \] the second coordinate being denoted \(T_1\) and remaining coordinates denoted by \(T_2, \dots, T_K\).
Note: this preselection model depends only on \(i\) through a parameterization of convenience. Each \({\cal M}_i\) has the same likelihood.

Conditioning on the winner

Allowing for different sample sizes / variances in the two stages the model might also be \[ {\cal M}_i = \left\{N\left( \begin{pmatrix} 0 \\ \mu_i \\ \mu_{-i} \end{pmatrix}, \text{diag}(\tau^2, 1, 1+\tau^2, \dots, 1+\tau^2) \right): \mu \in \mathbb{R}^K \right\} \]
Or possibly different variances for each arm separately \[ {\cal M}_i = \left\{N\left( \begin{pmatrix} 0 \\ \mu_i \\ \mu_{-i} \end{pmatrix}, \text{diag}(\tau^2, \sigma^2_i, \text{diag}(\sigma^2_{-i})) \right): \mu \in \mathbb{R}^K \right\} \]

Conditioning on the winner

For a given \(i\), the conditional model is \[ {\cal M}^*_i = \left\{F^*: \frac{dF^*}{dF}(\delta, t) \propto 1_{A}(\delta,t), F \in {\cal M}_i \right\} \] where \[ A = \{(\delta, t): (t_1+\delta) \geq \max_{2 \leq k \leq K} t_k\} \]
This is an exponential family with natural parameters \((\mu_i, \mu_{-i})\) with corresponding sufficient statistics \(T(\delta,t)=t\).
Reference measure has density \[(\delta, t) \mapsto \phi_{0,\Sigma}(\delta, t) \cdot 1_A(\delta, t).\]

Inference in drop-the-losers: nuisance parameters

Following our single parameter problem, it is natural to consider using the law of \(T_1=Z_i\) under \(H_{0,i}: \mu_i= \gamma\) in the model \({\cal M}^*_i\) to construct a CI.
Clearly, the law of \(T_1\) depends on \(\mu_{-i}\) in the model \({\cal M}^*_i\).
For example, suppose \(\min_{k \neq i}\mu_{k} \gg 1\) then if \(\gamma=0\) we will have \(T_1 \gg 1\) while if \(\max_{k \neq i}\mu_k \ll -1\) then if \(\gamma=0\) we will have \(T_1 \approx N(0, 1)\).
The \(\mu_{-i}\) are truly nuisance parameters!

Inference in drop-the-losers: nuisance parameters and asymptotics

At a high level, we note that for each \(F_{\mu} \in {\cal M}_i\) the push forward law \(Z_i\) depends only on \(\mu_i\).
In the conditional setting, for each \(F_{\mu}^* \in {\cal M}_i^*\) the push forward law \(Z_i\) depends on all of \(\mu\).
This issue is not an issue of difficult asymptotics – the problem exists in the limit!
The problem exists even when you have independent test statistics you use for selection: selection couples all the \(Z_i\)’s together under \({\cal M}_i\)!.

Inference in drop-the-losers: nuisance parameters and asymptotics

Suppose we take multivariate laws \(F\) on \(\mathbb{R}^k\) s.t. \[ \|n^{1/2} \mu(F)\|_{\infty} \leq 3, \text{Cov}_F(Z) = I \] and we set \(Z_n\) to be the (normalized) law of the sample mean of \(n/2\) draws from such an \(F\) and \(W_n\) to be the (normalized) law of the sample mean of \(n/2\) draws from arm \(i\). This determines a model \({\cal M}_{i,n}\)
Subject to usual CLT conditions, supposing \((F_n \in {\cal M}_{i,n})_{n \geq 1}\) is such that \(\mu(F_n) \to \mu_0\) \[ (Z_n, W_n) \overset{D}{\to} (Z, W) = F_{\mu_0} \in {\cal M}_i \]
So, we expect any phenomenon in the limiting problem to be present in the asymptotics when we relax the exact Gaussianity assumption…

Possible approaches

Let’s allow for \(\tau \neq 1\) but otherwise equal variances.

Null model

Use \(\hat{\mu}_i = 0\) and draw from \[ N\left( \begin{pmatrix} 0 \\ \gamma \\ 0 \end{pmatrix}, \text{diag}(\tau^2, 1, 1+\tau^2, \dots, 1+\tau^2) \right) \] restricted to \(A\), returning only \(T_1\).
Reject \(H_0:\mu_i=\gamma\) if observed \(T\) is outside the central 95% of this distribution. (Equal tailed test)

Possible approaches

Plugin score test

Use \(\hat{\mu}_{-i} = Z_{-i}\) and draw from \[ N\left( \begin{pmatrix} 0 \\ \gamma \\ Z_{-i} \end{pmatrix}, \text{diag}(\tau^2, 1, 1+\tau^2, \dots, 1+\tau^2) \right) \] restricted to \(A\), returning only \(T_1\).
Reject \(H_0:\mu_i=\gamma\) if observed \(T_1\) is outside the central 95% of this distribution. (Equal tailed test)

Possible approaches

Bootstrap

Use \(\hat{\mu}_i = Z_{-i}\) as well as \(\hat{\mu}_i=Z_i\) and draw from \[ N\left( \begin{pmatrix} 0 \\ Z_i \\ Z_{-i} \end{pmatrix}, \text{diag}(\tau^2, 1, 1+\tau^2, \dots, 1+\tau^2) \right) \] restricted to \(A\), returning only \(T_1\).
Construct a 95% CI by quantile method and check whether \(\gamma\) is covered.

Possible approaches

Centered bootstrap

Without selection, tests with bootstrap can be done by shifting to \(\gamma\) with shift \(\gamma - Z_i\).
Reject \(H_0:\mu_i=\gamma\) if observed \(T_1\) is outside the central 95% of this distribution. (Equal tailed test)

Data generating mechanism

winning_idx, K, noise_sd = 0, 10, 1
truth = np.zeros(K)

def draw_one(truth,
             noise_sd,
             winning_idx):
    while True:
        Z_star = (np.random.standard_normal(truth.shape) * np.sqrt(1 + noise_sd**2) 
                  + truth)
        if Z_star[winning_idx] == np.max(Z_star):
            break
    W = 1 / noise_sd * np.random.standard_normal() + truth[winning_idx]
    Z_star[winning_idx] = (Z_star[winning_idx] + noise_sd**2 * W) / (1 + noise_sd**2)
    return Z_star

Different ways of handling nuisance parameters

def null(Z, gamma, winning_idx):
    mean_param = np.zeros_like(Z)
    mean_param[winning_idx] = gamma
    return mean_param

def score(Z, gamma, winning_idx):
    mean_param = Z.copy()
    mean_param[winning_idx] = gamma
    return mean_param

def bootstrap(Z, gamma, winning_idx):
    return Z.copy()

Reference distribution by simulation

def reference_full(gamma,
                   observed,
                   winning_idx=0,
                   nuisance=null,
                   noise_sd=1,
                   ndraw=2000):
    
    reference_mean = nuisance(observed, gamma, winning_idx)

    reference = []
    for _ in range(ndraw):
        reference.append(draw_one(reference_mean,
                                  noise_sd,
                                  winning_idx))
    reference = np.array(reference)
    
    pvalue = np.mean(reference[:, winning_idx] < observed[winning_idx])
    return observed[winning_idx], reference[:, winning_idx], pvalue

In all examples, \(\gamma=\mu_i\) (i.e. \(H_{0,i}: \mu_i=\gamma\) is always true).
\(\implies\) Our \(p\)-value should look like a \(p\)-value…

Null model for nuisance parameters

\(K=10, \mu=0, \tau=1\)

Works well when null is correct

Null model for nuisance parameters

\(K=10, \mu=1, \tau=1\)

Not so good when null is not true

Score test for nuisance parameters

\(K=10, \mu=0, \tau=1\)

Works reasonably well here

Testing via bootstrap quantile intervals

\(K=10, \mu=0, \tau=1\)

Not a good idea…

## proportion covering truth: 0.07

Using mean-corrected bootstrap

\(K=10, \mu=0, \tau=1\)

Not pivotal (c.f. Leeb and Potscher)

Using mean-corrected bootstrap

\(K=10, \mu=(\mu_i, \mu_{-i}) = (3, 0), \tau=1\)

Not pivotal (c.f. Leeb and Potscher)

Score test with larger \(K\)

\(K=50, \mu=0, \tau=1\)

Not pivotal

Score test with \(K\) a little larger and \(\tau=0.5\)

\(K=20, \mu=0, \tau=0.5\)

Not pivotal

Null with \(K\) a little larger and \(\tau=0.5\) (and null is true)

\(K=20, \mu=0, \tau=0.5\)

Looks pretty good (when null is true)

An unachievable oracle

K, noise_sd = 20, 0.5
truth = np.ones(K) * 0
oracle = lambda Z, gamma, winning_idx: truth # an oracle for "nuisance"

\(K=20, \mu=1, \tau=0.5\)

Looks pretty good (as it should)

A valid approach (Sampson and Sill)

Conditioning on \(Z_{-i}\) (i.e. \((T_2, \dots, T_K)\) in \({\cal M}^*_i\)) removes dependence on \(\mu_{-i}\).
Note \[ 1_A(\Delta, T) | T_{-1} = 1_{\{T_1 + \Delta > C\}} \] where \(C=\max_{2 \leq k \leq K} T_k\) is measurable w.r.t \(Z_{-i}\).
Further \[ E[1_A(\Delta, T) | T_{-1}, T_1] = 1 - \Phi((C - T_1) / \tau) \]
Final model considers laws \[{\cal L}\left(T_1 | (T_1+\Delta) \geq \max_{2 \leq k \leq K} T_k, (T_i)_{2 \leq k \leq K}\right)\] for each \(F_{\mu} \in {\cal M}\) (or, equivalently each \(F_{\mu}^* \in {\cal M}^*\)). This is a randomized one-sample problem with threshold \(C(T_2, \dots, T_k)\)!

A valid approach (Sampson and Sill)

def reference_conditional(gamma,
                          observed,
                          winning_idx=0,
                          noise_sd=1,
                          ndraw=20000):
    
    random_threshold = max([observed[j] for j in range(observed.shape[0]) 
                            if j != winning_idx])

    Zsample = np.random.standard_normal(ndraw) + gamma
    W = normal_dbn.sf((random_threshold - Zsample) / noise_sd)
    F = discrete_family(Zsample, W)
    pvalue = F.cdf(0, observed[winning_idx])
    return pvalue

Exact inference even when null is not true

\(K=10, \mu=1, \kappa=0.5\)

Looks pretty good

Exact inference even when null is not true

\(K=20, \kappa=0.5\) and \(\mu\) is np.linspace(1, 2.5,20)

Looks pretty good

Inference in drop-the-losers: nuisance parameters and asymptotics

Suppose \(F_n\) is a sequence of laws on \(\mathbb{R}^K\) such that, in distribution, \[ (Z_n, W_n) \overset{F_n}{\to} (Z, W) = F_{\mu_0} \in {\cal M}_i, \qquad \|\mu_0\|_{\infty} \leq 3 \]
Then, by inspection above we see the pivot at \(\mu_0\) defined by reference_conditional is expressible (up to sampling) as the ratio of expectations under \(F_{\mu_0}\) of continuous functions of \((Z_n,W_n)\).
By standard weak convergence arguments, the law of the pivot converges weakly to \(\text{Unif}(0,1)\).
This argument can be made quantitative and uniform (c.f. Markovic and Taylor (2016))

Inference in drop-the-losers: some plugins are OK

What if we relax the assumption that \(\text{Cov}_F(Z)=I\) to unknown (diagonal) covariance?
We demonstrated in silico that the plugin principle fails here (i.e. the score test statistic is not pivotal).
In practice, with data splitting we will not know \(\tau\) (or \(\sigma\)) but with large enough samples we can estimate these rather well.
Going back to our discussions on unequal variances, etc. we see that the drop the losers pivot will depend on \[ \Sigma = \text{diag}(\sigma^2_W, \sigma^2_T) \] and if we use sample means instead of normalized sample means, proper dependence on \(n\) should have \[ \Sigma_n = n^{-1} \text{diag}(\sigma^2_{W,n}, \sigma^2_{T,n}) \]

Inference in drop-the-losers: some plugins are OK

We could write the pivotal quantity for \(H_0:n^{1/2}\mu_i=\gamma\) \[ {\cal P}_n(\gamma) = {\cal P}_n(n^{-1/2}\gamma, \hat{\mu}_{i,n}, \hat{\mu}_{-i,n}, \Sigma_n / n) = {\cal P}_{\infty}(\gamma, T_{1,n}, T_{-1,n}, \Sigma_n) \] where \({\cal P}_{\infty}\) is the (rescaled) pivot above and \[ \Sigma_n = \text{Var}(T_n) = O(1) \]
Failure of the plug-in principle for \(\mu\) parameters can be expressed at a high level as \(O(n^{-1/2})\) fluctuations in \(\hat{\mu}_n\) leads to \(O(1)\) changes in \({\cal P}\).
However, \(O(n^{-1/2})\) fluctuations in \(\Sigma_n\) (assuming \(n \cdot \Sigma_n > \kappa \cdot I\) uniformly in \(n\)) lead to \(O(n^{-1/2})\) changes in \({\cal P}\).

Exercise

In the data splitting example (with only one parameter) we noted that there existed a normally distributed unbiased estimator of \(\mu\) (at least when \(\tau > 0\)) in model \({\cal M}^*\)? What is the analogous estimator of \(\mu_i\) in model \({\cal M}_i^*\)?
Write a function that computes the 95% CI for \(\mu_i\) in model \({\cal M}_i^*\). Compare their length to the 95% CI for \(\mu_i\) based on the unbiased estimator above for various values of \(\mu\), \(\tau=1\) and \(K=20\).
The CI based on the unbiased estimator has the advantage that it’s easy to compute. Suppose now that you make your living by constructing 95% CIs \([L, U]\) in drop-the-losers trials where your payout for a given trial is \[ 1_{[L, U]}(\mu_{w^*}) - \lambda \cdot (U - L). \] That is, you are rewarded if the constructed interval covers its target, but you must pay for each unit in length. What CI would you use? For a few different values of \(\mu\) and \(\lambda\) compare your payout using the full conditional drop-the-losers interval and the data splitting interval.

Exercise

We argued above that plugin estimates of variance can be OK under some conditions. Are there any linear functions of \(\mu\) that are safe to use plugin estimates in drop the losers? Are there additional assumptions you might make under which more or perhaps many linear functions of \(\mu\) are safe to use in using a method like reference_full above?

Exercise: meta analysis

Suppose we have run \(M\) different drop-the-loser studies in which hydroxychloroquine is one of the candidates for treatment of COVID-19 with different other treatments across the \(M\) studies. Of these \(m < M\) show hydroxychloroquine to be the most effective treatment. Confident in the efficacy of hydroxychloroquine, you want to pool these \(m\) studies to improve your interval or point estimate for the effect of hydroxychloroquine.

For \(m\) different studies in which hydroxychloroquine is the best treatment, describe a model of all corresponding treatment effects under which you can carry out exact inference as in the drop-the-losers design. Do you trust the resulting confidence interval based on this model? (Added recently: Does it matter that hydroxychloroquine did not win the other \(M-m\) trials if our goal is valid conditional inference over all COVID-19 trial run?)
A colleague suggests you might also use the \(M-m\) studies in which hydroxychloroquine was not the winner. What do you think of this suggestion? Can it help you in your task of valid inference for the hydroxychloroquine effect?
Going back to the original drop-the-losers problem suppose that remdisivir as well as hydroxychloroquine. Suppose you ran a trial in which you committed to drawing more samples from the hydroxychloroquine arm as well as whatever the winning arm was. How would you construct valid confidence intervals for the effect of hydroxychloroquine conditional on your trial producing remdisivir as its winner?

Drop the losers

web.stanford.edu/class/stats364/

Setup

Conditioning on the winner

Conditioning on the winner

Conditioning on the winner

Conditioning on the winner

Inference in drop-the-losers: nuisance parameters

Inference in drop-the-losers: nuisance parameters and asymptotics

Inference in drop-the-losers: nuisance parameters and asymptotics

Possible approaches

Null model

Possible approaches

Plugin score test

Possible approaches

Bootstrap

Possible approaches

Centered bootstrap

Data generating mechanism

Different ways of handling nuisance parameters

Reference distribution by simulation

Null model for nuisance parameters

Works well when null is correct

Null model for nuisance parameters

Not so good when null is not true

Score test for nuisance parameters

Works reasonably well here

Testing via bootstrap quantile intervals

Not a good idea…

Using mean-corrected bootstrap

Not pivotal (c.f. Leeb and Potscher)

Using mean-corrected bootstrap

Not pivotal (c.f. Leeb and Potscher)

Score test with larger \(K\)

Not pivotal

Score test with \(K\) a little larger and \(\tau=0.5\)

Not pivotal

Null with \(K\) a little larger and \(\tau=0.5\) (and null is true)

Looks pretty good (when null is true)

An unachievable oracle

Looks pretty good (as it should)

A valid approach (Sampson and Sill)

A valid approach (Sampson and Sill)

Exact inference even when null is not true

Looks pretty good

Exact inference even when null is not true

Looks pretty good

Inference in drop-the-losers: nuisance parameters and asymptotics

Inference in drop-the-losers: some plugins are OK

Inference in drop-the-losers: some plugins are OK

Exercise

Exercise

Exercise: meta analysis