Review

We spent much of the class focusing on a particular “model” for conditional inference after selection.

Pre-selection our model was (often asymptotically) \[ {\cal M} = {\cal M}_q = \left\{D \sim N(\mu, \Sigma) \right\} \]
Selective model \[ {\cal M}^*_q = \left\{F^*: \frac{dF^*}{dF}(d) \propto \bar{\pi}_q(d), F \in {\cal M} \right\} \] with \[ \bar{\pi}_q(d) = P_F\left({\cal Q}(D) = q \vert D=d\right). \]

For a target parameter \(\theta(F)\) with estimator \(\hat{\theta}\) we can decompose \[ D = (\hat{\theta}, {\cal N}, \Delta) \]
The term \(\Delta\) arose when \(\mu\) was subject to some linear constraints ( i.e. a model) and for all \(F \in {\cal M}\) \[ \begin{aligned} \text{Cox}_F(\Delta, \hat{\theta}) &= 0 \\ \text{Cox}_F(\Delta, {\cal N}) &= 0. \\ \end{aligned} \]
Of real interest for inference is not \(\bar{\pi}\) but \[ \pi(t,n) = P_F\left({\cal Q}(D)=q | \hat{\theta}=t, {\cal N}=n \right) = E_F\left[\bar{\pi}(d) \vert \hat{\theta}=t, {\cal N}=n \right]. \]
Not a location family… (c.f. Leeb and Potscher)

Genuinely permits DFQL: Data First, Questions Later.
In principle, this is a full-fledged framework for inference (as long as we can work with \(\pi(t,n)\)…)
Hypothesis tests and confidence intervals: \(\checkmark\). Based on pivot \[ {\cal P}(\hat{\theta}, \theta(F), {\cal N}, \Gamma) = \frac{\int_{\hat{\theta}}^{\infty} \phi((t-\theta(F) / \sigma(F))) \pi(t, {\cal N}) \; dt} {\int_{-\infty}^{\infty} \phi((t-\theta(F) / \sigma(F))) \pi(t, {\cal N}) \; dt} \]
Point estimators \(\checkmark\) (c.f. UMVU Cohen and Sacrowicz or MLE)
Bayesian analysis \(\checkmark\) : can put a prior on \(\mu\) as long as one is able to work with \[ E_{\mu} [\bar{\pi}(d)]. \]
Goodness of fit test \(\checkmark\) set target to be the full (but constrained) \(\mu\) so that \({\cal N}=0\) and consider \[ {\cal L}^*(\Delta | \hat{\theta}, {\cal N}). \]
Asymptotics \(\checkmark\): can meaningfully talk of transfer CLT and consistency from sequence \({\cal M}_n\) to \({\cal M}^*_n\).
This is not to say that every detail has been exhaustively studied…

Some procedures (particularly early ones) have unattractive properties: long intervals (Lee et al. for LASSO) weak tests (covTest).
Difficulty in working with \(\pi(t,n)\) – often complex to evaluate.
It’s model-based…
Asymptotics not as well-established as one would like.
Why condition when one can (sometimes) be simultaneous?

With regard to intervals, using conditional approach one can see shorter intervals than e.g. data splitting. (One conditional method can dominate an other)
In sequential model selection, conditional tests were more powerful than deflated Bonferroni.
In short, conditional procedures, are not created equal. Some certainly have valid criticisms against them…

Yes, this can be difficult to evaluate.
On the other hand, when effort is put in, there can be a payoff: drop the losers is early and notable example.
An explicit integral representation of \(\pi\) for many commonly encountered settings (conditioning on solutions to a convex problem; selection by stepup procedures). Caveat: this assumes some amount of randomization or followup data.
In some cases one can approximate or learn \(\pi\), though how this affects asymptotics is less well-understood.

Not really model-based – can talk of limiting Gaussian model…
Non-standard distributions at play make calculations more complicated…
Seems reasonable to say that neater and more careful analyses than existing work will lead to clearer understanding…
Can one say anything in high dimensions? Not clear.
How about semi-parametric setting? Also an interesting question.

We had hoped to spend more time on exciting new aspects of simultaneous inference: notably knockoffs and enhanced FDR control.
Early on we saw that there is a distinction between simultaneous vs conditional approach.
Clearest example of this was a comment we heard from one of you: the fact that data splitting (can be) a valid way to address problems of selection bias has nothing to do with multiplicity (i.e. simultaneous inference)…
A related point: the classical scientific method is an instance of this “model”.

web.stanford.edu/class/stats364/