## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-18
glmnetObjective function of glmnet (without intercept or standardization) is \[
\beta \mapsto \frac{1}{2n} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1
\]
Including intercept and other unpenalized effects can be accomplished with feature weights \[ \beta \mapsto \frac{1}{2n} \|Y-X\beta\|^2_2 + \lambda \sum_i w_i |\beta_i| \]
For simplicity, we’ll start with no feature weights.
For ease of notation (dropping \(n\)) , we’ll also use SSE instead of MSE \[ \beta \mapsto \frac{1}{2} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1 \]
Choice of s=1.5/sqrt(n) corresponds to \(\lambda=1.5 \sqrt{100}\) in above loss.
## $selected
## [1] 2 20
## 2.5 % 97.5 %
## X[, selected]1 -0.4059135 0.03010663
## X[, selected]2 -0.4949535 -0.07162624
simulate = function(n=100,
p=20,
s=1.5) {
X = matrix(rnorm(n * p), n, p)
Y = rnorm(n)
L = glmnet(X, Y, intercept=FALSE, standardize=FALSE)
# ignore intercept
beta_hat = coef(L, s=s/sqrt(n), exact=TRUE, x=X, y=Y)[-1]
selected = beta_hat != 0
if (sum(selected) > 0) {
intervals = confint(lm(Y ~ X[,selected] - 1))
covered = (intervals[,1] < 0) * (intervals[,2] > 0)
return(covered)
} else {
return(numeric(0))
}
}
simulate()## X[, selected]1 X[, selected]2
## 0 0
## [1] 0.592
We can perhaps conditional inference…
What does it mean for LASSO to have selected features 2 and 20?
If \(\hat{\beta}_j \neq 0\) we must have \(\hat{u}_j = \lambda \text{sign}(\hat{\beta}_j)\).
Else, \(\hat{\beta}_j = 0\) and we must have \(|\hat{u}_j| \leq \lambda\).
Set \[ \begin{aligned} \hat{E} &= \left\{j : \hat{\beta}_j \neq 0 \right\} \\ \hat{s}_E = \text{sign}(\hat{\beta}_E) \end{aligned} \]
On the event \((\hat{E}, \hat{s}_E) = (E, s_E)\) the KKT conditions can be written \[ \begin{aligned} X_E^T(Y - X_E\hat{\beta}_E) &= \lambda s_E \\ \|X_{-E}^T(Y - X_E\hat{\beta}_E)\|_{\infty} & \leq \lambda \end{aligned} \]
Denote the set of \((X,Y)\) satisfying the above conditions as \(\bar{A}_{E,s_E}\). Since \(X\) is fixed we can (for now) consider this a function of \(Y\) only, i.e. \(1_{\bar{A}_{E,s_E}}(y) = 1_{\bar{A}_{E,s_E}}(X,y)\).
For reference later, setting \[ \ell(\beta) = \frac{1}{2} \|Y-X\beta\|^2_2 \] these read \[ \begin{aligned} -\nabla \ell(\hat{\beta})[E] &= \lambda s_E \\ \|\nabla \ell(\hat{\beta})[-E]\|_{\infty} & \leq \lambda \end{aligned} \]
selected is (2,20) can be written with \(E=\{2,20\}\) \[
\bigcup_{s_E \in \{-1,1\}^E}
\bar{A}_{E,s_E}.
\]## $selected
## [1] 2 20
##
## $signs
## [1] -1 -1
signs being (-1,1): negative effects for features 2 and 20! the event is just \[
\bar{A}_{\{2,20\},(-1,1)}.
\]Fixed \(X\) and known \(\sigma^2\) (wlog we set \(\sigma^2=1\)) \[ Y | X \sim N(\mu(X), I) \]
Or, \[ {\cal M} = \left\{N(\mu, I): \mu \in \mathbb{R}^n\right\} \]
In PoSI style notation, the effect for feature 2 that our earlier confidence intervals tried to cover is \[
e_2^T(X_{\{2,20\}})^{\dagger}\mu = \eta_{2|\{2,20\}}^T = \eta^T\mu
\] where \(\eta\) is a function of \(E\) (and \(U\)’s train of thought…)
selected is \(\{2,20\}\) and signs \((-1,1)\) we set \[
{\cal M}^* = {\cal M}^*_{\{2,20\},(-1,-1)} = \left\{F^*: \frac{dF^*}{dF}(y) \propto 1_{\bar{A}_{\{2,20\},(-1,-1)}}(X,y), F \in {\cal M} \right\}.
\]Our model has \(n\) parameters!
In order to construct a reference distribution of \(Y_0\)’s under \(H_0\) we need to specify \((n-1)\) free parameters in \(\mu\).
We already saw that such plugins are doomed to fail – Leeb and Potscher (essentially) proved as much…
There are \((n-1)\) free parameters in \(\mu\) once we’ve set \(\eta^T\mu=\delta\).
We note that, once again, \({\cal M}^*\) is an exponential family with reference measure \(N(0,I)\) truncated to \(\bar{A}_{\{2,20\},(-1,-1)}\), sufficient statistic \(Y\) and natural parameter \(\mu\).
That is, \(F_{\mu}^*\) has density \[ y \propto e^{y^T\mu} \phi_{0, I}(y) \cdot 1_{\bar{A}_{\{2,20\},(-1,-1)}}(y) \]
For any fixed \(\eta\), we can write \[ y^T\mu = y^TP_{\eta}\mu + y^TP_{\eta}^{\perp}\mu \] with \[ P_{\eta} = \frac{1}{\|\eta\|^2_2} \eta \eta^T, \qquad P_{\eta}^{\perp} = I - P_{\eta} \]
We see that conditioning on \(P_{\eta}^{\perp}y\) results in a model that depends only on \(\eta^T\mu\). Let \(y^{\perp}_{\eta}\) denote the observed value of \(P_{\eta}^{\perp}Y\). Our final model for inference is \[ \bar{\cal M}^* = \bar{\cal M}^*_{\{2,20\},(-1,-1), y^{\perp}_{\eta}}. \]
In this model, only variation is along \(\eta\), in particular it is supported on the affine line \[ L(y^{\perp}_{\eta}) = \left\{z \in \mathbb{R}^n: P_{\eta}^{\perp}z = y^{\perp}_{\eta}\right\} \]
The set \(\bar{A}_{\{2,20\},(-1,-1)}\) can be expressed by a set of affine inequalities \[ \left\{y:A_{\{2,20\},(-1,-1)}y \leq b_{\{2,20\},(-1,-1)} \right\}. \] Hence, it is a convex set.
By convexity \(L(P_{\eta}^{\perp}y) \cap \bar{A}_{\{2,20\},(-1,-1)}\) is convex: (equivalent to) either a bounded interval, a half-open interval or all of \(\mathbb{R}\).
The first active block of the KKT conditions yields \[ \hat{\beta}_E = \hat{\beta}(X, Y)[E] = (X_E^TX_E)^{-1}(X_E^TY - \lambda s_E) \] with \[ \bar{\beta}_E(X, Y) = (X_E^TX_E)^{-1}(X_E^TY) \]
The constraint \(\text{sign}(\hat{\beta}[E])=s_E\) yields \[ \text{diag}(s_E) \left(\bar{\beta}_E - (X_E^TX_E)^{-1}\lambda s_E \right) \geq 0 \]
Plugging in first display, the inactive block of the KKT conditions yields \[
\begin{aligned}
\lambda &\geq \|X_{-E}^T(Y - X_E\bar{\beta}_E) + \lambda X_{-E}^TX_E(X_E^TX_E)^{-1}s_E\|_{\infty} \\
&= \|X_{-E}^T(I - P_E)Y + \lambda X_{-E}^TX_E(X_E^TX_E)^{-1}s_E\|_{\infty}
\end{aligned}
\] with \(P_E\) the projection onto \(\text{col}(X_E)\). Note \((I-P_E)Y\) is the residual after fitting the model lm(Y ~ X[,E] - 1).
Fix any \(A \in \mathbb{R}^{k \times n}, b \in \mathbb{R}^k\) and \(\eta \in \mathbb{R}^n\).
The set \[ \left\{z: Az \leq b \right\} \] can be written as \[ \left\{z: {\cal V}^L(P_{\eta}^\perp z) \leq \eta^Tz \leq {\cal V}^U(P_{\eta}^{\perp}z), {\cal V}^0(P_{\eta}^{\perp}z) \geq 0 \right\} \]
That is, the event can be decomposed relative to \(\eta\) to be written in terms of \(\eta^Tz\) and \(P_{\eta}z\).
Suppose \(y \sim N(\mu, I)\). With \(A, b, \eta\) as above \[ {\cal L}\left(\eta^TY | AY \leq b, P_{\eta}^{\perp}Y = P_{\eta}^{\perp}y \right) = \text{TN}_{\eta^T\mu, \|\eta\|^2_2}([{\cal V}^L(P_{\eta}^\perp y), {\cal V}^U(P_{\eta}^{\perp}y)]) \]
That is, under \(H_0:\eta^T\mu=\delta\) this distribution has density \[ t \propto \phi_{\delta, \|\eta\|^2_2}(t) 1_{[l,u]}(t) \] with \(l={\cal V}^L(P_{\eta}^{\perp}y), u={\cal V}^U(P_{\eta}^{\perp}y)\).
Each sign vector \(s_{E}\) yields a (possibly empty) interval \([l(s_E),u(s_E)]\).
With \(\eta=\eta_{2|\{2,20\}}\) we have \[ {\cal L}\left(\eta^TY | \hat{E}=\{2,20\}, P_{\eta}^{\perp}Y = P_{\eta}^{\perp}y \right) = \text{TN}_{\eta^T\mu, \|\eta\|^2_2}\left(\cup_{s_E \in \{-1,1\}^E} [l(s_E), u(s_E)]\right) \]
Suppose \(F = F_{\mu} \in {\cal M}\)
Set \(\hat{\theta}_{\cal T}=\eta^TY\) \[ \Gamma = \text{Cov}_F(Y, \eta^TY) \text{Cov}_F(\eta^TY)^{-1} \]
Note \[ \text{Cov}_F(Y - \Gamma \hat{\theta}, \hat{\theta}) = 0. \]
Further \[ \begin{aligned} AY \leq b &\iff A(Y - \Gamma \hat{\theta} + \Gamma \hat{\theta}) \leq b \\ & \iff A \Gamma \hat{\theta} \leq b - A(Y - \Gamma \hat{\theta}) \\ \end{aligned} \]
Finally, \[ \begin{aligned} {\cal V}^L &= \max_{j: (A\Gamma)_j < 0} \frac{(b - A(Y - \Gamma \hat{\theta}))_j}{(A \Gamma)_j} \\ {\cal V}^U &= \min_{j: (A\Gamma)_j > 0} \frac{(b - A(Y - \Gamma \hat{\theta}))_j}{(A \Gamma)_j} \\ {\cal V}^0 &= \min_{j: (A\Gamma)_j = 0} (b - A(Y - \Gamma \hat{\theta}))_j \\ \end{aligned} \]
from selectinf.algorithms.lasso import lasso
n, p = X.shape
L = lasso.gaussian(X, Y, 1.5 * np.sqrt(X.shape[0]), sigma=1.)
L.fit()## array([ 0. , -0.01205666, -0. , -0. , -0. ,
## 0. , -0. , -0. , 0. , -0. ,
## 0. , 0. , 0. , -0. , 0. ,
## 0. , 0. , 0. , -0. , -0.11651073])
## variable pval lasso onestep ... upper_confidence lower_trunc upper_trunc sd
## 1 1 0.77475 -0.012057 -0.187903 ... 2.432066 -inf -0.175847 0.102646
## 19 19 0.06657 -0.116511 -0.283290 ... 0.030983 -inf -0.182397 0.099658
##
## [2 rows x 9 columns]
from scipy.stats import norm as normal_dbn
Z = S['onestep'] / S['sd']
UZ = S['upper_trunc'] / S['sd']
LZ = S['lower_trunc'] / S['sd']
# here both observed signs are -1
pval = ((normal_dbn.cdf(Z) - normal_dbn.cdf(LZ)) /
(normal_dbn.cdf(UZ) - normal_dbn.cdf(LZ)))
pval## array([0.77475002, 0.06656959])
twosided = np.asarray(L.summary(alternative='twosided')['pval'])
twosided, 2 * np.minimum(pval, 1 - pval)## (array([0.45049997, 0.13313918]), array([0.45049997, 0.13313918]))
R## Loading required package: intervals
##
## Attaching package: 'intervals'
## The following object is masked from 'package:Matrix':
##
## expand
## Loading required package: survival
## Loading required package: adaptMCMC
## Loading required package: parallel
## Loading required package: coda
## Loading required package: MASS
selectiveInference::fixedLassoInf(X,
Y,
beta_hat,
1.5 * sqrt(n),
intercept=FALSE,
alpha=0.1,
sigma=1)##
## Call:
## selectiveInference::fixedLassoInf(x = X, y = Y, beta = beta_hat,
## lambda = 1.5 * sqrt(n), intercept = FALSE, sigma = 1, alpha = 0.1)
##
## Standard deviation of noise (specified or estimated) sigma = 1.000
##
## Testing results at lambda = 15.000, with alpha = 0.100
##
## Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
## 2 -0.188 -1.831 0.775 -0.250 2.433 0.049 0.05
## 20 -0.283 -2.843 0.067 -0.444 0.032 0.049 0.05
##
## Note: coefficients shown are partial regression coefficients
Note: lambda value in R presumes loss is using \(SSE\) rather than \(MSE\).
Value of 1.5 / sqrt(n) for glmnet is multiplied by n yielding 1.5 * sqrt(n).
We have seen in the file drawer problem that truncation to half-open interval can yield undesirable properties: notably very long intervals…
Is this avoidable here?
Intervals of Zhong and Prentice (2008) with two-sided truncation do not suffer from this problem.
Marginalizing over signs may give us some mass in intervals outside the one for the observed sign vector.
But, marginalizing over signs does not always discover mass that can “save” us.
Perhaps the problem is the saturated model we used.
Suppose \(n \gg p\) and we are willing to assume \[ Y | X \sim N(X\beta, \sigma^2 I) \]
We can likely safely assume \(\sigma^2\) unknown as we have \(n - p \gg p\) degrees of freedom to estimate \(\sigma^2\) so we will assume wlog that \(\sigma^2=1\) (and cross our fingers for now).
Maybe this conditional approach will lead to “better” intervals in this full model?
Model: \[ {\cal M}_f = \left\{N(X\beta, I): \beta \in \mathbb{R}^p \right\} \]
Our targeted parameter is \[ \theta_{\cal T}(\beta) = e_2^T X_{\{2,20\}}^{\dagger}X \beta. \]
The selection event remains the same (we have just changed model for \(Y|X\)) \[ {\cal M}^*_f = {\cal M}^*_{f,\{2,20\},(-1,-1)} = \left\{F^*: \frac{dF^*}{dF}(y) = 1_{\bar{A}_{\{2,20\},(-1,-1)}}(y), F \in {\cal M}_f\right\}. \]
Our model \({\cal M}^*_f\) is again an exponential family with sufficient statistic \(X^TY\) and natural parameter \(\beta\).
We can write \[ \begin{aligned} \beta^TX^TY &= \beta^TX^TP_{\eta}Y + \beta^TX^T(P_E-P_{\eta})Y + \beta^T(P_F-P_E)Y \\ \end{aligned} \]
To eliminate nuisance parameters we see we will have to condition on \((P_F-P_E)Y\) as well as \((P_E-P_{\eta})Y\), i.e. \((P_F-P_{\eta})Y\).
The inactive block event can be written \[ \begin{aligned} \|X_{-E}^T((I-P_F)Y + (P_F-P_E)Y) + \lambda X_{-E}^TX_E(X_E^TX_E)^{-1}s_E\|_{\infty} \leq \lambda \end{aligned} \]
For \(G_{\beta}^* \in {\cal M}_f^*\), the law of \(P_EY\) given \((P_F-P_E)Y\) is identical to the law of \(P_EY\) for \(F_{X\beta}^* \in {\cal M}^*\) given \((P_F-P_E)Y\). Further, the inactive event is conditionally independent of \(P_EY\) given \((P_F-P_E)Y\).
We conclude that conditional inference starting with \({\cal M}_f\) is identical to that starting with \({\cal M}\).
Even though we are conditioning on less (i.e. \((P_F-P_{\eta})Y\) for \({\cal M}_f\) instead of \((I-P_{\eta})Y\) for \({\cal M}\)) we will still have long intervals.
Verify the equality in law above, as well as the conclusion.
Hmm… if we think our LASSO is doing a good job in finding all the true variables, why should we worry about \((P_F-P_E)Y\)?
Maybe we should use \[ {\cal M}_E = \left\{N(X_E\beta_E, I): \beta_E \in \mathbb{R}^E \right\}? \]
This is similar to our discussion in drop the losers of considering models for trials when remdisivir (denoted by \(R\)) wins as \[ {\cal M}_{R} = \left\{N(\mu, I): \mu_{R} \in \mathbb{R}, \mu_{-R} = \delta \cdot 1_{-R}, \delta > 0 \right\} \] and corresponding conditional models \({\cal M}_R^*\)…
This is the idea of a selected model presented in Fithian, Sun, Taylor (2014) which we’ll discuss soon.
Even under \({\cal M}_E\) conditional inference for the LASSO will be identical to \({\cal M}\) or \({\cal M}_f\)!
Verify that using \({\cal M}_E\) and corresponding conditional model \({\cal M}_E^*\) yields the same reference distribution as starting with \({\cal M}\) or \({\cal M}_f\).
In our LASSO we’ve so far conditioned either on \(\hat{E}=E\) or \((\hat{E},\hat{s}_E) = (E,s_E)\) and based our targets \(\theta_{\cal T}\) on PoSI targets for the features \(E\) (albeit under different models above)
In practice \(\theta_{\cal T}\) can be any parameter \(U\) dreams up knowing that the LASSO has selected \(\hat{E}\) (and possibly the signs \(\hat{s}_E\)).
This mimics the classical scientific method in the sense that after running an experiment, \(U\) is free to set the target parameter of the following experiment based on what \(U\) has observed.
In other words, we are able to provide full DFQL inference in this conditional setting…
Suppose we start with \[ {\cal M}_{NP,n} = {\cal M}_{NP} = \left\{\otimes^n F: (X, Y) \sim F\right\} \]
The event \(\bar{A}_{E, s_E}\) can be expressed in terms of \[D_n= \frac{1}{n}(X^TY, X^TX). \]
I.e. for all \((X,Y) \in \mathbb{R}^{n \times (p + 1)}\) \[ \bar{A}_{E, s_E}(X, Y) = \tilde{A}_{E, s_E}(n^{-1}X^TX,n^{-1}X^TY) = \tilde{A}_{(E,s_E)}(D_n) \]
Under appropriate moment assumptions \[ n^{1/2} \begin{pmatrix} \hat{\theta}_n - \theta_{\cal T}(F)\\ D_n - E_F[D_n] \end{pmatrix} \overset{n \to \infty}{\to} N(0, \Sigma(F)) \]
Corresponding Gaussian model \[ \tilde{\cal M}_G = \left\{N \left(\begin{pmatrix} \theta_{\cal T} \\ \Theta_{\cal N} \end{pmatrix}, \begin{pmatrix} \Sigma_{{\cal T}, {\cal T}}(F) & \Sigma_{{\cal T}, D}(F) \\ \Sigma_{D,{\cal T}}(F) & \Sigma_{D, D}(F) \end{pmatrix}\right) \right\} \]
Conditional model \(\tilde{\cal M}_G^*\) restricts \(\tilde{\cal M}_G\) to \(\tilde{A}_{\{2,20\},(-1,-1)}\).
Still need to remove dependence on nuisance parameter.
Under \(\tilde{\cal M}_G\) assuming \(\text{Var}(\hat{\theta}) > 0\), the law of \((\hat{\theta}_{\cal T}, D)\) under \(\tilde{\cal M}_G\) is equivalent to the law \[ (\hat{\theta}_{\cal T}, D - \Gamma \hat{\theta}_{\cal T}) = (\hat{\theta}_{\cal T}, {\cal N}) \] where \[ \Gamma = \Gamma(\Sigma) = \Sigma_{D,{\cal T}} \Sigma_{{\cal T}, {\cal T}}^{-1} \] via the reconstruction \[ (\hat{\theta}_{\cal T}, {\cal N} + \Gamma \hat{\theta}_{\cal T}) \]
The pair \((\hat{\theta}_{\cal T}, {\cal N})\) are independent under \(\tilde{\cal M}_G\).
\(\implies\) Under \(\tilde{\cal M}_G\) (and \(\tilde{\cal M}_G^*\)) conditional on \({\cal N}\) the law of \(( \hat{\theta}_{\cal T}, D)\) depends only on \(\theta_{\cal T}\).
Conditioning on \({\cal N}\) removes dependence on \(\Theta_{\cal N}\).
Final reference density under \(H_0: \theta_{\cal T}=\delta\) is proportional to \[ t \mapsto \phi_{(\delta, \text{Var}(\hat{\theta}_{\cal T}))}(t) \cdot 1_{\tilde{A}_{(E, s_E)}}\left({\cal N}_{obs} + \Gamma \cdot t\right) \] where \({\cal N}_{obs}\) is the observed value of \(D - \Gamma \hat{\theta}_{\cal T}\) and \(\Gamma\) must be estimated (e.g. pairs bootstrap).
Note: only \(\text{Var}(\hat{\theta}_{\cal T})\) need be inverted to form \(\Gamma\).
The polyhedral lemma asserts that if \(\tilde{A}_{(E,s_E)}\) is polyhedral in \(D\) \[ \left\{t: 1_{\tilde{A}_{(E, s_E)}}\left({\cal N}_{obs} + \Gamma \cdot t\right)=1 \right\} \] is an interval \([{\cal V}^L({\cal N}_{obs}, \Gamma), {\cal V}^U({\cal N}_{obs}, \Gamma)]\).
Nothing here depends on the specifics of the LASSO!
For event \(\{D:AD \leq b\}\).
Set \[ \Gamma = \Gamma(F) = \text{Cov}_F(D, \hat{\theta}_{\cal T}) \text{Cov}_F(\hat{\theta}_{\cal T})^{-1} \]
Compute \({\cal N} = D - \Gamma(F) \hat{\theta}_{\cal T}\) \[ \begin{aligned} {\cal V}^L({\cal N}, \Gamma) &= \max_{j: (A\Gamma)_j < 0} \frac{(b - A{\cal N})_j}{(A \Gamma)_j} \\ {\cal V}^U({\cal N}, \Gamma) &= \min_{j: (A\Gamma)_j > 0} \frac{(b - A{\cal N})_j}{(A \Gamma)_j} \\ \end{aligned} \]
To apply polyhedral lemma, \({\cal A}_{(E,s_E)}\) should be polyhedral… is it?
While selection is expressible in terms of \(D_n=n^{-1}(X^TY, X^TX)\) a more explicit form is important to actually have a workable algorithm.
Define the following functions of \(D_n\) (liberally using 0-padding where necessary) \[ \begin{aligned} \bar{\beta}_E &= (n^{-1}X_E^TX_E )^{-1}(n^{-1}X^TY) \\ G^I_E &= -n^{-1/2} X_{-E}^T(Y - X_E\bar{\beta}_E) \\ &= n^{-1/2}\nabla \ell(\bar{\beta}_E)[-E]\\ G^I_{(E,s_E)} &= G^I_E - C \cdot E_F[X_{-E} X_E^T] E_F[X_EX_E^T]^{-1}s_E \\ G^A_{(E,s_E)} &= n^{1/2}\bar{\beta}_E - C \cdot E_F[X_EX_E^T]^{-1}s_E \\ \end{aligned} \]
On the event \((\hat{E}, \hat{s}_E) = (E, s_E)\) we see \[ \begin{aligned} \hat{\beta}_E &= \bar{\beta}_E - C \cdot n^{-1/2} (E_F[X_EX_E^T])^{-1} s_E + {\cal R}^{A}_n \\ &= n^{-1/2} G^A_{(E,s_E)} + {\cal R}^{A'}_n \\ \nabla \ell(\hat{\beta})[-E] & = n^{1/2} G^I_{(E,s_E)} + C \cdot {\cal R}^I_{n} \end{aligned} \] where \[ {\cal R}^I_n = O(1), \qquad {\cal R}^{A / A'}_n = O_F(n^{-1}) \] while \(n^{1/2} G^I_{(E,s_E)} = n^{1/2} E_F[G^I_{(E,s_E)}] + O(n^{1/2})\) and \(\bar{\beta}_E = \beta_E^*(F) + O_F(n^{-1/2})\).
We say a sequence of models \({\cal M}_n\) satisfies no rare selection for our observed \((\hat{E}, \hat{s}_E)\) if \[ \liminf_{n \to \infty} \inf_{F \in {\cal M}_n}P_{F}((\hat{E}(X,Y), \hat{s}_E(X,Y)) = (E, s_E)) > 0. \]
One of the conditions in the event is \(\|G^I_{(E,s_E)}\|_{\infty} \leq C\) so for no rare selection it will be necessary that \[ \|E_F(G^I_{(E,s_E)})\|_{\infty} = O(1) \]
Similarly, as \[ \text{diag}(s_E) \left(\hat{\beta}_E \right) \geq 0 \] for no rare selection it will be necessary that \[ \liminf_{n \to \infty} \inf_{F \in {\cal M}_n} \min(\text{diag}(s_E)E_F[G^A_{(E,s_E)}]) > -\infty. \]
Note: under the no rare selection assumption, then \(O_{F}(n^{-1/2}) \iff O_{F^*}(n^{-1/2})\) – i.e. order of random variables is comparable under \(F\) or \(F^*\). We’ve made liberal use of that above.
Finally, our asymptotic model can be reduced to the laws \[ \begin{pmatrix} G^A_{(E,s_E)} \\ G^I_{(E, s_E)} \end{pmatrix} \] with selection event \[ \{\text{diag}(s_E) G^A_{(E,s_E)} \geq 0, \|G^I_{(E,s_E)}\|_{\infty} \leq C\}. \]
Stacking \(\hat{\theta}_{\cal T}\) on top yields a multivariate Gaussian family \[ \begin{pmatrix} \hat{\theta}_{\cal T} \\ G^A_{(E,s_E)} \\ G^I_{(E,s_E)} \end{pmatrix} \sim N\left(\begin{pmatrix} \theta_{\cal T}(F) \\ ? \\ ? \end{pmatrix}, \Sigma \right) \] where \(\Sigma\) can be estimated consistently (under no asymptotically rare selection).
This is a multivariate normal satisfying affine inequalities: we can use polyhedral lemma.
While \(\hat{\theta}_{\cal T}, n^{1/2}\bar{\beta}_E\) and \(n^{-1/2} \nabla \ell(\hat{\beta})[-E]\) are observable \(G^A_{(E,s_E)}\) and \(G^I_{(E,s_E)}\) are not.
However, on selection event, recall \[ \begin{aligned} G^A_{(E,s_E)} = n^{1/2} \hat{\beta}_E + O_{F^*}(n^{-1/2}) \\ G^I_{(E,s_E)} = n^{-1/2} \nabla \ell(\hat{\beta})[-E] + O_{F^*}(n^{-1/2}) \end{aligned} \] so the pair is essentially observable.
Even if they were observable, it is possible (due to remainder \({\cal R}^I_n\)) that for finite \(n\) that \[ \min(\text{diag}(s_E)(G^A_{(E,s_E)})) < 0 \] or \[ \|G^I_{(E,s_E)}\|_{\infty} > C. \]
Running polyhedral lemma would lead to nonsenical inference – \(\hat{\theta}_{\cal T}\) would be outside its truncation interval whatever estimate of \(\Sigma\) we have!
However, if we run the polyhedral lemma with observed value \((\hat{\theta}_{\cal T}, n^{1/2}\hat{\beta}_E, n^{-1/2} \nabla \ell(\hat{\beta})[-E])\) then \(\hat{\theta}_{\cal T}\) will be in its truncation interval whatever estimate of \(\Sigma\) we have.
By run the polyhedral lemma we mean compute \({\cal V}^{L/U}(D - \Gamma(\hat{\Sigma}) \hat{\theta}_{\cal T}, \Gamma(\hat{\Sigma}))\). Prove the claim above, i.e. with the the observed value of \(D\) as above, show that \(\hat{\theta}_{\cal T}\) is always in the interval \({\cal V}^{L,U}\). Supposing that after rescaling \(\text{Var}_F(\hat{\theta}_{\cal T}) = O(1)\) how much do you expect the endpoints \({\cal V}^{L/U}\) to change if you used the unobservable \(D=(\hat{\theta}_{\cal T}, G^A_{(E,s_E)}, G^I_{(E,s_E)})\) instead of \((\hat{\theta}_{\cal T}, n^{1/2}\hat{\beta}_E, n^{-1/2} \nabla \ell(\hat{\beta})[-E])\) under a no rare selection assumption? How large are these changes compared to the typical sizes of \({\cal V}^{L/U}\)? (Remember under no rare selection the “sizes” of random variables are the same under \(F\) as \(F^*\).)
Suppose now that \(Y \in \{0,1\}\) is binary and we want to select variables using the logistic LASSO with objective \[ \begin{aligned} \beta &\mapsto \sum_{i=1}^n \log(1 + \exp(X_i^T\beta)) - \beta^TX^TY + \lambda \|\beta\|_1 \\ &= \ell^L(\beta; X, Y) + \lambda \|\beta\|_1 \end{aligned} \]
KKT conditions are not as simple as OLS LASSO.
In a selected model setting (as mentioned above in Fithian, Sun, Taylor (2014) when we take the selected model to be parametrically correct, valid conditional inference is considered in Taylor and Tibshirani (2017).
We’ll first work through the pairs model and come back to this selected model.
The pairs model notes can be found in Markovic and Taylor (2016) for randomized selection.
The restricted problem for set \(E\) sets \(\beta_{-E}=0\). Call this loss \(\ell^L_E(\beta_E) = \ell^L(\beta_E, 0)\).
For a set of variables \(E\) and initial estimate \(\beta_E\), the one-step estimator from \(\beta_{E,0}\) for the restricted problem is \[ \bar{\beta}_E(\beta_{E,0}) = \beta_{E,0} - \nabla^2\ell^L_E(\beta_{E,0})^{-1}\nabla \ell^L_E(\beta_{E,0}) \]
On \((\hat{E}, \hat{s}_E)=(E,s_E)\), with initial estimator \(\hat{\beta}_E\), the KKT conditions yield \(\nabla \ell^L_E(\hat{\beta}_E)= - \lambda s_E\) and \[ \bar{\beta}_E = \hat{\beta}_E + \lambda \nabla^2 \ell^L_E(\hat{\beta}_E)^{-1}s_E \]
Keeping \(p\) fixed, under standard assumptions the one-step estimator will be asymptotically equivalent to the MLE (assuming \(\lambda = \lambda_n = n^{1/2} C\)).
We see again that our selection event can be expressed in terms of \[ \begin{pmatrix} G^A_{(E,s_E)} \\ G^I_{(E,s_E)} \end{pmatrix} = \begin{pmatrix} n^{1/2}\hat{\beta}_E \\ n^{-1/2} \nabla \ell(\hat{\beta}_E) \end{pmatrix} + O_{F^*}(n^{-1/2}). \]
To “run” the polyhedral lemma code we need only estimate \[ \Sigma = \text{Cov}_F \begin{pmatrix} \hat{\theta}_{\cal T} \\ G^A_{(E,s_E)} \\ G^I_{(E,s_E)} \end{pmatrix} \] and evaluate \({\cal V}^{L/U}\) for observed value \(D = (\hat{\theta}_{\cal T}, n^{1/2}\hat{\beta}_E, n^{-1/2}\nabla \ell(\hat{\beta})[-E])\).
Suppose \(U\) has decided, after seeing the logistic LASSO has selected \((E, s_E)\) that the proper target of inference is the \(k\)-th coordinate of the probit parameter of the true data generating distribution \(F\) in the pairs model with features \(E'\). Describe how to construct an (asymptotically) valid confidence interval for this parameter that “runs”, i.e. does not give nonsensical results such as \(\hat{\theta}_{\cal T}\) is outside its truncation interval.
Suppose upon observing \((\hat{E}, \hat{s}_E) = (E, s_E)\) we choose to work in the parametric model based on \(n\) IID draws from \[ {\cal M}_{P,E} = \left\{F: {\cal L}_F(Y|X) = \text{Bernoulli}(\text{logit}^{-1}(X_E^T\beta_E)), \beta_E \in \mathbb{R}^E \right\} \]
If \(\hat{\theta}_{\cal T}\) is a linear functional of \(\bar{\beta}_E\) then \((\hat{\theta}_{\cal T}, G^A_{(E,s_E)})\) is asymptotically independent of \(G^I_{(E,s_E)}\) even under \({\cal M}_{P,E}^*\).
Hence, our selection event can be expressed only in terms of \((\hat{\theta}_{\cal T}, G^A_{(E,s_E)})\). Further, in \({\cal M}_{P,E}\) (and \({\cal M}_{P,E}^*\) under no rare selection assumption) we can consistently estimate \(\text{Cov}(G^A_{(E,s_E)})\) by the usual GLM formula \[ n \cdot (X_E^T W(\hat{\beta}_E)X_E)^{-1} \] with \[ \begin{aligned} W(\beta_E) &= \text{diag}(\pi(\beta_E) (1 - \pi(\beta_E)) \\ \pi(\beta_E) &= \frac{e^{X_E^T\beta_E}}{1 + e^{X_E^T\beta_E}}. \end{aligned} \]
def simulate(n=500, p=50, C=1):
X = np.random.standard_normal((n, p))
Y = np.random.binomial(1, 0.5, size=(n,))
W = np.ones(p) * C * np.sqrt(n)
L = lasso.logistic(X, Y, W)
beta_hat = L.fit()
if (beta_hat != 0).sum() > 0:
S = L.summary(compute_intervals=True,
alternative='twosided',
level=0.90)
return S
simulate()## variable pval lasso onestep ... upper_confidence lower_trunc upper_trunc sd
## 17 17 0.016169 0.143245 0.327934 ... 0.496885 0.184689 0.533234 0.091908
## 18 18 0.348696 0.007160 0.176163 ... 0.197229 0.169004 3.912629 0.087659
## 26 26 0.759404 -0.032987 -0.209377 ... 0.487633 -inf -0.176390 0.087252
##
## [3 rows x 9 columns]
P, C = [], []
for _ in range(100):
S = simulate()
if S is not None:
P.extend(list(S['pval']))
C.extend(list((S['lower_confidence'] < 0) & (S['upper_confidence'] > 0)))## coverage: (target 90%): 0.9234234234234234
Verify the above claimed asymptotic independence of \((\hat{\theta}_{\cal T}, G^A_{(E,s_E)})\) and \(G^I_{(E,s_E)}\) under \({\cal M}_{P,E}\) and \({\cal M}_{P,E}^*\) under an appropriate no rare selection assumption for \((E,s_E)\).
For either the OLS or Logistic LASSO, suppose we want to marginalize over signs \(s_E\). Describe an algorithm using the polyhedral lemma for each \(s_E\) that will “run” (i.e. not give nonsensical answers due to \(\hat{\theta}_{\cal T}\) being outside its truncation set) and give (asymptotically) valid inference conditional only on the event \(\hat{E}=E\).
In practice, our LASSO penalty (e.g. with intercept) is really of the form \[ \beta \mapsto \lambda \sum_{j=1}^p w_j |\beta_j| \]
We ignored the intercept above, but this just corresponds to some \(w_j=0\).
Active set \(\hat{E} = U \cup \hat{A}\) with \(U\) (*unpenalized) and \(\hat{A}\) (active).
The vector \(\lambda s_E\) is replaced throughout with the vector \(\lambda w_E s_E\).
Only penalized variables in \(\hat{A}\) require a constraint (i.e. might as well only condition on signs of penalized variables).
In the inactive block a term \(w_{-E}\) appears on RHS of \(\ell_{\infty}\) constraint.
Mostly bookkeeping…
Xi = np.hstack([np.ones((n, 1)),
X])
weights = 1.5 * np.sqrt(n) * np.ones(p+1)
weights[0] = 0
Li = lasso.gaussian(Xi, Y, weights, sigma=1.)
Li.fit()## array([ 0.08047905, 0. , -0.01307299, -0. , -0. ,
## -0. , 0. , -0. , -0. , 0.00087677,
## -0. , 0. , 0. , 0. , -0. ,
## 0. , 0. , 0. , 0. , -0. ,
## -0.1046863 ])
## variable pval lasso onestep ... upper_confidence lower_trunc upper_trunc sd
## 0 0 0.488701 0.080479 0.061638 ... 0.218739 -0.013930 0.773783 0.101140
## 2 2 0.624802 -0.013073 -0.184972 ... 2.233520 -0.217428 -0.171899 0.102695
## 9 9 0.984540 0.000877 0.127013 ... -0.427391 0.126137 0.574571 0.098759
## 20 20 0.018953 -0.104686 -0.260735 ... -0.152579 -0.269230 -0.156049 0.101312
##
## [4 rows x 9 columns]
def simulate_intercept(n=500, p=50, C=1):
X = np.random.standard_normal((n, p+1))
X[:,0] = 1
Y = np.random.binomial(1, 0.6, size=(n,))
W = np.ones(p+1) * C * np.sqrt(n)
W[0] = 0
L = lasso.logistic(X, Y, W)
beta_hat = L.fit()
if (beta_hat != 0).sum() > 1:
S = L.summary(compute_intervals=True,
alternative='twosided',
level=0.90)
return S.iloc[1:] # remove intercept row
simulate_intercept()## variable pval lasso onestep ... upper_confidence lower_trunc upper_trunc sd
## 18 18 0.895433 -0.029592 -0.225419 ... 0.688152 -0.579943 -0.195827 0.094736
## 21 21 0.197391 -0.078194 -0.246453 ... 0.076704 -0.368645 -0.168259 0.089775
## 36 36 0.356272 0.007813 0.179142 ... 0.205369 0.171329 1.432253 0.091516
##
## [3 rows x 9 columns]
## coverage: (target 90%): 0.9282051282051282
Up to now, we’ve allowed target to be based on \((\hat{E},\hat{s}_E)\) for both OLS and logistic LASSO but this is not strictly necessary if:
We are willing to assume the full parametric model is correct e.g. for OLS and \(n \gg p\) \[ {\cal L}(Y|X) = N(X\beta^*, \sigma^2 I) \in {\cal M}_f \]
We restrict attention to targets \(\beta_j = \beta_{j|\{1,\dots,p\}}\).
Inference is based on conditioning on the outcome of queries \(Q_j(X,Y) = 1_{\hat{E}(X,Y)}(j)\).
Each feature \(j\) carries out inference in a different conditional model – there is no single selection event under which our intervals are conditionally correct.
This would similarly hold for Zhong and Prentice (2008) for simple thresholding of (absolute) marginal effects.
A little less in the spirit of DFQL: there are only ever \(p\) questions and \(U\) must decide a priori which effects to report if they are selected.
Parametric model must be correctly specified!
For a fixed \(1 \leq j \leq p\), we must describe \(\{j \in \hat{E}(X,Y)\}\) in model \({\cal M}_f\).
Our conditional model is \[ {\cal M}^*_{f,j} = \left\{F^*: \frac{dF^*}{dF}(y) \propto 1_{j \in \hat{E}(X,y)}, F \in {\cal M}_f \right\} \]
By inspection of KKT conditions similar to strong rules variable \(j\) is not selected a parameter \(\lambda\) iff \[ |X_j^T(Y - X_{-j} \hat{\beta}_{-j})| < \lambda \] where \[ \hat{\beta}_{-j} = \text{argmin}_{\beta:\beta_j=0} \frac{1}{2} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1 \] solves the restricted problem fixing \(\beta_j=0\).
Note that \[ \hat{\beta}_{-j} = \text{argmin}_{\beta:\beta_j=0} \frac{1}{2} \beta_{-j}^T(X_{-j}^TX_{-j})\beta_{-j} - \beta_{-j}^TX_{-j}^TY + \lambda \|\beta\|_1 \] hence (considering \(X\) fixed) it is a function of \(X_{-j}^TY\), i.e. we can write \[ \hat{\beta}_{-j} = \hat{\beta}_{-j}((X^TY)[-j], X^TX). \]
In model \({\cal M}_f\) (and \({\cal M}_{f,j}^*\)) conditioning on \(X_{-j}^TY\) removes dependence of the distribution on \(\beta_{-j}\). In other words, for target \(\beta_j\) we have \({\cal N} = X_{-j}^TY\) in this model.
Setting \(D = (X_j^TY, X_{-j}^TY)\) and \(\hat{\theta}_{\cal T} = \bar{\beta}_j\) we see \[ \text{Cov}_F(\hat{\theta}_{\cal T}, D) = \begin{pmatrix} \sigma^2 \\ 0 \end{pmatrix} \qquad \forall F \in {\cal M}_f. \]
This implies that \[ {\cal N} = X^TY - \frac{\bar{\beta}_j}{(X^TX)^{-1}_{jj}} e_j \]
Suppose \(X\) is fixed. Verify that if \(D=X^TY\) and \(\theta_{\cal T}=\beta_j\) that \({\cal N} \equiv X_{-j}^TY\) by showing that our usual definition of \({\cal N}\) is in 1:1 correspondence with \(X_{-j}^TY\). More formally, in terms of conditioning, argue that the sigma algebra generated by \({\cal N}\) is the same as that generated by \(X_{-j}^TY\).
Finally, note that for all \(t\) \[ \hat{\beta}_{-j}(({\cal N} + \frac{t}{(X^TX)^{-1}_{jj}} e_j)[-j], X^TX) = \hat{\beta}_{-j}(X_{-j}^TY, X^TX) \]
Selection event intersected with \(\{z: X_{-j}^Tz = (X_{-j}^TY)_{obs}\}\) is complement of \[ \left\{t: \left|\left(X_j^TY - \frac{\bar{\beta}_j}{(X^TX)^{-1}_{jj}}\right)_{obs} + \frac{t}{(X^TX)^{-1}_{jj}} - X_j^TX_{-j} \hat{\beta}_{-j}((X_{-j}^TY)_{obs})\right| < \lambda \right\}. \]
Reference density to test \(H_0:\beta_j=\delta\) is \(N(\delta, \sigma^2 (X^TX)^{-1}_{jj})\) restricted to above truncation region.
from selectinf.algorithms.lasso import ROSI
def simulate_full(n=500, p=50, C=1.5):
X = np.random.standard_normal((n, p))
Y = np.random.standard_normal(n)
W = np.ones(p) * C * np.sqrt(n)
L = ROSI.gaussian(X, Y, W)
beta_hat = L.fit()
if (beta_hat != 0).sum() > 0:
S = L.summary(compute_intervals=True,
level=0.90)
return S # remove intercept row
simulate_full()## variable pval lasso ... upper_confidence lower_truncation upper_truncation
## 6 6 0.187981 -0.040119 ... 0.012707 -0.057826 0.085826
## 17 17 0.261607 -0.027967 ... 0.015356 -0.069161 0.080316
## 25 25 0.478552 0.010119 ... 0.131542 -0.064073 0.071951
## 29 29 0.998388 0.016198 ... 0.124809 -0.109881 0.044545
## 35 35 0.756812 -0.001374 ... 0.039444 -0.072520 0.087578
## 39 39 0.781752 -0.003502 ... 0.040540 -0.069454 0.089908
## 45 45 0.810212 -0.016688 ... 0.038102 -0.049755 0.086653
##
## [7 rows x 9 columns]
P, C, L = [], [], []
for _ in range(100):
S = simulate_full()
if S is not None:
L.extend((S['upper_confidence'] - S['lower_confidence']) /
(2 * normal_dbn.ppf(0.95) * S['sd']))
P.extend(list(S['pval']))
C.extend(list((S['lower_confidence'] < 0) & (S['upper_confidence'] > 0)))## coverage: (target 90%): 0.898021308980213
We can simplify notation somewhat in this problem: set \[ \begin{aligned} Q &= X^TX \\ &= \nabla^2 \ell (\hat{\beta}) \\ D &= Q\hat{\beta} - \nabla \ell(\hat{\beta}) \\ &= Q\bar{\beta} - \nabla \ell(\bar{\beta}) \\ &= Q\beta^* - \nabla \ell(\beta^*) \\ &= X^TY \end{aligned} \]
In this notation under \({\cal M}_f\) for target \(\bar{\beta}_j\) \[ {\cal N} = D - \frac{\bar{\beta}_j}{Q^{-1}_{jj}} e_j \] so that \[ {\cal N}[-j] = D[-j] = (Q\bar{\beta} - \nabla \ell(\bar{\beta}))[-j] = (Q\hat{\beta} - \nabla \ell(\hat{\beta}))[-j] \]
The full LASSO solution solves \[ \text{minimize}_{\beta} -\beta^T(Q\hat{\beta} - \nabla \ell(\hat{\beta})) + \frac{1}{2} \beta^TQ\beta + \lambda \|\beta\|_1 \]
The restricted LASSO solution solves \[ \text{minimize}_{\beta:\beta_j=0} -\beta_{-j}^T(Q\hat{\beta} - \nabla \ell(\hat{\beta}))[-j] + \frac{1}{2} \beta^TQ\beta + \lambda \|\beta\|_1 \]
Truncation region is complement of \[ \left\{t: \left| {\cal N}_j + \frac{t}{Q^{-1}_{jj}} - Q[j] \hat{\beta}_{-j}({\cal N}[-j]) \right| \leq \lambda \right\} \]
In the logistic LASSO, under the parametric model \[ {\cal L}(Y|X) = \text{Bernoulli}\left(\frac{e^{X^T\beta^*}}{1 + e^{X^T\beta^*}}\right) \] we set \[ \begin{aligned} Q &= \nabla^2 \ell (\hat{\beta}) \\ &= \nabla^2 \ell (\beta^*) + O(n^{1/2}) D &= Q\hat{\beta} - \nabla \ell(\hat{\beta}) \\ &= Q\beta^* - \nabla \ell(\beta^*) + O(n^{1/2}) \\ \bar{\beta} &= Q^{-1}D \\ {\cal N} &= D - \frac{\bar{\beta}_j}{Q^{-1}_{jj}} e_j \end{aligned} \] and \[ \hat{\beta}_{-j}({\cal N}[-j], Q) = \text{argmin}_{\beta:\beta_j=0} -\beta[-j]^T{\cal N}[-j] + \frac{1}{2} \beta^TQ\beta + \lambda \|\beta\|_1 \]
To test \(H_0:\beta_j^*=\delta\) use reference \(N(\delta, Q_{jj}^{-1})\) truncated to the truncation region of previous slide.
from selectinf.algorithms.lasso import ROSI
def simulate_full_logistic(n=500, p=50, C=1):
X = np.random.standard_normal((n, p+1)); X[:,0] = 1
W = C * np.ones(p+1) * np.sqrt(n)
W[0] = 0
Y = np.random.binomial(1, 0.6, size=(n,))
L = ROSI.logistic(X, Y, W)
beta_hat = L.fit()
if (beta_hat != 0).sum() > 1:
S = L.summary(compute_intervals=True,
level=0.90)
return S.iloc[1:] # remove intercept
simulate_full_logistic()P, C = [], []
for _ in range(100):
S = simulate_full_logistic()
if S is not None:
P.extend(list(S['pval']))
C.extend(list((S['lower_confidence'] < 0) & (S['upper_confidence'] > 0)))## coverage: (target 90%): 0.892018779342723
Explain how you can be sure this algorithm will “run” (i.e. not produce nonsenical results)?
What will the confidence intervals look “more” like – Zhong and Prentice’s ones based on two-sided truncation or the ones in the file drawer example based on one-sided truncation?
Do you expect this algorithm to produce valid conditional inference with \(p\) fixed and \(\lambda = n^{1/2} C\)?
Returning to high dimensional \(p > n\) or \(p \gg n\) setting, let’s revisit our algorithms for the LASSO conditioning on \((\hat{E}, \hat{s}_E)\).
Supposing one can obtain some reasonable estimate of \(\sigma^2\), what is the rough computational cost of the computing a \(p\)-value for testing \(H_0:\beta_{j|E}=\delta\) in \({\cal M}\)? In \({\cal M}_f\)?
What is the rough computational cost in model \({\cal M}_{NP}\)?
Supposing we make sparsity assumptions so that, with high probability \(|\hat{E}| \ll n < p\) do these algorithms require estimating the inverse of poorly conditioned matrices (under reasonable non-degeneracy assumptions)?
In summary, do these algorithms “run” in the \(p > n\) setting (under such sparsity and non-degeneracy assumptions)? How about \(p \gg n\)?
(Open-ended) Do you expect them to “work”, i.e. provide valid conditional inference (under such sparsity and non-degeneracy assumptions)?
Drop the losers involves independent estimators (pre-selection) and is seemingly the simplest problem beyond the file drawer as it has nuisance parameters. Nevertheless, the general picture is remarkably similar.
Suppose our selection event can be expressed asymptotically in terms of \[ \begin{pmatrix} \hat{\theta}_{\cal T} \\ D \\ \Delta \end{pmatrix} \sim N \left(\begin{pmatrix} \theta_{\cal T} \\ ? \\ 0 \end{pmatrix}, \begin{pmatrix} \Sigma_{{\cal T},{\cal T}} & \Sigma_{{\cal T}, D} & 0 \\ \Sigma_{D, {\cal T}} & \Sigma_{D, D} & 0 \\ 0 & 0 & \Sigma_{\Delta,\Delta} \end{pmatrix}\right) \]
That is, we assume we have some query \({\cal Q}(\hat{\theta}_{\cal T}, D, \Delta)\) and have observed its value \(q\). (Note: in this set of slides we have had \(\Sigma_{\Delta, \Delta}=0\).)
Set \({\cal N} = D - \Gamma \hat{\theta}\).
A valid reference density under \(H_0: \theta_{\cal T}=\delta\) is seen to be \[ t \propto \phi_{\delta, \Sigma_{{\cal T}, {\cal T}}}(t) \cdot \pi(t, {\cal N}, \Gamma) \] where \[ \pi(t,n,\Gamma) = P({\cal Q}(\hat{\theta}_{\cal T}, {\cal N} + \Gamma \hat{\theta}_{\cal T}, \Delta) = q | \hat{\theta}_{\cal T}=t, {\cal N}=n). \]
For the OLS LASSO, although we considered different models (and several selection events) under which to carry out this conditional program all reference distributions had the above form.
Similarly, for the logistic LASSO, all reference distributions had the above form.