Conditional inference II

LASSO with `glmnet`

Objective function of glmnet (without intercept or standardization) is \[ \beta \mapsto \frac{1}{2n} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1 \]
Including intercept and other unpenalized effects can be accomplished with feature weights \[ \beta \mapsto \frac{1}{2n} \|Y-X\beta\|^2_2 + \lambda \sum_i w_i |\beta_i| \]

LASSO

For simplicity, we’ll start with no feature weights.
For ease of notation (dropping \(n\)) , we’ll also use SSE instead of MSE \[ \beta \mapsto \frac{1}{2} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1 \]
Choice of s=1.5/sqrt(n) corresponds to \(\lambda=1.5 \sqrt{100}\) in above loss.

Conditional inference II

LASSO

selected = which(beta_hat != 0)
list(selected=selected)

## $selected
## [1]  2 20

Scientist (U): Eureka! I had a feeling features 2 and 20 would be important. Let’s tell the world!

confint(lm(Y ~ X[,selected] - 1))

##                     2.5 %      97.5 %
## X[, selected]1 -0.4059135  0.03010663
## X[, selected]2 -0.4949535 -0.07162624

We know these intervals do not have any attractive coverage property.

Conditional inference II

LASSO

simulate = function(n=100,
                    p=20,
                    s=1.5) {
    X = matrix(rnorm(n * p), n, p)
    Y = rnorm(n)
    L = glmnet(X, Y, intercept=FALSE, standardize=FALSE)
    # ignore intercept
    beta_hat = coef(L, s=s/sqrt(n), exact=TRUE, x=X, y=Y)[-1] 
    selected = beta_hat != 0
    if (sum(selected) > 0) { 
        intervals = confint(lm(Y ~ X[,selected] - 1))
        covered = (intervals[,1] < 0) * (intervals[,2] > 0)
        return(covered)
    } else {
        return(numeric(0))
    }
}
simulate()

## X[, selected]1 X[, selected]2 
##              0              0

Conditional inference II

A little simulation

covered = c()
for (i in 1:100) {
    covered = c(covered, simulate())
}
mean(covered)

## [1] 0.592

Conditional inference II

LASSO

We can perhaps conditional inference…
What does it mean for LASSO to have selected features 2 and 20?

KKT conditions

The solution to our LASSO problem is \(\hat{\beta}(X, Y)\) iff there exists \(\hat{u}(X,Y) \in \partial (\beta \mapsto \lambda \|\beta\|_1)\) such that \[ 0 = X^T(X\hat{\beta} - Y) + \hat{u} \]

Conditional inference II

Subgradient equations

If \(\hat{\beta}_j \neq 0\) we must have \(\hat{u}_j = \lambda \text{sign}(\hat{\beta}_j)\).
Else, \(\hat{\beta}_j = 0\) and we must have \(|\hat{u}_j| \leq \lambda\).

Conditional inference

Subgradient equations in block form

Set \[ \begin{aligned} \hat{E} &= \left\{j : \hat{\beta}_j \neq 0 \right\} \\ \hat{s}_E = \text{sign}(\hat{\beta}_E) \end{aligned} \]
On the event \((\hat{E}, \hat{s}_E) = (E, s_E)\) the KKT conditions can be written \[ \begin{aligned} X_E^T(Y - X_E\hat{\beta}_E) &= \lambda s_E \\ \|X_{-E}^T(Y - X_E\hat{\beta}_E)\|_{\infty} & \leq \lambda \end{aligned} \]
Denote the set of \((X,Y)\) satisfying the above conditions as \(\bar{A}_{E,s_E}\). Since \(X\) is fixed we can (for now) consider this a function of \(Y\) only, i.e. \(1_{\bar{A}_{E,s_E}}(y) = 1_{\bar{A}_{E,s_E}}(X,y)\).
For reference later, setting \[ \ell(\beta) = \frac{1}{2} \|Y-X\beta\|^2_2 \] these read \[ \begin{aligned} -\nabla \ell(\hat{\beta})[E] &= \lambda s_E \\ \|\nabla \ell(\hat{\beta})[-E]\|_{\infty} & \leq \lambda \end{aligned} \]

Conditional inference II

Selection event

The event selected is (2,20) can be written with \(E=\{2,20\}\) \[ \bigcup_{s_E \in \{-1,1\}^E} \bar{A}_{E,s_E}. \]

Conditioning on signs

signs = sign(beta_hat[selected])
list(selected=selected,
     signs=signs)

## $selected
## [1]  2 20
## 
## $signs
## [1] -1 -1

If our scientist \(U\) further conditions on signs being (-1,1): negative effects for features 2 and 20! the event is just \[ \bar{A}_{\{2,20\},(-1,1)}. \]

Conditional inference II

Inference requires a model!

Besides understanding the selection event we must specify a statistical model.

Saturated model (PoSI, Lee et al. (2016))

Fixed \(X\) and known \(\sigma^2\) (wlog we set \(\sigma^2=1\)) \[ Y | X \sim N(\mu(X), I) \]
Or, \[ {\cal M} = \left\{N(\mu, I): \mu \in \mathbb{R}^n\right\} \]
In PoSI style notation, the effect for feature 2 that our earlier confidence intervals tried to cover is \[ e_2^T(X_{\{2,20\}})^{\dagger}\mu = \eta_{2|\{2,20\}}^T = \eta^T\mu \] where \(\eta\) is a function of \(E\) (and \(U\)’s train of thought…)

Conditional model

Having observed selected is \(\{2,20\}\) and signs \((-1,1)\) we set \[ {\cal M}^* = {\cal M}^*_{\{2,20\},(-1,-1)} = \left\{F^*: \frac{dF^*}{dF}(y) \propto 1_{\bar{A}_{\{2,20\},(-1,-1)}}(X,y), F \in {\cal M} \right\}. \]

Conditional inference II

Conditional inference for \(H_{0,\delta}:\eta^T\mu=\delta\)?

Our model has \(n\) parameters!
In order to construct a reference distribution of \(Y_0\)’s under \(H_0\) we need to specify \((n-1)\) free parameters in \(\mu\).
We already saw that such plugins are doomed to fail – Leeb and Potscher (essentially) proved as much…

Conditional inference II

Workhorse

There are \((n-1)\) free parameters in \(\mu\) once we’ve set \(\eta^T\mu=\delta\).
We note that, once again, \({\cal M}^*\) is an exponential family with reference measure \(N(0,I)\) truncated to \(\bar{A}_{\{2,20\},(-1,-1)}\), sufficient statistic \(Y\) and natural parameter \(\mu\).
That is, \(F_{\mu}^*\) has density \[ y \propto e^{y^T\mu} \phi_{0, I}(y) \cdot 1_{\bar{A}_{\{2,20\},(-1,-1)}}(y) \]
For any fixed \(\eta\), we can write \[ y^T\mu = y^TP_{\eta}\mu + y^TP_{\eta}^{\perp}\mu \] with \[ P_{\eta} = \frac{1}{\|\eta\|^2_2} \eta \eta^T, \qquad P_{\eta}^{\perp} = I - P_{\eta} \]
We see that conditioning on \(P_{\eta}^{\perp}y\) results in a model that depends only on \(\eta^T\mu\). Let \(y^{\perp}_{\eta}\) denote the observed value of \(P_{\eta}^{\perp}Y\). Our final model for inference is \[ \bar{\cal M}^* = \bar{\cal M}^*_{\{2,20\},(-1,-1), y^{\perp}_{\eta}}. \]

Conditional inference II

In this model, only variation is along \(\eta\), in particular it is supported on the affine line \[ L(y^{\perp}_{\eta}) = \left\{z \in \mathbb{R}^n: P_{\eta}^{\perp}z = y^{\perp}_{\eta}\right\} \]
The set \(\bar{A}_{\{2,20\},(-1,-1)}\) can be expressed by a set of affine inequalities \[ \left\{y:A_{\{2,20\},(-1,-1)}y \leq b_{\{2,20\},(-1,-1)} \right\}. \] Hence, it is a convex set.
By convexity \(L(P_{\eta}^{\perp}y) \cap \bar{A}_{\{2,20\},(-1,-1)}\) is convex: (equivalent to) either a bounded interval, a half-open interval or all of \(\mathbb{R}\).

Conditional inference II

Affine representation for \(Y\)

The first active block of the KKT conditions yields \[ \hat{\beta}_E = \hat{\beta}(X, Y)[E] = (X_E^TX_E)^{-1}(X_E^TY - \lambda s_E) \] with \[ \bar{\beta}_E(X, Y) = (X_E^TX_E)^{-1}(X_E^TY) \]
The constraint \(\text{sign}(\hat{\beta}[E])=s_E\) yields \[ \text{diag}(s_E) \left(\bar{\beta}_E - (X_E^TX_E)^{-1}\lambda s_E \right) \geq 0 \]
Plugging in first display, the inactive block of the KKT conditions yields \[ \begin{aligned} \lambda &\geq \|X_{-E}^T(Y - X_E\bar{\beta}_E) + \lambda X_{-E}^TX_E(X_E^TX_E)^{-1}s_E\|_{\infty} \\ &= \|X_{-E}^T(I - P_E)Y + \lambda X_{-E}^TX_E(X_E^TX_E)^{-1}s_E\|_{\infty} \end{aligned} \] with \(P_E\) the projection onto \(\text{col}(X_E)\). Note \((I-P_E)Y\) is the residual after fitting the model lm(Y ~ X[,E] - 1).

Conditional inference II

Polyhedral lemma

Fix any \(A \in \mathbb{R}^{k \times n}, b \in \mathbb{R}^k\) and \(\eta \in \mathbb{R}^n\).
The set \[ \left\{z: Az \leq b \right\} \] can be written as \[ \left\{z: {\cal V}^L(P_{\eta}^\perp z) \leq \eta^Tz \leq {\cal V}^U(P_{\eta}^{\perp}z), {\cal V}^0(P_{\eta}^{\perp}z) \geq 0 \right\} \]
That is, the event can be decomposed relative to \(\eta\) to be written in terms of \(\eta^Tz\) and \(P_{\eta}z\).

Conditional inference II

Polyhedral lemma (corollary)

Suppose \(y \sim N(\mu, I)\). With \(A, b, \eta\) as above \[ {\cal L}\left(\eta^TY | AY \leq b, P_{\eta}^{\perp}Y = P_{\eta}^{\perp}y \right) = \text{TN}_{\eta^T\mu, \|\eta\|^2_2}([{\cal V}^L(P_{\eta}^\perp y), {\cal V}^U(P_{\eta}^{\perp}y)]) \]
That is, under \(H_0:\eta^T\mu=\delta\) this distribution has density \[ t \propto \phi_{\delta, \|\eta\|^2_2}(t) 1_{[l,u]}(t) \] with \(l={\cal V}^L(P_{\eta}^{\perp}y), u={\cal V}^U(P_{\eta}^{\perp}y)\).

Conditional inference II

Marginalizing over signs

Each sign vector \(s_{E}\) yields a (possibly empty) interval \([l(s_E),u(s_E)]\).
With \(\eta=\eta_{2|\{2,20\}}\) we have \[ {\cal L}\left(\eta^TY | \hat{E}=\{2,20\}, P_{\eta}^{\perp}Y = P_{\eta}^{\perp}y \right) = \text{TN}_{\eta^T\mu, \|\eta\|^2_2}\left(\cup_{s_E \in \{-1,1\}^E} [l(s_E), u(s_E)]\right) \]

Conditional inference II

Not all signs vectors are realizable

Conditional inference II

How to compute the intersection (polyhedral lemma)

Suppose \(F = F_{\mu} \in {\cal M}\)
Set \(\hat{\theta}_{\cal T}=\eta^TY\) \[ \Gamma = \text{Cov}_F(Y, \eta^TY) \text{Cov}_F(\eta^TY)^{-1} \]
Note \[ \text{Cov}_F(Y - \Gamma \hat{\theta}, \hat{\theta}) = 0. \]
Further \[ \begin{aligned} AY \leq b &\iff A(Y - \Gamma \hat{\theta} + \Gamma \hat{\theta}) \leq b \\ & \iff A \Gamma \hat{\theta} \leq b - A(Y - \Gamma \hat{\theta}) \\ \end{aligned} \]
Finally, \[ \begin{aligned} {\cal V}^L &= \max_{j: (A\Gamma)_j < 0} \frac{(b - A(Y - \Gamma \hat{\theta}))_j}{(A \Gamma)_j} \\ {\cal V}^U &= \min_{j: (A\Gamma)_j > 0} \frac{(b - A(Y - \Gamma \hat{\theta}))_j}{(A \Gamma)_j} \\ {\cal V}^0 &= \min_{j: (A\Gamma)_j = 0} (b - A(Y - \Gamma \hat{\theta}))_j \\ \end{aligned} \]

Conditional inference II

Results!

X = np.load('X.npy')
Y = np.load('Y.npy')

from selectinf.algorithms.lasso import lasso
n, p = X.shape
L = lasso.gaussian(X, Y, 1.5 * np.sqrt(X.shape[0]), sigma=1.)
L.fit()

## array([ 0.        , -0.01205666, -0.        , -0.        , -0.        ,
##         0.        , -0.        , -0.        ,  0.        , -0.        ,
##         0.        ,  0.        ,  0.        , -0.        ,  0.        ,
##         0.        ,  0.        ,  0.        , -0.        , -0.11651073])

S = L.summary(compute_intervals=True,
              alternative='onesided',
              level=0.90)
S

##     variable     pval     lasso   onestep  ...  upper_confidence  lower_trunc  upper_trunc        sd
## 1          1  0.77475 -0.012057 -0.187903  ...          2.432066         -inf    -0.175847  0.102646
## 19        19  0.06657 -0.116511 -0.283290  ...          0.030983         -inf    -0.182397  0.099658
## 
## [2 rows x 9 columns]

Conditional inference II

One-sided \(p\)-value computation

from scipy.stats import norm as normal_dbn
Z = S['onestep'] / S['sd']
UZ = S['upper_trunc'] / S['sd']
LZ = S['lower_trunc'] / S['sd']
# here both observed signs are -1
pval = ((normal_dbn.cdf(Z) - normal_dbn.cdf(LZ)) / 
        (normal_dbn.cdf(UZ) - normal_dbn.cdf(LZ)))
pval

## array([0.77475002, 0.06656959])

Two-sided \(p\)-value

twosided = np.asarray(L.summary(alternative='twosided')['pval'])
twosided, 2 * np.minimum(pval, 1 - pval)

## (array([0.45049997, 0.13313918]), array([0.45049997, 0.13313918]))

Conditional inference II

Using `R`

library(selectiveInference)

## Loading required package: intervals

## 
## Attaching package: 'intervals'

## The following object is masked from 'package:Matrix':
## 
##     expand

## Loading required package: survival

## Loading required package: adaptMCMC

## Loading required package: parallel

## Loading required package: coda

## Loading required package: MASS

selectiveInference::fixedLassoInf(X, 
                                  Y,
                                  beta_hat,
                                  1.5 * sqrt(n),
                                  intercept=FALSE,
                                  alpha=0.1,
                                  sigma=1)

## 
## Call:
## selectiveInference::fixedLassoInf(x = X, y = Y, beta = beta_hat, 
##     lambda = 1.5 * sqrt(n), intercept = FALSE, sigma = 1, alpha = 0.1)
## 
## Standard deviation of noise (specified or estimated) sigma = 1.000
## 
## Testing results at lambda = 15.000, with alpha = 0.100
## 
##  Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
##    2 -0.188  -1.831   0.775    -0.250    2.433       0.049       0.05
##   20 -0.283  -2.843   0.067    -0.444    0.032       0.049       0.05
## 
## Note: coefficients shown are partial regression coefficients

Note: lambda value in R presumes loss is using \(SSE\) rather than \(MSE\).
Value of 1.5 / sqrt(n) for glmnet is multiplied by n yielding 1.5 * sqrt(n).

Conditional inference II

Conditional inference for LASSO

We have seen in the file drawer problem that truncation to half-open interval can yield undesirable properties: notably very long intervals…
Is this avoidable here?
Intervals of Zhong and Prentice (2008) with two-sided truncation do not suffer from this problem.

Although intervals are (significantly) shorter, power of test \(H_0:\eta^T\mu=0\) is essentially unchanged.

Conditional inference II

Marginalizing over signs may give us some mass in intervals outside the one for the observed sign vector.
But, marginalizing over signs does not always discover mass that can “save” us.

Conditional inference II

Choice of model

Perhaps the problem is the saturated model we used.
Suppose \(n \gg p\) and we are willing to assume \[ Y | X \sim N(X\beta, \sigma^2 I) \]
We can likely safely assume \(\sigma^2\) unknown as we have \(n - p \gg p\) degrees of freedom to estimate \(\sigma^2\) so we will assume wlog that \(\sigma^2=1\) (and cross our fingers for now).
Maybe this conditional approach will lead to “better” intervals in this full model?
Model: \[ {\cal M}_f = \left\{N(X\beta, I): \beta \in \mathbb{R}^p \right\} \]
Our targeted parameter is \[ \theta_{\cal T}(\beta) = e_2^T X_{\{2,20\}}^{\dagger}X \beta. \]
The selection event remains the same (we have just changed model for \(Y|X\)) \[ {\cal M}^*_f = {\cal M}^*_{f,\{2,20\},(-1,-1)} = \left\{F^*: \frac{dF^*}{dF}(y) = 1_{\bar{A}_{\{2,20\},(-1,-1)}}(y), F \in {\cal M}_f\right\}. \]

Conditional inference II

Nuisance parameters

Our model \({\cal M}^*_f\) is again an exponential family with sufficient statistic \(X^TY\) and natural parameter \(\beta\).
We can write \[ \begin{aligned} \beta^TX^TY &= \beta^TX^TP_{\eta}Y + \beta^TX^T(P_E-P_{\eta})Y + \beta^T(P_F-P_E)Y \\ \end{aligned} \]
To eliminate nuisance parameters we see we will have to condition on \((P_F-P_E)Y\) as well as \((P_E-P_{\eta})Y\), i.e. \((P_F-P_{\eta})Y\).
The inactive block event can be written \[ \begin{aligned} \|X_{-E}^T((I-P_F)Y + (P_F-P_E)Y) + \lambda X_{-E}^TX_E(X_E^TX_E)^{-1}s_E\|_{\infty} \leq \lambda \end{aligned} \]
For \(G_{\beta}^* \in {\cal M}_f^*\), the law of \(P_EY\) given \((P_F-P_E)Y\) is identical to the law of \(P_EY\) for \(F_{X\beta}^* \in {\cal M}^*\) given \((P_F-P_E)Y\). Further, the inactive event is conditionally independent of \(P_EY\) given \((P_F-P_E)Y\).
We conclude that conditional inference starting with \({\cal M}_f\) is identical to that starting with \({\cal M}\).
Even though we are conditioning on less (i.e. \((P_F-P_{\eta})Y\) for \({\cal M}_f\) instead of \((I-P_{\eta})Y\) for \({\cal M}\)) we will still have long intervals.

Exercise

Verify the equality in law above, as well as the conclusion.

Conditional inference II

YACM: yet another choice of model

Hmm… if we think our LASSO is doing a good job in finding all the true variables, why should we worry about \((P_F-P_E)Y\)?
Maybe we should use \[ {\cal M}_E = \left\{N(X_E\beta_E, I): \beta_E \in \mathbb{R}^E \right\}? \]
This is similar to our discussion in drop the losers of considering models for trials when remdisivir (denoted by \(R\)) wins as \[ {\cal M}_{R} = \left\{N(\mu, I): \mu_{R} \in \mathbb{R}, \mu_{-R} = \delta \cdot 1_{-R}, \delta > 0 \right\} \] and corresponding conditional models \({\cal M}_R^*\)…
This is the idea of a selected model presented in Fithian, Sun, Taylor (2014) which we’ll discuss soon.
Even under \({\cal M}_E\) conditional inference for the LASSO will be identical to \({\cal M}\) or \({\cal M}_f\)!

Exercise

Verify that using \({\cal M}_E\) and corresponding conditional model \({\cal M}_E^*\) yields the same reference distribution as starting with \({\cal M}\) or \({\cal M}_f\).

Conditional inference II

Choice of target parameter \({\cal \theta}_T\)

In our LASSO we’ve so far conditioned either on \(\hat{E}=E\) or \((\hat{E},\hat{s}_E) = (E,s_E)\) and based our targets \(\theta_{\cal T}\) on PoSI targets for the features \(E\) (albeit under different models above)
In practice \(\theta_{\cal T}\) can be any parameter \(U\) dreams up knowing that the LASSO has selected \(\hat{E}\) (and possibly the signs \(\hat{s}_E\)).
This mimics the classical scientific method in the sense that after running an experiment, \(U\) is free to set the target parameter of the following experiment based on what \(U\) has observed.
In other words, we are able to provide full DFQL inference in this conditional setting…

Conditional inference II

Pairs model

Suppose we start with \[ {\cal M}_{NP,n} = {\cal M}_{NP} = \left\{\otimes^n F: (X, Y) \sim F\right\} \]
The event \(\bar{A}_{E, s_E}\) can be expressed in terms of \[D_n= \frac{1}{n}(X^TY, X^TX). \]
I.e. for all \((X,Y) \in \mathbb{R}^{n \times (p + 1)}\) \[ \bar{A}_{E, s_E}(X, Y) = \tilde{A}_{E, s_E}(n^{-1}X^TX,n^{-1}X^TY) = \tilde{A}_{(E,s_E)}(D_n) \]
Under appropriate moment assumptions \[ n^{1/2} \begin{pmatrix} \hat{\theta}_n - \theta_{\cal T}(F)\\ D_n - E_F[D_n] \end{pmatrix} \overset{n \to \infty}{\to} N(0, \Sigma(F)) \]
Corresponding Gaussian model \[ \tilde{\cal M}_G = \left\{N \left(\begin{pmatrix} \theta_{\cal T} \\ \Theta_{\cal N} \end{pmatrix}, \begin{pmatrix} \Sigma_{{\cal T}, {\cal T}}(F) & \Sigma_{{\cal T}, D}(F) \\ \Sigma_{D,{\cal T}}(F) & \Sigma_{D, D}(F) \end{pmatrix}\right) \right\} \]
Conditional model \(\tilde{\cal M}_G^*\) restricts \(\tilde{\cal M}_G\) to \(\tilde{A}_{\{2,20\},(-1,-1)}\).
Still need to remove dependence on nuisance parameter.

Conditional inference II

Linear decomposition

Under \(\tilde{\cal M}_G\) assuming \(\text{Var}(\hat{\theta}) > 0\), the law of \((\hat{\theta}_{\cal T}, D)\) under \(\tilde{\cal M}_G\) is equivalent to the law \[ (\hat{\theta}_{\cal T}, D - \Gamma \hat{\theta}_{\cal T}) = (\hat{\theta}_{\cal T}, {\cal N}) \] where \[ \Gamma = \Gamma(\Sigma) = \Sigma_{D,{\cal T}} \Sigma_{{\cal T}, {\cal T}}^{-1} \] via the reconstruction \[ (\hat{\theta}_{\cal T}, {\cal N} + \Gamma \hat{\theta}_{\cal T}) \]
The pair \((\hat{\theta}_{\cal T}, {\cal N})\) are independent under \(\tilde{\cal M}_G\).
\(\implies\) Under \(\tilde{\cal M}_G\) (and \(\tilde{\cal M}_G^*\)) conditional on \({\cal N}\) the law of \(( \hat{\theta}_{\cal T}, D)\) depends only on \(\theta_{\cal T}\).
Conditioning on \({\cal N}\) removes dependence on \(\Theta_{\cal N}\).

Conditional inference II

Linear decomposition

Final reference density under \(H_0: \theta_{\cal T}=\delta\) is proportional to \[ t \mapsto \phi_{(\delta, \text{Var}(\hat{\theta}_{\cal T}))}(t) \cdot 1_{\tilde{A}_{(E, s_E)}}\left({\cal N}_{obs} + \Gamma \cdot t\right) \] where \({\cal N}_{obs}\) is the observed value of \(D - \Gamma \hat{\theta}_{\cal T}\) and \(\Gamma\) must be estimated (e.g. pairs bootstrap).
Note: only \(\text{Var}(\hat{\theta}_{\cal T})\) need be inverted to form \(\Gamma\).
The polyhedral lemma asserts that if \(\tilde{A}_{(E,s_E)}\) is polyhedral in \(D\) \[ \left\{t: 1_{\tilde{A}_{(E, s_E)}}\left({\cal N}_{obs} + \Gamma \cdot t\right)=1 \right\} \] is an interval \([{\cal V}^L({\cal N}_{obs}, \Gamma), {\cal V}^U({\cal N}_{obs}, \Gamma)]\).
Nothing here depends on the specifics of the LASSO!

Conditional inference II

General form of polyhedral lemma

For event \(\{D:AD \leq b\}\).

Set \[ \Gamma = \Gamma(F) = \text{Cov}_F(D, \hat{\theta}_{\cal T}) \text{Cov}_F(\hat{\theta}_{\cal T})^{-1} \]
Compute \({\cal N} = D - \Gamma(F) \hat{\theta}_{\cal T}\) \[ \begin{aligned} {\cal V}^L({\cal N}, \Gamma) &= \max_{j: (A\Gamma)_j < 0} \frac{(b - A{\cal N})_j}{(A \Gamma)_j} \\ {\cal V}^U({\cal N}, \Gamma) &= \min_{j: (A\Gamma)_j > 0} \frac{(b - A{\cal N})_j}{(A \Gamma)_j} \\ \end{aligned} \]

Conditional inference II

A more careful analysis of pairs model (fixed \(p\))

To apply polyhedral lemma, \({\cal A}_{(E,s_E)}\) should be polyhedral… is it?
While selection is expressible in terms of \(D_n=n^{-1}(X^TY, X^TX)\) a more explicit form is important to actually have a workable algorithm.
Define the following functions of \(D_n\) (liberally using 0-padding where necessary) \[ \begin{aligned} \bar{\beta}_E &= (n^{-1}X_E^TX_E )^{-1}(n^{-1}X^TY) \\ G^I_E &= -n^{-1/2} X_{-E}^T(Y - X_E\bar{\beta}_E) \\ &= n^{-1/2}\nabla \ell(\bar{\beta}_E)[-E]\\ G^I_{(E,s_E)} &= G^I_E - C \cdot E_F[X_{-E} X_E^T] E_F[X_EX_E^T]^{-1}s_E \\ G^A_{(E,s_E)} &= n^{1/2}\bar{\beta}_E - C \cdot E_F[X_EX_E^T]^{-1}s_E \\ \end{aligned} \]
On the event \((\hat{E}, \hat{s}_E) = (E, s_E)\) we see \[ \begin{aligned} \hat{\beta}_E &= \bar{\beta}_E - C \cdot n^{-1/2} (E_F[X_EX_E^T])^{-1} s_E + {\cal R}^{A}_n \\ &= n^{-1/2} G^A_{(E,s_E)} + {\cal R}^{A'}_n \\ \nabla \ell(\hat{\beta})[-E] & = n^{1/2} G^I_{(E,s_E)} + C \cdot {\cal R}^I_{n} \end{aligned} \] where \[ {\cal R}^I_n = O(1), \qquad {\cal R}^{A / A'}_n = O_F(n^{-1}) \] while \(n^{1/2} G^I_{(E,s_E)} = n^{1/2} E_F[G^I_{(E,s_E)}] + O(n^{1/2})\) and \(\bar{\beta}_E = \beta_E^*(F) + O_F(n^{-1/2})\).

Conditional inference II

A more careful analysis of pairs model (fixed \(p\))

We say a sequence of models \({\cal M}_n\) satisfies no rare selection for our observed \((\hat{E}, \hat{s}_E)\) if \[ \liminf_{n \to \infty} \inf_{F \in {\cal M}_n}P_{F}((\hat{E}(X,Y), \hat{s}_E(X,Y)) = (E, s_E)) > 0. \]
One of the conditions in the event is \(\|G^I_{(E,s_E)}\|_{\infty} \leq C\) so for no rare selection it will be necessary that \[ \|E_F(G^I_{(E,s_E)})\|_{\infty} = O(1) \]
Similarly, as \[ \text{diag}(s_E) \left(\hat{\beta}_E \right) \geq 0 \] for no rare selection it will be necessary that \[ \liminf_{n \to \infty} \inf_{F \in {\cal M}_n} \min(\text{diag}(s_E)E_F[G^A_{(E,s_E)}]) > -\infty. \]
Note: under the no rare selection assumption, then \(O_{F}(n^{-1/2}) \iff O_{F^*}(n^{-1/2})\) – i.e. order of random variables is comparable under \(F\) or \(F^*\). We’ve made liberal use of that above.

Conditional inference II

A more careful analysis of pairs model (fixed \(p\))

Finally, our asymptotic model can be reduced to the laws \[ \begin{pmatrix} G^A_{(E,s_E)} \\ G^I_{(E, s_E)} \end{pmatrix} \] with selection event \[ \{\text{diag}(s_E) G^A_{(E,s_E)} \geq 0, \|G^I_{(E,s_E)}\|_{\infty} \leq C\}. \]
Stacking \(\hat{\theta}_{\cal T}\) on top yields a multivariate Gaussian family \[ \begin{pmatrix} \hat{\theta}_{\cal T} \\ G^A_{(E,s_E)} \\ G^I_{(E,s_E)} \end{pmatrix} \sim N\left(\begin{pmatrix} \theta_{\cal T}(F) \\ ? \\ ? \end{pmatrix}, \Sigma \right) \] where \(\Sigma\) can be estimated consistently (under no asymptotically rare selection).
This is a multivariate normal satisfying affine inequalities: we can use polyhedral lemma.

Conditional inference II

A more careful analysis of pairs model (fixed \(p\))

A subtle point

While \(\hat{\theta}_{\cal T}, n^{1/2}\bar{\beta}_E\) and \(n^{-1/2} \nabla \ell(\hat{\beta})[-E]\) are observable \(G^A_{(E,s_E)}\) and \(G^I_{(E,s_E)}\) are not.
However, on selection event, recall \[ \begin{aligned} G^A_{(E,s_E)} = n^{1/2} \hat{\beta}_E + O_{F^*}(n^{-1/2}) \\ G^I_{(E,s_E)} = n^{-1/2} \nabla \ell(\hat{\beta})[-E] + O_{F^*}(n^{-1/2}) \end{aligned} \] so the pair is essentially observable.
Even if they were observable, it is possible (due to remainder \({\cal R}^I_n\)) that for finite \(n\) that \[ \min(\text{diag}(s_E)(G^A_{(E,s_E)})) < 0 \] or \[ \|G^I_{(E,s_E)}\|_{\infty} > C. \]
Running polyhedral lemma would lead to nonsenical inference – \(\hat{\theta}_{\cal T}\) would be outside its truncation interval whatever estimate of \(\Sigma\) we have!
However, if we run the polyhedral lemma with observed value \((\hat{\theta}_{\cal T}, n^{1/2}\hat{\beta}_E, n^{-1/2} \nabla \ell(\hat{\beta})[-E])\) then \(\hat{\theta}_{\cal T}\) will be in its truncation interval whatever estimate of \(\Sigma\) we have.

Conditional inference II

A more careful analysis of pairs model (fixed \(p\))

Exercise

By run the polyhedral lemma we mean compute \({\cal V}^{L/U}(D - \Gamma(\hat{\Sigma}) \hat{\theta}_{\cal T}, \Gamma(\hat{\Sigma}))\). Prove the claim above, i.e. with the the observed value of \(D\) as above, show that \(\hat{\theta}_{\cal T}\) is always in the interval \({\cal V}^{L,U}\). Supposing that after rescaling \(\text{Var}_F(\hat{\theta}_{\cal T}) = O(1)\) how much do you expect the endpoints \({\cal V}^{L/U}\) to change if you used the unobservable \(D=(\hat{\theta}_{\cal T}, G^A_{(E,s_E)}, G^I_{(E,s_E)})\) instead of \((\hat{\theta}_{\cal T}, n^{1/2}\hat{\beta}_E, n^{-1/2} \nabla \ell(\hat{\beta})[-E])\) under a no rare selection assumption? How large are these changes compared to the typical sizes of \({\cal V}^{L/U}\)? (Remember under no rare selection the “sizes” of random variables are the same under \(F\) as \(F^*\).)

Conditional inference II

Logistic LASSO

Suppose now that \(Y \in \{0,1\}\) is binary and we want to select variables using the logistic LASSO with objective \[ \begin{aligned} \beta &\mapsto \sum_{i=1}^n \log(1 + \exp(X_i^T\beta)) - \beta^TX^TY + \lambda \|\beta\|_1 \\ &= \ell^L(\beta; X, Y) + \lambda \|\beta\|_1 \end{aligned} \]
KKT conditions are not as simple as OLS LASSO.
In a selected model setting (as mentioned above in Fithian, Sun, Taylor (2014) when we take the selected model to be parametrically correct, valid conditional inference is considered in Taylor and Tibshirani (2017).
We’ll first work through the pairs model and come back to this selected model.
The pairs model notes can be found in Markovic and Taylor (2016) for randomized selection.

Conditional inference II

Restricted problem

The restricted problem for set \(E\) sets \(\beta_{-E}=0\). Call this loss \(\ell^L_E(\beta_E) = \ell^L(\beta_E, 0)\).
For a set of variables \(E\) and initial estimate \(\beta_E\), the one-step estimator from \(\beta_{E,0}\) for the restricted problem is \[ \bar{\beta}_E(\beta_{E,0}) = \beta_{E,0} - \nabla^2\ell^L_E(\beta_{E,0})^{-1}\nabla \ell^L_E(\beta_{E,0}) \]
On \((\hat{E}, \hat{s}_E)=(E,s_E)\), with initial estimator \(\hat{\beta}_E\), the KKT conditions yield \(\nabla \ell^L_E(\hat{\beta}_E)= - \lambda s_E\) and \[ \bar{\beta}_E = \hat{\beta}_E + \lambda \nabla^2 \ell^L_E(\hat{\beta}_E)^{-1}s_E \]
Keeping \(p\) fixed, under standard assumptions the one-step estimator will be asymptotically equivalent to the MLE (assuming \(\lambda = \lambda_n = n^{1/2} C\)).
We see again that our selection event can be expressed in terms of \[ \begin{pmatrix} G^A_{(E,s_E)} \\ G^I_{(E,s_E)} \end{pmatrix} = \begin{pmatrix} n^{1/2}\hat{\beta}_E \\ n^{-1/2} \nabla \ell(\hat{\beta}_E) \end{pmatrix} + O_{F^*}(n^{-1/2}). \]
To “run” the polyhedral lemma code we need only estimate \[ \Sigma = \text{Cov}_F \begin{pmatrix} \hat{\theta}_{\cal T} \\ G^A_{(E,s_E)} \\ G^I_{(E,s_E)} \end{pmatrix} \] and evaluate \({\cal V}^{L/U}\) for observed value \(D = (\hat{\theta}_{\cal T}, n^{1/2}\hat{\beta}_E, n^{-1/2}\nabla \ell(\hat{\beta})[-E])\).

Conditional inference II

Exercise

Suppose \(U\) has decided, after seeing the logistic LASSO has selected \((E, s_E)\) that the proper target of inference is the \(k\)-th coordinate of the probit parameter of the true data generating distribution \(F\) in the pairs model with features \(E'\). Describe how to construct an (asymptotically) valid confidence interval for this parameter that “runs”, i.e. does not give nonsensical results such as \(\hat{\theta}_{\cal T}\) is outside its truncation interval.

Conditional inference II

Selected model framework T. and Tibshirani (2018)

Suppose upon observing \((\hat{E}, \hat{s}_E) = (E, s_E)\) we choose to work in the parametric model based on \(n\) IID draws from \[ {\cal M}_{P,E} = \left\{F: {\cal L}_F(Y|X) = \text{Bernoulli}(\text{logit}^{-1}(X_E^T\beta_E)), \beta_E \in \mathbb{R}^E \right\} \]
If \(\hat{\theta}_{\cal T}\) is a linear functional of \(\bar{\beta}_E\) then \((\hat{\theta}_{\cal T}, G^A_{(E,s_E)})\) is asymptotically independent of \(G^I_{(E,s_E)}\) even under \({\cal M}_{P,E}^*\).
Hence, our selection event can be expressed only in terms of \((\hat{\theta}_{\cal T}, G^A_{(E,s_E)})\). Further, in \({\cal M}_{P,E}\) (and \({\cal M}_{P,E}^*\) under no rare selection assumption) we can consistently estimate \(\text{Cov}(G^A_{(E,s_E)})\) by the usual GLM formula \[ n \cdot (X_E^T W(\hat{\beta}_E)X_E)^{-1} \] with \[ \begin{aligned} W(\beta_E) &= \text{diag}(\pi(\beta_E) (1 - \pi(\beta_E)) \\ \pi(\beta_E) &= \frac{e^{X_E^T\beta_E}}{1 + e^{X_E^T\beta_E}}. \end{aligned} \]

Conditional inference II

Logistic inference under null (so \(\beta=0\))

def simulate(n=500, p=50, C=1):
    
    X = np.random.standard_normal((n, p))
    Y = np.random.binomial(1, 0.5, size=(n,))
    W = np.ones(p) * C * np.sqrt(n)
    L = lasso.logistic(X, Y, W)
    beta_hat = L.fit()
    if (beta_hat != 0).sum() > 0:
        S = L.summary(compute_intervals=True,
                      alternative='twosided',
                      level=0.90)

        return S

simulate()

##     variable      pval     lasso   onestep  ...  upper_confidence  lower_trunc  upper_trunc        sd
## 17        17  0.016169  0.143245  0.327934  ...          0.496885     0.184689     0.533234  0.091908
## 18        18  0.348696  0.007160  0.176163  ...          0.197229     0.169004     3.912629  0.087659
## 26        26  0.759404 -0.032987 -0.209377  ...          0.487633         -inf    -0.176390  0.087252
## 
## [3 rows x 9 columns]

P, C = [], []
for _ in range(100):
    S = simulate()
    if S is not None:
        P.extend(list(S['pval']))
        C.extend(list((S['lower_confidence'] < 0) & (S['upper_confidence'] > 0)))

print('coverage: (target 90%): ', np.mean(C))

## coverage: (target 90%):  0.9234234234234234

Conditional inference II

Pivot (p-value) plot

Conditional inference II

Exercise

Verify the above claimed asymptotic independence of \((\hat{\theta}_{\cal T}, G^A_{(E,s_E)})\) and \(G^I_{(E,s_E)}\) under \({\cal M}_{P,E}\) and \({\cal M}_{P,E}^*\) under an appropriate no rare selection assumption for \((E,s_E)\).

Conditional inference II

Exercise

For either the OLS or Logistic LASSO, suppose we want to marginalize over signs \(s_E\). Describe an algorithm using the polyhedral lemma for each \(s_E\) that will “run” (i.e. not give nonsensical answers due to \(\hat{\theta}_{\cal T}\) being outside its truncation set) and give (asymptotically) valid inference conditional only on the event \(\hat{E}=E\).

Conditional inference II

Unpenalized variables, feature weights

In practice, our LASSO penalty (e.g. with intercept) is really of the form \[ \beta \mapsto \lambda \sum_{j=1}^p w_j |\beta_j| \]
We ignored the intercept above, but this just corresponds to some \(w_j=0\).
Active set \(\hat{E} = U \cup \hat{A}\) with \(U\) (*unpenalized) and \(\hat{A}\) (active).
The vector \(\lambda s_E\) is replaced throughout with the vector \(\lambda w_E s_E\).
Only penalized variables in \(\hat{A}\) require a constraint (i.e. might as well only condition on signs of penalized variables).
In the inactive block a term \(w_{-E}\) appears on RHS of \(\ell_{\infty}\) constraint.
Mostly bookkeeping…

Conditional inference II

Unpenalized variables, feature weights

Xi = np.hstack([np.ones((n, 1)),
                X])
weights = 1.5 * np.sqrt(n) * np.ones(p+1)
weights[0] = 0
Li = lasso.gaussian(Xi, Y, weights, sigma=1.)
Li.fit()

## array([ 0.08047905,  0.        , -0.01307299, -0.        , -0.        ,
##        -0.        ,  0.        , -0.        , -0.        ,  0.00087677,
##        -0.        ,  0.        ,  0.        ,  0.        , -0.        ,
##         0.        ,  0.        ,  0.        ,  0.        , -0.        ,
##        -0.1046863 ])

Li.summary(compute_intervals=True,
           alternative='onesided',
           level=0.90)

##     variable      pval     lasso   onestep  ...  upper_confidence  lower_trunc  upper_trunc        sd
## 0          0  0.488701  0.080479  0.061638  ...          0.218739    -0.013930     0.773783  0.101140
## 2          2  0.624802 -0.013073 -0.184972  ...          2.233520    -0.217428    -0.171899  0.102695
## 9          9  0.984540  0.000877  0.127013  ...         -0.427391     0.126137     0.574571  0.098759
## 20        20  0.018953 -0.104686 -0.260735  ...         -0.152579    -0.269230    -0.156049  0.101312
## 
## [4 rows x 9 columns]

Conditional inference II

Unpenalized variables, feature weights

def simulate_intercept(n=500, p=50, C=1):
    
    X = np.random.standard_normal((n, p+1))
    X[:,0] = 1
    Y = np.random.binomial(1, 0.6, size=(n,))
    W = np.ones(p+1) * C * np.sqrt(n)
    W[0] = 0
    L = lasso.logistic(X, Y, W)
    beta_hat = L.fit()
    if (beta_hat != 0).sum() > 1:
        S = L.summary(compute_intervals=True,
                      alternative='twosided',
                      level=0.90)

        return S.iloc[1:] # remove intercept row

simulate_intercept()

##     variable      pval     lasso   onestep  ...  upper_confidence  lower_trunc  upper_trunc        sd
## 18        18  0.895433 -0.029592 -0.225419  ...          0.688152    -0.579943    -0.195827  0.094736
## 21        21  0.197391 -0.078194 -0.246453  ...          0.076704    -0.368645    -0.168259  0.089775
## 36        36  0.356272  0.007813  0.179142  ...          0.205369     0.171329     1.432253  0.091516
## 
## [3 rows x 9 columns]

P, C = [], []
for _ in range(100):
    S = simulate_intercept()
    if S is not None:
        P.extend(list(S['pval']))
        C.extend(list((S['lower_confidence'] < 0) & (S['upper_confidence'] > 0)))

Conditional inference II

Unpenalized variables, feature weights

print('coverage: (target 90%): ', np.mean(C))

## coverage: (target 90%):  0.9282051282051282

Conditional inference II

Pivot (p-value) plot

Conditional inference II

Full model inference for LASSO Liu, Markovic and Tibshirani (2018)

Up to now, we’ve allowed target to be based on \((\hat{E},\hat{s}_E)\) for both OLS and logistic LASSO but this is not strictly necessary if:

We are willing to assume the full parametric model is correct e.g. for OLS and \(n \gg p\) \[ {\cal L}(Y|X) = N(X\beta^*, \sigma^2 I) \in {\cal M}_f \]
We restrict attention to targets \(\beta_j = \beta_{j|\{1,\dots,p\}}\).

Inference is based on conditioning on the outcome of queries \(Q_j(X,Y) = 1_{\hat{E}(X,Y)}(j)\).

Comments

Each feature \(j\) carries out inference in a different conditional model – there is no single selection event under which our intervals are conditionally correct.
This would similarly hold for Zhong and Prentice (2008) for simple thresholding of (absolute) marginal effects.
A little less in the spirit of DFQL: there are only ever \(p\) questions and \(U\) must decide a priori which effects to report if they are selected.
Parametric model must be correctly specified!

Conditional inference II

Selection event for full model inference

For a fixed \(1 \leq j \leq p\), we must describe \(\{j \in \hat{E}(X,Y)\}\) in model \({\cal M}_f\).
Our conditional model is \[ {\cal M}^*_{f,j} = \left\{F^*: \frac{dF^*}{dF}(y) \propto 1_{j \in \hat{E}(X,y)}, F \in {\cal M}_f \right\} \]
By inspection of KKT conditions similar to strong rules variable \(j\) is not selected a parameter \(\lambda\) iff \[ |X_j^T(Y - X_{-j} \hat{\beta}_{-j})| < \lambda \] where \[ \hat{\beta}_{-j} = \text{argmin}_{\beta:\beta_j=0} \frac{1}{2} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1 \] solves the restricted problem fixing \(\beta_j=0\).

Conditional inference II

Selection event for full model inference

Note that \[ \hat{\beta}_{-j} = \text{argmin}_{\beta:\beta_j=0} \frac{1}{2} \beta_{-j}^T(X_{-j}^TX_{-j})\beta_{-j} - \beta_{-j}^TX_{-j}^TY + \lambda \|\beta\|_1 \] hence (considering \(X\) fixed) it is a function of \(X_{-j}^TY\), i.e. we can write \[ \hat{\beta}_{-j} = \hat{\beta}_{-j}((X^TY)[-j], X^TX). \]
In model \({\cal M}_f\) (and \({\cal M}_{f,j}^*\)) conditioning on \(X_{-j}^TY\) removes dependence of the distribution on \(\beta_{-j}\). In other words, for target \(\beta_j\) we have \({\cal N} = X_{-j}^TY\) in this model.
Setting \(D = (X_j^TY, X_{-j}^TY)\) and \(\hat{\theta}_{\cal T} = \bar{\beta}_j\) we see \[ \text{Cov}_F(\hat{\theta}_{\cal T}, D) = \begin{pmatrix} \sigma^2 \\ 0 \end{pmatrix} \qquad \forall F \in {\cal M}_f. \]
This implies that \[ {\cal N} = X^TY - \frac{\bar{\beta}_j}{(X^TX)^{-1}_{jj}} e_j \]

Conditional inference II

Exercise

Suppose \(X\) is fixed. Verify that if \(D=X^TY\) and \(\theta_{\cal T}=\beta_j\) that \({\cal N} \equiv X_{-j}^TY\) by showing that our usual definition of \({\cal N}\) is in 1:1 correspondence with \(X_{-j}^TY\). More formally, in terms of conditioning, argue that the sigma algebra generated by \({\cal N}\) is the same as that generated by \(X_{-j}^TY\).

Conditional inference II

Selection event for full model inference

Finally, note that for all \(t\) \[ \hat{\beta}_{-j}(({\cal N} + \frac{t}{(X^TX)^{-1}_{jj}} e_j)[-j], X^TX) = \hat{\beta}_{-j}(X_{-j}^TY, X^TX) \]
Selection event intersected with \(\{z: X_{-j}^Tz = (X_{-j}^TY)_{obs}\}\) is complement of \[ \left\{t: \left|\left(X_j^TY - \frac{\bar{\beta}_j}{(X^TX)^{-1}_{jj}}\right)_{obs} + \frac{t}{(X^TX)^{-1}_{jj}} - X_j^TX_{-j} \hat{\beta}_{-j}((X_{-j}^TY)_{obs})\right| < \lambda \right\}. \]
Reference density to test \(H_0:\beta_j=\delta\) is \(N(\delta, \sigma^2 (X^TX)^{-1}_{jj})\) restricted to above truncation region.

from selectinf.algorithms.lasso import ROSI

def simulate_full(n=500, p=50, C=1.5):
    
    X = np.random.standard_normal((n, p))
    Y = np.random.standard_normal(n)
    W = np.ones(p) * C * np.sqrt(n)
    L = ROSI.gaussian(X, Y, W)
    beta_hat = L.fit()
    if (beta_hat != 0).sum() > 0:
        S = L.summary(compute_intervals=True,
                      level=0.90)

        return S # remove intercept row

simulate_full()

##     variable      pval     lasso  ...  upper_confidence  lower_truncation  upper_truncation
## 6          6  0.187981 -0.040119  ...          0.012707         -0.057826          0.085826
## 17        17  0.261607 -0.027967  ...          0.015356         -0.069161          0.080316
## 25        25  0.478552  0.010119  ...          0.131542         -0.064073          0.071951
## 29        29  0.998388  0.016198  ...          0.124809         -0.109881          0.044545
## 35        35  0.756812 -0.001374  ...          0.039444         -0.072520          0.087578
## 39        39  0.781752 -0.003502  ...          0.040540         -0.069454          0.089908
## 45        45  0.810212 -0.016688  ...          0.038102         -0.049755          0.086653
## 
## [7 rows x 9 columns]

P, C, L = [], [], []
for _ in range(100):
    S = simulate_full()
    if S is not None:
        L.extend((S['upper_confidence'] - S['lower_confidence']) /
                 (2 * normal_dbn.ppf(0.95) * S['sd']))
        P.extend(list(S['pval']))
        C.extend(list((S['lower_confidence'] < 0) & (S['upper_confidence'] > 0)))

print('coverage: (target 90%): ', np.mean(C))

## coverage: (target 90%):  0.898021308980213

Conditional inference II

Pivot (p-value) plot

Conditional inference II

Length of intervals compared to naive

Conditional inference II

Why are they sometimes shorter?

Conditional inference II

Selection event for full model inference Markovic (2019)

We can simplify notation somewhat in this problem: set \[ \begin{aligned} Q &= X^TX \\ &= \nabla^2 \ell (\hat{\beta}) \\ D &= Q\hat{\beta} - \nabla \ell(\hat{\beta}) \\ &= Q\bar{\beta} - \nabla \ell(\bar{\beta}) \\ &= Q\beta^* - \nabla \ell(\beta^*) \\ &= X^TY \end{aligned} \]
In this notation under \({\cal M}_f\) for target \(\bar{\beta}_j\) \[ {\cal N} = D - \frac{\bar{\beta}_j}{Q^{-1}_{jj}} e_j \] so that \[ {\cal N}[-j] = D[-j] = (Q\bar{\beta} - \nabla \ell(\bar{\beta}))[-j] = (Q\hat{\beta} - \nabla \ell(\hat{\beta}))[-j] \]

Conditional inference II

Selection event for full model inference

The full LASSO solution solves \[ \text{minimize}_{\beta} -\beta^T(Q\hat{\beta} - \nabla \ell(\hat{\beta})) + \frac{1}{2} \beta^TQ\beta + \lambda \|\beta\|_1 \]
The restricted LASSO solution solves \[ \text{minimize}_{\beta:\beta_j=0} -\beta_{-j}^T(Q\hat{\beta} - \nabla \ell(\hat{\beta}))[-j] + \frac{1}{2} \beta^TQ\beta + \lambda \|\beta\|_1 \]
Truncation region is complement of \[ \left\{t: \left| {\cal N}_j + \frac{t}{Q^{-1}_{jj}} - Q[j] \hat{\beta}_{-j}({\cal N}[-j]) \right| \leq \lambda \right\} \]

Conditional inference II

Logistic full model inference Markovic (2019)

In the logistic LASSO, under the parametric model \[ {\cal L}(Y|X) = \text{Bernoulli}\left(\frac{e^{X^T\beta^*}}{1 + e^{X^T\beta^*}}\right) \] we set \[ \begin{aligned} Q &= \nabla^2 \ell (\hat{\beta}) \\ &= \nabla^2 \ell (\beta^*) + O(n^{1/2}) D &= Q\hat{\beta} - \nabla \ell(\hat{\beta}) \\ &= Q\beta^* - \nabla \ell(\beta^*) + O(n^{1/2}) \\ \bar{\beta} &= Q^{-1}D \\ {\cal N} &= D - \frac{\bar{\beta}_j}{Q^{-1}_{jj}} e_j \end{aligned} \] and \[ \hat{\beta}_{-j}({\cal N}[-j], Q) = \text{argmin}_{\beta:\beta_j=0} -\beta[-j]^T{\cal N}[-j] + \frac{1}{2} \beta^TQ\beta + \lambda \|\beta\|_1 \]
To test \(H_0:\beta_j^*=\delta\) use reference \(N(\delta, Q_{jj}^{-1})\) truncated to the truncation region of previous slide.

from selectinf.algorithms.lasso import ROSI

def simulate_full_logistic(n=500, p=50, C=1):
    
    X = np.random.standard_normal((n, p+1)); X[:,0] = 1
    W = C * np.ones(p+1) * np.sqrt(n)
    W[0] = 0
    Y = np.random.binomial(1, 0.6, size=(n,))
    L = ROSI.logistic(X, Y, W)
    beta_hat = L.fit()
    if (beta_hat != 0).sum() > 1:
        S = L.summary(compute_intervals=True,
                      level=0.90)

        return S.iloc[1:] # remove intercept

simulate_full_logistic()

P, C = [], []
for _ in range(100):
    S = simulate_full_logistic()
    if S is not None:
        P.extend(list(S['pval']))
        C.extend(list((S['lower_confidence'] < 0) & (S['upper_confidence'] > 0)))

print('coverage: (target 90%): ', np.mean(C))

## coverage: (target 90%):  0.892018779342723

Conditional inference II

Pivot (p-value) plot

Conditional inference II

Exercise

Explain how you can be sure this algorithm will “run” (i.e. not produce nonsenical results)?
What will the confidence intervals look “more” like – Zhong and Prentice’s ones based on two-sided truncation or the ones in the file drawer example based on one-sided truncation?
Do you expect this algorithm to produce valid conditional inference with \(p\) fixed and \(\lambda = n^{1/2} C\)?

Conditional inference II

Exercise

Returning to high dimensional \(p > n\) or \(p \gg n\) setting, let’s revisit our algorithms for the LASSO conditioning on \((\hat{E}, \hat{s}_E)\).

Supposing one can obtain some reasonable estimate of \(\sigma^2\), what is the rough computational cost of the computing a \(p\)-value for testing \(H_0:\beta_{j|E}=\delta\) in \({\cal M}\)? In \({\cal M}_f\)?
What is the rough computational cost in model \({\cal M}_{NP}\)?
Supposing we make sparsity assumptions so that, with high probability \(|\hat{E}| \ll n < p\) do these algorithms require estimating the inverse of poorly conditioned matrices (under reasonable non-degeneracy assumptions)?
In summary, do these algorithms “run” in the \(p > n\) setting (under such sparsity and non-degeneracy assumptions)? How about \(p \gg n\)?
(Open-ended) Do you expect them to “work”, i.e. provide valid conditional inference (under such sparsity and non-degeneracy assumptions)?

Conditional inference II

Take aways

Drop the losers involves independent estimators (pre-selection) and is seemingly the simplest problem beyond the file drawer as it has nuisance parameters. Nevertheless, the general picture is remarkably similar.
Suppose our selection event can be expressed asymptotically in terms of \[ \begin{pmatrix} \hat{\theta}_{\cal T} \\ D \\ \Delta \end{pmatrix} \sim N \left(\begin{pmatrix} \theta_{\cal T} \\ ? \\ 0 \end{pmatrix}, \begin{pmatrix} \Sigma_{{\cal T},{\cal T}} & \Sigma_{{\cal T}, D} & 0 \\ \Sigma_{D, {\cal T}} & \Sigma_{D, D} & 0 \\ 0 & 0 & \Sigma_{\Delta,\Delta} \end{pmatrix}\right) \]
That is, we assume we have some query \({\cal Q}(\hat{\theta}_{\cal T}, D, \Delta)\) and have observed its value \(q\). (Note: in this set of slides we have had \(\Sigma_{\Delta, \Delta}=0\).)
Set \({\cal N} = D - \Gamma \hat{\theta}\).
A valid reference density under \(H_0: \theta_{\cal T}=\delta\) is seen to be \[ t \propto \phi_{\delta, \Sigma_{{\cal T}, {\cal T}}}(t) \cdot \pi(t, {\cal N}, \Gamma) \] where \[ \pi(t,n,\Gamma) = P({\cal Q}(\hat{\theta}_{\cal T}, {\cal N} + \Gamma \hat{\theta}_{\cal T}, \Delta) = q | \hat{\theta}_{\cal T}=t, {\cal N}=n). \]
For the OLS LASSO, although we considered different models (and several selection events) under which to carry out this conditional program all reference distributions had the above form.
Similarly, for the logistic LASSO, all reference distributions had the above form.

Conditional inference II (LASSO)

web.stanford.edu/class/stats364/

Conditional inference II

LASSO

Conditional inference II

LASSO with glmnet

LASSO

Conditional inference II

LASSO

Conditional inference II

LASSO

Conditional inference II

A little simulation

Conditional inference II

LASSO

KKT conditions

Conditional inference II

Subgradient equations

Conditional inference

Subgradient equations in block form

Conditional inference II

Selection event

Conditioning on signs

Conditional inference II

Inference requires a model!

Saturated model (PoSI, Lee et al. (2016))

Conditional model

Conditional inference II

Conditional inference for \(H_{0,\delta}:\eta^T\mu=\delta\)?

Conditional inference II

Workhorse

Conditional inference II

Conditional inference II

Affine representation for \(Y\)

Conditional inference II

Polyhedral lemma

Conditional inference II

Polyhedral lemma (corollary)

Conditional inference II

Marginalizing over signs

Conditional inference II

Not all signs vectors are realizable

Conditional inference II

How to compute the intersection (polyhedral lemma)

Conditional inference II

Results!

Conditional inference II

One-sided \(p\)-value computation

Two-sided \(p\)-value

Conditional inference II

Using R

Conditional inference II

Conditional inference for LASSO

Conditional inference II

Conditional inference II

Choice of model

Conditional inference II

Nuisance parameters

Exercise

Conditional inference II

YACM: yet another choice of model

Exercise

Conditional inference II

Choice of target parameter \({\cal \theta}_T\)

Conditional inference II

Pairs model

Conditional inference II

Linear decomposition

Conditional inference II

Linear decomposition

Conditional inference II

General form of polyhedral lemma

Conditional inference II

A more careful analysis of pairs model (fixed \(p\))

Conditional inference II

A more careful analysis of pairs model (fixed \(p\))

Conditional inference II

A more careful analysis of pairs model (fixed \(p\))

Conditional inference II

A more careful analysis of pairs model (fixed \(p\))

LASSO with `glmnet`

Using `R`