Approximate likelihood and Bayesian methods

web.stanford.edu/class/stats364/

Jonathan Taylor

Spring 2020

Panigrahi et al. (2016)

What might a Bayesian do under selection?
Do Bayesians even have to worry about selection?

File drawer problem

Suppose $D \sim N(\mu, 1)$ with a prior \[ \pi(\mu) = N(0, \lambda) \]
Posterior \[ \mu | D \sim N\left(\frac{D}{1+\lambda}, \frac{1}{1+\lambda}\right) \]
What if we only report posterior intervals when $D > 2$?

import numpy as np
from scipy.stats import norm as normal_dbn
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
# %matplotlib inline

ECDF = sm.distributions.ECDF
def pivot_plot(pivot):
    G = np.linspace(0, 1, 101)
    fig = plt.figure()
    ax = fig.gca()
    ax.plot(G, ECDF(pivot)(G), 'r', linewidth=4)
    ax.plot([0,1], [0,1], 'k--')
    ax.set_xlabel('ECDF(pivot)', fontsize=20)
    ax.set_ylabel('pivot', fontsize=20)
    return ax.figure

def simulate(truth, 
             precision,
             threshold=2, 
             level=0.9): 
    # (truth=\mu, precision=\lambda)
    C = normal_dbn.ppf((1 + level) / 2)
    prec = precision
    while True:
        D = np.random.standard_normal() + truth
        if D > threshold:
            L = D / (1 + prec) - C / np.sqrt(1 + prec)
            U = D / (1 + prec) + C / np.sqrt(1 + prec)
            break
    return L, U, (L < truth) * (U > truth)         

simulate(0, 0.2)

## (0.2217172670177403, 3.22479538460277, False)

coverage = []
for _ in range(1000):
    coverage.append(simulate(0, 0.5))
np.mean(coverage, 0)[2]

## 0.044

coverage = []
for _ in range(1000):
    coverage.append(simulate(0, 1))
np.mean(coverage, 0)[2]

## 0.569

coverage = []
for _ in range(1000):
    coverage.append(simulate(-1, 1))
np.mean(coverage, 0)[2]

## 0.0

Panigrahi et al. (2016)

How to correct?

We were constructing intervals above using prior $\pi$ and likelihood \[ \frac{\exp(-\frac{1}{2}(D-\mu)^2)}{\sqrt{2 \pi}} = \phi(D-\mu) \]
Likelihood does not reflect selection.
As we have done for frequentist analysis, we can use selective likelihood \[ \frac{\phi(D - \mu)}{1 - \Phi(2 - \mu)} \]
This approach suggested by Yekutieli (2011) for one-dimensional problems like file drawer.

Panigrahi et al. (2016)

Normalization

In this case we need what we called $\ell^*$ in asymptotics – the $\Phi$-adjusted $\pi$ \[ \ell^*_{\Phi_{\mu}}(t,n) = \frac{\pi(t,n)}{E_{\Phi_{\mu}}[\pi(t,n)]}. \]
For file drawer problem this has closed form $1-\Phi(2-\mu)$ – for LASSO?

What about frequentist analysis?

In the frequentist case we conditioned on ${\cal N}$ which allowed for explicit normalization (at least without randomization).

# nonrandomized selection here
def simulate_corrected(truth, 
                       precision,
                       threshold=2, 
                       level=0.9,
                       ndraw=5000,
                       burnin=1000,
                       proposal_sd=0.5): 
    
    # (truth=\mu, precision=\lambda)
    
    prec = precision
    while True:
        D = np.random.standard_normal() + truth

        if D > threshold:

            def log_post(mu):
                if threshold - mu < 30:
                    value = np.log(normal_dbn.sf(threshold - mu))
                else:
                    value = - 0.5 * (threshold - mu)**2 - np.log(threshold - mu)
                return -(0.5 * (D - mu)**2 + 
                         value +
                         0.5 * prec * mu**2)

            # now draw B samples from Metropolis-Hastings
            chain = [D]
            den = log_post(chain[-1])
            for _ in range(ndraw + burnin):
                             
                proposal = (chain[-1] + 
                            np.random.standard_normal() * proposal_sd)
                
                try:
                    num = log_post(proposal)
                except:
                    num = - np.inf
                # acceptance probability
                 
                alpha = min(np.exp(num - den), 1)

                if np.random.binomial(1, alpha):
                    chain.append(1 * proposal)
                    den = num
                else:
                    chain.append(1 * chain[-1])
            
            chain = np.array(chain)[burnin:]
            
            U = np.percentile(chain, 100 * (1 - (1 - level) / 2))
            L = np.percentile(chain, 100 * (1 - level) / 2)
            break
    return L, U, (L < truth) * (U > truth)

simulate_corrected(-2, 0.2, ndraw=5000, proposal_sd=1)

## (-4.293642277299732, 1.8099147333531849, True)

coverage = []
for _ in range(50):
    coverage.append(simulate_corrected(-1, 0.2))
coverage = np.array(coverage)
coverage.mean(0)[2]

## 1.0

sns.distplot(coverage[:,1] - coverage[:,0])

## <matplotlib.axes._subplots.AxesSubplot object at 0x1a213a8dd0>

coverage = []
for _ in range(50):
    coverage.append(simulate_corrected(-1, 0.01))
coverage = np.array(coverage)
coverage.mean(0)[2]

## 0.94

sns.distplot(coverage[:,1] - coverage[:,0])

## <matplotlib.axes._subplots.AxesSubplot object at 0x1a213a8dd0>

Panigrahi et al. (2016)

In our discussion on asymptotics we used Chernoff’s bound / approximation \[ -\log\left(P_{\mu}[D + \omega \in A] \approx \right) \approx \inf_{(d,\omega) \in A} \frac{1}{2}(d-\mu)^T\Sigma^{-1}(d-\mu) + \frac{1}{2} \|\omega\|^2_2 \] where \[ \begin{pmatrix} D \\ \omega \end{pmatrix} \sim N \left( \begin{pmatrix} \mu \\ 0 \end{pmatrix}, \begin{pmatrix} \Sigma & 0 \\ 0 & I \end{pmatrix} \right) \]
Specifically we looked at \[ (d^*(\mu), \omega^*(\mu)) = \text{argmin}_{(d,\omega) \in A} \frac{1}{2}(d-\mu)^T\Sigma^{-1}(d-\mu) + \frac{1}{2} \|\omega\|^2_2 \] as a proxy for where $\Phi_{\mu}^*$ concentrates…

Panigrahi et al. (2016)

This bound can be used as proxy for $\ell^*_{\Phi_{\mu}}$
Approximate negative log-likelihood takes form \[ \frac{1}{2}(D-\mu)^T\Sigma^{-1}(D-\mu) - \inf_{(d,\omega) \in A} \frac{1}{2}(d-\mu)^T\Sigma^{-1}(d-\mu) + \frac{1}{2} \|\omega\|^2_2 \]

File drawer

With no randomization $\omega \equiv 0$ in file drawer, this yields likelihood \[ \frac{1}{2}(D-\mu)^2 - \inf_{d:d \geq 2} \frac{1}{2}(d-\mu)^2 = \begin{cases} \frac{1}{2}(D-\mu)^2 & D > 2 \\ \frac{1}{2}D^2 - D \mu + C & D \leq 2 \end{cases} \]
As we report only when $D>2$ posterior analysis will be unchanged.
“MLE” is just $D$ here
Fisher information is just usual information
$\implies$ this is not a good enough proxy…

Panigrahi et al. (2016)

File drawer with randomization

Suppose we had split data 50/50 before peeking at file drawer
Associated problem \[ \text{argmin}_{(d,\omega): d+\omega>2} \frac{1}{2}(d-\mu)^2 + \frac{1}{2}\omega^2 \]
Not hard to check that \[ \inf_{(d,\omega): d+\omega>2} \frac{1}{2}(d-\mu)^2 + \frac{1}{2}\omega^2 = \frac{1}{2} \max\left(-(\mu-2)/\sqrt{2}, 0\right)^2 \]
Negative log-likelihood \[ \frac{1}{2}(D-\mu)^2 - \begin{cases} 0 & \mu > 2 \\ \frac{1}{4}(\mu-2)^2 & \mu \leq 2 \end{cases} \]
Solving for MLE yields \[ \hat{\mu} = \begin{cases} D & D > 2 \\ 2 D - 2 & D \leq 2 \end{cases} \]
What about observed information? \[ \nabla^2 \Lambda(\hat{\mu}) = \begin{cases} 1 & D > 2 \\ 2 & D \leq 2 \end{cases} \]
Likelihood inference here either returns unadjusted or something like data splitting intervals…

# randomized selection here
def simulate_chernoff_freq(truth, 
                           threshold=2, 
                           level=0.9): 
    # truth=\mu
    C = normal_dbn.ppf((1 + level) / 2)
    while True:
        D = np.random.standard_normal() + truth
        omega = np.random.standard_normal()
        if D + omega > threshold:
            if D > threshold:
                L = D - C
                U = D + C
            else:
                L = 2 * D - threshold - C * np.sqrt(2)
                U = 2 * D - threshold + C * np.sqrt(2)
            break
    return L, U, (L < truth) * (U > truth)         

simulate_chernoff_freq(0)

## (-2.104403012045317, 2.547945602661378, True)

coverage = []
for _ in range(1000):
    coverage.append(simulate_chernoff_freq(-1, 0.01))
coverage = np.array(coverage)
coverage.mean(0)[2]

## 0.825

Panigrahi et al. (2016)

Chernoff MLE vs. truth

fig = plt.figure(figsize=(8, 8))
ax = plt.gca()
Zval = np.linspace(-10,10,501)
mle_val = np.linspace(-30,30,501)
chernoff = Zval * (Zval > 2) + (2 * Zval - 2) * (Zval <= 2)
Z_mle = mle_val + ((normal_dbn.pdf((2-mle_val) / np.sqrt(2)) / np.sqrt(2)) /
                    normal_dbn.sf((2-mle_val) / np.sqrt(2)))
ax.plot(Zval, chernoff, 'b', linewidth=3, label='Chernoff')

## [<matplotlib.lines.Line2D object at 0x1c2243b810>]

ax.plot(Z_mle, mle_val, 'r--', linewidth=3, label='MLE')

## [<matplotlib.lines.Line2D object at 0x1c231e2dd0>]

ax.set_xlabel('D', fontsize=15)

## Text(0.5, 0, 'D')

ax.set_ylabel(r'$\hat{\mu}(D)$', fontsize=20)

## Text(0, 0.5, '$\\hat{\\mu}(D)$')

ax.set_xlim([-1,5])

## (-1, 5)

ax.plot([2,2], [-5,10], 'k--')

## [<matplotlib.lines.Line2D object at 0x1c231ee790>]

ax.set_ylim([-5,6])

## (-5, 6)

ax.legend(fontsize=15)

## <matplotlib.legend.Legend object at 0x1c231ee550>

Panigrahi et al. (2016)

Bayesian inference using Chernoff approximation

# randomized selection here
def simulate_chernoff_bayesian(truth, 
                               precision,
                               threshold=2, 
                               level=0.9,
                               ndraw=5000,
                               burnin=1000,
                               proposal_sd=0.5): 
    
    prec = precision
    while True:
        D = np.random.standard_normal() + truth
        if D + np.random.standard_normal() > threshold:
            # now draw B samples from Metropolis-Hastings
            chain = [D]

            def log_post(mu):
                 return -(0.5 * (D - mu)**2 - 
                          0.25 * (mu - threshold)**2 * (D <= threshold) + 
                          0.5 * prec * mu**2)

            den = log_post(chain[-1])
            for _ in range(ndraw + burnin):
                             
                proposal = (chain[-1] + 
                            np.random.standard_normal() * proposal_sd)
                
                num = log_post(proposal)
                
                # acceptance probability
                 
                alpha = min(np.exp(num - den), 1)
                if np.random.binomial(1, alpha):
                    chain.append(1 * proposal)
                    den = num
                else:
                    chain.append(1 * chain[-1])
            
            chain = np.array(chain)[burnin:]
            
            U = np.percentile(chain, 100 * (1 - (1 - level) / 2))
            L = np.percentile(chain, 100 * (1 - level) / 2)
            break
    return L, U, (L < truth) * (U > truth)

simulate_chernoff_bayesian(-2, 0.2, ndraw=5000 ,proposal_sd=1)

## (-3.45454562713953, 0.4754887298734898, True)

coverage = []
for _ in range(1000):
    coverage.append(simulate_chernoff_bayesian(1, 0.0001))
coverage = np.array(coverage)
coverage.mean(0)[2]

## 0.801

sns.distplot(coverage[:,1] - coverage[:,0])

## <matplotlib.axes._subplots.AxesSubplot object at 0x1c231c9e90>

Panigrahi et al. (2016)

A smoother approximation

Our optimization problem can be written as \[ \text{minimize}_{d,\omega} \frac{1}{2}(d-\mu)^T\Sigma^{-1}(d-\mu) + \frac{1}{2}\| \omega\|^2 + I_A(d, \omega) \] where $I_A = - \log(1_A)$ is the log-indicator.
When \[A=\left\{(d,\omega): L_A \begin{pmatrix}d \\ \omega \end{pmatrix} \leq b_A\right\}\] is affine with $L_A \in \mathbb{R}^{m \times (p + p)}$, we might approximate $I_A$ by \[ z \mapsto \sum_{i=1}^m -\log(b_A[i]-L_A[i]^Tz) \] or some other smooth barrier.

Panigrahi et al. (2016)

File drawer

For unrandomized file drawer the approximation is \[ \text{inf}_d \frac{1}{2}(d-\mu)^2 - \log(d - T) \]
For MLE, we must differentiate this with respect to $\mu$.
Standard argument says that derivative is the (negative) of the minimizer.
Minimizer satisfies \[ \begin{aligned} d^*(\mu) - \frac{1}{d^*(\mu)-T} &= \mu \\ d^*(\mu) &= \frac{T + \mu - \sqrt{(T+\mu)^2 - 4 (T \mu - 1)}}{2} \end{aligned} \]
Overall MLE solves \[ D - \hat{\mu} + d^*(\hat{\mu}) = 0 \]

Panigrahi et al. (2016)

Can this fix our long intervals problem?

def value(mu, threshold):
    T = threshold
    d_star = ((T + mu) + np.sqrt((T + mu)**2 - 4 * (T * mu - 1))) / 2
    val = (0.5 * (d_star - mu)**2 - np.log(d_star - T))
    return val
    
# nonrandomized selection here
def simulate_smooth_bayesian(truth, 
                             precision,
                             threshold=2, 
                             level=0.9,
                             ndraw=5000,
                             burnin=1000,
                             proposal_sd=0.5): 
    
    prec = precision
    while True:
        D = np.random.standard_normal() + truth
        if D > threshold:
            # now draw B samples from Metropolis-Hastings
            chain = [D]

            def log_post(mu):
                 return -(0.5 * (D - mu)**2 - 
                          value(mu, threshold) + 
                          0.5 * prec * mu**2)

            den = log_post(chain[-1])
            for _ in range(ndraw + burnin):
                             
                proposal = (chain[-1] + 
                            np.random.standard_normal() * proposal_sd)
                
                num = log_post(proposal)
                
                # acceptance probability
                 
                alpha = min(np.exp(num - den), 1)
                if np.random.binomial(1, alpha):
                    chain.append(1 * proposal)
                    den = num
                else:
                    chain.append(1 * chain[-1])
            
            chain = np.array(chain)[burnin:]
            
            U = np.percentile(chain, 100 * (1 - (1 - level) / 2))
            L = np.percentile(chain, 100 * (1 - level) / 2)
            break
    return L, U, (L < truth) * (U > truth)

simulate_smooth_bayesian(-2, 0.2, ndraw=5000 ,proposal_sd=1)

## (-3.9963500352340864, 1.6283124834639633, True)

coverage = []
for _ in range(100):
    coverage.append(simulate_smooth_bayesian(0, 0.0001))
coverage = np.array(coverage)
coverage.mean(0)[2]

## 0.67

sns.distplot(coverage[:,1] - coverage[:,0])

## <matplotlib.axes._subplots.AxesSubplot object at 0x1c231c9e90>

Panigrahi et al. (2016)

Back to MLE with randomization

Corresponding problem \[ \text{argmin}_{d, \omega} \frac{1}{2}(d-\mu)^2 + \frac{1}{2} \omega^2 - \log(d+\omega - 2) \]
Can be solved explicitly by change of variables $(u, v) = ((d-\omega)/\sqrt{2}, (d+\omega)/\sqrt{2})$… messy but not impossible

def objective(soln, mu, threshold=2):
    d, omega = soln
    return 0.5 * ((d - mu)**2 + omega**2) - np.log(d + omega - threshold) 

def gradient(soln, mu, threshold=2):
    d, omega = soln
    return np.array([d - mu, omega]) - 1. / (d + omega - threshold)

def value_rand(mu, threshold, initial=None, tol=1.e-5):
    soln = np.array([threshold + 1, 0])
    cur_val = objective(soln, mu, threshold=threshold)
    alpha = 1
    count = 0
    stop = False
    while True:
        G = gradient(soln, mu, threshold=threshold)
        proposed = soln - alpha * G
        proposed_val = objective(proposed, mu, threshold=threshold)
        if ((proposed.sum() - threshold < 0) or
            (proposed_val > cur_val) or
            (np.isfinite(soln).sum() != 2)
            or (np.isfinite(G).sum() != 2)):
            alpha *= 0.5
        else:
            soln = proposed
            old_val, cur_val = cur_val, proposed_val
            if np.fabs(old_val - cur_val) <= tol * max(np.fabs(cur_val), 1):
                stop = True
        if stop:
            break
        count += 1
    return cur_val, soln
 
value_rand(0, 2)

## (2.177950604421338, array([1.37047989, 1.36157092]))
## 
## /Users/jonathantaylor/opt/anaconda3/bin/python3:3: RuntimeWarning: divide by zero encountered in log

# randomized selection here
def simulate_smooth_bayesian_rand(truth, 
                                  precision,
                                  threshold=2, 
                                  level=0.9,
                                  ndraw=5000,
                                  burnin=1000,
                                  proposal_sd=0.5): 
    
    prec = precision
    while True:
        D = np.random.standard_normal() + truth
        omega = np.random.standard_normal()
        if D + omega > threshold:
            # now draw B samples from Metropolis-Hastings
            chain = [D]

            def log_post(mu):
                 return -(0.5 * (D - mu)**2 - 
                          value_rand(mu, threshold)[0] + 
                          0.5 * prec * mu**2)

            den = log_post(chain[-1])
            for _ in range(ndraw + burnin):
                             
                proposal = (chain[-1] + 
                            np.random.standard_normal() * proposal_sd)
                
                num = log_post(proposal)
                
                # acceptance probability
                 
                alpha = min(np.exp(num - den), 1)
                if np.random.binomial(1, alpha):
                    chain.append(1 * proposal)
                    den = num
                else:
                    chain.append(1 * chain[-1])
            
            chain = np.array(chain)[burnin:]
            
            U = np.percentile(chain, 100 * (1 - (1 - level) / 2))
            L = np.percentile(chain, 100 * (1 - level) / 2)
            break
    return L, U, (L < truth) * (U > truth)

simulate_smooth_bayesian_rand(-2, 0.2, ndraw=5000, proposal_sd=1)

## (-3.6329185943845195, 0.29411072460259496, True)
## 
## /Users/jonathantaylor/opt/anaconda3/bin/python3:3: RuntimeWarning: invalid value encountered in log

coverage = []
for _ in range(50):
    coverage.append(simulate_smooth_bayesian_rand(-1, 0.01))
coverage = np.array(coverage)
coverage.mean(0)[2]

## 0.88

sns.distplot(coverage[:,1] - coverage[:,0])

## <matplotlib.axes._subplots.AxesSubplot object at 0x1c231c9e90>

Panigrahi & T (2019)

Frequentist inference by approximate MLE

Formally, given our barrier $B$ (which smooths out constraint of selection event), our (approximate MLE) problem is \[ \text{minimize}_{\mu} - D \mu + \frac{1}{2}\mu^2 - \inf_{d,\omega} \frac{1}{2}(d-\mu)^2 + \frac{1}{2} \omega^2 + B(w+d) \]
Or \[ \text{minimize}_{\mu} - D \mu + \sup_{d,\omega} d \mu - \frac{1}{2}d^2 - \frac{1}{2} \omega^2 - B(w+d) \]
Or \[ \text{minimize}_{\mu} - D \mu + \frac{1}{2}\mu^2 + \sup_{d}\left[ d \mu - \frac{1}{2}d^2 + \sup_{\omega} \left[ - \frac{1}{2} \omega^2 - B(w+d) \right]\right] \]
Or \[ \text{minimize}_{\mu} - D \mu + \frac{1}{2}\mu^2 + \sup_{d}\left[ d \mu - \frac{1}{2}d^2 + \sup_{\omega} \left[ - \frac{1}{2} (\omega-d)^2 + B(w) \right]\right] \]
Finally \[ \text{minimize}_{\mu} - D \mu + \sup_{d}\left[ d\mu - \frac{1}{2}d^2 + B^*_1(d)\right] \] with $B^*_1$ the Moreau envelope of $B^*$ at parameter 1 where the convex conjugate of our barrier. \[ f_{\kappa}(\mu) = \inf_{\omega} \frac{\kappa}{2}(\omega-\mu)^2 + f^*(\omega) \]

Panigrahi & T (2019)

Frequentist inference by approximate MLE

Note that we can define \[ h^*(\mu) = \sup_d d\mu - \frac{1}{2}d^2 + B^*_1(d) \] where \[ h(d) = \frac{1}{2} d^2 - B^*_1(d). \]
MLE equation is therefore \[ \nabla h^*(\hat{\mu}(D)) = D \]
Or, \[ \hat{\mu}(D) = \nabla h(D) = D - \nabla B^*_1(D). \]
General results on Moreau envelope tell us \[ \nabla B^*_1(d) = d - \text{argmin}_{\omega} d \omega - \frac{1}{2} \omega^2 - B(\omega) = d - \text{prox}_B(d) \]
$\implies$ \[ \hat{\mu}(D) = 2 D - \text{prox}_B(D) \]
$\implies$ we need only solve the prox problem.

Panigrahi & T (2019)

Frequentist inference by approximate MLE

For observed information, we must differentiate $\hat{\mu}(D)$ with respect to $D$.
KKT for proximal problem can be written as \[ D = \omega^*(D) + \nabla B(\omega^*(D)) \]
Differentiating both sides with respect to $D$ yields \[ \nabla \omega^*(D) = (I + \nabla^2 B(\omega^*(D))^{-1} \]
Finally, \[ \hat{I}(D) = 2 - (1 + \nabla^2 B(\omega^*(D))^{-1} \]
Can be computed with the solution to the proximal problem…

def prox_barrier(d, value_barrier, grad_barrier, threshold=2, niter=100):
    
    soln = threshold + 1
    cur_val = (d - soln)**2 / 2 + value_barrier(soln, threshold=threshold)
    alpha = 1.
    for _ in range(niter):
        G = soln - d + grad_barrier(soln, threshold=threshold)
        while True:
            proposed = soln - alpha * G
            proposed_val = (d - proposed)**2 / 2 + value_barrier(proposed, threshold=threshold)
            if ((proposed_val > cur_val)
                or (proposed < threshold)):
                alpha *= 0.5
            else:
                cur_val = proposed_val
                soln = proposed
                alpha *= 4
                break
    return soln

def value_barrier(arg, threshold=2):
    return -np.log(arg - threshold)

def grad_barrier(arg, threshold=2):
    return -1 / (arg - threshold)

def hessian_barrier(arg, threshold=2):
    return 1 / (arg - threshold)**2 

def approx_mle(D,
               barrier,
               threshold=2):
    value, grad, hessian = barrier
    prox_soln = prox_barrier(D, value, grad)
    MLE = 2 * D - prox_soln
    info_MLE = 2 - 1 / (1 + hessian(prox_soln))
    return MLE, info_MLE

barrier = (value_barrier, grad_barrier, hessian_barrier)
approx_mle(2.4, barrier)

## (1.5801960968332591, 1.4019419322542626)

Zval = np.linspace(5, -1, 601)
ax.plot(Zval, [approx_mle(z, barrier)[0] for z in Zval], 'gray', linewidth=3, label=r'$B(\omega)=\log(\omega)$')

## [<matplotlib.lines.Line2D object at 0x1c2329ac90>]
## 
## /Users/jonathantaylor/opt/anaconda3/bin/python3:2: RuntimeWarning: divide by zero encountered in log
## /Users/jonathantaylor/opt/anaconda3/bin/python3:2: RuntimeWarning: invalid value encountered in log

ax.legend()

## <matplotlib.legend.Legend object at 0x1c23297fd0>

fig

## <Figure size 800x800 with 1 Axes>

def simulate_freq_pivot(truth, barrier, threshold=2):
    
    while True:
        
        D = np.random.standard_normal() + truth
        omega = np.random.standard_normal()
        if (D + omega) > threshold:
            MLE, info = approx_mle(D, barrier)
            Z = (MLE - truth) / np.sqrt(info)
            return Z

simulate_freq_pivot(0, barrier)

## 0.1297325145333067

Zsim = []
for _ in range(5000):
    Zsim.append(simulate_freq_pivot(-2, barrier))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))

## [<matplotlib.lines.Line2D object at 0x1c23299d10>]

Zsim = []
for _ in range(5000):
    Zsim.append(simulate_freq_pivot(0, barrier))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))

## [<matplotlib.lines.Line2D object at 0x1c2338ff10>]

Zsim = []
for _ in range(5000):
    Zsim.append(simulate_freq_pivot(4, barrier))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))

## [<matplotlib.lines.Line2D object at 0x1c2329db50>]

Panigrahi & T (2019)

A slightly better (?) barrier

We will use \[ B(\omega) = -\log(\omega - T) + \log(\omega - T + 1) \]

def value_barrier_alt(arg, threshold=2):
    return -np.log(arg - threshold) + np.log(arg - threshold + 1)

def grad_barrier_alt(arg, threshold=2):
    return -1 / (arg - threshold) + 1 / (arg - threshold + 1)

def hessian_barrier_alt(arg, threshold=2):
    return 1 / (arg - threshold)**2 - 1 / (arg - threshold + 1)**2

barrier_alt = (value_barrier_alt, grad_barrier_alt, hessian_barrier_alt)

Panigrahi & T (2019)

Zval = np.linspace(5, -1, 601)
ax.plot(Zval, [approx_mle(z, barrier_alt)[0] for z in Zval], 'k', linewidth=3, label=r'$B(\omega)=\log(\omega/(\omega+1))$')

## [<matplotlib.lines.Line2D object at 0x1c234c1310>]

ax.legend()

## <matplotlib.legend.Legend object at 0x1c2338f850>

fig

## <Figure size 800x800 with 1 Axes>

Panigrahi & T (2019)

Zsim = []
for _ in range(5000):
    Zsim.append(simulate_freq_pivot(-2, barrier_alt))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))

## [<matplotlib.lines.Line2D object at 0x1c2320c350>]

Zsim = []
for _ in range(5000):
    Zsim.append(simulate_freq_pivot(0, barrier_alt))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))

## [<matplotlib.lines.Line2D object at 0x1c2342fb50>]

Zsim = []
for _ in range(5000):
    Zsim.append(simulate_freq_pivot(2, barrier_alt))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))

## [<matplotlib.lines.Line2D object at 0x1c231d3a90>]

Panigrahi & T (2019)

Density of MLE


$\mu^*=-2$	$\mu^*=0$	$\mu^*=2$

Panigrahi & T (2019)

Normal approximation


$\mu^*=-2$	$\mu^*=0$	$\mu^*=2$

Could the normal approximation be better?

Panigrahi & T (2019)

Fisher information

Panigrahi & T (2019)

Density of MLE pivot using observed information

Pivot \[ (\hat{\mu} - \mu^*) \nabla \Lambda^2(\hat{\mu})^{1/2} \]

$\mu^*=-2$ $\mu^*=0$ $\mu^*=2$

Panigrahi & T (2019)

Density of MLE: effect of amount of follow-up data


$\mu^*=-2$	$\mu^*=0$	$\mu^*=2$

Follow-up sample of same size

Panigrahi & T (2019)

Density of MLE: effect of amount of follow-up data


$\mu^*=-2$	$\mu^*=0$	$\mu^*=2$

Keeping only 5% for inference

Panigrahi & T (2019)

Some questions

How do we compute density?
Does this work beyond file drawer?

An answer: proximal representation of densities

These are related…

Tian et al. (2016)

Density of MLE

MLE problem solves \[ \hat{\mu}(D) = \text{argmin}_{\mu} -\mu^TD + \Lambda(\mu) \]
KKT conditions \[ D = \nabla \Lambda(\hat{\mu}(D)) \]
Plugging this into any density for $D$ (in the model or not) yields density of MLE.
This goes back to Fisher…

Tian et al. (2016)

Randomized file drawer solves \[ \hat{\beta}(D,\omega) = \text{minimize}_{\beta \geq 2} \frac{1}{2}(D-\beta)^2 - \omega \cdot \beta \]
Selection is whether constraint is tight or not…
KKT conditions \[ \hat{\beta} - D - \omega + \hat{u} = 0 \]
When $\hat{\beta} > 2$ (when we have selected this study…) we have $\hat{u}=0$.
When $\hat{\beta}=2$ we have $\hat{u}=D-2+\omega \leq 0$
Rewriting KKT conditions \[ \omega = \hat{\beta} - D + \hat{u} \]
Change of variables yields \[ \pi(t,n) = \pi(t, 0) \propto \int_2^{\infty} \phi(\beta - t) \; d\beta \]
If we were interested in the complementary selection event we could write \[ \pi^c(t,n) = \pi^c(t, 0) \propto \int_0^{\infty} \phi(2 - t + u) \; du. \]
Moral: integrals over $\omega$ can be expressed as integrals over $(\beta, u)$

Tian et al. (2016)

Randomized LASSO

Data splitting (or direct Gaussian randomization) of the LASSO solves \[ \begin{aligned} \hat{\beta}(D, \mu) &= \text{argmin}_{\beta} \frac{1}{2} \|Y-X\beta\|^2_2 - \omega^T\beta + \lambda \|\beta\|_1 \\ &= \text{argmin}_{\beta} \ell(\beta;D) - \omega^T\beta + \lambda \|\beta\|_1 \\ \end{aligned} \]
KKT conditions \[ \omega = \nabla \ell(\hat{\beta}; D) + \hat{u} \]

Tian et al. (2016)

Randomized LASSO

Conditioning on signs and active set \[ \pi(t,n) = \int_{\text{sign}(\beta_E)=s_E} \int_{\|u_{-E}\|_{\infty} \leq \lambda} g_{\omega}(\nabla \ell(\beta_E; n + t \Gamma) + u) \; du_{-E} \; d\beta_E \] with $u_E = \lambda s_E$.
Representation is agnostic to our model – we’ve just used a decomposition \[ D = {\cal N} + \Gamma \hat{\theta}. \]
Our target parameter $\theta^*$ need not be directly related to $(E,s_E)$ – researcher degrees of freedom.

Tian et al. (2016)

Conditioning on the entire subgradient

We wrote change of variables only on event $(\hat{E},\hat{s}_E)=(E,s_E)$ but it is global.
In principle, any function of $(D,\omega)$ can be expressed as function of $(D, \hat{\beta}, \hat{u})$.
Typically, researchers will want to look at functions of $(\hat{\beta},\hat{u})$.
Can even show them the entire subgradient $\hat{u}$ to see which variables were close to being selected.
Correction takes form \[ \pi(t,n|u) = \int_{\text{sign}(\beta_E)=s_E} g_{\omega}(\nabla \ell(\beta_E; n + t \Gamma) + u) \; d\beta_E \]
An $E$-dimensional integral.

Tian et al. (2016)

Conditioning on the entire subgradient

If $\omega \sim N(0, \tau^2 I)$ then \[ \begin{aligned} g_{\omega}(\nabla \ell(\beta_E; {\cal N} + \hat{\theta} \Gamma) + u) & \propto \exp \left(-\frac{1}{2 \tau^2} \|X^TX_E\beta_E - X^TY + (\lambda s_E, u_{-E})\|^2_2 \right) \end{aligned} \]
This is a Gaussian density when viewed as a function of $(\beta_E, u_E)$ with $X^TY=D$ fixed. Ignoring restriction there is an implied Gaussian \[ {\cal L}\left(\beta_E, u_E \vert X^TY \right) \sim N(\mu(X^TY), Q^{-1}). \] with density (up to normalization) \[ (\beta_E, u_{-E}) \mapsto \exp \left(-\frac{1}{2 \tau^2} \|X^TX_E\beta_E - X^TY + (\lambda s_E, u_{-E})\|^2_2 \right) = N(\gamma_1(X^TY), \Sigma_1) \]
Conditional on $u_{-E}$ there is another implied Gaussian \[ {\cal L}\left(\beta_E \vert u_E, X^TY \right) \sim N(\mu(X^TY, u_{-E}), Q^{-1}) \] with density (up to normalization) \[ \beta_E \mapsto \exp \left(-\frac{1}{2 \tau^2} \|X^TX_E\beta_E - X^TY + (\lambda s_E, u_{-E})\|^2_2 \right) = = N(\gamma_2(X^TY,u_{-E}), \Sigma_2) \]

Panigrahi & T (2019)

Conditioning on the entire subgradient

Corresponding (approximate) MLE problem, with $\hat{\theta} \approx N(\theta(F), \sigma^2(F))$ under $F$ $$ _{} - +
_{t, _E:(_E)=s_E} $$
Or, $$ _{} -
_{t, _E:(_E)=s_E} $$
Or, $$ _{} -
_{t, _E:(_E)=s_E} $$
Or, $$ _{} -
_{t} $$
Or, $$ _{} -
_{t} \[ with \] H(_E) = |X^TX_EE|^2 - {i E} () $$
Or, $$ _{} -
G^{(/ ^2) \[
with
\] G(t) = + |{N} + t - (s_E, u_{-E})|^2_2 - H^((X_E}TX({N} + t - (s_E, u_{-E}))) $$
Selective MLE $\hat{\theta}^*$ solves \[ \frac{\hat{\theta}}{\sigma^2} = \frac{1}{\sigma^2} \nabla G^*(\hat{\theta}^* / \sigma^2) \]
Or, \[ \hat{\theta}^* = \sigma^2 \cdot \nabla G(\hat{\theta}) = \hat{\theta} + \frac{\sigma^2}{\tau^2}\Gamma^T \left (D - (\lambda s_E, u_{-E}) \right)- \frac{\sigma^2}{\tau^2}\Gamma^T X^TX_E\nabla H^*\left(\frac{1}{\tau^2} X_E^TX(D - (\lambda s_E, u_{-E}))\right) \]
$\implies$ whatever our target, we need only solve the problem corresponding to $H^*$… adjustment for target $\theta(F)$ depends only on $\Gamma$!

Panigrahi & T (2019)

Selective MLE problem

Ultimate problem we must solve \[ \nabla H^*(\zeta_E) = \hat{\beta}_E(\zeta_E) = \text{argmin}_{\beta_E} -\zeta_E^T\beta_E + \frac{1}{2 \tau^2} \|X^TX_E\beta_E\|^2_2 - \sum_{i \in E} \log \left(\frac{\beta_i s_i}{\beta_is_i + 1} \right) \]

Information for selective MLE

Final approximate likelihood wrt $N(0, \sigma^2)$ is \[ \ell(\theta \vert \hat{\theta}) \mapsto \exp\left(\frac{\hat{\theta} \cdot \theta }{\sigma^2} - g^*\left(\frac{\theta}{\sigma^2} \right) \right) \]
Natural parameter is $\theta / \sigma^2$.
Observed (approximate) information is \[ \begin{aligned} \text{Var}^*(\hat{\theta}^*) &= \sigma^4 \left(\nabla^2 g^*(\hat{\theta}^* / \sigma^2)\right)^{-1} \\ &= \sigma^4 \nabla^2 g(\hat{\theta}) \\ &= \sigma^2 + \frac{\sigma^4 \|\Gamma\|^2_2}{\tau^2} - \frac{\sigma^4}{\tau^2} \Gamma^T X^TX_E \nabla^2 H^*\left(\frac{1}{\tau^2} X_E^TX(D - (\lambda s_E, u_{-E})) \right) X_E^TX\Gamma \end{aligned} \]
Remains to compute $\nabla^2 H^*$…

Panigrahi & T (2019)

Final piece for information

Recall \[ \hat{\beta}_E(\zeta_E) = \nabla H^*(\zeta_E) \]
Standard results yield \[ \begin{aligned} \nabla^2 H^*(\zeta_E) &= \nabla^2 H(\hat{\beta}_E(\zeta_E))^{-1} \\ \end{aligned} \] with \[ \nabla^2 H(\beta_E) = \frac{1}{\tau^2} X_E^TXX^TX_E + \text{diag}\left(\frac{1}{(s_i \beta_i)^2} - \frac{1}{(s_i \beta_i + 1)^2} \right). \]

Exercise

Suppose $\omega \sim N(0, \Theta^{-1})$ what pieces above change?
Suppose we work in the model ${\cal M} = \left\{Y|X \sim N(X\beta, \sigma^2 I): \beta \in \mathbb{R}^p, \sigma^2 >0 \right\}$ and we set $\Theta^{-1} = \tau^2 (X^TX)$. Does anything simplify? Can you “make sense” of these quantities even when $p>n$?
When will these estimates (and the information) revert to the (preselection) MLE? Does this match intuition?

Panigrahi and T. (2018)

Back to likelihood

Suppose we allow ourselves to specify a prior having observed $\hat{u}$.
We might choose features $M$ (typically related to $E$ but we allow researcher discretion) \[ {\cal M} = \left\{N(X_M\beta_M, \sigma^2_MI): \beta_M \in \mathbb{R}^M, \sigma^2_M > 0 \right\} \]
With a Gaussian prior on $\beta_M$ (approximate) normalizing constant can be expressed as an optimization over $\mathbb{R}^{|M|+|E|}$ \[ P(\text{sign}(\hat{\beta}_E(X^TY,\omega))=s_E | \hat{u}) \] where $\hat{u}_E = s_E, \|\hat{u}_{-E}\|_{\infty} \leq \lambda$.
Smoothed Chernoff approximation replaces $|E|$ affine constraints with $|E|$ barrier terms.

Panigrahi and T. (2018)

Data $Y \in \mathbb{R}^{97}, X \in \mathbb{R}^{97 \times 320}$…
Randomization analogous to using 80% of data in first stage…

Panigrahi and T. (2018)

from selectinf.randomized.lasso import lasso, selected_targets
from selectinf.randomized.posterior_inference import posterior_inference_lasso

n, p, s = 500, 100, 5

X = np.random.standard_normal((n, p))
beta = np.zeros(p)
beta[:s] = np.sqrt(2 * np.log(p) / n)
Y = X.dot(beta) + np.random.standard_normal(n)

Find selected variables

scale_ = np.std(Y)
# uses noise of variance n * scale_ / 4 by default
L = lasso.gaussian(X, Y, 3 * scale_ * np.sqrt(2 * np.log(p) * np.sqrt(n))) 
signs = L.fit()
E = (signs != 0)

Set up our targets

M = E.copy()
M[-3:] = 1
dispersion = np.linalg.norm(Y - X[:,M].dot(np.linalg.pinv(X[:,M]).dot(Y))) ** 2 / (n - M.sum())
(observed_target,
 cov_target,
 cov_target_score,
 alternatives) = selected_targets(L.loglike,
                                  L._W,
                                  M,
                                  dispersion=dispersion)
observed_target.shape, E.sum()

## ((12,), 9)

# interface a little clunky at the moment
# really just specifies quadratic of implied gaussian given subgradient

A_scaling = L.sampler.affine_con.linear_part
b_scaling = L.sampler.affine_con.offset
logdens_linear = L.sampler.logdens_transform[0]

posterior_inf = posterior_inference_lasso(observed_target,
                                          cov_target,
                                          cov_target_score,
                                          L.observed_opt_state,
                                          L.cond_mean,
                                          L.cond_cov,
                                          logdens_linear,
                                          A_scaling,
                                          b_scaling,
                                          observed_target)

samples = posterior_inf.posterior_sampler(nsample=400, nburnin=200, step=1.)

## /Users/jonathantaylor/git-repos/selectinf/selectinf/randomized/posterior_inference.py:90: RuntimeWarning: invalid value encountered in double_scalars
##   log_normalizer = -val - mean_marginal.T.dot(prec_marginal).dot(mean_marginal)/2

lower_cred = np.percentile(samples, 5, axis=0)
upper_cred = np.percentile(samples, 95, axis=0)

# clearly broken at the moment...
samples[:,0][:20]

## array([-4.11035730e+299,  1.93045796e+299, -9.06653034e+298,
##         4.25815917e+298, -1.99987413e+298,  9.39254826e+297,
##        -4.41127575e+297,  2.07178640e+297, -9.73028925e+296,
##         4.56989817e+296, -2.14628453e+296,  1.00801749e+296,
##        -4.73422439e+295,  2.22346148e+295, -1.04426418e+295,
##         4.90445952e+294, -2.30341359e+294,  1.08181425e+294,
##        -5.08081602e+293,  2.38624066e+293])

beta_target = np.linalg.pinv(X[:,M]).dot(X.dot(beta))
coverage_cred = (lower_cred < beta_target) * (upper_cred > beta_target)
coverage_cred

## array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
##         True,  True,  True])

try:
    S = L.selective_MLE(observed_target,
                        cov_target,
                        cov_target_score)[0]
    S
except ValueError as err:
    print("{0}".format(err))

## not finding a feasible point

def simulate_random_lasso():
    
    n, p = 500, 100

    X = np.random.standard_normal((n, p))
    beta = np.zeros(p)
    Y = np.random.standard_normal(n)
    
    scale_ = np.std(Y)
    
    # uses noise of variance n * scale_ / 4 by default
    L = lasso.gaussian(X, Y, 3 * scale_ * np.sqrt(2 * np.log(p) * np.sqrt(n))) 
    signs = L.fit()
    E = (signs != 0)
    M = E.copy()
    M[-3:] = 1
    
    dispersion = np.linalg.norm(Y - X[:,M].dot(np.linalg.pinv(X[:,M]).dot(Y))) ** 2 / (n - M.sum())
    
    (observed_target,
     cov_target,
     cov_target_score,
     alternatives) = selected_targets(L.loglike,
                                      L._W,
                                      M,
                                      dispersion=dispersion)
    
    S = L.selective_MLE(observed_target, 
                        cov_target,
                        cov_target_score,
                        level=0.9)[0]
    #print(S)
    cover = (S['lower'] < 0) * (S['upper'] > 0)
    return S['pvalue'], cover

p, c = simulate_random_lasso()

## /Users/jonathantaylor/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py:194: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
##   op=op_str, alt_op=unsupported[op_str]

P, C = [], []
for _ in range(200):
    
    p, c = simulate_random_lasso()
    P.extend(p)
    C.extend(c)
    
('coverage', np.mean(C))

## ('coverage', 0.844179651695692)

pivot_plot(P);

Panigrahi & T (2019)

Screening rules

Besides LASSO, such change of variables applicable to stepup rules such as Benjamini-Hochberg when conditioning on top $\hat{k}$ (as well as the corresponding ${\cal N}$).
Let’s try this out: $(n, p, s, \mu) = (2400, 1000, 30, 3.5)$ with $\Sigma=I$ (chosen to be close to Candes & Barber (2014))
We first collect a sample of size $0.8 * 2400$ and find BH selected variables.
Collect a followup sample of size $0.2 * 2400$ for selected variables.
Want confidence intervals for parameters of selected set.

Panigrahi & T (2019)

Stepup procedures

A two-sided step procedure can be expressed in terms of scores $(t_1,\dots,t_k)$ as \[ \max \left\{k: |t_{(i)}| > c_i \right\} \] for cutoffs $c_i$.
Benjamini-Hochberg: $c_i = \Phi^{-1}(1 - \frac{q \cdot i}{2k})$

Conditioning on winners and signs

The event that variables $E$ are selected is \[ |t_i| > c_{|E|} \forall i \in E, \qquad |t_j| \leq c_{|E|-1} \forall j \not \in E \]

Panigrahi & T (2019)

Randomized

Suppose $D \sim N(\mu, \Sigma)$ and we data split or otherwise randomize.
Event becomes \[ |D_i + \omega_i| > c_{|E|} \forall i \in E, \qquad |D_j + \omega_j| \leq c_{|E|-1} \forall j \not \in E \]
Conditional on signs $s_E$ (to make convex truncation) and $U=(D_j + \omega_j)_{j \not \in E}$ we see \[ \pi(t,n\vert u) = P(s_E(n + \Gamma t)[E] + \omega_E > c_{|E|}). \]

Followup analysis

If we make the assumption that the $D_i$’s are independent, then we can legitimately condition on $U=(D_j)_{j \not \in E}$ and simply collect new data with $\pi$ unchanged.
With dependence, conditioning on $(D_j)_{j \not \in E}$ does not remove dependence on nuisance parameters.
Using the usual decomposition $D = {\cal N} + \Gamma \hat{\theta}$ would lead to hard constraints on $\hat{\theta}$ when inspecting $\{|D_j + \omega_j| \leq c_{|E|-1} \forall j \not \in E\}$.


\(\mu^*=-2\)	\(\mu^*=0\)	\(\mu^*=2\)


\(\mu^*=-2\)	\(\mu^*=0\)	\(\mu^*=2\)


\(\mu^*=-2\)	\(\mu^*=0\)	\(\mu^*=2\)


\(\mu^*=-2\)	\(\mu^*=0\)	\(\mu^*=2\)

Approximate likelihood and Bayesian methods

File drawer problem

How to correct?

Normalization

What about frequentist analysis?

File drawer

File drawer with randomization

Chernoff MLE vs. truth

Bayesian inference using Chernoff approximation

A smoother approximation

File drawer

Can this fix our long intervals problem?

Back to MLE with randomization

Frequentist inference by approximate MLE

Frequentist inference by approximate MLE

Frequentist inference by approximate MLE

A slightly better (?) barrier

Density of MLE

Normal approximation

Fisher information

Density of MLE pivot using observed information

Density of MLE: effect of amount of follow-up data

Follow-up sample of same size

Density of MLE: effect of amount of follow-up data

Keeping only 5% for inference

Some questions

An answer: proximal representation of densities

Density of MLE

Randomized LASSO

Randomized LASSO

Conditioning on the entire subgradient

Conditioning on the entire subgradient

Conditioning on the entire subgradient

Selective MLE problem

Information for selective MLE

Final piece for information

Exercise

Back to likelihood

Find selected variables

Set up our targets

Screening rules

Stepup procedures

Conditioning on winners and signs

Randomized

Followup analysis

Frequentist results

Frequentist results