From discrete to continuous: Lindsey’s method

  • For ordinal discrete \(X^D\) we considered at least two models:

\[ N_i \sim \text{Poisson}(\lambda_i) = \# \left\{s: X^D_s=i\right\} \]
  • Saturated: \(\lambda_i\) unconstrained

  • Ordinal with scores: \(\lambda_i = \beta_0 + \beta_1 \cdot i\)


Continuous

  • Suppose \(X\) is a continuous random variable supported on \([0,1]\) and we discretize

\[\begin{split} X^{\Delta} = i \iff X \in ((i-1)\Delta, i\Delta] \\ \end{split}\]
  • Our ordinal model for such \(X^{\Delta}\) looks like

\[ \exp(\lambda_i) \approx \Delta \cdot \exp \left( \gamma_0 + \gamma_1 (i \Delta) \right), \qquad \gamma_1 = \beta_1 / \Delta, \gamma_0 = \beta_0 - \log\Delta \]
  • Corresponds to a density of the form

\[ f_{\gamma}(x) \propto \exp(\gamma_0 + \gamma_1 \cdot x) \]
  • When fitting a loglinear GLM to \(X^{\Delta}\) with \(\Delta\) small we should have (assuming the density is well-specified)

\[ \exp(-\gamma_0) = \int_{[0,1]} e^{\gamma_1 x} \; dx \]

Basis expansion

  • No reason to restrict scores to just be \(i\)

  • Can extend from \([0,1]\) to all of \(\mathbb{R}\) by adding “bins” of the form \([C,\infty)\) and \((-\infty,C]\) (with unrestricted parameters)

  • Thinking of index as \(i \cdot \Delta\) consider a model

\[ \exp(\lambda_i) = \Delta \cdot \exp \left(\gamma_0 + \sum_j \gamma_j h_j(i \cdot \Delta) \right) \]
  • This is a discretization of a density

\[ f(x) \propto \exp\left(\sum_j \gamma_j h_j(x) \right), \qquad \exp(-\gamma_0) = \int_{[-C,C]} \exp\left(\sum_j \gamma_j h_j(x) \right) \; dx \]
  • Taking \(h_1(x)=x, h_2(x)=x^2\) we arrive at (discretization) of a Gaussian.


Custom families: Lindsey’s method

  • Bins need not be of same size – correspond to an offset in a GLM.

  • “Representative point” of a bin need not be the right hand endpoint, could be midpoint, left end point….

\[ \exp(\lambda_i) = \Delta_i \cdot \exp \left(\gamma_0 + \sum_j \gamma_j h_j(t_i) \right) \]
X = rnorm(20000)
bins = unique(c(seq(-4,-2,by=0.2), 
                seq(-2.1,2,by=0.1),
                seq(2.2, 4, by=0.2)))
Y = table(cut(X, breaks=bins))
midpts = (bins[1:(length(bins)-1)] + bins[2:length(bins)])/2
widths = bins[2:length(bins)] - bins[1:(length(bins)-1)]
barplot(Y, widths)
../_images/Lindseys_method_1_0.png
df = cbind(Y, midpts, midpts^2)
colnames(df) = c('N', 'linear', 'quadratic')
df = data.frame(df)
M = glm(N ~ linear + quadratic, 
        family=poisson, 
        offset=log(widths),
        data=df)
Warning message in log(widths):
“NaNs produced”

summary(M)
Call:
glm(formula = N ~ linear + quadratic, family = poisson, data = df, 
    offset = log(widths))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-7.0377  -0.5934   0.1681   0.9296   2.3041  

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept)  8.981528   0.008706 1031.682  < 2e-16 ***
linear       0.027773   0.007092    3.916 8.99e-05 ***
quadratic   -0.507006   0.005051 -100.382  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 26759.61  on 59  degrees of freedom
Residual deviance:   154.95  on 57  degrees of freedom
  (1 observation deleted due to missingness)
AIC: 560.24

Number of Fisher Scoring iterations: 4

M_no_offset = glm(N ~ linear + quadratic, 
        family=poisson, 
        data=df)
summary(M_no_offset)
Call:
glm(formula = N ~ linear + quadratic, family = poisson, data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9386  -1.2079  -0.1816   0.9336   4.5798  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  6.635284   0.008815 752.716  < 2e-16 ***
linear       0.018348   0.007010   2.618  0.00886 ** 
quadratic   -0.431763   0.005172 -83.482  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 16321.28  on 60  degrees of freedom
Residual deviance:   184.92  on 58  degrees of freedom
AIC: 596.75

Number of Fisher Scoring iterations: 4

Joint distributions

  • Suppose \((X,Y)\) are two discrete ordinal variables and \(N_{ij} \sim \text{Poisson}(\lambda_{ij})\) is an \(I\times J\) contingency table representing a sample from \((X,Y)\).

  • Ordinal (independence): \(\lambda_{ij} = \beta_0 + \beta_X i + \beta_Y j\).

  • Ordinal with dependence?: \(\lambda_{ij} = \beta_0 + \beta_X i + \beta_Y j + \beta^{XY}_{ij}\).


Back to discretization

  • Suppose \((X^{\Delta}, Y^{\Delta})\) are such discretizations

  • Basis expansion (independence): \(\lambda_{ij} = \gamma_0 + \sum_j \gamma^X_j h^X_j(i\Delta) + \sum_l \gamma^Y_l h_l^Y(j\Delta)\)


What about dependence?

  • Add a term of the form

\[ \sum_{j'l'} \gamma^{XY}_{j'l'} h^{XY}_{j'l'}(i\Delta,j\Delta). \]
  • Dependence modeled by matrix \(\gamma^{XY}\).

What kind of \(h^{XY}\)?

  • Convenient to consider tensor products \(h^{XY}_{j'l'} = g^X_{j'l'}(x) g^Y_{j'l'}(y)\).

  • Example: if we set \(h^X_1(s)=h^Y_1(s)=s, h^X_2(s)=h^Y_2(s)=s^2\) and \(h^{XY}_{11}(x,y)=xy\) \(\implies\) Gaussian!

  • Use of tensor products can make conditionals nice (i.e. “simpler” pseudo-likelihood).


Mix of discrete and Gaussian

  • Suppose \(X\) is continuous and \(Y\) is discrete with \(K\) levels

  • For “main effects” of \(X\) we take \(h^X_1(x)=x, h^X_2(x)=x^2\).

  • For “main effects” of \(Y\) we take one-hot encoding \(h^Y_k(y) = \delta_j(y), 1 \leq k \leq K\).

  • For “interactions” we take \(h^{XY}_{1k}(x,y) = x \cdot \delta_k(y)\) – tensor product of (some) terms in the “main effect”. (Could also have the variance change as well.)

  • \(\implies\) conditionals \(X|Y\) and \(Y|X\) will be in the same family as implied by “main effect”.

  • \(X|Y\) is Gaussian (equivalent to lm(X ~ factor(Y)))

  • \(Y|X\) is multinomial (equivalent to a baseline multinomial logit nnet::multinom(Y ~ X)).

Parameters

  • The “interactions” are encoded by \(\gamma_{1,k} \in \mathbb{R}^{1 \times K}\).


Two discrete variables

  • Suppose \(Z\) is discrete with \(L\) levels and \(Y\) is discrete with \(K\) levels

  • For “main effects” of \(Z\) we take one-hot encoding \(h^Z_l(l)=\delta_l(z) , 1 \leq l \leq L\).

  • For “main effects” of \(Y\) we take one-hot encoding \(h^Y_k(y) = \delta_j(y), 1 \leq k \leq K\).

  • For “interactions” we take \(h^{ZY}_{lk}(z,y) = \delta_l(z) \delta_k(y)\) – tensor product of terms in the “main effect”.

Parameters

  • The “interactions” are encoded by \(\gamma_{l,k} \in \mathbb{R}^{L \times K}\).


Several continuous and discrete variables

  • For all continuous variables use the Gaussian encoding with \(x \mapsto x, x \mapsto x^2\) as main effects.

  • For all discrete variables use the one-hot encoding \(x \mapsto \delta_l(x)\)

Interactions

  • Use the tensor product formulation, but for any continuous interactions use only the \(x \mapsto x\) main effect.

  • Continuous / continuous: a single parameter \(\mathbb{R}\)

  • Continuous / discrete with \(L\) levels: a vector \(\mathbb{R}^L\)

  • Discrete with \(L\) levels / discrete with \(K\) levels: a matrix in \(\mathbb{R}^{L \times K}\).

Pseudo-likelihoods

  • For every continuous variable: Gaussian linear model

  • For every discrete variable: baseline multinomial logit


Conditional independence

  • Absence of any interaction \(\implies\) conditional independence.

Selecting edges: group LASSO

  • Want a penalty that can zero-out an entire vector (continuous / discrete) or matrix (discrete / discrete).

  • Consider

\[\begin{split} \begin{aligned} \hat{\beta} &= \text{argmin}_{\beta} \frac{1}{2} \|Z-\beta\|^2_2 + \lambda \|\beta\|_2 \\ &= \begin{cases} 0 & \|Z\|_2 \leq \lambda \\ \frac{Z}{\|Z\|_2} \cdot (\|Z\|_2-\lambda, 0) & \text{otherwise.} \end{cases} \\ &= \max(\|Z\|_2-\lambda, 0) \cdot \frac{Z}{\|Z\|_2} \end{aligned} \end{split}\]
  • For matrices use Frobenius (Euclidean norm of matrix entries)

\[ \|\beta\|_F = \text{Tr}(\beta'\beta)^{1/2} \]
  • Combine with pseudo-likelihood and proximal gradient and away we go…