From discrete to continuous: Lindsey’s method¶

For ordinal discrete \(X^D\) we considered at least two models:

\[ N_i \sim \text{Poisson}(\lambda_i) = \# \left\{s: X^D_s=i\right\} \]

Saturated: \(\lambda_i\) unconstrained
Ordinal with scores: \(\lambda_i = \beta_0 + \beta_1 \cdot i\)

Continuous¶

Suppose \(X\) is a continuous random variable supported on \([0,1]\) and we discretize

\[\begin{split} X^{\Delta} = i \iff X \in ((i-1)\Delta, i\Delta] \\ \end{split}\]

Our ordinal model for such \(X^{\Delta}\) looks like

\[ \exp(\lambda_i) \approx \Delta \cdot \exp \left( \gamma_0 + \gamma_1 (i \Delta) \right), \qquad \gamma_1 = \beta_1 / \Delta, \gamma_0 = \beta_0 - \log\Delta \]

Corresponds to a density of the form

\[ f_{\gamma}(x) \propto \exp(\gamma_0 + \gamma_1 \cdot x) \]

When fitting a loglinear GLM to \(X^{\Delta}\) with \(\Delta\) small we should have (assuming the density is well-specified)

\[ \exp(-\gamma_0) = \int_{[0,1]} e^{\gamma_1 x} \; dx \]

Basis expansion¶

No reason to restrict scores to just be \(i\)…
Can extend from \([0,1]\) to all of \(\mathbb{R}\) by adding “bins” of the form \([C,\infty)\) and \((-\infty,C]\) (with unrestricted parameters)
Thinking of index as \(i \cdot \Delta\) consider a model

\[ \exp(\lambda_i) = \Delta \cdot \exp \left(\gamma_0 + \sum_j \gamma_j h_j(i \cdot \Delta) \right) \]

This is a discretization of a density

\[ f(x) \propto \exp\left(\sum_j \gamma_j h_j(x) \right), \qquad \exp(-\gamma_0) = \int_{[-C,C]} \exp\left(\sum_j \gamma_j h_j(x) \right) \; dx \]

Taking \(h_1(x)=x, h_2(x)=x^2\) we arrive at (discretization) of a Gaussian.

Custom families: Lindsey’s method ¶

Bins need not be of same size – correspond to an offset in a GLM.
“Representative point” of a bin need not be the right hand endpoint, could be midpoint, left end point….

\[ \exp(\lambda_i) = \Delta_i \cdot \exp \left(\gamma_0 + \sum_j \gamma_j h_j(t_i) \right) \]

X = rnorm(20000)
bins = unique(c(seq(-4,-2,by=0.2), 
                seq(-2.1,2,by=0.1),
                seq(2.2, 4, by=0.2)))
Y = table(cut(X, breaks=bins))
midpts = (bins[1:(length(bins)-1)] + bins[2:length(bins)])/2
widths = bins[2:length(bins)] - bins[1:(length(bins)-1)]
barplot(Y, widths)

df = cbind(Y, midpts, midpts^2)
colnames(df) = c('N', 'linear', 'quadratic')
df = data.frame(df)
M = glm(N ~ linear + quadratic, 
        family=poisson, 
        offset=log(widths),
        data=df)

Warning message in log(widths):
“NaNs produced”

summary(M)

Call:
glm(formula = N ~ linear + quadratic, family = poisson, data = df, 
    offset = log(widths))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-7.0377  -0.5934   0.1681   0.9296   2.3041  

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept)  8.981528   0.008706 1031.682  < 2e-16 ***
linear       0.027773   0.007092    3.916 8.99e-05 ***
quadratic   -0.507006   0.005051 -100.382  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 26759.61  on 59  degrees of freedom
Residual deviance:   154.95  on 57  degrees of freedom
  (1 observation deleted due to missingness)
AIC: 560.24

Number of Fisher Scoring iterations: 4

M_no_offset = glm(N ~ linear + quadratic, 
        family=poisson, 
        data=df)
summary(M_no_offset)

Call:
glm(formula = N ~ linear + quadratic, family = poisson, data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9386  -1.2079  -0.1816   0.9336   4.5798  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  6.635284   0.008815 752.716  < 2e-16 ***
linear       0.018348   0.007010   2.618  0.00886 ** 
quadratic   -0.431763   0.005172 -83.482  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 16321.28  on 60  degrees of freedom
Residual deviance:   184.92  on 58  degrees of freedom
AIC: 596.75

Number of Fisher Scoring iterations: 4

Joint distributions¶

Suppose \((X,Y)\) are two discrete ordinal variables and \(N_{ij} \sim \text{Poisson}(\lambda_{ij})\) is an \(I\times J\) contingency table representing a sample from \((X,Y)\).
Ordinal (independence): \(\lambda_{ij} = \beta_0 + \beta_X i + \beta_Y j\).
Ordinal with dependence?: \(\lambda_{ij} = \beta_0 + \beta_X i + \beta_Y j + \beta^{XY}_{ij}\).

Back to discretization¶

Suppose \((X^{\Delta}, Y^{\Delta})\) are such discretizations
Basis expansion (independence): \(\lambda_{ij} = \gamma_0 + \sum_j \gamma^X_j h^X_j(i\Delta) + \sum_l \gamma^Y_l h_l^Y(j\Delta)\)

What about dependence?¶

Add a term of the form

\[ \sum_{j'l'} \gamma^{XY}_{j'l'} h^{XY}_{j'l'}(i\Delta,j\Delta). \]

Dependence modeled by matrix \(\gamma^{XY}\).

What kind of \(h^{XY}\)?¶

Convenient to consider tensor products \(h^{XY}_{j'l'} = g^X_{j'l'}(x) g^Y_{j'l'}(y)\).
Example: if we set \(h^X_1(s)=h^Y_1(s)=s, h^X_2(s)=h^Y_2(s)=s^2\) and \(h^{XY}_{11}(x,y)=xy\) \(\implies\) Gaussian!
Use of tensor products can make conditionals nice (i.e. “simpler” pseudo-likelihood).

Mix of discrete and Gaussian ¶

Suppose \(X\) is continuous and \(Y\) is discrete with \(K\) levels
For “main effects” of \(X\) we take \(h^X_1(x)=x, h^X_2(x)=x^2\).
For “main effects” of \(Y\) we take one-hot encoding \(h^Y_k(y) = \delta_j(y), 1 \leq k \leq K\).
For “interactions” we take \(h^{XY}_{1k}(x,y) = x \cdot \delta_k(y)\) – tensor product of (some) terms in the “main effect”. (Could also have the variance change as well.)
\(\implies\) conditionals \(X|Y\) and \(Y|X\) will be in the same family as implied by “main effect”.
\(X|Y\) is Gaussian (equivalent to lm(X ~ factor(Y)))
\(Y|X\) is multinomial (equivalent to a baseline multinomial logit nnet::multinom(Y ~ X)).

Parameters¶

The “interactions” are encoded by \(\gamma_{1,k} \in \mathbb{R}^{1 \times K}\).

Two discrete variables¶

Suppose \(Z\) is discrete with \(L\) levels and \(Y\) is discrete with \(K\) levels
For “main effects” of \(Z\) we take one-hot encoding \(h^Z_l(l)=\delta_l(z) , 1 \leq l \leq L\).
For “main effects” of \(Y\) we take one-hot encoding \(h^Y_k(y) = \delta_j(y), 1 \leq k \leq K\).
For “interactions” we take \(h^{ZY}_{lk}(z,y) = \delta_l(z) \delta_k(y)\) – tensor product of terms in the “main effect”.

Parameters¶

The “interactions” are encoded by \(\gamma_{l,k} \in \mathbb{R}^{L \times K}\).

Several continuous and discrete variables¶

For all continuous variables use the Gaussian encoding with \(x \mapsto x, x \mapsto x^2\) as main effects.
For all discrete variables use the one-hot encoding \(x \mapsto \delta_l(x)\)

Interactions¶

Use the tensor product formulation, but for any continuous interactions use only the \(x \mapsto x\) main effect.
Continuous / continuous: a single parameter \(\mathbb{R}\)
Continuous / discrete with \(L\) levels: a vector \(\mathbb{R}^L\)
Discrete with \(L\) levels / discrete with \(K\) levels: a matrix in \(\mathbb{R}^{L \times K}\).

Pseudo-likelihoods¶

For every continuous variable: Gaussian linear model
For every discrete variable: baseline multinomial logit

Conditional independence¶

Absence of any interaction \(\implies\) conditional independence.

Selecting edges: group LASSO¶

Want a penalty that can zero-out an entire vector (continuous / discrete) or matrix (discrete / discrete).
Consider

\[\begin{split} \begin{aligned} \hat{\beta} &= \text{argmin}_{\beta} \frac{1}{2} \|Z-\beta\|^2_2 + \lambda \|\beta\|_2 \\ &= \begin{cases} 0 & \|Z\|_2 \leq \lambda \\ \frac{Z}{\|Z\|_2} \cdot (\|Z\|_2-\lambda, 0) & \text{otherwise.} \end{cases} \\ &= \max(\|Z\|_2-\lambda, 0) \cdot \frac{Z}{\|Z\|_2} \end{aligned} \end{split}\]

For matrices use Frobenius (Euclidean norm of matrix entries)

\[ \|\beta\|_F = \text{Tr}(\beta'\beta)^{1/2} \]

Combine with pseudo-likelihood and proximal gradient and away we go…

STATS 305B

From discrete to continuous: Lindsey’s method¶

Continuous¶

Basis expansion¶

Custom families: Lindsey’s method¶

Joint distributions¶

Back to discretization¶

What about dependence?¶

What kind of \(h^{XY}\)?¶

Mix of discrete and Gaussian¶

Parameters¶

Two discrete variables¶

Parameters¶

Several continuous and discrete variables¶

Interactions¶

Pseudo-likelihoods¶

Conditional independence¶

Selecting edges: group LASSO¶

Custom families: Lindsey’s method ¶

Mix of discrete and Gaussian ¶