Assignment 3

Assignment 3#

You may discuss homework problems with other students, but you have to prepare the written assignments yourself.

Please combine all your answers, the computer code and the figures into one PDF file, and submit a copy to your folder on Gradescope.

Grading scheme: 10 points per question, total of 50.

Due date: 11:59 PM Friday May 30, 2025

Download RMarkdown#

Part A#

The sleepstudy dataset in lme4 provides a simple example of a mixed effects model. In this problem we’ll explore some aspects of fitting this model below:

library(lme4)
data(sleepstudy)
M = lmer(Reaction ~ Days + (1|Subject) + (0+Days|Subject), sleepstudy, subset=Days>=2)
M
Loading required package: Matrix
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (1 | Subject) + (0 + Days | Subject)
   Data: sleepstudy
 Subset: Days >= 2
REML criterion at convergence: 1404.626
Random effects:
 Groups    Name        Std.Dev.
 Subject   (Intercept) 28.843  
 Subject.1 Days         6.285  
 Residual              25.747  
Number of obs: 144, groups:  Subject, 18
Fixed Effects:
(Intercept)         Days  
     245.10        11.44  

You can download it to python with

import statsmodels.api as sm
df = sm.datasets.get_rdataset(dataname='sleepstudy', package='lme4').data

Each of 18 subjects has 10 reaction times recorded, call these \(Y_{ij}, 1 \leq i \leq 18, 1 \leq j \leq 10\) and set \(D_{ij}\) the corresponding days. The model specified by the above formula has a random intercept for Subject as well as a random slope for Days, the number of days of sleep deprivation. The formula is specified so that these random effects are assumed independent. That is, the model fit is

\[\begin{split} \begin{aligned} Y_{ij} | D_{ij} &\sim (\beta_0 + \alpha_{0,i}) + (\beta_D + \alpha_{D,i}) D_{ij} + \epsilon_{ij} \\ \epsilon_{ij} & \overset{IID}{\sim} N(0, \sigma^2_{\epsilon}) \\ \alpha_{0,i} & \overset{IID}{\sim} N(0,\sigma^2_0) \\ \alpha_{D,i} & \overset{IID}{\sim} N(0, \sigma^2_D) \end{aligned} \end{split}\]

with all groups of errors independent above.

  1. Assume Y has the Reaction times of each subject stacked together. Give a design matrix X describing the fixed effects of this model and a separate design matrix Z describing the random effects so that we can make sense of

\[ \begin{aligned} Y = X\beta + Z\Gamma + \epsilon \end{aligned} \]

Above, \(\Gamma\) should be a concatenation of the \(\alpha_0\)’s and \(\alpha_D\)’s. What dimension is \(X\)? What dimension is \(Z\)?

  1. Considering \(X\) and \(Z\) as fixed, non-random matrices, express \(\text{Cov}(Y)\) as a non-negative linear combination of three fixed matrices.

  2. Write out the log-likelihood \(\ell\) of \(Y|X\) in terms of \(\theta=(\beta_0, \beta_D, \sigma^2_{\epsilon}, \sigma^2_0, \sigma^2_D)\).

  3. Treating the \(\alpha_0\)’s and \(\alpha_D\)’s as latent variables, write out the complete log-likelihood \(\ell_c\) as well as its conditional expectation (given \(Y\)) at a specified value of \(\theta\).

  4. Verify Fisher’s (also known as Louis’) identity (don’t worry too much about what is needed for interchange of expectation and differentiation):

\[ \nabla \ell(\theta; Y) = \mathbb{E}_{\theta}[\nabla \ell_c(\theta; Y, \alpha_0, \alpha_D)|Y].\]
  1. Try to verify that \(\hat{\theta}\), the estimated values in M indeed satisfy \(\nabla \ell(\hat{\theta};Y)=0\).

Part B#

This problem will consider some aspects of Kernel PCA. Given a data matrix \(X \in \mathbb{R}^{n \times p}\), let \(K\) be a covariance kernel on \(\mathbb{R}^p\).

  1. Describe how (with centering but no scaling) the PCA scores of \(X\) can be found from an eigen-decomposition of

\[ \left(I_n - n^{-1} 1_n 1_n'\right) XX'\left(I_n - n^{-1} 1_n 1_n'\right). \]
  1. Suppose you have an orthonormal basis \((\phi_i)_{1 \leq i \leq \text{dim}({\cal H})}\) of \({\cal H}\) the (assumed separable) reproducing kernel Hilbert space associated to \(K\). Verify that for every \(x,y \in \mathbb{R}^p\) the following holds

\[ K(x,y) = \sum_{i=1}^{\text{dim}({\cal H})} \phi_i(x) \cdot \phi_i(y). \]
  1. Let’s take our orthonormal basis and make a new (possibly very large) data matrix

\[ W_{n \times \text{dim}({\cal H})} = \begin{pmatrix} \phi_1(X) & \dots & \phi_k(X) & \dots \end{pmatrix} \]

with \(W_{ij} = \phi_j(X_i)\). Compute \(WW'_{n \times n}\). How does this matrix depend on your choice of basis? What is \((WW')_{ij}\)?

  1. Using \(WW'\) and its centering

\[ \left(I_n - n^{-1} 1_n 1_n'\right) WW'\left(I_n - n^{-1} 1_n 1_n'\right), \]

describe how to find the first \(m\) PCA scores \(Z_i \in \mathbb{R}^n, 1 \leq i \leq m\) (with centering but no scaling) for the data matrix \(W\).

  1. Instead of a loading vector (which would depend on our basis), we might consider finding

\[ \hat{h}_1 = \text{argmax}_{\{h \in {\cal H}: \|h\|_{\cal H} \leq 1\}} \widehat{\text{Var}}(h(X)) \]

with

\[ \widehat{\text{Var}}(X) = \frac{1}{n}\sum_{i=1}^n h(X_i)^2 - \left(\frac{1}{n} \sum_{i=1}^n h(X_i) \right)^2 \]

being the sample variance of \(h(X)\). Show that \(\hat{h}_1\) takes on the expected form, i.e. there exists \(\hat{\alpha}^1 \in \mathbb{R}^n\) such that for every \(x \in \mathbb{R}^p\)

\[ \hat{h}_1(x) = \sum_{j=1} \hat{\alpha}^1_j K(X_j,x). \]

Having found \(\hat{h}_1\), we know that should have \(Z_{1,i} = \left(I_n - n^{-1} 1_n 1_n'\right)\hat{h}_1(X_i)\) for \(1 \leq i \leq n\). Check this. Can you find \(\hat{\alpha}^1\)? Denoting the functions \(\hat{h}_j\) for \(1 \leq j \leq n-1\), can you find the others? Verify that \(\langle \hat{h}_i, \hat{h}_j \rangle_{\cal H}=\delta_{ij}\) (for \(i,j\) less than the rank of the matrix in 4.). (Hint: verify that that both the \(WW'\hat{\alpha}^j\)’s and \(Z_j\)’s are in the column space of the matrix in 4.) (Hint: verify that the \(\hat{\alpha}^j\)’s are centered, then check ESL’s suggested solutions indeed solve the problem in the sense that the centered fit yields \(Z_{1,j}\) and the \({\cal H}\) norm is 1. To verify they are centered, write candidate \(\alpha\) as \(R\gamma + b\) where \(R=I_n-n^{-1}1_n1_n'\) and the decomposition is “orthogonal” in \({\cal H}\).)

  1. When using the linear kernel \(K(x,y)=x'y\), verify that \(\hat{h}_1(x)=V_1'x\) where \(V_1\) is the usual first loading vector in PCA.

  2. Apply kernel PCA to the digits data from Assignment #2. Use a few different kernels and look at the 1st two principal components. Is kernel PCA any better than regular PCA (i.e. using a linear kernel) at separating the classes?

Part C#

Let \((X, Y)\) be a pair of random vectors on \(\mathbb{R}^p \times \{1, \dots, K\}\). Their joint law is parameterized by a mixed undirected graphical model. That is, its density proportional to

\[ \begin{aligned} (x,y) \mapsto \exp \left[ -\frac{1}{2} \sum_{i,j=1}^p \Theta^{XX}_{ij} x_i x_j + \sum_{i=1}^p \Theta^X_i x_i + \sum_{i=1}^p \sum_{k=1}^K \Theta^{XY}_{ik} \cdot y_k \cdot x_i + \sum_{k=1}^K \Theta^Y_k \cdot y_k \right]. \end{aligned} \]
  1. Show this density model is essentially the LDA (Linear Discriminant Analysis) model:

\[\begin{split} \begin{aligned} Y & \sim \text{Multinomial}(\pi_1, \dots, \pi_K) \\ X|Y=k & \sim N(\mu_k, \Sigma) \end{aligned} \end{split}\]

Express the parameters \(\mu_k, \pi, \Sigma\) in terms of \(\Theta\) above (the map may many to 1).

  1. Modify the original joint density to get a model that yields QDA (Quadratic Discriminant Analysis). Keeping with the notation \(\Theta\) with a superscript indicating the interaction, which parameters have you added? (By interaction we mean \(\Theta^{XX}\) is paired with terms of the form \(xx'\) and \(\Theta^{XY}\) is paired with \(xy'\)).

  2. The work of Hastie and Lee (2015) include penalties on \(\Theta\) to induce structure in the parameters. Using penalties or linear constraints, modify your solution to 2. to arrive at the Gaussian Naive Bayes estimator.

  3. Impose constraints on your \(\Theta\) to arrive at Diagonal LDA (c.f. Section 18.2 of ESL). Using the group LASSO-like penalties suggested in Hastie and Lee (2015) would introduced certain sparsity patterns in \(\Theta\). Sketch how you could fit this model, using either a pseudo-likelihood or a more direct parameterization such as the LDA model in 1. above. Are these the same as the sparsity patterns in Nearest Shrunken Centroid (also in 18.2 of ESL)?

Part D#

Lindsey’s method is a method used for flexible exponential family modeling to model a density as

\[ f(x) \propto \exp\left(\sum_j \beta_j t_j(x)\right). \]

It is used, among other places, in Efron’s local FDR modeling, from which this figure is taken:

  1. Modify the mixed graphical model of Part C so that within each component class \(X\) is drawn from a user specified exponential family (i.e. a specification of a set of sufficient statistics \(t_i:\mathbb{R}^p \to \mathbb{R}\)). Write down a likelihood that fits separate models in each class.

  2. The fit for Lindsey’s method is based on a discretization of feature space and reduction to a Poisson GLM (see Section 2 of the Efron-Tibshirani paper cited above). Suppose we had 3 sufficient statistics T_1, T_2 and T_3 and the carrier measure has been estimated at the prototype point for each element of the discretization and stored in mu_0. Write a stylized call to glm that would fit this model. Assume the labels are stored as Y.

  3. Mixture Discriminant Analysis (c.f. Section 12.7 of ESL is another method that fits a non-Gaussian density within each class. Unlike the proposal in 1., it often shares structure in the form of the covariance matrix. Modify your stylized call to glm so that T_1 is shared across classes.