Multinomial models (Ch. 8)

  • We’ve spent a lot of time on binary / Binomial regression models.

  • Sometimes response has more than two categories: need a model for multinomial responses.

  • Solution? Why not look at multinomial as exponential family?


Multinomial

  • Mass function of \(Y \sim \text{Multinomial}(N,\pi)\) for \(\pi \in S_K\) the simplex in \(\mathbb{R}^K\). $\( f(y_1, \dots, y_K|\pi) = \binom{n}{y_1,\dots,y_K} \prod_{j=1}^K \pi_j^{y_j}, \qquad \sum_{j=1}^Ky_j=N. \)$

  • Density with respect to counting measure on $\( \left\{(z_1, \dots, z_K) \in \mathbb{Z}^K: \sum_{j=1}^K z_j = n, z_j \geq 0 \right\} \)$


Multinomial

  • We fix \(K\) as a baseline category.

  • Setting \(\pi_K = 1 - \sum_{j=1}^{K-1} \pi_j\), \(y_K = N - \sum_{j=1}^{K-1} y_j\) yields equivalent representation

\[ f(y_1, \dots, y_{K-1}|\pi) = \binom{N}{y_1,\dots,y_K} \left(\prod_{j=1}^{K-1} \pi_j^{y_j} \right) \left(1 - \sum_{j=1}^{K-1} \pi_j\right)^{N - \sum_{j=1}^{K-1}y_j} \]

with the constraint \(\sum_{j=1}^{K-1}y_j\leq N.\)


  • Log-likelihood (ignoring multinomial coefficient)

\[\begin{split} \begin{aligned} \log L(\pi|Y) &= \sum_{j=1}^{K-1} y_j \cdot \log \left(\frac{\pi_j}{1 - \sum_{l=1}^{K-1}\pi_l}\right) + \\ & \qquad N \cdot \log\left(1 - \sum_{l=1}^{K-1}\pi_l \right) \end{aligned} \end{split}\]
  • This is an exponential family with sufficient statistics \((y_1, \dots, y_{K-1})\) natural parameters

\[\begin{split} \begin{aligned} \eta_j &= \log \left(\frac{\pi_j}{1 - \sum_{l=1}^{K-1}\pi_l}\right), \qquad 1 \leq j \leq K-1 \\ \pi_j &= \frac{e^{\eta_j}}{1 + \sum_{l=1}^{K-1} e^{\eta_l}}, 1 \leq j \leq K-1 \end{aligned} \end{split}\]

  • Log-likelihood in natural parameters $\( \begin{aligned} \log L(\eta|Y) &= \sum_{j=1}^{K-1} y_j \cdot \eta_j - N \cdot \log\left(1 + \sum_{l=1}^{K-1} e^{\eta_l} \right) \end{aligned} \)$

  • Note: parametrization here depends on baseline category which we chose to be level \(K\).


Baseline category logit models (8.1)

  • Suppose we now observe \((Y_i, X_i)\) where each \(Y_i|X_i \sim \text{Multinomial}(N_i, \pi_i)\).

  • Natural parametrization above leads to the baseline category logit model

\[ \eta_{ij} = X[i,]^T\beta[,j] \]

or

\[ \eta = X\beta \]

with \(\beta \in \mathbb{R}^{p \times (K-1)}\) and \(\eta \in \mathbb{R}^{n \times (K-1)}\).

  • (For baseline model, notation assumes (Intercept) is baked into \(X\)).

Softmax

  • Here, for \(1 \leq j \leq K-1\)

\[ \pi_{i,j} = \pi_{i,j}(\beta) = \frac{\exp(X[i,]^T\beta[,j])}{1 + \sum_{l=1}^{K-1} \exp(X[i,]^T\beta[,l])}. \]

Fitting baseline model

  • Data is represented as \(Y \in (\mathbb{Z}^+)^{n \times (K-1)}\) (dropping baseline \(K\) column)

  • Likelihood for \(\beta\) is

\[\begin{split} \begin{aligned} \log L(\beta|Y) &= \text{Tr}((X\beta)^TY) \\ & \qquad - \sum_{i=1}^n N_i \log\left(1 + \sum_{l=1}^{K-1} \exp(X[i,]^T\beta[,l]) \right) \end{aligned} \end{split}\]

Fitting baseline model

  • Let’s compute the gradient of \(\log L(\beta|Y)\). By now, this should not seem surprising:

\[ \nabla \log L(\beta|Y) = X^T(Y - E_{\beta}(Y)) \in \mathbb{R}^{p \times (K-1)} \]
  • Above

\[ E_{\beta}(Y)[i,j] = N_i \cdot \frac{\exp(X[i,]^T\beta[,j])}{1 + \sum_{l=1}^K \exp(X[i,]^T\beta[,l])} \]
  • Gradient descent will be pretty easy…


Fitting baseline model

  • Hessian of \(\log L(\beta|Y)\), is a little harder to write – it is a tensor of shape \(\mathbb{R}^{p \times (K-1)} \otimes \mathbb{R}^{p \times (K-1)}\):

\[ \frac{\partial^2}{\partial \beta_{ij} \partial \beta_{kl}} \log L(\beta|Y) = -\sum_{c=1}^n X_{ci}X_{ck} \text{Cov}_{\beta}(Y_c)_{jl} \]

where

\[ \text{Cov}_{\beta}(Y_c)_{jl} = N_i \left(\text{diag}(\pi_i(\beta)) - \pi_i(\beta)\pi_i(\beta)^T\right)_{jl} \]
  • Asymptotic covariance of \(\hat{\beta}\) well-estimated (under assumptions) by observed information

\[ -\nabla^2 \log L(\hat{\beta}|Y)^{-1}. \]

Example

  • Goal: to predict student’s academic program type prog in "academic", "general", "vocational" from scores on a writing test write and ses.

library(foreign)
ml = read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
str(ml)
'data.frame':	200 obs. of  13 variables:
 $ id     : num  45 108 15 67 153 51 164 133 2 53 ...
 $ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
 $ ses    : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
 $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
 $ prog   : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
 $ read   : num  34 34 39 37 39 42 31 50 39 34 ...
 $ write  : num  35 33 39 37 31 36 36 31 41 37 ...
 $ math   : num  41 41 44 42 40 42 46 40 33 46 ...
 $ science: num  29 36 26 33 39 31 39 34 42 39 ...
 $ socst  : num  26 36 42 32 51 39 46 31 41 31 ...
 $ honors : Factor w/ 2 levels "not enrolled",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ awards : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cid    : int  1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "datalabel")= chr "highschool and beyond (200 cases)"
 - attr(*, "time.stamp")= chr "30 Oct 2009 14:13"
 - attr(*, "formats")= chr [1:13] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
 - attr(*, "types")= int [1:13] 254 254 254 254 254 254 254 254 254 254 ...
 - attr(*, "val.labels")= chr [1:13] "" "fl" "sl" "scl" ...
 - attr(*, "var.labels")= chr [1:13] "" "" "" "type of school" ...
 - attr(*, "version")= int 8
 - attr(*, "label.table")=List of 5
  ..$ sl    : Named int [1:3] 1 2 3
  .. ..- attr(*, "names")= chr [1:3] "low" "middle" "high"
  ..$ scl   : Named int [1:2] 1 2
  .. ..- attr(*, "names")= chr [1:2] "public" "private"
  ..$ sel   : Named int [1:3] 1 2 3
  .. ..- attr(*, "names")= chr [1:3] "general" "academic" "vocation"
  ..$ fl    : Named int [1:2] 0 1
  .. ..- attr(*, "names")= chr [1:2] "male" "female"
  ..$ honlab: Named int [1:2] 0 1
  .. ..- attr(*, "names")= chr [1:2] "not enrolled" "enrolled"

library(nnet)
model_full = nnet::multinom(prog ~ ses + write, data = ml)
summary(model_full)
dim(coef(model_full))
# weights:  15 (8 variable)
initial  value 219.722458 
iter  10 value 179.985215
final  value 179.981726 
converged
Call:
nnet::multinom(formula = prog ~ ses + write, data = ml)

Coefficients:
         (Intercept) sesmiddle   seshigh       write
academic   -2.851973 0.5332914 1.1628257  0.05792480
vocation    2.366097 0.8246384 0.1802176 -0.05567514

Std. Errors:
         (Intercept) sesmiddle   seshigh      write
academic    1.166437 0.4437319 0.5142215 0.02141092
vocation    1.174251 0.4901237 0.6484508 0.02333135

Residual Deviance: 359.9635 
AIC: 375.9635 
  1. 2
  2. 4

vcov(model_full)
A matrix: 8 × 8 of type dbl
academic:(Intercept)academic:sesmiddleacademic:seshighacademic:writevocation:(Intercept)vocation:sesmiddlevocation:seshighvocation:write
academic:(Intercept) 1.360576234-0.0939555923-0.0598081985-0.0238322833 0.69281537-0.0271783502 0.007910847-0.0128334660
academic:sesmiddle-0.093955592 0.1968980069 0.1225345572-0.0005216217-0.04944195 0.1050952297 0.058406872-0.0001497438
academic:seshigh-0.059808198 0.1225345572 0.2644237520-0.0011823607-0.02173743 0.0589096999 0.165112195-0.0007211086
academic:write-0.023832283-0.0005216217-0.0011823607 0.0004584275-0.01258505-0.0005960548-0.001266725 0.0002545341
vocation:(Intercept) 0.692815375-0.0494419508-0.0217374274-0.0125850549 1.37886605-0.0905237580-0.082650366-0.0258391029
vocation:sesmiddle-0.027178350 0.1050952297 0.0589096999-0.0005960548-0.09052376 0.2402212037 0.154979401-0.0012868585
vocation:seshigh 0.007910847 0.0584068716 0.1651121953-0.0012667250-0.08265037 0.1549794008 0.420488501-0.0014339382
vocation:write-0.012833466-0.0001497438-0.0007211086 0.0002545341-0.02583910-0.0012868585-0.001433938 0.0005443519

Wald tests

Z = summary(model_full)$coefficients / summary(model_full)$standard.errors
Z
A matrix: 2 × 4 of type dbl
(Intercept)sesmiddleseshighwrite
academic-2.4450281.2018322.2613324 2.705386
vocation 2.0149841.6825110.2779203-2.386280
P = 2 * pnorm(abs(Z), lower.tail=FALSE)
P
A matrix: 2 × 4 of type dbl
(Intercept)sesmiddleseshighwrite
academic0.014484070.229428450.023738680.00682251
vocation0.043906340.092469810.781073560.01701977

Likelihood ratio tests

model_write = multinom(prog ~ write, data = ml)
anova(model_write, model_full)
# weights:  9 (4 variable)
initial  value 219.722458 
iter  10 value 185.510839
iter  10 value 185.510838
final  value 185.510838 
converged
A Anova: 2 × 7
ModelResid. dfResid. DevTest DfLR stat.Pr(Chi)
<chr><dbl><dbl><chr><dbl><dbl><dbl>
write 396371.0217 NA NA NA
ses + write392359.96351 vs 2 411.058220.02591741
model_ses = multinom(prog ~ ses, data = ml)
anova(model_ses, model_full)
# weights:  12 (6 variable)
initial  value 219.722458 
iter  10 value 195.705517
final  value 195.705188 
converged
A Anova: 2 × 7
ModelResid. dfResid. DevTest DfLR stat.Pr(Chi)
<chr><dbl><dbl><chr><dbl><dbl><dbl>
ses 394391.4104 NA NA NA
ses + write392359.96351 vs 2 231.446921.483841e-07

Latent variable model (8.1.6)

  • Suppose

\[ U_{ij} = X[i,]^T{\beta}[,j] + \epsilon_{ij} \]

for some noise distribution \(\epsilon_{ij}\)’s independent across \(i\).

  • We might take our multinomial observation as

\[ Y_i = \text{argmax}_{1 \leq j \leq K} U_{ij} \]

  • If \(\epsilon_{ij}\) are independent extreme value (i.e. cloglog) then the resulting likelihood is equivalent to a baseline multinomial logit, essentially with parameters \(\beta[,j] - \beta[,K]\).

  • A more “natural” model is to assume \(\epsilon_{ij}\)’s are independent \(N(0,1)\) (or perhaps correlated within a subject?).

  • Leads to a multinomial probit model, though likelihood is rather complicated:

\[\begin{split} \begin{aligned} \pi_j(X) &= P_{\beta}(Y=j|X) \\ &= P_{\beta}\left(X[i,]^T\beta[,j] + \epsilon_{ij} \geq \max_{l \neq j} X[i,]^T\beta[,l] + \epsilon_{il} \right) \end{aligned} \end{split}\]
  • Data augmented sampling is (relatively) straightforward.


Cumulative logit models (8.2)

  • When levels of \(Y\) are ordinal, we might model things differently.

  • Cumulative model considers probabilities

\[ P(Y\leq j | X) = \sum_{l=1}^j \pi_l(X) \]

and models these as

\[ \text{logit} \left( P(Y \leq j | X) \right) = \alpha_j + \beta^TX, \qquad 1 \leq j \leq K-1. \]
  • Note: \(\alpha_{j} \geq \alpha_{j-1}\).

  • Fewer parameters than the baseline model, but link is not canonical – optimization is more difficult.


  • Log-likelihood looks like

\[ \sum_{i=1}^n \sum_{j=1}^{K} y_{ij} \log\left( \frac{\exp(\alpha_j + X[i,]^T\beta)}{1 + \exp(\alpha_j + X[i,]^T\beta)} - \frac{\exp(\alpha_{j-1} + X[i,]^T\beta)}{1 + \exp(\alpha_{j-1} + X[i,]^T\beta)} \right) \]

with \(\alpha_{-1}=-\infty, \alpha_K=\infty\).

  • No need to identify a baseline category.


Viewed as a generalization of threshold model…

  • Let \(Y^*=X^T\beta + \epsilon\) with \(\epsilon\) having logistic distribution

\[ Y = \sum_{j=1}^K j \cdot 1_{[\alpha_{j-1}, \alpha_j]}(Y^*) \]
  • Obvious analogs when errors are not logistic, but have CDF \(G\).

  • Leads to log-likelihood

\[ \sum_{i=1}^n \sum_{j=1}^{K} y_{ij} \log\left( G(\alpha_j + X[i,]^T\beta) - G(\alpha_{j-1} + X[i,]^T\beta) \right) \]

Example

  • Questionnaire has students rank “likelihood” of applying to graduate school: apply

  • We also collect gpa as well as pared (indicating whether or not at least one parent has a graduate degree) and public (is undergraduate institution public or private?)

  • More involved analysis here.


graduate_data = read.dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
str(graduate_data)
'data.frame':	400 obs. of  4 variables:
 $ apply : Factor w/ 3 levels "unlikely","somewhat likely",..: 3 2 1 2 2 1 2 2 1 2 ...
 $ pared : int  0 1 1 0 0 0 0 0 0 1 ...
 $ public: int  0 0 1 0 0 1 0 0 0 0 ...
 $ gpa   : num  3.26 3.21 3.94 2.81 2.53 ...
 - attr(*, "datalabel")= chr ""
 - attr(*, "time.stamp")= chr " 3 Aug 2010 10:45"
 - attr(*, "formats")= chr [1:4] "%15.0g" "%9.0g" "%9.0g" "%9.0g"
 - attr(*, "types")= int [1:4] 251 251 251 254
 - attr(*, "val.labels")= chr [1:4] "apply" "" "" ""
 - attr(*, "var.labels")= chr [1:4] "" "" "" ""
 - attr(*, "version")= int 12
 - attr(*, "label.table")=List of 1
  ..$ apply: Named int [1:3] 0 1 2
  .. ..- attr(*, "names")= chr [1:3] "unlikely" "somewhat likely" "very likely"

library(MASS)
graduate_model = MASS::polr(apply ~ public + gpa + pared, 
                            data=graduate_data, 
                            Hess=TRUE)
summary(graduate_model)
Call:
MASS::polr(formula = apply ~ public + gpa + pared, data = graduate_data, 
    Hess = TRUE)

Coefficients:
          Value Std. Error t value
public -0.05879     0.2979 -0.1974
gpa     0.61594     0.2606  2.3632
pared   1.04769     0.2658  3.9418

Intercepts:
                            Value   Std. Error t value
unlikely|somewhat likely     2.2039  0.7795     2.8272
somewhat likely|very likely  4.2994  0.8043     5.3453

Residual Deviance: 717.0249 
AIC: 727.0249 

Compare to a less parsimonious model

graduate_multinom = multinom(apply ~ public + gpa + pared, data=graduate_data)
c(AIC(graduate_multinom), AIC(graduate_model))
# weights:  15 (8 variable)
initial  value 439.444915 
iter  10 value 357.012275
final  value 356.996982 
converged
  1. 729.993963188076
  2. 727.024871965377

Covariance matrix for Wald tests / intervals

vcov(graduate_model)
A matrix: 5 × 5 of type dbl
publicgpaparedunlikely|somewhat likelysomewhat likely|very likely
public 0.088721414-0.01737218-0.003060743-0.04091869-0.041211341
gpa-0.017372183 0.06793006-0.007018330 0.20088564 0.203892325
pared-0.003060743-0.00701833 0.070644021-0.01039088-0.002951814
unlikely|somewhat likely-0.040918689 0.20088564-0.010390883 0.60769114 0.613233012
somewhat likely|very likely-0.041211341 0.20389232-0.002951814 0.61323301 0.646941420

Confidence intervals

confint(graduate_model)
Waiting for profiling to be done...
A matrix: 3 × 2 of type dbl
2.5 %97.5 %
public-0.65220600.5191384
gpa 0.10762021.1309148
pared 0.52817681.5721750

Likelihood ratio tests

anova(MASS::polr(apply ~ gpa, data=graduate_data), graduate_model)
A Anova: 2 × 7
ModelResid. dfResid. DevTest DfLR stat.Pr(Chi)
<chr><dbl><dbl><chr><dbl><dbl><dbl>
gpa 397732.6000 NA NA NA
public + gpa + pared395717.02491 vs 2 215.575090.0004148703

Adjacent category models (8.3.4)

  • Considers the parameters

\[ \log \frac{\pi_j(X)}{\pi_{j+1}(X)}, \qquad 1 \leq j \leq K-1 \]
  • Model

\[ \log \frac{\pi_j(X)}{\pi_{j+1}(X)} = \alpha_j + X^T\beta. \]

Adjacent category models (8.3.4)

  • Equivalent to a linear reparametrization of the baseline logits $\( \log \frac{\pi_j(X)}{\pi_{K}(X)} = \alpha_j^* + (K-j) \cdot X^T\beta, \qquad 1 \leq j \leq K-1 \)\( where \)\( \alpha_j^* = \sum_{l=j}^{K-1} \alpha_l. \)$

  • Negative log-likelihood is convex in \((\alpha, \beta)\)


Bayesian methods for multinomial (8.6)

  • For cumulative logit models, conceptually simple to sample \((Y^*,\alpha,\beta)|X\).

  • Constraints on \((Y^*,\alpha,\beta)\) when \(Y_i=j\):

\[ \alpha_{j-1} + X[i,]^T\beta \leq Y_i^* \leq \alpha_j + X[i,]^T\beta \]
  • Each draw is from \(G\) truncated to some interval determined by \((\alpha, \beta)\).

  • For baseline logit models, if we model using \(Y=\text{argmax}(U)\) wiith latent \(U\), then a similar augmented data model is possible with variables \((\epsilon,\beta)\)

\[ X[i,]^T\beta[,j] + \epsilon_{ij} \geq \max_{l \neq j }X[i,]^T\beta[,l] + \epsilon_{il} \]

STAN for multinomial

  • See here

  • Close to baseline model, except uses overparameterized model

\[ P(Y=j|X) = \frac{\exp(X^T\beta[,j])}{\sum_l \exp(X^T\beta[,l])} \]

while baseline model sets \(\beta[,K]=0\).

STAN code

data {
  int<lower=1> N;
  int<lower=1> K;
  int<lower=1> P;
  matrix[N, P] X;
  int<lower=1,upper=K> y[N];
}
parameters {
  matrix[P,K] beta;
}
model {
  to_vector(beta) ~ normal(0, 100);
  for (n in 1:N) {
    y[n] ~ categorical_logit((X[n] * beta)');
  }  
}

# pretty slow!
X_graduate = model.matrix(graduate_model)
graduate_stan = list(X=X_graduate,
                     N=nrow(X_graduate),
                     P=ncol(X_graduate),
                     K=3,
                     y=as.integer(graduate_data$apply))
system.time(grad_samples <- rstan::stan('multinom.stan', 
                                        data=graduate_stan, 
                                        iter=100, 
                                        chains=1))
SAMPLING FOR MODEL 'multinom' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.002744 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 27.44 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: WARNING: There aren't enough warmup iterations to fit the
Chain 1:          three stages of adaptation as currently configured.
Chain 1:          Reducing each adaptation stage to 15%/75%/10% of
Chain 1:          the given number of warmup iterations:
Chain 1:            init_buffer = 7
Chain 1:            adapt_window = 38
Chain 1:            term_buffer = 5
Chain 1: 
Chain 1: Iteration:  1 / 100 [  1%]  (Warmup)
Chain 1: Iteration: 10 / 100 [ 10%]  (Warmup)
Chain 1: Iteration: 20 / 100 [ 20%]  (Warmup)
Chain 1: Iteration: 30 / 100 [ 30%]  (Warmup)
Chain 1: Iteration: 40 / 100 [ 40%]  (Warmup)
Chain 1: Iteration: 50 / 100 [ 50%]  (Warmup)
Chain 1: Iteration: 51 / 100 [ 51%]  (Sampling)
Chain 1: Iteration: 60 / 100 [ 60%]  (Sampling)
Chain 1: Iteration: 70 / 100 [ 70%]  (Sampling)
Chain 1: Iteration: 80 / 100 [ 80%]  (Sampling)
Chain 1: Iteration: 90 / 100 [ 90%]  (Sampling)
Chain 1: Iteration: 100 / 100 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.588644 seconds (Warm-up)
Chain 1:                51.6949 seconds (Sampling)
Chain 1:                52.2836 seconds (Total)
Chain 1: 
Warning message:
“There were 26 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”
Warning message:
“There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“The largest R-hat is 2.11, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
   user  system elapsed 
 74.968   1.701  79.488 

grad_samples
Inference for Stan model: multinom.
1 chains, each with iter=100; warmup=50; thin=1; 
post-warmup draws per chain=50, total post-warmup draws=50.

                mean se_mean    sd       2.5%        25%        50%        75%
beta[1,1]       0.82    0.64  1.04      -0.83      -0.11       0.87       1.69
beta[1,2]       0.46    0.71  1.13      -1.31      -0.57       0.46       1.69
beta[1,3]      -0.74    0.76  1.20      -2.52      -1.81      -0.83       0.40
beta[2,1]      -1.59    1.07  1.77      -4.71      -2.91      -1.73       0.42
beta[2,2]      -1.58    1.08  1.78      -4.73      -2.92      -1.73       0.41
beta[2,3]      -1.59    1.08  1.78      -4.73      -2.93      -1.74       0.41
beta[3,1]      -0.45    0.52  0.99      -1.32      -1.15      -0.93       0.27
beta[3,2]      -0.48    0.54  1.02      -1.31      -1.25      -0.95       0.27
beta[3,3]      -0.50    0.55  1.03      -1.41      -1.26      -0.94       0.25
beta[4,1]       1.97    0.71  1.17       0.19       0.36       2.45       2.99
beta[4,2]       1.97    0.71  1.17       0.17       0.35       2.45       3.01
beta[4,3]       2.00    0.67  1.11       0.30       0.51       2.45       2.97
lp__      -148281.16   32.18 81.58 -148518.58 -148256.96 -148247.34 -148244.84
               97.5% n_eff Rhat
beta[1,1]       2.25     3 2.95
beta[1,2]       1.82     3 3.13
beta[1,3]       0.84     3 2.96
beta[2,1]       0.57     3 2.49
beta[2,2]       0.64     3 2.48
beta[2,3]       0.60     3 2.48
beta[3,1]       1.89     4 1.53
beta[3,2]       1.90     4 1.57
beta[3,3]       1.90     4 1.59
beta[4,1]       3.21     3 2.20
beta[4,2]       3.20     3 2.21
beta[4,3]       3.21     3 2.25
lp__      -148242.37     6 1.19

Samples were drawn using NUTS(diag_e) at Mon Jan 31 10:24:12 2022.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

STAN for ordered logit

STAN code for ordered logit

data {
  int<lower=1> N;
  int<lower=1> K;
  int<lower=1> P;
  matrix[N,P] X;
  int<lower=1,upper=K> y[N];
}
parameters {
  vector[P] beta;
  ordered[K-1] cutpts;
}
model {
  beta ~ normal(0, 100);
  for (n in 1:N) {
    y[n] ~ ordered_logistic(X[n] * beta, cutpts);
  }  
}

# also a bit slow!
system.time(ordered_samples <- rstan::stan('ordered_logit.stan', 
                                           data=graduate_stan, 
                                           iter=100,
                                           chains=1))
ordered_samples
SAMPLING FOR MODEL 'ordered_logit' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000544 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 5.44 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: WARNING: There aren't enough warmup iterations to fit the
Chain 1:          three stages of adaptation as currently configured.
Chain 1:          Reducing each adaptation stage to 15%/75%/10% of
Chain 1:          the given number of warmup iterations:
Chain 1:            init_buffer = 7
Chain 1:            adapt_window = 38
Chain 1:            term_buffer = 5
Chain 1: 
Chain 1: Iteration:  1 / 100 [  1%]  (Warmup)
Chain 1: Iteration: 10 / 100 [ 10%]  (Warmup)
Chain 1: Iteration: 20 / 100 [ 20%]  (Warmup)
Chain 1: Iteration: 30 / 100 [ 30%]  (Warmup)
Chain 1: Iteration: 40 / 100 [ 40%]  (Warmup)
Chain 1: Iteration: 50 / 100 [ 50%]  (Warmup)
Chain 1: Iteration: 51 / 100 [ 51%]  (Sampling)
Chain 1: Iteration: 60 / 100 [ 60%]  (Sampling)
Chain 1: Iteration: 70 / 100 [ 70%]  (Sampling)
Chain 1: Iteration: 80 / 100 [ 80%]  (Sampling)
Chain 1: Iteration: 90 / 100 [ 90%]  (Sampling)
Chain 1: Iteration: 100 / 100 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 3.32696 seconds (Warm-up)
Chain 1:                7.85343 seconds (Sampling)
Chain 1:                11.1804 seconds (Total)
Chain 1: 
Warning message:
“There were 14 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“The largest R-hat is 1.92, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
   user  system elapsed 
 31.836   1.262  34.205 
Inference for Stan model: ordered_logit.
1 chains, each with iter=100; warmup=50; thin=1; 
post-warmup draws per chain=50, total post-warmup draws=50.

             mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff
beta[1]    -41.14   30.36 52.19 -128.46 -104.60  -14.21   -6.19   23.82     3
beta[2]     -0.05    0.04  0.32   -0.61   -0.28   -0.03    0.15    0.67    63
beta[3]      0.57    0.04  0.21    0.19    0.48    0.55    0.68    0.98    30
beta[4]      1.04    0.04  0.31    0.53    0.84    0.95    1.26    1.68    52
cutpts[1]  -39.08   30.33 52.23 -127.17 -102.33  -11.75   -4.10   25.76     3
cutpts[2]  -36.99   30.34 52.23 -125.00 -100.13   -9.66   -2.10   28.09     3
lp__      -360.53    0.25  1.50 -364.15 -361.39 -360.09 -359.39 -358.49    35
          Rhat
beta[1]   1.99
beta[2]   0.98
beta[3]   1.00
beta[4]   0.98
cutpts[1] 1.98
cutpts[2] 1.98
lp__      0.98

Samples were drawn using NUTS(diag_e) at Mon Jan 31 10:24:46 2022.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).