Multinomial models

web.stanford.edu/class/stats305b

Jonathan Taylor

Winter 2022

Multinomial models (Ch. 8)

We’ve spent a lot of time on binary / Binomial regression models.
Sometimes response has more than two categories: need a model for multinomial responses.
Solution? Why not look at multinomial as exponential family?

Multinomial

Mass function of \(Y \sim \text{Multinomial}(N,\pi)\) for \(\pi \in S_K\) the simplex in \(\mathbb{R}^K\). \[ f(y_1, \dots, y_K|\pi) = \binom{n}{y_1,\dots,y_K} \prod_{j=1}^K \pi_j^{y_j}, \qquad \sum_{j=1}^Ky_j=N. \]
Density with respect to counting measure on \[ \left\{(z_1, \dots, z_K) \in \mathbb{Z}^K: \sum_{j=1}^K z_j = n, z_j \geq 0 \right\} \]

Multinomial

We fix \(K\) as a baseline category.
Setting \(\pi_K = 1 - \sum_{j=1}^{K-1} \pi_j\), \(y_K = N - \sum_{j=1}^{K-1} y_j\) yields equivalent representation

\[ f(y_1, \dots, y_{K-1}|\pi) = \binom{N}{y_1,\dots,y_K} \left(\prod_{j=1}^{K-1} \pi_j^{y_j} \right) \left(1 - \sum_{j=1}^{K-1} \pi_j\right)^{N - \sum_{j=1}^{K-1}y_j} \]

with the constraint \(\sum_{j=1}^{K-1}y_j\leq N.\)

Log-likelihood (ignoring multinomial coefficient)

\[ \begin{aligned} \log L(\pi|Y) &= \sum_{j=1}^{K-1} y_j \cdot \log \left(\frac{\pi_j}{1 - \sum_{l=1}^{K-1}\pi_l}\right) + \\ & \qquad N \cdot \log\left(1 - \sum_{l=1}^{K-1}\pi_l \right) \end{aligned} \]

This is an exponential family with sufficient statistics \((y_1, \dots, y_{K-1})\) natural parameters

\[ \begin{aligned} \eta_j &= \log \left(\frac{\pi_j}{1 - \sum_{l=1}^{K-1}\pi_l}\right), \qquad 1 \leq j \leq K-1 \\ \pi_j &= \frac{e^{\eta_j}}{1 + \sum_{l=1}^{K-1} e^{\eta_l}}, 1 \leq j \leq K-1 \end{aligned} \]

Log-likelihood in natural parameters \[ \begin{aligned} \log L(\eta|Y) &= \sum_{j=1}^{K-1} y_j \cdot \eta_j - N \cdot \log\left(1 + \sum_{l=1}^{K-1} e^{\eta_l} \right) \end{aligned} \]
Note: parametrization here depends on baseline category which we chose to be level \(K\).

Baseline category logit models (8.1)

Suppose we now observe \((Y_i, X_i)\) where each \(Y_i|X_i \sim \text{Multinomial}(N_i, \pi_i)\).
Natural parametrization above leads to the baseline category logit model

\[ \eta_{ij} = X[i,]^T\beta[,j] \]

or

\[ \eta = X\beta \]

with \(\beta \in \mathbb{R}^{p \times (K-1)}\) and \(\eta \in \mathbb{R}^{n \times (K-1)}\).

(For baseline model, notation assumes (Intercept) is baked into \(X\)).

Softmax

Here, for \(1 \leq j \leq K-1\)

\[ \pi_{i,j} = \pi_{i,j}(\beta) = \frac{\exp(X[i,]^T\beta[,j])}{1 + \sum_{l=1}^{K-1} \exp(X[i,]^T\beta[,l])}. \]

Fitting baseline model

Data is represented as \(Y \in (\mathbb{Z}^+)^{n \times (K-1)}\) (dropping baseline \(K\) column)
Likelihood for \(\beta\) is

\[ \begin{aligned} \log L(\beta|Y) &= \text{Tr}((X\beta)^TY) \\ & \qquad - \sum_{i=1}^n N_i \log\left(1 + \sum_{l=1}^{K-1} \exp(X[i,]^T\beta[,l]) \right) \end{aligned} \]

Fitting baseline model

Let’s compute the gradient of \(\log L(\beta|Y)\). By now, this should not seem surprising:

\[ \nabla \log L(\beta|Y) = X^T(Y - E_{\beta}(Y)) \in \mathbb{R}^{p \times (K-1)} \]

Above

\[ E_{\beta}(Y)[i,j] = N_i \cdot \frac{\exp(X[i,]^T\beta[,j])}{1 + \sum_{l=1}^K \exp(X[i,]^T\beta[,l])} \]

Gradient descent will be pretty easy…

Fitting baseline model

Hessian of \(\log L(\beta|Y)\), is a little harder to write – it is a tensor of shape \(\mathbb{R}^{p \times (K-1)} \otimes \mathbb{R}^{p \times (K-1)}\):

\[ \frac{\partial^2}{\partial \beta_{ij} \partial \beta_{kl}} \log L(\beta|Y) = -\sum_{c=1}^n X_{ci}X_{ck} \text{Cov}_{\beta}(Y_c)_{jl} \]

where

\[ \text{Cov}_{\beta}(Y_c)_{jl} = N_i \left(\text{diag}(\pi_i(\beta)) - \pi_i(\beta)\pi_i(\beta)^T\right)_{jl} \]

Asymptotic covariance of \(\hat{\beta}\) well-estimated (under assumptions) by observed information

\[ -\nabla^2 \log L(\hat{\beta}|Y)^{-1}. \]

Example

Goal: to predict student’s academic program type prog in "academic", "general", "vocational" from scores on a writing test write and ses.

library(foreign)
ml = read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
str(ml)

## 'data.frame':    200 obs. of  13 variables:
##  $ id     : num  45 108 15 67 153 51 164 133 2 53 ...
##  $ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
##  $ ses    : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
##  $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prog   : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
##  $ read   : num  34 34 39 37 39 42 31 50 39 34 ...
##  $ write  : num  35 33 39 37 31 36 36 31 41 37 ...
##  $ math   : num  41 41 44 42 40 42 46 40 33 46 ...
##  $ science: num  29 36 26 33 39 31 39 34 42 39 ...
##  $ socst  : num  26 36 42 32 51 39 46 31 41 31 ...
##  $ honors : Factor w/ 2 levels "not enrolled",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ awards : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cid    : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "datalabel")= chr "highschool and beyond (200 cases)"
##  - attr(*, "time.stamp")= chr "30 Oct 2009 14:13"
##  - attr(*, "formats")= chr [1:13] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int [1:13] 254 254 254 254 254 254 254 254 254 254 ...
##  - attr(*, "val.labels")= chr [1:13] "" "fl" "sl" "scl" ...
##  - attr(*, "var.labels")= chr [1:13] "" "" "" "type of school" ...
##  - attr(*, "version")= int 8
##  - attr(*, "label.table")=List of 5
##   ..$ sl    : Named int [1:3] 1 2 3
##   .. ..- attr(*, "names")= chr [1:3] "low" "middle" "high"
##   ..$ scl   : Named int [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "public" "private"
##   ..$ sel   : Named int [1:3] 1 2 3
##   .. ..- attr(*, "names")= chr [1:3] "general" "academic" "vocation"
##   ..$ fl    : Named int [1:2] 0 1
##   .. ..- attr(*, "names")= chr [1:2] "male" "female"
##   ..$ honlab: Named int [1:2] 0 1
##   .. ..- attr(*, "names")= chr [1:2] "not enrolled" "enrolled"

library(nnet)
model_full = nnet::multinom(prog ~ ses + write, data = ml)

## # weights:  15 (8 variable)
## initial  value 219.722458 
## iter  10 value 179.985215
## final  value 179.981726 
## converged

summary(model_full)

## Call:
## nnet::multinom(formula = prog ~ ses + write, data = ml)
## 
## Coefficients:
##          (Intercept) sesmiddle   seshigh       write
## academic   -2.851973 0.5332914 1.1628257  0.05792480
## vocation    2.366097 0.8246384 0.1802176 -0.05567514
## 
## Std. Errors:
##          (Intercept) sesmiddle   seshigh      write
## academic    1.166437 0.4437319 0.5142215 0.02141092
## vocation    1.174251 0.4901237 0.6484508 0.02333135
## 
## Residual Deviance: 359.9635 
## AIC: 375.9635

dim(coef(model_full))

## [1] 2 4

vcov(model_full)

##                      academic:(Intercept) academic:sesmiddle academic:seshigh
## academic:(Intercept)          1.360576234      -0.0939555923    -0.0598081985
## academic:sesmiddle           -0.093955592       0.1968980069     0.1225345572
## academic:seshigh             -0.059808198       0.1225345572     0.2644237520
## academic:write               -0.023832283      -0.0005216217    -0.0011823607
## vocation:(Intercept)          0.692815375      -0.0494419508    -0.0217374274
## vocation:sesmiddle           -0.027178350       0.1050952297     0.0589096999
## vocation:seshigh              0.007910847       0.0584068716     0.1651121953
## vocation:write               -0.012833466      -0.0001497438    -0.0007211086
##                      academic:write vocation:(Intercept) vocation:sesmiddle
## academic:(Intercept)  -0.0238322833           0.69281537      -0.0271783502
## academic:sesmiddle    -0.0005216217          -0.04944195       0.1050952297
## academic:seshigh      -0.0011823607          -0.02173743       0.0589096999
## academic:write         0.0004584275          -0.01258505      -0.0005960548
## vocation:(Intercept)  -0.0125850549           1.37886605      -0.0905237580
## vocation:sesmiddle    -0.0005960548          -0.09052376       0.2402212037
## vocation:seshigh      -0.0012667250          -0.08265037       0.1549794008
## vocation:write         0.0002545341          -0.02583910      -0.0012868585
##                      vocation:seshigh vocation:write
## academic:(Intercept)      0.007910847  -0.0128334660
## academic:sesmiddle        0.058406872  -0.0001497438
## academic:seshigh          0.165112195  -0.0007211086
## academic:write           -0.001266725   0.0002545341
## vocation:(Intercept)     -0.082650366  -0.0258391029
## vocation:sesmiddle        0.154979401  -0.0012868585
## vocation:seshigh          0.420488501  -0.0014339382
## vocation:write           -0.001433938   0.0005443519

Wald tests

Z = summary(model_full)$coefficients / summary(model_full)$standard.errors
Z

##          (Intercept) sesmiddle   seshigh     write
## academic   -2.445028  1.201832 2.2613324  2.705386
## vocation    2.014984  1.682511 0.2779203 -2.386280

P = 2 * pnorm(abs(Z), lower.tail=FALSE)
P

##          (Intercept)  sesmiddle    seshigh      write
## academic  0.01448407 0.22942845 0.02373868 0.00682251
## vocation  0.04390634 0.09246981 0.78107356 0.01701977

Likelihood ratio tests

model_write = multinom(prog ~ write, data = ml)

## # weights:  9 (4 variable)
## initial  value 219.722458 
## iter  10 value 185.510839
## iter  10 value 185.510838
## final  value 185.510838 
## converged

anova(model_write, model_full)

##         Model Resid. df Resid. Dev   Test    Df LR stat.    Pr(Chi)
## 1       write       396   371.0217           NA       NA         NA
## 2 ses + write       392   359.9635 1 vs 2     4 11.05822 0.02591741

model_ses = multinom(prog ~ ses, data = ml)

## # weights:  12 (6 variable)
## initial  value 219.722458 
## iter  10 value 195.705517
## final  value 195.705188 
## converged

anova(model_ses, model_full)

##         Model Resid. df Resid. Dev   Test    Df LR stat.      Pr(Chi)
## 1         ses       394   391.4104           NA       NA           NA
## 2 ses + write       392   359.9635 1 vs 2     2 31.44692 1.483841e-07

Latent variable model (8.1.6)

Suppose

\[ U_{ij} = X[i,]^T{\beta}[,j] + \epsilon_{ij} \]

for some noise distribution \(\epsilon_{ij}\)’s independent across \(i\).

We might take our multinomial observation as

\[ Y_i = \text{argmax}_{1 \leq j \leq K} U_{ij} \]

If \(\epsilon_{ij}\) are independent extreme value (i.e. cloglog) then the resulting likelihood is equivalent to a baseline multinomial logit, essentially with parameters \(\beta[,j] - \beta[,K]\).
A more “natural” model is to assume \(\epsilon_{ij}\)’s are independent \(N(0,1)\) (or perhaps correlated within a subject?).
Leads to a multinomial probit model, though likelihood is rather complicated:

\[ \begin{aligned} \pi_j(X) &= P_{\beta}(Y=j|X) \\ &= P_{\beta}\left(X[i,]^T\beta[,j] + \epsilon_{ij} \geq \max_{l \neq j} X[i,]^T\beta[,l] + \epsilon_{il} \right) \end{aligned} \]

Data augmented sampling is (relatively) straightforward.

Cumulative logit models (8.2)

When levels of \(Y\) are ordinal, we might model things differently.
Cumulative model considers probabilities

\[ P(Y\leq j | X) = \sum_{l=1}^j \pi_l(X) \]

and models these as

\[ \text{logit} \left( P(Y \leq j | X) \right) = \alpha_j + \beta^TX, \qquad 1 \leq j \leq K-1. \]

Fewer parameters than the baseline model, but link is not canonical – optimization is more difficult.

Log-likelihood looks like

\[ \sum_{i=1}^n \sum_{j=1}^{K} y_{ij} \log\left( \frac{\exp(\alpha_j + X[i,]^T\beta)}{1 + \exp(\alpha_j + X[i,]^T\beta)} - \frac{\exp(\alpha_{j-1} + X[i,]^T\beta)}{1 + \exp(\alpha_{j-1} + X[i,]^T\beta)} \right) \]

with \(\alpha_{-1}=-\infty, \alpha_K=\infty\).

No need to identify a baseline category.

Viewed as a generalization of threshold model…

Let \(Y^*=X^T\beta + \epsilon\) with \(\epsilon\) having logistic distribution

\[ Y = \sum_{j=1}^K j \cdot 1_{[\alpha_{j-1}, \alpha_j]}(Y^*) \]

Obvious analogs when errors are not logistic, but have CDF \(G\).
Leads to log-likelihood

\[ \sum_{i=1}^n \sum_{j=1}^{K} y_{ij} \log\left( G(\alpha_j + X[i,]^T\beta) - G(\alpha_{j-1} + X[i,]^T\beta) \right) \]

Example

Questionnaire has students rank “likelihood” of applying to graduate school: apply
We also collect gpa as well as pared (indicating whether or not at least one parent has a graduate degree) and public (is undergraduate institution public or private?)
More involved analysis here.

graduate_data = read.dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
str(graduate_data)

## 'data.frame':    400 obs. of  4 variables:
##  $ apply : Factor w/ 3 levels "unlikely","somewhat likely",..: 3 2 1 2 2 1 2 2 1 2 ...
##  $ pared : int  0 1 1 0 0 0 0 0 0 1 ...
##  $ public: int  0 0 1 0 0 1 0 0 0 0 ...
##  $ gpa   : num  3.26 3.21 3.94 2.81 2.53 ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr " 3 Aug 2010 10:45"
##  - attr(*, "formats")= chr [1:4] "%15.0g" "%9.0g" "%9.0g" "%9.0g"
##  - attr(*, "types")= int [1:4] 251 251 251 254
##  - attr(*, "val.labels")= chr [1:4] "apply" "" "" ""
##  - attr(*, "var.labels")= chr [1:4] "" "" "" ""
##  - attr(*, "version")= int 12
##  - attr(*, "label.table")=List of 1
##   ..$ apply: Named int [1:3] 0 1 2
##   .. ..- attr(*, "names")= chr [1:3] "unlikely" "somewhat likely" "very likely"

library(MASS)
graduate_model = MASS::polr(apply ~ public + gpa + pared, 
                            data=graduate_data, 
                            Hess=TRUE)
summary(graduate_model)

## Call:
## MASS::polr(formula = apply ~ public + gpa + pared, data = graduate_data, 
##     Hess = TRUE)
## 
## Coefficients:
##           Value Std. Error t value
## public -0.05879     0.2979 -0.1974
## gpa     0.61594     0.2606  2.3632
## pared   1.04769     0.2658  3.9418
## 
## Intercepts:
##                             Value   Std. Error t value
## unlikely|somewhat likely     2.2039  0.7795     2.8272
## somewhat likely|very likely  4.2994  0.8043     5.3453
## 
## Residual Deviance: 717.0249 
## AIC: 727.0249

Compare to a less parsimonious model

graduate_multinom = multinom(apply ~ public + gpa + pared, data=graduate_data)

## # weights:  15 (8 variable)
## initial  value 439.444915 
## iter  10 value 357.012275
## final  value 356.996982 
## converged

c(AIC(graduate_multinom), AIC(graduate_model))

## [1] 729.9940 727.0249

Covariance matrix for Wald tests / intervals

vcov(graduate_model)

##                                   public         gpa        pared
## public                       0.088721414 -0.01737218 -0.003060743
## gpa                         -0.017372183  0.06793006 -0.007018330
## pared                       -0.003060743 -0.00701833  0.070644021
## unlikely|somewhat likely    -0.040918689  0.20088564 -0.010390883
## somewhat likely|very likely -0.041211341  0.20389232 -0.002951814
##                             unlikely|somewhat likely
## public                                   -0.04091869
## gpa                                       0.20088564
## pared                                    -0.01039088
## unlikely|somewhat likely                  0.60769114
## somewhat likely|very likely               0.61323301
##                             somewhat likely|very likely
## public                                     -0.041211341
## gpa                                         0.203892325
## pared                                      -0.002951814
## unlikely|somewhat likely                    0.613233012
## somewhat likely|very likely                 0.646941420

Confidence intervals

confint(graduate_model)

## Waiting for profiling to be done...

##             2.5 %    97.5 %
## public -0.6522060 0.5191384
## gpa     0.1076202 1.1309148
## pared   0.5281768 1.5721750

Likelihood ratio tests

anova(MASS::polr(apply ~ gpa, data=graduate_data), graduate_model)

## Likelihood ratio tests of ordinal regression models
## 
## Response: apply
##                  Model Resid. df Resid. Dev   Test    Df LR stat.      Pr(Chi)
## 1                  gpa       397   732.6000                                   
## 2 public + gpa + pared       395   717.0249 1 vs 2     2 15.57509 0.0004148703

Adjacent category models (8.3.4)

Considers the parameters

\[ \log \frac{\pi_j(X)}{\pi_{j+1}(X)}, \qquad 1 \leq j \leq K-1 \]

Model

\[ \log \frac{\pi_j(X)}{\pi_{j+1}(X)} = \alpha_j + X^T\beta. \]

Adjacent category models (8.3.4)

Equivalent to a linear reparametrization of the baseline logits \[ \log \frac{\pi_j(X)}{\pi_{K}(X)} = \alpha_j^* + (K-j) \cdot X^T\beta, \qquad 1 \leq j \leq K-1 \] where \[ \alpha_j^* = \sum_{l=j}^{K-1} \alpha_l. \]
Negative log-likelihood is convex in \((\alpha, \beta)\)…

Bayesian methods for multinomial (8.6)

For cumulative logit models, conceptually simple to sample \((Y^*,\alpha,\beta)|X\).
Constraints on \((Y^*,\alpha,\beta)\) when \(Y_i=j\):

\[ \alpha_{j-1} + X[i,]^T\beta \leq Y_i^* \leq \alpha_j + X[i,]^T\beta \]

Each draw is from \(G\) truncated to some interval determined by \((\alpha, \beta)\).
For baseline logit models, if we model using \(Y=\text{argmax}(U)\) wiith latent \(U\), then a similar augmented data model is possible with variables \((\epsilon,\beta)\)

\[ X[i,]^T\beta[,j] + \epsilon_{ij} \geq \max_{l \neq j }X[i,]^T\beta[,l] + \epsilon_{il} \]

STAN for multinomial

See here
Close to baseline model, except uses overparameterized model

\[ P(Y=j|X) = \frac{\exp(X^T\beta[,j])}{\sum_l \exp(X^T\beta[,l])} \]

while baseline model sets \(\beta[,K]=0\).

STAN code

data {
  int<lower=1> N;
  int<lower=1> K;
  int<lower=1> P;
  matrix[N, P] X;
  int<lower=1,upper=K> y[N];
}
parameters {
  matrix[P,K] beta;
}
model {
  to_vector(beta) ~ normal(0, 100);
  for (n in 1:N) {
    y ~ categorical_logit((X[n] * beta)');
  }  
}

# pretty slow!
X_graduate = model.matrix(graduate_model)
graduate_stan = list(X=X_graduate,
                     N=nrow(X_graduate),
                     P=ncol(X_graduate),
                     K=3,
                     y=as.integer(graduate_data$apply))
system.time(grad_samples <- rstan::stan('multinom.stan', 
                                        data=graduate_stan, 
                                        iter=100, 
                                        chains=1))

## 
## SAMPLING FOR MODEL 'multinom' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.002443 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 24.43 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: WARNING: There aren't enough warmup iterations to fit the
## Chain 1:          three stages of adaptation as currently configured.
## Chain 1:          Reducing each adaptation stage to 15%/75%/10% of
## Chain 1:          the given number of warmup iterations:
## Chain 1:            init_buffer = 7
## Chain 1:            adapt_window = 38
## Chain 1:            term_buffer = 5
## Chain 1: 
## Chain 1: Iteration:  1 / 100 [  1%]  (Warmup)
## Chain 1: Iteration: 10 / 100 [ 10%]  (Warmup)
## Chain 1: Iteration: 20 / 100 [ 20%]  (Warmup)
## Chain 1: Iteration: 30 / 100 [ 30%]  (Warmup)
## Chain 1: Iteration: 40 / 100 [ 40%]  (Warmup)
## Chain 1: Iteration: 50 / 100 [ 50%]  (Warmup)
## Chain 1: Iteration: 51 / 100 [ 51%]  (Sampling)
## Chain 1: Iteration: 60 / 100 [ 60%]  (Sampling)
## Chain 1: Iteration: 70 / 100 [ 70%]  (Sampling)
## Chain 1: Iteration: 80 / 100 [ 80%]  (Sampling)
## Chain 1: Iteration: 90 / 100 [ 90%]  (Sampling)
## Chain 1: Iteration: 100 / 100 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 25.8162 seconds (Warm-up)
## Chain 1:                79.5667 seconds (Sampling)
## Chain 1:                105.383 seconds (Total)
## Chain 1:

## Warning: There were 41 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: The largest R-hat is 1.3, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

##    user  system elapsed 
## 131.405   2.664 680.143

grad_samples

## Inference for Stan model: multinom.
## 1 chains, each with iter=100; warmup=50; thin=1; 
## post-warmup draws per chain=50, total post-warmup draws=50.
## 
##                 mean se_mean   sd       2.5%        25%        50%        75%
## beta[1,1]      -1.95    0.44 1.08      -3.43      -2.86      -2.09      -0.76
## beta[1,2]      -2.40    0.45 1.08      -3.89      -3.35      -2.55      -1.20
## beta[1,3]      -3.66    0.44 1.08      -5.09      -4.57      -3.82      -2.44
## beta[2,1]       3.92    0.45 1.05       2.33       3.40       3.83       4.29
## beta[2,2]       3.92    0.45 1.05       2.34       3.39       3.85       4.29
## beta[2,3]       3.92    0.45 1.05       2.31       3.37       3.84       4.31
## beta[3,1]      -3.27    0.30 0.84      -5.53      -3.73      -3.14      -2.64
## beta[3,2]      -3.27    0.30 0.84      -5.52      -3.73      -3.16      -2.64
## beta[3,3]      -3.26    0.30 0.84      -5.51      -3.74      -3.14      -2.65
## beta[4,1]      -2.06    0.35 1.11      -4.99      -2.32      -1.96      -1.31
## beta[4,2]      -2.06    0.35 1.11      -4.98      -2.34      -1.97      -1.30
## beta[4,3]      -2.06    0.35 1.11      -4.98      -2.34      -1.96      -1.32
## lp__      -148245.20    0.42 2.03 -148249.62 -148246.20 -148244.83 -148243.54
##                97.5% n_eff Rhat
## beta[1,1]      -0.14     6 0.98
## beta[1,2]      -0.61     6 0.98
## beta[1,3]      -1.90     6 0.98
## beta[2,1]       6.32     5 1.34
## beta[2,2]       6.34     6 1.33
## beta[2,3]       6.29     6 1.33
## beta[3,1]      -2.29     8 1.27
## beta[3,2]      -2.30     8 1.27
## beta[3,3]      -2.26     8 1.27
## beta[4,1]      -0.40    10 0.98
## beta[4,2]      -0.40    10 0.98
## beta[4,3]      -0.40    10 0.98
## lp__      -148242.60    24 1.13
## 
## Samples were drawn using NUTS(diag_e) at Mon Jan 31 11:21:25 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

STAN for ordered logit

See here

STAN code for ordered logit

data {
  int<lower=1> N;
  int<lower=1> K;
  int<lower=1> P;
  matrix[N,P] X;
  int<lower=1,upper=K> y[N];
}
parameters {
  vector[P] beta;
  ordered[K-1] cutpts;
}
model {
  beta ~ normal(0, 100);
  for (n in 1:N) {
    y[n] ~ ordered_logistic(X[n] * beta, cutpts);
  }  
}

# also a bit slow!
system.time(ordered_samples <- rstan::stan('ordered_logit.stan', 
                                           data=graduate_stan, 
                                           iter=100,
                                           chains=1))

## 
## SAMPLING FOR MODEL 'ordered_logit' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000552 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 5.52 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: WARNING: There aren't enough warmup iterations to fit the
## Chain 1:          three stages of adaptation as currently configured.
## Chain 1:          Reducing each adaptation stage to 15%/75%/10% of
## Chain 1:          the given number of warmup iterations:
## Chain 1:            init_buffer = 7
## Chain 1:            adapt_window = 38
## Chain 1:            term_buffer = 5
## Chain 1: 
## Chain 1: Iteration:  1 / 100 [  1%]  (Warmup)
## Chain 1: Iteration: 10 / 100 [ 10%]  (Warmup)
## Chain 1: Iteration: 20 / 100 [ 20%]  (Warmup)
## Chain 1: Iteration: 30 / 100 [ 30%]  (Warmup)
## Chain 1: Iteration: 40 / 100 [ 40%]  (Warmup)
## Chain 1: Iteration: 50 / 100 [ 50%]  (Warmup)
## Chain 1: Iteration: 51 / 100 [ 51%]  (Sampling)
## Chain 1: Iteration: 60 / 100 [ 60%]  (Sampling)
## Chain 1: Iteration: 70 / 100 [ 70%]  (Sampling)
## Chain 1: Iteration: 80 / 100 [ 80%]  (Sampling)
## Chain 1: Iteration: 90 / 100 [ 90%]  (Sampling)
## Chain 1: Iteration: 100 / 100 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 6.00502 seconds (Warm-up)
## Chain 1:                7.5355 seconds (Sampling)
## Chain 1:                13.5405 seconds (Total)
## Chain 1:

## Warning: There were 12 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: The largest R-hat is 1.63, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

##    user  system elapsed 
##  36.309   1.592  40.185

ordered_samples

## Inference for Stan model: ordered_logit.
## 1 chains, each with iter=100; warmup=50; thin=1; 
## post-warmup draws per chain=50, total post-warmup draws=50.
## 
##              mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff
## beta[1]     46.23   15.10 31.00  -22.24   29.08   50.68   65.30   89.18     4
## beta[2]     -0.08    0.04  0.33   -0.75   -0.24   -0.07    0.12    0.54    54
## beta[3]      0.67    0.05  0.29    0.12    0.46    0.70    0.88    1.12    28
## beta[4]      1.07    0.03  0.29    0.46    0.88    1.15    1.24    1.61    85
## cutpts[1]   48.60   15.14 30.97  -19.49   31.86   52.31   67.50   91.54     4
## cutpts[2]   50.73   15.13 30.91  -17.22   33.92   54.45   69.69   93.47     4
## lp__      -360.57    0.25  1.41 -363.08 -361.61 -360.31 -359.57 -358.57    31
##           Rhat
## beta[1]   1.49
## beta[2]   1.01
## beta[3]   0.99
## beta[4]   1.01
## cutpts[1] 1.50
## cutpts[2] 1.50
## lp__      1.01
## 
## Samples were drawn using NUTS(diag_e) at Mon Jan 31 11:22:06 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).