Multinomial models (Ch. 8)¶

We’ve spent a lot of time on binary / Binomial regression models.
Sometimes response has more than two categories: need a model for multinomial responses.
Solution? Why not look at multinomial as exponential family?

Multinomial¶

Mass function of $Y \sim \text{Multinomial}(N,\pi)$ for $\pi \in S_K$ the simplex in $\mathbb{R}^K$. $$ f(y_1, \dots, y_K|\pi) = \binom{n}{y_1,\dots,y_K} \prod_{j=1}^K \pi_j^{y_j}, \qquad \sum_{j=1}^Ky_j=N. $$
Density with respect to counting measure on $$ \left\{(z_1, \dots, z_K) \in \mathbb{Z}^K: \sum_{j=1}^K z_j = n, z_j \geq 0 \right\} $$

Multinomial¶

We fix $K$ as a baseline category.
Setting $\pi_K = 1 - \sum_{j=1}^{K-1} \pi_j$, $y_K = N - \sum_{j=1}^{K-1} y_j$ yields equivalent representation

\[ f(y_1, \dots, y_{K-1}|\pi) = \binom{N}{y_1,\dots,y_K} \left(\prod_{j=1}^{K-1} \pi_j^{y_j} \right) \left(1 - \sum_{j=1}^{K-1} \pi_j\right)^{N - \sum_{j=1}^{K-1}y_j} \]

with the constraint $\sum_{j=1}^{K-1}y_j\leq N.$

Log-likelihood (ignoring multinomial coefficient)

\[\begin{split} \begin{aligned} \log L(\pi|Y) &= \sum_{j=1}^{K-1} y_j \cdot \log \left(\frac{\pi_j}{1 - \sum_{l=1}^{K-1}\pi_l}\right) + \\ & \qquad N \cdot \log\left(1 - \sum_{l=1}^{K-1}\pi_l \right) \end{aligned} \end{split}\]

This is an exponential family with sufficient statistics $(y_1, \dots, y_{K-1})$ natural parameters

\[\begin{split} \begin{aligned} \eta_j &= \log \left(\frac{\pi_j}{1 - \sum_{l=1}^{K-1}\pi_l}\right), \qquad 1 \leq j \leq K-1 \\ \pi_j &= \frac{e^{\eta_j}}{1 + \sum_{l=1}^{K-1} e^{\eta_l}}, 1 \leq j \leq K-1 \end{aligned} \end{split}\]

Log-likelihood in natural parameters $$ \begin{aligned} \log L(\eta|Y) &= \sum_{j=1}^{K-1} y_j \cdot \eta_j - N \cdot \log\left(1 + \sum_{l=1}^{K-1} e^{\eta_l} \right) \end{aligned} $$
Note: parametrization here depends on baseline category which we chose to be level $K$.

Baseline category logit models (8.1)¶

Suppose we now observe $(Y_i, X_i)$ where each $Y_i|X_i \sim \text{Multinomial}(N_i, \pi_i)$.
Natural parametrization above leads to the baseline category logit model

\[ \eta_{ij} = X[i,]^T\beta[,j] \]

\[ \eta = X\beta \]

with $\beta \in \mathbb{R}^{p \times (K-1)}$ and $\eta \in \mathbb{R}^{n \times (K-1)}$.

(For baseline model, notation assumes (Intercept) is baked into $X$).

Softmax¶

Here, for $1 \leq j \leq K-1$

\[ \pi_{i,j} = \pi_{i,j}(\beta) = \frac{\exp(X[i,]^T\beta[,j])}{1 + \sum_{l=1}^{K-1} \exp(X[i,]^T\beta[,l])}. \]

Fitting baseline model¶

Data is represented as $Y \in (\mathbb{Z}^+)^{n \times (K-1)}$ (dropping baseline $K$ column)
Likelihood for $\beta$ is

\[\begin{split} \begin{aligned} \log L(\beta|Y) &= \text{Tr}((X\beta)^TY) \\ & \qquad - \sum_{i=1}^n N_i \log\left(1 + \sum_{l=1}^{K-1} \exp(X[i,]^T\beta[,l]) \right) \end{aligned} \end{split}\]

Fitting baseline model¶

Let’s compute the gradient of $\log L(\beta|Y)$. By now, this should not seem surprising:

\[ \nabla \log L(\beta|Y) = X^T(Y - E_{\beta}(Y)) \in \mathbb{R}^{p \times (K-1)} \]

Above

\[ E_{\beta}(Y)[i,j] = N_i \cdot \frac{\exp(X[i,]^T\beta[,j])}{1 + \sum_{l=1}^K \exp(X[i,]^T\beta[,l])} \]

Gradient descent will be pretty easy…

Fitting baseline model¶

Hessian of $\log L(\beta|Y)$, is a little harder to write – it is a tensor of shape $\mathbb{R}^{p \times (K-1)} \otimes \mathbb{R}^{p \times (K-1)}$:

\[ \frac{\partial^2}{\partial \beta_{ij} \partial \beta_{kl}} \log L(\beta|Y) = -\sum_{c=1}^n X_{ci}X_{ck} \text{Cov}_{\beta}(Y_c)_{jl} \]

where

\[ \text{Cov}_{\beta}(Y_c)_{jl} = N_i \left(\text{diag}(\pi_i(\beta)) - \pi_i(\beta)\pi_i(\beta)^T\right)_{jl} \]

Asymptotic covariance of $\hat{\beta}$ well-estimated (under assumptions) by observed information

\[ -\nabla^2 \log L(\hat{\beta}|Y)^{-1}. \]

Example¶

Goal: to predict student’s academic program type prog in "academic", "general", "vocational" from scores on a writing test write and ses.

library(foreign)
ml = read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
str(ml)

'data.frame':	200 obs. of  13 variables:
 $ id     : num  45 108 15 67 153 51 164 133 2 53 ...
 $ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
 $ ses    : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
 $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
 $ prog   : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
 $ read   : num  34 34 39 37 39 42 31 50 39 34 ...
 $ write  : num  35 33 39 37 31 36 36 31 41 37 ...
 $ math   : num  41 41 44 42 40 42 46 40 33 46 ...
 $ science: num  29 36 26 33 39 31 39 34 42 39 ...
 $ socst  : num  26 36 42 32 51 39 46 31 41 31 ...
 $ honors : Factor w/ 2 levels "not enrolled",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ awards : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cid    : int  1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "datalabel")= chr "highschool and beyond (200 cases)"
 - attr(*, "time.stamp")= chr "30 Oct 2009 14:13"
 - attr(*, "formats")= chr [1:13] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
 - attr(*, "types")= int [1:13] 254 254 254 254 254 254 254 254 254 254 ...
 - attr(*, "val.labels")= chr [1:13] "" "fl" "sl" "scl" ...
 - attr(*, "var.labels")= chr [1:13] "" "" "" "type of school" ...
 - attr(*, "version")= int 8
 - attr(*, "label.table")=List of 5
  ..$ sl    : Named int [1:3] 1 2 3
  .. ..- attr(*, "names")= chr [1:3] "low" "middle" "high"
  ..$ scl   : Named int [1:2] 1 2
  .. ..- attr(*, "names")= chr [1:2] "public" "private"
  ..$ sel   : Named int [1:3] 1 2 3
  .. ..- attr(*, "names")= chr [1:3] "general" "academic" "vocation"
  ..$ fl    : Named int [1:2] 0 1
  .. ..- attr(*, "names")= chr [1:2] "male" "female"
  ..$ honlab: Named int [1:2] 0 1
  .. ..- attr(*, "names")= chr [1:2] "not enrolled" "enrolled"

library(nnet)
model_full = nnet::multinom(prog ~ ses + write, data = ml)
summary(model_full)
dim(coef(model_full))

# weights:  15 (8 variable)

initial  value 219.722458 
iter  10 value 179.985215
final  value 179.981726 
converged

Call:
nnet::multinom(formula = prog ~ ses + write, data = ml)

Coefficients:
         (Intercept) sesmiddle   seshigh       write
academic   -2.851973 0.5332914 1.1628257  0.05792480
vocation    2.366097 0.8246384 0.1802176 -0.05567514

Std. Errors:
         (Intercept) sesmiddle   seshigh      write
academic    1.166437 0.4437319 0.5142215 0.02141092
vocation    1.174251 0.4901237 0.6484508 0.02333135

Residual Deviance: 359.9635 
AIC: 375.9635 

vcov(model_full)

A matrix: 8 × 8 of type dbl
	academic:(Intercept)	academic:sesmiddle	academic:seshigh	academic:write	vocation:(Intercept)	vocation:sesmiddle	vocation:seshigh	vocation:write
academic:(Intercept)	1.360576234	-0.0939555923	-0.0598081985	-0.0238322833	0.69281537	-0.0271783502	0.007910847	-0.0128334660
academic:sesmiddle	-0.093955592	0.1968980069	0.1225345572	-0.0005216217	-0.04944195	0.1050952297	0.058406872	-0.0001497438
academic:seshigh	-0.059808198	0.1225345572	0.2644237520	-0.0011823607	-0.02173743	0.0589096999	0.165112195	-0.0007211086
academic:write	-0.023832283	-0.0005216217	-0.0011823607	0.0004584275	-0.01258505	-0.0005960548	-0.001266725	0.0002545341
vocation:(Intercept)	0.692815375	-0.0494419508	-0.0217374274	-0.0125850549	1.37886605	-0.0905237580	-0.082650366	-0.0258391029
vocation:sesmiddle	-0.027178350	0.1050952297	0.0589096999	-0.0005960548	-0.09052376	0.2402212037	0.154979401	-0.0012868585
vocation:seshigh	0.007910847	0.0584068716	0.1651121953	-0.0012667250	-0.08265037	0.1549794008	0.420488501	-0.0014339382
vocation:write	-0.012833466	-0.0001497438	-0.0007211086	0.0002545341	-0.02583910	-0.0012868585	-0.001433938	0.0005443519

Wald tests¶

Z = summary(model_full)$coefficients / summary(model_full)$standard.errors
Z

A matrix: 2 × 4 of type dbl
	(Intercept)	sesmiddle	seshigh	write
academic	-2.445028	1.201832	2.2613324	2.705386
vocation	2.014984	1.682511	0.2779203	-2.386280

P = 2 * pnorm(abs(Z), lower.tail=FALSE)
P

A matrix: 2 × 4 of type dbl
	(Intercept)	sesmiddle	seshigh	write
academic	0.01448407	0.22942845	0.02373868	0.00682251
vocation	0.04390634	0.09246981	0.78107356	0.01701977

Likelihood ratio tests¶

model_write = multinom(prog ~ write, data = ml)
anova(model_write, model_full)

# weights:  9 (4 variable)

initial  value 219.722458 
iter  10 value 185.510839
iter  10 value 185.510838
final  value 185.510838 
converged

A Anova: 2 × 7
Model	Resid. df	Resid. Dev	Test	Df	LR stat.	Pr(Chi)
<chr>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>
write	396	371.0217		NA	NA	NA
ses + write	392	359.9635	1 vs 2	4	11.05822	0.02591741

model_ses = multinom(prog ~ ses, data = ml)
anova(model_ses, model_full)

# weights:  12 (6 variable)

initial  value 219.722458 
iter  10 value 195.705517
final  value 195.705188 
converged

A Anova: 2 × 7
Model	Resid. df	Resid. Dev	Test	Df	LR stat.	Pr(Chi)
<chr>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>
ses	394	391.4104		NA	NA	NA
ses + write	392	359.9635	1 vs 2	2	31.44692	1.483841e-07

Latent variable model (8.1.6)¶

Suppose

\[ U_{ij} = X[i,]^T{\beta}[,j] + \epsilon_{ij} \]

for some noise distribution $\epsilon_{ij}$’s independent across $i$.

We might take our multinomial observation as

\[ Y_i = \text{argmax}_{1 \leq j \leq K} U_{ij} \]

If $\epsilon_{ij}$ are independent extreme value (i.e. cloglog) then the resulting likelihood is equivalent to a baseline multinomial logit, essentially with parameters $\beta[,j] - \beta[,K]$.
A more “natural” model is to assume $\epsilon_{ij}$’s are independent $N(0,1)$ (or perhaps correlated within a subject?).
Leads to a multinomial probit model, though likelihood is rather complicated:

\[\begin{split} \begin{aligned} \pi_j(X) &= P_{\beta}(Y=j|X) \\ &= P_{\beta}\left(X[i,]^T\beta[,j] + \epsilon_{ij} \geq \max_{l \neq j} X[i,]^T\beta[,l] + \epsilon_{il} \right) \end{aligned} \end{split}\]

Data augmented sampling is (relatively) straightforward.

Cumulative logit models (8.2)¶

When levels of $Y$ are ordinal, we might model things differently.
Cumulative model considers probabilities

\[ P(Y\leq j | X) = \sum_{l=1}^j \pi_l(X) \]

and models these as

\[ \text{logit} \left( P(Y \leq j | X) \right) = \alpha_j + \beta^TX, \qquad 1 \leq j \leq K-1. \]

Note: $\alpha_{j} \geq \alpha_{j-1}$.
Fewer parameters than the baseline model, but link is not canonical – optimization is more difficult.

Log-likelihood looks like

\[ \sum_{i=1}^n \sum_{j=1}^{K} y_{ij} \log\left( \frac{\exp(\alpha_j + X[i,]^T\beta)}{1 + \exp(\alpha_j + X[i,]^T\beta)} - \frac{\exp(\alpha_{j-1} + X[i,]^T\beta)}{1 + \exp(\alpha_{j-1} + X[i,]^T\beta)} \right) \]

with $\alpha_{-1}=-\infty, \alpha_K=\infty$.

No need to identify a baseline category.

Viewed as a generalization of threshold model…¶

Let $Y^*=X^T\beta + \epsilon$ with $\epsilon$ having logistic distribution

\[ Y = \sum_{j=1}^K j \cdot 1_{[\alpha_{j-1}, \alpha_j]}(Y^*) \]

Obvious analogs when errors are not logistic, but have CDF $G$.
Leads to log-likelihood

\[ \sum_{i=1}^n \sum_{j=1}^{K} y_{ij} \log\left( G(\alpha_j + X[i,]^T\beta) - G(\alpha_{j-1} + X[i,]^T\beta) \right) \]

Example¶

Questionnaire has students rank “likelihood” of applying to graduate school: apply
We also collect gpa as well as pared (indicating whether or not at least one parent has a graduate degree) and public (is undergraduate institution public or private?)
More involved analysis here.

graduate_data = read.dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
str(graduate_data)

'data.frame':	400 obs. of  4 variables:
 $ apply : Factor w/ 3 levels "unlikely","somewhat likely",..: 3 2 1 2 2 1 2 2 1 2 ...
 $ pared : int  0 1 1 0 0 0 0 0 0 1 ...
 $ public: int  0 0 1 0 0 1 0 0 0 0 ...
 $ gpa   : num  3.26 3.21 3.94 2.81 2.53 ...
 - attr(*, "datalabel")= chr ""
 - attr(*, "time.stamp")= chr " 3 Aug 2010 10:45"
 - attr(*, "formats")= chr [1:4] "%15.0g" "%9.0g" "%9.0g" "%9.0g"
 - attr(*, "types")= int [1:4] 251 251 251 254
 - attr(*, "val.labels")= chr [1:4] "apply" "" "" ""
 - attr(*, "var.labels")= chr [1:4] "" "" "" ""
 - attr(*, "version")= int 12
 - attr(*, "label.table")=List of 1
  ..$ apply: Named int [1:3] 0 1 2
  .. ..- attr(*, "names")= chr [1:3] "unlikely" "somewhat likely" "very likely"

library(MASS)
graduate_model = MASS::polr(apply ~ public + gpa + pared, 
                            data=graduate_data, 
                            Hess=TRUE)
summary(graduate_model)

Call:
MASS::polr(formula = apply ~ public + gpa + pared, data = graduate_data, 
    Hess = TRUE)

Coefficients:
          Value Std. Error t value
public -0.05879     0.2979 -0.1974
gpa     0.61594     0.2606  2.3632
pared   1.04769     0.2658  3.9418

Intercepts:
                            Value   Std. Error t value
unlikely|somewhat likely     2.2039  0.7795     2.8272
somewhat likely|very likely  4.2994  0.8043     5.3453

Residual Deviance: 717.0249 
AIC: 727.0249 

Compare to a less parsimonious model¶

graduate_multinom = multinom(apply ~ public + gpa + pared, data=graduate_data)
c(AIC(graduate_multinom), AIC(graduate_model))

# weights:  15 (8 variable)

initial  value 439.444915 
iter  10 value 357.012275
final  value 356.996982 
converged

729.993963188076
727.024871965377

Covariance matrix for Wald tests / intervals¶

vcov(graduate_model)

A matrix: 5 × 5 of type dbl
	public	gpa	pared	unlikely\|somewhat likely	somewhat likely\|very likely
public	0.088721414	-0.01737218	-0.003060743	-0.04091869	-0.041211341
gpa	-0.017372183	0.06793006	-0.007018330	0.20088564	0.203892325
pared	-0.003060743	-0.00701833	0.070644021	-0.01039088	-0.002951814
unlikely\|somewhat likely	-0.040918689	0.20088564	-0.010390883	0.60769114	0.613233012
somewhat likely\|very likely	-0.041211341	0.20389232	-0.002951814	0.61323301	0.646941420

Confidence intervals¶

confint(graduate_model)

Waiting for profiling to be done...

A matrix: 3 × 2 of type dbl
	2.5 %	97.5 %
public	-0.6522060	0.5191384
gpa	0.1076202	1.1309148
pared	0.5281768	1.5721750

Likelihood ratio tests¶

anova(MASS::polr(apply ~ gpa, data=graduate_data), graduate_model)

A Anova: 2 × 7
Model	Resid. df	Resid. Dev	Test	Df	LR stat.	Pr(Chi)
<chr>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>
gpa	397	732.6000		NA	NA	NA
public + gpa + pared	395	717.0249	1 vs 2	2	15.57509	0.0004148703

Adjacent category models (8.3.4)¶

Considers the parameters

\[ \log \frac{\pi_j(X)}{\pi_{j+1}(X)}, \qquad 1 \leq j \leq K-1 \]

Model

\[ \log \frac{\pi_j(X)}{\pi_{j+1}(X)} = \alpha_j + X^T\beta. \]

Adjacent category models (8.3.4)¶

Equivalent to a linear reparametrization of the baseline logits $$ \log \frac{\pi_j(X)}{\pi_{K}(X)} = \alpha_j^* + (K-j) \cdot X^T\beta, \qquad 1 \leq j \leq K-1 $$ where $$ \alpha_j^* = \sum_{l=j}^{K-1} \alpha_l. $$
Negative log-likelihood is convex in $(\alpha, \beta)$…

Bayesian methods for multinomial (8.6)¶

For cumulative logit models, conceptually simple to sample $(Y^*,\alpha,\beta)|X$.
Constraints on $(Y^*,\alpha,\beta)$ when $Y_i=j$:

\[ \alpha_{j-1} + X[i,]^T\beta \leq Y_i^* \leq \alpha_j + X[i,]^T\beta \]

Each draw is from $G$ truncated to some interval determined by $(\alpha, \beta)$.
For baseline logit models, if we model using $Y=\text{argmax}(U)$ wiith latent $U$, then a similar augmented data model is possible with variables $(\epsilon,\beta)$

\[ X[i,]^T\beta[,j] + \epsilon_{ij} \geq \max_{l \neq j }X[i,]^T\beta[,l] + \epsilon_{il} \]

STAN for multinomial¶

See here
Close to baseline model, except uses overparameterized model

\[ P(Y=j|X) = \frac{\exp(X^T\beta[,j])}{\sum_l \exp(X^T\beta[,l])} \]

while baseline model sets $\beta[,K]=0$.

STAN code¶

data {
  int<lower=1> N;
  int<lower=1> K;
  int<lower=1> P;
  matrix[N, P] X;
  int<lower=1,upper=K> y[N];
}
parameters {
  matrix[P,K] beta;
}
model {
  to_vector(beta) ~ normal(0, 100);
  for (n in 1:N) {
    y[n] ~ categorical_logit((X[n] * beta)');
  }  
}

# pretty slow!
X_graduate = model.matrix(graduate_model)
graduate_stan = list(X=X_graduate,
                     N=nrow(X_graduate),
                     P=ncol(X_graduate),
                     K=3,
                     y=as.integer(graduate_data$apply))
system.time(grad_samples <- rstan::stan('multinom.stan', 
                                        data=graduate_stan, 
                                        iter=100, 
                                        chains=1))

SAMPLING FOR MODEL 'multinom' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.002744 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 27.44 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: WARNING: There aren't enough warmup iterations to fit the
Chain 1:          three stages of adaptation as currently configured.
Chain 1:          Reducing each adaptation stage to 15%/75%/10% of
Chain 1:          the given number of warmup iterations:
Chain 1:            init_buffer = 7
Chain 1:            adapt_window = 38
Chain 1:            term_buffer = 5
Chain 1: 
Chain 1: Iteration:  1 / 100 [  1%]  (Warmup)
Chain 1: Iteration: 10 / 100 [ 10%]  (Warmup)
Chain 1: Iteration: 20 / 100 [ 20%]  (Warmup)
Chain 1: Iteration: 30 / 100 [ 30%]  (Warmup)
Chain 1: Iteration: 40 / 100 [ 40%]  (Warmup)
Chain 1: Iteration: 50 / 100 [ 50%]  (Warmup)
Chain 1: Iteration: 51 / 100 [ 51%]  (Sampling)
Chain 1: Iteration: 60 / 100 [ 60%]  (Sampling)
Chain 1: Iteration: 70 / 100 [ 70%]  (Sampling)
Chain 1: Iteration: 80 / 100 [ 80%]  (Sampling)
Chain 1: Iteration: 90 / 100 [ 90%]  (Sampling)
Chain 1: Iteration: 100 / 100 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.588644 seconds (Warm-up)
Chain 1:                51.6949 seconds (Sampling)
Chain 1:                52.2836 seconds (Total)
Chain 1: 

Warning message:
“There were 26 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”

Warning message:
“There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”

Warning message:
“Examine the pairs() plot to diagnose sampling problems
”

Warning message:
“The largest R-hat is 2.11, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat”

Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”

Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”

   user  system elapsed 
 74.968   1.701  79.488 

grad_samples

Inference for Stan model: multinom.
1 chains, each with iter=100; warmup=50; thin=1; 
post-warmup draws per chain=50, total post-warmup draws=50.

                mean se_mean    sd       2.5%        25%        50%        75%
beta[1,1]       0.82    0.64  1.04      -0.83      -0.11       0.87       1.69
beta[1,2]       0.46    0.71  1.13      -1.31      -0.57       0.46       1.69
beta[1,3]      -0.74    0.76  1.20      -2.52      -1.81      -0.83       0.40
beta[2,1]      -1.59    1.07  1.77      -4.71      -2.91      -1.73       0.42
beta[2,2]      -1.58    1.08  1.78      -4.73      -2.92      -1.73       0.41
beta[2,3]      -1.59    1.08  1.78      -4.73      -2.93      -1.74       0.41
beta[3,1]      -0.45    0.52  0.99      -1.32      -1.15      -0.93       0.27
beta[3,2]      -0.48    0.54  1.02      -1.31      -1.25      -0.95       0.27
beta[3,3]      -0.50    0.55  1.03      -1.41      -1.26      -0.94       0.25
beta[4,1]       1.97    0.71  1.17       0.19       0.36       2.45       2.99
beta[4,2]       1.97    0.71  1.17       0.17       0.35       2.45       3.01
beta[4,3]       2.00    0.67  1.11       0.30       0.51       2.45       2.97
lp__      -148281.16   32.18 81.58 -148518.58 -148256.96 -148247.34 -148244.84
               97.5% n_eff Rhat
beta[1,1]       2.25     3 2.95
beta[1,2]       1.82     3 3.13
beta[1,3]       0.84     3 2.96
beta[2,1]       0.57     3 2.49
beta[2,2]       0.64     3 2.48
beta[2,3]       0.60     3 2.48
beta[3,1]       1.89     4 1.53
beta[3,2]       1.90     4 1.57
beta[3,3]       1.90     4 1.59
beta[4,1]       3.21     3 2.20
beta[4,2]       3.20     3 2.21
beta[4,3]       3.21     3 2.25
lp__      -148242.37     6 1.19

Samples were drawn using NUTS(diag_e) at Mon Jan 31 10:24:12 2022.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

STAN for ordered logit¶

See here

STAN code for ordered logit¶

data {
  int<lower=1> N;
  int<lower=1> K;
  int<lower=1> P;
  matrix[N,P] X;
  int<lower=1,upper=K> y[N];
}
parameters {
  vector[P] beta;
  ordered[K-1] cutpts;
}
model {
  beta ~ normal(0, 100);
  for (n in 1:N) {
    y[n] ~ ordered_logistic(X[n] * beta, cutpts);
  }  
}

# also a bit slow!
system.time(ordered_samples <- rstan::stan('ordered_logit.stan', 
                                           data=graduate_stan, 
                                           iter=100,
                                           chains=1))
ordered_samples

SAMPLING FOR MODEL 'ordered_logit' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000544 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 5.44 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: WARNING: There aren't enough warmup iterations to fit the
Chain 1:          three stages of adaptation as currently configured.
Chain 1:          Reducing each adaptation stage to 15%/75%/10% of
Chain 1:          the given number of warmup iterations:
Chain 1:            init_buffer = 7
Chain 1:            adapt_window = 38
Chain 1:            term_buffer = 5
Chain 1: 
Chain 1: Iteration:  1 / 100 [  1%]  (Warmup)
Chain 1: Iteration: 10 / 100 [ 10%]  (Warmup)
Chain 1: Iteration: 20 / 100 [ 20%]  (Warmup)
Chain 1: Iteration: 30 / 100 [ 30%]  (Warmup)
Chain 1: Iteration: 40 / 100 [ 40%]  (Warmup)
Chain 1: Iteration: 50 / 100 [ 50%]  (Warmup)
Chain 1: Iteration: 51 / 100 [ 51%]  (Sampling)
Chain 1: Iteration: 60 / 100 [ 60%]  (Sampling)
Chain 1: Iteration: 70 / 100 [ 70%]  (Sampling)
Chain 1: Iteration: 80 / 100 [ 80%]  (Sampling)
Chain 1: Iteration: 90 / 100 [ 90%]  (Sampling)
Chain 1: Iteration: 100 / 100 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 3.32696 seconds (Warm-up)
Chain 1:                7.85343 seconds (Sampling)
Chain 1:                11.1804 seconds (Total)
Chain 1: 

Warning message:
“There were 14 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”

Warning message:
“Examine the pairs() plot to diagnose sampling problems
”

Warning message:
“The largest R-hat is 1.92, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat”

Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”

Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”

   user  system elapsed 
 31.836   1.262  34.205 

Inference for Stan model: ordered_logit.
1 chains, each with iter=100; warmup=50; thin=1; 
post-warmup draws per chain=50, total post-warmup draws=50.

             mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff
beta[1]    -41.14   30.36 52.19 -128.46 -104.60  -14.21   -6.19   23.82     3
beta[2]     -0.05    0.04  0.32   -0.61   -0.28   -0.03    0.15    0.67    63
beta[3]      0.57    0.04  0.21    0.19    0.48    0.55    0.68    0.98    30
beta[4]      1.04    0.04  0.31    0.53    0.84    0.95    1.26    1.68    52
cutpts[1]  -39.08   30.33 52.23 -127.17 -102.33  -11.75   -4.10   25.76     3
cutpts[2]  -36.99   30.34 52.23 -125.00 -100.13   -9.66   -2.10   28.09     3
lp__      -360.53    0.25  1.50 -364.15 -361.39 -360.09 -359.39 -358.49    35
          Rhat
beta[1]   1.99
beta[2]   0.98
beta[3]   1.00
beta[4]   0.98
cutpts[1] 1.98
cutpts[2] 1.98
lp__      0.98

Samples were drawn using NUTS(diag_e) at Mon Jan 31 10:24:46 2022.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Bayesian methods for GLMs (7.2)

Penalized regression

STATS 305B

Multinomial models (Ch. 8)¶

Multinomial¶

Multinomial¶

Baseline category logit models (8.1)¶

Softmax¶

Fitting baseline model¶

Fitting baseline model¶

Fitting baseline model¶

Example¶

Wald tests¶

Likelihood ratio tests¶

Latent variable model (8.1.6)¶

Cumulative logit models (8.2)¶

Viewed as a generalization of threshold model…¶

Example¶

Compare to a less parsimonious model¶

Covariance matrix for Wald tests / intervals¶

Confidence intervals¶

Likelihood ratio tests¶

Adjacent category models (8.3.4)¶

Adjacent category models (8.3.4)¶

Bayesian methods for multinomial (8.6)¶

STAN for multinomial¶

STAN code¶

STAN for ordered logit¶

STAN code for ordered logit¶