Multinomial models (Ch. 8)¶
We’ve spent a lot of time on binary / Binomial regression models.
Sometimes response has more than two categories: need a model for multinomial responses.
Solution? Why not look at multinomial as exponential family?
Multinomial¶
Mass function of \(Y \sim \text{Multinomial}(N,\pi)\) for \(\pi \in S_K\) the simplex in \(\mathbb{R}^K\). $\( f(y_1, \dots, y_K|\pi) = \binom{n}{y_1,\dots,y_K} \prod_{j=1}^K \pi_j^{y_j}, \qquad \sum_{j=1}^Ky_j=N. \)$
Density with respect to counting measure on $\( \left\{(z_1, \dots, z_K) \in \mathbb{Z}^K: \sum_{j=1}^K z_j = n, z_j \geq 0 \right\} \)$
Multinomial¶
We fix \(K\) as a baseline category.
Setting \(\pi_K = 1 - \sum_{j=1}^{K-1} \pi_j\), \(y_K = N - \sum_{j=1}^{K-1} y_j\) yields equivalent representation
with the constraint \(\sum_{j=1}^{K-1}y_j\leq N.\)
Log-likelihood (ignoring multinomial coefficient)
This is an exponential family with sufficient statistics \((y_1, \dots, y_{K-1})\) natural parameters
Log-likelihood in natural parameters $\( \begin{aligned} \log L(\eta|Y) &= \sum_{j=1}^{K-1} y_j \cdot \eta_j - N \cdot \log\left(1 + \sum_{l=1}^{K-1} e^{\eta_l} \right) \end{aligned} \)$
Note: parametrization here depends on baseline category which we chose to be level \(K\).
Baseline category logit models (8.1)¶
Suppose we now observe \((Y_i, X_i)\) where each \(Y_i|X_i \sim \text{Multinomial}(N_i, \pi_i)\).
Natural parametrization above leads to the baseline category logit model
or
with \(\beta \in \mathbb{R}^{p \times (K-1)}\) and \(\eta \in \mathbb{R}^{n \times (K-1)}\).
(For baseline model, notation assumes
(Intercept)is baked into \(X\)).
Softmax¶
Here, for \(1 \leq j \leq K-1\)
Fitting baseline model¶
Data is represented as \(Y \in (\mathbb{Z}^+)^{n \times (K-1)}\) (dropping baseline \(K\) column)
Likelihood for \(\beta\) is
Fitting baseline model¶
Let’s compute the gradient of \(\log L(\beta|Y)\). By now, this should not seem surprising:
Above
Gradient descent will be pretty easy…
Fitting baseline model¶
Hessian of \(\log L(\beta|Y)\), is a little harder to write – it is a tensor of shape \(\mathbb{R}^{p \times (K-1)} \otimes \mathbb{R}^{p \times (K-1)}\):
where
Asymptotic covariance of \(\hat{\beta}\) well-estimated (under assumptions) by observed information
Example¶
Goal: to predict student’s academic program type
progin"academic", "general", "vocational"from scores on a writing testwriteandses.
library(foreign)
ml = read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
str(ml)
'data.frame': 200 obs. of 13 variables:
$ id : num 45 108 15 67 153 51 164 133 2 53 ...
$ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
$ ses : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
$ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
$ prog : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
$ read : num 34 34 39 37 39 42 31 50 39 34 ...
$ write : num 35 33 39 37 31 36 36 31 41 37 ...
$ math : num 41 41 44 42 40 42 46 40 33 46 ...
$ science: num 29 36 26 33 39 31 39 34 42 39 ...
$ socst : num 26 36 42 32 51 39 46 31 41 31 ...
$ honors : Factor w/ 2 levels "not enrolled",..: 1 1 1 1 1 1 1 1 1 1 ...
$ awards : num 0 0 0 0 0 0 0 0 0 0 ...
$ cid : int 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "datalabel")= chr "highschool and beyond (200 cases)"
- attr(*, "time.stamp")= chr "30 Oct 2009 14:13"
- attr(*, "formats")= chr [1:13] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
- attr(*, "types")= int [1:13] 254 254 254 254 254 254 254 254 254 254 ...
- attr(*, "val.labels")= chr [1:13] "" "fl" "sl" "scl" ...
- attr(*, "var.labels")= chr [1:13] "" "" "" "type of school" ...
- attr(*, "version")= int 8
- attr(*, "label.table")=List of 5
..$ sl : Named int [1:3] 1 2 3
.. ..- attr(*, "names")= chr [1:3] "low" "middle" "high"
..$ scl : Named int [1:2] 1 2
.. ..- attr(*, "names")= chr [1:2] "public" "private"
..$ sel : Named int [1:3] 1 2 3
.. ..- attr(*, "names")= chr [1:3] "general" "academic" "vocation"
..$ fl : Named int [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "male" "female"
..$ honlab: Named int [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "not enrolled" "enrolled"
library(nnet)
model_full = nnet::multinom(prog ~ ses + write, data = ml)
summary(model_full)
dim(coef(model_full))
# weights: 15 (8 variable)
initial value 219.722458
iter 10 value 179.985215
final value 179.981726
converged
Call:
nnet::multinom(formula = prog ~ ses + write, data = ml)
Coefficients:
(Intercept) sesmiddle seshigh write
academic -2.851973 0.5332914 1.1628257 0.05792480
vocation 2.366097 0.8246384 0.1802176 -0.05567514
Std. Errors:
(Intercept) sesmiddle seshigh write
academic 1.166437 0.4437319 0.5142215 0.02141092
vocation 1.174251 0.4901237 0.6484508 0.02333135
Residual Deviance: 359.9635
AIC: 375.9635
- 2
- 4
vcov(model_full)
| academic:(Intercept) | academic:sesmiddle | academic:seshigh | academic:write | vocation:(Intercept) | vocation:sesmiddle | vocation:seshigh | vocation:write | |
|---|---|---|---|---|---|---|---|---|
| academic:(Intercept) | 1.360576234 | -0.0939555923 | -0.0598081985 | -0.0238322833 | 0.69281537 | -0.0271783502 | 0.007910847 | -0.0128334660 |
| academic:sesmiddle | -0.093955592 | 0.1968980069 | 0.1225345572 | -0.0005216217 | -0.04944195 | 0.1050952297 | 0.058406872 | -0.0001497438 |
| academic:seshigh | -0.059808198 | 0.1225345572 | 0.2644237520 | -0.0011823607 | -0.02173743 | 0.0589096999 | 0.165112195 | -0.0007211086 |
| academic:write | -0.023832283 | -0.0005216217 | -0.0011823607 | 0.0004584275 | -0.01258505 | -0.0005960548 | -0.001266725 | 0.0002545341 |
| vocation:(Intercept) | 0.692815375 | -0.0494419508 | -0.0217374274 | -0.0125850549 | 1.37886605 | -0.0905237580 | -0.082650366 | -0.0258391029 |
| vocation:sesmiddle | -0.027178350 | 0.1050952297 | 0.0589096999 | -0.0005960548 | -0.09052376 | 0.2402212037 | 0.154979401 | -0.0012868585 |
| vocation:seshigh | 0.007910847 | 0.0584068716 | 0.1651121953 | -0.0012667250 | -0.08265037 | 0.1549794008 | 0.420488501 | -0.0014339382 |
| vocation:write | -0.012833466 | -0.0001497438 | -0.0007211086 | 0.0002545341 | -0.02583910 | -0.0012868585 | -0.001433938 | 0.0005443519 |
Wald tests¶
Z = summary(model_full)$coefficients / summary(model_full)$standard.errors
Z
| (Intercept) | sesmiddle | seshigh | write | |
|---|---|---|---|---|
| academic | -2.445028 | 1.201832 | 2.2613324 | 2.705386 |
| vocation | 2.014984 | 1.682511 | 0.2779203 | -2.386280 |
P = 2 * pnorm(abs(Z), lower.tail=FALSE)
P
| (Intercept) | sesmiddle | seshigh | write | |
|---|---|---|---|---|
| academic | 0.01448407 | 0.22942845 | 0.02373868 | 0.00682251 |
| vocation | 0.04390634 | 0.09246981 | 0.78107356 | 0.01701977 |
Likelihood ratio tests¶
model_write = multinom(prog ~ write, data = ml)
anova(model_write, model_full)
# weights: 9 (4 variable)
initial value 219.722458
iter 10 value 185.510839
iter 10 value 185.510838
final value 185.510838
converged
| Model | Resid. df | Resid. Dev | Test | Df | LR stat. | Pr(Chi) |
|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <dbl> |
| write | 396 | 371.0217 | NA | NA | NA | |
| ses + write | 392 | 359.9635 | 1 vs 2 | 4 | 11.05822 | 0.02591741 |
model_ses = multinom(prog ~ ses, data = ml)
anova(model_ses, model_full)
# weights: 12 (6 variable)
initial value 219.722458
iter 10 value 195.705517
final value 195.705188
converged
| Model | Resid. df | Resid. Dev | Test | Df | LR stat. | Pr(Chi) |
|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <dbl> |
| ses | 394 | 391.4104 | NA | NA | NA | |
| ses + write | 392 | 359.9635 | 1 vs 2 | 2 | 31.44692 | 1.483841e-07 |
Latent variable model (8.1.6)¶
Suppose
for some noise distribution \(\epsilon_{ij}\)’s independent across \(i\).
We might take our multinomial observation as
If \(\epsilon_{ij}\) are independent extreme value (i.e.
cloglog) then the resulting likelihood is equivalent to a baseline multinomial logit, essentially with parameters \(\beta[,j] - \beta[,K]\).A more “natural” model is to assume \(\epsilon_{ij}\)’s are independent \(N(0,1)\) (or perhaps correlated within a subject?).
Leads to a multinomial probit model, though likelihood is rather complicated:
Data augmented sampling is (relatively) straightforward.
Cumulative logit models (8.2)¶
When levels of \(Y\) are ordinal, we might model things differently.
Cumulative model considers probabilities
and models these as
Note: \(\alpha_{j} \geq \alpha_{j-1}\).
Fewer parameters than the baseline model, but link is not canonical – optimization is more difficult.
Log-likelihood looks like
with \(\alpha_{-1}=-\infty, \alpha_K=\infty\).
No need to identify a baseline category.
Viewed as a generalization of threshold model…¶
Let \(Y^*=X^T\beta + \epsilon\) with \(\epsilon\) having logistic distribution
Obvious analogs when errors are not logistic, but have CDF \(G\).
Leads to log-likelihood
Example¶
Questionnaire has students rank “likelihood” of applying to graduate school:
applyWe also collect
gpaas well aspared(indicating whether or not at least one parent has a graduate degree) andpublic(is undergraduate institution public or private?)More involved analysis here.
graduate_data = read.dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
str(graduate_data)
'data.frame': 400 obs. of 4 variables:
$ apply : Factor w/ 3 levels "unlikely","somewhat likely",..: 3 2 1 2 2 1 2 2 1 2 ...
$ pared : int 0 1 1 0 0 0 0 0 0 1 ...
$ public: int 0 0 1 0 0 1 0 0 0 0 ...
$ gpa : num 3.26 3.21 3.94 2.81 2.53 ...
- attr(*, "datalabel")= chr ""
- attr(*, "time.stamp")= chr " 3 Aug 2010 10:45"
- attr(*, "formats")= chr [1:4] "%15.0g" "%9.0g" "%9.0g" "%9.0g"
- attr(*, "types")= int [1:4] 251 251 251 254
- attr(*, "val.labels")= chr [1:4] "apply" "" "" ""
- attr(*, "var.labels")= chr [1:4] "" "" "" ""
- attr(*, "version")= int 12
- attr(*, "label.table")=List of 1
..$ apply: Named int [1:3] 0 1 2
.. ..- attr(*, "names")= chr [1:3] "unlikely" "somewhat likely" "very likely"
library(MASS)
graduate_model = MASS::polr(apply ~ public + gpa + pared,
data=graduate_data,
Hess=TRUE)
summary(graduate_model)
Call:
MASS::polr(formula = apply ~ public + gpa + pared, data = graduate_data,
Hess = TRUE)
Coefficients:
Value Std. Error t value
public -0.05879 0.2979 -0.1974
gpa 0.61594 0.2606 2.3632
pared 1.04769 0.2658 3.9418
Intercepts:
Value Std. Error t value
unlikely|somewhat likely 2.2039 0.7795 2.8272
somewhat likely|very likely 4.2994 0.8043 5.3453
Residual Deviance: 717.0249
AIC: 727.0249
Compare to a less parsimonious model¶
graduate_multinom = multinom(apply ~ public + gpa + pared, data=graduate_data)
c(AIC(graduate_multinom), AIC(graduate_model))
# weights: 15 (8 variable)
initial value 439.444915
iter 10 value 357.012275
final value 356.996982
converged
- 729.993963188076
- 727.024871965377
Covariance matrix for Wald tests / intervals¶
vcov(graduate_model)
| public | gpa | pared | unlikely|somewhat likely | somewhat likely|very likely | |
|---|---|---|---|---|---|
| public | 0.088721414 | -0.01737218 | -0.003060743 | -0.04091869 | -0.041211341 |
| gpa | -0.017372183 | 0.06793006 | -0.007018330 | 0.20088564 | 0.203892325 |
| pared | -0.003060743 | -0.00701833 | 0.070644021 | -0.01039088 | -0.002951814 |
| unlikely|somewhat likely | -0.040918689 | 0.20088564 | -0.010390883 | 0.60769114 | 0.613233012 |
| somewhat likely|very likely | -0.041211341 | 0.20389232 | -0.002951814 | 0.61323301 | 0.646941420 |
Confidence intervals¶
confint(graduate_model)
Waiting for profiling to be done...
| 2.5 % | 97.5 % | |
|---|---|---|
| public | -0.6522060 | 0.5191384 |
| gpa | 0.1076202 | 1.1309148 |
| pared | 0.5281768 | 1.5721750 |
Likelihood ratio tests¶
anova(MASS::polr(apply ~ gpa, data=graduate_data), graduate_model)
| Model | Resid. df | Resid. Dev | Test | Df | LR stat. | Pr(Chi) |
|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <dbl> |
| gpa | 397 | 732.6000 | NA | NA | NA | |
| public + gpa + pared | 395 | 717.0249 | 1 vs 2 | 2 | 15.57509 | 0.0004148703 |
Adjacent category models (8.3.4)¶
Considers the parameters
Model
Adjacent category models (8.3.4)¶
Equivalent to a linear reparametrization of the baseline logits $\( \log \frac{\pi_j(X)}{\pi_{K}(X)} = \alpha_j^* + (K-j) \cdot X^T\beta, \qquad 1 \leq j \leq K-1 \)\( where \)\( \alpha_j^* = \sum_{l=j}^{K-1} \alpha_l. \)$
Negative log-likelihood is convex in \((\alpha, \beta)\)…
Bayesian methods for multinomial (8.6)¶
For cumulative logit models, conceptually simple to sample \((Y^*,\alpha,\beta)|X\).
Constraints on \((Y^*,\alpha,\beta)\) when \(Y_i=j\):
Each draw is from \(G\) truncated to some interval determined by \((\alpha, \beta)\).
For baseline logit models, if we model using \(Y=\text{argmax}(U)\) wiith latent \(U\), then a similar augmented data model is possible with variables \((\epsilon,\beta)\)
STAN for multinomial¶
See here
Close to baseline model, except uses overparameterized model
while baseline model sets \(\beta[,K]=0\).
STAN code¶
data {
int<lower=1> N;
int<lower=1> K;
int<lower=1> P;
matrix[N, P] X;
int<lower=1,upper=K> y[N];
}
parameters {
matrix[P,K] beta;
}
model {
to_vector(beta) ~ normal(0, 100);
for (n in 1:N) {
y[n] ~ categorical_logit((X[n] * beta)');
}
}
# pretty slow!
X_graduate = model.matrix(graduate_model)
graduate_stan = list(X=X_graduate,
N=nrow(X_graduate),
P=ncol(X_graduate),
K=3,
y=as.integer(graduate_data$apply))
system.time(grad_samples <- rstan::stan('multinom.stan',
data=graduate_stan,
iter=100,
chains=1))
SAMPLING FOR MODEL 'multinom' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.002744 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 27.44 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: WARNING: There aren't enough warmup iterations to fit the
Chain 1: three stages of adaptation as currently configured.
Chain 1: Reducing each adaptation stage to 15%/75%/10% of
Chain 1: the given number of warmup iterations:
Chain 1: init_buffer = 7
Chain 1: adapt_window = 38
Chain 1: term_buffer = 5
Chain 1:
Chain 1: Iteration: 1 / 100 [ 1%] (Warmup)
Chain 1: Iteration: 10 / 100 [ 10%] (Warmup)
Chain 1: Iteration: 20 / 100 [ 20%] (Warmup)
Chain 1: Iteration: 30 / 100 [ 30%] (Warmup)
Chain 1: Iteration: 40 / 100 [ 40%] (Warmup)
Chain 1: Iteration: 50 / 100 [ 50%] (Warmup)
Chain 1: Iteration: 51 / 100 [ 51%] (Sampling)
Chain 1: Iteration: 60 / 100 [ 60%] (Sampling)
Chain 1: Iteration: 70 / 100 [ 70%] (Sampling)
Chain 1: Iteration: 80 / 100 [ 80%] (Sampling)
Chain 1: Iteration: 90 / 100 [ 90%] (Sampling)
Chain 1: Iteration: 100 / 100 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.588644 seconds (Warm-up)
Chain 1: 51.6949 seconds (Sampling)
Chain 1: 52.2836 seconds (Total)
Chain 1:
Warning message:
“There were 26 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”
Warning message:
“There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“The largest R-hat is 2.11, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
user system elapsed
74.968 1.701 79.488
grad_samples
Inference for Stan model: multinom.
1 chains, each with iter=100; warmup=50; thin=1;
post-warmup draws per chain=50, total post-warmup draws=50.
mean se_mean sd 2.5% 25% 50% 75%
beta[1,1] 0.82 0.64 1.04 -0.83 -0.11 0.87 1.69
beta[1,2] 0.46 0.71 1.13 -1.31 -0.57 0.46 1.69
beta[1,3] -0.74 0.76 1.20 -2.52 -1.81 -0.83 0.40
beta[2,1] -1.59 1.07 1.77 -4.71 -2.91 -1.73 0.42
beta[2,2] -1.58 1.08 1.78 -4.73 -2.92 -1.73 0.41
beta[2,3] -1.59 1.08 1.78 -4.73 -2.93 -1.74 0.41
beta[3,1] -0.45 0.52 0.99 -1.32 -1.15 -0.93 0.27
beta[3,2] -0.48 0.54 1.02 -1.31 -1.25 -0.95 0.27
beta[3,3] -0.50 0.55 1.03 -1.41 -1.26 -0.94 0.25
beta[4,1] 1.97 0.71 1.17 0.19 0.36 2.45 2.99
beta[4,2] 1.97 0.71 1.17 0.17 0.35 2.45 3.01
beta[4,3] 2.00 0.67 1.11 0.30 0.51 2.45 2.97
lp__ -148281.16 32.18 81.58 -148518.58 -148256.96 -148247.34 -148244.84
97.5% n_eff Rhat
beta[1,1] 2.25 3 2.95
beta[1,2] 1.82 3 3.13
beta[1,3] 0.84 3 2.96
beta[2,1] 0.57 3 2.49
beta[2,2] 0.64 3 2.48
beta[2,3] 0.60 3 2.48
beta[3,1] 1.89 4 1.53
beta[3,2] 1.90 4 1.57
beta[3,3] 1.90 4 1.59
beta[4,1] 3.21 3 2.20
beta[4,2] 3.20 3 2.21
beta[4,3] 3.21 3 2.25
lp__ -148242.37 6 1.19
Samples were drawn using NUTS(diag_e) at Mon Jan 31 10:24:12 2022.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
STAN for ordered logit¶
See here
STAN code for ordered logit¶
data {
int<lower=1> N;
int<lower=1> K;
int<lower=1> P;
matrix[N,P] X;
int<lower=1,upper=K> y[N];
}
parameters {
vector[P] beta;
ordered[K-1] cutpts;
}
model {
beta ~ normal(0, 100);
for (n in 1:N) {
y[n] ~ ordered_logistic(X[n] * beta, cutpts);
}
}
# also a bit slow!
system.time(ordered_samples <- rstan::stan('ordered_logit.stan',
data=graduate_stan,
iter=100,
chains=1))
ordered_samples
SAMPLING FOR MODEL 'ordered_logit' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.000544 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 5.44 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: WARNING: There aren't enough warmup iterations to fit the
Chain 1: three stages of adaptation as currently configured.
Chain 1: Reducing each adaptation stage to 15%/75%/10% of
Chain 1: the given number of warmup iterations:
Chain 1: init_buffer = 7
Chain 1: adapt_window = 38
Chain 1: term_buffer = 5
Chain 1:
Chain 1: Iteration: 1 / 100 [ 1%] (Warmup)
Chain 1: Iteration: 10 / 100 [ 10%] (Warmup)
Chain 1: Iteration: 20 / 100 [ 20%] (Warmup)
Chain 1: Iteration: 30 / 100 [ 30%] (Warmup)
Chain 1: Iteration: 40 / 100 [ 40%] (Warmup)
Chain 1: Iteration: 50 / 100 [ 50%] (Warmup)
Chain 1: Iteration: 51 / 100 [ 51%] (Sampling)
Chain 1: Iteration: 60 / 100 [ 60%] (Sampling)
Chain 1: Iteration: 70 / 100 [ 70%] (Sampling)
Chain 1: Iteration: 80 / 100 [ 80%] (Sampling)
Chain 1: Iteration: 90 / 100 [ 90%] (Sampling)
Chain 1: Iteration: 100 / 100 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 3.32696 seconds (Warm-up)
Chain 1: 7.85343 seconds (Sampling)
Chain 1: 11.1804 seconds (Total)
Chain 1:
Warning message:
“There were 14 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”
Warning message:
“The largest R-hat is 1.92, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat”
Warning message:
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess”
Warning message:
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess”
user system elapsed
31.836 1.262 34.205
Inference for Stan model: ordered_logit.
1 chains, each with iter=100; warmup=50; thin=1;
post-warmup draws per chain=50, total post-warmup draws=50.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
beta[1] -41.14 30.36 52.19 -128.46 -104.60 -14.21 -6.19 23.82 3
beta[2] -0.05 0.04 0.32 -0.61 -0.28 -0.03 0.15 0.67 63
beta[3] 0.57 0.04 0.21 0.19 0.48 0.55 0.68 0.98 30
beta[4] 1.04 0.04 0.31 0.53 0.84 0.95 1.26 1.68 52
cutpts[1] -39.08 30.33 52.23 -127.17 -102.33 -11.75 -4.10 25.76 3
cutpts[2] -36.99 30.34 52.23 -125.00 -100.13 -9.66 -2.10 28.09 3
lp__ -360.53 0.25 1.50 -364.15 -361.39 -360.09 -359.39 -358.49 35
Rhat
beta[1] 1.99
beta[2] 0.98
beta[3] 1.00
beta[4] 0.98
cutpts[1] 1.98
cutpts[2] 1.98
lp__ 0.98
Samples were drawn using NUTS(diag_e) at Mon Jan 31 10:24:46 2022.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).