Modelling Mixtures: the Dirichlet distribution

A measure on discrete measures.

You are given a set \(\mathcal{X}\) (here taken as finite) and a probability density \(p(x), (p(x)\geq 0, \sum p(x)=1)\). Also given is a set \(A\) in \(\mathcal{X}\). The problem is to compute or approximate \(p(A)\). We consider the Birthday Problem from a Bayesian standpoint.

In order to go further we need to extend what we did before for the binomial and its Conjugate Prior to the multinomial and the the Dirichlet Prior. This is a probability distribution on the \(n\) simplex

\[\Delta_n=\{{\tilde{p} }=(p_1,\cdots,p_n),\; p_1+\cdots +p_n=1,\; p_i\geq 0\;\}\]

It is a \(n\)-dimensional version of the beta density. Wilks (1962) is a standard reference for Dirichlet c omputations.

The Dirichlet has a parameter vector: \(\tilde{a} =(a_1,\ldots,a_n)\). Throughout we write \(A=a_1+\cdots+a_n\).

With respect to Lebesgue measure on \(\Delta_n\) normalised to have total mass 1 the Dirichlet has density:

\[D_{\tilde{a}}(\tilde{x})=\frac{\Gamma(A)}{\prod \Gamma (a_i)} x_1^{a_1-1} x_2^{a_2-1} \cdots x_n^{a_n-1}\]

The uniform distribution on \(\Delta_n\) results from choosing all \(a_i=1\). The multinomial distribution corresponding to \(k\) balls dropped into \(n\) boxes with fixed probability \((p_1,\cdots,p_n)\) (with the ith box containing \(k_i\) balls) is

\[{k \choose {k_1 \ldots k_n}} p_1^{k_1} \cdots p_n^{k_n}\]

If this is averaged with respect to \(D_{\tilde{a}}\) one gets the marginal (or Dirichlet/ Multinomial):

\[P (k_1,\ldots,k_n) = P(\tilde{k})= \frac{(a_1)_{(k_1)} (a_2)_{(k_2)} \ldots (a_n)_{(k_n)}} {A_{(k)}} \mbox{ where }\;\; m_{(j)} \stackrel{\mbox{def}}{=} m(m+1)\cdots (m+(j-1))\]

From a more practical point of view there are two simple procedures worth recalling here:

  • To pick \(\tilde{p}\) from a Dirichlet prior; just pick \(X_1, X_2, \ldots ,X_n\) independant from gamma densities

    \[\frac{e^{-x}x^{a_i-1}}{\Gamma(a_i)} \mbox{ and set } p_i=\frac{X_i}{X_1+\cdots X_n}, 1\leq i\leq N\]
  • To generate sequential samples from the marginal distribution use Polya’s Urn: Consider an urn containing \(a_i\) balls of color \(i\) (actually fractions are allowed).

    Each time, choose a color \(i\) with probability proportional to the number of balls of that color in the urn. If \(i\) is drawn, replace it along with another ball of the same color.

The Dirichlet is a convenient prior because the posterior for \(\tilde{p}\) having observed \((k_1,\cdots,k_n)\) is Dirichlet with probability \((a_1+k_1,\cdots,a_n+k_n)\). Zabell (1982) gives a nice account of W.E. Johnson’s characterization of the Dirichlet: it is the only prior that predicts outcomes linearly in the past. One special case is the symmetric Dirichlet when all \(a_i=c >0\). We denote this prior as \(D_c\).

\(The\; Birthday\; Problem\)

In its simplest version the birthday problem involves \(k\) balls dropped uniformly at random into \(n\) boxes. We declare a \(match\) if two or more balls drop into the same box. Elementary considerations Feller (1968, page 33) show that:

\[P(match)= 1-P(no\; matches) =1-\prod^{k-1}_{i=1}(1-\frac{i}{n}) \doteq 1-e^{-\frac{{{k}\choose{2}}}{n}}\]

To be more precise:

Proposition: If \(n\) and \(k\) are large in such a way that : \(\frac{{{k}\choose{2}}}{n} \longrightarrow \lambda\) then in the classical birthday problem :

\[P(Match) \cong 1-e^{-\lambda}\]

Proof :

\[\mbox{ Write } \qquad \prod^{k-1}_{i=1}(1-\frac{i}{n})=exp({\sum_{i=1}^{k-1}}log (1-\frac{i}{n}))\]

Expand the \(\log\) using \(log (1-x)=-x+O(x^2)\) to see that the exponent is

\[-\frac{{{k}\choose{2}}}{n} +O(\frac{k^3}{n^2})\]

Setting the right side equal to \(\frac{1}{2}\) and solving for \(k\) shows that if \(k\doteq 1.2 \sqrt{n}\) , \(P(match) \doteq \frac{1}{2}\). When \(n=365\), \(k=1.2\sqrt{n}\doteq 22.9\). This gives the usual statement : it is about even odds that there are two people with the same birthday in a group of 23.

In contrast with the clean formulation above, consider
an introductory probability class with 25

students. When considering ‘doing’ the birthday problem, several things come to mind :

  • It is slightly better than a 50-50 chance of success with 25 students.
  • If it fails it’s quite a disaster, 10 minutes of class time with a big build-up, devoted to a failure.
  • If it succeeds, the students are usually captivated and they are interested in learning how computations can dissipate the ‘intuitive fog’ surrounding probability.
  • The students are practically all from the same year, it is quite well known that birthdays within a year are not uniformly distributed; far more births occur on weekdays than on weekends (Doctors don’t like to work on weekends, and the induced births and c-sections are never scheduled for weekends). There are also seasonal trends (more births in the summer) and lunar trends.

Taking these things into consideration, you realize that you have no idea what the ‘true underlying probabilities’ are! The considerations above make a match more likely. It seems sensible to carry through a Bayesian version of the birthday problem.

The problem becomes : drop \(k\) balls into \(n\) boxes where the chance of a ball falling in box \(i\) is \(p_i\). Here \(\tilde{p}=(p_1,p_2,\ldots,p_n)\) has some prior distribution on the \(n\)-simplex \(\Delta_n\). We will work with a Dirichlet prior \(D_{\tilde{a}}\), with \(\tilde{a}=(a_1,a_2,\ldots,a_n)\). This has density proportional to \(x_1^{a_1-1} x_2^{a_2-1} \cdots x_n^{a_n-1}\).

For \(a_i=1\) we get the classical uniform prior on the simplex.

For \(a_i\equiv c\) we get the symmetric Dirichlet prior.

The Dirichlet prior is the n-dimensional version of the 2-dimensional beta prior we have already studied.

This interpolates between the uniform prior(c=1) and the classical case \(p_i=\frac{1}{365}\) (\(c\longrightarrow \infty\)). For more general choices of \(a_i\) we get a flexible family of priors.

Let us we carry out necessary computations in the following cases; in each we give \(k\) required for a \(50-50\) chance of a match when n=365:

  • Uniform Prior, c=1 \(k\doteq .83 \sqrt{n}\), for \(n=365\), \(k\doteq 16\)

  • Symmetric Prior, \(a_i=c\)

    \[\begin{split}\begin{array}{l|ccccccc} c & .5 & 1 & 2 & 5 & 20 & \infty\\ \hline k_c & 13.2 & 16.2 & 18.7 & 20.9 & 21.9 & 22.9\\ \end{array}\end{split}\]

    In coin tossing the standard prior is the uniform on \([0,1]\). The standard prior on the \(n-\)dimensional simplex \(\Delta_n\) is the uniform distribution \(U\) where all vectors \(\tilde{p}\) have same probability.

    The probability of a match, averaged over \((p_1,p_2,\ldots,p_n)\), represents the chance of success to a Bayesian statistician who has chosen the uniform prior. As is well known, (Bayes (17???), Good (1979), Diaconis and Efron (1987)) such a uniform mixture of multinomials results in Bose-Einstein allocation of balls in boxes, each configuration, or composition \((k_1,k_2,\ldots,k_n)\) being equally likely with chance \(\displaystyle{ \frac{1}{{{k+n-1}\choose {k}}}}\). For this simple prior, it is again possible to do an exact calculation:

    Under a uniform prior on \(\Delta_n\)

\[\label{eq2.1} P_u(match)=1-\prod_{i=1}^{k-1}(\frac{1-\frac{i}{n}}{1+\frac{i}{n}})\]
If \(n\) and \(k\) are large in such a way that

\(\frac{k^2}{n}\longrightarrow \lambda\), then

\[\label{eq2.2} P(\mbox{match})\cong 1-e^{-\lambda}\]


Represent the uniform mixture of multinomials using Polya’s urn as described in the introduction. The chance that the first \(k\) balls fall in different boxes is

\[\frac{n-1}{n+1} \times \frac{n-2}{n+2} \cdots \times \frac{n-(k-1)}{n+(k+1)}\]
This gives ([eq2.1]) and ([eq2.2]) follows by the same calculus

approximations detailed in the proof of Proposition [prop1].

Thus in order to obtain a 50-50 chance of a match under a uniform prior \(k\) must be \(.83\sqrt{n}\). When \(n=365\), this becomes \(k=16\), and for \(k=23\), \(P_u(match)\doteq .75\).

The uniform prior allows some mass far from \((\frac{1}{n},\frac{1}{n}, \ldots, \frac{1}{n})\) and such “lumpy” configurations make a match quite likely.

The uniform prior studied above is a special case of a symmetric Dirichlet prior \(D_c\) on \(\Delta_n\), with \(c=1\). We next extend the calculations above to a general \(c\). For \(c\) increasing to infinity, the prior converges to point mass \(\delta_{(\frac{1}{n},\frac{1}{n},\ldots \frac{1}{n})}\) and thus gives the classical answer. When \(c\) converges to \(0\), \(D_c\) becomes an improper prior giving infinite mass to the corners of the simplex, thus for small \(c\), the following proposition shows that matches are judged likely when \(k=2\).

Under a symmetric Dirichlet prior \(D_c\) on \(\Delta_n\),

\[P_c(\mbox{match})=1-\prod^{k-1}_{i=1} \frac{(n-i)c}{nc+i} .\]

For a proof see the paper here

In order for the probability of a match to be about \(\frac{1}{2}\) ; \(k_c\doteq 1.2\sqrt{\frac{nc}{c+1}}\) is needed. The following table shows how \(k_c\) depends on \(c\) when \(n=365\):

\[\begin{split}\begin{array}{l|ccccccc} c & .5 & 1 & 2 & 5 & 20 & \infty\\ \hline k_c & 13.2 & 16.2 & 18.7 & 20.9 & 21.9 & 22.9\\ \end{array}\end{split}\]
  • Honest Priors Construct a \(2\) “hyper”parameter family of Dirichlet priors writing \(a_i= A \pi_i\), with \(\pi_1+\pi_2\cdots+ \pi_n=1\). Assign weekdays parameter \(\pi_i=a\), weekends \(\pi_i=\gamma a\), with \(260 a +104 \gamma a=1\). Here \(\gamma\) is the parameter ‘ratio of weekends to weekdays’, (roughly we said \(\gamma \doteq .7\)) and \(A\) measures the strength of prior conviction. The table below shows how \(k\) varies as a function of \(A\) and \(\gamma\). We have assumed the year has \(7\times 52=364\) days.

    \[\begin{split}\begin{array}{l|ccc} A\;\;\; {\gamma} & .5 & .7 & 1 \\ \hline 1 & 2.2 & 2.2 & 2.2\\ 364 & 16.1& 16.3 & 16.4\\ 728 & 18.4& 18.6& 18.8\\ \infty & 22.2& 22.4& 22.6 \end{array}\end{split}\]


  1. Under the uniform,symmetric Dirichlet priors for small \(c\),and ‘honest priors’ for moderate \(A\) the prior concentrates on fairly ‘lumpy’ vectors \(\tilde{p}\) which will have matches with small \(k\).
  2. The calculations show that some uncertainty about \(p\) leads to a much improved chance of a match. Returning to the \(25\) students in the probability class, the Dirichlet prior with \(A=365\) and \(\gamma=.7\) is a first approximation to the prior of the present authors. It leads to a chance of a match with \(25\) students of 0.81. This may be the reason the birthday problem works as often as it does!