### Learning Goals

To understand how to treat a distribution's parameters as unknowns and then determine which value within the distribution is of maximum mass or denisty.

None!

### Concept Check

Q: Many congratulations on the aniversary Jerry! :))

A1:  wahoo. wonderful :)

Q: Hi Chris, had a confusion from previous lecture on uncertainity and p-values as it was defined as the probability of null-hypothesis. In another stats course, I learnt it is the probability of extreme values, assuming the null hypothesis is true.

A1:  thats right! it is the probability of “seeing a result as extreme as what you obseved, given the null hypothesis”

Q: thanks for clarifying! This helps

A1:  My pleasure!

Q: what does it mean to maximize the posterior? to actually maximize the probability after we gain new information?

A1:  the posterior is the “probability of the parameter value given the data”, the prior is the a-priori belief about the likelihood of the parameter value before you saw data

Q: so it seems that MAP and MLE are known for the basic distributions, but that if we had a distribution that did not perfectly fit a binomial, gaussian, etc we would actually be calculating theta?

A1:  They are the basis for machine learning in general which goes well beyond basic distributions. For more complex examples you need to first (1) identify what your params are (2) figure out a prior on them, for MAP, and (3) figure out the derivative which would allow you to maximize it

Q: is the prior our belief of the probability of heads before we do anything?

A1:  yes!

Q: why is it N(0.5, 1^2). I didnt follow the 1^2

A1:  i suppose that is the authors choice for how uncertain they are?

Q: oh ok so it’s arbitrary?

A1:  yes! I think the beta prior is also “modelers choice” but more meaningful

Q: what’s the difference between g(theta) and theta? I think theta is a distribution (as opposed to a number) so I’m not sure what a function of a distribution would be

A1:  great question. its meant to be a density function! As in the PDF. In this case instead of using f for the PDF like we normally do, we use g to be clear that there is a different prior density. that is a notational option!

Q: will we prove why mode comes up? that would help understand better

A1:  the mode *is* the argmax of a likelihood function :)

Q: can you explain the normalizing constant?

A1:  in the context of the beta? It is simply the number that allows this to integrate to one, we know that its a PDF so we know such a number exists… sometimes it goes away, sometimes you could imagine calculating the normalization constant

Q: is there a way to determine how fast it converges based on the prior?

A1:  certainly. you could set up a problem were: you chose a true p, chose a particular prior, then calculate a function which relates the number of iterations to some measure of how different the two PDFs are (eg earth movers distance or KL divergence)

Q: here pi is the mode?

A1:  p_i is set to be the mode