Lecture 20: Beta

March 1st, 2021

Lecture Materials

Learning Goals

Know how to represent an uncertainty about a probability. Know how to update that belief given new evidence. See how randomized algorithms can use the ability to express uncertainty about probability make better choices. Appreciate that all parameters can be seen as random variables

Reading

Concept Check

https://www.gradescope.com/courses/226051/assignments/1063910

Questions & Answers

Q: will you be releasing statistics for the quiz?

A1: At the moment, the median grade is 84 out of 90.

A2: Yep, very shortly. There are 11 exams we still need to finish grading, but we’ll update the course website to include statistics once they’re final.

Q: If you do a lot of trials does the effect prior start to fade? (maybe if you have a low entropy prior) Or does a bad prior lead to a problem even after many trials?

A1: Absolutely, yes. The initial probability distribution is just a guess, and if that guess turns out to be horrible, then you deviate away from it quite a bit toward the true distribution.

Q: Can we use the posterior distribution as a prior for another experiment? For example could we flip the coin k more times after these m + n times now using the posterior as the prior and would the resulting distribution be the same as if we had a single experiment with m + n + k flips?

A1: Yes. Flipping m coins, then n, then k is the same as flipping m + n, then k, and that’s the same as flipping m + n + k.

Q: how does the uniform distribution factor in here?

A1: Marcus, I made a mistake because I was thinking of something else. I should have said that all probabilities of success are equally likely, so that’s why the (continuous) Uni(0, 1) works. That’s consistent with a prior belief that you’ve flipped 0 heads and 0 tails.

A2: I assume you’re speaking of Uni(0, 1)? That’s a discrete Uniform, not a continuous one. Essentially, it’s a fancy way of saying P(X = 0) = P(X = 1) = 0.5.

Q: Why did the (n+m choose n) cancel out?

A1: It actually didn’t just cancel. We just represented (n + m choose n) divided by P(N = n) as 1/c, and then figuring out that 1 / c, whatever is it, must take on a value that makes f(x|n) a valid PDF.

Q: (in response to Marcus’s question) - I thought the Uni(0,1) here was continous. Can you explain how it is discrete?

A1: Ah, I was mistaken… I should have said that all probabilities are equally likely, and that’s why it’s Uni(0, 1).

Q: I have heard that sometimes in ML, they use a Gaussian prior so I was wondering what are the conjugate distributions of a Gaussian distribution?

A1: Believe it or not, it depends. If you know the variance of the Normal, then it’s conjugate is a Normal, and that’s likely what you’ve seen before. If you know the mean but you don’t know the variance, then the conjugate is something called an inverse Gamma.

Q: Is the beta distribution approximated by a normal distribution by the Central Limit Theorem?

A1: Super interesting question, and one I think a lot of people might be wondering. That would only be true if a beta were defined to be the sum of IID random variables (the binomial for example is). The beta isn’t a sum. You can see that by looking at the values it can take on. Even after a trillion experiments, its values are between 0 and 1. The beta derivation has a binomial in it, which can be approximated.

Q: do pharma people use this to scale their beliefs between say animal testing and human testing?

A1: I don’t know of that specific application, but I’m sure they could do precisely that. (Also, notice that I added a followup to my answer to your last question).

Q: Is this how n-shot learning works?

A1: few shot (or n-shot) learning takes this to the next level. And this is the starting point!

Q: If we are running these experiments, how is bootstrapping similar to updating these beliefs. For example if we wanted a distribution after some number of experiments could we just use the results of the experiments as a categorical distribution and bootstrap from there? Is that like a frequentist point of view?

A1: it is really wild to think about how frequentists can get to distributions over probabilities. They do, because its powerful, but without a prior you have to do really crazy things. One idea is kinda along the lines you suggest. Run lots of different little experiments and each one calculates its own p, the distribution that results is a distribution on p. I think I follow your logic but we should probably talk about it in office hours.

Q: I am not sure if I missed this part: why do we also choose a and b as success+1 and failure +1, why don’t we just call a as the number of sucess and b as the number of failure for convinence

A1: it was defined that way… we can never go back… if we could the world would be full of rainbows and unicorns

Q: How did we decide to have the prior be Beta(5,2) rather than Beta(81,21) or some other values for a,b that also give us 80% success?

A1: You can use either a hueristic. Going with Beta(5, 2) would be a statement that you’re not quite as confident with your prior beliefs as you would be if you went with Beta(81, 21).

Q: So a Beta distribution cannot be approximated by a Normal?

A1: a beta can only be approximated by a normal when a and b are very close to each other… so often people dont