March 8th, 2021
Be able to implement and use Naive Bayes. Understand why it works and what it is doing.
Q: Can you explain why the MLE formula doesn’t have a prior?
A1: MLE is defined in terms of a single sample with no prior experiments and it asks what value of theta maximizes the chances of seeing the data you just saw without paying attention to anything that happened with prior trials. That’s choosing the value of theta that maximized f(x | theta).
MAP asks that we maximize theta given the data, so we’re trying to choose the value of theta that maximizes f(theta | x). The decision to involve priors stems from the decision to involve Bayes theorem to define f(theta | x) in terms of f(x | theta).
Q: just to clarify, we use Beta(2,2) because then we don’t have to worry about one type of event not happening at all throwing off our math? And that’s called laplace smoothing?
A1: That’s precisely correct.
Q: so leplace smoothing is just always beta one success and one failure?
A1: yes! though more generally, a count of one for each possible outcome.
Q: Is x^(i) a vector of 3 values in this example?
A1: technicaly, it’s a ventor of length m.
A2: There’s an ellipsis between data point 2 and data point m, but the discussion is being framed in terms of m features for each input vector, but the illustration is only showing three of them (with that ellipses).
Q: why is the last X_i in the linear regression formula 1? is this the offset?
A1: We’re just using a 1 or a 0 to encode whether the team is playing at home (1, yes, true, etc.) or playing away (0, no, false, etc.)
Q: I think the last question was referring to the parameter x_5 = 1 with weight +95. Does this refer to 95 being a good average estimate for the number points the Warriors score regardless of other parameters?
A1: We can’t assign too much semantics as to why theta_5 is specially 95.4. It’s just the number that’s inferred by the black box algorithm that most effectively maps the input features for all training data examples to their known values of y.
But because theta_5 ends up being positive, that means being at home is an advantage. And because it’s much larger than the other thetas, it implies that playing at home is significant enough that we need to scale all x_5’s that are 1 to contribute much more to the dot product than they might otherwise.
Q: are we going to leverage bayesian networks to solve the scaling here?
A1: something very similar, yes. But you’re correct to call out the exponential growth of the input feature tables, and we need to fix that. We’ll fix is by assuming conditional independence instead of using Bayesian nets, though. Stay tuned. :)
A2: I wasn’t expecting Chris to go into this, but Naive Bayes is technically equivalent to a Bayes Net with 1 parent (that’s the Y) and m children that are conditioned on Y but conditionally independent of each other.
Q: so this algorithm will improve as the “independence-ness” of the variables increases? And this would translate to a lower covariance?
A1: yes, to the extent that they are truly conditionally independent, this learning algorithm will do a much, much better job.
Q: so the training data set is to find the probability table and the testing data set is to measure our precision and recall?
A1: That’s exactly right. Very often, you’ll hold back 10-15% of the training data and truly train with the remaining 85 - 90%, and then use the 10-15% you held back to test instead. Those test cases are used to measture precision/recall.