Lecture 17: General Inference And Bayesian Networks

Feb 22nd, 2021

Lecture Materials

Slides PDF

Concept Check

Recording

Learning Goals

Understand how to construct a Bayesian network and run simulations to compute probabilities and statistics.

Reading

None!

Concept Check

https://www.gradescope.com/courses/226051/assignments/1031106

Questions & Answers

Q: Is this how Palantir works?

A1: palantir is a big company that does many things — but they do participate in some reasoning under uncertainty (as I can imagine)

Q: when does pset5 come out?

A1: we will release it on wednesday now (when pset4 comes in). good question

Q: So 2^256 is something you can use for an output space (ie sha256), but your computer couldnt handle actually storing all 2^256 numbers all at once?

A1: exactly. you could think about something similar with continuous valued numbers. You can have a continuous valued number, but you couldn’t store all of them :)

Q: Could there be bidirectional arrows? For example could being tired also cause someone to be more likely to get flu?

A1: you could — but a lot of the ability to represent a *ton* of random variables into a bayesian network — comes from committing to one flow of causality (which is a-cyclic)

Q: So how does this scale in terms of nodes and edges?

A1: it scales exponentially with respect to the number of parents of a node (so if a node has 8 parents with 2 values each, it needs a table with 2^8). If you can keep the number of parents small (say 2 per node), it scales linearly! Woot

Q: I’m a little confused on what we’re supposed to set up in the denominator! How do we determined what probabilities to include?

A1: Good question. You want to calculate the probability of the thing you are conditioning on (per the def of conditional probability). Great! But how? Well you can use “marginalization”. The bayes net actually defines a joint probability over all assignments to all variables. Marginalize out those you don’t want for the probability of the things you are conditioning on

Q: So we don’t use the methods we just used in practice?

A1: we do! But you are about to learn another good way. At different times I use the one jerry just taught you, other times i use rejection sampling. Then people are still inventing better methods… :)

Q: Why do we sample when we can explictly come up with a closed form expression for the probability we are looking for using the numbers in the Bayesian Network?

A1: good question. One answer is that its much easier to code for a bayes net which could change in structure. But sampling vs closed form are two great tools to have handy.

Q: Don’t we need all of the probabilites to generate these samples? How can we do this efficiently?

A1: we need to have the “joint” of all the RVs. And that is the beauty of the Bayesian Network. It is a crazy efficient way of storing the joint distribution. It has everything we need.

Q: The two fever conditional probabilities don’t have to add up to equal 1?

A1: no they dont! For example you could imagine a variable which is 0 in both cases

Q: Could you reuse the sample for multiple queries?

A1: yes!

Q: Is there a reason we reject the observations first?

A1: if you don’t reject, you won’t be in the conditional event

Q: given some data (e.g. 10000 patients and whether or not they exibit a certain sympton for say 100 symptons), I’m still a little confused on how all these probabilities would be computed in practice

A1: 1) get an expert to chose a “causal structure” then 2) find the probability of each “child” given parents. For that you can just use the original definition of probability (see the netflix example in the course reader)

Q: Is it computationally expensive for a computer to generate a closed form expression for a certain probability? I was just wondering if that is another benefit of sampling.

A1: yes it can be! And there are folks still working on algorithms to make it easier (check out a thing called “message passing”)

Q: How does this work for non-binary random variables?

A1: same way, it gets much more complex for continuous random variables. Then we use another algorithm called MCMC (markov chain monte carlo). Its like the big sister of rejection sampling.

Q: can a bayesian network be constructed from just a lot of pure data if there is not an understood casual structure?

A1: yes, its called “structure learning” but you need a ton more data

Q: What makes this faster than doing the full computation of all the possibilities? what information are we skipping over?

A1: samples are re-usable! also the coding is much easier

Q: I have heard of Laplacian smoothing to solve this problem. Is that adding one of each sample?

A1: thats right!

Q: For what N, assuming binary random variables, and a typical graph are problems solvable in practice today?

A1: oh interesting… we have a $30k computer in our lab which could do quite a lot… thousands?

Q: If we add one of each sample is there a reasonable way to correct for it? (I am thinking maybe round things with higher probability up slaghtly and low down lightly)

A1: you can prove that with enough data the effect will be washed out! So it will still converge to the right answer

Q: Also if i remember correctly judea pearl won a turing award for this. I guess this idea allows us to solve a ton of problems in practice that we couldn;t before?

A1: yes! And he did belief propagation! It was a huge revolution