More Probability

The Axioms of Probability

The axioms of probability say what mathematical rules must be followed in assigning probabilities to events. Let P(A) denote the probability of the event A. The axioms are rules the function P must follow:

The Axioms of Probability

The probability of every event is at least zero.

The probability of the entire outcome space is 100%.

something

If two events are disjoint, the probability that either happens is the sum of the probabilities that each happens.

Everything else that is mathematically true of probability is a consequence of these axioms, and of further definitions.

For example, we have the complement rule:

The Complement Rule

the probability that an event A does not happen is 100% minus the probability that A happens:

P(A^c) = 100% - P(A).

The complement rule can be derived from the axioms: the union of A and its complement is S (either A happens or it does not, and there is no other possibility), so P(AUA^c) = P(S) = 100%, by axiom 2. The event A and its complement are disjoint (if "A does not happen" happens, A does not happen; if A happens, "A does not happen" does not happen), so P(AUA^c) = P(A) + P(A^c) by axiom 3. Putting these together, we get P(A) + P(A^c) = 100%. If we subtract P(A) from both sides of this equation, we get what we sought: P(A^c) = 100%-P(A).

A special case of the complement rule is that

P({}) = 0%,

because P(S) = 100%, and S^c = {}.

An event A that has probability one is said to be certain or sure. S is certain.

The union of two events, A UB, can be broken up into three disjoint sets:

elements of A that are not in B (AB^c)
elements of B that are not in A (A^cB)
elements of both A and B (AB)

Together, these three sets contain every element of AUB. Therefore, the chance that either A or B occurs is

P(AUB) = P(AB^cU A^cB U AB ).

The three sets on the right are disjoint, so the third axiom implies that

P(AUB) = P(AB^c) + P(A^cB) + P(AB).

On the other hand,

P(A) = P(AB^c U AB) = P(AB^c) + P(AB),

because AB^c and AB are disjoint. Similarly,

P(B) = P(A^cB U AB) = P(A^cB) + P(AB),

because A^cB and AB are disjoint. Adding, we find

P(A) + P(B) = P(AB^c)+ P(A^cB) +2×P(AB).

This would be P(AUB), but for the fact that P(AB) is counted twice, not once. It follows that, in general,

P(AUB) = P(A) + P(B) - P(AB).

Again, while this is a true statement, it is not one of the axioms of probability. In the special case that AB = {}, this reduces to one of the axioms, because, as we saw in the preceding paragraph, P({}) = 0%. It follows that

P(AUB) <= P(A) + P(B),

because, by axiom 1, P(AB) >= 0.

Moreover, because taking a union can only include additional outcomes,

P(AUB) >= P(A), and

P(AUB) >= P(B).

Probability is analogous to area or volume or mass. Consider the unit square, which has length unity on each side. Its total area is 1 (= 100%). Let's call the square S, just like outcome space. Now consider regions inside the square S (subsets of S). The area of any such region is at least zero, the area of S is 100%, and the area of two regions is the sum of their areas, if they do not overlap (i.e., if their intersection is empty). These facts are direct analogues of the axioms of probability, and we shall often use this model to get intuition about probability.

A further analogy that I find useful is to consider the square S to be a dartboard. A trial or experiment consists of throwing a dart at the board once. The event A occurs if the dart sticks in the set A. The event AB occurs if the dart sticks in both A and B on that one toss. Clearly, AB cannot occur unless A and B overlap--the dart cannot stick in two places at once. AUB occurs if the dart sticks in either A or B (or both) on that one throw. A and B need not overlap for AUB to occur.

This analogy is also useful for thinking about logical implication. If A is a subset of B, the occurrence of A implies the occurrence of B; we shall sometimes say that A implies B. In the dartboard model, the dart cannot stick in A without sticking in B as well, so if A occurs, B must occur also. If A implies B, AB=A, so P(AB)=P(A). If AB = {}, A implies B^c and B implies A^c: if the dart sticks in A it did not stick in B, and vice versa. If A implies B, then if B does not occur A cannot occur either: B^c implies A^c, so B^c is a subset of A^c.

The options in the next questions change only if you hold down the Shift key while you reload the page. If you reload the page without holding down the Shift key, the questions can be out of synch with the answers.

Conditional Probability

Conditioning means updating probabilities to incorporate new information. The conditional probability of A given B is the probability of the event A, updated on the basis of the knowledge that the event B occurred. Suppose that AB = {} (A and B are disjoint). Then if we learn that B occurred, we know A did not occur, so we should revise the probability of A to be zero (the conditional probability of A given B is zero). On the other hand, suppose that AB = B (B is a subset of A, so B implies A). Then if we learn that B occurred, we know A must have occurred as well, so we should revise the probability of A to be 100% (the conditional probability of A given B is 100%).

For in-between cases, the conditional probability of A given B is defined to be

		P(AB)
P(A\|B)	=	------------ ,
		P(B)

provided P(B) is not zero (division by zero is undefined). "P(A|B)" is pronounced "the (conditional) probability of A given B." Why does this formula make sense? First of all, note that it does give back the intuitive answers we arrived at above: if AB = {}, then P(AB) = 0, so P(A|B) = 0/P(B) = 0; and if AB = B, P(A|B) = P(B)/P(B) = 100%. Similarly, if we learned that S occurred, this is not really new information (by definition, S always occurs, because it contains all possible outcomes), so we would like P(A|S) = P(A). This is how it works out: AS = A, so P(A|S) = P(A)/P(S) = P(A)/100% = P(A).

Now suppose that A and B are not disjoint. Then if we learn that B occurred, we can restrict attention to just those outcomes that are in B, and disregard the rest of S, so we have a new outcome space that is just B. We need P(B) = 100% to consider B an outcome space; we can make this happen by dividing all probabilities by P(B). For A to have occurred in addition to B requires that AB occurred, so the conditional probability of A given B is P(AB)/P(B), just as we defined it above.

Example. We deal two cards from a well shuffled deck. What is the conditional probability that the second card is an Ace (event A), given that the first card is an Ace (event B)? This is P(AB)/P(B) by definition. The (unconditional) chance that the first card is an Ace is 100%/13 = 7.7%, because there are 13 possible faces for the first card, and all are equally likely. The chance that both cards are Aces is as follows: from the four suits, we need to pick two; there are ₄C₂ = 6 ways that can happen. The total number of ways of picking two cards from the deck is ₅₂C₂ = 52×51/2 = 1326, so the chance that the two cards are both Aces is (6/1326)×100% = 0.5%. The conditional probability that the second card is an Ace given that the first card is an Ace is thus 0.5%/7.7% = 5.9%. As we might expect, it is somewhat lower than the chance that the first card is an Ace, because we know one of the Aces is gone. We could approach this more intuitively as well: given that the first card is an Ace, the second card is an Ace too if it is one of the three remaining Aces among the 51 remaining cards. These possibilities are equally likely if the deck was shuffled well, so the chance is 3/51 × 100% = 5.9%.

Independence

Two events are independent if learning that one occurred does not affect the chance that the other occurred. That is, if P(A|B) = P(A), and vice versa. A slightly more general way to write this is that A and B are independent if P(AB) = P(A) × P(B). (This covers the case that either P(A), P(B), or both, are equal to zero, while the definition in terms of conditional probability requires the probability in the denominator to be positive.) To reiterate: two events are independent if and only if the probability that both events happen simultaneously is the product of their unconditional probabilities. If two events are not independent, they are dependent.

Independence and Mutual Exclusivity are Different!

In fact, the only way two events can be both mutually exclusive and independent is if at least one of them has probability zero. If two events are mutually exclusive, learning that one of them happened tells us that the other did not happen. This is clearly informative: the conditional probability of the second event given the first is zero! This changes the (conditional) probability of the second event unless its (unconditional) probability was already zero.

Independent events bear a special relationship to each other. Independence is a very precise point between being disjoint (so that one event implies that the other did not occur), and one event being a subset of the other (so that one event implies the other).

Recap:

If two events are mutually exclusive, they cannot both occur in the same trial: the probability of their intersection is zero. The probability of their union is the sum of their probabilities.
If two events are independent, they can both occur in the same trial (except possibly if at least one of them has probability zero). The probability of their intersection is the product of their probabilities. The probability of their union is less than the sum of their probabilities, unless at least one of the events has probability zero.

The following figure represents two events, A and B, as subsets of a rectangle. The probabilities of the events are proportional to their areas. Try dragging the events in the figure around to make them independent (that is, so that the area of their intersection is the product of their areas). Notice that it is not easy to do: to get the probability of the intersection equal to the product of the probabilities requires just the right amount of overlap.

You need Java to see this.

If A and B are independent, so are

A and B^c
A^c and B^c
A^cand B.

What kinds of events are independent? The outcomes of successive tosses of a fair coin, the outcomes of random draws from a box with replacement, etc. Draws without replacement are dependent, because what can happen on a given draw depends on what happens on previous draws.

Example: Suppose I have a box with four tickets in it, labeled 1, 2, 3, and 4. I stir the tickets and then pick one, stir them again without replacing the ticket I got, and pick another. Consider the event A = {I get the ticket labeled 1 on the first draw} and the event B = {I get the ticket labeled 2 on the second draw}. Are these events dependent or independent?

Solution: The chance that I get the 1 on the first draw is 25%. The chance that I get the 2 on the second draw is 25%. The chance that I get the 2 on the second draw given that I get the 1 on the first draw is 33%, which is much larger than the unconditional chance that I draw the 2 the second time. Thus A and B are dependent.

Now suppose that I replace the ticket I got on the first draw and stir the tickets again before drawing the second time. Then the chance that I get the 1 on the first draw is 25%, the chance that I get the 2 on the second draw is 25%, and the conditional chance that I get the 2 on the second draw given that I drew the 1 the first time is also 25%. A and B are thus independent if I draw with replacement.

Example: Two fair dice are rolled independently; one is blue, the other is red. What is the chance that the number of spots that show on the red die is less than the number of spots that show on the blue die?

Solution: The event that the number of spots that show on the red die is less than the number that show on the blue die can be broken up into mutually exclusive events, according to the number of spots that show on the blue die. The chance that the number of spots that show on the red die is less than the number that show on the blue die is the sum of the chances of those simpler events. If only one spot shows on the blue die, the number that show on the red die cannot be smaller, so the probability is zero. If two spots show on the blue die, the number that show on the red die is smaller if the red die shows exactly one spot. Because the number of spots that show on the blue and red dice are independent, the chance that the blue die shows two spots and the red die shows one spot is (1/6)(1/6) = 1/36. If three spots show on the blue die, the number that show on the red die is smaller if the red die shows one or two spots. The chance that the blue die shows three spots and the red die shows one or two spots is (1/6)(2/6) = 2/36. If four spots show on the blue die, the number that show on the red die is smaller if the red die shows one, two, or three spots; the chance that the blue die shows four spots and the red die shows one, two, or three spots is (1/6)(3/6) = 3/36. Proceeding similarly for the cases that the blue die shows five or six spots gives the ultimate result:

P(red die shows fewer spots than the blue die) = 1/36 + 2/36 + 3/36 + 4/36 + 5/36 = 15/36.

Alternatively, one could just count the ways: there are 36 possibilities, which can be written in a square table:

	Blue Die
R e d D i e	1,1	1,2	1,3	1,4	1,5	1,6
	2,1	2,2	2,3	2,4	2,5	2,6
	3,1	3,2	3,3	3,4	3,5	3,6
	4,1	4,2	4,3	4,4	4,5	4,6
	5,1	5,2	5,3	5,4	5,5	5,6
	6,1	6,2	6,3	6,4	6,5	6,6

The outcomes above the diagonal comprise the event whose probability we seek. There are 36 outcomes in all, of which 6 are on the diagonal. Half of the remaining 36-6=30 are above the diagonal; half of 30 is 15. The 36 outcomes are equally likely, so the chance is 15/36. The outcomes highlighted in yellow are one of the mutually exclusive pieces used in the computation just above: the three ways the red die can show a smaller number of spots than the blue die, when the blue die shows exactly 4 spots.

Hint: to solve this problem, you need to evaluate an expression of the form

1 - (1-x)ⁿ,

where x is nearly zero and n is very large. You can find the answer approximately using the following result:

(1-x)ⁿ = 1 + n×(-x) + (n×(n-1)/2)×(-x)² + . . . + _nC_k×(-x)^k + . . . + (-x)ⁿ.

The function (1-x)ⁿ is called a binomial; the fact that the coefficient of x^k in the expansion of (1-x)ⁿ is _nC_k is the reason that _nC_k is sometimes called a binomial coefficient. When x is very small, x², x³, . . . are much smaller still (and they get smaller faster than _nC_k grows), so the terms involving higher powers of x than x¹ are effectively negligable. That is, when x is nearly zero,

(1-x)ⁿ is approximately 1-n×x, so
1 - (1-x)ⁿ is approximately n×x.

Using that approximation is equivalent to ignoring the possibility that the sentence is typed more than once. The probability that the sentence is typed more than once is tiny compared to the chance that the sentence is typed exactly once, which is already quite small.

The Multiplication Rule

We can rearrange the definition of conditional probability to solve for the probability that both A and B occur (that AB occurs) in terms of the probability that B occurs and the conditional probability of A given B:

P(AB) = P(A|B)×P(B).

This is called the multiplication rule.

Example: A deck of cards is shuffled well, then two cards are drawn. What is the chance that both cards are aces?

P(card 1 is an Ace and card 2 is an Ace) = P(card 2 is an Ace | card 1 is an Ace)×P(card 1 is an Ace)
= 3/51 × 4/52 = 0.5%.

You can see that the multiplication rule can save you a lot of time!

Example: Suppose there is a 50% chance that you catch the 8:00am bus. If you catch the bus, you will be on time. If you miss the bus, there is a 70% chance that you will be late. What is the chance that you will be late?

P(late) = P(miss the bus and late)

= P(late|miss the bus) × P(miss the bus)

= 0.5 × 0.7 = 35%.

Example: Suppose that 10% of a given population has benign chronic flatulence. Suppose that there is a standard screening test for benign chronic flatulence that has a 90% chance of correctly detecting that one has the disease, and a 10% chance of a "false positive" (erroneously reporting that one has the disease when one does not). We pick a person at random from the population (so that everyone has the same chance of being picked) and test him/her. The test is positive. What is the chance that the person has the disease?

Solution: We shall combine several things we have learned. Let D be the event that the person has the disease, and T be the event that the person tests positive for the disease. The problem statement told us that:

P(D) = 10%.
P(T|D) = 90%.
P(T|D^c) = 10%.

The problem asks us to find P(D|T) = P(DT)/P(T). We shall find P(T) by breaking T into two mutually exclusive pieces, DT and D^cT, corresponding to testing positive and having the disease (DT) and testing positive falsely (D^cT). Then P(T) is the sum of P(DT) and P(D^cT). We will find those two probabilities using the multiplication rule. We need P(DT) for the numerator, and it will be one of the terms in the denominator as well. The probability of DT is, by the multiplication rule,

P(DT) = P(T|D) × P(D) = 90% × 10% = 9%.

The probability of D^cT is, by the multiplication rule and the complement rule,

P(D^cT) = P(T|D^c) × P(D^c) = P(T|D^c) × (100%- P(D) ) = 10% × 90% = 9%.

By one of the axioms,

P(T) = P(DT) + P(D^cT) = 9% + 9% = 18%,

because DT and D^cT are mutually exclusive.Finally, plugging in the definition of P(D|T) gives

P(D|T) = P(DT)/P(T) = 9%/18% = 50%.

Because only a small fraction of the population actually have benign chronic flatulence, the chance that a positive test result for someone selected at random from the population is a false positive is 50%, even though the test is 90% accurate.

This problem illustrates Bayes' Rule:

P(A|B) = P(B|A) × P(A) / ( P(B|A)×P(A) + P(B|A^c) × P(A^c) ).

The numerator on the right is just P(AB), computed using the multiplication rule. The denominator is just P(B), computed by partitioning B into the mutually exclusive sets AB and A^cB, and finding the probability of each of those pieces using the multiplication rule.

Bayes' Rule is useful to find the conditional probability of A given B in terms of the conditional probability of B given A, which is the more natural thing to measure in some problems. For example, in the disease-screening problem just above, the natural way to calibrate a test is to see how well it does at detecting a certain thing (e.g., a disease) when the thing is present, and to see how poorly it does at raising false alarms when the thing is not really present. These are, respectively, the conditional probability of detecting the thing given that the condition is present, and the conditional probability of incorrectly raising an alarm given that the thing is not present. However, the interesting quantity for an individual is the conditional chance that he or she has the disease, for example, given that the test raised an alarm.