Lecture 18: Multiple hypotheses#
Recap#
Hypotheses and p-values#
A p-value is the probability of finding a result at least as extreme/surprising, if outcomes happened by random chance alone.
The null hypothesis corresponds to “just chance” or “no effect.”
The alternative hypothesis corresponds to “better than chance” or “an effect.”
Small p-value (less than 0.05) → evidence against the null hypothesis.
One-sided and two-sided hypotheses#
A one-sided hypothesis is an alternative hypothesis with a direction (\(>\) or \(<\)).
A two-sided hypothesis is an alternative hypothesis without a direction (\(\neq\)).
The type of hypothesis determines what is “more extreme” in the p-value calculation.
Two-sided is a good default.
Practice quiz #3#
A candy company promises that at least 30% of their chocolate eggs contain a figurine of the fictional character Elsa from Frozen; the rest contain other toys.
Suppose you buy 40 chocolate eggs, and only 9 of them contain an Elsa figurine.
You will investigate whether the low number of Elsa figurines is statistically significant.
Question 1#
What are the null and alternative hypotheses? Describe them both in English and in mathematical symbols.
Answer: The null hypothesis is that the company is telling the truth and the chocolate eggs have a 30% chance of containing an Elsa figurine. The alternative hypothesis is that the chocolate eggs have a smaller than 30% chance of containing an Elsa figurine.
In symbols, let \(\pi\) be the long run proportion chocolate eggs that contain an Elsa figurine. The null hypothesis is \(H_0 : \pi = 0.3\) and the alternative hypothesis is \(H_A : \pi < 0.3\).
You could also do a two-sided alternative hypothesis \(H_A : \pi \neq 0.3\). In words, the probability of a chocolate egg containing an Elsa figurine is more or less than the 30% advertised by the company.
Question 2#
Describe how you would do a simulation to compute a p-value. If the null was true, what would be the “probability of success”? What would be the “number of trials”? What value would you compare the simulated data to?
Answer: A “success” would correspond to the chocolate egg containing an Elsa figurine. If the company is telling the truth, then the 30% of the chocolate eggs would contain an Else figurine. The “probability of success” is therefore 0.3
The number of trails is 40 (the number chocolate eggs bought).
The value to compare to is 9 (the number of chocolate eggs that contain an Elsa figurine).
Question 3#
The p-value for the observed results (an Elsa figurine in 9 of the 40 chocolate eggs) is 0.04. What do you conclude about the null hypothesis?
Answer: Since the p-value is less than 0.05, we have evidence against the null hypothesis that 30% of chocolate eggs contain an Elsa figurine.
Type 1 and type 2 errors#
Type 1 and type 2 errors#
Truth |
|||
|---|---|---|---|
\(H_0\) |
\(H_A\) |
||
Decision |
reject \(H_0\) |
Type I error |
|
don’t reject \(H_0\) |
Type II error |
From last lecture:
Type 1 error: a “false alarm” or false positive.
Type 2 error: a “missed opportunity” or a false negative.
Type 1 error rate#
The Type 1 error rate is:
\[\frac{\text{Number of times a type 1 error is made}}{\text{Number of times the null hypothesis is true}}\]Rejecting the null hypothesis when the p-value is less than 0.05 means that the type 1 error rate is less than 0.05.
In general, if we reject the null hypothesis when the p-value is less than a threshold \(\alpha\), then the type 1 error rate is less than \(\alpha\).
False positives and multiple experiments#
This comic comes from xkcd.

#




#

Multiple hypothesis#
Multiple testing#
In the comic, the scientists investigated multiple hypotheses.
There were twenty null/alternative hypotheses pairs (one for each Jellybean color).
If there are \(m\) hypothesis tests, then the probability of having at least one false positive goes up.
This can lead to accidental “p-hacking” where reported p-values are artificially small and do not accurately measure the evidence against a null hypothesis.
Family wise error rate#
A collection of multiple hypotheses is called a family.
The family wise error rate (FWER) is the probability that there is at least one false positive (type 1 error) in the family.
In symbols:
\[\text{FWER} = \mathrm{Pr}[\text{at least one false positive}] \]
Example: AI faces#

Did anyone identify the real face 7 out of 7 times?
AI faces: FWER#
The probability of a specific person guessing and getting 7 out of 7 faces correct is \(2^{-7} \approx 0.0078\).
There are about 90 students who go to section each week.
If everyone was guessing, then the probability that somebody got 7 out of 7 is
\[ 1-(1-0.0078)^{90} \approx 0.51\]The family wise error rate can be much higher than the type 1 error rate for a single hypothesis.
Bonferroni correction#

Suppose there are \(m\) null/alternative hypothesis pairs in the family.
This means that we would compute \(m\) p-values.
Instead of rejecting each null hypothesis when its p-value is less than \(\alpha\), we will only reject when its p-value is less than \(\alpha /m\).
Changing \(\alpha\) to \(\alpha/m\) is called a Bonferroni correction.
Bonferroni correction#
The Bonferroni correction makes sure that the family wise error rate is at most \(\alpha\).
This is because:
\[\begin{split}\begin{align*} \text{FWER} &= \mathrm{Pr}[\text{at least one false positive}] \\ &\leq m \times \mathrm{Pr}[\text{false positive for one hypothesis}] \\ &\leq m \times \frac{\alpha}{m} \\ &= \alpha \end{align*}\end{split}\]
Bonferroni example#
In the xkcd comic, what would be the Bonferroni correction?
Since there are 20 hypotheses, the new threshold should be \(\frac{0.05}{20}=0.0025\)
Interpretation: Since the scientists did 20 tests, a p-value less than 0.05 is not strong evidence that green jelly beans are linked to acne.
The p-value needs to be less than 0.0025 for there to be evidence that green jelly beans are linked to acne.
Dream again#
Dream recap#
In a speedrun attempt, Dream received 42 Ender pearls in 262 trades.
Last lecture, we saw that is very unlikely he would get that many Ender pearls without cheating.
But, do we need to do a multiple hypothesis correction?

Dream and multiple hypotheses#
We should account for the fact that there are many people who play Minecraft and therefore there have been many speed running attempts.
Maybe it is reasonable that someone would get as lucky as Dream, and we are unfairly focusing on him.
Bonferroni correction for Dream#
How could we do a Bonferroni correction for Dream? What is \(m\) the number of hypotheses?
\(m\) should be the number of Minecraft speed run attempts. We do not know \(m\) exactly but we could do a Fermi estimate.
Bonferroni correction for Dream#
Giving Dream the benefit of the doubt, we can use \(m=10^8\) (one hundred million).
A calculation gives that the p-value for Dream’s trades is roughly \(\frac{6}{10^{12}}\) (less than 1 in a hundred billion).
The Bonferroni correction is
\[\frac{0.05}{10^8} = \frac{5}{10^{10}} = \text{1 in 20 billion}\]Dream’s p-value is still much smaller than the corrected threshold so we still have evidence for cheating.
The dark side of Bonferroni#
The Bonferroni correction makes it much harder to reject the null hypothesis.
This keeps the false positive (type 1 error) rate under control.
But if we don’t reject the null hypothesis, we risk having a lot of false negatives (type 2 errors).
The Bonferroni correction increases the false negative (type 2 error) rate.

Genomic studies#
Genome wide association study#
The human body has roughly 20,000 genes.
The expression levels of each person’s genes can vary widely.
A genome wide association (GWA) study looks at whether there are any genes that are associated with a particular disease.

Genome wide association study#
Scientists conducting GWA studies have to be very careful about false positives due to testing multiple hypotheses (one for each gene).
This has led to a lot of new statistical methods (many developed at Stanford) that are designed specifically for GWA studies that have lower false negative rates than the Bonferroni correction.
GWA studies have successfully identified genes that are associated with a variety of diseases including heart disease, diabetes, and Crohn’s disease.
Recap#
Using p-values controls the false positive rate:
If we reject a null hypothesis when the p-value is less than \(\alpha\), then the false positive rate will be \(\alpha\).
If we make the \(\alpha\) smaller, there will be more false negatives.
Multiple testing:
The family-wise error rate is the chance of at least one false positive.
p-hacking.
Bonferroni correction for multiple testing.