Lecture 18: Multiple hypotheses#

Recap#

Hypotheses and p-values#

  • A p-value is the probability of finding a result at least as extreme/surprising, if outcomes happened by random chance alone.

  • The null hypothesis corresponds to “just chance” or “no effect.”

  • The alternative hypothesis corresponds to “better than chance” or “an effect.”

  • Small p-value (less than 0.05) → evidence against the null hypothesis.

One-sided and two-sided hypotheses#

  • A one-sided hypothesis is an alternative hypothesis with a direction (\(>\) or \(<\)).

  • A two-sided hypothesis is an alternative hypothesis without a direction (\(\neq\)).

  • The type of hypothesis determines what is “more extreme” in the p-value calculation.

  • Two-sided is a good default.

Practice quiz #3#

  • A candy company promises that at least 30% of their chocolate eggs contain a figurine of the fictional character Elsa from Frozen; the rest contain other toys.

  • Suppose you buy 40 chocolate eggs, and only 9 of them contain an Elsa figurine.

  • You will investigate whether the low number of Elsa figurines is statistically significant.

Question 1#

  • What are the null and alternative hypotheses? Describe them both in English and in mathematical symbols.

  • Answer: The null hypothesis is that the company is telling the truth and the chocolate eggs have a 30% chance of containing an Elsa figurine. The alternative hypothesis is that the chocolate eggs have a smaller than 30% chance of containing an Elsa figurine.

  • In symbols, let \(\pi\) be the long run proportion chocolate eggs that contain an Elsa figurine. The null hypothesis is \(H_0 : \pi = 0.3\) and the alternative hypothesis is \(H_A : \pi < 0.3\).

  • You could also do a two-sided alternative hypothesis \(H_A : \pi \neq 0.3\). In words, the probability of a chocolate egg containing an Elsa figurine is more or less than the 30% advertised by the company.

Question 2#

  • Describe how you would do a simulation to compute a p-value. If the null was true, what would be the “probability of success”? What would be the “number of trials”? What value would you compare the simulated data to?

  • Answer: A “success” would correspond to the chocolate egg containing an Elsa figurine. If the company is telling the truth, then the 30% of the chocolate eggs would contain an Else figurine. The “probability of success” is therefore 0.3

  • The number of trails is 40 (the number chocolate eggs bought).

  • The value to compare to is 9 (the number of chocolate eggs that contain an Elsa figurine).

Question 3#

  • The p-value for the observed results (an Elsa figurine in 9 of the 40 chocolate eggs) is 0.04. What do you conclude about the null hypothesis?

  • Answer: Since the p-value is less than 0.05, we have evidence against the null hypothesis that 30% of chocolate eggs contain an Elsa figurine.

Type 1 and type 2 errors#

Type 1 and type 2 errors#

Truth

\(H_0\)

\(H_A\)

Decision

reject \(H_0\)

Type I error

don’t reject \(H_0\)

Type II error

  • From last lecture:

    • Type 1 error: a “false alarm” or false positive.

    • Type 2 error: a “missed opportunity” or a false negative.

Type 1 error rate#

  • The Type 1 error rate is:

    \[\frac{\text{Number of times a type 1 error is made}}{\text{Number of times the null hypothesis is true}}\]
  • Rejecting the null hypothesis when the p-value is less than 0.05 means that the type 1 error rate is less than 0.05.

  • In general, if we reject the null hypothesis when the p-value is less than a threshold \(\alpha\), then the type 1 error rate is less than \(\alpha\).

False positives and multiple experiments#

This comic comes from xkcd.

#

#

Multiple hypothesis#

Multiple testing#

  • In the comic, the scientists investigated multiple hypotheses.

  • There were twenty null/alternative hypotheses pairs (one for each Jellybean color).

  • If there are \(m\) hypothesis tests, then the probability of having at least one false positive goes up.

  • This can lead to accidental “p-hacking” where reported p-values are artificially small and do not accurately measure the evidence against a null hypothesis.

Family wise error rate#

  • A collection of multiple hypotheses is called a family.

  • The family wise error rate (FWER) is the probability that there is at least one false positive (type 1 error) in the family.

  • In symbols:

    \[\text{FWER} = \mathrm{Pr}[\text{at least one false positive}] \]

Example: AI faces#

Image 1: Which face is real?
The left face is real.

Did anyone identify the real face 7 out of 7 times?

AI faces: FWER#

  • The probability of a specific person guessing and getting 7 out of 7 faces correct is \(2^{-7} \approx 0.0078\).

  • There are about 90 students who go to section each week.

  • If everyone was guessing, then the probability that somebody got 7 out of 7 is

    \[ 1-(1-0.0078)^{90} \approx 0.51\]
  • The family wise error rate can be much higher than the type 1 error rate for a single hypothesis.

Bonferroni correction#

  • Suppose there are \(m\) null/alternative hypothesis pairs in the family.

  • This means that we would compute \(m\) p-values.

  • Instead of rejecting each null hypothesis when its p-value is less than \(\alpha\), we will only reject when its p-value is less than \(\alpha /m\).

  • Changing \(\alpha\) to \(\alpha/m\) is called a Bonferroni correction.

Bonferroni correction#

  • The Bonferroni correction makes sure that the family wise error rate is at most \(\alpha\).

  • This is because:

    \[\begin{split}\begin{align*} \text{FWER} &= \mathrm{Pr}[\text{at least one false positive}] \\ &\leq m \times \mathrm{Pr}[\text{false positive for one hypothesis}] \\ &\leq m \times \frac{\alpha}{m} \\ &= \alpha \end{align*}\end{split}\]

Bonferroni example#

  • In the xkcd comic, what would be the Bonferroni correction?

  • Since there are 20 hypotheses, the new threshold should be \(\frac{0.05}{20}=0.0025\)

  • Interpretation: Since the scientists did 20 tests, a p-value less than 0.05 is not strong evidence that green jelly beans are linked to acne.

  • The p-value needs to be less than 0.0025 for there to be evidence that green jelly beans are linked to acne.

Dream again#

Dream recap#

  • In a speedrun attempt, Dream received 42 Ender pearls in 262 trades.

  • Last lecture, we saw that is very unlikely he would get that many Ender pearls without cheating.

  • But, do we need to do a multiple hypothesis correction?

Dream and multiple hypotheses#

  • We should account for the fact that there are many people who play Minecraft and therefore there have been many speed running attempts.

  • Maybe it is reasonable that someone would get as lucky as Dream, and we are unfairly focusing on him.

Bonferroni correction for Dream#

  • How could we do a Bonferroni correction for Dream? What is \(m\) the number of hypotheses?

  • \(m\) should be the number of Minecraft speed run attempts. We do not know \(m\) exactly but we could do a Fermi estimate.

\[\begin{split}\begin{align*} &\# \text{speed run attempts} \\ &= \#\text{number of speed runners} \\ &\times \#\text{number of attempts per runner per year}\\ & \times \#\text{number of years of speed running}\\ &\approx 10^{5} \times 10 \times 10\\ &=10^{7} \end{align*}\end{split}\]

Bonferroni correction for Dream#

  • Giving Dream the benefit of the doubt, we can use \(m=10^8\) (one hundred million).

  • A calculation gives that the p-value for Dream’s trades is roughly \(\frac{6}{10^{12}}\) (less than 1 in a hundred billion).

  • The Bonferroni correction is

    \[\frac{0.05}{10^8} = \frac{5}{10^{10}} = \text{1 in 20 billion}\]
  • Dream’s p-value is still much smaller than the corrected threshold so we still have evidence for cheating.

The dark side of Bonferroni#

  • The Bonferroni correction makes it much harder to reject the null hypothesis.

  • This keeps the false positive (type 1 error) rate under control.

  • But if we don’t reject the null hypothesis, we risk having a lot of false negatives (type 2 errors).

  • The Bonferroni correction increases the false negative (type 2 error) rate.

Genomic studies#

Genome wide association study#

  • The human body has roughly 20,000 genes.

  • The expression levels of each person’s genes can vary widely.

  • A genome wide association (GWA) study looks at whether there are any genes that are associated with a particular disease.

An illustration of GWA study (source).

Genome wide association study#

  • Scientists conducting GWA studies have to be very careful about false positives due to testing multiple hypotheses (one for each gene).

  • This has led to a lot of new statistical methods (many developed at Stanford) that are designed specifically for GWA studies that have lower false negative rates than the Bonferroni correction.

  • GWA studies have successfully identified genes that are associated with a variety of diseases including heart disease, diabetes, and Crohn’s disease.

Recap#

  • Using p-values controls the false positive rate:

    • If we reject a null hypothesis when the p-value is less than \(\alpha\), then the false positive rate will be \(\alpha\).

    • If we make the \(\alpha\) smaller, there will be more false negatives.

  • Multiple testing:

    • The family-wise error rate is the chance of at least one false positive.

    • p-hacking.

    • Bonferroni correction for multiple testing.