Lecture 24: Sampling bias#

Recap & Practice Quiz 2#

Samples and populations#

  • There is a variable \(x\) which we want to measure on observation units in a population.

  • Our goal is to estimate the population mean \(\mu\) which is a parameter.

  • We take independent \(n\) samples from the population and record the variable on the sample.

  • This gives measurements \(x_1,\ldots,x_n\).

  • The sample mean \(\hat{\mu}_n = \frac{x_1+\cdots+x_n}{n}\) is an estimate of \(\mu\).

Confidence intervals#

  • A confidence interval for \(\mu\) is a collection of plausible values of \(\mu\).

  • The estimate \(\hat{\mu}_n\) can be used to make confidence intervals of the form

    \[ \hat{\mu}_n \pm 2 \frac{\hat{\sigma}_x}{\sqrt{n}}\]

    where \(n\) is the sample size and \(\hat{\sigma}_x\) is the standard deviation of \(x_1,\ldots,x_n\).

Question 1#

Decide whether the following statement is True or False, and justify your answer:

“When estimating a mean, a larger sample size will make the confidence interval smaller.”

  • Answer: True! Large sample sizes decrease the standard deviation of \(\hat{\mu}_n\) which in turn decreases the size of the confidence interval.

  • Specifically, the standard deviation of \(\hat{\mu}_n\) is

    \[\frac{\hat{\sigma}_x}{\sqrt{n}}\]

    which decreases with \(n\).

Question 2#

In the following scenario, explain

a. What is the population

b. What is the variable \(x\) being measured

c. What is the sample \(x_1,\ldots,x_n\)

A penguin ecologist is trying to determine the average number of offspring a female Antarctic penguin will hatch over her lifetime. The ecologist tags \(n\) Antarctic female penguins at random, then records the number of eggs that each penguin hatches in her lifetime.

Question 2 – Answer#

a. The population is female Antarctic penguins.

b. The variable \(x\) is the number of eggs that a female penguin hatches in her lifetime.

c. The sample is, for each tagged female penguin, the number of eggs \(x_i\) that the female hatched.

Question 2 - Extension#

In the following scenario, explain

a. What is the population

b. What is the variable \(x\) being measured

c. What is the sample \(x_1,\ldots,x_n\)

A penguin ecologist is trying to determine the proportion of female Antarctic penguins that will hatch at least one egg in her lifetime.

The ecologist tags \(n\) Antarctic female penguins at random, and records whether each penguin hatches at least one in her lifetime.

Question 2 - Extension answer#

a. The population is female Antarctic penguins.

b. The variable \(x\) is whether the female hatched at least one egg in her lifetime.

c. The sample is, for each tagged female penguin, a record of whether the female hatched an egg in her lifetime.

Question 3#

Suppose you conduct a poll on an issue on which the population is roughly divided. You survey \(n=50\) people and 20 said yes.

a. Compute \(\hat{\pi}_n\) the sample proportion of people who said yes.

b. Suppose that the standard deviation of \(\hat{\pi}_n\) is \(0.07\). Construct a 95% confidence interval for \(\pi\) the proportion of people in the population who would say yes.

Question 3 – Answer#

a. The sample proportion is \(\hat{\pi}_n = \frac{20}{50}=0.4\)

b. The standard deviation of \(\hat{\pi}_n\) is \(0.07\). A 95% confidence interval for \(\pi\) would be $\(0.4 \pm 2 \times 0.07 = [0.26, 0.54]\)$

Question 3 - follow up#

  • How would your answer change if the question asked for a 99% confidence interval?

  • Answer: Use 3 standard deviations instead of 2.

  • How would your answer change if your were told \(\hat{\sigma}_x\) (the sample standard deviation of \(x_1,\ldots,x_n\)) instead of the standard deviation of the estimate?

  • Answer: Use the formula \(\frac{\hat{\sigma}_x}{\sqrt{n}}\) for the standard deviation of the estimate.

Gettysburg address#

Sampling words#

  • Your worksheet has the Gettysburg address.

  • Randomly sample 10 words from the Gettysburg address and write them down on your worksheet.

Analyzing the data#

  • Enter the length of each word in this Google form.

  • Scroll across for all options.

  • Your responses are being saved here.

  • Let’s analyze the results here.

Comparison to unbiased sampling#

This is the distribution of words length in the Gettysburg address.

This is the distribution of the sample mean if words were uniformly sampled

What went wrong?#

  • What could be causing the bias when sampling words?

    • Longer words take up more space on the page.

    • Longer words are more interesting.

  • Sampling bias occurs when there are factors that effect both:

    • The chance that an observational unit is sampled.

    • The value of the measurement \(x\).

  • Similar to observational studies: bias is caused by factors that effect the chance of treatment and the outcome.

Causes of bias#

Sampling bias#

  • Without randomly sampling, estimates can be inaccurate, and confidence intervals can be invalid.

  • When the sample is not collected uniformly sampling bias can occur.

Convenience sampling#

  • Convenience sampling refers to samples on a convenient-to-reach population.

  • The convenient-to-reach population might not be representative of the whole.

  • Example: Experiments in the social sciences (psychology, behavioral economics) are disproportionately done on college students (because they are conducted by college professors).

    • How could using college students bias the results of studies?

Example: Endowment effect#

  • In 1990 Khaneman, Knetsch, and Thaler, did the following experiment to test for “the endowment effect”:

    • The researchers recruited Cornell undergraduates to participate in the study.

    • They randomly gave half the participants a coffee mug.

    • The participants where then allowed to trade with each other.

    • Economic theory predicts that about half the mugs would be traded but the observed number of trades was much lower (only a quarter).

Endowment effect#

  • Later, List (2003) tried to replicate the study with a sample of people at sports card trading show.

  • Instead of mugs, the study participants were offered sport merchandise.

  • In this experiment, there was no endowment effect.

  • Market experience might explain whether there is an endowment effect.

  • The initial dramatic finding could be due to the use of a convenience sample of college students with little market experience.

Method of contact#

  • The way that participants are contacted can lead to convenience sampling.

  • Example In 2012, Gallop predicted that Mitt Romney would win the 2012 Presidential election but Barack Obama won.

  • Reviewing their polling, Gallop found that they systematically over-predicted the success of Republicans.

  • They concluded that a major source of bias was their use of phone-based polling.

  • People who have a landline phone tend to be older and more conservative.

Volunteer bias#

  • When participation in a study/survey is voluntary, volunteer bias may occur.

  • Example people who have a very bad or very good experience are more likely to write a review.

  • An experiment at AirBnB found when more reviews were collected (by offering a coupon), then the average rating decreased.

Compensation#

  • Offering compensation may also introduce bias.

  • Example: In the 2000’s and 2010’s, the Bureau of Labor Statistics was having trouble recruiting participants for the “Consumer Quarterly Expenditures Survey,” which aims to measure household expenses.

    • The Bureau conducted an experiment to check if offering incentives of a prepaid debit card would be an effective way of increasing participation.

    • Income level and the rate of homeownership was lower in the group that got the prepaid debit card.

  • Question: Can you think of other examples of survivorship bias?

Survivorship bias#

  • Survivorship bias occurs when screening rules introduce bias.

  • A clinical trial in the 80’s and 90’s investigate the benefits of chemotherapy and bone marrow transplants for treating for breast cancer.

  • Only women who did not have a bad response to conventional chemotherapy were eligible for the early phases of the trial.

Study results#

  • The early phases of the trial showed very favorable results, but later phases of the trial showed that the therapy is not effective.

  • By only selecting patients that responded well to conventional chemotherapy, the study select patients that were more likely to survive regardless of the impact of the new therapy.

  • Question: Can you think of other examples of survivorship bias?

Sampling bias – summary#

  • Examples of sampling bias:

    • Convenience sampling (only sampling college students or people with landlines).

    • Volunteer bias (participants opt in to the study).

    • Survivorship bias (screening determines allowed in the study).

  • Sampling bias occurs when there is a factor that affects both the chance of being sampled and the variable being measured.