{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 24: Sampling bias\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 2;\" >\n",
    "\n",
    "**Announcement:** <a href = \"https://edstem.org/us/courses/96502/discussion/8074947\">Practice finals on Ed. </a>\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "# Recap & Practice Quiz 2\n",
    "\n",
    "## Samples and populations\n",
    "\n",
    "\n",
    "- There is a variable $x$\n",
    "which we want to measure on observation units in a **population**.\n",
    "- Our goal is to estimate the population mean $\\mu$ which is a **parameter**.\n",
    "- We take independent $n$ **samples** from the population and record the variable on the sample.\n",
    "- This gives measurements $x_1,\\ldots,x_n$.\n",
    "- The sample mean $\\hat{\\mu}_n = \\frac{x_1+\\cdots+x_n}{n}$ is an **estimate** of $\\mu$.\n",
    "\n",
    "\n",
    "\n",
    "## Confidence intervals\n",
    "\n",
    "\n",
    "- A **confidence interval** for $\\mu$ is a collection of plausible values of $\\mu$.\n",
    "- The estimate $\\hat{\\mu}_n$ can be used to make confidence intervals of the form\n",
    "\n",
    "  $$ \\hat{\\mu}_n \\pm 2 \\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$$\n",
    "\n",
    "  where $n$ is the sample size and $\\hat{\\sigma}_x$ is the standard deviation of $x_1,\\ldots,x_n$.\n",
    "\n",
    "\n",
    "\n",
    "## Question 1\n",
    "\n",
    "Decide whether the following statement is True or False, and justify your answer: \n",
    "\n",
    "\"When estimating a mean, a larger sample size will make the confidence interval smaller.\"\n",
    "\n",
    "- **Answer:** True! Large sample sizes decrease the standard deviation of $\\hat{\\mu}_n$ which in turn decreases the size of the confidence interval.\n",
    "- Specifically, the standard deviation of $\\hat{\\mu}_n$ is\n",
    "\n",
    "  $$\\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$$\n",
    "\n",
    "  which decreases with $n$.\n",
    "\n",
    "\n",
    "\n",
    "## Question 2\n",
    "\n",
    "In the following scenario, explain \n",
    "\n",
    "\n",
    "  a. What is the population\n",
    "\n",
    "  b. What is the variable $x$ being measured\n",
    "\n",
    "  c. What is the sample $x_1,\\ldots,x_n$\n",
    "\n",
    "\n",
    "\n",
    "A penguin ecologist is trying to determine the average number of offspring a female Antarctic penguin will hatch over her lifetime.\n",
    "The ecologist tags $n$ Antarctic female penguins at random, then records the number of eggs that each penguin hatches in her lifetime.\n",
    "\n",
    "\n",
    "## Question 2 -- Answer\n",
    "\n",
    "a. The population is female Antarctic penguins.\n",
    "\n",
    "b. The variable $x$ is the number of eggs that a female penguin hatches in her lifetime.\n",
    "\n",
    "c. The sample is, for each tagged female penguin, the number of eggs $x_i$ that the female hatched.\n",
    "\n",
    "## Question 2 - Extension\n",
    "\n",
    "In the following scenario, explain \n",
    "\n",
    "\n",
    "  a. What is the population\n",
    "\n",
    "  b. What is the variable $x$ being measured\n",
    "\n",
    "  c. What is the sample $x_1,\\ldots,x_n$\n",
    "\n",
    "\n",
    "\n",
    "A penguin ecologist is trying to determine the *proportion* of female Antarctic penguins that will hatch at least one egg in her lifetime.\n",
    "\n",
    "The ecologist tags $n$ Antarctic female penguins at random, and records whether each penguin hatches at least one in her lifetime.\n",
    "\n",
    "\n",
    "## Question 2 - Extension answer\n",
    "\n",
    "a. The population is female Antarctic penguins.\n",
    "\n",
    "b. The variable $x$ is whether the female hatched at least one egg in her lifetime.\n",
    "\n",
    "c. The sample is, for each tagged female penguin, a record of whether the female hatched an egg in her lifetime.\n",
    "\n",
    "\n",
    "## Question 3\n",
    "\n",
    "Suppose you conduct a poll on an issue on which the population is roughly divided.\n",
    "You survey $n=50$ people and 20 said yes. \n",
    "\n",
    "  a. Compute $\\hat{\\pi}_n$ the sample proportion of people who said yes. \n",
    "  \n",
    "  b. Suppose that the standard deviation of $\\hat{\\pi}_n$ is $0.07$. Construct a 95% confidence interval for $\\pi$ the proportion of people in the population who would say yes.\n",
    "\n",
    "## Question 3 -- Answer\n",
    "\n",
    "a. The sample proportion is $\\hat{\\pi}_n = \\frac{20}{50}=0.4$\n",
    "\n",
    "b. The standard deviation of $\\hat{\\pi}_n$ is $0.07$. A 95% confidence interval for $\\pi$ would be\n",
    "  $$0.4 \\pm 2 \\times 0.07 = [0.26, 0.54]$$\n",
    "\n",
    "## Question 3 - follow up\n",
    "\n",
    "- How would your answer change if the question asked for a 99% confidence interval?\n",
    "\n",
    "- **Answer:** Use 3 standard deviations instead of 2.\n",
    "\n",
    "- How would your answer change if your were told $\\hat{\\sigma}_x$ (the sample standard deviation of $x_1,\\ldots,x_n$) instead of the standard deviation of the estimate?\n",
    "\n",
    "- **Answer:** Use the formula $\\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$ for the standard deviation of the estimate.\n",
    "\n",
    "# Gettysburg address\n",
    "\n",
    "## Sampling words\n",
    "\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "- Your worksheet has the Gettysburg address.\n",
    "- Randomly sample 10 words from the Gettysburg address and write them down on your worksheet.\n",
    "\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/Gettysburg_Address_(poster).jpg\" alt=\"\" style=\"width:50%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "## Analyzing the data\n",
    "\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "\n",
    "- Enter the length of each word in this <a href=\"https://forms.gle/DtYBQs2e3H3tFEAH9\">Google form</a>.\n",
    "- Scroll across for all options.\n",
    "- Your responses are being saved <a href =\"https://docs.google.com/spreadsheets/d/1XnsrurvaSjwmHtk6KnnkxBrQJqUPtafJGN48uowiVEk/edit?usp=sharing\">here</a>.\n",
    "- Let's analyze the results <a href = \"https://colab.research.google.com/drive/1LlGczhE8bg0uojCJ9ybbjfQdYiEM6CLh?usp=sharing\">here</a>.\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/sampling_words.jpg\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Comparison to unbiased sampling\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "This is the distribution of words length in the Gettysburg address.\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/gettysburg_hist.png\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "\n",
    "\n",
    "This is the distribution of the sample mean *if* words were uniformly sampled\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/gettysburg_mean_hist.png\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "## What went wrong?\n",
    "\n",
    "- What could be causing the bias when sampling words?\n",
    "  - Longer words take up more space on the page.\n",
    "  - Longer words are more interesting.\n",
    "\n",
    "- **Sampling bias** occurs when there are factors that effect both:\n",
    "  - The chance that an observational unit is sampled.\n",
    "  - The value of the measurement $x$.\n",
    "\n",
    "- Similar to observational studies: bias is caused by factors that effect the chance of treatment and the outcome.\n",
    "\n",
    "# Causes of bias\n",
    "\n",
    "## Sampling bias\n",
    "\n",
    "- Without randomly sampling, estimates can be *inaccurate*, and confidence intervals can be *invalid*.\n",
    "- When the sample is not collected uniformly sampling bias can occur.\n",
    "\n",
    "## Convenience sampling\n",
    "\n",
    "\n",
    "- *Convenience sampling* refers to samples on a convenient-to-reach population.\n",
    "- The convenient-to-reach population might not be representative of the whole.\n",
    "- **Example**: Experiments in the social sciences (psychology, behavioral economics) are disproportionately done on college students (because they are conducted by college professors).\n",
    "  - How could using college students bias the results of studies?\n",
    "\n",
    "## Example: Endowment effect\n",
    "\n",
    "- In 1990 [Khaneman, Knetsch, and Thaler](https://www.journals.uchicago.edu/doi/10.1086/261737), did the following experiment to test for \"the endowment effect\":\n",
    "\n",
    "  - The researchers recruited Cornell undergraduates to participate in the study.\n",
    "\n",
    "  - They randomly gave half the participants a coffee mug.\n",
    "\n",
    "  - The participants where then allowed to trade with each other. \n",
    "  \n",
    "  - Economic theory predicts that about half the mugs would be traded but the observed number of trades was much lower (only a quarter).\n",
    "\n",
    "## Endowment effect\n",
    "\n",
    "- Later, [List (2003)](https://academic.oup.com/qje/article-abstract/118/1/41/1917048) tried to replicate the study with a sample of people at sports card trading show.\n",
    "- Instead of mugs, the study participants were offered sport merchandise.\n",
    "- In this experiment, there was no endowment effect.\n",
    "- Market experience might explain whether there is an endowment effect. \n",
    "- The initial dramatic finding could be due to the use of a convenience sample of college students with little market experience.\n",
    "\n",
    "## Method of contact\n",
    "\n",
    "- The way that participants are contacted can lead to convenience sampling.\n",
    "- **Example** In 2012, Gallop predicted that Mitt Romney would win the 2012 Presidential election but Barack Obama won.\n",
    "- Reviewing their polling, Gallop found that they systematically over-predicted the success of Republicans.\n",
    "- They concluded that a major source of bias was their use of phone-based polling.\n",
    "- People who have a landline phone tend to be older and more conservative.\n",
    "\n",
    "## Volunteer bias\n",
    "\n",
    "- When participation in a study/survey is voluntary, **volunteer bias** may occur.\n",
    "- **Example** people who have a very bad or very good experience are more likely to write a review.\n",
    "\n",
    "  <figure style=\"text-align:center;\"><img src=\"../figures/airbnb.png\" alt=\"\" style=\"width:80%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "- An [experiment at AirBnB](https://arxiv.org/pdf/2112.09783) found when more reviews were collected (by offering a coupon), then the average rating decreased.\n",
    "\n",
    "## Compensation\n",
    "\n",
    "- Offering compensation may also introduce bias.\n",
    "- **Example**: In the 2000\u2019s and 2010\u2019s, the Bureau of Labor Statistics was having trouble recruiting participants for the \u201cConsumer Quarterly Expenditures Survey,\u201d which aims to measure household expenses.\n",
    "  - The Bureau conducted an experiment to check if offering incentives of a prepaid debit card would be an effective way of increasing participation.\n",
    "  - Income level and the rate of homeownership was lower in the group that got the prepaid debit card.\n",
    "\n",
    "- **Question**: Can you think of other examples of survivorship bias?\n",
    "\n",
    "## Survivorship bias\n",
    "\n",
    "- **Survivorship bias** occurs when screening rules introduce bias.\n",
    "\n",
    "- A clinical trial in the 80's and 90's investigate the benefits of chemotherapy and bone marrow transplants for treating for breast cancer.\n",
    "- Only women who did not have a bad response to conventional chemotherapy were eligible for the early phases of the trial.\n",
    "\n",
    "## Study results\n",
    "\n",
    "- The early phases of the trial showed very favorable results, but later phases of the trial showed that the therapy is not effective.\n",
    "- By only selecting patients that responded well to conventional chemotherapy, the study select patients that were more likely to survive regardless of the impact of the new therapy.\n",
    "\n",
    "- **Question**: Can you think of other examples of survivorship bias?\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Sampling bias -- summary\n",
    "\n",
    "\n",
    "- Examples of sampling bias:\n",
    "  - **Convenience sampling** (only sampling college students or people with landlines).\n",
    "  - **Volunteer bias** (participants opt in to the study).\n",
    "  - **Survivorship bias** (screening determines allowed in the study).\n",
    "- Sampling bias occurs when there is a factor that affects both the chance of being sampled and the variable being measured."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Phython (JB)",
   "language": "python",
   "name": "jb-python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
