{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 23: Estimation for quantitative variables\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 2;\" >\n",
    "\n",
    "**Announcements:**\n",
    "\n",
    "- Practice quizzes are online.\n",
    "- Please bring a laptop to section tomorrow.\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "# Recap\n",
    "\n",
    "\n",
    "## Sampling\n",
    "\n",
    "\n",
    "- **Samples** can be used to estimate parameters.\n",
    "  - Example: sample $n$ Stanford students and ask if they support the proctoring pilot.\n",
    "  - This defines an unknown **parameter** $\\pi$ (the proportion of *all* Stanford students who support the proctoring pilot).\n",
    "  - And an **estimate** $\\hat{\\pi}_n$ (the proportion of students in the sample who support the proctoring pilot).\n",
    "\n",
    "\n",
    "## Distribution of $\\hat{\\pi}_n$\n",
    "\n",
    "\n",
    "- The distribution of $\\hat{\\pi}_n$ is centered at $\\pi$ and has standard deviation:\n",
    "\n",
    "  $$\n",
    "  \\sqrt{\\frac{\\pi(1-\\pi)}{n}} \\approx \\sqrt{\\frac{\\hat{\\pi}_n(1-\\hat{\\pi}_n)}{n}}\n",
    "  $$ \n",
    "\n",
    "- Larger sample size \u2194 smaller standard deviation \u2194 $\\hat{\\pi}_n$ is closer to $\\pi$\n",
    "\n",
    "\n",
    "\n",
    "## Confidence intervals \n",
    "\n",
    "\n",
    "- A **confidence interval** is a collection of plausible values for the parameter.\n",
    "- A confidence interval has a **confidence level** (for example 95%).\n",
    "- We are 95% confident that a 95% confidence interval contains the parameter.\n",
    "- Confidence intervals can be calculated using the **68-95-99 rule** and the **normal approximation**.\n",
    "\n",
    "\n",
    "## Example: proctoring pilot\n",
    "\n",
    "- Suppose you surveyed $n=100$ Stanford students and $55$ of them say they support the proctoring pilot.\n",
    "  - What is the estimate $\\hat{\\pi}_n$?\n",
    "  - **Answer:** $\\hat{\\pi}_n = \\frac{55}{100}=0.55$\n",
    "  - What is the standard deviation of $\\hat{\\pi}_n$?\n",
    "  - **Answer:** \n",
    "    $$\\sqrt{\\hat{\\pi}_n(1-\\hat{\\pi}_n)/n} = \\sqrt{0.55 \\times 0.45 / 100} = 0.05$$\n",
    "\n",
    "  - What is a 68% confidence interval for $\\pi$?\n",
    "  - **Answer:** by the 68-95-99 rule:\n",
    "\n",
    "    $$\\hat{\\pi}_n \\pm \\sqrt{\\frac{\\hat{\\pi}_n(1-\\hat{\\pi}_n)}{n}} = 0.55 \\pm 0.05 = [0.5, 0.6]$$\n",
    "\n",
    "## The importance of random sampling\n",
    "\n",
    "- The previous results are only valid if the sample is drawn randomly.\n",
    "- Each student needs to have the same chance of being selected and students should be sampled independently.\n",
    "- If some students are more likely to be chosen, then $\\hat{\\pi}_n$ might be *biased*.\n",
    "- If the students are not sampled independently, then the formula $\\sqrt{\\hat{\\pi}_n(1-\\hat{\\pi}_n)/n}$ is not correct.\n",
    "- More on this on Friday.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "# Estimation for quantitative variables\n",
    "\n",
    "## Populations and parameters\n",
    "\n",
    "- Sometimes, we want to know about something other than a yes or no question.\n",
    "- Instead, we might want to measure a *quantitative variable*.\n",
    "- We could in theory measure the variable for every observational unit in the population of interest. \n",
    "-  This would let us calculate the **population mean**.\n",
    "- The population mean is a parameter and written as $\\mu$ (a Greek *m*, pronounced \"mu\").\n",
    "\n",
    "## Samples and estimation\n",
    "\n",
    "- As with polling, it is more efficient to take a sample instead of measuring every observational unit in the population.\n",
    "- Suppose that we randomly sample $n$ observational units, and measure the quantitative variable for all of them.\n",
    "- This gives $n$ *measurements*: $x_1,x_2,\\ldots,x_n$.\n",
    "- The **sample mean** of $x_1,\\ldots,x_n$ is\n",
    "\n",
    "  $$\\hat{\\mu}_n = \\frac{x_1+x_2+\\cdots + x_n}{n} $$\n",
    "\n",
    "## Microplastics \n",
    "\n",
    "- We want to determine the average concentration of microplastics in Palo Alto tap water.\n",
    "- The concentration is a parameter. It is a fixed unknown quantity $\\mu$.\n",
    "- Estimating $\\mu$ with a sample:\n",
    "  - Take $n$ water samples and measure the microplastics in each. This produces measurements $x_1,x_2,\\ldots,x_n$.\n",
    "  - $x_i$ is the concentration of microplastics in the $i$ th sample.\n",
    "  \n",
    "\n",
    "## Properties of $\\hat{\\mu}_n$\n",
    "\n",
    "- The estimate is the sample mean: $\\hat{\\mu}_n = \\frac{x_1+x_2+\\cdots + x_n}{n}$\n",
    "\n",
    "- Like $\\hat{\\pi}_n$, the estimate $\\hat{\\mu}_n$ is random and most of the time $\\hat{\\mu}_n \\neq \\mu$. But $\\hat{\\mu}_n$ should be close to $\\mu$.\n",
    "- **Question**: how does the sample size $n$ effect the distribution of $\\hat{\\mu}_n$?\n",
    "- **Answer**: if the sample size $n$ increases, then the variability of $\\hat{\\mu}_n$ will decrease. The histogram should become skinnier.\n",
    "\n",
    "## Simulation for $\\hat{\\mu}_n$\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "- Let's suppose that we know the distribution of the concentration of microplastics in Palo Alto tap water. \n",
    "- We can do a simulation to see what the distribution of $\\hat{\\mu}_n$ looks like for different values of $n$.\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_dist.png\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "## Simulation for $n=1$\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_sample_1.png\" alt=\"\" style=\"width:75%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "## Simulation for $n=5$\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_sample_5.png\" alt=\"\" style=\"width:75%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "## Simulation for $n=10$\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_sample_10.png\" alt=\"\" style=\"width:75%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "## Simulation for $n=20$\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_sample_20.png\" alt=\"\" style=\"width:75%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "## Simulation for $n=40$\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_sample_40.png\" alt=\"\" style=\"width:75%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "## Simulation for $n=100$\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_sample_100.png\" alt=\"\" style=\"width:75%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "## Simulation summary\n",
    "\n",
    "\n",
    "- What do you notice about the distribution of $\\hat{\\mu}_n$?\n",
    "- The distribution of the estimate $\\hat{\\mu}_n$ is centered at the parameter $\\mu = 300$.\n",
    "  - The *expected value* of $\\hat{\\mu}_n$ is $\\mu$.\n",
    "- The distribution of $\\hat{\\mu}_n$ is less spread out as $n$ gets bigger.\n",
    "  - The standard deviation of $\\hat{\\mu}_n$ decreases as $n$ gets larger.\n",
    "- When $n$ is large, the distribution of $\\hat{\\mu}_n$ looks \"bell shaped.\"\n",
    "  - The normal approximation also applies to $\\hat{\\mu}_n$\n",
    "\n",
    "## Comparison to $\\hat{\\pi}_n$\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "Distribution of $\\hat{\\pi}_n$ \n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/sampling_hist_5.png\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "Distribution of $\\hat{\\mu}_n$\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_sample_5.png\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "## Comparison to $\\hat{\\pi}_n$\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "Distribution of $\\hat{\\pi}_n$ \n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/sampling_hist_40.png\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "Distribution of $\\hat{\\mu}_n$\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_sample_40.png\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "</div>\n",
    "\n",
    "# Confidence intervals for $\\mu$\n",
    "\n",
    "## Standard deviation of $\\hat{\\mu}_n$\n",
    "\n",
    "- Let $\\sigma_x$ be the standard deviation of a single sample $x$.\n",
    "- The standard deviation of the estimate $\\hat{\\mu}_n$ is given by:\n",
    "\n",
    "  $$\\text{standard deviation of } \\hat{\\mu}_n = \\frac{\\sigma_x}{\\sqrt{n}}$$\n",
    "\n",
    "- As with proportions, the standard deviation of $\\hat{\\mu}_n$ is smaller by a factor of $\\frac{1}{\\sqrt{n}}$.\n",
    "\n",
    "## Computing the standard deviation\n",
    "\n",
    "- The standard deviation of a single sample $\\sigma_x$ is not known.\n",
    "- So instead we will estimate it with the sample standard deviation $\\hat{\\sigma}_x$.\n",
    "\n",
    "  $$\\text{standard deviation of } \\hat{\\mu}_n \\approx \\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$$\n",
    "\n",
    "- The sample standard deviation $\\hat{\\sigma}_x$ can be computed from the sample $x_1,\\ldots,x_n$.\n",
    "\n",
    "## Microplastics\n",
    "\n",
    "- Suppose that collected $n=100$ water samples and measured the concentration of microplastics in all of them.\n",
    "- Suppose that the estimate $\\hat{\\mu}_n$ is 310 nano grams per litre and $\\hat{\\sigma}_x$ is 200 nano grams per litre.\n",
    "- What is the standard deviation of $\\hat{\\mu}_n$?\n",
    "\n",
    "- **Answer:**\n",
    "\n",
    "  $$\\frac{\\hat{\\sigma}_x}{\\sqrt{n}} = \\frac{200}{\\sqrt{100}} = \\frac{200}{10} = 20$$\n",
    "\n",
    "## Normal approximation\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "<div style=\"flex: 1.5;\" >\n",
    "\n",
    "- When $n$ is large, the distribution of $\\hat{\\mu}_n$ is close to the normal distribution.\n",
    "- The distribution of the measurements $x_1,x_2,\\ldots,x_n$ might not be close to the normal distribution.\n",
    "- But, the distribution of $\\hat{\\mu}_n$ will be close to the normal distribution if $n$ is big.\n",
    "\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_dist.png\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/microplastic_sample_40.png\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "## 68-95-99 rule\n",
    "\n",
    "- This means we can use the 68-95-99 rule:\n",
    "   1. With **68%** probability: $\\hat{\\mu}_n$ is within **one** standard deviation of $\\mu$.\n",
    "  2. With **95%** probability: $\\hat{\\mu}_n$ is within **two** standard deviations of $\\mu$.\n",
    "  3. With **99%** probability: $\\hat{\\mu}_n$ is within **three** standard deviations of $\\mu$.\n",
    "\n",
    "## Confidence intervals\n",
    "\n",
    "- We can make confidence interval using the 68-95-99 rule.\n",
    "- In the microplastics example, suppose that $\\hat{\\mu}_n=310$ and the standard deviation of $\\hat{\\mu}_n$ is $20$. What is a 95% confidence interval for $\\mu$?\n",
    "\n",
    "\n",
    "$$[310 - 2 \\times 20, 310 + 2 \\times 20] = [270, 350]$$\n",
    "\n",
    "\n",
    "- We are 95% confident that the concentration of microplastics in Palo Alto tap water is between 270 and 350 nanograms per litre.\n",
    "\n",
    "# Mini crosswords\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## New York Times Mini crosswords\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/minis.png\" alt=\"\" style=\"width:80%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "## Clikey and Andel\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/clikey.jpg\" alt=\"\" style=\"width:65%;\" ><figcaption>Clikey = Claire + Mikey</figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/andel.jpg\" alt=\"\" style=\"width:65%;\" ><figcaption>Andel = Andrew + Eleanor</figcaption></figure>\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "## Data\n",
    "\n",
    "Here is a dataset of the time it took us to do the mini crosswords:\n",
    "\n",
    "|date|Clikey time|Andel time|Difference|Winner|\n",
    "|:--:|:---------:|:--------:|:--------:|:----:|\n",
    "|9-Jul|42|44|-2|Clikey|\n",
    "|11-Jul|78|55|23|Andel|\n",
    "|12-Jul|92|107|-15|Clikey|\n",
    "|13-Jul|67|90|-23|Clikey|\n",
    "\n",
    "And so on for a total of $n=33$ rows.\n",
    "\n",
    "\n",
    "\n",
    "## Estimation\n",
    "\n",
    "- Let $x_1,x_2,\\ldots,x_n$ be the difference between Clikey and Andel's crossword times.\n",
    "- We will pretend that $x_1,x_2,\\ldots,x_n$ are a representative sample of the two team's crossword performance.\n",
    "- Let $\\mu$ be the long-run average difference in crossword times.\n",
    "- The sample mean is $\\hat{\\mu}_n$ is $-7.3$ seconds. Is this evidence that Clikey are better than Andel?\n",
    "\n",
    "## Confidence interval\n",
    "\n",
    "- The sample mean is $\\hat{\\mu}_n=-7.3$ seconds, the sample size is $n=33$ and the standard deviation of $x_1,\\ldots,x_n$ is $\\hat{\\sigma}_x= 75$ seconds. What is a 95% confidence interval for $\\mu$?\n",
    "- **Answer:** first calculate the standard deviation of $\\hat{\\mu}_n$ which is $\\frac{\\hat{\\sigma}_x}{\\sqrt{n}} = \\frac{75}{\\sqrt{33}}=13$\n",
    "- Then use the 68-95-99 rule:\n",
    "\n",
    "  $$\\hat{\\mu}_n \\pm 2\\times\\frac{\\hat{\\sigma}_x}{\\sqrt{n}} = -7.3 \\pm 2 \\times 13 = [-33.3, 19.3]$$\n",
    "\n",
    "- Based on the confidence interval both $\\mu <0$ (Clikey are better) and $\\mu >0$ (Andel are better) are plausible.\n",
    "\n",
    "\n",
    "\n",
    "# Standard deviation of $\\hat{\\mu}_n$\n",
    "\n",
    "## Standard deviation and standard error\n",
    "\n",
    "- Recall that if $\\hat{\\sigma}_x$ is the standard deviation of $x_1,x_2,\\ldots,x_n$, then the standard deviation of $\\hat{\\mu}_n$ is $\\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$\n",
    "\n",
    "- It is easy to confuse $\\hat{\\sigma}_x$ and $\\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$. \n",
    "  - $\\hat{\\sigma}_x$ is the standard deviation of just a single measurement ($n=1$).\n",
    "  - $\\hat{\\sigma}_x/\\sqrt{n}$ is the standard deviation of the sample mean with a sample size of size $n$.\n",
    "- $\\hat{\\sigma}_x/\\sqrt{n}$ is sometimes called the *standard error*.\n",
    "\n",
    "## Standard deviation and sample size\n",
    "\n",
    "- The standard deviation of $\\hat{\\mu}_n$ is $\\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$.\n",
    "- The standard deviation of $\\hat{\\mu}_n$ is **proportional to** $\\frac{1}{\\sqrt{n}}$.\n",
    "- If you double the sample size, then the standard deviation of $\\hat{\\mu}_n$ will decrease by a factor of $\\sqrt{2}=1.41$ not by a factor of $2$.\n",
    "- If you want the standard deviation of $\\hat{\\mu}_n$ to decrease by a factor of $2$, then you need to increase the sample size by a factor of $2^2=4$.\n",
    "\n",
    "## Computing a required sample size\n",
    "\n",
    "- The formula $\\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$ can be used to compute the required sample size for a desired level of precision.\n",
    "- In the minis, suppose that Clikey are actually 5 seconds faster on average ($\\mu = -5$).\n",
    "- How large does $n$ need to be so that a 95% confidence interval centered at $-5$ will only include negative numbers? (Assume that $\\hat{\\sigma}_x = 75$)\n",
    "\n",
    "## Computing a required sample size\n",
    "\n",
    "- We need to solve for $n$ in the equation\n",
    "\n",
    "  $$2 \\frac{\\hat{\\sigma}_x}{\\sqrt{n}} = 5 $$\n",
    "\n",
    "- Rearranging and using $\\hat{\\sigma}_x=75$:\n",
    "\n",
    "  $$\\sqrt{n} = \\frac{2\\hat{\\sigma}_x}{5} = 30$$\n",
    "\n",
    "- And so $n = 30^2 = 900$ which is about 2 and a half years of mini crosswords.\n",
    "\n",
    "\n",
    "# Connection to proportions\n",
    "\n",
    "## Similarities between $\\hat{\\pi}_n$ and $\\hat{\\mu}_n$\n",
    "\n",
    "- The estimates $\\hat{\\pi}_n$ and $\\hat{\\mu}_n$ have a lot in common:\n",
    "  - Both are centered at the population parameters $\\pi$ and $\\mu$.\n",
    "  - The standard deviations of $\\hat{\\pi}_n$ and $\\hat{\\mu}_n$ both decrease with $n$.\n",
    "  - When $n$ is large, the distribution of $\\hat{\\pi}_n$ and $\\hat{\\mu}_n$ is close to the normal distribution.\n",
    "- These similarities are not just a coincidence.\n",
    "\n",
    "## Proportions are means\n",
    "\n",
    "- The sample proportion $\\hat{\\pi}_n$ is a special case of the sample mean $\\hat{\\mu}_n$.\n",
    "- Let $x_1,\\ldots,x_n$ be measurements where $x_i=1$ if the $i$ th person in the sample answered yes and $x_i=0$ if they answered no.\n",
    "- Then\n",
    "\n",
    "  $$\\hat{\\mu}_n = \\frac{x_1+x_2+\\cdots+x_n}{n}=\\frac{m}{n} = \\hat{\\pi}_n $$\n",
    "\n",
    "  where $m$ is the number of people in the sample who answered yes.\n",
    "\n",
    "## Proportions and means\n",
    "\n",
    "- Most of the results from Monday about $\\hat{\\pi}_n$ are a special case of the results for $\\hat{\\mu}_n$.\n",
    "- There are two things that are special about $\\hat{\\pi}_n$:\n",
    "  1. The formula $\\sqrt{\\frac{\\hat{\\pi}_n(1-\\hat{\\pi}_n)}{n}}$ for the standard deviation of $\\hat{\\pi}_n$ (for $\\hat{\\mu}_n$ the formula is $\\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$)\n",
    "  2. The rule of thumb that the normal approximation is reasonably accurate when $\\hat{\\pi}_n n \\ge 10$ and $(1-\\hat{\\pi}_n)n \\ge 10$.\n",
    "\n",
    "## How large should $n$ be?\n",
    "\n",
    "- There isn't a similar rule for the normal approximation for quantitative variables.\n",
    "- The accuracy of the normal approximation depends on both the size of $n$ and how asymmetric the distribution of $x_1,\\ldots,x_n$ is.\n",
    "- If the distribution is very asymmetric, then $n$ needs to be larger.\n",
    "- If the distribution is not too \"wild\", then $n=30$ should give good results.\n",
    "\n",
    "\n",
    "# Error bars\n",
    "\n",
    "## Confidence intervals and error bars\n",
    "\n",
    "Confidence intervals are often expressed visually with **error bars**, like in this figure from [Do defaults save lives?](https://www.science.org/doi/10.1126/science.1091721) \n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/organ-donation-results.gif\" alt=\"\" style=\"width:50%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "## Error bar: warning\n",
    "\n",
    "- There are different conventions about what is displayed in the error bars.\n",
    "- Sometimes it is a 95% confidence interval, other times it is the standard deviation of $\\hat{\\mu}_n$ and other times it is the standard deviation of $x_1,\\ldots,x_n$!\n",
    "- Any figure with error bars should say how they are calculated.\n",
    "\n",
    "## Uncertainty\n",
    "\n",
    "- Error bars and confidence intervals are meant to represent *uncertainty* in an estimate.\n",
    "- For example: 42% of people in Opt-in group consented to be donors, but the population proportion could be between 32% and 52%\n",
    "- In many cases, people focus on the estimate and ignore the error bars and uncertainty.\n",
    "- This has led people to develop alternatives.\n",
    "- How would design a visualization that emphasizes uncertainty?\n",
    "\n",
    "## Hypothetical outcomes plot\n",
    "\n",
    "One alternative is a moving image that shows different plausible estimates. More information [here](https://medium.com/hci-design-at-uw/hypothetical-outcomes-plots-experiencing-the-uncertain-b9ea60d7c740).\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/hypothetical-outcomes.gif\" alt=\"\" style=\"width:100%;\" ><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "# Confidence intervals conclusions\n",
    "\n",
    "\n",
    "## Population and sample\n",
    "\n",
    "- There is a variable $x$\n",
    "which we want to measure on observation units in a **population**.\n",
    "- Our goal is to estimate the population mean $\\mu$ which is a **parameter**.\n",
    "- We take independent $n$ **samples** from the population and record the variable on the sample.\n",
    "- This gives measurements $x_1,\\ldots,x_n$.\n",
    "- The sample mean $\\hat{\\mu}_n = \\frac{x_1+\\cdots+x_n}{n}$ is an **estimate** of $\\mu$.\n",
    "\n",
    "\n",
    "## Confidence intervals\n",
    "\n",
    "- A **confidence interval** for $\\mu$ is a collection of plausible values of $\\mu$.\n",
    "- The estimate $\\hat{\\mu}_n$ can be used to make confidence intervals of the form\n",
    "\n",
    "  $$ \\hat{\\mu}_n \\pm 2 \\frac{\\hat{\\sigma}_x}{\\sqrt{n}}$$\n",
    "\n",
    "  where $n$ is the sample size and $\\hat{\\sigma}_x$ is the standard deviation of $x_1,\\ldots,x_n$.\n",
    "\n",
    "- This produces a 95% confidence interval (2 standard deviations).\n",
    "- For 68% use 1 standard deviation and for 99% use 3 standard deviations.\n",
    "\n",
    "\n",
    "## Confidence intervals theory\n",
    "\n",
    "- The normal approximation means that \n",
    "\n",
    "  $$\\mathrm{Pr}\\left[\\hat{\\mu}_n - \\frac{2\\hat{\\sigma}_x}{\\sqrt{n}} \\le \\mu \\le \\hat{\\mu}_n + \\frac{2\\hat{\\sigma}_x}{\\sqrt{n}}\\right] \\approx 0.95$$\n",
    "\n",
    "- In general, you can make $1-\\alpha$ confidence interval for $\\mu$ like so\n",
    "\n",
    "  $$\\mathrm{Pr}\\left[\\hat{\\mu}_n - \\frac{z_\\alpha \\hat{\\sigma}_x}{\\sqrt{n}} \\le \\mu \\le \\hat{\\mu}_n + \\frac{z_\\alpha\\hat{\\sigma}_x}{\\sqrt{n}}\\right] \\approx 1-\\alpha$$\n",
    "\n",
    "- The constant $z_\\alpha$ is something you could look up online or with a calculator.\n",
    "\n",
    "## Confidence interval interpretation\n",
    "\n",
    "- The probability in this equation has a specific meaning:\n",
    "\n",
    "  $$\\mathrm{Pr}\\left[\\hat{\\mu}_n - \\frac{2\\hat{\\sigma}_x}{\\sqrt{n}} \\le \\mu \\le \\hat{\\mu}_n + \\frac{2\\hat{\\sigma}_x}{\\sqrt{n}}\\right] \\approx 0.95$$\n",
    "\n",
    "- It means that, across different studies around 95% of confidence intervals that use two standard deviations will contain the population parameter.\n",
    "- It does not mean that there is a 95% chance that the confidence interval contains $\\mu$ in a *particular study*.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "<!-- \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def simulate_gamma_xbar_hists(alpha=2.0, beta=150.0, ns=(1, 5, 10, 20, 40, 100),\n",
    "                             num_reps=100_000, out_dir=\"../figures\", seed=1234):\n",
    "    \"\"\"\n",
    "    Gamma parameterization: shape=alpha, scale=beta so E[X]=alpha*beta.\n",
    "    Plots histograms of X_bar for each n, using a common x-range across plots.\n",
    "    \"\"\"\n",
    "    rng = np.random.default_rng(seed)\n",
    "    os.makedirs(out_dir, exist_ok=True)\n",
    "\n",
    "    # Precompute xbars for all n so we can enforce a common x-range\n",
    "    xbars = {}\n",
    "    for n in ns:\n",
    "        x = rng.gamma(shape=alpha, scale=beta, size=(num_reps, n))\n",
    "        xbars[n] = x.mean(axis=1)\n",
    "\n",
    "    # Common x-range (use global min/max across all n)\n",
    "    xmin = min(v.min() for v in xbars.values())\n",
    "    xmax = 1000\n",
    "\n",
    "    # # Fixed bin edges for all plots\n",
    "    # bins = np.linspace(xmin, xmax, 60)\n",
    "\n",
    "    for n in ns:\n",
    "        if n == 1:\n",
    "          plt.figure(figsize=(6, 4))\n",
    "          plt.hist(xbars[n], bins=20, edgecolor=\"black\", alpha=0.7)\n",
    "          plt.title(f\"Concentration of microplastics\")\n",
    "          plt.xlabel(\"Nanograms per litre\")\n",
    "          plt.tight_layout()\n",
    "          plt.axvline(x=alpha*beta, linestyle=\"--\", color=\"black\")\n",
    "\n",
    "          out_path = os.path.join(out_dir, f\"microplastic_dist.png\")\n",
    "          plt.savefig(out_path, dpi=200)\n",
    "          plt.show()\n",
    "          plt.close()\n",
    "        plt.figure(figsize=(6, 4))\n",
    "        plt.hist(xbars[n], bins=20, edgecolor=\"black\", alpha=0.7)\n",
    "        plt.title(f\"Sample size {n}\")\n",
    "        plt.xlim(xmin, xmax)\n",
    "        plt.tight_layout()\n",
    "        plt.axvline(x=alpha*beta, linestyle=\"--\", color=\"black\")\n",
    "\n",
    "        out_path = os.path.join(out_dir, f\"microplastic_sample_{n}.png\")\n",
    "        plt.savefig(out_path, dpi=200)\n",
    "        plt.show()\n",
    "        plt.close()\n",
    "\n",
    "\n",
    "simulate_gamma_xbar_hists()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " -->"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Phython (JB)",
   "language": "python",
   "name": "jb-python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}