{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lecture 5: Summaries of Center\n",
    "\n",
    "STATS 60 / STATS 160 / PSYCH 10\n",
    "\n",
    "**Announcements**\n",
    "\n",
    "- Section 05: 4:30 - 5:20pm Hewlett 101 (new location).\n",
    "- <a href = \"https://web.stanford.edu/class/stats60/discussion/02-discussion.html\">The slides for discussion 2</a> are online.\n",
    "\n",
    "\n",
    "\n",
    "## Recap\n",
    "\n",
    "**Lecture 4**\n",
    "\n",
    "- Data definitions (observational unit, variables, distribution).\n",
    "- Different types of visualizations (pie charts, bar charts, and line charts).\n",
    "\n",
    "\n",
    "**Today**\n",
    "\n",
    "- Data visualizations for quantitative data (histograms and scatter plots).\n",
    "- Data visualization that use maps (dot maps and cloropleths).\n",
    "- Summaries of center (mean, median and mode).\n",
    "\n",
    "    \n",
    "\n",
    "# Quantitative Variables\n",
    "\n",
    "## Quantitative Data\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; align-items: center; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/old_faithful.jpg\" alt=\"\" style=\"width:70%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "<div style=\"flex: 1;\">\n",
    "\n",
    "Let's take another look at the old faithful dataset.\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/old_faithful_df.png\" alt=\"\" style=\"width:50%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "\n",
    "## Dot plots for quantitative data\n",
    "\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/waiting_dotplot.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "How could we improve this visualization?\n",
    "\n",
    "\n",
    "## Histograms\n",
    "\n",
    "A **histogram** is a more appropriate visualization for a quantitative variable.\n",
    "First, values are sorted into _bins_, and the number of values in each bin is\n",
    "plotted as a bar.\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/waiting_histo.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "\n",
    "## A Histogram is Not a Bar Chart!\n",
    "\n",
    "How is a histogram different from a bar chart?\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; align-items: center; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 1; text-align: center;\">\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/organ-donation-results.gif\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/waiting_histo2.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "</div>\n",
    "\n",
    "## Histograms vs bar charts\n",
    "\n",
    "- Histograms are for quantitative variables and bar chart are for categorical variables.\n",
    "- The y-axis of a bar chart can be something other than a count. \n",
    "- For a histogram the y-axis is always a count or a frequency.\n",
    "\n",
    "<!-- ## A Histogram is Not a Bar Chart!\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/not_a_histogram.png\" alt=\"Not a histogram\" style=\"width:100%;\"><figcaption>A \"histogram\" from  Metr (https://arxiv.org/abs/2503.14499v3).</figcaption></figure> -->\n",
    "\n",
    "## Relationships between Variables\n",
    "\n",
    "We can also make a histogram of the `eruption` time of each eruption.\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; align-items: center; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 1; text-align: center;\">\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/waiting_histo2.png\" alt=\"\" style=\"width:70%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/times_histo.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "But how do we understand the relationship between two quantitative variables?\n",
    "\n",
    "\n",
    "## Scatter plots\n",
    "\n",
    "In a **scatter plot**, each observation is represented by a point $(x, y)$. \n",
    "The $x$-coordinate represents the value of one variable, while the \n",
    "$y$-coordinate represents the value of the other.\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/old_faithful_scatter.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "# Maps\n",
    "\n",
    "## Dot maps\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/coffee-dotmap.png\" alt=\"Coffee Place Geography in 2014 from [Flowing Data](https://flowingdata.com/2014/03/18/coffee-place-geography/)\" style=\"width:70%;\"><figcaption>Coffee Place Geography in 2014 from <a href = \"https://flowingdata.com/2014/03/18/coffee-place-geography/\">Flowing Data</a>.</figcaption></figure>\n",
    "\n",
    "## John Snow's Cholera map\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/cholera-dotmap.jpg\" alt=\"John Snow's Cholera map from Edward Tufte's *The Visual Display of Quantitive Visualization*\" style=\"width:50%;\"><figcaption>John Snow's Cholera map of an 1854 Cholera outbreak.</figcaption></figure>\n",
    "\n",
    "## A problem with dot maps\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/libraries-dotmap.png\" alt=\"\" style=\"width:100%;\"><figcaption>Dot map of public libraries</figcaption></figure>\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/gun-violence-homicides.png\" alt=\"\" style=\"width:100%;\"><figcaption>Dot map of gun homicides in 2015 <a href = \"https://www.theguardian.com/us-news/ng-interactive/2017/jan/09/special-report-fixing-gun-violence-in-america\">source</a></figcaption></figure>\n",
    "\n",
    "\n",
    "\n",
    "- Be careful of dot maps that are really just population maps!\n",
    "\n",
    "## Cloropleths\n",
    "\n",
    "\n",
    "- A *cloropleth* is a map where regions are colored according to the values of a variable. \n",
    "- \"cloro\" + \"pleth\" means \"region\" + \"many\".\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/sf-chloropleth.png\" alt=\"Cloropleths are often used for election results.\" style=\"width:40%;\"><figcaption>Cloropleths are often used for election results (<a href = \"https://www.geoapify.com/what-is-a-choropleth-map-definition-examples-how-to-create/\">source</a>).</figcaption></figure>\n",
    "\n",
    "## Cloropleths\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/population-density.png\" alt=\"A cloropleth showing population density from Our World in Data.\" style=\"width:70%;\"><figcaption>A cloropleth showing population density from <a href = \"https://ourworldindata.org/grapher/population-density?time=2026\">Our World in Data</a>.</figcaption></figure>\n",
    "\n",
    "## Problems with cloropleths\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; align-items: center; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/trump-chloropleth.png\" alt=\"\" style=\"width:70%;\"><figcaption></figcaption></figure>\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "\n",
    "\u201cOh I love those beautiful red areas, that middle of the map. There\u2019s just a little blue here, and a little blue, everything else is bright red.\u201d\n",
    "\u2014Donald Trump\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "\n",
    "What is misleading about this chloropleth showing the 2016 election results?\n",
    "\n",
    "## Alternatives\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; align-items: center; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/2016-election-dotmap.png\" alt=\"A dot map to show population density.\" style=\"width:100%;\"><figcaption>A dot map to show population density (<a href = \"https://www.andybeger.com/blog/2018-05-11-us-2016-dot-density/\">source</a>).</figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/2016-election-cartogram.png\" alt=\"A cartogram to show number of electoral votes.\" style=\"width:100%;\"><figcaption>A cartogram to show number of electoral votes from <a href = \"https://en.wikipedia.org/wiki/2016_United_States_presidential_election#Maps\">Wikipedia</a>.</figcaption></figure>\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "## Visualization summary\n",
    "\n",
    "- When making a visualization, think about the **number of variables** and the **type of variable** (quantitative or categorical).\n",
    "- For a **single** variable:\n",
    "    - **Categorical**: bar chart or pie chart.\n",
    "    - **Quantitative**: histogram.\n",
    "- For **multiple** variables:\n",
    "    - **Two categorical**: stacked bar chart.\n",
    "    - **Two quantitative**: scatter plot.\n",
    "    - **One quantitative, one categorical**: side-by-side histograms.\n",
    "- For a variable that changes **over time**: line chart.\n",
    "- For a variable that changes **over locations**: dot map or chloropelth (maps).\n",
    "\n",
    "\n",
    "# Summaries of center\n",
    "\n",
    "## USA Women's Eight Rowing\n",
    "\n",
    "Shown below are stats for the members of the USA Women's Eight rowing team \n",
    "that competed at the 2024 Paris Olympics.\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; align-items: center; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/rowing_df.png\" alt=\"\" style=\"width:100%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/usa_rowing.png\" alt=\"\" style=\"width:100%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "## USA Women's Eight Rowing\n",
    "\n",
    "\n",
    "- In the dataset of rower weights:\n",
    "\n",
    "    a. What are the observational units?\n",
    "    b. Is weight a quantitative or categorical variable? What visualization could we use to represent the variable weight?\n",
    "\n",
    "\n",
    "- The rowers are the observational units.\n",
    "- Weight is a quantitative variable. We could use a histogram.\n",
    "\n",
    "## Histogram of weights\n",
    "\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/rowing_hist.png\" alt=\"\" style=\"width:55%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "\n",
    "But what if we wanted to summarize the data by a single number?\n",
    "\n",
    "\n",
    "## Mean\n",
    "\n",
    "One common summary of a quantitative variable is the **mean** \n",
    "(or **average**, although this is less precise).\n",
    "\n",
    "\n",
    "To calculate the mean, add up the numbers and divide \n",
    "by how many there are:\n",
    "$$\n",
    "  \\bar x = \\text{mean} = \\frac{x_1 + x_2 + \\dots + x_n}{n}.\n",
    "$$\n",
    "\n",
    "- $n$ is the number of observational units.\n",
    "- $x_1,x_2,\\ldots,x_n$ are the different values of a variable.\n",
    "- $\\bar x$ is a common shorthand for the mean of the values $x_1,\\ldots,x_n$.\n",
    "\n",
    "## Mean of the rowers\n",
    "\n",
    "Calculate the mean weight of the rowers.\n",
    "\n",
    "\n",
    "$$\n",
    "  \\text{mean} = \\frac{170 + 180 + 115 + 170 + 175 + 170 + 180 + 180 + 160}{9} \\approx 166.7.\n",
    "$$\n",
    "\n",
    "\n",
    "Surprisingly the mean was not used to summarize data until about 1720! See [this article](https://web.stanford.edu/class/datasci112/readings/mean.pdf) for more history.\n",
    "\n",
    "\n",
    "## Interpreting the Mean\n",
    "\n",
    "The mean $\\bar x \\approx 166.7$ measures the \"center\" of the distribution.\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/rowing_hist_mean.png\" alt=\"\" style=\"width:55%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "\n",
    "It is where the histogram would \"balance\" if we put it on a scale.\n",
    "\n",
    "\n",
    "## Median\n",
    "\n",
    "The mean is not the only way to summarize the center of a distribution.\n",
    "Another summary is the **median**, the middle value when the data is sorted \n",
    "in order.\n",
    "\n",
    "\n",
    "Calculate the median weight of the rowers.\n",
    "\n",
    "$$\n",
    "  115, 160, 170, 170, 170, 175, 180, 180, 180\n",
    "$$\n",
    "\n",
    "\n",
    "$$\n",
    "  115, 160, 170, 170, \\underbrace{170}_{\\text{median}}, 175, 180, 180, 180\n",
    "$$\n",
    "\n",
    "## The median \n",
    "\n",
    "When $n$ is even, there are two middle numbers. The median is the mean of the\n",
    "two middle numbers.\n",
    "\n",
    "\n",
    "\n",
    "Calculate the median weight of the $n=8$ rowers, excluding the coxswain.\n",
    "\n",
    "$$\n",
    "  160, 170, 170, \\underbrace{170, 175}_{\\text{median} = 172.5}, 180, 180, 180\n",
    "$$\n",
    "\n",
    "\n",
    "## Interpreting the Median\n",
    "\n",
    "The median $170$ is another summary of the \"center\" of the distribution.\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/rowing_hist_median.png\" alt=\"\" style=\"width:55%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "\n",
    "It is the value where half the data is below and half the data is above.\n",
    "\n",
    "## Mode\n",
    "\n",
    "The **mode** is another way of measuring the center of a distribution. The mode is value that appears most often.\n",
    "\n",
    "The weights of the rowers have two modes:\n",
    "\n",
    "\n",
    "$$\n",
    "  115, 160, \\underbrace{170, 170, 170}_{\\text{mode 1}}, 175, \\underbrace{180, 180, 180}_{\\text{mode 2}}\n",
    "$$\n",
    "\n",
    "\n",
    "This is an example of a _bimodal_ distribution.\n",
    "\n",
    "## The mode\n",
    "\n",
    "Modes are peaks in the histogram.\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/rowing_hist_modes.png\" alt=\"\" style=\"width:55%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "\n",
    "## Mean vs. Median vs. Mode\n",
    "\n",
    "We have now seen three different summaries of center:\n",
    "\n",
    "- $\\displaystyle \\text{mean} = \\frac{170 + 180 + 115 + 170 + 175 + 170 + 180 + 180 + 160}{9}\\approx 166.7$\n",
    "- $\\displaystyle 115, 160, 170, 170, \\underbrace{170}_{\\text{median}}, 175, 180, 180, 180$\n",
    "- $\\displaystyle 115, 160, \\underbrace{170, 170, 170}_{\\text{mode 1}}, 175, \\underbrace{180, 180, 180}_{\\text{mode 2}}$\n",
    "\n",
    "\n",
    "What would happen to the mean, median and mode, if the coxswain weighed only 90 pounds? \n",
    "What if the coxswain weighed 140 pounds?\n",
    "\n",
    "\n",
    "_Answer:_ The mean would change, but the median and mode would not.\n",
    "\n",
    "\n",
    "**Moral:** The mean is sensitive to outliers (in either direction), but the \n",
    "median and mode are not. Statisticians say that the median is more \"robust\" than the\n",
    "mean.\n",
    "\n",
    "## Sensitivity of the mean\n",
    "\n",
    "- Recall the general formula for the mean:\n",
    "\n",
    "$$ \n",
    "\\bar x = \\text{mean} = \\frac{x_1 + x_2 + \\dots + x_n}{n}.\n",
    "$$\n",
    "\n",
    "- Changing a single data point will change the mean by a proportional amount. \n",
    "\n",
    "\n",
    "- On the other hand, there is a limit to how much any single data point can change the median or the mode.\n",
    "\n",
    "\n",
    "\n",
    "## Exercise\n",
    "\n",
    "Shown below is a histogram of the arrival delays from the flights data.\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/arr_delay_hist.png\" alt=\"\" style=\"width:50%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "\n",
    "How do you think the mean and median of the arrival delays compare?\n",
    "\n",
    "\n",
    "- The mean will be bigger than the median.\n",
    "- The mean is around 7 minutes and the median is -5 minutes.\n",
    "\n",
    "\n",
    "\n",
    "## The Center Doesn't Tell the Whole Story\n",
    "\n",
    "Many people think that the mean/median represent the \"typical\" value, \n",
    "but this is not always the case.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; align-items: center; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "\n",
    "\n",
    "Consider the Old Faithful eruption times.\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/eruptions_hist.png\" alt=\"\" style=\"width:100%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\">\n",
    "\n",
    "\n",
    "- The mean eruption time is about $3.5$ minutes. \n",
    "\n",
    "- If we only reported this number, we would miss the fact that most \n",
    "eruptions are either much shorter or much longer! \n",
    "\n",
    "- This is another _bimodal_ distribution.\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "\n",
    "## Variability\n",
    "\n",
    "Shown below are histograms of daily average temperatures in two cities.\n",
    "\n",
    "Chicago\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/chicago_hist.png\" alt=\"\" style=\"width:100%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "Seattle\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/seattle_hist.png\" alt=\"\" style=\"width:100%; display:block; margin:0 auto;\"><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "\n",
    "The means of the two cities are about the same ($53.25^\\circ\\text{F}$ for Chicago \n",
    "vs. $53.07^\\circ\\text{F}$ for Seattle), but the distributions are very \n",
    "different.\n",
    "\n",
    "\n",
    "## Recap\n",
    "\n",
    "- The mean, the median and the mode are three summaries of center.\n",
    "- The mean is sensitive to outliers.\n",
    "- However, summaries of center don't paint the full picture.\n",
    "\n",
    "\n",
    "## Looking ahead\n",
    "\n",
    "**Tomorrow**\n",
    "\n",
    "- Review and solution to practice quiz 1.\n",
    "- Vibe coding to make visualizations.\n",
    "\n",
    "**Friday**\n",
    "\n",
    "- Misleading data visualizations.\n",
    "- Quiz on data visualizations and summaries of center.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Phython (JB)",
   "language": "python",
   "name": "jb-python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
