{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lecture 7: Variability \n",
    "\n",
    "STATS 60 / STATS 160 / PSYCH 10\n",
    "\n",
    "**Concepts and Learning Goals:**\n",
    "\n",
    "- Variability of distributions\n",
    "- Intuition for variability from histograms\n",
    "- Common measures of variability:\n",
    "    - Variance and Standard Deviation\n",
    "    - Quantiles\n",
    "<div style=\"display: flex; justify-content: \"right\"; flex-direction: column; align-items: \"right\";\">\n",
    "  <div>\n",
    "    <p style=\"font-size: smaller; text-align: \"right\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "**Announcements:**\n",
    "\n",
    "- You will have 1 week for quiz regrade requests.\n",
    "- The 4:30-5:20PM discussion section only has 6 students.\n",
    "\n",
    "\n",
    "## Variability\n",
    "\n",
    "Last week, you learned about them **mean** and **median.**  \n",
    "\n",
    "Both measure where the **center** of a distribution (the data) is, for different notions of centering.\n",
    "\n",
    "\n",
    "Many times, we don't just care where the *center* of the distribution is; we also want to know about the **variability** of the data.\n",
    "\n",
    "- Are most of the samples close to the center (mean/median), or not?\n",
    "- What is the \"typical range\" the data falls into?\n",
    "\n",
    "## Why care about variability?\n",
    "\n",
    "**Question:** think of examples of scenarios where you care not only where the data is centered, but also what the variability is.\n",
    "\n",
    "- Medicine: you know the average life expectancy, given a diagnosis. But what are the best/worst case scenarios?\n",
    "\n",
    "- Exams: you know the class average, and you know your score. But how do you really compare to the rest of the class?\n",
    "\n",
    "- Investments: you are trying to decide if you should invest in a stock. You know the historical average annual rate of return. But is it possible that there will be a big loss?\n",
    "\n",
    "## Example 1: daily temperatures in different cities\n",
    "\n",
    "Recounting the example from lecture 5, below are the overlayed histograms of daily average temperatures in two cities in 2024-2025.\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/seattle-chicago-2.png\" style=\"width:\"500\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "The means of the two cities are very close, but the distributions are very different.\n",
    "\n",
    "Qualitatively, the temperature in Chicago exhibits greater *variability.*\n",
    "\n",
    "\n",
    "\n",
    "## Example 2: stock prices\n",
    "\n",
    "These histograms show the daily closing prices of Visa (VISA) and Tesla (TSLA) stock for the last 5 years.\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/tsla-v-visa.png\" style=\"width:\"500\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "The means are very close, but qualitatively, the TSLA price exhibits greater *variability.*\n",
    "\n",
    "\n",
    "## How should we measure variability?\n",
    "\n",
    "We saw two examples of distributions with similar means, but different levels of variability.\n",
    "\n",
    "**Question:** how could we measure variability? Suggest a quantitative measure.\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/seattle-chicago-2.png\" style=\"width:\"450\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "<img></img>\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/tsla-v-visa.png\" style=\"width:\"450\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "<img></img>\n",
    "\n",
    "\n",
    "## Variance \n",
    "\n",
    "A common quantitative summary of variability is the **variance.**\n",
    "\n",
    "\n",
    "If our datapoints are $x_1,\\ldots,x_n$, and their mean is $\\bar{x} = \\frac{x_1+x_2 + \\cdots + x_n}{n}$, \n",
    "\n",
    "the **variance** is the average squared distance to the mean:\n",
    "\n",
    "$$\n",
    "  \\overline{\\sigma}^2 = \\text{variance} = \\frac{(x_1-\\bar x)^2 + (x_2 - \\bar x)^2 + \\dots + (x_n - \\bar x)^2}{n}.\n",
    "$$\n",
    "\n",
    "## Practice with the variance\n",
    "\n",
    "The **variance** is the average squared distance to the mean:\n",
    "\n",
    "$$\n",
    "  \\overline{\\sigma}^2 = \\text{variance} = \\frac{(x_1-\\bar x)^2 + (x_2 - \\bar x)^2 + \\dots + (x_n - \\bar x)^2}{n}\n",
    "$$\n",
    "\n",
    "**Question:** Calculate the variance of the rowers' heights. What are the units?\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/rowing_df\" style=\"width:\"300\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/usa_rowing\" style=\"width:\"300\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "$\\bar{x}= 70.55\\mathrm{in}$ ; $\\bar{\\sigma}^2 = 14.47 \\mathrm{in}^2$.\n",
    "\n",
    "## Standard Deviation\n",
    "\n",
    "The **standard deviation** is the square root of the variance:\n",
    "\n",
    "$$\n",
    "\\bar \\sigma = \\text{standard deviation} = \\sqrt{\\bar \\sigma^2}.\n",
    "$$\n",
    "\n",
    "\n",
    "If the data has the units $u$, then the variance has the units $u^2$. \n",
    "\n",
    "<font color=\"teal\">The units of the variance are *incompatible* with the units of the data.</font>\n",
    "\n",
    "For this reason, if you want a measure of variability that you can compare to the mean, you should **use the standard deviation** rather than the variance.\n",
    "\n",
    "\n",
    "**Question:** Calculate the standard deviation of the rowers' heights.\n",
    "\n",
    "\n",
    "$\\sigma = 3.80\\mathrm{in}$.\n",
    "\n",
    "## Variability and risk\n",
    "\n",
    "Suppose someone offers you a choice between:\n",
    "\n",
    "1. A gift of \\$100\n",
    "\n",
    "2. The chance to flip a fair coin for \\$300.\n",
    "\n",
    "What would you choose, and why?\n",
    "\n",
    "\n",
    "We can think of the outcomes in each scenario as datapoints in two different distribution:\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/scenarios.png\" style=\"width:\"400\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "- Scenario 1 is a distribution containing exactly one datapoint: \\$100.\n",
    "- Scenario 2 is a distribution with two datapoints: \\$0 (tails), \\$300 (heads)\n",
    "\n",
    "\n",
    "\n",
    "**Question:** calculate the mean and standard deviation of your earnings in each scenario.\n",
    "\n",
    "\n",
    "| Scenario | Mean | Standard Deviation |\n",
    "|:---:|:---:|:---:|\n",
    "| 1 | \\$100 | \\$ 0 |\n",
    "| 2 | \\$150 | \\$ 150 |\n",
    "\n",
    "\n",
    "## Example 1: daily temperature\n",
    "\n",
    "Mean and Standard Deviation in temperature in 2024-2025:\n",
    "\n",
    "| City | Mean Temperature | Standard Deviation | \n",
    "| :---: | :---: | :---:|\n",
    "| Seattle | $51.7^{\\circ} F$ | $10.3^{\\circ} F$ |\n",
    "| Chicago |  $54.3^{\\circ} F$ |$19.0^{\\circ} F$ |\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/seattle-chicag-std.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "<img></img>\n",
    "\n",
    "The standard deviation of temperature in Chicago is about twice as much as that of Seattle.\n",
    "\n",
    "\n",
    "## Example 2: stock prices\n",
    "\n",
    "Mean and Standard Deviation in closing value for the last 5 years:\n",
    "\n",
    "| Stock | Mean Value | Standard Deviation | \n",
    "| :---: | :---: | :---:|\n",
    "| TSLA | \\$274.60 | \\$83.24 |\n",
    "| V | \\$258.92 | \\$53.54 |\n",
    "\n",
    "The standard deviation of Tesla stock is about 30\\% of its mean value.\n",
    "\n",
    "The standard deviation of Visa stock is about 20\\% of its mean value.\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/tsla-v-visa-std.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "<img></img>\n",
    "\n",
    "\n",
    "<font color=\"gray\">Aside: the ratio of the standard deviation to the mean only makes sense as a measurement of variability for non-negative data.</font>\n",
    "\n",
    "## Standard deviation \\& outliers\n",
    "\n",
    "Sometimes, the standard deviation can be large because of one outlier.\n",
    "\n",
    "\n",
    "**Example:** The following dataset gives section attendance for each of the 5 sections of STATS60 this week:\n",
    "\n",
    "| TA | Attendance |\n",
    "|:---:|:---:|\n",
    "| Cole | 20 |\n",
    "| Junyi | 6 |\n",
    "| Leda | 27 |\n",
    "| Skyler | 25 |\n",
    "| Valerie | 21 |\n",
    "\n",
    "The mean is 19.8, the standard deviation is 7.3.\n",
    "\n",
    "\n",
    "If we remove the outlier of Junyi's section:\n",
    "\n",
    "the mean is 23.3, the standard deviation is only 2.9.\n",
    "\n",
    "\n",
    "## Discussion\n",
    "\n",
    "**Question:** Do you think the standard deviation is a satisfying measure of variability? What is it conveying? What is it not conveying?\n",
    "\n",
    "\n",
    "- The standard deviation can be large because of the influence of outliers. It can be a \"pessimistic\" notion of variability.\n",
    "\n",
    "## A guarantee for the standard deviation\n",
    "\n",
    "Most samples are within a few standard deviations of the mean!\n",
    "\n",
    "\n",
    "The following fact is called **Chebyshev's inequality:** \n",
    "\n",
    "For any $t > 0$, at most a $1/t^2$ fraction of datapoints are more than $t$ standard deviations away from the mean.\n",
    "\n",
    "\n",
    "For example, this implies that 75\\% of the datapoints are no more than $2$ standard deviations away from the mean (Chebyshev's inequality with the choice $t = 2$).\n",
    "\n",
    "\n",
    "You'd see how to prove and use this fact in an intro probability course, like STATS 117/118.\n",
    "\n",
    "## Quantiles\n",
    "\n",
    "**Quantiles** tell us the fraction of the data that falls in each range. \n",
    "They give us a more complete picture of variability.\n",
    "\n",
    "The **$k$-quantiles** of a distribution are the $k-1$ numbers which partition the histogram into $k$ equal-sized parts: \n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/seattle-chicago-deciles.png\" style=\"width:\"800\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "Depicted here are the $10$-quantiles, also known as **deciles.**\n",
    "\n",
    "\n",
    "\n",
    "Other commonly used quantiles are the **quartiles** ($4$-quantiles), and **percentiles** ($100$-quantiles).\n",
    "\n",
    "\n",
    "## Using quantiles to measure variability\n",
    "\n",
    "**Question**: How can we use quantiles to measure variability?\n",
    "\n",
    "- We can measure distance between quantiles (the \"width\" of quantiles), or between quantiles and the mean.\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/seattle-chicago-deciles.png\" style=\"width:\"400\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "For example, the distance from the $10$th percentile to $90$th percentile:\n",
    "\n",
    "| City | Mean Temp | Std. Dev |  10th Percentile | 90th percentile | 10-90 percentile window |\n",
    "| :---: | :---: | :---:| :---:| :---:|:---:|\n",
    "| Seattle | $51.7^{\\circ F}$ | $10.3^{\\circ F}$ | $39^{\\circ F}$ | $66.5^{\\circ F}$ | $27.5^{\\circ F}$ |\n",
    "| Chicago |  $54.3^{\\circ F}$ |$19.0^{\\circ F}$ | $28.0^{\\circ F}$ | $77.0^{\\circ F}$ | $49^{\\circ F}$|\n",
    "\n",
    "## Another way to think about quantiles\n",
    "\n",
    "**Question:** How does the information we get from the standard deviation differ from the information we get from the quantiles?\n",
    "\n",
    "\n",
    "- The quantiles give us a better sense of the shape of the distribution.\n",
    "\n",
    "- They also exactly tell us what percent of datapoints fall in a range.\n",
    "\n",
    "\n",
    "**For example:** 80\\% of data points in the histogram fall between the 10th and 90th percentile.\n",
    "\n",
    "**Question:** Why?\n",
    "\n",
    "\n",
    "| City |  10th | 90th | Window Size |\n",
    "| :---: | :---: | :---: | :---: |\n",
    "| Seattle | $39^{\\circ F}$ | $66.5^{\\circ F}$ | $27.5^{\\circ F}$ |\n",
    "| Chicago | $28.0^{\\circ F}$ | $77.0^{\\circ F}$ | $49^{\\circ F}$ |\n",
    "\n",
    "In each city, you can reasonably expect that 80\\% of the time, the temperature will be in the 10-90th percentile window.\n",
    "\n",
    "This also gives us a sense of the variability.\n",
    "\n",
    "## Recap\n",
    "\n",
    "- Concept of variability\n",
    "- Common measures of variability:\n",
    "    - Variance and Standard Deviation\n",
    "    - Quantiles"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Phython (JB)",
   "language": "python",
   "name": "jb-python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
