{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lecture 26: Nearest Neighbors\n",
    "\n",
    "STATS 60 / STATS 160 / PSYCH 10\n",
    "\n",
    "\n",
    "**Concepts and Learning Goals:**\n",
    "\n",
    "- The $k$-nearest neighbors model\n",
    "- Classification\n",
    "- Effects of selection bias in training data\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"right\"; flex-direction: column; align-items: \"right\";\">\n",
    "  <div>\n",
    "    <p style=\"font-size: smaller; text-align: \"right\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "## Return to admissions\n",
    "\n",
    "Imagine you work in the admissions office.\n",
    "\n",
    "You want to use applicants' SAT percentiles $x_{SAT}$ to predict $y$, their freshman year GPA if they were to be admitted.\n",
    "\n",
    "\n",
    "Last time, we trained a <font color=\"teal\">*linear regression model*</font> $f$ to generate a prediction $\\hat{y}$,\n",
    "\n",
    "$$\n",
    "\\hat{y} = f(x_{SAT}) = m \\cdot x_{SAT} + b.\n",
    "$$\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/sat-gpa-lin.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "Today, we'll see a different model: the <font color=\"teal\">*$k$-Nearest-Neighbors model*</font>.\n",
    "\n",
    "## Compatible?\n",
    "\n",
    "Your high school friend has just moved to the bay area. Your roommate wants you to set them up on a date.\n",
    "\n",
    "**Question:** How do you decide if they will be compatible?\n",
    "\n",
    "- One common strategy is comparison: \n",
    "    - How similar is your roommate to your friends' past relationships? \n",
    "    - Were those relationships good or bad?\n",
    "\n",
    "\n",
    "## $k$-Nearest-Neighbors\n",
    "\n",
    "Assume we have access to a set of examples of features $x$ and labels $y$, \n",
    "\n",
    "$$(x_1,y_1),\\ldots,(x_n,y_n).$$\n",
    "\n",
    "\n",
    "We are given a new datapoint $x$, and we want to generate a prediction $\\hat y$ for $y$.\n",
    "\n",
    "\n",
    "**The $k$-Nearest Neighbors Model:**\n",
    "\n",
    "1.  Find the $k$ examples $x_{i_1},x_{i_2},\\ldots,x_{i_k}$ most similar to $x$.\n",
    "\n",
    "    - In case of a tie in similarity, increase $k$ to include all the tied points.\n",
    "\n",
    "2. Choose our prediction $\\hat{y}$ to be the <font color=\"teal\">*average*</font> of $y_{i_1},\\ldots,y_{i_k}$.\n",
    "\n",
    "## Nearest Neighbor Prediction\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/practice-nn.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "**Question:** what would your $1$-nearest-neighbor prediction for the point $x_{new}$ be?\n",
    "\n",
    "**Question:** what would your $2$-nearest-neighbor prediction for the point $x_{new}$ be?\n",
    "\n",
    "## $k$-Nearest-Neighbors for first-year GPA\n",
    "\n",
    "Below are the prediction curves for the first-year GPA produced by $k$-Nearest-Neighbors for $k = 1,5,10,20,30$.\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/1nn-admissions.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/5nn-admissions.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/10nn-admissions.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/20nn-admissions.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "## What happens as we vary $k$?\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/nn-k-comparison.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "**Question:** what do you notice about the $k$-nearest neighbor model prediction as $k$ increases?\n",
    "\n",
    "\n",
    "**Question:** do you see a disadvantage to taking $k$ too small?\n",
    "\n",
    "\n",
    "**Question:** do you see a disadvantage to taking $k$ too large?\n",
    "\n",
    "\n",
    "\n",
    "## Evaluating the model\n",
    "\n",
    "To evaluate the $k$-nearest-neighbors model for regression, we'll use root mean squared error (just like for linear regression).\n",
    "\n",
    "\n",
    "**Question:** Assume that we only have one example data point with features $x_i$.\n",
    "\n",
    "If $k = 1$, what will $\\hat y = f(x_i)$ be? \n",
    "\n",
    "- Within the set of examples, $x_i$ will be its own nearest neighbor, and the $\\hat y = y_i$. \n",
    "- So when $x_i$ are all distinct, the model has zero error on our example data.\n",
    "- When $x_i$ are not distinct, $\\hat y$ is still possibly very influenced by $y_i$.\n",
    "\n",
    "\n",
    "Because of this, the root mean squared error on our training examples is not a good measurement of accuracy.\n",
    "\n",
    "## Training data vs. Testing data\n",
    "\n",
    "To avoid cheating on our error measurement, in advance, we randomly split our example data into two parts: <font color=\"teal\">*training*</font> data and <font color=\"maroon\">*testing*</font> data.\n",
    "\n",
    "\n",
    "1. We train/define the model using the <font color=\"teal\">training</font> data.\n",
    "2. We evaluate model performance on the <font color=\"maroon\">testing</font> data.\n",
    "\n",
    "\n",
    "It is actually a good idea to do this training/testing split even when we are evaluating a linear model.\n",
    "\n",
    "But it is not as essential in the case of linear regression when the number of examples is big and when the influence of outliers is small.\n",
    "\n",
    "## Error for $k$-nearest neighbors\n",
    "\n",
    "For our $k$-nearest neighbor model for first-year GPA, the root mean squared error on the test data was:\n",
    "\n",
    "| $k$ | RMSE |\n",
    "|:---:|:---:|\n",
    "|1|0.94|\n",
    "|5|0.74|\n",
    "|10|0.70|\n",
    "|20|0.68|\n",
    "|200|0.69|\n",
    "|400|0.71|\n",
    "\n",
    "\n",
    "The value for $k=20$ compares favorably with RMSE $0.66$ for the best-fit linear model.\n",
    "\n",
    "## $k$-nearest neighbors and nonlinear data\n",
    "\n",
    "$k$-Nearest Neighbors can capture nonlinear relationships between $x,y$.\n",
    "\n",
    "\n",
    "Consider the synthetic quadratically associated data from Lecture 25:\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/nn-parabola.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "## Linear regression vs. $k$-nearest neigbhors\n",
    "\n",
    "**Question:** brainstorm as many advantages and disadvantages as you can for using $k$-nearest neighbors vs. linear regression.\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/nn-parabola.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "1. Nonlinear associations: nearest neighbors is better\n",
    "2. Interpretability: nearest neighbors arguably easier to understand\n",
    "3. Effect of selection bias and outliers: it depends\n",
    "4. Model selection: linear regression has a formula. Requires us to make fewer choices (no need to choose $k$)\n",
    "\n",
    "## Classification\n",
    "\n",
    "In the examples we saw so far, the quantity $y$ was a number.\n",
    "When you want to predict a number, that's called a <font color=\"maroon\">*regression*</font> problem.\n",
    "\n",
    "\n",
    "In <font color=\"teal\">*classification*</font> problems, $y$ takes a yes/no value, or a categorical value.\n",
    "\n",
    "**Examples**:\n",
    "\n",
    "- Admissions: you want to predict whether or not a student passes freshman year.\n",
    "- Medicine: $x$ is test results for a patient, and $y$ is whether or not they have a disease.\n",
    "- Weather: $x$ is temperature/pressure/humidity data, $y$ is whether or not it is going to rain.\n",
    "\n",
    "## $k$-Nearest Neighbors for classification\n",
    "\n",
    "Assume we have access to a set of examples of features $x$ and labels $y$ (taking yes/no values), \n",
    "\n",
    "$$(x_1,y_1),\\ldots,(x_n,y_n).$$\n",
    "\n",
    "\n",
    "We are given a new datapoint $x$, and we want to generate a yes/no prediction $\\hat y$ for $y$.\n",
    "\n",
    "\n",
    "**$k$-Nearest Neighbors:**\n",
    "\n",
    "1.  Find the $k$ examples $x_{i_1},x_{i_2},\\ldots,x_{i_k}$ most similar to $x$\n",
    "\n",
    "2. Choose our prediction $\\hat{y}$ to be the <font color=\"teal\">*majority*</font> of $y_{i_1},\\ldots,y_{i_k}$.\n",
    "\n",
    "## Example 1: Cirrhosis\n",
    "\n",
    "Cirrhosis is a chronic liver disease.\n",
    "This data comes from a study at the [Mayo Clinic](https://www.kaggle.com/datasets/joebeachcapital/cirrhosis-patient-survival-prediction) which ran from 1974 to 1984.\n",
    "\n",
    "Below we have plotted patient measurements of two liver proteins, Albumin and Prothrombin, color coded according to whether they survived until the end of the study.\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/cirrhosis-features.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "## Decision Boundary\n",
    "\n",
    "We train a $k$-NN model to predict whether a patient with features $(x_{A},x_P)$ will survive.\n",
    "\n",
    "\n",
    "\n",
    "**k=1**\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/1-nn-cirrhosis.png\" style=\"width:\"350\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "Evaluation on test set:\n",
    "\n",
    "- $56\\%$ of deaths classified as deaths\n",
    "\n",
    "- $57\\%$ of survivals classified as survivals\n",
    "\n",
    "\n",
    "We can plot the <font color=\"teal\">*decision boundary*</font>: our model predicts that a patient with measurements $(x_A,x_P)$ in the white region will survive.\n",
    "\n",
    "\n",
    "**k=10**\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/10nn-cirrhosis.png\" style=\"width:\"350\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "Evaluation on test set:\n",
    "\n",
    "- $63\\%$ of deaths classified as deaths\n",
    "\n",
    "- $89\\%$ of survivals classified as survivals\n",
    "\n",
    "\n",
    "\n",
    "**k=20**\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/20nn-cirrhosis.png\" style=\"width:\"350\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "Evaluation on test set:\n",
    "\n",
    "- $50\\%$ of deaths classified as deaths\n",
    "\n",
    "- $85\\%$ of survivals classified as survivals\n",
    "\n",
    "\n",
    "\n",
    "## Example 2: Breast Cancer\n",
    "\n",
    "The [following data](https://www.kaggle.com/datasets/erdemtaha/cancer-data) describes features of tumors, along with a diagnosis of *malignant* or *benign*.\n",
    "\n",
    "Below we have plotted measurements of two features of the tumor: concave points and texture.\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/features-cancer.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "##\n",
    "\n",
    "We train a $k$-NN classifier, and plot the decision boundary:\n",
    "\n",
    "\n",
    "**k=1**\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/1-nn-cancer.png\" style=\"width:\"350\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "Evaluation on test set:\n",
    "\n",
    "- $92\\%$ of benign classified as benign\n",
    "\n",
    "- $81\\%$ of malignant classified as malignant\n",
    "\n",
    "\n",
    "\n",
    "**k=5**\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/5-nn-cancer.png\" style=\"width:\"350\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "Evaluation on test set:\n",
    "\n",
    "- $94\\%$ of benign classified as benign\n",
    "- $93\\%$ of malignant classified as malignant\n",
    "\n",
    "\n",
    "\n",
    "**k = 20**\n",
    "\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/20-nn-cancer.png\" style=\"width:\"350\";\"/>\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "Evaluation on test set:\n",
    "\n",
    "- $94\\%$ of benign classified as benign\n",
    "- $94\\%$ of malignant classified as malignant\n",
    "\n",
    "\n",
    "## Regions with low coverage\n",
    "\n",
    "**Question:** Can we trust a diagnosis from the model on features $x$ if $x$ happens to fall in the bottom right of this plot? Why or why not?\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/20-nn-cancer.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "## Selection bias\n",
    "\n",
    "The following $20$-NN model was trained on a training set that only included tumors on the larger side.\n",
    "\n",
    "Only $86\\%$ of benign are classified as benign, compared to $94\\%$ with an unbiased training set.\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/20-nn-cancer-filtered.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "Here we have overlayed decision regions for the biased and non-biased sample.\n",
    "The biased training set causes more tumors to be classified as malignant.\n",
    "\n",
    "<div style=\"display: flex; justify-content: \"center\"; flex-direction: column; align-items: \"center\";\">\n",
    "  <div>\n",
    "    <img src=\"../figures/20-nn-cancer-both.png\" />\n",
    "    <p style=\"font-size: smaller; text-align: \"center\"; margin-top: 4px;\"></p>\n",
    "  </div>\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "## Recap\n",
    "\n",
    "- $k$-nearest neighbors model\n",
    "- Classification\n",
    "    - decision boundary\n",
    "- Effect of selection bias in training set"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Phython (JB)",
   "language": "python",
   "name": "jb-python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
