{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lecture 27: Generative models for text\n",
    "\n",
    "STATS 60 / STATS 160 / PSYCH 10\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "<div style=\"flex: 2;\" >\n",
    "\n",
    "**Course evaluations:** <a href = http://course-evaluations.stanford.edu>http://course-evaluations.stanford.edu</a>\n",
    "\n",
    "\n",
    "**Concepts and Learning Goals:**\n",
    "\n",
    "\n",
    "\n",
    "- Next word prediction\n",
    "- Markov text generators\n",
    "- Similarities/differences to LLMs\n",
    "\n",
    "\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "# Background\n",
    "\n",
    "## Text generation\n",
    "\n",
    "- Previously, we have focused on using machine learning for **prediction** (regression and classification).\n",
    "- Today, we will discuss **text generation**. How do chatbots and large language models (LLMs) like ChatGPT produce written text?\n",
    "\n",
    "\n",
    "\n",
    "## Alan Turing\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Alan_Turing_%281951%29.jpg/500px-Alan_Turing_%281951%29.jpg\" alt=\"\" style=\"width:60%;\"><figcaption>Alan Turing. Image from Wikipedia.</figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1.5;\" >\n",
    "\n",
    "Alan Turing, in 1950, was trying to define AI:\n",
    "\n",
    "- Defining what *thinking/intelligence* means or what *machine* means is dicey.\n",
    "\n",
    "- Proposed a test now called <a href = \"https://courses.cs.umbc.edu/471/papers/turing.pdf\"> *the Turing test*</a>.\n",
    "\n",
    "- If a machine can converse with a person and pass for human, it is effectively intelligent.\n",
    "\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "## The Turing test\n",
    "\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "- Alan Turing called his test *the imitation game.*\n",
    "- A human (C) asks questions to another human (B) and a computer (A).\n",
    "- Can (C) correctly identify which is which?\n",
    "\n",
    "\n",
    "</div>\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/turing_test.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "<!-- . . .\n",
    "\n",
    "![From Turing's original paper.](https://tselilschramm.org/introstats/figures/imitation-game.png)\n",
    "\n",
    "Then he says, substitute \"machine\" for \"woman\". I guess at the time, this was rhetorically necessary, because people found talking to machines so strange. -->\n",
    "\n",
    "## Initial attempts\n",
    "\n",
    "- Most early chatbots were based on hand-built models of communication, and were not very good.\n",
    "\n",
    "- But some people found them surprisingly compelling.\n",
    "\n",
    "- Let's test drive the 1960's [ELIZA](https://www.masswerk.at/elizabot/) therapy chatbot.\n",
    "\n",
    "\n",
    "# Text generation\n",
    "\n",
    "## Generation from prediction\n",
    "\n",
    "- Modern chatbots and LLMs use **next word prediction**.\n",
    "- **Next word prediction:** given some text $x$, predict $y$, the word most likely to come next.\n",
    "- Our goal is now to create a model $f$ for next word prediction:\n",
    "\n",
    "    $$\n",
    "    x \\to f(x) \\to y \\quad \n",
    "    $$\n",
    "\n",
    "- $x$ is a string of words and $f(x)$ and $y$ are single words.\n",
    "\n",
    "## Generating more than one word\n",
    "\n",
    "Strategy for generating text:\n",
    "\n",
    "1. Start with a string of words, $x = x_1, x_2,\\ldots,x_m$\n",
    "2. Predict the next word, $f(x) = \\hat{y}$\n",
    "3. Repeat: set $x_{m+1} = \\hat{y}$ to create a new string of words, $x' = x_1,x_2,\\ldots,x_m,\\hat{y}$\n",
    "4. We predict the next word, $f(x') = \\hat{y}'$\n",
    "5. Repeat: create a new string of text, $x'' = x_1,x_2,\\ldots,x_{m},\\hat{y}, \\hat{y}'$\n",
    "6. Etc...\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Models for next word prediction\n",
    "\n",
    "\n",
    "1. **Question:** do you think linear regression would be a good model for next word prediction? Why or why not?\n",
    "\n",
    "2. **Question:** do you think nearest neighbors would be a good model for next word prediction? Why or why not?\n",
    "\n",
    "3. **Question:** any other ideas?\n",
    "\n",
    "# Markov text generator\n",
    "\n",
    "## Dependencies in text\n",
    "\n",
    "**Idea:** treat language as a random sampling process. \n",
    "\n",
    "- Speech is a random sequence of words $x_1,x_2,x_3,\\ldots$\n",
    "\n",
    "- The words are **not** independent: $x_i$ depends on the words that came before, $x_{i-1},x_{i-2},\\ldots$\n",
    "\n",
    "\n",
    "A **Markov text generator** is a simple next word prediction model that incorporates dependencies between words.\n",
    "\n",
    "## Markov text generator - model\n",
    "\n",
    "**The model:** for each pair of words $a,b$ in the dictionary, learn the probability that $b$ follows $a$:\n",
    "   \n",
    "$$\n",
    "    \\Pr[ x_{i+1} = b \\mid x_i = a]\n",
    "$$\n",
    "\n",
    "- If $a = \\text{hula}$, then the next word is very likely to be \"hoop\":\n",
    "    \n",
    "    $$\n",
    "        \\Pr[x_{i+1} = \\text{ hoop}\\mid x_{i} = \\text{ hula}] \\approx 1\n",
    "    $$\n",
    "\n",
    "- Assume that the location of the words within the text doesn't matter. All that matters is the chance that the word $b$ follows the word $a$.\n",
    "\n",
    "## Markov text generator - training\n",
    "\n",
    "- The training data is a corpus of text $T$.\n",
    "- For each pair of words $a,b$ in $T$:\n",
    "    - Count how many times $a$ appears in $T$; let this be $n_a$\n",
    "    - Count how many times $b$ appears directly after $a$ in $T$; let this be $n_{a,b}$\n",
    "    - Estimate the probability that word $b$ comes after word $a$ with $\\hat{P}[x_{i+1}=b \\mid x_i = a] = \\frac{n_{a,b}}{n_{a}}$\n",
    "\n",
    "## Markov text generator - generation\n",
    "\n",
    "- Given an input string $x=x_1,x_2,\\ldots,x_m$ sample $\\hat{y}$ from the distribution $\\hat{P}$ learned during training.\n",
    "- Eventually, we will reach a word $a$ that was only seen at the end of the text.\n",
    "- When this happens either:\n",
    "    - You can stop generating.\n",
    "    - Or pick a new random word.\n",
    "\n",
    "## Training summary:\n",
    "\n",
    "\n",
    "\n",
    "*Reminder:* for each pair of words $a,b$:\n",
    "\n",
    "\n",
    "- Count how many times $a$ appears; let this be $n_a$\n",
    "- Count how many times $b$ appears directly after $a$; let this be $n_{a,b}$\n",
    "- Estimate the probability that word $b$ comes after word $a$ with \n",
    "$\\frac{n_{a,b}}{n_{a}}$\n",
    "- These probabilities can be put in a table where the rows correspond to $a$ (the current word) and the columns correspond to $b$ (the following word).\n",
    "\n",
    "\n",
    "## Markov text generator - example:\n",
    "\n",
    "**Practice:** train a Markov Text Generator on the text:\n",
    "\n",
    "> \"I came, I saw, I conquered\"\n",
    "\n",
    "The row corresponds to $a$, the column corresponds to $b$.\n",
    "\n",
    "| | I | came | saw | conquered |\n",
    "|:---:|:---:|:---:|:---:|:---:|\n",
    "| **I** | | | | |\n",
    "| **came** | | | | |\n",
    "| **saw** | | | | |\n",
    "| **conquered** | | | | |\n",
    "\n",
    "\n",
    "## Markov text generator - example:\n",
    "\n",
    "\n",
    "\n",
    "**Practice:** train a Markov Text Generator on the text:\n",
    "\n",
    "> \"I came, I saw, I conquered\"\n",
    "\n",
    "The row corresponds to $a$, the column corresponds to $b$.\n",
    "\n",
    "\n",
    "| | I | came | saw | conquered |\n",
    "|:---:|:---:|:---:|:---:|:---:|\n",
    "| **I** | 0 | $\\frac{1}{3}$ | $\\frac{1}{3}$ | $\\frac{1}{3}$ |\n",
    "| **came** | 1 | 0 | 0 | 0 |\n",
    "| **saw** | 1 | 0 | 0 | 0 |\n",
    "| **conquered** | 0 | 0 | 0 | 0 |\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Visualizing the Markov text generator\n",
    "\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/chain.png\" alt=\"\" style=\"width:70%;\"><figcaption></figcaption>A Markov text generator is like a \"random walk.\"</figure>\n",
    "\n",
    "\n",
    "## Generating text\n",
    "\n",
    "I generated 4 random strings of text using our Markov text generator, on input word \"I\":\n",
    "\n",
    "1. I saw I saw I conquered\n",
    "\n",
    "2. I saw I saw I saw I saw I saw I saw I conquered\n",
    "\n",
    "3. I conquered\n",
    "\n",
    "4. I saw I came I conquered\n",
    "\n",
    "\n",
    "# Other training sets\n",
    "\n",
    "## The Gettysburg address: training data\n",
    "\n",
    "Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.\n",
    "\n",
    "Now we are engaged in a great civil war, testing ...\n",
    "\n",
    "\n",
    "\n",
    "## Gettysburg address: output\n",
    "\n",
    "\n",
    "> Whether that all men are met on this nation, conceived and proper that we should do this. But, in a great civil war, testing whether that we can not consecrate -- that field, as a portion of devotion to the living, rather, to the people, by the living, rather, to dedicate -- and that we say here, have thus far above our fathers brought forth on a new nation, conceived in Liberty, and proper that this continent, a new birth of freedom -- and dead, who struggled here, have a great task remaining before us -- that war. \n",
    "\n",
    "\n",
    "\n",
    "## Wikipedia -- Statistics: training data\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/stats-wiki.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "## Wikipedia -- Statistics: output\n",
    "\n",
    "\n",
    "> Consider now available. Examples of a positive feedback runaway effect of mean and instrumental variables, and covers descriptive statistic was introduced the least some degree of questions to specifying an important that is the use of type of data, or population: central or not belong to being true (statistical study typically uses a \"false negative\"). Multiple problems are yet-to-be-observed random vector of the set is also heavily criticized today for future of modern use the discipline of these errors in society.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "## Reddit -- ELI5: training data\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/eli5.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "\n",
    "\n",
    "## Reddit -- ELI5: output\n",
    "\n",
    "\n",
    "> My body got better at targeting the throat completely independent of mucus off it is constantly passing through it? bacteria infects the soft tissues from inhaling virus or bacteria causes pain allergies can be very painful sore outside of post-nasal drip; this can people get them making them susceptible to deal with a bit? Almost like my colds always started with the immune system has cracked the combination meds for a very irritating and water is constantly passing through it? bacteria causes inflammation I noticed that sore throat the throat, this causes inflammation I don't understand.\n",
    "\n",
    "\n",
    "## Using the last $k$ words\n",
    "\n",
    "\n",
    "To make the model more realistic, we can pick the next word based on the previous $k$ words:\n",
    "\n",
    "- For each sequence of $k+1$ words $a_1,\\ldots,a_k,a_{k+1}$, we want to learn the probability that $a_{k+1}$ is the next word after $a_1$, $a_2$,...,$a_k$.\n",
    "- In symbols, we want\n",
    " \n",
    "    $$\n",
    "        \\Pr[X_{k+1} = a_{k+1} \\mid X_1 = a_1,\\ldots,X_k = a_k]\n",
    "    $$\n",
    "\n",
    "\n",
    "- The number $k$ is called the **context size**.\n",
    "\n",
    "\n",
    "\n",
    "## Effect of context size{.smaller}\n",
    "\n",
    "As the context size grows, the generated text makes more sense.\n",
    "\n",
    "\n",
    "1. $k = 1$: It is for those who fought here highly resolve that all men are engaged in a portion of that government of that field, as a great civil war, testing whether that nation, conceived in Liberty, and that government of the proposition that nation, or detract.\n",
    "\n",
    "\n",
    "\n",
    "2. $k = 2$: It can never forget what they did here. It is for us to be dedicated here to the great task remaining before us -- that this nation, under God, shall have a new nation, conceived in Liberty, and dedicated to the unfinished work which they gave the last full measure of devotion.\n",
    "\n",
    "3. $k = 3$: It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. \n",
    "\n",
    "\n",
    "## Context size\n",
    "\n",
    "**Question:** Do you see a downside to making the context size bigger? \n",
    "\n",
    "1. Training and generation both take more time as $k$ increases.\n",
    "2. It is less random, more like copying sentences from the corpus.\n",
    "\n",
    "## Memorizing?\n",
    "\n",
    "Sometimes people say of a language model \"it just memorized the training data.\"\n",
    "\n",
    "\n",
    "What they mean is that there is a copy of this output somewhere in $T$.\n",
    "\n",
    "\n",
    "When $k=3$, our Markov chain effectively memorized from Gettysburg.  This sentence was generated and is so in the address:\n",
    "\n",
    "> \"Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure.\"\n",
    "\n",
    "## Influence of training data\n",
    "\n",
    "Notice that output is extremely sensitive to the training data. With context length 3:\n",
    "\n",
    "\n",
    "> Wikipedia -- statistics: Populations can be diverse groups of people or objects such as \"all people living in a country\" or \"every atom composing a crystal\".\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "> Reddit -- ELI5: As far as why these affect the back of the throat, it is the primary entry point for what we call upper respiratory infections, usually from inhaling virus or bacteria particles.\n",
    "\n",
    "\n",
    "\n",
    "## Modern large language models (LLMs)\n",
    "\n",
    "- Modern language models share similarities with Markov text generators.\n",
    "\n",
    "- They try to estimate \n",
    "    \n",
    "    $$\n",
    "        \\Pr[X_{k+1} = a_{k+1} \\mid X_1 = a_1,\\ldots,X_k = a_k]\n",
    "    $$\n",
    "\n",
    "    - They also have a context size $k$\n",
    "    - Instead of words, they use \"tokens\" or fragments of words\n",
    "\n",
    "## Modern LLMs: training\n",
    "\n",
    "The probability \n",
    "\n",
    "$$\n",
    "    \\Pr[X_{k+1} = a_{k+1} \\mid X_1 = a_1,\\ldots,X_k = a_k]\n",
    "$$\n",
    "\n",
    "is estimated by training on a corpus of example text $T$\n",
    "\n",
    "- The probability is *not* estimated by counting\n",
    "- This probability is estimated with a **deep neural network**\n",
    "\n",
    "## Deep neural network\n",
    "\n",
    "- A deep neural network is a much more complicated version of linear regression.\n",
    "- There is no exact mathematical formula for finding the best-fit neural network.\n",
    "- Instead, the model is trained using algorithms that incrementally improve the fit.\n",
    "\n",
    "## Modern LLMs: data\n",
    "\n",
    "Like the Markov text generator, the choice of training data dramatically influences how the model performs.\n",
    "\n",
    "\n",
    "<div class=\"layout\" style=\"display: flex; justify-content: space-around;\">\n",
    "\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/eli5.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "\n",
    "\n",
    "\n",
    "<div style=\"flex: 1;\" >\n",
    "\n",
    "<figure style=\"text-align:center;\"><img src=\"../figures/stats-wiki.png\" alt=\"\" style=\"width:100%;\"><figcaption></figcaption></figure>\n",
    "\n",
    "</div>\n",
    "</div>\n",
    "\n",
    "## Recap\n",
    "\n",
    "- Similarities between Markov text generators and LLMs:\n",
    "    - Generate data by next word prediction.\n",
    "    - Use a context size.\n",
    "    - Training data dramatically influences output.\n",
    "- Differences between Markov text generators and LLMs:\n",
    "    - LLMs use deep neural networks.\n",
    "    - No mathematical formulas for training the best deep neural network."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Phython (JB)",
   "language": "python",
   "name": "jb-python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}