{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bigram Character Model\n",
    "## Attribution\n",
    "Note that this bigram notebook is based *very heavily* on \n",
    "Andrej Karpathy's \"makemore\" aka NN-Zero-to-Hero code and videos. All\n",
    "credit goes to him.\n",
    "\n",
    "You can find his repo for the birgram notebook here:\n",
    "https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part1_bigrams.ipynb\n",
    "as well as the makemore repo here:\n",
    "https://github.com/karpathy/makemore\n",
    "in particular we use his names file found here:\n",
    "https://raw.githubusercontent.com/karpathy/makemore/master/names.txt\n",
    "\n",
    "His video is extremely excellent and can be found here:\n",
    "https://www.youtube.com/watch?v=PaCmpygFfXo\n",
    "\n",
    "Refer to his LICENSE file in this folder.\n",
    "\n",
    "## Starting with a purely statistical model\n",
    "- Set some display properties\n",
    "- Load in the names and look at some properties of them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:100% !important; }</style>\"))\n",
    "display(HTML(\"<style>.output_result { max-width:100% !important; }</style>\"))\n",
    "display(HTML(\"<style>div.output_area pre {white-space: pre;}</style>\"))\n",
    "\n",
    "import torch\n",
    "torch.set_printoptions(linewidth=230)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = open('names.txt', 'r').read().splitlines()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "words[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "min(len(w) for w in words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max(len(w) for w in words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create bigrams\n",
    "Now let's find all bigrams for each name.\n",
    "- Find all pairs of characters (bigrams) for each name\n",
    "- We add in the special character '.' which we use to indicate the start and stop of a word\n",
    "- Look at some examples and stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b = {}\n",
    "for w in words:\n",
    "  chs = ['.'] + list(w) + ['.']\n",
    "  for ch1, ch2 in zip(chs, chs[1:]):\n",
    "    bigram = (ch1, ch2)\n",
    "    b[bigram] = b.get(bigram, 0) + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(b.items())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted(b.items(), key = lambda kv: -kv[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introducing: Tensors\n",
    "Tensors are multidimensional arrays. PyTorch implements tensors and we can use them for the purpose of speed, efficiency, and convenience.\n",
    "- How can we create and work with them?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "play = torch.zeros((3, 8), dtype=torch.int32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "play"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "play[1,6] = 99\n",
    "play"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = torch.zeros((27, 27), dtype=torch.int32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tensors store numbers, not letters. We need to create a mapping between the letters of the alphabet and indices (numbers)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chars = sorted(list(set(''.join(words))))\n",
    "stoi = {s:i+1 for i,s in enumerate(chars)}\n",
    "stoi['.'] = 0\n",
    "itos = {i:s for s,i in stoi.items()}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stoi['r']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "itos[19]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Statistics\n",
    "If we had the exact probability of a bigram occuring, we could use this to create the best model possible. Actually we do have that! Just count each occurence and divide by the total number. Let's start with counts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for w in words:\n",
    "  chs = ['.'] + list(w) + ['.']\n",
    "  for ch1, ch2 in zip(chs, chs[1:]):\n",
    "    ix1 = stoi[ch1]\n",
    "    ix2 = stoi[ch2]\n",
    "    N[ix1, ix2] += 1  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "plt.figure(figsize=(16,16))\n",
    "plt.imshow(N, cmap='Blues')\n",
    "for i in range(27):\n",
    "    for j in range(27):\n",
    "        chstr = itos[i] + itos[j]\n",
    "        plt.text(j, i, chstr, ha=\"center\", va=\"bottom\", color='gray')\n",
    "        plt.text(j, i, N[i, j].item(), ha=\"center\", va=\"top\", color='gray')\n",
    "plt.axis('off');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to be able to sample a probability distribution, not counts! Let's look at the first row of our counts table and create probabilities from it. Then we'll see how we can sample this distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = N[0].float()\n",
    "p = p / p.sum()\n",
    "p"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This should be 1 i.e. all probabilites added should be 100%\n",
    "p.sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's make a new synthetic distribution with fewer outcomes and see how we can use it\n",
    "g = torch.Generator().manual_seed(2147483647)\n",
    "p2 = torch.rand(3, generator=g)\n",
    "p2 = p2 / p.sum()\n",
    "p2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.multinomial(p2, num_samples=100, replacement=True, generator=g)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling the next letter \n",
    "Let's apply our computed probability to sample the first letter of a name. How would we sample the next letter?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We are working with just the first row of our table: the start of a name\n",
    "p = N[0].float()\n",
    "p = p / p.sum()\n",
    "\n",
    "# Now use this probability\n",
    "g = torch.Generator().manual_seed(2147483647)\n",
    "ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()\n",
    "itos[ix]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Putting it all together\n",
    "Believe it or not we are almost done. We just need to calculate probabilities for all rows, not just the first. Then we sample over and over until we hit an end character."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P = (N+1).float()\n",
    "P /= P.sum(1, keepdims=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P[0].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Activity\n",
    "\n",
    "Generate 5 names using the computed bigram probability distribution.\n",
    "- Hint: What do all names start with?\n",
    "- Hint: Given a character, how do you predict the next character?\n",
    "- Hint: What does every name end with? What should happen after a name ends?\n",
    "\n",
    "Don't be afraid to copy and paste code that does what you need from our work above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g = torch.Generator().manual_seed(2147483647)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Answer\n",
    "To see the answer, run the cell below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P2 = torch.ones((27, 27), dtype=torch.float32)\n",
    "P2 /= P2.sum(1, keepdims=True)\n",
    "P2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load answer1.txt\n",
    "g = torch.Generator().manual_seed(2147483647)\n",
    "\n",
    "P2 = torch.ones((27, 27), dtype=torch.float32)\n",
    "P2 /= P2.sum(1, keepdims=True)\n",
    "\n",
    "for i in range(5):\n",
    "  \n",
    "  out = []\n",
    "  ix = 0\n",
    "  while True:\n",
    "    p = P[ix]\n",
    "    ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()\n",
    "    out.append(itos[ix])\n",
    "    if ix == 0:\n",
    "      break\n",
    "  print(''.join(out))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "* How well did we do?\n",
    "* How can we quantitatively measure the quality of our model?\n",
    "* Is this the best a bigram model can be?\n",
    "* How could we improve if we moved away from a bigram model?\n",
    "* How might our current approach face difficulties if we applied it to other models?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GOAL: maximize likelihood of the data w.r.t. model parameters (statistical modeling)\n",
    "* equivalent to maximizing the log likelihood (because log is monotonic)\n",
    "* equivalent to minimizing the negative log likelihood\n",
    "* equivalent to minimizing the average negative log likelihood\n",
    "\n",
    "log(a*b*c) = log(a) + log(b) + log(c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "log_likelihood = 0.0\n",
    "n = 0\n",
    "\n",
    "for w in words:\n",
    "#for w in [\"josh\"]:\n",
    "  chs = ['.'] + list(w) + ['.']\n",
    "  for ch1, ch2 in zip(chs, chs[1:]):\n",
    "    ix1 = stoi[ch1]\n",
    "    ix2 = stoi[ch2]\n",
    "    prob = P[ix1, ix2]\n",
    "    logprob = torch.log(prob)\n",
    "    log_likelihood += logprob\n",
    "    n += 1\n",
    "    #print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')\n",
    "\n",
    "print(f'{log_likelihood=}')\n",
    "nll = -log_likelihood\n",
    "print(f'{nll=}')\n",
    "print(f'{nll/n}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have derived a \"loss\" function. How large can this loss be? How small can this loss be?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "torch.log(torch.tensor(1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create the training set of bigrams (x,y)\n",
    "xs, ys = [], []\n",
    "\n",
    "for w in words[:1]:\n",
    "  chs = ['.'] + list(w) + ['.']\n",
    "  for ch1, ch2 in zip(chs, chs[1:]):\n",
    "    ix1 = stoi[ch1]\n",
    "    ix2 = stoi[ch2]\n",
    "    print(ch1, ch2)\n",
    "    xs.append(ix1)\n",
    "    ys.append(ix2)\n",
    "    \n",
    "xs = torch.tensor(xs)\n",
    "ys = torch.tensor(ys)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch.nn.functional as F\n",
    "xenc = F.one_hot(xs, num_classes=27).float()\n",
    "xenc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xenc.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.imshow(xenc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xenc.dtype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "W = torch.randn((27, 1))\n",
    "xenc @ W"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logits = xenc @ W # log-counts\n",
    "counts = logits.exp() # equivalent N\n",
    "probs = counts / counts.sum(1, keepdims=True)\n",
    "probs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# randomly initialize 27 neurons' weights. each neuron receives 27 inputs\n",
    "g = torch.Generator().manual_seed(2147483647)\n",
    "W = torch.randn((27, 27), generator=g)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding\n",
    "logits = xenc @ W # predict log-counts\n",
    "counts = logits.exp() # counts, equivalent to N\n",
    "probs = counts / counts.sum(1, keepdims=True) # probabilities for next character\n",
    "# btw: the last 2 lines here are together called a 'softmax'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "probs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "probs.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlls = torch.zeros(5)\n",
    "for i in range(5):\n",
    "  # i-th bigram:\n",
    "  x = xs[i].item() # input character index\n",
    "  y = ys[i].item() # label character index\n",
    "  print('--------')\n",
    "  print(f'bigram example {i+1}: {itos[x]}{itos[y]} (indexes {x},{y})')\n",
    "  print('input to the neural net:', x)\n",
    "  print('output probabilities from the neural net:', probs[i])\n",
    "  print('label (actual next character):', y)\n",
    "  p = probs[i, y]\n",
    "  print('probability assigned by the net to the the correct character:', p.item())\n",
    "  logp = torch.log(p)\n",
    "  print('log likelihood:', logp.item())\n",
    "  nll = -logp\n",
    "  print('negative log likelihood:', nll.item())\n",
    "  nlls[i] = nll\n",
    "\n",
    "print('=========')\n",
    "print('average negative log likelihood, i.e. loss =', nlls.mean().item())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# randomly initialize 27 neurons' weights. each neuron receives 27 inputs\n",
    "g = torch.Generator().manual_seed(2147483647)\n",
    "W = torch.randn((27, 27), generator=g, requires_grad=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# forward pass\n",
    "xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding\n",
    "logits = xenc @ W # predict log-counts\n",
    "counts = logits.exp() # counts, equivalent to N\n",
    "probs = counts / counts.sum(1, keepdims=True) # probabilities for next character\n",
    "loss = -probs[torch.arange(5), ys].log().mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(loss.item())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# backward pass\n",
    "W.grad = None # set to zero the gradient\n",
    "loss.backward()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "W.data += -0.1 * W.grad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create the dataset\n",
    "xs, ys = [], []\n",
    "for w in words:\n",
    "  chs = ['.'] + list(w) + ['.']\n",
    "  for ch1, ch2 in zip(chs, chs[1:]):\n",
    "    ix1 = stoi[ch1]\n",
    "    ix2 = stoi[ch2]\n",
    "    xs.append(ix1)\n",
    "    ys.append(ix2)\n",
    "xs = torch.tensor(xs)\n",
    "ys = torch.tensor(ys)\n",
    "num = xs.nelement()\n",
    "print('number of examples: ', num)\n",
    "\n",
    "# initialize the 'network'\n",
    "g = torch.Generator().manual_seed(2147483647)\n",
    "W = torch.randn((27, 27), generator=g, requires_grad=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# gradient descent\n",
    "for k in range(100): # YOU PROBABLY WANT TO RUN THIS FOR MORE THAN 1 EPOCH\n",
    "  \n",
    "  # forward pass\n",
    "  xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding\n",
    "  logits = xenc @ W # predict log-counts\n",
    "  counts = logits.exp() # counts, equivalent to N\n",
    "  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character\n",
    "  loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()\n",
    "  print(loss.item())\n",
    "  \n",
    "  # backward pass\n",
    "  W.grad = None # set to zero the gradient\n",
    "  loss.backward()\n",
    "  \n",
    "  # update\n",
    "  W.data += -50 * W.grad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# finally, sample from the 'neural net' model\n",
    "g = torch.Generator().manual_seed(2147483647)\n",
    "\n",
    "for i in range(5):\n",
    "  \n",
    "  out = []\n",
    "  ix = 0\n",
    "  while True:\n",
    "    \n",
    "    # ----------\n",
    "    # BEFORE:\n",
    "    #p = P[ix]\n",
    "    # ----------\n",
    "    # NOW:\n",
    "    xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()\n",
    "    logits = xenc @ W # predict log-counts\n",
    "    counts = logits.exp() # counts, equivalent to N\n",
    "    p = counts / counts.sum(1, keepdims=True) # probabilities for next character\n",
    "    # ----------\n",
    "    \n",
    "    ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()\n",
    "    out.append(itos[ix])\n",
    "    if ix == 0:\n",
    "      break\n",
    "  print(''.join(out))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}