{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem Set 3\n",
"=======\n",
"\n",
"\n",
"### Instructions / Notes:\n",
"\n",
"**_Read these carefully_**\n",
"\n",
"* **Please read all the points of the \"Notes\" sections- they're important for this PS!!!**\n",
"* You **are not required to do any plotting in this PS- only in certain problems to provide the tuples that would generate a plot.** You can then optionally plot (in the notebook with matplotlib, in Excel, wherever works)\n",
"* You **may** create new IPython notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!- **just make sure that your final answer for each question is _in its own cell_ and _clearly indicated_**\n",
"* **See Piazza for submission instructions**\n",
"* _Have fun!_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem 1: Double Trouble\n",
"------------------------\n",
"**_[25 points total]_**\n",
"\n",
"In this problem we'll explore an optimization often referred to as **_double buffering_**, which we'll use to speed up the **external merge sort algorithm** we saw in _Lecture 12_.\n",
"\n",
"Although we haven't explicitly modeled it in many of our calculations so far, recall that _sequential IO_ (i.e. involving reading from / writing to consecutive pages) is generally much faster that _random access IO_ (any reading / writing that is not sequential). Additionally, on newer memory technologies like SSD reading data can be faster than writing data (if you want to read more about SSD access patterns look [here](http://codecapsule.com/2014/02/12/coding-for-ssds-part-5-access-patterns-and-system-optimizations/). \n",
"\n",
"In other words, for example, if we read 4 consecutive pages from file $A$, this should be much faster than reading 1 page from $A$, then 1 page from file $B$, then the next page from $A$.\n",
"\n",
"**In this problem, we will begin to model this, by assuming that 3/4 sequential _READS_ are \"free\", i.e. the total cost of $4$ sequential reads is $1$ IO. We will also assume that the writes are always twice as expensive as a read. Sequential writes are never free, therefore the cost of $N$ writes is always $2N$.**\n",
"\n",
"### Other important notes:\n",
"* **NO REPACKING:** Consider the external merge sort algorithm using the basic optimizations we present in section 1 of Lecture 12, but do not use the repacking optimization covered in Lecture 12.\n",
"* **ONE BUFFER PAGE RESERVED FOR OUTPUT:** Assume we use one page for output in a merge, e.g. a $B$-way merge would require $B+1$ buffer pages\n",
"* **REMEMBER TO ROUND:** Take ceilings (i.e. rounding up to nearest integer values) into account in this problem for full credit! Note that we have sometimes omitted these (for simplicity) in lecture.\n",
"* **Consider worst case cost:** In other words, if 2 reads _could happen_ to be sequential, but in general might not be, consider these random IO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a)\n",
"\n",
"**_[15 points]_**\n",
"\n",
"Consider a modification of the external merge sort algorithm where **reads are always read in 4-page chunks (i.e. 4 pages sequentially at a time)** so as to take advantage of sequential reads. Calculate the cost of performing the external merge sort for a setup having $B+1=20$ buffer pages and an unsorted input file with $160$ pages.\n",
"\n",
"Show the steps of your work and make sure to explain your reasoning by writing them as python comments above the final answers.\n",
"\n",
"#### Part (a.i)\n",
"\n",
"What is the **exact** IO cost of spliting and sorting the files? As is standard we want runs of size $B+1$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 160 pages / 20 buffer pages = 8 runs\n",
"# 8 runs * (20 / 4) = 40 IO reads\n",
"# (Alternatively, 160 pages / 4 page per read = 40 IO reads)\n",
"# 160 pages * 2 = 320 IO writes\n",
"# 320 + 40 = 360 total IOs\n",
"\n",
"io_split_sort = 360"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (a.ii)\n",
"\n",
"After the file is split and sorted, we can merge $n$ runs into 1 using the merge process. What is largest $n$ we could have, given reads are always read in 4-page chunks? Note: this is known as the arity of the merge."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# we must read in seq. of 4 + leave one space for output, so we can read at most 4, using 4 * 4 + 1 = 17 out of\n",
"# 20 pages(tricky)\n",
"\n",
"merge_arity = 4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (a.iii)\n",
"\n",
"How many passes of merging are required?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# first pass: 8 runs => 2 runs\n",
"# second pass: 2 runs => 1 run\n",
"\n",
"merge_passes = 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (a.iv)\n",
"\n",
"What is the IO cost of the first pass of merging? Note: the highest arity merge should always be used."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 160 pages / 4 page per IO = 40 read IOs\n",
"# 160 * 2 = 320 IO writes \n",
"# 320 + 40 = 360 total IOs\n",
"\n",
"merge_pass_1 = 360"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (a.v)\n",
"\n",
"What is the total IO cost of running this external merge sort algorithm? **Do not forget to add in the remaining passes (if any) of merging.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 360 + 360 + 360 = 1080\n",
"\n",
"total_io = 1080"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b)\n",
"\n",
"**_[5 points]_**\n",
"\n",
"Now, we'll generalize the reasoning above by writing a python function that computes the _approximate_* cost of performing this version of external merge sort for a setup having $B+1$ buffer pages, a file with $N$ pages, and where we now read in $P$-page chunks (replacing our fixed 4 page chunks in Part (a)).\n",
"\n",
"**Note: our approximation will be a small one- for simplicity, we'll assume that each pass of the merge phase has the same IO cost, when actually it can vary slightly... Everything else will be exact given our model!* \n",
"\n",
"We'll call this function `external_merge_sort_cost(B,N,P)`, and we'll compute it as the product of the cost of reading in and writing out all the data (which we do each pass), and the number of passes we'll have to do.\n",
"\n",
"Even though this is an approximation, **make sure to take care of floor / ceiling operations- i.e. rounding down / up to integer values properly!**\n",
"\n",
"**Importantly, to simplify your calculations: Your function will only be evaluated on cases where the following hold:**\n",
"* **(B + 1) % P == 0** (i.e. the buffer size is divisible by the chunk size)\n",
"* **N % (B + 1) == 0** (i.e. the file size is divisible by the buffer size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (b.i)\n",
"\n",
"First, let's write a python function that computes the **exact** total IO cost to create the initial runs:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import math\n",
"def cost_initial_runs(B, N, P):\n",
" # 2 * N: writing to the files\n",
" # N / P: reading the files\n",
" return N*2 + N/P"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (b.ii)\n",
"\n",
"Next, let's write a python function that computes the _approximate_* total IO cost to read in and then write out all the data during one pass of the merge:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def cost_per_pass(B, N, P):\n",
" \n",
" # N/P: read in all the blocks\n",
" # 2 * N: write out all the blocks\n",
" \n",
" return N*2 + N/P"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note that this is an approximation: when we read in chunks during the merge phase, the cost per pass actually varies slightly due to 'rounding issues' when the file is split up into runs... but this is a small difference*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (b.iii)\n",
"\n",
"Next, let's write a python function that computes the **exact** total number of passes we'll need to do"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def num_passes(B, N, P):\n",
" # at each step, how many blocks are we joining\n",
" # B / P gets us the number of blocks we can merge(since we need 1 to output),\n",
" # we need to floor it because B might not be divisible by P \n",
" # final: floor(B/P)\n",
" \n",
" # we have num_of_passes = log_base(floor(B/P))(N/(B + 1)))\n",
" # need to ceiling it since might not be perfect merge\n",
" # final: ceiling(log_base(floor(B/P))(N/B+1))\n",
" # whew!\n",
" return math.ceil(math.log(N/(B + 1), math.floor(B/float(P))))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, our total cost function is:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def external_merge_sort_cost(B, N, P):\n",
" return cost_initial_runs(B,N,P) + cost_per_pass(B,N,P)*num_passes(B,N,P)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (c)\n",
"\n",
"**_[10 points]_**\n",
"\n",
"For $B + 1 =100$ and $N=900$, find the optimal $P$ according to your IO cost equation above. Return both the optimal $P$ value (`P_opt`) and the list of tuples **_for feasible values of $P$_** that would generate a plot of P vs. IO cost, at resolution $=1$(every value of P), stored as `points`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"B = 99\n",
"N = 900\n",
"\n",
"feasible_p_range = range(1, B/2)\n",
"\n",
"#if divisibility assumptions were carried over from part b:\n",
"\n",
"# feasible_p_range = []\n",
"# for i in range(1, B/2):\n",
"# if 100 % i == 0:\n",
"# feasible_p_rangea.append(i)\n",
"\n",
"p1_points = [(p, external_merge_sort_cost(B, N, p)) for p in feasible_p_range]\n",
"\n",
"# Save the optimal value here\n",
"P = 11\n",
"\n",
"# P can also be P = 10 OR P = 12 depending on if divisibility assumptions from #(b) were carried over. We will accept those.\n",
"\n",
"# Save a list of tuples of (P, io_cost) here, for all feasible P's\n",
"points = p1_points"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Below we provide starter code for using `matplotlib` in the notebook, if you want to generate the graph of P vs. IO cost; however any other software that allows you to visualize the plot (Excel, Google spreadsheets, MATLAB, etc) is fine!"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAEACAYAAABYq7oeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHL1JREFUeJzt3X+QVWed5/H3B5ofARFJImAgQTIIkkzUoLazEy2ucYYY\nrSLsVi2LuxqywUyVwTLrWlOB1GyF1LgzxtmtRGsrqZk1ayATJThWDFMVgUS8Y8VRISYRN0SgTMF0\ns6HJz2byk1/f/eM8nT707V/3B3277/m8qk71uQ/Pc+5zTzr308/znHOvIgIzM7O8cc3ugJmZjT4O\nBzMzq+BwMDOzCg4HMzOr4HAwM7MKDgczM6swZDhIukdSl6Q9ubIZknZI2idpu6TpuX9bL+mApGck\nLcuVL5G0R9J+SXfmyidK2pza/ELSRY18gWZmVr3hjBy+C1zVp2wd8GhELAJ2AusBJF0CrAQWA1cD\nd0lSanM3sCYiFgILJfUccw3wUkS8D7gT+GYdr8fMzBpgyHCIiMeAl/sUXwNsTPsbgRVpfzmwOSJO\nRsRB4ADQLmk2MC0idqd6m3Jt8sf6B+BTNbwOMzNroFrXHGZGRBdARBwBZqbyOUBHrt7hVDYH6MyV\nd6ayM9pExCngFUnn1tgvMzNrgEYtSDfyMzg0dBUzMzub2mps1yVpVkR0pSmjo6n8MHBhrt7cVDZQ\neb7N/5M0HnhnRLzU35NK8gdBmZnVICKq+sN7uCMHceZf9FuB69L+auChXPmqdAXSfGABsCtNPXVL\nak8L1Nf2abM67f97sgXuAUWEtwhuvfXWpvdhtGw+Fz4XPheDb7UYcuQg6XtACThP0r8AtwLfAH4g\n6XrgENkVSkTEXklbgL3ACeDG6O3ZWuBeYDLwcERsS+X3APdJOgC8CKyq6ZWYmVnDDBkOEfEfB/in\nPxmg/l8Df91P+a+By/opf4sULmZmNjr4DukxqlQqNbsLo4bPRS+fi14+F/VRrfNRzSApxlJ/zcxG\nA0nEWVqQNjOzAnE4mJlZBYeDmZlVcDiYmVkFh4OZmVVwOJiZWQWHg5mZVXA4mJlZBYeDmZlVcDiY\nmVkFh4OZmVVwOJiZWQWHg5mZVXA4mJlZBYeDmZlVcDiYmVkFh4OZmVVwOJiZWQWHg5mZVXA4mJlZ\nhbZmd8DMxp4XX4SXXoKJE3u3SZOynxMmgKr6KnsbjRQRtTeWbgK+mB7+74j4tqQZwAPAPOAgsDIi\nulP99cD1wEngpojYkcqXAPcCk4GHI+K/DPB8UU9/zawxli6F3/8exo+Ht96C48d7txMnBm4nnRko\nfbeegMlvEybAuAHmOIY6Xn/Hr/V448fXFnoDvb7BjtfWNng/qyWJiKiq9zWHg6RLge8DHyV7s/8x\n8CXgz4AXI+Kbkm4GZkTEOkmXAPen+nOBR4H3RURI+hXw5YjYLelh4FsRsb2f53Q4mI0Cl14KDzwA\nf/iH1bU7dSoLj3yYHD9eGTB9t4H+tz99euDj5cv7Hr+W4506Vd1rhex5eo7Xtw8nTw7c5tSprE5b\n25nhcs45cPBg9f2oJRzqmVZaDPwqIt5KT/4z4N8By4FSqrMRKAPrUvnmiDgJHJR0AGiXdAiYFhG7\nU5tNwAqgIhzMbHTo7oZ3vrP6duPHZ9vkyY3vU6uJyAKkb8CNlHrC4f8CX0/TSG8BnwEeB2ZFRBdA\nRByRNDPVnwP8Itf+cCo7CXTmyjtTuZmNUt3dMH16s3vR2qRsamnCBJg6deSfv+ZwiIjfSbodeAR4\nFXgS6G/g1dB5oA0bNry9XyqVKJVKjTy8mQ3h1Cl4/XWYNq3ZPbGBlMtlyuVyXceoa0H6jANJ/x3o\nAG4CShHRJWk28NOIWCxpHRARcXuqvw24FTjUUyeVrwKWRsSX+nkOrzmYNdkrr8C8ednowcaGWtYc\n6loLl/Tu9PMi4N8C3wO2AtelKquBh9L+VmCVpImS5gMLgF0RcQToltQuScC1uTZmNsrUut5gY0u9\n9zn8UNK5wAngxog4lqaatki6nmxUsBIgIvZK2gLszdXvGQas5cxLWbfV2S8zO0u83lAMDZtWGgme\nVjJrvsceg5tvhp//vNk9seEa8WklMysejxyKweFgZlXxmkMxOBzMrCoeORSDw8HMqnLsmMOhCBwO\nZlYVTysVg8PBzKriaaVicDiYWVUcDsXgcDCzqnjNoRgcDmZWFa85FIPDwcyq4mmlYnA4mFlVHA7F\n4HAws6p4zaEYHA5mNmwRXnMoCoeDmQ3b669nX1s5cWKze2Jnm8PBzIbNU0rF4XAws2HzYnRxOBzM\nbNi83lAcDgczGzaPHIrD4WBmw+Y1h+JwOJjZsHlaqTgcDmY2bJ5WKg6Hg5kNm8OhOBwOZjZsXnMo\njrrCQdJ6SU9L2iPpfkkTJc2QtEPSPknbJU3vU/+ApGckLcuVL0nH2C/pznr6ZGZnj9cciqPmcJA0\nD7gBuDwiPgC0AZ8D1gGPRsQiYCewPtW/BFgJLAauBu6SpHS4u4E1EbEQWCjpqlr7ZWZnj6eViqOe\nkcMx4DgwVVIbcA5wGLgG2JjqbARWpP3lwOaIOBkRB4EDQLuk2cC0iNid6m3KtTGzUcThUBw1h0NE\nvAz8T+BfyEKhOyIeBWZFRFeqcwSYmZrMATpyhzicyuYAnbnyzlRmZqOM1xyKo63WhpIuBr4KzAO6\ngR9I+k9A9Kna93FdNmzY8PZ+qVSiVCo18vBmNgivOYwN5XKZcrlc1zEUUdt7t6SVwJ9GxA3p8ReA\nPwKuBEoR0ZWmjH4aEYslrQMiIm5P9bcBtwKHeuqk8lXA0oj4Uj/PGbX218zqN3s2PPEEXHBBs3ti\n1ZBERGjomr3qWXPYB/yRpMlpYflTwF5gK3BdqrMaeCjtbwVWpSua5gMLgF1p6qlbUns6zrW5NmY2\ninhaqThqnlaKiN9I2gT8GjgFPAn8HTAN2CLperJRwcpUf6+kLWQBcgK4MTcMWAvcC0wGHo6IbbX2\ny8zOjhMn4PhxmDKl2T2xkVDztFIzeFrJrHleeAEWLoSXXmp2T6xaIz2tZGYF4stYi8XhYGbD4vWG\nYnE4mNmw+DLWYnE4mNmweFqpWBwOZjYsDodicTiY2bB4zaFYHA5mNixecygWh4OZDYunlYrF4WBm\nw+JppWJxOJjZsHjkUCwOBzMbFq85FIvDwcyGxSOHYnE4mNmweM2hWBwOZjYsHjkUi8PBzIbFaw7F\n4u9zMLMhnT4NEyZkX/Yzfnyze2PV8vc5mNlZ8eqr2TfAORiKw+FgZkPylFLxOBzMbEhejC4eh4OZ\nDcnhUDwOBzMbku9xKB6Hg5kNyWsOxeNwMLMheVqpeGoOB0kLJT0p6Yn0s1vSVyTNkLRD0j5J2yVN\nz7VZL+mApGckLcuVL5G0R9J+SXfW+6LMrLE8rVQ8NYdDROyPiMsjYgnwYeA14EFgHfBoRCwCdgLr\nASRdAqwEFgNXA3dJ6rkp425gTUQsBBZKuqrWfplZ43nkUDyNmlb6E+D3EdEBXANsTOUbgRVpfzmw\nOSJORsRB4ADQLmk2MC0idqd6m3JtzGwU8JpD8TQqHP4D8L20PysiugAi4ggwM5XPATpybQ6nsjlA\nZ668M5WZ2SjhkUPxtNV7AEkTyEYFN6eivh9+1NAPQ9qwYcPb+6VSiVKp1MjDm1k/vOYwtpTLZcrl\ncl3HqDscyNYPfh0RL6THXZJmRURXmjI6msoPAxfm2s1NZQOV9ysfDmY2MjytNLb0/cP5tttuq/oY\njZhW+hzw/dzjrcB1aX818FCufJWkiZLmAwuAXWnqqVtSe1qgvjbXxsxGAU8rFU9dIwdJU8gWo/8s\nV3w7sEXS9cAhsiuUiIi9krYAe4ETwI25z99eC9wLTAYejoht9fTLzBrL4VA8/j4HMxvSu98NTz8N\nM2cOXddGH3+fg5k1XITXHIrI4WBmg3rzTZBg8uRm98RGksPBzAbl9YZicjiY2aB8j0MxORzMbFBe\nbygmh4OZDcrTSsXkcDCzQXlaqZgcDmY2KI8cisnhYGaD8ppDMTkczGxQHjkUk8PBzAblNYdicjiY\n2aA8rVRMDgczG5SnlYrJ4WBmg3I4FJPDwcwG5TWHYnI4mNmgvOZQTA4HMxuUp5WKyeFgZoPytFIx\n+WtCzWxAJ0/CpEnZT1X1JZM2mvhrQs2soY4dg2nTHAxF5HAwswF5vaG4HA5mNiCvNxSXw8HMBuSR\nQ3HVFQ6Spkv6gaRnJD0t6WOSZkjaIWmfpO2Spufqr5d0INVflitfImmPpP2S7qynT2bWOL7Hobjq\nHTl8C3g4IhYDHwR+B6wDHo2IRcBOYD2ApEuAlcBi4GrgLuntZa67gTURsRBYKOmqOvtlZg3gkUNx\n1RwOkt4JfCIivgsQEScjohu4BtiYqm0EVqT95cDmVO8gcABolzQbmBYRu1O9Tbk2ZtZEXnMornpG\nDvOBFyR9V9ITkv5O0hRgVkR0AUTEEWBmqj8H6Mi1P5zK5gCdufLOVGZmTeZppeJqq7PtEmBtRDwu\n6Q6yKaW+d6k19K61DRs2vL1fKpUolUqNPLyZ5XhaaWwql8uUy+W6jlFPOHQCHRHxeHr8Q7Jw6JI0\nKyK60pTR0fTvh4ELc+3nprKByvuVDwczO7u6u+HCC4euZ6NL3z+cb7vttqqPUfO0Upo66pC0MBV9\nCnga2Apcl8pWAw+l/a3AKkkTJc0HFgC70tRTt6T2tEB9ba6NmTWR1xyKq56RA8BXgPslTQCeBf4z\nMB7YIul64BDZFUpExF5JW4C9wAngxtwHJa0F7gUmk139tK3OfplZA3jNobj8wXtmNqBPfAK+/nVY\nurTZPbF6+IP3zKyhPK1UXA4HMxuQr1YqLoeDmQ3Iaw7F5TUHM+tXBLS1wZtvwoQJze6N1cNrDmbW\nMK+9BpMnOxiKyuFgZv3yekOx1Xufg40REfDAA9lfgjNn9m7+CkgbiNcbis3hUBCHDsENN8CVV8LR\no73biRNZSJx/Ppx3Hpx7bu/W8/hd78reJKZPz7ae/UmTmv2q7GzyyKHYHA4F0dEBl10GD/X5YJLX\nX4fnn8+2l17q3V58MWvz1FPZm0TPduxY7/748b1B0V94DLQ/ZUo2gpk0Kdt69idPzo5po4PvcSg2\nh0NBdHT0/wFqU6bAvHnZVo2I7CqWvoHRd//IEdi//8yAeeONrO1bb2Vbfr+tDaZOzfo1deqZW0+Y\n5ANl0iQ455zs39/xjso2+eDp23b8+Oz5xo+HceM8vdaXp5WKzeFQEAOFQ62k7E35nHNg9uzGHDMi\nC4jXXjtze/317GdPgOQD5c03s7B55RU4fLiybb5e3/anTmXbyZO9l22OH59dnTNQoEyY0Fuvre3M\nbaDwmjixt04+kPqW9fd4qPoDte2vTrXh52mlYnM4FERHByxY0OxeDE7K3lQnT87WO0bS6dO9QXHi\nRGWg9OyfPHnmlm9z/Hj/bY4dOzOIen7m9/NlA/1bf+369mGw5xk3buiQyf985RX4/OdH9r+DjR4O\nh4Lo7IRPfrLZvRi9xo3LtgkTstFQq4nIArC/0OgbKvk68+c3u+fWLC0RDnv3wpw5HgIPptHTSja2\nSNlowAv+NlwtcRPc+vWwfXuzezG6ORzMrBotEQ4f/zg89lizezF69VxVNGtWs3tiZmOFw6EAOjvh\ngguyOXUzs+FoibeLJUuya+n/9V+b3ZPRyVNKZlatlgiHSZOygPjlL5vdk9Gps9PhYGbVaYlwAE8t\nDaajA+bObXYvzGwsaZlwuOIKh8NAPK1kZtVqmXD44z+GXbuyG3rsTA4HM6tWy4TDjBnw3vfCb37T\n7J6MPg4HM6tWXeEg6aCk30h6UtKuVDZD0g5J+yRtlzQ9V3+9pAOSnpG0LFe+RNIeSfsl3Vlrf7zu\n0D+Hg5lVq96Rw2mgFBGXR0R7KlsHPBoRi4CdwHoASZcAK4HFwNXAXdLbnxN5N7AmIhYCCyVdVUtn\nrrgCfv7z2l9MK3rttexTS88/v9k9MbOxpN5wUD/HuAbYmPY3AivS/nJgc0ScjIiDwAGgXdJsYFpE\n7E71NuXaVKVn5BBRS+vW1NmZXank7yows2rUGw4BPCJpt6QvprJZEdEFEBFHgJmpfA7QkWt7OJXN\nATpz5Z2prGrz5mUfLPbss7W0bk2eUjKzWtT7qaxXRMRzkt4N7JC0jyww8hr6d/yGDRve3i+VSpRK\npbcfS72XtP7BHzTyWccu3+NgVjzlcplyuVzXMeoKh4h4Lv18XtKPgHagS9KsiOhKU0ZHU/XDQP5v\n2LmpbKDyfuXDoT8f/3i27rB6dZUvpkV55GBWPH3/cL7tttuqPkbN00qSpkh6R9qfCiwDfgtsBa5L\n1VYDPV9pvxVYJWmipPnAAmBXmnrqltSeFqivzbWpmq9YOpPDwcxqUc/IYRbwoKRIx7k/InZIehzY\nIul64BDZFUpExF5JW4C9wAngxoi3l47XAvcCk4GHI2JbrZ267LLsu4RfeMFX6EAWDitqWt43syJT\njKFLeyTFcPq7bBl8+cuwfPkIdGqUu/RS+P734QMfaHZPzKxZJBERVV2z2DJ3SOf1rDuYP5HVzGrT\nsuHgdQc4diz7ovh3vavZPTGzsaYlw+FjH4OnnsruDC6ynsVo3wBnZtVqyXCYOhUuuQQef7zZPWku\n3+NgZrVqyXAArzuAL2M1s9q1dDgUfd3B4WBmtWrZcOj5hNbTp5vdk+ZxOJhZrVo2HGbPhvPOg717\nm92T5vFlrGZWq5YNB/C6g0cOZlarlg+Hoq47RDgczKx2LR0OV1wBP/tZMdcdXn4Z2tpg2rRm98TM\nxqKWDof3vx8uugjuuKPZPRl5HjWYWT1aOhwk2LQJvvEN2LOn2b0ZWQ4HM6tHS4cDwPz58Dd/A5//\nPLz5ZrN7M3J8pZKZ1aPlwwGyb4V73/vgL/6i2T0ZOR45mFk9ChEOEvzt32bfa7BzZ7N7MzIcDmZW\nj0KEA2TfCnfPPXDdddmVPK3O4WBm9ShMOAB8+tPZt8OtXdvsnpx9Dgczq0ehwgHgm9+EJ5/Mppha\nVUS2IO2P6zazWhUuHKZMgfvvh5tuyv66bkXPP599p8WUKc3uiZmNVYULB4AlS+CrX4XPfKY1P5jP\nl7GaWb0KGQ4A69bBV74CS5fC3XdnUzGtwusNZlavwoaDBDfckH0w33e+AytWZNMxrcDhYGb1qjsc\nJI2T9ISkrenxDEk7JO2TtF3S9Fzd9ZIOSHpG0rJc+RJJeyTtl3RnvX2qxqJF8ItfZJ/D9KEPwSOP\njOSznx0OBzOrVyNGDjcB+Zn7dcCjEbEI2AmsB5B0CbASWAxcDdwlSanN3cCaiFgILJR0VQP6NWwT\nJ8Ltt8N998H118PXvgZvvDGSPWgsh4OZ1auucJA0F/gM8J1c8TXAxrS/EViR9pcDmyPiZEQcBA4A\n7ZJmA9MiYneqtynXZkRdeSU89RQcPgzz5sEtt4zNK5ocDmZWr3pHDncAfw7kl3NnRUQXQEQcAWam\n8jlA/q32cCqbA3TmyjtTWVOcdx5s3pytRbz6Knzwg7ByZfZ4rCxad3T4Hgczq09brQ0lfRboioin\nJJUGqdrQt9QNGza8vV8qlSiVBnvq2i1cCN/+Nnz967BxI6xZk9038OUvZ4vX5513Vp62bqdPw3PP\nORzMiqxcLlMul+s6hqLGP4cl/RXweeAkcA4wDXgQ+AhQioiuNGX004hYLGkdEBFxe2q/DbgVONRT\nJ5WvApZGxJf6ec6otb/1On0aduzIPsDvJz+BSy+Fq6/Otg9/GMaNkuu+nnsuW1jv6mp2T8xstJBE\nRGjomr1qfkuLiFsi4qKIuBhYBeyMiC8A/whcl6qtBh5K+1uBVZImSpoPLAB2pamnbkntaYH62lyb\nUWPcuOyzmR58MLvk9S//Erq74dprYfZs+MIXsgXt/fub+7WkXm8ws0aoeeRwxkGkpcDXImK5pHOB\nLcCFZKOClRHxSqq3HlgDnABuiogdqfzDwL3AZODhiLhpgOdp2shhMAcPwo9/nH0c+O7dWWh85CPQ\n3g4f/Wj284ILRqYvP/wh/P3fZyFmZga1jRwaEg4jZbSGQ19Hj2YhsWtX70/I7qlYtCi7p6Jn/+KL\ns0tpG+XOO+HZZ7P1EjMzqC0cal6QtoHNnAmf/Wy2QXaVU1cX7NuXbb/7HfzTP2X7HR0wa1Y2FXTR\nRdnPnm3uXHjPe7J/bxvmfylPK5lZIzgcRoCUrUvMnp19llPe8ePZfRUdHb3bM89ki9+dnXDkCLzw\nAsyYkQVFz3Fmz85CqO928GA2lWVmVg9PK40Bp05li+BHjmTbc89lI5GjR8/cnn8eXnoJ/vmf4fLL\nm91rMxstvOZgZmYVRvRSVjMza10OBzMzq+BwMDOzCg4HMzOr4HAwM7MKDgczM6vgcDAzswoOBzMz\nq+BwMDOzCg4HMzOr4HAwM7MKDgczM6vgcDAzswoOBzMzq+BwMDOzCg4HMzOr4HAwM7MKDgczM6tQ\nczhImiTpV5KelPS0pL9K5TMk7ZC0T9J2SdNzbdZLOiDpGUnLcuVLJO2RtF/SnfW9JDMzq1fN4RAR\nbwGfjIjLgQ8AV0q6AlgHPBoRi4CdwHoASZcAK4HFwNXAXZJ6vtP0bmBNRCwEFkq6qtZ+FUW5XG52\nF0YNn4tePhe9fC7qU9e0UkS8nnYnpWO9DFwDbEzlG4EVaX85sDkiTkbEQeAA0C5pNjAtInanepty\nbWwA/sXv5XPRy+eil89FfeoKB0njJD0JHAHKEbEXmBURXQARcQSYmarPATpyzQ+nsjlAZ668M5WZ\nmVmTtNXTOCJOA5dLeiewXVIJiL7V6nkOMzMbeYpozHu3pP8GvAGsAUoR0ZWmjH4aEYslrQMiIm5P\n9bcBtwKHeuqk8lXA0oj4Uj/P4aAxM6tBRGjoWr1qHjlIOh84ERHdks4B/hS4DdgKXAfcDqwGHkpN\ntgL3S7qDbNpoAbArIkJSt6R2YDdwLfDt/p6z2hdnZma1qWda6T3AxnTF0Tjgvoj4SVqD2CLperJR\nwUqAiNgraQuwFzgB3Bi9w5a1wL3AZODhiNhWR7/MzKxODZtWMjOz1jFm7pCW9GlJv0s3yt3c7P6M\nJEn3SOqStCdXNuDNhq1K0lxJO9NNl7+V9JVUXsRzUfVNqK0uXT35hKSt6XEhz4Wkg5J+k343dqWy\nqs/FmAgHSeOA/wVcBVwKfE7S+5vbqxH1XbLXntfvzYYt7iTwXyPiUuDfAGvT70HhzkW1N6EWxE1k\n09Y9inouTpNdFHR5RLSnsqrPxZgIB6AdOBARhyLiBLCZ7Ga7QoiIx8huMMwb6GbDlhURRyLiqbT/\nKvAMMJcCnguo+ibUliZpLvAZ4Du54kKeC6BnHTiv6nMxVsKh7w10vlEOZg5ws2EhSHov8CHglwx8\n42VLq/Im1FZ3B/DnnHlfVVHPRQCPSNot6YuprOpzUddNcDaqFObKAknvAP4BuCkiXu3n/pdCnAvf\nhJqR9FmgKyKeSudgIC1/LpIrIuI5Se8GdkjaRw2/F2Nl5HAYuCj3eG4qK7IuSbMA0s2GR5vcnxEh\nqY0sGO6LiJ57aAp5LnpExDHgYeAjFPNcXAEsl/Qs8H2y9Zf7gCMFPBdExHPp5/PAj8im5av+vRgr\n4bAbWCBpnqSJwCqym+qKRGnr0XOzIZx5s2Gr+z/A3oj4Vq6scOdC0vk9V5zkbkJ9kgKei4i4JSIu\nioiLyd4bdkbEF4B/pGDnQtKUNLJG0lRgGfBbavi9GDP3OUj6NPAtskC7JyK+0eQujRhJ3wNKwHlA\nF9nHjvwI+AFwIelmw4h4pVl9HAnpapyfkf2yR9puAXYBWyjWubiMbGExfxPq/5B0LgU7F3mSlgJf\ni4jlRTwXkuYDD5L9v9EG3B8R36jlXIyZcDAzs5EzVqaVzMxsBDkczMysgsPBzMwqOBzMzKyCw8HM\nzCo4HMzMrILDwczMKjgczMyswv8HwEW9xnJIqh8AAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Shell code for plotting in matplotlib\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Plot\n",
"plt.plot(*zip(*points))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem 2: IO Cost Models\n",
"--------------------------------------\n",
"\n",
"**_[15 points total]_**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this problem we consider different join algorithms when joining relations $R(A,B)$,$S(B,C)$, and $T(C,D)$. We want to investigate the cost of various pairwise join plans and try to determine the best join strategy given some conditions.\n",
"\n",
"Specifically, for each part of this question, we are intereseted determining some (or all) of the following variables:\n",
"\n",
"* `P_R`: Number of pages of $R$\n",
"* `P_S`: Number of pages of $S$\n",
"* `P_RS`: Number of pages of output (and input) $RS$\n",
"* `P_T`: Number of pages of $T$\n",
"* `P_RST`: Number of pages of output (and input) $RST$\n",
"* `B`: Number of pages in buffer\n",
"* `IO_cost_join1`: Total IO cost of first join\n",
"* `IO_cost_join2`: Total IO cost of second join\n",
"\n",
"#### Note:\n",
"* ** The output of join1 is always feed as one of the inputs to join 2 ** \n",
"* **Use the \"vanilla\" versions of the algorithms as presented in lecture, _i.e. without any of the optimizations we mentioned_**\n",
"* **Again assume we use one page for output, as in lecture!**\n",
"* ** The abbreviates for the joins used are Sort-Merge Join (SMJ), Hash Join (HJ), and Block Nested Loop Join (BNLJ). **"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a)\n",
"\n",
"**_[8 points]_**\n",
"\n",
"Given:\n",
"* `P_R`: 10\n",
"* `P_S`: 100\n",
"* `P_T`: 1000\n",
"* `P_RS`: 50\n",
"* `P_ST`: 500\n",
"* `P_RST`: 250\n",
"* `B`: 32\n",
"\n",
"Compute the IO cost for the following query plans:\n",
"\n",
"* IO_Cost_HJ_1 where only hash join is used, $join1 = R(a,b),S(b,c)$ and $join2 = join1(a,b,c),T(c,d)$\n",
"* IO_Cost_HJ_2 where only hash join is used, $join1 = T(c,d),S(b,c)$ and $join2 = join1(b,c,d),R(a,b)$\n",
"* IO_Cost_SMJ_1 where only sort merge join is used, $join1 = R(a,b),S(b,c)$ and $join2 = join1(a,b,c),T(c,d)$\n",
"* IO_Cost_SMJ_2 where only sort merge join is used, $join1 = T(c,d),S(b,c)$ and $join2 = join1(b,c,d),R(a,b)$\n",
"* IO_Cost_BNLJ_1 where only block nested loop join is used, $join1 = R(a,b),S(b,c)$ and $join2 = join1(a,b,c),T(c,d)$\n",
"* IO_Cost_BNLJ_2 where only block nested loop merge join is used, $join1 = T(c,d),S(b,c)$ and $join2 = join1(b,c,d),R(a,b)$\n",
"\n",
"**Note: again, be careful of rounding for this problem. Use ceiling/floors whenever it is necessary.**\n",
"\n",
"Include 1-2 sentences (as a python comment) above each answer explaining the performance for each algorithm/query plan."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Since (B - 1)(B - 2) = 930 >> min(P_R, P_S):\n",
"# P_R fits completely in memory, no partition phase needed\n",
"# IO(join1) = 10 + 100 + 50 OUT = 160\n",
"# Similarly:\n",
"# IO(join2) = 3(50 + 1000) + 250 OUT = 3400\n",
"# Total: 3400 + 360 = 3560\n",
"\n",
"IO_Cost_HJ_1 = 3560\n",
"\n",
"# (B - 1)(B - 2) = 930 >> min(P_S, P_T)\n",
"# IO(join1) = 3(100 + 1000) + 500 OUT = 3800\n",
"# IO(join2) = 500 + 10 + 250 OUT = 760\n",
"# Total: 4560\n",
"\n",
"IO_Cost_HJ_2 = 4560\n",
"\n",
"# join1:\n",
"# 1 pass(R/W) to sort R (2 * 10 = 20)\n",
"# 2 pass(R/W) to sort S (4 * 100 = 400)\n",
"# 1 pass(R) to merge (10 + 100 = 110)\n",
"# Total = 20 + 400 + 110 + 50 = 580\n",
"\n",
"# join2:\n",
"# 2 pass(R/W) to sort RS(4 * 50 = 200)\n",
"# 3 pass(R/W) (B * (B - 1) = 992 < 1000) to sort T (6 * 1000 = 6000)\n",
"# 1 pass(R) to merge (50 + 1000 = 1050)\n",
"# OUT = 250\n",
"# Total = 200 + 6000 + 1050 + 250 = 7500\n",
"\n",
"# Total = 8080\n",
"\n",
"IO_Cost_SMJ_1 = 8080\n",
"\n",
"\n",
"# join1:\n",
"# 3 pass(R/W) (B * (B - 1) = 992 < 1000) to sort T (6 * 1000 = 6000)\n",
"# 2 pass(R/W) to sort S (4 * 100 = 400)\n",
"# 1 pass(R) to merge (100 + 1000 = 1100)\n",
"# OUT = 500\n",
"# Total = 6000 + 400 + 1100 + 500 = 8000\n",
"\n",
"# join2:\n",
"# 2 pass(R/W) to sort ST (4 * 500) = 2000\n",
"# 1 pass(R/W) to sort R (2 * 10 = 20)\n",
"# 1 pass(R) to merge (10 + 500 = 510)\n",
"# OUT = 250\n",
"# Total = 2000 + 20 + 510 + 250 = 2780\n",
"\n",
"# Total: 10780\n",
"\n",
"IO_Cost_SMJ_2 = 10780\n",
"\n",
"# From lecture: P(R) + P(R)P(S)/B + OUT\n",
"# Where one should use smaller of two relations as R\n",
"\n",
"#join1: 10 + ceiling(10/30) * 100 + 50 = 160\n",
"\n",
"#join2: 50 + ceiling(50/30) * 1000 + 250 = 2300\n",
"\n",
"#Total: 160 + 2300 = 2460\n",
"\n",
"IO_Cost_BNLJ_1 = 2460\n",
"\n",
"#join1: 100 + ceiling(100/30) * 1000 + 500 = 4600\n",
"#join2: 10 + ceiling(10/30) * 500 + 250 = 760 \n",
"#Total: 4600 + 760 = 5360\n",
"\n",
"IO_Cost_BNLJ_2 = 5360\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b)\n",
"\n",
"For the query plan where $join1 = R(a,b),S(b,c)$ and $join2 = join1(a,b,c),T(c,d)$ find a configuration where using HJ for $join1$ and SMJ for $join2$ is cheaper than SMJ for $join1$ and HJ for $join2$. The output sizes you choose for P_RS and P_RS must be non-zero and feasible (e.g. the maximum output size of $join1$ is P_R*P_S). \n",
"\n",
"**_[8 points]_**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Possible Idea: \n",
"# Have P_R be small while P_S be large. This will result in HJ for join1 being much cheaper using HJ than SMJ for join1\n",
"\n",
"P_R = 10\n",
"P_S = 10000\n",
"P_T = 100\n",
"P_RS = 50\n",
"P_RST = 25\n",
"B = 20\n",
"\n",
"# join1: fits in memory, no partition needed\n",
"# 10 + 10000 + 50 = 10060\n",
"#join2:\n",
"# 5 * (100 + 50) + 25 = 775\n",
"\n",
"HJ_IO_Cost_join1 = 10060\n",
"SMJ_IO_Cost_join2 = 775\n",
"\n",
"# join1: \n",
"# sort R: 1 pass (2 * 10 = 20)\n",
"# sort S: ceiling(log_base(B)(P_S/B) (8 * 10000 = 80000)\n",
"# merge: 10 + 10000 = 10010\n",
"# 20 + 80000 + 10010 + 50 = 90080\n",
"#join2:\n",
"# 3 * (100 + 50) + 25 = 475\n",
"\n",
"SMJ_IO_Cost_join1 = 90080\n",
"HJ_IO_Cost_join2 = 475"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10060.0 775.0\n",
"90080.0 475.0\n",
"10835.0 90555.0\n"
]
}
],
"source": [
"import math\n",
"\n",
"def HJ_cost_calc(input1, input2, buf, out):\n",
" #From lecture notes, note B is B + 1 in notes\n",
" B = buf - 1\n",
" smaller = min(input1, input2)\n",
" return 2 * math.ceil(math.log(math.ceil(float(smaller)/(B - 1)), B)) * (input1 + input2) + (input1 + input2) + out\n",
"\n",
"def SMJ_cost_calc(input1, input2, buf, out):\n",
" #From lecture notes, note buf is B + 1 in notes\n",
" B = buf - 1\n",
" return 2 * input1 * (1 + math.ceil(math.log(math.ceil(float(input1)/(B + 1)), B))) + \\\n",
" 2 * input2 * (1 + math.ceil(math.log(math.ceil(float(input2)/(B + 1)), B))) + \\\n",
" input1 + input2 + out\n",
" \n",
"plan1 = HJ_cost_calc(P_R, P_S, B, P_RS) \\\n",
" + SMJ_cost_calc(P_RS, P_T, B, P_RST)\n",
"plan2 = SMJ_cost_calc(P_R, P_S, B, P_RS) \\\n",
" + HJ_cost_calc(P_RS, P_T, B, P_RST)\n",
"\n",
"print HJ_cost_calc(P_R, P_S, B, P_RS), SMJ_cost_calc(P_RS, P_T, B, P_RST)\n",
"print SMJ_cost_calc(P_R, P_S, B, P_RS), HJ_cost_calc(P_RS, P_T, B, P_RST)\n",
"print plan1, plan2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem 3: Sequential Flooding\n",
"-----------------------------\n",
"\n",
"**_[10 points total]_**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note: Before doing this question, it is highly recommended that you go through [Activity 15](http://web.stanford.edu/class/cs145/cs145-notebooks-2016/lecture-14-15/Activity-15.ipynb), which covers eviction policies for buffer managers such as LRU, and why _sequential flooding_ can sometimes occurs with LRU.**\n",
"\n",
"In the activity accompanying Lecture 15, we saw something called _sequential flooding_ that can occur when a default eviction policy (for example LRU) is used by the buffer manager. We saw that we can achieve much lower IO cost by using a different eviction policy, MRU (\"most recently used\").\n",
"\n",
"**Note that \"Most recently used\" means most recently accessed, either from buffer or disk, consistent with what we showed in Activity-15.**\n",
"\n",
"For this problem, we will take a closer look at the IO cost of different eviction policies when reading the pages of a file sequentially multiple times. \n",
"\n",
"## Part (a)\n",
"### Part (a.i)\n",
"**_[1 point]_**\n",
"\n",
"Write a python function `lru_cost(N,M,B)` that computes the IO cost of the LRU eviction policy when reading in all the papges of an $N$-page file sequentially, $M$ times, using a bugger with $B+1$ pages. Assume that after reading the files, you don't need to write them out (you can just release them, so there is no write IO cost)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def lru_cost(N, M, B):\n",
" if (N <= B + 1):\n",
" return N\n",
" return N * M"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a.ii)\n",
"**_[2 points]_**\n",
"\n",
"Write a python function `mru_cost(N,M,B)` that computes the IO cost of the MRU eviction policy when reading in all the papges of an $N$-page file sequentially, $M$ times, using a bugger with $B+1$ pages. Assume that after reading the files, you don't need to write them out (you can just release them, so there is no write IO cost)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def mru_cost(N, M, B):\n",
" if (N <= B + 1):\n",
" return N\n",
" \n",
" #initial reads\n",
" buf = range(B+1)\n",
" io = B+1\n",
" pos = B\n",
" mru = B\n",
" passes = 0\n",
" \n",
" while True:\n",
" pos+=1\n",
" if (pos >= N):\n",
" pos = 0\n",
" passes+=1\n",
" if (passes >= M):\n",
" break\n",
" if pos in buf:\n",
" mru = buf.index(pos)\n",
" else:\n",
" buf[mru] = pos\n",
" io+=1\n",
" \n",
" return io"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a.iii)\n",
"**_[2 points]_**\n",
"\n",
"Now that you have written these functions, provide the tuples which generate the plot of **M vs. the absolute value of the difference between LRU and MRU in terms of IO cost** for $B=6$, $N=10$, and $M$ between 1 and 20 inclusive (saved as the variable `p3_lru_points`)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"B = 6\n",
"N = 10\n",
"M = 20\n",
"p3_lru_points = [(m, abs(lru_cost(N, m, B) - mru_cost(N, m, B))) for m in range(1, M+1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, you can optionally plot your answer to check that it seems reasonable- starter code for doing this in the notebook below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Shell code for plotting in matplotlib\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Plot\n",
"plt.plot(*zip(*p3_lru_points))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part (b)\n",
"\n",
"Recall that the LRU eviction policy removes the least recently used page when the buffer is full and a new page is referenced which is not there in buffer. The basic idea behind LRU is that you timestamp your buffer elements, and use the timestamps to decide when to evict elements. Doing so efficiently, requires some serious book-keeping, this is why in practice many buffer managers try to approximate LRU with other eviction policies that are easier to implement. \n",
"\n",
"Here we will focus on the _CLOCK_ or _Second Chance_ policy. In the CLOCK eviction policy, the candidate pages for removal are considered left-to-right in a circular manner(with wraparound), and a page that has been accessed between consecutive considerations will not be replaced. The page replaced is the one that - considered in a circular manner - has not been accessed since its last consideration.\n",
"\n",
"In more details the CLOCK policy proceeds maintains a circular list of pages in the buffer and uses an additional _clock (or second chance) bit_ for each page to track how often a page is accessed. The bit is set to 1 whenever a page is referenced. When clock needs to read in a new page in the buffer, it sweeps over existing pages in the buffer looking for one with second chance bit set to 0. It basically replaces pages that have not been referenced for one complete revolution of the clock. \n",
"\n",
"A high-level implementation of clock:\n",
"1. Associate a \"second chance\" bit with each page in the buffer. Initialize all bits to ZERO (0).\n",
"2. Each time a page is referenced in the buffer, set the \"second chance\" bit to ONE (1). this will give the page a second chance...\n",
"3. A new page read into a buffer page has the second chance bit set to ZERO (0).\n",
"4. When you need to find a page for removal, look in left-to-right in a circular manner(with wraparound) in the buffer pages:\n",
" - If the second chance bit is ONE, reset its second chance bit (to ZERO) and continue.\n",
" - If the second chance bit is ZERO, replace the page in the buffer.\n",
" \n",
"You can find more details on CLOCK [here](http://cseweb.ucsd.edu/classes/wi08/cse221-a/papers/carr81.pdf).\n",
"\n",
"\n",
"### Part (b.i)\n",
"**_[4 points]_**\n",
"\n",
"Write a python function `clock_cost(N,M,B)` that computes the IO cost of the CLOCK eviction policy when reading in all the papges of an $N$-page file sequentially, $M$ times, using a bugger with $B+1$ pages. Assume that after reading the files, you don't need to write them out (you can just release them, so there is no write IO cost)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def clock_cost(N, M, B):\n",
" b = [None]*(B+1)\n",
" secondChance = [0]*(B+1)\n",
" clock = 0\n",
" reads = 0\n",
" for i in range(M):\n",
" for x in range(N):\n",
" if x not in b:\n",
" if b[clock] == None:\n",
" b[clock] = x\n",
" else:\n",
" while secondChance[clock] == 1:\n",
" secondChance[clock] = 0\n",
" clock = (clock + 1) % (B+1)\n",
" \n",
" b[clock] = x \n",
" secondChance[clock] = 0\n",
" clock = (clock + 1) % (B+1)\n",
" reads += 1\n",
" else:\n",
" secondChance[b.index(x)] = 1\n",
" return reads"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b.ii)\n",
"**_[1 point]_**\n",
"\n",
"Now that you have written the CLOCK cost function, provide the tuples which generate the plot of **M vs. the absolute value of the difference between LRU and CLOCK in terms of IO cost** for $B=6$, $N=10$, and $M$ between 1 and 20 inclusive (saved as the variable `p3_clock_points`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"B = 6\n",
"N = 10\n",
"M = 20\n",
"p3_clock_points = [(m, abs(lru_cost(N, m, B) - clock_cost(N, m, B))) for m in range(1, M+1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Does the CLOCK eviction policy prevent sequential flooding? How does it perform against LRU? Write a short explanation in the field below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# CLOCK eviction is a form of LRU, which does not prevent sequential flooding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem 4: Hash Join Madden\n",
"-----------------------------\n",
"\n",
"**_[10 points total]_**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The NFL season has started strong and Jack Del Rio ([The Oakland Raider's](https://www.youtube.com/watch?v=YXj7I1RzLSE) coach) wants to find out if Joe Flacco is an elite quarterback. He wants to do this by being more of a sabermetrics guy than a numbers guy. As a first step in doing this he wants to find out which are the colleges each NFL teams prefers drafting players from. We have access to two tables: (i) a table named \"teams\" which contains (team, player) pairs, and (ii) a table named \"colleges\" which contains (player, college) pairs. Being all excited about databases you decide that there is no other way but to join the two tables and get the desired results. However, you have no access to a database. Not even a challenge for you who decide to implement your favorite join algorithm on your own. And of course HASH JOIN is the way to go!!!\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load and explore the data\n",
"\n",
"The two tables are stored in files which can be loaded into memory as two lists of **named tuples** using the code below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Load data\n",
"import nfl\n",
"from nfl import *\n",
"teams, colleges = loadData()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Named tuples are basically lightweight object types and instances of named tuple instances can be referenced using object like variable deferencing or the standard tuple syntax. The following code prints the first 10 tuples from teams and colleges. *Notice how fields of named tuples are accessed inside the loops.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Print List Entries\n",
"print 'Table teams contains %d entries in total' % len(teams)\n",
"print 'Table colleges contains %d entries in total' % len(colleges)\n",
"print \n",
"print 'First 10 entries in teams table'\n",
"for i in range(10):\n",
" team = teams[i]\n",
" print 'Entry %d' %(i+1),':',team.teamname, '|', team.playername\n",
"print \n",
"print 'First 10 entries in college table'\n",
"for i in range(10):\n",
" college = colleges[i]\n",
" print 'Entry %d' %(i+1),':',college.collegename, '|', college.playername"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Down to business\n",
"\n",
"During the lectures we saw that hash joins consist of two phases: The **Partition Phase** where using a hash function $h$ we split the two tables we want to join into $B$ buckets, and the **Matching Phase** where we iterate over each bucket and join the tuples from the two tables that match. Here you will need to implement a hash join in memory.\n",
"\n",
"You are determined to implement the most efficient hash join possible! This is why you decide to implement your own hash function that will uniformly partition the entries of a table across $B$ buckets so that all buckets have roughly the same number of entries. You decide to use the following hash function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Define hash function\n",
"def h(x,buckets):\n",
" rawKey = ord(x[1])\n",
" return rawKey % buckets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You use this hash function to partition the tables. To do so you can use the helper method `partitionTable(table,hashfunction,buckets)` for convenience as shown next:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Fix the number of buckets to 500\n",
"buckets = 500\n",
"# Partition the teams table using hash function h\n",
"teamsPartition = partitionTable(teams,h,buckets)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output of `partitionTable()` is a dictionary with its keys corresponding to bucket numbers in $[0,B-1]$ and its entries to lists of named tuples.\n",
"\n",
"## Part (a)\n",
"### Part (a.i)\n",
"**_[4 points]_**\n",
"\n",
"It's now time to implement your own hash join! You only need to implement the merge phase of the hash join. The output of the method should correspond to the result of a join between teams and colleges over the ***playername*** attribute. The partition phase is implemented. You need to fill in the merge phase.\n",
"\n",
"***Note: You should only use the two dictionaries t1Partition and t1Partition provide. No other data structures are allowed.***"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def hashJoin(table1, table2, hashfunction,buckets):\n",
" # Parition phase \n",
" t1Partition = partitionTable(table1,hashfunction,buckets)\n",
" t2Partition = partitionTable(table2,hashfunction,buckets)\n",
" # Merge phase\n",
" result = []\n",
" \n",
" # To populate your output you should use the following code(t1Entry and t2Entry are possible var names for tuples)\n",
" # result.append((t1Entry.teamname, t1Entry.playername, t2Entry.collegename))\n",
" for b in range(buckets):\n",
" for t1Entry in t1Partition[b]:\n",
" for t2Entry in t2Partition[b]:\n",
" if t1Entry.playername == t2Entry.playername:\n",
" result.append((t1Entry.teamname, t1Entry.playername, t2Entry.collegename))\n",
" return result\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a.ii)\n",
"**_[1 point]_**\n",
"\n",
"It time to evaluate your algorithm! The code provided below executes the join between teams and colleges and measures the total execution time. \n",
"What is the total number of entries output by your algorithm?\n",
"\n",
"Does the runtime of your algorithm seem reasonable? Provide a brief explanation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import time\n",
"start_time = time.time()\n",
"res1 = hashJoin(teams, colleges, h, buckets)\n",
"end_time = time.time()\n",
"duration = (end_time - start_time)*1000 #in ms\n",
"print 'The join took %0.2f ms and returned %d tuples in total' % (duration,len(res1))\n",
"\n",
"# No, the time of the join seems a bit longer than expected. part b and c explains why(skewed buckets)!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part (b)\n",
"\n",
"You decide to investigate the performance of `hashJoin( )` further. Since you implemented the merge phase of `hashJoin( )` yourself you focus on the partitioning obtained by using the provided hash function `h( )`. \n",
"In the lectures we saw that a good hash function should partition entries uniformly across buckets. We will now check if `h( )` is indeed a good function.\n",
"\n",
"The following code generates a histogram of the bucket sizes for table teams (using the above hash function `h` and 500 buckets) to help figure out what is going wrong. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Examine if this is a good partition function\n",
"def histogramPoints(partition):\n",
" ids = range(buckets)\n",
" items = []\n",
" for i in range(buckets):\n",
" if i in partition:\n",
" items.append(len(partition[i]))\n",
" else:\n",
" items.append(0)\n",
" return ids, items\n",
"\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Plot bucket histogram\n",
"buckets = 500\n",
"teamsPartition = partitionTable(teams,h,buckets)\n",
"ids, counts = histogramPoints(teamsPartition)\n",
"plt.plot(ids, counts)\n",
"plt.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (b.i)\n",
"**_[3 points]_**\n",
"\n",
"\n",
"Now find the skew associated with the above histogram. Skew is defined as the standard deviation of the number of entries in the buckets. A uniform hash function produces buckets of equal size, leading to 0 skew, but our candidate hash function h is imperfect so you should observe a positive skew."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# ANSWER\n",
"# partition- a table partition as returned by method partitionTable\n",
"# return value - a float representing the skew of hash function (i.e. stdev of chefs assigned to each restaurant)\n",
"def calculateSkew(partition):\n",
" # ANSWER STARTS HERE\n",
" skew = np.std([len(partition[p]) for p in partition])\n",
" # ANSWER ENDS HERE\n",
" return skew\n",
"\n",
"print calculateSkew(teamsPartition)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (b.ii)\n",
"\n",
"**_[1 point]_**\n",
"\n",
"Use python's hash function to see if you can produce a better (aka smaller) runtime for hash join. As at the beginning of part b, make a histogram of the bucket sizes (this time using the new hash function and 500 buckets). You can plot your histogram using the same code provided above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Design a better hash function and print the skew difference for \n",
"def hBetter(x,buckets):\n",
" rawKey = hash(x)\n",
" return rawKey % buckets\n",
"\n",
"# Plot bucket histogram\n",
"buckets = 500\n",
"teamsPartition = partitionTable(teams,hBetter,buckets)\n",
"ids, counts = histogramPoints(teamsPartition)\n",
"plt.plot(ids, counts)\n",
"plt.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part (b.iii)\n",
"\n",
"**_[1 point]_**\n",
"\n",
"Rerun your hash join algorithm with the new hash function you designed and 500 buckets.\n",
"Does the algorithm run faster? If so what is the speed-up you are observing?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"start_time = time.time()\n",
"\n",
"res1 = hashJoin(teams, colleges, hBetter, buckets)\n",
"\n",
"end_time = time.time()\n",
"duration = (end_time - start_time)*1000 #in ms\n",
"print 'The join took %0.2f ms and returned %d tuples in total' % (duration,len(res1))\n",
"\n",
"# around 100x!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part (c)\n",
"**_[0 points]_**\n",
"\n",
"For our internal sabermetrics purposes. Is Joe Flacco an elite quarterback? (True for elite, False for not elite)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"flacco_elite = False"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}