Syllabus

CS166 Course Syllabus

This course is a deep dive into the design, analysis, implementation, and theory of data structures. Over the course of the quarter, we’ll explore fundamental techniques in data structure design (isometries, amortization, randomization, etc.). In doing so, we’ll see a number of classic data structures, as well as some more modern ones. By the time we’ve finished, we’ll have seen some truly beautiful strategies for solving problems efficiently.

There are two main reasons this is a course on advanced data structures. First, the data structures we'll cover make use of highly specialized techniques (the Method of Four Russians, isometries, amortized analysis, etc.) that aren't covered in classic data structures courses. These techniques are beautiful in their own right and give a sense for just how much creativity is possible when solving problems with computers. Second, the data structures we'll cover often focus on more specialized applications (maintaining connectivity information in a changing graph, locating points in 2D space, approximating the number of items in a set, etc.) than what would be covered in a traditional introduction to data structures.

Because this is a more advanced course, our goal is not just to share specific data structures with you, but to give you an appreciation for how those data structures were designed and to equip you to develop your own in the future. Our hope is that, after each topic, you feel as though you yourself could have invented the data structures we explored.

Although this course leans heavily into Theoryland, it will involve a decent amount of coding. Implementing data structures is one of the best ways to really get how they work and see what makes them tick. It's also an objective test to make sure you fully understand the nuances. We'll use C++ this quarter, as it combines the ability to do low-level memory management and bit manipulation with the ability to encode higher-level abstractions.

Teaching Team

If you have questions on the course content, the easiest way to reach us through our course EdStem forum (details below). If you have any logistical questions or concerns, the best way to reach us is through the staff mailing list, cs166-spr2425-staff@lists.stanford.edu.

Websites and Technology

The main CS166 website is where you are right now, cs166.stanford.edu. We have links to a bunch of other tools here. Here's the quick rundown:

Our main course website cs166.stanford.edu is the main hub for course information. It contains links to everything you’ll need.
You’ll submit written work using GradeScope and code through the myth machines.
We use Ed as our Q&A forum.
You may find it helpful to use Overleaf to typeset your problem sets. (Your @stanford.edu email address gives access to “Pro” features.)

Prerequisites

CS166 is a course in advanced data structures and is intended for advanced undergraduates and beginning graduate students. If you are looking for a traditional CS course that functions as an introduction to data structures, we recommend checking out CS106B.

The prerequisites for CS166 are CS161 and CS107.

From CS161, you should be able to design and analyze nontrivial algorithms and write proofs of correctness. You should be comfortable using asymptotic notation (o, O, Θ, Ω, and ω), solving recurrence relations, manipulating inequalities, and simplifying summations. We'll also expect that you're comfortable with divide-and-conquer algorithms, greedy algorithms, and dynamic programming; that you're familiar with randomized algorithms (and related concepts like universal families of hash functions); and that you're comfortable writing correctness proofs for algorithms of each of these types. You should also feel comfortable with standard algorithms like Dijkstra's algorithm, Prim's algorithm, quicksort, mergesort, etc. and feel comfortable with their runtimes and proofs of correctness.

The CS161 prerequisite, by transitivity, also means we assume you have the equivalent of CS103 (discrete mathematics, automata, and proofwriting) and CS109 (probability and basic combinatorics) as well. If you have never written a formal mathematical proof, or if the phrase “linearity of expectation” doesn’t ring a bell, you may want to come talk to us before jumping into CS166.

From CS107, we expect that you're comfortable writing and testing nontrivial programs and working from the command line. You should also feel comfortable with binary representations of numbers. We'll expect that you've at least heard of the memory hierarchy and are comfortable with the idea that not all memory accesses take the same amount of time. Additionally, we expect that you'll be comfortable writing code in C++.

If you're unsure whether CS166 is the right place for you, please feel free to get in touch with the course staff.

Lectures

Our lectures are Tuesdays and Thursdays from 1:30PM - 2:50PM in 370-370. Lectures are not recorded this quarter.

5% of your grade is reserved for lecture participation. Here's the details. Starting on Tuesday, April 8th, we'll ask a small number of practice questions through the PollEV polling system. You will earn participation credit if your answer those questions, regardless of whether your answers are right. If you would prefer not to attend lecture in person, in Week 4 we will send out a form that will give you the option to instead count your final exam score in place of your lecture participation grade.

To allow for some flexibility, you're allowed to miss the credit opportunity (i.e. not completing PollEV) for two lectures over the course of the quarter and still earn the full 5%.

Units

CS166 is offered for either three or four units. Undergraduates are required to enroll for four units, while graduate students can enroll for either three or four units. The course content and requirements are the same in the three-unit and four-unit versions of the course and the unit flexibility is purely to help graduate students stay under unit limits.

Problem Sets

There will be six problem sets over the course of the quarter. The first, Problem Set 0, is meant as a refresher on key concepts and must be completed individually. The remainder focus on the topics from the course and can be completed either individually or in pairs.

You are welcome to choose problem set partners however you’d like. Our only policy is that if you start working with someone on one of the assignments, you need to finish that assignment jointly with them. (You can switch partners between assignments however you’d like.)

Submitting Work

This quarter, we will be using GradeScope to handle written assignment submissions. By virtue of being enrolled in CS166, you should have access to the course GradeScope page.

GradeScope only accepts electronic submissions. You are required to type your assignment solutions and submit them as a PDF; scans of handwritten solutions will not be accepted. $\LaTeX$ is a great way to type up solutions.

When submitting on GradeScope, if you’re working with a partner, please list both of your names on GradeScope in addition to on the PDF itself. To do so, have one person submit, then, after the submission completes, have them add the other student’s name to the submission. Since we rely on GradeScope for our final grading spreadsheet, if you forget to include your partner on the submis - sion – or if your partner forgets to list you on the submission – then only one person will get credit for the assignment. We strongly recommend that you always check to make sure that your assign- ment was submitted correctly, especially if you weren't the one submitting it, just in case your part - ner forgot to list you.

Programming questions are submitted separately from written answers. You should submit your code electronically by sshing into myth, cding into the directory containing your solution files, then running

/usr/class/cs166/bin/submit

to submit your work. You'll be prompted for your name, whether you worked with a partner, and the problem set number. We'll test your code on the myth machines, so please make sure that your code works correctly there before submitting.

Late Policy

All problem sets in this class are due at 1PM Pacific time on their due dates. Each student has three “late days” that they may use throughout the quarter. Each late day automagically extends an assignment deadline by 24 hours. You may use at most two late days on any one assignment, and they’re charged automatically; you don’t need to get our advance approval before using them. Just submit late.

If an emergency arises and you will need more time to complete an assignment, contact the course staff over email (cs166-spr2425-staff@lists.stanford.edu) ~~no more than~~ at least 24 hours before the normal assignment deadline to request an extension.

No work may be submitted more than 48 hours after the stated deadline without prior approval by the course staff. If you submit an assignment fewer than 48 hours late but have run out of late days, your score on that assignment will be capped at 70%. That means it’s still better to submit late than not at all. Late days are tracked per student rather than per pair. So, for example, suppose you and your partner submit an assignment 12 hours late. You have a late day left and your partner does not. You’ll then be charged that late day and will receive the full score for your assignment, but your partner will be capped at 70% on the assignment score.

Late days, once used, can’t be shifted to other assignments.

EdStem Policy

We have an EdStem forum where you can ask questions on the problem sets and search for partners. You're welcome to ask questions online, and the course staff and other students can then provide answers. Please exercise discretion when asking questions that might give away the answers to assignment ques- tions. If you'd like to ask a question that you think would give away too much information about the solu - tion to a problem, post your question privately.

Collaboration Policy

You are allowed to work on the problem sets individually or in pairs. We grade assignments uniformly re- gardless of whether they’re submitted individually or jointly. You are not required to work with the same people on each problem set – you're welcome to work in a pair on one problem set, individually on the next, in a pair with a different partner the next time, etc. If you do work in a pair, please note that both members of the pair are responsible for ensuring that each assignment is completed and submitted on time.

If you submit in a pair, submit a single set of solutions. Both members of the pair will earn the same grade on the problem set. That way, two or more TAs don't accidentally end up grading the same submis - sion multiple times.

For more details about collaborating with other students, please read over our Honor Code policy.

Coding Expectations

You’re expected to write beautiful, well-commented, well-decomposed code. This is both to make it easier for you to debug and so that it’s easier for the TAs to review your code. If you’re the sort of person that likes step-by-step checklists, here’s a bare minimum set of requirements for any code you submit:

Comment your code. It is difficult to read code someone else wrote if that person didn’t leave comments describing what it is that they were trying to do. At a bare minimum, you should include comments describing what all your helper functions and helper types do, how edge cases are handled, etc. Ideally, you should also include comments on any dense sections of code explaining what that code does. If there are any spots where your code is handling some particular task in an unusual or non-obvious way, please leave some notes so we know what you’re doing.
Decompose problems. There’s a bad tendency in algorithms and data structures contexts to see hundred-line monster functions. Please do not do this. Break larger pieces of code apart into smaller, bite-sized chunks that all perform a single task. If you find yourself working with some concept that’s logically separate from other concepts, make it into its own type and use that type where appropriate.
Use clear variable names. There’s a tension between the mathematical side of things, where single-letter variable names are the norm, and the programming side of things, where single-letter variable names make things nigh impossible to read. With the exception of loop indices in cases where the index can easily be inferred from context, please do not use single-letter names. Describe what your variables represent, and do so in a way that makes the code lucid.
Remove dead code. It’s fine to sprinkle some cout or printf statements throughout your code when you’re testing things. It’s also reasonable to comment out an old implementation if you found a better way to achieve the same result. But please don’t submit code that has debug printouts in it or which has commented-out blocks. It makes things harder to read and can mess up our autograding infrastructure.
Don’t be cute, unless you need to. You’ll be writing code in C++, where it’s possible to write both elegant code and obfuscated labyrinths of twisted logic. Aim for clarity above efficiency un- less there’s a compelling reason not to. For example, please divide by two by writing value / 2, not value >> 1, unless there’s a specific reason that bit-shifts more clearly express your intent. You should never need to #define anything in this class, and you should especially not use #define to make shorthands for iterating over things. Language features like inline functions, const, enhanced for loops, etc. have mostly eliminated the need to do things like this.

If you do find yourself needing to pull all the stops out in order to improve performance, that’s fine. But be sure to extensively comment any aggressively-optimized sections of your code so that the TA reading over it can appreciate the beautiful ideas you used to squeeze out a little bit more efficiency. In the past, we’ve had people turn in optimized code that neither we nor the code’s original authors could fully understand. That’s not good for anyone.

Here’s an example of a piece of code that we would consider to be totally beautiful. It’s an implementa - tion of binary search over an array:
```
/* Returns whether the element key exists in the sorted vector elems in the
 * index range [low, high). Note that low is inclusive, while high is exclusive.
 */
bool binarySearchRec(const std::vector<int>& elems, int key,
                     size_t low, size_t high) {
  /* Base case: If the range is empty (low >= high), the key does not exist in
   * the range [low, high).
   */
  if (low >= high) return false;
    
  /* Probe the middle element. The use of low + (high - low) / 2 is to avoid
   * integer overflow that could result from computing (low + high) / 2.
   */
  size_t mid = low + (high - low) / 2;
 
  /* If the element is at the midpoint, we've found it. */
  if (key == elems[mid]) return true;

  /* Otherwise, discard half the elements and search
   * the appropriate section.
   */
  if (key < elems[mid]) {
    return binarySearchRec(elems, key, low, mid);
  } else {
    return binarySearchRec(elems, key, mid + 1, high);
  }
}
  
bool binarySearch(const std::vector<int>& elems, int key) {
  return binarySearchRec(elems, key, 0, elems.size());
}
```

Proofwriting Expectations

You’re expected to write proofs clearly and lucidly. Your proofs should not just convey the mathematical argument; they should guide the reader through your reasoning. We’re expecting that you’ll write in complete sentences and use mathematical notation where appropriate but not as a substitute for plain English. If you have a complex, multi-part proof, you should break it apart into smaller pieces and, ideally, include a brief introduction outlining the high-level approach you’re going to be taking in your argument. If drawing pictures would make things easier to understand, go for it! If doing a small worked example before writing the formal proof would clarify things, do that! Remember that the TAs have to be able to read and understand what you’re writing, and they really do want to hear what you have to say, so please try to make it easy for them. 😃

As an example, consider the following problem:

Consider a binary heap $B$ with $n$ elements, where the elements of $B$ are drawn from a totally-ordered set. Give the best lower bound you can on the runtime of any comparison-based algorithm for constructing a binary search tree from the elements of $B\text.$

Here is one possible solution:

Proof Idea: The lower bound is $\Omega(n \log n)\text,$ and this is a tight bound. We'll prove this by first showing that there's an $O(n \log n)$-time, comparison-based algorithm for constructing a BST from the elements of an $n$-element heap. Then, we'll show that any $o(n \log n)$-time, comparison-based algorithm for doing the conversion would make it possible to sort $n$ elements in time $o(n \log n)$ using only comparisons, which we know is impossible.

Proof: First, we'll show that there is an $O(n \log n)$-time, comparison-based algorithm for constructing a BST out of the elements of $B\text.$ Specifically, just iterate across the $n$ elements of $B$ and insert each into a balanced binary search tree. This does $O(n)$ insertions into a balanced binary search tree, each of which takes time $O(\log n)\text,$ for a net runtime of $O(n \log n)\text.$ This algorithm is also comparison-based because binary search tree insertion is comparison-based.

Next, we'll show that no $o(n \log n)$-time, comparison-based algorithm exists for constructing a BST from a binary heap. Assume for the sake of contradiction that such an algorithm exists. Then consider the following algorithm on an array of length $n\text:$

Construct a binary heap $B$ from the array elements in time $O(n)$ using the standard heapify algorithm.
Create a binary search tree $T$ from $B$ in time $o(n \log n)\text.$
Do an inorder traversal of $T$ and output the elements in the order visited in time $O(n)\text.$

Note that the runtime of this algorithm is $o(n \log n)\text,$ and each step is comparison-based. However, this algorithm will sort the elements of the array, because doing an inorder traversal over a BST will list off the elements of that BST in sorted order. This is impossible, since there is no $o(n \log n)$-time, comparison-based sorting algorithm. Therefore, no $o(n \log n)$-time, comparison-based algorithm exists for converting a binary heap into a binary search tree. $\qed$

Algorithm and Data Structure Expectations

If you’re designing and analyzing an algorithm or data structure, you should present the algorithm in the clearest way that you can. Here are our recommendations for how to do this:

Start at a high level, then go deeper. One of the most effective ways to convey a complex algorithm or data structure is in stages. Begin by detailing the idea at a very high level, then make a second pass over the approach and give more details, and finally drop down to low-level specifics. This approach lets the reader get context for the overarching idea behind your solution, then get more and more clarity on the approach as you go.
Use pseudocode only as a last resort. It is hard to understand an algorithm or data structure purely from pseudocode – ask anyone who’s TAed CS161 if you doubt us! Instead, give a high-level overview of the algorithm or data structure, pointing out any bits that you think the reader might find unexpected or unusual. Then, go a little lower-level, describing the steps in the algorithm. You should only use pseudocode if it is absolutely necessary to convey an idea.
Use the runtime analysis to fill in details. Some data structures or algorithms are based on a reasonable, intuitive idea that makes sense at a high level but gets trickier as you get into the details. One way to make these structures easier to explain is to describe the high-level operation of the data structure (“concatenate these lists,” “find the third-largest element in the binary heap”, etc.), and then to only talk specifics in the runtime analysis. For example, you might describe at a high level how you’ll maintain a collection of lists, then in the runtime analysis reveal that the lists are circularly, doubly-linked lists that store pointers to their maximum elements. Only divulging that detail in the runtime analysis keeps the discussion easier and leaves the tricky implementation details to the spots where they matter.

For example, consider the following problem:

Design a data structure that supports the following operations: insert(x), which inserts real number $x$ into the data structure and runs in time $O(\log n)\text,$ where $n$ is the number of elements in the data structure; and find-median(), which returns the median of the data set if it is nonempty and runs in time $O(1)\text.$

If you haven’t encountered this before, it’s a great problem! Take a minute to think through it, then check the solution below for an example of a writeup that we think is at the appropriate level of detail.

At a high level, the data structure consists of two binary heaps that each store roughly half of the elements. The first heap is a max heap that stores the smallest half of the elements, and the second heap is a min heap that stores the larger half of the elements. The median of the values can then be found by looking at the tops of the two heaps.

More specifically, we’ll maintain two heaps called left and right. The left heap is a max-heap that will store the smallest $\floor{\frac{n}{2}}$ elements in the data structure, where $n$ is the total number of elements inserted. The right heap is a min-heap that stores the largest $\ceil{\frac{n}{2}}$ largest elements in the data structure.

To implement the operation insert(x), we compare $x$ against the tops of the two heaps. If $x$ is less than or equal to the top of the left heap, then we insert $x$ into left. Otherwise, we insert $x$ into right. This ensures that all the elements of left are less than or equal to all the elements of right, but may result in there being too many elements in one of the two heaps. If this happens, then we either remove the maximum element from left and add it to right, or remove the minimum element of right and add it to left. Doing so doesn’t change the fact that all elements of left are less than or equal to the elements of right (since we’ve either moved the largest element of left to right or the smallest element of right to left), and ensures that the two heaps have the right number of elements.

(An edge case we need to handle: if either heap is empty, we insert into left. If the data structure was previously empty, this results in 1 element in left and 0 elements in right, matching the expected counts. If the data structure was not empty, then we must have had 1 element in left and 0 in right; otherwise, there would be an element in right. After adding to left we have 2 elements in left and 0 in right, and we can move an element from left to right as above.)

To implement the operation find-median(), we consider two cases. First, it might be the case that the number of elements $n$ in the heap is even. In that case, the median value is the average of the two elements closest to the center were the elements to be written in sorted order. Since in this case both left and right contain exactly $\frac{n}{2}$ elements, any element of left that’s greater than or equal to all the other elements of left could appear just before position $\frac{n}{2}$ in the sorted sequence, since that element is greater than or equal to half the elements (the elements of left) and less than or equal to half the elements (the elements of right). A similar argument establishes that any element less than or equal to all the elements of right could appear just after position $\frac{n}{2}$ in the sorted sequence. Therefore, we can compute the average over the maximum element of left and the minimum element of right to get the median.

On the other hand, if $n$ is odd, then the maximum element of left is the median. To see this, note that if $n = 2k+1\text,$ then there are $k+1$ elements in left and $k$ in right. The maximum element of left is then greater than or equal to $k$ elements (the smaller elements of left) and less than or equal to $k$ elements (the elements of of right), so it’s the median.

Looking at the runtime: each insert call does a heap enqueue, followed possibly by a heap dequeue and second enqueue. These heaps each have size $O(n)\text,$ so this takes time $O(\log n)\text.$ Each call to find-median looks at the tops of at most two heaps, which takes time $O(1)\text.$

Regrade Policies

We do our best in this course to grade as accurately and as thoroughly as possible. We understand how important it is for your grades to be fair and correct, especially since the graders' comments will be our main vehicle for communicating feedback on your progress. That said, we sometimes make mistakes while grading – we might misread what you've written and conclude that your reasoning is invalid, or we might forget that you proved a key result earlier in your answer. In cases like these – where we've misread or misinterpreted your proof – you're encouraged to contact the course staff and ask for a regrade. Regrade requests must be received within one week of the graded work being returned to you.

Exams

We will hold a final exam on Saturday, June 7 from 3:30PM – 6:30PM, location TBA. We will offer alternate exam times only for students with OAE accommodations.

Readings

The recommended reading for this course is Introduction to Algorithms, Fourth Edition by Cormen, Leiserson, Rivest, and Stein (CLRS). We understand that not everyone has a copy of this book or can get a copy, and that’s okay. There’s nothing in that textbook that you’ll absolutely need for the course, and it’s mostly there as a reference in case you’d like to look at certain topics in more depth.

Additionally, there will be a variety of readings posted online (papers, course notes, slides, articles, etc.) Check the website for details on the readings for each lecture. I will try to present the salient features of each data structure in lecture, so depending on your learning style, you may find it useful to do the readings right before or right after lecture.

A helpful note from the university:

“All students should retain receipts for books and other course-related expenses, as these may be qualified educational expenses for tax purposes. If you are an undergraduate receiving financial aid, you may be eligible for additional financial aid for required books and course materials if these expenses exceed the aid amount in your award letter. For more information, review your award letter or visit the Student Budget website.”

Withdraw / Incomplete Policy

If a serious emergency arises and you cannot complete the work in this course, you may contact Keith – not the TAs – to request an incomplete. We reserve incompletes for emergencies, so we do not grant incomplete grades for poor performance on the assignments or exams, nor do we offer incompletes for busy work schedules. Withdrawing is the appropriate option in those circumstances.

In order to be eligible for an Incomplete, University policy says you must have completed a “substantial” part of the course work in “satisfactory” fashion. This means that incompletes are appropriate for serious medical or family emergencies that occur late in the quarter, which prevent you from completing the course despite having done well up to that point.

Grading

Your raw score in CS166 is determined as follows:

\[\text{Raw Score } = \quad 0.45 \cdot \text{PSet Score} + 0.50 \cdot \text{Exam Score} + 0.05 \cdot \text{Participation Score}\text.\]

Your participation score is computed as

\[\text{Participation Score} \quad = \quad \min \set{1, \frac{\text{number of lectures where you answered PollEV}}{\text{number of lectures with PollEV questions} - 2}}\]

As an exception, if you explicitly opt to use your final exam score in place of your participation score, then your participation score will be equal to your raw final exam score.

We assign letter grades as follows. We first determine a grading curve over raw scores to assign initial letter grades. Historically, the median raw score has ended up somewhere near the B/B+ cutoff. We never assign letter grades that are lower than the decile of your raw score; for example, a 90% will never map to anything lower than an A-.

There are two exceptions to this rule. First, A+ grades are given only sparingly and at the discretion of the instructors, not with regard to raw course grades. Second, to earn a passing grade in CS166, your PSet Score and your Exam Score, as computed above, must each be passing work. (The actual numbers used to denote “passing work” are set at the discretion of the instructor. We will likely use 60% as a cutoff for passing work for problem sets and 50% as a cutoff for passing work for exams, though this is subject to change.) If your score in at least one area is below this cutoff, you will receive a non-passing grade. This rule is to ensure that you have demonstrated competency throughout the quarter. Historically, very few students are impacted by this rule, since usually having a non-passing score in either area results in having a low overall composite score.

Your final grade will be determined solely as mentioned above. We do not offer any make-up work.

Force Majeure

A note from the Center for Teaching and Learning:

Stanford as an institution is committed to the highest quality education, and as your teaching team, our first priority is to uphold your educational experience. To that end we are committed to following the syllabus as written here, including through short or long-term disruptions, such as public health emergencies, natural disasters, or protests and demonstrations. However, there may be extenuating circumstances that necessitate some changes. Should adjustments be necessary we will communicate clearly and promptly to ensure you understand the expectations and are positioned for successful learning.