General Info

Visit Canvas for the Zoom lecture link.

Announcements

March 30, 2021

Welcome to CS166, a course in the design, analysis, and implementation of data structures. We've got an exciting quarter ahead of us - the data structures we'll investigate are some of the most beautiful constructs I've ever come across - and I hope you're able to join us.

CS166 has two prerequisites - CS107 and CS161. From CS107, we'll assume that you're comfortable working from the command-line; designing, testing, and debugging nontrivial programs; manipulating pointers and arrays; using bitwise operators; and reasoning about the memory hierarchy. From CS161, we'll assume you're comfortable designing and analyzing nontrivial algorithms; using O, o, Θ, Ω, and ω notation; solving recurrences; working through standard graph and sequence algorithms; and structuring proofs of correctness.

We'll update this site with more information as we get closer to the start of the quarter. In the meantime, feel free to email me at htiek@cs.stanford.edu if you have any questions about the class!

Schedule and Readings

This syllabus is still under construction and is subject to change as we fine-tune the course. Stay tuned for more information and updates!

Tuesday	Thursday
Building Suffix Arrays June 1 Suffix trees and suffix arrays are amazing structures, but they'd be much less useful if it weren't possible to construct them quickly. Fortunately, there are some great techniques for building suffix arrays and suffix trees. By using the fact that suffixes overlap and simulating what a multiway merge algorithm would do in certain circumstances, we can rapidly build these beautiful structures. Slides: Lecture Slides Condensed Slides Readings: Ko, Pang and Aluru, Srinivas. Linear Time Construction of Suffix Arrays Nong, Ge, Zhang, Sen, and Chan, Wai Hong. Linear Suffix Array Construction by Almost Pure Induced Sorting
Tries and Suffix Trees May 25 To kick off our discussion of string data structures, we'll be exploring tries, Patricia tries, and, most importantly, suffix trees. These data structures provide fast solutions to a number of algorithmic problems and are much more versatile than they might initially seem. What makes them so useful? What properties of strings do they capture? And what intuitions can we build from them? Slides: Lecture Slides	Suffix and LCP Arrays May 27 What makes suffix trees so useful as a data structure? Surprisingly, much of their utility and flexibility can be attributed purely to two facts: they keep the suffixes sorted, and they expose the branching words in the string. By representing this information in a different way, we can get much of the benefit of suffix trees without the huge space cost. Slides: Lecture Slides Condensed Slides Readings: Manber, Udi and Myers, Gene Suffix Arrays: A New Method for On-Line String Searches Kasai, Toru et al. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
Better than Balanced BSTs May 18 We've been operating under the assumption that a balanced BST that has worst-case O(log n) lookups is, in some sense, an "optimal" binary search tree. In one sense (worst-case efficiency) these trees are optimal. However, there are other perspectives we can take on what "optimal" means, and they counsel toward other choices of tree structures - weight-balanced trees, finger search trees, and Iacono's working set structure. Slides: Lecture Slides Condensed Slides Readings: Kurt Mehlhorn. Nearly Optimal Binary Saerch Trees. John Iacono. Alternatives to Splay Trees with O(log n) Worst-Case Access Times.	Splay Trees May 20 We've seen that it's possible to design BST variants whose performance exceeds the Ω(log n) barrier per operation on non-uniform access distributions. Astonishingly, there's a single type of BST, the splay tree, that provably meets all the guarantees we saw last time - and it just might possibly be the best possible BST, up to constant factors. Slides: Lecture Slides Condensed Slides Readings: Sleator, Daniel and Tarjan, Robert. Self-Adjusting Binary Search Trees.
Approximate Membership Queries, Part I May 11 Approximate membership query structures are ways of representing approximations of sets. They're used extensively in practice and are one of the most commonly used randomized data structures. This lecture explores the Bloom filter, the first (and still most popular) AMQ structure, and uses lower bounding techniques to find additional room for improvement. Slides: Lecture Slides Condensed Slides Readings: Bloom, Burton. Space/Time Tradeoffs in Hash Coding with Allowable Errors.	Approximate Membership Queries, Part II May 13 Bloom filters are fast and have great space usage, but can they be improved upon? The answer, in both a practical and theoretical sense, is "yes," and some of the data structures that do so were invented in the past decade. This lecture shows how to adapt cuckoo hash tables into an approximate membership query structure called the cuckoo filter, as well as how to use the same insights driving Bloom filters to build the XOR filter. Slides: Lecture Slides Condensed Slides Readings: Fan et al. Cuckoo Filter: Practically Better than Bloom. Graf, Thomas and Lemire, Daniel. XOR Filters: Faster and Smallre than Bloom and Cuckoo Filters. Stefan Walzer. Peeling Close to the Orientability Threshold - Spatial Coupling in Hashing-Based Data Structures.
Orthogonal Range Searching May 4 Imagine you've got a huge collection of points stored in 2D space. Maybe they're points on a map, or maybe they're points in an abstract feature space. You want to find all points in an axis-aligned rectangle, such as the viewport on a map window. How quickly can you do so? By using some clever techniques, we can make this type of search just as fast as in the 1D case. Slides: Lecture Slides Condensed Slides Handouts: Handout 11P: Problem Set 4 \| (LaTeX Template) Handout 11I: Individual Assessment 4 \| (LaTeX Template)	Planar Point Location May 6 The planar point location problem is the following: given a collection of borders on a map and a point p, which region of the map is p contained in? This question can be answered quickly and efficiently using persistent data structures, a family of data structures where each operation keeps the old version around while producing a new version. Slides: Lecture Slides Condensed Slides Readings: Sarnak, Neil and Tarjan, Robert E. Planar Point Location Using Persistent Search Trees.
Hashing and Sketching, Part I April 27 How can Google keep track of frequent search queries without storing all the queries it gets in memory? How can you estimate frequently- occurring tweets without storing every tweet in RAM? As long as you're willing to trade off accuracy for space, you get get excellent approximations. Slides: Lecture Slides Condensed Slides Readings: Cormode, Graham and Muthukrishnan, C. An Improved Data Stream Summary: The Count-Min Sketch and its Applications.	Hashing and Sketching, Part II April 29 We've now seen how to build an estimator: make a simple data structure that gives a good chance of success, then run it in parallel. This idea can be extended to build frequency estimators with other properties, as well as to build estimators for how many distinct items we've seen. Slides: Lecture Slides Condensed Slides Readings: Charikar et al. Finding Frequent Items in Data Streams. Flajolet et al. HyperLogLog: The Analysis of a Near-Optimal Cardinality Estmiation Algorithm.
Fibonacci Heaps April 20 Fibonacci heaps are a type of priority queue that efficiently supports decrease-key, an operation used as a subroutine in many graph algorithms (Dijkstra's algorithm, Prim's algorithm, the Stoer-Wagner min cut algorithm, etc.) They're formed by a clever transformation on a lazy binomial heap. Although Fibonacci heaps have a reputation for being ferociously complicated, they're a lot less scary than they might seem! Slides: Lecture Slides Condensed Slides Readings: CLRS: Chapter 19 Fredman, Michael and Tarjan, Robert. Fibonacci Heaps and Their Uses in Improved Network Optimization Algorithms	Cuckoo Hashing April 22 Most hash tables give expected O(1) lookups. Can we make hash tables with no collisions at all, and if so, can we do it efficiently? Amazingly, the answer is yes. There are many schemes for achieving this, one of which, cuckoo hashing, is surprisingly simple to implement. The analysis, on the other hand, goes deep into properties of random graph theory. Slides: Lecture Slides Condensed Slides Readings: Pagh, Rasmus and Rodler, Flemming. Cuckoo Hashing Handouts: Handout 10P: Problem Set 3 \| (LaTeX Template) Handout 10I: Individual Assessment 3 \| (LaTeX Template)
Amortized Analysis April 13 In many cases we only care about the total time required to process a set of data. In those cases, we can design data structures that make some operations more expensive in order to lower the total cost of all aggregate operations. How do you analyze these structures? Slides: Lecture Slides Condensed Slides Readings: CLRS: Chapter 17 Handouts: Handout 07P: Problem Set 2 \| (LaTeX Template) Handout 07I: Individual Assessment 2 \| (LaTeX Template) Pugh, William. Skip Lists: A Probabilistic Alternative to Balanced Trees Handout 08: Research Project Handout 09: Suggested Project Topics	Binomial Heaps April 15 Binomial heaps are a simple and flexible priority queue structure that supports efficient melding of priority queues. The intuition behind binomial heaps is particularly elegant, and they'll serve as a building block toward the more complex Fibonacci heap data structure that we'll talk about on Thursday. Slides: Lecture Slides Condensed Slides Readings: Vuillemin, Jean. A Data Structure for Manipulating Priority Queues
Balanced Trees, Part I April 6 Balanced search trees are among the most versatile and flexible data structures. They're used extensively in theory and in practice. What sorts of balanced trees exist? How would you design them? And what can you do with them? Slides: Lecture Slides Condensed Slides Handouts: Problem Set 1 \| LaTeX Template Individual Assessment 1 \| LaTeX Template Readings: Bayer, Rudolf and McCreight, Edward. Organization and Maintenance of Large Ordered Indices Guibas, Leo and Sedgewick, Robert. A Dichromatic Framework for Balanced Trees	Balanced Trees, Part II April 8 Our last lecture concluded with a view of red/black trees as isometries of 2-3-4 trees. How far does this connection go? How can we use it to derive the rules for red/black trees? And now that we've got red/black trees, what else can we do with them? Slides: Lecture Slides Condensed Slides Readings: CLRS, Chapter 14.
Range Minimum Queries, Part One March 30 The range minimum query problem is the following: given an array, preprocess it so that you can efficiently determine the smallest value in a variety of subranges. RMQ has tons of applications throughout computer science and is an excellent proving ground for a number of advanced algorithmic techniques. Slides: Lecture Slides Condensed Slides Readings: Handout 00: Course Information Handout 01: CS166 Calendar Handout 02: Math Terms and Identities Handout 03: Assignment Policies Handout 04: CS166 and the Honor Code Handout 05: Individual Assessment 0	Range Minimum Queries, Part Two April 1 Our last lecture took us very, very close to a ⟨O(n), O(1)⟩-time solution to RMQ. Using a new data structure called a Cartesian tree in conjunction with a technique called the Method of Four Russians, we can adapt our approach to end up with a linear-preprocessing-time, constant-query-time solution to RMQ. In doing so, we'll see a number of clever techniques that will appear time and time again in data structure design. Slides: Lecture Slides Condensed Slides Readings: Fischer, Johannes and Heun, Volker. Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE