Announcements

June 8, 2018

The TAs and I have finished grading the midterm exam. The exams themselves are available for pickup in the Gates 1B Wing in a set of filing cabinets near Keith's office. (Check the drawer labeled "CS166 Midterms" under the pictures of all the CS106 course staff.)

We've put together a handout with solutions and statistics, along with more information about why we asked each question and some of the more common mistakes we saw when grading.

We were thoroughly impressed with this quarter's batch of final projects. Thanks for putting in such an amazing effort. We're hoping to get back to you with feedback early next week.

May 11, 2018

We've finished our matchmaking and are happy to report that every team managed to get either their first-choice or second-choice topic for the final project. Here's a link to the list of assigned topics. Best of luck - we're excited to see what you come up with!

April 6, 2018

We'll be holding three sets of office hours each week:

Mondays, 5PM - 7PM, in the Huang Basement (Sam and Ben)
Wednesdays, 2PM - 4PM, in Gates 178 (Keith)
Fridays, 1PM - 3PM, in the Huang Basement (Rafa and Mitchell)

Come with questions, leave with answers!

March 29, 2018

Welcome to CS166, a course in the design, analysis, and implementation of data structures. We've got an exciting quarter ahead of us - the data structures we'll investigate are some of the most beautiful constructs I've ever come across - and I hope you're able to join us.

CS166 has two prerequisites - CS107 and CS161. From CS107, we'll assume that you're comfortable working from the command-line; designing, testing, and debugging nontrivial programs; manipulating pointers and arrays; using bitwise operators; and reasoning about the memory hierarchy. From CS161, we'll assume you're comfortable designing and analyzing nontrivial algorithms; using O, o, Θ, Ω, and ω notation; solving recurrences; working through standard graph and sequence algorithms; and structuring proofs of correctness.

We'll update this site with more information as we get closer to the start of the quarter. In the meantime, feel free to email me at htiek@cs.stanford.edu if you have any questions about the class!

Schedule and Readings

This syllabus is still under construction and is subject to change as we fine-tune the course. Stay tuned for more information and updates!

Tuesday	Thursday
Where to Go from Here June 5 Whew! We've covered a lot of ground this quarter. Did you think you'd have learned as much as you did in these ten action-packed weeks! This final lecture serves as a recap of the different techniques that we've seen, both from a practical and philosophical perspective, and is a chance to ask any remaining questions you might have. Slides: Slides
Fusion Trees, Part II May 29 The sardine tree we developed in our last lecture gives a fast ordered dictionary data structure for small keys. By harnessing its key insight - B-tree lookups can be sped up by improving rank calculations at each node - and combining it with some insights about integers and Patricia tries, we can build the fusion tree, which works for any integers that fit into a machine word. Slides: Slides Condensed Slides	Dynamic Graphs May 31 In real-world networks, links come online and offline all the time. Euler-tour trees can maintain connectivity in such networks as long as those networks are forests, and dynamic graphs can handle arbitrary network topologies. Readings: Tarjan, Robert. Dynamic trees as search trees via Euler tours, applied to the network simplex algorithm Holm, Jacob et al. Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity Slides: Slides Condensed Slides
x-Fast and y-Fast Tries May 22 Balanced BSTs are excellent general-purpose data structures for storing sorted collections of elements. But what happens if we know those elements are integers? By taking advantage of their structure and some nice properties of machine words, we can exponentially improve their performance. Readings: Willard, Dan. Log-Logarithmic Worst-Case Range Queries are Possible in Space Θ(N) Slides: Slides Condensed Slides Handouts: Handout 14: Midterm Practice Problems \| (Solutions)	Fusion Trees, Part I May 24 Memory inside a computer is split apart into individual machine words, with primitive operations like addition, subtraction, and multiplication taking constant time. This means that, if we can pack multiple pieces of data into a single machine word, we can essentially get a limited form of parallel computation. Through a variety of very clever tricks, we can use this to speed up searching and sorting. This first lecture explores word-level parallelism through a data structure for very small integers and a technique for computing most-significant bits in time O(1). It will lay the groundwork we'll use next time when exploring fusion trees. Readings: Fredman, Michael and Willard, Dan. Surpassing the Information-Theoretic Bound with Fusion Trees C++ code to compute most-significant bits in time O(1). Slides: Slides Condensed Slides
Linear Probing May 15 Linear probing is one of the oldest and simplest strategies for building a hash table. It's easy to implement and, for a number of reasons, is extremely fast in practice. The analysis of linear probing dates was due to Knuth, which assumed truly random hash functions. But what happens if we tighten up this restriction and say that the hash functions aren't truly random but are, instead, just k-independent for some fixed k? Then the analysis gets a lot more challenging, but also a lot more interesting. In exploring how the math works out, we'll get a richer understanding for how to analyze randomized data structures. Readings: CLRS, Chapter 11.4 (summary of linear probing) Knuth, Donald. Notes on "Open" Addressing. Thorup, Mikkel. Lecture Notes on Linear Probing with 5-Independent Hashing. Patrascu, Mihai and Thorup, Mikkel. On the k-Independence Required by Linear Probing and Minwise Independence. Slides: Slides Condensed Slides	Cuckoo Hashing May 17 Most hash tables give expected O(1) lookups. Can we make hash tables with no collisions at all, and if so, can we do it efficiently? Amazingly, the answer is yes. There are many schemes for achieving this, one of which, Cuckoo hashing, is surprisingly simple to implement. Readings: Pagh, Rasmus and Rodler, Flemming. Cuckoo Hashing Devroye, Luc and Morin, Pat. Cuckoo Hashing: Further Analysis Slides: Lecture Slides Condensed Slides Handouts: Handout 13: Problem Set 5 \| (LaTeX Template) \| (Solutions)
Splay Trees May 8 Balanced binary search trees give worst-case O(log n) times on each tree operation. If we're trying to guarantee worst-case efficiency, this is as good as it gets. But what if we're trying to guarantee average-case efficiency? In that case, it's possible to improve upon the bounds given by balanced BSTs. In this lecture, we'll explore statically optimal BSTs and an amazing data structure called the splay tree that very well may be the best possible binary search tree. Readings: Sleator, Daniel and Tarjan, Robert. Self-Adjusting Binary Search Trees. Slides: Slides Condensed Slides Handouts: Handout 11: Problem Set 4 \| (LaTeX Template) \| (Solutions)	Count-Min and Count Sketches May 10 How can Google keep track of frequent search queries without storing all the queries it gets in memory? How can you estimate frequently- occurring tweets without storing every tweet in RAM? As long as you're willing to trade off accuracy for space, you get get excellent approximations. Readings: Charikar et al. Finding Frequent Items in Data Streams. Cormode, Graham and Muthukrishnan, C. An Improved Data Stream Summary: The Count-Min Sketch and its Applications. Slides: Slides Condensed Slides
Binomial Heaps May 1 Binomial heaps are a simple and flexible priority queue structure that supports efficient melding of priority queues. The intuition behind binomial heaps is particularly elegant, and they'll serve as a building block toward the more complex Fibonacci heap data structure that we'll talk about on Thursday. Readings: Vuillemin, Jean. A Data Structure for Manipulating Priority Queues Slides: Slides Condensed Slides Handouts: Final Project Requirements Suggested Project Topics	Fibonacci Heaps May 3 Fibonacci heaps are a type of priority queue that efficiently supports decrease-key, an operation used as a subroutine in many graph algorithms (Dijkstra's algorithm, Prim's algorithm, the Stoer-Wagner min cut algorithm, etc.) They're formed by a clever transformation on a lazy binomial heap. Although Fibonacci heaps have a reputation for being ferociously complicated, they're a lot less scary than they might seem! Readings: CLRS: Chapter 19 Fredman, Michael and Tarjan, Robert. Fibonacci Heaps and Their Uses in Improved Network Optimization Algorithms Slides: Slides Condensed Slides
Balanced Trees, Part II April 24 We've spent a lot of time trying to figure out how to build nice balanced trees. Now that we've got them, what else can we do with them? In this lecture, we'll see two more advanced techniques - tree augmentation and the split/join operations - that will make it possible to build significantly more complex data structures in the future. Readings: CLRS, Chapter 14. Slides: Slides Condensed Slides Handouts: Handout 08: Problem Set 3 \| (LaTeX Template) \| (Solutions) Pugh, William. "Skip Lists: A Probabilistic Alternative to Balanced Trees" (for Q4)	Amortized Analysis April 26 In many cases we only care about the total time required to process a set of data. In those cases, we can design data structures that make some operations more expensive in order to lower the total cost of all aggregate operations. How do you analyze these structures? Readings: CLRS: Chapter 17 Slides: Lecture Slides Condensed Slides
Constructing Suffix Arrays April 17 Suffix trees and suffix arrays are amazing structures, but they'd be much less useful if it weren't possible to construct them quickly. Fortunately, there are some great techniques for building suffix arrays and suffix trees. What are they? How do they work? Prepare to see some of the craziest and most beautiful algorithms in all of stringology! Readings: Ko, Pang and Aluru, Srinivas. Linear Time Construction of Suffix Arrays Nong, Ge, Zhang, Sen, and Chan, Wai Hong. Linear Suffix Array Construction by Almost Pure Induced Sorting Slides: Slides Condensed Slides Handouts: Handout 07: Problem Set 2 \| (LaTeX Template) \| (Solutions)	Balanced Trees, Part I April 19 Balanced search trees are among the most versatile and flexible data structures. They're used extensively in theory and in practice. What sorts of balanced trees exist? How would you design them? And what can you do with them? Readings: CLRS, Chapter 13. CLRS, Chapter 18. Google Open Source Blog: "C++ containers that save time and memory." Slides \| Condensed Slides
Aho-Corasick Automata April 10 Suppose you have a fixed set of strings and you'd like to search for all copies of those strings in a larger text corpus. How could you do so efficiently? By using a combination of tries and techniques from finite automata, it's possible to solve this problem in linear time. Readings: Aho, Alfred V. and Corasick, Margaret J. Efficient String Matching: An Aid to Bibliographic Search Slides: Slides Condensed Slides Handouts: Handout 06: Problem Set 1 \| (LaTeX Template) \| (Solutions)	Suffix Trees and Suffix Arrays April 12 The suffix tree and suffix array are probably the most powerful and versatile data structures for string processing that have ever been invented. What is a suffix tree? What is a suffix array? Why are they so useful? And what techniques can be used on them? Readings: Slides \| Condensed Slides
Range Minimum Queries, Part One April 3 The range minimum query problem is the following: given an array, preprocess it so that you can efficiently determine the smallest value in a variety of subranges. RMQ has tons of applications throughout computer science and is an excellent proving ground for a number of advanced algorithmic techniques. Slides: Slides Condensed Slides Readings: Handout 00: Course Information Handout 01: CS166 Syllabus Handout 02: Math Terms and Identities Handout 03: Problem Set Policies Handout 04: CS166 and the Honor Code Handout 05: Problem Set 0 \| (LaTeX Template) \| Solutions	Range Minimum Queries, Part Two April 5 Our last lecture took us very, very close to a ⟨O(n), O(1)⟩-time solution to RMQ. Using a new data structure called a Cartesian tree in conjunction with a technique called the Method of Four Russians, we can adapt our approach to end up with a linear-preprocessing-time, constant-query-time solution to RMQ. In doing so, we'll see a number of clever techniques that will appear time and time again in data structure design. Readings: Fischer, Johannes and Heun, Volker. Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE Slides \| Condensed Slides