Due Friday, February 26 at 11:30 am Pacific
- Submissions received by due date receive a small on-time bonus.
- All students are granted a pre-approved extension or "grace period" of 48 hours after the due date. Late submissions are accepted during the grace period with no penalty.
- The grace period expires Sun, Feb 28 at 11:30 am Pacific, after which we cannot accept further late submissions.
- In this course, we express all date/times in Pacific time GMT -8. Our Paperless submission system also displays/records due dates and submission times in Pacific time.
Parts of this assignment adapted from one by Julie Zelenski and Jerry Cain.
Thanks to Michael Chang for his input on streaming top-k.
We’ve spent a lot of time talking about searching, sorting, and runtime complexity. This assignment is designed to give you a sense of how to combine those ideas together in the service of something larger: diving deep into data sets. Over the course of this assignment, you’ll build out a set of algorithms and data structures for processing large data sets. Once you’ve gotten them working, you’ll get to see them in action as they power four data analyses.
You are welcome to work on this assignment in pairs.
This assignment has three parts:
- Array Exploration: This debugging exercise is designed to get you comfortable exploring arrays in memory and seeing what lies beyond them.
HeapPQueue
: This workhorse of a data structure is useful for finding the best objects of various types. It’s a powerful tool in its own right and can be used as a building block in other algorithms.- Streaming Top-k: A clever algorithm that finds the k best objects of a given type, and does so remarkably quickly.
As usual, we suggest making slow and steady progress. Here’s our recommended timetable:
- Aim to complete the debugging exercise within one day of this assignment going out.
- Aim to complete HeapPQueue within five days of this assignment going out.
- Aim to start Streaming Top-k within six days of this assignment going out.
Problem One: Beyond the Bounds of Arrays
The first part of this assignment deals with pointers and dynamic allocation, and that introduces some new types of errors you’ll need to learn to smoke out using the debugger. To get more familiar with what’s going on, we’re going to ask you to work the debugger to explore arrays and what you can find when you walk off the end of them.
Open the source file ExploreArrays.cpp
. That file contains a function named exploreArrays
. You’ll trace this function in the debugger. The reason we’d like you to explore this function is that we’d like you to see what arrays look like in memory in a controlled environment before you encounter memory issues “in the wild” (that is, when writing up your own code). It’s similar to how we wanted you to explore stack overflows in a simpler setting before turning you loose on writing your own recursive code.
Deliverables
Here’s what you need to do.
- Set a breakpoint at the top of
exploreArrays
inExploreArrays.cpp
. - Run the program with the debugger turned on, choose the “Explore Arrays” option from the top menu, then click “Go!” to trigger the breakpoint. Follow the instructions in
ExploreArrays.cpp
and write your answers inShortAnswers.txt
.
Problem Two: Priority Queues and Binary Heaps
Priority Queues
A priority queue is a modified queue in which elements are not dequeued in the order in which they were inserted. Instead, elements are removed from the queue in order of priority. For example, you could use a priority queue to model a hospital emergency room: patients enter in any order, but more critical patients are seen before less critical patients, regardless of how long the less-critical patients have been waiting. Similarly, if you were building a self-driving car that needed to process events, you might use a priority queue to respond to important messages (say, a pedestrian just walked in front of the car) before less important messages (say, a car switched on its turn signal).
For starters, here’s a refresher on the DataPoint
type, which you first saw in Assignment 5:
struct DataPoint {
string name; // Name of this data point; varies by application
int weight; // "Weight" of this data point. Points are sorted by weight.
};
Here’s the interface for the HeapPQueue
type (we’ll explain the name in a minute):
class HeapPQueue {
public:
HeapPQueue();
~HeapPQueue();
void enqueue(const DataPoint& data);
DataPoint dequeue();
DataPoint peek() const;
bool isEmpty() const;
int size() const;
void printDebugInfo();
private:
/* Up to you! */
};
Looking purely at the interface of this type, it sure looks like you’re dealing with a queue. You can enqueue elements, dequeue them, and peek at them. The difference between a priority queue and a regular queue is the order in which the elements are dequeued. In a regular Queue
, elements are lined up in sequential order, and calling dequeue()
or peek()
gives back the element added the longest time ago. In the HeapPQueue
, the element that’s returned by dequeue()
or peek()
is the element that has the lowest weight. For example, let’s imagine we set up a HeapPQueue
like this:
HeapPQueue hpq;
hpq.enqueue({ "Amy", 103 });
hpq.enqueue({ "Ashley", 101 });
hpq.enqueue({ "Anna", 110 });
hpq.enqueue({ "Luna", 161 });
If we write
DataPoint data = hpq.dequeue();
cout << data.name << endl;
then the string printed out will be Ashley
, since of the four elements enqueued, the weight associated with Ashley (101) was the lowest. Calling hpq.dequeue()
again will return {"Amy", 103}
, since her associated weight (103) was the lowest of what remains. Calling hpq.dequeue()
a third time would return {"Anna", 110}
.
Let’s insert more values. If we call hpq.enqueue({"Chioma", 103})
and then hpq.dequeue()
, the return value would be the newly-added {"Chioma", 103}
because her associated weight is lower than all others. And note that Chioma was the most-recently-added TA here; just goes to show you that this is quite different from a regular Queue
!
Binary Heaps
You’ll implement HeapPQueue
using a data structure called a binary heap, hence the name HeapPQueue
. Binary heaps are best explained by example. Below is a binary heap containing a collection of current and former TAs, each of whom is associated with a number corresponding to the class that they TAed for.
Let's look at the structure of this heap. Each value in the heap is stored in a node, and each node has zero, one, or two child nodes descending from it. For example, Ashley has two children (Kate and Amy) while Kali has just one child (Luna) and Julia has no children at all.
In a binary heap, we enforce the rule that every row of the heap, except for the last, must be full. That is, the first row should have one node, the second row two nodes, the third row four nodes, the fourth row eight nodes, etc., up until the last row. Additionally, that last row must be filled from the left to the right. You can see this in the above example – the first three rows are all filled in, and only the last row is partially filled. Here are two other examples of binary heaps, each of which obey this rule:
We also enforce one more property – no child node’s weight is less than its parent’s weight. All three of the heaps you've seen so far obey this rule. However, there are no guarantees about how nodes can be ordered within a row; as you can see from the examples, within a row the ordering is pretty much arbitrary.
To recap, here are the three rules for binary heaps:
- Every node in a binary tree has either zero, one, or two children.
- No child’s weight is less than the weight of its parent.
- Every row of the heap, except the last, is completely full, and the last row’s elements are as far to the left as possible.
Before moving on, edit the file ShortAnswers.txt
with answers to the following question.
Q3: Draw three different binary heaps containing the DataPoints given in the example on the right-hand side above (the one containing Jackie, Jermaine, etc.) Yes, we know that we’re asking you to draw pictures in a text file. Here’s an example of what that could look like:
Person A
Weight
/ \
Person B Person C
Weight Weight
/ \
Person D Person E
Weight Weight
Maya Ziv, section leader extraordinaire, suggests using this website to draw the binary heaps.
Binary Heap Enqueue ("Bubble-up")
It is easy to read off which element has the smallest weight in a binary heap – it's the one at the top. It is also efficient to insert an element into a binary heap. Suppose, for example, that we have this binary heap containing some famous satellites:
Let's add the San Marco, the first Italian satellite, to this heap with weight 1964. Since a binary heap has all rows except the last filled, the only place we can initially place San Marco is in the first available spot in the last row. This is as the left child of Japan’s first space probe 大隅, so we place the new node for San Marco there:
At this point, the binary heap is invalid because San Marco’s weight (1964) is less than that of 大隅 (1970). To fix this, we run a bubble-up step and continuously swap San Marco with its parent node until its weight is at least that of its parent’s. This means that we exchange San Marco and 大隅, shown here:
Since San Marco’s weight (1964) is greater than its parent’s weight (1957), it’s now in the right place, and we’re done. We now have a binary heap containing all of the original values, plus San Marco.
Let's suppose that we now want to insert Explorer, the first US space probe, into the heap with weight 1958. We begin by placing it at the next free location in the last row, as the right child of San Marco:
We then bubble Explorer up one level to fix the heap:
And, again, have a new heap containing these elements. As a final example, suppose that we want to insert 东方红, the first Chinese space probe, into this heap with weight 1970. We begin by putting it into the first free spot in the last row, which in this case is as the left child of the Israeli satellite אופק:
We now do a bubble-up step. We first swap 东方红 and אופק to get:
Notice that 东方红’s weight is still less than its new parent’s weight, so it’s not yet in the right place. We therefore do another swap, this time with the Indian satellite आर्यभट, to get
This step runs very quickly. With a bit of math we can show that if there are n nodes in a binary heap, then the height of the heap is at most O(log n), and so we need at most O(log n) swaps to put the new element into its proper place. Thus the enqueue step runs in time O(log n). That’s pretty fast! Remember that the base-2 logarithm of a million is about twenty, so even with a million elements we’d only need about twenty swaps to place things!
Before moving on, answer the following question in ShortAnswers.txt
:
Q4: Draw the binary heap formed by inserting these nine DataPoints into an empty binary heap using the algorithm described above. Specifically, insert those data points in the order that’s shown below. You only need to show your final answer, not your intermediate steps.
{ "R", 4 }
{ "A", 5 }
{ "B", 3 }
{ "K", 7 }
{ "G", 2 }
{ "V", 9 }
{ "T", 1 }
{ "O", 8 }
{ "S", 6 }
There are many binary heaps you can form that contain these elements, but only one of them will be what you get by tracing the algorithm.
Binary Heap Dequeue ("Bubble-down")
We now know how to insert an element into a binary heap. How do we implement dequeue? We know that the minimum-weight element of the binary heap is atop the heap, but we can't just remove it – that would break the heap into two smaller heaps. Instead, we use a more clever algorithm. First, we swap the top of the heap, the original Soviet satellite Спутник, with the rightmost node in the bottom row of the heap (אופק) as shown here:
Now, we can remove Спутник from the heap, which is the element we’ll return. We now have this:
Unfortunately, what we are left with isn’t a binary heap because the top element (אופק) is one of the highest-weight values in the heap. To fix this, we will use a bubble-down step and repeatedly swap אופק with its lower-weight child until it comes to rest. First, we swap אופק with Explorer to get this heap:
Since אופק is not at rest yet, we swap it with the smaller of its two children (San Marco) to get this:
And we're done. That was fast! As with enqueue, this step runs in time O(log n), because we make at most O(log n) swaps. This means that enqueuing n elements into a binary heap and then dequeuing them takes time at most O(n log n). This method of sorting values is called heapsort.
To confirm that this makes sense, answer this question in ShortAnswers.txt
:
Q5: Draw the binary heap that results from following this dequeue
procedure on the heap you drew in Q4 of this problem.
Binary Heaps As Arrays
How do we represent a binary heap in code? Amazingly, we can implement the binary heap using nothing more than a dynamic array. “An array‽,” you might exclaim. “How is it possible to store that complicated heap structure inside an array‽” The key idea is to number the nodes in the heap from top-to-bottom, left-to-right. For example, we might number the nodes of the previous heap like this:
This numbering system has some amazing properties:
- Given a node numbered n, its children (if any) are numbered 2n and 2n + 1.
- Given a node numbered n, its parent is numbered n / 2, rounded down.
You can check this yourself in the above tree. That's pretty cool, isn't it? The reason that this works is that the heap has a rigid shape – every row must be filled in completely before we start adding any new rows. Without this restriction, our numbering system wouldn't work.
Because our algorithms on binary heaps only require us to navigate from parent to child or child to parent, it's possible to represent binary heap using just an array. Each element will be stored at the index given by the above numbering system. Given an element, we can then do simple arithmetic to determine the indices of its parent or its children. For example, we’d encode the above heap as the following array:
The enqueue and dequeue algorithms we have developed for binary heaps translate beautifully into algorithms on the array representation. For example, suppose we want to insert Astérix, the first French satellite, into this binary heap with weight 1965. We begin by adding it into the heap, as shown here:
Notice that Astérix is at index 8, which is the last position in the array. This is not a coincidence; whenever you add a node to a binary heap, it always goes at the end of the array. (Do you see why?)
We then bubble Astérix up into its final position by repeatedly comparing it to its parent. Since Astérix is at position 8, its parent (आर्यभट) is at position 4. Since Astérix precedes आर्यभट, we swap them:
Astérix’s parent is now at position 2 (东方红), so we swap Astérix and 东方红 to get the final heap:
To confirm that this all makes sense, answer the following question in ShortAnswers.txt:
Q6: Draw the array representation of the binary heap that you drew in Q5.
Deliverables
Now, let’s talk about the coding assignment.
Your task is to implement this data structure to power the HeapPQueue
type. Although in practice you would layer this class on top of the Vector, for the purposes of this assignment you must do all of your own memory management. This means that you must dynamically allocate and deallocate the underlying array in which your heap is represented.
Here is our recommendation for how to complete this part of the assignment.
- Start off by adding any private member variables you’ll need to
HeapPQueue.h
. You know you’ll at a bare minimum need to define something to hold the elements in the heap. We recommend you start there, since without adding member variables you won’t be able to store anything. - Implement the constructor in a way that allocates an initial amount of space for the elements of the heap. We recommend picking a medium-sized array to begin with (say, around size 100), though you’ll probably drop that to something lower later on. Then, implement
size
andisEmpty
, which should each probably be a single line long. - To test whether your code works correctly, choose the “Interactive PQueue” option from the top menu bar. This will let you create and issue commands to a
HeapPQueue
using a graphical interface. Confirm that you can create aHeapPQueue
and thatsize
andisEmpty
behave as expected (size
should return0
,isEmpty
should returntrue
). - Implement
enqueue
using the bubble-up algorithm. Then, implementprintDebugInfo
in a way that will let you confirm that your code works correctly. For now, don’t worry about what happens if you run out of space in your dynamic array; you’ll patch this up later. Head back to the Interactive PQueue option and test out yourenqueue
code by enqueuing a bunch of values, working out with a pencil and paper what the array contents should look like, and then usingprintDebugInfo
to confirm things work. - Implement
dequeue
using the bubble-down algorithm. Run some initial tests using the Interactive PQueue feature and a pencil and paper, as before. Once that seems to be working, run the full automated test suite and see what happens. If you’re passing all the tests, except for the tests that require yourHeapPQueue
to hold thousands of elements and the ones that check for memory leaks, move on to the next step. Otherwise, iterate and debug until things are working. - Change your implementation of enqueue to grow the dynamic array when more space is needed, and implement the
HeapPQueue
destructor. Change the original size of your array down to something much smaller (say, four or five elements) and use a mix of the automated test suites and the Interactive PQueue to confirm that your code works as intended. Move on to the next step once all the tests pass. - Use the Time Tests feature to confirm that the cost of doing n
enqueue
s followed by ndequeue
s is indeed O(n log n). If not, look back over your code and see if you can spot the source of the inefficiency. - Add in at least one custom test of your own, just to make sure that you’ve covered all the cases you need to cover.
Problem Notes
Some notes on this part of the assignment:
- If your program mysteriously stops running without warning, it probably crashed. To see where it crashes, run the program in debug mode. You don’t need to set any breakpoints – simply running in debug mode will let the debugger catch where your program crashed. It’ll stop the program at that point as if you’d set a breakpoint there, and you can then walk up and down the call stack, look at local variables, etc. to see what’s going on. Common causes of program crashes this way include the following:
- Writing to an array variable through an uninitialized pointer. If you don’t initialize a pointer, it’s not given a sensible default value. It might be a null pointer, and writing to a null pointer will cause a crash. It might be a pointer to memory you don’t own, which will crash your program. Or it might point to memory you do own, in which case you might see a crash or might not, depending on whether you were (un)lucky.
- Writing to an array variable pointing at deallocated memory. Once you’ve used delete[] to deallocate an array, you should not try reading or writing that memory. (That would be like trying to let yourself into a building you just sold to someone else.)
- Writing past the end of an array. You saw this in the warmup assignment.
- Writing before the start of an array. This can happen if your indices are wrong.
- The indices in our array-based heap start at one. C++ arrays are zero-indexed. You’ll need to be clever about how you store things. There are many ways to do this – perhaps you will have a dummy element at the start of your array, or perhaps you'll adjust the math to use zero-indexing – but be sure that you keep this in mind when designing your implementation.
- You are, of course, welcome to define private helper functions in
HeapPQueue.h
. However, please do not change the signatures of any of the functions that we have provided to you. - If multiple data points are tied for the same weight, you can break those ties however you’d like.
- Read over the comments in
HeapPQueue.h
, which detail what all of the member functions need to do. They’re there for a reason. 😃 - Our testing harness automatically checks for memory leaks and other types of memory errors. If you allocate memory without later deallocating it, the test driver will tell you how many objects you leaked. You will then need to explore your code to find the source of the memory error.
- The C++ standard libraries contain functions
std::push_heap
andstd::pop_heap
. For the purposes of this assignment, please refrain from using those functions.
Problem Three: Streaming Top-k
(This part of the assignment requires you to have HeapPQueue
working, so we recommend doing it last.)
In the previous part of this assignment, you implemented the HeapPQueue
type. Your interactions with the HeapPQueue
were focused on getting the thing working, not writing code that takes advantage of what the HeapPQueue
has to offer.
In this last part of the assignment, you’ll switch hats and become a client of your HeapPQueue
type. Specifically, you’re going to implement an algorithm that uses the HeapPQueue
as a building block. This will be similar to what you did in Assignment 2, Assignment 3, and Assignment 4 when you interacted with types like Vector
and Map
– you used those types to solve problems without thinking too much about how those types were actually implemented.
The problem we’d like you to solve here is called the streaming top-k problem, and it’s defined like this:
Given a stream of data points, find the k elements in that data stream with the highest weight.
Your task, specifically, is to implement a function
Vector<DataPoint> topK(istream& stream, int k);
that takes as input a data stream and a number k, then returns the k data points from that stream with the highest weight. If there are fewer than k elements in the data stream, you should return all the elements in that stream. The items should be returned in descending order of weight, so the zeroth element of the returned Vector should be the highest-weight data point, the first should be the data point with the highest weight less than that, etc.
You might have noticed that the input to this function is not provided by a Vector<DataPoint>
, but rather by an istream
, which is usually something you’d hook up to a file or to the console. Why is this?
Imagine you’re working at Google and have a file with how many times each search query was made in a given day. You want to find the 1,000 most popular searches made that day. Google gets billions of search queries in a day, and most computers simply don’t have the RAM to keep all those queries in memory at once. That would mean that you couldn’t create a Vector<DataPoint>
to hold all the data points from the file; it would take up more memory than your computer has available!1
When you read data from a file stream in C++, on the other hand, the full contents of that stream don’t all have to be held in memory at the same time. Instead, whenever you try reading data from the stream, the stream calls out to your computer’s hard drive to get a bit more information.2 This means that you can process the elements from a data file one at a time without filling up your computer’s memory; the space to hold each element from the file gets recycled each time you pull in a new element.
In the case of this problem, your goal is to implement your function such that you can find the top k elements using only O(k) space; that is, space proportional to the number of elements you’ll need to return. In other words, if you wanted the top 1,000 search queries from Google, it wouldn’t matter that there are billions or tens of billions of search queries per day. You’d only need to use enough space in RAM to hold about a thousand or so of them.
As a reminder, you can read one element at a time from an istream using this general pattern:
for (DataPoint pt; in >> pt; ) {
/* … do something with pt … */
}
Your solution should not only use little space; it should also run quickly. Specifically, the runtime of your algorithm should be O(n log k), where n is the number of elements in the data stream.
Deliverables
Here’s what you need to do for this part of the assignment:
- Implement the the
topK
function inTopK.cpp
. You will need to use theHeapPQueue
type that you built in the previous part of this assignment in the course of doing this. However, you should not modify the filesHeapPQueue.h
orHeapPQueue.cpp
to solve this problem (unless, of course, you’re fixing bugs in theHeapPQueue
type). - Add at least one custom test case – and, ideally, many more – and test your code thoroughly using the test suite.
- Use the “Time Test” feature to see how fast your code runs, and confirm that the runtimes are consistent with what you’d expect to see in a piece of code whose runtime is O(n log k), where n is the number of elements and k is the input parameter to
topK
.
Problem Notes
Here’s some hints to help you get started:
- Just to make sure you didn’t miss it, we want you to return the highest-weight elements from the data stream sorted in descending order, which is the reverse of what we’ve asked you to do elsewhere in this assignment.
- Think about how you might solve this problem if you just wanted to find the highest-weight data point in the stream. How would you go about solving that problem using a
HeapPQueue
? Now imagine you want to find the top two. How would you do that? - The runtime bound of O(n log k) might actually give you a hint about how to solve this problem. There are n total elements to process, which means that you can only do O(log k) work per element from the data stream.
- You can assume the data stream only has
DataPoints
in it and don’t need to handle the case where the stream contains malformed data. - Inserting elements into a Vector using
Vector::insert
or removing elements from a Vector usingVector::remove
will take time O(n) if you insert or remove elements from the beginning of the Vector. Be careful when using these functions, since they can degrade performance. - Be careful – you can only consume the elements of an
istream
once. This means that, for each test you run, you can only calltopK
on a stream once. If you calltopK
a second time on that stream, the stream will appear empty andtopK
will return a different value than the one you got the first time you calledtopK
.
As with the other parts of this assignment, you must write at least one custom test case for this part of the assignment, and, ideally, should include more than just one. Also, be sure to run the time tests. You should have a sense of the general shapes of the curves you should expect to see, since you already saw some curves from an O(n log k)-time algorithm earlier in this assignment.
(Optional) Problem Four: Explore Some Data Sets!
The starter files we’ve provided you with in this assignment include four demos that use your code – the HeapPQueue
, and your implementation of topK
– to explore large data sets. Once you’ve finished the previous parts of this assignment, feel free to play around with these demos to see your code in action!
- Child Mortality: The United Nations Millennium Development Goals were a set of ambitious targets for improving health and welfare across the globe. Over twenty-five years, the UN kept records of child mortality data worldwide. How did those numbers change since when they started keeping track in 1990 to when the most recent public numbers were released in 2013?
- Earthquakes: The US Geological Survey operates a global network of seismometers and publishes lists of earthquakes updated every hour. Where are these earthquakes? How big are they? A note: for some reason, Qt Creator’s networking system doesn’t work on all computers. You may get error messages here about not being able to connect to the servers. If that happens, don’t worry; it’s not your fault.
- Women’s 800m Freestyle: The women’s 800m freestyle swim race was introduced as a competitive event in the 1960s. How have the fastest times in that event improved since then? A certain Stanford-affiliated athlete might make an appearance here.
- National Parks: The US National Parks Service runs America’s national parks, national monuments, national recreation areas, national seashores, etc. How many people visit those parks? Which ones are most popular? What trends can you detect by looking at those numbers?
There are no deliverables for this part of the assignment. Just sit back, enjoy, and celebrate having gotten everything working!
(Optional) Problem Five: Extensions
There are many, many ways you could do extensions on this assignment. Here are a few:
- Priority Queue: There are so many different ways to implement priority queues, of which the binary heap is only one. Other approaches include binomial heaps, pairing heaps, leftist heaps, Fibonacci heaps, hollow heaps, thick heaps, and randomized meldable heaps. These other styles of heaps are designed to efficiently support additional operations on heaps, such as meld, which efficiently merges together two heaps, or decrease-key, which lowers the priority of an existing element in the heap, essentially bumping it up in line. Pick one of these other approaches, research it, and put together your own implementation.
- Streaming Top-k: The streaming top-k algorithm is an example of a streaming algorithm, an algorithm that works on a data stream and tries to keep its memory footprint low. Streaming algorithms are an active area of research in algorithms and data structures, and there are some really beautiful ones out there. The count-min sketch, for example, lets you approximate how many times you’ve seen various elements in the data stream without actually storing everything you’ve encountered. Research a streaming algorithm and code up your own implementation.
- Data Sagas: This assignment builds up to exploring data sets with these sorts of algorithms to see what you find. So go out there and find a fun data set to explore! Get some data and tell us a good story about it. Which data set did you grab? Where did you find it? What did you find when you looked in that data set?
Submission Instructions
Once you’ve finished writing up your solution, submit these files:
ShortAnswers.txt
HeapPQueue.cpp
andHeapPQueue.h
. (Don’t forget the header file!)TopK.cpp
.
If you edited any of the other files in the course of adding extensions to the base assignment, please be sure to include them as well.
Good luck, and have fun!
1 Technically, due to the wonder that is virtual memory, you actually could create this Vector. It would just degrade the performance on your machine so much that everything would grind to a seeming halt.
2 Technically, the stream usually uses a technique called buffering where, whenever you ask for data, it reads more data than you need from disk. Reading from disk is much, much slower than reading from memory, and this design helps reduce the amount of time spent asking the hard drive for data.