Assign5: Priority Queue client
Streaming top-k problem from Keith Schwarz
We bring you this little interlude between the PQSortedArray and PQHeap to appreciate the utility of the Priority Queue class abstraction in solving interesting and challenging problems.
While implementing a class, you as the programmer have your head down and are focused entirely on getting the class to work appropriately, navigating all the challenges and tricky small details along the way, as you probably just experienced when completing the implementation of PQSortedArray! However, once the class is completed and working as expected, what a delight it can be to switch hats to become a client of your own data structure and take advantage of what the class has to offer! Remember how in previous assignments when you made use of Vector and Map to solve problems without worrying about how those types were actually implemented? Let's go there now and see what we can do with a working implementation of a handy data structure like the PQSortedArray now in our repertoire.
Sorting via Priority Queue
Given our discussion in lecture on Friday about different sorting algorithms, it seems almost too fortuitous that we can actually use a priority queue as a tool for sorting! Since we know that internally a priority queue always maintains its elements in sorted order, we can devise a devilishly simple sorting algorithm as follows: * Given a list of values, originally unsorted * Enqueue the values into a priority queue * Then, repeatedly call dequeue to pull them out, overwriting the original list with the elements now in order Voila! You have the sorted the input list into increasing order!
The four-line of pqSort in pqclient.cpp does exactly this:
void pqSort(Vector<int>& v)
{
PQSortedArray pq;
for (int i = 0; i < v.size(); i++) {
pq.enqueue(v[i]);
}
for (int i = 0; i < v.size(); i++) {
v[i] = pq.dequeue();
}
}
Answer the following questions in short_answer.txt:
- Q7. Given the runtime of PQSortedArray enqueue/dequeue, what will be the runtime of pqSort? Run timing trials to confirm your prediction.
- Q8. The action of pqSort operates very similarly to one of the classic sorting algorithms shown in lecture. Which algorithm is it?
Streaming top K via Priority Queue
The second priority queue client application we will consider is a neat algorithm that solves an interesting "challenge" problem often presented in coding interviews. The streaming top-k problem is the task of returning the k elements with the largest values from a stream of values. The stream is expected to contain a enormous count of values and k is small; the goal is to complete the task efficiently in both use of space and time.
Using a priority queue as a building block is a great approach to solving this problem, as this data structure lends itself naturally to the task of keeping track of some subset of elements as designated by their integer priority (their value in this case).
You are to write the function
Vector<int> topK(istream& stream, int k);
takes as input a data stream and a number k, then returns a Vector of k values from that stream with the largest values. If there are fewer than k elements in the data stream, it return all the elements in that stream. The elements in the returned vector are arranged in descending order, the maximum value is at index 0, the next largest at index 1, and so on.
You might have noticed that the input to this function is not provided by a Vector<int>, but rather by an istream, which is usually something you’d hook up to a file or to the console. Why is this?
Imagine you’re working at Google and have a file with how many times each search query was made in a given day. You want to find the 1,000 most popular searches made that day. Google gets billions of search queries in a day, and most computers simply don’t have the memory to hold all those queries at once. That would mean that you couldn’t create a Vector<int> to hold all the values from the file; it would take up more memory than your computer has available!
When you read data from a file stream in C++, on the other hand, the full contents of that stream don’t all have to be held in memory at the same time. Instead, whenever you try reading data from the stream, the stream calls out to your computer’s hard drive to get a bit more information. This means that you can process the elements from a data file one at a time without filling up your computer’s memory; the space to hold each element from the file gets recycled each time you pull in a new element.
In the case of this problem, your goal is to implement your function such that you can find the top k elements using only O(k) space; that is, space proportional to the number of elements you’ll need to return. In other words, if you wanted the top 1,000 search queries from Google, it wouldn’t matter that there are billions or tens of billions of search queries per day. You’d only need to use enough memory to hold about a thousand or so of them. Wow!
Write your implementation of topK and use our provided tests to confirm that the function works correctly.
Answer the following questions in short_answer.txt:
- Q9. Given the runtime of PQSortedArray enqueue/dequeue, what will be the runtime of topK? Run some timing trials to confirm your prediction.
Notes
- There are many varied ways to read from a stream. We recommend
a simple approach using operator
>>. The code below demonstrates an idiomatic loop to read numbers from a stream until it is exhausted:int num; while (stream >> num) { /* do something with num */ }You may assume the input stream only has integers in it. You do not need to handle the case where the stream contains malformed data.
- You only consume the elements of an
istreamonce. This means that, for each test you run, you can only calltopKon a stream once. If you calltopKa second time on that stream, the stream will appear empty. - Just to make sure you didn’t miss it,
topKreturns the k largest elements from the input stream sorted in descending order, which appears reversed from what you've been doing in the rest of the assignment. - It may help to think about solving the problems in terms of what you can discard from the stream. If looking for the top 10 values, by the time you've seen your 11th entry, you already can identify one of those values that definitely won't be in the top 10. Consider how you can use a priority queue to identify and discard those elements that are out of the running for the top slots.
- To be sure your performance is optimal, take care with how you access
the Vector for the results. Adding to the end of Vector is O(1) but using
insertto insert at a position near the beginning takes time O(N) and will degrade performance. Another approach you can use is to construct the Vector of a fixed size and then access by index, as this also runs in O(1) time:Vector<int> counts(5); // construct array of size 5 counts[0] = 106; counts[1] = 930; - Note that
pqSortandtopKare two client tasks that each make use of a priority queue, but the two algorithms are not otherwise related. In particular, trying to usepqSortas part of a solution totopKwill not end well. Keep in mind that you just need to pick off the top K largest values; fully sorting the entire contents of the stream is doing way more work than needed.