Assign5: Priority Queue client


Streaming top-k problem from Keith Schwarz

We bring you this little interlude between the PQSortedArray and PQHeap to appreciate the utility of the Priority Queue class abstraction in solving interesting and challenging problems.

While implementing a class, you as the programmer have your head down and are focused entirely on getting the class to work appropriately, navigating all the challenges and tricky small details along the way, as you probably just experienced when completing the implementation of PQSortedArray! However, once the class is completed and working as expected, what a delight it can be to switch hats to become a client of your own data structure and take advantage of what the class has to offer! Remember how in previous assignments when you made use of Vector and Map to solve problems without worrying about how those types were actually implemented? Let's go there now and see what we can do with a working implementation of a handy data structure like the PQSortedArray now in our repertoire.

Sorting via Priority Queue

Given our discussion in lecture on Friday about different sorting algorithms, it seems almost too fortuitous that we can actually use a priority queue as a tool for sorting! Since we know that internally a priority queue always maintains its elements in sorted order, we can devise a devilishly simple sorting algorithm as follows: * Given a list of values, originally unsorted * Enqueue the values into a priority queue * Then, repeatedly call dequeue to pull them out, overwriting the original list with the elements now in order Voila! You have the sorted the input list into increasing order!

The four-line of pqSort in pqclient.cpp does exactly this:

void pqSort(Vector<int>& v)
{
    PQSortedArray pq;

    for (int i = 0; i < v.size(); i++) {
       pq.enqueue(v[i]);
    }
    for (int i = 0; i < v.size(); i++) {
        v[i] = pq.dequeue();
    }
}

Answer the following questions in short_answer.txt:

Streaming top K via Priority Queue

The second priority queue client application we will consider is a neat algorithm that solves an interesting "challenge" problem often presented in coding interviews. The streaming top-k problem is the task of returning the k elements with the largest values from a stream of values. The stream is expected to contain a enormous count of values and k is small; the goal is to complete the task efficiently in both use of space and time.

Using a priority queue as a building block is a great approach to solving this problem, as this data structure lends itself naturally to the task of keeping track of some subset of elements as designated by their integer priority (their value in this case).

You are to write the function

Vector<int> topK(istream& stream, int k);

takes as input a data stream and a number k, then returns a Vector of k values from that stream with the largest values. If there are fewer than k elements in the data stream, it return all the elements in that stream. The elements in the returned vector are arranged in descending order, the maximum value is at index 0, the next largest at index 1, and so on.

You might have noticed that the input to this function is not provided by a Vector<int>, but rather by an istream, which is usually something you’d hook up to a file or to the console. Why is this?

Imagine you’re working at Google and have a file with how many times each search query was made in a given day. You want to find the 1,000 most popular searches made that day. Google gets billions of search queries in a day, and most computers simply don’t have the memory to hold all those queries at once. That would mean that you couldn’t create a Vector<int> to hold all the values from the file; it would take up more memory than your computer has available!

When you read data from a file stream in C++, on the other hand, the full contents of that stream don’t all have to be held in memory at the same time. Instead, whenever you try reading data from the stream, the stream calls out to your computer’s hard drive to get a bit more information. This means that you can process the elements from a data file one at a time without filling up your computer’s memory; the space to hold each element from the file gets recycled each time you pull in a new element.

In the case of this problem, your goal is to implement your function such that you can find the top k elements using only O(k) space; that is, space proportional to the number of elements you’ll need to return. In other words, if you wanted the top 1,000 search queries from Google, it wouldn’t matter that there are billions or tens of billions of search queries per day. You’d only need to use enough memory to hold about a thousand or so of them. Wow!

Write your implementation of topK and use our provided tests to confirm that the function works correctly.

Answer the following questions in short_answer.txt:

Notes