PQ client


Streaming top-k problem from Keith Schwarz

Put your client hat on!

Having finished PQSortedArray and before starting on PQHeap, we take a brief diversion to appreciate the utility of the Priority Queue class as an abstraction for solving interesting and challenging problems.

While acting in the role of class implementer, the programmer is head down, focusing on the nitty-gritty of the functionality, navigating the challenges and tricky details to ensure the class works correctly in all situations. You just experienced this for yourself for PQSortedArray. However, once the class is tested and debugged, it is a delight to switch hats to become a client of your own data structure and take advantage of what the class has to offer! Remember in previous assignments when you made use of Vector and Map to solve problems without worrying about how those data structures were actually implemented? Let's revisit this powerful land of problem-solving and see what you can do with a working implementation of a handy data structure like the PQSortedArray now in your repertoire.

Sorting via priority queue

While we have yet to formally discuss sorting algorithms in lecture (coming soon…), one great use for a priority queue is as a tool for sorting. Capitalizing on the priority queue's management of elements in sorted order, you can devise a devilishly simple approach to sorting. The algorithm takes as input a sequence of unordered elements, enqueues those elements into a priority queue, and then repeatedly dequeues to get the elements back out, now in order of priority. Voila!

The provided implementation of pqSort in pqclient.cpp does exactly this:

void pqSort(Vector<DataPoint>& v) {
    PQSortedArray pq;

    /* Add all the elements to the priority queue. */
    for (int i = 0; i < v.size(); i++) {
        pq.enqueue(v[i]);
    }

    /* Extract all the elements from the priority queue. Due
     * to the priority queue property, we know that we will get
     * these elements in sorted order, in order of increasing priority
     * value. Store elements back into vector, now in sorted order.
     */
    for (int i = 0; i < v.size(); i++) {
        v[i] = pq.dequeue();
    }
}

Answer the following questions in short_answer.txt:

Q10. Based on the Big O of PQSortedArray enqueue/dequeue, what do you expect for the Big O runtime of pqSort? Run some timing trials to confirm your prediction, and include that data in your answer.

There is nothing more you need to do for pqSort, just admire how easy it is to be a client of your priority queue and how handy its functionality can be. You are starting to see the payoff for all your hard work!

Taming massive data streams (Top-K)

The second client application of your priority queue is a neat algorithm that solves an interesting "challenge" problem often asked in coding interviews. The streaming top-k problem is the task of returning the k elements with the largest values from a stream of values. The stream is expected to contain a enormous number of values, and k is small; the goal is to complete the task efficiently in both use of space and time. A stream of data is simply an abstraction that accesses data one by one and avoids storing all of the data in memory at the same time (which makes it a good fit for this problem).

Using a priority queue as a building block is a great approach to solving this problem, as this data structure is well-suited to managing elements in ranked order.

You are to write the function

Vector<DataPoint> topK(istream& stream, int k) 

which takes as input a data stream of DataPoint objects and a number k and returns a Vector of the top k elements drawn from the stream. More precisely, the returned vector must meet these requirements:

  • If the stream contains at least k elements, the vector contains the k elements with the largest values.
  • If the stream contains fewer than k elements, the vector contains all the elements from the stream.
  • You may assume that k > 0.
  • In all cases, the elements in the vector are arranged in descending order by priority value. The element with the largest priority value is at index 0, the next largest at index 1, and so on.

You might have noticed that the input to this function is not given as Vector<DataPoint>, but rather by an istream, which is usually something you’d use when working with files or the console. What is an istream and why is it useful here?

Imagine you’re working at Google and you have access to a file that contains information about how many times each individual search query was made in a given day. You want to find the 1,000 most popular searches made that day. Google gets billions of search queries in a day and, most computers simply don’t have the memory to hold all these data points at once. That would mean that you couldn’t create a Vector<DataPoint> to hold all the values from the file; it would take up more memory than your computer has available!

When you read data from a stream in C++, on the other hand, the entire contents of that stream don’t have to be held in memory at the same time. While reading from a stream, the stream calls out to your computer to get a bit more of the underlying information (from a corresponding file or other location). This means that you can process the elements from a data file one at a time without filling up your computer’s memory; the space to hold each element from the file gets recycled each time you pull in a new element.

In the case of this problem, your goal is to implement your function such that you can find the top k elements using only O(k) space; that is, space proportional to the number of elements you’ll need to return. This means that any data structure you use should never hold more than about k elements at any given moment. In other words, if you wanted the top 1,000 Google queries, it wouldn’t matter that there are billions or tens of billions of queries per day. You’d only need to use enough memory to hold about a thousand or so of them. Wow!

Notes

  • To read from a stream, we suggest operator>>. You can think of operator>> as the inverse of operator<<. The expression out << var takes the value of var and writes it to the output stream. The inverse expression in >> var reads a value from input stream and stores the value into var. The boolean result of operator<< is true if the read was successful and false otherwise (because of read error or end of input).

    The loop below would read DataPoints from a stream until the stream is exhausted:

     DataPoint cur;
     while (stream >> cur) {
         /* do something with cur */  
     }
    

    You may assume the input stream contains only valid DataPoints. You do not need to handle the case where the stream contains malformed data.

  • After reading the elements from an istream, the data is consumed and the stream is empty. You cannot re-read an already-spent stream, so your test cases will have to create/replenish the stream contents before another call to topK.
  • To be sure you didn’t miss it: topK returns the k elements with the largest priority values from the input stream, sorted in descending order of priority value. This is reversed from what you've been doing in the rest of the assignment, so think carefully about what this means when it comes to using your data structures and building the final result output.
  • It may help to think about the algorithm in terms of what you can discard from the stream. If looking for the elements with the top 5 priority values, by the time you've seen your 6th entry, you already can identify one element that has a priority that definitely won't be in the top 5. Consider how you can use a priority queue to identify and discard those elements that are out of the running for the top slots.
  • To be sure your performance is optimal, take care with how you build up the the Vector of results. Adding to the end of Vector is O(1) but using insert at index 0 requires O(N). One option is to construct the Vector with a fixed size and access locations by index. This runs in O(1) time:
        Vector<DataPoint> data(5); // construct vector of size 5
        data[0] = { "Julie", 106 };
        data[1] = { "Keith", 137 };
    
  • Although pqSort and topK are two client tasks that each make use of a priority queue, the two algorithms are not otherwise related. In particular, the solution to topK does not use pqSort. You only need to pick out the elements with the k largest priority values; fully sorting the entire contents of the stream is doing way more work than needed.
  • Be sure to firmly keep on your client hat. When previously in role of the class implementer, you were deep inside the class and hyper-aware of how and where the data was stored and what was happening with the internal state. After completing the implementation and thoroughly testing it, you button up the the class and move to the role of client. In this role, you make calls to enqueue/dequeue, depending on these functions to behave as specified in the interface. You should be blissfully unaware of what is happening under the hood.

Testing and timing

Most of the heavy lifting for pqSort and topK is handled inside enqueue and dequeue and those operations will be tested as part of the class. The pqclient.cpp file contains some provided tests that you can use to confirm the correctness of pqSort and topK as a specific application of using the priority queue. These tests are fairly comprehensive and should meet your the testing needs as-is. You are always free to add your own tests if you like, but it will be most fruitful to direct your testing energy into the test cases needed for PQHeap, coming up next.

The final step for pqclient.cpp is for you to predict the Big-O runtime of the topK operation when using a PQSortedArray and run time trials to confirm your prediction. The runtime should be expressed in terms of two distinct variables: n, which represents the total number of elements in the data stream, and k, which represents the number of "top" elements that you are searching for. You can approach this task by first picking a fixed value of k and then varying different values (at least 4 distinct magnitudes) of n to see the relationship between the growth of n and the growth of the runtime. We have provided a timing test in the starter code that provides a framework on how to do this, so make sure to use that as a starting point. Then, switch it up and run some time trials that vary k while keeping n constant to figure out the relationship with k. Do the timing results match up with your prediction of the runtime?

Answer the following questions in short_answer.txt:

Q11. Based on the Big O of PQSortedArray enqueue/dequeue, what do you expect for the Big O runtime of topK in terms of k and n? Run some timing trials to confirm your prediction, and include that data in your answer.