Assign4: Priority Queue client


Streaming top-k problem from Keith Schwarz

Moving back across the abstraction boundary

We bring you this brief interlude between implementing PQSortedArray and PQHeap to appreciate the utility of the Priority Queue class abstraction in solving interesting and challenging problems.

While implementing a class, you as the programmer have your head down and are focused entirely on getting the class to work appropriately, navigating all the challenges and tricky small details along the way. You probably just experienced this for yourself when completing the implementation of PQSortedArray! However, once the class is completed and working as expected, it can often be a delight to switch hats to become a client of your own data structure and to take advantage of what the class has to offer! Remember in previous assignments when you made use of Vector and Map to solve problems without worrying about how those data structures were actually implemented? Let's revisit this powerful land of problem-solving and see what we can do with a working implementation of a handy data structure like the PQSortedArray now in our repertoire.

Implementing a sorting algorithm (PQSort)

While we have yet to formally discuss sorting algorithms in lecture (it's coming soon, we promise!), this exercise will show you that it's actually possible to use a priority queue as a tool for sorting! Since we know that internally a priority queue always maintains its elements in sorted order, we can devise a devilishly simple sorting algorithm as follows:

  • Given a list of DataPoint structs, originally unsorted (elements in completely random order of priorities)
  • Enqueue the elements into a priority queue
  • Then, repeatedly call dequeue to pull them out, overwriting the original list with the elements now in order of priority

Voila! You have sorted the input list into increasing order!

The provided implementation of pqSort in pqclient.cpp does exactly this:

void pqSort(Vector<DataPoint>& v) {
    PQSortedArray pq;

    /* Add all the elements to the priority queue. */
    for (int i = 0; i < v.size(); i++) {
        pq.enqueue(v[i]);
    }

    /* Extract all the elements from the priority queue. Due
     * to the priority queue property, we know that we will get
     * these elements in sorted order, in order of increasing priority
     * value.
     */
    for (int i = 0; i < v.size(); i++) {
        v[i] = pq.dequeue();
    }
}

Answer the following questions in short_answer.txt:

  • Q7. Given the runtime of PQSortedArray enqueue/dequeue, what will be the runtime of pqSort? Then, run timing trials and produce data to support your prediction.

Taming massive data streams (Top-K Problem)

The second priority queue client application we will consider is a neat algorithm that solves an interesting "challenge" problem often presented in coding interviews. The streaming top-k problem is the task of returning the k elements with the largest values from a stream of values. The stream is expected to contain a enormous count of values, and k is small; the goal is to complete the task efficiently in both use of space and time. In this case a "stream" of data is simply an abstraction that allows easy access to one element of data at a time while avoiding having to store all data in memory at once (which makes it a good fit for this problem).

Using a priority queue as a building block is a great approach to solving this problem, as this data structure lends itself naturally to the task of keeping track of some subset of elements as designated by their integer priority (which works nicely with our previously used DataPoint struct).

You are to write the function

Vector<DataPoint> topK(istream& stream, int k) 

which takes as input a data stream of DataPoint objects and a number k and then returns a Vector of k elements from that stream with the largest values. The Vector returned from the function must abide to the following constraints/requirements:

  • If k is 0, then the returned Vector should be empty.
  • If there are fewer than k elements in the data stream, the returned Vector should contain all the elements in that stream.
  • The elements in the returned vector are arranged in descending order by priority value, which means that the element with the largest priority value is at index 0, the next largest at index 1, and so on.

You might have noticed that the input to this function is not provided by a Vector<DataPoint>, but rather by an istream, which is usually something you’d use when working with files or the console. What is an istream and why is it useful here?

Imagine you’re working at Google and you have access to a file that contains information about how many times each individual search query was made in a given day. You want to find the 1,000 most popular searches made that day. Google gets billions of search queries in a day, and with each (query, count) pair represented as a single instance of the DataPoint struct, most computers simply don’t have the memory to hold all these data points at once. That would mean that you couldn’t create a Vector<DataPoint> to hold all the values from the file; it would take up more memory than your computer has available!

When you read data from a stream in C++, on the other hand, the full contents of that stream don’t all have to be held in memory at the same time. Instead, whenever you try reading data from the stream, the stream calls out to your computer to get a bit more of the underlying information (from a corresponding file or other location). This means that you can process the elements from a data file one at a time without filling up your computer’s memory; the space to hold each element from the file gets recycled each time you pull in a new element.

In the case of this problem, your goal is to implement your function such that you can find the top k elements using only O(k) space; that is, space proportional to the number of elements you’ll need to return. This means that any data structure you declare should never hold more than about k elements in it at any given point in time. In other words, if you wanted the top 1,000 search queries from Google, it wouldn’t matter that there are billions or tens of billions of search queries per day. You’d only need to use enough memory to hold about a thousand or so of them. Wow! This space constraint will also play a factor in your overall runtime performance as well – once you have completed the implementation, you will analyze and categorize how good the runtimes of this algorithm are when using different priority queue implementations to store the data.

Helpful Notes and Suggestions

  • There are many varied ways to read from a stream. We recommend a simple approach using operator >>. The code below demonstrates an idiomatic loop to read DataPoints from a stream until the stream is exhausted:
    DataPoint point;
    while (stream >> point)  {  
        /* do something with point */  
    }
    

    You may assume the input stream only has DataPoint struct instances in it. You do not need to handle the case where the stream contains malformed data.

  • You can only consume the elements of an istream once. This means that, for each test you run, you can only call topK on a stream once. If you call topK a second time on that stream, the stream will appear empty.
  • Just to make sure you didn’t miss it, topK returns the k elements with the largest priority values from the input stream, sorted in descending order of priority value. This is reversed from what you've been doing in the rest of the assignment, so think carefully about what this means when it comes to using your data structures and building the final result output.
  • It may help to think about solving the problems in terms of what you can discard from the stream. If looking for the elements with the top 10 priority values, by the time you've seen your 11th entry, you already can identify one of those elements that has a priority that definitely won't be in the top 10. Consider how you can use a priority queue to identify and discard those elements that are out of the running for the top slots.
  • To be sure your performance is optimal, take care with how you access the Vector for the results. Adding to the end of Vector is O(1) but using insert to insert at a position near the beginning takes time O(N) and will degrade performance. One approach you can use to work around this is to construct the Vector of a fixed size and then access locations by index, as this also runs in O(1) time:
        Vector<DataPoint> data(5); // construct vector of size 5
        data[0] = { "Julie", 106 };
        data[1] = { "Keith", 137 };
    
  • Note that pqSort and topK are two client tasks that each make use of a priority queue, but the two algorithms are not otherwise related. In particular, trying to use pqSort as part of a solution to topK will not end well. Keep in mind that you just need to pick out the elements with the k largest priority values; fully sorting the entire contents of the stream is doing way more work than needed.

Testing and Timing Analysis

Once you have written your implementation of topK, you should use our provided tests to confirm that the function works correctly. Additionally, you should add at least 1-2 more tests of your own to confirm that your function behaves as expected in all circumstances. When writing these tests, you should make use of the two asStream helper functions that we've provided with the starter code, which allow you to construct data streams from Vector<DataPoint> and ranges of int priorities. The stringstream return type of these functions is a type of istream, which means that you can pass the returned streams directly into the topK function. Note that for most of the provided tests, we choose to use the empty string as the name field of the DataPoint, since we really only care about the priority value – feel free to do the same in your student tests.

Finally, you should predict what you think the Big-O of the topK operation is when using a PQSortedArray as your auxiliary data structure. Then, run some time trials to determine the runtime of the topK operation. This runtime should be expressed in terms of two distinct variables: n, which represents the total number of elements in the data stream, and k, which represents the number of "top" elements that you are searching for. As you did in the last assignment, you should approach this task by first picking a fixed value of k and then varying different values (at least 4 distinct magnitudes) of n to see the relationship between the growth of n and the growth of the runtime. We have provided a timing test in the starter code that provides a framework on how to do this, so make sure to use that as a starting point. Then, switch it up and run some time trials that vary k while keeping n constant to figure out the relationship with k. Do the timing results match up with your prediction of the runtime?

Answer the following questions in short_answer.txt:

  • Q8. Given the runtime of PQSortedArray enqueue/dequeue, what will be the runtime of topK? Run some timing trials to confirm your prediction, and include the data you collected in your final answer.