Bag'O Big-O

Due Friday, February 19 at 11:30 am Pacific

  • Submissions received by due date receive a small on-time bonus.
  • All students are granted a pre-approved extension or "grace period" of 48 hours after the due date. Late submissions are accepted during the grace period with no penalty.
  • The grace period expires Sun, Feb 21 at 11:30 am Pacific, after which we cannot accept further late submissions.
  • In this course, we express all date/times in Pacific time GMT -8. Our Paperless submission system also displays/records due dates and submission times in Pacific time.

This assignment is all about big-O notation and efficiency. We’re aware that you’ll be working on the midterm over the weekend, and accordingly we’ve calibrated this assignment to be considerably shorter than normal.

Due Friday, February 19th at 11:30AM Pacific. You are welcome to work on this assignment in pairs.

This assignment has two parts:

  • Big-O Analysis: We’ve given you a number of functions, access to their source code, and a run-time plotter. Your task is to determine their big-O time complexities.
  • Combine: You have a large collection of sorted sequences you want to assemble into a single sorted sequence. We could just toss everything together into a list and sort it from scratch, but there’s a faster way to combine multiple sorted sequences.

As usual, we suggest making slow and steady progress. Here’s our recommended timetable:

  • Work on the midterm over the weekend. Realistically, we don’t expect you to start this until Monday. 😃
  • Aim to complete the big-O analysis questions by Tuesday.
  • Aim to complete Combine by Thursday.

Problem One: Big-O Analysis

In this problem, we’ve provided you a list of thirteen C++ functions. Your task is to determine the big-O runtimes of each of those functions with respect to n, where n is either the numeric input int n or the size of the input Vector or Queue.

Those functions are broken down into five smaller groups:

  • Printing Chip: Five functions that draw graphical representations of letters or words to the con- sole. Increasing n in these functions increases the size of the letters or words.
  • Counting Triples: A function that finds all triples of values in a Vector that add up to a particular total.
  • Printing Cycles: Three functions that take in a sequence, then repeatedly move the first element of the sequence to the end, like the Looper example from lecture.
  • Recursive Mysteries: Two simple recursive functions that compute values based on their inputs.
  • Maximum Single-Sell Profit: Two functions that solve a problem pertaining to historical stock market values.

To help you determine the big-O runtimes of these functions, we recommend that you do a mix of the following:

  1. Hand-analyze the code. Following the principles we outlined in lecture, read over the code and determine how much work you think each step does.

  2. Consult the documentation. Operations on containers like the Vector and Queue do not all take time O(1), and the specific amount of work required depends on which operations are performed. We’ve documented the amount of work done by each recursive call online at the link marked “Stanford C++ Library Documentation.”

  3. Run time trials. Our provided starter files contain code that will run the functions on inputs of different sizes, providing both a qualitative plot and quantitative runtime information.

More concretely, here’s what you need to do:

Big-O Analysis Requirements

Edit the file BigOAnswers.txt with the big-O time complexities of the functions defined in BigOFunctions.cpp. You do not need to justify your answers.

Some notes on this problem:

  • You’re not expected to write or modify any code for this problem. Instead, take the provided code as it’s given to you and analyze its efficiency.

  • The very last function you can run in the time tests, “Combine,” is for Problem Two. Don’t analyze that function, since initially it’s not implemented!

  • Remember that raw timing data can be “noisy,” in the sense that other processes running on your computer can skew the results. If you see any unusual spikes in your plots, it might simply indicate that your machine was busy doing something else when you ran the test.

  • On some computers, the very first timing data point can appear much higher than the ones that come immediately after it. If that happens, just disregard that data point. (This is due to how the computer caches parts of your code that run frequently – come talk to us if you’re curious why!)

  • The effects predicted by different big-O runtimes become more pronounced and more accurate for larger input sizes. For smaller inputs sizes, you may see growth rates that don’t align with the actual big-O runtime.

  • Make sure that, for each function, you can explain its big-O time complexity both analytically (by reading the code and seeing what it does) and empirically (based on the runtime plots). If you’re unsure about something, come ask us!

Problem Two: Combine

Suppose you have several lists of numbers, each of which is already in sorted order. You want to combine those lists together into one giant list, also in sorted order. How should you do this?

One approach would be to ignore the fact that we know these lists are already sorted and to do the follow- ing: create a giant Vector holding all the numbers, then sort it with mergesort in time O(n log n). This works, but by harnessing the fact that the input sequences are already sorted we can improve on this.

If you’ll recall, mergesort works by recursively breaking the input array down into a bunch of tiny sequences, then using the merge algorithm to combine all those sequences back together into one giant, sorted sequence. Here, we already have the input broken down into smaller sorted sequences, and so we just need to do that second step of mergesort, merging things back together, to finish things off.

Let’s imagine that we have k sequences that collectively have n total elements in them. We can follow the lead of mergesort to sort those sequences together using an algorithm called combine:

  • Split those k sequences apart into two groups of roughly k / 2 sequences each. (It doesn’t matter how many elements are in each of the sequences, just that the number of sequences in each group is roughly the same).

  • Recursively combine each of those groups of sequences. You now have two sorted sequences, one made by combining the sequences in the first group, and one made by combining the sequences from the second group.

  • Using the merge algorithm from class, merge those two large sequences together into one final overall sequence.

Here’s an example illustrating how to combine six sorted sequences (k = 6) with twenty total elements across them (n = 20) into one giant sorted sequence. For convenience, these are sequences of integers.

How merge sort works: Split the sequence into two groups. Recursively combine the groups. Merge the final two sequences.

This diagram should hopefully look familiar; it’s really similar to the one we saw for mergesort in class.

With a little bit of creativity you can prove that the runtime for this approach is O(n log k). In the case where you have a small number of sequences (low k) with a large total number of elements (large n), this can be dramatically faster than resorting things from scratch! For example, if n is roughly one million and k is, say, ten, then this combine algorithm will be roughly ten times faster than a regular mergesort.

Milestone 1: Implement Combine

Your task is to implement a function

Vector<DataPoint> combine(const Vector<Vector<DataPoint>>& dataPoints)

that takes as input a list containing zero or more lists of data points, then uses the above algorithm to combine them into one giant sorted sequence. The DataPoint type is a defined in Demos/DataPoint.h as follows:

struct DataPoint {
  string name; // Name of this data point; varies by application
  int weight; // "Weight" of this data point. Points are sorted by weight.

You can assume that the sequences of data points provided to you are sorted by their weight fields from lowest to highest, and your resulting sequence should also be sorted by weight from lowest to highest. Here’s the first milestone you need to reach:

Milestone 1 Requirements

  1. Add at least one custom test to Combine.cpp using STUDENT_TEST. We recommend doing this first, since it’s a great way to confirm you understand what’s being asked of you.

  2. Implement combine from Combine.cpp using the following recursive strategy:
    • Split the list of sequences into two groups with roughly the same number of sequences.
    • Recursively combine each of those groups together, forming two sorted sequences.
    • Use the merge algorithm to merge those resulting sequences into one overall sequence. (The merge algorithm was described in our lectures on searching and sorting; you’ll need to code this up yourself.)
  3. Test your code thoroughly to make sure that it works correctly.

Some notes on this problem:

  • A key step in solving this problem will be implementing the merge algorithm. The version of merge that we outlined in class worked by repeatedly removing the first elements of the sequences to merge. Be careful – as you saw in Problem One, removing from the front of a Vector does not take time O(1), and if you remove too many elements from the front of a Vector, you may end up exceeding the O(n log k) runtime.

  • There may be multiple DataPoints that have the same weight. If that’s the case, you should keep each of them in the resulting sequence, and you can break ties in weights arbitrarily.

  • The sequences to combine aren’t required to have the same size. Some of them may be gigantic. Some of them might be empty.

  • It’s legal to combine a list of zero sequences. What do you think you should return in this case?

  • The C++ standard libraries contain a function std::merge that implements the merge algorithm from class. For the purposes of this assignment, please refrain from using that function. We’re specifically interested in seeing you code this one up yourself.

Milestone 2: Ensure Combine Runs Quickly

The testing framework we’ve provided you this quarter is great for checking whether the code you’ve written works correctly. Now that we’ve started talking about efficiency, you’ll also need to make sure that your code has the proper big-O runtime. As a refresher, the code you write here should run in time O(n log k), where n is the total number of elements across all the lists and k is the number of lists.

What, exactly, does O(n log k) mean? It might help to imagine that n is allowed to vary, while k stays constant. If you have a small value of k, then log k is also small, and as n varies you’ll get a plot of a straight line with a small slope. As k increases, you’ll see the slope of that line starting to increase, but not by much because log k grows extremely slowly. In particular, if you look at values of k that grow exponentially quickly (say, k = 1, 4, 16, 64, 256, 1024), the slope of the lines you see should appear to increase by some fixed rate.

To help you confirm that you have indeed met this runtime bound, we’ve bundled a runtime plotter along with the starter files. You can select it using the “Time Tests” button at the top of the demo app and then choosing the “Combine” option. You’ll then see a graph of the runtime of your combine function over a range of different values of n and k. The coordinate axes are on a standard linear scale, and the values of k that are shown go up by a factor of four on each run. So take a look a the runtime plots you’re getting back. Are they consistent with your function running in time O(n log k)?

If you have questions about this, you’re welcome to stop by the CLaIR (the Conceptual LaIR, which runs in parallel with the regular LaIR queue) to talk through these questions with one of the section leaders. Once you have a sense of what you think you should see, confirm that your runtime plots match what’s expected. If so, great! If not, take a look back at your code. Think about where the inefficiencies might be coming from.

Milestone 2 Requirements

Ensure your code runs in time O(n log k), where n is the total number of elements and k is the number of different sequences.

  1. Choose the “Time Tests” option from the top menu, then choose “Combine.”

  2. Run the time tests and look at the plots you get back. Is what you’re seeing consistent with a runtime of O(n log k)? If so, great! If not, use those plots to form a hypothesis of what the runtime is, then go back to your code and see if you can spot the source of the inefficiency. Don’t forget to run the regular tests whenever you make a change to the code; you need to both have efficient code and correct code.

Submission Instructions

Once you’ve finished writing up your solution, submit these files:

  • BigOAnswers.txt

  • Combine.cpp

If you edited any of the other files in the course of adding extensions to the base assignment, please be sure to include them as well.

Good luck, and have fun!