Lab Solution 4: assign3 Redux and Threads


The lab checkoff sheet for all students can be found right here.

Mid-Quarter Evaluation

Before starting lab this week, we ask that you take 10-15 minutes to individually fill out the CS110 mid-quarter evaluation. This evaluation is important in letting the Teaching Team know what they're doing well and what they can improve. The evaluation is anonymous, and we appreciate any feedback that you have.

To fill out the mid-quarter evaluation, please visit this link.

Get Started

Go ahead and clone the lab4 folder and run make; this folder contains a working implementation of problem 4. While you may not get to problem 4 during section, it's useful to play with if you have time.

git clone /usr/class/cs110/repos/lab4/shared lab4

Problem 1: assign3 Redux

est. 30min.
Here are a collection of short answer questions about subprocess, trace and farm. These questions are here to force you to think big picture and understand the systems concepts we feel are important.

struct worker {
  worker() {}
  worker(char *argv[]) : sp(subprocess(argv, true, false)), available(false) {} 
  subprocess_t sp;
  bool available;
};

static const size_t kNumCPUs = sysconf(_SC_NPROCESSORS_ONLN); 
static vector workers(kNumCPUs);
static size_t numWorkersAvailable;

static const char *kWorkerArguments[] = { 
  "./factor.py", "--self-halting", NULL
};

static void spawnAllWorkers() {
  cout << "There are this many CPUs: " << kNumCPUs << ", numbered 0 through " 
    << kNumCPUs - 1 << "." << endl;
  for (size_t i = 0; i < workers.size(); i++) {
    workers[i] = worker(const_cast(kWorkerArguments));
    assignToCPU(workers[i].sp.pid, i); // implementation omitted, irrelevant 
  }
}

int main(int argc, char *argv[]) {
    signal(SIGCHLD, markWorkersAsAvailable); // markWorkersAsAvailable is correct
    spawnAllWorkers();
    // other functions, assume all correct
    return 0;
}
static sigset_t waitForAvailableWorker() {
  sigset_t existing, additions;
  sigemptyset(&additions);
  sigaddset(&additions, SIGCHLD);
  sigprocmask(SIG_BLOCK, &additions, &existing);
  while (numWorkersAvailable == 0) sigsuspend(&existing); 
  return existing;
}

static size_t getAvailableWorker() {
  sigset_t existing = waitForAvailableWorker();
  size_t i;
  for (i = 0; !workers[i].available; i++);
  assert(i < workers.size());
  numWorkersAvailable--;

  workers[i].available = false;
  sigprocmask(SIG_SETMASK, &existing, NULL); // restore original block set 
  return i;
}
static sigset_t waitForAvailableWorker() { 
  sigset_t mask;
  sigemptyset(&mask);
  sigaddset(&mask, SIGCHLD); 
  sigprocmask(SIG_BLOCK, &mask, NULL); 
  while (numWorkersAvailable == 0) {
    sigprocmask(SIG_UNBLOCK, &mask, NULL); 
    pause();
    sigprocmask(SIG_BLOCK, &mask, NULL);
  } 
}

Solution

Problem 2: Threads vs. Processes

est. 20min.
Threads and processes have much in common, but are also different in several key ways, which make each one better suited for certain use cases. The questions below explore more about multiprocessing and multithreading in the big picture.

Each thread within a larger process is given a thread id, also called a tid. In fact, the thread id concept is just an extension of the process id. For singly threaded processes, the pid and the main thread's tid are precisely the same. If a process with pid 12345 creates three additional threads beyond the main thread (and no other processes or threads within other processes are created), then the tid of the main thread would be 12345, and the thread ids of the three other threads would be 12346, 12347, and 12348.

In some situations, the decision to rely on multithreading instead of multiprocessing is dictated solely by whether the code to be run apart from the main thread is available in executable or in library form. But in other situations, we have a choice.

Solution

Problem 3: Threads and Mutexes

est. 15min.
Here are some more short answer questions about the specifics of threads and the directives that help threads communicate.

Solution

mov    (%rbp), %rax
sub    $1, %rax
mov    %rax, (%rbp)

If the thread gets swapped off of the processor after the first or second of those two lines and another thread is scheduled and changes i, the original thread will eventually resume and continue on with stale data. Some architectures are capable of doing in-memory decrement(or, more broadly, a memory-memory instruction) in one instruction, so i-- could compile to one assembly code instruction on some architectures if i is a global variable, and in a single core whewre only one thing can happen at a time. But in general, you write code to be portable and architecture-agnostic, so you would need to operate as if i--, even where i is a global, is thread-unsafe.

Problem 4: Multithreaded quicksort

if time
quicksort is an efficient, divide-and-conquer sorting algorithm whose traditional implementation looks like this:

static void quicksort(vector<int>& numbers, ssize_t start, ssize_t finish) { 
  if (start >= finish) return;
  ssize_t mid = partition(numbers, start, finish);
  quicksort(numbers, start, mid - 1);
  quicksort(numbers, mid + 1, finish);
}

static void quicksort(vector<int>& numbers) { 
  quicksort(numbers, 0, numbers.size() - 1);
}

The details of how partition works aren’t important. All you need to know is that a call to partition(numbers, start, finish) reorders the elements between numbers[start] and numbers[finish], inclusive, so that numbers residing within indices start through mid - 1, inclusive, are less than or equal to the number at index mid, and that all numbers residing in indices mid + 1 through stop, inclusive, are strictly greater than or equal to the number at index mid. As a result of this reorganization, we know that, once partition returns, the number residing at index mid actually belongs there in the final sort.

What’s super neat is that the two recursive calls to quicksort can execute in parallel, since the sequences they operate on don’t overlap. In fact, to make sure you get some practice with C++ threads right away, you're going to cannibalize the above implementation so that each call to quicksort spawns off threads to recursively sort the front and back portions at the same time.

Solution

static void aggressiveQuicksort(vector<int>& numbers, ssize_t start, ssize_t finish) {
  if (start >= finish) return;
  ssize_t mid = partition(numbers, start, finish);
  thread front(aggressiveQuicksort, ref(numbers), start, mid - 1); 
  thread back(aggressiveQuicksort, ref(numbers), mid + 1, finish); 
  front.join();
  back.join();
}

static void aggressiveQuicksort(vector<int>& numbers) { 
  aggressiveQuicksort(numbers, 0, numbers.size() - 1);
}

Note that the ref is needed to be clear that a reference to numbers (not a copy) is passed as the first argument to the aggressiveQuicksort thread routine. You'd think that the compiler could examine the prototype of the thread routine being installed, but it doesn't. It only looks at the prototype of the thread constructor, which relies on the C++ equivalent of ... to accept zero or more arguments beyond the thread routine function name.

// A hybrid of sequential and aggressive
static void conservativeQuicksort(vector<int>& numbers, ssize_t start, ssize_t finish) { 
  if (start >= finish) return;
  ssize_t mid = partition(numbers, start, finish);
  thread front(conservativeQuicksort, ref(numbers), start, mid - 1); 
  conservativeQuicksort(numbers, mid + 1, finish);
  front.join(); 
}

static void conservativeQuicksort(vector<int>& numbers) { 
  conservativeQuicksort(numbers, 0, numbers.size() - 1);
}

When timing, the results may surprise you and make you question whether we should ever use threading:

myth61:~/lab4$ echo $SHELL
/bin/bash

myth61:~/lab4$ time ./quicksort --sequential Trial #1000: SUCCEEDED!
real 0m0.025s
user 0m0.020s
sys 0m0.004s
myth61:~/lab4$ time ./quicksort --aggressive Trial #1000: SUCCEEDED!
real 0m14.698s
user 0m0.968s
sys 0m37.452s
myth61:~/lab4$ time ./quicksort --conservative Trial #1000: SUCCEEDED!
real 0m7.271s
user 0m0.400s
sys 0m18.424s

The issue here is that quicksort is what's considered CPU-bound, which is systems speak that means the vast majority of an algorithm's work is computational and requires CPU time to add, compare, move, and so forth. By introducing threading, we do enlist all eight of the myth's cores. But we also introduce an enormous amount of overhead (thread selection, context switches between threads, and so forth) so that the more parallelism we introduce, the longer the algorithm takes. In general, you rarely want the number of threads doing CPU-bound work in parallel to be more than the number of cores (or maybe twice the number of cores). The work that needs to be done using any one of the CPUs is the same regardless of whether one thread does it or 100 threads do it. Rest assured that we'll soon rely on threading to manage I/O- or network-bound algorithms, where the majority of the time is spent sleeping (off of the CPU) and waiting for I/O events. Even in these scenarios, you rarely want the number of threads to be more than a small multiple of your CPU and core count. So forgive this extreme example where we create as many threads as we want to. In general, we don't and won't do that. We just allow it during the initial window where we're just learning about threads.

Checkoff Questions Solutions


Website design based on a design by Chris Piech
Icons by Piotr Kwiatkowski