Lab Solution 4: assign3 Redux and Threads

The lab checkoff sheet for all students can be found right here.

Mid-Quarter Evaluation

Before starting lab this week, we ask that you take 10-15 minutes to individually fill out the CS110 mid-quarter evaluation. This evaluation is important in letting the Teaching Team know what they're doing well and what they can improve. The evaluation is anonymous, and we appreciate any feedback that you have.

To fill out the mid-quarter evaluation, please visit this link.

Get Started

Go ahead and clone the lab4 folder and run make; this folder contains a working implementation of problem 4. While you may not get to problem 4 during section, it's useful to play with if you have time.

git clone /usr/class/cs110/repos/lab4/shared lab4

Problem 1: `assign3` Redux

est. 30min.
Here are a collection of short answer questions about subprocess, trace and farm. These questions are here to force you to think big picture and understand the systems concepts we feel are important.

Your implementation of subprocess required two pipes — one to foster a parent-to-child communication channel, and a second to foster a child-to-parent communication channel. Clearly explain why a single pipe shouldn't be used to support both communication channels.
You've seen dprintf in lecture and in the assign3 handout, and it might have contributed to your farm implementation. Explain why there’s a dprintf function, but there's no analogous dscanf function. Hint: think about why dprintf(fd, %s %d\n", str, i) would be easy to manage whereas dscanf(fd, "%s %d\n", str, &i) wouldn't be. Read the first few lines of the man pages for the traditional fprintf and fscanf functions to understand how they operate.
Consider the implementation of spawnAllWorkers below. Even though it rarely causes problems, the line in bold italics technically contains a race condition. Briefly describe the race condition, and explain how to fix it.

struct worker {
  worker() {}
  worker(char *argv[]) : sp(subprocess(argv, true, false)), available(false) {} 
  subprocess_t sp;
  bool available;
};

static const size_t kNumCPUs = sysconf(_SC_NPROCESSORS_ONLN); 
static vector workers(kNumCPUs);
static size_t numWorkersAvailable;

static const char *kWorkerArguments[] = { 
  "./factor.py", "--self-halting", NULL
};

static void spawnAllWorkers() {
  cout << "There are this many CPUs: " << kNumCPUs << ", numbered 0 through " 
    << kNumCPUs - 1 << "." << endl;
  for (size_t i = 0; i < workers.size(); i++) {
    workers[i] = worker(const_cast(kWorkerArguments));
    assignToCPU(workers[i].sp.pid, i); // implementation omitted, irrelevant 
  }
}

int main(int argc, char *argv[]) {
    signal(SIGCHLD, markWorkersAsAvailable); // markWorkersAsAvailable is correct
    spawnAllWorkers();
    // other functions, assume all correct
    return 0;
}

While implementing the farm program, you were expected to implement a getAvailableWorker function to effectively block until at least one worker was available. One possible solution could rely on a helper function called waitForAvailableWorker, which we present below.

static sigset_t waitForAvailableWorker() {
  sigset_t existing, additions;
  sigemptyset(&additions);
  sigaddset(&additions, SIGCHLD);
  sigprocmask(SIG_BLOCK, &additions, &existing);
  while (numWorkersAvailable == 0) sigsuspend(&existing); 
  return existing;
}

static size_t getAvailableWorker() {
  sigset_t existing = waitForAvailableWorker();
  size_t i;
  for (i = 0; !workers[i].available; i++);
  assert(i < workers.size());
  numWorkersAvailable--; 
  workers[i].available = false;
  sigprocmask(SIG_SETMASK, &existing, NULL); // restore original block set 
  return i;
}

After analyzing this possible solution, answer the following questions:
- Assuming no signals are blocked at the time waitForAvailableWorker is called, clearly identify when SIGCHLD is blocked and when it is not.
- Had we accidentally passed in &additions to the sigsuspend call instead of &existing, the farm could have deadlocked. Explain why.
- Had we accidentally omitted the sigaddset call and not blocked SIGCHLD, farm could have deadlocked. Explain how.
- In past quarters, we have seen some implementations that lift the block on SIGCHLD before the two lines in bold instead of after. As it turns out, executing numWorkersAvailable-- after the block is lifted can cause problems, but executing workers[i].available = false actually can't. Explain why the placement of the -- is more sensitive to race conditions than the Boolean assignment.
Someone proposes an alterante solution using the pause function instead, and provides the alternate implementation below of waitForAvailableWorker. The zero-argument pause function doesn’t alter signal masks like sigsuspend does; it simply halts execution until the process receives any signal whatsoever and any installed signal handler has fully executed. This is conceptually simpler and more easily explained than the version that relies on sigsuspend, but it’s flawed in a way the previous solution above is not. Describe the problem and why sigsuspend is necessary.

static sigset_t waitForAvailableWorker() { 
  sigset_t mask;
  sigemptyset(&mask);
  sigaddset(&mask, SIGCHLD); 
  sigprocmask(SIG_BLOCK, &mask, NULL); 
  while (numWorkersAvailable == 0) {
    sigprocmask(SIG_UNBLOCK, &mask, NULL); 
    pause();
    sigprocmask(SIG_BLOCK, &mask, NULL);
  } 
}

Your implementation of trace relied on ptrace's ability to read system call arguments from registers via the PTRACE_PEEKUSER command. When a system call argument was a C string, you needed to rely on repeated calls to ptrace and the PTRACE_PEEKDATA option to pull in characters, eight bytes at a time, until a zero byte was included. At that point, the entire \0-terminated C string could be printed. Was this more complicated than need be? If, after all, the argument register contains the base address of a \0-terminated character array, why can't you just << the char * to cout and rely on cout to print the C string?

Solution

A single pipe shouldn't be used because both child and parent would need to write to fds[1], and there’d be no generic way to codify whether material in the pipe is intended for child or parent, so one could accidentally read what’s intended for the other.
With dprintf, the entire string can be constructed (and its length computed) before the underlying write calls. dscanf might need to read an extra character (e.g. the space in "1234 ") to confirm a placeholder like %d has been matched, and there's no way to revert that extra read. fscanf, on the other hand, is framed in terms of FILE *s, and those address data structures that include not only the underlying descriptor, but also a cache of the most recently read block of characters, which can be consulted and used for future calls to fscanf on the same FILE *.
For the question about identifying the race condition, the right hand side constructs a temporary that launches a process that self-halts before content of temporary is copied into workers[i]. If that happens, the SIGCHLD handler is invoked and crawls over an array that doesn’t have the pid yet, and sadness ensues. The solution is to block SIGCHLD before the bold, italic line and then unblock after.
For the alternate farm implementation:
- The summary is thatSIGCHLD isn’t blocked prior to the sigprocmask call, nor within the call to sigsuspend. It’s blocked everywhere else. Going into more detail, the first three lines create a singleton mask containing just SIGCHLD, and the fourth line blocks it. The while loop test is evaluated while the SIGCHLD block is in place, which is exactly what we want, because we don't want its evaluation to be interrupted by the execution of markWorkersAsAvailable. If the test passes, the call to sigsuspend simultaneously lifts the block on SIGCHLD (remember: existingmask is empty) and puts the process to sleep until any signal is received, at which point any installed handler (presumably the SIGCHLD handler, since we expect SIGCHLD to be the signal that comes in) executes with high priority. After any handler exits, sigsuspend returns while simultaneously restoring the mask that was in place before sigsuspend was called, and re-evaluates the test again with a block on SIGCHLD. That process repeats until the while loop test fails, at which point waitForAvailableWorker returns, leaving the block on SIGCHLD in place.
- If we had passed in &additions, here's what could happen: numWorkersAvailable == 0 could pass, and then sigsuspend would force the program to deadlock, as only SIGCHLD signals are coming in and capable of changing numWorkersAvailable, and they're blocked.
- If we had omitted sigaddset, here's what could happen: numWorkersAvailable == 0 passes, farm is swapped off the CPU, all kNumCPUs workers self-halt, all kNumCPUs SIGCHLDs are handled by one SIGCHLD handler call, farm descends into sigsuspend, and no additional SIGCHLDs ever arrive to wake farm up.
- For the --, execution of it could be interrupted and confused by execution of the ++ within the SIGCHLD handler. But when the boolean assignment is executed, the relevant worker is halted, so interruption by the SIGCHLD handler would hop over workers[i], as its pid can't possibly have been returned by the handler's waitpid call until workers[i] is continued.
- For the pause implementation, the program can deadlock. Here's how: after SIGCHLD is blocked, all workers become available before pause gets called, numAvailableWorkers becomes its maximum value, main execution flow descends into pause and no more signals are ever sent to wake it up.
- For reading strings in trace, the register of interest contains the address of a C string in the tracee's virtual address space, but operator<<(ostream& os, const char *str) prints the C string at that address in the tracer's virtual address space.

Problem 2: Threads vs. Processes

est. 20min.
Threads and processes have much in common, but are also different in several key ways, which make each one better suited for certain use cases. The questions below explore more about multiprocessing and multithreading in the big picture.

What does it mean when we say that a process has a private address space?
What are the advantages of a private address space?
What are the disadvantages of a private address space?
What programming directives have we used in prior assignments and discussion section handouts to circumvent address space privacy?
In what cases do the processes whose private address spaces are being publicized have any say in the matter?
When architecting a larger program like farm or stsh that relies on multiprocessing, what did we need to do to exchange information across process boundaries?
Can a process be used to execute multiple executables? Restated, can it execvp twice to run multiple programs?
Threads are often called lightweight processes. In what sense are they processes? And why the lightweight distinction?
Threads are often called virtual processes as well. In what sense are threads an example of virtualization?
Threads running within the same process all share the same address space. What are the advantages and disadvantages of allowing threads to access pretty much all of virtual memory?

Each thread within a larger process is given a thread id, also called a tid. In fact, the thread id concept is just an extension of the process id. For singly threaded processes, the pid and the main thread's tid are precisely the same. If a process with pid 12345 creates three additional threads beyond the main thread (and no other processes or threads within other processes are created), then the tid of the main thread would be 12345, and the thread ids of the three other threads would be 12346, 12347, and 12348.

What are the advantages of leveraging the pid abstraction for thread ids?
What happens if you pass a thread id that isn't a process id to waitpid?
What happens if you pass a thread id to sched_setaffinity?
What are the advantages of requiring that a thread always be assigned to the same CPU?

In some situations, the decision to rely on multithreading instead of multiprocessing is dictated solely by whether the code to be run apart from the main thread is available in executable or in library form. But in other situations, we have a choice.

Why might you prefer multithreading over multiprocessing if both are reasonably good options?
Why might you prefer multiprocessing over multithreading if both are reasonably good options?
What happens if a thread within a larger process calls fork?
What happens if a thread within a larger process calls execvp?

Solution

When we say that a process has a private address space, it means that its range of addresses are, by default, private to the owning process, and that it's impossible for an another arbitrary, unprivileged process to access any of it.
The advantages of a private address space are the fact that it's private, of course. Most other programs can't accidentally or intentionally examine another process's data.
The disadvantages of a private address space are the fact that it's private, of course. Address space privacy makes it exceptionally difficult to exchange data with other processes, and works against any efforts to breezily distribute work over multiple CPUs using multiprocessing.
To circumvent address space privacy, we used ptrace in Assignment 3 to allow one process to examine pretty much everything in a second one. Memory maps (via mmap and munmap) were used in the parallel mergesort section example to share large data structures between many processes. Signals and pipes have been used to foster communication between multiple processes in many assignments and section examples.
Regarding a say in the matter, an individual process has very little protection against ptrace, which is why it's never a good idea to store sensitive data in the clear within a process, since everything is discoverable. System administrators can disable ptrace for a specific set of executables, or for everything, but they rarely do that for development machines like the myths,rices, or oats. For more information about how system administrators can limit the use of ptrace system-wide, check out this link. Processes also have little say about pipes, since they're generally set up and used to rewire standard input, standard output, and standard error before execvp is used to bring a new executable into the space of running processes. For signals, if you don't like them, you can install SIG_IGN to ignore most of them and/or block them for the lifetime of the executable if you want to. The only signals that can't be ignored or fully blocked are SIGKILL and SIGSTOP. Memory mapped segments can be forced onto child processes, but those memory mapped segments are pretty much ignored the instant a virtual address space is cannibalized by an execvp call.
farm relied on pipes to forward data to each of the workers, and it relied on SIGTSTP, SIGCHLD, SIGCONT, SIGKILL, and a SIGCHLD handler to manage synchronization issues. And in some sense, farm relied on the execvp's string arguments to influence what each of the workers should be running. stsh relies on pretty much the same thing, albeit for different reasons. stsh relies on signals and signal handlers to manage job control, terminal control transfer to be clear who's responding to keyboard events and any signals associated with them, and pipes to foster communication between neighboring processes in a pipeline.
In general, a process cannot be used to execute multiple executables. A program like stsh can fork off additional processes, each of which execvps, but that's not the same thing. The bottom line is that process spaces can't be used to run multiple executables - they can only be transformed once.
Each thread of execution runs as if it's a miniature program. Threads have their own stack, and they have access to globals, dynamic memory allocation, global data, system calls, and so forth. They're relatively lightweight compared to true processes, because you don't need to create a new process when creating a thread. The thread manager cordons off a portion of the full stack segment for the new thread's stack, but otherwise the thread piggybacks off existing text, heap, and data segments previously establish by fork and execvp. (For more info, check out the man page for clone).
Virtualization is an abstraction mechanism used to make a single resource appear to be many or to make many resources appear to be one. In this case, the process is subdivided to house many lightweight processes, so that one process is made to look like many.
Threads sharing memory is a double-edged sword, as you'll learn with Assignments 5 and 6. Because all threads have access to the same virtual address space, it's relatively trivial to share information. However, because two or more threads might want to perform surgery on a shared piece of data, directives must be implanted into thread routines to ensure data and resources are shared without inspiring data corruption via race conditions, or deadlock via resource contention. That's a very difficult thing to consistently get right.
For leveraging the PID abstraction for thread ids, to the extent that the OS can view threads as processes, the OS can provide services on a per-thread basis instead of just a per-process one.
For passing a non-PID thread ID to waitpid, this is poorly documented, but the takeaway here is that waitpid does not consider it to be an error if a thread id is passed in. waitpid's man page includes some information about how it interacts with threads.
For sched_setaffinity, it actually works, and informs the OS that the identified thread should only be run on those CPUs.
The advantages of requiring a thread always be assigned to the same CPU are that each CPU typically has a dedicated L1 cache that stores data fetched by instructions run on that CPU. CPU-specific L1 caches are more likely to retain content relevant to the thread if the same thread keeps coming back instead of being arbitrarily assigned to other CPUs.
We might prefer multithreading over multiprocessing for ease of data exchange and code sharing.
We might prefer multiprocessing over multithreading for protection, privacy, and insulation from other processes' errors.
If a thread calls fork on Linux, the process space is cloned, but the only surviving thread is the one that called fork. In general, you don't want a single thread of many to call fork unless you have an exquisite reason to do so.
If a thread calls execvp, it transforms the entire process to run a new executable. The fact that the pre-execvp process involved threading is irrelevant once execvp is invoked. It's that drastic.

Problem 3: Threads and Mutexes

est. 15min.
Here are some more short answer questions about the specifics of threads and the directives that help threads communicate.

Is i-- thread safe on a single core machine? Why or why not?
What is busy waiting? Is it ever a good idea? Does your answer to the good-idea question depend on the number of CPUs your multithreaded application has access to?
Why must we be careful when passing mutexes around as parameters to pass them by reference?
What's the multiprocessing equivalent of the mutex?

Solution

i-- is not thread safe if it expands to two or more assembly code instructions, as it does in x86_64. For a local variable not already in a register, the generated code might look like this:

mov    (%rbp), %rax
sub    $1, %rax
mov    %rax, (%rbp)

If the thread gets swapped off of the processor after the first or second of those two lines and another thread is scheduled and changes i, the original thread will eventually resume and continue on with stale data. Some architectures are capable of doing in-memory decrement(or, more broadly, a memory-memory instruction) in one instruction, so i-- could compile to one assembly code instruction on some architectures if i is a global variable, and in a single core whewre only one thing can happen at a time. But in general, you write code to be portable and architecture-agnostic, so you would need to operate as if i--, even where i is a global, is thread-unsafe.

Busy waiting is consuming processor time waiting for some condition to change, as with while (numAvailableWorkers == 0) {;}. In a single CPU machine, it makes sense for a process or thread that would otherwise busy wait to yield the processor, since the condition can't possibly change until some other thread gets the processor and makes changes to the variables that are part of the condition. That's what sigsuspend (for processes) and condition_variable_any::wait (for threads, which we see later) are for. In a few scenarios where multithreaded code is running on a machine with multiple CPUs, it's okay to busy wait if and only if you know with high probability that another thread is running at the same time (on another CPU) and will invert the condition that's causing the first thread to busy wait.
We must be careful when passing mutexes around as parameters to pass them by reference because if we pass by copy, this makes a copy of the locks, and so the threads (or functions) no longer share a single constraint like we might have originally intended. For example, if we have a lock for accessing a variable passed by reference from the main thread, we want to make sure that each thread gets a reference to the single lock we made, not its own copy, since each thread having its own lock defeats the point of limiting access to one thread at a time!
For the multiprocessing equivalent of mutex, it's not a perfect parallel, but if we had to choose one, it'd be signal masks. We used signal masks to remove the possibility of interruption by signal, which is the only thing that can otherwise introduce a race condition into a sequential program. (Another answer, though not something we talk about in CS110, is the flock function - if you're curious, investigate it and think about how it could be used to prevent interprocess race conditions on shared memory mapped segments.)

Problem 4: Multithreaded `quicksort`

if time
quicksort is an efficient, divide-and-conquer sorting algorithm whose traditional implementation looks like this:

static void quicksort(vector<int>& numbers, ssize_t start, ssize_t finish) { 
  if (start >= finish) return;
  ssize_t mid = partition(numbers, start, finish);
  quicksort(numbers, start, mid - 1);
  quicksort(numbers, mid + 1, finish);
}

static void quicksort(vector<int>& numbers) { 
  quicksort(numbers, 0, numbers.size() - 1);
}

The details of how partition works aren’t important. All you need to know is that a call to partition(numbers, start, finish) reorders the elements between numbers[start] and numbers[finish], inclusive, so that numbers residing within indices start through mid - 1, inclusive, are less than or equal to the number at index mid, and that all numbers residing in indices mid + 1 through stop, inclusive, are strictly greater than or equal to the number at index mid. As a result of this reorganization, we know that, once partition returns, the number residing at index mid actually belongs there in the final sort.

What’s super neat is that the two recursive calls to quicksort can execute in parallel, since the sequences they operate on don’t overlap. In fact, to make sure you get some practice with C++ threads right away, you're going to cannibalize the above implementation so that each call to quicksort spawns off threads to recursively sort the front and back portions at the same time.

Descend into your clone of the shared lab4 directory, and execute the sequential quicksort executable to confirm that it runs and passes with flying colors. Then examine the file quicksort.cc to confirm your understanding of quicksort. You can ignore the details of the partition routine and just trust that it works, but ensure you believe in the recursive substructure of the three-argument quicksort function.
Now implement the aggressiveQuicksort function, which is more or less the same as the sequential quicksort, except that each of the two recursive calls run in independent, parallel threads. Create standalone threads without concern for any system thread count limits. Ensure that any call to aggressiveQuicksort returns only after its recursively guided threads finish. Test your implementation to verify it works as intended by typing ./quicksort --aggressive on the command line.
Tinker with the value of kNumElements (initially set to 128) to see how high you can make it before you exceed the number of threads allowed to coexist in a single process. You don't need to surface an exact number, as a ballpark figure is just fine.
Leveraging your aggressiveQuicksort implementation, implement the recursive conservativeQuicksort function so it’s just as parallel, but the second recursive call isn’t run within a new thread; instead, it’s run within the same thread of execution as the caller. Test your implementation to verify it works as intended by typing in ./quicksort --conservative on the command line.
Time each of the three versions by using the time utility as you might have done in assignment 3 while testing farm. Are the running times of the parallel versions lower or higher than the sequential versions? Are the running times what you expect? Explain.

Solution

static void aggressiveQuicksort(vector<int>& numbers, ssize_t start, ssize_t finish) {
  if (start >= finish) return;
  ssize_t mid = partition(numbers, start, finish);
  thread front(aggressiveQuicksort, ref(numbers), start, mid - 1); 
  thread back(aggressiveQuicksort, ref(numbers), mid + 1, finish); 
  front.join();
  back.join();
}

static void aggressiveQuicksort(vector<int>& numbers) { 
  aggressiveQuicksort(numbers, 0, numbers.size() - 1);
}

Note that the ref is needed to be clear that a reference to numbers (not a copy) is passed as the first argument to the aggressiveQuicksort thread routine. You'd think that the compiler could examine the prototype of the thread routine being installed, but it doesn't. It only looks at the prototype of the thread constructor, which relies on the C++ equivalent of ... to accept zero or more arguments beyond the thread routine function name.

While tinkering with kNumElements, we tried 2500 and everything seemed to work without crashing. When we went with 5000, the test application aborted with an error message that included "terminate called after throwing an instance of 'std::system_error'", which is the abort error message you'll generally see when too many threads (and therefore, too many subdivisions of the finite stack segment into thread stacks) exist at any one time. 4500 led to a crash, and so did 4250. With 4000, it succeeded, so the actual number is probably somewhere in between 4000 and 4250. In reality, the number of threads permitted to exist at any one time is just above 2000 - actually in our own testing it seems to be 2041. Note that quicksorting an array of 4000 numbers doesn't require 2000 threads exist at any one moment, even if the total number of threads created over the lifetime of the sort is much greater than 2000. Some threads die before other threads are created.

// A hybrid of sequential and aggressive
static void conservativeQuicksort(vector<int>& numbers, ssize_t start, ssize_t finish) { 
  if (start >= finish) return;
  ssize_t mid = partition(numbers, start, finish);
  thread front(conservativeQuicksort, ref(numbers), start, mid - 1); 
  conservativeQuicksort(numbers, mid + 1, finish);
  front.join(); 
}

static void conservativeQuicksort(vector<int>& numbers) { 
  conservativeQuicksort(numbers, 0, numbers.size() - 1);
}

When timing, the results may surprise you and make you question whether we should ever use threading:

myth61:~/lab4$ echo $SHELL
/bin/bash 
myth61:~/lab4$ time ./quicksort --sequential Trial #1000: SUCCEEDED!
real 0m0.025s
user 0m0.020s
sys 0m0.004s
myth61:~/lab4$ time ./quicksort --aggressive Trial #1000: SUCCEEDED!
real 0m14.698s
user 0m0.968s
sys 0m37.452s
myth61:~/lab4$ time ./quicksort --conservative Trial #1000: SUCCEEDED!
real 0m7.271s
user 0m0.400s
sys 0m18.424s

The issue here is that quicksort is what's considered CPU-bound, which is systems speak that means the vast majority of an algorithm's work is computational and requires CPU time to add, compare, move, and so forth. By introducing threading, we do enlist all eight of the myth's cores. But we also introduce an enormous amount of overhead (thread selection, context switches between threads, and so forth) so that the more parallelism we introduce, the longer the algorithm takes. In general, you rarely want the number of threads doing CPU-bound work in parallel to be more than the number of cores (or maybe twice the number of cores). The work that needs to be done using any one of the CPUs is the same regardless of whether one thread does it or 100 threads do it. Rest assured that we'll soon rely on threading to manage I/O- or network-bound algorithms, where the majority of the time is spent sleeping (off of the CPU) and waiting for I/O events. Even in these scenarios, you rarely want the number of threads to be more than a small multiple of your CPU and core count. So forgive this extreme example where we create as many threads as we want to. In general, we don't and won't do that. We just allow it during the initial window where we're just learning about threads.

Checkoff Questions Solutions

Problem 1a: Explain why a fully functional version of spawnAllWorkers needs to block SIGCHLD if it's to be 100% reliable. We need to account for the possibility that a worker immediately starts working and halts once we spawn it but before we record information about it into our workers vector. In general, we have to consider whether it's problematic if the signal handler interrupts us anywhere in this function.
Problem 1b: Assuming no signals are blocked at the time the correct version of waitForAvailableWorker is called, clearly identify when SIGCHLD is blocked and when it is not. The summary is thatSIGCHLD isn’t blocked prior to the sigprocmask call, nor within the call to sigsuspend. It’s blocked everywhere else.
Problem 2a: Threads running within the same process all share the same address space. What are the advantages and disadvantages of allowing threads to access pretty much all of virtual memory?. Advantages include ease of data sharing and coordination, and less work to make an entirely new virtual memory space. Disadvantages include race conditions or corruption with accessing shared data.
Problem 2b: What are the advantages of requiring a thread always be assigned to the same CPU? The advantages of requiring a thread always be assigned to the same CPU are that each CPU typically has a dedicated L1 cache that stores data fetched by instructions run on that CPU. CPU-specific L1 caches are more likely to retain content relevant to the thread if the same thread keeps coming back instead of being arbitrarily assigned to other CPUs.
Problem 3a: Is i-- thread safe on a single core machine? Why or why not? Generally, no - i-- is not thread safe since it may expand to two or more assembly code instructions, as it does in x86_64. If the thread gets swapped off of the processor midway through its assembly and another thread is scheduled and changes i, the original thread could eventually resume and continue on with stale data.
Problem 3b: What's the multiprocessing equivalent of the thread.join()? waitpid - both allow us to block and wait for a process or thread to finish, although join always acts on a single thread (rather than waitpid's ability to say "wait for any child to finish").
Problem 4: Which is easier to write? Parallel quicksort using multiprocessing but no threading, or parallel quicksort using multithreading but without multiprocessing? Why? It seems as though the multithreaded version is easier, as it's simpler to share data between workers and coordinate.

Website design based on a design by Chris Piech
Icons by Piotr Kwiatkowski

Lab Solution 4: assign3 Redux and Threads

Mid-Quarter Evaluation

Get Started

Problem 1: assign3 Redux

Solution

Problem 2: Threads vs. Processes

Solution

Problem 3: Threads and Mutexes

Solution

Problem 4: Multithreaded quicksort

Solution

Checkoff Questions Solutions

Problem 1: `assign3` Redux

Problem 4: Multithreaded `quicksort`