Lab Solution 3: Parallel Processing


The lab checkoff sheet for all students can be found right here.

Problem 1: Analyzing parallel mergesort

Before starting, go ahead and clone the lab3 folder and run make; this folder contains a working implementation of mergesort.

git clone /usr/class/cs110/repos/lab3/shared lab3

Consider the architecturally interesting portion of the mergesort executable, which launches 128 peer processes to cooperatively sort an array of 128 randomly generated numbers.  The implementations of createSharedArray and freeSharedArray are omitted for the time being. You can also find this code, with comments (omitted below for brevity), in mergesort.cc.

static bool shouldKeepMerging(size_t startIndex, size_t reach, size_t length) {
  return startIndex % reach == 0 && reach <= length;
}

static void repeatedlyMerge(int numbers[], size_t length, size_t startIndex) {
  int *base = numbers + startIndex;
  for (size_t reach = 2; shouldKeepMerging(startIndex, reach, length); reach *= 2) {
    raise(SIGSTOP);
    inplace_merge(base, base + reach / 2, base + reach);
  }
  exit(0);
}

static void createMergers(int numbers[], pid_t workers[], size_t length) {
  for (size_t workerIndex = 0; workerIndex < length; workerIndex++) {
    workers[workerIndex] = fork();
    if (workers[workerIndex] == 0) {
      repeatedlyMerge(numbers, length, workerIndex);
    }
  }
}

static void orchestrateMergers(int numbers[], pid_t workers[], size_t length) {
  size_t step = 1;
  while (step <= length) {

    // Wait for all still-remaining workers to stop or terminate
    for (size_t start = 0; start < length; start += step) {
      waitpid(workers[start], NULL, WUNTRACED);
    }

    // Continue half of the workers
    step *= 2;
    for (size_t start = 0; start < length; start += step) {
      kill(workers[start], SIGCONT);
    }
  }
}

static void mergesort(int numbers[], size_t length) {
  pid_t workers[length];
  createMergers(numbers, workers, length);
  orchestrateMergers(numbers, workers, length);
}

int main(int argc, char *argv[]) {
  for (size_t trial = 1; trial <= kNumTrials; trial++) {
    int *numbers = createSharedRandomArray(kNumElements);    
    mergesort(numbers, kNumElements);

    // Check if the resulting array is in fact sorted
    bool sorted = is_sorted(numbers, numbers + kNumElements);
    cout << "\rTrial #" << setw(5) << setfill('0') << trial << ": ";
    cout << (sorted ? "\033[1;34mSUCCEEDED!\033[0m" : "\033[1;31mFAILED!   \033[0m") << flush;
    freeSharedArray(numbers, kNumElements);
    if (!sorted) { 
      cout << endl << "mergesort is \033[1;31mBROKEN.... please fix!\033[0m" << flush; 
      break; 
    }
  }
  cout << endl;
  return 0;
}

The program presented above is a nod to concurrent programming and whether parallelism can reduce the asymptotic running time of an algorithm (in this case, mergesort).  We’ll lead you through a series of questions to reinforce your multiprocessing and signal skills and to understand why the asymptotic running time of an algorithm can sometimes be improved in a parallel programming world.   For reasons discussed below, this program works because the address in the numbers variable is cloned across the 128 fork calls, and this particular address maps to the same set of physical addresses in all 128 processes (and that’s different than what usually happens).   The program successfully sorts any array of length 128 by relying on 128 independent processes.  In a nutshell, the above program works because:

Truth be told, the mergesort algorithm we've implemented is more of theoretical interest than practical.  But it's still a novel example of parallel programming that rings much more relevant and real-world than parent-and-pentuplets-go-to-disney.

Use the following short answer questions to guide the discussion.

static void orchestrateMergers(int numbers[], pid_t workers[], size_t length) {
  for (size_t step = 1; step <= length; step *= 2) {
    for (size_t start = 0; start < length; start += step) {
      int status;
      waitpid(workers[start], &status, WUNTRACED);
      if (WIFSTOPPED(status)) {
        kill(workers[start], SIGCONT);
      }
    }
  }
}  
static void orchestrateMergers(int numbers[], pid_t workers[], size_t length) {
  for (size_t step = 1; step <= length; step *= 2) {
    for (ssize_t start = length - step; start >= 0; start -= step) {
      int status;
      waitpid(workers[start], &status, WUNTRACED);
      if (WIFSTOPPED(status)) {
        kill(workers[start], SIGCONT);
      }
    }
  }
}  

The createSharedRandomArray function (defined in memory.h and memory.cc in your lab3 repo) sets aside space for an array of length integers and seeds it with random numbers.  It does so using the mmap function you've seen in Assignment 1 and 2, and you also saw it a bunch of times while playing with strace last week during discussion section.

int *createSharedRandomArray(size_t length) {
  // Allocate space for the array
  int *numbers = static_cast<int *>(mmap(NULL, length * sizeof(int), 
    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0));

  // Fill the array with random numbers
  RandomGenerator rgen;
  for (size_t i = 0; i < length; i++) {
    numbers[i] = rgen.getNextInt(kMinValue, kMaxValue);
  }

  return numbers;
}

The mmap function takes the place of malloc here, because it sets up space not in the heap, but in an undisclosed segment that other processes can see and touch (that’s what MAP_ANONYMOUS and MAP_SHARED mean).

Solution

Problem 2: Virtual Memory and Memory Mapping

Assume the OS allocates virtual memory to physical memory in 4096-byte pages (a page is like the unit of memory the operating system manages). Recall that one key benefit of this virtual memory mapping is that the OS can lazily map as needed (e.g. it doesn't need to map every virtual address to a physical address from the outset). Describe how virtual memory for a process undergoing an execvp transformation might be updated as: - the assembly code instructions are loaded - the initialized global variables, initialized global constants, and uninitialized global variables are loaded - the heap is initialized - the portion of the stack frame set up to call main If the virtual address 0x7fffa2efc345 maps to the physical page in main memory whose base address is 0x12345aab8000, what range of virtual addresses around it would map to the same physical page? What's the largest size a character array can be before it absolutely must map to three different physical pages? What's the smallest size a character array can be and still map to three physical pages?

For fun, optional reading, read these two documents (though you needn't do this reading if you don't want to, since it goes beyond the scope of our discussion of virtual memory):

As an optional fun activity, try the following experiment: * ssh into any myth machine, and type ps u at the command prompt to learn the process id of your terminal, as with:

myth64:~$ ps u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
<USER> 22606  0.6  0.0  15716  5004 pts/18   Ss   18:14   0:00 -bash
<USER> 22827  0.0  0.0  30404  1516 pts/18   R+   18:14   0:00 ps u
myth64:~$ cd /proc/22606
myth64:/proc/22606$ ls -lta maps
-r--r--r-- 1 <USER> operator 0 Jan 28 18:15 maps
myth64:/proc/22606$ more maps 
00400000-004f4000 r-xp 00000000 08:01 2752634               /bin/bash
006f3000-006f4000 r--p 000f3000 08:01 2752634               /bin/bash
006f4000-006fd000 rw-p 000f4000 08:01 2752634               /bin/bash
006fd000-00703000 rw-p 00000000 00:00 0 
01230000-01405000 rw-p 00000000 00:00 0                     [heap]
7f7de7c51000-7f7de8069000 r--p 00000000 08:01 8921285       /usr/lib/locale/locale-archive
7f7de8069000-7f7de8229000 r-xp 00000000 08:01 6033975       /lib/x86_64-linux-gnu/libc-2.23.so
7f7de8229000-7f7de8429000 ---p 001c0000 08:01 6033975       /lib/x86_64-linux-gnu/libc-2.23.so
7f7de8429000-7f7de842d000 r--p 001c0000 08:01 6033975       /lib/x86_64-linux-gnu/libc-2.23.so
7f7de842d000-7f7de842f000 rw-p 001c4000 08:01 6033975       /lib/x86_64-linux-gnu/libc-2.23.so
7f7de842f000-7f7de8433000 rw-p 00000000 00:00 0 
7f7de8433000-7f7de8436000 r-xp 00000000 08:01 6033979       /lib/x86_64-linux-gnu/libdl-2.23.so
7f7de8436000-7f7de8635000 ---p 00003000 08:01 6033979       /lib/x86_64-linux-gnu/libdl-2.23.so
7f7de8635000-7f7de8636000 r--p 00002000 08:01 6033979       /lib/x86_64-linux-gnu/libdl-2.23.so
7f7de8636000-7f7de8637000 rw-p 00003000 08:01 6033979       /lib/x86_64-linux-gnu/libdl-2.23.so
7f7de8637000-7f7de865c000 r-xp 00000000 08:01 6029424       /lib/x86_64-linux-gnu/libtinfo.so.5.9
7f7de865c000-7f7de885b000 ---p 00025000 08:01 6029424       /lib/x86_64-linux-gnu/libtinfo.so.5.9
7f7de885b000-7f7de885f000 r--p 00024000 08:01 6029424       /lib/x86_64-linux-gnu/libtinfo.so.5.9
7f7de885f000-7f7de8860000 rw-p 00028000 08:01 6029424       /lib/x86_64-linux-gnu/libtinfo.so.5.9
7f7de8860000-7f7de8886000 r-xp 00000000 08:01 6033971       /lib/x86_64-linux-gnu/ld-2.23.so
7f7de8a28000-7f7de8a5d000 r--s 00000000 00:13 713           /run/nscd/dbS6IIxD (deleted)
7f7de8a5d000-7f7de8a61000 rw-p 00000000 00:00 0 
7f7de8a7e000-7f7de8a85000 r--s 00000000 08:01 8924731       /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
7f7de8a85000-7f7de8a86000 r--p 00025000 08:01 6033971       /lib/x86_64-linux-gnu/ld-2.23.so
7f7de8a86000-7f7de8a87000 rw-p 00026000 08:01 6033971       /lib/x86_64-linux-gnu/ld-2.23.so
7f7de8a87000-7f7de8a88000 rw-p 00000000 00:00 0 
7fff30343000-7fff30364000 rw-p 00000000 00:00 0             [stack]
7fff30382000-7fff30384000 r--p 00000000 00:00 0             [vvar]
7fff30384000-7fff30386000 r-xp 00000000 00:00 0             [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0     [vsyscall]

From the man page for proc: "The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures.  It is commonly mounted at /proc.  Most of it is read-only, but some files allow kernel variables to be changed."

 Within proc is a subdirectory for every single process running on the machine, and within each of those there are sub-subdirectories that present information about various resources tapped by that process.  In this case, the process subdirectory is named 22606, and the sub-subdirectory of interest is maps, which provides information about all of the contiguous regions of virtual memory the process relies on for execution.

 To find out what each row and column in the output means, consult this stackoverflow question and read through the accepted answer.

Solution

Problem 3: The Process Scheduler

The Linux Kernel is responsible for scheduling processes onto processor cores. The “process scheduler” is a component of the operating system that decides whether a running process should continue running and, if not, what process should run next.  This scheduler maintains three different data structures to help manage the selection process:

The Running Queue

The running queue keeps track of all of the processes that are currently assigned to a CPU.  The nodes in that queue needn't store very much if anything at all, since the CPUs themselves house everything needed for execution. The running queue could be of length 0 (meaning all processes are blocked), or its length can be as high as the number of CPUs.

The Ready Queue

The ready queue keeps track of all of the processes that aren't currently running but are qualified to run.  The nodes in the queue store the state of a CPU at the moment it was pulled off the processor.  That information is used to restore a CPU when a process is promoted from ready to running again, so that the process can continue as if it was never interrupted.

The Blocked Set

This set holds processes which cannot at the moment carry on without some external event happening — they may be waiting for user input, waiting for memory reads, waiting on a network, or waiting for another process to change state. The blocked set looks much like the ready queue, except that it contains processes that were forced off the processor even before its time slice ended.  A process is demoted to the blocked set because it blocked on something (e.g. waitpid).

Solution

Checkoff Questions Solutions


Website design based on a design by Chris Piech
Icons by Piotr Kwiatkowski