Assignment 4: Synchronization and Trust

Due: Tue Feb 28 11:59 pm
Late submissions accepted until Thu Mar 2 11:59 pm

Assignment by David Mazières and John Ousterhout, with modifications by Nick Troccoli, Benjamin Xie and Xiyu Zhang. Ethics case study includes material from Kathleen Creel.

Assignment 4 Overview Session: Tues 2/21 4PM on Zoom (link on Canvas). Note that the session is recorded and will be posted to the course website shortly after it’s over. Slides here, video here.

In this project you will reflect on the impact of complexities in writing multithreaded code and evaluating trust, and implement code using the "monitor" pattern to complete two synchronization problems. For the programs you will write, one is for passengers boarding Caltrain, and another is for grouping together guests at a party, and your task is to implement a class for each with the necessary instance variables and code to make the threads coordinate properly.

Learning Goals

The goal of this project is for you to get practice writing multithreaded code with mutexes and condition variables, and understand the challenges and impacts of multithreaded code and race conditions. This assignment will help you:

build your skills identifying race conditions and deadlock
more fully understand where to use mutexes and condition variables as appropriate to coordinate between threads and avoid deadlock/race conditions
get practice using the "monitor" pattern for designing multithreaded programs
explore understandings of trust and how we trust the systems we use

Starter Code

To get started on the assignment, clone the starter project using the command

    git clone /afs/ir/class/cs111/repos/assign4/$USER assign4

This line creates a new folder called assign4 in your current working directory containing a copy of the starter code.

Here are the relevant files for your work on the assignment:

bridge.hh/cc and car-simulation.cc: the implementation of the Bridge class as introduced in lecture, and the car simulation program that uses it. You do not need to modify these files.
bridge-two-locks.hh/cc and car-simulation-two-locks.cc: an alternate implementation of the Bridge class that uses two locks instead of one, but that has a race condition you will identify (but are not required to fix).
questions.txt: a text file where you will add answers to some written questions in the first part of the assignment
caltrain.hh/cc: the implementation of the Station class that represents a single train station. You will edit the header file to add necessary instance variables, and edit the cc file to implement the provided methods.
caltrain_test.cc: a test file containing various tests that spawn threads and call the Station methods in different ways.
party.hh/cc: the implementation of the Party class that represents a single gathering. You will edit the header file to add necessary instance variables, and edit the cc file to implement the provided methods.
party_test.cc: a test file containing various tests that spawn threads and call the Party methods in different ways.
samples: a symbolic link to the shared directory for this assignment. It contains:
- caltrain_test_soln: a provided sample caltrain_test solution executable that uses a fully-implemented Station class.
- party_test_soln: a provided sample party_test solution executable that uses a fully-implemented Party class.

1: Race Conditions, Complexity and Trust

Synchronization code is hard to get right - it is difficult to reason about, and can be difficult to thoroughly test, particularly in complex multithreaded systems. In this part of the assignment, we will explore the provided Bridge example from lecture as a case study in complexity tradeoffs, and we will also explore the ramifications of race conditions in real-world software and the trust complexities that arise as a result.

Complexity and Race Conditions

In lecture, we discussed how there are tradeoffs in deciding the number of locks we use in a multithreaded program. If we have many locks, perhaps there's more concurrency, but there's also more complexity, and locking/unlocking itself introduces overhead in our program. If we have just one lock, it doesn't allow for much concurrency, but perhaps is easier to reason about.

The monitor pattern suggests using one lock with some collection of related variables. This reduces complexity while also permitting some level of concurrency. The Bridge example from lecture is an example of the monitor pattern.

Take some time to review bridge.hh and bridge.cc now, along with the associated car-simulation.cc program that uses it, and run make to compile and run the car-simulation program; you will be writing your own monitor classes similar to Bridge later on the assignment (we will provide fully-implemented programs that use your classes, like car-simulation.cc), so we encourage you to ask questions about anything you don't understand!

Your colleague has implemented an alternative approach to the Bridge class in bridge-two-locks.hh/cc and compiled into car-simulation-two-locks. Instead of using a single lock within Bridge to adhere to the monitor pattern, they have decided to use two locks to potentially increase concurrency, though at the expense of complexity. Here are some details about their updated approach:

instead of 1 bridge lock, they have two locks; one for eastbound state, and 1 for westbound.
they updated the print function to lock internally, whereas previously it relied on the caller locking before calling it. (you do not need to review this code)

Q1: Read through bridge-two-locks.hh/cc to review their two-lock implementation. In questions.txt, in at most 2 sentences, describe parallelism supported because of this 2-lock approach that is not supported by the original bridge.hh/cc implementation. (e.g. "thread 1 could be doing X while thread 2 could be doing Y, but this isn't possible in the original implementation").

Your colleague believes this new approach has introduced race conditions...they have observed cases occasionally where cars get stuck as they are permitted to go in both directions simultaneously, and traffic grinds to a halt. They have narrowed their issues to arrive_eastbound and arrive_westbound, since those are the methods where cars wait to cross.

Q2: Identify one race condition and add a single call to sleep_for(500) to surface the race condition consistently. Then add an up-to-3-sentence explanation in questions.txt explaining the race condition and a specific ordering of events that causes it to occur.

(Note: you do not need to fix the race condition, but rather just identify it - it turns out fixing it is tough!)

Trust

Any software bugs or vulnerabilities can have large ramifications in real-world software. We'll use a real-world case study to examine this and to reflect on how we trust systems. Specifically, we will consider one of the worst software bugs in history: a lethal race condition in the software of a medical radiation device, the Therac-25, between 1985-1987.

Read this description of the case in the article below, and answer the two short-answer questions that follow about trust as discussed in lecture.

(Note: this case study relates to potentially-sensitive topics surrounding radiation and death caused by radiation. If for any reason you are uncomfortable reading a case study related to these topics, please reach out to the instructor and we would be happy to provide a substitute.)

Open Therac-25 Article

(If you would optionally like more context a fuller description is here).

Q3: Identify one way engineers developing Therac-25 exhibited agential gullibility in the development of its software.

Q4: Identify one way that engineers developing Therac-25 could have substituted some of the need for them or others to trust that the software was defect-free.

Writing Synchronization Code

The last two parts of the assignment have you write two monitor-style classes. Synchronization code is hard to get right, so it's important for it to be clean, simple, and obvious. The best way to get started on synchronization problems like these is to think about what information about the state of the world needs to be kept in a Station or Party object. Consider what information is needed to implement the methods. For example:

What information is needed in order to decide whether a passenger can safely board a train?
What information is needed to decide when a train can leave the station?

We call these variables state variables because they keep track of the state of the world, such as how many passengers are currently waiting. Once you have an initial guess about the state to maintain, you can then try writing the methods. As you write and test the code, you may discover that your state variables don't have enough information to make important decisions; when that happens, you'll need to add more state variables.

Also make sure to refer to our lecture checklists/steps for using mutexes and condition variables, which can help you reason through when and how they might be appropriate.

2: Caltrain Automation

Caltrain has decided to improve its efficiency by automating not just its trains but also its passengers. From now on, passengers will be robots. Each robot and each train is controlled by a thread. You have been hired to implement synchronization that will guarantee orderly loading of trains. You must implement a Station class that provides three methods:

load_train(int available): when a train arrives in the station and has opened its doors, it invokes this method, where available indicates how many seats are available on the train. The method must not return until the train is satisfactorily loaded (all passengers are in their seats, and either the train is full or all waiting passengers have boarded).
wait_for_train(): When a passenger robot arrives in a station, it first invokes this method. This method must not return until a train is in the station (i.e., a call to load_train is in progress) and there are enough free seats on the train for this passenger to sit down. Once this method returns, the passenger robot will begin moving the passenger on board the train and into a seat (you do not need to worry about how this mechanism works).
boarded(): Once a passenger is boarded, it will call this method to let the station know that it's on board. This method is needed because it takes time for passengers to board the train once wait_for_train has given them the go-ahead, and the train mustn't leave the station until all of the admitted passengers are safely in their seats.

There is also a constructor where you should initialize any instance variables you add (e.g. counters).

Another way to think about the simulation is a passenger can be in one of three states: waiting, boarding, or boarded. A waiting passenger is one who is waiting to board a train - either a train isn't in the station, or it is but there isn't space on the train. A boarding passenger is one who has reserved a seat on the train but is still getting on the train. A boarded passenger is one who has successfully gotten on the train.

Here is some additional information for this problem:

You do not have to write code to create threads; we provide a test harness that creates Station objects, plus threads to represent passengers and trains, and then invokes the methods that you write.
At any given time, there may exist any number of Station objects, passenger threads, and train threads (subject to the restrictions below).
Your code should not invoke any of the above methods (they will be invoked by the test harness).
You may assume that there is never more than one train in a given Station at once, and that all trains (and all passengers) are going to the same destination. Any passenger can board any train.
Your code must allow multiple passengers to board simultaneously (it must be possible for several passengers to have called wait_for_train, and for that method to have returned for each of the passengers, before any of the passengers calls boarded).
It must be possible to have multiple Station objects operating independently (activities in one Station should not affect any other Station). This means that you shouldn't have any static class variables in Station (if you don't know what those are, no problem!).
Our solution for each method is < 10 lines long each (excluding comments); this isn't a hard requirement for you to adhere to, but if your implementation is significantly longer, then you're likely taking a more complicated approach than you need.

Recommended Steps

We recommend approaching this problem using the following steps:

Start by ignoring the train capacity (assume all trains are infinite size) and just make it so that a passenger blocks in wait_for_train until a train has arrived (and once a train has arrived, pretend it never leaves - since the train doesn't yet wait for passengers, it may leave too quickly for passengers to see). Ignore boarded, ignore the boarding process, and don't worry about load_train waiting for passengers to board. You should pass sanity check tests #2 and #3.
Now make it so that load_train blocks correctly - meaning as long as people are boarding or waiting to board, the train should stay there, and then leave when it's done. Continue to ignore train capacity, and just make it so if a train is in the station, a passenger will begin boarding. But now you should make sure to keep the train there until nobody is waiting anymore, and every passenger boarding has called boarded. Hint: the while loop condition for a condition variable can contain more than one check (e.g. x && y). You should pass sanity check tests #4 - #7.
Now factor in train capacity; even if passengers are waiting, a train should leave if it has no more seats. Once a train leaves the station, even if it had open seats, no passengers should board until another train with space arrives. And even if a train is in the station, a passenger shouldn't board if there are no more seats. You should pass sanity check tests #8-10; woohoo!

Lastly, try running the randomized test many times with various parameters. You can run it like this:

./caltrain_test randomized 1000 50

The 1000 number is how many passengers you want in total, and the 50 number is the max capacity per train (it will be random, up to that value). Then inspect the output to ensure things look correct.

Testing

Sanity check provides a wide range of tests for this problem. You can look at the code for each test (which is spawning the threads for you and calling your methods) in caltrain_test.cc; the functions are named the same as the tests. You can run them on your own by doing ./caltrain_test TESTNAME. If you are failing a test, add debug print statements to print out information; e.g. each time a lock is acquired, before and after each condition variable wait, and after any other major decisions are made. Print out lots of information in each of these statements, including where in the code you are, which thread is running (if you can tell), and key state variables. Then pick a test that occasionally fails and run it repeatedly until it does fail (save the output to a file). If you walk through the log file, you will probably find a sequence of events that you hadn't expected to occur, and which your code can't handle properly. Here is information about what sequence of events each test is evaluating:

Expand/Collapse Test Details

`no_waiting_passengers`

No passengers are spawned.

A train with 0 seats arrives (should immediately depart)
A train with 10 seats arrives (should immediately depart)

`passenger_wait_for_train`

A passenger arrives, and begins waiting for a train
A train with 3 seats arrives
The passenger should then begin boarding

This test does not check whether the train stays in the station until the passenger is boarded (but later tests do).

`train_wait_for_passenger`

A passenger arrives, and begins waiting for a train
A train with 3 seats arrives
The passenger should then begin boarding
The train should only depart once the passenger is boarded

`multiple_waiting`

A passenger arrives, and begins waiting for a train
A train with 3 seats arrives
The passenger should then begin boarding
While passenger 1 is boarding, another passenger arrives, and should begin boarding
The first passenger finishes boarding, but the train should not depart yet
The second passenger finishes boarding, and the train should then depart

`board_in_parallel`

4 passengers arrive, and begin waiting for a train
A train with 4 seats arrives
All 4 passengers should begin boarding in parallel
3 of the passengers finish boarding, but the train should not depart yet
The last passenger finishes boarding, and the train should then depart

`leftover`

A train with 3 seats arrives (should immediately depart)
A passenger arrives, and begins waiting for a train
A train with 3 seats arrives
The passenger should then begin boarding
The train should only depart once the passenger is boarded

`full_train_departs`

A passenger arrives, and begins waiting for a train
A train with 0 seats arrives (should immediately depart, and passenger should continue waiting)
A train with 3 seats arrives
The passenger should then begin boarding
The train should only depart once the passenger is boarded

`board_in_parallel_wait`

4 passengers arrive, and begin waiting for a train
A train with 3 seats arrives
3 of the 4 passengers should begin boarding in parallel
The first two passengers finish boarding, but the train should not depart yet
The 3rd passenger finishes boarding, and the train should then depart with the fourth passenger still waiting
The last passenger finishes boarding, and the train should then depart

`randomized`

A larger number of passengers arrive
A series of trains arrive one at a time with varying numbers of available seats. They should leave only when they are either full or no more passengers are waiting/boarding.

3: Party Introductions

You have been hired as a party organizer for parties where people want to be introduced to other guests with certain Zodiac signs. In particular, you must create a C++ class Party with a single method that can be used to introduce incoming guests to each other for conversation. When a guest arrives at a party, they will invoke the following method on a Party object:

std::string meet(std::string &my_name, int my_sign, int other_sign);

The my_name argument (assumed to be nonempty) gives the name for this guest and my_sign indicates the guest's Zodiac sign (a number from 0-11). The other_sign argument indicates that this guest would like to talk to someone with that Zodiac sign. The meet method must not return until a guest has arrived with matching my_sign and other_sign, and meet must return the name of the matching guest.

In addition:

Guests must be matched: if A receive's B's name, then B must receive A's name.
If a suitable match is not immediately available, then meet must wait until a match becomes available. In the meantime, it must be possible for non-interfering matches to occur.
It is possible for multiple guests to have the same name.
It must be possible to have multiple Party objects all operating independently (activities in one Party should not affect any other Party). This means that you shouldn't have any static class variables in Party (if you don't know what those are, no problem!).
Guests must be matched in order of arrival, where "arrival" means starting execution of meet and locking the synchronization mutex. Note that C++ locks and condition variables are not guaranteed to be first-in-first-out (FIFO) (i.e. the thread waiting the longest for a mutex may not be the next to get it, and the thread waiting the longest for a CV may not be the first to wake up and acquire the lock). You should not assume in your code that these are FIFO. However, you can assume that the behavior is close enough to FIFO for fairness considerations (i.e. that this is good enough for the "order of arrival" requirement for Party).
Our solution for the meet method is < 20 lines long (excluding comments); this isn't a hard requirement for you to adhere to, but if your implementation is significantly longer, then you're likely taking a more complicated approach than you need.

We have provided some scaffolding to help you track the state for this problem. Specifically, you do not need to add any additional instance variables - we have already implemented a data structure and methods for keeping track of waiting guests who have not yet found a match. The way it is modeled is there is a struct WaitingGuest for each guest waiting for a match. You can call the following methods, which we have implemented for you, to keep track of these structs - but you must make sure you have already locked mutex_ before calling them:

/* Provided internal helper function that checks the list
 * of waiting guests to see if there is anyone with
 * other_sign looking for someone with my_sign.  Returns true
 * if yes, false if no.  Assumes caller has mutex_.
 */
bool has_matching_waiting_guest(int my_sign, int other_sign);

/* Provided internal helper function that returns a WaitingGuest
 * struct reference containing information about a matching
 * guest with other_sign looking for someone with my_sign.
 * It will respect the ordering in which guests were added
 * via add_candidate below; in other words, if there are
 * multiple potential pairings, this function returns the
 * guest who was added first.  This also removes this guest
 * from the list of waiting guests. Assumes caller has mutex_.
 */
WaitingGuest& get_matching_waiting_guest(int my_sign, int other_sign);

/* Provided internal helper function that adds a WaitingGuest
 * to the list of waiting guests looking for a match.
 * my_sign is the sign of this guest, and other_sign is the
 * sign of the person they are interested in talking to.
 * Assumes caller has mutex_.
 */
void add_waiting_guest(int my_sign, int other_sign, WaitingGuest& guest);

You must decide what fields a WaitingGuest struct should have, and add them in your header file. There will be a struct for each guest waiting for someone to arrive who is a match (but not for guests who are not waiting for a match), so consider what state needs to be stored for each so that they can coordinate (though you should not need to include a mutex in the struct).

NOTE: get_matching_waiting_guest returns a reference, which means it is not copied when it is returned. This is just like pass-by-reference with parameters, but we can declare regular variables as references, too. This is because e.g. condition_variable_any cannot be copied. So when you call that method, store its return value in a reference, like this (note the &), and then you can use it just like a regular variable:

WaitingGuest& match = get_matching_waiting_guest(...);

This is essential for this problem, as any thread can access and modify the same copy of these structs due to passing references around!

Testing

Sanity check provides a wide range of tests for this problem. You can look at the code for each test (which is spawning the threads for you and calling your methods) in party_test.cc; the functions are named the same as the tests. You can run them on your own by doing ./party_test TESTNAME. If you are failing a test, add debug print statements to print out information; e.g. each time a lock is acquired, before and after each condition variable wait, and after any other major decisions are made. Print out lots of information in each of these statements, including where in the code you are, which thread is running (if you can tell), and key state variables. Then pick a test that occasionally fails and run it repeatedly until it does fail (save the output to a file). If you walk through the log file, you will probably find a sequence of events that you hadn't expected to occur, and which your code can't handle properly. Here is information about what sequence of events each test is evaluating:

Expand/Collapse Test Details

`two_guests_perfect_match`

guest_a with sign 0 arrives, looking for guest with sign 5 (must wait)
guest_b with sign 5 arrives, looking for guest with sign 0 (should match with guest_a)

`return_in_order`

guest_a with sign 1 arrives, looking for guest with sign 3 (must wait)
guest_b with sign 1 arrives, looking for guest with sign 3 (must wait)
guest_c with sign 1 arrives, looking for guest with sign 3 (must wait)
guest_d with sign 3 arrives, looking for guest with sign 1 (should match with guest_a)
guest_e with sign 3 arrives, looking for guest with sign 1 (should match with guest_b)
guest_f with sign 3 arrives, looking for guest with sign 1 (should match with guest_c)

`sign_matching`

guest_a with sign 1 arrives, looking for guest with sign 3 (must wait)
guest_b with sign 2 arrives, looking for guest with sign 1 (must wait)
guest_c with sign 3 arrives, looking for guest with sign 2 (must wait)
guest_d with sign 3 arrives, looking for guest with sign 1 (should match with guest_a)
guest_e with sign 2 arrives, looking for guest with sign 3 (should match with guest_c)
guest_f with sign 1 arrives, looking for guest with sign 2 (should match with guest_b)

`single_sign`

guest_a with sign 2 arrives, looking for guest with sign 2 (must wait)
guest_b with sign 2 arrives, looking for guest with sign 2 (should match with guest_a)

`single_sign_many`

10 guests arrive in total, all with the same sign, and pairs of guests should match (e.g. guest 1 should match with guest 2, guest 3 with guest 4, etc.).

`same_name`

Guest (that's their name) with sign 4 arrives, looking for guest with sign 5 (must wait)
Guest (that's their name) with sign 4 arrives, looking for guest with sign 5 (must wait)
Guest (that's their name) with sign 5 arrives, looking for guest with sign 4 (should match with first Guest)
Guest (that's their name) with sign 5 arrives, looking for guest with sign 4 (should match with second Guest)

`randomized`

A larger number of guests arrive, such that every guest can eventually find a match.

There is also the randomized test that you can run many times with various parameters. You can run it like this:

./party_test randomized 100 4

The 100 number is how many guests you want in total, and the 4 number is the max number of distinct zodiac signs possible (they are assigned randomly). Then inspect the output to ensure things look correct.

Other Coding Requirements

Your solutions for both problems should be written in C++.
There must be no busy-waiting in your solutions.
You must use the monitor design pattern for synchronization, as discussed in lecture. In particular, you may not declare any further mutexes other than the ones declared in the starter code.
You may use more than one condition variable for each part. Code using one condition variable to wake up many different threads can sometimes be improved by using multiple condition variables. This way only threads you really want to wake get woken up.

Submitting and Grading

Once you are finished working and have saved all your changes, submit by running the submit tool in the tools folder: ./tools/submit. We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work. You can confirm the timestamp of your latest submission in your course gradebook.

Your assignment will be graded using the tests we expose via sanitycheck, plus the following:

reviewing your questions.txt responses for the short answer questions
ensuring your program has no busy waiting or forced sleeping (i.e. calls to usleep)
ensuring your program has no race conditions or other concurrency issues
no static class variables or other designs that mean instances of Station or Party are interdependent
ensuring compliance with the monitor pattern, meaning Station has exactly 1 mutex, and Party has exactly 1 mutex
Style code review - Synchronization code is hard to get right, so it's important for it to be clean, simple, and obvious. Your goal should be simplicity, not just line count; simple programs are usually shorter than complex ones, but the shortest program isn't always the simplest. Check out our style guide for other style tips.