Section 4: Multithreading

NOTE: this website is out of date. This is the course web site from a past quarter, Fall 2023. If you are a current student taking the course, you should visit the current class web site instead. If the current website is not yet visible by going to cs111.stanford.edu, it may be accessible by visiting this link until the new page is mounted at this address. Please be advised that courses' policies change with each new quarter and instructor, and any information on this out-of-date page may not apply to you.

Sections Wed Oct 25 to Fri Oct 27

This section handout is based on problems by Jerry Cain, with modifications by Nick Troccoli.

Learning Goals

During this section, you will:

  1. get practice writing multithreaded code
  2. use mutexes to prevent race conditions

Get Started

Clone the section starter code by using the command below. This command creates a section4 directory containing the project files.

git clone /afs/ir/class/cs111/repos/lab4/shared section4

Next, pull up the online section checkoff and have it open in a browser so you can jot things down as you go.

1. News Aggregation

Multithreading is a powerful tool for parallelizing tasks, particularly tasks that share data. In this section problem, we'll get practice with making a non-multithreaded program multithreaded to speed up its performance, and eliminate any race conditions while doing so.

We've provided a program news-aggregator.cc that is fully implemented; it downloads news articles (slowly, one after the other) from across the web using RSS Feeds, and then lets the user search for keywords in them. Try out the program now! To run the program, you must specify a file with the list of RSS feeds to download articles from. (A single RSS feed is like a feed of recent articles, and news sources maintain these feeds publicly). Run make to compile it, and then run ./news-aggregator samples/small-feed.xml to run it on a smaller list of RSS Feeds. It may take a few seconds to download the articles one after the other...and then it should prompt you to enter search keywords and it will show you matching articles, like this:

Enter a search term [or just hit <enter> to quit]: music
That term appears in 17 articles.  Here are the top 15 of them:
   1.) "In L.A., Gustavo Dudamel’s Influence Extends Beyond the Concert Hall" [appears 14 times].
       "https://www.nytimes.com/2023/02/08/arts/music/gustavo-dudamel-los......html"
   2.) "The sounds of science" [appears 6 times].
       "https://www.latimes.com/science/story/2023-02-03/the-sounds-of-science"
   3.) "Gustavo Dudamel tapped as New York Philharmonic’s music director" [appears 6 times].
       "https://www.nydailynews.com/news/national/ny-new-york-philharmoni.....onal/"
   4.) "Grammys 2023: Hip-Hop Wins, Beyoncé Wins (Sort of)" [appears 4 times].
       "https://www.nytimes.com/2023/02/07/arts/music/popcast-grammys.html"
       ...

Cool! This program is powerful, but as you see it takes a long time to download articles, becuase it downloads them one at a time. Our task is to take the news-aggregator.cc code and add multithreading to download all the articles in parallel, each in its own thread.

Your first task is to read through the news-aggregator.cc code and understand how it works in its sequential form, before modifying it to add threads. The code relies on a NewsManager class that can give you a list of all the article titles/URLs in the feeds, download a single article's contents, and handle the user keyword prompting/article searching. Here is the full interface:

/* Initialize a NewsManager with the path to the RSS Feed list file.
 * Also initializes XML library for article parsing.
 */
NewsManager(const std::string& feedListFilename);

// Destructor that uninitializes the XML library for article parsing.
~NewsManager();

/* Returns a vector of Articles referred to by the feed list.  An
 * Article is a struct combination of title and URL.  This function
 * fetches contents from the web (the contents of the RSS Feeds), and
 * blocks while waiting for the download.
 */
std::vector<Article> getArticleTitlesAndURLs() const;

/* Returns a vector of tokens (words) in order for this article.  This
 * function fetches the contents from the web, and blocks while waiting
 * for the download.  Thread-safe.
 */
std::vector<std::string> fetchArticleContents(const Article& article) const;

/* Continually prompts the user for search queries and goes through
 * the database (assumed to be a map from article to article tokens)
 * to find any matches for their query, printing them out each time.
 */
void handleUserQueries(std::map<Article, std::vector<std::string>> database) const;

Main task: Once you've read through the news-aggregator.cc code, modify it to spawn a new thread to fetch each article. Specifically, instead of calling fetchArticle sequentially, in main you should instead spawn a thread for each article to call fetchArticle, so all the calls to it are executed in parallel and all add their entries to the database map. Make sure to wait on all threads before handleUserQueries is called so that the database is complete before letting the user search it! (For now do not add any mutexes - just use threads. We'll deal with any race conditions soon!)

Now that we've added multithreading, we must be on the lookout for race conditions. Spoiler alert; there is a race condition! Try to review your implementation to see if you can come up with a specific ordering of events that would cause a problem.

Q1: What is the race condition? Add a mutex to the implementation to eliminate the race condition. Use the strategies outlined in lecture to think about what is necessary to coordinate the threads and avoid race conditions!

Now that your code is race-condition free, run it again - and notice the dramatically improved performance! Multithreading can really help speed up operations, particularly on computers with multiple cores (like the myth machines) and particularly with operations that are waiting for other tasks (like here, waiting for web downloads). We'll see more in the future about how to know what operations benefit the most from multithreading depending on the number of cores available, and how to know how many threads to spawn.

2. (If time) Short Answer

Work through the following short answer questions that examine more design/big-picture questions about multithreading and synchronization, and compare multithreading and multiprocessing.

Q1: What does it mean when we say that a process has a private address space? What are the advantages and disadvantages of having a private address space?

Q2: How do we exchange information across process boundaries? When do processes have or not have a say in information being transferred?

Q3: Threads are often called lightweight processes. In what sense are they processes? And why the lightweight distinction?

Q4: Threads running within the same process all share the same virtual address space. What are the advantages and disadvantages of allowing threads to access pretty much all of virtual memory?

Q5: What are some scenarios where we might prefer multithreading in our programs, and what are some scenarios where we might prefer multiprocessing?

Q6: Is i-- thread safe? Why or why not?