Assignment 5: RSS News Feed Aggregator

Due: Fri Mar 5 11:59 pm
Late submissions accepted until Sun Mar 7 11:59 pm

Learning Goals

This assignment gives you a chance to implement an index of information similar to that held by news.google.com. It acts as a transition from multiprocessing to multithreading. This assignment will help you:

further develop your understanding of threads and thread-related concurrency issues
hone your skills identifying and eliminating race conditions
become more familiar with how to use mutexes, condition_variable_anys and semaphores to manage coordination between threads.
understand how to be both a client and implementer of the ThreadPool data structure pattern

Starter Code

All coding should be done on a myth cluster machine, as that's where we'll be testing all submissions. Clone the starter project using the command

    git clone /afs/ir/class/cs110/repos/assign5/$USER assign5

Compile often, test incrementally and almost as often as you compile, run ./tools/sanitycheck a bunch to test your work, and run ./tools/submit when you’re done. And as you've seen in past assignments, you can play with a sample application by invoking ./samples/aggregate_soln or ./samples/tptest_soln.

The starter project contains the following, categorized by whether or not you will be modifying them for this assignment. Note, however, that you need to become familiar with these files and their usage even if you do not modify them.

Files you must modify:

The news-aggregator.h/cc files define and implement the class used to build and query an index. You'll update these to implement the version of the assignment that uses two ThreadPools to quickly download all RSS feeds and articles.
The thread-pool.h/cc files define and implement the ThreadPool class you will be implementing for milestone 3, using the same interface that thread-pool-release.h presents. You’ll want to extend the private section of the class definition with additional data members and the helper methods needed to build it, and you’ll need to flesh out the implementations of the public methods in thread-pool.cc.

Files you should consider modifying:

tpcustomtest.cc: contains a collection of functions you can optionally use to exercise your ThreadPool. Because you are building the ThreadPool class and need to debug it, you should make judicious use of the tpcustomtest.cc tests. You can add to these tests yourself to test the functionality of your ThreadPool class. There are four tests in the program already, and to add another test, you update the buildMap function with a flag describing the test and pointing to a function that runs the test. To run, make the assignment and type (for example):

$ ./tpcustomtest --single-thread-no-wait
This is a test.

Files you can, but are not required to, modify:

You shouldn’t need to change any of these files, although you're welcome to if it helps you arrive at a working program more quickly.

rss-index.h/cc: define and implement the RSSIndex class, which models the news article index we’ve been talking about. An RSSIndex index maintains information about all of the words in all of the various news articles that’ve been indexed. The code you add to your NewsAggregator class will end up interacting with an RSSIndex, so you’ll want to be familiar with it—in particular, you should inspect the rss-index.h file so you’re familiar with the add method. Note that the RSSIndex implementation is not thread-safe.
test-union-and-intersection.cc: sample code on how to use the C++ algorithms most useful for managing token set intersections. As written, this standalone program doesn't directly contribute to your aggregate executable, but it's there as a reference so you have an idea how you should be using vectors, sort, and set_intersection. (You're free to tinker with this file if you'd like to)
utils.h/cc: define and implement a small number of URL and string utility functions that contribute to the implementation of several functions in your NewsAggregator class. The provided implementation of queryIndex calls the shouldTruncate and truncate functions, and you’ll certainly want to call the getURLServer function are part of the server/title comparison effort.
article.h: defines a simple record—the Article—that pairs a news article URL with its title. operator< has been overloaded so that it can compare two Article records, which means that Articles can be used as the keys in STL maps.
*-exception.h: these three header files each define and inline-implement a custom exception class that gets instantiated and thrown when some network drama prevents one of the three parse methods from doing its job. It’s possible the URL was malformed, or it’s possible your WiFi connection hiccuped and your network connection was dropped mid-parse. Recall that functions and methods capable of throwing exceptions can be optionally placed under the jurisdiction of a try/catch block - see the tips section of the spec for an example.

Files you should not modify:

aggregate.cc: a super small file that defines the main function and nothing else.
thread-pool-release.h: presents the ThreadPool interface as discussed below, and it's the one that links against our own solution.
tptest.cc: contains a short test program also included further down in the spec for testing your thread pool implementation. We provide a compiled version of this program using our own ThreadPool implementation, so you can use this for testing. We call this from sanitycheck to see if your ThreadPool is working. If you want to add your own tests, check out tpcustomtest.cc instead.
log.h/cc: define and implement a small NewsAggregatorLog class that helps manage all of the progress and error messages that come up during program execution. You can read through the interface and implementation files to figure out what method is likely called where within working implementations of buildIndex and processAllFeeds. In fact, the logging might help you work through the workflow of program execution as you implement, test, and debug. You, however, are not required to use the logging facilities provided by the NewsAggregatorLog class if you don't want to.
html-document.h: defines the HTMLDocument class, which models a traditional HTML page. It provides functionality to pull an HTML document from its server and surface its payload—specifically, the plain text content under its <body> tag—as a sequence of tokens. The parse method manages the networking and processes the content, and the getTokens method provides access to the sequence of tokens making up the pages. All of the news articles referenced by the RSS feeds are standard web pages, so you’ll rely on this HTMLDocument class to pull each of them and parse them into their constituent tokens. We have provided you the header file so you can see the interface. You do not have access to the functions themselves.
rss-feed.h: defines the RSSFeed class, which models a particular type of standardized XML file known as an RSS feed. The structure of an RSS feed document isn’t important, since an RSSFeed object, when constructed around a URL of such a feed, knows how to pull and parse the XML document just as an HTMLDocument knows how to pull and parse an HTML document. The primary difference is that an RSSFeed produces a sequence of Articles, not a sequence of tokens. So, RSSFeeds produce Articles housing URLs which can be fed to HTMLDocuments to produce words. We have provided you the header file so you can see the interface. You do not have access to the functions themselves.
rss-feed-list.h: defines the RSSFeedList class, which models another XML file type whose format was invented for the purpose of this assignment. RSSFeedList’s story is similar to that for RSSFeed, except that it surfaces a feed-title-to-feed-url URL map. We have provided you the header file so you can see the interface. You do not have access to the functions themselves.
stream-tokenizer.h/cc: the C++ equivalent of the Java StreamTokenizer class. The implementation is not pretty, but that’s because it needs to handle UTF8-encodings of strings that aren’t necessarily ASCII. Fortunately, you should be able to ignore this class, since it’s really used to decompose the already implemented HTMLDocument class. Feel free to peruse the implementation if you want, or ignore it. Just understand that it’s there and contributing to the overall solution. You should not need to read or modify this file.

Overview

Virtually all major newspapers and TV news stations have bought into this whole Internet thing. What you may not know is that many major media corporations serve up RSS feeds summarizing the news stories that have aired and gone to press in the preceding 24 hours. RSS news feeds are XML documents housing information about online news articles. If we can get the feeds, we can get the articles, and if we can get the articles, we can build an index of information similar to that held by news.google.com. That’s precisely what you’ll be doing in this assignment.

While working on this assignment, see the end section for helpful C++ tips.

Indexing a news article means breaking the story down into the individual words and noting how often each word appears. If a word appears a good number of times, then said word is probably a good indicator as to what the story is all about. Once all of the world’s online RSS news feeds have been indexed, you can ask it to list news stories about a specific person, place, or thing.

If, for instance, you want to learn more about music in the news lately, you might type in this:

myth56$ ./samples/aggregate_soln --url medium-feed.xml
Enter a search term [or just hit <enter> to quit]: music
That term appears in 119 articles.  Here are the top 15 of them:
   1.) "Griffith Park hikers, listen up: 'Ellen Reid Soundwalk' is a GPS-.....l map" [appears 17 times].
       "https://www.latimes.com/entertainment-arts/story/2021-02-18/ellen.....c-map"
   2.) "Black creatives helped turn Clubhouse into the next tech unicorn......gain?" [appears 10 times].
       "https://www.latimes.com/entertainment-arts/music/story/2021-02-01.....ation"
   3.) "Harder, Better, Faster, Stronger: Daft Punk's 10 greatest moments" [appears 10 times].
       "https://www.latimes.com/entertainment-arts/music/story/2021-02-22.....songs"
   4.) "Movies on TV this week: "Pulp Fiction";  "My Fair Lady" and more" [appears 10 times].
       "https://www.latimes.com/entertainment-arts/tv/story/2021-02-19/mo.....-2021"
   5.) "The essential guide to Johnny Pacheco and Fania Records, the Lati.....isten" [appears 7 times].
       "https://www.latimes.com/entertainment-arts/music/story/2021-02-18.....ening"
   6.) "Long Beach Opera names pandemic go-to director James Darrah as it.....eader" [appears 7 times].
       "https://www.latimes.com/entertainment-arts/story/2021-02-22/james.....eader"
   7.) "A list of fun things to do on New Year's Eve in L.A." [appears 6 times].
       "https://www.latimes.com/lifestyle/story/2020-12-29/a-list-of-fun-.....n-l-a"
   8.) "They wanted to bring a sneaker shop to South L.A. Then their drea.....igger" [appears 5 times].
       "https://www.latimes.com/lifestyle/story/2021-01-22/leimert-park-s.....rpose"
   9.) "Dolly Parton tells lawmakers she doesn't want a statue of herself.....later" [appears 4 times].
       "https://www.latimes.com/entertainment-arts/music/story/2021-02-18.....essee"
  10.) "What's playing at the drive-in: 'Nomadland,' 'Minari' and more" [appears 4 times].
       "https://www.latimes.com/entertainment-arts/story/2021-02-18/thing.....inari"
  11.) "9 ways to experience holiday music this season" [appears 4 times].
       "https://www.latimes.com/lifestyle/story/2020-12-16/2020-christmas.....tions"
  12.) "Where to find church services in L.A. this Christmas Eve" [appears 4 times].
       "https://www.latimes.com/lifestyle/story/2020-12-19/where-to-find-.....vices"
  13.) "L.A. Affairs: She turned me down because she wanted me to ask her out again" [appears 4 times].
       "https://www.latimes.com/lifestyle/story/2021-01-02/la-affairs-why.....tters"
  14.) "Why older people are having an even harder time with the pandemic..... help" [appears 4 times].
       "https://www.latimes.com/lifestyle/story/2021-01-25/ideas-to-help-.....demic"
  15.) "Barack Obama and Bruce Springsteen team up for timely new Spotify podcast" [appears 3 times].
       "https://www.latimes.com/entertainment-arts/business/story/2021-02.....dcast"

Sadly, sometimes something perfectly awesome doesn’t get mentioned at all:

Enter a search term [or just hit <enter> to quit]: CS110
Ah, we didn't find the term "CS110". Try again.

Milestone 1: Implementing Sequential `aggregate`

Your NewsAggregator class—that's the primary class in your assign5 repo to help manage the construction and interrogation of your news article index—ultimately needs to provide a multithreaded version of a method called processAllFeeds, which downloads a collection of RSS feeds and the HTML news articles they reference to build an in-memory version of your index. You’re free to implement the multithreaded version right from the start and forgo an intermediate, sequential version. However, we strongly recommend getting a sequential version working first to ensure everything works predictably, that you’ve made proper use of all of the helper classes we’ve provided, and that your sequential application’s output matches that produced by the sample. This is the approach we took when implementing this assignment, and is the one that will allow us to best help you in office hours. Incremental development is the key to arriving at a robust, bug-free executable!

Note: you might see small differences between the sample application and your own, because a small fraction of an article’s content—the advertising, most often—is dynamically generated for each download and influences how the index is built each time. The sanitycheck tool, however, has been seeded with static RSS feeds and HTML news articles, so the subset of tests it exposes pulls the same web content every single time, so the sanitycheck test results are reproducible.

Here are details that apply to both the sequential and the later multithreaded version:

You may not change the implementation of the NewsAggregator::queryIndex method. It outputs the results in the exact format we’ll expect to be used by the autograding process, so changing it would likely cause your solution to not recieve any credit.
Ensure that you never ever download the same exact URL twice. The feed lists and RSS feeds may surface the same URL more than once (e.g. the New York Daily News' world-news and religion feeds may each include the same URL to an article about the Vatican), so you need to ensure that you detect the duplicates. You should download an article the first time you see its URL, but ignore all of the others. If you try to pull a document and the networking fails for whatever reason (i.e. RSSFeed::parse or HTMLDocument::parse throws an exception), you shouldn't try to download it a second time, and you should still ignore any other repeated requests to download the same URL during program execution. Long story short: every URL gets one chance to contribute.
When two or more HTML articles share the same title and are hosted by the same server (i.e. their URLs have the same domain and subdomain—perhaps both include arts.nytimes.com), don't assume they're true duplicates. It's safe to assume they're all really the same article, but download all of them anyway, compute the running intersection of their token sets, and file the token intersection under the lexicographically smallest URL. So if, for instance, http://www.jerrynews.com/a.html, http://www.jerrynews.com/b.html, and http://www.jerrynews.com/c.html all lead to articles titled "Rock Me Like a Jerry Cain", and independent downloads produce token vectors of ["a", "a", "b", "c", "a", "a"],["a", "a", "b", "b", "a"], and ["a", "a", "b", "b", "c", "a", "d", "a"], respectively, ensure that the RSSIndex gets "Rock Me Like a Jerry Cain" under http://www.jerrynews.com/a.html with ["a", "a", "a", "b"] as tokens. This technique overcomes the fact that many news articles have dynamically generated advertising content that can be intersected away via multiple downloads. Implementing this will require you investigate the sort and set_intersection functions, iterators, and/or back_inserters (at least these are the C++'isms we feel are best used to implement this part). Your repo includes a standalone program in test-union-and-intersection.cc that serves no other purpose than to illustrate how these STL built-ins work.
In contrast to Assignment 4, your error messages needn't match ours.
Because all state and functionality is managed by a C++ NewsAggregator class, you shouldn't need any global variables. Any state needed for aggregation can be maintained in the private section of the class declaration and manipulated by the class's public and private methods. You’ll want to place additional data members right here as well.

Testing

For milestone 1, you should test with the small-feed.xml file until you’re convinced you’ve reached the sequential milestone. This file is small enoguh that it's feasible to test with on the sequential version, which takes much longer than your final version will. Once you are confident you're done with the sequential version, try testing one last time with medium-feed.xml, though this will take a while to run!

Don't be alarmed if you see spurious parsing errors while testing (e.g. messages like “Operation in progress” come up quite a bit). If you're testing using small-feed.xml, then you're testing with live data, so you need to be super tolerant of the fact that some of the XML and HTML documents aren't perfectly structured. (The static feeds used by the sanitycheck tool have been curated to rely on articles that parse without drama.)

Other feeds to test with:

static-alphabet-feed.xml, which is a contrived feed that'll help confirm you properly coded up the requirement that the same URL is downloaded at most once, and that all of your token intersection logic is spot on.
single-server-source-feed.xml, where all HTML articles are drawn from the same exact server. This feed intentionally includes the same RSS feed URLs multiple times so you can ensure you’re only ever downloading the same URL once.

Milestone 2: Implementing Multithreaded `aggregate`

Practically speaking, the sequential, singly-threaded implementation of aggregate is unacceptably slow. Users don’t want to wait 45 minutes for an application to launch, so you should introduce multithreading in order to speed things up. The sequential version of aggregate is slow because each and every request for an article from some remote server takes on the order of a second. If a single thread of execution sequentially processes each and every article, then said articles are requested, pulled, parsed, and indexed one after another. There’s no overlay of network stall times, so the lag associated with network connectivity scales with the number of articles. Not good!

Your NewsAggregator should be updated so that each RSS news feed is downloaded in its own thread, and each article identified within each feed thread can be downloaded in its own thread as well. Each article, of course, still needs to be parsed and indexed. But much of the dead time spent just waiting for content to come over the wire can be scheduled to overlap. The news servers we poll for content are industrial strength and can handle hundreds if not thousands of requests per minute (or at least we're hoping so, lest Stanford IP addresses be banned between now and the assignment deadline).

Naturally, multithreading and concurrency have their shortcomings. Each thread needs exclusive access to the index while performing modifications, so you’ll need to introduce some mutexes to prevent race conditions from compromising your data structures.

Your instinct might be to spawn a child thread for each feed, each of which spawns its own family of grandchild threads, one per feed article. However, this is not the approach you should use. We can’t just create an arbitrarily large, multigenerational family of threads when downloading the entire Web, so we need to limit the number of threads to be less than whatever limit the system imposes. One possibility would be to use mutexes and semaphores to limit the number of threads that can exist at any one moment. The practical problem with this approach is that it doesn’t limit the number of thread objects created over the lifetime of aggregate. Even if we guarantee that no more than 8 feed threads ever exist at any moment, there’s no sense creating a thread to do some work only to let it die if we’re going to need another thread to do pretty much the same thing later on. Building up and tearing down a thread is a relatively expensive process, so we should prefer having a smaller number of threads that live very long lives to a larger number of threads that live very short ones.

A better solution is to rely on a multithreading abstraction that’s so common that many other programming languages just provide it. That abstraction is the ThreadPool class, which for us exports the following public interface:

class ThreadPool {
public:
    ThreadPool(size_t numThreads);
    void schedule(const std::function<void(void)>& thunk);
    void wait();
    ~ThreadPool();
};

A ThreadPool is a collection of reusable threads that can perform specified tasks. You can ask the ThreadPool to do work for you by calling schedule and passing in a thunk, or function that takes in no parameters. The ThreadPool will have one of its threads execute that function. These worker threads are recycled, to avoid constantly making new threads for each task. In this way, it's similar to the farm program you wrote for a previous assignment - where you had a fixed number of workers we reused to perform assigned tasks - but using threads instead of processes. The wait method blocks until the thread pool is completely idle.

Aside: The function<void(void)> type is a general type that is compatible with anything invokable—a function pointer, or an anonymous function—that doesn’t require any arguments. Even though thunks take no arguments, you can still access variables in the surrounding scope using lambdas and capture clauses.

Here is a simple program, included in your starter code as tptest.cc, that uses a ThreadPool to execute a collection of 10 function calls using 4 threads.

static const size_t kNumThreads = 4;
static const size_t kNumFunctions = 10;
int main(int argc, char *argv[]) {
  ThreadPool pool(kNumThreads);
  for (size_t id = 0; id < kNumFunctions; id++) {
    pool.schedule([id] {
      cout << oslock << "Thread (ID: " << id << ") has started." << endl << osunlock;
      size_t sleepTime = (id % 3) * 10;
      sleep_for(sleepTime);
      cout << oslock << "Thread (ID: " << id << ") has finished." << endl << osunlock;
    });
  }

  pool.wait();
  cout << "All done!" << endl;
  return 0;
}

The output of the above program might look like this:

$ ./tptest
Thread (ID: 3) has started.
Thread (ID: 2) has started.
Thread (ID: 1) has started.
Thread (ID: 0) has started.
Thread (ID: 3) has finished.
Thread (ID: 4) has started.
Thread (ID: 0) has finished.
Thread (ID: 5) has started.
Thread (ID: 1) has finished.
Thread (ID: 6) has started.
Thread (ID: 4) has finished.
Thread (ID: 7) has started.
Thread (ID: 6) has finished.
Thread (ID: 8) has started.
Thread (ID: 2) has finished.
Thread (ID: 9) has started.
Thread (ID: 9) has finished.
Thread (ID: 5) has finished.
Thread (ID: 7) has finished.
Thread (ID: 8) has finished.
All done!

For this milestone, you will be a client of a working ThreadPool implementation we provide. In milestone 3, you will implement your own ThreadPool. For this milestone, you will use two private instance variable thread pools (which are already declared in the starter code - you do not need to declare any more ThreadPools of your own): one of size 8 to manage a collection of worker threads that download RSS XML documents, and a second of size 64 to manage a second group of workers dedicated to news article downloads. The combined number of threads managed by the two thread pools is 72, which is large compared to the number of cores on the myths. But remember that most of a thread’s time is spent sleeping while the OS builds a network connection. The amount of time needed to pull, parse, and index content is so small that it’s okay if we average 8 or 9 threads per core.

Here are some additional notes:

Ensure that you never ever download—or even attempt to download—the same exact URL twice. This was stated above during the specification of the sequential version, but it's particularly important here, since two threads might initiate a download of the same URL at the same time unless you implant measures to forbid it. As it turns out, the XML and HTML parsing libraries don't deal well when asked to pull the same URL in two simultaneously executing threads, so make sure you don't.
Manage articles with the same title and server as you did for the sequential part, being sensitive the to the possibility that two such articles may be downloaded simultaneously in parallel threads.
Ensure that access to all shared data structures is thread-safe, meaning that there’s zero chance of deadlock, and there are no race conditions. Use mutexes to provide the binary locks needed to protect against race conditions within critical regions. Make sure that all locks are released no matter the outcome of a download (e.g. inside catch clauses as well).
Ensure that the full index is built before advancing to the query loop.
You can add logging code (and we strongly encourage you do so!) to emulate the logging using the provided NewsAggregatorLog class, but you aren't required to. We’ll be testing your solutions using the --quiet flag, but you can add logging code just so you can see how much faster everything runs once you introduce threading.
Reminder: you may not change the implementation of the NewsAggregator::queryIndex method. It outputs the results in the exact format we’ll expect to be used by the autograding process, so changing it would likely cause your solution to not recieve any credit.
In contrast to Assignment 4, your error messages needn't match ours.
Reminder: because all state and functionality is managed by a C++ NewsAggregator class, you shouldn't need any global variables. Any state needed for aggregation can be maintained in the private section of the class declaration and manipulated by the class's public and private methods. You’ll want to place additional data members (maps, sets, mutexes, semaphores, and so forth) right here as well, particularly when multiple threads need access to them.
Because your program is fully object oriented, you can't just pass in a news aggregator method as a thunk to a thread pool. Instead, you can schedule a thunk that is a lambda function that calls your method. As an example, let's say we have a method addArticleToRawIndex that takes two parameters - we can wrap a lambda around the call to that method and capture the address of the relevant object and and information needed for the two arguments, like this:

const vector<Article>& articles = feed.getArticles();
semaphore completed(1 - articles.size());
for (const Article& article: articles) {
  articlePool.schedule([this, &article, &completed]() {
     // pretend addArticleToRawIndex is some two-argument method in NewsAggregator
     addArticleToRawIndex(/* title = */ article.title, /* url = */ article.url);
     completed.signal();
  });   
}
completed.wait(); // the feed might wait for just its articles to be processed

Once you complete this milestone, you will have a dramatically sped up version of your program. Woohoo! You’ll also get genuine experience with networking and the various concurrency patterns that come up in networked applications.

Milestone 3: Implementing the `ThreadPool` class

Now that you have experience using a thread pool as a client, it's your turn to implement your own ThreadPool!

Your constructor should do the following:
- Launch a single dispatcher thread, which pulls work off the queue, wakes up a particular worker, and hands the function to be executed to that worker. Assume dt is a private ThreadPool member of type thread: dt = thread([this]() { dispatcher(); });
- Launch a specific number of worker threads (assume wts is a private ThreadPool member of type vector<thread>): wts[workerID] = thread([this, workerID]() { worker(workerID); });
Your schedule function should accept a thunk (a function that takes no parameters and returns no value -- expressed as type function<void(void)>) and append it to the end of a queue of such functions. Each time a function is added, the dispatcher thread should be notified. Once the dispatcher is notified, schedule should return right away so that more functions can be scheduled. schedule should be thread-safe (i.e. if your program has more threads that are running outside of the ThreadPool, it should be possible to call schedule from multiple different threads and not have any chance of encountering race conditions).
The dispatcher thread should loop almost interminably. In each iteration, it should sleep until schedule tells it that something has been added to the queue. It should then wait for a worker to become available, select the worker, mark it as unavailable, dequeue a function, put a copy of that function in a place where the worker (and only that worker) can access it, and then signal the worker to execute it. Reviewing the customer-cashier interactions from the ice cream parlor lecture example may be helpful.
The worker threads should also loop repeatedly, blocking within each iteration until the dispatcher thread signals it to execute an assigned function. Once signaled, the worker should invoke the function, wait for it to execute, then mark itself as available so that it can be discovered and selected again by the dispatcher.
The wait function should wait until the ThreadPool is completely idle - meaning all previously-scheduled-but-yet-to-be-executed functions have been executed. In other words, there should be no functions in the queue, nor any workers executing functions. It should be possible to call wait multiple times, and wait should be thread-safe. It's possible that while wait is executing, more functions are added to the thread pool - it should wait on these as well, and only return once there is a moment where it's completely idle. This is intended to be used in cases where we wish to clear out the thread pool before using it further.
The ThreadPool destructor should wait until all functions have been executed to completion, then dispose of any ThreadPool resources. It should wait until all scheduled functions have executed to completion, somehow inform the dispatcher and worker threads to exit, and then wait for them all to exit.

Our solution doesn't use any dynamic memory allocation, but if you use it, then be sure to free resources.

All of this should feel like a generalization of Assignment 3’s farm. That’s because it is! Yes, we’re using threads instead of processes, but the communication patterns between the various players are all the same. We’re just relying on threads, mutexes, condition_variable_anys and semaphores instead of processes, signals, signal blocks, and sigsuspend.

Once you finish this milestone, feel free to try out your own ThreadPool with your news aggregator program instead of the provided solution. To do this, change line 25 of news-aggregator.h from namespace tp = release; to namespace tp = develop;, and recompile. You should only do this when you are done thoroughly testing your implementation, but it is an excellent form of ThreadPool exercise that you can use to verify that your ThreadPool is bulletproof and runs everything with as much parallelism as our own release version does. If you do this, make sure you swap the release version back in before your final submission for grading!

Can I implement `ThreadPool` without a dispatcher thread?

Yes, it's quite possible to implement a scheme where workers are notified of incoming work, and then they pull work off the queue without the dispatcher specifically handing the work to them. However, we want you to implement ThreadPool with a dispatcher thread. It's better practice with thread communication/synchronization, and the dispatcher thread is essential to implementing more capable ThreadPools (such as a ThreadPool with lazy initialization, where the worker threads aren't created unless they're actually needed).

Grading

Your assignments will be exercised using the tests we've exposed, plus quite a few of others. We reserve the right to add tests and change point values if during grading we find some features aren't being exercised, but we're fairly certain the breakdown presented below will be a very good approximation regardless.

Clean Build: 2 points
NewsAggregator tests: 80 points
- Ensure that your aggregator handles a small feed list: 30 points
- Ensure that your aggregator handles a medium feed list: 20 points
- Ensure that your aggregator handles a large feed list: 20 points
- Ensure that you properly handle duplicate URLs and articles with the same title hosted by the same server: 10 points
ThreadPool tests: 78 points
- Basic functionality via tests exposed in tptest.cc and tpcustomtest.cc: 40 points
- More advanced tests that should be properly handled by a working ThreadPool: 15 points
- Equally advanced tests that hit on an array of edge-case (but legitimate) use cases: 15 points
- Ensures that memory is properly managed by your ThreadPool—no leaks, no memory errors: 8 points

Your NewsAggregator class will be tested using the staff solution to ThreadPool, as long as using release::ThreadPool; appears near the top of your news-aggregator.h file. Your own ThreadPool, of course, will be exercised by the ThreadPool tests. Note that your ThreadPool, because it can be tested in isolation of the XML parsing library, can and will be examined for memory access errors and leaks.

Helpful Hints

Here's a compilation of hints that CS110 students have relied on in past quarters to complete this assignment:

Try/Catch

Recall that functions and methods capable of throwing exceptions can be optionally placed under the jurisdiction of a try/catch block, as with this:

HTMLDocument document(article.url);
try {
  document.parse();
} catch (const HTMLDocumentException& hde) {
  log.noteSingleArticleDownloadFailure(article);
  return;
}
// got this far? then the article was downloaded just fine

Be Careful With References

Be careful when you share references to data with your thread routines. Sometimes you absolutely need to, but you need to be careful that the shared reference doesn't lead to data that unexpectedly changes or is destroyed before the selected ThreadPool worker makes use of it. You need to be sensitive to the very type of race condition we saw in lecture a little over a week ago that justified why mutexes were important in the first place. Consider what might happen if the code example from the end of milestone 2 didn’t involve the completed semaphore at all and instead looked like this:

const vector<Article>& articles = feed.getArticles();
for (const Article& article: articles) {
  articlePool.schedule([this, &article]() {
     addArticleToRawIndex(/* title = */ article.title, /* url = */ article.url);
  });   
}

Each of the scheduled thunks depends on access to an article that looks to be officially held by the feed object. If execution allows that feed object to be destroyed, then the actual Article records being referenced by the thunk’s article would refer to an Article that no longer exists.

There are two solutions to the above problem. One is to ensure the original Article - or, more generally, whatever variables are being captured by reference—sticks around for the lifetime of the reference. Our original solution did that by waiting until all relevant thunks for any given feed executed to completion. Alternatively, if it’s easier, just capture a full copy of the object so you needn’t worry how long the original persists. Our preference is to capture references whenever possible, but the assignment isn’t necessarily about memory management so much as it is about safely sharing data with threads, so the following would be an acceptable solution as well:

const vector<Article>& articles = feed.getArticles();
for (const Article& article: articles) {
  articlePool.schedule([this, article]() { // capture a copy
     addArticleToRawIndex(/* title = */ article.title, /* url = */ article.url);
  });   
}

Pass `mutex`es, `thread`s, `condition_variable_any`s and `semaphore`s by reference

Remember that you can’t pass mutexes, threads, semaphores, or condition_variable_anys by value, only by reference. The decision to suppress pass by value is driven by the fact that it doesn’t make much sense in most cases to allow, say, two or more copies of the same mutexes to mark the boundaries of a critical region. Now, if you have, say, a ThreadPool with a private vector<semaphore> called semaphores, this you-can-not-pass-by-value rule restricts how you can build up the vector. For instance, you might very well want to do something like this so the vector contains a semaphore for every worker:

ThreadPool::ThreadPool(size_t numThreads) {
  semaphore s;
  for (size_t i = 0; i < numThreads; i++) {
    semaphores.push_back(s); // uh oh: semaphore inside can't be assigned to s
  }
  // other code
}

Unfortunately, that’s a flagrant violation of this rule. However, it is possible to perfectly size the vector<semaphore> at construction time by leveraging C++ initialiation lists, as with:

ThreadPool::ThreadPool(size_t numThreads): semaphores(numThreads) {
  // other code
}

That may not look all that different, but it is! In this second version, semaphores is constructed to be the correct length, and all of the semaphores are constructed in place. All in one initialization list entry!

More likely, though, you’ll want to bundle many pieces of information into a privately defined struct, and maintain a vector of those structs, one per worker, for the lifetime of the ThreadPool.