Assignment 5: RSS News Feed Aggregator


Due Date: Saturday, February 22nd at 11:59PM

Learning Goals

This assignment gives you a chance to implement an index of information similar to that held by news.google.com. It acts as a transition from multiprocessing to multithreading. This assignment will help you:

Starter Code

All coding should be done on a myth cluster machine, as that's where we'll be testing all submissions. Clone the starter project using the command

    git clone /afs/ir/class/cs110/repos/assign5/$USER assign5

Compile often, test incrementally and almost as often as you compile, run ./tools/sanitycheck a bunch to test your work, and run ./tools/submit when you’re done. And as you've seen in past assignments, you can play with a sample application by invoking ./samples/aggregate_soln.

The starter project contains the following, categorized by whether or not you will be modifying them for this assignment. Note, however, that you need to become familiar with these files and their usage even if you do not modify them.

Files you must modify:

Files you should consider modifying:

Our solution changed these two files to codify the rules on how to handle articles with the same title but different URLs on the same server. You should consider doing the same thing.

Files you can, but are not required to, modify:

You shouldn’t need to change any of these files, although you're welcome to if it helps you arrive at a working program more quickly.

Files you should not modify:

Overview

Virtually all major newspapers and TV news stations have bought into this whole Internet thing. What you may not know is that many major media corporations serve up RSS feeds summarizing the news stories that’ve aired or gone to press in the preceding 24 hours. RSS news feeds are XML documents housing information about online news articles. If we can get the feeds, we can get the articles, and if we can get the articles, we can build an index of information similar to that held by news.google.com. That’s precisely what you’ll be doing for the next seven days.

Assignment 5 is shorter than past assignments, and it acts as a transition from multiprocessing to multithreading while letting you gain experience with C++ threads, mutexes, and semaphores. Assignment 6, which goes out a week from today, will have you revisit some of Assignment 5's design decisions and force you to be much more conservative about the number of actual threads you create over the program's lifetime.

While working on this assignment, see the end section for helpful C++ tips.

Indexing a news article amounts to little more than breaking the story down into the individual words and noting how often each word appears. If a particular word appears a good number of times, then said word is probably a good indicator as to what the web page is all about. Once all of the world’s online RSS news feeds have been indexed, you can ask it for a list of stories about a specific person, place, or thing.

If, for instance, you want to learn more about the movies in the news lately, you might type in this:

myth56$ ./samples/aggregate_soln --url medium-feed.xml
Enter a search term [or just hit <enter> to quit]: movie
That term appears in 99 articles.  Here are the top 15 of them:
   1.) "Review: The engrossing dramas 'Beanpole' and 'And Then We Danced'.....place" [appears 13 times].
       "https://www.latimes.com/entertainment-arts/movies/story/2020-02-1.....views"
   2.) "Why Blumhouse flipped the '70s TV classic 'Fantasy Island' into a.....iller" [appears 13 times].
       "https://www.latimes.com/entertainment-arts/movies/story/2020-02-1.....house"
   3.) "How Julia Louis-Dreyfus and Will Ferrell made 'Downhill' at a wor.....esort" [appears 10 times].
       "https://www.latimes.com/entertainment-arts/movies/story/2020-02-1.....eyfus"
   4.) "Review: After his makeover, 'Sonic the Hedgehog' is legit funny, .....ining" [appears 9 times].
       "https://www.latimes.com/entertainment-arts/movies/story/2020-02-1.....arrey"
   5.) "Review: Just in time for Valentine's Day, 'The Photograph' casts .....spell" [appears 9 times].
       "https://www.latimes.com/entertainment-arts/movies/story/2020-02-1.....a-rae"
   6.) "Visit the places where the 2020 Oscar-nominated movies were filmed" [appears 9 times].
       "https://www.latimes.com/travel/story/2020-02-07/where-to-visit-os.....tings"
   7.) "How 'Sonic the Hedgehog' director saved the movie from massive in.....klash" [appears 8 times].
       "https://www.latimes.com/entertainment-arts/business/story/2020-02.....-work"
   8.) "'To All the Boys 2': How Lana Condor and Jenny Han made a teen he.....elves" [appears 7 times].
       "https://www.latimes.com/entertainment-arts/movies/story/2020-02-1.....y-han"
   9.) "Apple and Amazon are transforming Culver City. Should they pay mo.....axes?" [appears 6 times].
       "https://www.latimes.com/entertainment-arts/business/story/2020-02.....taxes"
  10.) "Review: 'Fantasy Island' movie squanders premise on thrill-free horror" [appears 6 times].
       "https://www.latimes.com/entertainment-arts/movies/story/2020-02-1.....orror"
  11.) "Feedback: From 'Parasite's' win to In Memoriam snubs, readers sou.....scars" [appears 6 times].
       "https://www.latimes.com/entertainment-arts/story/2020-02-13/oscar.....dback"
  12.) "Movies on TV this week: Sunday, Feb. 16, 2020: 'Mr. Smith Goes to..... more" [appears 6 times].
       "https://www.latimes.com/entertainment-arts/story/2020-02-14/movie.....-more"
  13.) "Kirk Douglas gave millions for Hollywood's Alzheimer's patients" [appears 5 times].
       "https://www.latimes.com/entertainment-arts/business/story/2020-02.....-fund"
  14.) "Oscars 2020: Netflix wins just two awards as 'Parasite' boosts Neon" [appears 5 times].
       "https://www.latimes.com/entertainment-arts/business/story/2020-02.....oscar"
  15.) "Review: Clever plotting brings atmospheric ghost story 'Camp Cold..... life" [appears 5 times].
       "https://www.latimes.com/entertainment-arts/movies/story/2020-02-1.....urray"

Interested in a different medium? Let's try music:

Enter a search term [or just hit <enter> to quit]: music
That term appears in 157 articles.  Here are the top 15 of them:
   1.) "Discover a living legend: Singer, songwriter, visual artist, Texa.....Allen" [appears 16 times].
       "https://www.latimes.com/entertainment-arts/music/story/2020-02-14.....-dick"
   2.) "What Grammys? Try these 12 places for the best live music in America" [appears 15 times].
       "https://www.latimes.com/travel/story/2020-01-23/never-mind-grammy.....enues"
   3.) "Movies on TV this week: Sunday, Feb. 16, 2020: 'Mr. Smith Goes to..... more" [appears 13 times].
       "https://www.latimes.com/entertainment-arts/story/2020-02-14/movie.....-more"
   4.) "Huey Lewis on his struggle with hearing loss: 'I thought I might .....self'" [appears 10 times].
       "https://www.latimes.com/entertainment-arts/music/story/2020-02-12.....sease"
   5.) "Confront your fear of bumblebees: Nose into human-size nests at D.....rdens" [appears 8 times].
       "https://www.latimes.com/lifestyle/story/2020-01-16/art-embraces-t.....hibit"
   6.) "Best concerts in L.A. this week: Courtney Barnett, Mac DeMarco, Ginuwine" [appears 7 times].
       "https://www.latimes.com/entertainment-arts/music/story/2020-02-12.....rnett"
   7.) "How a fire in the Inland Empire could spell doom for the worldwid..... boom" [appears 7 times].
       "https://www.latimes.com/entertainment-arts/music/story/2020-02-14.....cquer"
   8.) "Vegas tips from the pros: Lounge singers share their favorite off.....gouts" [appears 7 times].
       "https://www.latimes.com/entertainment-arts/story/2020-01-30/las-v.....picks"
   9.) "Eight things to do in L.A. and O.C., including Valentine's Day shows" [appears 7 times].
       "https://www.latimes.com/entertainment-arts/story/2020-02-13/thing.....avoom"
  10.) "California Sounds: L.A.'s Kamasi Washington goes 'Live at the Apollo'" [appears 7 times].
       "https://www.latimes.com/entertainment-arts/story/2020-02-14/calif.....ntary"
  11.) "Morrissey, Bauhaus, Blondie headline new  '80s alt-music festiva.....World" [appears 6 times].
       "https://www.latimes.com/entertainment-arts/music/story/2020-02-11.....s-80s"
  12.) "What doesn't kill Ozzy Osbourne makes him ... even Ozzier" [appears 6 times].
       "https://www.latimes.com/entertainment-arts/music/story/2020-02-12.....alone"
  13.) "Camila Cabello (sort of) launches acting career with old Hollywoo.....h My'" [appears 4 times].
       "https://www.latimes.com/entertainment-arts/music/story/2020-02-12.....-baby"
  14.) "Review: On 'Changes,' Justin Bieber finds salvation through hot m.....t R&B" [appears 4 times].
       "https://www.latimes.com/entertainment-arts/music/story/2020-02-13.....eview"
  15.) "What's on TV This Week: 'Washington,' 'Rocketman' and more" [appears 4 times].
       "https://www.latimes.com/entertainment-arts/story/2020-02-14/whats.....etman"

Sadly, sometimes something perfectly awesome doesn’t get mentioned at all:

Enter a search term [or just hit <enter> to quit]: CS110
Ah, we didn't find the term "CS110". Try again.

Milestone 1: Implementing Sequential aggregate

Your NewsAggregator class—that's the primary class in your assign5 repo to help manage the construction and interrogation of your news article index—ultimately needs to provide a multithreaded version of a method called processAllFeeds, which downloads a collection of RSS feeds and the HTML news articles they reference to build an in-memory version of your index. You’re free to implement the multithreaded version right from the start and forgo an intermediate, sequential version. However, we strongly recommend getting a sequential version working first to ensure everything works predictably, that you’ve made proper use of all of the helper classes we’ve provided, and that your sequential application’s output matches that produced by the sample (see Note below). This is the approach we took when implementing this assignment, and is the one that will allow us to best help you in office hours. Incremental development is the key to arriving at a robust, bug-free executable!

Note: you might see small differences between the sample application and your own, because a small fraction of an article’s content—the advertising, most often—is dynamically generated for each download and influences how the index is built each time. The sanitycheck tool, however, has been seeded with static RSS feeds and HTML news articles, so you should rely on it for a reliable test framework.

Here are details that apply to both the sequential and the later multithreaded version:

And here's some advice and words of caution.

Milestone 2: Implementing Multithreaded aggregate

Practically speaking, the sequential, singly-threaded implementation of aggregate is unacceptably slow. Users don’t want to wait 45 minutes for an application to launch, so you should introduce multithreading in order to speed things up. The sequential version of aggregate is slow because each and every request for an article from some remote server takes on the order of a second. If a single thread of execution sequentially processes each and every article, then said articles are requested, pulled, parsed, and indexed one after another. There’s no overlay of network stall times, so the lag associated with network connectivity scales with the number of articles. It’s like watching paint dry in 100% humidity.

Your NewsAggregator should be updated so that each RSS news feed is downloaded in its own child thread, and each article identified within each feed thread can be downloaded in its own grandchild thread. Each article, of course, still needs to be parsed and indexed. But much of the dead time spent just waiting for content to come over the wire can be scheduled to overlap. The news servers we poll for content are industrial strength and can handle hundreds if not thousands of requests per minute (or at least we're hoping so, lest Stanford IP addresses be blacklisted between now and the assignment deadline).

Naturally, multithreading and concurrency have their shortcomings. Each thread needs exclusive access to the index while doing surgery, so you’ll need to introduce some mutexes to prevent race conditions from compromising your data structures.

Rather than outline a large and very specific set of constraints, it's your responsibility to do whatever it takes to make the application run more quickly without introducing any synchronization issues. The assignment is high on intellectual content—after all, this is the first time where an algorithmically sound application can break because of race conditions and deadlock—but honestly, there isn’t much additional coding once you get the sequential version up and running. You’ll spend a good amount of time figuring out where the sequential version should spawn off child threads, when those threads should spawn off grandchild threads, and how you’re going to use concurrency directives to coordinate thread communication.

Here’s a list of must-haves:

If you take the care to introduce threads as outlined above, then you’ll dramatically speed up the configuration of your aggregate executable. Woo! You’ll also get genuine experience with networking and the various concurrency patterns that come up in networked applications.

Grading

Your assignments will be exercised using the tests we've exposed, plus quite a few of others. We reserve the right to add tests and change point values if during grading we find some features aren't being exercised, but we're fairly certain the breakdown presented below will be a very good approximation regardless.

News Aggregator Tests (82 points)

Helpful Hints

Here's a compilation of hints that CS110 students have relied on in past quarters to complete Assignment 5:

HTMLDocument document(article.url);
try {
  document.parse();
} catch (const HTMLDocumentException& hde) {
  log.noteSingleArticleDownloadFailure(article);
  return;
}
// got this far? then the article was downloaded just fine
thread t([this, n, &sem] { foobar(n, sem); }); // first way

thread t([this](int number, semaphore& s) {
   foobar(number, s);
}, n, ref(sem)); // second way

thread t(&NewsAggregator::foobar, this, n, ref(sem)); // third way
// flawed attempt to limit the number of threads to 6 at all times
vector<thread> threads;
semaphore permits(6);
for (size_t i = 0; i < 250; i++) {
   permits.wait();
   threads.push_back(thread([this](semaphore& s) {
      // thread safe code of your choosing
      s.signal();
   }, ref(permits));
}
for (thread& t: threads) t.join();

// correct way to limit the number of threads to 6 at all times
vector<thread> threads;
semaphore permits(6);
for (size_t i = 0; i < 250; i++) {
   permits.wait();
   threads.push_back(thread([this](semaphore& s) {
      // thread safe code of your choosing
      s.signal(on_thread_exit); // schedule permits to be signaled as surrounding thread is destroyed
   }, ref(permits));
}
for (thread& t: threads) t.join();

In fact, because this second version of signal schedules the internal ++ to happen when the thread exits, you can hoist the call to the top of the thread routine so that you know it's invoked no matter how the routine exits, as presented below.

// correct way to limit the number of threads to 6 at all times
vector<thread> threads;
semaphore permits(6);
for (size_t i = 0; i < 250; i++) {
   permits.wait();
   threads.push_back(thread([this](semaphore& s) {
      s.signal(on_thread_exit); // schedule permits to be signaled as surrounding thread is destroyed
      // thread safe code of your choosing
   }, ref(permits));
}
for (thread& t: threads) t.join();

Website design based on a design by Chris Piech
Icons by Piotr Kwiatkowski