Section 4 Solutions

Sections Wed Feb 08 to Sat Feb 11

Solutions

1. News Aggregator

void fetchArticle(const Article& article, const NewsManager& nm, map<Article, vector<string>>& database, mutex& databaseLock) {
    cout << oslock << "Fetching " << article.url << "..." << endl << osunlock;
    vector<string> articleContents = nm.fetchArticleContents(article);

    // Lock around modifying the shared map!
    databaseLock.lock();
    database[article] = articleContents;
    databaseLock.unlock();
}

int main(int argc, char *argv[]) {
    // The user must specify one argument, the name of the RSSFeedList file
    if (argc != 2) {
        cerr << "ERROR: please specify an RSSFeedList" << endl;
        return 1;
    }

    // Construct our collection of article -> article words ("tokens")
    map<Article, vector<string>> database;

    // mutex that we will lock before modifying the map
    mutex databaseLock;

    // A NewsManager can get a list of articles, download a single article, and search articles
    NewsManager nm(argv[1]);

    // An article is a struct that has a title and url field
    vector<Article> articles = nm.getArticleTitlesAndURLs();

    // Make a vector, since we don't know how many articles there are until the program runs
    vector<thread> threads;
    for (int i = 0; i < articles.size(); i++) {
        // Spawn a thread to fetch each article
        threads.push_back(thread(fetchArticle, ref(articles[i]), ref(nm), ref(database), ref(databaseLock)));
    }

    // Join all the threads
    for (thread& t : threads) {
        t.join();
    }

    nm.handleUserQueries(database);
    return 0;
}

2. Short Answer

Q1: What does it mean when we say that a process has a private address space? What are the advantages and disadvantages of having a private address space?

A1: It means that its range of addresses are, by default, private to the owning process, and that it's impossible for an another arbitrary, unprivileged process to access any of it. This is an advantage because other code/programs can't accidentally or intentionally examine another process's data. This is a disadvantage because it is exceptionally difficult to exchange data with other processes, and works against any efforts to breezily distribute work over multiple CPUs using multiprocessing.

Q2: How do we exchange information across process boundaries? When do processes have or not have a say in information being transferred?

A2: When we spawn a process, all parent info is cloned in the child, and if the child runs another program we can pass arguments via execvp. We can also use pipes to allow processes to communicate. We can use them separately from out STDIN/STDOUT/STDERR to send data back and forth, or we can redirect process input/output to/from pipes. A program can ignore command-line arguments if it wants to, but if its I/O is redirected before it's launched with execvp there isn't much it can do about it.

Q3: Threads are often called lightweight processes. In what sense are they processes? And why the lightweight distinction?

A3: Each thread of execution runs as if it's a miniature program. Threads have their own stack, and they have access to globals, dynamic memory allocation, system calls, and so forth. They're relatively lightweight compared to true processes, because you don't need to create a new process when creating a thread. We just cordon off a portion of the full stack segment for the new thread's stack, but otherwise the thread piggybacks off existing text, heap, and data segments of the process.

Q4: Threads running within the same process all share the same virtual address space. What are the advantages and disadvantages of allowing threads to access pretty much all of virtual memory?

A4: Because all threads have access to the same virtual address space, it's relatively easy to share information (e.g. via pass-by-reference). However, because two or more threads might want to perform operations on a shared piece of data, directives must be implanted into thread routines to ensure data and resources are shared without inspiring data corruption via race conditions, or deadlock via resource contention. That's a very difficult thing to consistently get right.

Q5: What are some scenarios where we might prefer multithreading in our programs, and what are some scenarios where we might prefer multiprocessing?

A5: We might prefer multithreading in scenarios where ease of data exchange is important, and multiprocessing where protection, privacy, and insulation from other processes' errors is important.

Q6: Is i-- thread safe? Why or why not?

A6: It's not normally thread safe - it may take multiple operations (loading the existing value, modifying it, then writing the new value) and those operations could be interrupted. For instance, if a thread completes just the first two steps of loading the existing value and modifying it, but then gets swapped off the processor and another thread changes i, the original thread will eventually resume and continue on with stale data. Note that technically it's possible in some cases for i-- to compile down to one assembly instruction and be atomic, but it's not guaranteed, so we assume it's not.

Checkoff Questions

  1. [Q1] What is the race condition in the multithreaded but no-locks version of news-aggregator?

    • Multiple threads could modify the database map at the same time, and maps aren't thread-safe, so the map could be put in an inconsistent state.
  2. [Q2] Threads running within the same process all share the same virtual address space. What are the advantages and disadvantages of allowing threads to access pretty much all of virtual memory?

    • Because all threads have access to the same virtual address space, it's relatively easy to share information (e.g. via pass-by-reference). However, because two or more threads might want to perform operations on a shared piece of data, directives must be implanted into thread routines to ensure data and resources are shared without inspiring data corruption via race conditions, or deadlock via resource contention. That's a very difficult thing to consistently get right.