Assignment 6: ThreadPool
Due Date: Saturday, February 29th at 11:59PM
The ThreadPool
part of this assignment was written by Jerry Cain. The Internet Archive part of the assignment was written by Ryan Eberhardt.
The Internet Archive maintains one of the largest archives of websites, books, movies, and more. Their Wayback Machine allows you to enter a website's address and see how it looked at various points in the past. (See how Stanford's website looked in 1996.)
In this assignment, you'll build the ThreadPool
class, which will be used by a simplified version of the Internet Archive. A thread pool is a set of threads that are created when a ThreadPool
class is instantiated in a program, and these threads are then used by the program to perform work in parallel. The benefit of using a thread pool over creating threads on-the-fly (as you did for the RSS reader from assignment 5) is to save thread creation time while your program runs. In other words, the threads are created once and can be re-used without the need to .join
or recreate them. For the RSS reader, thread pools could have been used, and (for example) you would only need a thread pool of eight threads for the number of feeds. Instead of creating a thread for every feed, those eight threads would be shared, with the thread pool managing the waiting necessary when the pool ran out of available threads.
Because of the time constraints for this assignment, we have written the "crawl the web pages from the network" part of the assignment for you, and you are only responsible for writing the ThreadPool
class itself. We have compiled the Internet Archive part of the assignment into a static library that already has the binary that expects a working version of the ThreadPool
class. If you have time, you are welcome to also implement the web-crawling function as well, but it is not necessary, nor will it be graded (see archive.cc
for details of how to use your version instead of ours).
Here is how the program works: given a website to archive, the program crawls through the network of pages it links to and resources it uses, archiving them to disk. Each page is scheduled to download via a thread from your thread pool.
The Finished Product
The program builds an archive
executable. Given a seed website, this will begin downloading that website, any websites it links to, any websites those link to, and so on. Because downloading this network of interconnected websites could end up downloading a fair portion of the internet (depending on your seed website), you are restricting your downloads to a whitelist, so we only download from particular websites of interest.
To download Stanford's website, we can run the following. (The seed website is https://www.stanford.edu
, and we are restricting to sites on the domain www.stanford.edu
. You could use a wildcard and whitelist *.stanford.edu
, but this ends up indexing a huge number of pages, and it takes a very long time to download. You can whitelist multiple different domains by using multiple w
flags.)
./archive -w www.stanford.edu -d https://www.stanford.edu
This produces the following output:
[24-07-2018 08:07:00] Beginning download of https://www.stanford.edu
[24-07-2018 08:07:01] End download of https://www.stanford.edu (1.104319 seconds)
[24-07-2018 08:07:01] Skipping download of https://fonts.googleapis.com (not whitelisted, or blocked by robots.txt)
[24-07-2018 08:07:01] Skipping download of https://s.w.org (not whitelisted, or blocked by robots.txt)
[24-07-2018 08:07:01] Beginning download of https://www.stanford.edu/wp-json/
[24-07-2018 08:07:01] Beginning download of https://www.stanford.edu/wp-includes/wlwmanifest.xml
[24-07-2018 08:07:01] Beginning download of https://www.stanford.edu/xmlrpc.php?rsd
[24-07-2018 08:07:01] Beginning download of https://www.stanford.edu/wp-content/plugins/awesome-weather-pro/awesome-weather.css?ver=4.9.7
[24-07-2018 08:07:01] Skipping download of https://fonts.googleapis.com/css?family=Open+Sans%3A400%2C300&ver=4.9.7 (not whitelisted, or blocked by robots.txt)
[24-07-2018 08:07:13] Skipping download of https://www.stanford.edu/wp-content/plugins/awesome-weather-pro/js/js-cookie.js?ver=1.1 (already downloaded)
<many lines omitted...>
[24-07-2018 08:07:13] End download of https://www.stanford.edu/list/admin/#admin-finance (0.323541 seconds)
[24-07-2018 08:07:13] End download of https://www.stanford.edu/list/admin/#admin-research (0.318452 seconds)
[24-07-2018 08:07:13] End download of https://www.stanford.edu/list/admin/#admin-staff (0.314381 seconds)
[24-07-2018 08:07:13] End download of https://www.stanford.edu/list/admin/#admin-students (0.317646 seconds)
myth66.stanford.edu listening on port 9979...
After crawling a ton of www.stanford.edu
pages, this launches a server; the last line tells you where to connect. (It will be different for you.) If you connect to http://myth66.stanford.edu:9979
in a web browser while leaving archive
running, you will see the following:
If you enter https://www.stanford.edu
and click "Go!", you can see Stanford's homepage, in all its glory:
We can click around on the links, and everything works, as long as we only click links pointing to www.stanford.edu
sites (the domain we whitelisted). Even if Stanford's website goes offline, this archive can still continue serving it as if nothing had ever happened.
Program Usage
You may want to play around with ./samples/archive_soln
before you begin, just to get a sense of how to test the program.
-
The
-d
specifies the seed page that you'd like to begin crawling from. You should specify a full URL, including http/https. For example:./archive -d https://web.stanford.edu/class/cs110/summer-2018/
. You can only specify one-d
flag, but if you want to index several pages, you can runarchive
several times. (The files it downloads are persisted in theindexed-documents/
directory.) -
Add
-w
flags to whitelist domains for our crawl. You can specify multiple-w
flags, and you can use wildcards if you'd like (though that may include many more pages than you want). For example:./archive -w "*.stanford.edu" -w fonts.googleapis.com -d https://www.stanford.edu
-
If you want to run
archive
without the server, you can add the-n
or--no-serve
flag to disable it. (Your program will exit after downloading the pages.) -
archive
starts the server with a port number generated from your sunet ID. It's unlikely anyone else will conflict with your port. However, if you want to run the server on a different port, add the-p
or--port
flag:./archive -p 12345
Important note: archive
saves downloaded files in the indexed-documents/
directory. Running archive
several times will add to this database without clearing it. This allows you to crawl several different websites and have them all be accessible from your archive
web server, but it might not be what you want in testing.
If you want to start fresh, run make filefree
to clear the indexed-documents
directory. You can also run archive
with the -m
flag to make it memory-only (it won't read from disk or persist downloads on disk).
Running The Program Off-Campus
As shown above, this assignment features a web server that shows the pages your program has archived. We're running this server on ports 2000-65535, but unfortunately, for security reasons, the myth
machines only allow access to these ports from on-campus computers.
You have several options if you want to run the program off-campus:
-
You can run
archive
with the web server disabled. When you runarchive
, add the-nim
flag.archive
won't run the web server, but will spit out a list of downloaded content that you can check for correctness. This is the easiest option of this list, but you miss out on the cool factor of being able to use your archive from your browser. -
Use an SSH proxy. SSH has a feature that allows us to send traffic to an SSH server, and it will forward that traffic to a web server. If we SSH into a Stanford computer, we can then use that computer to forward web requests to your
archive
server.- Launch
archive
. Let's say our server is listening onmyth55
port9979
. - Open an extra terminal window and
ssh -L 9979:myth55.stanford.edu:9979 rice.stanford.edu
. Leave this running on the side. (Make sure this is running in a separate terminal window from the one you're using to connect to myth, and make sure it continues running while you try to usearchive
with your browser.) You should replacemyth55
with whatever serverarchive
is running on. - In your browser, go to
http://localhost:9979
.
This method might be a little annoying if you have a bad network or frequently sleep your computer (logging into
rice
requires 2 factor authentication), but it is probably the easiest way to be able to use your browser witharchive
.By the way, you may be wondering, why not use SSH into
myth
and have myth forward the traffic toarchive
-- why userice
? That would definitely be more familiar and straightforward, but unfortunately,myth
has SSH tunneling disabled. - Launch
-
Connect to the campus network using a VPN; instructions are here. If you feel like taking the time to install the VPN client and get everything set up, then this will probably be the easiest option in the long run.
-
Use the terminal-based
lynx
browser. Open up another terminal window, SSH intomyth
, and then run something like this:lynx http://myth66.stanford.edu:9979/https://www.stanford.edu
If you don't care much for the interactivity of
lynx
, you can ask your server for a single page by usingcurl
:curl http://myth66.stanford.edu:9979/https://www.stanford.edu
This simply downloads the page from
myth
and prints it out on your terminal window. If you get anything back, yourarchive
is probably working correctly. (Errors in the page output probably come from the crawling framework that we've written for you.)
A Note On robots.txt
If you look at the sample output of some of your program runs, you'll notice that some downloads may be skipped. For example, if a site is not in the whiltelist, it will be skipped. However, another reason a site may not be downloaded, even if it is in the specified whitelist, is because of robots.txt. This file is used by website administrators to tell crawlers (like us) not to download particular parts of the website. You can see it by going to /robots.txt
on any website. For example, we tried using archive
to download the Linux man pages, but ran into issues. When we checked https://linux.die.net/robots.txt
, we saw that everything was disallowed for our crawler.
Getting Started
Clone the repository that we've set up for you by typing:
git clone /usr/class/cs110/repos/assign6/$USER assign6
Compile often, test incrementally and almost as often as you compile, run ./tools/sanitycheck
a bunch to test your work, and run ./tools/submit
when you're done. As you've seen in past assignments, you can play with a sample application by invoking ./samples/archive_soln
.
Because you are building the ThreadPool
class and need to debug it, you should make judicious use of the tpcustomtest.cc
tests. You can add to these tests yourself to test the functionality of your ThreadPool
class. There are four tests in the program already, and to add another test, you update the buildMap
function with a flag describing the test and pointing to a function that runs the test. To run, make the assignment and type (for example):
$ ./tpcustomtest --single-thread-no-wait
This is a test.
Files
You can ignore the first five files in this list -- they are used to build the downloading feature.
-
archive.h/cc: This would be the main function for the program, but it is not used when doing a normal
make
. Instead, a library is used in its place. A skeleton is written for you if you'd like to complete it (compile withmake myarchive
), and you would need to implementArchive::download
to crawl the internet starting from seed URL. Again, you do not need to work on this file. -
document.h/cc: This is another file you can ignore, but feel free to look at it. It contains the
Document
class, used to represent web pages that you encounter. The program creates aDocument
for every URL it considers downloading. It callsDocument::download
to download a web page and parse it for any links to other web pages, and callDocument::getLinks
to get those outgoing links. -
index.h/cc: Again, ignore this file unless you are working on
archive.cc
or just want to see the details. It contains theIndex
class, used to store all theDocument
s that get downloaded, and it is used to implement the archive web server that the program uses to access downloaded web pages. -
whitelist.h/cc: This class implements access control to the URLs you might try to access while crawling the web. Every whitelisted URL (specified by
-w
command-line arguments) should be added to the whitelist by callingaddHost
. Before you download a URL, you should callWhitelist::canAccess
to check whether you're allowed to access the URL. (This checks both the whitelist from the-w
flags, and it does some much more complicated checking ofrobots.txt
files to see if we're allowed to access this specific URL -- see "note on robots.txt" above.) -
log.h/cc: This contains a
Log
class that can be used to produce the logging messages you see in the sample output. If you were writing the archive functionality, you could use this class if you want to use your own logging -- it's easy to use and may save you time, so you should read the interface in thelog.h
file. It is not necessary to use or understand this file for the assignment in general.
Start paying attention here
-
thread-pool.h/cc: This is where you'll implement the
ThreadPool
class. -
tptest.cc: This is a short program that verifies basic ThreadPool functionality. We call this from sanitycheck to see if your ThreadPool is working. Don't change this file.
-
tpcustomtest.cc: This contains a collection of functions you can use to test your ThreadPool yourself. This is for you to play with and add as many additional tests as you can think of to make sure your ThreadPool is working brilliantly, as described above.
Assignment Roadmap
You have one thing to do for this assignment: create a rock-solid ThreadPool
class. Think back to farm
from Assignment 3. You created several worker processes, then distributed work to each worker as it finished its previous work. There were only eight child processes ever created, but each child factored several numbers when there was a lot of work to be done.
An easier approach would have been as follows: every time you had a number to factor, you could fork
to launch a worker to factor that number. You could have implemented it such that a maximum of 8 workers were running at a time (on an 8-core machine), and some might argue that this would be just as good as your implementation, since this avoids contention for hardware resources just like your implementation did. However, this approach is definitely worse, because even if it has the same number of processes running at a time, it creates many more processes over the entire execution of the program. Creating processes is relatively expensive, so if we can reuse processes to do work, we should do so.
Because the archive
program downloads many files (thousands, for big websites), the code has the potential to use many, many threads. We don't want to create thousands of individual threads, especially because we are going to limit the number of threads hammering the website we are trying to download. So, instead of creating those threads, we are going to have a fixed number of threads and then just re-use them with the thread pool.
The archive
program limits itself to sixteen threads by setting up a sixteen-thread thread pool:
ThreadPool tp(16);
It does this at the beginning of the program. Then, work is added to the pool (using the tp.schedule()
function) one of the threads wakes up to do the work. Concretely, we can add functions to the thread pool's queue, and one of the threads will wake up, execute the function we added, and then go back to sleep. (Note: specifically, thunks are added to the queue. Thunks are just functions that take no parameters and return no values.)
The thread pool concept is practically inescapable in software engineering; you'll see it in database libraries, web servers, some graphics rendering engines, and more. The code we're implementing is quite general-purpose, and the concepts involved will serve you well.
Using The ThreadPool
Class
The ThreadPool
class has the following interface:
class ThreadPool {
public:
ThreadPool(size_t numThreads);
void schedule(const std::function<void(void)>& thunk);
void wait();
~ThreadPool();
};
A simple program can use this pool to execute 10 function calls across 4 threads:
static const size_t kNumThreads = 4;
static const size_t kNumFunctions = 10;
int main(int argc, char *argv[]) {
ThreadPool pool(kNumThreads);
for (size_t id = 0; id < kNumFunctions; id++) {
pool.schedule([id] {
cout << oslock << "Thread (ID: " << id << ") has started." << endl << osunlock;
size_t sleepTime = (id % 3) * 10;
sleep_for(sleepTime);
cout << oslock << "Thread (ID: " << id << ") has finished." << endl << osunlock;
});
}
pool.wait();
cout << "All done!" << endl;
return 0;
}
Implementation
-
Your constructor should do the following:
- Launch a single dispatcher thread, which pulls work off the queue, wakes up a particular worker, and hands the function to be executed to that worker. Assume
dt
is a privateThreadPool
member of typethread
:dt = thread([this]() { dispatcher(); });
- Launch a specific number of worker threads (assume
wts
is a privateThreadPool
member of typevector<thread>
):wts[workerID] = thread([this, workerID]() { worker(workerID); });
- Launch a single dispatcher thread, which pulls work off the queue, wakes up a particular worker, and hands the function to be executed to that worker. Assume
-
Your
schedule
function should accept a thunk (a function that takes no parameters and returns no value -- expressed as typefunction<void(void)>
) and append it to the end of a queue of such functions. Each time a function is added, the dispatcher thread should be notified. Once the dispatcher is notified,schedule
should return right away so that more functions can be scheduled.schedule
should be thread-safe (i.e. if your program has more threads that are running outside of the ThreadPool, it should be possible to callschedule
from multiple different threads and not have any chance of encountering race conditions). -
The
dispatcher
thread should loop almost interminably. In each iteration, it should sleep untilschedule
tells it that something has been added to the queue. It should then wait for a worker to become available, select the worker, mark it as unavailable, dequeue a function, put a copy of that function in a place where the worker can access it, and then signal the worker to execute it. Reviewing the customer-cashier interactions from the ice cream parlor lecture example may be helpful. -
The
worker
threads should also loop repeatedly, blocking within each iteration until the dispatcher thread signals it to execute an assigned function. Once signaled, the worker should invoke the function, wait for it to execute, then mark itself as available so that it can be discovered and selected again by the dispatcher. -
The
wait
function should wait until theThreadPool
is completely idle. It should be possible to callwait
multiple times, andwait
should be thread-safe. -
The
ThreadPool
destructor should wait until all functions have been executed (it's fine to callwait
), then dispose of anyThreadPool
resources. Our solution doesn't use any dynamic memory allocation, but if you use it, then be sure to free resources.
You can test your ThreadPool
using tptest.c
and tpcustomtest.c
(which compile to tptest
and tpcustomtest
). If you use dynamic memory allocation, make sure that you do not leak any memory. (You shouldn't need to.)
Can I implement ThreadPool without a dispatcher thread?
Yes, it's quite possible to implement a scheme where workers are notified of incoming work, and then they pull work off the queue without the dispatcher specifically handing the work to them. However, we want you to implement ThreadPool
with a dispatcher thread. It's better practice with thread communication/synchronization, and the dispatcher thread is essential to implementing more capable ThreadPool
s (such as a ThreadPool
with lazy initialization, where the worker threads aren't created unless they're actually needed).
Icons by Piotr Kwiatkowski