Assignment 7: HTTP Web Proxy and Cache


Due Date: Friday, March 6th at 11:59PM

Written by Jerry Cain, with modifications by Philip Levis and Chris Gregg.

Your penultimate CS110 assignment has you implement a multithreaded HTTP web proxy and cache. A web proxy is an intermediary that intercepts each and every request and (generally) forwards it on to the intended recipient. The servers direct their responses back to the proxy, which in turn passes them on to the client. In this way, a web proxy acts as a middle-person between a client and a server. Here's the neat part, though; when requests and responses travel through a proxy, the proxy can control what gets passed along. The proxy might, for instance, do the following:

Overview

Go ahead and clone the git repository we've set up for you by typing:

git clone /usr/class/cs110/repos/assign7/$USER assign7

Compile often, test incrementally and almost as often as you compile, run ./tools/sanitycheck, and run ./tools/submit when you're done.

If you descend into your assign7 directory, you'll notice a subfolder called samples, which itself contains a symlink to a fully operational version called proxy_soln (note: the solution also works for HTTPS:// sites and videos, which your version does not need to support, though you can optionally support it). You can invoke the sample executable without any arguments, as with:

myth61:$ ./samples/proxy_soln
Listening for all incoming traffic on port <port number>.

The port number issued depends on your SUNet ID, and with very high probability, you'll be the only one ever assigned it. If for some reason proxy says the port number is in use, you can select any other port number between 2000 and 49150 (We'll choose 12345 here) that isn't in use by typing:

myth61:$ ./proxy_soln --port 12345
Listening for all incoming traffic on port 12345.

In isolation, proxy_soln doesn't do very much. In order to see it work its magic, you should download and launch a web browser that allows you to appoint a proxy for just HTTP traffic. We're recommending you download Firefox (http://www.mozilla.org/en-US/firefox/new/), since it has been well-tested to work easily with this assignment, and we believe will be the easiest to use for testing. In particular, some other browsers don't allow you to configure browser-only proxy settings, but instead prompt you to configure computer-wide proxy settings for all HTTP traffic--for all browsers, Dropbox and/or iCloud synchronization, iTunes downloads, and so forth. You don't want that level of interference.

Once you download and launch Firefox, you can configure it as follows:

Firefox Connection Settings window to enable Manual proxy configuration

You should enter the myth machine you're working on (and you should get in the habit of ssh'ing into the same exact myth machine for the next week so you don't have to continually change these settings), and you should enter the port number that your proxy is listening to. Make sure to check the box for HTTPS!

If you'd like to start small and avoid the browser, you can use telnet from your own machine to talk HTTP with your proxy, like this (everything the user types below is in bold, and everything sent back by a proxy--one presumably running on myth60:12345--is italicized):

myth61:$ telnet myth60.stanford.edu 12345
Trying 171.64.15.20...
Connected to myth60.stanford.edu.
Escape character is '^]'.
GET http://api.ipify.org/?format=json HTTP/1.1
Host: api.ipify.org

HTTP/1.1 200 OK
connection: keep-alive
content-length: 21
content-type: application/json
date: Fri, 01 Mar 2019 05:56:30 GMT
server: Cowboy
vary: Origin
via: 1.1 vegur

{"ip":"171.64.15.22"}Connection closed by foreign host.
myth61:$

Note that after you enter Host: api.ipify.org, you need to hit enter twice. In this case, the response is a valid one that responds with your IP address, packaged in a JSON object that's been serialized into string form.

Another note: if you're working off campus and not connected to the Stanford network, you need to VPN into Stanford by following instructions outlined right here (https://uit.stanford.edu/service/vpn). The myth machines are configured so that all ports 2000 and up are open to other Stanford machines, but if you're not on campus, you need to VPN into the Stanford network, else your attempts to redirect web traffic to a myth machine will be rejected. Also note: some students in the dorms have reported that they cannot get their proxy to work when connected to the ethernet port (hardwired). If this is the case for you, try using WiFi instead.

Implementing v1: Sequential proxy

Your final product should be a multithreaded proxy and cache that blocks access to certain domains. As with all large programs, we're encouraging you to work through a series of milestones instead of implementing everything at once. You'll want to read and reread Sections 11.5 and 11.6 of your B&O textbook to ensure a basic understanding of HTTP.

For the v1 milestone, you shouldn't worry about threads or caching. You should transform the initial code base into a sequential but otherwise legitimate proxy. The code you're starting with responds to all HTTP requests with a placeholder status line consisting of an "HTTP/1.0" version string, a status code of 200, and a curt "OK" reason message. The response includes an equally curt body with the message "You're writing a proxy!". Once you've configured your browser so that all HTTP traffic is directed toward the relevant port of the myth machine you're working on, go ahead and launch proxy and start visiting any and all websites. Your proxy should at this point intercept every HTTP request and respond with this:

Image showing a successful connection in Firefox.  The visited URL is www.time.com, and the web browser just displays the text 'You're writing a proxy!'

For the v1 milestone, you should upgrade the starter application to be a true proxy--an intermediary that ingests requests from the client, establishes connections to the origin servers (which are the machines for which the requests are actually intended), passes the requests on to the origin servers, waits for these origin servers to respond, and then passes their responses back to the clients. Once the v1 checkpoint has been implemented, your proxy application should basically be an intermediary that intercepts requests and responses and passes them on to the intended servers. You only need to provide support for the most common HTTP methods: GET, POST, and HEAD. Once you implement these, you will be able to access http:// (unencrypted) sites. There are other methods (PUT and DELETE come to mind), but they're rarely used by traditional web servers, so your proxy doesn't need to support them. Optionally, we have given you another function (HTTPRequestHandler::manageClientServerBridge) that allows you to connect to encrypted https:// sites if you also provide support for the CONNECT method.

Each intercepted request is passed along to the origin server pretty much as is, save for three small changes.

  1. You should modify the intercepted request URL within the first line -- the request line as it's called -- as needed so that when you forward it as part of the request, it includes only the path and not the protocol or the host. The request line of the intercepted request should look something like this:
    GET http://www.cornell.edu/research/ HTTP/1.1
    
    and the first line of the request you forward to www.cornell.edu would need to look like this (for this example):
    GET /research/ HTTP/1.1
    
    Of course, GET might be any one of the legitimate HTTP method names, the protocol might be HTTP/1.0 instead of HTTP/1.1, and the URL will be any one of a jillion. But provided your browser is configured to direct all HTTP traffic through your proxy, the URLs are guaranteed to include the protocol (e.g. the leading "http://") and the host name (e.g. www.cornell.edu). The protocol and the host name are included whenever the request is directed to a proxy, because the proxy would otherwise have no clue where the forwarded request should go. But when you *do* forward the request to the origin server, you need to strip the leading "http://" and the host name from the URL. We've implemented the HTTPRequest class to manage this detail for you automatically (inspect the implementation of operator<< in request.cc and you'll see), but you need to ensure that you don't break this as you start modifying the code base, because you'll need to change the implementation of operator<< once you support proxy chaining for the final milestone.
  2. You should add a new request header entity named "x-forwarded-proto" and set its value to be "http". If "x-forwarded-proto" is already included in the request header, then simply add it again.
  3. You should add a new request header entity called "x-forwarded-for" and set its value to be the IP address of the requesting client. If "x-forwarded-for" is already present, then you should extend its value into a comma-separated chain of IP addresses the request has passed through before arriving at your proxy. (The IP address of the machine you're directly hearing from would be appended to the end). Your reasons for adding these two new fields will become apparent later on, when you support proxy chaining.

For reference, here is a description of the "x-forwarded-for" header field and its purpose: click here. Here is a description of "x-forwarded-proto": click here. Finally, this document is the original RFC for HTTP/1.1: click here; it describes the proxy request line and how requests to a proxy must include an absolute URI - third paragraph of Section 5.1.2.

Most of the code you write for your v1 milestone will be confined to request-handler.h and request-handler.cc files (although you'll want to make a few changes to request.h/cc as well). The HTTPRequestHandler class you're starting with has just one public method, with a placeholder implementation.

NOTE: You need to familiarize yourself with all of the various classes at your disposal to determine which ones should contribute to the v1 implementation. Of course, you'll want to leverage the client socket code presented in lecture to open up a connection to the origin server. Your implementation of the one public method will evolve into a substantial amount of code--substantial enough that you'll want to decompose and add a good number of private methods.

Once you've reached your v1 milestone, you'll be the proud owner of a sequential (but otherwise fully functional) proxy. You should visit every web site imaginable to ensure the round-trip transactions pass through your proxy without impacting the functionality of the site (caveat: see the note below on sites that require login or are served up via HTTPS). Of course, you can expect the sites to load very slowly, since your proxy has this much parallelism: zero. For the moment, however, concern yourself with the networking and the proxy's core functionality, and worry about improving application throughput in later milestones.

Implementing v2: Adding Blacklisting and Caching

Once you've built v1, you'll have constructed a bonafide proxy. In practice, proxies are used to either block access to certain websites, cache static resources that rarely change so they can be served up more quickly, or both.

Why block access to certain websites? There are several reasons, and here are a few:

Why should the proxy maintain copies of static resources like images and JavaScript files? Here are two reasons:

In spite of the long-winded defense of why caching and blacklisting are reasonable features, incorporating support for each is relatively straightforward, provided you confine your changes to the request-handler.h and .cc files. In particular, you should just add two private instance variables--one of type HTTPBlacklist, and a second of type HTTPCache, to HTTPRequestHandler. Once you do that, you should do this:

Your to-do item for caching? Before passing the HTTP request on to the origin server, you should check to see if a valid cache entry exists. If it does, just return a copy of it--verbatim!--without bothering to forward the HTTP request. If it does not, then you should forward the request as you would have otherwise. If the HTTP response identifies itself as cacheable, then you should cache a copy before propagating it along to the client.

What's cacheable? The code we've given you makes some decisions--technically off specification, but good enough for our purposes--and implements pretty much everything. In a nutshell, an HTTP response is cacheable if the HTTP request method was "GET", the response status code was 200, and the response header was clear that the response is cacheable and can be cached for a reasonably long period of time. You can inspect some of the HTTPCache method implementations to see the decisions made for you, or you can just ignore the implementations for the time being and use the HTTPCache off-the-shelf.

Once you've finished v2, you should once again pelt your proxy with oodles of requests to ensure it still works as before, save for some obvious differences. Web sites matching domain regexes listed in blocked-domains.txt should be blocked with a 403, and you should confirm your proxy's cache grows to store a good number of documents, sparing the larger Internet from a good amount of superfluous network activity. (Again, to test the caching part, make sure you clear your browser's cache a whole bunch.)

Implementing v3: Adding Concurrency

You've implemented your HTTPRequestHandler class to proxy, block, and cache, but you have yet to work in any multithreading magic. For precisely the same reasons threading worked out so well with your RSS News Feed Aggregator, threading will work miracles when implanted into your proxy. Virtually all of the multithreading you add will be confined to the scheduler.h and scheduler.cc files. These two files will ultimately define and implement an über-sophisticated HTTPProxyScheduler class, which is responsible for maintaining a list of socket/IP-address pairs to be handled in FIFO fashion by a limited number of threads.

The initial version of scheduler.h/.cc provides the lamest scheduler ever: It just passes the buck on to the HTTPRequestHandler, which proxies, blocks, and caches on the main thread. Calling it a scheduler is an insult to all other schedulers, because it doesn't really schedule anything at all. It just passes each socket/IP-address pair on to its HTTPRequestHandler underling and blocks until the underling's serviceRequest method sees the full HTTP transaction through to the last byte transfer.

One extreme solution might just spawn a separate thread within every single call to scheduleRequest, so that its implementation would go from this:

void HTTPProxyScheduler::scheduleRequest(int connectionfd,
                                         const string& clientIPAddress) {
  handler.serviceRequest(make_pair(connectionfd, clientIPAddress));
}                                        

to this:

void HTTPProxyScheduler::scheduleRequest(int connectionfd,
                                         const string& clientIPAddress) {
  thread t([this](const pair<int, string>& connection) {
    handler.serviceRequest(connection);
  }, make_pair(connectionfd, clientIPAddress));
  t.detach();
}

(Side note: detach above makes the thread its own entity that no longer has to be joined later by the parent).

While the above approach succeeds in getting the request off of the main thread, it doesn't limit the number of threads that can be running at any one time. If your proxy were to receive hundreds of requests in the course of a few seconds--in practice, a very real possibility--the above would create hundreds of threads in the course of those few seconds, and that would be bad. Should the proxy endure an extended burst of incoming traffic--scores of requests per second, sustained over several minutes or even hours, the above would create so many threads that the thread count would immediately exceed a thread-manager-defined maximum.

Fortunately, you built a ThreadPool class for Assignment 6, which is exactly what you want here. We've included the thread-pool.h file in the assign7 repositories, and updated the Makefile to link against our working solution of the ThreadPool class. You should leverage a single ThreadPool with 64 worker threads, and use that to elevate your sequential proxy to a multithreaded one. Given a properly working ThreadPool, going from sequential to concurrent is actually not very much work at all.

Your HTTPProxyScheduler class should encapsulate just a single HTTPRequestHandler, which itself already encapsulates exactly one HTTPBlacklist and one HTTPCache. You should stick with just one scheduler, request handler, blacklist, and cache, but because you're now using a ThreadPool and introducing parallelism, you'll need to implant more synchronization directives to avoid any and all data races. Truth be told, you shouldn't need to protect the blacklist operations, since the blacklist, once constructed, never changes. But you need to ensure concurrent changes to the cache don't actually introduce any races that might threaten the integrity of the cached HTTP responses. In particular, if your proxy gets two competing requests for the same exact resource and you don't protect against race conditions, you may see problems.

Here are some basic requirements:

You should not lock down the entire cache with a single mutex for all requests, as that introduces a huge bottleneck into the mix, allows at most one open network connection at a time, and renders your multithreaded application to be essentially sequential. You could take the map<string, unique_ptr<mutex>> approach that the implementation of oslock and osunlock takes (you probably took a similar approach in Assignment 5 to manage per-server connection limits as well), but that solution doesn't scale for real proxies, which run uninterrupted for months at a time and cache millions of documents.

Instead, your HTTPCache implementation should maintain an array of 997 mutexes, and before you do anything on behalf of a particular request, you should hash it and acquire the mutex at the index equal to the hashcode modulo 997. You should be able to inspect the initial implementation of the HTTPCache and figure out how to surface a hash code and use that to decide which mutex guards any particular request. A specific HTTPRequest will always map to the same mutex, which guarantees safety; different HTTPRequests may very, very occasionally map to the same mutex, but we're willing to live with that, since it happens so infrequently. (Why 997? We just choose a relatively large prime number. Hash theory works out better in practice when prime numbers are involved.)

We've ensured that the starting code base relies on thread safe versions of functions (gethostbyname_r instead of gethostbyname, readdir_r instead of readdir), so you don't have to worry about any of that. (Note your assign7 repo includes client-socket.[h/cc], updated to use gethostbyname_r.)

Implementing v4: Adding Proxy Chaining

Some proxies elect to forward their requests not to the origin servers, but instead to secondary proxies. Chaining proxies makes it possible to more fully conceal your web surfing activity, particularly if you pass through proxies that pledge to anonymize your IP address, cookie jar, etc. A proxied proxy might also rely on the services of an existing proxy while providing a few more--better caching, custom blacklisting, and so forth--to the client.

The proxy_soln we've supplied you allows for a secondary proxy to be specified, as with this:

myth61:$ ./samples/proxy_soln --proxy-server myth63.stanford.edu
Listening for all incoming traffic on port 39245.

Requests will be directed toward another proxy at myth63.stanford.edu:39245.

Provided a second proxy is running on myth63 and listening on port 39245, the proxy running on myth61 would forward all HTTP requests--unmodified, save for the updates to the "x-forwarded-proto" and "x-forwarded-for" header fields--on to the proxy running on myth63:39245, which for all we know forwards to another proxy!

We actually don't require that the secondary proxy be listening on the same port number, so something like this might be a legal chain:

myth61:~$ ./samples/proxy_soln --proxy-server myth63.stanford.edu --proxy-port 12345
Listening for all incoming traffic on port 39245.

Requests will be directed toward another proxy at myth63.stanford.edu:12345.

In that case, the myth61:39245 proxy would forward all requests to the proxy listening to port 12345 on myth63. If the --proxy--port option isn't specified, then the proxy assumes the its own port number also applied to the secondary.

The HTTPProxy class we've given you already knows how to parse these additional --proxy-server and --proxy-port flags, but it doesn't do anything with them. You're to update the hierarchy of classes to allow for the possibility that a (or several) secondary proxy is being used, and if so, to forward all requests (as is, except for the modifications to the "x-forwarded-proto" and "x-forwarded-for" headers) on to the secondary proxy. This'll require you to extend the signatures of many methods and/or add methods to the hierarchy of classes to allow for the possibility that requests will be forwarded to another proxy instead of the origin servers. If you notice a chained set of proxy IP addresses that lead to a cycle (even if the port numbers are different), you should respond with a status code of 504. For fun, we're supplying a python script called run-proxy-farm.py, which can be used to manage a farm of proxies that forward to each other. One you have proxy chaining implemented, open the python script, update the HOSTS variable to be a list of one or more myth machine numbers (e.g. HOSTS = [51, 53, 57, 60]) to get a daisy chain of proxy processes running on the different hosts. Note that you cannot run the python script to test for cycles in chains; you will have to set that up manually. (If you want to use run-proxy-farm.py to test for cycles, you'll need to modify it to support that).

Additional Tidbits

When you complete this assignment, be proud of what you've accomplished! It's genuinely thrilling to know that all of you can implement something as sophisticated as an industrial-strength proxy, particularly in light of the fact that just a few weeks ago, we hadn't even discussed networking yet.

Optional: Implementing CONNECT for HTTPS:// access

If you would like to support HTTPS:// websites, which are the dominant sites on the web these days (for good reason), you will need to support the CONNECT request, which is similar to a GET request. However, this request is relevant only for the proxy(ies) between the client and the destination server, and not actually intended for the destination server itself. You must ultimately open a connection to the destination server (without sending anything), and then have a 200 OK response sent back to the client. Once you have handled this, you should flush the input stream and then pass both the input stream and output stream to the manageClientServerBridge(iosockstream& client, iosockstream& server) function. The input stream is the stream to the client, and the output stream (that you created to forward the request) is the stream to whom you are forwarding onto. Simply calling the manageClientServerBridge function is all that you should need to do to fully complete the CONNECT request. If you are forwarding to another proxy, you must instead forward the CONNECT request and call manageClientServerBridge.

We will award 5% extra credit for a correct HTTPS implementation.


Website design based on a design by Chris Piech
Icons by Piotr Kwiatkowski