Project 2: Balancebeam

Networked services are everywhere, and keeping them running is a critical task: communications, credit card processing, the power grid, and much more all depend on networking infrastructure. Load balancers are a crucial component for providing scalability and availability to networked services, and in this assignment, you’ll feel out the internals of load balancers and learn what makes them tick!

Conceptually, there is a lot of overlap between this assignment and the CS 110 proxy assignment (which is the last assignment this quarter), so you’ll get a good chance to compare and contrast similar C++ and Rust code.

This project is much more open-ended than previous assignments in this class. There are more design and structural decisions involved, and you will find that there are many ways to implement each task. You’ll also find that you may have to include additional libraries while you work and search the web to figure out how to do something; this is part of the process! (And, because of the open-endedness, I recommend working on this assignment in a group, and I encourage you to discuss your implementation strategies and challenges on Slack.) I hope you enjoy working on this and are able to reflect on how much you’ve learned this quarter!

Logistics

This project is due on Friday, March 11 at 11:59PM pacific time. If you might have difficulty meeting that deadline, please contact me ahead of time and I can give you an extension.

You may work in groups of 2-3 if you would like. Please follow the same instructions outlined under “Working with a Partner” from the project 1 handout. In particular:

If you would like to submit together, please let me know which repo you’ll be submitting with on Slack so that I can add your group as collaborators.
If you would like to work together but submit separately, that’s also fine. You’re also welcome to discuss the project at a high level with others in the class! (Beyond the group you’re formally working with, the usual honor code guidelines apply.)
In either case, please put the names of all group members at the top of main.rs.

Also, if you would be interested in working on a different project, let me know! This is a small class and I would love to support your individual interests.

Getting set up

You should have received an invite to join this project’s Github repository. If you didn’t get an email invite, try going to this link:

https://github.com/cs110l/proj2-YOURSUNETID

(If you’re working in a group and plan to submit together, let me know which repository you’ll be working in, and I’ll add your full group to that repository.)

You can download the code using git as usual:

git clone https://github.com/cs110l/proj2-YOURSUNETID.git proj2

Unlike the last project, you can work on this project directly on your computer without any tools like Docker or Vagrant. If you would like to work on myth or rice, you are certainly welcome to.

Throughout this project, please take notes on your experience: What did you try? What worked? What didn’t? I’ll ask you to write a reflection at the end of the assignment.

Background: What’s a load balancer?

A load balancer is core to a networked or distributed system – a system with functionality that’s distributed across many different machines, communicating with each other over some network. You very likely interact with networked systems every day: when you send an email, visit a website hosted by a large organization, check social media, etc.

Two key properties of networked systems are scalability and availability:

Scalability: how well can the system grow as demands increase over time?
Availability: how well is the system able to stay available and avoid downtime?

The single client, single server introductory model of networking

In CS110, you likely learned about a networking model in which one client connects with one server:

single server

The client looks up the server’s IP address (using DNS)
The client connects to the server’s IP on port 80 (HTTP) over the network
Once the connection is established, the client and the server each receive one end of a connected socket in the form of a file descriptor. This is like a “pipe”: when the client writes to the socket, the data gets sent over the Internet to the server. The server can read this data from the socket. (And vice verse.)
The client can then send HTTP requests, which the server can receive, process, and send HTTP responses accordingly.

Let’s think about the scalability and availability of this system.

Is it scalable? Not at all!
- In the most basic form of this model, the server is just one server (one computer).
- Individual computers aren’t scalable: it becomes exponentially more expensive as you try to upgrade performance on a single machine, and Internet traffic has grown far faster than the capabilities of hardware.
Is it reliable (available)? Not at all!
- Server could get overloaded and run out of resources (memory, compute, file descriptors, threads…)
- Single point of failure: system crashes, hardware fails, dog eats power cable, local network goes down, etc. = entire system goes down.

Enter distributed systems: single client, many servers

The idea behind a distributed system is to distribute a system’s functionality over a large number of servers. These servers talk to each other over a network on the problem they’re trying to solve.

In the case of web servers, this means that multiple machines are available to service client requests.

Here’s one basic model of what that can look like:

In this model, the client follows the process above to open a connected socket to one of the web servers, which provide connectivity and processing power. The servers manage client connections and much of the logic for processing requests. The servers also, likely, communicate with a database to get content. (The database system will likely also be distributed, but implementing a distributed database is a really complicated problem to solve!)

Now, we have multiple machines providing file descriptors, connection setup and management, and computing power. We also no longer have a single point of failure: if one of these servers goes down, there are backups that a client can still connect to.

Aside: web servers aren’t the only example of distributed systems – a quick web search will give you results ranging from large filesystems to parallel computation for AI/ML to virtual machines to electric grids. I’m focusing on web servers here, because it’s relevant to this assignment; if you’re interested, I recommend CS244B (grad seminar) and CS149 (parallel computing).

Enter load balancers: distribute traffic across compute nodes

There are a few problems with the above model. Should the servers share an IP address (and if so, how do they do that?) or have different IP addresses? How does the client know which server to connect to? (Is it up to the client to figure out when one node is overwhelmed or experiencing faillure, and switch to another?)

Here’s where load balancers come in. The role of a load balancer is to act as an intermediary for client connections and decide which node to forward requests to:

Here:

A client looks up the IP address for a website using DNS, then requests to open a connection to that IP address on port 80 (HTTP) or 443 (HTTPS). This connection is actually to the load balancer, which sits at the edge of a datacenter, ready to receive connection requests from clients.
The load balancer receives the client’s connection request, then decides which compute node to “assign” it to.
After choosing a compute node, the load balancer opens a connection to it. From here, the load balancer acts like a proxy:
- Any traffic the client sends to the load balancer is relayed to the compute node.
- Any traffic the compute node sends is forwarded back to the client.
This is all the load balancer does! Anything resource- or logic-intensive is offloaded to the compute nodes. Consequently, load balancers can handle a large number of concurrent connections.
Note: the client generally doesn’t know that all of this is happening – the inner workings of the load balancer, data center, etc. provide the abstraction of a single, powerful server.

How are we doing on scalability and availability?

Scalability: if many clients are connecting, we can add more compute nodes. If one compute node is overwhelmed, the load balancer can forward new requests elsewhere.
Availability: if a compute node fails, the load balancer will detect that it isn’t able to contact that node, and it can stop relaying traffic there.

In case you’re interested in a real-life example, here’s a diagram from Cloudflare, one of the largest website hosting providers, on what their datacenters look like (source):

When a client connects to a website hosted by Cloudflare:

Connection requests and (later) HTTP requests get forwarded to a gateway (ECMP) router. This router implements a first layer of load balancing, forwarding HTTP traffic to various servers.
The entry-point to each server has an additional layer of load balancing (“layer 4 load balancing”, or L4LB in the diagram). This load balancer may decide to forward the connection somewhere else for processing, e.g., if the local server is overwhelmed.
- (There’s also Denial-of-Service protection, which tries to detect and block abnormal traffic)
At this server, the connection is “terminated” (this is a confusing term, but it just means that here is where the server end of the connected socket will be). Requests are processed here by a suite of applications.
Then, there’s a system for getting content – a shared, distributed cache and a connection to the “origin”, the servers of the hosted website.

Load balance your load balancers

I said that load balancers can handle a large number of connection requests, but you’re probably thinking: “Okay, but what about a large company that services millions of connection requests every day? Surely, they can’t just have one load balancer. And what if that load balancer fails? Heck, what if a power line goes down and the whole datacenter loses power?” You’d be 100% right about all of these concerns!

The reality is that there are layers of load balancing, and there are many strategies that organizations use to load balance across their load balancers. One strategy is allowing DNS to choose from multiple IP addresses, corresponding to multiple load balancers, for a single hostname – allowing the client to choose, or shuffling the order, or choosing based on whatever is likely to be closest to the client. Another is relying on a system called IP Anycast, in which multiple servers announce to the Internet that they “own” the same IP address, then rely on the broader Internet routing infrastructure to figure out where to forward requests destined for that address. How modern, large-scale web providers work is a huge topic. (If you’re interested, I recommend CS144 and CS249i!)

In this assignment, consider the load balancer you’re implementing as part of a larger system: your load balancer sits at the edge of a datacenter and forwards client requests to multiple compute nodes. There’s probably another mechanism that keeps the load on your load balancer manageable and redirects traffic if your load balancer fails. The internal logic of the datacenter(s) is its own beast – with a distributed database to store and update content, compute logic, etc. You’re implementing one core piece of functionality that fits into a complex whole.

Milestone 0: Read the starter code

The starter code implements a single-threaded reverse proxy. (We considered having you implement it, but wanted to give you time to focus on the more interesting parts, and you’ll be implementing a proxy in CS 110 anyways.) If you’re running the code locally, you can take it for a spin. Start the load balancer like so:

cargo run -- --upstream 171.67.215.200:80

Then, visit http://localhost:1100/class/cs110l/ in your web browser. (You may get a bad gateway error, temporarily, while the website is loading. This will be harder to do if you are running the server on myth or rice, but fear not; there are many other ways to test the server that will be described in this handout.)

This is configuring the load balancer to forward requests to a single upstream server. Think of this as a server inside of your datacenter that can handle requests and send you back content. (Here, you’re using the server that operates web.stanford.edu, which happens to be at 171.67.215.200). When your browser opens a connection to balancebeam, balancebeam opens a connection to web.stanford.edu; then, when your browser sends requests over the connection, balancebeam passes them along, and when web.stanford.edu sends responses, balancebeam relays them to your browser.

Let’s have a look at the code that implements this, starting in main.rs. You should be sure to understand the code in main.rs thoroughly, as you will be making substantial changes over the course of the assignment.

The CmdOptions struct uses some fancy macros to do command-line argument parsing. Any command-line arguments supplied to the program end up in this struct in main():

// Parse the command line arguments passed to this program
let options = CmdOptions::parse();
if options.upstream.len() < 1 {
    log::error!("At least one upstream server must be specified using the --upstream option.");
    std::process::exit(1);
}

The ProxyState struct is useful for storing information about the state of balancebeam. Currently, it only stores information pulled directly from CmdOptions, but in later milestones, you will want to add more fields to this struct so that different threads can more easily share information.

/// Contains information about the state of balancebeam (e.g. what servers we are currently proxying
/// to, what servers have failed, rate limiting counts, etc.)
///
/// You should add fields to this struct in later milestones.
struct ProxyState {
    /// How frequently we check whether upstream servers are alive (Milestone 2)
    #[allow(dead_code)]
    active_health_check_interval: u64,
    /// Where we should send requests when doing active health checks (Milestone 2)
    #[allow(dead_code)]
    active_health_check_path: String,
    /// How big the rate limiting window should be, default is 1 minute (Milestone 3)
    #[allow(dead_code)]
    rate_limit_window_size: u64,
    /// Maximum number of requests an individual IP can make in a window (Milestone 3)
    #[allow(dead_code)]
    max_requests_per_window: u64,
    /// Addresses of servers that we are proxying to
    upstream_addresses: Vec<String>,
}

After parsing the command line arguments, main() creates a server socket so that it can begin listening for connections:

let listener = match TcpListener::bind(&options.bind) {
    Ok(listener) => listener,
    Err(err) => {
        log::error!("Could not bind to {}: {}", options.bind, err);
        std::process::exit(1);
    }
};
log::info!("Listening for requests on {}", options.bind);

Then, as connections come in, it calls the handle_connection() function, passing a TcpStream (“client socket” stream in CS 110 terms – an “Internet pipe” connected to the client).

for stream in listener.incoming() {
    let stream = stream.unwrap();
    // Handle the connection!
    handle_connection(stream, &state);
}

The handle_connection is a little long, but it is not too conceptually complex:

First, it selects a random upstream server and opens a connection to that server. (If the server is dead, it informs the client with a 502 Bad Gateway error.)
Then, it begins infinitely looping, trying to read requests from the client. The client may send any number of requests over a single connection, and balancebeam is expected to relay each of them to the upstream server. There is some error handling in case the client sends a malformed request.
Once the request has been read, balancebeam adds an X-Forwarded-For HTTP header so that the upstream server can know the client’s IP address. (This isn’t particularly important to understand; it’s simply a de-facto standard header.)
Next, balancebeam relays the request to the upstream server. If there is an error in transmitting the request, it informs the client.
Next, balancebeam tries to read the response from the upstream server. If the response is corrupt or there is some other problem in receiving it, it informs the client.
Finally, balancebeam forwards the response to the client.

Throughout this code, you may see calls like log::debug! or log::info!. This is effectively just printing to the terminal, but it is also colorizing the output to make it easier to differentiate important log lines (e.g. errors) from less important log lines (e.g. lines just helpful for debugging if something goes wrong). You may continue using print! if you like, but the logger is available for your convenience. The log levels are log::debug!, log::info!, log::warn!, and log::error!.

request.rs and response.rs

Frustratingly, we discovered Rust does not have many crates for parsing HTTP requests and responses. The crates that are available are either too high-level (implementing major functionality that we want you to implement for this assignment) or too low-level (requiring a lot of extra code to use). We ended up using two libraries. httparse is a low-level library that does HTTP parsing but requires a lot of extra code to be functional. http is a library that provides Request and Response structs that can be used to store HTTP requests/responses, but does not actually provide any code to create these objects by parsing data or to send stored requests/responses by serializing these structs to a TcpStream. (It doesn’t even have a toString function!) We ended up writing a lot of glue code to combine these two libraries, and this code is provided for you in request.rs and response.rs.

Notably, request.rs and response.rs export read_from_stream functions to read and parse http::Request and http::Response objects from the bytes sent by the client or server, along with write_to_stream functions to serialize http::Request or http::Response and send that to the client or server. These functions each first call read_headers, which reads bytes from the client and tries to see whether the bytes received so far form a valid HTTP request up until the end of the headers, as well as read_body, which reads bytes from the client until we have finished reading the full body for a request.

response.rs also provides a make_http_error function that generates an http::Response with the provided HTTP status code. If you want to send an error to the client, you can call this function to create an http::Response, then use response::write_to_stream to send it to the client.

These files are more complicated to follow than main.rs, as they rely on an understanding of the HTTP protocol, but we have tried to add comments in the right places to make them easier to read. You don’t need to understand them in much detail. (If you implement Milestone 5, you will be modifying some of the functions there, but we will walk you through the process, and you can make the changes without really understanding the code.)

Testing

We have provided a full test suite so that you do not need to run balancebeam with an upstream server and figure out how to send HTTP requests yourself. The following tests should pass out of the box:

cargo test --test 01_single_upstream_tests
cargo test --test 02_multiple_upstream_tests test_load_distribution

As you work on later milestones, you should make sure that these tests continue to pass. We have tried to add good logging in the tests so that you can figure out what a failing test is trying to do, but if you want to see the source code for the tests, you can find it in the tests/ directory.

Rust-analyzer error

If you’re using rust analyzer, you may end up with all of CmdOptions highlighted in red with an error message saying that “proc macro server crashed.” This is a bug in rust-analyzer. You can ignore it.

If it’s bothering you, you can disable proc macro checks to get rid of it. To do this, you’d go into the rust-analyzer configurations (for vscode, see https://rust-analyzer.github.io/manual.html#vs-code). In package.json, set "rust-analyzer.procMacro.enable" to false. Don’t forget to turn it back on after you’re done with this project.

Milestone 1: Failover with passive health checks

Your load balancer can already distribute traffic across several upstream servers. However, if an upstream server goes down, the load balancer currently sends an error back to the client without trying to do anything else. (See connect_to_upstream and the beginning of handle_connection.) If we have multiple upstream servers, we can do better than this!

In this milestone, we will begin to implement failover: when one of the upstream servers fails, we should redirect its traffic to any remaining upstream servers so that clients experience minimal disruptions. Remember that when a client connects to balancebeam, balancebeam tries to connect to a random upstream server. We’ll first implement a simple mechanism for detecting when an upstream server has failed: if connecting to the upstream fails, we can assume that the upstream server is dead, and we can pick a different upstream server.

Modify your balancebeam implementation to keep track of dead upstream servers and to proxy clients only to the live upstreams. If a client connects and the first upstream that balancebeam selects is dead, balancebeam should mark that upstream as dead and then pick a different upstream. Clients should only receive an error if all upstreams are dead.

You will need to make the majority of your changes in connect_to_upstream, although other functions might require slight modification.

At the end of this milestone, you should pass the test_passive_health_checks test:

cargo test passive_health_checks

You should also make sure that the load balancer still works for distributing load across healthy upstreams:

cargo test --test 01_single_upstream_tests
cargo test load_distribution

Milestone 2: Failover with active HTTP health checks

Passive health checks are convenient, but they have limitations. Sometimes, servers fail in such a way that they can still establish connections, but they fail to service HTTP requests. For example, if an application server loses contact with the database server, it may do just fine establishing initial connections with a client, but subsequently fail to process any request that relies on the database.

Application servers will commonly implement health check endpoints. A load balancer or service monitor (e.g. Github status) can make requests to these health check paths, and the server will do a quick self-test (e.g. doing some database operations) to make sure everything is functioning as expected. If the health check returns HTTP 200, the load balancer can be more confident that the upstream server is working, but if it returns something else, the load balancer can take it out of the rotation.

Performing periodic health checks also has the benefit that the load balancer can restore a failed upstream if it starts working again. An upstream server may temporarily go down if it gets overloaded or crashes or is rebooted, but the load balancer can periodically try making requests, and if the server starts responding successfully again, the upstream can start using it again.

You can assume that the servers in your datacenter are set up so that each has an active health check HTTP “path” that can be queried, and that this path will be given to your load balancer in the active_health_check_path argument. (Our tests hard-code this path as “/health-check”.) This means that when one of these servers receives an HTTP request from your load balancer’s IP address for “/health-check”, it will know that this is a health check, and it will respond with an “HTTP 200.” If your load balancer sends a query and doesn’t receive this response back, it should assume that the server is dead.

In this milestone, you are to send a request to each upstream at active_health_check_path every active_health_check_interval. If a failed upstream returns HTTP 200, put it back in the rotation of upstream servers. If an online upstream fails to return a response or returns a non-200 status code, mark that server as failed. (To check the response status, you can use response.status().as_u16().)

You will need to use one or more threads to accomplish this. In doing this, will need to ensure that the ProxyState struct can be safely shared. There are several ways to do this; feel free to discuss any approaches on Slack. You may assume that this server loops infinitely, so you do not need to worry about joining any threads.

Two requirements: First, if you use locks, only lock what you need to. (What do you need to protect? What will/won’t be modified? For instance, don’t lock an entire struct if only one element in it is writable). Second, hold locks that other threads need for as little time as possible. (Implementing health checks should not slow down your load balancer. In particular, make sure that no thread holds a lock that some other thread might need while waiting for network activity. You’re welcome to clone() small amounts of data to achieve this).

Tip: If you’re not sure how to do something, such as building a request and sending it to the server, find where in the starter code this is already being done. Can you use that as a model for your own code?

Optional: If you’re interested, watch the 2021 lecture on channels and try implementing this milestone with channels. For example, you might have an active health check thread that sends messages to the main thread whenever an upstream server is marked alive or dead. The main thread can receive these messages in between calls to handle_connection (i.e. in the for loop at the end of main).

At the end of this milestone, you should pass the active_health_checks tests:

cargo test active_health_checks

You can run all of the tests up until this point using this command:

cargo test -- --skip test_rate_limiting

(Added 3/7) Tip for constructing an HTTP request: as mentioned in Milestone 0, we’re representing HTTP requests using the Request object from the http crate (docs here). It may be helpful to check out how the request.rs file uses the http::request::Builder object to populate the fields of an HTTP request. You’ll need to hard-code the appropriate http::Method; you can see an example of this used in response.rs.)

Milestone 3: Rate limiting

Rate limiting is the practice of limiting the rate at which clients can send requests. This can help prevent a service from being accidentally or intentionally overwhelmed. For example, a Denial of Service attack involves sending large amounts of traffic to a server in order to disable it; rate limiting can help by preventing large-volume attack traffic from reaching the application servers. Rate limiting is also used to prevent abuse, such as credential stuffing attacks, when an attacker attempts to brute-force guess usernames and passwords. Sometimes, rate limiting is even made part of a business model! For example, the Google Maps API allows other applications to make requests for maps information, but it charges per request and imposes limits on request rate per billing tier.

In this milestone, you will implement basic rate limiting by IP address. The max_requests_per_window parameter specifies how many requests each IP address should be allowed to make per window of time (by default, per minute, but this is controlled by rate_limit_window_size). If it is zero, rate limiting should be disabled. If a client makes more requests within a window, the proxy should respond to those requests with HTTP error 429 (Too Many Requests) rather than forwarding the requests to the upstream servers.

There are many algorithms for implementing rate limiting. This article provides a great overview. We recommend implementing the fixed window algorithm, but if you are up for something just slightly more complex, you can give the sliding window algorithm a try.

In the fixed window algorithm, the rate limiter tracks counters for each IP within a fixed time window (e.g. 12:00PM to 12:01PM). At the end of the window, the counters reset. This has the advantage of being extremely simple to implement. However, it is not the most accurate algorithm. To see why, imagine our rate limit is 100 requests/minute, and imagine a client sends 100 requests from 12:00:30 to 12:00:59, then sends another 100 requests from 12:01:00 to 12:01:30. Such a client would get away with sending 200 requests, which is double the rate limit, even though those requests are legal under this rate limiting scheme. The sliding window algorithm preserves the fixed window algorithm’s simplicity while being more accurate. However, for this assignment, implementing fixed window rate limiting will suffice.

You can run the test_rate_limiting test with this command:

cargo test rate_limiting

At the end of this milestone, all tests should pass!

cargo test

Tip: You can create a rate limiting error response with response::make_http_error(http::StatusCode::TOO_MANY_REQUESTS). Then, you can use send_response to send it to the client.

Milestone 4: Add multithreading

You now have a fairly capable load balancer! Now it’s time to make it fast.

In this milestone, you should use threads to service requests. You may do this however you like:

You could modify main() to handle each connection (handle_connection) in a separate thread.
You could use a ThreadPool to reuse threads between requests.
If you want to get a bit more advanced, consider replacing your locks with RwLocks so that threads reading data don’t block each other (only writers will block).
If you’re using channels, you way want to look into the bus crate (allows you to broadcast messages to multiple threads) and the multiqueue crate (allows for similar functionality with multiple broadcasting threads).
Got any other ideas?

The baseline requirement of this milestone is that your program is parallelized, with meaningful speed and concurrency improvement over the sequential version.

An optional extension is to experiment with how you can measure and improve performance; see Optional Milestone 4.5 at the bottom of this handout.

I encourage you to brainstorm possible approaches on Slack.

As you go (and before you submit!), make sure that all functionality continues to work:

cargo test

Reflection

When you finish, please write a brief reflection of your experience. We have provided a reflection.txt file for you to write your reflection in, although you can write it in a different format if you like (e.g. if you want to add images). If you save the reflection as a different file, be sure to git add whatever-file before submitting, or else it will not be included in your git repository.

You can write anything you like, but here are some things we would love to hear about:

Did you finish all of the milestones? What did/didn’t you attempt? Are there any remaining bugs that you’re aware of?
What did you like about this assignment? What did you not like about it?
Were there any parts you got stuck on?
Did you try anything to improve performance?
If you were continuing to work on this project, what might be a next thing you would implement? (Additional features? A way to improve performance?)
What surprised you? What did you learn?

Optional (Extra Credit) Milestone 5: Use asynchronous I/O

In order to implement this milestone, you’ll want to watch the 2021 lectures on Event-Driven Programming.

Once you have multithreading working, it’s time to take a stab at using nonblocking tasks instead. We’d like for you to contrast between multithreaded and asynchronous code – and performance!

In this milestone, you should convert the codebase to use nonblocking I/O and lightweight Tokio tasks instead of threads. In many other languages, this would be a monumental task, but we think you will be pleasantly surprised at how easy it is when using Tokio and async/await syntax!

In request.rs and response.rs, you should convert std::net::TcpStream to tokio::net::TcpStream and std::io::{Read, Write} to tokio::io::{AsyncReadExt, AsyncWriteExt}. Then, you’ll need to update read_headers, read_body, read_from_stream, and write_to_stream, as these are the functions that contain blocking I/O operations. Convert each of those functions to an async function. When calling TcpStream::read and TcpStream::write_all, add .await at the end of each read or write call. In read_from_stream, add .await at the end of the read_headers and read_body calls, as these are now asynchronous functions.

In main.rs, you will probably see an overwhelming number of compiler errors, but fear not – most of these can be fixed without much trouble!

In your imports, swap std::net::{TcpListener, TcpStream} for tokio::net::{TcpListener, TcpStream}
Similarly, swap std::thread::sleep for tokio::time::sleep
Make main() async, and add #[tokio::main] on the line before it

Swap this:

for stream in listener.incoming() {
    if let Ok(stream) = stream {
        // Handle the connection!
        // ...
    }
}

for this:

while let Ok((stream, _)) = listener.accept().await {
    // Handle the connection!
    // ...
}

Convert the rest of the request processing code to be asynchronous – try adding async or .await wherever there is a compiler error :)

Any threading/synchronization you wrote in previous milestones will also need to be converted:

thread::spawn(move || { ... }) becomes task::spawn(async { ... })
If you used a ThreadPool, you can remove it and spawn a task for each request; the purpose of a ThreadPool is to reuse threads (which are expensive to spawn), but Tokio tasks are lightweight and do not have this expense.
tokio::sync::Mutex is almost a drop-in replacement for std::sync::Mutex
Similarly, tokio::sync::RwLock works the same as std::sync::RwLock.
async_std::channel::unbounded is an asynchronous version of crossbeam channels (add async-std = "1.9" to your Cargo.toml file)

As you are working through this milestone, beware that Googling things may give you outdated results. Rust introduced async/await syntax in 2017-2018 and the feature was not released in stable Rust until November 2019; Tokio just released version 1.0 (with breaking changes) in December 2020, so we are really working on the bleeding edge here. You should be looking at documentation for Tokio 1.6; in particular, any pre-1.0 documentation is stale and will likely mislead you.

If you Google documentation, make sure to keep an eye on the upper left corner of the documentation pages and ensure you are on the latest version. Watch out for things like this:

“This release has been yanked, go to latest version”

Tip: If you get unexpected error messages with the word “future” in them, chances are you forgot to .await on a function before using its return value. (The compiler provides a helpful hint, even though the “found opaque type” error may be confusing at first.) Example errors look like this:

error[E0308]: mismatched types
   --> tests/02_multiple_upstream_tests.rs:132:20
    |
132 |     upstreams.push(SimpleServer::new_at_address(failed_ip));
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |                    |
    |                    expected struct `common::simple_server::SimpleServer`, found opaque type
    |                    help: consider using `.await` here: `SimpleServer::new_at_address(failed_ip).await`
    |
   ::: tests/common/simple_server.rs:46:62
    |
46  |     pub async fn new_at_address(bind_addr_string: String) -> SimpleServer {
    |                                                              ------------ the `Output` of this `async fn`'s found opaque type
    |
    = note:   expected struct `common::simple_server::SimpleServer`
            found opaque type `impl std::future::Future`

Again, at the end of this milestone, ensure all functionality is still working:

cargo test

Optional (Extra Credit) Milestones 6+: Add More Functionality!

This assignment is quite open ended, and there are many more features you could implement. You’re welcome to take a look at HAProxy and Nginx to take inspiration from production load balancers. If you implement something new that is related to performance, feel free to talk to me about benchmarking, and I can suggest ways that you can benchmark and compare your implementation.

Here some features we feel would be particularly worthwhile:

Connection pooling

If you have the time, this might be one of the most interesting extra features to implement.

When a client connects to balancebeam, balancebeam must first establish a connection to an upstream before relaying the request. Establishing a connection is an expensive process that can double the time required to service a request in the worst cases.

A connection pool is a pool of open, ready-to-go upstream connections that a load balancer maintains in order to reduce request latency. (This is actually not a load balancing concept; for example, database clients almost always maintain a pool of connections to the database so that database queries can be executed with minimal latency.) When a request comes in, it is forwarded to an upstream over an idle connection in the pool. If there are no idle connections in the pool, the load balancer can fall back to opening a new connection, but this pooling reduces latency in the common case.

Better rate limiting

As mentioned, fixed window rate limiting isn’t the best strategy. Sliding window rate limiting provides significant improvements, and isn’t too hard to implement.

Also, if you are interested, this article from Cloudflare provides an overview of rate limiting at scale and explains how counters are maintained over a network of countless load balancers. (Cloudflare itself is essentially one massive, distributed load balancer.)

Other load balancing algorithms

The balancebeam starter code does random load balancing. This strategy is pretty effective and dead simple, but it degenerates under high load. Other strategies take into account how loaded each server is. There is a decent high-level summary of techniques here. Some of them depend on knowing the CPU load of each server, which is not possible with our setup, but techniques depending on the number of open connections are possible. This article also does a great job of exploring a hybrid algorithm called Power of Two Choices, and talks about benchmarking different algorithms.

Caching

A load balancer can also cache responses from application servers in order to reduce load on upstreams. Unlike the CS 110 proxy cache, this cache should be entirely in-memory in order to minimize latency. Since memory is a limited resource, you will want to implement policies for when data is evicted from the cache. You can depend on libraries if you like, such as lru.

Web Application Firewall

A WAF screens incoming requests, guessing whether a request seems to be malicious (e.g. an unusual request that seems intended to exploit a vulnerability on the application server, or an attempt at brute forcing credentials) and denying malicious traffic from reaching upstream servers. I didn’t find any WAF crates for Rust, but let us know if you would be interested in learning how to call into C++ code so that you can use the standard ModSecurity library for filtering.

Optional (Extra Credit) Milestone 4.5: Performance Testing

Added 3/5

During class on Monday (2/28), I said that I would provide some performance tuning tools and infrastructure. Later in the week, Ryan and I talked in more depth about performance testing for this assignment, what he’s done in previous offerings of this course, and the challenges he encountered then. I’m outlining the key points of this conversation below.

The TL;DR is: we don’t have an out-of-the-box performance test to offer that you can just run and use. Instead, we’re going to link to some articles that describe the process for performance-tuning web servers, and we’re going to recommend some ways to do it.

If you’re interested in testing and tuning performance, now or after the quarter ends, I hope that the below gives you a starting point for thinking through testing and metrics. (Often, setting up a way to measure and benchmark performance is just as hard as actually improving it.) Feel free to reach out to me if you run into any issues with the below. You should also feel more than welcome to reach out to Ryan (ryan@reberhardt.com) – he developed this assignment, knows a lot more about performance testing than I do, and would love to brainstorm with you and recommend strategies.

Thinking about performance testing

Performance testing and benchmarking is hard, and it’s worth thinking about this before you think about how to measure performance. (There are tons of resources out there about this. Here’s one video that talks through how to improve performance tooling. Here’s a paper that talks through some external factors that contribute to fluctuations in performance in cloud-based web applications.)

When trying to test the performance of a single piece of a larger system – as we are – the problem gets even trickier. How do you isolate the impacts of your load balancer on performance? (For intance, how do you ensure that you’re not bottlenecked by the servers or, if you’re running across multiple machines, the network?) If you’re running other components on the same machine, resource contention comes into play: if you have one process that’s generating a lot of requests, one process that’s running your load balancer, and other processes simulating upstream servers, they are all going to contend for the CPU time, memory, cache, etc., impacting each other in unpredictable ways.

What’s been done for this project in the past

Last year, Ryan was able to get some instances (virtual machines) from a cloud computing platform. He set up a relatively complicated testing infrastructure: one instance generated requests, one ran the load balancer, and a few ran upstream servers to process the requests. (By putting these on different machines, he aimed to isolate the impacts of the load balancer.) The “client” (load generator) would generate requests, then record metrics like latency, throughput, and scalability based on load.

In addition to being expensive and requiring funding, this solution was still unreliable. Since he didn’t have direct control over the datacenter infrastructure, he couldn’t control where the VMs got placed in relation to each other and what other processes they were sharing underlying hardware with. There was still bottlenecking from the network and servers, and variability if instances were spun up at different locations.

There was other variation too – from day of the week, to time of day, to one run to the next. Some of this could be abstracted away by averaging and taking direct comparisons: e.g., if you want to test the difference between commit A and commit B, run the load balancer with commit A multiple times and take average metrics, and then immediately run the load balancer with commit B multiple times and take average metrics. That said, setting this up was complicated, beyond what most people wanted to do, and still difficult to interpret. Ryan found that it led to more confusion among students about their design decisions than it did actual utility.

Suggestions for performance testing

All of that said: high throughput and low latency are critical design goals for a load balancer, and figuring out how to performance test in a development environment is a common and (possibly) interesting problem.

Background reading: performance testing web servers

Read up on the general category of tools and types of tests you might consider running. I recommend this post and this post. Some of the details might be beyond what you can or want to do, but I think the high-level strategies are helpful.

Questions to ask

Get clear about some of the things you’re trying to measure. For instance:

What is the throughput? How many requests can your load balancer handle per unit time?
- Note: one challenge here is making sure that you’re not bottlenecked by the backend server(s). (E.g., if the server can only process X requests per second, your load balancer isn’t going to be able to return responses to the client any faster than X.)
What is the latency for a single request?
More advanced: how does that change as load increases? (scalability)
More advanced: how does that change as you add more backend servers?
- Note: this is challenging if you’re running the upstreams on the same machine – more servers = more processes competing for limited resources.

Initial test: generate load and send requests

Here’s one recommendation for an initial test using open-source tools that you can run locally to get some basic metrics across commits.

Tools:

A server: you’ll want to use a performant server to make sure it doesn’t bottleneck your performance, and the testing suite just isn’t fast enough.
- Ryan recommends actix.
- Create a new package to run your server. Make sure to specify actix_web as a dependency in the Cargo.toml file.
- Copy the basic “Hello World” program from the actix homepage into main. This should work fine as your web server.
A load tester: Ryan recommends wrk2, an HTTP benchmarking tool. You can download the code from GitHub and use the instructions in the README to run it.

Running the test:

Open one shell (terminal window) and run the server. Keep track of which IP/port it’s listening on.
Start the load balancer on a different port. For command line arguments, specify:
- For --upstream, the IP_address:port of your server.
- For the health check URL, / (something that your server will respond to). See the tests directory for how to specify a health check URL.
- Make sure you know what IP address and port your load balancer is running on.
Start wrk2, specifying the IP address and port of your load balancer as the target and / for path. Check out the metrics that are outputted.

Remember that you will get variable performance, especially if you’re on myth and sharing underlying hardware with other people.

I recommend running this multiple times, and I recommend comparing across commits at the same time. For instance, if you want to compare between commit A and commit B:

Switch to a new branch, and roll back to commit A. Run the test (multiple times).
Stash this branch, and switch to the branch where the code reflects commit B. Run the test again (multiple times).

This should filter out some noise and reflect differences in latency and throughput, if there are any.

Run on `FarmShare`

Stanford offers a research computing environment, FarmShare2. As described at this link, you can log in to FarmShare via rice (which shares a file system with myth), then access the wheat servers (compute nodes) from there.

FarmShare2 uses Slurm to schedule jobs. You can use this to request a consistent allocation of computing resources – and, depending on availability, you may be able to specify that you want different processes (e.g., your load generator vs. your load balancer) to run on different servers.

FarmShare may give you more flexibility than running locally. Again, know that your metrics will fluctuate.

Profiling

A profiler will give you some information about how your load balancer is using hardware resources. Run any of these profilers on your load balancer. Again, you’ll have to think through what kinds of questions/hypotheses you want to gather evidence for by running the profiler.

You can do this during the above tests. You could also modify the existing test suite to send a large number (say, 1,000) requests in a tight for loop. For instance, you could modify the n_requests variable in the test_load_distribution function in 02_multiple_upstream_tests.rs, or write your own test.

Other ideas?

If you go this route, Ryan and I would love to hear about how it goes, what you learn, what you tried, and what did or didn’t work!

Thanks for exploring this together. :)

CS 110L