Multithreading Practice
You and your friends are bored, so you decided to play a super fun game where you go to a random Wikipedia page and try to find a link to another Wikipedia page that is the longest (by length of the html). You decide to write a program to do this in Rust!
Sequential Link Explorer
Here’s a program that downloads the Wikipedia page for “Multithreading,” then sequentially downloads each page, looking for the longest one:
extern crate reqwest;
extern crate select;
#[macro_use]
extern crate error_chain;
use select::document::Document;
use select::predicate::Name;
error_chain! {
foreign_links {
ReqError(reqwest::Error);
IoError(std::io::Error);
}
}
const TARGET_PAGE: &str = "https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)";
// Nothing interesting here; feel free to ignore.
fn get_linked_pages(html_body: &str) -> Result<Vec<String>> {
Ok(Document::from_read(html_body.as_bytes())?
.find(Name("a"))
.filter_map(|n| {
if let Some(link_str) = n.attr("href") {
if link_str.starts_with("/wiki/") {
Some(format!("{}/{}", "https://en.wikipedia.org",
&link_str[1..]))
} else {
None
}
} else {
None
}
}).collect::<Vec<String>>())
}
// Adapted from https://rust-lang-nursery.github.io/rust-cookbook/web/scraping.html
fn main() -> Result<()> {
// Get the body of the page
let html_body = reqwest::blocking::get(TARGET_PAGE)?.text()?;
// Identify all linked wikipedia pages
let links = get_linked_pages(&html_body)?;
// Keep track of the URL and length (of the body) of the
// longest article so far
let mut longest_article_url = "".to_string();
let mut longest_article_len = 0;
// Get each link
for link in &links {
// Download the HTML body
let body = reqwest::blocking::get(link)?.text()?;
let curr_len = body.len();
// Update longest article found (if needed)
if curr_len > longest_article_len {
longest_article_len = curr_len;
longest_article_url = link.to_string();
}
}
println!("{} was the longest article with length {}", longest_article_url,
longest_article_len);
Ok(())
}
Notes on the code
Adding dependencies
If you want to run this locally, you can start a new package using cargo new
:
cargo new link-explorer-example --bin
This code uses the reqwest, select crates, error-chain, and (later) threadpool crates. (Crates are like external libraries in Rust.) Because it relies on libraries outside of std
, we need to explicitly tell Cargo about them. We do that by listing them as dependencies in the Cargo.toml file.
If you open Cargo.toml
, you should see a line that says [dependencies]
. We can add the libraries we need, along with the versions and features we want, there:
[dependencies]
select = "0.4.3"
error-chain = "0.12.2"
reqwest = {version = "0.10.4", features = ["blocking"]}
threadpool = "1.8.1"
Now, when you run cargo build
, cargo
will download (if needed), compile, and link each of these libraries at the specified version!
Custom Error
You might be wondering what that error_chain!
macro is, and you may also be wondering why the return types from functions only have one type specified in the Result
.
We’re using the error-chain
crate, which you’ll also encounter in project 2. At a high level, this:
- Implements a custom Error type.
- This can be useful if you’re using a bunch of different libraries that implement their own error types.
- For instance, if a
reqwest
function fails, it’ll return anErr(reqwest::Error)
. foreign_links
translates Result error types from other libraries into our custom error type.- Here, an
Err(reqwest::Error)
from the request library will be converted into anErr(ErrorKind::ReqError)
in our custom type. - An
Err(std::io::Error)
will be converted into anErr(ErrorKind::IoError)
.
- For instance, if a
- This can also be useful if you want to define your own different types of errors.
- In project 2, you’ll see custom
errors
defined: anErrorKind::BadResponse
and anErrorKind::NoUpstreamServers
. - These are intended to be used to indicate different types of errors.
- In project 2, you’ll see custom
- Each of our methods that returns a
Result
specifies only the success type. TheError
type will always be our custom error.- In other words, every
Result
in our program is now a customResult type
, which is generic over theOk
parameter, but not generic over theErr
parameter. TheErr
will always encapsulate our custom defined error type.
- In other words, every
Back to multithreading: this is SLOW
Unfortunately, this is terribly slow, and it takes almost 3 minutes to run on my machine.
Why is it slow? This program is I/O bound (input/output bound): its speed of execution is limited by the network. The CPU is idle almost the entire time! We aren’t making good use of system resources.
Adding threads
Adding Arc
/Mutex
We want threads to work together to find the longest article. By the end, we
want the threads to collectively update longest_article_url
so that we know
what the longest article is.
As with last lecture, we’ll want to use an Arc
and Mutex
to ensure that the
threads can all access AND update the same longest article. (You can imagine we’re
putting the longest article in a bathroom stall, and whenever a thread
downloads an article, it’ll go into the bathroom stall to check it against the
running longest article.)
However, there can only be one value in a Mutex
, and we want to store both
the longest article URL and length. To fix this, we can bundle the URL and
length together in a tuple or a struct (we’ll opt for a struct), put this in
our Mutex
, and access from our threads:
// Define a struct to put in the Arc<Mutex<T>>
struct Article {
url: String,
length: usize,
}
fn main() -> Result<()> {
// Arc containing a mutex containing an Article
let longest_article = Arc::new(Mutex::new(Article {url: "".to_string(), length: 0}));
// Store thread handles in a vector for easy joining later
let mut threads = Vec::new();
for link in &links {
let longest_article_handle = longest_article.clone();
threads.push(thread::spawn(move || {
let body = reqwest::blocking::get(&link)?.text()?;
let curr_len = body.len();
let mut longest_article = longest_article_handle.lock().unwrap();
if curr_len > longest_article.length {
longest_article.length = curr_len;
longest_article.url = link.to_string();
}
}));
}
for thread in threads {
thread.join().unwrap();
}
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
longest_article_ref.length);
Ok(())
}
Error propagation from inside a thread
Compiling the above code gives us an error:
error[E0277]: the `?` operator can only be used in a closure that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
--> src/main.rs:58:24
|
57 | threads.push(thread::spawn(move || {
| ____________________________________-
58 | | let body = reqwest::blocking::get(link)?.text()?;
| | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot use the `?` operator in a closure that returns `()`
59 | | let curr_len = body.len();
60 | | let mut longest_article = longest_article_handle.lock().unwrap();
... |
64 | | }
65 | | }));
| |_________- this function should return `Result` or `Option` to accept `?`
|
= help: the trait `std::ops::Try` is not implemented for `()`
= note: required by `std::ops::Try::from_error`
What’s this about? main
does return Result
! And we didn’t change this
line when adding threading, so why is it giving us an error now?
If you look carefully, we moved the offending line inside of a closure function that runs inside a different thread. It’s this function that isn’t
returning Result
, which is what is causing problems. Furthermore, there’s a
conceptual issue here: if this child thread returns an Error
, how should we
propagate that to the main thread?
Conveniently, Rust allows threads to return values back to the parent thread:
you can add a return type to the closure function, and once the child thread
returns, that value will be returned by thread::join
:
let t = thread::spawn(move || -> i32 {
println!("Inside the child thread, returning 5");
return 5;
}
let x = t.join().expect("Thread panicked!");
println!("Parent thread: {}", x); // prints 5
This means that the child thread can return a Result
back to the parent,
which can propagate it after join()
returns the error:
for link in &links {
let longest_article_handle = longest_article.clone();
threads.push(thread::spawn(move || -> Result<()> {
// ^ note added "-> Result<()>" return type
let body = reqwest::blocking::get(link)?.text()?;
let curr_len = body.len();
let mut longest_article = longest_article_handle.lock().unwrap();
if curr_len > longest_article.length {
longest_article.length = curr_len;
longest_article.url = link.to_string();
}
// Once this thread is done, it needs to return Ok
Ok(())
}));
}
for thread in threads {
thread.join().unwrap()?;
// ^ note the added ?, which will stop/propagate if a thread returns Error
}
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
longest_article_ref.length);
Ensuring link
lives long enough
We aren’t finished with the compiler errors:
error[E0597]: `links` does not live long enough
--> src/main.rs:55:17
|
55 | for link in &links {
| ^^^^^^
| |
| borrowed value does not live long enough
| argument requires that `links` is borrowed for `'static`
...
75 | }
| - `links` dropped here while still borrowed
The link
variable is of type &str
(i.e. it is a reference to a string owned
by the main thread), and the Rust compiler is not 100% convinced that the main
thread will outlive the child thread, so we get a lifetime error. (It would be
a use-after-free if the child thread were to continue using this reference
after the main thread cleaned up the memory.)
A simple fix is to move each link out of the vector and transfer ownership to each thread:
for link in links {
// `link` is now an owned String
threads.push(thread::spawn(move || -> Result<()> {
// `link` is moved into the thread
let body = reqwest::blocking::get(&link)?.text()?;
...
}
}
The above code now transfers ownership of each link
, one-by-one, into each thread. This means that, once this for
loop has finished executing, ownership of every element in the vector has been transferred, and, thus, the main thread no longer owns the vector of links.
Of course, this means that you won’t be able to use links
in the main thread
after this loop. If the main thread needed to continue using the vector, you could either clone the vector (e.g. for link in links.clone()
), or you could put all the links in an Arc
that all the threads share, to ensure that the memory will live long enough.
Limiting network connections
Great, this code finally compiles! However, it crashes shortly after running.
The error might look slightly different depending on your OS, but it should say something related to resource consumption. I get this:
Error: Error(ReqError(reqwest::Error { kind: Request, url:
"https://en.wikipedia.org/wiki/Thread_(computer_science)", source:
hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 24, kind:
Other, message: "Too many open files" })) }), State { next_error: None,
backtrace: InternalBacktrace { backtrace: None } })
The key part of the error is Too many open files
. When each thread goes to download an article, it opens a socket, which requires a file descriptor. With too many threads doing this at the same time, we run out of
file descriptors!
You may also see this error:
Error: Error(ReqError(reqwest::Error { kind: Request, url:
"https://en.wikipedia.org/wiki/File:Question_book-new.svg", source:
hyper::Error(Connect, ConnectError("dns error", Custom { kind: Other, error:
"failed to lookup address information: nodename nor servname provided, or not
known" })) }), State { next_error: None, backtrace: InternalBacktrace {
backtrace: None } })
This error is much more cryptic, but is ultimately caused by having too many threads active. The thread limit is flexible and can be increased to have thousands (or even tens of thousands) of threads, but there is usually a thread limit set that is lower than that, and it’s usually not a good idea to spawn so many threads for a task such as this.
Again, these errors might look different for you, but the underlying issue is fundamentally that we’re trying to consume too many resources.
There are a few ways to fix this. One is to use a semaphore to keep the number of threads and number of open file descriptors manageable. Rust doesn’t have a semaphore in the standard library, but there are crates you can use, such as sema
(https://docs.rs/sema/0.1.4/sema/struct.Semaphore.html). This can be used like a traditional semaphore, as shown in CS 110, though it has a handy SemaphoreGuard
which works like the MutexGuard
returned by lock()
, meant to help prevent you from forgetting to free resources. (If you’re interested, you can see an example of sema
use from 2021 here.)
We could, alternatively or in addition, implement a “batching” approach: spawn a fixed number of threads, then manually divide the links statically and equally between the threads. Or, maybe, we could spawn a fixed number of threads and share a queue of links between them – each thread could pull a link off of the queue, continuing until all links have been processed.
To keep things simple, I’m going to keep the core logic the same, but instead of spawning threads per link, I’ll use a threadpool, like what you built and used in CS110. A thread pool allows you to create a fixed
number of threads and then reuse those threads to do many tasks. Rust doesn’t
have a thread pool in the standard library, but the
threadpool
crate provides
one:
let threadpool = ThreadPool::new(20);
for link in links {
let longest_article_handle = longest_article.clone();
threadpool.execute(move || {
let body = reqwest::blocking::get(&link).unwrap().text().unwrap();
let curr_len = body.len();
let mut longest_article = longest_article_handle.lock().unwrap();
if curr_len > longest_article.length {
longest_article.length = curr_len;
longest_article.url = link.to_string();
}
});
}
threadpool.join();
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
longest_article_ref.length);
Note: You still may get resource limit issues here, unfortunately, likely due to spawning too many threads (some of the libraries we’re calling have, themselves, some threading going on under the hood). If you decrease the number of threads you’re giving to your ThreadPool, you should eventually get a working solution with some amount of speedup from the sequential version!