Description

(This and the next assignment will follow the same structure as the previous two: You will first implement a serial version and then parallelize it.)

In the previous two assignments, you worked with structured index spaces. In this assignment, you will use unstructured index spaces to implement a well-known, and frequently implemented, graph algorithm, PageRank.

PageRank is the basis of Google’s ranking of web pages in search results. Given a directed graph where pages are nodes and the links between pages are edges, the algorithm calculates the likelihood, or rank, that a page will be visited. The rank of a page is determined recursively by the ranks of the pages that link to it. In this definition, pages that have a lot of incoming links or links from highly ranked pages are likely to be highly ranked. This idea is quite simple and yet powerful enough to produce search results that correspond well to human expectations of which pages are important. In the next section, we will see a mathematical definition of the problem and an iterative method to solve it, which you will implement in this assignment.

The Math behind PageRank

The PageRank algorithm assumes that a web surfer is visiting some node (web page), and will either follow one of the links on that page or move to a random page (chosen from all pages on the web). The following equation recursively defines the rank PR(p) for each page p:


$$ \mathit{PR}(p) = \frac{1 - d}{N} + d \sum_{p' \in M(p)} \frac{\mathit{PR}(p')}{L(p')} $$

where N is the number of pages, M(p) denotes the set of nodes with links to page p, L(p) is the number of outgoing links in page p, and d is the damping factor. The first term of this equation models the possibility that surfers will jump to a random web page (with probability 1 − d). The second term corresponds to the likelihood a surfer visits a page by following links. Note that the contribution PR(p′) from a neighboring page p is divided by L(p′) (the number of p's outgoing links), which assumes each link has an equal chance to be selected.

Because of its usefulness, many researchers have explored various ways to solve this equation. In this assignment, we will pick an iterative method that is simple to implement and matches nicely with the Regent programming model.

Iterative Method for PageRank

As the name suggests, the iterative method repeats some calculation up to convergence. Let PR(p; t) be the rank of page p at iteration t. We abuse notation for a vector p of pages:
$$ \mathit{PR}(\mathbf{p}; t) = (\mathit{PR}(p_0; t), \mathit{PR}(p_1; t), \dots)\\ \textrm{ where }\mathbf{p} = (p_0, p_1, \dots) $$

At iteration 0, all ranks are uniformly initialized to $\frac{1}{N}$ where N is the number of pages.
$$ \mathit{PR}(p; 0) = \frac{1}{N} $$
Then, the method calculates the updated ranks PR(p; t + 1) using the previous ranks PR(p; t) from this equation:
$$ \mathit{PR}(p; t + 1) = \frac{1 - d}{N} + d \sum_{p' \in M(p)} \frac{\mathit{PR}(p'; t)}{L(p')} $$
This equation is the same as the original one except that it is no longer recursive. The method converges when the L2-norm of the difference between the previous and current ranks is smaller than some error bound (ϵ):
$$ \left\lVert \mathit{PR}(\mathbf{p}; t + 1) - \mathit{PR}(\mathbf{p}; t) \right\rVert \le \epsilon\\ $$

Your sole task in this assignment is to implement the iterative method described in the previous paragraph. Unlike the previous assignments, this assignment gives you more freedom in how you structure your code. As long as your code generates the same result as the TA's solution code (which we all know is correct), you will get full credit.

How to run the code

File assignment3.tar.gz has the following files:

  • The starter code pagerank.rg and pagerank_config.rg. The code will not compile and you are supposed to see this compile error:

    pagerank.rg:34: attempt to call global 'Link' (a nil value)
  • Some test inputs examples/*.dat. The first two lines of the inputs are the numbers of pages and edges, and the rest of them enumerate links between pages:

    (# of pages)
    (# of links)
    (src page1) (dst page1)
    (src page2) (dst page2)
    ...
    Any inputs that have the same format as described here can be used with the code. You can also find two large graphs on Certainty under directory /panfs/panfs-certainty-fe/aiken/cs315b/assignment3. They will be useful for the next assignment, so you might want to make sure that your code in this assignment can handle them.
  • The reference outputs from TA's solution code for inputs example1.dat, ..., example6.dat (examples/references/*). File exampleN.result.iterK corresponds to the solution at iteration K for input exampleN.dat.
  • The graph visualizer gen_graph.sh. Running this script on input abc.dat will give you a graph abc.pdf in the same directory. Note that visualizing a large graph can practically take forever.

Here is the complete list of options you can pass (the same list will appear with option -h):

Usage: regent.py pagerank.rg [OPTIONS]
OPTIONS
  -h            : Print the usage and exit.
  -i {file}     : Use {file} as input.
  -o {file}     : Save the ranks of pages to {file}.
  -d {value}    : Set the damping factor to {value}.
  -e {value}    : Set the error bound to {value}.
  -c {value}    : Set the maximum number of iterations to {value}.

Although not mandatory, supporting the maximum number of iterations will be useful to compare your results with TA's solutions.

What to submit

Send your code pagerank.rg to cs315b-aut1617-staff@lists.stanford.edu.

back to course webpage