Description

In class we've discussed how to express both data and task parallelism in Regent. In this assignment, you will use that knowledge to parallelize the edge detection code from the previous assignment.

Before jumping into the actual assignment, let's discuss briefly the task parallelism that already exists in the previous code. All the steps in the edge detection algorithm are done sequentially because the task for each step reads the output from the previous step and writes to a separate field that will be consumed in the next step. However, the second step, gradient calculation, consists of two tasks, sobelX and sobelY, which update different fields and thus can run in parallel. Regent detects this parallelism between the two tasks by examining their signatures:
task sobelX(r_image    : region(ispace(int2d), Pixel),
            r_interior : region(ispace(int2d), Pixel))
where
  reads(r_image.smooth), writes(r_interior.gradient.x)
...
task sobelY(r_image    : region(ispace(int2d), Pixel),
            r_interior : region(ispace(int2d), Pixel))
where
  reads(r_image.smooth), writes(r_interior.gradient.y)
...

If you used Legion Spy on your code from Assignment 1, you saw this graph which shows there is no dependence between tasks sobelX and sobelY: (Each node in the graph corresponds to a task and each edge indicates data dependence between the tasks it connects)

Unfortunately, this is the only parallelism in the previous code, which means that the code cannot utilize more than two processors (and really won’t even use two processors efficiently). The goal of this assignment is to modify the code so that it creates more work to do in parallel: specifically, to create data parallelism to augment the task parallelism described above. As a starting point, the edgeFromGradient task has already been parallelized in the code given for this assignment (assignment2.tar.gz). The next sections will explain first what has been changed for you and then what you are expected to change in the code.

Parallelization using a disjoint partition

The first change you will notice in the starter code is that there is now a for loop that launches the edgeFromGradient task for each subregion p_private[color] of partition p_private:

edgeFromGradient(r_interior, config.threshold)

for color in p_private.colors do
  edgeFromGradient(p_private[color], config.threshold)
end
Assignment 1 Assignment 2

Before looking at the definition of partition p_private, let's think about what kinds of partitions will enable the snippet on the right to have more parallelism than the one on the left. First of all, the partition should cover all the elements in region r_interior; otherwise, the code on the right will update a smaller region and will not produce the same answer as the original code. Second, every pair of sub-regions p_private[i] and p_private[j] should be disjoint (for i != j), otherwise the code will not be able to run in parallel. Task edgeFromGradient not only reads the region that is passed in, but also updates that region. Therefore, any two tasks edgeFromGradient(p_private[i], ...) and edgeFromGradient(p_private[j], ...) will be serialized by the runtime whenever sub-regions p_private[i] and p_private[j] overlap.

The easiest way to create a partition that satisfies both conditions is an equal partitioning. Here is the definition of partition p_private:
var p_private = partition(equal, r_interior, p_private_colors)

As you have already learned in class, this statement will create an equal partition of region r_interior that has (almost) equally-sized sub-regions.

It's worth noting that this is a different way of thinking about parallelizing programs than you might be used to seeing in existing programming systems. In the Regent programming model, programmers should consciously choose a partitioning of data and precisely describe what tasks do on each piece of data. The discovery of parallelism is then implicitly and automatically done by the compiler and the runtime system. This releases the programmer from the burden of manual synchronization and communication between tasks.

Parallelizing stencil tasks

For this assignment you will parallelize the tasks smooth, sobelX, sobelY, and suppressNonmax. Unlike edgeFromGradient, these tasks access two regions as parameters: r_image and r_interior. These regions do not need to be the whole and interior images, respectively (although they were in the previous assignment). They can be any subsets of the image as long as they make the stencil access safe. For example, task smooth tries to read the neighbor elements e + {-2, -2}, e + {-2, -1}, ... from region r_image for every element e in region r_interior. Therefore, the points e + {-2, -2}, e + {-2, -1}, ... should be valid in region r_image. This diagram shows one example of regions r_image and r_interior that are valid for task smooth:

region r_image

region r_interior

So, the first step in parallelizing tasks smooth, sobelX, sobelY, and suppressNonmax is to create partitions whose sub-regions are valid for that task's accesses. We have already created a partition (p_private) of the interior region r_interior in a way that tasks updating its sub-regions can run in parallel. Your first job in this assignment is to create another partition p_halo of the region r_image; each sub-region of partition p_halo should contain all neighbors accessible from the pixels in each sub-region of partition p_private. The code already has the boilerplate to create a coloring and add ranges to it:
  -- Create a halo partition for ghost access
  var c_halo = coloring.create()
  for color in p_private_colors do
    var bounds = p_private[color].bounds
    var halo_bounds : rect2d = -- TODO: Calculate the correct bounds of the halo
    coloring.color_domain(c_halo, color, halo_bounds)
  end
Compute the correct bounds for each subregion using the bounds of each subregion of partition p_private and then assign it to variable halo_bounds. In the next line, you will see another line to create the partition p_halo that is commented out:
  --
  -- TODO: Create an aliased partition of region 'r_image' using coloring 'c_halo':
  -- var p_halo = partition(...)
  --

Write a partition expression that uses the coloring object you just populated.

Once you have the necessary partitions, you can then launch all those tasks on each pair of sub-regions. Your second job in this assignment is to modify the following code so that it launches tasks smooth, sobelX, sobelY, and suppressNonmax on sub-regions of two partitions.

  --
  -- TODO: Change the following task launches so they are launched for
  --       each of the private regions and its halo region.
  --
  smooth(r_image, r_interior)
  sobelX(r_image, r_interior)
  sobelY(r_image, r_interior)
  suppressNonmax(r_image, r_interior)

If you do this correctly, you should be able to obtain this graph from Legion Spy: As you can see from the graph, there are no edges between tasks of the same type, which means those tasks run in parallel.

Bonus Points

Parallelize the complete Canny algorithm

Try to parallelize the complete Canny algorithm if you solved the bonus problem in the previous assignment. You will notice that edge tracking is tricky to parallelize efficiently because it needs to maintain global knowledge of how pixels can be connected. Still, you can think of a naive implementation that consists of two following tasks running alternatively until convergence:

  • Task 1 finds in each subregions of the partitioned image all the connections among edge pixels in that region.
  • Task 2 checks if the pixels from two different sub-regions are connected on the boundary between the two.

If you cannot come up with anything else, you are welcome to implement this. Otherwise, try to find a better algorithm that runs more efficiently.

How to run the code

You download file assignment2.tar.gz. All you have to change is file edge.rg and the parts that you need to modify are marked with TODOs. Here is the complete list of options you can pass (the same list will appear with option -h):

Usage: regent edge.rg [OPTIONS]
OPTIONS
  -h            : Print the usage and exit.
  -i {file}     : Use {file} as input.
  -o {file}     : Save the final edge to {file}. Will use 'edge.png' by default.
  -t {value}    : Set {value} to the threshold.
  -p {value}    : Set {value} to the number of parallel tasks.

Some options were removed for simplicity and option -p has been added for taking the number of parallel tasks.

You will need libpng for this code as in the previous assignment. If you have not installed it yet, you can find some instructions from the previous assignment.

What to submit

Send the following files to :

  • File edge.rg (or an archive of the whole directory if you changed any other files)
  • File answers.txt that contains your answers on these questions:
    • What were the speedups that your code achieved on 1, 2, 4, and 8 CPUs? Run your code on the image earth.png with varying numbers of processors and calculate the speedup at each processor count over a single processor. Also include the command you used for each run. Does the code scale linearly? (i.e. did the code run N times faster on N processors?) If not, can you answer why? Note that poor scaling does not necessarily mean that there is something wrong with your code, and it is also possible that the scaling drops off as the number of processors increases.

back to course webpage