In class we've discussed how to express both data and task parallelism in Regent. In this assignment, you will use that knowledge to parallelize the edge detection code from the previous assignment.
Before jumping into the actual assignment, let's discuss briefly the task parallelism that already exists in the previous code. All the steps in the edge detection algorithm are done sequentially because the task for each step reads the output from the previous step and writes to a separate field that will be consumed in the next step. However, the second step, gradient calculation, consists of two tasks,sobelX
and sobelY
, which update different fields and thus can run in parallel. Regent detects this parallelism between the two tasks by examining their signatures:
task sobelX(r_image : region(ispace(int2d), Pixel), r_interior : region(ispace(int2d), Pixel)) where reads(r_image.smooth), writes(r_interior.gradient.x) ... task sobelY(r_image : region(ispace(int2d), Pixel), r_interior : region(ispace(int2d), Pixel)) where reads(r_image.smooth), writes(r_interior.gradient.y) ...
If you used Legion Spy on your code from Assignment 1, you saw this graph which shows there is no dependence between tasks sobelX
and sobelY
: (Each node in the graph corresponds to a task and each edge indicates data dependence between the tasks it connects)
Unfortunately, this is the only parallelism in the previous code, which means that the code cannot utilize more than two processors (and really won’t even use two processors efficiently). The goal of this assignment is to modify the code so that it creates more work to do in parallel: specifically, to create data parallelism to augment the task parallelism described above. As a starting point, the edgeFromGradient
task has already been parallelized in the code given for this assignment (assignment2.tar.gz). The next sections will explain first what has been changed for you and then what you are expected to change in the code.
edgeFromGradient
task for each subregion p_private[color]
of partition p_private
:
|
|
Assignment 1 | Assignment 2 |
Before looking at the definition of partition p_private
, let's think about what kinds of partitions will enable the snippet on the right to have more parallelism than the one on the left. First of all, the partition should cover all the elements in region r_interior
; otherwise, the code on the right will update a smaller region and will not produce the same answer as the original code. Second, every pair of sub-regions p_private[i]
and p_private[j]
should be disjoint (for i
!= j
), otherwise the code will not be able to run in parallel. Task edgeFromGradient
not only reads the region that is passed in, but also updates that region. Therefore, any two tasks edgeFromGradient(p_private[i], ...)
and edgeFromGradient(p_private[j], ...)
will be serialized by the runtime whenever sub-regions p_private[i]
and p_private[j]
overlap.
p_private
:
|
As you have already learned in class, this statement will create an equal partition of region r_interior
that has (almost) equally-sized sub-regions.
It's worth noting that this is a different way of thinking about parallelizing programs than you might be used to seeing in existing programming systems. In the Regent programming model, programmers should consciously choose a partitioning of data and precisely describe what tasks do on each piece of data. The discovery of parallelism is then implicitly and automatically done by the compiler and the runtime system. This releases the programmer from the burden of manual synchronization and communication between tasks.
smooth
, sobelX
, sobelY
, and suppressNonmax
. Unlike edgeFromGradient
, these tasks access two regions as parameters: r_image
and r_interior
. These regions do not need to be the whole and interior images, respectively (although they were in the previous assignment). They can be any subsets of the image as long as they make the stencil access safe. For example, task smooth
tries to read the neighbor elements e + {-2, -2}
, e + {-2, -1}
, ... from region r_image
for every element e
in region r_interior
. Therefore, the points e + {-2, -2}
, e + {-2, -1}
, ... should be valid in region r_image
. This diagram shows one example of regions r_image
and r_interior
that are valid for task smooth
:
smooth
, sobelX
, sobelY
, and suppressNonmax
is to create partitions whose sub-regions are valid for that task's accesses. We have already created a partition (p_private
) of the interior region r_interior
in a way that tasks updating its sub-regions can run in parallel. Your first job in this assignment is to create another partition p_halo
of the region r_image
; each sub-region of partition p_halo
should contain all neighbors accessible from the pixels in each sub-region of partition p_private
. The code already has the boilerplate to create a coloring and add ranges to it:
-- Create a halo partition for ghost access var c_halo = coloring.create() for color in p_private_colors do var bounds = p_private[color].bounds var halo_bounds : rect2d = -- TODO: Calculate the correct bounds of the halo coloring.color_domain(c_halo, color, halo_bounds) endCompute the correct bounds for each subregion using the bounds of each subregion of partition
p_private
and then assign it to variable halo_bounds
. In the next line, you will see another line to create the partition p_halo
that is commented out:
-- -- TODO: Create an aliased partition of region 'r_image' using coloring 'c_halo': -- var p_halo = partition(...) --
Write a partition expression that uses the coloring object you just populated.
Once you have the necessary partitions, you can then launch all those tasks on each pair of sub-regions. Your second job in this assignment is to modify the following code so that it launches tasks smooth
, sobelX
, sobelY
, and suppressNonmax
on sub-regions of two partitions.
--
-- TODO: Change the following task launches so they are launched for
-- each of the private regions and its halo region.
--
smooth(r_image, r_interior)
sobelX(r_image, r_interior)
sobelY(r_image, r_interior)
suppressNonmax(r_image, r_interior)
If you do this correctly, you should be able to obtain this graph from Legion Spy: As you can see from the graph, there are no edges between tasks of the same type, which means those tasks run in parallel.
Try to parallelize the complete Canny algorithm if you solved the bonus problem in the previous assignment. You will notice that edge tracking is tricky to parallelize efficiently because it needs to maintain global knowledge of how pixels can be connected. Still, you can think of a naive implementation that consists of two following tasks running alternatively until convergence:
If you cannot come up with anything else, you are welcome to implement this. Otherwise, try to find a better algorithm that runs more efficiently.
You download file assignment2.tar.gz. All you have to change is file edge.rg
and the parts that you need to modify are marked with TODO
s. Here is the complete list of options you can pass (the same list will appear with option -h
):
Usage: regent edge.rg [OPTIONS]
OPTIONS
-h : Print the usage and exit.
-i {file} : Use {file} as input.
-o {file} : Save the final edge to {file}. Will use 'edge.png' by default.
-t {value} : Set {value} to the threshold.
-p {value} : Set {value} to the number of parallel tasks.
Some options were removed for simplicity and option -p
has been added for taking the number of parallel tasks.
You will need libpng for this code as in the previous assignment. If you have not installed it yet, you can find some instructions from the previous assignment.
Send the following files to :
edge.rg
(or an archive of the whole directory if you changed any other files)answers.txt
that contains your answers on these questions:
earth.png
with varying numbers of processors and calculate the speedup at each processor count over a single processor. Also include the command you used for each run. Does the code scale linearly? (i.e. did the code run N times faster on N processors?) If not, can you answer why? Note that poor scaling does not necessarily mean that there is something wrong with your code, and it is also possible that the scaling drops off as the number of processors increases.