Tips for Successful Projects

Four weeks for final projects sound like a lot of time, but without a realistic schedule and some early milestones it is easy to accomplish little or nothing in the end. Because the final project is most important part of this course, we are providing some tips for planning and executing your projects. Each paragraph outlines milestones for each week. If you follow the schedule suggested in this document, your project is very likely to be successful, and at a minimum you will get early indications of problems and have time to make adjustments. However, these tips are not requirements and you should feel free to adapt them as necessary to your situation. These tips wee originally written with Regent in mind are still mostly Regent-centric, but having an analogous week-by-week plan for writing a cuNumeric code is also a good idea.

Week 1: Write a functionally correct code

There is a reason the programming assignments start with a serial code that is then parallelized: it is much easier to first write a functionally correct serial code and then optimize than it is to write a parallel code all at once. In the first week, we recommend you try to focus on completing a functionally correct serial code, meaning a code that runs and gets the right answer on non-trivial inputs (though perhaps quite slowly). Note this implies not only having an implementation but also finding or writing test inputs for your code. If there is a reference implementation, writing a script that compares it with your code will be helpful for automatically checking that your code is correct.

If you are writing in cuNumeric, your serial and parallel versions of the code will be the same, since cuNumeric does the parallelization for you. That will make the initial phases easier but may also make it harder to improve the code if it turns out to be slow, as you will have less control over the program's execution than you would in Regent. Some projects might require significant programming effort for the kernels. Even in those cases, it will be helpful to leave out some of the kernels and complete the control part first in order to get a working code quickly.

You need to be a little careful about choosing the task granularity since it will determine how much parallelism you can extract later. The rule of thumb is that the pieces that exhibit different access patterns on the same data should be put in different tasks. Remember that Regent extracts parallelism by examining task privileges. If the privileges only loosely describe what tasks do, you can lose some parallelism that could have been extracted otherwise.

Week 2: Parallelize your code

This week you can extract the data parallelism in your code. As in Assignment 4, you will need to explore several partitioning schemes and pick the best one. It is always a good idea to make those partitioning schemes configurable through some command line interface because there is often no single best configuration for every problem.

Parallelization might involve some refactoring to the existing serial code when the task granularity is not chosen well in the first place. In particular, if you happen to make a task out of two code pieces that have a data dependence on each other, there is a good chance that the task is not parallelizable. For example, if the starter code for Assignment 2 had fused task `smooth` and `sobelX` into one task, that task could not have been parallelized (think about why!).

Week 3: Optimize your code

At this stage, you will have a code that runs in parallel and technically should run on any number of nodes though the performance might not be very good yet. The first thing in this week is to profile your code and understand how well it runs. Comparing the performance between your code and a reference code is one good approach. Another way is to calculate the FLOPS per second that your program achieves and evaluate how far it is off from the peak performance that the machine can provide.

If you find your code is less than optimal, you need to narrow down the cause. It could be either an inefficient use of the runtime system or simply slow computation kernels. Runtime inefficiencies typically show up as some gaps in the Legion Prof trace. In this case, you can optimize your runtime usage by writing a custom mapper if the default mapper is sub-optimal or fusing some tasks that are too small to be launched alone. You can also try to use the control replication transformation if your code is written in the form that the transformation expects.

On the other hand, if your code has slow kernels, optimizing them is highly program specific and there is no one good answer. You can use the vectorizer or CUDA code generator in Regent, but if they are not sufficient for your purpose, you can also import some existing kernels written in C, C++, or Fortran.

Week 4: Scale up

You should plan for a week to do multi-node experiments with your code. Every time you double the number of nodes you are using, you are very likely to find a new set of problems that didn't show up at smaller scales. Specifically, any overhead that grows proportionally to the number of nodes or processors will become more prominent at larger scales. These problems can be difficult and time consuming to fix for two reasons: First, the turnaround time for larger jobs can be significantly longer than for smaller ones. Second, problems that occur on a larger number of nodes might not be reproducible in smaller settings. You might be able to come up with some tricks to mimic a similar workload to reproduce an issue on a smaller number of nodes, but you can't always count on those tricks. Based on past experience, if you get to the point of using all four GPU nodes of sapling with all four GPUs per node effectively you will be doing pretty well.