Five weeks for final projects sound like a lot of time, but without a realistic schedule and some early milestones it is easy to accomplish little or nothing in the end. Because the final project is most important part of this course, we are providing some tips for planning and executing your projects. Each paragraph outlines milestones for each week. If you follow the schedule suggested in this document, your project is very likely to be successful, and at a minimum you will get early indications of problems and have time to make adjustments. However, these tips are not requirements and you should feel free to adapt them as necessary to your situation.
There is a reason the programming assignments start with a serial code that is then parallelized: it is much easier to first write a functionally correct serial code and then optimize than it is to write a parallel code all at once. In the first week, we recommend you try to focus on completing a functionally correct serial code, meaning a code that runs and gets the right answer on non-trivial inputs (though perhaps quite slowly). Note this implies not only having a sequential implementation but also finding or writing test inputs for your code. If there is a reference implementation, writing a script that compares it with your code will be helpful for automatically checking that your code is correct.
Some projects might require significant programming effort for the kernels. Even in those cases, it will be helpful to leave out some of the kernels and complete the control part first in order to get a working code quickly.
You need to be a little careful about choosing the task granularity since it will determine how much parallelism you can extract later. The rule of thumb is that the pieces that exhibit different access patterns on the same data should be put in different tasks. Remember that Regent extracts parallelism by examining task privileges. If the privileges only loosely describe what tasks do, you can loose some parallelism that could have been extracted otherwise.
This week you can extract the data parallelism in your code. As in Assignment 4, you will need to explore several partitioning schemes and pick the best one. It is always a good idea to make those partitioning schemes configurable through some command line interface because there is often no single best configuration for every problem.
Parallelization might involve some refactoring to the existing serial code when the task granularity is not chosen well in the first place. In particular, if you happen to make a task out of two code pieces that have a data dependence on each other, there is a good chance that the task is not parallelizable. For example, if the starter code for Assignment 2 had fused task
sobelX into one task, that task could not have been parallelized (think about why!).
At this stage, you will have a code that runs in parallel and technically should run on any number of nodes though the performance might not be very good yet. The first thing in this week is to profile your code and understand how well it runs. Comparing the performance between your code and a reference code is one good approach. Another way is to calculate the FLOPS per second that your program achieves and evaluate how far it is off from the peak performance that the machine can provide.
If you find your code is less than optimal, you need to narrow down the cause. It could be either an inefficient use of runtime or simply slow computation kernels. Runtime inefficiencies typically show up as some gaps in the Legion Prof trace. In this case, you can optimize your runtime usage by writing a custom mapper if the default mapper is sub-optimal or fusing some tasks that are too small to be launched alone. You can also try to use the SPMD transformation if your code is written in the form that the transformation expects. The current limitations of the SPMD transformation can be found in this document. On the other hand, if your code has slow kernels, optimizing them is highly program specific and there is no one good answer. You can use the vectorizer or CUDA code generator in Regent, but if they are not sufficient for your purpose, you can also import some existing kernels written in C, C++, or Fortran. This parallel Cholesky decomposition code is one such example that uses the existing LAPACK library in Regent.
You will need a whole week (or even more) for large scale experiments with your code. Every time you double the number of nodes you are using, you are very likely to find a new set of problems that didn't show up at smaller scales. Specifically, any overhead that grows proportionally to the number of nodes or processors will become prominent at large scales. These problems are difficult and time consuming to fix for two reasons: First, the turnaround time for large jobs is significantly longer than smaller ones. Depending on the load on the machine, getting those jobs done may take up to a few days. Second, they happen only on a large number of nodes and might not be reproducible in smaller settings. You might be able to come up with some tricks to mimic a similar workload to reproduce an issue on a small number of nodes, but you can't always count on those tricks. Therefore, you should plan your large scale runs as early as you can; avoid the last week of quarter when everyone is competing against each other to get a large job done.