*This site is used in conjunction with Canvas. Assignments will only be posted on Canvas.*

This offering of CS315B will be a course in advanced topics and new paradigms in programming supercomputers, with a focus on modern tasking runtimes. As supercomputers have grown much larger and more complex, tasking has emerged as one of the leading alternatives to current bulk synchronous programming models, with the promise of both higher performance and more productive software development.

The course will consist of lectures on supercomputer architectures, a survey of current standard and alternative programming models, and in-depth lectures on proposals for tasking. Students will be given the opportunity to program in cuNumeric, a drop-in replacement for NumPy that targets clusters, and Regent, a high-level tasking programming language. Both systems target the Legion distributed runtime. There will be lectures on and programming exercises in both cuNumeric and Regent, and there will also be a course project in which students will write a significant supercomputer application of their own choosing.

The course is open to both computer scientists and computational scientists who are interested in learning about new approaches to programming modern supercomputers. Because it is desirable to have such a mix of students, the course will not assume much background, though good programming skills will be needed to get the most out of the course.

- Time: Tuesdays and Thursdays, 9:00am - 10:20am
- Readings: There is no textbook, but there will be readings assigned from lecture notes and research papers.
- Assignments: Six short programming assignments and a final project.
- Exams: There will be no exams.

Date | Topic | Readings | Lecture Notes |
---|---|---|---|

9/27 (Tue) | Course Introduction | Lecture 1 | |

9/29 (Thu) | Bulk Synchronous and SPMD Programming | [1] | Lecture 2 |

10/4 (Tue) | cuNumeric | gameoflife | Lecture 3 |

10/6 (Thu) | cuNumeric II | [2] | Lecture 4 |

10/11 (Tue) | Introduction to Tasking and Regent Programming | [3], [Regent examples] | Lecture 5 |

10/13 (Thu) | Regions | [Regent examples] | Lecture 6 |

10/18 (Tue) | More on Regions | [4], [Regent examples] | Lecture 7 |

10/20 (Thu) | Circuit Example & Mapping | Lecture 8 | |

10/25 (Tue) | Metaprogramming | [5], [Regent examples] | Lecture 9 |

10/27 (Thu) | I/O and Control Replication | [6], [Regent examples] | Lecture 10 |

11/1 (Tue) | Charm++ | [7], [8] | Lecture 11 |

11/3 (Thu) | Project Proposal Meetings | ||

11/8 (Tue) | Democracy Day |
||

11/10 (Thu) | Chapel and X10 | [9], [10] | Lecture 12 |

11/15 (Tue) | StarPU | [10], [11] | Lecture 13 |

11/17 (Thu) | System Comparisons | [12] | Lecture 14 |

11/22 (Tue) | Thanksgiving break |
||

11/24 (Thu) | Thanksgiving break |
||

11/29 (Tue) | Project Meetings | ||

12/1 (Thu) | Wrap-Up | Lecture 15 | |

12/6 (Tue) | Project Presentations | ||

12/8 (Thu) | Project Presentations |

Students are given access to a cluster for programming assignments where cuNumeric and Regent are pre-installed. Students can also run cuNumeric and Regent programs locally on their own machines, but you will be responsible for doing the install and build yourself.

There are six small programming assignments during the quarter to give you an overview of cuNumeric and Regent:

Title | Assigned | Due |
---|---|---|

Assignment 1: Edge Detection in cuNumeric | 10/4 (Tue) | 10/11 |

Assignment 2: Running a Regent program | 10/11 (Tue) | 10/18 |

Assignment 3: Edge Detection | 10/13 (Thu) | 10/20 |

Assignment 4: Parallel Edge Detection | 10/20 (Thu) | 10/27 |

Assignment 5: PageRank | 10/27 (Thu) | 11/3 |

Assignment 6: Parallel PageRank | 11/3 (Thu) | 11/10 |

For their final projects, students are expected to write a significant
cuNumeric or Regent program that runs on clusters with accelerators. **Students are welcome to come up with their own ideas,
especially the ones related to their own research.** For
students who do not have a specific topic in mind, here is a list of
some project ideas:

The last two class meetings will be project presentations. Students should be prepared to give a 15 minute talk (the exact length will be determined closer to the time) and answer questions. The final project presentation and source code is due on the last day of classes, 12/9.

Create a code for the solution of the compressible Euler / Navier-Stokes equations, which are statements of mass, momentum, and energy conservation in a fluid, on uniform structured grids (2D or 3D). Use a second-order discretization in space (centered or upwind schemes for the convective term, e.g., JST or Roe, and a centered scheme for the viscous fluxes) and an explicit time integration scheme (e.g., classical fourth-order Runge-Kutta). For an example of this type of fluid solver (and some test cases), see the Soleil-X code.

Port a code for solving the incompressible Navier-Stokes equations (Boussinesq approximation) for computing the flow within a thermally driven 2D cavity. The code uses structured, collocated grids with an implicit discretization (SIMPLE fractional step). Time integration is accomplished through a first- or second-order backward difference formula (BDF). An incomplete LU decomposition is used for solving the linear systems, including a pressure Poisson solve. Examples and details can be found in Computational Methods for Fluid Dynamics, Ferziger & Peric.

Create a linear solver for symmetric, positive definite matrices using the Conjugate Gradient (CG) method. For example, the poisson equation with a known source term on a uniform 2D mesh can be solved. Other Krylov methods can be considered instead, such as GMRES or Bi-CGSTAB, which handle more general matrix types. Similarly, unstructured meshes could also be considered here, leading to sparse matrices that can be stored in Compressed Sparse Row (CSR) format.

- An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
- OpenMP implementation that solves the Poisson equation with CG using a CSR matrix representation

Solve the Poisson equation on a regular 2D grid with a multigrid method. Use a second-order finite difference discretization with Dirichlet boundary conditions. The discretization results in a linear system with a banded matrix structure, which can be solved with classical iterative methods alone or a geometric multigrid method that uses the classical iterative methods as smoothers on the various mesh levels.

Solve the radiative transfer equation on a structured 2D grid using the discrete ordinates method. Implement the source iteration solver with full-domain sweeps. Angular discretization is accomplished using level symmetric quadrature sets or quadrature sets constructed on the fly. Spatial discretization will use the finite volume method with a first- or second-order accurate method (step or diamond differencing).

Implement a fast multipole algorithm. Fast multipole algorithm is a scalable algorithm for N-body simulation whose time complexity is O(N) and has a wide range of applications including electrostatics, fluid simulations, and astrophysics.

- A Fast Algorithm for Particle Simulations
- A short primer on the fast multipole method
- ExaFMM, an open source FMM implementation
- Legion implementation also available upon request.

Implement parallel fast Fourier transform. An efficient implementation of parallel FFT has many applications such as dark matter, plasma, and incompressible fluid simulations (to name just a few!). Oddly, the widely used implementations parallelize 3D FFTs in only one dimension, resulting in limited scalability. A 3D FFT implementation that parallelizes in two dimensions should be clean to express and an interesting computation to map to heterogeneous hardware.

Adaptive mesh refinement is a technique to adjust the resolution depending on the sensitivity or criticality of parts within a simulation. This technique often yields more precise solutions than those that assign resolution uniformly across the mesh. However, the technique also poses challenges in finding the right regions in a mesh to refine and balancing computations between nodes. Writing scalable AMR code should be an interesting project as tasking features, such as asynchronous task scheduling and dynamic task mapping, can help solve some of these challenges.

Implement parallel wavelet compression. Wavelet transforms are one of the most popular time-frequency-transformations and are widely used for data compression, especially image compression; notable applications include JPEG 2000 and DjVu. For good resolution on high-frequency terms, the wavelet compression shows a better compression performance for images that have transient signals.

Implement a two-threshold peak finder algorithm for locating "peaks" in images. This algorithm is currently used for processing data produced by the Linac Coherent Light Source (LCLS) at SLAC. Peak finding is a data processing step that locates regions of charge in the image produced by a detector. The most common peak finding algorithm consists of two steps: First, it finds the initial peaks, the pixels whose intensity is above the higher threshold. Then, it searches for neighboring pixels within a given radius whose intensity is above the lower threshold.

- Peak Finding page at SLAC wiki
- Hit and Peak Finding Algorithms page at SLAC wiki

Today’s X-ray free-electron lasers can generate pulses at 120 Hz, so the pixel-array imaging detectors need to also work at that speed. Furthermore since all the photons are detected in 40 fs, we cannot use the more accurate method of counting each photon on each pixel individually, rather we have to compromise and use the “integrating” approach: each pixel has independent circuitry to count electrons, and the sensor material (silicon) develops a negative charge that is proportional to the number of X-ray photons striking the pixel. It is thus absolutely critical for us to know the gain of the detector, that is, the conversion factor between electronic signal in “counts” and number of photons. (signal counts = gain x photon count, in our definition). Unfortunately, knowledge of the gain has been very difficult to acquire. It differs from pixel to pixel within the pixel array, and is a function of temperature and X-ray energy.

To calibrate the gain field we use a “flood field” source: somehow we rig it up so that several photons will hit each pixel on each image. We don’t know the number of incident photons, and it is not even uniform over the image, but the flux is slowly varying in space, so we can assume that the true flux at pixel A is very similar to the true flux on each of the surrounding pixels. We are assuming Poisson statistics here — the variance in the photon count is equal to the number of photons. The signals we measure are probably in the range of 20 photons per pixel per image. As stated above the critical complication is that the gain varies from pixel to pixel, so for example, the true gain is probably in the range of 8 counts/photon to 30 counts/photon. So we need to determine a gain for each pixel, not a constant value over the whole detector. Also, critically, for each imaging event the detector electronics imparts a “common mode” offset, or DC-offset, such that a random integer number of counts is added to each pixel value. The common mode offset is a random value for each shot (for example it might have a Gaussian distribution with a mean of zero and a standard deviation of 10) but it is constant across all pixels within the shot. But we need to determine it’s unknown value for each shot. So the unknowns to determine are the gain for each pixel, and the common mode correction for each shot. How do we determine these unknowns from the image data?

The basic sensor is a 370 x 190 pixel array. The entire CSPAD detector consists of 32 of these sensor panels. Each has a separate common mode correction for each image, so they should be treated as separate detectors. Assume that we collect a lot of images, say 1000. In the analysis of the data, we should only accept signals that vary smoothly from pixel to pixel where we can assume that the true photon field is locally constant. In other words, Bragg spots and other edge-type features probably have to be rejected on each image.

Port one of the mini-apps developed in the national labs. These mini-apps represent an interesting subset of requirements from the real scientific workloads, so it is worthwhile exploring how well the tasking programming model fits those requirements. Here are some mini-apps that are interesting to port:

- LULESH, an unstructured Lagrangian explicit shock hydrodynamics proxy application.
- CoMD, a classical molecular dynamics proxy application.
- CloverLeaf, a Lagrangian-Eulerian hydrodynamics benchmark.

Write a ray tracer. Choose on your own how much physical phenomena to capture in the ray tracer, but try to make it better than the accompanied reference in terms of image quality.

- Paper about a parallel ray tracer implementation with MPI/OpenMP
- MPI/OpenMP implementation

- Office Hours: TBD
- E-mail: