SFOtoHKGin5min
From FarmShare
Line 13: | Line 13: | ||
=== Executive Summary === | === Executive Summary === | ||
- | We explore the possibilites presented | + | We explore the possibilites presented by a parallel raytracer (Tachyon) using OpenMPI (an MPI library). Given a sample scene file, a singe core job takes 8.8 minutes to render a single frame. Rendering the same scene file as a 208 core MPI job on the barley cluster takes 3.4 seconds. This is a speedup of 156 times - the same as reducing the flight time from San Francisco to Hong Kong from 13 hours to 5 minutes. |
== Methodology == | == Methodology == | ||
Line 19: | Line 19: | ||
=== Assessing Communication Overhead (was Assessing Cardinality) === | === Assessing Communication Overhead (was Assessing Cardinality) === | ||
- | For | + | For embarrassingly parallel problems, we considered the cardinality of the problem in order to chop the problem up into pieces which can be run on a cluster of finite size. When MPI is used, one needs to consider the constraints of the communication between the nodes participating in the job. |
'''Running on a Single core''' | '''Running on a Single core''' | ||
Line 25: | Line 25: | ||
There is no particular requirement in the MPI library to use more than one core. So let us get a baseline time to render measurement by running Tachyon on a single core. | There is no particular requirement in the MPI library to use more than one core. So let us get a baseline time to render measurement by running Tachyon on a single core. | ||
<pre>$ qsub tachyon.submit | <pre>$ qsub tachyon.submit | ||
- | </pre> | + | </pre> <pre>1 core: |
- | + | ||
- | <pre> | + | |
- | 1 core: | + | |
CPU Information: | CPU Information: | ||
Node 0: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu | Node 0: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu | ||
Line 44: | Line 41: | ||
- | </pre> | + | </pre> |
+ | <br> '''Scaling the job up''' | ||
- | + | <br> | |
- | + | <pre>CPU Information: | |
- | + | ||
- | + | ||
- | <pre> | + | |
- | CPU Information: | + | |
Node 0: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu | Node 0: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu | ||
Node 1: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu | Node 1: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu |
Revision as of 16:01, 8 January 2012
Contents |
San Francisco to Hong Kong in 5 minutes
Introduction
This is a followon article to CheapFlights, and while the metaphor may be showing stress cracks, please bear with me. In the previous article we made the most of a single threaded, and hence single core program, by taking advantage of the embarrassingly parallel nature of moving the camera. The time taken to render any given frame, however, was completely unchanged from running povray directly on a corn. To break the 15 minutes barrier (for this particular scene file) we need to employee an HPC specific technology - MPI.
MPI stands for Message Passing Interface and is typically used as a library to a compiled language (C/C++/Fortran) or an interpreted (byte compiled actually) language such as Python. In this example we will explore a parallel raytracer called Tachyon which uses MPI as one of the available parallel options. MPI allows Tachyon access to the distributed compute and memory of all cores participating in the job. It is up to the code in Tachyon, then, to decide how to break up the task of rendering a frame such that all of the cores in the job can execute their portion and communicate their results back. This architectural decision has been made in the code of Tachyon, and we are free to run jobs on the barley cluster using as few or many cores as long as the number fits within those constraints. This number we choose becomes the size of our Tachyon job we submit to the scheduler.
Let's explore what kind of speedup Tachyon can achieve on the barley cluster. This cluster is 10 gigabit ethernet connected, which plays an important part, as all of the intermediate processing steps are communicated within all nodes participating in the job.
Executive Summary
We explore the possibilites presented by a parallel raytracer (Tachyon) using OpenMPI (an MPI library). Given a sample scene file, a singe core job takes 8.8 minutes to render a single frame. Rendering the same scene file as a 208 core MPI job on the barley cluster takes 3.4 seconds. This is a speedup of 156 times - the same as reducing the flight time from San Francisco to Hong Kong from 13 hours to 5 minutes.
Methodology
Assessing Communication Overhead (was Assessing Cardinality)
For embarrassingly parallel problems, we considered the cardinality of the problem in order to chop the problem up into pieces which can be run on a cluster of finite size. When MPI is used, one needs to consider the constraints of the communication between the nodes participating in the job.
Running on a Single core
There is no particular requirement in the MPI library to use more than one core. So let us get a baseline time to render measurement by running Tachyon on a single core.
$ qsub tachyon.submit
1 core:CPU Information:
Node 0: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Total CPUs: 1 Total Speed: 1.000000. . .
Preprocessing Time: 1.0531 seconds Rendering Progress: 100% completeRay Tracing Time: 533.7124 seconds Image I/O Time: 0.0754 seconds
Scaling the job up
CPU Information: Node 0: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 1: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 2: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 3: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 4: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 5: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 6: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 7: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 8: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 9: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu Node 10: 1 CPUs, CPU Speed 1.00, Node Speed 1.00 Name: barley01.stanford.edu . . . Total CPUs: 208 Total Speed: 208.000000 Preprocessing Time: 1.5401 seconds Rendering Progress: 100% complete Ray Tracing Time: 3.4819 seconds Image I/O Time: 0.0680 seconds