SFOtoHKGin5min

From FarmShare

Revision as of 21:37, 8 January 2012 by Bishopj (Talk | contribs)
Jump to: navigation, search

Contents

San Francisco to Hong Kong in 5 minutes

Introduction

This is a followon article to CheapFlights, and while the metaphor may be showing stress cracks, please bear with me. In the previous article we made the most of a single threaded, and hence single core program, by taking advantage of the embarrassingly parallel nature of moving the camera. The time taken to render any given frame, however, was completely unchanged from running povray directly on a corn. To break the 15 minutes barrier (for this particular scene file) we need to employee an HPC specific technology - MPI.

MPI stands for Message Passing Interface and is typically used as a library to a compiled language (C/C++/Fortran) or an interpreted (byte compiled actually) language such as Python. In this example we will explore a parallel raytracer called Tachyon which uses MPI as one of the available parallel options. MPI allows Tachyon access to the distributed compute and memory of all cores participating in the job. It is up to the code in Tachyon, then, to decide how to break up the task of rendering a frame such that all of the cores in the job can execute their portion and communicate their results back. This architectural decision has been made in the code of Tachyon, and we are free to run jobs on the barley cluster using as few or many cores as long as the number fits within those constraints. This number we choose becomes the size of our Tachyon job we submit to the scheduler.

Let's explore what kind of speedup Tachyon can achieve on the barley cluster. This cluster is 10 gigabit ethernet connected, which plays an important part, as all of the intermediate processing steps are communicated within all nodes participating in the job.


Executive Summary

We explore the possibilites presented by a parallel raytracer (Tachyon) using OpenMPI (an MPI library). Given a sample scene file, a singe core job takes 8.8 minutes to render a single frame. Rendering the same scene file as a 208 core MPI job on the barley cluster takes 3.4 seconds. This is a speedup of 156 times - the same as reducing the flight time from San Francisco to Hong Kong from 13 hours to 5 minutes.

Methodology

Assessing Communication Overhead (was Assessing Cardinality)

For embarrassingly parallel problems, we considered the cardinality of the problem in order to chop the problem up into pieces which can be run on a cluster of finite size. When MPI is used, one needs to consider the constraints of the communication between the nodes participating in the job.

Running on a Single core

There is no particular requirement in the MPI library to use more than one core. So let us get a baseline time measurement by running Tachyon on a single core. Looking at the qsub script below there is now an additional parameter to qsub (-pe orte 1) and the use of mpirun to launch tachyon.

#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -N tachyon1
#$ -pe orte 1

tmphosts=`mktemp`
awk '{ for (i=0; i < $2; ++i) { print $1} }' $PE_HOSTFILE > $tmphosts

echo "Got $NSLOTS slots"
echo "jobid $JOB_ID"

mpirun -np $NSLOTS -machinefile $tmphosts /mnt/glusterfs/bishopj/tachyon/tachyon/compile/linux-mpi/tachyon +V -aasamples 4 \
 -trans_vmd /mnt/glusterfs/bishopj/tachyon/stmvao-white.dat -o /mnt/glusterfs/$USER/stmvao-white_short.$JOB_ID.bmp \
-format BMP -rescale_lights 0.4 -add_skylight 0.9 -skylight_samples 32 -res 1900 1080


$ qsub tachyon.submit

1 core: CPU Information:


 Node    0:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
Total CPUs: 1
Total Speed: 1.000000


Preprocessing Time: 1.0531 seconds Rendering Progress: 100% complete


 Ray Tracing Time:   533.7124 seconds
Image I/O Time:     0.0754 seconds


Scaling the job up

effieciency per core added. ratio.

CPU Information:
  Node    0:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    1:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    2:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    3:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    4:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    5:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    6:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    7:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Total CPUs: 8
  Total Speed: 8.000000



Preprocessing Time:     1.2570 seconds
Rendering Progress:       100% complete
  Ray Tracing Time:    76.1758 seconds
    Image I/O Time:     0.0695 seconds

If we take the 76 seconds result and calculate how much speedup we observed per core we get 87% (533/76.17)/8 = .87. All of the cores were on the system system, so 87% would indicate the protocol efficiency of MPI as used in Tachyon between cores on the same system. It is not 100%, but 87% is not a bad number.

Let's examine what happens when we cross a physical system boundary and need to communicate over ethernet network:


CPU Information:
  Node    0:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    1:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    2:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    3:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    4:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    5:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    6:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    7:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    8:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    9:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node   10:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node   11:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node   12:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node   13:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node   14:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node   15:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node   16:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   17:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   18:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   19:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   20:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   21:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   22:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   23:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   24:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   25:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   26:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   27:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   28:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   29:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   30:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Node   31:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley03.stanford.edu
  Total CPUs: 32
  Total Speed: 32.000000



Allocating Image Buffer.
Preprocessing Time:     1.3596 seconds
Rendering Progress:       100% complete
  Ray Tracing Time:    21.1898 seconds
    Image I/O Time:     0.0761 seconds

As you can see in the Node list, barley01 and barley03 are being used. In this case, our efficiency per core is: (533/21.19)/32 = .78 or 78%. Approximately a 10% hit by going over 10gigabit ethernet. However, from a speedup per added core standpoint we are still definitely in the money. Every added core results in a significant speedup.

Let us scale up further, to 208 cores:

#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -N tachyon208
#$ -pe orte 208

tmphosts=`mktemp`
awk '{ for (i=0; i < $2; ++i) { print $1} }' $PE_HOSTFILE > $tmphosts

echo "Got $NSLOTS slots"
echo "jobid $JOB_ID"

mpirun -np $NSLOTS -machinefile $tmphosts /mnt/glusterfs/bishopj/tachyon/tachyon/compile/linux-mpi/tachyon +V -aasamples 4 -trans_vmd /mnt/glusterfs/bishopj/tachyon/stmvao-white.dat -o /mnt/glusterfs/$USER/stmvao-white_short.$JOB_ID.bmp -format BMP -rescale_lights 0.4 -add_skylight 0.9 -skylight_samples 32 -res 1900 1080



CPU Information:
  Node    0:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    1:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    2:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    3:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    4:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    5:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    6:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    7:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    8:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node    9:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
  Node   10:  1 CPUs, CPU Speed 1.00, Node Speed   1.00 Name: barley01.stanford.edu
.
.
.
  Total CPUs: 208
  Total Speed: 208.000000

Preprocessing Time:     1.5401 seconds
Rendering Progress:       100% complete            
  Ray Tracing Time:     3.4819 seconds
    Image I/O Time:     0.0680 seconds

Calculating the efficiency, we get (533/3.4)/208 = .75 or 75% (speedup per added core). Observe that at 208 cores, and 3.4 seconds, the standup and tear down time of the processes themselves is starting to play a large role. I would say this scalability experiment is very successful, having reduced 8.8 minutes to 3.4 seconds, a factor of 156 times.

Personal tools
Toolbox
LANGUAGES