To provide a reference point on the performance of the 3 SCV multiprocessors (i.e., the IBM Blue Gene, the IBM pSeries, and the Intel Pentium 3 Linux Cluster),
a two-dimensional Laplace equation solver code (Powerpoint slides) using Finite Difference (
for numerical discretisation of the PDE) and MPI (for parallelization of the resulting algebraic system) has been used to collect
the wall clock times of various runs on these three machines.
Two different levels of optimizations (-O3 and -O5 for
the IBM machines and -O3 and -O3 plus -ipo for the Cluster using the Intel compiler)
have also been used. The number of processors used are: 1, 4, 9, and 16.
Please note that the results and observations shown below are by no means
to be construed as a general trend as different codes do behave differently
on different machines and compilers. Nevertheless, it does provide a reference point, and that is the sole purpose of
this page.
In the following, two measures of code parallel performance, Speedup and
Efficiency, are defined as :
- Speedup(P) = T(1)/T(P); where T(1) is the wallclock time for 1 processor while T(P) is the wallclock time for P processors
- Efficiency(P) = T(1)/T(P)/P = Speedup(P)/P
The blue line represents the theorectical timings of a code that scales linearly. In that case, T(P) = T(1)/P and hence Speedup(P) = P.
Included below are the Speedup and Efficiency plots of the pSeries, the Blue
Gene and the Cluster, respectively. The fourth figure demonstrates the
wallclock times versus the number of processors for all three machines at
two different levels of optimizations.
Figure 1. Speedup and efficiency plots of the SOR program on
the pSeries.
Figure 2. Speedup and efficiency plots of the SOR program on
the Blue Gene.
Figure 3. Speedup and efficiency plots of the SOR program on
the Cluster.
Figure 4. Timings of SOR code with the pSeries, Blue Gene, and Linux Cluster.
Based on the above figures, one observes the following for this code:
- For the pSeries machines, there is no appreciable improvement in
performance by changing the optimization from -O3 to -O5 for
multiprocessor runs. However, significant improvement is observed
for the single processor run between -O3 and -O5.
- On the other hand, very impressive performance improvement is achieved,
for all processor counts, on the Bluegene by going from -O3 to -O5.
Since the code's memory requirement is quite small (less than 50
Mbytes), making the use of both processors on each node possible.
This is referred to as the virtual node mode and you can turn it on
by adding "-mode VN" to the mpirun statement of your PBS script. For
the referenced code, given a fixed number of processors, the timing is
the same whether one or both of the processors of each node are used.
This is excellent as only half as many nodes is needed for the same
performance (run time).
- For the Cluster, the use of -ipo (for interprocedural optimization)
in addition to -O3 yields NO appreciable performance gain. It should be
noted that according to Intel's manpage on ifort, there is an equivalent
to -O5 (for IBM machines) called
-fast. However, the executable generated with -fast fails to run and so a combination -O3 plus -ipo was used
instead.
- The superlinear behavior seen on many of the plots are the result of
reduction in memory usage in multiprocessor runs relative
to that required for a single processor run. This in turn increases
cache utilization efficiency which leads to higher throughput.