Scientific Computing & Visualization
Help Contact
About Accounts Computation Visualization Documentation Services

Timing Comparisons of Blue Gene, pSeries, and Intel Pentium 3 Cluster

 

To provide a reference point on the performance of the 3 SCV multiprocessors (i.e., the IBM Blue Gene, the IBM pSeries, and the Intel Pentium 3 Linux Cluster), a two-dimensional Laplace equation solver code (Powerpoint slides) using Finite Difference ( for numerical discretisation of the PDE) and MPI (for parallelization of the resulting algebraic system) has been used to collect the wall clock times of various runs on these three machines. Two different levels of optimizations (-O3 and -O5 for the IBM machines and -O3 and -O3 plus -ipo for the Cluster using the Intel compiler) have also been used. The number of processors used are: 1, 4, 9, and 16.

Please note that the results and observations shown below are by no means to be construed as a general trend as different codes do behave differently on different machines and compilers. Nevertheless, it does provide a reference point, and that is the sole purpose of this page.

In the following, two measures of code parallel performance, Speedup and Efficiency, are defined as :

  • Speedup(P) = T(1)/T(P); where T(1) is the wallclock time for 1 processor while T(P) is the wallclock time for P processors
  • Efficiency(P) = T(1)/T(P)/P = Speedup(P)/P

The blue line represents the theorectical timings of a code that scales linearly. In that case, T(P) = T(1)/P and hence Speedup(P) = P.

Included below are the Speedup and Efficiency plots of the pSeries, the Blue Gene and the Cluster, respectively. The fourth figure demonstrates the wallclock times versus the number of processors for all three machines at two different levels of optimizations.

Figure 1. Speedup and efficiency plots of the SOR program on the pSeries.

Figure 2. Speedup and efficiency plots of the SOR program on the Blue Gene.

Figure 3. Speedup and efficiency plots of the SOR program on the Cluster.

 

Figure 4. Timings of SOR code with the pSeries, Blue Gene, and Linux Cluster.

Based on the above figures, one observes the following for this code:

  1. For the pSeries machines, there is no appreciable improvement in performance by changing the optimization from -O3 to -O5 for multiprocessor runs. However, significant improvement is observed for the single processor run between -O3 and -O5.
  2. On the other hand, very impressive performance improvement is achieved, for all processor counts, on the Bluegene by going from -O3 to -O5. Since the code's memory requirement is quite small (less than 50 Mbytes), making the use of both processors on each node possible. This is referred to as the virtual node mode and you can turn it on by adding "-mode VN" to the mpirun statement of your PBS script. For the referenced code, given a fixed number of processors, the timing is the same whether one or both of the processors of each node are used. This is excellent as only half as many nodes is needed for the same performance (run time).
  3. For the Cluster, the use of -ipo (for interprocedural optimization) in addition to -O3 yields NO appreciable performance gain. It should be noted that according to Intel's manpage on ifort, there is an equivalent to -O5 (for IBM machines) called -fast. However, the executable generated with -fast fails to run and so a combination -O3 plus -ipo was used instead.
  4. The superlinear behavior seen on many of the plots are the result of reduction in memory usage in multiprocessor runs relative to that required for a single processor run. This in turn increases cache utilization efficiency which leads to higher throughput.
Boston University
Boston University
 
OIT | CCS | September 16, 2008  
Scientific Computing & Visualization Boston University home page Boston University home page