Using the IBM PESSL Library
IBM provides a Parallel Engineering Scientific Subroutine Library (Parallel ESSL) for many
mathematical and matrix operations. Parallel ESSL includes
many linear algebra routines (some of which are from the
BLAS package),
Lapack-like routines, FFT routines and eigen routines (not unlike EISPACK).
Note that in using the ESSL SMP library for
parallel applications on the SP, essentially no work, like data decomposition,
is required of the programmer. On the contrary, the primary focus of the PESSL
is for distributed-memory applications and hence user codes must employ a message-passing
paradigm such as MPI, PVM, MPL. Data decomposition must be performed by the
programmer. In addition, calls to
blacs library routines (see example below) must also be made prior to calling the PESSL routines. Like the ESSL package,
PESSL also has an SMP version. The blacs calls, as well as the PESSL routine calls, are the same whether you
use the distributed-memory PESSL (in IBM's jargon, a signal handling library) or the SMP PESSL.
Below, the example of solving for Ax = b is demonstrated. The same
example is also demonstrated in a companion page using the ESSL SMP library (see
Using the IBM ESSL). Comparing the two example
programs, base on code complexity and scalability, might quickly lead to the conclusion that on an SMP machine ESSL SMP is the more
convenient way. However, note that there are other considerations that may
affect the library selection. For instance,
- Whether the desired functionality exists in either library.
- Portability.
- Parallelization of the remainder of the user code.
How To Run Application Program With PESSL Subroutine Calls On An SMP Node
(an SMP nighthawk-2 node at BU comprises 16 processors)
- Insert the necessary blacs and MPI calls as well as the desired PESSL
calls in your code.
- Compile using appropriate thread-safe compiler script. For example,
hal01% mpxlf90_r -O example -O5 -qsuffix=f=f90 example.f90 -lpesslsmp -lblacssmp
- Run job interactively using 4 processors
hal01% poe example -procs 4
- Run job in batch using 4 processors
hal01% bsub -qQUEUE poe example -procs 4
where QUEUE is one of SP's 2 multi-processor queues (sp-mp8, sp-mp16)
hal01% bqueues for more details.
Example - Solution of Ax = b
The solution procedure consists of two subroutines: DGETRF to factor the matrix
A and DGETRS to solve for x using the factored matrix from DGETRF.
Here is a fortran 90 implementation:
program PESSL_example
include "mpif.h"
integer nprow, npcol, nblk, myrow, mycol, i, j, mi, mj, nprocs
integer icontxt, Iam, ierr, k, info
integer, parameter :: ndim=6000, root=0
integer :: world_group, row_group, row_comm, rowcomm
integer, dimension(:), allocatable :: ipvt, member, desc_a, desc_b
real*8, dimension(:,:), allocatable :: a
real*8, dimension(:), allocatable :: x, b, bp, x0
call MPI_Init(ierr)
call blacs_pinfo(Iam, nprocs)
nprow = sqrt(real(nprocs)) ! assume nprocs = nprow**2
npcol = nprocs/nprow
nblk = ndim/nprow
call blacs_get(0, 0, icontxt)
call blacs_gridinit(icontxt, 'r', nprow, npcol)
call blacs_gridinfo(icontxt, nprow, npcol, myrow, mycol)
call MPI_Comm_split(MPI_COMM_WORLD, myrow, mycol, rowcomm, ierr)
! allocate arrays ...
allocate(a(nblk,nblk), x(nblk), b(nblk), bp(nblk), ipvt(nblk))
allocate(member(0:nprow-1), desc_a(9), desc_b(9))
if(Iam .eq. 0) allocate(x0(ndim)) ! on proc 0 only
member = (/ (i*npcol, i=0,nprow-1) /) ! all procs in 1st column
call MPI_Comm_group(MPI_COMM_WORLD, world_group, ierr)
call MPI_Group_incl(world_group, nprow, member, row_group, ierr)
call MPI_Comm_create(MPI_COMM_WORLD, row_group, row_comm, ierr)
x = (/ (real(j+mycol*nblk), j=1,nblk) /) ! define x
! Initialize local A and compute b = Ax
mi = myrow*nblk; mj = mycol*nblk
forall (i=1:nblk, j=1:nblk) a(i,j) = 1.0+(1.0/(real(i+mi+j+mj)))
! bp = matmul(a,x) ! local b = Ax
do i=1,nblk
bp(i) = 0.0
do j=1,nblk
bp(i) = bp(i) + a(i,j)*x(j)
enddo
enddo
! Sum all local b (rhs) to form the final b
call MPI_Allreduce(bp, b, nblk, MPI_DOUBLE_PRECISION, &
& MPI_SUM, rowcomm, ierr)
! define parameters for PESSL calls
desc_a = (/ 1, icontxt, ndim, ndim, nblk, nblk, 0, 0, &
max(1,numroc(ndim,nblk,myrow,1,nprow)) /)
desc_b = (/ 1, icontxt, ndim, 1, nblk, 1, 0, 0, &
max(1,numroc(ndim,nblk,myrow,1,nprow)) /)
! LU factor and back substitution routine sof PESSL
call pdgetrf(ndim,ndim,a,1,1,desc_a,ipvt,info) ! factor A
call pdgetrs('N',ndim,1,a,1,1,desc_a,ipvt, & ! solve for x
& b,1,1,desc_b,info)
call MPI_Group_rank(row_group, k, ierr) ! see if k is in row_group
if(k .ne. MPI_UNDEFINED) & ! undefine if not
call MPI_Gather( b, nblk, MPI_DOUBLE_PRECISION, &
x0, nblk, MPI_DOUBLE_PRECISION, &
root, row_comm, ierr)
if(Iam .eq. 0) then
write(*,"('final solution'//(5f10.1))")(x0(i),i=1,ndim)
endif
deallocate(a, x, b, bp, desc_a, desc_b, ipvt, member)
call MPI_Finalize(ierr)
end PROGRAM PESSL_example
Some preliminary timing numbers for the above program are listed below:
PESSL PESSL SMP
Procs Wallclock Wallclock
1 145 76
4 142 82
9 96 62
16 75 73
Example - Eigenvalues and Eigenvectors of a Hermitian matrix
The subroutine for solving
eigenvalues and eigenvectors of a Hermitian matrix is called
CHPEV. ZHPEV is the double precision counterpart to CHPEV while
SSPEV and DSPEV are the corresponding single and double-precision eigen solvers
for real symmetric matrices.
To examine the performance characteristics of CHPEV,
a general method for the generation of hermitian matrices for arbitrary
order is desirable. To this end, the following recipe is used in the example code:
- Generate c, a order n real matrix, by a random number generator.
- a = c + transpose(c) (=> a symmetric)
- Generate d, a order n real matrix, by a random number generator.
- b = d - transpose(d) (=> b skew-symmetric)
- h = a + b*i is hermitian
- with h as input, solve for eigenvalues and eigenvectors using CHPEV
The code is available for download.
Note that speedup and efficiency are defined as :
- speedup = T(1) / T(P)
- efficiency = T(1) / T(P) / P
where T(1) and T(P) denote walkclock times for 1 and P processors, respectively.
The timings of a N=4096 case using CHPEV with various processors is
tabulated below:
Procs Wallclock
1 2172
2 1523
4 1230
8 1177
The speedup ratio and efficiency plots of this table are included below:
Note that comparing between the Origin2000 performance with the
SP performance is not entirely appropriate as the Lapack routine
used for the N=4096 runs is CHEEVD which achieves its better
efficiency (over CHEEV) by way of significantly higher demand on
memory usage. There is no similar routine available in ESSL.
|