Scientific Computing & Visualization
Help Contact
About Accounts Computation Visualization Documentation Services

Using the IBM PESSL Library

IBM provides a Parallel Engineering Scientific Subroutine Library (Parallel ESSL) for many mathematical and matrix operations. Parallel ESSL includes many linear algebra routines (some of which are from the BLAS package), Lapack-like routines, FFT routines and eigen routines (not unlike EISPACK).

Note that in using the ESSL SMP library for parallel applications on the SP, essentially no work, like data decomposition, is required of the programmer. On the contrary, the primary focus of the PESSL is for distributed-memory applications and hence user codes must employ a message-passing paradigm such as MPI, PVM, MPL. Data decomposition must be performed by the programmer. In addition, calls to blacs library routines (see example below) must also be made prior to calling the PESSL routines. Like the ESSL package, PESSL also has an SMP version. The blacs calls, as well as the PESSL routine calls, are the same whether you use the distributed-memory PESSL (in IBM's jargon, a signal handling library) or the SMP PESSL.

Below, the example of solving for Ax = b is demonstrated. The same example is also demonstrated in a companion page using the ESSL SMP library (see Using the IBM ESSL). Comparing the two example programs, base on code complexity and scalability, might quickly lead to the conclusion that on an SMP machine ESSL SMP is the more convenient way. However, note that there are other considerations that may affect the library selection. For instance,

  1. Whether the desired functionality exists in either library.
  2. Portability.
  3. Parallelization of the remainder of the user code.

How To Run Application Program With PESSL Subroutine Calls On An SMP Node (an SMP nighthawk-2 node at BU comprises 16 processors)

  1. Insert the necessary blacs and MPI calls as well as the desired PESSL calls in your code.
  2. Compile using appropriate thread-safe compiler script. For example,
    hal01% mpxlf90_r -O example -O5 -qsuffix=f=f90 example.f90 -lpesslsmp -lblacssmp
  3. Run job interactively using 4 processors
    hal01% poe example -procs 4
  4. Run job in batch using 4 processors
    hal01% bsub -qQUEUE poe example -procs 4
    where QUEUE is one of SP's 2 multi-processor queues (sp-mp8, sp-mp16)
    hal01% bqueues for more details.

Example - Solution of Ax = b

The solution procedure consists of two subroutines: DGETRF to factor the matrix A and DGETRS to solve for x using the factored matrix from DGETRF.

Here is a fortran 90 implementation:

program PESSL_example
  include "mpif.h"
  integer nprow, npcol, nblk, myrow, mycol, i, j, mi, mj, nprocs
  integer icontxt, Iam, ierr, k, info
  integer, parameter :: ndim=6000, root=0
  integer :: world_group, row_group, row_comm, rowcomm
  integer, dimension(:), allocatable :: ipvt, member, desc_a, desc_b
  real*8, dimension(:,:), allocatable :: a
  real*8, dimension(:), allocatable :: x, b, bp, x0

  call MPI_Init(ierr)

  call blacs_pinfo(Iam, nprocs)

  nprow = sqrt(real(nprocs))   ! assume nprocs = nprow**2
  npcol = nprocs/nprow
  nblk  = ndim/nprow

  call blacs_get(0, 0, icontxt)
  call blacs_gridinit(icontxt, 'r', nprow, npcol)
  call blacs_gridinfo(icontxt, nprow, npcol, myrow, mycol)

  call MPI_Comm_split(MPI_COMM_WORLD, myrow, mycol, rowcomm, ierr)
! allocate arrays ...
  allocate(a(nblk,nblk), x(nblk), b(nblk), bp(nblk), ipvt(nblk))
  allocate(member(0:nprow-1), desc_a(9), desc_b(9))
  if(Iam .eq. 0) allocate(x0(ndim))        ! on proc 0 only
 
  member = (/ (i*npcol, i=0,nprow-1) /)    ! all procs in 1st column
  call MPI_Comm_group(MPI_COMM_WORLD, world_group, ierr)
  call MPI_Group_incl(world_group, nprow, member, row_group, ierr)
  call MPI_Comm_create(MPI_COMM_WORLD, row_group, row_comm, ierr)

  x = (/ (real(j+mycol*nblk), j=1,nblk) /) ! define x
 
!  Initialize local A and compute b = Ax
  mi = myrow*nblk; mj = mycol*nblk
  forall (i=1:nblk, j=1:nblk) a(i,j) = 1.0+(1.0/(real(i+mi+j+mj)))
! bp = matmul(a,x)       ! local b = Ax
  do i=1,nblk
    bp(i) = 0.0
    do j=1,nblk
      bp(i) = bp(i) + a(i,j)*x(j)
    enddo
  enddo
! Sum all local b (rhs) to form the final b
  call MPI_Allreduce(bp, b, nblk, MPI_DOUBLE_PRECISION, &
     &                     MPI_SUM, rowcomm, ierr)

! define parameters for PESSL calls
  desc_a = (/ 1, icontxt, ndim, ndim, nblk, nblk, 0, 0, &
                   max(1,numroc(ndim,nblk,myrow,1,nprow)) /)
  desc_b = (/ 1, icontxt, ndim,    1, nblk,    1, 0, 0, &
                   max(1,numroc(ndim,nblk,myrow,1,nprow)) /)

! LU factor and back substitution routine sof PESSL
  call pdgetrf(ndim,ndim,a,1,1,desc_a,ipvt,info)  ! factor A
  call pdgetrs('N',ndim,1,a,1,1,desc_a,ipvt,  &   ! solve for x
     &                          b,1,1,desc_b,info)
 
  call MPI_Group_rank(row_group, k, ierr)  ! see if k is in row_group
  if(k .ne. MPI_UNDEFINED) &            ! undefine if not
    call MPI_Gather( b, nblk, MPI_DOUBLE_PRECISION, &
                    x0, nblk, MPI_DOUBLE_PRECISION, &
                    root, row_comm, ierr)

  if(Iam .eq. 0) then
    write(*,"('final solution'//(5f10.1))")(x0(i),i=1,ndim)
  endif

  deallocate(a, x, b, bp, desc_a, desc_b, ipvt, member)

  call MPI_Finalize(ierr)
end PROGRAM PESSL_example

Some preliminary timing numbers for the above program are listed below:

              PESSL      PESSL SMP
Procs    Wallclock      Wallclock
    1        145             76   
    4        142             82
    9         96             62
   16         75             73

Example - Eigenvalues and Eigenvectors of a Hermitian matrix

The subroutine for solving eigenvalues and eigenvectors of a Hermitian matrix is called CHPEV. ZHPEV is the double precision counterpart to CHPEV while SSPEV and DSPEV are the corresponding single and double-precision eigen solvers for real symmetric matrices.

To examine the performance characteristics of CHPEV, a general method for the generation of hermitian matrices for arbitrary order is desirable. To this end, the following recipe is used in the example code:

  • Generate c, a order n real matrix, by a random number generator.
  • a = c + transpose(c) (=> a symmetric)
  • Generate d, a order n real matrix, by a random number generator.
  • b = d - transpose(d) (=> b skew-symmetric)
  • h = a + b*i is hermitian
  • with h as input, solve for eigenvalues and eigenvectors using CHPEV

The code is available for download.

Note that speedup and efficiency are defined as :

  • speedup = T(1) / T(P)
  • efficiency = T(1) / T(P) / P
where T(1) and T(P) denote walkclock times for 1 and P processors, respectively.

The timings of a N=4096 case using CHPEV with various processors is tabulated below:


Procs    Wallclock
    1         2172
    2         1523
    4         1230
    8         1177

The speedup ratio and efficiency plots of this table are included below:

Note that comparing between the Origin2000 performance with the SP performance is not entirely appropriate as the Lapack routine used for the N=4096 runs is CHEEVD which achieves its better efficiency (over CHEEV) by way of significantly higher demand on memory usage. There is no similar routine available in ESSL.

Boston University
Boston University
 
OIT | CCS | September 18, 2007  
Scientific Computing & Visualization Boston University home page Boston University home page