Scientific Computing & Visualization
Help Contact
About Accounts Computation Visualization Documentation Services

Using the IBM ESSL Library 

IBM provides an Engineering Scientific Subroutine Library (ESSL) for many mathematical and matrix operations. ESSL includes many linear algebra routines, some of which are from, or make use of, the Basic Linear Algebra Subroutine package, BLAS (see Appendix A of the IBM ESSL documentation). Only a handful of LAPACK routines have been included in ESSL, such as SGETRF for factoring a general matrix. If you have been using those LAPACK routines on the SGI Origin, porting to the SP couldn't be easier. The multithreaded version of ESSL (see the mutiprocessor example below) provides a means for a serial program that make call(s) to ESSL to gain performance through shared-memory multiprocessing. Note that ESSL uses very similar naming convention for its routines. If the LAPACK routine you want to use is NOT explicitly listed in Appendix B, it would be prudent to check the routine's argument list, even if the routine name in ESSL seems to match that in LAPACK, letter for letter. Often, the arguments may be slightly different than those you expect in the corresponding LAPACK routines.

Note that a similar parallel package, PESSL, is available for use in a distributed-memory environment - though it is equally applicable in the shared-memory environment.

In contrast to the PESSL in which the programmer is responsible for assigning data to individual processors (i.e., data decomposition), the ESSLSMP version of ESSL requires no effort on the part of the programmer. Note however that ESSLSMP is only a subset of the serial ESSL library,i.e., not all ESSL routines are available in multithreads.

How To Run Application Program With ESSL Subroutine Calls On Single Processor

  1. Search the ESSL table of contents for a routine suitable for your applications. These routines are categorized into
    • Linear algebra
    • Matrix algebra
    • Eigen values/vectors
    • FFT
    • Sorting and searching
    • Interpolation
    • Numerical Quadratures
    • Random number generators
    • Other utilities
  2. Insert call(s) to the desired ESSL routine at the intended locations in your code
  3. Compile with appropriate compiler script (xlf, xlf90, ...)
  4. Link the ESSL library using -lessl. For example,
    hal01% xlf -o example -O5 example.f -lessl
  5. Run job interactively
    hal01% example
  6. Run job in batch
    hal01% bsub -qQUEUE example
    where QUEUE is one of SP's 2 single-processor queues (sp-short, sp-long).
    hal01% bqueues
    for more details.

How To Run Application Program With ESSL Subroutine Calls On An SMP Node (an SMP nighthawk-2 node at BU is comprise of 16 processors)

  1. Search the ESSL table of contents for a routine suitable for your applications.
  2. See if a multithreaded version of this routine is available.
  3. Insert the desired ESSL calls in your code.
  4. Compile with corresponding thread-safe version of compiler script above (i.e., script ending with _r) and link with SMP version of the ESSL library. Also, must add -qsmp flag. For example,
    hal01% xlf_r -qsmp -o example -O5 example.f -lesslsmp
    Note that -qsmp is needed for the compiler to recognize any directives that may exist in the esslsmp library as well as your program. Since automatic parallization is one of the default options with -qsmp, the compiler will also try to parallelize your code where possible. Use "-qsmp=noauto" to suppress auto-parallelization.
  5. Run job interactively
    hal01% setenv XLSMPOPTS parthds=num (num is number of threads )
    hal01% setenv OMP_NUM_THREADS num also works.
    hal01% example
  6. Run job in batch
    hal01% setenv XLSMPOPTS parthds=num
    hal01% bsub -qQUEUE example
    where QUEUE is one of SP's 2 multi-processor queues (sp-mp8, sp-mp16) while num is number of processors.
    hal01% bqueues for more details.
    Note that OpenMP environment variables take precedent over XLSMPOPTS environment variables, e.g., setting number of processors via OMP_NUM_THREADS overrides setting of parthds as shown above.
  7. To find out how many threads are in effect during run, insert num_parthds() in your code.

Multithreaded Example - Solution of Ax = b

The solution procedure consists of two subroutines: DGETRF to factor the matrix A and DGETRS to solve for x using the factored matrix from DGETRF. In addition, DGEMV is called to perform the matrix-vector multiplication Ax.

Here is a fortran 90 implementation:

program ESSL_example
   implicit none
   integer :: i, j, info, n=6000   ! n is square matrix size
   integer, dimension(:), allocatable :: ipvt
   real,    dimension(:), allocatable :: x, b   ! solution and RHS
   real,  dimension(:,:), allocatable :: a      ! square matrix A

   allocate(a(n,n), ipvt(n), x(n), b(n))

 ! define solution x
   x = (/ (j, j=1,n) /)
 ! define square matrix A
   forall (i=1:n, j=1:n) a(i,j) = 1.0+(1.0/real(i+j))

   call sgemv('N',n,n,1.0,a,n,x,1,0.0,b,1)   ! b = Ax; parallel
   call sgetrf(n,n,a,n,ipvt,info)            ! factor A; parallel
   call sgetrs('N',n,1,a,n,ipvt,b,n,info)    ! solve for x; serial

   deallocate(a, ipvt, x, b)
end program ESSL_example

The above code is available for download.

To study the performance, define speedup and efficiency as:

  • speedup = T(1) / T(P)
  • efficiency = T(1) / T(P) / P
where T(1) and T(P) denote walkclock times for 1 and P processors, respectively.

The timings of a N=6000 case using DGETRF with various processors running on Hal is tabulated:

Procs    Wallclock (in seconds)
    1          114
    2           58
    4           30
    8           17
   16           11

The speedup ratio and efficiency plots of this table are included below:

Boston University
Boston University
 
OIT | CCS | September 18, 2007  
Scientific Computing & Visualization Boston University home page Boston University home page