Using the IBM ESSL Library
IBM provides an Engineering Scientific Subroutine Library (ESSL) for many
mathematical and matrix operations. ESSL includes
many linear algebra routines, some of which are from, or make use of, the
Basic Linear Algebra Subroutine package, BLAS
(see
Appendix A of the IBM ESSL
documentation). Only
a handful of LAPACK routines have been included in ESSL,
such as SGETRF for factoring a general matrix. If you have been
using those LAPACK routines on the SGI Origin, porting to
the SP couldn't be easier. The multithreaded version of ESSL
(see the mutiprocessor example below)
provides a means for a serial program that make call(s) to
ESSL to gain performance through shared-memory multiprocessing.
Note that ESSL uses very similar naming convention for its
routines. If the LAPACK routine you want to use is NOT explicitly
listed in Appendix B, it would be prudent to check the routine's
argument list, even if the routine name in ESSL seems to match
that in LAPACK, letter for letter. Often, the arguments may be
slightly different than those you expect in the corresponding
LAPACK routines.
Note that a similar parallel package,
PESSL, is available for use in a distributed-memory environment - though it is equally applicable in the shared-memory environment.
In contrast to the PESSL in which the programmer is responsible for assigning data
to individual processors (i.e., data decomposition), the ESSLSMP version of ESSL
requires no effort on the part of the programmer.
Note however that ESSLSMP is only a subset of the serial ESSL
library,i.e., not all ESSL routines are available in multithreads.
How To Run Application Program With ESSL Subroutine Calls On Single Processor
- Search the ESSL table of contents for a
routine suitable for your applications. These routines are categorized into
- Linear algebra
- Matrix algebra
- Eigen values/vectors
- FFT
- Sorting and searching
- Interpolation
- Numerical Quadratures
- Random number generators
- Other utilities
- Insert call(s) to the desired ESSL routine at the intended locations in your code
- Compile with appropriate compiler script (xlf, xlf90, ...)
- Link the ESSL library using
-lessl. For example,
twister% xlf -o example -O5 example.f -lessl
- Run job interactively
twister% example
- Run job in batch
twister% bsub -qQUEUE example
where QUEUE is one of SP's 2 single-processor queues (sp-short, sp-long).
twister% bqueues for more details.
How To Run Application Program With ESSL Subroutine Calls On An SMP Node
(an SMP nighthawk-2 node at BU is comprise of 16 processors)
- Search the ESSL table of contents for a
routine suitable for your applications.
- See if a multithreaded version of this routine is available.
- Insert the desired ESSL calls in your code.
- Compile with corresponding thread-safe version of compiler script above
(i.e., script ending with
_r) and link with SMP version
of the ESSL library. Also, must add -qsmp flag. For
example,
twister% xlf_r -qsmp -o example -O5 example.f -lesslsmp
Note that -qsmp is needed for the compiler to recognize any directives
that may exist in the esslsmp library as well as your program.
Since automatic parallization is one of the default options
with -qsmp, the compiler will also try to parallelize your code where
possible. Use "-qsmp=noauto" to suppress auto-parallelization.
- Run job interactively
twister%
setenv XLSMPOPTS parthds=num
(num is number of threads )
twister% setenv OMP_NUM_THREADS num also works.
twister% example
- Run job in batch
twister% setenv XLSMPOPTS parthds=num
twister% bsub -qQUEUE example
where QUEUE is one of SP's 2 multi-processor queues (sp-mp8, sp-mp16)
while num is number of processors.
twister% bqueues for more details.
Note that OpenMP environment variables take precedent over XLSMPOPTS
environment variables, e.g., setting number of processors via
OMP_NUM_THREADS overrides setting of parthds
as shown above.
- To find out how many threads are in effect during
run, insert
num_parthds() in your code.
Multithreaded Example - Solution of Ax = b
The solution procedure consists of two subroutines: DGETRF to factor the matrix
A and DGETRS to solve for x using the factored matrix from DGETRF. In addition,
DGEMV is called to perform the matrix-vector multiplication Ax.
Here is a fortran 90 implementation:
program ESSL_example
implicit none
integer :: i, j, info, n=6000 ! n is square matrix size
integer, dimension(:), allocatable :: ipvt
real, dimension(:), allocatable :: x, b ! solution and RHS
real, dimension(:,:), allocatable :: a ! square matrix A
allocate(a(n,n), ipvt(n), x(n), b(n))
! define solution x
x = (/ (j, j=1,n) /)
! define square matrix A
forall (i=1:n, j=1:n) a(i,j) = 1.0+(1.0/real(i+j))
call sgemv('N',n,n,1.0,a,n,x,1,0.0,b,1) ! b = Ax; parallel
call sgetrf(n,n,a,n,ipvt,info) ! factor A; parallel
call sgetrs('N',n,1,a,n,ipvt,b,n,info) ! solve for x; serial
deallocate(a, ipvt, x, b)
end program ESSL_example
The above code is available for download.
To study the performance, define speedup and efficiency as:
- speedup = T(1) / T(P)
- efficiency = T(1) / T(P) / P
where T(1) and T(P) denote walkclock times for 1 and P processors, respectively.
The timings of a N=6000 case using DGETRF with various processors
running on Hal is tabulated:
Procs Wallclock (in seconds)
1 114
2 58
4 30
8 17
16 11
The speedup ratio and efficiency plots of this table are included below:
|