(The Uniform Resource Locator for this World Wide Web page is
"http://scv.bu.edu/documentation/tutorials/F90/")
Fortran 90 and Multiprocessing
Course 3070
- Objectives of this Tutorial
To serve as an outline for a two-hour lecture on Fortran 90 at Boston University
as well as an "in-a-nut-shell"
reference guide on Fortran 90 and its extensions.
- What is Fortran 90 ?
Fortran 90 is an extension of fortran 77. Its primary features are
to provide array-handling capabilities and the formalization of many
extensions that have been introduced beyond the fortran 77 standard over
the years.
- Why Fortran 90 ?
The need to have array-handling and other safety and convenient features not found
in the fortran 77 standard formalized both for the goods of programmers
and vendors alike.
- Should I use Fortran 90 ?
It depends. Fortran 90 is still relatively new and as a result, it is
less portable than a f77 code because of the
lack of f90 compiler (say, on some workstations). A public domain version
of f90 is not available in LINUX yet.
- File-suffixes
A fortran program may have the following suffixes: .for, .f, .F, .f90, .hpf
- ALLOCATABLE -- declare an assumed-shape array
- ALLOCATE -- request size for an assumed-shape array
- CASE -- element of SELECT construct
- CASE DEFAULT -- default case of SELECT construct
- CONTAINS -- Indicates procedures within subprogram/module
- CYCLE -- element of DO statement; continue to next iteration under certain condition
- DEALLOCATE -- return memory allocation to system
- ELSE WHERE -- else branch of WHERE
- END DO -- ends a DO loop
- END FUNCTION -- the end of a function subprogram
- END INTERFACE -- ends an INTERFACE block
- END MODULE -- the end of a module
- END PROGRAM -- the end of a (main) program
- END SELECT -- ends the SELECT construct
- END SUBROUTINE -- the end of a subroutine
- END TYPE -- ends TYPE
- END WHERE -- ends WHERE
- EXIT -- element of DO statement; exits DO loop under certain condition
- INCLUDE -- insert external (source) file
- INTENT -- declare intention
- INTERFACE -- define procedure interface
- MODULE -- a form of subprogram block whose data are made available by USE
- OPTIONALsubprogram arguments classification
- POINTER -- data type qualifier
- PRESENT -- determine whether an argument of a subroutine is present
- PRIVATE -- data access control
- PUBLIC -- general data access
- SELECT CASE -- decision/branch construct
- SEQUENCE -- data alignment; used in conjuction with derived type
- TARGET pointer target
- TYPE -- derived data type defined by user
- USE -- data access construct
- WHERE -- array "if" construct
- In the following, lower case italics argument denotes optional parameter. Click on item for further detail/example.
- ALL(MASK,dim) -- true if all values are true
- ANY(MASK,dim) -- true if any value is true
- CSHIFT(ARRAY,SHIFT,DIM) -- circular shift
- COUNT(MASK,dim) -- number of true elements in an array
- DOT_PRODUCT(A,B) -- dot product of two rank-one arrays
- EOSHIFT(ARRAY,SHIFT,boundary,dim) -- end-off shift
- MATMUL(A,B) -- pre-multiply matrix B by matrix A; columns of A must equal rows of B
- MAXLOC(ARRAY,mask) -- location of element with maximum value
- MAXVAL(ARRAY,dim,mask) -- maximum value of ARRAY
- MERGE(TSOURCE,FSOURCE,MASK) -- combining two arrays using a mask
- MINLOC(ARRAY,mask) -- location of element with minimum value
- MINVAL(ARRAY,dim,mask) -- minimum value of array
- PACK(ARRAY,MASK,vector) -- pack an array into a vector under a mask
- PRODUCT(ARRAY,dim,mask) -- product of array elements
- RESHAPE(SOURCE,SHAPE,pad,order) -- reshape an array
- SHAPE(SOURCE) -- shape of an array or scalar
-
SELECTED_INT_KIND(n) -- Returns integer kind with range
(-10n, 10n)
-
SELECTED_REAL_KIND(d,n) -- Returns real kind
- SIZE(ARRAY,dim) -- number of elements in an aray
- SPREAD(SOURCE,DIM,NCOPIES) -- replicate an array by adding a dimension
- SUM(ARRAY,dim,mask) -- sum array elements
- TRANSPOSE(MATRIX) -- transpose array of rank two
- UNPACK(VECTOR,MASK,FIELD) -- unpack rank-one array into a multidimensional array under a mask
The performances of the above array intrinsics may or may not be better than
by doing them explicitly with do-loops. It is highly dependent on the individual function
and the compiler version used.
A table
listing the performances of the above array intrinsics has been compiled.
- To compile example.f90 and produce an executable "example" :
- lego% f90 -o example example.f90
-
at prompt, type f90 -help to get a list of all f90 compiler options
-
If a Makefile is used to compile an
f90 program, caution must be taken to make sure that modules are compiled before
subroutines that refer to them. Otherwise, compilation will fail. For an example,
see here.
- To run job interactively,
- lego% example
-
On the Power Challenge Array (PCA) and Origin2000 (O2K), all
interactive jobs have a 10-minute (loosely speaking) cpu time limit.
- To submit batch job :
- lego% bsub
-q o2k-short example
-
See the man page of
lsbatch
for other batch-related commands
-
lego% bqueues lists all available queues
-
lego% bqueues -l gives a complete list of
all single and multiprocessor queues and their respective time
limits.
-
Here is a good summary on the hardware and queues available at Boston University.
- Serial code tuning is essential in helping to make
the serial code more efficient prior to any attempt to
parallelize it. Going through this exercise may often help to
generate a "cleaner" code which in turn could help APO to parallelize
loops that may otherwise seen by APO as not parallelizable. An
excellent SGI documentation on
performance tuning is available on-line.
- Use the compiler optimization switch -O3 (or more agressively with -Ofast) whenever possible.
This turns on:
- Loop Nest Optimization (-lno) which include loop unrolling, cache prefetch, ...
- Inter-Procedural Analysis (-ipa) which include inlining, cross-procedure optimization, ...
- Software pipelining
- Link with -lfastm (default is -lm) if you need
sin, cos, ...
- Pay attention to loop indexes. As a rule of thumb, the innermost
loop should correspond to the leftmost index of an array to maintain
stride-one memory-reference.
- Group data into multidimensional array for more efficient memory access.
x(n),y(n),z(n) ==> p(3,n)
- Avoid subroutine calls, I/O, and unnecessary branching inside
pontentially parallelizable loops.
- Avoid power-of-two (2**n) leading dimension which could cause cache conflict
A(1024,1024) ==> A(1025,1024)
- Use performance tools to help identify and understand cpu-intensive code segments
- ssrun -- to collect performance data
- perfex (This gives instruction counts, cache misses, etc.)
There are a number of methods available to help you achieve parallelism
in your code. One method may be more effective than another, it depends
largely on the characteristics of your code.
Note that it is possible to use more than one
method in different parts of the same code to achieve parallelism.
Parallel Code Compilation and Executions
- There are different ways to compile source codes, depending on your objectives:
- Use SGI's parallel mathematical libraries :
lego% f90
-o example -mp example.f90 -lscs_mp
-
at prompt, type f90 -help to get a list of all f90
compiler options.
-
If your code is f77 based, you can still link
with -lscs_mp.
-
See Intro to SCSL to find out if the Lapack routine you are
using is a member of the parallel library. Caveat: just because it is in
the library doesn't means that you will get great speed up.
Some routines are known to have minimal effect (like SVD routines);
others however scales up very well (like LU decomposition).
- Use loop-level parallel directives :
lego% f90
-o example -mp example.f90
- Here, you must include parallel directives in example.f90
in order for parallel works to take effect. -mp
alerts the compiler that the source file contains
directives. In addition, -mp also causes mp libraries to
be linked.
- Use
apo to automatically parallelizes and compiles code
lego% f90
-o example -apo keep example.f90
-
apo option can also be used with f77 compiler
- To run job interactively at the monitor with 4 processors:
-
lego% setenv MP_SET_NUMTHREADS 4
-
lego% example
- Alternatively, you can insert a fortran-callable SGI utility library routine in your code immediately after non-executable statements as follows:
-
call mp_set_numthreads(4)
Note that interactive jobs can only be executed
on Tonka (an SGI PowerChallengeArray) and Lego (an SGI Origin2000)) and the time limit is 10 minutes (loosely speaking) per processor. A job that
requires more than 10 minutes should be submitted to the various batch queues
via bsub.
- To submit a multiprocessor batch job requiring 4 processors to the PCA:
lego% bsub
-q pca-mp4 example
-
DO NOT provide "-n 4" as described
in the bsub manpage to request 4 processors. Instead, use MP_SET_NUMTHREADS
as in the interactive job, or insert "call mp_set_numthreads(4)" in your
program to
request 4 processors. Remember to link with the -mp switch and do not
ask for more processors than the queue's limit.
-
pca-mp4 is for jobs that require up to
4 hours per processor for a total of 16 hours of cpu time on the PCA.
-
o2k-mp4 is for jobs that require up to
4 hour per processor for a total of 16 hours of cpu time on the Origin 2000..
-
For more information on available queues and their
corresponding CPU limits,
see
Scientific Computing Facility Technical Summary.
-
Click here for
bsub related commands.
With Boston University's Power Challenge Array,
HPF is available through
pghpf
driver to The Portland Group's HPF compiler.
At present, we have pghpf 2.4.
In order to use it, you should put this
if ( -d /usr/local/pghpf ) then
setenv PGI /usr/local/pghpf-2.4
set path = ($path $PGI/sgi/bin)
setenv LM_LICENSE_FILE /usr/local/flexlm/licenses/license.dat
endif
in your .cshrc script.
For those who have Thinking Machines' CM Fortran codes and would like to convert it to HPF, the
on-line documentation
includes a paper,
"Migrating CM FORTRAN to F90 and HPF", by Meadows and Miles.
There are man pages for the
pghpf compiler and for the
individual
HPF library routines.
For the efficiency-conscious, there is a menu-driven profiler
pgprof for your applications.
- Examples of source code compilation are as follows:
- For F90 source code :
lego% pghpf
-o example -O3 example.f90
lego% f90
-o example -O3 -pghpf example.f90
- For F77 source code :
lego% pghpf
-o example -Mautopar example.f
- To run a pghpf job:
-
lego% example -pghpf -np 4
- or
-
lego% setenv PGHPF_NP 4
-
lego% example
- Example 1.
Allocation and matrix multiply
- Example 2.
Array-valued function
- Example 3.
Recursion
- Example 4a.
Laplace Equation with Jacobi iteration -- F77 version
- Example 4b.
Laplace Equation with Gauss Seidel iteration -- F77 version
- Example 4c.
Laplace Equation with Jacobi iteration -- F90 version
- For further details,
click here
- There are several examples of HPF code in
/usr/local/examples/hpf/.
There are a number of Fortran 90 and HPF references available.
- From book publishers :
- Fortran 90 Programming by Ellis, Phillips and Lahey, Addison-Wesley, 1994
- Migrating to Fortran 90 by J.F. Kerrigan, O'Reilly & Associates, Inc., 1993
- Fortran 90 Handbook by Adams, Brainerd, et. al., McGraw-Hill, 1992
- On the Internet:
- The Fortran Market maintained by Walt Brainerd
- Fortran 90 Tutorial by Michael Metcalf
-
Fortran 90 and Computational Science chapter in CSEP's
on-line text book on scientific computation
- Portland Group's pghpf User's Guide
- Portland Group's pghpf HPF Reference Manual
- High Performance Fortran Forum's HPF Language Specifications. This document is also available in postscript form at this site.
-
HPF web tour by Ian Foster
-
HPF chapter in
Designing and Building Parallel Programs by Ian Foster
For more information about this tutorial,
and about Fortran 90, HPF and Multiprocessing,
contact the course coordinator and instructor, Kadin Tseng
(Email: kadin@bu.edu).