Example : Matrix Transpose
Demonstrates the usage of virtual topology.
We will demonstrate the use of virtual topologies by way of a matrix transposition.
The matrix algebra for matrix transpose is:
The parallel algorithm is:
- Select p and q such that the total number of processes,
nprocs = p x q.
- Partition the n x m matrix into a (blocked) p x q matrix
whose elements are themselves matrices of size (n/p) x (m/q).
- Perform a transpose on each of these sub-matrices. These are performed
serially as the entire sub-matrix resides locally on a process. No
inter-process communication is required.
- Formally, the p x q matrix need be transposed to obtain the final
result. However, in reality this step is often not necessary. If
you need to access the element (or sub-matrix) "p,q" of the transposed
matrix, all you need do is to access the element "q,p" which has already
been transposed locally. Depending on what comes next in the calculation,
unnecessary message passing may be avoided.
As an example, we define a 9 x 4 matrix and will use 6 processes. Next,
map it into a 3 x 2 virtual cartesian grid, i.e., p=3, q=2.
Coincidentally, each element of this cartesian grid is in turn a matrix of size
3 x 2, as illustrated in the figure below. For the physical grid, each square
box represents one entry of the matrix. The pair of indices, "i,j", on the first
row gives the global cartesian coordinates while "(p)" is
the process associated with the virtual grid allocated by calling
MPI_Cart_create or MPI_Comm_split.
aij on the second row is the value of the matrix element.
On the right, the 3 x 2 virtual grid is depicted.
Each box in this grid represents one process and contains one 3x2 submatrix.
Finally, another communicator is created for the transposed virtual grid
which has the dimensions of 2 x 3. The element at "1,0" of the
transposed virtual grid, for instance, stores the value sent by the element
at "0,1" of the virtual grid.
Physical Grid |
  |
  |
  |
Matrix Transpose
0,0 (0) 100 | 0,1 (0) 101 |
1,0 (0) 110 | 1,1 (0) 111 |
2,0 (0)
120 | 2,1 (0) 121 |
|
Matrix Transpose
0,2 (1)
102 | 0,3 (1) 103 |
1,2 (1)
112 | 1,3 (1) 113 |
2,2 (1)
122 | 2,3 (1) 123 |
|
|
Virtual Grid
Virtual Grid
| 0,0 (0) | 0,1 (1) |
| 1,0 (2) | 1,1 (3) |
| 2,0 (4) | 2,1 (5) |
|
Matrix Transpose
3,0 (2)
130 | 3,1 (2) 131 |
4,0 (2)
140 | 4,1 (2) 141 |
5,0 (2)
150 | 5,1 (2) 151 |
|
Matrix Transpose
3,2 (3)
132 | 3,3 (3) 133 |
4,2 (3)
142 | 4,3 (3) 143 |
5,2 (3)
152 | 5,3 (3) 153 |
|
  |
|
Matrix Transpose
6,0 (4)
160 | 6,1 (4) 161 |
7,0 (4)
170 | 7,1 (4) 171 |
8,0 (4)
180 | 8,1 (4) 181 |
|
Matrix Transpose
6,2 (5)
162 | 6,3 (5) 163 |
7,2 (5)
172 | 7,3 (5) 173 |
8,2 (5)
182 | 8,3 (5) 183 |
|
  |
Transposed Virtual Grid
Transposed Virtual Grid
| 0,0 (0) | 0,1 (1) | 0,2 (2) |
| 1,0 (3) | 1,1 (4) | 1,2 (5) |
|
program matrix_transpose
implicit none
integer n, m, nv, nl, mv, ml, i, il, iv, j, jl, jv
integer p, ndim, reorder, ierr, grid_comm
integer master, me, Iam, source, dest, tag
parameter (n=9, m=8, nv=3, mv=2, nl=n/nv, ml=m/mv)
parameter (ndim=2, reorder=1)
integer a(nl,ml), at(ml, nl), b(m,n)
include "mpif.h" !! This brings in pre-defined MPI constants, ...
integer dims(ndim), coord(ndim), req
logical period(ndim)
integer status(MPI_STATUS_SIZE)
data master/0/ !! 0 is defined as the master processor
data period/.false.,.false./ !! no cyclic boundary in either index
data tag/0/ !! a tag is not required in this case, set it to zero
data dest/0/ !! results are sent back to master
c**Starts MPI processes ...
call MPI_Init(ierr) !! starts MPI
call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) !! get current process id
call MPI_Comm_size(MPI_COMM_WORLD, p, ierr) !! get number of processes
c**create cartesian topology for matrix
dims(1) = nv
dims(2) = mv
call MPI_Cart_create(MPI_COMM_WORLD, ndim, dims,
& period, reorder, grid_comm, ierr)
call MPI_Comm_rank(grid_comm, me, ierr)
call MPI_Cart_coords(grid_comm, me, ndim, coord, ierr)
iv = coord(1)
jv = coord(2)
c**define local matrix according to virtual grid coordinates, (iv,jv)
do jl=1,ml
do il=1,nl
i = il + iv*nl
j = jl + jv*ml
a(il,jl) = i*10 + j
enddo
enddo
c**perform transpose on local matrix
do jl=1,ml
do il=1,nl
at(jl,il) = a(il,jl)
enddo
enddo
c**send "at" to Master for asembly and printing
call MPI_Isend(at, ml*nl, MPI_INTEGER, Master, tag,
& grid_comm, req, ierr)
c**Master asembles all local transposes into final matrix and print
if(Iam .eq. Master)call asemble(at,ml,nl,grid_comm,b,m,n,p)
call MPI_Wait(req, status, ierr) !! make sure all sends done
call MPI_Finalize(ierr) !! let MPI finish up ...
end
subroutine asemble(at,ml,nl,comm,b,m,n,p)
implicit none
include "mpif.h"
integer ml, nl, comm, tag, source, m, n, ierr, p, ndim
integer iv, jv, i, j, il, jl, coord(2)
integer b(m,n), at(ml,nl), status(MPI_STATUS_SIZE)
data tag, ndim/0,2/
c**The Master asembles the final (transposed) matrix from local copies and print
do source=0,p-1
call MPI_Cart_coords(comm, source, ndim, coord, ierr)
call MPI_Recv(at,ml*nl, MPI_INTEGER, source, tag, comm,
& status, ierr)
iv = coord(1)
jv = coord(2)
do jl=1,nl
j = jl + iv*nl ! swap iv and jv for transpose
do il=1,ml
i = il + jv*ml
b(i,j) = at(il,jl)
enddo
enddo
enddo
write(*,'(9i5)')((b(i,j),j=1,n),i=1,m)
return
end