Example : Matrix Transpose

Demonstrates the usage of virtual topology.

We will demonstrate the usage of virtual topologies through a matrix transposition example. While the underlining operations of a matrix transpose is very simple mathematically, it is not at all trivial to program it in parallel efficiently by conventional procedures. More importantly here, it enables us to see how virtual topology can be used effectively to assist us in the programming process.

The algorithm we will follow is a somewhat little known fact that a matrix transpose can be performed in the following manner:

  1. Select p and q such that the total number of processes, nprocs = p x q. For simplicity, we will assume that p and q evenly divide n and m, respectively.
  2. Partition the n x m matrix into a (blocked) p x q matrix whose elements are themselves matrices of size (n/p) x (m/q). The elements of these sub-matrices may either be defined, computed or read in from a file.
  3. Perform a transpose on each of these sub-matrices. These are performed serially as the entire sub-matrix resides locally on a process. No interprocess communication is required.
  4. Perform a transpose on the p x q matrix; this means that the contents of the local transposed matrix must be communicated from location "p,q" to location "q,p" via message passing. This operation involves interprocess communications. On completion, the final transposed matrix is obtained.
  5. Note that in reality, the previous step is often not necessary. If you need to access the element (or sub-matrix) "p,q" of the transposed matrix, all you need do is to access the element "q,p" which has already been transposed locally. Depending on what comes next in the calculation, unnecessary message passing may be avoided.

i,j (p)
aij
As an example, we define a 9 x 4 matrix and will use 6 processes. Next, map it into a 3 x 2 virtual cartesian grid, i.e., p=3, q=2. Coincidentally, each element of this cartesian grid is in turn a matrix of size 3 x 2, as illustrated in the figure below. For the physical grid, each square box represents one entry of the matrix. The pair of indices, "i,j", on the first row gives the global cartesian coordinates while "(p)" is the process associated with the virtual grid allocated by calling MPI_Cart_create or MPI_Comm_split. aij on the second row is the value of the matrix element. On the right of the figure, the 3 x 2 virtual grid is depicted. Each box in this grid represents one process and contains one 3x2 submatrix. Finally, another communicator is created for the transposed virtual grid which has the dimensions of 2 x 3. The element at "1,0" of the transposed virtual grid, for instance, stores the value sent by the element at "0,1" of the virtual grid.

Physical Grid

     
Matrix Transpose
0,0 (0)
100
0,1 (0)
101
1,0 (0)
110
1,1 (0)
111
2,0 (0)
120
2,1 (0)
121
Matrix Transpose
0,2 (1)
102
0,3 (1)
103
1,2 (1)
112
1,3 (1)
113
2,2 (1)
122
2,3 (1)
123
Virtual Grid

Virtual Grid

0,0 (0)0,1 (1)
1,0 (2)1,1 (3)
2,0 (4)2,1 (5)
Matrix Transpose
3,0 (2)
130
3,1 (2)
131
4,0 (2)
140
4,1 (2)
141
5,0 (2)
150
5,1 (2)
151
Matrix Transpose
3,2 (3)
132
3,3 (3)
133
4,2 (3)
142
4,3 (3)
143
5,2 (3)
152
5,3 (3)
153
 
Matrix Transpose
6,0 (4)
160
6,1 (4)
161
7,0 (4)
170
7,1 (4)
171
8,0 (4)
180
8,1 (4)
181
Matrix Transpose
6,2 (5)
162
6,3 (5)
163
7,2 (5)
172
7,3 (5)
173
8,2 (5)
182
8,3 (5)
183
  Transposed Virtual Grid

Transposed Virtual Grid

0,0 (0)0,1 (1)0,2 (2)
1,0 (3)1,1 (4)1,2 (5)

      program matrix_transpose
      implicit none
      integer n, m, nv, nl, mv, ml, i, il, iv, j, jl, jv
      integer p, ndim, reorder, ierr, grid_comm
      integer master, me, Iam, source, dest, tag
      parameter (n=9, m=8, nv=3, mv=2, nl=n/nv, ml=m/mv)
      parameter (ndim=2, reorder=1)
      integer a(nl,ml), at(ml, nl), b(m,n)
      integer me2, grid_comm_t
      include "mpif.h"  !! This brings in pre-defined MPI constants, ...
      integer dims(ndim), coord(ndim), period(ndim)
      integer d(ml,nl), map(0:nv*mv-1)

      data master/0/    !! 0 is defined as the master processor
      data period/0,0/  !! no circular shift permited in either direction
      data tag/0/       !! a tag is not required in this case, set it to zero

c**Starts MPI processes ...

      call MPI_Init(ierr)                               !! starts MPI
      call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr)     !! get current process id
      call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)       !! get number of processes

c**create cartesian topology for matrix
      dims(1) = nv
      dims(2) = mv
      call MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, 
     &       period, reorder, grid_comm, ierr)
      call MPI_Comm_rank(grid_comm, me, ierr)
      call MPI_Cart_coords(grid_comm, me, ndim, coord, ierr)
      iv = coord(1)
      jv = coord(2)

c**define local matrix according to virtual grid coordinates, (iv,jv)
      do jl=1,ml
        do il=1,nl
          i = il + iv*nl
          j = jl + jv*ml
          a(il,jl) = i*10 + j
        enddo
      enddo

c**perform transpose on local matrix
      do jl=1,ml
        do il=1,nl
          at(jl,il) = a(il,jl)
        enddo
      enddo

c**Transpose virtual grid to complete the transposition ...
c**First, create communicators for the transpose
      dims(1) = mv
      dims(2) = nv
      call MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, 
     &       period, reorder, grid_comm_t, ierr)
      call MPI_Comm_rank(grid_comm_t, me2, ierr)
      coord(1) = jv
      coord(2) = iv
      call MPI_Cart_rank(grid_comm_t, coord, dest, ierr)   !! where does the local copy go?

c**next, define mapping between ranks in grid_comm_t and grid_comm with transposition
      call table(mv,nv,map,grid_comm_t,grid_comm)

      source = map(me2)    !! who is to send me stuff?

      call message_passing(at,ml,nl,dest,tag,grid_comm_t,d,source)

      call MPI_Barrier(grid_comm_t,ierr)    !! wait for everyone ...

c**When dust finally settles, send the transposed "d" matrix to the Master for asembly
      call final_asembly(d,ml,nl,grid_comm_t,Master,tag,m,n,p,me)

      call MPI_Finalize(ierr)                      !! let MPI finish up ...

      end
      subroutine asemble(a,ml,nl,b,m,n,iv,jv)
      implicit none
      integer nl, ml, n, m, iv, jv, il, jl, i, j
      integer a(ml,nl), b(m,n)

      do jl=1,nl
        j = jl + (jv-1)*nl
        do il=1,ml
          i = il + (iv-1)*ml
          b(i,j) = a(il,jl)
        enddo
      enddo

      return
      end
      subroutine table(mv,nv,map,comm2,comm)
      implicit none
      include "mpif.h"
      integer mv, nv, comm, comm2, coord(2), proc1, proc2, ierr, pv, qv
      integer map(0:mv*nv-1)

      do pv=0,mv-1
        do qv=0,nv-1
          coord(1)=pv
          coord(2)=qv
          call MPI_Cart_rank(comm2,coord,proc2,ierr)
          coord(1)=qv
          coord(2)=pv
          call MPI_Cart_rank(comm,coord,proc1,ierr)
          map(proc2)=proc1
        enddo
      enddo

      return
      end
      subroutine final_asembly(d,ml,nl,comm,Master,tag,m,n,p,me)
      implicit none
      include "mpif.h"
      integer ml, nl, comm, Master, tag, source, m, n, p, me, ierr
      integer iv, jv, i, j
      integer b(m,n), d(ml,nl), req, status(MPI_STATUS_SIZE)
      integer cc(2,0:5)

      data cc/1,1,1,2,1,3,2,1,2,2,2,3/

      call MPI_Isend(d,ml*nl, MPI_INTEGER, Master, tag, comm,
     &       req, ierr)

c**The Master asembles the final (transposed) matrix from local copies and print
      if(me. eq. Master) then
        do source=0,p-1
          call MPI_Recv(d,ml*nl, MPI_INTEGER, source, tag, comm,
     &       status, ierr)
          iv=cc(1,source)
          jv=cc(2,source)
          call asemble(d,ml,nl,b,m,n,iv,jv)
        enddo
        write(*,'(9i5)')((b(i,j),j=1,n),i=1,m)
      endif

      return
      end

      subroutine message_passing(at,ml,nl,dest,tag,comm,d,source)
      implicit none
      include "mpif.h"
      integer ml, nl, dest, tag, comm, source, ierr
      integer at(ml,nl), d(ml,nl)
      integer req(2), status(MPI_STATUS_SIZE,2)

      call MPI_Isend(at, nl*ml, MPI_INTEGER, dest, tag,
     &     comm, req(1), ierr)
      call MPI_Irecv(d, nl*ml, MPI_INTEGER, source, tag,
     &     comm, req(2), ierr)
      call MPI_Waitall(2, req, status, ierr)

      return
      end