Introduction to MPI¶

Message Passing Interface¶

A library for writing distributed applications.¶

Outline¶

Basic Concepts
Basic MPI Functions
Point-to-point Communication
Collective Communications
Timing

Intermediate/Advanced Topics¶

Third Party Libraries
Dynamic Memory Allocation
Asynchronous Communication
Domain Decomposition
Convergence and Global Communications
Testing

Basic Concepts¶

Threads vs Processes¶

Both processes and threads are independent sequences of execution. The typical difference is that threads (of the same process) run in a shared memory space, while processes run in separate memory spaces.

-- StackOverflow

MPI runs many programs (processes) and easily configures communications amongst them.

mpirun¶

This program starts all the programs that should run. Typically you provide it your program and it runs it n times.

Here I show how mpirun just runs other programs for you.

This is a silly example, because the running programs don't take care of mpirun's communication protocols. We'll learn today how to interact with mpirun to get information about the other processes. When using mpirun you don't have to worry about where your processes are exectued, and mpirun puts them on multiple machines depending on how they are configured.

libmpi¶

You incorporate the MPI library in your code, allowing it to interact with mpirun and other instances of your application.

Why MPI?¶

Basically the same answers to Why parallelize my code?

Divide and conquer
Do bigger runs
Do higher quality runs
Access more memory
Better portability*

Specific to MPI:

Abstract away communication technologies
Long track record of use

Implementations¶

One specification, many implementations:

OpenMPI (on SCC)
MPICH2 (maintained by ANL)
MVAPICH2 (derivative of MPICH2)
Vendor implementations (Intel, SGI, IBM, …)

MPI Basics¶

Accessing the library¶

C:

#include "mpi.h"

C++:

#include "boost/mpi/environment.hpp"
#include "boost/mpi/communicator.hpp"

namespace mpi = boost::mpi;

boost::mpi

Fortran:

include mpif.h

Python

from mpi4py import MPI

MPI_Init and MPI_Finalize¶

MPI_Init -- called before any other MPI library routine. Initializes communication with mpirun

MPI_Finalize -- called at the end of MPI part of code, signals to mpirun that this process is all done.

Typically they are the first and last things called in an application.

note - python takes care of calls to these routines for you when you import the library and when your program ends.

MPI routines¶

C, C++, and Python are case sensitive, for example exactly MPI_Init()
Fortran is not
For C and C++, the return values are integers and the correspond to error codes.
For Fortran, the methods are all subroutines and the error code is an argument
For Python, the entire interface is object oriented and the errors are raised Exceptions

Messaging Basics¶

Communication between processes involves:

data, an array of standard types or user defined structured types
message envelope -- source, destination, tag, communicator; more on all this soon

Modes of Communication¶

Point-to-point Communication
- blocking - wait till communication is done
- non-blocking - let it happen in the background, go on with other work
Collective Communication - several forms of organized, group communications

Standard Data Types¶

C Type	MPI Name
int	MPI_INT
unsigned int	MPI_UNSIGNED
float	MPI_FLOAT
double	MPI_DOUBLE
char	MPI_CHAR

Fortran Type	MPI Name
INTEGER	MPI_INTEGER
REAL	MPI_READ
DOUBLE	MPI_DOUBLE_PRECISION
COMPLEX	MPI_COMPLEX
CHARACTER	MPI_CHARACTER

python inspects your data and handles this information for you.

MPI Data Types¶

MPI_PACKED, MPI_BYTE -- fine-grained control of data transfer

user defined types, i.e. structures.

Example 1: Integration¶

How do we perform this integral in parallel

$\int\limits_a^b cos(x) dx$

Well we make it the sum of the integrals of part of the space (p domains):

$\sum\limits_{i=0}^{p-1}\int\limits_{a_i}^{a_{i+1}} cos(x) dx$

Each integral can be solved numerically using the midpoint rule (n points per domain):

$\sum\limits_{i=0}^{p-1}\sum\limits_{j=0}^{n-1} cos(a_{ij}) * h$

Where $h$ is the step size and $a_{ij}$ is the position:

$h = (a - b) / n / p$

$a_{ij} = a + i * n * h + (j + 0.5) * h$

Serial Version¶

Parallel Integration¶

This is an example of Single Program Multiple Data parallelization. The single program is the inegral function. The multiple data are the limits of each partition.

To parallelize this appliation we'll use:

MPI_Init, MPI_Finalize -- start things up and then stop them

MPI_Comm_rank -- what is the ID (called rank) of the current process

MPI_Comm_size -- how many processes are running

MPI_Send, MPI_Recv -- send and recieve a buffer of information

MPI uses communicators to organize who is getting what. Each communicator is associated with a Group of processes.

MPI_COMM_WORLD is a communicator (id or object depending on language) that represents all the processes. It's there by default.

Later we'll discuss making your own Group's and their corresponding Communicator's

What value should `master` be?¶

Hint: What does master represent, and then what value can it be?

Why might this code fail?¶

Hint 1: go through the code as if your ID is 0.

Hint 2: What needs to happen for a MPI_Send to be successful

Self Messaging¶

You can send messages to yourself. Since, when you send info, a temporary buffer is used under the hood and this is not necissary when sending to yourself, it's traditionally been up to the MPI vendor whether or not to actually use a buffer for self messaging. Historically this has ment that this process is flaky, but now it is part of the MPI standard and should be supported universally.

How to Compile¶

language	mpi compiler
C	mpicc
fortran 77	mpif77
fortran 90	mpif90
C++	mpicxx
	mpic++
	mpiCC

How to Run/Example Output¶

mpirun
-np 4 -- replace '4' with the number of processes you want.

MPI will take care of where they run. There's an automatic way of placing them on different nodes and there's a ton of options for controlling where they end up.
intro/basics/F77/example1_1.x -- your application

Python:

mpirun -np 4 python intro/basics/python/example1_1.py

Note:¶

The standard python on SCC is missing mpi4py, so you need to load a module to access this library, for example:

module load anacoda/2.2.0

C++:

You need the properly compiled boost library, which on SCC is:

module load boost/1.58.0

Practice 1: hello mpi world¶

Use MPI_Comm_Rank MPI_Comm_Size to figure out processor id and number of processors, then print out hello message:

Hello world from 0 of 4
Hello world from 1 of 4
Hello world from 2 of 4
Hello world from 3 of 4

You will need:

to access mpi library (import, include, etc.)
to Initialize and Finalize (if required by the language)
to use MPI_Comm_Rank to get id number of current processors
to use MPI_Comm_Size to get number of processors

Saving time Sending¶

Does master need to send info to itself?¶

Practice 2: hello mpi world, from master¶

Like practice 1, find out id and number of processes
Send from master to all the other processes a messeage: "Hello world from master!"
Have each process print out the message with it's id: "From 2, received message: 'Hello world from master!'"

How can we speed up the collecting of results?¶

Hint: what order are the results collected?

Saving time Receiving, Saving time Sending¶

Asynchronous or non-blocking communication is possible. Normally when you send or receive, the call/function waits until the operation is complete. You can use alternative methods to return right away so you can continue to work while the data is sent/received behind the seen.

Old Method	Non-blocking Method
MPI_Send	MPI_Isend
MPI_Recv	MPI_Irecv

You should be careful not to use the buffers involved in these non-blocking communications while the tasks are on going.

You can use MPI_Probe to see if there is any communication on going and MPI_Wait to wait until communication is complete.

With MPI-3, there are now non-blocking collective communcation, a somewhat more advanced topic.

Practice 3: Integrate with Checking¶

Modify example1_3 so that

each process also sends it's id.
Make an array to keep track of each process's status.
Have master print out the status of each process.

Collective communication¶

Collective Computation¶

Speedup Ratio and Parallel Efficiency¶

Speed up $S$ is the ratio $T_1$, the time for one process, to $T_N$, the time for $N$ processes, and when there is $f$ fraction of the code that cannot be parallelized, then

$ S = \frac{T_1}{T_N} < \frac{T_1}{(f + \frac{1 - f}{N})T_1} < \frac{1}{f}$

as

$ N \rightarrow \infty $

So if 90% of the code is parallelizable, then at best you can get 10x speed up.

Parallel Efficiency is $\frac{S}{N}$, and is ideally 1.

How Reduce Works¶

+---------+  +----------+  +----------+      +-----------+
|    P0   |  |   P1     |  |     P2   |      |     P3    |
|    0    |  |   1      |  |     2    |      |     3     |
+---------+  +--+-------+  +---------++      +-----------+
          |     |                    |       |            
          |     |                    |       |            
         +v-----v+                  +v-------v+           
         |  P0   |                  |   P2    |           
         |  1    |                  |   5     |           
         |       |                  |         |           
         +-------+--------+  +------+---------+           
                          |  |                            
                          |  |                            
                          |  |                            
                      +---v--v---+                        
                      |   P0     |                        
                      |   6      |                        
                      |          |                        
                      +----------+

Each implementation has it's own method for collective communication. It's not always the fastest. Collective communications traditionally are blocking, though now in MPI 3 there are non-blocking versions.

Regardless, the main point is: collective communication is expensive, use it sparingly.

Collective Comminations Summary¶

Process 0	Process 1*	Process 2	Process 3	Operation	Process 0	Process 1*	Process 2	Process 3
	b			MPI_Bcast	b	b	b	b
a	b	c	d	MPI_Gather		a,b,c,d
a	b	c	d	MPI_Allgather	a,b,c,d	a,b,c,d	a,b,c,d	a,b,c,d
	a,b,c,d			MPI_Scatter	a	b	c	d
a,b,c,d	e,f,g,h	i,j,k,l	m,n,o,p	MPI_Alltoall	a,e,i,m	b,f,j,n	c,g,k,o	d,h,l,p
SendBuff	SendBuff	SendBuff	SendBuff		RecvBuff	RecvBuff	RecvBuff	RecvBuff

Timing with Wtime¶

MPI_Wtime() returns a precise value of the time (in seconds) on a particular node.

Call it before and after a block of code and use the differences to get timing information for a particular worker.

MPI_Wtick() returns the resolution of the timer.

MPI_Probe and MPI_Get_count¶

MPI_Probe - Checks to see if a message is comming. If used with MPI_ANY_SOURCE, then can identify source from status.

MPI_Get_count - Returns the number of elements comming in message. (useful for dynamic allocation).

Custom Groups and Communicators¶

Sometimes you want to break your problem into multiple domains (think a 2D grid partitioned vertically and horizontally).

00000 11111
00000 11111
00000 11111
00000 11111
00000 11111

22222 33333
22222 33333
22222 33333
22222 33333
22222 33333

Then you can make new groups for vertical communication ((0, 2) and (1, 3)) and others for horizontal communication ((0, 1) and (2, 3)). Then for each group you'd make a corresponding communicator. Within these communicators, these processors will get communicator-specific id's of 0 and 1 respectively.

From there you can organize your code to be communicator independant. A Communicator will be an argument, and from that you'll get the number of processes working on that task and what ID the current process is within that communicator's group.

From there you can build more complex geometries that may or may not be reflected in the networking (the process with the next id might be physically near you in the cluster or on the other side of the system).

The details of this are beyond this class, but here's a decent reference.

Debugging¶

Use totalview.

It makes life easy.

module load totalview
mpirun -np 4 -tv your_app_compiled_for_debugging

Derived Data Types¶

This is an advnanced topic, but if you want to send data of compound data types (structs) or if you want to send contiguous or non-contiguous data of the same type, then you can define your own MPI types.

For more information, see: http://static.msi.umn.edu/tutorial/scicomp/general/MPI/mpi_data.html

Reference:¶

A cheet sheet

Practice Problem¶

Matrix Multiplication

Make random A and B on Master
Broadcast A (B on C, Python) to all n processors
Break up B (A on C, Python) in to n chunks, sending one chunk to each processor
Start a clock
Do local multiplication
Calculate how long the local calculation took
send local results to master
send local timing to master
report on timing

Next step¶

Communicate only the relevant parts of A to all processes.

Bonus Problem¶

Repeat Practice Problem, but

do only vector multiplication row A by column B
after each multiplication, send the result asynchronously (send it and move on to next column).
before you send next column, make sure last communication succeeded.
on the master node, your receiving instead of sending each column

Bonus Bonus Problems¶

Checkout LLNL's practice problems for different ways to cut up numerical methods so they can be parallelized.