Both processes and threads are independent sequences of execution. The typical difference is that threads (of the same process) run in a shared memory space, while processes run in separate memory spaces.
MPI runs many programs (processes) and easily configures communications amongst them.
This program starts all the programs that should run. Typically you provide it your program and it runs it n times.
Here I show how mpirun just runs other programs for you.
! echo '-ls-'; ls -lart; echo
! echo '-date-'; date; echo
! echo '-mpirun-'; mpirun -n 1 ls -lart : -n 1 date; echo
This is a silly example, because the running programs don't take care of mpirun
's communication protocols. We'll learn today how to interact with mpirun
to get information about the other processes. When using mpirun
you don't have to worry about where your processes are exectued, and mpirun
puts them on multiple machines depending on how they are configured.
You incorporate the MPI library in your code, allowing it to interact with mpirun
and other instances of your application.
Basically the same answers to Why parallelize my code?
Specific to MPI:
One specification, many implementations:
C:
#include "mpi.h"
C++:
#include "boost/mpi/environment.hpp"
#include "boost/mpi/communicator.hpp"
namespace mpi = boost::mpi;
Fortran:
include mpif.h
Python
from mpi4py import MPI
MPI_Init
-- called before any other MPI library routine. Initializes communication with mpirun
MPI_Finalize
-- called at the end of MPI part of code, signals to mpirun
that this process is all done.
Typically they are the first and last things called in an application.
note - python takes care of calls to these routines for you when you import the library and when your program ends.
MPI_Init()
Communication between processes involves:
Point-to-point Communication
Collective Communication - several forms of organized, group communications
C Type | MPI Name |
---|---|
int | MPI_INT |
unsigned int | MPI_UNSIGNED |
float | MPI_FLOAT |
double | MPI_DOUBLE |
char | MPI_CHAR |
Fortran Type | MPI Name |
---|---|
INTEGER | MPI_INTEGER |
REAL | MPI_READ |
DOUBLE | MPI_DOUBLE_PRECISION |
COMPLEX | MPI_COMPLEX |
CHARACTER | MPI_CHARACTER |
python inspects your data and handles this information for you.
MPI_PACKED, MPI_BYTE -- fine-grained control of data transfer
user defined types, i.e. structures.
How do we perform this integral in parallel
$\int\limits_a^b cos(x) dx$
Well we make it the sum of the integrals of part of the space (p domains):
$\sum\limits_{i=0}^{p-1}\int\limits_{a_i}^{a_{i+1}} cos(x) dx$
Each integral can be solved numerically using the midpoint rule (n points per domain):
$\sum\limits_{i=0}^{p-1}\sum\limits_{j=0}^{n-1} cos(a_{ij}) * h$
Where $h$ is the step size and $a_{ij}$ is the position:
$h = (a - b) / n / p$
$a_{ij} = a + i * n * h + (j + 0.5) * h$
This is an example of Single Program Multiple Data parallelization. The single program is the inegral
function. The multiple data are the limits of each partition.
To parallelize this appliation we'll use:
MPI_Init
, MPI_Finalize
-- start things up and then stop them
MPI_Comm_rank
-- what is the ID (called rank) of the current process
MPI_Comm_size
-- how many processes are running
MPI_Send
, MPI_Recv
-- send and recieve a buffer of information
MPI uses communicators to organize who is getting what. Each communicator is associated with a Group
of processes.
MPI_COMM_WORLD
is a communicator (id or object depending on language) that represents all the processes. It's there by default.
Later we'll discuss making your own Group
's and their corresponding Communicator
's
master
be?¶Hint: What does master
represent, and then what value can it be?
Hint 1: go through the code as if your ID is 0.
Hint 2: What needs to happen for a MPI_Send
to be successful
You can send messages to yourself. Since, when you send info, a temporary buffer is used under the hood and this is not necissary when sending to yourself, it's traditionally been up to the MPI vendor whether or not to actually use a buffer for self messaging. Historically this has ment that this process is flaky, but now it is part of the MPI standard and should be supported universally.
!mpif77 -o intro/basics/F77/example1_1.x intro/basics/F77/example1_1.f
!mpif77 --show
language | mpi compiler |
---|---|
C | mpicc |
fortran 77 | mpif77 |
fortran 90 | mpif90 |
C++ | mpicxx |
mpic++ | |
mpiCC |
mpirun
-np 4
-- replace '4' with the number of processes you want.
MPI will take care of where they run. There's an automatic way of placing them on different nodes and there's a ton of options for controlling where they end up.
intro/basics/F77/example1_1.x
-- your application
!mpirun -np 4 intro/basics/F77/example1_1.x
Python:
mpirun -np 4 python intro/basics/python/example1_1.py
The standard python on SCC is missing mpi4py, so you need to load a module to access this library, for example:
module load anacoda/2.2.0
C++:
You need the properly compiled boost library, which on SCC is:
module load boost/1.58.0
Use MPI_Comm_Rank MPI_Comm_Size to figure out processor id and number of processors, then print out hello message:
Hello world from 0 of 4
Hello world from 1 of 4
Hello world from 2 of 4
Hello world from 3 of 4
You will need:
MPI_Comm_Rank
to get id number of current processorsMPI_Comm_Size
to get number of processorsHint: what order are the results collected?
Asynchronous or non-blocking communication is possible. Normally when you send or receive, the call/function waits until the operation is complete. You can use alternative methods to return right away so you can continue to work while the data is sent/received behind the seen.
Old Method | Non-blocking Method |
---|---|
MPI_Send | MPI_Isend |
MPI_Recv | MPI_Irecv |
You should be careful not to use the buffers involved in these non-blocking communications while the tasks are on going.
You can use MPI_Probe
to see if there is any communication on going and MPI_Wait
to wait until communication is complete.
With MPI-3, there are now non-blocking collective communcation, a somewhat more advanced topic.
Modify example1_3 so that
Speed up $S$ is the ratio $T_1$, the time for one process, to $T_N$, the time for $N$ processes, and when there is $f$ fraction of the code that cannot be parallelized, then
$ S = \frac{T_1}{T_N} < \frac{T_1}{(f + \frac{1 - f}{N})T_1} < \frac{1}{f}$
as
$ N \rightarrow \infty $
So if 90% of the code is parallelizable, then at best you can get 10x speed up.
Parallel Efficiency is $\frac{S}{N}$, and is ideally 1.
+---------+ +----------+ +----------+ +-----------+
| P0 | | P1 | | P2 | | P3 |
| 0 | | 1 | | 2 | | 3 |
+---------+ +--+-------+ +---------++ +-----------+
| | | |
| | | |
+v-----v+ +v-------v+
| P0 | | P2 |
| 1 | | 5 |
| | | |
+-------+--------+ +------+---------+
| |
| |
| |
+---v--v---+
| P0 |
| 6 |
| |
+----------+
Each implementation has it's own method for collective communication. It's not always the fastest. Collective communications traditionally are blocking, though now in MPI 3 there are non-blocking versions.
Regardless, the main point is: collective communication is expensive, use it sparingly.
Process 0 | Process 1* | Process 2 | Process 3 | Operation | Process 0 | Process 1* | Process 2 | Process 3 |
---|---|---|---|---|---|---|---|---|
b | MPI_Bcast | b | b | b | b | |||
a | b | c | d | MPI_Gather | a,b,c,d | |||
a | b | c | d | MPI_Allgather | a,b,c,d | a,b,c,d | a,b,c,d | a,b,c,d |
a,b,c,d | MPI_Scatter | a | b | c | d | |||
a,b,c,d | e,f,g,h | i,j,k,l | m,n,o,p | MPI_Alltoall | a,e,i,m | b,f,j,n | c,g,k,o | d,h,l,p |
SendBuff | SendBuff | SendBuff | SendBuff | RecvBuff | RecvBuff | RecvBuff | RecvBuff |
MPI_Wtime()
returns a precise value of the time (in seconds) on a particular node.
Call it before and after a block of code and use the differences to get timing information for a particular worker.
MPI_Wtick()
returns the resolution of the timer.
MPI_Probe
- Checks to see if a message is comming. If used with MPI_ANY_SOURCE
, then can identify source from status
.
MPI_Get_count
- Returns the number of elements comming in message. (useful for dynamic allocation).
Sometimes you want to break your problem into multiple domains (think a 2D grid partitioned vertically and horizontally).
00000 11111
00000 11111
00000 11111
00000 11111
00000 11111
22222 33333
22222 33333
22222 33333
22222 33333
22222 33333
Then you can make new groups for vertical communication ((0, 2) and (1, 3)) and others for horizontal communication ((0, 1) and (2, 3)). Then for each group you'd make a corresponding communicator. Within these communicators, these processors will get communicator-specific id's of 0 and 1 respectively.
From there you can organize your code to be communicator independant. A Communicator will be an argument, and from that you'll get the number of processes working on that task and what ID the current process is within that communicator's group.
From there you can build more complex geometries that may or may not be reflected in the networking (the process with the next id might be physically near you in the cluster or on the other side of the system).
The details of this are beyond this class, but here's a decent reference.
module load totalview
mpirun -np 4 -tv your_app_compiled_for_debugging
This is an advnanced topic, but if you want to send data of compound data types (structs) or if you want to send contiguous or non-contiguous data of the same type, then you can define your own MPI types.
For more information, see: http://static.msi.umn.edu/tutorial/scicomp/general/MPI/mpi_data.html
Matrix Multiplication
Communicate only the relevant parts of A to all processes.
Repeat Practice Problem, but
Checkout LLNL's practice problems for different ways to cut up numerical methods so they can be parallelized.