ME759 High Performance Computing for Engineering Applications Parallel Computing with the Message Passing Interface (MPI) November 6, 2013
© Dan Negrut, 2013 ME759 UW-Madison
“Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.” -- Winston Churchill
Before We Get Started…
Last time:
Today:
Wrap up point-to-point communication in MPI: non-blocking flavors Collective action: barriers, communication, operations
Collective action: operations User defined types in MPI Departing thoughts: CUDA, OpenMP, MPI
Miscellaneous
No class on Friday. Time slot set aside for midterm exam Midterm exam is Nov. 25 at 7:15 PM in room 1163ME
I will travel and miss four office hours: next week and subsequent week
Review session on Monday, Nov 25 during regular class. Attend if you have questions I am checking my email on daily basis
Final Project Proposal due at 11:59 PM on Nov. 15
2
Midterm & Final Project Partitioning
If you are happy with your Midterm Project, it can become your Final Project
If not happy w/ your Midterm Project selection: November 15 provides the opportunity to wrap up and choose a different Final Project
No midterm project report due then
Report should be detailed and follow rules spelled out in forum posting
Nov 15: Final Project proposal should be uploaded
Do so even if you choose to continue Midterm Project
In this case simply upload a one liner stating this
If changing to new project, submit a proposal that details the work to be done
For SPH default project: the student[s] w/ the fastest implementation will write a paper with Arman, Dan and another lab member 3
MPI_Reduce
before MPI_REDUCE • inbuf • result
after
ABC
ABC
o
DEF
DEF
o
GH I
GH I
o
JKL
JKL
o
MN O
MN O
root=1 AoDoGoJoM 4 [ICHEC]→
Reduce Operation
data (output buffer)
processes
data (input buffer)
A0
B0
C0
A1
B1
C1
A2
B2
C2
A0+A1+A2 B0+B1+B2
C0+C1+C2
reduce
Assumption: Rank 0 is the root 5 [A. Siegel]→
MPI_Reduce int MPI_Reduce (void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm);
IN OUT IN IN IN IN IN
sendbuf recvbuf count datatype op root comm
(address of send buffer) (address of receive buffer) (number of elements in send buffer) (data type of elements in send buffer) (reduce operation) (rank of root process) (communicator) 6
[A. Siegel]→
MPI_Reduce example MPI_Reduce(sbuf,rbuf,6,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD) sbuf P0
3
P2
+ 5 + 2
P3
+ 1
P1
11
4
2
8
12
1
rbuf 2
5
1
7
11
4
4
10
4
5
6
9
3
1
1
P0
11
16
20
22
24
18
7
MPI_Reduce, MPI_Allreduce
MPI_Reduce: result is collected by the root only
The operation is applied element-wise for each element of the input arrays on each processor
MPI_Allreduce: result is sent out to everyone
...
MPI_Reduce(x, r, 10, MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD) ...
input array
output array
array size
root
...
MPI_Allreduce(x, r, 10, MPI_INT, MPI_MAX, MPI_COMM_WORLD) ... 8 Credit: Allan Snavely
MPI_Allreduce
data (buffer)
processes
data (buffer)
A0
B0
C0
A1
B1
A2
B2
A0+A1+A2 B0+B1+B2
C0+C1+C2
C1
A0+A1+A2 B0+B1+B2
C0+C1+C2
C2
A0+A1+A2 B0+B1+B2
C0+C1+C2
Allreduce
9 [A. Siegel]→
MPI_Allreduce int MPI_Allreduce (void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
IN OUT IN IN IN IN
sendbuf recvbuf count datatype op comm
(address of send buffer) (address of receive buffer) (number of elements in send buffer) (data type of elements in send buffer) (reduce operation) (communicator) 10
Example: MPI_Allreduce #include "mpi.h" #include #include int main(int argc, char **argv) { int my_rank, nprocs, gsum, gmax, gmin, data_l; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); data_l = my_rank; MPI_Allreduce(&data_l, &gsum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Allreduce(&data_l, &gmax, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); MPI_Allreduce(&data_l, &gmin, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD); printf("gsum: %d, gmax: %d MPI_Finalize();
gmin:%d\n", gsum,gmax,gmin);
} 11
Example: MPI_Allreduce [Output]
[negrut@euler24 gsum: 45, gmax: gsum: 45, gmax: gsum: 45, gmax: gsum: 45, gmax: gsum: 45, gmax: gsum: 45, gmax: gsum: 45, gmax: gsum: 45, gmax: gsum: 45, gmax: gsum: 45, gmax: [negrut@euler24
CodeBits]$ mpiexec 9 gmin:0 9 gmin:0 9 gmin:0 9 gmin:0 9 gmin:0 9 gmin:0 9 gmin:0 9 gmin:0 9 gmin:0 9 gmin:0 CodeBits]$
-np 10 me759.exe
12
MPI_SCAN
Performs a prefix reduction on data distributed across a communicator
The operation returns, in the receive buffer of the process with rank i, the reduction of the values in the send buffers of processes with ranks 0,...,i (inclusive)
The type of operations supported, their semantics, and the constraints on send and receive buffers are as for MPI_REDUCE
13
MPI_SCAN before MPI_SCAN • inbuf • result
after
ABC
ABC
A
o
DEF
DEF
AoD
o
GH I
o
JKL
GH I
JKL
AoDoG
AoDoGoJ
o
MN O
MN O
AoDoGoJoM
done in parallel 14 [ICHEC]→
Scan Operation
processes
data (input buffer)
[A. Siegel]→
A0
B0
C0
A1
B1
C1
A2
B2
C2
data (output buffer) scan
A0
B0
C0
A0+A1
B0+B1
C0+C1
A0+A1+A2 B0+B1+B2
C0+C1+C2
15
MPI_Scan: Prefix reduction
Process i receives data reduced on process 0 through i rbuf
sbuf P0
3
4
2
8
12
1
P0
3
4
2
8
12
1
P1
5
2
5
1
7
11
P1
8
6
7
9
19
12
P2
2
4
4
10
4
5
P2
10
10
11
19
23
17
P3
1
6
9
3
1
1
P3
11
16
12
22
24
18
6 entries
MPI_Scan(sbuf,rbuf,6,MPI_INT,MPI_SUM,MPI_COMM_WORLD) 16 [A. Snavely]→
MPI_Scan int MPI_Scan (void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
[A. Siegel]→
IN OUT IN IN IN IN
sendbuf recvbuf count datatype op comm
(address of send buffer) (address of receive buffer) (number of elements in send buffer) (data type of elements in send buffer) (reduce operation) (communicator)
Note: count refers to total number of elements that will be received into receive buffer after operation is complete 17
#include "mpi.h" #include #include int main(int argc, char **argv){ int myRank, nprocs, i, n; int *result, *data_l; const int dimArray = 2; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myRank); data_l = (int *) malloc(dimArray*sizeof(int)); for (i = 0; i < dimArray; ++i) data_l[i] = (i+1)*myRank; for (n = 0; n < nprocs; ++n) { if( myRank == n ) { for(i=0; i 1, the operation returns, in the receive buffer of the process with rank i, the reduction of the values in the send buffers of processes with ranks 0,...,i-1 (inclusive)
The type of operations supported, their semantics, and the constraints on send and receive buffers, are as for MPI_REDUCE 20
MPI_Exscan int MPI_Exscan (void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
[A. Siegel]→
IN OUT IN IN IN IN
sendbuf recvbuf count datatype op comm
(address of send buffer) (address of receive buffer) (number of elements in send buffer) (data type of elements in send buffer) (reduce operation) (communicator)
21
#include "mpi.h" #include #include int main(int argc, char **argv){ int myRank, nprocs,i, n; int *result, *data_l; const int dimArray = 2; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myRank); data_l = (int *) malloc(dimArray*sizeof(int)); for (i = 0; i < dimArray; ++i) data_l[i] = (i+1)*myRank; for (n = 0; n < nprocs; ++n){ if( myRank == n ) { for(i=0; i