Lecture 5: Parallel Matrix Algorithms (part 3)
1
A Simple Parallel Dense Matrix-Matrix Multiplication Let π΄ = [πππ ]πΓπ and π΅ = [πππ ]πΓπ be n Γ n matrices. Compute πΆ = π΄π΅ β’ Computational complexity of sequential algorithm: π(π3 ) β’ Partition π΄ and π΅ into π square blocks π΄π,π and π΅π,π (0 β€ π, π < π) of size (π/ π) Γ (π/ π) each. β’ Use Cartesian topology to set up process grid. Process ππ,π initially stores π΄π,π and π΅π,π and computes block πΆπ,π of the result matrix. β’ Remark: Computing submatrix πΆπ,π requires all submatrices π΄π,π and π΅π,π for 0 β€ π < π.
2
β’ Algorithm: β Perform all-to-all broadcast of blocks of A in each row of processes β Perform all-to-all broadcast of blocks of B in each column of processes β Each process ππ,π perform πΆπ,π =
πβ1 π=0 π΄π,π
π΅π,π
3
Performance Analysis β’
π rows of all-to-all broadcasts, each is among a group of π processes. A message size is π2 π‘π€ π
β’
π2 , π
communication time: π‘π πππ π +
πβ1
π columns of all-to-all broadcasts, communication time: π‘π πππ π +
π2 π‘π€ π
πβ1
β’ Computation time: π Γ (π/ π)3 = π3 /π β’ Parallel time: ππ =
π3 π
+ 2 π‘π πππ π
π2 + π‘π€ π
πβ1
4
Memory Efficiency of the Simple Parallel Algorithm β’ Not memory efficient β Each process ππ,π has 2 π blocks of π΄π,π and π΅π,π β Each process needs Ξ(π2 / π) memory β Total memory over all the processes is Ξ(π2 Γ π), i.e., π times the memory of the sequential algorithm.
5
Cannonβs Algorithm of Matrix-Matrix Multiplication Goal: to improve the memory efficiency. Let π΄ = [πππ ]πΓπ and π΅ = [πππ ]πΓπ be n Γ n matrices. Compute πΆ = π΄π΅ β’ Partition π΄ and π΅ into π square blocks π΄π,π and π΅π,π (0 β€ π, π < π) of size (π/ π) Γ (π/ π) each. β’ Use Cartesian topology to set up process grid. Process ππ,π initially stores π΄π,π and π΅π,π and computes block πΆπ,π of the result matrix. β’ Remark: Computing submatrix πΆπ,π requires all submatrices π΄π,π and π΅π,π for 0 β€ π < π.
β’ The contention-free formula: πΆπ,π =
πβ1 π=0 π΄π, π+π+π % π π΅ π+π+π % π,π 6
Cannonβs Algorithm // make initial alignment for π, π :=0 to π β 1 do Send block π΄π,π to process π, π β π + π πππ π and block π΅π,π to process π β π + π πππ π, π ; endfor; Process ππ,π multiply received submatrices together and add the result to πΆπ,π ; // compute-and-shift. A sequence of one-step shifts pairs up π΄π,π and π΅π,π // on process ππ,π . πΆπ,π = πΆπ,π +π΄π,π π΅π,π for step :=1 to π β 1 do Shift π΄π,π one step left (with wraparound) and π΅π,π one step up (with wraparound); Process ππ,π multiply received submatrices together and add the result to πΆπ,π ; Endfor; Remark: In the initial alignment, the send operation is to: shift π΄π,π to the left (with wraparound) by π steps, and shift π΅π,π to the up (with wraparound) by π steps. 7
Cannonβs Algorithm for 3 Γ 3 Matrices
Initial A, B
A, B initial alignment
A, B after shift step 1
A, B after shift step 2
8
Performance Analysis β’ In the initial alignment step, the maximum distance over which block shifts is π β 1 β The circular shift operations in row and column π‘π€ π 2 directions take time: π‘ππππ = 2(π‘π + ) π
β’ Each of the π single-step shifts in the computeπ‘π€ π 2 and-shift phase takes time: π‘π + . π
β’ Multiplying π submatrices of size ( takes time: π3 /π. β’ Parallel time: ππ =
π3 π
+ 2 π π‘π +
π‘π€ π 2 π
n ) π
Γ(
n ) π
π‘π€ π 2 + 2(π‘π + ) π
9
int MPI_Sendrecv_replace( void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status ); β’ Execute a blocking send and receive. The same buffer is used both for the send and for the receive, so that the message sent is replaced by the message received. β’ buf[in/out]: initial address of send and receive buffer
10
#include "mpi.h" #include int main(int argc, char *argv[]) { int myid, numprocs, left, right; int buffer[10]; MPI_Request request; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid); right = (myid + 1) % numprocs; left = myid - 1; if (left < 0) left = numprocs - 1; MPI_Sendrecv_replace(buffer, 10, MPI_INT, left, 123, right, 123, MPI_COMM_WORLD, &status); MPI_Finalize(); return 0;
}
11
DNS Algorithm β’ The algorithm is named after Dekel, Nassimi and Aahni β’ It is based on partitioning intermediate data β’ It performs matrix multiplication in time π(logπ) by using π(π3 /logπ) processes The sequential algorithm for πΆ = π΄ Γ π΅ πΆππ = 0 πππ π = 0; π < π; π + + πππ π = 0; π < π; π + + πππ π = 0; π < π; π + + πΆππ = πΆππ + π΄ππ Γ π΅ππ
Remark: The algorithm performs π3 scalar multiplications 12
β’ Assume that π3 processes are available for multiplying two π Γ π matrices. β’ Then each of the π3 processes is assigned a single scalar multiplication. β’ The additions for all πΆππ can be carried out simultaneously in logπ steps each. β’ Arrange π3 processes in a three-dimensional π Γ π Γ π logical array. β The processes are labeled according to their location in the array, and the multiplication π΄ππ π΅ππ is assigned to process P[i,j,k] (0 β€ π, π, π < π). β After each process performs a single multiplication, the contents of P[i,j,0],P[i,j,1],β¦,P[i,j,n-1] are added to obtain πΆππ . 13
j
i
14
15
β’ The vertical column of processes P[i,j,*] computes the dot product of row π΄πβ and column π΅βπ 16
β’ The DNS algorithm has three main communication steps: 1. moving the rows of A and the columns of B to their respective places, 2. performing one-to-all broadcast along the j axis for A and along the i axis for B 3. all-to-one reduction along the k axis
17