Lecture 5: Parallel Matrix Algorithms (part 3)

Lecture 5: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let 𝐴 = [π‘Žπ‘–π‘— ]𝑛×𝑛 and 𝐡 = [𝑏𝑖𝑗 ]𝑛×𝑛 be n Γ— n ...
8 downloads 0 Views 1MB Size
Lecture 5: Parallel Matrix Algorithms (part 3)

1

A Simple Parallel Dense Matrix-Matrix Multiplication Let 𝐴 = [π‘Žπ‘–π‘— ]𝑛×𝑛 and 𝐡 = [𝑏𝑖𝑗 ]𝑛×𝑛 be n Γ— n matrices. Compute 𝐢 = 𝐴𝐡 β€’ Computational complexity of sequential algorithm: 𝑂(𝑛3 ) β€’ Partition 𝐴 and 𝐡 into 𝑝 square blocks 𝐴𝑖,𝑗 and 𝐡𝑖,𝑗 (0 ≀ 𝑖, 𝑗 < 𝑝) of size (𝑛/ 𝑝) Γ— (𝑛/ 𝑝) each. β€’ Use Cartesian topology to set up process grid. Process 𝑃𝑖,𝑗 initially stores 𝐴𝑖,𝑗 and 𝐡𝑖,𝑗 and computes block 𝐢𝑖,𝑗 of the result matrix. β€’ Remark: Computing submatrix 𝐢𝑖,𝑗 requires all submatrices 𝐴𝑖,π‘˜ and π΅π‘˜,𝑗 for 0 ≀ π‘˜ < 𝑝.

2

β€’ Algorithm: – Perform all-to-all broadcast of blocks of A in each row of processes – Perform all-to-all broadcast of blocks of B in each column of processes – Each process 𝑃𝑖,𝑗 perform 𝐢𝑖,𝑗 =

π‘βˆ’1 π‘˜=0 𝐴𝑖,π‘˜

π΅π‘˜,𝑗

3

Performance Analysis β€’

𝑝 rows of all-to-all broadcasts, each is among a group of 𝑝 processes. A message size is 𝑛2 𝑑𝑀 𝑝

β€’

𝑛2 , 𝑝

communication time: 𝑑𝑠 π‘™π‘œπ‘” 𝑝 +

π‘βˆ’1

𝑝 columns of all-to-all broadcasts, communication time: 𝑑𝑠 π‘™π‘œπ‘” 𝑝 +

𝑛2 𝑑𝑀 𝑝

π‘βˆ’1

β€’ Computation time: 𝑝 Γ— (𝑛/ 𝑝)3 = 𝑛3 /𝑝 β€’ Parallel time: 𝑇𝑝 =

𝑛3 𝑝

+ 2 𝑑𝑠 π‘™π‘œπ‘” 𝑝

𝑛2 + 𝑑𝑀 𝑝

π‘βˆ’1

4

Memory Efficiency of the Simple Parallel Algorithm β€’ Not memory efficient – Each process 𝑃𝑖,𝑗 has 2 𝑝 blocks of 𝐴𝑖,π‘˜ and π΅π‘˜,𝑗 – Each process needs Θ(𝑛2 / 𝑝) memory – Total memory over all the processes is Θ(𝑛2 Γ— 𝑝), i.e., 𝑝 times the memory of the sequential algorithm.

5

Cannon’s Algorithm of Matrix-Matrix Multiplication Goal: to improve the memory efficiency. Let 𝐴 = [π‘Žπ‘–π‘— ]𝑛×𝑛 and 𝐡 = [𝑏𝑖𝑗 ]𝑛×𝑛 be n Γ— n matrices. Compute 𝐢 = 𝐴𝐡 β€’ Partition 𝐴 and 𝐡 into 𝑝 square blocks 𝐴𝑖,𝑗 and 𝐡𝑖,𝑗 (0 ≀ 𝑖, 𝑗 < 𝑝) of size (𝑛/ 𝑝) Γ— (𝑛/ 𝑝) each. β€’ Use Cartesian topology to set up process grid. Process 𝑃𝑖,𝑗 initially stores 𝐴𝑖,𝑗 and 𝐡𝑖,𝑗 and computes block 𝐢𝑖,𝑗 of the result matrix. β€’ Remark: Computing submatrix 𝐢𝑖,𝑗 requires all submatrices 𝐴𝑖,π‘˜ and π΅π‘˜,𝑗 for 0 ≀ π‘˜ < 𝑝.

β€’ The contention-free formula: 𝐢𝑖,𝑗 =

π‘βˆ’1 π‘˜=0 𝐴𝑖, 𝑖+𝑗+π‘˜ % 𝑝 𝐡 𝑖+𝑗+π‘˜ % 𝑝,𝑗 6

Cannon’s Algorithm // make initial alignment for 𝑖, 𝑗 :=0 to 𝑝 βˆ’ 1 do Send block 𝐴𝑖,𝑗 to process 𝑖, 𝑗 βˆ’ 𝑖 + 𝑝 π‘šπ‘œπ‘‘ 𝑝 and block 𝐡𝑖,𝑗 to process 𝑖 βˆ’ 𝑗 + 𝑝 π‘šπ‘œπ‘‘ 𝑝, 𝑗 ; endfor; Process 𝑃𝑖,𝑗 multiply received submatrices together and add the result to 𝐢𝑖,𝑗 ; // compute-and-shift. A sequence of one-step shifts pairs up 𝐴𝑖,π‘˜ and π΅π‘˜,𝑗 // on process 𝑃𝑖,𝑗 . 𝐢𝑖,𝑗 = 𝐢𝑖,𝑗 +𝐴𝑖,π‘˜ π΅π‘˜,𝑗 for step :=1 to 𝑝 βˆ’ 1 do Shift 𝐴𝑖,𝑗 one step left (with wraparound) and 𝐡𝑖,𝑗 one step up (with wraparound); Process 𝑃𝑖,𝑗 multiply received submatrices together and add the result to 𝐢𝑖,𝑗 ; Endfor; Remark: In the initial alignment, the send operation is to: shift 𝐴𝑖,𝑗 to the left (with wraparound) by 𝑖 steps, and shift 𝐡𝑖,𝑗 to the up (with wraparound) by 𝑗 steps. 7

Cannon’s Algorithm for 3 Γ— 3 Matrices

Initial A, B

A, B initial alignment

A, B after shift step 1

A, B after shift step 2

8

Performance Analysis β€’ In the initial alignment step, the maximum distance over which block shifts is 𝑝 βˆ’ 1 – The circular shift operations in row and column 𝑑𝑀 𝑛 2 directions take time: π‘‘π‘π‘œπ‘šπ‘š = 2(𝑑𝑠 + ) 𝑝

β€’ Each of the 𝑝 single-step shifts in the compute𝑑𝑀 𝑛 2 and-shift phase takes time: 𝑑𝑠 + . 𝑝

β€’ Multiplying 𝑝 submatrices of size ( takes time: 𝑛3 /𝑝. β€’ Parallel time: 𝑇𝑝 =

𝑛3 𝑝

+ 2 𝑝 𝑑𝑠 +

𝑑𝑀 𝑛 2 𝑝

n ) 𝑝

Γ—(

n ) 𝑝

𝑑𝑀 𝑛 2 + 2(𝑑𝑠 + ) 𝑝

9

int MPI_Sendrecv_replace( void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status ); β€’ Execute a blocking send and receive. The same buffer is used both for the send and for the receive, so that the message sent is replaced by the message received. β€’ buf[in/out]: initial address of send and receive buffer

10

#include "mpi.h" #include int main(int argc, char *argv[]) { int myid, numprocs, left, right; int buffer[10]; MPI_Request request; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid); right = (myid + 1) % numprocs; left = myid - 1; if (left < 0) left = numprocs - 1; MPI_Sendrecv_replace(buffer, 10, MPI_INT, left, 123, right, 123, MPI_COMM_WORLD, &status); MPI_Finalize(); return 0;

}

11

DNS Algorithm β€’ The algorithm is named after Dekel, Nassimi and Aahni β€’ It is based on partitioning intermediate data β€’ It performs matrix multiplication in time 𝑂(log𝑛) by using 𝑂(𝑛3 /log𝑛) processes The sequential algorithm for 𝐢 = 𝐴 Γ— 𝐡 𝐢𝑖𝑗 = 0 π‘“π‘œπ‘Ÿ 𝑖 = 0; 𝑖 < 𝑛; 𝑖 + + π‘“π‘œπ‘Ÿ 𝑗 = 0; 𝑗 < 𝑛; 𝑗 + + π‘“π‘œπ‘Ÿ π‘˜ = 0; π‘˜ < 𝑛; π‘˜ + + 𝐢𝑖𝑗 = 𝐢𝑖𝑗 + π΄π‘–π‘˜ Γ— π΅π‘˜π‘—

Remark: The algorithm performs 𝑛3 scalar multiplications 12

β€’ Assume that 𝑛3 processes are available for multiplying two 𝑛 Γ— 𝑛 matrices. β€’ Then each of the 𝑛3 processes is assigned a single scalar multiplication. β€’ The additions for all 𝐢𝑖𝑗 can be carried out simultaneously in log𝑛 steps each. β€’ Arrange 𝑛3 processes in a three-dimensional 𝑛 Γ— 𝑛 Γ— 𝑛 logical array. – The processes are labeled according to their location in the array, and the multiplication π΄π‘–π‘˜ π΅π‘˜π‘— is assigned to process P[i,j,k] (0 ≀ 𝑖, 𝑗, π‘˜ < 𝑛). – After each process performs a single multiplication, the contents of P[i,j,0],P[i,j,1],…,P[i,j,n-1] are added to obtain 𝐢𝑖𝑗 . 13

j

i

14

15

β€’ The vertical column of processes P[i,j,*] computes the dot product of row π΄π‘–βˆ— and column π΅βˆ—π‘— 16

β€’ The DNS algorithm has three main communication steps: 1. moving the rows of A and the columns of B to their respective places, 2. performing one-to-all broadcast along the j axis for A and along the i axis for B 3. all-to-one reduction along the k axis

17

Suggest Documents