Parallel Fast Fourier Transform Massimiliano Guarrasi - [email protected] Carlo Cavazzoni - [email protected] SuperComputing Applications and Innovation Department - CINECA

1

FOURIER Transform

2

Property



∫ g(τ )h(t − τ )dτ ⇔ G( f )H ( f ) −∞





Corr( h,g) =

∫ h(τ )g(t + τ ) ⇔ H ( f )G ( f ) *

−∞





Corr( h,h ) =

∫ h(τ )h(t + τ ) ⇔ H ( f )

2

−∞

3



Discrete Fourier Transform (DFT) In many application contexts the Fourier transform is approximated with a Discrete Fourier Transform (DFT):

tk=Δk/N The last expression is periodic, with period N. It define a between 2 sets of numbers , Hn & hk ( H(fn) = ΔHn )

fn=n/Δ

4

Discrete Fourier Trasform (DFT)

frequencies from 0 to fc (maximum frequency) are mapped in the values with index from 0 to N/2-1, while negative ones are up to -fc mapped with index values ​of N / 2 to N

Scale like N*N 5

Fast Fourier Transform (FFT) The DFT can be calculated very efficiently using the algorithm known as the FFT, which uses symmetry properties of the DFT sum.

6

Fast Fourier Transform exp(2πi/N)

DFT of even terms

DFT of odd terms

7

Fast Fourier Transform Now Iterate:

Fe = Fee + Wk/2 Feo Fo = Foe + Wk/2 Foo You obtain a series for each value of fn

Foeoeooeo..oe = fn Scale like N*logN (binary tree) 8

Multi dimensional FFT

1) For each value of j and k Apply FFT to h(1...N,j,k)

z

2) For each value of i and k Apply FFT to h(i,1...N,k)

y

h(1...N,j,k)

x

3) For each value of i and j Apply FFT to h(i,j,1...N)

9

Parallel FFT Data Distribution P0 P1 P2 P3

3D Grid Distribute data along one coordinate (e.g. Z ) This is know as “Slab Decomposition”

10

Transform along x and y P1

P0

P2

x

P3

y

each processor trasform its own sub-grid along the x and y independently of the other

11

Data redistribution involving x and z P0 P1

z

z

P2 P3

x

y

y x

The data are now distributed along x 12

FFT along z P0

P1

P2

P3

z

each processor transform its own sub-grid along the z dimension independently of the other

13

Data are redistributed, back from x to z P0 P1 P2 P3

y x

x

y

The 3D array now has the original layout, but each element Has been substituted with its FFT.

14

http://www.fftw.org

15

16

17

18

MPI

OpenMP

19

20

21

fftw_plan

22

fftw_plan

23

24

25

26

27

Scalar 3D FFT with FFTW-2

Creating Plans SUBROUTINE THREED( nx, ny, nz ) INCLUDE '../fftw-2.1.3/fortran/fftw_f77.i' INTEGER :: nx, ny, nz INTEGER :: plan_fw INTEGER :: plan_bw COMPLEX*16, ALLOCATABLE :: data_in(:,:,:), data_out(:,:,:) REAL*8 :: scal ! initialize the fft, compute the multipling factors of the transform serie scal = 1.0d0 / DBLE( nx*ny*nz ) CALL fftw3d_f77_create_plan( plan_fw, nx, ny, nz, FFTW_FORWARD, FFTW_ESTIMATE ) CALL fftw3d_f77_create_plan( plan_bw, nx, ny, nz, FFTW_BACKWARD, FFTW_ESTIMATE ) ! Allocate and prepare the data for the FFT ALLOCATE( data_in(nx, ny, nz), data_out(nx, ny, nz) ) CALL setdata(data_in, nx, ny, nz )

fftw_test.f90 28

! Transform and scale the result CALL fftwnd_f77_one( plan_fw, data_in, data_out ) CALL zdscal( nx*ny*nz, scal, data_out, 1 )

Execute DFT

! Output results, at this stage you could compute convolutions and correlations ! as simple product, in O(n) steps . CALL putdata( data_out, nx, ny, nz ) ! BackTransform CALL fftwnd_f77_one( plan_bw, data_out, data_in ) ! Output results CALL putdata( data_in, nx, ny, nz )

Execute backward DFT

! Finalize and deallocate data CALL fftwnd_f77_destroy_plan( plan_fw ) CALL fftwnd_f77_destroy_plan( plan_bw ) DEALLOCATE( data_in, data_out ) RETURN END SUBROUTINE

Destroy Plans fftw_test.f90 29

Parallel 3D FFT with FFTW-2

PROGRAM fftw_test IMPLICIT NONE INCLUDE 'mpif.h' INTEGER :: ierr, nproc, mpime CALL MPI_INIT(IERR) CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPROC,IERR) CALL MPI_COMM_RANK(MPI_COMM_WORLD,MPIME,IERR) CALL THREED( 32, 8, 8, mpime, nproc ) CALL MPI_FINALIZE(IERR) END PROGRAM

fftw_mpi_test.f90 30

Execute backward DFT

Including Variables

SUBROUTINE THREED( nx, ny, nz, mpime, nproc ) IMPLICIT NONE INCLUDE 'mpif.h' INCLUDE '../fftw-2.1.3/fortran/fftw_f77.i' INTEGER :: nx, ny, nz, plan_mpi_fw, plan_mpi_bw INTEGER :: nproc, mpime, ierr, nzl, izl, nyl_at, iyl_at, size_fw, size_bw COMPLEX*16, ALLOCATABLE :: data_in(:), work(:) COMPLEX*16 :: val REAL*8 :: scal CALL myfftw3d_init( plan_mpi_fw, plan_mpi_bw, scal, nx, ny, nz, & nzl, izl, nyl_at, iyl_at, size_fw, size_bw) ALLOCATE( data_in( size_fw ), work( size_fw ) ) CALL setdata3d( data_in, nx, ny, nzl, izl ) CALL fftwnd_f77_mpi( plan_mpi_fw, 1, data_in, work, 1, FFTW_NORMAL_ORDER ) CALL zdscal( nx * ny * nzl, scal, data_in, 1 ) CALL putdata3d( 'fw', data_in, nx, ny, nzl, izl ) CALL fftwnd_f77_mpi( plan_mpi_bw, 1, data_in, work, 1, FFTW_NORMAL_ORDER ) CALL putdata3d( 'bw', data_in, nx, ny, nzl, izl ) CALL fftwnd_f77_mpi_destroy_plan( plan_mpi_fw ) CALL fftwnd_f77_mpi_destroy_plan( plan_mpi_bw ) DEALLOCATE( data_in, work ) RETURN END SUBROUTINE

Creating Plans

Execute DFT

Destroy Plans

fftw_mpi_test.f90 31

SUBROUTINE myfftw3d_init(plan_mpi_fw, plan_mpi_bw, scal, nx, ny, nz, & nzl, izl, nyl_at, iyl_at, size_fw, size_bw) IMPLICIT NONE INCLUDE 'mpif.h' INCLUDE '../fftw-2.1.3/fortran/fftw_f77.i' INTEGER :: nx, ny, nz, plan_mpi_fw, plan_mpi_bw, nzl, izl, nyl_at, iyl_at, & size_fw, nzl_bw, izl_bw, nyl_at_bw, iyl_at_bw, size_bw, ierr, nproc, mpime REAL*8 :: scal CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPROC,IERR) CALL MPI_COMM_RANK(MPI_COMM_WORLD,MPIME,IERR) scal = 1.0d0 / DBLE( nx*ny*nz ) CALL fftw3d_f77_mpi_create_plan( plan_mpi_fw, MPI_COMM_WORLD, & nx, ny, nz, FFTW_FORWARD, FFTW_ESTIMATE ) CALL fftw3d_f77_mpi_create_plan( plan_mpi_bw, MPI_COMM_WORLD, & nx, ny, nz, FFTW_BACKWARD, FFTW_ESTIMATE ) ! la matrice e' affettata per colonne (FORTRAN) CALL fftwnd_f77_mpi_local_sizes( plan_mpi_fw, nzl, izl, nyl_at, iyl_at, size_fw) CALL fftwnd_f77_mpi_local_sizes( plan_mpi_bw, nzl_bw, izl_bw, nyl_at_bw, & iyl_at_bw, size_bw) izl = izl + 1 iyl_at = iyl_at + 1 izl_bw = izl_bw + 1 iyl_at_bw = iyl_at_bw + 1 WRITE(6,*) mpime, ' (fw) : ', nzl , izl, nyl_at, iyl_at, size_fw WRITE(6,*) mpime, ' (bw) : ', nzl_bw, izl_bw, nyl_at_bw, iyl_at_bw, size_bw RETURN END SUBROUTINE

Creating Plans

Slice the matrix 32

SUBROUTINE setdata3d( a, nx, ny, nzl, INTEGER :: nx, ny, nzl, izl COMPLEX*16 :: a( nx*ny*nzl ) COMPLEX*16 :: val data_in = 0.0d0 val = 1.0d0 call fftw3d_mpi_setijk( val, a, 1, call fftw3d_mpi_setijk( val, a, 32, call fftw3d_mpi_setijk( val, a, 1, call fftw3d_mpi_setijk( val, a, 32, RETURN END SUBROUTINE

izl)

1, 1, 1, 1,

1, 1, 8, 8,

nx, nx, nx, nx,

ny, ny, ny, ny,

nzl, nzl, nzl, nzl,

izl izl izl izl

) ) ) )

SUBROUTINE fftw3d_mpi_setijk( val, a, i, j, k, nx, ny, nzl, izl ) IMPLICIT NONE COMPLEX*16 :: val, a(*) INTEGER :: i, j, k, nx, ny, nzl, izl IF( (izl