Fast Electromagnetic Integral-Equation Solvers on Graphics Processing Units

Fast Electromagnetic Integral-Equation Solvers on Graphics Processing Units Shaojing Li1, Ruinan Chang1, Amir Boag2, and Vitaliy Lomakin1 Department o...

Author: Blaise Snow

3 downloads 2 Views 4MB Size

Report

Download PDF

Recommend Documents

AES on Graphics Processing Units

Graphics Processing Units (GPUs)

Graphics processing units (GPUs)

EFFICIENT MULTIFRAGMENT EFFECTS ON GRAPHICS PROCESSING UNITS

SU(3) gluodynamics on Graphics Processing Units

00 Fast K-selection Algorithms for Graphics Processing Units

High-speed parallel processing on CUDA-enabled Graphics Processing Units

Accelerating Genetic Programming Using Graphics Processing Units

A Parallel Algebraic Multigrid Solver on Graphics Processing Units

Computationally Efficient Tsunami Modelling on Graphics Processing Units (GPU)

An Optimized Parallel IDCT on Graphics Processing Units

Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

CMSC 411 Computer Systems Architecture Lecture 23 Graphics Processing Unit (GPU) Graphics Processing Units (GPUs)

Fast Analysis of Molecular Dynamics Trajectories with Graphics Processing Units Radial Distribution Function Histogramming

Optimizing RDF stores by coupling General-purpose Graphics Processing Units and Central Processing Units

Multi-Layer Packet Classification with Graphics Processing Units

Parallel Cycle Based Logic Simulation using Graphics Processing Units

Acceleration of Grammatical Evolution Using Graphics Processing Units

Massively parallel Monte Carlo simulation with graphics processing units (GPU)

Graphics Processing Units (GPU) for HEP trigger systems

3.5 Finite Differences and Fast Poisson Solvers

Fast Fourier Transforms (FFTs) and Graphical Processing Units (GPUs)

Signal Processing on a Graphics Card

Cryptography on Graphics Processing Unit: A Survey

Fast Electromagnetic Integral-Equation Solvers on Graphics Processing Units Shaojing Li1, Ruinan Chang1, Amir Boag2, and Vitaliy Lomakin1 Department of Electrical and Computer Engineering University of California, San Diego 9500 Gilman Drive, La Jolla, CA 92093 USA Tel: +1 (858) 822-4726; Fax: +1 (858) 534-5909; E-mail: [email protected] 1

School of Electrical Engineering Tel Aviv University Ramat Aviv 69978 Israel Tel: +972-3-6408246; Fax: +972-3-6423508; E-mail: [email protected] 2

Abstract A survey of electromagnetic integral-equation solvers, implemented on graphics processing units (GPUs), is presented. Several key points for efﬁcient GPU implementations of integral-equation solvers are outlined. Three spatial-interpolationbased algorithms, including the Nonuniform-Grid Interpolation Method (NGIM), the box Adaptive-Integral Method (B-AIM), and the fast periodic interpolation method (FPIM), are described to show the basic principles for optimizing GPU-accelerated fast integral-equation algorithms. It is shown that proper implementations of these algorithms lead to very high computational performance, with GPU-CPU speed-ups in the range of 100-300. Critical points for these accomplishments are (i) a proper arrangement of the data structure, (ii) an “on-the-ﬂy” approach, trading excessive memory usage with increased arithmetic operations and data uniformity, and (iii) efﬁcient utilization of the types of GPU memory. The presented methods and their GPU implementations are geared towards creating efﬁcient electromagnetic integral-equation solvers. They can also ﬁnd a wide range of applications in a number of other areas of computational physics. Keywords: Computational electromagnetics; electromagnetic analysis; high performance computing; integral equations; graphics processing units

1. Introduction

E

lectromagnetic integral-equation solvers are widely used for analyzing and designing devices and systems. The state-of-the-art of integral-equation (IE) electromagnetic solvers has grown tremendously since the development of fast methods, both hierarchical – e.g., the Fast Multipole Method (FMM) [1-3] – and the fast Fourier transform (FFT) methods [4, 5]. These methods reduce the computational and memory

( )

cost from O N 2 to O ( N ) or O ( N log N ) , where N is the number of spatial degrees of freedom. The development of fast integral-equation methods was accompanied by the efforts of parallelization, driven by the need for simulating more-complex and realistic systems. Recent advancements in the development of parallel hardware architectures have opened many exciting opportuni-

ties. Multi-core central processing units (CPUs) and multiCPU systems have replaced single-core single-CPU computer configurations. Many scientists and researchers began to investigate the possibility of leveraging the floating-point computing capability of the graphics processing units (GPUs). An important driving force of the development of GPUs has been their consumer-market niche of graphics processing, which, in many ways, is similar to scientific computing. For example, the NVIDIA GeForce GTX 680 GPU, at a cost of about $500, has 1536 stream processors with a performance of over 3 TFLOPs in single precision, and a memory bandwidth of more than 200 GBps. This is much more powerful than any existing general-purpose CPU. Multiple GPUs can fit into an inexpensive desktop computing node, offering the performance of a small-to-medium size traditional CPU cluster. GPU supercomputers are also being built around the world, and three of the world’s current top 10 supercomputers are GPU-based [6].

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

AP_Mag_Oct_2012_Final.indd 71

ISSN 1045-9243/2012/$26 ©2012 IEEE

71

10/5/2012 6:49:36 PM

The high arithmetic throughput of GPUs makes them well suited for general-purpose scientific computing. GPU computing has become especially attractive with the introduction of the NVIDIA CUDA framework [7], and the recently established OpenCL standard [8]. These allow writing highlevel code in a C/C++-like language. GPUs have been used in many fields of scientific computing, such as computational biology, medical imagining, computational fluid dynamics, astrophysics, and micromagnetics [9-13]. In many areas, GPU-implemented computational tools have made qualitative changes in the way problems are approached. For example, in molecular dynamics, widely used open-source packages, such as GROMACS and LAMMPS, have GPU accelerator components that help drastically reduce the simulation time. Another example is a recently presented micromagnetic simulator, FastMag [12], which can solve problems of a large size and complexity at high speed. The dramatically improved computing capability allows solving problems not accessible in the past. GPU computing also has a great potential for computational electromagnetics. In computational electromagnetics, GPUs have shown high performance in several numerical tasks. Efficient differential-equation solvers, such as the Finite-Difference TimeDomain Method and the Finite-Element Method, on GPUs have been developed, and also integrated with commercial solvers [14-22]. Efforts in the development of integral-equation methods have mostly been focused on porting existing direct Method-of-Moment (MoM) codes from CPU systems to GPU systems [23-27]. Several projects have also shown GPU implementations of acceleration schemes and accelerated integral-equation solvers on GPUs, including FFT-based approaches [28, 29] and hierarchical multi-level approaches [30-35]. Given the great potential of and the rapidly growing interest in GPU computing, a review of the current state and an outline of future opportunities of GPU computing for fast integral-equation methods in electromagnetics is anticipated to facilitate the development of high-performance computationalelectromagnetics methods and simulators. In this paper, we (i) review the current state of the art in GPU computing for electromagnetics; (ii) describe recent advancements in the development of fast methods for integralequation solvers on GPUs, including hierarchical O ( N ) or O ( N log N )

(

)

methods, and FFT-based ( O ( N log N )

or

O N 3/2 ) methods; (iii) present a set of results comparing different methods; and (iv) outline future opportunities for GPU computing in electromagnetics. Three categories of fast methods are addressed: the Nonuniform-Grid Interpolation Method (NGIM) [36-39]; the Box-Based Adaptive-Integral Method (B-AIM), which is a modification of the AdaptiveIntegral Method (AIM) [4]; and the Fast Periodic-Interpolation Method (FPIM) [40]. The Nonuniform-Grid Interpolation Method belongs to the family of “tree-codes,” so the presented results should give hints for optimizing other algorithms belonging to the same category, such as the Fast Multipole Method (FMM). The GPU-accelerated Nonuniform-Grid Interpolation Method and other tree-codes are expected to have 72

AP_Mag_Oct_2012_Final.indd 72

wide applicability in various areas of electromagnetics and computational physics, in general. The Adaptive-Integral Method is also a widely used algorithm for accelerating integral-equation solvers, and is of high importance for many applications. The Fast Periodic-Interpolation Method is a recently introduced new technique for accelerating integralequation solvers for general periodic-unit problems. All presented methods are kernel independent, and thus allow addressing problems of several classes by simply changing the integral kernel, i.e., changing the Green’s function. The paper is organized as follows. Section 2 discusses the GPU hardware architecture and software programming environment, and gives guidelines for developing efficient computational algorithms on GPUs. Section 3 briefly summarizes the integral-equation method in electromagnetics, outlines its critical components and bottlenecks, and explores the potential of acceleration techniques. Section 4 shows how direct iterative integral-equation solvers can be efficiently accelerated on GPUs. The direct approaches demonstrated in Section 4 present concepts of GPU acceleration for field evaluations, and serve as a baseline for comparing to fast integral-equation solvers on GPUs. Section 5 shows an efficient implementation of the Nonuniform-Grid Interpolation Method on GPUs. Section 6 presents an efficient implementation of the AdaptiveIntegral Method on GPUs. Section 7 discusses implementations of the Fast Periodic-Interpolation Method on GPUs. GPU-CPU accelerations of ~150 with small absolute computational times for all the acceleration techniques are demonstrated. Section 8 demonstrates results of volume integral-equation (VIE) solvers accelerated by the presented fast techniques and GPUs. Finally, Section 9 summarizes the findings of the presented work. It should be noted that all the GPU results were obtained with the NVIDIA GeForce GTX 570 GPU. We also tested the codes with the newest GeForce GTX 680 GPU, and the results were between two and 2.5 times faster than those for the GTX 570 for all tested cases.

2. GPU Computing and the CUDA Programming Environment A GPU typically contains a certain number of stream multiprocessors (SM), each working as a single-instruction multiple-data (SIMD) processing unit, e.g., the GeForce GTX 580 has 16 stream multiprocessors (Figure 1). Each stream multiprocessor has 32 stream processors for NVIDIA GPUs (192 single-precision plus 64 double-precision processors for each SMX in the new Kepler generation of NVIDIA GPUs). GPUs have several types of memories, including global memory, shared memory, constant memory, and texture memory. The global memory is accessible by all stream multiprocessors and their stream processors. Each stream multiprocessor has a certain amount of shared memory, which is accessible simultaneously by all its stream processors. Each stream processor has a small number of registers. In addition, there is specialpurpose texture memory, which is designed to better accommodate two-dimensional or three-dimensional addressing, and possesses a certain amount of linear filtering/interpolation

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

10/5/2012 6:49:36 PM

Figure 1. The NVIDIA GPU architecture.

functionality. The total number of stream processors can be very large (e.g., up to 1536 for NVIDIA GTX 680 GPUs), and the memory access can be very fast (provided it is accomplished via the proper approach). This makes GPUs well suited for carrying out mathematically heavy operations often required in scientific simulations. In 2007, NVIDIA released CUDA, the first high-level general purpose programming language for GPUs [7]. In the CUDA framework, GPUs use execution threads to populate data to stream processors and control the code execution. A separate thread cannot be executed alone. Instead, 32 threads are bundled together as a warp, and distributed by schedulers to be executed on stream multiprocessors at the same time. Warps can further be bundled into thread blocks. Threads within the same block may not be active at the same time but they share memory, which can be accessed with high throughput during the execution. This unique processor and memory architecture gives GPUs certain advantages and disadvantages. An advantage of GPUs is their fast shared memory, which may have up to several terabits per second bandwidth, and can be accessed by threads within a block via a wide bus. Shared memory is therefore important

for improving memory-access performance. However, shared memory is only accessible when the corresponding block is active, so every time a block of threads is dispatched by the scheduler to multiprocessors, it has to be loaded with content from the global memory and has its content written back after the block finishes its job. This communication between shared and global memory usually suffers a noticeable latency. This latency can be alleviated by coalesced access, which is triggered when threads in a warp access contiguous memory address space, as well as when the L1 cache is used. Moreover, GPU algorithms usually have distinct optimization strategies. For example, many CPU implementations of algorithms tabulate data to reduce the number of arithmetic operations. GPUs generally have relatively small memory, but are very efficient in arithmetic operations. As a result, it may be beneficial to perform arithmetic operations on GPUs “on-the-fly.” This may simultaneously reduce the memory consumption and alleviate the aforementioned impact of global memory-access latency. Another feature of current GPUs is that they are less efficient in handling complicated branches and flow-control commands. As a result, it may be preferable to use non-branching tasks being executed by threads in a block, even if this means padding the execution paths with redundant operations. In addition, GPUCPU and GPU-GPU communications are much slower than

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

AP_Mag_Oct_2012_Final.indd 73

73

10/5/2012 6:49:36 PM

any type of on-chip communications between the processors and on-chip memories, so inter-device communication should be avoided to the extent possible.

fn ( r ′)

Z mn = ∫∫∫ dV f m ( r )  ∫∫∫ dV ′ Ω

εr

Ω

+ ∫∫∫ dV ∇  f m ∫∫∫ dV ′ ∇′  ( ke f n ) Ω

3. Integral Equation Solvers This paper uses volume integral-equation solvers implemented on GPUs to demonstrate the power of GPU computing for electromagnetics. A field-based volume integral equation for structures involving dielectric materials is [41]

Ω

− k02 ∫∫∫ dV f m  ∫∫∫ dV ′ Ω

ke f n ( r ′ ) e

Ω

− jk r −r′

e 0 4π r − r ′ − jk0 r −r′

4π r − r ′

, (4)

Dninc = ∫∫∫ Dinc  f m ( r ) dV . Ω

D

− ∇ ∫∫∫ ∇′  ( ke D )

− jk0 r′−r

e dV ′ 4π r ′ − r

Equations (2)-(4) can solved iteratively, and the left-hand side of Equation (3) needs to be evaluated repeatedly until conver− jk0 r′−r gence is reached. k De − k02 ∫∫∫ e dV ′ = Dinc , (1) 4π r ′ − r Ω The integrals in Equation (4) can be evaluated using quadrature rules with a number of quadrature nodes N q , where D is the total electric-flux density, ε r is the relative accompanied by a singularity extraction procedure. This results permittivity of the material, k0 is the free-space wavenumber, in the following representation of the algebraic equation: and k = 1 − 1 ε r is the contrast ratio.

εr

Ω

The electric-flux density is expanded over a set of basis N

functions, i.e., D = ∑ Dn f n ( r ) , where N is the number of basis n =1

functions and the Dn are unknown coefficients. A discretized form of the volume integral equation is obtained as

−k02 ∫∫∫ Ω

ke f n ( r ′ ) e 4π r ′ − r

cient, Dn , to the quadrature-node field coefficients, determined

via the product Q = PD . The matrix Z0 is an N × N matrix

− jk rm −rn

rm − rn . The matrix PT is the

 transpose of P , and it maps the quadrature-node fields to the dV ′ Dn = Dinc . (2)  testing-function coefficients. The matrices Q , P , and Z0 are

(3)

the incident field on the testing functions. For example, choosing the testing functions to be the same as the basis function results in the following expressions for the impedance matrix and incident-field elements:

AP_Mag_Oct_2012_Final.indd 74

Here, P is an N × N q matrix that maps the unknown coeffi-

its entries given by e

where Z is the impedance matrix, and Dinc is the projection of

74

(5)

mapping from N q scalar charges to N q scalar observers, with

Equation (2) can be tested – i.e., integrated – with testing functions to result in a system of algebraic equations: ZD = Dinc ,

)

that describes the local corrections (singularity extractions) in the potential. The matrix Zl is dense, and it represents a

− jk r′−r  f n ( r ) e 0 ′ ′  k f r − ∇ ∇   ( )  ∑ ε ∫∫∫  e n  4π r′ − r dV ′ n =1  Ω  r N

− jk0 r′−r

(

PT Zl + Z0 PD = Dinc .

sparse, and have non-vanishing entries only for elements corresponding to overlapping basis-testing-function domains. The computation of the matrix-vector products corresponding to these matrices scales as O ( N ) . The matrix Zl is dense, and the corresponding transformation, l Dm =

Nq

∑

= n 1;n ≠ m

e

− jk rm −rn

rm − rn

Qn , m = 1, 2,..., N q ,

(6)

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

10/5/2012 6:49:36 PM

typically is a bottleneck for iterative electromagnetic integralequation solvers. The computational cost of the transformation

( )

in Equation (6) scales as O N q2 if the summation is evaluated

( )

directly. The computational cost can be reduced to O N q

(

)

or

O N q log N q via various acceleration algorithms [4, 42] and using parallelization. The goal of this paper is to demonstrate how the transformation in Equation (6) and the integral equation in Equation (1) can be rapidly solved on GPUs using the direct summations in

( )

Equation (6), with O N q2 computational cost, and using fast

( )

algorithms, with O N q

(

or O N q log N q

)

computational

cost.

4. GPU Implementations of O

( ) Iterative N q2

Integral-Equation Solvers When the summation in Equation (6) is directly evaluated,

( )

the computational cost scales as O N q2 . The goals of this section are to present a baseline for comparison of the performance among different codes, and to describe important GPU implementation approaches that are also used in Sections 5-7 for more-complex fast techniques. One approach to computing the spatial convolution on GPUs is the “explicit matrix” approach. In this approach, the impedance matrix is allocated, filled, and transferred to the GPU device, and the corresponding matrix-vector multiplication is then performed using general-purpose linear-algebra libraries or user-defined CUDA kernels. Literature illustrating this approach includes [23, 24, 26, 43]. In these papers, the “explicit matrix” approach was shown to produce substantial speed-ups, but storing the impedance matrix on the GPU set the upper limit of the problem size to about only N = 10, 000 on the existing GPUs. To allow addressing larger problems, an approach was introduced in which the impedance matrix was transferred to the GPU device block by block, and the matrixvector products were executed with these blocks. This approach allowed considering matrices of any size, but it suffered from major performance reductions, due to the slow memory transfer between the CPUs and GPUs. A simple solution to this problem is to compute the elements of the impedance matrix “on-thefly” [44]. This approach allows considering problems of any size, avoids unnecessary memory operations, and leads to a very high performance. In the following paragraphs, this approach will be referred to as “GPU direct method.” The basic principle of the GPU direct method is “onethread-per-observer” and “block-by-block accumulation” of fields. One-thread-per-observer is chosen instead of the alternative one-thread-per-interaction approach because the latter would require a substantial amount of shared memory, and the

code would be memory-bandwidth limited [44]. The blockby-block traverse ensures each global memory access is coalesced, and the data loaded to the shared memory can be reused by all threads in the same block. Figure 2 shows the thread arrangement and memory-loading scheme, and Figure 3 is the source code of the device kernel. The computational times of both the CPU and GPU direct methods are shown in Figure 4. The comparison was between one core of an Intel i7 950 CPU and an NVIDIA GeForce GTX 570 GPU. It was evident that the GPU provided very large speed-ups for the direct method (up to several thousands, compared with the CPU “on-the-fly” calculation). Importantly, this high performance was obtained starting from small values of N (as small as 8K). The high performance stemmed from the simplicity of the problem, which led to the uniform execution path of parallelization and arithmetic-intensive nature of the algorithm. Although the computational time

( )

increases as O N 2 , the GPU version of the direct method is still useful for moderately large problems. Interestingly, as will be shown when analyzing fast methods in Sections 5-7, a CPU fast (e.g., O ( N log N ) ) method on a single CPU core does not outperform the GPU direct code until N is as large as one million.

5. GPU Implementations of NonuniformGrid-Interpolation-Method-Based Fast Iterative Integral-Equation Solvers Although GPUs give high speed-ups for the direct method, the quadratic scaling of the computational time eventually makes the code unusable for solving complex large-scale structures. Multi-level tree-structured fast algorithms have been

(

or O N q log N q

( )

)

on CPUs. Early attempts to port the Fast

Multipole Method to GPUs were devoted to cases of nonoscillatory kernels, e.g., a static Green’s function with k = 0 [31]. While showing significant overall speed-ups, these efforts had relatively insignificant speed-ups for the often-dominant translation part of the method. A recent book chapter [45] showed more efficient implementations that can handle problems up to 107 on an NVIDIA GTX 295 GPU with 1.8 GB of memory. Tree-code and Fast Multipole Methods on multiple GPU systems were also investigated in [35], in which significant GPU-CPU speed-ups and absolute performance were demonstrated. However, the complex memory-access pattern and the need for evaluating special functions may complicate the Fast Multipole Method implementations on GPUs. Electromagnetic Fast Multipole Methods are even more complex, and only a few implementations have been presented [30] . In this paper, we describe the basic principles and procedures of the Nonuniform-Grid Interpolation Method, which works equally well for electromagnetic problems in all frequency regimes, without the need for significant changes.

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

AP_Mag_Oct_2012_Final.indd 75

( )

widely used to reduce the complexity from O N q2 to O N q

75

10/5/2012 6:49:36 PM

Figure 2. The thread-execution pattern of the GPU direct method.

76

AP_Mag_Oct_2012_Final.indd 76

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

10/5/2012 6:49:37 PM

Figure 3. The GPU direct method kernel code.

Figure 4. The computational times of the CPU and GPU direct methods. IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

AP_Mag_Oct_2012_Final.indd 77

77

10/5/2012 6:49:37 PM

Cartesian grids are used to represent fields inside a domain free of sources. Cartesian grids are defined only for levels below a certain interface level [39], which is defined as the level for which boxes are sufficiently small compared to the wavelength. All boxes at lower levels have a small size, and these levels are referred to as low-frequency levels. All levels above the interface level are referred to as high-frequency levels. The multilevel Nonuniform-Grid Interpolation Method consists of several sequential stages: Stage 0 (near-ﬁeld evaluation): All near-field interactions between the sources in the near-field boxes at the finest level are computed via direct superposition. This step is completely independent of the rest of the algorithm, and can be separately implemented and executed in parallel with all other stages on multi-GPU systems. Stage 1 (ﬁnest-level nonuniform grid construction): The computation starts with directly computing the field values on nonuniform grids, serving as sample points for all source boxes at the finest level.

Figure 5. The fieldevaluation procedure in the onelevel NonuniformGrid Interpolation Method. The far field is estimated by interpolations via a nonuniform grid and a Cartesian grid. The nearfield evaluation is done directly. The Multilevel Nonuniform-Grid Interpolation Method involves additional inter-level accumulation and decomposition processes.

Compared to other tree-code-like algorithms, the NonuniformGrid Interpolation Method has unique properties in geometric adaptivity, kernel independency, and high parallelizability for GPU systems. At the same time, GPU implementations of the Nonuniform-Grid Interpolation Method have many features that can be directly used for GPU implementations of low- and high-frequency Fast Multipole Methods. The Nonuniform-Grid Interpolation Method algorithm is implemented using a hierarchical domain-decomposition method, which is similar to the Fast Multipole Methods. The computational domain is subdivided into a hierarchy of levels of boxes, which form an octal tree. For each box, near-field and far-field boxes are identified for distances larger or smaller than a predefined distance. The field potentials contributed by sources in the near-field boxes are evaluated directly via direct superposition. The field potential far from a source distribution is a function with a known asymptotic behavior. This allows smoothing the fast spatial variations of the potential, computing it on a sparse grid, and interpolating to the required observation points (Figure 5). Two sets of grids are constructed, namely the nonuniform grids (NG) and the Cartesian grids (CG). The nonuniform grids are used to represent fields outside a source domain. The nonuniform grids are defined for all levels. The 78

AP_Mag_Oct_2012_Final.indd 78

Stage 2 (aggregation of nonuniform grids/upward pass): The field values at the nonuniform grids of the boxes at coarser levels are computed by aggregation from their child boxes on finer levels. Such aggregation involves local interpolation and common-distance compensation in the amplitude and phases between the corresponding nonuniform grids. Stage 3 (nonuniform-grid-to-Cartesian grid transitions, and Cartesian-grid decomposition/downward pass): Field values at Cartesian grids of an observer box on a specific level come from two contributions. For the boxes at all low-frequency levels below the interface level, the first contribution is from the same-level interaction-list boxes via their nonuniform-grid samples. These boxes have their parent box as a neighbor of the observer box, but exclude those that have already been accounted for in the near-field stage. For the interface-level boxes, the first contribution comes from the same-level interaction-list boxes as well as all high-frequency boxes, the hierarchical parents of which are in their corresponding level interaction-list. The second contribution at Cartesian grids of all low-frequency levels below the interface level comes from the parent box. The fields at the Cartesian grids of an observer box are found via interpolations from the nonuniform-grid samples in the former case, and from Cartesian-grid samples in the latter case. Stage 4 (Cartesian grid to observers): The field values at the observation points are obtained by local interpolations from the Cartesian grids on the finest level of the domain subdivision.

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

10/5/2012 6:49:37 PM

Figure 6. The source code for Stage 4 of the NonuniformGrid Interpolation Method, showing how fields on actual observers are obtained via interpolations on the GPU.

O N q log N q

belonging to the same box are shared by the same threads, these threads can share data during the computation. This allows reducing the memory-access count, using shared memory, and using the coalesced access when transferring data from the global to shared memory.

Source code for Stage 4 of the Nonuniform-Grid Interpolation Method shows how GPUs obtain the field at observers by interpolating from Cartesian grids. All other stages have similar thread and memory arrangements. For each stage, “one-threadper-observer” type of parallelization is implemented, which is similar to the GPU direct approach described in Section 4. Here, the “observers” can be either the actual observers where the final field values are computed, or intermediate observers such as grid samples, depending on the actual tasks done at each stage. The number of threads per block can be chosen by the user, or can be determined by the properties of the hardware, the number of unknowns of the problem, and the source/observer distribution. One or several thread blocks may be launched to handle a certain box if the number of observers is large. Since the operations and data required by observers

Field transformations on grid samples are done via interpolations. Linear or cubic Lagrange interpolations are chosen, as they match the architecture of GPUs well. All interpolations, including the computation of the interpolation coefficients, are done on-the-fly. This reduces the memory consumption, and the global memory read-and-write penalty. We also have tested an approach in which the interpolation coefficients are tabulated in the pre-processing step, and used while the simulation is running. This approach is designed for the CPU NonuniformGrid Interpolation Method, but on GPUs, the on-the-fly approach is faster in most cases. In the CPU NonuniformGrid Interpolation Method, the need for pre-computation of the interpolation coefficients leads to much larger memory consumption. Implementation details of each stage and a stageby-stage performance breakdown were described in [39]. Here, the overall performance of two versions of the NonuniformGrid Interpolation Method is shown.

The computational cost of the Nonuniform-Grid Interpolation Method is of O N q in the low-frequency case,

( )

( ) in the high-frequency case, and between O ( N q ) and O ( N q log N q ) for the mixed-frequency case.

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

AP_Mag_Oct_2012_Final.indd 79

79

10/5/2012 6:49:37 PM

6. GPU Implementations of AdaptiveIntegral-Method-Based Fast Iterative Integral-Equation Solvers Another category of the fast algorithms use the FFT to reduce the complexity of the convolution operation between sources and the Green’s function to O ( N log N ) . The most popular FFT-based fast methods are the Adaptive-Integral Method (AIM) and the pre-corrected FFT (pFFT) [4, 5]. They are widely used due to their simplicity, high accuracy, and parallelizability [46].

Figure 7. The computational time for the NonuniformGrid Interpolation Method. The times of the CPU and GPU direct methods are shown for reference.

Figure 7 shows the computational time of the CPU and GPU versions of the Nonuniform-Grid Interpolation Method in the low- and high-frequency regimes. The asymptotic complexity is O ( N ) for low-frequency problems and O ( N log N ) for high-frequency problems. The increase of computational complexity in the high-frequency is due to the fact that the density of sample points of nonuniform grids does not decrease as the distance between source and observer increases. The computational time results shown in Figure 7 were obtained using an NVIDIA GeForce GTX 570 GPU and an Intel i7-950 CPU, respectively. The GPU-CPU speed-ups were around 150-400 in the range of problem sizes between 2K to 16M. Due to the memory limitations, the largest problem size that a single GeForce GTX 570 GPU with 1.2 GB memory can run is 28M. A single Tesla GPU with 4 GB memory can handle problems of 64M unknowns for low-frequency application, and 160M for static applications. The memory consumption as a function of problem size is worth mentioning, as well. Due to different implementations of interpolations between the GPU and CPU versions of the Nonuniform-Grid Interpolation Method, the CPU NonuniformGrid Interpolation Method consumes much more memory than the GPU Nonuniform-Grid Interpolation Method. For example, the memory required by the Nonuniform-Grid Interpolation Method CPU implementation for a problem of 8M was 18 GB, which was almost 50 times larger than that of the GPU code. Implementing the on-the-fly approach to reduce the memory consumption in the CPU code would result in a significant increase in the computational time.

80

AP_Mag_Oct_2012_Final.indd 80

These two methods adopt similar philosophies to make the FFT-based convolution applicable to nonuniformly distributed sources and observers. They create relatively sparse uniform grids of samples, project the source excitations from actual sources to the grid samples, calculate interactions between source and observer grids, and interpolate fields on actual observers from the observer-grid samples. The projections and interpolations introduce errors, which are large when sources and observer are close to each other. Both the Adaptive-Integral Method and the FFT have correction mechanisms for this inaccuracy. For each observer, interactions from sources residing within a certain range of observers are identified as the near fields, and are supposed to be directly calculated. Other interactions from sources outside of this near-field domain are identified as the far fields, and are calculated via grid interactions. Since the interactions between grids are done through FFTs, inaccurate near-field components are calculated again and removed from the total field. A detailed description of the algorithms can be found in [4, 5]. The Box-Based Adaptive Integral Method (B-AIM) algorithm presented here follows a similar philosophy, but has a different approach in arranging the data structures for projections and corrections. These data structures make the BoxBased Adaptive-Integral Method well suited for parallelization on GPU and CPU platforms. The computational domain is divided into subdomains, referred to as boxes, as shown in Figure 8. The number of boxes can be set by the user or by other criteria, e.g., to result in a set average number of sources or observers per box. Empty boxes are excluded from the computations. Two grids are constructed with respect to the boxes: one for emulating the field generated by actual sources, referred to as the source grid, and the other for estimating the field results on actual observers, referred to as the observer grid. In many cases (e.g., for free-space problems), these two grids can be overlapped and thus are represented by a single array in memory, but for some cases, e.g., periodic problems [40], these two grids are shifted to ensure fast convergence of the Green’s function. The grids can be expressed as two N g -length vectors, I and U, respectively. The algorithm proceeds with the following stages:

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

10/5/2012 6:49:37 PM

F. Combining the above equations, this coarse estimation can be summarized as

{

}

F = VT invFFT FFT G grid   FFT [ VQ ] . 4.

(7)

Near-field correction: The approximate near-field parts of F are substituted for by field values computed directly. This substitution requires a second pass of the previous stages and direct calculation of a portion of Z. This process has O (1) complexity for a single observer, and O ( N ) complexity overall. This correction can be summarized as T G grid _ near ⊗ ( Vnear Q )  + Z near Q . Fˆ = F − Vnear   (8)

Figure 8. The farfield and nearfield interactions in the Box-Based Adaptive-Integral Method.

1.

Projection: Lagrange interpolations are used for projecting actual sources to the source grid. The interpolation provides convergent results as long as the sources to be projected and observers to be interpolated do not belong to the same or nearby boxes. This condition holds, as the interactions between sources and observers within the same and nearby boxes are calculated directly in the later “near-field correction stage.” The result of the projection operation can be expressed as an N × N g matrix V, so that I = VQ .

2.

Grid-interaction calculation: The field generated by source samples on the source grid are calculated at = U G grid ⊗ I . each observer via a convolution, This is done by convolving the grid-sample matrix with the Green’s-function matrix via an FFT. The FFTW library was used for the CPU version, and NVIDIA’s CUFFT library was used for the GPU version.

3.

Interpolation: In this stage, the fields at actual observers are found by interpolating from the field values of observer-grid samples. The interpolation operation is the reciprocal operation of the projection, so they can be expressed as the transpose of V, so that F = VT U , where F is the approximation to

For classical Adaptive-Integral Method/pre-corrected FFT implementations, a list of near sources is usually maintained for each observer (or basis function) to accurately calculate near fields. Projection and interpolation coefficients are often tabulated. This near-field list can be seen as a sparse matrix the nonzero elements of which depend on the source’s geometrical distribution. In our CPU Box-Based Adaptive-Integral Method codes, we also tabulate such interpolation matrices. This is necessary for the reason stated in Section 5, but it requires a large amount of memory for large problems. For GPUs, this approach is not optimal, for several reasons. First, preparing these sparse matrices usually requires multiple complex branches, which are not suitable for GPUs. Second, the large memory consumption limits the largest problem size the GPU can solve, especially when the sparse matrices are very unstructured. Indeed, relatively limited GPU-CPU speed-ups (of around seven times) for conventional Adaptive-Integral Method implementations have been shown [28, 29, 47]. In the Box-Based Adaptive-Integral Method, domain decomposition is performed similarly to the finest level of the Nonuniform-Grid Interpolation Method (or the Fast Multipole Method). The grids are associated with the boxes, and not with specific observers. The observers are associated with the boxes as well, so the mappings between grids and sources/observers are achieved through a two-stage mapping. This mapping makes sources and observers within the same box have the same nearand far-field source list, as well as the same projection- and interpolation-grid-points list. Box-level division of the far and near fields eliminates the need for keeping the near-field lists for each observer. Near-field lists are generated in the near-field correction stage from the position of the boxes onthe-fly. This approach is similar to the finest level in the tree codes discussed in Section 5. In fact, due to these similarities, our Nonuniform-Grid Interpolation Method and Box-Based Adaptive-Integral Method implementations share the same modules. This approach reduces the memory requirements, allows reusing the data, directly associates the boxes with observers to CUDA blocks with threads, allows using coalesced access of global memory, allows using the shared memory, and increases the cache-hit ratio during the global-memory loading.

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

AP_Mag_Oct_2012_Final.indd 81

81

10/5/2012 6:49:37 PM

Using interpolations allows for fast on-the-fly computations, and allows accounting for various complex Green’s functions, e.g., periodic and layered-medium Green’s functions. Typically, the average number of sources per box is chosen to be a constant, which means the number of boxes and the number of grid samples are proportional to the number of sources, N q . This leads to O N q log N q computational

(

)

complexity of Stage 4. Stages 1, 3, and 4 contain only local operations, so they have an O N q complexity. The overall

( )

(

)

asymptotical complexity of the algorithm is O N q log N q .

( )

The memory complexity of the algorithm is O N q . The computational times of the Box-Based AdaptiveIntegral Method are shown in Figure 9 for the cubic interpolations with 10−4 average L1 error. A single NVIDIA GeForce GTX 570 GPU was compared with one core of an Intel i7-950 CPU. All computational times were obtained using optimal settings for the CPU and GPU codes, respectively. We could see that the times of both the CPU and GPU codes scaled nearly linearly, starting from very small problems sizes. The speed-ups between the GPU and CPU implementations were around 100200. For example, one field evaluation using the GPU BoxBased Adaptive-Integral Method cost 0.48 s for a problem of one million unknowns, which was 210, 110, and 2.6 × 105 times faster than the CPU Box-Based Adaptive-Integral Method, the GPU direct method, and the CPU direct method, respectively. The largest problem that can be solved on GTX 570 cards is 8 million. We also note that the memory consumption can be considerably further reduced.

7. GPU Implementations of Fast PeriodicInterpolation-Method-Based Fast Iterative Integral-Equation Solvers Interpolation-based fast algorithms for integral-equation solvers also extend to more complex problems, such as those

Figure 9. The computational times of the Box-Based Adaptive-Integral Method on CPUs and GPUs. 82

AP_Mag_Oct_2012_Final.indd 82

Figure 10. The projection (source) and interpolation (observer) grids for the Fast PeriodicInterpolation Method. containing infinite periodic structures. Periodic structures are an important category of problems. They are often solved via integral approaches requiring convolutions of sources within a principal unit cell with periodic Green’s functions (PGFs), which account for periodic boundary conditions. Most recent efforts on accelerating such periodic solvers has been focused on efficiently representing and evaluating periodic Green’s functions. In [40], a fast algorithm, referred to as the Fast Periodic Interpolation Method (FPIM), was presented. This algorithm reduces the required number of periodic Green’s function evaluations to O (1) , and reduces the overall computa-

tional cost to O ( N ) for most practical problems. The Fast Periodic-Interpolation Method is kernel independent, and so it works for any one-dimensional, two-dimensional, or threedimensional problems with one-dimensional, two-dimensional, and three-dimensional periodicities in free space, and possibly in more-complicated environments. The Fast PeriodicInterpolation Method shares common features with the Adaptive-Integral Method and the Nonuniform-Grid Interpolation Method, and thus utilizes the same GPU parallelization approaches.

In periodic problems, one unit cell is usually selected as the observer cell, where the fields from sources in other identical but shifted cells are calculated. Treating it as the origin, the Fast Periodic-Interpolation Method separates the whole computational domain into a near region, which consists of cells within a certain distance, and a far region, which consists of infinite cells beyond that distance. Fields generated by sources in the near region are called near fields, which are interactions between a finite number of sources and observers, and thus can be calculated by any fast methods: we use the Nonuniform-Grid Interpolation Method or the Box-Based

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

10/5/2012 6:49:38 PM

uniform-Grid Interpolation Method, so their computational times can be found from previous figures. The CPU and GPU used were the Intel i7-950 and NVIDIA GeForce GTX 570. Additional periodic-solver results using the GPU-implemented Fast Periodic-Interpolation Method with the Nonuniform-Grid Interpolation Method are presented in Section 8.

8. Simulation Examples

Figure 11. The computational times for the Fast PeriodicInterpolation Method on CPUs and GPUs. Adaptive-Integral Method. Fields generated by sources in the far region are calculated via periodic Green’s functions, but not directly from sources to observers. Similar to the BoxBased Adaptive-Integral Method, the Fast Periodic-Interpolation Method also builds grids for projecting amplitude sources and interpolating fields to observers. Two major differences compared to the Box-Based Adaptive-Integral Method are that (i) the Fast Periodic-Interpolation Method uses non-coinciding source and observer grids (Figure 10), and (ii) the Fast Periodic-Interpolation Method does not have the near-field correction stage, as the near fields are separated beforehand and are taken care of by a free-space n-body algorithm, including Fast Multipole Methods, the Nonuniform-Grid Interpolation Method, and Adaptive-Integral Methods. The major reason for using non-overlapping source and observer grids is to allow using a simple Floquet expansion to evaluate periodic Green’s functions between the grid, which are rapidly convergent due to the presence of a transverse separation between the grids. Other methods for the periodic Green’s function evaluation would not require the grids to be separate [48-50]. Projections from sources to the source grid and interpolations from the observer grid to observers are done via Lagrange interpolation. Interactions between grids are done through FFT-accelerated convolutions, just as in the Nonuniform-Grid Interpolation Method and the Box-Based Adaptive-Integral Method. The GPU implementation of the Fast PeriodicInterpolation Method is similar to that of the Nonuniform-Grid Interpolation Method and the Box-Based Adaptive-Integral Method. The computational times and scaling with the problem size can be found in Figure 11, and they have very similar trends to those of the Box-Based Adaptive-Integral Method. It should be noted that only the preprocessing times and the far-field evaluation times are shown for both the CPU and GPU implementations. The near-fields are evaluated using either the Box-Based Adaptive-Integral Method or the Non-

The Nonuniform-Grid Interpolation Method, the BoxBased Adaptive-Integral Method, and Fast Periodic-Interpolation Method have been coupled with a volume integral-equation solver. The volume integral-equation was implemented via standard procedures: SWG (Schaubert-Wilton-Glisson) basis functions were used for the field representations, and fourpoint quadrature rules for the tetrahedral elements and surface triangles were used for the volume and surface integrals. The three acceleration methods were implemented on GPUs and CPUs, as described in Section 4-7. All the differential operators were represented in terms of sparse matrices. Corresponding sparse matrix-vector multiplications (spMV) were implemented on GPUs. Such sparse products constituted a small portion of the total computational time. The matrix equation generated by the volume integral-equation was solved iteratively via a GMRES solver. We present two sets of examples: scattering from objects residing in free space, and scattering from infinitely periodic structures. Each set includes a verification example with a known problem, and a more-complicated scattering example. The simulations were run on a desktop computer with an Intel i7-950 CPU and 24 GB of RAM, and an NVIDIA GeForce GTX 570 GPU with 1.2 GB of global memory. Figure 12 presents the results for free-space problems. Figure 12a shows the radar cross section (RCS) of a freestanding layered sphere of diameter D and permittivity 14 − j8 at a wavelength of λ = D 3.8 . The RCS obtained via the Miescattering approach and via the volume integral-equation accelerated by the GPU using the Box-Based Adaptive-Integral Method are shown and compared. The number of Mie-series terms was chosen to achieve a full convergence. The results demonstrated good agreement. Figure 12b demonstrates the GPU-accelerated solver’s performance for a problem of scattering from a partial humanbody model (the detailed mesh of the human body was provided by Ali Yilmaz, University of Texas, Austin). The permittivity was 15 − j10 , and the mesh resolution was 4 mm. The incident field was a plane wave coming from the top, with a wavelength of 0.2 m. The figure shows the real part of the x component of the current. The results were also verified via an alternative volume integral-equation solver. The number of basis functions in this simulation was 2,591,522, and the number of quadrature nodes was over 5,000,000. Note that at each iteration, six scalar dense matrix-

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

AP_Mag_Oct_2012_Final.indd 83

83

10/5/2012 6:49:38 PM

Figure 12b. The electrical current distribution along the x axis excited by an incident wave on a human body.

Figure 12a. The RCS of a dielectric sphere.

0.03

Figure 13a. The reﬂection coefficient of a doubly periodic metamaterial structure: a comparison between the volume integralequation approach (dotted) and the rigorous cou pledwave analysis (RCWA) code (solid line).

84

AP_Mag_Oct_2012_Final.indd 84

Figure 13b. The scattering coefficient for a complex peri odic unit cell: a comparison between the volume integralequation approach (dotted) and the rigorous coupledwave analysis (RCWA) code (solid line).

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

10/5/2012 6:49:38 PM

vector multiplications were executed. The simulation involved 176 iterations. The preprocessing time was 24 s, and the total simulation time was 6188 s. The same GPU-accelerated volume integral-equation solvers work for problems with periodic boundary conditions by replacing the Box-Based Adaptive-Integral Method or the Nonuniform-Grid Interpolation Method kernel by the Fast Periodic-Interpolation Method kernel. Results of such a periodic solver are demonstrated in Figure 13. Figure 13a shows a verification example, computing the reflection coefficient of a doubly periodic array of dielectric cubes, illuminated by a normal incident wave along the z axis. Each unit cell comprised a cube of size 1 mm, and was made of a material with a permittivity of 4. The reflection coefficient is shown as a function of the wavelength. The results were compared with the results obtained via the rigorous coupled-wave analysis (RCWA) code [51]. The results obtained via the two methods matched well. One could observe a resonant behavior, which is a typical Wood anomaly property of periodic gratings [52]. Finally, Figure 10b presents an example of a complex problem involving the scattering from a doubly periodic array of unit cells, each comprising multiple split-ring resonators with a permittivity of −4 − j 2.5 . The number of basis functions was 121,546, and the number of quadrature points was 283,484. The number of iterations required to generate a single data point in this figure was around 11. The total simulation time (including preprocessing) per frequency was about six minutes. Again, a resonant behavior was observed. It should be noted that to the best of our knowledge, this is the first volume integral-equation solver that can handle periodic problems of such high computational complexity.

9. Summary We demonstrated that GPU computing offers exciting opportunities for computational electromagnetics in general, and for integral-equation-based methods in particular. We described several important points that should be kept in mind when developing GPU codes. In particular, GPUs have unique processor and memory arrangements. Efficient implementation of algorithms on GPUs requires taking a different design perspective. For example, performing operations on-the-fly may often be more beneficial than using memory. Memory accessing should be done in parallel and in a coalesced manner. Shared memory serves as a scratch pad, and is often very efficient in accelerating data-accessing operations. Keeping these points in mind, we presented GPU implementations of three fast  ( N q ) or  ( N q log N q ) methods for accelerating integral-equation solvers. These methods included the Nonuniform-Grid Interpolation Method, which is a tree-type code similar to the Fast Multipole Method; the BoxBased Adaptive-Integral Method, which is a modification of FFT-based techniques; and the Fast Periodic-Interpolation

Method, which is a recently introduced new technique for accelerating general periodic unit-cell problems. GPU implementations of all these techniques were shown to be very efficient, with GPU-CPU speed-up rates of 100 to 400 times, and very small absolute computational times. We also presented GPU-accelerated volume integralequation solvers incorporating these fast techniques. These solvers are powerful tools for the analysis of complex dielectric structures residing in free space or in a periodic unit cell. On a simple desktop computer with a middle-range GPU (at total cost of $2500), we can rapidly solve problems involving millions of degrees of freedom. Systems with more memory and higher-end GPUs can allow handling significantly larger problems with potentially higher speeds. Moreover, we have implementations of multi-GPU codes that demonstrate great opportunities for further scaling of the developed codes to GPU-based clusters. CPU clusters matching the performance of the presented GPU-accelerated fast solvers would be much more expensive, would consume much more power, and would require special facilities.

10. Acknowledgements This work was supported in part by NSF CIAN ERC and by United States – Israel Binational Science Foundation Grant No. 2008077.

11. References 1. H. W. Cheng, W. Y. Crutchfield, Z. Gimbutas et al., “A Wideband Fast Multipole Method for the Helmholtz Equation in Three Dimensions,” Journal of Computational Physics, 216, 1, 2006, pp. 300-325. 2. L. Greengard and V. Rokhlin, “A Fast Algorithm for Particle Simulations,” Journal of Computational Physics, 73, 2, 1987, pp. 325-348. 3. V. Rokhlin, “Rapid Solution of Integral Equations of Scattering Theory in Two Dimensions,” Journal of Computational Physics, 86, 2, 1990, pp. 414-439. 4. E. Bleszynski, M. Bleszynski, and T. Jaroszewicz, “AIM: Adaptive Integral Method for Solving Large-Scale Electromagnetic Scattering and Radiation Problems,” Radio Science, 31, 5, 1996, pp. 1225-1251. 5. J. R. Phillips and J. K. White, “A Precorrected-FFT Method for Electrostatic Analysis of Complicated 3-D Structures,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 16, 10, 1997, pp. 1059-1072. 6. “TOP500 List – June 2012,” http://www.top500.org/list/ 2012/06/100, no. 8/19/2012, 2012.

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

AP_Mag_Oct_2012_Final.indd 85

85

10/5/2012 6:49:38 PM

7. NVIDIA Corporation, CUDA Compute Uniﬁed Device Architecture Programming Guide, V4.2, 2012. 8. Khronos Group, “The OpenCL Specification – Khronos Group,” 2011. 9. J. Bedorf, E. Gaburov, and S. P. Zwart, “A Sparse Octree Gravitational N-Body Code that Runs Entirely on the GPU Processor,” Journal of Computational Physics, 231, no. 7, pp. 2825-2839, 2012.

20. A. Kakay, E. Westphal, and R. Hertel, “Speedup of FEM Micromagnetic Simulations with Graphical Processing Units,” IEEE Transactions on Magnetics, 46, 6, 2010, pp. 2303-2306. 21. D. Komatitsch, G. Erlebacher, D. Goddeke, and D. Michea, “High-Order Finite-Element Seismic Wave Propagation Modeling with MPI on a Large GPU Cluster,” Journal of Computational Physics, 229, 20, 2010, pp. 7692-7714.

10. J. E. Stone, D. J. Hardy, I. S. Ufimtsev et al., “GPUaccelerated molecular modeling coming of age,” Journal of Molecular Graphics and Modelling, 29, 2, 2010, pp. 116-125.

22. K. Liu, “Graphics Processor Unit (GPU) Acceleration of Time-Domain Finite Element Method (TD-FEM) Algorithm,” 3rd IEEE International Symposium on Microwave, Antenna, Propagation and EMC Technologies for Wireless Communications, Beijing, October 27-29, 2009, pp. 928-931.

11. J. E. Stone, J. C. Phillips, P. L. Freddolino et al., “Accelerating Molecular Modeling Applications with Graphics Processors,” Journal of Computational Chemistry, 28, 16, 2007, pp. 2618-2640.

23. D. De Donno, A. Esposito, G. Monti et al., “Parallel Efficient Method of Moments Exploiting Graphics Processing Units,” Microwave and Optical Technology Letters, 52, 11, 2010, pp. 2568-2572.

12. R. Chang, S. Li, M. Lubarda et al., “FastMag: Fast Micromagnetic Simulator for Complex Magnetic Structures,” Journal of Applied Physics, 109, 7, 2011, p. 07D358.

24. E. Lezar and D. B. Davidson, “GPU-Accelerated Method of Moments by Example: Monostatic Scattering,” IEEE Antennas and Propagation Magazine, 52, 6, December 2010, pp. 120135.

13. S. Li, B. Livshitz, and V. Lomakin, “Graphics Processing Unit Accelerated O(N) Micromagnetic Solver,” IEEE Transactions on Magnetics, 46, 6, 2010, pp. 2373-2375. 14. B. Bergen, G. Wellein, et al., “Hierarchical Hybrid Grids: Achieving TERAFLOP Performance on Large Scale Finite Element Simulations,” Int. J. Parallel Emerg. Distrib. Syst., 22, 4, 2007, pp. 311-329. 15. C. Cecka, A. J. Lew, and E. Darve, “Assembly of Finite Element Methods on Graphics Processors,” International Journal for Numerical Methods in Engineering, 85, 5, 2011, pp. 640-669. 16. S. Dosopoulos, J. D. Gardiner, and J.-F. Lee, “An MPI/GPU Parallelization of an Interior Penalty Discontinuous Galerkin Time Domain Method for Maxwell’s Equations,” Radio Science, 46, 2011, pp. RS0M05. 17. S. Dosopoulos, and L. Jin-Fa, “Discontinuous Galerkin Time Domain for Maxwell’s Equations on GPUs,” 2010 URSI International Symposium on Electromagnetic Theory (EMTS), Berlin, August 16-19, 2010, pp. 989-991. 18. A. Dziekonski, A. Lamecki, and M. Mrozowski, “GPU Acceleration of Multilevel Solvers for Analysis of Microwave Components with Finite Element Method,” IEEE Microwave and Wireless Components Letters, 21, 1, 2011, pp. 1-3. 19. D. Goddeke, R. Strzodka, and S. Turek, “Performance and Accuracy of Hardware-Oriented Native-, Emulated- and Mixed-Precision Solvers in FEM Simulations,” International Journal on Parallel Emerging Distributed Systems, 22, 4, 2007, pp. 221-256.

86

AP_Mag_Oct_2012_Final.indd 86

25. A. Noga, T. Topa, and A. Karwowski, “Using GPU Accelerated Version of MoM for Solving Scattering and Radiation Electromagnetic Problems,” 19th International Conference on Microwave Radar and Wireless Communications (MIKON), Warsaw, May 21-23, 2012, pp. 233-234. 26. S. Peng and Z. Nie, “Acceleration of the Method of Moments Calculations by Using Graphics Processing Units,” IEEE Transactions on Antennas and Propagation, AP-56, 7, 2008, pp. 2130-2133. 27. T. Topa, A. Noga, and A. Karwowski, “Adapting MoM With RWG Basis Functions to GPU Technology Using CUDA,” IEEE Antennas and Wireless Propagation Letters, 10, 2011, pp. 480-483. 28. M. A. Francavilla, E. A. Attardo, F. Vipiana et al., “A GPU Acceleration for FFT-Based Fast Solvers for the Integral Equation,” Fourth European Conference on Antennas and Propagation (EuCAP), Barcelona, April 12-16, 2010, pp. 1-4. 29. S. Peng and C.-F. Wang, “Hardware Accelerated MoMPFFT Method Using Graphics Processing Units,” IEEE International Symposium on Antennas and Propagation, Spokane, July 3-8, 2011, pp. 3152-3153. 30. M. Cwikla, J. Aronsson, and V. Okhmatovski, “Low-Frequency MLFMA on Graphics Processors,” IEEE Antennas and Wireless Propagation Letters, 9, 2010, pp. 8-11. 31. N. A. Gumerov and R. Duraiswami, “Fast Multipole Methods on Graphics Processors,” Journal of Computational Physics, 227, 18, 2008, pp. 8290-8313.

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

10/5/2012 6:49:38 PM

32. I. Lashuk, A. Chandramowlishwaran, H. Langston et al., “A Massively Parallel Adaptive Fast-Multipole Method on Heterogeneous Architectures,” Conference on High Performance Computing Networking, Storage and Analysis, Portland, Oregon, November 14-20, 2009, pp. 1-12. 33. M. López-Portugués, J. A. López-Fernández, J. MenendezCanal et al., “Acoustic Scattering Solver Based on Single Level FMM for Multi-GPU Systems,” J. Parallel Distrib. Comput., 72, 9, 2012, pp. 1057-1064. 34. K. Xu, D. Z. Ding, Z. H. Fan et al., “Multilevel fast multipole algorithm enhanced by GPU parallel technique for electromagnetic scattering problems,” Microwave and Optical Technology Letters, 52, no. 3, pp. 502-507, 2010. 35. R. Yokota, J. P. Bardhan, M. G. Knepley et al., “Biomolecular Electrostatics Using a Fast Multipole BEM on Up to 512 GPUs and a Billion Unknowns,” Computer Physics Communications, 182, 6, 2011, pp. 1272-1283. 36. A. Boag, V. Lomakin, and E. Michielssen, “Nonuniform Grid Time Domain (NGTD) Algorithm for Fast Evaluation of Transient Wave Fields,” IEEE Transactions on Antennas and Propagation, AP-54, 7, 2006, pp. 1943-1951. 37. A. Boag, E. Michielssen, and A. Brandt, “Nonuniform Polar Grid Algorithm for Fast Field Evaluation,” IEEE Antennas and Wireless Propagation Letters, 1, 2002, pp. 142-145. 38. B. Livshitz, A. Boag, H. N. Bertram et al., “Non-Uniform Grid Algorithm for Fast Magnetostatic Interactions Calculation in Micromagnetics,” Journal of Applied Physics, 105, 2009. 39. S. Li, B. Livshitz, and V. Lomakin, “Fast Evaluation of Helmholtz Potential on Graphics Processing Units (GPUs),” Journal of Computational Physics, 229, 22, 2010, pp. 84638483. 40. S. Li, D. A. Van Orden, and V. Lomakin, “Fast Periodic Interpolation Method for Periodic Unit Cell Problems,” IEEE Transactions on Antennas and Propagation, AP-58, 12, 2010, pp. 4005-4014. 41. A. F. Peterson, S. L. Ray, and R. Mittra, Computational Methods for Electromagnetics, New York, IEEE Press, 1998. 42. L. Greengard, J. Huang, V. Rokhlin et al., “Accelerating Fast Multipole Methods for the Helmholtz Equation at Low

Frequencies,” IEEE Computational Science and Engineering, 5, 3, 1998, pp. 32-38. 43. T. Killian, D. L. Faircloth, and S. M. Rao, “Acceleration of TM Cylinder EFIE with CUDA,” IEEE Antennas and Propagation Society International Symposium, Charleston, June 1-5, 2009, pp. 1-4. 44. NVIDIA Corporation, GPU Computing SDK 4.2, 2012. 45. R. Yokota, and L. A. Barba, “Chapter 9 – Treecode and Fast Multipole Method for N-Body Simulation with CUDA,” in W. H. Wen-mei (ed.), GPU Computing Gems Emerald Edition, Boston, Morgan Kaufmann, 2011, pp. 113-132. 46. A. E. Yilmaz, J. Jian-Ming, and E. Michielssen, “Time Domain Adaptive Integral Method for Surface Integral Equations,” IEEE Transactions on Antennas and Propagation, AP52, 10, 2004, pp. 2692-2708. 47. M. A. Francavilla, F. Vipiana, and G. Vecchi, “FFT-Based Solvers for the EFIE on Graphics Processors,” IEEE Antennas and Propagation Society International Symposium, Toronto, Ontario, July 11-17, 2010. 48. F. T. Celepcikay, D. R. Wilton, D. R. Jackson et al., “Choosing Splitting Parameters and Summation Limits in the Numerical Evaluation of 1-D and 2-D Periodic Green’s Functions with the Ewald Method,” Radio Science, 43, 2008, pp. RS6S01. 49. G. Valerio, P. Baccarelli, P. Burghignoli et al., “Comparative Analysis of Acceleration Techniques for 2-D and 3-D Green’s Functions in Periodic Structures Along One and Two Directions,” IEEE Transactions on Antennas and Propagation, AP-55, 6, 2007, pp. 1630-1643. 50. D. Van Orden and V. Lomakin, “Rapidly Convergent Representations for 2D and 3D Green’s Functions for a Linear Periodic Array of Dipole Sources,” IEEE Transactions on Antennas and Propagation, AP-57, 7, 2009, pp. 1973-1984. 51. M. G. Moharam and T. K. Gaylord, “Rigorous CoupledWave Analysis of Planar-Grating Diffraction,” Journal of Optical Society of America, 71, 7, 1981, pp. 811-818. 52. R. W. Wood, “XLII. On a Remarkable Case of Uneven Distribution of Light in a Diffraction Grating Spectrum,” Philosophical Magazine, 4, 21, 1902.

IEEE Antennas and Propagation Magazine, Vol. 54, No. 5, October 2012

AP_Mag_Oct_2012_Final.indd 87

87

10/5/2012 6:49:38 PM