‡

INESC-ID/ISEL/IPL, Lisbon, Portugal INESC-ID, Instituto Superior T´ecnico, Universidade de Lisboa, Portugal [email protected], [email protected] ABSTRACT

Floating-point computing with more than one TFLOP of peak performance is already a reality in recent FieldProgrammable Gate Arrays (FPGA). General-Purpose Graphics Processing Units (GPGPU) and recent many-core CPUs have also taken advantage of the recent technological innovations in integrated circuit (IC) design and had also dramatically improved their peak performances. In this paper, we compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of algorithms belonging to different scientific application domains. Trends in peak performance, power consumption and sustained performances, for particular applications, show that FPGAs are increasing the gap to GPUs and many-core CPUs moving them away from high-performance computing with intensive floatingpoint calculations. FPGAs become competitive for custom floating-point or fixed-point representations, for smaller input sizes of certain algorithms, for combinational logic problems and parallel map-reduce problems. I. INTRODUCTION Recently, GPUs were adopted as high-performance computing (HPC) platforms offering very high peak floating-point operations per second (FLOPs) up to five single precision (SP) TFLOPs and over one double-precision (DP) TFLOPs. Today’s most powerful GPU processors, like the NVidea GPUs Tesla K20 and K40 [14], compete with several other commercial multi and many-core processors also with very high computational power. The most representative of these commercial processors during the last decade are the IBM Cell with nine SIMD 32-bit floatingpoint processing elements executing at 3 GHZ [27], the 80-tile TFLOPs 32-bit floating-point processor from Intel operating at frequencies up to 5 GHz [30] and the recent Intel Xeon Phi processors [15] with single and double floating-point capacity. These chips deliver single-precision TFLOPs of peak performance. Field-Programmable Gate Arrays (FPGAs) were traditionally used for fixed-point digital signal processing, but now offer high floating-point processing capacity over one TFLOP for SP, making them another candidate to execute applications requiring high-performance computing. However, the peak performance of a device is different

from its sustained performance while running a particular application. For example, when running a 2D Fast Fourier Transform (FFT), the 80-teraflop from Intel only achieves 2.73 % of its peak performance, that is 20 GFLOPs. FPGAs operate with a lower clock frequency, have lower peak performances but can be hardware optimized for each specific application and, therefore FPGAs can achieve better performance efficiencies. Another important point is that FPGAs achieve in general higher power efficiencies compared to GPUs and CPUs. It is known that a particular application will run differently in different processing devices. How good it runs each application can be measured based on several design metrics including performance, power consumption, power efficiency, performance efficiency, cost, among others. In this paper, we analyze the trends in peak performances and power consumption of each device and compare their sustained performances for some sets of scientific applications looking to identify the most appropriate platform for a particular application domain. Previous literature comparing and surveying GPUs, FPGAs and CPUs are in general restricted to a particular set of applications in a specific domain or just compare two of these platforms. They fail to give the complete scenario for different application domains. In our survey, we look for a comparison between the analyzed target platforms for a set of application domains that cover the most important sets of applications in highperformance computing. The paper is organized as follows. In section 2 we analyze the trend in peak performances. Section 3 report the sustainable performances of the devices for particular application domains. Section 4 concludes the paper. II. PEAK PERFORMANCES II-A. Graphics Processing Units GPUs have seen significant advances for graphics processing and their design has been oriented by these target applications. In the past decade GPUs have entered in the general-purpose computing, known as General-Purpose GPU (GPGPU). Using their large parallel computing capabilities they can be used to do other type of analysis competing with modern general-purpose multi-core centralprocessing units (CPU).

Peak Performance (GFLOPs)

5000

3000

GPU Peak Performance Geforce GTX 780 Ti

Single precision

Gefore GTX Titan

Double precision

K40

4000 Geforce GTX 680

3000

Geforce GTX 580 Geforce GTX 480 Geforce GTX 280 1000 Geforce 8800 GTX Tesla C2075 Geforce 7800 GTX Geforce 6800 Ultra Tesla C1060

2000

Gefore GTX Titan K20X

CPU Performance

K40

Xeon Phi 7120P

Single precision

2500 Peak Performance (GFLOPs)

6000

Double precision

2000 1500

Haswell

1000 Sandy Bridge

500

Ivy Bridge

Westmere

Prescott

Woodcrest Harpertown Bloomfield

0 2004 (90 nm)

0

2006 (65 nm)

2008 (45 nm)

2009 (45 nm)

2010 (32 nm)

2011 (32 nm)

2012 (22 nm)

2013 (22 nm)

2013 (22 nm)

2004 2005 2006 2008 2009 2010 2010 2011 2012 2013 2013 2013 (130 nm)(110 nm) (90 nm) (65 nm) (65 nm) (40 nm) (40 nm) (40 nm) (28 nm) (28 nm) (28 nm) (28 nm)

Fig. 2. Peak performance of Intel CPUs

Looking at the evolution of several generations of GPU families we observe a more than linear increase in peak performance, both in SP floating-point (SPFP) and doubleprecision FP (DPFP). Describing an exhaustive evolution of the GPUs it is not possible in a few lines since more than a hundred of different models were already deployed. So, we have picked the best GPUs, in terms of performance, from some of the most relevant families within a specific year and technology (see Figure 1). The graph shows the increase of more than one TFLOP from family to family. This progress has to do not only with the known advantages of newer fabrication technologies but also advances in the architectures since we observe increases within the same technology. The ratio between SP and DP was initially around 10×, but the latest generations of GPU have reduced it to around 4×. Considering the power efficiency (not shown in the graph) as the ratio between peak performance and thermal design power (TDP), we also observe a sustained increase. Power efficiencies have increased from 0.5 up to 6 GFLOPs/W for DP and from 0.5 up to 20 GFLOPs/W for SP. This means that GPUs are offering an amazing peak performance at increasing power efficiency. GPUs are also known for having high external memory bandwidths. Geforce 6800 had a bandwidth of 35 GBytes/s, while the recent K20, K20X and K40 have bandwidths of 208, 250 and 288 GBytes/s, respectively. II-B. Multi and Many-Core CPUs General-purpose CPUs have also experienced an increase in peak performance. We have considered some of the most known processors from Intel family of CPUs and annotated their peak performances (see Figure 2). With the recent Xeon PHI 7120P co-processor (it cannot be used as a stand-alone processor) CPUs achieve peak performances of 2416 SP GFLOPs and 1208 DP GFLOPs (in this case a ratio of two between SP and DP is observed). Intel processors have progressed by increasing the number of cores per CPU. The power efficiencies of these devices are typically lower compared to those from GPUs. The first CPUs had around 0.1 GFLOPs/W with 65 nm technology

LUTs

1400000

4000

3600

1200000

LUTs

1000000

Integer Multipliers

800000

1221600

3000 2160

2016

600000 1056

200000

512

2500 2000

612000

400000

3500

1500 297600

Multipliers

Fig. 1. Peak performance of GPUs

1000

149760

500

49152

0

0 XC4VSX55

XC5VSX240T

XC6VSX475T FPGA

XC7VX980T

XC7V2000T

Fig. 3. Number of multiplier for different Xilinx FPGAs while the recent CPUs with 22nm technology increased to around 9 GFLOPs/W for SP and half of this value for DP. Within this set of Intel processors it is important to remember the many-core 80-tile multi-core processor delivered by Intel in 2008 with a peak performance of 1 SP TFLOP at 1.07 V using an operating frequency of 3.16 GHz, quite above all other processors. CPUs and many-core CPUs are also characterized by having high memory bandwidths. For example, the recent Xeon Phi 7120P has a bandwidth of 352 GBytes/s. Slightly higher than that of a recent GPU. II-C. Field Programmable Gate Arrays The peak floating-point capacity of FPGAs is determined by both multiplier and LUT resources. A look at the number of multipliers and LUTs in each successive family of the Xilinx manufacturer shows a more than linear increase (see the evolution of resources in FPGAs from Xilinx in Figure 3). With the recent Virtex7 FPGA families, we have, for example, 3600 18 × 18 multipliers and 612.000 LUTs in a XC7VX980T and 2160 multipliers and 1.221.600 LUTs in a XC7V2000T. To analyze the trends in peak FP performance of FPGAs, we have considered three different FP units: adder only, multiplier only, and multiply-add. The adder is implemented only with LUTs, but for the multiplier we have considered different combinations of DSP/LUT resources (configurations M0, M1 and M2) resource utilizations. The multiply-add is just the combination of a one of these

Config. M0 M1 M2 A0

Add

Oper

Config. M0 M1 M2 A0

Mult Add

Single-precision LUT-FF pairs 665 331 160 500

Double-precision DSP48E1 LUT-FF pairs 0 2361 9 439 10 371 0 989

800

LUTs 629 283 128 407

LUTs 2317 356 299 794

FFs 669 331 160 541

FFs 2418 510 456 1029

Freq. 473 MHz 462 MHz 459 MHz 472 MHz

Freq. 318 MHz 378 MHz 389 MHz 436 MHz

Table I. Resources and frequency for FP operations Peak performance (GFLOPs) Single-precision

600

500 400

336 250

300 168

200

151 91

84

100

140

302

84

XC7VX980T

280 181

168

M1+A0

M2+A0

125

XC7V2000T

0 M0

M1

M2

A0

M0+A0

Combination of floating-point operations

Fig. 5. Peak performance of Xilinx FPGAs for different combination of double-precision floating-point operations 120

1800

1636

1600

1417

1400 1200

919

1000

1113

999 998

826

XC7VX980T

558

496

460

1050 991 820

710

800 600

671

700

XC7V2000T

400

200 0 M0

M1

M2

A0

M0+A0

M1+A0

M2+A0

Relative performance to GPU (%)

Mult

DSP48E1 0 1 2 0

Peak performance (GFLOPs) Double-precision

Oper

130 nm

90 nm

65 nm

40 nm

28 nm

100

80

90 nm

28 nm 60

65 nm

GPU 22 nm

45 nm

FPGA

CPU

40

90 nm

32 nm

20

65 nm

Combination of floating-point operations

28 nm

45 nm

0 2004

Fig. 4. Peak performance of Xilinx FPGAs for different combination of single-precision floating-point operations multipliers and an adder (see the resources utilization of floating-point operations in table I [33]). From these figures, we have determined the peak performances for two different FPGAs (XC7VX980T and XC7V2000T) simply determining how many units of a particular configuration referred in table I fit in a device (total number of DSP/LUT divided by the required number of DSP/LUTs of a single unit) see results in Figures 4, 5). Considering only multiplications, the best SPFP peak performance is 1.4 TFLOPs, while for addition only we have 1 TFLOP. The best peak performance was achieved with the combined multiplication/addition configuration with around 1.6 TFLOPs. In this case, we also observe that the XC7V2000T is better for SPFP arithmetic even thought it has only 60% of the total DSPs of a XC7VX980T. For DP, the best peak performance was 671 GFLOPs achieved for an adders only configuration with a XC7V2000T FPGA. For the multipliers only configuration, the performance drops down to 168 GFLOPs. A combined multiplier/adder configuration achieves 302 GFLOPs in a XC7VX980T, limited by the number of multipliers. Power efficiencies are above 10 GFLOPs/W, typically higher than that of CPUs and GPUs. These efficiencies are increasing with technology generations. For example, Altera’s next-generation high-performance FPGAs will support a minimum of 5 TFLOPs performance by leveraging Intel’s 14 nm Tri-Gate process and up to 100 GFLOPs/W is expected using this advanced semiconductor process [1].

2006

2009 Year

2011

2013

Fig. 6. Peak performance of FPGA and CPU relative to GPU for single precision. type of device for different years and compared the peak performance for both single and double precision. We have only considered Intel family of processors (see peak performances relative to GPU device in Figure 6). GPU were always the most performing for SP. In 2011 both FPGAs and CPUs have increased their relative performance, but in 2013, while the CPU kept improving, the FPGA have again decreased. It was always above CPU, but in 2013 they have changed positions due to the appearance of many-core CPUs, like Intel Phi. It is important to emphasize that FPGAs and CPUs are always one technology step ahead, except in the last year, where both the GPU and the FPGA use 28 nm against 22 nm used by the CPU. For double precision, we have extrapolated the results for GPU (just to serve as a reference) Therefore, ignoring the GPU in the first two cases, once again the GPUs have dominated the peak performances (except for the very first years, starting in 2008) and the CPUs started to dominate FPGAs from 2011. We also see that in 2013 the peak performances of both GPUs and many-core CPUs are very close (5% difference).

II-D. Trends in Peak Performances

The peak performance of FPGAs is decreasing relative to GPUs, while CPUs are increasing, both in single and double precision. According to [13], the number of applications running on GPUs have increased 60%. In fact, given, for example, the peak performances of GPUs and FPGAs we have that GPUs can go downto a performance efficiency of 25% and still have the same performance of an FPGA with 100% performance efficiency.

To better understand the relative evolution of GPUs, FPGAs and multi/many-core CPUs in terms of peak floatingpoint performance, we have piked the best from each

Peak performance just gives us a theoretical comparison. The sustainable performance (performance achieved when running a particular algorithm) and the power effi-

Relative performance to GPU (%)

250

90 nm

200 90 nm

150 130 nm

45 nm

65 nm

90 nm

GPU 65 nm

40 nm

28 nm

100

22 nm

50

65 nm

45 nm

FPGA

CPU

32 nm 28 nm 28 nm

0 2004

2006

2009 Year

2011

2013

Fig. 7. Peak performance of FPGA and CPU relative to GPU for double precision ciency vary with each particular application since these processors have different architectural aspects, including the memory architecture with its external memory bandwidth and the communication network, that are more appropriate for specific algorithms or applications. In the following section, we provide a state-of-the-art review of several application running on these architectures to compare both their sustainable performance and power efficiency. III. SUSTAINABLE PERFORMANCES The relative sustainable performance and power efficiency of each device is highly dependent on the algorithm. To analyze both metrics for each device we considered a few parallel algorithms of each of the thirteen algorithmic dwarfs proposed by the Department of Computer Science of UC Berkeley [3]. This set of algorithms covers a very wide range of algorithms for parallel computing, namely dense and sparse matrix computation (linear algebra), spectral (e.g., FFT based methods), N-body (particle interaction), structured grid (fluid dynamics), unstructured grid (e.g., Finite Element Method), MapReduce (Monte Carlo based methods), Combinational logic, Graph traversal, Dynamic programming, Backtrack/Branch-and-bound, Graphical models and Finite state machine. In the following, we will briefly analyze the sustainable performance of each HPC device for each of these dwarfs. In our study, we are particularly interested in those representing HPC kernels, namely, linear algebra computation, spectral methods, nbody simulations, structured and unstructured grids, and monte carlo based methods [3]. Matrix operations are computationally very demanding whether in single or double precision and with a particular and simple communication pattern. Performances on a Tesla K20X GPU running dense matrix multiplication are around 3 TFLOPs for single precision and 1 TFLOPs for double precision, which is about 70% of peak performance (performance efficiency) [23]. The Intel Xeon Phi 7120P achieves performances of 2.2/1 (single/double precision) TFLOPs [16] that corresponds to a performance efficiency around 92%. In terms of power efficiency, the GPU attains 12/5 (SP/DP) GFLOPs/W and the CPU only around 4.2/3.6 GFLOPs/W. In [21] is reported that FPGAs are about 1.5× more energy efficient for single precision matrix multiplication and slightly worse for double precision. In

[7] the authors compare an i7-960 CPU, a GTX480 GPU and a Virtex-6 FPGA running dense matrix multiplication and conclude that the GPU with a sustainable performance of 541 GFLOPs is about 3× faster than the FPGA and 6× faster than the CPU. Considering power efficiency, both FPGA and GPU reported similar figures three times better than that of the CPU. Sustainable performance of a more recent FPGA is about twice of this result [17]. In [34] both Virtex-6 FPGA and a GTX480 GPU are compared in the acceleration of Cholesky decomposition. The GPU is always better than FPGA for both SP and DP ranging from 2× to 4× faster depending on the matrix size. In [18] compares Virtex-5 FPGA, an intel E8500 CPU and a Tesla C1060 GPU in the execution of BLAS level 2 operations. They conclude that for small matrix sizes, FPGA has similar performance of GPU, while the GPU overcomes the FPGA for higher matrices sizes (e.g., around 20× for 8k × 8k matrices). The power efficiency is always better with the FPGA, while the efficiency in the GPU decreses considerably (two orders of magnitude worst compared to the FPGA) with the size of the matrix. The most known spectral algorithm is the Fast-Fourier Transform (FFT). For example, seven parallel 4096-point single-precision floating-point FFT in a 28 nm FPGA achieves over 500 GFLOPs with around 10 GFLOPs/W [2], but a single FFT implementation stays below 100 GFLOPs single precision. In [9] the authors conclude that GPUs are 10× faster than FPGAs. The performance of a recent GTX Titan GPU running FFT achieves around 450 GFLOPs in SP and 220 for DP [25], corresponding to performance efficiencies of 10% and 17%, respectively. Similar results are obtained in many-core CPUs [12], but with higher SP performance efficiency (around 20%). Particle interaction problems have higher communication requirements than matrix operations. Implementations of the N-Body on Intel Xeon Phi B1Q-3115A [32] and GPU K20 [6] reveal sustainable performances around 800 GFLOPs, not achievable with FPGA. In [11] the authors conclude that GPUs are most performing than FPGAs but performance per watt favors FPGAs (about 15×). Finite element modeling or computational fluid dynamics are unstructured grid problems with complex calculation coordination and demanding computations. Lattice Boltzmann Methods (LBM) are very well known methods used for the computational simulation of Newtonian fluid dynamics. LBM-based simulations are very parallelizable. The GPU K20 implementations have achieved the highest simulation performance per chip with 500 DP GFLOPs (30% of peak performance) against 394 of Xeon-Phi 5110P [4] (40% of peak performance), [24]. These results are obtained with power efficiencies around 2 GFLOPs/W. In [19], an implementation of lattice Boltzmann simulation in a Stratix 4 FPGA achieved only 445 SP GFLOPs. Monte-Carlo based methods are very common in scientific computing (e.g., for financial algorithms). An important unit in the design of Monte Carlo algorithm is Random

Number Generator (RNG). Considering a uniform RNG method, FPGAs obtain in the order of 15× and 60× more samples per second than a GPU and a CPU, respectively [28]. For Gaussian and exponential performance is similar for both platforms. In [29] the implementation of a quasiMonte Carlo simulation in a Virtex-4 FPGA, a NVidia 8800GTX GPU and a 2.8 MHz Xeon CPU are compared. According to the results, FPGAs outperform CPU-based implementations by one order of magnitude. Also, FPGAs achieve around 3× speedup compared to equivalent GPUbased implementations. Power consumption measurements show FPGAs to be 336× more energy efficient than CPUs, and 16× more energy efficient than GPUs. A more recent comparison between Xeon PHi, Ivy-Bridge CPU and Tesla K20X running two compute-intensive analytics algorithms in Finance [20] reveals that for 512k paths GPU is twice faster than Xeon Phi and Ivy bridge is faster (from 1.2× to 1.9×) than Xeon Phi. According to the author this is due to the heavy memory operations that take advantage of cache. In [7] the authors compare an i7-960 CPU, a GTX480 GPU and a Virtex-6 FPGA running Black-Scholes and conclude that the GPU is about 1.4× faster than the FPGA and 22× faster than the CPU. Considering power efficiency, the GPU is about 1.3× more power efficient than the FPGA and almost 40× compared to the CPU. Combinational logic problems like encryption, optimization algorithms or hashing are very well suited to FPGAs logic implementation. The same is true for finite state machines, that includes compression or cellular automata, in which cases dozens of peta operations/s can be obtained. These performances can not be achieve in other manycore devices. For example, in [31] different encryption algorithms are targeted on a Tesla C2090 GPU and a ZYNQ XC7Z020 FPGA and compared them to a Xeon processor. GPUs outperform the CPU by 13×. The FPGA provided a throughput speedup of 6 − 9× over the GPU for 8KB and 16KB plaintext blocks. This speedup is due to the capacity of FPGAs to do streaming computation. For larger plaintext sizes, the GPU and FPGA provide similar throughput. This brief analysis permitted us to stress some trends about which platforms should be used for specific applications. Considering the type of workload, we can establish a relation with the type of processor. A sequential dominated workload is definitely for a CPU since its higher frequencies improve the sequential execution. CPUs are also very good for iterative workloads, leaving some space for FPGAs. Parallel intensive applications require the manycore processing of a GPU, FPGA or many-core CPU. Some parallel intensive applications should be implemented in a specific platform while others could be equally implemented in several platforms. For algebraic operations, spectral analysis, n-body simulations and fluid dynamics, GPUs achieve better performance figures for a larger set of applications, followed by the many-core CPU, the FPGA and finally the CPU. This trend is getting stronger given the higher increase of peak performances of GPUs. GPUs are also improving its major power consumption disadvantage

almost reaching the numbers of an FPGA. In applications like Monte-Carlo based algorithms, a particular case of MapReduce problems, FPGAs are more efficient in implementing random number generators and for Monte-Carlo based applications, like some financial computing analysis, FPGAs are more performing than the other platforms. Finally, combinatorial logic problems, like RSA encryption, graph problems image/signal processing without floatingpoint are very well suited to be implemented in FPGA. FPGAs support custom FP precisions, which permits to optimize the size of arithmetic units. It is also possible to design fused-datapaths that permits simplifications in the normalization, shifting, routing and frequency improvement [1]. These techniques reduce the sustainable performance gap to GPUs and many-core CPUs but it is not enough for HPC applications. Worst, considering the peak performance trend, FPGAs are increasing the gap to GPUs and manycore CPUs, increasing the number of applications for GPUs and CPUs and reducing the number of those algorithms whose best performance is achieved on FPGAs. If we move to fixed-point operations then FPGAs can achieve peta operations per second, not achievable with other platforms. HPC industry is moving to hybrid computing models and platforms where CPUs, GPUs, FPGAs and manycore CPUs work together to perform scientific computing applications. In fact, different parts of the algorithm may be sequential, iterative, data-parallel or memory intensive and different workloads are more suited to different devices. So, an algorithm could run faster if different parts of it could run in different devices. A new perspective for reconfigurable logic in HPC with promising results has started to appear: reconfigurable GPUs and reconfigurable many-core CPUs. Performance and energy efficiency can be improved with reconfigurable many-core architectures [5]. In the future, we will probably have several reconfigurable GPU and/or reconfigurable many-core CPUs integrated in a single chip. It would increase performance and power efficiency. IV. CONCLUSION This paper reports peak performances of FPGAs, GPUS, many-core CPUs and CPUs and surveys their sustainable performances and power efficiencies. From trends in peak performance, we conclude that FPGAs are not keeping pace with the other platforms in the execution of these applications which are, consequently, migrating to these many-core software-based platforms. FPGAs become competitive when using operands with custom datawidths and still are the best options for combinational logic problems, finite state machines and parallel MapReduce problems. GPUs and many-core CPUs are dominating highperformance computing with intensive FP calculations. HPC industry is moving to hybrid platforms with a mix of CPUs, GPUS, FPGA or many-core CPUs working together. These solutions potentially reduce the power consumption while increasing the performance.

ACKNOWLEDGMENT This work was supported by national funds through FCT, Fundac¸a˜ o para a Ciˆencia e Tecnologia, under projects PEst-OE/EEI/LA0021/2013 and PTDC/EEAELC/122098/2010. V. REFERENCES [1] Altera, “Achieving One TeraFLOPS with 28-nm FPGAs”, White paper, WP-01142-1.0, September 2010. [2] Altera, “Radar Processing: FPGAs or GPUs”, White paper, WP-01197-2.0, May 2013. [3] Asanovic, K. and U C Berkeley CSD, “The Parallel Computing Laboratory at U.C. Berkeley: A Research Agenda Based on the Berkeley View,” Tech. Rep.UCB/EECS-2008-23, UC Berkeley, 2008. [4] Bailey, P., Myre, J., Walsh, S., Lilja, D. and Saar, M., ”Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors,” International Conference on Parallel Processing, 2009, pp. 550-557. [5] Braak, G.-J., and Corporaal, H., “GPU-CC: a reconfigurable GPU architecture with communicating cores”, International Workshop on Software and Compilers for Embedded Systems, 2013. [6] Capuzzo, R. and Spera, M., “A performance comparison of different graphics processing units running direct-body simulations”, Computer Physics Communications, Vol. 184, Issue 11, Nov. 2013, pp. 2528-2539 [7] Chung, E., et al., ”Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?,” International Symposium on Microarchitecture, Dec. 2010, pp.225-236. [8] Cray Inc. (2006). Cray XD1 Supercomputer. http://www.cray.com/products/xd1. [9] Duan, B., et al., ”Floating-point mixed-radix FFT core generation for FPGA and comparison with GPU and CPU,” International Conference on FieldProgrammable Technology, December 2011, pp.1-6. [10] George, A., Lam, H. and Stitt, G., “Novo-G: At the Forefront of Scalable Reconfigurable Supercomputing”, Computing in Science & Engineering, vol.13, no.1, 2013, pp.82-86. [11] Hamada, T., Benkrid, K., Nitadori, K. and Taiji, M., ”A Comparative Study on ASIC, FPGAs, GPUs and General Purpose Processors in the O(N 2 ) Gravitational N-body Simulation,” NASA/ESA Conference on Adaptive Hardware and Systems, 2009, pp.447-452. [12] http://software.intel.com/en-us/intel-mkl [13] http://www.hpc.mcgill.ca/downloads/gpu workshop/ clumeq intro apps FINAL.pdf [14] http://www.nvidia.com/object/tesla-servers.html [15] http://www.intel.com/content/dam/www/public/us/en/ documents/performance-briefs/xeon-phi-productfamily-performance-brief.pdf [16] Intel, ”Intel Xeon Phi Product Family Performance”, April 2013. [17] Jos´e, W., Silva, A., Neto, H. and V´estias, M., “Analysis of Matrix Multiplication on High Density

Virtex-7 FPGA”, International Conference on Field Programmable Logic and Applications, 2013, pp.1-4. [18] Kestur, S., Davis, J. and Williams, O., ”BLAS Comparison on FPGA, CPU and GPU”, in Annual Symposium on VLSI, 2010, pp. 288-293. [19] Kono, Y., Sano, K., Yamamoto, S., ”Scalability analysis of tightly-coupled FPGA-cluster for lattice Boltzmann computation,” Intl. Conference on Field Programmable Logic and Applications, 2012, pp.120-127. [20] Lotze, J., ”Benchmarks: Intel Xeon Phi vs. NVIDIA Tesla GPU”, Sept. 2013, http://blog.xcelerit.com/intelxeon-phi-vs-nvidia-tesla-gpu/ [21] Matam, K., Hoang Le, Prasanna, V., ”Evaluating energy efficiency of floating point matrix multiplication on FPGAs,” High Performance Extreme Computing Conference (HPEC), September 2013, pp.1-6 [22] NVIDIA, “CUDA Toolkit 5.0: Performance Report”, 2013. [23] NVIDIA, ”NVIDIA Tesla K20-K20X GPU Accelerators - Benchmarks”, November 2012. [24] Schifano, S., ”Multi- and many-core computing for Physics applications”, June 2013. [25] Smith, R. and Garg, R., “NVIDIA’s GeForce GTX Titan Review, Part 2: Titan’s Performance Unveiled: Titan’s Compute Performance”, from http://www.anandtech.com/show/6774/nvidias-geforcegtx-titan-part-2-titans-performance-unveiled/3, 2013. [26] SRC Computers, Inc., ”http://www.srccomp.com/.” [27] Takahashi, O. et al., ”Migration of Cell Broadband Engine from 65nm SOI to 45nm SOI,” Intl. Conference on Solid-State Circuits, pp.86-597, Feb. 2008. [28] Thomas, D., Howes, L. and Luk, W., ”A comparison of CPUs, GPUs, FPGAs, and massively processor arrays for random number generation,” International Symposium on Field-Programmable Gate Arrays, 2009, pp. 63-72. [29] Tian, X. and Benkrid, K., ”High-Performance QuasiMonte Carlo Financial Simulation: FPGA vs. GPP vs. GPU”, ACM Trans. Reconfigurable Technol. Syst. vol. 3, no 4, Article 26, November 2010, 22 pages. [30] Vangal, S., et al, ”An 80-tile sub-100w teraflops Processor in 65-nm CMOS”, IEEE Journal of SolidState Circuits, 43(1):29-41, 2008. [31] Venugopal, V. and Shila, D., ”High throughput implementations of cryptography algorithms on GPU and FPGA,” International Instrumentation and Measurement Technology Conference, pp.723-727, May 2013. [32] Vladimirov, A. and Karpusenko, V., “Test-driving R Intel Xeon P hiT M coprocessors with a basic N-body simulation”, White paper Colifax International, January 2013. [33] Xilinx, “LogiCORE IP - Floating-Point Operator v6.0”, DS816, January 18, 2012. [34] Yang, D., Sun, J., Lee, J., Liang, G., Jenkins, D., Peterson, G. and Li, H., ”Performance Comparison of Cholesky Decomposition on GPUs and FPGAs”, Symposium on Application Accelerators in High Performance Computing, 2010.