A GPU Accelerated BiConjugate Gradient Stabilized Solver for Speeding-up Large Scale Model Evaluation

International Journal of Economic Practices and Theories, Vol. 3, No. 3, 2013 (July), e-ISSN 2247–7225 www.ijept.org A GPU Accelerated BiConjugate Gr...
Author: Ursula Collins
2 downloads 1 Views 656KB Size
International Journal of Economic Practices and Theories, Vol. 3, No. 3, 2013 (July), e-ISSN 2247–7225 www.ijept.org

A GPU Accelerated BiConjugate Gradient Stabilized Solver for Speeding-up Large Scale Model Evaluation by Alexandru Voicu The Bucharest University of Economic Studies, Faculty of Management, Romania [email protected]

Abstract. Solving linear systems remains a key activity in of economics modelling, therefore making fast and accurate methods for computing solutions highly desirable. In this paper, a proof of concept C++ AMP implementation of an iterative method for solving linear systems, BiConjugate Gradient Stabilized (henceforth BiCGSTAB), is presented. The method relies on matrix and vector operations, which can benefit from parallel implementations. The work contained herein details the process of arriving at a moderately parallel implementation and a widely parallel implementation. Furthermore, the construction of two typical sparse data containers in a C++ AMP friendly manner is fleshed out. The implementation is evaluated by solving a number of large-scale linear systems to an exact or ε-exact solution. Keywords: non-stationary iterative method, economics modelling, GPU, C++ AMP. JEL classification: C63.

Programming / Optimal Control problems, require, sooner or later, for a system of linear equations to be solved. Moreover, as the cases undergoing analysis become increasingly complex, and, by way of consequence, larger having access to fast solvers becomes crucial – a model that cannot be repeatedly evaluated (solved) efficiently is of little use to economists, whom cannot afford to waste many hours or even days for evaluating a single iteration of the model. We place emphasis that the system of

1 Introduction It would be quite difficult to overstate the crucial importance that solving linear systems of equations holds in science in general and in economics in particular. Indeed, if one were to just consider (Amman et al. 1996) it is apparent that most modelling and analysis methods currently used in Economics, from Computable General Equilibrium models or Constrained Optimization to Infinite Horizon Dynamic

Figure 1. Solver performance expressed as speed-up versus a parallel implementation running on an AMD E-350 APU (“AMD Fusion Developer Summit 2012” 2012)

186

International Journal of Economic Practices and Theories, Vol. 3, No. 3, 2013 (July), e-ISSN 2247–7225 www.ijept.org

equations portion represents a mere part of the aggregate workload, therefore the impact that speed-ups achieved in this area may have on the overall speed of evaluation is directly proportional to the weight that the solve steps hold in the particular test case. However, there is strong evidence that, at least for larger scale problems, the solve step becomes the primary bound on performance. As an example, let us consider a frequently encountered task in economics, that of solving dynamic programming problems through policy iteration (Rust 1996). Within the algorithm, the most expensive step is the policy valuation one, in which the value V function attached to a particular policy αk is computed by solving the linear system , where uα is a single-period utility function, Mα is the Markov operator, β is the discount factor. If we take S as the size of the state space associated with the problem, then it has to be noted that | Mα| = S x S. In this context, even for small values S ≤ 500, the resulting linear system is quite formidable and can be rather computationally expensive. It thus makes sense to pursue the development and implementation of fast solvers. In this paper we seek to explore the fertile field of parallel and massively parallel iterative solvers implemented on special purpose accelerators. We contend that this will constitute one of the main avenues to pursue in the neverending quest for more detailed, large-scale models. Also, based on the observation that the intersection between the set of Economists and that of specialist computer programmers is represented by a rather small set, as expected given the reduced overlap existing between the two fields up to relatively recently, we present a token implementation using a modern accelerator and a modern compute-centric programming environment, which is quite easy to deploy in the field.

convergence patterns in other fast methods, for example Conjugate Gradient Squared (henceforth CGS) (Sonneveld 1989). Moreover, it has no problem handling non symmetric linear systems. Given spatial constraints we will avoid an in extenso characterization of iterative methods in general or other related or precursor methods in particular, but rather assume reader familiarity and proceed to outline BiCGSTAB's traits and its computational structure. We direct the reader interested in a general treatment of iterative methods to (Barrett 1994).

Figure 2: BiCGSTAB pseudo-code (Barrett, 1994)

Starting from the fundamental CGS sequence: BiCGSTAB follows a different update rule: with i the iteration count, A is the coefficient matrix, Qi(A) is an ith degree polynomial describing the steepest descent update and Pi(A) is an ith degree polynomial acting as a “contraction” operator. In terms of speed of convergence, BiCGSTAB converges roughly as fast as CGS, with some variance. In terms of computational requirements, two (sparse) matrix – vector products and four inner

2 The BiCGSTAB Method The BiCGSTAB method (Van der Vorst 1992) comes from the rather expansive family of Krylov methods (Lanczos 1950). Its development was caused by the irregular 187

International Journal of Economic Practices and Theories, Vol. 3, No. 3, 2013 (July), e-ISSN 2247–7225 www.ijept.org

(dot) products are required. Figure 2 highlights out the algorithmic structure of the method, using pseudo-code (we note that in this context we abusively reuse M to denote a preconditioner, and not the Markov operator, as was the case in the introductory portion of the text in Figure 2): It has to be noted that whilst endowed with a relatively regular convergence behaviour, the method can break down if the local residual minimization embodied by Qi(A) stagnates, thus leading to a halt in the Krylov subspace expansion. Moreover, if the coefficient matrix is indefinite, then convergence won't be achieved. We do not deal with these cases in the present work, beyond mentioning that refinements aimed at solving the former problem have been proposed in the literature (Gutknecht 1991; Sleijpen & Fokkema 1993), and that left-handside multiplication by AT produces a system that has a symmetric positive definite coefficient matrix, thus solving the latter, at the expense of slower convergence given that the system now depends on the square of the condition number.

number of data items (conceptually vertices, primitives or pixels) in parallel. This has led the modern GPU to essentially become a wide Single Instruction Multiple Data (henceforth SIMD) array processor, where we use SIMD array processor in the sense defined in (Flynn 1972). Amortising the power and spatial cost of control hardware across a high number of processing elements allows for the achievement of high computational densities – for example, a current generation AMD HD 7970 GPU can, theoretically, achieve a throughput of 3.8 TeraFLOPs / second when doing singleprecision computations and 950 GigaFLOPs / second when doing double-precision computations. The theoretical numbers are even more notable when taking into account that we are considering a simple add-in board that can be freely bought and plugged into any mainstream computer, as opposed to exotic hardware. The high-computational density coupled with general accessibility traits are not specific solely to last generation GPUs, and have been present in prior generations too. Considering the demand for accessible computational throughput as mentioned above, it would be appealing to explain the appearance and subsequent intense development of the so called General-purpose Programming of GPUs (henceforth GPGPU) field solely as a levering of the joining of high computational density and low cost outlined in the prior paragraph. However, whilst this has undoubtedly been the raison d'etre for the initial interest, the sustained investment of time offered by researchers from differing fields cannot solely be explained in this simplistic fashion. Indeed, we contend that the second key element is constituted by the evolution of the Application Programming Interfaces (henceforth API) used for interfacing and exploiting GPUs. Whereas the first experiments made use of graphics APIs and non-intuitive means of casting general problems into the mould imposed by such a domainspecific interface (Bolz et al. 2003), in recent years compute centric APIs have been developed for programming GPUs. We consider CUDA (Nickolls et al. 2008), OpenCL (Munshi et al. 2011), and C++ AMP (Gregory & Miller 2012) to be the main GPU compute APIs, and

3 The Graphics Processing Unit The Graphics Processing Unit (henceforth GPU) started its existence as a domain specific accelerator, aimed at the commodity market and engineered to allow for efficient rendering of three-dimensional graphics used in videogames. We will note the existence of graphical accelerators for three-dimensional rendering pre-dates the processors we considered above, whilst underlining that their cost structure and excessive domain specificity makes them lack relevance in the context of the present discussion. Fuelled by the sustained growth of the total addressable market coupled with the ever increasing demands for more realistic graphics, an accelerated evolution has led the GPU from being a collection of fixed-function blocks to a heterogeneous processor dominated primarily by programmable Arithmetic Logic Units (henceforth ALU), but not completely deprived of fixed-function elements. A large part of the computational cost of three dimensional rendering is constituted by dense algebra operations, executed across a high 188

International Journal of Economic Practices and Theories, Vol. 3, No. 3, 2013 (July), e-ISSN 2247–7225 www.ijept.org

we contend that the bubbling interest in GPGPU experienced in the past four years is due to their existence, coupled with the widespread availability of high-throughput and relatively cheap GPUs - it can be argued that the introduction of CUDA established a marketplace that justified the development of OpenCL and subsequently C++ AMP. At the present time, we conjecture that C++ AMP represents a local optimum in terms of GPU compute APIs, fusing the expressive power of C++ with portability across hardware platforms. The former leads to an increase in genericness and abstraction of the code, both desirable traits for programs that are intended to be re-used and understood by non-specialists as well. Fundamentally, a non-arcane programming model coupled with cheap, high-throughput hardware, means that certain tasks that were firmly reserved to significantly more expensive and less accessible computing solutions, become realisable for a much higher number of researchers. However, we underline the fact that there exists a consistent differential between the theoretical throughput claimed for GPUs and what can be achieved in practice and that their architectures do not lend themselves well to implementing a vast array of scientifically relevant computational tasks (Lee et al. 2010; Vuduc et al. 2010).

relies on two objects designed from the bottom up: • AMP_vector: a class wrapping around a container: ◦ overloaded operators =, +, * and -; ◦ data flows in and out through array_view (Gregory & Miller 2012) interfaces, where T is a primitive data-type; • AMP_sparse_matrix: a class implementing a matrix in Compressed Row Storage format (Saad & (US) 1990; Eijkhout 1992): ◦ SparseMatrix-Vector product accelerated through C++ AMP, using the algorithm from (Bell & Garland 2008); ◦ Norm retrieval and approximate normalization possible. GPUs are particularly sensitive to irregular data-access patterns, therefore we paid attention to linearising our traversal. Furthermore we make use of rapid on-chip memory (LDS) for a lowering of bandwidth pressure: we write the partial products into the LDS instead of exporting them to main RAM, and then perform the final accumulation, writing out just the result. Whilst we have successfully used our solution on a number of hardware architectures, we will constrain publication of results to the following configuration:  AMD FX-8150 3.6GHz processor;  AMD RADEON 7970 3GiB RAM GPU;  16 GiB of RAM;  Windows 8 RTM;  Visual Studio 2012 RTM; The code was compiled with all optimization flags enabled in Visual Studio. After implementing a series of optimisations the final performance situation is presented in Figure 3, across a set of publicly available problems from the Harwell-Boeing collection (Anon n.d.): Except for bcsstk05, all problem instances are quite large and very sparse. For example, mac_econ_fwd500, which is grabbed from a macroeconomic modelling task, has a coefficient matrix of rank 2 with 206500 x 206500 elements, out of which 12733898 coefficients are non-zero. As is obvious, we achieve notable speed-ups versus a high-end multi-core processor, thus opening the gate for

4 Performance evaluation (This section is based on the presentation “A Proof-of-Concept C++ AMP Implementation of the BiConjugate Gradient Stabilized Method on AMD CPUs, GPUs and APUs” given by the author at AMD Fusion Developer Summit 12, Bellevue, Washington, United States, 2012) The performance of BiCGSTAB (and, indeed, most iterative solvers) is primarily bound by the throughput of a fundamental primitive, the Matrix-Vector product. Moreover, a peculiarity of the large systems that we are considering is that the coefficient matrix is quite sparse, with only some unknowns being present in a particular equation, as opposed to the entire set. With this insight in mind levering C++ AMP, 189

International Journal of Economic Practices and Theories, Vol. 3, No. 3, 2013 (July), e-ISSN 2247–7225 www.ijept.org

the more efficient evaluation of large-scale models. Our optimisations are focused on exploiting the strengths of the hardware as much as possible:  moving from std::vector (Josuttis 2012) data containers to concurrency::array (Gregory & Miller 2012) allows us to explicitly pin the data on the GPU and avoid any spurious (and expensive) copies back to CPU accessible memory;  fixing tile size to 64 threads, which is the size of the hardware unit of scheduling for the AMD Radeon HD 7970, ensures that full barrier synchronization is elided;  adding constructors and assignment operators that take rvalue (Williams 2012) arguments removes any remaining spurious temporary datum. We express the performance as speed-ups versus a baseline as opposed to providing full measurements as we believe it is more compact and thus better suited in the present context. However, full performance characterizations can be made available upon request. Furthermore, our implementation will be made publicly available, with full source code available under a permissive license.

complete, we can use it in future publications which are more intensely geared towards economics. Acknowledgement This work was co-financed from the European Social Fund through Sectoral Operational Programme Human Resources Development 2007-2013; project number POSDRU/107/1.5/S/77213 “Ph.D. for a career in interdisciplinary economic research at the European standards”. References Amman, H.M., Kendrick, D.A. & Rust, J. eds., 1996. Handbook of Computational Economics, Volume 1 (Vol 1) 1st ed., North Holland. Anon, Matrix Market: Harwell-Boeing Sparse Matrix Collection (Release I). Available at: http://math.nist.gov/MatrixMarket/collections/hb.html [Accessed May 26, 2013]. Barrett, R., 1994. Templates for the solution of linear systems: building blocks for iterative methods, Society for Industrial Mathematics. Available at: http://www.google.com/books?hl=en&lr=&id=zMv8_9W 0a60C&oi=fnd&pg=PR15&dq=templates+for+the+soluti on+of+linear+systems:+building+blocks+for+iterative+m ethods&ots=ZuENafus87&sig=EYSXy5VlFlvXfk734Uc ySuar8f4 [Accessed September 18, 2012].

5 Conclusions

Bell, N. & Garland, M., 2008. Efficient sparse matrixvector multiplication on CUDA. NVIDIA Corporation, NVIDIA Technical Report NVR-2008-004. Available at: http://sbel.wisc.edu/Courses/ME964/Literature/techRepor tGarlandBell.pdf [Accessed September 18, 2012].

We have presented an early, yet functionally complete implementation of the BiCGSTAB method using C++ AMP to lever the latest GPUs. We contend that such approaches are likely to yield notable benefits to economists dealing with large-scale models or problems. To this end, the code, which shall be made public once finalised, is structured in a modular fashion, and is arguably easy to plug into existing frameworks. In future investigations, we will seek to evaluate other iterative solvers (there is a large array of such methods), whilst at the same time aiming for extra economic specialization. We argue that the initial implementation had to be presented as clearly as possible, without muddying the waters by drawing in other topics, which is why we have avoided high customization of discourse to fit into the Economics sub-space. However, with the introduction of the tool

Bolz, J. et al., 2003. Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graph., 22(3), pp.917–924. Eijkhout, V., 1992. LAPACK working note 50 Distributed Sparse Data Structures for Linear Algebra Operations. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1. 48.6091 [Accessed September 18, 2012]. Flynn, M.J., 1972. Some Computer Organizations and Their Effectiveness. IEEE Transactions on Computers, C21(9), pp.948–960. Gregory, K. & Miller, A., 2012. C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++, Microsoft Press. Gutknecht, M.H., 1991. Variants of BICGSTAB for matrices with complex spectrum.

190

International Journal of Economic Practices and Theories, Vol. 3, No. 3, 2013 (July), e-ISSN 2247–7225 www.ijept.org Josuttis, N.M., 2012. The C++ Standard Library: A Tutorial and Reference 2nd ed., Addison-Wesley Professional.

http://www.inf.ufes.br/~avalli/metodos/preprint/sparskit.p df [Accessed September 18, 2012]. Sleijpen, G. & Fokkema, D., 1993. Bicgstab(l) For Linear Equations Involving Unsymmetric Matrices With Complex Spectrum.

Lanczos, C., 1950. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, United States Governm. Pr. Office. Available at: http://www.cs.umd.edu/~oleary/lanczos1950.pdf [Accessed September 18, 2012].

Sonneveld, P., 1989. CGS, A Fast Lanczos-Type Solver for Nonsymmetric Linear systems. Siam Journal on Scientific and Statistical Computing, 10(1).

Lee, V.W. et al., 2010. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News, 38(3), pp.451–460.

Van der Vorst, H.A., 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM Journal on scientific and Statistical Computing, 13, p.631.

Munshi, A. et al., 2011. OpenCL Programming Guide, Pearson Education.

Vuduc, R. et al., 2010. On the Limits of GPU Acceleration. In Hot Topics in Parallelism (HotPar). Berkley, California, USA. Available at: http://www.cc.gatech.edu/~aparna/papers/vuduc-hotpar10.pdf.

Nickolls, J. et al., 2008. Scalable Parallel Programming with CUDA. Queue, 6(2), pp.40–53. Rust, J., 1996. Numerical dynamic programming in economics. Handbook of computational economics, 1, pp.619–729.

Williams, A., 2012. C++ Concurrency in Action: Practical Multithreading 1st ed., Manning Publications.

Saad, Y. & (US), R.I. for A.C.S., 1990. SPARSKIT: A basic toolkit for sparse matrix computations. Available at:

191

Suggest Documents