Performance of GPU for Pricing Financial Derivatives: Convertible Bonds *

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, 141-155 (2014) Performance of GPU for Pricing Financial Derivatives: Convertible Bonds* YUH-DAUH L...

Author: Annabel Boyd

4 downloads 1 Views 462KB Size

Report

Download PDF

Recommend Documents

ISSUE OF CONVERTIBLE BONDS FOR LOAN SETTLEMENT

Foreign Currency Convertible Bonds

Financial Derivatives Spring Derivatives

Financial Derivatives Fall Derivatives

Which Method for Pricing Weather Derivatives?

Convertible Bonds with Call Notice Periods

Financial Derivatives

Playing Defense: Investing in Convertible Bonds

Study of Irrational Pricing for Financial Products

JOINT ANNOUNCEMENT (1) ISSUANCE OF CONVERTIBLE BONDS UNDER GENERAL MANDATE;

THE PRICING AND PERFORMANCE OF CONVERTIBLE PREFERRED STOCK OFFERINGS FOLLOWING ISSUANCE

GPU: Power vs Performance

o Convertible Bonds have suffered more than any other o Convertible Bonds should now not suffer more than

OpenCL 1.1 Enhancements for Multi-GPU Performance

23 Financial instruments derivatives

Introduction to Financial Derivatives

Performance models for CPU-GPU data transfers

19 Financial instruments - derivatives

Fundamentals of Financial Instruments. An Introduction to Stocks, Bonds, Foreign Exchange, and Derivatives

TRENDS OF CPU, GPU AND FPGA FOR HIGH-PERFORMANCE COMPUTING

Importance of Explicit Vectorization for CPU and GPU Software Performance

Pricing the CBOT T-Bonds Futures

Pricing interest rate derivatives under monetary changes

Applying design patterns for web-based derivatives pricing

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, 141-155 (2014)

Performance of GPU for Pricing Financial Derivatives: Convertible Bonds* YUH-DAUH LYUU1, KUO-WEI WEN2 AND YI-CHUN WU3 Department of Computer Science and Information Engineering Taiwan University Taipei, 106 Taiwan E-mail: {1lyuu; 3r95080}@csie.ntu.edu.tw; [email protected] Financial derivatives are financial instruments whose payoff is linked to some fundamental financial assets or indices. They are essential tools for speculation and riskmanagement. This paper focuses on the pricing of a common type of derivatives: convertible bonds (CBs), which incorporate the features of both bonds and stocks. Chambers and Lu propose a popular two-factor tree model for CBs pricing. This paper assesses the efficiency of their model on both GPU (graphics processing unit) and CPU. The GPU code exploits the GPU’s inherently parallel architecture and high memory bandwidth. The numerical results show that the GPU code is orders faster than the CPU code. These positive results encourage more use of GPUs on computation-intensive problems in financial engineering such as pricing derivatives by tree-based models studied in this paper. Keywords: convertible bond, default risk, tree model, GPU, CUDA, parallel processing

1. INTRODUCTION Financial derivatives are financial instruments whose payoff is based on the values of some underlying assets or indices. Examples include options, forward contracts, futures, convertible bonds (CBs), structured notes, swaps, and mortgage-backed securities [19]. They are essential tools for speculation and risk management. According to the Bank for International Settlements [3], the size of the global over-the-counter (OTC) derivatives market increased steadily from 1999 to 2010 as shown in Fig. 1. The focus of our paper is the pricing of CBs. CBs are complex and widely used financial instruments that combine the characteristics of both bonds and stocks. The distinct feature of CBs is that they give the holders the right to convert them for stocks. Pricing CBs should take into account many factors such as the underlying stock’s price, interest rate, and the credit risk of the issuing company. This paper focuses on a popular model to price CBs as our case study because its computation structure is generic of almost all such algorithms: the Chambers-Lu tree model [9]. This model has a cubic complexity. Traditionally, a pricing model is implemented to run on the CPU (central processing unit), also known as a central processor or a microprocessor. In 1965, Gordon Moore stated that the number of transistors per square inch on integrated circuits will double every 18 months [27]. Moore’s observation, now universally known as Moore’s Law, has Received September 28, 2011; revised December 14, 2011; accepted January 13, 2012. Communicated by Chi-Jen Lu. * The first and second authors were supported in part by the National Science Council of Taiwan under Grant No. 100-2221-E-002-111-MY3. The first author was supported in part by the Excellent Research Projects of National Taiwan University under Grant 99R80300.

141

142

YUH-DAUH LYUU, KUO-WEI WEN AND YI-CHUN WU

Fig. 1. The global over-the-counter (OTC) derivatives market (1999-2010).

held for more than 40 years and is not expected to stop for another decade at least and perhaps much longer. During the same period, CPU’s speed has also become exponentially faster. For example, the Intel CPU Core i7 Extreme Edition 990x released in February 2011 can execute up to 159 billion instructions per second [34]. A modern CPU can handle only a few threads simultaneously. As the Chambers-Lu tree becomes larger, it would take much CPU time to obtain the prices. Hence it is important to look for alternative architectures that are able to price CBs much faster than the CPU. Recently, GPUs (graphics processing units) have emerged as one such candidate because of their massively parallel architecture and high memory bandwidth. They have been extensively used in computation-intensive applications such as microwave engineering and optics, molecular mechanics, 2D/3D image processing, electron tomography, gene analysis, cryptography, and so on; furthermore, the results often yield up to two orders in speedup over the corresponding CPU implementations [7, 8, 13, 20, 21, 30, 33, 35, 36]. This paper implements Chambers and Lu’s model on the GPU and the CPU. The GPU code’s execution time is then compared with the CPU code’s. Our numerical results show that up to a hundred-fold speedup can be achieved by the GPU. This positive result may encourage more use of GPUs for computation-intensive problems in financial engineering such as pricing derivatives by tree-based models studied in this paper. The rest of this paper is organized as follows. Section 2 introduces CBs. Section 3 describes Chambers and Lu’s model. Section 4 describes how to exploit GPU’s computational power. Section 5 shows that the GPU can be up to 125 times faster than the CPU. Section 6 discusses our numerical results. Section 7 concludes.

2. CONVERTIBLE BONDS (CBS) 2.1 An Overview of CBs Like all bonds, CBs are issued with an indenture that includes a coupon rate, a par value, a maturity date, the issue size, and an issue date. CBs often come with a lower coupon rate and yield than pure bonds. The following two terms are important for CBs:

PERFORMANCE OF GPU FOR PRICING FINANCIAL DERIVATIVES

143

 Conversion ratio: The number of shares which can be converted from each CB.  Conversion price: The nominal price per share, i.e., the par value divided by the conversion ratio. The conversion value is defined as the product of the underlying stock’s current market value and the conversion ratio. Obviously, the conversion value changes with time, and a CB cannot be cheaper than its conversion value when it is convertible to avoid arbitrage opportunities. A CB may contain embedded options such as calls and puts. CBs are hybrid securities as they are bonds which can be exchanged for stocks at the discretion of its owner. Table 1 is an example CB that incorporates all the features mentioned above [11, 14]. Table 1. A typical CB indenture. Item Value Nominal value £5,000 Conversion price £8.00 Conversion ratio 625 Coupon 0% Underlying stock price £7.8 Dividend yield of underlying stock 1% Time to maturity 1 year Quoted market price at issue £103

2.2 Literature Review Several methods are available to price CBs. The first is to derive a closed-form solution to their prices. For example, Ingersoll presents a closed-form formula to value CBs and studies the optimal conversion strategy for the investors and the optimal call policy for the issuers [17]. The second approach uses partial differential equations (PDEs). Brennan and Schwartz introduce finite-difference schemes to solve the PDE for CB’s price [5]. They later extend it by including stochastic interest rates [6]. McConnell and Schwartz develop a single-factor model based on a finite-difference method to model the stock price [26]. Tsiveriotis and Fernandes take the default risk into account and use two PDEs to price CBs [32]. The tree model is the third numerical method to price CBs. The Cox-Ross-Rubinstein (CRR) model is the first binomial tree model for pricing CBs [12]. Because the CRR model considers neither the interest rate risk nor the default risk, the resulting prices tend to be inaccurate. Later works on tree models to price CBs such as Chambers and Lu [9] and Lyuu and Wang [25] rectify that deficiency by considering more state variables. CBs can also be priced by Monte Carlo simulation. Longstaff and Schwartz [22], Lvov et al. [23], and Ammann et al. [2] price CBs by simulation. However, the simulation method has some disadvantages: (1) it costs much execution time, (2) it converges slowly, and (3) it cannot handle the early-exercise features. Although the least-squares

144

YUH-DAUH LYUU, KUO-WEI WEN AND YI-CHUN WU

Monte Carlo of Longstaff and Schwartz [22] can handle them, it is not competitive in cases for which efficient tree or finite-difference methods are available. This paper implements the popular Chambers-Lu tree model, which is introduced in the next section [9].

3. CHAMBERS AND LU’S MODEL 3.1 Introduction Two-factor CB pricing model proposed by Hung and Wang incorporates both interest rate risk and default risk [16]. Chambers and Lu [9] later propose a two-factor CB pricing model that extends Hung and Wang’s model by adding a nontrivial correlation between the short rate and the stock price. A short rate is the interest rate for the shortest duration, i.e., a time period. A short-rate model for the interest rates captures the dynamics of the short rate [24]. Chambers and Lu’s model comprises three trees. The first is the Black-Derman-Toy (BDT) model for the risk-free short rate [4], where the risk-free short rate means the riskless interest rate with the shortest duration. The second tree is the risky short-rate model. The third tree is the Cox-Ross-Rubinstein (CRR) model for the stock price [12]. 3.2 The BDT Model The BDT model can be calibrated by market term structures, which are risk-free yields and short-rate volatilities of zero-coupon bonds of all maturities. Short-rate volatility is the standard deviation of the short rate. A zero-coupon bond does not pay interest before maturity and pays back the par value at maturity. Yield on a bond is the annualized return on that bond. Fig. 2 illustrates the 3-period BDT model. It assumes each period t is one year long for simplicity. R is the risk-free short rate between times t = 0 and t = 1, Ru and Rd are the two risk-free short rates between times t = 1 and t = 2, and Ruu, Rud, and Rdd are the three risk-free short rates between times t = 2 and t = 3. The probability of the risk-free short rate’s up or down move at each node in the BDT model is 0.5 [4].

Fig. 2. A 3-period BDT model.

PERFORMANCE OF GPU FOR PRICING FINANCIAL DERIVATIVES

145

3.3 The Risky Short-Rate Model Chambers and Lu [9] adopt the approach of Jarrow and Turnbull [18] to construct the risky short-rate model incorporating the default risk. The risky term structure, i.e., yields on defaultable zero-coupon bonds, is used to infer the probability of default at each period. Jarrow and Turnbull first modify the risk-free short-rate model by adding a defaultable state to each node with a recovery value in the event of default [18]. Then they calculate the default rate for each period with the risky term structure. The short-rate model with defaultable states is illustrated in Fig. 3. In that figure, i is the default probability between times t = i  1 and t = i with a recovery rate of  once default occurs, where i = 1, 2, and the probability of upward or downward movement of the risk-free short-rate model from time t = i  1 to time t = i is 1/2(1  i) without default. The recovery rate is the percentage of the par value returned to the bondholder. Details on how to extract default rates can be found in [9].

Fig. 3. A 2-period risk-free short-rate model with default added.

3.4 The CRR Model The third tree is the CRR model for the stock price depicted in Fig. 4 [12]. The CRR model assumes that the current stock price moves from S to Su with probability p and Sd with probability 1  p, where 0 < p < 1 and d < u:

u  exp( s t ), d  1 u , Su  Su , S d  Sd , p  (exp(rf t )  d )  u  d  . Above, s and rf denote the volatility of the stock price and the risk-free short rate, respectively. As Fig. 5 shows, there are four components at each node of the binomial tree model. The first component is the stock price. The next two components are the equity and debt parts of the CB. The former value arises from the contingency that the CB will ultimately become stock. The latter value arises from the contingency that the CB remains a bond.

146

YUH-DAUH LYUU, KUO-WEI WEN AND YI-CHUN WU

Fig. 4. A 2-period CRR tree model.

Fig. 5. The 4 components at each node of the binomial tree model for pricing a CB.

The last component, the rollback value, represents each node’s combined expected value from its child nodes’ values of the equity part and the debt part after discounting. For each node N, the rollback value is N’s rollback value = N’s equity part + N’s debt part. For the terminal nodes of the binomial tree, when conversion value is larger than the par value, the equity part is the conversion value and the debt part is 0; otherwise, the equity part is 0 and the debt part is the par value. For an internal node N1 of the binomial tree, assume its upward and downward children are N2 and N3, respectively. Then N1’s values of the equity part and the debt part are equity part = (N2’s equity part  p + N3’s equity part  (1  p))exp(risk-free short rate), debt part = (N2’s debt part  p + N3’s debt part  (1  p))exp(risk-free short rate). When rolling back through the tree by backward induction (to be explained later), the equity part is discounted at the risk-free short rate, whereas the debt part is discounted at the risky short rate. During backward induction, two optimality conditions at each node must be checked. First, the issuer may choose to call back the CB when its rollback value exceeds the bond’s call price, at which time the investors are forced to convert. Second, investors may choose to exercise the conversion option when the conversion value exceeds the new rollback value. In summary, the optimal value of the CB for investors at each node is max[min(rollback value, call price), conversion value]. 3.5 Chambers and Lu’s Model

Each node of Chambers and Lu’s model has five children: two from the stock price movements, Su and Sd, two from the risk-free short rates’ movements, Ru and Rd, and one from the default event, F, where  is the recovery rate when the default event occurs and F denotes the face value of the risky bond. See Fig. 6 for illustration. Chambers and Lu prove that the risk-neutral probabilities for the four non-default children are

PERFORMANCE OF GPU FOR PRICING FINANCIAL DERIVATIVES

147

Fig. 6. Branching probabilities at each node in Chambers and Lu’s model.

 











1 1 1 p  p 1  p   , p2  p  p 1  p   , p3  1  p  p 1  p   , 2 2 2 1 p4  1  p  p 1  p   , p   exp  rt t  1      d  u  d  , 2 p1 







where  is the correlation between the stock price and the risk-free short rate. The branching probabilities at each node in Chambers and Lu’s model are also shown in Fig. 6. In that figure, i is the default probability between times i  1 and i, whereas p1, p2, p3, and p4 are the risk-neutral probabilities for the four non-default children. Chambers and Lu’s model incorporates several state variables: the stock price, the prevailing interest rate at each period, and the default rate. Hence the model should yield more realistic prices for CBs. 3.6 Implementation Methodology

Chambers and Lu’s model involves four tasks: building the BDT model, building the risky short-rate model, building the CRR model, and backward induction. The implementation starts from constructing the BDT model for the risk-free short rate with the inputs: risk-free yields of zero-coupon bonds of all maturities and a constant short-rate volatility. Now we refer to Fig. 2 and calculate the discounted values at times t = 0, 1 of a payment $1 at time t = 2. The discounted value refers to the present value of payments that will be received in the future. We obtain the following equations: V = 1/(1 + two year zero coupon bond annual yield)2, Vu = 1/(1 + Ru), Vd = 1/(1 + Rd), and V = (Vd/2+Vd/2)/(1+one year zero coupon bond annual yield),

(1)

YUH-DAUH LYUU, KUO-WEI WEN AND YI-CHUN WU

148

where V is the discounted value at time t = 0, and Vu and Vd are the discounted values of nodes ru and rd at time t = 1, respectively. The first 3 equations are about discounting, whereas Eq. (1) carries out backward induction. By the definition of short-rate volatility, we obtain the following equation for the volatility of the risk-free short rate: Var(lnR) = E[(lnR)2]  E[(lnR)2] = ln(Ru/Rd)/2.

(2)

Eq. (2) is equivalent to

r2  Var(lnR) = ln(Ru/Rd)/2 = (ln(Ru)  ln(Rd))/2 so Ru  Rd exp(2 r t ) , where r denotes the volatility of the risk-free short rate. We can solve Ru and Rd by Eqs. (1) and (2). The values of Ruu, Rud, and Rdd can also be solved in the same way. We adopt Lyuu’s quadratic-time algorithm to build the BDT model [24]. With the risk-free short-rate model in place, we proceed to solve the default rate for each period. With Fig. 3 in mind, 1 can be solved from the following equation: *

exp(R0) = [1  (1  1) + 1]exp(R0), *

where R0 and R0 are the risk-free and risky short rates between times t = 0 and t = 1, respectively, and  is the recovery rate when the default event occurs. By the same method, 2 can be solved. The value $1 at time t = 2 discounted to time t = 1 is: Pu = [1  (1  2) + 2]exp(Ru), Pd = [1  (1  2) + 2]exp(Rd). Discounting the values Pu and Pd further to time t = 0, we can solve for 2 thus: *

exp( 2R2) = [1/2(1  1)Pu + 1/2(1  1)Pd + 1]exp(R0). Apply the Newton-Raphson method to recursively calculate the default rate at each time step to completely build the risky short-rate model. Next we describe how to implement backward induction for pricing a CB on Chambers and Lu’s tree. The programs have to know each node’s child nodes while carrying out backward induction. In a straightforward manner, a 2-dimensional array is allocated whose size in each dimension is i + 1 at time period i. Each array element corresponds to a node on the tree; hence there are (i + 1)2 nodes at time period i. The first index of the array points to a risk-free short rate, and the second index points to a stock price. The first index of the array is increased by 1 with each risk-free short rate’s down movement; it stays constant otherwise. The second index of the array is increased by 1 with each stock price’s down movement; it stays constant otherwise. Based on the above arrangement, we can easily identify each node’s child nodes when carrying out backward induction. The space complexity of Chambers and Lu’s model is O(n2), and the time complexity is O(n3), where n is the number of periods of the tree between now and the maturity date of the bond. Then the CB’s value at each node at time t can be stated as:

PERFORMANCE OF GPU FOR PRICING FINANCIAL DERIVATIVES

149

max[min(rollback value, call pricet), conversion valuet, put valuet], where the put valuet is the value obtained when the bondholder redeems the bond at time t. The CB’s price will eventually emerge at the tree’s root node.

4. GPU IMPLEMENTATION 4.1 GPU with CUDA

To exploit the GPU’s massive computational power, our programs are implemented for a GPU supporting CUDA compute capability 1.1, proposed by NVIDIA in 2007. CUDA stands for Compute Unified Device Architecture. It is a parallel computing architecture for issuing and managing computations on the GPU and is accessible to software developers through variants of industry-standard programming languages. Issuing a computation means that the CPU sends the instructions to the GPU. After sending the instructions to the GPU, the CPU goes on performing other tasks and will receive the results from the GPU when it finishes the job. CUDA compute capability 1.1 supports the IEEE-754 standard for single-precision binary floating-point arithmetics. It is available for the GeForce 8 Series GPUs and other advanced GPUs manufactured by NVIDIA. The CUDA application programming interface (API) comprises an extension to the C programming language [28]. Our CUDA programs are written in the C programming language with the CUDA API. 4.2 Utilization of GPU’s Power

The major goal of our CUDA implementation of the GPU code for Chambers and Lu’s model is to greatly reduce the execution time while maintaining accuracy. Two methods are applied to achieve this major goal. The first method is to implement our GPU code in as parallel a fashion as possible for the best performance. To achieve a high degree of parallelism, our GPU code will be multi-threaded. In particular, a thread is created for each node at the same time step because these nodes perform independent calculations. The second method is to make the code fully utilize the GPU’s high memory bandwidth. Table 2 details the memory layers of the GPU. The accessing speed of each memory layer is affected by the memory bandwidth. Specifically, the on-chip memory layers of the GPU and the memory layers with cache are accessed with the highest speed. For example, the constant memory and the texture memory are cached and reside in the memory layers of the GPU, whereas the shared memory and registers reside in the on-chip memory layer of the GPU. In comparison, the global memory and the local memory are accessed by the GPU with the slowest speed. All variables copied from the CPU to the GPU must reside in either the global memory, the constant memory, or the texture memory. To maximize the use of the high memory bandwidth offered by the GPU, we use the shared memory and the constant memory instead of the global memory to perform the computation.

YUH-DAUH LYUU, KUO-WEI WEN AND YI-CHUN WU

150

Memory Register Local Shared Global Constant Texture

Table 2. GPU memory layers. Location Cached Access Scope On-Chip N/A Read/Write One Thread Off-Chip No Read/Write One Thread On-Chip N/A Read/Write All Threads in a Block Off-Chip No Read/Write All Threads + Host Off-Chip Yes Read All Threads + Host Off-Chip Yes Read All Threads + Host

5. NUMERICAL RESULTS This section presents our numerical results. Throughout, n means the number of periods of the tree. The numerical results show the performance gains of the GPU code. The specifications of the CPU and the GPU used in the experiments are summarized in Table 3. The data transferring bus between the CPU and the GPU is the PCI-Express 16x. Only 1 thread is created in the CPU code. We measure the execution times of the CPU and GPU codes with the high-resolution performance clock counter (in milliseconds) of Windows XP. Table 3. Hardware specifications of the CPU and the GPU. Index Intel Core 2 Quad Q6600 Nvidia GeForce 8800GTS Clock rate/ Power 2.4GHz/105W 500MHz/250W Technology/ Die size 65nm/286nm2 90nm/484nm2 Off-chip memory size 4GB(DDR2) 640MB(DDR3) Per CPU has Per core has On-chip cache and 1. L1 cache : 32 KB 1. Constant memory cache: 8 KB memory size 2. L2 cache : 8 MB 2. Texture memory cache: 8 KB 3. Shared memory: 16 KB

5.1 Performance of the GPU Implementation with Multiple Threads but without Memory Optimization

The goal of our GPU code here and in the next subsection is to reduce the execution time as much as possible while maintaining accuracy. We will let n grow up to 1,000. The execution time of the GPU code never exceeds 104ms. In this subsection, the GPU implementation of Chambers and Lu’s model uses the slowest global memory. The parameters for the CB are S0 = 30.0, s = 0.23, calli = 105, par = 100,  = 0.32, conversion ratio = 3,  = 0.1, risk-free yield = 0.01, risky yield = 0.015, short-rate volatility = 0.01. The number of registers per core on the GPU is fixed; hence the GPU code has to prevent the total number of registers occupied by the threads per block from exceeding the number of GPU registers. For the 4 tasks of Chambers and Lu’s model in our GPU implementation, the numbers of threads used in the numerical experiments for building the BDT model, building the risky short-rate model, building the CRR model, and backward induction are 256, 320, 192, and 192, respectively. Table 4 shows the performance of our GPU implementation with multiple threads but without memory optimization. The pricing results are listed in Table 5. Observe that the speedup approaches 52 at n = 1000.

PERFORMANCE OF GPU FOR PRICING FINANCIAL DERIVATIVES

151

Table 4. Performance comparison between the CPU code and the GPU code with multiple threads but without memory optimization. Number of periods 10 100 300 500 CPU Time (ms) 0.7473 206.535797 5544.516602 35646.41797 GPU Time (ms) 2.0216579 98.895983 838.392942 2658.242548 Speedup 0.36964711 2.08841442 6.613267269 13.40976879 Number of periods 700 800 900 1000 CPU Time (ms) 139112.5781 231459.4063 357641.3125 523328.0938 GPU Time (ms) 6057.518413 8200.084818 11073.08547 9951.536905 Speedup 22.96527532 28.22646489 32.29825268 52.58766548 Table 5. The pricing results of the CPU code and the GPU code with multiple threads. Number of periods Price on the CPU Price on the GPU 10 $103.019394 $103.019363 100 $93.683701 $93.683640 Longer than 200 $93.411613 $93.411430 Table 6. Performance comparison between the CPU code and the GPU code with multiple threads and memory optimization. Number of periods 10 100 300 500 CPU Time (ms) 0.7473 206.535797 5544.516602 35646.41797 GPU Time (ms) 1.442884 49.473396 363.301453 1131.153564 Speedup 0.517921053 4.174684046 15.26147654 31.51333214 Number of periods 700 800 900 1000 CPU Time (ms) 139112.5781 231459.4063 357641.3125 523328.0938 GPU Time (ms) 2544.294922 3555.336426 4797.348145 4179.777832 Speedup 54.67627865 65.10197025 74.54979328 125.2047632

5.2 Performance of the GPU Implementation with Multiple Threads and Memory Optimization

Finally, we report the results of the GPU implementation of Chambers and Lu’s model with multiple threads and memory optimization. The parameters are identical to those in section 5.1. The global memory is used for our GPU implementation in section 5.1. Now we use the shared memory and the constant memory to gain higher speedup. The data generated from the four tasks of Chamber and Lu’s model are assigned to the shared memory to exploit the high accessing speed for reading and writing data. The parameters of the experiments are assigned to the constant memory to exploit the high accessing speed for reading data. The pricing results are identical to those in section 5.1, and the performance results are shown in Table 6. Observe that the speedup approaches 125 at n = 1000. 5.3 Performance of the GPU Implementation with Various Thread Counts

We finally assess the influence of the thread count per block on the speedup by repeating the experiments in section 5.2 with 16 to 256 threads per block for the four tasks

152

YUH-DAUH LYUU, KUO-WEI WEN AND YI-CHUN WU

of Chambers and Lu’s model at n = 500. The results are tabulated in Table 7. As the number of threads per block exceeds a threshold (192 in Table 7), the speedup will level off. Our experiments stop at 256 threads per block because, otherwise, the required number of registers will exceed the number of GPU registers. Table 7. Performance with various thread counts per block of the GPU code. Number of threads 16 32 64 Time (ms) 4646.07373 2701.482178 1711.93396 Number of threads 128 192 256 Time (ms) 1263.913818 1263.781372 1263.994507

6. SOME REMARKS 6.1 Effects of the Performance Improvement

The GPU code with multiple threads and memory optimization in section 5.2 exploits the GPU’s massive computing power. Up to a hundred-fold speedup is achieved by the GPU as a result. However, two factors affect the degree of parallelism of the CUDA implementation. First, data dependency limits the degree of parallelism. Recall from sections 3.2 and 4.2 that implementing Chambers and Lu’s model consists of four tasks: building the risk-free short-rate model, building the risky short-rate model, building the stock price model, and backward induction. We build the risk-free short-rate model followed by the risky short-rate model. Then the stock price model is built. Finally, backward induction is carried out after all three models are in place. As these four tasks must be executed sequentially, we can only parallelize each task individually. Besides, as these tasks are all tree-based, values on different nodes needed to be derived period by period. Data dependency such as this limits the degree of parallelism. 6.2 Power Consumption

In recent years, how efficiently a chip manages power has become an important issue. Several works have studied the power consumption of the GPU [10, 15, 31, 35]. The power consumption of the GPU is several times larger than the CPU (recall Table 3). NVIDIA proposes a technology, PowerMizer [29], which is an intelligent power management solution for their graphics chip that can effectively extend battery life and reduce wasted power. AMD also introduces a power management technology called AMD PowerPlay [1], which is built into the AMD Radeon HD 3800 Series graphics processor for the optimal balance between performance and power consumption.

7. CONCLUSIONS Pricing CBs is computationally demanding. This paper focuses on a popular and efficient model to price CBs: the tree model of Chambers and Lu. It is a two-factor tree model, which considers the stock price, the risk-free interest rate, and the default risk for

PERFORMANCE OF GPU FOR PRICING FINANCIAL DERIVATIVES

153

CBs pricing. It has a cubic time complexity. To accelerate the computation, Chambers and Lu’s model is implemented on the GPU with NVIDIA’s CUDA which is a new parallel computing architecture for performing computations on the GPU to exploit the GPU’s massive computing power. Up to a speedup of 125 over the CPU code is achieved as a result. Helpful programming techniques to achieve that bound are also discussed in some detail.

REFERENCES 1. AMD, “AMD powerplay technology,” http://www.amd.com/us/products/technologies/ ati-power-play/Pages/ati-power-play.aspx, Sunnyvale, CA, 2008. 2. M. Ammann, A. Kind, and C. Wilde, “Simulation-based pricing of convertible bonds,” Journal of Empirical Finance, Vol. 15, 2008, pp. 310-331. 3. Bank for International Settlements, Semiannual OTC Derivatives Statistics from End-December 2000 to End-December 2010, http://www.bis.org/statistics/derstats.htm, Basle, Switzerland, 2011. 4. F. Black, E. Derman, and W. Toy, “A one-factor model of interest rates and its application to treasury bond options,” Financial Analysts Journal, Vol. 46, 1990, pp. 33-39. 5. M. J. Brennan and E. S. Schwartz, “Convertible bonds: Valuation and optimal strategies for call and conversion,” Journal of Finance, Vol. 32, 1977, pp. 1699-1715. 6. M. J. Brennan and E. S. Schwartz, “Analyzing convertible bonds,” Journal of Financial and Quantitative Analysis, Vol. 15, 1980, pp. 907-929. 7. T. Brunner, F. Deinzer, T. Feldmann, A. K. Kubias, D. Paulus, S. Paulus, and B. Schreiber, “2D/3D image registration on the GPU,” Pattern Recognition and Image Analysis, Vol. 18, 2008, pp. 381-389. 8. T. T. Cao, K. Tang, A. Mohamed, and T. S. Tan, “Parallel banding algorithm to compute exact distance transform with the GPU,” in Proceedings of ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, 2010, pp. 83-90. 9. D. R. Chambers and Q. Lu, “A tree model for pricing convertible bonds with equity, interest rate, and default risk,” Journal of Derivatives, Vol. 14, 2007, pp. 25-46. 10. S. Collange, D. Defour, and A. Tisserand, “Power consumption of GPUs from a software perspective,” in Proceedings of the 9th International Conference on Computational Science: Part I, 2009, pp. 914-923. 11. K. B. Connolly, Pricing Convertible Bonds, John Wiley and Sons, NY, 1998. 12. J. C. Cox, S. A. Ross, and M. Rubinstein, “Option pricing: A simplified approach,” Journal of Financial Economics, Vol. 7, 1979, pp. 229-263. 13. P. L. Freddolino, D. J. Hardy, J. C. Phillips, K. Schulten, J. E. Stone, and L. G. Trabuco, “Accelerating molecular modeling applications with graphics processors,” Journal of Computational Chemistry, Vol. 28, 2007, pp. 2618-2640. 14. E. G. Haug, The Complete Guide to Option Pricing Formulas, 2nd ed., McGraw-Hill, NY, 2006. 15. M. W. Hung and J. Y. Wang, “Pricing convertible bonds subject to default risk,” Journal of Derivatives, Vol. 10, 2002, pp. 75-87. 16. Jr. J. E. Ingersoll, “A contingent claim valuation of convertible securities,” Journal

154

YUH-DAUH LYUU, KUO-WEI WEN AND YI-CHUN WU

of Financial Economics, Vol. 4, 1977, pp. 289-321. 17. R. A. Jarrow and S. M. Turnbull, “Pricing derivatives on financial securities subject to credit risk,” Journal of Finance, Vol. 50, 1995, pp. 53-85. 18. R. Kolb and J. A. Overdahl, Financial Derivatives: Pricing and Risk Management, John Wiley and Sons, NY, 2010. 19. C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, “Fast BVH construction on GPUs,” Computer Graphics Forum, Vol. 28, 2009, pp. 375-384. 20. R.-S. Liu, Y.-C. Tsai, and C.-L. Yang, “Parallelization and characterization of GARCH option pricing on GPUs,” in Proceedings of IEEE International Symposium on Workload Characterization, 2010, pp. 1-10. 21. F. A. Longstaff and E. S. Schwartz, “Valuing American options by simulation: A simple least-squares approach,” Review of Financial Studies, Vol. 14, 2001, pp. 113147. 22. D. Lvov, A. B. Yigitsbasioglu, and N. E. Bachir, “Pricing convertible bonds by simulation,” in Proceedings of Financial Engineering and Applications, 2004, pp. 1-20. 23. Y.-D. Lyuu, Financial Engineering and Computation, Cambridge University Press, Cambridge, 2002. 24. Y.-D. Lyuu and C.-J. Wang, “On the complexity of bivariate lattice with stochastic interest rate models,” Computers and Mathematics with Applications, Vol. 61, 2011, pp. 1107-1121. 25. J. J. McConnell and E. S. Schwartz, “LYON taming,” Journal of Finance, Vol. 41, 1986, pp. 561-577. 26. G. E. Moore, “Cramming more components onto integrated circuits,” Electronics Magazine, Vol. 38, 1998, pp. 114-117. 27. NVIDIA, NVIDIA CUDA Compute United Device Architecture, Programming Guide, Version 2.0, NVIDIA, Santa Clara, CA, 2008. 28. NVIDIA, PowerMizer, http://www.nvidia.com/object/feature_powermizer.html, Santa Clara, CA, 2008. 29. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, “GPU computing,” Proceedings of the IEEE, Vol. 96, 2008, pp. 879-899. 30. T. R. W. Scogland, H. Lin, and W. Feng, “A first look at integrated GPUs for green high-performance computing,” Computer Science Research and Development, Vol. 25, 2011, pp. 125-134. 31. K. Tsiveriotis and C. Fernandes, “Valuing convertible bonds with credit risk,” Journal of Fixed Income, Vol. 8, 1998, pp. 95-102. 32. Z. Wang, X. Xu, N. Xiong, L. T. Yang, and W. Zhao, “Analysis of parallel algorithms for energy conservation with GPU,” in Proceedings of IEEE/ACM International Conference on Green Computing and Communications & International Conference on Cyber, Physical and Social Computing, 2010, pp. 155-162. 33. Tom’s Hardware, “Benchmark results: SiSoftware Sandra 2011,” http://www.tomshardware.com/reviews/core-i7-990x-extreme-edition-gulftown,2874-6.html, Culver, CA, 2011. 34. J. Yang, Y. Wang, and Y. Chen, “GPU accelerated molecular dynamics simulation of thermal conductivities,” Journal of Computational Physics, Vol. 221, 2007, pp. 799804. 35. Q. Zhang and Y. Zhang, “Hierarchical clustering of gene expression profiles with

PERFORMANCE OF GPU FOR PRICING FINANCIAL DERIVATIVES

155

graphics hardware acceleration,” Pattern Recognition Letters, Vol. 27, 2006, pp. 676-681. Yuh-Dauh Lyuu (呂育道) received his B.S. degree in Information Engineering from National Taiwan University, Taipei, Taiwan, in 1984 and his Ph.D. degree in Computer Science from Harvard University in 1990. He is now a Processor of Computer Science and Information Engineering Department, Finance Department, and the Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan. He served as head of Computer Science and Information Engineering in 2008-2011. His research interests include design and analysis of algorithms, theory of computation, and financial computation. Kuo-Wei Wen (文國煒) received his B.S. and M.S. degrees in Computer Science and Information Engineering from National Cheng Kung University, Tainan, Taiwan, in 2003 and 2005, respectively. Currently, he is a Ph.D. candidate in the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. His research interests include design and analysis of algorithms, and financial computation.

Yi-Chun Wu (吳宜駿) received his B.S. and M.S. degrees in Computer Science and Information Engineering from National Taiwan University Taipei, Taiwan, in 2006 and 2008, respectively. He is now an Engineer at Yahoo! Inc., Taipei, Taiwan.