Preliminary Result of Parallel double Divide and Conquer

Preliminary Result of Parallel double Divide and Conquer Taro Konda† , Hiroaki Tsuboi†‡ , Masami Takata⋆ , Masashi Iwasaki†‡ and Yoshimasa Nakamura†‡ ...
Author: Holly Mathews
2 downloads 0 Views 104KB Size
Preliminary Result of Parallel double Divide and Conquer Taro Konda† , Hiroaki Tsuboi†‡ , Masami Takata⋆ , Masashi Iwasaki†‡ and Yoshimasa Nakamura†‡ † Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University Yoshida Honmachi, Sakyo-ku, Kyoto, JAPAN ‡ SORST, JST ⋆ Graduate School of Humanity and Science, Nara Women’s University

Abstract This paper shows a concept for parallelization of double Divide and Conquer and its preliminary result. For singular value decomposition, double Divide and Conquer was recently proposed. It first computes singular values by a compact version of Divide and Conquer. The corresponding singular vectors are then computed by twisted factorization. The speed and accuracy of double Divide and Conquer are as well or even better than standard algorithms such as QR and Divide and Conquer. In addition, it is expected that double Divide and Conquer has great parallelism because each step is theoretically parallel and heavy communication is no required. However, any parallel model of double Divide and Conquer has not been studied yet. In this paper, policy of the parallelization is discussed. Then, a parallel implementation with MPI is tested on a distributed memory parallel computer. It successfully shows a high parallelism.

Keywords: parallel algorithm, numerical analysis, singular value decomposition, numerical algorithm, high performance computing, integrable system

1

Introduction

Singular value decomposition (SVD) is one of the most important matrix decompositions in numerical linear algebra. It is applied in many

fields such as image processing[1] and data search[2]. A new framework of SVD was recently developed. It first computes singular values and the corresponding singular vectors are then computed by twisted factorization. MR3 [3, 4, 5] and I–SVD[6, 7, 8] are algorithms in this manner. They achieve successful speed–up compared to past standard algorithm such as QR, although further study about numerical accuracy of twisted factorization is desired. Interest in parallelism of numerical algorithm is growing due to increased access to parallel computers coupled by the necessity to process growing data size. The parallelism of these new algorithms seems to be excellent because each piece of twisted factorization is parallel executable. However, the total parallelism is practically limited by seriality in the section of singular value computation[9]. A new algorithm, double Divide and Conquer(dDC)[9, 10], improves the parallelism of algorithm with the help of twisted factorization. It adopts Divide and Conquer(D&C) to parallelize the section of singular value computation. It is as fast and accurate as I–SVD and high parallelism is expected for any type of matrix. The focus of this paper is to evaluate preliminarily a parallel version of dDC to see a high potential of parallelism.

2

Singular Value Decomposition

An arbitrary m × n rectangular matrix A is ˆ by converted to an upper bidiagonal matrix B Housholder transformation[11, 12] as follows. ( ˆ A=U

ˆ B 0

) Vˆ T

or

ˆ U

(

ˆ 0 B

)

Vˆ T , (1)

= L(t) R(t) , t = 0, 1, ....

ˆ and Vˆ are suitable orthogonal matriwhere U ces. ˆ are Suppose that the elements of B 

.. ..

. . ˆb2n−2 ˆb2n−1

(3)

And let λk (k = 1, · · · , n) be eigenvalues of the ˆ is matrix Ts . Then σk , the singular value of B, given by √ λk .

(4)

Let us introduce two matrices,

L

 R(t)



(t)

q  1  1 ≡   

  ≡   

    

(t)

q2 ..

..

.

1

.

1

.. ..

and

(t)

qn

(t)

1 e1

Ts = GT (0) G−1 , G ≡ diag (g1,1 , g2,2 , ..., gn−1,n−1 , 1) , gk,k ≡

ˆ T B. ˆ Ts = B



The tridiagonal matrix is symmetrized by the following theorem.

  ∀  , ˆbj ̸= 0. (2)  

This paper is mainly concerned with SVD of ˆ in (2). the bidiagonal matrices B Let Ts be a positive definite symmetric tridiagonal matrix that consists of the n × n bidiˆ such as agonal matrix B

σk =

(6)

Lemma1 [13] A symmetrization of T (0) by a similarity transformation is given by



ˆb1 ˆb2  ˆb3  ˆ≡ B  

(t)

Then, a tridiagonal matrix is introduced by these matrices as follows.   (t) (t) (t) q1 q1 e1   (t) (t) (t)  1 e(t)  q2 e2 1 + q2    .. .. (t)  T ≡ . .  1   ..  (t) (t)  . qn−1 en−1   (t) (t) 1 en−1 + qn

. . e(t) n−1 1

   .  

(5)

n−1 ∏

1 , ˆb2j−1ˆb2j j=k

(7)

where the symmetric matrix Ts is written by  2  ˆb ˆb1ˆb2 1   .. ˆb1ˆb2 ˆb22 + ˆb23  .  . Ts =  .. .. ˆ ˆ  . . b2n−3 b2n−2  ˆb2n−3ˆb2n−2 ˆb2 ˆ2 2n−2 + b2n−1 (8)

3

double Divide and Conquer

Here, we introduce double Divide and Conquer(dDC) for singular value decomposition. It first computes singular values by a “compact” Divide and Conquer described in section 3.1. We here call this compact version Singular Value oriented Divide and Conquer (SVDC). Then, it computes the corresponding singular vectors by twisted factorization described in section 3.2. dDC has the following features. 1. Speed. Complexity of D&C is ranged from O(n2 ) to O(n3 ) due to the frequency of deflations. In contrast, SVDC is dramatically down–sized and achieves stability because it skips heavy update of vectors

in D&C. Also, twisted factorization costs O(n2 ), too. Thus complexity of dDC is O(n2 ). 2. Accuracy. Singular values are computed with high accuracy particularly, because dDC inherits good performance of D&C. However, dDC has a theoretical issue to be discussed. Although dDC computes singular values in absolutely high precision, twisted factorization assumes that singular values are computed in relatively high precision. It may cause a problem when matrix has a tiny singular value. Further investigation should be taken. 3. Parallelism. Both SVDC and twisted factorization have essentially good parallelism. 4. Memory space. D&C requires O(n2 ) memory space to hold SVD of submatrices temporarily. In contrast, dDC needs O(n) memory space because of a few vectors to be at hand. Therefore, we can use dDC to solve various problems with huge matrix. 5. Partial computation of singular values and singular vectors. Twisted factorization computes selected vectors according to need because the computations are independent each other. However, mdLVs of I–SVD has to finish the computation of all singular values to find the targeted values. In contrast, SVDC of dDC can specially compute the targets, selecting secular equation to be solved.

3.1

Divide and Conquer to Compute Singular Values

Given an n × (n + 1) upper bidiagonal matrix   b1 b2   .   b3 . . ,  (9) B=  . .   . b2n−2 b2n−1 b2n

where U is an n × n orthogonal matrix whose columns are left singular vectors. V is an (n + 1) × (n + 1) orthogonal matrix whose columns are right singular vectors. Σ is an n×n nonnegative definite diagonal matrix and 0 is a column of zero elements. The n × (n + 1) upper bidiagonal matrix B is partitioned into two submatrices as   B1 0 B =  b2k−1 ek T b2k e1 T  (11) 0 B2 for a fixed k such that 1 < k < n, where B1 is a (k − 1) × k lower bidiagonal matrix, B2 is an (n − k) × (n − k + 1) upper bidiagonal matrix and ej is the j–th unit vector of appropriate dimension. The parameter k is usually taken to be ⌊n/2⌋. Now, suppose that SVD of the Bi is given by Bi = Ui (Di 0) (Vi vi )T . (12) Let l1 be the last row of V1 , ψ1 be the last element of v1 , f2 be the first row of V2 and ϕ2 be the first element of v2 . By substituting (12) into (11), we then obtain    0 U1 0 b2k−1 ψ1 b2k−1 l1 b2k f2 b2k ϕ2 D1 0 0  B =1 0 0  0 0 0 U2 0 0 D2 0 ( )T v 1 V1 0 0 × . (13) 0 0 V2 v2 If Givens rotation is applied to make b2k ϕ2 zero, then we get ( )( )T ˜ M 0 B=U , (14) ˜ V˜ v where ˜ U

M V˜ r0

its SVD is B=U

(

Σ 0

)

V T,

(10)

c0

 0 U1 0 0 , ≡  1 0 0 0 U2   r0 b2k−1 l1 b2k f2 D1 0 , ≡  0 0 0 D2 ( ) ( ) c0 v1 V1 0 −s0 v1 ˜≡ ≡ , v , s0 v2 0 V2 c0 v2 √ (b2k−1 ψ1 )2 + (b2k ϕ2 )2 , = b2k ϕ2 b2k−1 ψ1 , s0 = . (15) = r0 r0 

( ) Thus the matrix B is reduced to M 0 ˜ by ( the )orthogonal transformations U and ˜ . V˜ v The above D&C process can be simplified when only singular values are desired. From (14) and (15), B is written as (

)(

)T ˜ V˜ v ( )( )T ˜ UM ΣV T 0 = U ˜ V˜ v M ( )T = U Σ V˜ VM v (16) ˜ ))T (( ) ( c0 v1 V1 0 −s0 v1 , VM = UΣ c0 v2 s0 v2 0 V2

˜ B = U

M

0

thus f

=

l =

( (

co ϕ1 f1 0 so ψ2 0 l2

ϕ = −s0 ϕ1 ,

) )

VM , VM ,

ψ = c0 ψ2 ,

(17)

where f1 is the first row of V1 , ϕ1 is the first element of v1 , l2 is the last row of V2 , ψ2 is the last element of v2 , l is the last row of V , ψ is the last element of v, f is the first row of V and ϕ is the first element of v. Because most of the running time of D&C is consumed for vector update during singular vector computation, SVDC is wholly faster than the normal one.

3.2

Twisted Factorization to Compute Singular Vectors

Let us consider the following system (k) ˆ (Ts − λI)x = ek γk ,

(18)

where ek is the k–th unit vector of appropriate dimension, the k–th element of e(k) is 1 and γk is a residual norm of the k–th equation. Suppose that L is a lower bidiagonal matrix and U is an upper bidiagonal matrix such that ˆ = LD+ LT Ts − λI = U D− U T ,

(19)

where L and U are given by Choleskey decomˆ Let the i–th element of the position of Ts − λI. diagonal matrix D+ and D− be Di+ and Di− ,

and i–th subdiagonal element of L and U be Li and Ui , respectively. Then, factorizations LD+ LT and U D− U T are respectively computed by the algorithms stdLVvs and rdLVvs [6]. Then the twisted factorization is described as follows. ˆ = Nk Dk N T , Ts − λI k

(20)

where Nk is the twist factor[4, 5] defined as   1  L1 1      .. ..   . .    , Lk−1 1 Uk Nk =     .. ..   . .    1 Un−1  1 factoring the matrix from top down and from bottom up meeting at row k. And Dk is a diagonal matrix such that ( ) + − − Dk = diag D+ 1 , ..., Dk−1 , γ, Dk+1 , ..., Dn . (21) ˆ is close to eigenvalues of Ts , singuWhen λ ˆ is revealed by choosing γ = larity of Ts − λI arg mink |γk |.

4

Preliminary result of Parallel double Divide and Conquer

Here, we concern a parallel version of dDC. Each process of SVDC and each vector computation of twisted factorization are executed independently. It means that dDC has ideal parallelism theoretically. Since each vector computation with twisted factorization has almost equal amount of loading, simple assignment of tasks is sufficient to achieve high efficiency. We have to parallelize SVDC more carefully, although high efficiency of SVDC is expected while huge communication during vector update of D&C is skipped. Figure 1 is a pictorial model of SVDC executed by 4 processors. It is the tree built by

: Communication

4

2

4

1

1

2

1

Table 1: Specification of test bed. Appro HyperBlade PC Cluster CPU AMD Opteron 1.6GHz (SMP with 2CPUs × 8) Cache L1D: 64KB L1I: 64KB L2: 1024KB Memory 2.0GB Network Gigabit Ethernet OS SuSE Linux 8.0 (kernel) (Linux 2.4.19-SMP) Compiler pgf77 5.1-3 (Option) -O3 BLAS Optimized by ATLAS

2

3

2

3

4

3

4

4

Figure 1: Parallel model with less communication (P = 4).

the dividing process. A leaf of the tree represents a submatrix. A node is a merge process of D&C. A number on leaves or nodes shows processor number to address the process, which is computing SVD of submatrix at leaves and merge of submatrices at nodes. A line is process transition. In particular, an arrowed line between boxes (leaves and nodes) means communication between processors. First, submatrices at the lowest level (represented as leaves of the tree) are evenly assigned to all processors in order. Second, each processor computes SVDs of assigned submatrix independently. Then the merge processes are begun at each processor. For each depth of the tree from the bottom, the submatrices are merged in order. The process is paused when the depth becomes more than P , the number of processors. It is time to start communication. We call a processor which is a left son of the parent Slave and a son M aster. A slave sends data ΣM , f , l, ϕ, ψ, to be needed for parent’s merge process to master and finishes tasks at SVDC process. A master receives the data and merge two submatrices. This process is proceeded until the root node is computed. When it is over, the computed singular values are broadcasted to every processor to compute singular vectors parallel. Finally, twisted factorization is invoked parallel. Vec-

tors to be computed are assigned to every processor evenly in order, and these results are gather to one processor. This parallel model is a straightforward approach. Although it calls a few communication to order, practical efficiency is not high. After the communication is begun, the number of idle processors is increased step by step. What is worse, the tree is top–heavy, merge process costs more at upper level. Parallelization of each merge makes improvement. The detail will be discussed in future papers.

4.1

Numerical experiments

In this section, we evaluate the parallel dDC with respect to parallelism. We compare it with the following two algorithms on a bidiagonal SVD. • Parallel I–SVD: I–SVD algorithm whose discretized interval of mdLVs is δ (t) = 1.0. Inverse Iteration updates computed vectors. Twisted factorization and Inverse Iteration is parallel. • QR: QR algorithm with shifts (PDBDSQR in ScaLAPACK[14]). Here, dDC uses QR to solve submatrices and inverse iteration to update computed vectors.

Table 2: Timing of Parallel dDC. The number of processors 1 2 4 8 16 n=3,000 5.78 2.93 1.50 0.79 0.48 EP (1.00) (0.99) (0.96) (0.91) (0.75) n=5,000 18.17 9.38 4.76 2.83 1.66 EP (1.00) (0.97) (0.95) (0.80) (0.68) n=7,000 37.86 19.27 9.81 5.07 2.81 SP (1.00) (1.96) (3.86) (7.47) (13.47) EP (1.00) (0.98) (0.96) (0.93) (0.84) in second: [s] EP : Parallel Efficiency by P processors SP : Speed–up ratio by P processors

Table 1 shows specification of test beds. Tests are carried out on a PC cluster of AMD Opteron. Cache memories are L1D:64KB, L1I:64KB and L2:1024KB. The cluster invokes LAPACK with BLAS (Basic Linear Algebra Subprograms) optimized by ATLAS (Automatically Tuned Linear Algebra Software)[15]. We compute SVD of a bidiagonal matrix whose diagonal elements are 2.001 and subdiagonal elements are 2.0. All singular values are separated to each other. The deflation of D&C and dDC seldom occurs. The parallel dDC is implemented with MPI[16]. The number of processors is ranged from 1 to 16. Table 2 shows the best times of 10 executions of parallel dDC. Dimensions of the matrices are n = 3, 000, n = 5, 000 and n = 7, 000. In the case of n = 7, 000, parallel dDC acquires 0.84 of parallel efficiency and speed–up is 13.47 by 16 processors. For the smaller case, it also shows good performance. We can say that parallel dDC has high scalability in this case. Then, parallel dDC is compared to parallel versions of I–SVD and QR. Table 3 shows the best times of 10 executions. The dimension is n = 3, 000. Parallel dDC is more than tree times faster than Parallel I–SVD by 16 processors. It is much faster than Parallel QR, although it shows super linear efficiency due to good use of cache memory. Figure 2 illustrates the execution times of parallel dDC and

Table 3: Comparison of parallel dDC to parallel I–SVD and QR (n = 3, 000). The number of processors 1

dDC EP SP I–SVD

EP SP QR EP SP

2(2 × 1)

4(2 × 2)

8(2 × 4) 16(2 × 8)

5.78

2.93

1.50

0.79

0.48

(1.00) (1.00)

(1.97) (0.99)

(3.85) (0.96)

(7.32) (0.91)

(12.04) (0.75)

5.53

3.44

2.36

1.80

1.52

(1.00) (1.61) (2.34) (3.07) (1.00) (0.80) (0.59) (0.38) 867.98 433.61 173.44 66.78 (1.00) (2.00) (5.00) (13.00) (1.00) (1.00) (1.25) (1.62)

(3.64) (0.23)

34.71 (25.01) (1.56)

in second: [s] EP : Parallel Efficiency by P processors SP : Speed–up ratio by P processors parallel I–SVD.

5

Conclusions

A new framework of SVD was recently proposed that computes singular values first, then the corresponding vectors are computed by twisted factorization later. double Divide and Conquer (dDC) is one of the algorithms built on this framework. The speed and accuracy of dDC are as well or even better than standard algorithms such as QR and normal Divide and Conquer. In addition, it is expected that dDC has great parallelism because each step is theoretically parallel and heavy communication is not required. This paper shows a basic idea to parallelize dDC and its parallel implementation with MPI is tested on distributed memory parallel computer. It shows high scalability and much faster than parallel I–SVD and QR on a matrix. As a future work, study on an affinity of precision between the part of singular computation in dDC and twisted factorization should be proceed. Moreover, comprehensive test on practical and huge applications will be taken. Acknowledgment We’d like to thank Dr. Y. Minesaki and Dr. K. Kimura for helping

 

computation by the discrete lotka-volterra system. In Proc. International Conf. on Parallel and Distributed Processing Techniques and Applications, volume 2, pages 410–416, 2005. Las Vegas, USA.

+-,! ,. . .(/1032 +-,! ,. . .54 6 798:0



 

  

 



[8] M. Iwasaki and Y. Nakamura. Accurate computation of singular values in terms of the shifted integrable scheme. preprint, 2005.  













 



"!$#"%&! #'()*)(#! )

Figure 2: Execution times of parallel dDC and parallel I-SVD. this research.

References [1] W. Pratt. Digital image processing. Wiley– Interscience Publishing, 1978. [2] F. Jiang, R. Kannan, M. Littman, and S. Vempala. Efficient singular value decomposition via improved document sampling. Technical Report CS–99–5, Department of Computer Science, Duke University, 1999. [3] I. Dhillon and B. Parlett. Orthogonal eigenvectors and relative gaps. SIAM J. Matrix Anal. Appl., 25(3):858–899, 2004. [4] K. Fernando. On computing an eigenvector of a tridiagonal matrix. part 1: basic results. SIAM J. Matrix. Anal. Appl., 18(4):1013– 1034, 1997. [5] B. Parlett and I. Dhillon. Fernando’s solution to wilkinson’s problem: An application of double factorization. Lin. Alg. Appl., 267:247–279, 1997. [6] M. Iwasaki, S. Sakano, and Y. Nakamura. Accurate twisted factorization of real symmetric tridiagonal matrices and its application to singular value decomposition. Trans. Japan. Soc. Indust. Appl. Math., 15(3):461–481, 2005. (in Japanese). [7] M. Takata, M. Iwasaki, K. Kimura, and Y. Nakamura. An evaluation of singular value

[9] T. Konda, M. Takata, M. Iwasaki, and Y. Nakamura. A new singular value decomposition algorithm suited to parallelization and preliminary results. Proceedings of the IASTED International Conference on Advances in Computer Science and Technology, pages 79–84, 2006. [10] T. Konda, M. Takata, M. Iwasaki, and Y. Nakamura. A new svd algorithm by divide and conquer and twisted factorizations. IPSJ Symposium Series, 2006(1):105–112, 2006. (in Japanese). [11] G. Golub and W. Kahan. Calculating the singular values and pseudo-inverse of a matrix. SIAM J. Numeri. Anal., 2(2):205–224, 1965. [12] J. Demmel. Applied Numerical Linear Algebra. SIAM, 1997. [13] M. Iwasaki and Y. Nakamura. On the convergence of a solution of the discrete Lotka-Volterra system. Inverse Problems, 18(6):1569–78, 2002. [14] L. Blackford, J. Choi, and A. Cleary et al. ScaLAPACK Users’ Guide. SIAM, 1997. [15] R. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimizations of software and the atlas project. Parallel Computiing, 27(1– 2):3–35, 2001. [16] W. Gropp, E. Lusk, and A. Skjellum. Using MPI Second Edition. The MIT Press, 1999.

Suggest Documents