Parallel Algorithms for LQ Optimal Control of Discrete-Time Periodic Linear Systems 1

Zentrum fur Te hnomathematik Fa hberei h 3 { Mathematik und Informatik Parallel Algorithms for LQ Optimal Control of Dis rete-Time Periodi Linear Sy...

Author: Robert Cook

1 downloads 2 Views 354KB Size

Report

Download PDF

Recommend Documents

Periodic Event-Triggered Control for Linear Systems

Parallel Iterative Methods for Linear Systems

Parallel Algorithms for Forward and Back Substitution in Direct Solution of Sparse Linear Systems

Algorithms for Optimal Outlier Removal 1

OPTIMAL CONTROL OF LINEAR SYSTEMS WITH STATE EQUALITY CONSTRAINTS. Sangho Ko, Robert R. Bitmead 1

Design of Parallel Algorithms. Parallel Dense Matrix Algorithms

NUMERICAL METHODS FOR LINEAR CONTROL SYSTEMS

Parallel Algorithms for Integer Factorisation

Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

ALGORITHMS FOR OPTIMAL OUTLIER REMOVAL

Linear Algebra Algorithms as Dynamical Systems

Parallel Numerical Linear Algebra for Future Extreme-Scale Systems

CONTROL SYSTEMS 2 ANALYSIS OF LINEAR

Theory of Parallel Evolutionary Algorithms

Structure-Adaptive Parallel Solution of Sparse Triangular Linear Systems

Parallel Sorting Algorithms

Fast Parallel Algorithms for Voronoi Diagrams

OPTIMAL ALGORITHMS FOR BLIND SOURCE SEPARATION

TROPICAL ALGORITHMS FOR LINEAR ALGEBRA AND LINEAR EVENT-INVARIANT DYNAMICAL SYSTEMS

Model predictive control for stochastic systems by randomized algorithms

Parallel Algorithms for Depth-First Search

Communication Costs for Parallel Volume-Rendering Algorithms

27 Linear Programming Algorithms

Improved Algorithms for Linear Stochastic Bandits

Zentrum fur Te hnomathematik Fa hberei h 3 { Mathematik und Informatik

Parallel Algorithms for LQ Optimal Control of Dis rete-Time Periodi Linear Systems Peter Benner Rafael Mayo

Ralph Byers Enrique S. Quintana-Ort

Vi ente Hern andez

Report 01{02

Beri hte aus der Te hnomathematik Report 01{02

Februar 2001

Parallel Algorithms for LQ Optimal Control of Discrete-Time Periodic Linear Systems1 Peter Benner Zentrum f¨ur Technomathematik, Fachbereich 3/Mathematik und Informatik, Universit¨at Bremen, 28334 Bremen, Germany. E-mail: bennermath.uni-bremen.de.

Ralph Byers

Dept. of Mathematics, University of Kansas, Lawrence, Kansas 66045, USA. E-mail: byersmath.ukans.edu.

Rafael Mayo, Enrique S. Quintana-Ort´ı

Depto. de Inform´atica, Universidad Jaume I, 12.080–Castell´on, Spain. E-mails: mayo,quintana inf.uji.es.

f

g

Vicente Hern´andez

Depto. de Sistemas Inform´aticos y Computaci´on, Universidad Polit´ecnica de Valencia, 46.071–Valencia, Spain. E-mail: vhernanddsi .upv.es.

This paper analyzes the performance of two parallel algorithms for solving the linear-quadratic optimal control problem arising in discrete-time periodic linear systems. The algorithms perform a sequence of orthogonal reordering transformations on formal matrix products associated with the periodic linear system, and then employs the so-called matrix disk function to solve the resulting discrete-time periodic algebraic Riccati equations needed to determine the optimal periodic feedback. We parallelize these solvers using two different approaches, based on a coarse-grain and a medium-grain distribution of the computational load. The experimental results report the high performance and scalability of the parallel algorithms on a Beowulf cluster.

1. INTRODUCTION In this paper we analyze the parallel solution of the linear-quadratic (LQ) optimal control problem for periodic control systems on parallel computers with distributed memory. Specifically, we consider the discrete-time linear control system

xk+1 = Ak xk + Bk uk ; x0 = x0 ; yk = Ck xk ;

k = 0; 1; : : : ;

(1)

where Ak 2 Rnn , Bk 2 Rnm , and Ck 2 Rrn . The system is said to be periodic if for some integer period p, Ak+p = Ak , Bk+p = Bk , and Ck+p = Ck . The aim in the LQ 1 Peter Benner and Enrique S. Quintana-Ort´ı were partially supported by the DAAD programme Acciones Integradas Hispano-Alemanas. Ralph Byers was partially supported by National Science Foundation awards CCR-9732671, MRI-9977352, by the NSF EPSCoR/K*STAR program through the Center for Advanced Scientific Computing, and by the National Computational Science Alliance under DMS990005N.

optimal control problem is to find a periodic feedback fuk g1 k=0 which minimizes

1 1X xTi Qi xi + uTi Ri ui 2 i=0 [10, 11]. Here, Qk 2 Rnn , Qk = QTk 0, Qk = Qk+p , and Rk 2 Rmm , Rk = RkT > 0, Rk = Rk+p . Under certain assumptions [10, 11], the unique optimal periodic feedback is given by

uk = (Rk + BkT Xk+1 Bk ) 1 BkT Xk+1 Ak xk ; where Xk 2 Rnn , Xk = Xk+p , is the unique symmetric positive semidefinite solution of the discrete-time periodic Riccati equation (DPRE)

0 = CkT Qk Ck Xk + ATk Xk+1 Ak ATk Xk+1 Bk (Rk + BkT Xk+1 Bk ) 1 BkT Xk+1 Ak ;

(2)

see [11] for details. In case p = 1, the DPRE reduces to the well-known discrete-time algebraic Riccati equation (DARE) [29]. Periodic linear systems naturally arise when performing multirate sampling of continuous linear systems [18]. Large state-space dimension n and/or large period appear, e.g., in the helicopter ground resonance damping problem and the satellite attitude control problem; see, e.g., [9, 22, 26, 37]. The analysis and design of these class of systems has received considerable attention in recent years (see, e.g., [10, 11, 13, 26, 33, 32, 36, 37]). The need for parallel computing in this area can be seen from the fact that (2) represents a non-linear system with pn2 unknowns. Reliable methods for solving these equations have a computational cost in flops (floating-point arithmetic operations) of O(pn3 ). In this paper we analyze the parallelization of two DPRE solvers, introduced in [5, 6], following two different approaches. First, we present a coarse-grain approach which only requires efficient point-to-point communication routines and a few high-performance numerical serial kernels for well-known linear algebra computations. A version of this coarse-grain algorithm with computational cost of O(p2 n3 ) flops was reported in [7, 8]. Here we extend the theoretical analysis of the parallel properties of the algorithm and include a second variant, suggested in [6], with computational cost of O(p log2 (p)n3 ). Second, we investigate a medium-grain parallel approach, based on the use of parallel linear algebra libraries; in particular, we employ ScaLAPACK [12] to obtain scalable and portable implementations of the solvers. This approach is applied to both algorithms mentioned above. The paper is structured as follows. In section 2 we briefly review three numerical DPRE solvers based on a reordering of a product of matrices associated with (2). Coarse-grain and medium-grain parallelizations of the solvers are described and analyzed in sections 3 and 4, respectively. In section 5 we report the performance of the algorithms on a Beowulf cluster of IntelTM Pentium-II processors. This class of parallel distributed computer systems presents a better price/performance ratio than traditional parallel supercomputers and has recently become a cost-effective, widely-spread approach for solving large applications [15]. Finally, some concluding remarks are given in section 6. 2

2. SOLVING DISCRETE-TIME PERIODIC RICCATI EQUATIONS In order to solve the LQ optimal control problem for discrete-time perioidc systems, we need to solve the DPRE (2). Here we consider the 2n 2n periodic matrix pairs associated with this DPRE,

Lk =

Ak 0 T Ck Qk Ck In

; Mk =

In Bk Rk 1BkT ; k = 0; 1; : : : ; p 1; 0 ATk

where In denotes the identity matrix of order n. In case all the Ak are non-singular, the solution matrices Xk of the DPRE in (2) are given via the invariant subspaces of the periodic matrices [23]

k = Mk+1p 1 Lk+p 1 Mk+1p 2 Lk+p 2 Mk 1 Lk ; k = 0; 1; : : : ; p 1;

(3)

corresponding to the eigenvalues inside the unit disk. Under mild control-theoretic as T sumptions, the k have exactly n of these eigenvalues. If the columns of UkT VkT , Uk ; Vk 2 Rnn , span this invariant subspace, and the Uk are invertible, then Xk = Vk Uk 1 . Note that these relations still hold in a generalized sense if any of the Ak are singular [6, 11, 23, 35]. and all algorithms presented here can still be applied in that case. The periodic QZ algorithm is a numerically sound DPRE solver which relies on an extension of the generalized Schur vector method [13, 23]. This is a QR-like algorithm with a computational cost of O(pn3 ) flops (it has to deal with p eigenproblems). Several experimental studies report the difficulties in parallelizing this type of algorithms on parallel distributed systems (see, e.g., [24]). The algorithms present a fine granularity which introduces performance losses due to communication start-up overhead. Besides, traditional data layouts (column/row block scattered) lead to an unbalanced distribution of the computational load. These drawbacks can partially be solved by using a block Hankel distribution to improve load balancing [24] and multishift techniques to increase the granularity [25, 14]. Nevertheless, the parallelism and scalability of these algorithms are still far from those of traditional matrix factorizations [12]. In this paper we follow a different approach described in [5, 6] for solving DPREs without explicitly forming the matrix products in (3). The approach relies on the following lemma. Lemma 1. Consider Z; Y 2 Rq q , with Y invertible, and let

Q11 Q12 Q21 Q22

Y = R0 Z

(4)

be a QR factorization of [Y T ; Z T ℄T ; then Q221 Q21 = ZY 1 . The application of this lemma to a q q matrix pair requires 40q 3 =3 flops and storage for 6q 2 real numbers [6]. Hereafter, we denote by Cswap and Cstore the computational and storage costs, respectively, of applying the above swapping procedure to a matrix pair of size q = 2n. We next describe three different algorithms, based on the swapping lemma, for solving DPREs [5, 6].

2.1. Reordering a matrix product The basic idea in this first algorithm is to apply the swapping procedure to reorder the matrix products in k in order to obtain

k = M^ k 1L^ k = (M k M k+p 1 ) 1 (L k+p 1 L k );

(5)

without computing any explicit inverse. We describe the procedure by means of an example of period p = 4. First, apply the swapping procedure to (L1 ; M0 ), to obtain a reordered (1) (1) (1) (1) matrix pair (L1 ; M0 ) such that L1 M0 1 = (M0 ) 1 L1 . Then,

0 = M3 1L3 M2 1 L2M1 1 L1 M0 1 L0 = M3 1L3 M2 1 L2M1 1 (M0(1) ) 1 L(1) 1 L0 = M3 1L3 M2 1 L2M1:01 L1:0 : Here, we borrow the colon notation from [6] to indicate, e.g., that L1:0 is obtained by (1) collapsing (computing) the matrix product L1 L0 . Next, apply the swapping lemma to (1) (L2 ; M1:0 ), to obtain a reordered matrix pair (L(1) 2 ; M1:0 ) such that

0 = M3 1 L3M2 1 L2 M1:01L1:0 (1) = M3 1 L3M2 1 (M1:0 ) 1 L(1) 2 L1:0 = M3 1 L3M2:01 L2:0 : By applying a last time the swapping procedure, to (L3 ; M2:01 ), we obtain the required reordering in (5). This stage requires applying the swapping procedure p 1 times. The solution of the ^ k ; M^ k ), using any each DPRE, Xk , is then obtained from the associated matrix pair (L method that computes the deflating subspace corresponding to the generalized eigenvalues inside the unit disk [28, 30, 31]. In case all Ak , k = 0; 1; : : : ; p 1, are non-singular, the remaining p 1 matrix products can be obtained from k using the relation

k+1 = (Mk 1 Lk ) 1 k (Mk 1 Lk ): Thus, e.g.,

k+1 = (Mk 1 Lk ) 1 k (Mk 1 Lk ) = (Mk 1 Lk ) 1 M^ k 1 L^ k (Mk 1 Lk ); and this matrix can be obtained from k by applying twice the swapping procedure. Overall, the computation of k , k = 0; 1; : : : ; p 1, requires 3(p 1)C omp flops, where C omp = Cswap + 4(2n)3 , and workspace for 2(p + 1)n2 + Cstore real numbers [6]. The cost of solving the DPRE depends on the method employed to compute the corresponding deflating subspace and will be given in section 2.4. The previous algorithm requires all Ak to be non-singular. Furthermore, a single moderately ill-conditioned matrix Ak may affect negatively the accuracy of the computed solutions. This algorithm is not considered any further.

p

2.2. Reordering matrix products Next we describe an algorithm that deals with the case of singular Ak matrices, at the expense of increasing the computational cost of the algorithm in the previous subsection

by a factor of p. We illustrate the idea in this algorithm by means of an example of period p = 4. Consider the matrices

0 1 2 3

= = = =

M3 1 L3M2 1 L2 M1 1L1 M0 1 L0 ; M0 1 L0M3 1 L3 M2 1L2 M1 1 L1 ; M1 1 L1M0 1 L0 M3 1L3 M2 1 L2 ; M2 1 L2M1 1 L1 M0 1L0 M3 1 L3 ;

and apply the swapping procedure to the matrix pairs (L3 ; M2 ), (L2 ; M1 ), (L1 ; M0 ), and (1) (1) (1) (1) (1) (1) (1) (L0 ; M3 ). Then, we obtain (L(1) 3 ; M2 ), (L2 ; M1 ), (L1 ; M0 ), and (L0 ; M3 ), which satisfy

L3 M2 1 L2 M1 1 L1 M0 1 L0 M3 1

= = = =

(M2(1) ) (M1(1) ) (M0(1) ) (M3(1) )

L(1) 3 ; 1 (1) L2 ; 1 (1) L1 ; 1 (1) L0 : 1

and

Therefore,

0 1 2 3

= = = =

M3 1 (M2(1) ) M0 1 (M3(1) ) M1 1 (M0(1) ) M2 1 (M1(1) )

(1) L(1) 3 (M1 ) (1) 1 L0 (M2(1) ) 1 (1) L1 (M3(1) ) 1 (1) L2 (M0(1) ) 1

(1) L(1) 2 (M0 ) (1) 1 L3 (M1(1) ) 1 (1) L0 (M2(1) ) 1 (1) L1 (M3(1) ) 1

L(1) 1 L0 ; 1 (1) L2 L1 ; 1 (1) L3 L2 ; 1 (1) L0 L3 : 1

and

Repeating the swapping procedure with the matrix pairs (L3 ; M1 ), (L2 ; M0 ), (L0 ; M2 (1) (2) (2) (2) (2) (1) (2) (2) (2) (2) and (L1 ; M3 ), we obtain (L3 ; M1 ), (L2 ; M0 ), (L0 ; M2 ), and (L1 ; M3 ) such that (1)

0 1 2 3

= = = =

M3 1 (M2(1) ) M0 1 (M3(1) ) M1 1 (M0(1) ) M2 1 (M1(1) )

(M1(2) ) 1 (M2(2) ) 1 (M3(2) ) 1 (M0(2) ) 1

(2) L(2) 3 (M0 ) (2) 1 L0 (M1(2) ) 1 (2) L1 (M2(2) ) 1 (2) L2 (M3(2) ) 1

(1)

(1)

(1) L(2) 2 L1 L0 ; (2) 1 L3 L(1) 2 L1 ; 1 (2) (1) L0 L3 L2 ; 1 (2) (1) L1 L0 L3 :

(1)

(1)

(1)

1

and

A last reordering of the matrix pairs (L3 ; M0 ), (L0 ; M1 ), (L1 ; M2 ), and (L2 ; M3 provides the required reordering in (5). The algorithm can be stated as follows [5]. In the algorithm we denote by (Y ; Z ) swap(Y; Z ) the application of the swapping lemma to the matrix pair (Y; Z ), where Y and Z are overwritten by Q22 and Q21 , respectively, using the notation of Lemma 1. (2)

(2)

(2)

(2)

(2)

(2)

(2)

(2)

),

Algorithm 2.1 Input: p matrix pairs (Lk ; Mk ), k = 0; 1; : : : ; p 1. Output: Solution of the p DPREs asso iated with the matrix pairs. for k = 0; 1; : : : ; p 1 ^k ^ L Lk , M Mk (k+1) mod p end for t = 1; 2; : : : ; p 1 for k = 0; 1; : : : ; p 1 (L(k+t 1) mod p ; M(k+p 1) mod p ) swap(L(k+t 1) mod p ; M(k+p 1) mod p )

),

^ M M(k+p (k+t) mod p ^ ^k Lk L(k+t) mod p L

1) mod

^ pM (k+t) mod p

end end for k = 0; 1; : : : ; p 1 ^k; M ^ k) Solve the DPRE using (L end

The algorithm is only composed of QR factorizations and matrix products [20]. The computational cost of the reordering procedure is p(p 1)C omp flops and p(2n2 + Cstore ) for workspace [6]. 2.3. Reducing the computational cost In this subsection we describe an algorithm which reduces the computational cost of the previous algorithm to O(p log2 (p)n3 ) flops, while maintaning a similar numerical behavior. We use an example of period p = 4 to describe the algorithm. Consider the matrices

0 1 2 3

= = = =

M3 1 L3M2 1 L2 M1 1L1 M0 1 L0 ; M2 1 L2M1 1 L1 M0 1L0 M3 1 L3 ; M1 1 L1M0 1 L0 M3 1L3 M2 1 L2 ; M0 1 L0M3 1 L3 M2 1L2 M1 1 L1 ;

and apply the swapping procedure to reorder the matrix products L3 M2 1 , L2 M1 1 , (1) (1) 1 (1) 1 (1) L2 , (M0(1) ) 1 L(1) L0 , L1 M0 1, and L0M3 1 into (M2(1)) 1 L(1) 3 , (M1 ) 1 , and (M3 ) respectively. Note that these matrix products appear twice in 0 ; : : : ; 3 . Thus, we obtain reordered matrix products

0 1 2 3

= = = =

M3 1(M2(1) ) M2 1(M1(1) ) M1 1(M0(1) ) M0 1(M3(1) )

(1) 1 L(1) 3 L2 M1 (M0 ) 1 (1) L2 L1M0 1 (M3(1) ) 1 (1) L1 L0M3 1 (M2(1) ) 1 (1) L0 L3M2 1 (M1(1) )

1

L(1) 1 L0 1 (1) L0 L3 1 (1) L3 L2 1 (1) L2 L1

1

= = = =

M3:21 L3:2M1:01 L1:0 ; M2:11 L2:1M0:31 L0:3 ; M1:01 L1:0M3:21 L3:2 ; M2:11 L2:1M2:11 L2:1 :

Now, we only need to apply the swapping procedure again, to reorder the matrix products L3:2 M1:01, L2:1 M0:31 , L1:0M3:21 , L2:1 M2:11, and thus obtain the required reordered matrix products in (5). The algorithm can be stated as follows. For simplicity, we present the algorithm for p equal to a power of 2. Algorithm 2.2 Input: p matrix pairs (Lk ; Mk ), k = 0; 1; : : : ; p 1. Output: Solution of the p DPREs asso iated with the matrix pairs. for t = 1; 2; : : : ; log 2 (p) t 1 l 2 for k = 0; 1; : : : ; p

1

(Y; Z ) swap(Lk ; M(k+p ^k L YL ^k M

k p l) mod p

( +

ZMk

end for k = 0; 1; : : : ; p

1

l) mod p )

Lk Mk

^k L ^k M

end end for k = 0; 1; : : : ; p 1 ^k; M ^ k) Solve the DPRE using (L end

The computational cost of the reordering procedure in this case is pdlog2 (p)eC omp flops and 2p(2n2 + Cstore ) for workspace [6].

X

2.4. Computing k At the final stage of the three reordering algorithms described above, it is necessary to solve the DPRE (2). As outlined in section 2, these solutions can be obtained from certain invariant subspaces of the formal matrix products k . As these subspaces are exactly ^ k ; M^ k ), the Xk are computed from the right the deflating subspaces of the matrix pairs (L ^ ^ deflating subspace of (Lk ; Mk ) corresponding to the eigenvalues inside the unit disk. Here we propose to compute the Xk using the so-called matrix disk function [5]. This matrix function can be computed using an inverse-free iteration, composed of QR factorizations and matrix products, which employs the rationale behind the swapping lemma. The inverse-free iteration for the matrix disk function was introduced in [27] and made truly inverse-free in [4]. The iterative procedure can be stated as follows. Algorithm 3.1 Input: A matrix pair (L; M ). Output: Disk fun tion of the matrix pair j 0 L0 L, M0 R0 0.

M

repeat Compute the QR fa torization

"

Q11 Q12 Q21 Q22

#"

Mj Lj

#

" =

Rj +1

#

0

Lj +1 Q21 Lj Mj +1 Q22 Mj j j +1 until Rj +1 Rj F < Rj +1 F

k

k

k

k

In the algorithm, is a user-defined tolerance threshold for the stopping criterion. The disk function of the matrix pair (L; M ) is defined from the matrix pair at convergence, denoted by (L1 ; M1 ), as disk (L; M ) = (L1 + M1 ) 1 M1 [5]. Note that the QR factorization computed at each iteration of the algorithm is exactly the same used as in the swapping lemma. ^ k ; M^ k ), then X := Xk can be computed diIf we apply Algorithm 3.1 to (L; M ) := (L rectly from disk (L; M ) without computing a basis for the corresponding deflating subspace explicitly. Partition L1 into n n blocks as

L1 = L11 L12 ; L21 L22

then X is obtained by solving

L L12 X = 11 ; L21 L22

see [5] for details. The cost of computing the matrix disk function of a q q matrix pair using the inversefree iteration is 40q 3 =3 flops per iteration. Solving the linear-least square (LLS) system for X adds 13q3=3 flops more. 3. COARSE-GRAIN PARALLEL ALGORITHMS Our coarse-grain parallel algorithms employ a logical linear array of p processes, P0 , P1 ,. . . , Pp 1 , where p is the period of the system. These parallel algorithms require efficient point-to-point communication routines, high-performance numerical serial kernels for the matrix product and the QR factorization, like those available in BLAS [16] and LAPACK [2], and a serial subspace extraction method based in our case on the matrix disk function [5]. Consider first the parallelization of Algorithm 2.1. In this algorithm, each matrix pair (Lk ; Mk ) is initially stored in a different process Pk , k = 0; 1; : : : ; p 1. During the swapping stage, each swapping of a matrix pair is carried out locally in the process where it is stored. Thus the coarse grain parallelism is obtained by performing each iteration of loop k in Algorithm 2.1 (a swap of a matrix pair) in a different process. By the end of this ^ k ; M^ k ), as in (5), is stored in process Pk , and the solution stage, a reordered matrix pair (L of the corresponding DPRE can be obtained locally. Parallel Algorithm 4.1 Input: p matrix pairs (Lk ; Mk ), stored in Pk , k = 0; 1; : : : ; p 1. Output: Solution of the p DPREs asso iated with the matrix pairs and stored in Pk , k = 0; 1; : : : ; p 1. In pro ess Pk : ^k ^ L Lk , M Mk (k+1) mod p Send Mk to P(k+1) mod p Re eive M(k+p 1) mod p from P(k+p 1) mod p for t = 1; 2; : : : ; p 1 (L(k+t 1) mod p ; M(k+p 1) mod p ) swap(L(k+t 1) mod p ; M(k+p 1) mod p ) Send L(k+t 1) mod p to P(k+p 1) mod p ^ ^ M M(k+p 1) mod p M (k+t) mod p (k+t) mod p Re eive L(k+t) mod p from P(k+1) mod p ^ Send M (k+t) mod p to P(k+p 1) mod p ^k ^k L L(k+t) mod p L ^ Re eive M (k+t+1) mod p from P(k+1) mod p end ^k; M ^ k) Solve the DPRE using (L

The algorithm presents a regular, local communication pattern as the only communica^ k. tions necessary are the left circular shifts of Lk and M Assume our system consists of np physical processors, P0 ; : : : ; Pnp 1 , with np p. (Using a number of processors larger than p does not produce any benefit in this algorithm.) In the ideal distribution a group of d(p r 1)=np e + 1 consecutive processes are mapped

onto processor Pr , r = 0; : : : ; np 1. As the communication pattern only involves neighbour processes, this mapping only requires the communication of np matrix pairs between neighbour processors at each iteration of loop t. The remaining p np transferences are actually performed inside a processor, and can be implemented as local memory matrix copies. To derive a theoretical model for our parallel algorithm we use a simplified variant of the “logGp” model [1], where the time required to transfer a message of length l between two processors is given by + l. (Basically, the latency and overhead parameters of the logGp model are combined into , while no distinction is made between the bandwidth, 1 , for short and long messages.) We also define as the time required to perform a flop. Finally, for simplicity, we do not distinguish between the cost of sending a square matrix of size n and that of performing a matrix copy between two processes in the same processor. We use + n2 in both cases, though we recognize that the matrix copy can be much faster. An upper bound for the parallel execution time of the previous algorithm is given by

T max(n; p; np ) =

Pp

1

t=1

Pdp=np e k=0 p np

= (p 1)d e

(C omp + 2(2n)3 ) + 2( + (2n)2 )

(C omp + 2(2n)3) + 2( + (2n)2 ) : 1

The transference of the matrix pairs can be overlapped with the computation of the matrix products using, e.g., a non-blocking (buffered) communication Send routine; the execution time will then be

T ovl(n; p; np ) = (p 1)d

p e

C omp + 2 maxf (2n)3; + (2n)2 g : np

The actual degree of overlap depends on the efficiency of the communication subsystem and the computational performance of the processors. Usually, and from a certain ^ communication will be completely overlapped with computation “overlapping” threshold n ^ . For a particular architecture, this threshold can be derived as the value of n for all n > n which satisfies

(2n)3 + (2n)2 : In the optimal case communication and computation will be completely overlapped and

T opt(n; p; np) = (p 1)d

p e (C omp + 2(2n)3 ): np

The optimal speed-up is then

S opt(n; p; np ) =

T opt(n; p; 1) = p : T opt(n; p; np ) d npp e

This model will surely deviate from the experimental results obtained in section 5. We point out two reasons for this behavior. First, in general and depend on the message length. Second, the computational cost of a flop depends on the problem size and the type of operation; e.g., the so-called Level 3 BLAS operations (basically, matrix products) exploit the hierarchical structure of the memory to achieve a lower . Let us consider now the parallelization of Algorithm 2.2. For simplicity, again we only present the algorithm for a period p = 2i for some positive integer i.

Parallel Algorithm 4.2 Input: p matrix pairs (Lk ; Mk ), stored in Pk , k = 0; 1; : : : ; p 1 Output: Solution of the p DPREs asso iated with the matrix pairs and stored in Pk , k = 0; 1; : : : ; p 1. In pro ess Pk : Send Mk to P(k+1) mod p Re eive M(k+p 1) mod p from P(k+p 1) mod p for t = 1; 2; : : : ; log 2 (p) t 1 l 2 swap(Lk ; M(k+p l) mod p ) Send Lk to P(k+l) mod p Mk ZMk Re eive L(k+p l) mod p from P(k+p l) mod p if (t 6= log2 (p)) Send Mk to P(k+2l) mod p Lk Y L(k+p l) mod p if (t 6= log2 (p)) Re eive M(k+p 2l) mod p (Y; Z )

end ^k; M ^ k) Solve the DPRE using (L

The analysis on the execution time of this parallel algorithm, the overlapping between communication and computation, and the maximum speed-up attainable follow closely those of Algorithm 4.1. In the parallel coarse grain algorithms each Xk is computed on a single processor, so only a serial implementation of Algorithm 3.3 is required. 4. MEDIUM-GRAIN PARALLEL ALGORITHMS The coarse-grain algorithms lose part of their efficiency when the parallel architecture consists of a number of processors np larger than the period p of the system. Specifically, in such a case, there are np p idle processors, and the maximum speed-up attainable using np processors is limited by p. To overcome this drawback we propose to use a different parallelization scheme, with a finer grain, where all the processors of the system cooperate to compute each of the matrix operations in the algorithm. Another advantadge of the medium-grain approach in case p < np is that the distribution of each matrix among several processors may allow the solution of larger problems (in n) than the coarse-grain approach. The development of medium-grain parallel algorithms, which work “at the matrix level”, is supported by parallel matrix algebra libraries like ScaLAPACK [12] and PLAPACK [34]. Both public libraries rely on the message-passing paradigm and provide parallel routines for basic matrix computations (e.g., matrix-vector product, matrix-matrix products, matrix norms), and solving linear systems, eigenproblems and singular value problems. ScaLAPACK closely mimics the functionality of the popular LAPACK [2], and is used as a black box. PLAPACK basically offers parallel routines for the BLAS [16] and provides the user with an environment for easy development of new, user-tailored codes. In this paper we propose to use the kernels in ScaLAPACK. This library employs BLAS and LAPACK for serial computations, PB-BLAS (parallel block BLAS) for parallel basic matrix algebra computations, and BLACS (basic linear algebra communication subprograms) for communication. The efficiency of the ScaLAPACK kernels depends on the efficiency of the underlying computational BLAS/PB-BLAS routines and the communica-

tion BLACS library. BLACS can be used on any machine that supports either PVM [19] or MPI [21], thus providing a highly portable environment. In ScaLAPACK, the user is responsible for distributing the data among the processes. Access to data stored in a different process must be explicitly requested and provided via message-passing. The implementation of ScaLAPACK employs a block-cyclic distribution scheme [12]. The data matrices are mapped onto a logical pr p grid of processes. Each process owns a collection of blocks of dimension mb nb , which are locally and contiguously stored in a two-dimensional array in “column-major” order. For scalability purposes, we map each process onto a different physical processor (i.e., np = pr p ), and we use a 2-dimensional block-scattered layout for all our matrices. We employ square blocks of size nb for the layout, with nb experimentally determined to optimize performance. The parallelization of a numerical algorithm using this library consists of distributing the matrices, and identifying and calling the appropriate parallel routines. The reordering algorithms for the DPRE perform the following matrix operations based on Lemma 1: QR factorization, forming Q12 and Q22 (these matrices can be computed by applying from the right the transformations computed in the QR factorization to a matrix of the form [0n ; In ℄), and matrix product. ScaLAPACK provides routines PDGEQRF, PDORMQR, and PDGEMM for these purposes. The computation of Xk requires basically the same matrix operations plus the solution of a consistent overdetermined LLS problem at the final stage. This problem can be solved by performing the QR factorization of the coefficient matrix, applying these transformations to the right-hand side matrix, [LT11 ; LT21 ℄T , and solving a triangular linear system. The two first steps can be performed using routines PDGEQRF, PDORMQR, while the last step is done using PDTRSM. Table 1 reports the number of subroutine calls required by the reordering procedure in the DPRE solvers (swapping stage in Algorithms 2.1 and 2.2) and solving the DPRE from the reordered matrix pair (Algorithm 3.1). In the table, “iter” stands for the number of iterations necessary for convergence in the inverse-free iteration for the matrix disk function.

Table 1 Number of calls to different parallel routines from ScaLAPACK required by the reordering procedure in the DPRE solvers and solving the DPRE from the reordered matrix pair. PDGEQRF Alg. 2.1 Alg. 2.2 Alg. 3.1

p(p 1) p log2 (p) iter + 1

PDORMQR

Form Q12 , Q22

p(p 1) p log2 (p) iter

PDGEMM

pp p

2 ( 1) 2 log2 ( ) 2 iter

p

PDORMQR

PDTRSM

— — 1

— — 1

T T Apply to [LT 11 ; L21 ℄

In general, predicting execution time is a complex task due to the large number of factors that have an influence on the final results (system software such as compilers, libraries, operating system; layout block size, processor grid size, actual implementation of collective communication routines, etc.). This is also true for ScaLAPACK, where proposed models

Table 2 Performance of the communication libraries. Library MPI/GM API

(s.)

1 (Mbps.) Short messages Long messages

37.4

213.1

254.4

for the execution time of the routines are oversimplistic and neglect many of these factors; see, e.g., [3, 17]. We therefore do not pursue this goal any further. 5. EXPERIMENTAL RESULTS Our experiments are performed on a cluster of Intel Pentium-II processors at 300 MHz, with 128 MB of RAM each, using IEEE double-precision floating-point arithmetic ( 2:2 10 16). An implementation of BLAS specially tuned for this architecture was employed. Performance experiments with the matrix product routine in BLAS (DGEMM) achieved 180 Mflops (millions of flops per second) on one processor; that roughly provides a parameter 5:5 ns. The cluster consists of 32 nodes connected by a Myrinet multistage interconnection network1. Myrinet provides 1.28 Gbps, full-duplex links, and employs cut-through (wormhole) packet switching with source-based routing. The nodes in our network are connected via two M2M-OCT-SW8 Myrinet crossbar SAN switches. Each node contains a M2MPCI-32B Myrinet network interface card, with a LANai 4.3 processor, a 33 MHz PCI-bus interface, and 1 MB of RAM. In our coarse-grain parallel algorithms we employed basic Send and Re eive communication routines in an implementation of MPI, specially tuned for the Myrinet, which makes direct use of the GM API. This is a native application programming interface by MyricomTM for communication over Myrinet which avoids the overhead of using MPI on top of the TCP/IP protocol stack. Our medium-grain parallel algorithms are implemented using ScaLAPACK. The communications in this case are performed using BLACS on top of MPI/GM API. Table 2 reports the communication performance of these libraries measured using a simple ping-pong test, both for short (20 KB) and long (500 KB) messages. 5.1. Performance of the coarse-grain parallel reordering In these algorithms we are interested in finding the threshold from where the communications will be overlapped with the computations. For this purpose, we use data in table 2 and 5:5 ns. to determine the theoretical threshold at n = 9. In practice, the resolution of the timer that we used did not allow to obtain accurate measurements for n < 30. For n 30 and period p, the coarse-grain parallel algorithms using np processors, np = p, obtained a perfect speed-up. As expected, for p > np , the p speed-up of the algorithm in practice is dp=n . The scalability of the algorithm as p is pe increased with the number of processors ( p=np and n are fixed) is perfect. However, the algorithm is not scalable as n is increased with the number of processors ( n2 =np and p are 1 see

http://www.myri. om for a detailed description.

fixed). The matrix becomes too large to be stored in the memory of a single processor. For more performance results, see [7]. 5.2. Performance of the medium-grain parallel reordering We now evaluate the parallelism and scalability of our medium-grain parallel reordering algorithms. 2500

500 p= 4, q=2n=852 p=16, q=2n=428 p=32, q=2n=304

400 Execution time (in s.)

Execution time (in s.)

2000

p= 4, q=2n=852 p=16, q=2n=428 p=32, q=2n=304

450

1500

1000

500

350 300 250 200 150 100 50

0

1

Figure 1. (right).

4

8

12

Number of processors (np)

16

0

1

4

8

12

Number of processors (n )

16

p

Execution time of the medium-grain parallelization of Algorithm 2.1 (left) and Algorithm 2.2

Figure 1 reports the execution time for the reordering procedures of the DPRE solvers (Algorithms 2.1 and 2.2), for p = 4, 16, and 32. The problem size q = 2n was set to the size of the largest problem sizes that could be solved in 1 processor. The results in the figures show an important reduction in the execution time achieved by the medium-grain parallel reordering algorithms. Figure 2 reports the scalability of the medium grain algorithms. To analyze the scalability vs. the problem size, (2n)2 p=np is fixed at 460, and we report the Mflop ratio per node, with p = 4, on np = 4, 8, 16 and 30 processors. There is only a minor decrease in performance and the parallel algorithms can be considered scalable in n. The algorithm however is not scalable with p: As p is increased while (2n)2 p=np is kept fixed, the matrix size per processor is reduced and the performance of the computational matrix kernels will become lower. Table 3 reports the speed-up of the medium grain reordering algorithms for p = 4 and n = 852. 5.3. Performance of the DPRE solver Once the reordering procedure provides the corresponding matrix pair, we only need to use our subspace extraction method, based on matrix disk function, to obtain the solution of the DPRE. We have already noted that the coarse-grain parallel DPRE solvers only requires a serial implementation of the matrix disk function. In case p np , the algorithm will p . (There is no overhead due to achieve a theoretical and experimental speed-up of dp=n pe syncronization or communication as the Xk ’s can be computed independently). Otherwise, the attained speed-up will be p.

220 Algorithm 2.1 Algorithm 2.2

200

Mflop ratio per node

180 160 140 120 100 80 60 40 20 0

1

4

9

16

Number of processors (n )

25

30

p

Figure 2. Scalability of the medium grain algorithms vs. problem size. (2n)2 p=np =460.

In the medium-grain parallel solvers all the processors in the system participate in the computation of each Xk . Figure 3 reports the execution time of the DPRE solver for a matrix pair of size q = 2n = 700, using np = 4, 8, 12 and 16 processors. The execution time on 1 processor is that of a serial implementation of the algorithm. Ten inverse-free iterations were required in all cases for convergence. The figure also reports the scalability p of the solver. In this case we set n= np =1000 and evaluate the Mflop ratio (millions of flops per second) per node achieved by the algorithm using a square-like mesh of processors (np = 2 2, 3 3, 4 4, 5 5, and 5 6).

200

250

200

Mflop ratio per node

Execution time (in s.)

180

150

100

160 140 120 100 80 60 40

50

20 0

1

4

8

12

Number of processors (n_p)

16

0

1

4

9

16

25

30

Number of processors (n_p)

Figure 3. Execution time (left) and scalability (right) of the DPRE solver.

Table 3 reports the speed-up of the DPRE solver for a problem of size q = 2n = 700 (see column labeled as Algorithm 3.1). A comparison between the DPRE solver employed by the coarse-grain and the mediumgrain solvers is straight-forward. The coarse-grain solver (serial) will achieve a better performance as long as p np , as overhead due to communications does not increase. In case p < np we would just need to compare the experimental execution time of the medium-grain DPRE solver with that of the serial solver for p = 1.

Table 3 Speed-ups of the medium-grain algorithms.

np 4 8 12 16

n=852, p=4 Alg. 4.1 Alg. 4.2 2.58 4.71 5.98 8.47

2.47 4.19 5.49 7.25

n=350 Alg. 3.1 3.45 6.56 9.47 11.76

6. CONCLUDING REMARKS We have investigated the performance of two parallel algorithms for linear-quadratic optimal control of discrete-time periodic systems. Two different approaches are used to parallelize these solvers. A coarse-grain parallel approach is better suited for problems of period larger than the number of available processors, and moderate dimension of the matrices. The medium-grain parallel solver obtains better results for problems with largescale matrices. The period does not play an essential role in this case. The experimental results report the high performance and scalability of the parallel algorithms on a Beowulf cluster. REFERENCES 1. A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating long messages into the LogP model for parallel computation. J. Parallel and Distributed Computing, 44(1):71–79, 1997. 2. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, PA, second edition, 1995. 3. Z. Bai, J. Demmel, J. Dongarra, A. Petitet, H. Robinson, and K. Stanley. The spectral decomposition of nonsymmetric matrices on distributed memory parallel computers. SIAM J. Sci. Comput., 18:1446–1461, 1997. 4. Z. Bai, J. Demmel, and M. Gu. An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems. Numer. Math., 76(3):279–308, 1997. 5. P. Benner. Contributions to the Numerical Solution of Algebraic Riccati Equations and Related Eigenvalue Problems. Logos–Verlag, Berlin, Germany, 1997. Also: Dissertation, Fakult¨at f¨ur Mathematik, TU Chemnitz– Zwickau, 1997. 6. P. Benner and R. Byers. Evaluating products of matrix pencils and collapsing matrix products. Technical report, Mathematics Research Reports 99-12-01, Department of Mathematics, University of Kansas, 1999. To appear in Numerical Linear Algebra with Appl. 7. P. Benner, R. Mayo, E.S. Quintana-Ort´ı, and V. Hern´andez. A coarse grain parallel solver for periodic riccati equations. Technical Report 2000–01, Depto. de Inform´atica, 12080-Castell´on, Spain, 2000. In preparation. 8. P. Benner, R. Mayo, E.S. Quintana-Ort´ı, and V. Hern´andez. Solving discrete-time periodic Riccati equations on a cluster. In A. Bode, T. Ludwig, and R. Wism¨uller, editors, Euro-Par 2000 – Parallel Processing, number 1900 in Lecture Notes in Computer Science, pages 824–828. Springer-Verlag, Berlin, Heidelberg, New York, 2000. 9. P. Benner, V. Mehrmann, V. Sima, S. Van Huffel, and A. Varga. SLICOT - a subroutine library in systems and control theory. In B.N. Datta, editor, Applied and Computational Control, Signals, and Circuits, volume 1, chapter 10, pages 499–539. Birkh¨auser, Boston, MA, 1999. 10. S. Bittanti and P. Colaneri. Periodic control. In J.G. Webster, editor, Wiley Encyclopedia of Electrical and Electronic Engineering, volume 16, pages 59–74. John Wiley & Sons, Inc., New York, Chichester, 1999. 11. S. Bittanti, P. Colaneri, and G. De Nicolao. The periodic Riccati equation. In S. Bittanti, A.J. Laub, and J.C. Willems, editors, The Riccati Equation, pages 127–162. Springer-Verlag, Berlin, Heidelberg, Germany, 1991.

12. L.S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. ScaLAPACK Users’ Guide. SIAM, Philadelphia, PA, 1997. 13. A. Bojanczyk, G.H. Golub, and P. Van Dooren. The periodic Schur decomposition. algorithms and applications. In F.T. Luk, editor, Advanced Signal Processing Algorithms, Architectures, and Implementations III, volume 1770 of Proc. SPIE, pages 31–42, 1992. 14. K. Braman, R. Byers, and R. Mathias. The multi-shift QR-algorithm: Aggressive deflation, maintaining well focused shifts, and level 3 performance. Preprint 99-05-01, Department of Mathematics, University of Kansas, Lawrence, KS 66045-2142, May 1999. Available from http://www.math.ukans.edu/ereports/1999.html. 15. R. Buyya. High Performance Cluster Computing: Architectures & Systems. Prentice-Hall, 1999. 16. J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Software, 16(1):1–17, 1990. 17. J. Dongarra, R. van de Geijn, and D. Walker. Scalability issues affecting the design of a dense linear algebra library. J. Parallel and Distrib. Comput., 22(3):523–537, 1994. 18. B.A. Francis and T.T. Georgiou. Stability theory of linear time-invariant plants with periodic digital controllers. IEEE Trans. Automat. Control, 33:820–832, 1988. 19. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine – A Users Guide and Tutorial for Network Parallel Computing. MIT Press, Cambridge, MA, 1994. 20. G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, second edition, 1989. 21. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, MA, 1994. 22. M.T. Heath, A.J. Laub, C.C. Paige, and R.C. Ward. Computing the SVD of a product of two matrices. SIAM J. Sci. Statist. Comput., 7:1147–1159, 1987. 23. J.J. Hench and A.J. Laub. Numerical solution of the discrete-time periodic Riccati equation. IEEE Trans. Automat. Control, 39:1197–1210, 1994. 24. G. Henry and R. van de Geijn. Parallelizing the QR algorithm for the unsymmetric algebraic eigenvalue problem: myths and reality. SIAM J. Sci. Comput., 17:870–883, 1997. 25. G. Henry, D.S. Watkins, and J.J. Dongarra. A parallel implementation of the nonsymmetric QR algorithm for distributed memory architectures. LAPACK Working Note 121, University of Tennessee at Knoxville, 1997. 26. W. Johnson, editor. Helicopter Theory. Princeton University Press, Princton, NJ, 1981. 27. A.N. Malyshev. Parallel algorithm for solving some spectral problems of linear algebra. Linear Algebra Appl., 188/189:489–520, 1993. 28. V. Mehrmann. The Autonomous Linear Quadratic Control Problem, Theory and Numerical Solution. Number 163 in Lecture Notes in Control and Information Sciences. Springer-Verlag, Heidelberg, July 1991. 29. T. Pappas, A.J. Laub, and N.R. Sandell. On the numerical solution of the discrete-time algebraic Riccati equation. IEEE Trans. Automat. Control, AC-25:631–641, 1980. 30. P.H. Petkov, N.D. Christov, and M.M. Konstantinov. Computational Methods for Linear Control Systems. Prentice-Hall, Hertfordshire, UK, 1991. 31. V. Sima. Algorithms for Linear-Quadratic Optimization, volume 200 of Pure and Applied Mathematics. Marcel Dekker, Inc., New York, NY, 1996. 32. J. Sreedhar and P. Van Dooren. Periodic descriptor systems: Solvability and conditionability. IEEE Trans. Automat. Control, 44(2):310–313, 1999. 33. J. Sreedhar and P. Van Dooren. Periodic Schur form and some matrix equations. In U. Helmke, R. Mennicken, and J. Saurer, editors, Systems and Networks: Mathematical Theory and Applications, Proc. Intl. Symposium MTNS ’93 held in Regensburg, Germany, August 2–6, 1993, pages 339–362. Akademie Verlag, Berlin, FRG, 1994. 34. R.A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. MIT Press, Cambridge, MA, 1997. 35. P. Van Dooren. Two point boundary value and periodic eigenvalue problems. In O. Gonzalez, editor, Proc. 1999 IEEE Intl. Symp. CACSD, Kohala Coast-Island of Hawai’i, Hawai’i, USA, August 22–27, 1999 (CD-Rom), pages 58–63, 1999. 36. A. Varga. Periodic Lyapunov equations: some applications and new algorithms. Internat. J. Control, 67(1):69–87, 1997. 37. A. Varga and S. Pieters. Gradient-based approach to solve optimal periodic output feedback control problems. Automatica, 34(4):477–481, 1998.

Beri hte aus der Te hnomathematik

ISSN 1435-7968

http://www.math.uni-bremen.de/zetem/beri hte.html

| Vertrieb dur h den Autor |

Reports

Stand: 9. Februar 2001

98{01. Peter Benner, Heike Fabender:

An Impli itly Restarted Symple ti Lan zos Method for the Symple ti Eigenvalue Problem,

Juli 1998. 98{02. Heike Fabender:

Sliding Window S hemes for Dis rete Least-Squares Approximation by Trigonometri Polynomials, Juli 1998.

98{03. Peter Benner, Maribel Castillo, Enrique S. Quintana-Ort:

Parallel Partial Stabilizing Algorithms for Large Linear Control Systems, Juli 1998.

98{04. Peter Benner:

Computational Methods for Linear{Quadrati Optimization, August 1998.

98{05. Peter Benner, Ralph Byers, Enrique S. Quintana-Ort, Gregorio Quintana-Ort:

Solving Algebrai Ri

ati Equations on Parallel Computers Using Newton's Method with Exa t Line Sear h, August 1998.

98{06. Lars Grune, Fabian Wirth:

On the rate of onvergen e of in nite horizon dis ounted optimal value fun tions, November

1998. 98{07. Peter Benner, Volker Mehrmann, Hongguo Xu:

A Note on the Numeri al Solution of Complex Hamiltonian and Skew-Hamiltonian Eigenvalue Problems, November 1998.

98{08. Eberhard Bans h, Burkhard Hohn:

Numeri al simulation of a sili on oating zone with a free apillary surfa e, Dezember 1998.

99{01. Heike Fabender:

The Parameterized

SR Algorithm for Symple ti (Butter y) Matri es, Februar 1999.

99{02. Heike Fabender:

Error Analysis of the symple ti Lan zos Method for the symple ti Eigenvalue Problem,

Marz 1999. 99{03. Eberhard Bans h, Alfred S hmidt:

Simulation of dendriti rystal growth with thermal onve tion, Marz 1999.

99{04. Eberhard Bans h:

Finite element dis retization of the Navier-Stokes equations with a free apillary surfa e,

Marz 1999. 99{05. Peter Benner:

Mathematik in der Berufspraxis, Juli 1999.

99{06. Andrew D.B. Pai e, Fabian R. Wirth:

Robustness of nonlinear systems and their domains of attra tion, August 1999.

99{07. Peter Benner, Enrique S. Quintana-Ort, Gregorio Quintana-Ort:

Balan ed Trun ation Model Redu tion of Large-S ale Dense Systems on Parallel Computers, September 1999.

99{08. Ronald Stover:

Collo ation methods for solving linear dierential-algebrai boundary value problems, Septem-

ber 1999. 99{09. Huseyin Ak ay:

Modelling with Orthonormal Basis Fun tions, September 1999.

99{10. Heike Fabender, D. Steven Ma key, Niloufer Ma key:

Hamilton and Ja obi ome full ir le: Ja obi algorithms for stru tured Hamiltonian eigenproblems, Oktober 1999.

99{11. Peter Benner, Vin ente Hernandez, Antonio Pastor: On the Kleinman Iteration for Nonstabilizable System, Oktober 1999. 99{12. Peter Benner, Heike Fabender:

A Hybrid Method for the Numeri al Solution of Dis rete-Time Algebrai Ri

ati Equations,

November 1999. 99{13. Peter Benner, Enrique S. Quintana-Ort, Gregorio Quintana-Ort:

Numeri al Solution of S hur Stable Linear Matrix Equations on Multi omputers, November

1999. 99{14. Eberhard Bans h, Karol Mikula: Adaptivity in 3D Image Pro essing, Dezember 1999. 00{01. Peter Benner, Volker Mehrmann, Hongguo Xu:

Perturbation Analysis for the Eigenvalue Problem of a Formal Produ t of Matri es, Januar

2000. 00{02. Ziping Huang:

Finite Element Method for Mixed Problems with Penalty, Januar 2000.

00{03. Gianfran es o Martini o:

Re ursive mesh re nement in 3D, Februar 2000.

00{04. Eberhard Bans h, Christoph Egbers, Oliver Mein ke, Ni oleta S urtu: Taylor-Couette System with Asymmetri Boundary Conditions, Februar 2000. 00{05. Peter Benner: Symple ti Balan ing of Hamiltonian Matri es, Februar 2000. 00{06. Fabio Camilli, Lars Grune, Fabian Wirth: A regularization of Zubov's equation for robust domains of attra tion, Marz 2000. 00{07. Mi hael Wol, Eberhard Bans h, Mi hael Bohm, Domini Davis: Modellierung der Abkuhlung von Stahlbrammen, Marz 2000. 00{08. Stephan Dahlke, Peter Maa, Gerd Tes hke: Interpolating S aling Fun tions with Duals, April 2000. 00{09. Jo hen Behrens, Fabian Wirth: A globalization pro edure for lo ally stabilizing ontrollers, Mai 2000.

00{10. Peter Maa, Gerd Tes hke, Werner Willmann, Gunter Wollmann:

Dete tion and Classi ation of Material Attributes { A Pra ti al Appli ation of Wavelet Analysis, Mai 2000.

00{11. Stefan Bos hert, Alfred S hmidt, Kunibert G. Siebert, Eberhard Bans h, Klaus-Werner Benz, Gerhard Dziuk, Thomas Kaiser: Simulation of Industrial Crystal Growth by the Verti al Bridgman Method, Mai 2000. 00{12. Volker Lehmann, Gerd Tes hke: Wavelet Based Methods for Improved Wind Pro ler Signal Pro essing, Mai 2000. 00{13. Stephan Dahlke, Peter Maass: A Note on Interpolating S aling Fun tions, August 2000. 00{14. Ronny Ramlau, Rolf Cla kdoyle, Frederi Noo, Girish Bal:

A

urate Attenuation Corre tion in SPECT Imaging using Optimization of Bilinear Fun tions and Assuming an Unknown Spatially-Varying Attenuation Distribution, September

2000. 00{15. Peter Kunkel, Ronald Stover:

Symmetri ollo ation methods for linear dierential-algebrai boundary value problems,

September 2000. 00{16. Fabian Wirth:

The generalized spe tral radius and extremal norms, Oktober 2000.

00{17. Frank Stenger, Ahmad Reza Naghsh-Nil hi, Jenny Niebs h, Ronny Ramlau: A uni ed approa h to the approximate solution of PDE, November 2000. 00{18. Peter Benner, Enrique S. Quintana-Ort, Gregorio Quintana-Ort: Parallel algorithms for model redu tion of dis rete{time systems, Dezember 2000. 00{19. Ronny Ramlau:

A steepest des ent algorithm for the global minimization of Tikhonov{Phillips fun tional,

Dezember 2000. 01{01. EÆ ient methods in hyperthermia treatment planning:

Torsten Kohler, Peter Maass, Peter Wust, Martin Seebass, Januar 2001.

01{02. Parallel Algorithms for LQ Optimal Control of Dis rete-Time Periodi Linear Systems: Peter Benner, Ralph Byers, Rafael Mayo, Enrique S. Quintana-Ort, Vi ente Hernandez, Februar 2001.