Architecture for wavelet packet transform based on lifting steps

Parallel Computing 28 (2002) 1023–1037 www.elsevier.com/locate/parco Architecture for wavelet packet transform based on lifting steps Francisco Arg€ ...
Author: Bryce Stevenson
0 downloads 2 Views 254KB Size
Parallel Computing 28 (2002) 1023–1037 www.elsevier.com/locate/parco

Architecture for wavelet packet transform based on lifting steps Francisco Arg€ uello

a,*

, Juan L opez b, Marıa A. Trenas b, Emilio L. Zapata b

a

b

Dept. Electr onica y Computaci on, Universidad de Santiago, 15782 Santiago, Spain Dept. Arquitectura de Computadores, Universidad de M alaga, P.O. Box 4114, 29080 M alaga, Spain

Received 1 December 2000; received in revised form 1 September 2001; accepted 1 December 2001

Abstract We present a parallel architecture able of computing a wide range of wavelet packets. The architecture is based on the implementation of the wavelet transform by means of its factoring into lifting steps. In order to implement the lifting scheme on the architecture, we previously carry out a regularization of the lifting steps, so that they all have similar structure and complexity. This allows us to efficiently compute a wide range of lifting steps over the same hardware. The architecture consists of a group of multipliers–accumulators that operate in parallel on the data stored in a memory. With simple programming, we can choose the type of filter, the numbers of stages of the transform and the form of the wavelet packet tree. In summary, the architecture is designed to efficiently take advantage of the structure of the lifting steps.  2002 Elsevier Science B.V. All rights reserved. Keywords: Wavelet transform; Wavelet packet transform; Lifting steps; Parallel architecture

1. Introduction The standard wavelet transform (WT) provides a time–frequency (time-scale) representation of signals. However, this transform does not allow a flexible choice of the regions into which the time–frequency plane is divided. This flexibility is given by the *

Corresponding author. Tel.: +34-981-56-31-00x13556; fax: +34-981-52-80-12. E-mail addresses: [email protected] (F. Arg€ uello), [email protected] (J. L opez), [email protected] (M.A. Trenas), [email protected] (E.L. Zapata). 0167-8191/02/$ - see front matter  2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 0 2 ) 0 0 1 0 1 - 1

1024

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

wavelet packet transform (WPT) [1–4], a generalization of the WT. Many applications, from signal analysis to image or video processing and communications, may take advantage of the flexibility of the WPT. In practice, WT and WPT are often implemented using a tree-structured filter bank. Fig. 1(a) shows the tree-structured filter bank of a WT and the time–frequency tiling that it provides. As is shown in this figure, in the WT only the output of the low-pass filters progresses to the next stage, while the output of the high-pass filters is already in the form of results. This originates a time–frequency tiling in which the time resolution is lower at low frequencies and higher at high frequencies. On the other hand, the WPT is a set of transformations that admits any type of tree-structured filter bank. As an example, Fig. 1(b) shows one of these trees and the time–frequency tiling that it originates. The lifting scheme [5,6] is a new approach for the construction of biorthogonal wavelets entirely in the spatial domain and other totally new wavelets. Wavelets can be broken down into lifting stages, which, like in the case of quadrature mirror filter (QMF) lattices structures, result in a significant reduction in the number of arithmetic operations that are necessary to compute the transform. More specifically, the advantages of the lifting scheme are as follows [7]: 1. It allows a faster implementation of the WT because makes optimal use of similarities between the high and low-pass filters to speed up the calculation. The number of flops can be reduced by a factor of two. 2. The lifting scheme allows a fully in situ calculation of WT, i.e., no auxiliary memory is required. This way, the original image can be replaced with its WT. 3. With the lifting scheme, the inverse WT can be found immediately by reversing the operations of the forward transform. In practice, this comes down to simply changing each þ into a , and vice versa. 4. It can be used in situations where no Fourier transform is available. Typical examples are WTs that map integers to integers; such transforms are important for hardware implementation and for lossless image coding. The lifting scheme has been explicitly included as a WT computation scheme in the JPEG2000 standard [8].

Fig. 1. Tree-structured filter banks and time–frequency tilings: (a) standard WT, (b) an example of WPT. Observe the gray permutation that organizes the outputs in frequency order.

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

1025

Recently, a number of architectures based on the lifting scheme have appeared in the literature. In [9], a sequential architecture is proposed to compute the two-dimensional WT for a wide range of filters. This architecture is made of two processor modules and two memory units to respectively compute and store the rows and columns of the transformation. A parallel architecture based on a technique named boundary postprocessing, is presented in [10]. The input data sequence is segmented into non-overlapping blocks which are distributed among processors. Then, each processor transforms its allocated data independently. Next, boundary data is communicated to neighboring blocks. Lastly, the transform is completed for the boundary data samples. A FPGA implementation of this architecture is presented in [11]. In this paper we present a parallel architecture for the WPT based on the lifting scheme. It is configurable to compute a wide range of WPTs, with variation in the type of filters, in the number of stages, and in the form of the tree-structured filter bank. The architecture is folded and requires that the input sequence is previously stored in a memory, i.e., the architecture is not designed to provide minimum latency between the input and the output. The organization of the remainder of the work is as follows: Section 2 gives the basics behind the lifting scheme. In Section 3, we regularize the lifting steps and develop the folded architecture for the standard WT, while Section 4 describes the computation of WPTs over the architecture. In Section 5 we review related work and evaluate the proposed architecture. Finally, conclusions are presented in Section 6.

2. The lifting scheme In the tree-based scheme of the WPT, each node of the tree consists of a two-band subband filtering with finite filters. Each node can be broken down into a finite sequence of simple filtering steps, which are called lifting steps or ladder structures. The decomposition is essentially a factorization of the polyphase matrix of the wavelet or subband filters into elementary matrices. As described in [5,6], the lifting steps scheme consists of three simple phases: the first step, or Lazy wavelet, splits the data into two subsets: even and odd; the second step calculates the wavelet coefficients (high-pass) as the failure to predict the odd set based on the even; finally the third step updates the even set using the wavelet coefficients to compute the scaling function coefficients (low-pass). The predict phase ensures polynomial cancellation in the high-pass (vanishing moments of the dual wavelet), and the update phase ensures preservation of moments in the low-pass (vanishing moments of the primal wavelet). By varying the order, an entire family of transforms can be built. As a result [6], the synthesis polyphase matrix P ðxÞ of the wavelet and its inverse, the analysis polyphase matrix P 1 ðxÞ, can be written as    m  Y 1 0 f 0 1 si ðzÞ ; ð1Þ P ðzÞ ¼ ti ðzÞ 1 0 1=f 0 1 i¼1

1026

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

P ðzÞ1 ¼

1  Y 1=f i¼m

0

0 f



1 0 ti ðzÞ 1



 si ðzÞ ; 1

1 0

ð2Þ

where m ¼ ðn=2Þ þ 1, n is the size of the filters, si ðzÞ and ti ðzÞ are Laurent polynomials, and f is a non-zero constant. For example, in the case of the Daubechies D4 wavelet, two possible factorizations are # " ppffiffi3þ1   pffiffiffi  ffiffi 0 1 0 1 z 1  3 a pffiffi pffiffi 2 pffiffi PD4 ðzÞ ¼ ; ð3Þ 3 31 þ 342 z1 1 0 1 p ffiffi 0 1 0 4 2 b PD4 ðzÞ



1 ¼ 0

 p1ffiffi3 z1 1



pffiffi 1 pffiffi 63 3 þ 43 z 4

0 1



1 0

 13 1

" 3þppffiffiffiffi3 3 2

0

0

pffiffi 3pffiffi3 3 2

# :

In the case of the Daubechies D6 wavelet, one possible factorization is       1 0 1 bz1 þ b0 1 0 1 d f 0 ; PD6 ðzÞ ¼ a 1 0 c þ c0 z 1 0 1 0 1=f 1

ð4Þ

ð5Þ

where [6] a ¼ 0:4122865950, b ¼ 1:5651362796, b0 ¼ 0:3523876576, c ¼ 0:0284590896, c0 ¼ 0:4921518449, d ¼ 0:3896203900, and f ¼ 1:9182029462. The lifting scheme has been explicitly included as a computation scheme in the JPEG2000 standard [8]. This standard supports two different transformations, one reversible, that may be used only for lossy coding, and one irreversible for lossless or lossy coding. The 9–7 filter is included in Part I of JPEG2000 standard for lossy coding and the 5–3 filter for lossless coding. The default irreversible transform is implemented by means of the 9–7 filter, which may be factorized as follows:      1 0 1 cð1 þ z1 Þ 1 0 1 að1 þ z1 Þ P9–7 ðzÞ ¼ bð1 þ zÞ 1 0 dð1 þ zÞ 1 0 1 1   f 0  ; ð6Þ 0 1=f where the coefficients are given in [8] and [6]. Table 1 Lifting steps of analysis (forward transform) and synthesis (inverse transform) in the Daubechies D6 wavelet Analysis

Synthesis

anð0Þ bnð1Þ anð1Þ bnð2Þ anð2Þ anð3Þ

ð3Þ ð2Þ 1 ð3Þ að2Þ n ¼ fan , bn ¼ f bn ð1Þ ð2Þ ð2Þ an ¼ an þ dbn ð1Þ 0 ð1Þ ð2Þ bð1Þ n ¼ can þ c anþ1 þ bn ð1Þ 0 ð1Þ ð0Þ ð1Þ an ¼ an þ b bn þ bbn1 ð0Þ ð1Þ bð0Þ n ¼ aan þ bn ð0Þ x2n ¼ að0Þ , x 2nþ1 ¼ bn n

¼ ¼ ¼ ¼ ¼ ¼

x2n , bð0Þ n ¼ x2nþ1 aanð0Þ þ bð0Þ n ð1Þ anð0Þ  b0 bð1Þ n  bbn1 ð1Þ 0 ð1Þ can  c anþ1 þ bð1Þ n anð1Þ  dbð2Þ n 1 ð2Þ ð2Þ a , bð3Þ n ¼ fbn f n

The coefficients are given in [6]. Variables x2n , x2nþ1 represent a couple of even/odd indexed data in the input sequence.

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

1027

On the other hand, the default reversible transform in the JPEG2000 standard is implemented by means of the 5–3 filter. A possible decomposition is    1 0 1 0:5ð1 þ z1 Þ P5–3 ðzÞ ¼ : ð7Þ 0:25ð1 þ zÞ 1 0 1 Other factorizations are also possible, as factoring into lifting steps is a highly non-unique process. The equations to implement the WT are easily deduced from its factorizations. For example, the equations obtained from factorization (5) for the Daubechies D6 wavelet are shown in Table 1. Observe that these equations can be calculated in situ, that is, the original data set can be replaced by its WT.

3. Architecture In this section we present the design of a parallel architecture for the standard WT. Fig. 2 shows the basic concepts that are used in the design of our architecture.

Fig. 2. (a) Two-stage WT with three-step lifting filters, (b) three-stage WT with two-step lifting filters, (c) folded architecture, (d) folded architecture with parallelism.

1028

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

Fig. 2(a) depicts the scheme of a two-stages WT with filters consisting of three lifting steps, while Fig. 2(b) depicts a three-stages WT with filters consisting of two lifting steps. Boxes labelled as s and t carry out the arithmetic operations that are required by the lifting steps (corresponding to the Laurent polynomials s and t), while the memory carries out the data exchanges and the subsampling that are needed among stages. As is shown in Fig. 2(a) and (b), the WT computed through lifting steps presents a general structure with a high degree of regularity. However, this regularity is not maintained if we consider each one of the steps in detail. For example, in the standard WT, the amount of data is halved among the stages. The output of the highpass filters does not progress and is stored in the memory, while the output of the low-pass filters, after regrouped, is used in the calculation of the next stage. Another irregularity is the great degree of variability present in the Laurent polynomials s and t, not only among different WTs, but also among the different lifting steps of the same WT, as observed in decompositions (1)–(7). This irregularity must be borne in mind in the design of an architecture that computes generic WTs. Fig. 2(c) shows the proposed scheme of folded architecture. This architecture is able of computing the WTs of Fig. 2(a) and (b) in several cycles. Similarly, Fig. 2(d) shows a generic scheme that considers parallelism in the calculation of the lifting steps. Next, we design the circuits that compute the lifting steps. Most of the lifting steps (see [6]) present Laurent polynomials with expressions sðzÞ ¼ a0 þ a1 z1 and tðzÞ ¼ b0 þ b1 z. Considering these polynomials as patterns for the lifting steps, we obtain the calculation schemes that are shown in Fig. 3. However, not all the lifting steps follow the previous scheme exactly. We apply a regularization strategy that allows us to efficiently compute a wide range of lifting

Fig. 3. Computation schemes for the lifting steps.

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

1029

Table 2 Basic operations that are carried out by the proposed architecture to compute the lifting steps Type s

Type t

an þ a0 bn an þ a1 bn1 ACC þ a1 bn1 an þ aðbn þ bn1 Þ fan

bn þ b0 an bn þ b1 anþ1 ACC þ b1 anþ1 bn þ bðan þ anþ1 Þ 1 b f n

Fig. 4. Circuits that implement the basic operations of Table 2.

steps over the same hardware. The basic operations that we implement are shown in Table 2. All these operations can easily be implemented on an MAC (combined multiplication and accumulation) unit. The resulting circuits are shown in Fig. 4. The operations that we have included in Table 2 have the following justifications: 1. The first operations in this table, an þ a0 bn and bn þ b0 an , have been included so that the circuits of Fig. 4 can implement lifting steps from the factorization matrices    1 0 1 a0 : ð8Þ b0 1 0 1 2. In a similar manner, the following operations, an þ a1 bn1 and bn þ b1 anþ1 , allow us to carry out lifting steps from the factorization matrices    1 0 1 a1 z1 : ð9Þ b1 z 1 0 1 3. The third operations, ACC þ a1 bn1 and ACC þ b1 anþ1 , have been included to be carried out jointly with the first ones, allowing the circuit to calculate in two cycles (as two multiplications are required) the lifting steps from the factorization matrices    1 0 1 a0 þ a1 z1 : ð10Þ b 0 þ b1 z 1 0 1

1030

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

4. The fourth operations in Table 2, an þ aðbn þ bn1 Þ and bn þ bðan þ anþ1 Þ, allow us to carry out lifting steps from the factorization matrices (present, for example, in the 9–7 filter)    1 0 1 að1 þ z1 Þ : ð11Þ bð1 þ zÞ 1 0 1 5. Finally, the last operation enables us to carry out the multiplications by the scale coefficients f and 1=f, which are necessary in each one of the stages of the wavelet. The corresponding matrix is   f 0 : ð12Þ 0 1=f Let us describe in detail the operation of the circuits in Fig. 4. The computation of a lifting step consists of a loop of N =2 iterations (N is the size of the data sequence). At each iteration, circuits s and t carry out some of the operations (8)–(12). Specifically, consider the case of expression (10). Let an , bn be the input data and An , Bn the output data of circuits s and t in Fig. 4. ACC denotes the value stored at the multiplier/accumulator device. Initially, the circuit s stores the value 0 at the cell T (b1 ¼ 0). Then, each iteration involves the following operations: • Reading from memory of an and bn . These data input circuit s. • Computation requires two cycles. In the first cycle, circuit s carries out the operation ACC ¼ an þ a0 bn , and in the second cycle the operation An ¼ ACC þ a1 bn1 . On the other hand, Bn ¼ bn . • Finally, the output data An and Bn of circuit s will input circuit t, while bn remains stored at the cell T to be used in a next iteration if necessary. The operation of circuit t begins by storing data a0 and b0 at cell T. Let us suppose aN =2 ¼ 0 and bN =2 ¼ 0. Then, each iteration involves the following operations: • Cells T store data an and bn from a previous iteration, and data anþ1 and bnþ1 input the circuit t from circuit s. • Computation requires two cycles. In the first cycle circuit t carries out the operation ACC ¼ bn þ b0 an , and in the second cycle the operation Bn ¼ ACC þ b1 anþ1 . On the other hand, An ¼ an . • The output data An and Bn are stored in memory at the same positions occupied by an and bn . Data anþ1 and bnþ1 remain stored at the cells T to be used at the next iteration. As pointed out before, the arithmetic complexity of each lifting step is not uniform. For example, in the factorization (5) of the Daubechies D6 wavelet, one of the steps presents a Laurent polynomial a (it requires only a multiply operation) and another one presents a Laurent polynomial bz1 þ b0 (it requires two multiply operations).

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

1031

In order to obtain a uniform arithmetic complexity for all the lifting steps and so an architecture with the same cycle time for all the stages of the transform and a regular data-flow, we introduce a regularization strategy which is based on a further decomposition of some lifting steps. To this end, we use the following two matrix identities. First, a lifting step may be decomposed as follows:      1 0 1 0 1 0 ¼ : ð13Þ aþb 1 a 1 b 1 Second, we use the following identity       1 0 0 1 1 a 0 1 ¼ : a 1 1 0 0 1 1 0

ð14Þ 



0 1 (permutation), 1 0 whose implementation over the architecture is very simple, through reading and writing in the memory. A matrix P in initial position implies a change in the reading order of data in the memory, while a matrix P in final position implies a change in the writing order. For example, the re-factoring of the Daubechies D6 wavelet requires four cycles, in which must be carried out the products given by the following matrices      0 1 1 a 1 0 0 1 PS1 T1 P ¼ ; ð15Þ 1 0 0 1 b0 1 1 0 In the last equation, we have inserted the matrix P ¼





bz1 1

1 S2 T 2 ¼ 0 

0 1 PS3 T3 P ¼ 1 0 

0 1

f S4 T 4 ¼ 0





 1 0 ; c0 z 1

1 0

c 1



1 d

ð16Þ 0 1



 1 ; 0

0 1

ð17Þ

 0 : 1=f

1 0

ð18Þ

In the case of the Daubechies D4 wavelet, by applying the same technique to the Eq. (3) (it could also be applied to the Eq. (4)), the following re-factoring in three cycles is obtained  S1 T 1 ¼

1 0 

PS2 T2 P ¼

 0 ; 1



pffiffi 32 1 z 4

0 1 1 0

 pffiffi S3 T 3 ¼

pffiffiffi   3 p1ffiffi 3 1 4

3þ1 p ffiffi 2

0

0 1

1 0



1 0

1 pffiffi0 31 p ffiffi 2

 :

ð19Þ 

1 z

0 1



0 1

 1 ; 0

ð20Þ

ð21Þ

1032

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

In a similar manner, we can compute the 9–7 filter in three cycles (for this wavelet, in each cycle, a multiplication and two sums are carried out). Table 3 shows the number of operations (multiplications and additions) that are required by the wavelets D4, D6, the 9–7 filter, and the 5–3 filter when they are computed by the standard scheme, with lifting steps, and over the architecture that we propose. The data in the first two columns have been taken from [6]. The number of operations that are required by the proposed architecture has been calculated in the following manner: in the Daubechies D4 and D6 wavelets, the architecture carries out one multiplication and one addition per circuit and cycle, which gives (we have the circuits s and t) two multiplications and two additions per cycle. Since D4 requires three cycles and D6 requires four, we obtain, in total, 12 and 16 operations, respectively. For the case of the 9–7 filter, the architecture must carry out one multiplication and two additions per circuit and cycle, and there are three cycles; thus a total of 18 operations are required. In summary, the proposed architecture has uniform arithmetic complexity for all the lifting steps and so the computation cycle is similar for all the stages of the trans-

Table 3 Number of arithmetic operations (multiplications and additions) that are required by the wavelets D4, D6, the 9–7 filter, and the 5–3 filter when they are computed by the standard scheme, with lifting steps, and over the architecture that we propose

D4 D6 9–7 filter 5–3 filter

Standard

Lifting

Proposed architecture

14 22 23 –

9 14 14 6

12 16 18 6

Fig. 5. Introduction of parallelism in the architecture.

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

1033

form. However, the number of arithmetic operations is lightly higher than in the case of lifting with steps of different computation cycles. The introduction of parallelism into the circuits of Fig. 4 is relatively simple. The general structure of the parallel circuits s and t is shown in Fig. 5. In type s circuits, the variable z1 from the Laurent polynomial implies that the processors require the input bn1 . It introduces an interconnection line from each processor to the following one. Furthermore, for the border effects, it originates an interconnection line between the last processor and the first one. In a similar manner, in type t circuits, the variable z from the Laurent polynomial implies that the processors require the input anþ1 . In this case, it introduces an interconnection line from each processor to the previous one. Fig. 6 shows the structure of these parallel circuits.

4. Computation of wavelet packets In this section we show how to implement a WPT with a tree of any structure over the architecture. Fig. 7 shows a tree-structured filter bank, where at each octave level j, an input sequence wj1;0 ðnÞ is fed into low-pass and high-pass filters H0 and H1 , respectively. The output from the high-pass filter, H1 , represents the detailed information in the original signal at a given level j, which is denoted by wj;1 ðnÞ, and the

Fig. 6. Structure of the parallel circuits.

1034

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

Fig. 7. A complete three-level tree-structured filter bank. Solid lines show a wavelet packet tree for WPT.

output from the low-pass filter H0 represents the remaining (coarse) information in the original signal, which is denoted as wj;0 ðnÞ. Fig. 7 shows a complete three-level tree, where solid lines show a wavelet packet tree for WPT. The Gray permutation in the right part of the figure reorders the output into frequency order. Table 4 shows how to compute an eight-data-item WPT over a two-processor system. As can be seen, the calculations have been distributed so that the system is occupied 100% of the time. Given the characteristics of the system, this distribution is almost trivial. Obviously, it is not possible to reach total use of all the system’s resources in every case. Step by step, Table 5 shows the calculation of stages 2 and 3 of this WPT over the processor 0. We have specified the data-flow for a Daubechies D4 wavelet, Eqs. (19)– (21). In this table, Si ðTi Þ indicates the calculation of an operation of type sðtÞ; the two operands in parentheses are input data to the circuit; and P indicates an exchange of the two data-items at the input of the corresponding circuit. As can be observed, the generated sequence of operations is the appropriate one for implementation over a pipeline architecture.

Table 4 Data-flow in the WPT of Fig. 7 Data

Stage 1

Stage 2

Stage 3

PE 0

w0;0 ð0Þ w0;0 ð1Þ

w0;0 ð4Þ w0;0 ð5Þ

w1;0 ð0Þ w1;1 ð0Þ

w1;0 ð2Þ w1;1 ð2Þ

w2;0 ð0Þ w2;1 ð0Þ

w2;0 ð1Þ w2;1 ð1Þ

w3;0 ð0Þ w3;1 ð0Þ

PE 1

w0;0 ð2Þ w0;0 ð3Þ

w0;0 ð6Þ w0;0 ð7Þ

w1;0 ð1Þ w1;1 ð1Þ

w1;0 ð3Þ w1;1 ð3Þ

w2;2 ð0Þ w2;3 ð0Þ

w2;2 ð1Þ w2;3 ð1Þ

w3;4 ð0Þ w3;5 ð0Þ

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

1035

Table 5 Detail of the data-flow in the stages 2 and 3 of the processor 0, considering a Daubechies D4 wavelet, Eqs. (19)–(21) Stage 2 S1  T2

w2;0 ð0Þ

!

w2;1 ð0Þ w2;1 ð1Þ w2;0 ð1Þ

T1 !

w2;0 ð0Þ w2;1 ð0Þ PS3

! S1

w2;0 ð0Þ w2;1 ð0Þ

!

w2;0 ð1Þ w2;1 ð1Þ T3

! T1

w2;0 ð0Þ w2;1 ð0Þ

!

w2;0 ð1Þ w2;1 ð1Þ S3

!

w2;1 ð0Þ

PS2

w2;0 ð1Þ w2;1 ð1Þ

! T3

!

w2;0 ð0Þ w2;0 ð1Þ

T2 !

w2;1 ð0Þ

!

w2;0 ð0Þ

S2

w2;1 ð1Þ

!

w2;0 ð1Þ

w2;1 ð1Þ

Stage 3











w3;0 ð0Þ w3;0 ð0Þ w3;1 ð0Þ w3;1 ð0Þ w3;0 ð0Þ w3;0 ð0Þ S1 T1 PS2 T2 PS3 T3 w3;1 ð0Þ w3;1 ð0Þ w3;0 ð0Þ w3;0 ð0Þ w3;1 ð0Þ w3;1 ð0Þ

5. Related work and evaluation In recent literature many architectures for implementing standard WT have been described; these differ mainly in the way that intermediate results are stored and routed. Implementations include systolic routing networks, distributed memory, RAMbased, and architectures which use the minimum number of registers. In most cases, they are folded architectures that are designed to compute the WT by interleaving computations for different octaves [12]. However, there are very few architectures for computing WPT described in the bibliography. Along these lines, in [13] a design for a programmable processor using a two-buffer structure and a multiplier–accumulator is shown, while in [14] we find a pipeline structure for WPT-based speech enhancement. In [15] the types of architectures for wavelets that have been proposed up until 1996 are summarized. While each of these architectures may offer advantages on an architectural level, none of them utilizes improvements that are available on the algorithmic level. Optimization of this type involves the use of QMF lattice structures [16] and lifting steps [9–11]. In [9], a sequential architecture is proposed which is made of two processor modules to respectively compute the rows and columns of a two-dimensional transformation. In addition, it uses two memory modules. The first module generates a data-flow to compute the lifting steps on the rows, while the second one acts similarly on the columns. The parallel architecture presented in [10] is quite different. The basic idea is the distribution of data among processors in a preliminary step, computation of partial WTs on these data and then mix the partial transformations using additional information from neighboring blocks to obtain the complete WT. Our parallel architecture uses a similar strategy to the row and column processor [9] in the sense of generating a data-flow among processors. But we first carry out a regularization step in order to obtain lifting steps with the same computational complexity along the stages of the transform. In particular, this computation scheme is highly adequate for the case of WPT.

1036

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

Our architecture may efficiently compute the WT of an image as defined in the JPEG2000 standard. As an example, consider an image of size 512  512 pixels and 256 gray levels. The computation of the first level in the WT of this image requires a sequence of 1024 unidimensional transforms. For lossy coding, a 9–7 filter should be used, which requires a number of 12 additions and 6 multiplications per couple of points. For lossless coding, a 5–3 filter should be used, which requires four additions and two shifts per pair of points. We have evaluated the processing section considering 4 units of type s and 4 units of type t. It includes multipliers of size 16  16, other arithmetic elements and a 16 bits data bus, but not memory. Using synopsis we have obtained an approximated area of 3.966 mm2 , and a frequency of 89.445 MHz for a technology of standard cells of 0.35 lm. The cell occupies 1.218 mm2 and the rest corresponds to interconnection elements.

6. Conclusions The architecture that we have designed computes WPTs using the lifting scheme. This is an alternative to the standard calculation scheme and to the QMF lattice structure. It allows both an efficient calculation of the classic wavelets, as well as the construction of completely new wavelets. In this sense, the calculation of a wavelet on the architecture requires a intermediate number of operations between the standard scheme and the lifting scheme. On the other hand, the architecture is designed to take advantage of the great flexibility provided by lifting steps, enabling us to choose the type of filter, the numbers of stages of the transform and the form of the wavelet packet tree. In order to implement the lifting scheme on the architecture, we previously carry out a regularization of the lifting steps, so that they all have similar structure and complexity. This allows us to efficiently compute them in the same hardware. The resulting folded architecture consists of a group of multipliers–accumulators that operate in parallel on the data stored in a memory. In summary, the characteristics of the architecture are: 1. It has the structure of a digital signal processor, with a memory to store the data and MAC type (multiplication and accumulation) processing elements. It allows an easy exploitation of parallelism. 2. With simple programming, we can choose the type of filter, the numbers of stages of the transform and the form of the wavelet packet tree. 3. The number of flops is smaller than in the standard scheme and all operations are in situ.

Acknowledgements This work was supported in part by the Xunta de Galicia under contract PGIDT99PXI20602B and by MCYT under contract TIC2001-3694-C02.

F. Arg€uello et al. / Parallel Computing 28 (2002) 1023–1037

1037

References [1] R.R. Coifman, M.V. Wickerhauser, Entropy-based algorithms for best basis selection, IEEE Trans. Inf. Theory 38 (1992) 713–738. [2] R.R. Coifman, M.V. Wickerhauser, Adapted wavelet de-noising for medical signals and images, IEEE Eng. Med. Biol. 14 (5) (1995) 578–586. [3] Z. Xiong, K. Ramchandran, M.T. Orchad, Wavelet packet image coding using space-frequency quantization, IEEE Trans. Image Process. 7 (1998) 160–174. [4] H. Khalil, A. Jackin, C. Podilchuk, Constrained wavelet packets for tree-structured video coding algorithms, in: Proceedings of the Data Compression Conference, IEEE, 1999, pp. 354–363. [5] W. Sweldens, The lifting scheme: A construction of second generation wavelets, Siam J. Math. Anal. 29 (2) (1997) 511–546. [6] I. Daubechies, W. Sweldens, Factoring wavelets transforms into lifting steps, J. Fourier Anal. Appl. 4 (3) (1998) 247–269. [7] W. Sweldens, Wavelets and the lifting scheme: A 5 minute tour, Zeitschrift f€ ur Angewandte Mathematik und Mechanik 76 (Suppl. 2) (1996) 41–44. [8] JPEG2000, ISO/IEC FCD15444, 2000. [9] K. Andra, C. Chakrabarti, T. Acharya, A VLSI architecture for lifting-based wavelet transform, in: Proceedings of the IEEE Workshop on Signal Processing Systems, 2000, pp. 70–79. [10] W. Jiang, A. Ortega, Lifting factorization-based wavelet transform architecture design, IEEE Trans. Circ. Syst. Video Technol. 11 (5) (2001) 651–657. [11] N. Aranki, W. Jiang, A. Ortega, FPGA-based parallel implementation for the lifting discrete wavelet transform, in: Proceedings of the Parallel and Distributed Methods for Image Processing, SPIE, vol. 4118, 2000, pp. 96–107. [12] K.K. Parhi, T. Nishitani, VLSI Architectures for discrete wavelet transforms, IEEE Trans. VLSI Syst. 1 (2) (1993) 191–202. [13] X. Wu, Y. Li, H. Chen, Programmable wavelet transform processor, Electron. Lett. 35 (6) (1999) 449– 450. [14] M.A. Trenas, J. L opez, F. Arg€ uello, E.L. Zapata, An Architecture for wavelet-packet based speech enhancement for hearing aids, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, CDROM, 2000. [15] C. Chakrabarti, M. Vishwanath, R.M. Owens, Architectures for wavelet transforms: A survey, J. VLSI Signal Process. 14 (1996) 171–192. [16] T.C. Denk, K.K. Parhi, VLSI architectures for lattice structure based orthonormal discrete wavelet transforms, IEEE Trans. Circ. Systems II: Analog Digital Signal Process. 44 (2) (1997) 129–132.

Suggest Documents