Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11 Waqar Hussain Research Scientist w...
Author: Allen Brown
4 downloads 2 Views 2MB Size
Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Waqar Hussain Research Scientist [email protected] Department of Computer Systems Tampere University of Technology, Finland

Department of Computer Systems

Electronic Products Multifunction devices are becoming popular besides their reliability and durability

Example__ Mobile Phone • The key selling features of a cell phone are size, weight, longer battery times, audio/video streaming and several games running onto it • Adaptability to many communication standards • Expectations for Real Time performance • No Limits to Human Desire Department of Computer Systems

2

Embedded Technology

The embedded technology empowers a mobile phone to carry all these features. Intended for a specific use which consist of a hardware capable to perform a set of different tasks with the help of software Example Embedded System = RISC + Accelerator(s)

Department of Computer Systems

3

Why Coarse Grain Reconfigurable Arrays ?

Computationally Intensive Kernels (CIK) need to be accelerated in a Signal Processing System.

Examples of CIKs 1. 2.

3. 4.

FIR Filtering Encoding and Decoding a) Viterbi b) Reed-Solomon Matrix-Vector Multiplication Fast Fourier Transform

Department of Computer Systems

4

Why Coarse Grain Reconfigurable Arrays ?

Question: So why CGRA, why not traditional accelerators?

Its more desirable to use devices that could accelerate different kernels than typical traditional accelerators that were designed to accelerate only a single kernel. Thanks to Reconfigurability!

5

Department of Computer Systems

Why CGRAs are Powerful Engines ?

Answer: Due to its structure! CGRAs offer high parallelism and throughput due to its arraybased structure. Algorithms containing parallelism are most suitable to be mapped on a CGRA. It can process large streams of data. Unit of Structure of a CGRA is an ALU, called Processing Elements (PE). Each PE is connected to other PEs using point-to-point or a Network on Chip (NoC).

Department of Computer Systems

6

CGRA in an Embedded System

An Example of Embedded System is RISC + Accelerator(s) RISC = COFFEE Accelerator = BUTTER Both COFFEE and BUTTER were designed at the Department of Computer Systems, Tampere University of Technology, Finland

BUTTER A general purpose Coarse Grain Reconfigurable Array (CGRA) which is a martix of processing elements (PEs). Each PE is capable to perform a set of different tasks and connected with each other using point to point interconnections. BUTTER was capable to process many computationally intensive kernels.

Department of Computer Systems

7

Problems with BUTTER !

BUTTER’s presence in the system was expensive if it is not used most of the time BUTTER occupies a large number of hardware resources A General Purpose CGRA requires a few million gates of FPGA

Department of Computer Systems

8

Solution

CREMA A parameterized general purpose CGRA to generate special purpose accelerators.

Department of Computer Systems

Category of Interconnections

Department of Computer Systems

9

Processing Elements in CREMA

Two Operand Registers Decoder for Operation Selection Supports Integer and Floating point operations Blocks with dashed border are scalable and selectable for instantiation LUT for logical operations

Processing Element Template

Department of Computer Systems

CREMA based System

COFFEE for general purpose processing CREMA generated accelerator for CIK Network of Switched Interconnections for faster data transfer between modules

Department of Computer Systems

12

Applications Mapped on CREMA and BUTTER Integer and Floating-point Matrix-Vector Multiplication Execution Time Compared with RISC and DSP 2D-Low Pass Image Filtering based on Averaging Window FFT Satisfied Execution Time Constraints for SISO and MIMO OFDM Applications Resource utilization and execution time was compared with other state-of-the-art W-CDMA cell search Execution time compared with a RISC core

In all of the above applications, CREMA as a templatebased device required lesser resources for its generated accelerator than BUTTER 13

Department of Computer Systems

Application Mapping

Department of Computer Systems

14

Number Scaling

Very important, so the signals don’t overflow before processing >> scale down after processing >> scale up

If x[n] and y[n] is input and output signal then scaling down = (x[n] / |max x[n]|) x 2^b scaling up = (y[n] / 2^b) x |max x[n]|

15

Department of Computer Systems

Example

Consider a set of numbers S = {-3,-2,-1,0,1,2,3} Trying to compute -3 x 3 = -9 in 16-bit binary integer representation

Scaling Down • S/|max. S| = {-1, -0.6667, -0.3333, 0, 0.3333 0.6667 1} • S/|max. S|*2^15 = {-3.2768 -2.1845 -1.0923 0 1.0923 2.1845 3.2768} * 10^4 • -32768*32768= -1.0737x10^9 • After multiplication there is a shift operation -1.0737x10^9 / 2^15 = -32768

Scaling Up • The answer was -32768 • So (-32768 / 2^15) x 3 = -9

Department of Computer Systems

16

First Order Linear Constant Coefficient Difference Equation

y[n] = x[n-1] + x[n], n=0,1,2,3,…,N-1

x[n]

y[n]

+

Z^-1

17

Department of Computer Systems

Finite Impulse Response Filtering

Transfer Function of the Filter

There is no feedback so N = 0

FIR Structure

x[n]

Z^-1

b(0)

Department of Computer Systems

b(2)

b(1) +

Z^-1

Z^-1

+

Z^-1

b(M-1) +

+

y[n]

18

Polynomial Division

Very important and used many times in Signal Processing Example: Encoding process of Reed-Solomon codes Best way of doing it is by using a Linear Feedback Shift Register (LFSR)

19

Department of Computer Systems

Reed Solomon Codes-Encoding in Systematic Form, (7, 3) Example

X n k m( X )

q( X ) g ( x)

p( X )

p( X )

X n k m( X ) modulo g ( X )

U (X )

p ( X ) X n k m( X ) 010 110 111 1

Department of Computer Systems

3

5

20

Encoding in Systematic Form

p( X )

0

2

X

4

X2

6

X3

U(X )

0

2

X

4

X2

6

X3

1

X4

3

X5

5

X6

21

Department of Computer Systems

Systematic Encoding with an (n-k)-Stage Shift Register

X0 3

X1 1

X2

X3

0

X4

3

1

3

5

22 Department of Computer Systems

Systematic Encoding with an (n-k)-Stage Shift Register

INPUT QUEUE

CLOCK

REGISTER

CONTENTS

FEEDBACK

CYCLES 1

3

5

0

1

3

1

1

1

2

3

__

3

0

0

0

0 6

0 2

5

0 5

1

0

2

2

4

4

6

___

23

Department of Computer Systems

Systematic Encoding with an (n (n--k) k)-Stage Shift Register Message arrives and resetting the LFSR X0

X1

X2

X3

X4

000

110

010 000

100

110

000

000

000

010 110 111

Department of Computer Systems

24

Systematic Encoding with an (n (n--k) k)-Stage Shift Register 1st clock cycle in LFSR X0

X1

X2

X3

X4

111

110

010

100

000

000

110

000

000

010 110

25

Department of Computer Systems

Systematic Encoding with an (n (n--k) k)-Stage Shift Register 2nd clock cycle in LFSR X0

X1

X2

X3

X4

100

110

010

100

010

101

110

111

010

010

Department of Computer Systems

26

Systematic Encoding with an (n (n--k) k)-Stage Shift Register 3rd clock cycle in LFSR X0

X1

X2

X3

X4

011

110

010

100

110

100

110

001

001

27

Department of Computer Systems

Systematic Encoding with an (n (n--k) k)-Stage Shift Register 4th clock cycle in LFSR X0

X1

X2

X3

X4

----

110

010

100

100

001

Department of Computer Systems

110

011

101

28

Systematic Encoding with an (n (n--k) k)-Stage Shift Register X0

X1

X2

X3

X4

----

110

010

100

100

001

110

011

101

The parity 100 001 011 101 bits will come out from the LFSR serially 29

Department of Computer Systems

Systematic Encoding with an (n-k)-Stage Shift Register

6

un X n

U(X ) n 0

U(X )

0

2

X

4

X2

(100) (001) X

Department of Computer Systems

6

X3

1

X4

3

X5

5

X6

(011) X 2 (101) X 3 (010) X 4 (110) X 5 (111) X 6

30

Correlation

The slot timing synchronization in W-CDMA cell search requires several correlation calculations over a window of 256 elements. The correlation can be defined as sum-of-products of complex input samples (R_i) and coefficients (C_i), mathematically can be expressed as

After each correlation process, the window shifts by one input sample so the second correlation can be defined as

and the n-th as

Department of Computer Systems

31

Correlation

Assuming that R_{Ri}, C_{Ri} are the real and R_{Ii}, C_{Ii} are the imaginary parts of R_i and C_i respectively then the first equation can be expanded in its real and imaginary parts as

Using CREMA or BUTTER, a context can be designed for its processing, F_Ri and F_Ii can be loaded in the local memory of BUTTER or CREMA

Department of Computer Systems

32

Fast Fourier Transform

33

Department of Computer Systems

FFT Implementation

Radix-2 Butterfly

Department of Computer Systems

Radix-4 Butterfly

FFT Implementation

64-point FFT Radix-2 Structure Department of Computer Systems

Radix-2 vs Radix-4

Department of Computer Systems

64-point FFT Radix-4 Structure

Radix-2 FFT Implementation

Single Context Two Radix-2 Butterflies

Department of Computer Systems

Radix-4 FFT Implementation

Three context for one Radix-4 Butterfly The first context performing only additions and subtractions

Department of Computer Systems

38

Radix-4 FFT Implementation

The second context performing multiplications and rest of additions and subtractions The third context performs the shift operations

39

Department of Computer Systems

Data Reordering

x(A)

X(A)

x(B)

X(B)

x(C)

X(C)

x(D)

X(D)

Department of Computer Systems

Splitting required into x(A), x(B), x(C) and x(D)

Data Reordering

Department of Computer Systems

Performance Comparison

Radix-2 vs Radix-4 Execution Performance Almost the Same!

Department of Computer Systems

Thank You *Questions*

Department of Computer Systems

Suggest Documents