Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11
Waqar Hussain Research Scientist
[email protected] Department of Computer Systems Tampere University of Technology, Finland
Department of Computer Systems
Electronic Products Multifunction devices are becoming popular besides their reliability and durability
Example__ Mobile Phone • The key selling features of a cell phone are size, weight, longer battery times, audio/video streaming and several games running onto it • Adaptability to many communication standards • Expectations for Real Time performance • No Limits to Human Desire Department of Computer Systems
2
Embedded Technology
The embedded technology empowers a mobile phone to carry all these features. Intended for a specific use which consist of a hardware capable to perform a set of different tasks with the help of software Example Embedded System = RISC + Accelerator(s)
Department of Computer Systems
3
Why Coarse Grain Reconfigurable Arrays ?
Computationally Intensive Kernels (CIK) need to be accelerated in a Signal Processing System.
Examples of CIKs 1. 2.
3. 4.
FIR Filtering Encoding and Decoding a) Viterbi b) Reed-Solomon Matrix-Vector Multiplication Fast Fourier Transform
Department of Computer Systems
4
Why Coarse Grain Reconfigurable Arrays ?
Question: So why CGRA, why not traditional accelerators?
Its more desirable to use devices that could accelerate different kernels than typical traditional accelerators that were designed to accelerate only a single kernel. Thanks to Reconfigurability!
5
Department of Computer Systems
Why CGRAs are Powerful Engines ?
Answer: Due to its structure! CGRAs offer high parallelism and throughput due to its arraybased structure. Algorithms containing parallelism are most suitable to be mapped on a CGRA. It can process large streams of data. Unit of Structure of a CGRA is an ALU, called Processing Elements (PE). Each PE is connected to other PEs using point-to-point or a Network on Chip (NoC).
Department of Computer Systems
6
CGRA in an Embedded System
An Example of Embedded System is RISC + Accelerator(s) RISC = COFFEE Accelerator = BUTTER Both COFFEE and BUTTER were designed at the Department of Computer Systems, Tampere University of Technology, Finland
BUTTER A general purpose Coarse Grain Reconfigurable Array (CGRA) which is a martix of processing elements (PEs). Each PE is capable to perform a set of different tasks and connected with each other using point to point interconnections. BUTTER was capable to process many computationally intensive kernels.
Department of Computer Systems
7
Problems with BUTTER !
BUTTER’s presence in the system was expensive if it is not used most of the time BUTTER occupies a large number of hardware resources A General Purpose CGRA requires a few million gates of FPGA
Department of Computer Systems
8
Solution
CREMA A parameterized general purpose CGRA to generate special purpose accelerators.
Department of Computer Systems
Category of Interconnections
Department of Computer Systems
9
Processing Elements in CREMA
Two Operand Registers Decoder for Operation Selection Supports Integer and Floating point operations Blocks with dashed border are scalable and selectable for instantiation LUT for logical operations
Processing Element Template
Department of Computer Systems
CREMA based System
COFFEE for general purpose processing CREMA generated accelerator for CIK Network of Switched Interconnections for faster data transfer between modules
Department of Computer Systems
12
Applications Mapped on CREMA and BUTTER Integer and Floating-point Matrix-Vector Multiplication Execution Time Compared with RISC and DSP 2D-Low Pass Image Filtering based on Averaging Window FFT Satisfied Execution Time Constraints for SISO and MIMO OFDM Applications Resource utilization and execution time was compared with other state-of-the-art W-CDMA cell search Execution time compared with a RISC core
In all of the above applications, CREMA as a templatebased device required lesser resources for its generated accelerator than BUTTER 13
Department of Computer Systems
Application Mapping
Department of Computer Systems
14
Number Scaling
Very important, so the signals don’t overflow before processing >> scale down after processing >> scale up
If x[n] and y[n] is input and output signal then scaling down = (x[n] / |max x[n]|) x 2^b scaling up = (y[n] / 2^b) x |max x[n]|
15
Department of Computer Systems
Example
Consider a set of numbers S = {-3,-2,-1,0,1,2,3} Trying to compute -3 x 3 = -9 in 16-bit binary integer representation
Scaling Down • S/|max. S| = {-1, -0.6667, -0.3333, 0, 0.3333 0.6667 1} • S/|max. S|*2^15 = {-3.2768 -2.1845 -1.0923 0 1.0923 2.1845 3.2768} * 10^4 • -32768*32768= -1.0737x10^9 • After multiplication there is a shift operation -1.0737x10^9 / 2^15 = -32768
Scaling Up • The answer was -32768 • So (-32768 / 2^15) x 3 = -9
Department of Computer Systems
16
First Order Linear Constant Coefficient Difference Equation
y[n] = x[n-1] + x[n], n=0,1,2,3,…,N-1
x[n]
y[n]
+
Z^-1
17
Department of Computer Systems
Finite Impulse Response Filtering
Transfer Function of the Filter
There is no feedback so N = 0
FIR Structure
x[n]
Z^-1
b(0)
Department of Computer Systems
b(2)
b(1) +
Z^-1
Z^-1
+
Z^-1
b(M-1) +
+
y[n]
18
Polynomial Division
Very important and used many times in Signal Processing Example: Encoding process of Reed-Solomon codes Best way of doing it is by using a Linear Feedback Shift Register (LFSR)
19
Department of Computer Systems
Reed Solomon Codes-Encoding in Systematic Form, (7, 3) Example
X n k m( X )
q( X ) g ( x)
p( X )
p( X )
X n k m( X ) modulo g ( X )
U (X )
p ( X ) X n k m( X ) 010 110 111 1
Department of Computer Systems
3
5
20
Encoding in Systematic Form
p( X )
0
2
X
4
X2
6
X3
U(X )
0
2
X
4
X2
6
X3
1
X4
3
X5
5
X6
21
Department of Computer Systems
Systematic Encoding with an (n-k)-Stage Shift Register
X0 3
X1 1
X2
X3
0
X4
3
1
3
5
22 Department of Computer Systems
Systematic Encoding with an (n-k)-Stage Shift Register
INPUT QUEUE
CLOCK
REGISTER
CONTENTS
FEEDBACK
CYCLES 1
3
5
0
1
3
1
1
1
2
3
__
3
0
0
0
0 6
0 2
5
0 5
1
0
2
2
4
4
6
___
23
Department of Computer Systems
Systematic Encoding with an (n (n--k) k)-Stage Shift Register Message arrives and resetting the LFSR X0
X1
X2
X3
X4
000
110
010 000
100
110
000
000
000
010 110 111
Department of Computer Systems
24
Systematic Encoding with an (n (n--k) k)-Stage Shift Register 1st clock cycle in LFSR X0
X1
X2
X3
X4
111
110
010
100
000
000
110
000
000
010 110
25
Department of Computer Systems
Systematic Encoding with an (n (n--k) k)-Stage Shift Register 2nd clock cycle in LFSR X0
X1
X2
X3
X4
100
110
010
100
010
101
110
111
010
010
Department of Computer Systems
26
Systematic Encoding with an (n (n--k) k)-Stage Shift Register 3rd clock cycle in LFSR X0
X1
X2
X3
X4
011
110
010
100
110
100
110
001
001
27
Department of Computer Systems
Systematic Encoding with an (n (n--k) k)-Stage Shift Register 4th clock cycle in LFSR X0
X1
X2
X3
X4
----
110
010
100
100
001
Department of Computer Systems
110
011
101
28
Systematic Encoding with an (n (n--k) k)-Stage Shift Register X0
X1
X2
X3
X4
----
110
010
100
100
001
110
011
101
The parity 100 001 011 101 bits will come out from the LFSR serially 29
Department of Computer Systems
Systematic Encoding with an (n-k)-Stage Shift Register
6
un X n
U(X ) n 0
U(X )
0
2
X
4
X2
(100) (001) X
Department of Computer Systems
6
X3
1
X4
3
X5
5
X6
(011) X 2 (101) X 3 (010) X 4 (110) X 5 (111) X 6
30
Correlation
The slot timing synchronization in W-CDMA cell search requires several correlation calculations over a window of 256 elements. The correlation can be defined as sum-of-products of complex input samples (R_i) and coefficients (C_i), mathematically can be expressed as
After each correlation process, the window shifts by one input sample so the second correlation can be defined as
and the n-th as
Department of Computer Systems
31
Correlation
Assuming that R_{Ri}, C_{Ri} are the real and R_{Ii}, C_{Ii} are the imaginary parts of R_i and C_i respectively then the first equation can be expanded in its real and imaginary parts as
Using CREMA or BUTTER, a context can be designed for its processing, F_Ri and F_Ii can be loaded in the local memory of BUTTER or CREMA
Department of Computer Systems
32
Fast Fourier Transform
33
Department of Computer Systems
FFT Implementation
Radix-2 Butterfly
Department of Computer Systems
Radix-4 Butterfly
FFT Implementation
64-point FFT Radix-2 Structure Department of Computer Systems
Radix-2 vs Radix-4
Department of Computer Systems
64-point FFT Radix-4 Structure
Radix-2 FFT Implementation
Single Context Two Radix-2 Butterflies
Department of Computer Systems
Radix-4 FFT Implementation
Three context for one Radix-4 Butterfly The first context performing only additions and subtractions
Department of Computer Systems
38
Radix-4 FFT Implementation
The second context performing multiplications and rest of additions and subtractions The third context performs the shift operations
39
Department of Computer Systems
Data Reordering
x(A)
X(A)
x(B)
X(B)
x(C)
X(C)
x(D)
X(D)
Department of Computer Systems
Splitting required into x(A), x(B), x(C) and x(D)
Data Reordering
Department of Computer Systems
Performance Comparison
Radix-2 vs Radix-4 Execution Performance Almost the Same!
Department of Computer Systems
Thank You *Questions*
Department of Computer Systems