An Efficient Softcore Multiplier Architecture for Xilinx FPGAs

An Efficient Softcore Multiplier Architecture for Xilinx FPGAs 22nd IEEE Symposium on Computer Arithmetic Martin Kumm, Shahid Abbas and Peter Zipf Un...
0 downloads 1 Views 2MB Size
An Efficient Softcore Multiplier Architecture for Xilinx FPGAs 22nd IEEE Symposium on Computer Arithmetic

Martin Kumm, Shahid Abbas and Peter Zipf University of Kassel, Germany

CONTENTS 1. State-of-the-art 2. Proposed multiplier 3. Results

2

WHY FPGA 
 SOFTCORE MULTIPLIERS? The need for efficient multipliers forced FPGA vendors to embed hard multiplier blocks FPGA softcore multipliers are still required: Small word sizes (worse mapping for embedded mults) Large word sizes ("fill gaps") Replace embedded mults on small/low-cost FPGAs

3

WHY THEY ARE DIFFERENT?

Research for efficient multipliers is an ongoing process nearly since >50 years Efficient multipliers in terms of gates may not be efficient on FPGAs FPGA optimized structures are relatively rare

4

WHY THEY ARE DIFFERENT?

Xilinx slice 6/7 series 5

PREVIOUS WORK A Baugh-Wooley like multiplier was proposed in 
 [Parandeh-Afshar 2011] Two partial products are generated and added using carry chain Compression tree of already reduced PP's necessary

LUT 0 1

LUT 0 1

LUT 0 1

LUT 0 1

Carry Logic 6

PREVIOUS WORK A Baugh-Wooley like multiplier was proposed in 
 [Parandeh-Afshar 2011] Two partial products are generated and added using carry chain Compression tree of already reduced PP's necessary

full adder LUT 0 1

LUT 0 1

LUT 0 1

LUT 0 1

Carry Logic 6

PREVIOUS WORK Another idea was discussed in [Brunie 2013]: Decompose multiplication into small multipliers that fit into single LUTs, e. g., 3x3, 2x3, 1x4 Use a compression tree to add partial results

p =M 1 + 23 M 2 + 26 M 3 + . . . 3

6

9

. . . + 2 M4 + 2 M5 + 2 M6 + . . . . . . + 26 M 7 + 29 M 8 + 212 M 9 7

BOOTH RECODING a·b=

M X

m

m=0 m even

bm+1

bm

bm

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1

1

a · BEm 2

BEm

zm

cm

sm

0 1 1 2 -2 -1 -1 0

1 0 0 0 0 0 0 1

0 0 0 0 1 1 1 0

0 0 0 1 1 0 0 0

8

BOOTH MULTIPLIER

LSB

b 0 c0 c0 c0 c0 c0 c0 c0 c2 c2 c2 c2 c2 c4 c4 c4 c6

c0 c2 c4 c6

MSB

0 0

+ =

9

BOOTH MULTIPLIER LSB

b 0 c0 1 1

c0

1 c2

c2

1 c4

c4

c6

MSB

0 0

c6

+ =

10

PROPOSED ARCHITECTURE

LUT 0 1

LUT

0 1

0 1

0 1

0 1

0

0

0 1

0 1

LUT

0 1

0 1

LUT

0 1

Carry Logic

11

PROPOSED ARCHITECTURE

LUT 0 1

LUT

0 1

0 1

0 1

0 1

0

0

0 1

0 1

LUT

0 1

0 1

LUT

0 1

Carry Logic

full adder 11

PROPOSED ARCHITECTURE

12

RESULTS The number of slices can be precisely predicted:
 
 


#slices(M, N ) = dN/4 + 1e · bM/2 + 1c | {z } | {z } slices per row

no of rows

Design was implemented as generic VHDL

A pipelined multiplier can be obtained by using the 
 (otherwise unused) slice FFs without much additional cost Reference circuits (Parandeh-Afshar & LUT-based) were designed with the FloPoCo library [de Dinechin 2012] Xilinx Coregen was used as a commercial reference 13

RESULTS VIRTEX 6 COMBINATORIAL, SLICES 2,000 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed) proposed

1,800 1,600 1,400 #Slices

1,200 1,000 800 600 400 200 0

8

12

16

20

24

28

32 36 40 44 Input word size (N) 14

48

52

56

60

64

RESULTS VIRTEX 6 COMBINATORIAL, SLICE RED. 80

Slice reduction (%)

60

40

1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed)

20

0 8

12

16

20

24

28

32 36 40 44 Input word size (N) 15

48

52

56

60

64

RESULTS VIRTEX 6 COMBINATORIAL, FREQ. 700 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed) proposed

600

Frequency [MHz]

500 400 300 200 100 0

8

12

16

20

24

28

32 36 40 44 Input word size (N) 16

48

52

56

60

64

RESULTS VIRTEX 6 PIPELINED, SLICES 2,000 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed) proposed

1,800 1,600 1,400 #Slices

1,200 1,000 800 600 400 200 0

8

12

16

20

24

28

32 36 40 44 Input word size (N) 17

48

52

56

60

64

RESULTS VIRTEX 6 PIPELINED, SLICE RED. 80 70

Slice reduction (%)

60 50 40 30 20

1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed)

10 0 10

8

12

16

20

24

28

32 36 40 Input word size (N) 18

44

48

52

56

60

64

RESULTS VIRTEX 6 PIPELINED, FREQ. 700 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed) proposed

600

Frequency [MHz]

500 400 300 200 100 0

8

12

16

20

24

28

32 36 40 44 Input word size (N) 19

48

52

56

60

64

UNFORTUNATELY NOT POSSIBLE ON ALTERA FPGAS

20

Altera ALM

MAYBE POSSIBLE NEXT?

21

CONCLUSION Compared to the best known design, up to 50% slices can be saved for the combinatorial multiplier 30% slices can be saved for the pipelined multiplier Portable to FPGAs providing a 5-input LUT at one full adder input "Free addition" supports multiply-accumulate (MAC) operation

22

THANK YOU!

LITERATURE [Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011 [Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic Core Generation Using Bit Heaps, FPL 2013 [de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data Paths with FloPoCo IEEE Design & Test of Computers 2012 23

BOOTH RECODING b =bM =bM

12

M

1

2

1

+ . . . + b2 2 + b1 2 + b 0

12

M

1

2

1

+ . . . + b2 2 + 2b1 2 +

1

b1 2 + b0 | {z }

BE0 = 2b1 +b0

=bM

12

M

1

+ ...

. . . + 2b3 23 b3 23 + b2 22 + 2b1 21 +BE0 | {z } BE2 =( 2b3 +b2 +b1 )22

=

M X

m

BEm 2

with BEm =

m=0 m even 25

2bm+1 + bm + bm

1

WHY THEY ARE DIFFERENT?

26

Altera ALM

WHY THEY ARE DIFFERENT?

CE CK

SRHI SRLO INIT1 Q INIT0 SR

D6:1

D CE CK

27

FF/LAT INIT1 Q INIT0 SRHI SRLO SR