High Performance Elliptic Curve Cryptographic Co-processor

High Performance Elliptic Curve Cryptographic Co-processor by Jonathan Lutz A thesis presented to the University of Waterloo in fulﬁllment of the th...

Author: Polly Phoebe Shepherd

1 downloads 0 Views 651KB Size

Report

Download PDF

Recommend Documents

A high speed coprocessor for elliptic curve scalar multiplications over F p

PERFORMANCE ANALYSIS OF ELLIPTIC CURVE MULTIPLICATION ALGORITHMS FOR ELLIPTIC CURVE CRYPTOGRAPHY

Elliptic Curve Cryptography

Lecture 8: Elliptic Curve Crypto

Elliptic Curve Cryptography: Delivering High- Performance Security for E-Commerce and Communications

A High-Performance Reconfigurable Elliptic Curve Processor for GF (2 m )

Redbooks Flash. IBM zseries 990 Cryptographic Coprocessor Configuration. Configuration planning. Cryptographic domains

IBM 4765 PCIe Cryptographic Coprocessor Smart Card User Guide

IBM 4767 PCIe Cryptographic Coprocessor Smart Card User Guide

Elgamal Encryption using Elliptic Curve Cryptography

Applications and Benefits of Elliptic Curve Cryptography

Implementation of Elliptic Curve Cryptography using

THE DISCRETE LOG PROBLEM AND ELLIPTIC CURVE CRYPTOGRAPHY

Efficient Multiplication in GF (p k ) for Elliptic Curve Cryptography

Coprocessor

The Double-Base Number System in Elliptic Curve Cryptograhy

LOWER-ORDER BIASES IN ELLIPTIC CURVE FOURIER COEFFICIENTS IN FAMILIES

Virtual Private Networks powered by Elliptic Curve Cryptography

Implementation of Elliptic Curve Cryptography in Binary Field

Elligator: Elliptic-curve points indistinguishable from uniform random strings

An Elliptic Curve Based Handoff Authentication Protocol for WLAN

A New Secure and Efficient Elliptic. Curve Cryptosystem

Integrating Lamports OTP Scheme with Elliptic Curve Cryptography (ECC)

Performance Analysis of Public key Cryptographic Systems RSA and NTRU

High Performance Elliptic Curve Cryptographic Co-processor by

Jonathan Lutz

A thesis presented to the University of Waterloo in fulﬁllment of the thesis requirement for the degree of Master of Applied Science in Electrical and Computer Engineering

Waterloo, Ontario, Canada, 2003

c Jonathan Lutz, 2003

I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required ﬁnal revisions, as accepted by my examiners.

I understand that my thesis may be made electronically available to the public.

ii

Abstract In FIPS 186-2, NIST recommends several ﬁnite ﬁelds to be used in the elliptic curve digital signature algorithm (ECDSA). Of the ten recommended ﬁnite ﬁelds, ﬁve are binary extension ﬁelds with degrees ranging from 163 to 571. The fundamental building block of the ECDSA, like any ECC based protocol, is elliptic curve scalar multiplication. This operation is also the most computationally intensive. In many situations it may be desirable to accelerate the elliptic curve scalar multiplication with specialized hardware. In this thesis a high performance elliptic curve processor is developed which is optimized for the NIST binary ﬁelds. The architecture is built from the bottom up starting with the ﬁeld arithmetic units. The architecture uses a ﬁeld multiplier capable of performing a ﬁeld multiplication over the extension ﬁeld with degree 163 in 0.060 µsec. Architectures for squaring and inversion are also presented. The co-processor uses L´opez and Dahab’s projective coordinate system and is optimized speciﬁcally for Koblitz curves. A prototype of the processor has been implemented for the binary extension ﬁeld with degree 163 on a Xilinx XCV2000E FPGA. The prototype runs at 66 MHz and performs an elliptic curve scalar multiplication in 0.233 msec on a generic curve and 0.075 msec on a Koblitz curve.

iii

Acknowledgements This thesis is sponsored in part by Motorola, Inc. I am particularly grateful to Dan Cronin, Jim Dworkin, and Jeﬀ LaVell of Motorola for their continued support throughout the course of this research eﬀort. Additionally, special thanks is due Professor Anwarul Hasan for his time, guidance, and encouragement. And to my colleagues and friends, Amir and Arash, for the many coﬀee breaks at Tim Hortons.

iv

To my wife and best friend, Sarah Joy

v

List of Abbreviations ASIC

Application Speciﬁc Integrated Circuit

CLB

Conﬁgurable Lobic Block

DSA

Digital Signature Algorithm

ECC

Elliptic Curve Cryptography

ECDSA

Elliptic Curve Digital Signature Algorithm

FPGA

Field Programmable Gate Array

GF

Galois Field

IOB

Input/Output Block

NAF

Non-adjacent Form

NIST

National Institute of Standards in Technology

τ -NAF

τ -adic Non-adjacent Form

SSL

Secure Socket Layer

vi

Contents

1 Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Scope of the Work and Objectives . . . . . . . . . . . . . . . . . . . .

2

1.3

Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2 Background 2.1

5

Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1.1

Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.2

Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Arithmetic over Binary Finite Fields . . . . . . . . . . . . . . . . . .

12

2.2.1

Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.2.2

Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.3

Arithmetic over the Elliptic Curve Group . . . . . . . . . . . . . . . .

16

2.4

Implementation Media . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.4.1

Field Programmable Gate Arrays . . . . . . . . . . . . . . . .

20

2.4.2

The Rapid-Prototyping Platform . . . . . . . . . . . . . . . .

23

2.2

3 High Performance Finite Field Arithmetic vii

26

3.1

Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.1.1

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

3.1.2

Computation of R(x)W (x) mod F (x) . . . . . . . . . . . . .

32

3.1.3

The Multiplier Data Path . . . . . . . . . . . . . . . . . . . .

37

3.1.4

Choice of Digit Size . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2

Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.3

Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.4

Comparator/Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.5

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

4 A Co-processor Architecture for ECC Scalar Multiplication

47

4.1

Projective Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . .

49

4.2

Scalar Multiplication using Recoded Integers . . . . . . . . . . . . . .

51

4.2.1

Scalar Multiplication using Binary NAF . . . . . . . . . . . .

52

4.2.2

Scalar Multiplication using τ -NAF . . . . . . . . . . . . . . .

53

4.2.3

Summary and Analysis . . . . . . . . . . . . . . . . . . . . . .

60

Co-processor Architecture . . . . . . . . . . . . . . . . . . . . . . . .

61

4.3.1

The Data Path . . . . . . . . . . . . . . . . . . . . . . . . . .

62

4.3.2

The Micro-sequencer . . . . . . . . . . . . . . . . . . . . . . .

65

4.3.3

Top Level Control . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.3.4

Choice of Field Arithmetic Units . . . . . . . . . . . . . . . .

71

4.3.5

Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

4.4

FPGA Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

4.5

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.3

viii

5 Concluding Remarks

78

5.1

Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . .

78

5.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

A Micro-code supporting Curve Arithmetic and Field Inversion

80

A.1 Point Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

A.1.1 Generic Point Addition . . . . . . . . . . . . . . . . . . . . . .

81

A.1.2 Koblitz Curve Point Addition . . . . . . . . . . . . . . . . . .

84

A.1.3 Eﬃcient Koblitz Curve Point Addition . . . . . . . . . . . . .

87

A.2 Point Doubling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

A.3 Field Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

A.3.1 Inversion by Square and Multiply . . . . . . . . . . . . . . . .

91

A.3.2 Inversion by Itoh and Tsujii . . . . . . . . . . . . . . . . . . .

92

A.4 Coordinate Conversion . . . . . . . . . . . . . . . . . . . . . . . . . .

95

A.5 Copy Routines

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

A.5.1 Copy P to Q . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

A.5.2 Copy −P to Q . . . . . . . . . . . . . . . . . . . . . . . . . .

96

B Tool Related Scripts and Setup Files

97

B.1 Synthesis Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

B.1.1 Synthesis Compile Scripts . . . . . . . . . . . . . . . . . . . .

98

B.1.2 Synthesis Constraints Script . . . . . . . . . . . . . . . . . . . 103 B.2 Place and Route Scripts . . . . . . . . . . . . . . . . . . . . . . . . . 104 B.2.1 Top Level Place and Route Script . . . . . . . . . . . . . . . . 104 B.2.2 User Constraints File . . . . . . . . . . . . . . . . . . . . . . . 107 ix

List of Tables 2.1

NIST Recommended Finite Fields . . . . . . . . . . . . . . . . . . . .

12

3.1

Performance/Cost Trade-oﬀ for Multiplication over GF(2163 ) . . . . .

40

3.2

Comparison of Various Inversion Methods for GF(2163 ) . . . . . . . .

45

3.3

Performance of Finite Field Operations . . . . . . . . . . . . . . . . .

46

4.1

Comparison of Projective Point Systems . . . . . . . . . . . . . . . .

51

4.2

Cost of Scalar Multiplication in terms of Field Operations . . . . . .

61

4.3

Representation of the Scalar k . . . . . . . . . . . . . . . . . . . . . .

69

4.4

Example Representations of the Scalar . . . . . . . . . . . . . . . . .

69

4.5

Performance of Field and Curve Operations . . . . . . . . . . . . . .

76

4.6

Performance and Cost Results for Scalar Multiplication . . . . . . . .

77

4.7

Comparison of Published Results . . . . . . . . . . . . . . . . . . . .

77

x

List of Figures 2.1

Functionality of a CLB . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.2

Functionality of an IOB . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3

CLB Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.1

LFSR Based Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.2

The Multiplier Data-Path . . . . . . . . . . . . . . . . . . . . . . . .

31

3.3

Generating xi W (x) mod F (x) . . . . . . . . . . . . . . . . . . . . . .

34

3.4

Computing R(x)W (x) mod F (x)

. . . . . . . . . . . . . . . . . . .

35

3.5

Computation of a Single Bit in R(x)W (x) mod F (x) . . . . . . . . .

36

3.6

Modiﬁed Multiplier Data-Path . . . . . . . . . . . . . . . . . . . . . .

38

3.7

Data-Path of the Squaring Unit . . . . . . . . . . . . . . . . . . . . .

41

3.8

Data-Path of the Comparator/Adder . . . . . . . . . . . . . . . . . .

46

4.1

Co-Processor’s Hierarchical Control Path . . . . . . . . . . . . . . . .

62

4.2

Co-Processor Data-Path . . . . . . . . . . . . . . . . . . . . . . . . .

63

4.3

Field Element Storage . . . . . . . . . . . . . . . . . . . . . . . . . .

64

4.4

32-bit/163-bit Address Map . . . . . . . . . . . . . . . . . . . . . . .

64

4.5

Eﬃcient Frobenius Mapping . . . . . . . . . . . . . . . . . . . . . . .

65

xi

4.6

Utilization of Finite Field Units for Point Addition . . . . . . . . . .

72

4.7

Utilization of Finite Field Units for Point Doubling . . . . . . . . . .

73

xii

Chapter 1 Introduction 1.1

Motivation

The use of elliptic curves in cryptographic applications was ﬁrst proposed independently in [17] and [24]. Since then several algorithms have been developed whose strength relies on the diﬃculty of the discrete logarithm problem over a group of elliptic curve points. Prominent examples include the Elliptic Curve Digital Signature Algorithm (ECDSA) [25], EC El-Gammal and EC Diﬃe Hellman [14]. In each case the underlying cryptographic primitive is elliptic curve scalar multiplication. This operation is by far the most computationally intensive step in each algorithm. In applications where many clients authenticate to a single server (such as a server supporting SSL [8, 27] or WTLS [1]), the computation of the scalar multiplication becomes the bottle neck which limits throughput. In a scenario such as this it may be desirable to accelerate the elliptic curve scalar multiplication with specialized hardware. In doing so, the scalar multiplications are completed more quickly and the

1

CHAPTER 1. INTRODUCTION

2

computational burden on the server’s main processor is reduced. Elliptic curve-based cryptosystems are most closely related to algorithms like the Digital Signature Algorithm (DSA) which are based on the discrete logarithm problem. In the DSA, the parameters can be chosen to provide eﬃcient implementations of the algorithm. In the same way, the parameters of ECC based cryptosystems can be selected to optimize the eﬃciency of the implementation. Unfortunately, the selection of the ECC parameters is not a trivial process and, if chosen incorrectly, may lead to an insecure system. In response to this issue NIST recommends ten ﬁnite ﬁelds, ﬁve of which are binary ﬁelds, for use in the ECDSA [25]. For each ﬁeld a speciﬁc curve, along with a method for generating a pseudo-random curve, are supplied. These curves have been intentionally selected for both cryptographic strength and eﬃcient implementation. Such a recommendation has signiﬁcant implications on design choices made while implementing elliptic curve cryptographic functions. In standardizing speciﬁc ﬁelds for use in elliptic curve cryptography (ECC), NIST allows ECC implementations to be heavily optimized for curves over a single ﬁnite ﬁeld. As a result, performance of the algorithm can be maximized and resource utilization, whether it be in code size for software or logic gates for hardware, can be minimized.

1.2

Scope of the Work and Objectives

Presented in this thesis are hardware architectures for multiplication, squaring and inversion over binary ﬁnite ﬁelds. Each of these architectures is optimized for a speciﬁc ﬁnite ﬁeld with the intent that it might be implemented for any of the ﬁve NIST

CHAPTER 1. INTRODUCTION

3

recommended binary curves. These ﬁnite ﬁeld arithmetic units are then integrated together along with control logic to create an elliptic curve cryptographic co-processor capable of computing the scalar multiple of an elliptic curve point. While the coprocessor supports all curves over a single binary ﬁeld, it is optimized for the special Koblitz curves [18]. To demonstrate the feasibility and eﬃciency of both the ﬁnite ﬁeld arithmetic units and the elliptic curve cryptographic co-processor, the latter has been implemented in hardware using a ﬁeld programmable gate array (FPGA). The design was synthesized, timed and then demonstrated on a physical board holding an FPGA. The objectives of the work presented in this thesis are twofold: First to develop a high performance hardware ﬁnite ﬁeld arithmetic units with low resource requirements. Second to integrate the arithmetic units into an eﬃcient hardware elliptic curve scalar multiplier.

1.3

Thesis Organization

This thesis is organized as follows. Chapter 2 gives an overview of the basic mathematical concepts used in elliptic curve cryptography. This chapter also provides an introduction to the hardware/software system used to implement the elliptic curve scalar multiplier. Chapter 3 presents eﬃcient hardware architectures for ﬁnite ﬁeld multiplication and squaring. A method for high speed inversion is also discussed. In Chapter 4 a hardware architecture of an elliptic curve scalar multiplier is presented. This architecture uses the multiplication, squaring and inversion methods discussed in Chapter 3. Finally Chapter 5 provides concluding remarks and a summary of the

CHAPTER 1. INTRODUCTION research contributions documented in this thesis.

4

Chapter 2 Background The fundamental building block for any elliptic curve-based cryptosystem is elliptic curve scalar multiplication. It is this operation that will be implemented. Provided in this chapter is an overview of the mathematics behind elliptic curve scalar multiplication as well as an introduction to FPGA technology which will be used in the implementation. The chapter is organized as follows: An introduction to concepts in abstract algebra including groups and ﬁelds. Next is given an overview of arithmetic over binary ﬁnite ﬁelds followed by a discussion of arithmetic over elliptic curve groups. The chapter concludes with a brief description of the implementation media used to prototype the elliptic curve scalar multiplier.

2.1

Mathematical Background

Elliptic curve cryptography is built on two underlying algebraic structures. They are ﬁnite groups and ﬁnite ﬁelds. This ﬁrst section provides an introduction to these

5

CHAPTER 2. BACKGROUND

6

concepts. The deﬁnitions and theorems have been gathered from [9], [23] and [29] and are given without proof. These texts as well as [4] and [21] provide further discussion of the mathematics behind elliptic curve cryptography.

2.1.1

Groups

Deﬁnition 1. Let G be a set. A binary operation on G is a function that assigns each ordered pair of elements in G an element in G. Deﬁnition 2. An algebraic group (G, ∗) is deﬁned by a nonempty set G and a binary operation ∗. (G, ∗) is said to be a group if the following properties hold: • Closure: For all elements a, b ∈ G, element (a ∗ b) ∈ G. • Associativity: For all elements a, b, c ∈ G, (a ∗ b) ∗ c = a ∗ (b ∗ c). • Identity: There exists an element e ∈ G such that for any element a ∈ G a ∗ e = e ∗ a = a. The element e is referred to as the group identity. • Inverse: For every element a ∈ G, the inverse b = a−1 is also an element of G. Then a ∗ b = b ∗ a = e. Deﬁnition 3. If for all elements a, b ∈ G, a ∗ b = b ∗ a, then G is a commutative or Abelian group.

Theorem 1. There is a single identity in every group G.

Example: The integers form a group under addition. The group (Z, +) possesses the properties listed in Deﬁnition 2 and has the identity e = 0.

CHAPTER 2. BACKGROUND

7

Example: The set of non-zero integers under multiplication does not form a group. (Z∗ , ·) possesses all the properties of a group except one. Elements 1 and −1 are the only elements whose multiplicative inverse is also in Z∗ . Element 2, for example, has inverse 1/2 ∈ / Z∗ . Deﬁnition 4. The order of G, denoted as |G|, is the number of elements in the set G. Deﬁnition 5. The order of element g ∈ G, denoted as |g|, is deﬁned to be the smallest positive integer t such that g t = e. Deﬁnition 6. Element g ∈ G is said to be a generator of G if every element in G can be expressed by g i for some integer i. Then |g| = |G|. Example: Consider the group deﬁned by the set G = Z∗5 = {1, 2, 3, 4} under multiplication. Then the order of the group is |G| = 4. Since 20

mod 5 = 1

21

mod 5 = 2

22

mod 5 = 4

23

mod 5 = 3

24

mod 5 = 1

CHAPTER 2. BACKGROUND

8

the order of element 2 is 4. And since 40

mod 5 = 1

41

mod 5 = 4

42

mod 5 = 1

the order of element 4 is 2. Note that element 2 is a generator of the group but 4 is not.

2.1.2

Finite Fields

A ﬁnite ﬁeld can be considered as a ﬁnite set whose elements form a group under two binary operations; usually multiplication and addition. More speciﬁcally, Deﬁnition 7. (F, +, ·) is a ﬁeld if the following properties hold: • The elements of F form a group under addition. • The non-zero elements of F form a group under multiplication. • The addition and multiplication operations are commutative, i.e. a + b = b + a and ab = ba for all a, b ∈ F . • The multiplication operation can be distributed through the addition operation, i.e. a(b + c) = ab + ac for all a, b, c ∈ F . Deﬁnition 8. A ﬁeld F with a ﬁnite number of elements is a ﬁnite ﬁeld.

Deﬁnition 9. The order of a ﬁeld F is the number of elements in F .

CHAPTER 2. BACKGROUND

9

Deﬁnition 10. A generator of the non-zero elements of a ﬁnite ﬁeld F is said to be a primitive element or generator of F .

Deﬁnition 11. The characteristic of a ﬁnite ﬁeld is the smallest positive integer j such that 1 + 1 +· · · + 1 ≡ 0. j times Example: Consider the ﬁeld GF(7) containing the elements 0, 1, 2, 3, 4, 5 and 6. The order of the ﬁeld is 7 and the characteristic is also 7 since 1 + 1 + 1 + 1 + 1 + 1 + 1 ≡ 0 (mod 7). 7 times Element 3 generates GF(7) as shown below. 30 = 1 (mod 7) 34 ≡ 4 (mod 7) 31 = 3 (mod 7) 35 ≡ 5 (mod 7) 32 = 2 (mod 7) 36 ≡ 1 (mod 7) 33 = 6 (mod 7) Deﬁnition 12. A unique1 ﬁnite ﬁeld exists for every prime-power order. These ﬁelds are denoted GF(pm ) where p is prime and m is a positive integer. In cryptographic applications, two types of ﬁelds are commonly used. They are • Prime Fields: GF(p) where p is large • Binary Fields: GF(2m ) where m is large 1

Unique in the sense that all ﬁelds of a speciﬁc prime-power order are isomorphic.

CHAPTER 2. BACKGROUND

10

The architectures described in the following chapters perform arithmetic over binary ﬁnite ﬁelds. Attention will be focused exclusively on this speciﬁc case for the duration of the this thesis. Element Representation: The binary ﬁeld GF(2m ) contains 2m elements. Precisely how each element is represented is deﬁned by the basis being used. The two most common representations are polynomial basis and normal basis. The work discussed in this thesis uses polynomial basis. Let GF(2)[x] denote the set of polynomials over GF(2). Then for any irreducible polynomial F (x) = xm + fm−1 xm−1 + · · · + f2 x2 + f1 x + 1 with fi ∈GF(2), GF(2)[x]/F (x) is a ﬁnite ﬁeld with 2m elements [23]. Since the ﬁeld of order 2m is unique up to isomorphism, the elements of the binary ﬁeld GF(2m ) can be uniquely represented by the set of polynomials over GF(2) of degree less than m. Furthermore, ﬁeld addition is performed by adding two such polynomials over GF(2). Field multiplication is performed by straightforward multiplication of two polynomials and reducing mod F (x). The irreducible polynomial F (x) is often referred to as the reduction polynomial or ﬁeld polynomial. Example: Consider the ﬁeld GF(23 ) with the irreducible polynomial F (x) = x3 + x + 1. The elements of the ﬁeld are contained in the set

{0, 1, x, x + 1, x2 , x2 + 1, x2 + x, x2 + x + 1}

CHAPTER 2. BACKGROUND

11

The element x + 1 generates GF(23 ) as shown below. (x + 1)0 ≡ 1 (mod F (x)) (x + 1)1 ≡ x + 1 (mod F (x)) (x + 1)2 ≡ x2 + 1 (mod F (x)) (x + 1)3 ≡ x2 (mod F (x)) (x + 1)4 ≡ x2 + x + 1 (mod F (x)) (x + 1)5 ≡ x (mod F (x)) (x + 1)6 ≡ x2 + x (mod F (x)) (x + 1)7 ≡ 1 (mod F (x)) The characteristic of the ﬁeld is two since

1+1≡0

(mod 2).

NIST recommends the ﬁelds GF(2163 ), GF(2233 ), GF(2283 ), GF(2409 ) and GF(2571 ) for use in the Elliptic Curve Digital Signature Algorithm (ECDSA). These ﬁelds and corresponding reduction polynomials are listed in Table 2.1. Note that each of the reduction polynomials listed in the table is either a trinomial or a pentanomial. Also, note that the second leading non-zero coeﬃcient of the polynomial has a relatively small degree when compared to the degree of the whole polynomial. Polynomials were chosen with these properties in order to beneﬁt the resulting implementation of ﬁnite ﬁeld arithmetic.

CHAPTER 2. BACKGROUND

12

Table 2.1: NIST Recommended Finite Fields Field

Reduction Polynomial

GF(2163 )

F (x) = x163 + x7 + x6 + x3 + 1

GF(2233 )

F (x) = x233 + x74 + 1

GF(2283 ) F (x) = x283 + x12 + x7 + x5 + 1 GF(2409 )

F (x) = x409 + x87 + 1

GF(2571 ) F (x) = x571 + x10 + x5 + x2 + 1

2.2

Arithmetic over Binary Finite Fields

The elements of the binary ﬁeld GF(2m ) are interrelated through the operations of addition and multiplication. Since the additive and multiplicative inverses exist for all ﬁelds, the subtraction and division operations are also deﬁned. Discussed in this section are basic methods for computing the sum, diﬀerence and product of two elements. Also presented is a method for computing the inverse of an element. The inverse, along with a multiplication, is used to implement division. Addition and Subtraction: If we deﬁne the ﬁeld elements a, b ∈GF(2m ) to be the polynomials A(x) = am−1 xm−1 + · · · + a1 x + a0 and B(x) = bm−1 xm−1 + · · · + b1 x + b0 respectively, then their sum is written

S(x) = A(x) + B(x) =

m−1

(ai + bi )xi .

i=0

(2.1)

CHAPTER 2. BACKGROUND

13

Working in a ﬁeld of characteristic two provides two distinct advantages. First, the bit additions ai + bi in (2.1) are performed modulo 2 and translate to an exclusiveOR (XOR) operation. The entire addition is computed by a component-wise XOR operation and does not require a carry chain. The second advantage is that in GF(2) the element 1 is its own additive inverse (i.e. 1 + 1 = 0 or 1 = −1). It can be concluded then that addition and subtraction are equivalent.

2.2.1

Multiplication

The product of ﬁeld elements a and b is written as

P (x) = A(x) × B(x) mod F (x) =

m−1 m−1

ai bj xi+j

mod F (x)

i=0 j=0

where F (x) is the ﬁeld reduction polynomial. By expanding B(x) and distributing A(x) through its terms we get

P (x) = bm−1 xm−1 A(x) + · · · + b1 xA(x) + b0 A(x)

mod F (x).

By repeatedly grouping multiples of x and factoring out x we get P (x) = (· · · (((A(x)bm−1 )x + A(x)bm−2 )x + · · · + A(x)b1 )x

(2.2)

+ A(x)b0 ) mod F (x). Starting with the inner most parenthesis and moving out, Algorithm 1 performs the computation required to compute the right hand side of (2.2). This algorithm can be used to compute the product of a and b.

CHAPTER 2. BACKGROUND

14

Algorithm 1 Bit-Level Multiplication Input: A(x), B(x), and F (x) Output: P (x) = A(x) × B(x) mod F (x) P (x) ← 0; for i = m − 1 downto 0 do P (x) ← xP (x) mod F (x); if (bi == 1) then P (x) ← P (x) + A(x); Many of the faster multiplication algorithms rely on the concept of group-level multiplication. Let g be an integer less than m and let s = m/g (Note that g is diﬀerent here from previous usage). If we deﬁne the polynomials  g−1     big+j xj   Bi (x) =

0 ≤ i ≤ s − 2,

j=0

(m mod g)−1     big+j xj  

i = s − 1,

j=0

then the product of a and b is written

P (x) = A(x) x(s−1)g Bs−1 (x) + · · · + xg B1 (x) + B0 (x)

mod F (x).

In the derivation of equation (2.2) multiples of x were repeatedly grouped then factored out. This same grouping and factoring procedure will now be implemented for

CHAPTER 2. BACKGROUND

15

multiples of xg arriving at P (x) = (· · · ((A(x)Bs−1 (x))xg + A(x)Bs−2 (x))xg + · · · )xg + A(x)B0 (x) mod F (x) which can be computed using Algorithm 2. Algorithm 2 Group-Level Multiplication Input: A(x), B(x), and F (x) Output: P (x) = A(x)B(x) mod F (x) P (x) ← Bs−1 (x)A(x) mod F (x); for k = s − 2 downto 0 do P (x) ← xg P (x); P (x) ← Bk (x)A(x) + P (x) mod F (x);

2.2.2

Inversion m −1

For any element a ∈ GF(2m ) the equality a2 m −2

both sides by a results in a2

≡ 1 holds. When a = 0, dividing

≡ a−1 . Using this equality the inverse, a−1 , can be

computed through successive ﬁeld squarings and multiplications. In Algorithm 3 the inverse of an element is computed using this method. The primary advantage to this inversion method is the fact that it does not require hardware dedicated speciﬁcally to inversion. The ﬁeld multiplier can be used to perform all required ﬁeld operations.

CHAPTER 2. BACKGROUND

16

Algorithm 3 Inversion by Square and Multiply Input: Field element a Output: b ≡ a(−1) b ← a; for i = 1 to m − 2 do b ← b2 ∗ a; b ← b2 ;

2.3

Arithmetic over the Elliptic Curve Group

The ﬁeld operations discussed in the previous section are used to perform arithmetic over an elliptic curve. This thesis is aimed at the elliptic curve deﬁned by the nonsupersingular Weierstrass equation for binary ﬁelds. This curve is deﬁned by the equation y 2 + xy = x3 + αx2 + β

(2.3)

where the variables x and y are elements of the ﬁeld GF(2m ) as are the curve parameters α and β. The points on the curve, deﬁned by the solutions, (x, y), to (2.3) form an additive group when combined with the “point at inﬁnity”. This extra point is the group identity and is denoted by the symbol O. By deﬁnition, the addition of two elements in a group results in another element of the group. As a result any point on the curve, say P , can be added to itself an arbitrary number of times and the result will also be a point on the curve. So for any integer k and point P adding P to itself k − 1 times results in the point kP = P · · · + P . +P + k times

CHAPTER 2. BACKGROUND

17

Given the binary expansion k = 2l−1 kl−1 + 2l−2 kl−2 + · · · + 2k1 + k0 the scalar multiple kP can be computed by

Q = kP = 2l−1 kl−1 P + 2l−2 kl−2 P + · · · + 2k1 P + k0 P.

By factoring out 2, the result is

Q = (2l−2 kl−1 P + 2l−3 kl−2 P + · · · + k1 P )2 + k0 P.

By repeating this operation it is seen that

Q = (· · · ((kl−1 P )2 + kl−2 P )2 + · · · + k1 P )2 + k0 P

which can be computed by the well known (left-to-right) double and add method for scalar multiplication shown in Algorithm 4. Algorithm 4 Scalar Multiplication by Double and Add Method Input: Integer k = (kl−1 , kl−2 , . . . , k1 , k0 )2 , Point P Output: Point Q = kP Q ← O; if (kl−1 == 1) then Q ← P; for i = l − 2 downto 0 do Q ← DOUBLE(Q); if (ki == 1) then Q ← ADD(Q, P );

CHAPTER 2. BACKGROUND

18

Two basic operations required for elliptic curve scalar multiplication are point ADD and point DOUBLE. The mathematical deﬁnitions for these operations are derived from the curve equation in (2.3). Consider the points P1 and P2 represented by the coordinate pairs (x1 , y1 ) and (x2 , y2 ) respectively. Then the coordinates, (xa , ya ), of point Pa = P1 + P2 (or ADD(P1 , P2 )) are computed using the equations

2 y1 + y2 y1 + y2 + + x1 + x2 + α xa = x1 + x2 x1 + x2

y1 + y2 (x1 + xa ) + xa + y1 . ya = x1 + x2

Similarly the coordinates (xd , yd ) of point Pd = 2P1 (or DOUBLE(P1 )) are computed using the equations

β x21

xd =

x21

+

yd =

x21

y1 + x1 + x1

xd + xd .

So the addition of two points can be computed using two ﬁeld multiplications, one ﬁeld squaring, eight ﬁeld additions and one ﬁeld inversion. The double of a point can be computed using two ﬁeld multiplications, one ﬁeld squaring, six ﬁeld additions and one ﬁeld inversion.

2.4

Implementation Media

In the end, the goal of this work is to implement the ﬁeld and group arithmetic described above using hardware. This can be done using two diﬀerent hardware technologies.

CHAPTER 2. BACKGROUND

19

They are: • Application Speciﬁc Integrated Circuits (ASICs) • Field Programmable Gate Arrays (FPGAs) ASICs are typically used when a design is massed produced or when performance is of the utmost importance. FPGAs, on the other hand, lend themselves nicely to research work where a design is being prototyped. The following attributes of the FPGA design ﬂow are particularly advantageous. 1. Relatively small initial setup cost. A single FPGA is inexpensive when compared to the manufacturing cost of an ASIC design. 2. Simpliﬁed implementation ﬂow. In most cases, the FPGA vendor (such as Xilinx or Altera) will provide a fully integrated tool ﬂow. This ﬂow will have been fully tested for compatibility with the FPGA and as a result fewer tool related problems can be expected. 3. Fast turn around time. An FPGA can be programmed in less than a minute and can also be reprogrammed many times. An ASIC on the other hand may take months to fabricate. 4. Simpliﬁed integration. Whether using an ASIC or FPGA design ﬂow, the design must be integrated into a hardware/software system. It is common for FPGAs to be sold within such a system, minimizing the integration task required of the designer.

CHAPTER 2. BACKGROUND

20

It makes sense that most other ECC prototypes have been implemented using FPGA technology. By following suit, results can be more easily compared to those of previously reported work. The following section provides an overview of the fundamental principles on which FPGAs are based. Introduced next is the Rapid-Prototyping Platform which includes the FPGA and hardware/software system used to prototype the design discussed in this thesis.

2.4.1

Field Programmable Gate Arrays

An FPGA or ﬁeld programmable gate array is an integrated circuit consisting of • Conﬁgurable Logic Blocks (CLBs), • Input/Output Blocks (IOBs) and • programmable interconnect. Conﬁgurable Logic Blocks: A typical Conﬁgurable Logic Block (CLB) is composed of both combinational and sequential logic. The combinational logic can be conﬁgured to create any of a number of possible boolean functions. Flip-Flops are provided to support sequential logic and can be utilized or bypassed depending on the conﬁguration. Figure 2.1 shows an example CLB with 8 inputs and 2 outputs. The blocks F, G and H are programmable functions which can be conﬁgured to perform any one of a number of diﬀerent boolean functions. The functions are typically implemented with either look-up tables (LUTs) or logic gates. The actual number of possible boolean functions depends on the implementation. The multiplexors are used to conﬁgure the interconnect inside the CLB.

CHAPTER 2. BACKGROUND

21

Figure 2.1: Functionality of a CLB

Input/Output Blocks: The Input/Output Blocks (IOBs) are blocks used to connect internal nets to external pins or pads on the FPGA. These blocks control the direction of the signal and can also register both input and output data. Figure 2.2 shows an example IOB.

Programmable Interconnect: An FPGA is made of many IOBs and CLBs. These blocks can be conﬁgured and connected together to achieve complex functionality. The connections between the blocks are performed by the programmable interconnect. There are several ways in which the CLBs, IOBs and programmable interconnect are organized. One such organization is the symmetric array method. As shown in Figure 2.3, the CLBs are organized in a two dimensional array with

CHAPTER 2. BACKGROUND

22

Figure 2.2: Functionality of an IOB

IOBs around the perimeter. The programmable interconnect is routed in between the blocks.

Conﬁguring the FPGA: The conﬁguration of each CLB and IOB as well as the programmable interconnect is deﬁned when a design is loaded into the FPGA. The conﬁguration is typically stored in static RAM cells. This allows the conﬁguration to be preserved through reset of the FPGA while still providing the option of reconﬁguration.

CHAPTER 2. BACKGROUND

23

Figure 2.3: CLB Organization

2.4.2

The Rapid-Prototyping Platform

The Rapid-Prototyping Platform (RPP) [6, 7] is a hardware/software system provided to Canadian universities by Canadian Microelectronics Corporation (CMC). The major hardware components included in the system are: • ARM Integrator/AP, • ARM Integrator/CM7TDMI and • ARM Integrator/LM-SCV600E+. The Integrator/CM7TDMI board contains a fully functional ARM7 core. The Integrator/LMSCV600E+ board holds a Xilnix XCV2000E FPGA. The chips on these two boards

CHAPTER 2. BACKGROUND

24

are allowed to communicate through the Integrator/AP board. The common bus between the ARM7 core and the Virtex FPGA is the Arm High Performance Bus (AHB). In this system the ARM7 is the bus master and the design loaded onto the FPGA is the slave. In other words, the ARM7 core initiates bus transactions and the FPGA design responds to them. The hardware and software design ﬂows of the RPP are thoroughly documented in [6]. Provided here is a brief overview. Hardware ﬂow, the more complicated of the two ﬂows, is summarized in the following steps. 1. HDL (Hardware Description Language) coding. This is done in either VHDL or Verilog HDL. 2. Functional simulation and veriﬁcation (Cadence Verilog XL). 3. Synthesis (Synopsys FPGA Compiler II). 4. Place/Route (Xilinx Foundation Software). 5. Static Timing Analysis (Xilinx Foundation Software). 6. Generate the bit ﬁle (Xilinx Foundation Software). 7. Download the bit ﬁle onto the FPGA. If the design fails to pass static timing analysis, changes may need to be made to the HDL in which case all the steps must be performed again. The software side is less complicated.

CHAPTER 2. BACKGROUND

25

1. Write the driver code in C using the ARM Firmware Suite provided with the RPP software environment. The ﬁrmware suite provides read and write commands used to access address locations on the AHB bus. 2. Compile the code for the ARM7 core. 3. Download the core into memory on the ARM7 core. 4. Execute the code.

Chapter 3 High Performance Finite Field Arithmetic In order to optimize the curve arithmetic discussed in Section 2.3 the underlying ﬁeld operations must be implemented in a fast and eﬃcient way. The required ﬁeld arithmetic operations are addition, multiplication, squaring and inversion. Each of these operations have been implemented in hardware for use in the prototype discussed in Chapter 4. Generally speaking, ﬁeld multiplication has the greatest eﬀect on the performance of the entire elliptic curve scalar multiplication.1 For this reason, focus will be primarily on the ﬁeld multiplier when discussing hardware architectures for ﬁeld arithmetic. This chapter is organized as follows. Section 3.1 presents a hardware architecture designed to perform ﬁnite ﬁeld multiplication. In Section 3.2 the ideas presented for multiplication are extended to create a hardware architecture optimized for squar1

Inversion takes much longer than multiplication, but its eﬀect on performance can be greatly reduced through use of projective coordinates. This is discussed in greater detail in Section 4.1.

26

CHAPTER 3. HIGH PERFORMANCE FINITE FIELD ARITHMETIC

27

ing. Section 3.3 gives a method for inversion due to Itoh and Tsujii. This method does not require any additional hardware but instead uses the multiplication and squaring units described in Sections 3.1 and 3.2. Section 3.4 gives a description of a comparator/adder which both compares and adds ﬁnite ﬁeld elements. Finally, Section 3.5 summarizes results gleaned from a hardware prototype of each arithmetic unit/routine.

3.1

Multiplication

Hardware/software architectures for ﬁeld multiplication can be roughly categorized into three groups. Bit Serial multipliers are based on Algorithm 1 on page 14 where each coeﬃcient of operand b is considered in a separate iteration of the for loop. Such an implementation is resource eﬃcient in that it can be implemented using an m-bit LFSR deﬁned by the reduction polynomial F (x) along with an m bit accumulator. The LFSR and accumulator are connected as shown in Figure 3.1. The disadvantage of such an architecture is the number of iterations required of the for loop. In hardware, the m iterations translate to a minimum of m clock cycles. In contrast, Bit Parallel multipliers complete a multiplication in a single iteration. All m-bits of both input operands are considered at the same time and the result is immediately generated. Unfortunately, such a multiplier cannot be implemented in software and may result in a costly design when implemented in hardware. The minimum clock period of such an implementation is also likely to be large. A compromise between these architectures is the Digit Serial multiplier. This multiplier is based on Algorithm 2 on page 15 and considers multiple coeﬃcients of operand b in each iteration.

CHAPTER 3. HIGH PERFORMANCE FINITE FIELD ARITHMETIC

28

Figure 3.1: LFSR Based Multiplier

½

½

¾

A multiplication is completed in m/g iterations and requires fewer resources than the bit parallel method. In [13] a digit serial multiplier is proposed which is based on look-up tables. This method was implemented in software for the ﬁeld GF(2163 ) and reported in [16]. To the best of our knowledge this performance of 0.540 µ-seconds for a single ﬁeld multiplication is the fastest reported result for a software implementation. In this section the possibilities of using this look-up table-based algorithm in hardware will be explored. First to be described in this section is the algorithm used for multiplication. Then presented is a hardware structure designed to compute R(x)W (x) mod F (x) where R(x) and W (x) are polynomials with degrees g − 1 and m − 1 respectively and g 41 had diﬃculty meeting timing at the target operating frequency of 66 MHz. Instead of spending time redesigning the ﬁeld multiplier, a maximum digit size of 41 was selected.

3.2

Squaring

While squaring is a speciﬁc case of general multiplication and can be performed by the multiplier, performance can be improved signiﬁcantly by optimizing the architecture speciﬁcally for the case of squaring. The square of an element a represented by A(x) involves two mathematical steps. The ﬁrst is the polynomial multiplication of A(x)

CHAPTER 3. HIGH PERFORMANCE FINITE FIELD ARITHMETIC

40

Table 3.1: Performance/Cost Trade-oﬀ for Multiplication over GF(2163 ) Digit

Performance

# LUTs # Flip

Size

in clock cycles

g=1

163

677

670

g=4

41

854

670

g = 28

6

3,548

670

g = 33

5

4,040

670

g = 41

4

4,728

670

Flops

resulting in A2 (x) = am−1 x2m−2 + · · · + a2 x4 + a1 x2 + a0 . The second is the reduction of this polynomial modulo F (x). If the terms with degree greater than m − 1 are separated and xm+1 is factored out where possible the result will be A2 (x) = Ah (x)xm+1 + Al (x) where Ah (x) = am−1 xm−3 + · · · + a( m+3 ) x2 + a( m+1 ) 2

2

Al (x) = a( m−1 ) xm−1 + · · · + a1 x2 + a0 , 2

The polynomial Al (x) has degree less than m and does not need to be reduced. The product Ah (x)xm+1 may have degree as large as 2m − 2. The reduction polynomial gives us the equality xm = xd + · · · + 1. Multiplying both sides by x, we get xm+1 = xd+1 + · · · + x. So

Ah (x)xm+1 = Ah (x) xd+1 + · · · + x .

CHAPTER 3. HIGH PERFORMANCE FINITE FIELD ARITHMETIC

41

This multiplication can be performed using a method similar to the one described in Section 3.1. The same architecture used to compute R(x)W (x) mod F (x) in the multiplier is used here to compute xm+1 Ah (x). The digit size is set to g = d + 2 and the elements of g-operand mod 2 adder are generated from Ah (x). Ah (x) is in turn generated by expanding A(x) (i.e. inserting zeros between the coeﬃcient bits of A(x)). Since the digit size is set to d + 2, the multiplication is completed in a single cycle. This method only works if d + 2 < m which is the case for each of the NIST polynomials. Figure 3.7 shows the data ﬂow for the squaring operation. Note that the ﬂow does not include any buﬀers and so is implemented in pure combinational logic. Figure 3.7: Data-Path of the Squaring Unit

·½

¼

¾

The prototype of this squaring unit for ﬁeld GF(2163 ) using the NIST reduction

CHAPTER 3. HIGH PERFORMANCE FINITE FIELD ARITHMETIC

42

polynomial runs at 66 MHz and is capable of performing a squaring operation in a single clock cycle. This implementation requires 330 LUTs and 328 Flip Flops.

3.3

Inversion

The inversion method described in Algorithm 3 on page 16 requires m − 1 squarings and m − 2 multiplications. In order to accurately estimate the cycle performance of the inversion, consideration must be given to the performance of the multiplication and squaring units as well as the time required to load and unload these units. The architecture of the elliptic curve scalar multiplier will be discussed in detail in Chapter 4. For now, it is suﬃcient to know that the arithmetic units are loaded using two independent m bit data buses and unloaded using a single m bit data bus. The operands are stored in a dual port memory which takes two clock cycles to read from and one cycle to write to. These combined makes three cycles that are required to both load and unload any arithmetic unit. Further analysis assumes that these three cycles remain constant for all m. If Cs and Cm denote the number of clock cycles required to complete a squaring and multiplication respectively, then an inversion can be completed in (Cs + 3)(m − 1) + (Cm + 3)(m − 2) clock cycles. For the ﬁeld GF(2163 ) where Cs = 1 and Cm = 4, this translates to 1775 clock cycles. Performance can be improved by using Algorithm 6 due to Itoh and Tsujii [15]. 2 m m−1 This algorithm is derived from the equation a(−1) ≡ a2 −2 ≡ 22 −1 which is

CHAPTER 3. HIGH PERFORMANCE FINITE FIELD ARITHMETIC

43

true for any element a ∈GF(2m ). From

t

a2 −1

 2t/2   2t/2 −1  a2t/2 −1 a for t even, ≡ 2   a a2t−1 −1 for t odd, m−1 −1

the computation required for the exponentiation 22

(3.1)

can be iteratively broken

down. Algorithm 6 requires log2 (m − 1) + H(m − 1) − 1 multiplications and m − 1 squarings. Using the notation deﬁned earlier, this translates to

(Cs + 3)(m − 1) + (Cm + 3)( log2 (m − 1) + H(m − 1) − 1)

clock cycles. For GF(2163 ) this translates to 711 clock cycles. Algorithm 6 Optimized Inversion by Square and Multiply [15] Inputs:

Field element a, Binary representation of m − 1 = (ml−1 , . . . , m2 , m0 )2

Output: b ≡ a(−1) b ← aml−1 ; e ← 1; for i = l − 2 downto 0 do e

b ← b2 b; e ← 2e; if (mi == 1) then b ← b2 a; e = e + 1; b ← b2 ;

CHAPTER 3. HIGH PERFORMANCE FINITE FIELD ARITHMETIC

44

Now, the majority of the time spent for each squaring operation is used to load and unload the squaring unit (three out of the four cycles). Algorithm 6 requires several t

sequences of repetitive squaring (i.e. computations of the form x2 ). These repeated squarings do not require intermediate values to be stored outside the squaring unit. By modifying the squaring unit to support the re-square of an element, most of the memory accesses otherwise required to load and unload the squaring unit are eliminated. In fact, the squaring unit only needs to be loaded and unloaded once for each multiplication. Hence the number of clock cycles is reduced to (Cs (m − 1) + 3( log2 (m − 1) + H(m − 1) − 1)) + (Cm + 3)( log2 (m − 1) + H(m − 1) − 1) clock cycles. For the ﬁeld GF(2163 ) with Cs = 1 and Cm = 4, this results in 252 clock cycles. This is a competitive value since a typical hardware implementation of the Extended Euclidean Algorithm (EEA) is expected to complete an inversion in approximately 2m clock cycles or 326 cycles for GF(2163 ). This corresponds to a 60 clock cycle reduction or 20% performance improvement without requiring hardware dedicated speciﬁcally for inversion. Table 3.2 lists the performance numbers of the previously mentioned inversion methods when implemented over the ﬁeld GF(2163 ). The actual time to complete an inversion using the ECC co-processor architecture discussed in Chapter 4 is 259 clock cycles. The 7 extra cycles are due to control related instructions executed in the micro-sequencer.

CHAPTER 3. HIGH PERFORMANCE FINITE FIELD ARITHMETIC

45

Table 3.2: Comparison of Various Inversion Methods for GF(2163 ) Method

# Squarings

# Multiplications

# Cycles

Square & Multiply

m−1

m−2

1127

Itoh & Tsujii

m−1

log2 (m − 1) + H(m) − 1

711

Itoh & Tsujii w/ re-square

m−1

log2 (m − 1) + H(m) − 1

252

EEA

-

-

326

3.4

Comparator/Adder

The primary purpose of the Comparator/Adder is to compute the sum of two ﬁeld elements. This is done with an array of m exclusive OR gates. To minimize register usage as well as time to complete the addition, the sum of the two operands is the only value stored in a register. In this way, the sum is available immediately after the operands are loaded into the Comparator/Adder. In other words, it takes zero clock cycles to complete a ﬁnite ﬁeld addition. In addition to computing the sum of two ﬁnite ﬁeld elements, the Comparator/Adder also acts as a comparator. The comparison is performed by taking the logical NOR of all the bits in the sum register. If the result is a one, then the sum is zero and the two operands are equal. If operand a is set to zero, then operand b can be tested for zero. The logic depth for the zero detect circuitry (the m-bit NOR gate) is log2 (m) and is registered before being sent out of the module. Figure 3.8 provides a functional diagram of the Comparator/Adder.

CHAPTER 3. HIGH PERFORMANCE FINITE FIELD ARITHMETIC

46

Figure 3.8: Data-Path of the Comparator/Adder

3.5

½

Concluding Remarks

In this chapter, we have discussed hardware architectures designed to perform ﬁnite ﬁeld addition, multiplication and squaring. Also discussed was an eﬃcient method for inversion which uses the squaring and multiplication units. The performance results associated with these arithmetic units are summarized in Table 3.3. Table 3.3: Performance of Finite Field Operations Operation

# Cycles

(g = 41)

# Cycles Including Initial and Final Data Movement

Multiplication

4

7

Squaring

1

4

Addition

0

3

Inversion

256

259

Chapter 4 A Co-processor Architecture for ECC Scalar Multiplication In the recent past, several articles have proposed various hardware architectures/accelerators for ECC. These elliptic curve cryptographic accelerators can be categorized into three functional groups. They are 1. Accelerators which use general purpose processors to implement curve operations but implement the ﬁnite ﬁeld operations using hardware. References [2] and [32] are examples of this. Both of these implementations support the composite ﬁeld GF(2155 ). 2. Accelerators which perform both the curve and ﬁeld operations in hardware but use a small ﬁeld size such as GF(253 ). Architectures of this type include those proposed in [30] and [10]. In [30], a processor for the ﬁeld GF(2168 ) is synthesized, but not implemented. Both works discuss methods to extend their implementation to a larger ﬁeld size but do not actually do so. 47

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

48

3. Accelerators which perform both curve and ﬁeld operations in hardware and use ﬁelds of cryptographic strength such as GF(2163 ). Processors in this category include [3, 12, 19, 26, 28]. The work discussed in this chapter falls into category three. The architectures proposed in [26] and [28] were the ﬁrst reported cryptographic strength elliptic curve co-processors. Montgomery scalar multiplication with an LSD multiplier was used in [28]. In [26] a new ﬁeld multiplier is developed and demonstrated in an elliptic curve scalar multiplier. In both [19] and [3] parameterized module generation is discussed. To the best of our knowledge the architecture proposed in [12] oﬀers the fastest scalar multiplication using FPGA technology at 0.144 milliseconds. This architecture uses Montgomery scalar multiplication with L´opez and Dahab’s projective coordinates. They use a shift and add ﬁeld multiplier but also compare LSD and Karatsuba multipliers. In this chapter a hardware architecture for elliptic curve scalar multiplication is proposed. The architecture uses projective coordinates and is optimized for scalar multiplication over the Koblitz curves. The arithmetic routines discussed in Chapter 3 are used to perform the ﬁeld arithmetic. This architecture has been implemented and demonstrated on an FPGA. The chapter is organized as follows. Section 4.1 introduces projective coordinates and discusses some of the reasons for using a projective system. Section 4.2 presents two methods for recoding the scalar. They are non-adjacent form (NAF) and τ -adic non-adjacent form (τ -NAF). Then in Section 4.3 the ideas described in 4.1 and 4.2 are implemented in a co-processor architecture for scalar multiplication. The data path

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

49

and diﬀerent levels of control are outlined there. Section 4.4 discusses the prototype of the scalar multiplier. Finally in Section 4.5 concludes with results gathered from the prototype.

4.1

Projective Coordinates

Projective coordinates allow the inversion required by each DOUBLE and ADD to be eliminated at the expense of a few extra ﬁeld multiplications. The beneﬁt is measured by the ratio Time to Complete Inversion . Time to Complete Multiplication

(4.1)

The inversion algorithm proposed by Itoh and Tsujii [15] will be used and therefore, the ratio in (4.1) is guaranteed to be larger than log2 (m − 1) and could be larger depending on the eﬃciency of the squaring operations. Therefore, projective coordinates will provide us the best performance for NIST curves. Several ﬂavors of projective coordinates have been proposed over the last few years. The prominent ones are Standard [22], Jacobian [5, 14] and L´opez & Dahab [20] projective coordinates. If the aﬃne representation of P be denoted as (x, y) and the projective representation of P be denoted as (X, Y, Z), then the relation between aﬃne and projective coordinates for the Standard system is x=

X Z

and y =

Y . Z

For Jacobian projective coordinates the relation is x=

X Z2

and y =

Y . Z3

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

50

Finally for L´opez & Dahab’s, the relation between aﬃne and projective coordinates is x=

X Z

and y =

Y . Z2

For L´opez & Dahab’s system the projective equation of the elliptic curve in (2.3) then becomes Y 2 + XY Z = X 3 Z + αX 2 Z 2 + βZ 4 . It is important to note that when using the left-to-right double and add method for scalar multiplication all point additions are of the form ADD(P, Q). The base point P is never modiﬁed and as a result will maintain its aﬃne representation (i.e. P = (x, y, 1)). The constant Z coordinate signiﬁcantly reduces the cost of point addition (from 14 ﬁeld multiplications down to 10). The addition of two distinct points (X1 , Y1 , Z1 )+(X2 , Y2 , 1) = (Xa , Ya , Za ) using mixed coordinates (one projective point and one aﬃne point) is then computed by A = Y2 · Z12 + Y1

E =A·C

B = X2 · Z1 + X1

Xa = A2 + D + E

C = Z1 · B

F = Xa + X2 · Za

D = B 2 · (C + α · Z12 )

G = Xa + Y2 · Za

Za = C 2

Y a = E · F + Za · G

(4.2)

Similarly, the double of a point (X1 , Y1 , Z1 ) + (X1 , Y1 , Z1 ) = (Xd , Yd , Zd ) is computed

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

51

by Zd = Z12 · X12 Xd = X14 + β · Z14

(4.3)

Yd = β · Z14 · Zd + Xd · (α · Zd + Y12 + β · Z14 ) In Table 4.1, the number of ﬁeld operations required for the aﬃne, Standard, Jacobean and L´opez & Dahab coordinate systems are provided. In the table the symbols M, S, A and I denote ﬁeld multiplication, squaring, addition and inversion respectively. Table 4.1: Comparison of Projective Point Systems System

Point Addition

Point Doubling

Aﬃne

2M + 1S + 8A + 1I

3M + 2S + 4A + 1I

Standard

13M + 1S + 7A

7M + 5S + 4A

Jacobian

11M + 4S + 7A

5M + 5S + 4A

L´opez & Dahab

10M + 4S + 8A

5M + 5S + 4A

The projective coordinate system deﬁned by L´opez and Dahab will be used since it oﬀers the best performance for both point addition and point doubling.

4.2

Scalar Multiplication using Recoded Integers

The binary expansion of an integer k is written as k =

l−1 i=0

ki 2i where ki ∈ {0, 1}.

For the case of elliptic curve scalar multiplication the length l is approximately equal

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

52

to m, the degree of the extension ﬁeld. Assuming an average Hamming weight, a scalar multiplication will require approximately l/2 point additions and l − 1 point doubles. Several recoding methods have been proposed which in eﬀect reduce the number of additions. In this section two methods are discussed; NAF [11, 31] and τ -adic NAF [18, 31].

4.2.1

Scalar Multiplication using Binary NAF

The symbols in the binary expansion are selected from the set {0, 1}. If this set is increased to {0, 1, −1} the expansion is referred to as signed binary (SB) representation. When using this representation, the double and add scalar multiplication method must be slightly modiﬁed to handle the −1 symbol (often denoted ¯1). If the 2l−1 +· · ·+k1 2+k0 where ki ∈ {0, 1, ¯1} is denoted by (kl−1 , . . . , k1 , k0 )SB , expansion kl−1

then Algorithm 7 computes the scalar multiple of point P . The negative of the point Algorithm 7 Scalar Multiplication for Signed Binary Representation Input: Integer k = (kl−1 , kl−2 , . . . , k1 , k0 )SB , Point P Output: Point Q = kP Q ← O; if (kl−1 = 0) then P; Q ← kl−1

for i = l − 2 downto 0 do Q ← DOUBLE(Q); if (ki = 0) then Q ← ADD(Q, ki P ); (x, y) is (x, x + y) and can be computed with a single ﬁeld addition. The signed

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

53

binary representation is redundant in the sense that any given integer has more than one possible representation. For example, 17 can be represented by (1001)SB as well as (101¯1)SB . Interest here is in a particular form of this signed binary representation called NAF or non-adjacent form. A signed binary integer is said to be in NAF if there are no adjacent non-zero symbols. The NAF of an integer is unique and it is guaranteed to be no more than one symbol longer than the corresponding binary expansion. The primary advantage gained from NAF is its reduced number of non-zero symbols. The average Hamming weight of a NAF is approximately l/3 [31] compared to that of the binary expansion which is l/2. As a result, the running time of elliptic curve scalar multiplication when using binary NAF is reduced to (l + 1)/3 point additions and l point doubles. This represents a signiﬁcant reduction in run time. In [31], Solinas provides a straightforward method for computing the NAF of an integer. This method is given here in Algorithm 8.

4.2.2

Scalar Multiplication using τ -NAF

Anomalous Binary Curves (ABC’s), ﬁrst proposed for cryptographic use in [18], provide an eﬃcient implementation when the scalar is represented as a complex algebraic number. ABC’s, often referred to as the Koblitz curves, are deﬁned by

y 2 + xy = x3 + αx2 + 1

(4.4)

with α = 0 or α = 1. The advantage provided by the Koblitz curves is that the DOUBLE operation in Algorithm 7 can be replaced with a second operation, namely

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

54

Algorithm 8 Generation of Binary NAF [31] Input: Positive integer k Output: k = NAF(k) i ← 0; while (k > 0) do if (k ≡ 1 (mod 2)) then ki ← 2 − (k mod 4); k ← k − ki ; else ki ← 0; k ← k/2; i ← i + 1; Frobenius mapping, which is easier to perform. If point (x, y) is on a Koblitz curve then it can be easily checked that (x2 , y 2 ) is also on the same curve. Moreover, these two points are related by the following Frobenius mapping τ (x, y) = (x2 , y 2 ) where τ satisﬁes the quadratic equation

τ 2 + 2 = µτ.

(4.5)

In (4.5), µ = (−1)1−α and α is the curve parameter in (4.4) and is 0 or 1 for the Koblitz curves. The integer k can be represented with radix τ using signed representation. In this

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

55

case, the expansion is written

k = κl−1 τ l−1 + · · · κ1 τ + κ0 ,

where κi ∈ {0, 1, ¯1}. Using this representation, Algorithm 7 can be rewritten, replacing the DOUBLE(Q) operation with τ Q or a Frobenius mapping of Q. The modiﬁed algorithm is shown in Algorithm 9. Since τ Q is computed by squaring the coordinates of Q, this suggests a possible speed up over the DOUBLE and ADD method. Algorithm 9 Scalar Multiplication for τ -adic Integers Input: Integer k = (κl−1 , κl−2 , . . . , κ1 , κ0 )τ , Point P Output: Point Q = kP Q ← O; if (κl−1 = 0) then Q ← κl−1 P ; for i = l − 2 downto 0 do Q ← τ Q; if (κi = 0) then Q ← ADD(Q, κi P );

This complex representation of the integer can be improved further by computing its non-adjacent form. Solinas proved the existence of such a representation in [31] by providing an algorithm which computes the τ -adic non-adjacent form or τ -NAF of an integer. This algorithm is provided here in Algorithm 10. In most cases, the input to Algorithm 10 will be a binary integer, say k (i.e. r0 = k and r1 = 0). If k has length l then TNAF(k) will have length 2l, roughly twice the length of NAF(k).

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

Algorithm 10 Generation of τ -adic NAF [31] Input: r0 + r1 τ where r0 , r1 ∈ Z Output: u =TNAF(r0 + r1 τ ) i ← 0; while (r0 = 0 or r1 = 0) do if (r0 ≡ 1 (mod 2)) then ui ← 2 − (r0 − 2r1 mod 4); r0 ← r0 − ui ; else ui ← 0; t ← r0 ; r0 ← r1 + µr0 /2; r1 ← −t/2; i ← i + 1;

56

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

57

The length of the representation generated by Algorithm 10 can be reduced by either preprocessing the integer k, as is done in [31], or by post processing the result. A method for post processing the output of Algorithm 10 is presented here. m

Remember that τ (x, y) = (x2 , y 2 ). Since z 2

= z for all z ∈GF(2m ), it follows

that m

m

τ m (x, y) = (x2 , y 2 ) = (x, y). This relation gives us the general equality

(τ m − 1)P ≡ 0

where P is a point on a Koblitz curve. As a result, any integer k expressed with radix τ can be reduced modulo τ m − 1 without changing the scalar multiple kP . This reduction is performed easily with a few polynomial additions. Consider the τ -adic integer

u = u2m−1 τ 2m−1 + · · · + um+1 τ m+1 + um τ m + um−1 τ m−1 + · · · + u1 τ + u0 .

Factoring out τ m wherever possible, the result is u = (u2m−1 τ m−1 + · · · + um+1 τ + um )τ m +(um−1 τ m−1 + · · · + u1 τ + u0 )

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

58

Substituting τ m with 1 and combining terms results in

u = ((u2m−1 + um−1 )τ m−1 + · · · + (um+1 + u1 )τ + (um + u0 ).

The output of Algorithm 10 is approximately twice the length of the input but may be slightly larger. Assuming the length of the input to be approximately m symbols, the reduction method must be capable of reducing τ -adic integers with length slightly greater 2m. Algorithm 11 describes this method for reduction. Algorithm 11 Reduction mod τ m Input: u = ul−1 τ l−1 + · · · + u1 τ + u0 with m ≤ l < 3m Output: v =REDUCE TM(u) v ← 0; if (l > 2m) then v ← (ul−1 τ l−2m−1 + · · · + u2m+1 τ + u2m ); if (l > m) then v ← v + (u2m−1 τ m−1 + · · · + um+1 τ + um ); v ← v + (um−1 τ m−1 + · · · + u1 τ + u0 );

Now the result of Algorithm 11 has length m but is no longer in τ -adic NAF form. There may be adjacent non-zero symbols and the symbols are not restricted to the set {0, 1, ¯1}. The input of Algorithm 10 is of the form r0 + r1 τ where r0 , r1 ∈ Z. The output is

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

59

the τ -adic representation of the input. For v ∈ Z[τ ] we can write v = vm−1 τ m−1 + · · · + v2 τ 2 + v1 τ + v0 = vm−1 τ m−1 + · · · + v2 τ 2 + TNAF(v1 τ + v0 ) Now the two least signiﬁcant symbols of v are in τ -adic NAF. Repeating this procedure for every bit in v the entire string can be converted to τ -adic NAF. This process is described in Algorithm 12. Algorithm 12 Regeneration of τ -adic NAF Input: v = vm−1 τ m−1 + · · · + v1 τ + v0 Output: w =REGEN TNAF(v) w ← v; i ← 0; while (wj = 0 for some j ≥ i) do if (wi == 0) then i ← i + 1; else t0 ← wi ; t1 ← wi+1 ; wi ← 0; wi+1 ← 0; w ← w+TNAF(t1 τ + t0 ); i ← i + 1; The output of Algorithm 12 is in τ -adic NAF and has a length of approximately m symbols. If the result is larger than m symbols, it is possible to repeat Algorithms 11

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

60

and 12 to further reduce the length. Algorithms 10, 11 and 12 have been implemented in C and were used to generate test vectors for the prototype discussed later in this chapter. During testing, it was found that a single pass of these algorithms generates a τ -adic representation with average length of m and a maximum length of m + 51 . Like radix 2 NAF the τ -adic NAF uses the symbol set {1, 0, ¯1} and has an average Hamming weight of approximately l/3 for an l-bit integer [31]. So Algorithm 9 has a running time of l/3 point additions and l − 1 Frobenius mappings.

4.2.3

Summary and Analysis

A point addition using L´opez & Dahab’s projective coordinates requires ten ﬁeld multiplications, four ﬁeld squarings and eight ﬁeld additions. A point double requires ﬁve ﬁeld multiplications, ﬁve ﬁeld squarings and four ﬁeld additions. Using this information, the run time for scalar multiplication can be written in terms of ﬁeld operations. Typically scalar multiplication is measured in terms of ﬁeld multiplications, inversions and squarings, ignoring the cost of addition. In the case of this architecture, ﬁeld multiplication and squaring are completed quickly enough that the cost of ﬁeld addition becomes signiﬁcant. The run times using binary, binary NAF and τ -adic NAF representations are shown in Table 4.2. These values are based on the curve addition and doubling equations deﬁned in (4.2) and (4.3) assuming arbitrary curve parameters α and β and the average Hamming weights discussed in the previous sections. For the case of τ -NAF, a Frobenius mapping is assumed to require three squaring operations. The symbols M, S, A and I correspond to ﬁeld multiplication, squaring, addition and inversion respectively. In each case it is assumed that the length of the 1

These are empirical rather than analytical results.

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

61

integer is approximately equal to m. Table 4.2: Cost of Scalar Multiplication in terms of Field Operations

Binary

4.3

Generic m

m = 163

(10M + 7S + 8A)m + I

1630M + 1141S + 1304A + I

NAF

( 25 3 M+

19 3 S

+

20 3 A)m

+I

τ -NAF

( 10 3 M+

13 3 S

+ 83 A)m + I

1359M + 1033S + 1087A + I 544M + 706S + 435A + I

Co-processor Architecture

The architecture, which is detailed in this section, consists of several ﬁnite ﬁeld arithmetic units, ﬁeld element storage and control logic. All logic related to ﬁnite ﬁeld arithmetic is optimized for speciﬁc ﬁeld size and reduction polynomial. Internal curve computations are performed using L´opez & Dahab’s projective coordinate system. While generic curves are supported, the architecture is optimized speciﬁcally for the special Koblitz curves. The processor’s architecture consists of the data path and two levels of control. The lower level of control is composed of a micro-sequencer which holds the routines required for curve arithmetic such as DOUBLE and ADD. The top level control is implemented using a state machine which parses the scalar and invokes the appropriate routines in the lower level control. This hierarchical control is shown in Figure 4.1.

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

62

Figure 4.1: Co-Processor’s Hierarchical Control Path

4.3.1

The Data Path

The data path of the co-processor consists of three ﬁnite ﬁeld arithmetic units as well as space for operand storage. The arithmetic units include a multiplier, adder, and squaring unit. Each of these are optimized for a speciﬁc ﬁeld and corresponding ﬁeld polynomial. In an attempt to minimize time lost to data movement, the adder and multiplier are equipped with dual input ports which allow both operands to be loaded at the same time (the squaring unit requires a single operand and cannot beneﬁt from an extra input bus). Similarly, the ﬁeld element storage has two output ports used to supply data to the ﬁnite ﬁeld units. In addition to providing ﬁeld element storage, the storage unit provides the connection between the internal m-bit data path and the 32-bit external world. Figure 4.2 shows how the arithmetic units are connected to the storage unit. The internal m-bit busses connecting the storage and arithmetic units are con-

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

63

Figure 4.2: Co-Processor Data-Path

trolled to perform sequences of ﬁeld operations. In this way the underlying curve operations DOUBLE and ADD as well as ﬁeld inversion are performed. Field Element Storage: The ﬁeld element storage unit provides storage for curve points and parameters as well as temporary values. Parameters required to perform elliptic curve scalar multiplication include the ﬁeld elements α and β and coordinates of the base point P . Storage will also be required for the coordinates of the scalar multiple Q. The point addition routine developed for this design also requires four temporary storage locations for intermediate values. Figure 4.3 shows how the storage space is organized. The top eight ﬁeld element storage locations are implemented using 32-bit dualport RAMs generated by the Xilinx Coregen tool and the bottom three storage locations2 are made of register ﬁles with 32-bit register widths. The dual 32-bit/m-bit m interface support is achieved by instantiating 32 dual-port storage blocks (either

memories or register ﬁles) with 32-bit word widths as shown in Figure 4.4. The ﬁgure assumes m = 163. If the 32-bit storage locations in Figure 4.4 are viewed as a 2

These locations are shaded gray in Figures 4.3 and 4.4.

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

64

Figure 4.3: Field Element Storage ÈÜ ÈÝ « ¬ Ì¼ Ì½ Ì¾ Ì¿ ÉÜ ÉÝ ÉÞ

matrix then the rows of the matrix hold the m-bit ﬁeld words. Each 32-bit location is accessible by the 32-bit interface and each m-bit location is accessible by the m-bit interface. For simplicity sake the ﬁeld elements are aligned at 32 byte boundaries. Figure 4.4: 32-bit/163-bit Address Map

Computation of τ Q: In addition to providing storage, the registers in the bottom three m-bit locations are capable of squaring the resident ﬁeld element. This is accomplished by connecting the logic required for squaring directly to the output of the storage register. The squared result is then muxed in to the input of the

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

65

storage register and is activated with an enable signal. Figure 4.5 provides a diagram of this connection. This allows the squaring operations required to compute τ Q to be performed in parallel. Furthermore, it eliminates the data movement otherwise required if the squaring unit were to be loaded and unloaded for each coordinate of Q. This provides signiﬁcant performance improvement when using Koblitz curves. Figure 4.5: Eﬃcient Frobenius Mapping

4.3.2

The Micro-sequencer

The micro-sequencer controls the data movement between the ﬁeld element storage and the ﬁnite ﬁeld arithmetic units. In addition to the fundamental load and store operations, it supports control instructions such as jump and branch. The following list brieﬂy summarizes the instruction set supported by the micro-sequencer. • ld: Load operand(s) from storage location into speciﬁed ﬁeld arithmetic unit. • st: Store result from ﬁeld arithmetic unit into speciﬁed storage location. • j: Jump to speciﬁed address in the micro-sequencer.

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

66

• jr: Jump to speciﬁed micro-sequencer address and push current address onto the program counter stack. • ret: Return to micro-sequencer address. The address is supplied by the program counter stack. • bne: Branch if the last ﬁeld elements loaded into the ALU are NOT equal. • nop: Increment program counter but do nothing. • set: Set internal counter to speciﬁed value. • rsq: Resquares the contents of the squaring unit. • dbnz: Decrement internal counter and branch if the new value of the counter is zero. This opcode also causes the contents of the squaring unit to be resquared. A two-pass perl assembler was developed to generate the micro-sequencer bit code. The assembler accepts multiple input ﬁles with linked addresses and merges them into one ﬁle. This ﬁle is then used to generate the bit code. The multiple input ﬁle support allows diﬀerent versions of the ROM code to be eﬃciently managed. Diﬀerent implementations of the same micro-sequencer routine can be stored in diﬀerent ﬁles allowing them to be easily selected at compile time.

Micro-Sequencer Routines The micro-sequencer supports the curve arithmetic primitives, ﬁeld inversion as well as a few other miscellaneous routines. The list below provides a summary of routines developed for use in the design.

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

67

• POINT ADD (P, Q): Adds the elliptic curve points P and Q where P is represented in aﬃne coordinates and Q is represented using projective coordinates. The result is given in projective coordinates. • POINT SUB (P, Q): Computes the diﬀerence Q − P . P is represented using aﬃne coordinates and Q is represented using projective coordinates. The result is given in projective coordinates. This routine calls the POINT ADD routine. • POINT DBL (Q): Doubles the elliptic curve point Q. Both Q and the result are in projective coordinates. • INVERT (X): Computes the inverse of the ﬁnite ﬁeld element X. • CONVERT (Q): Computes the aﬃne coordinates of an elliptic curve point Q given the point’s projective coordinates. This routine calls the INVERT routine. • COPY P2Q (P , Q): Copies the x and y coordinates of point P to the x and y coordinates of point Q. The z coordinate of point Q is set to 1. • COPY MP2Q (P , Q): Computes the x and y coordinates of point −P and copies them to the x and y coordinates of point Q. The z coordinate of point Q is set to 1. Several versions of the POINT ADD routine have been developed. The most generic one supports any curve over the ﬁeld GF(2m ). In this version, the values of α and β are used when computing the sum of two points. This curve also checks if Q = P , Q = −P and Q = O. The second version of the point addition routine is optimized for a Koblitz curve by assuming α and β are equal to the NIST recommended values.

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

68

The number of ﬁeld multiplications required to compute the addition of two points is reduced from 10 to 9. The third version of the routine is optimized for a Koblitz curve and also forgoes the checks of point Q. If the base point P has a large prime order and the integer k is less than this order3 , it will never be the case that Q = ±P or Q = O. This ﬁnal version of the routine is the fastest of the three routines and is the one used to achieve the results reported at the end of this chapter. The assembly code for each of these routines is included in the appendix.

4.3.3

Top Level Control

The routines listed above along with the POINT FRB(Q) operation are invoked by the top level state machine. The POINT FRB(Q) routine computes the Frobenius map of the point Q. This operation is not as complex as the other operations and is not implemented in the micro-sequencer. It is invoked by the top level state machine all the same. The state machine parses the scalar k and calls the routines as needed. Since integers in NAF and τ -NAF require use of the symbol −1 (denoted ¯1), the scalar requires more than just an m-bit register for storage. In the implementation given here, each symbol in the scalar is represented using two bits; one for the magnitude and one for the sign. Table 4.3 provides the corresponding representation. For each (m)

bit ki in the scalar k the magnitude is stored in the register ki

and the sign is stored

(s)

in register ki . Table 4.4 provides example representations for integers in binary form, NAF, and τ -adic NAF using m = 8. 3

These are fair assumptions since the security of the ECC implementation relies on these properties.

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

69

Table 4.3: Representation of the Scalar k Symbol Magnitude Sign 0

0

-

1

1

0

¯1

1

1

Table 4.4: Example Representations of the Scalar k

k (m)

k (s)

(01001100)2

(01001100)2

(00000000)2

(0100¯1010)N AF

(01001010)2

(00001000)2

(0100¯1010)τ −N AF

(01001010)2

(00001000)2

The top level state machine is designed to support binary, NAF and τ -adic NAF representations of the scalar. This eﬀectively requires the state machine to perform Algorithms 4, 7 and 9. By taking advantage of the similarities between these algorithms, the top level state machine can perform this task with the addition of a single mode. This is shown in Algorithm 13. The algorithm is written in terms of the underlying curve and ﬁeld primitives provided by the micro-sequencer (listed in Section 4.3.2). The ﬁrst step of Algorithm 13 is to search for the ﬁrst non-zero bit in k (m) . Once found, either P or −P is copied to Q depending on the sign of the non-zero bit. The while loop then iterates over all the remaining bits in the scalar performing “doubles and adds” or “Frobenius mappings and adds” depending on the mode. Since the curve

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE Algorithm 13 State Machine Algorithm (m)

(m)

(m)

(m)

k (m) = (kl−1 , kl−2 , . . . , k1 , k0 )2 ,

Inputs:

(s)

(s)

(s)

(s)

k (s) = (kl−1 , kl−2 , . . . , k1 , k0 )2 , Point P and mode (NAF or τ -NAF) Output: Point Q = kP i ← l − 1; (m)

while (ki

== 0) do

k ← i − 1; (s)

if (ki == 1) then COPY MP2Q(P, Q);

else COPY P2Q(P, Q);

i ← i − 1; while (i ≥ 0) do if (mode == τ -NAF) then Q ← POINT FRB(Q); else Q ← POINT DBL(Q); (m)

if (ki

== 1) then

(s)

if (ki == 1) then Q ← POINT SUB(Q, P ); else Q ← POINT ADD(Q, P ); i←i−1 Q ← CONVERT(Q);

70

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

71

arithmetic is performed using projective coordinates, the result must be converted to aﬃne coordinates at the end of computation.

4.3.4

Choice of Field Arithmetic Units

The use of redundant arithmetic units, speciﬁcally ﬁeld multipliers, has been suggested in [3] and should be considered when designing an elliptic curve scalar multiplier. It seems the advantage provided remains purely theoretical. This can be seen by examining the top performing ECC multipliers in [12] and [28], both of which use a single ﬁeld multiplier. Reasons for doing the same for this ECC accelerator are twofold. (1) One of the limiting factors for the performance of the design is data movement. As shown in Figures 4.6 and 4.7 the bus usage for point addition and point doubling is very high (83% and 80% respectively). If another multiplier is added to the design there may not be enough free bus cycles to capitalize on the extra computational power. For the ﬁeld GF(2163 ), the multiplier computes a product in four clock cycles and requires three cycles to load and unload the unit. If a second multiplier is added, then two multiplications can be completed in four cycles but six cycles are required to unload the multiplier. (2) Many of the multiplications in point addition and point doubling are dependent on each other and must be performed in sequence. For this reason, the second multiplier may sit idle much of the time. The combination of these observations seems to argue against the use of multiple ﬁeld multiplication units in the design.

"

¼

¼

"

¼

¼

¼¼

#

¼¼

¼

#

!

¼

$

$

¼

Figure 4.6: Utilization of Finite Field Units for Point Addition

¼¼

!

¼¼

"

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE 72

¼

¼¼

¼

!

! ¼¼

"

"

¼

¼¼

Figure 4.7: Utilization of Finite Field Units for Point Doubling

¼¼¼

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE 73

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

4.3.5

74

Usage Model

The following steps should be performed when using the module to compute the scalar multiple of an elliptic curve point. • Load the base point P . • Load the magnitude and sign of the scalar. • Set the mode. • Start computation. • Wait for completion. • Read out the resulting point Q. During computation the base point P is preserved. If several scalar multiples of P need to be computed, P only needs to be loaded once. The same is true of the curve parameters α and β.

4.4

FPGA Prototype

A prototype of the architecture has been implemented for the ﬁeld GF(2163 ) using the NIST recommended ﬁeld polynomial. The design was coded using Verilog HDL and synthesized using Synopsys FPGA Compiler II. Xilinx’ Foundation software was used to place, route and time the netlist. The prototype was designed to run at 66 MHz on a Xilinx’ Virtex 2000E FPGA.

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

75

The resulting design was veriﬁed on the Rapid Prototyping Platform (RPP) provided by Canadian Microelectronics Corporation (CMC) [6, 7]. The hardware/software system includes an ARM Integrator/LM-XCV600E+ (board with a Virtex 2000E FPGA) and an ARM Integrator/ARM7TDMI (board with an ARM7 core) connected by the ARM Integrator/AP board. The design was connected to an AHB slave interface which made it directly accessible by the ARM7 core. Stimulated by compiled C-code, the core read from and wrote to the prototype. The Integrator/AP’s system clock had a maximum frequency of 50 MHz. In order to run our design at 66 MHz it was necessary to use the oscillator generated clock provided with the Integrator/LMSCV600E+. The data headed to and coming from the design was passed across the two clock domains.

4.5

Results

Table 4.5 shows the performance in clock cycles of the prototypes ﬁeld and curve operations. These values were gathered using a ﬁeld multiplier digit size of g = 41. Note that the multiple instantiations of the squaring logic allow for the Frobenius mapping of a projective point to be completed in a single cycle. This signiﬁcantly improves the performance of scalar multiplication when using the Koblitz curves. The prototype of the scalar multiplier has been implemented using several digit sizes in the ﬁeld multiplier. Table 4.6 reports the area consumption and resulting performance of the architecture given the diﬀerent digit sizes. Table 4.7 provides a comparison of published performance results for scalar multiplication. The performance of 0.144 ms reported in [12] is the fastest reported scalar multiplication using

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

76

Table 4.5: Performance of Field and Curve Operations Operation

# Cycles

(g = 41) Point Addition

79

Point Subtraction

87

Point Double

68

Frobenius Mapping

1

FPGA technology. The design presented in this thesis provides almost double (0.075 ms) the performance for the speciﬁc case of Koblitz curves. The co-processor discussed in this thesis requires approximately half the CLBs used in the co-processor of [12] using the same FPGA. It must be noted that the co-processor presented in [12] is robust in that it supports all ﬁelds up to GF(2256 ). In applications where support for a only single ﬁeld size is required it is overkill to support elliptic curves over many ﬁelds. In scenarios such as this, this new elliptic curve co-processor oﬀers an improved cost eﬀective solution.

CHAPTER 4. ECC CO-PROCESSOR ARCHITECTURE

77

Table 4.6: Performance and Cost Results for Scalar Multiplication Multiplier Digit

# LUTs

# FFs

Size

Binary

NAF

τ -NAF

(ms)

(ms)

(ms)

g=4

6,144

1,930

1.107

0.939

0.351

g = 14

7,362

1,930

0.446

0.386

0.135

g = 19

7,872

1,930

0.378

0.329

0.113

g = 28

8,838

1,930

0.309

0.272

0.090

g = 33

9,329

1,930

0.286

0.252

0.083

g = 41

10,017

1,930

0.264

0.233

0.075

Table 4.7: Comparison of Published Results Implementation

Field

FPGA

Scalar Mult. (ms)

S. Okada et. al. [26]

GF(2163 ) Altera EPF10K250

45

Leong & Leung [19]

GF(2155 )

Xilinx XCV1000

8.3

M. Bednara et. al. [3] GF(2191 )

Xilinx XCV1000

0.27

Orlando & Paar [28]

GF(2167 )

Xilinx XCV400E

0.210

N. Gura et. al. [12]

GF(2163 )

Xilinx XCV2000E

0.144

Our design (g = 14)

GF(2163 )

Xilinx XCV2000E

0.135

Our design (g = 41)

GF(2163 )

Xilinx XCV2000E

0.075

Chapter 5 Concluding Remarks 5.1

Summary and Contributions

In this thesis, the development of an elliptic curve cryptographic co-processor has been discussed. The co-processor takes advantage of multiplication and squaring arithmetic units which are based on the look-up table-based multiplication algorithm proposed in [13]. Field elements are represented with respect to the polynomial basis. While the base point and resulting scalar are given in aﬃne coordinates, internal arithmetic is performed using projective coordinates. This choice of coordinate system allows the scalar multiple of a point to be computed with a single ﬁeld inversion alleviating the need for a highly eﬃcient inversion method. The processor was designed to support signed, unsigned and τ -NAF integer representation. All curves over a speciﬁc ﬁeld are supported, but the architecture is optimized speciﬁcally for the Koblitz curves. The feasibility and eﬃciency of the co-processor architecture has been demonstrated through a prototype implementation on an FPGA. The prototype has resulted

78

CHAPTER 5. CONCLUDING REMARKS

79

in record performance for elliptic curve scalar multiplication over the ﬁeld GF(2163 ). Contributions achieved in this work are as follows: • A new high performance, low cost implementation of the ﬁeld multiplier from [13]. • A new architecture designed for eﬃcient Frobenius mappings through multiple instantiations of squaring logic. • A high performance implementation of Itoh & Tsujii’s inversion method. • Overall performance for the elliptic curve co-processor is 0.075 micro-seconds for a single elliptic curve scalar multiplications.

5.2

Future Work

In the future it is intended to extend ﬁeld support to several ﬁeld sizes. Ideally, the architecture would support all NIST recommended ﬁelds simultaneously. Logic would be reused wherever possible. Extra logic would be limited to certain parts of the squaring and multiplication units which are dependent on the ﬁeld reduction polynomial.

Appendix A Micro-code supporting Curve Arithmetic and Field Inversion This appendix includes the assembly code written to support elliptic curve point addition, point doubling, and ﬁeld inversion, along with a few other operations. Note that there are multiple point addition and inversion routines.

A.1

Point Addition

The following three routines perform elliptic curve point addition. The ﬁrst is the most generic and supports all curves with arbitrary α and β. The second routine is optimized for the NIST Koblitz over GF(2163 ). The third routine is also optimized for the NIST Koblitz curve, but also forgoes integrity checking of point Q.

80

APPENDIX A

A.1.1

81

Generic Point Addition

//---------------------------// Generic Point Add Routine //----------------------------

ld nop nop nop nop bne ld nop nop nop nop bne ld st ld st st ret

ld ld st st ld ld

(ADD, (); (); (); (); (ADD, (ADD, (); (); (); (); (ADD, (ADD, (ADD, (ADD, (ADD, (ONE, ();

(MLT, (SQR, (SQR, (MLT, (MLT, (ADD,

nop ();

QX, ZRO);

PTADD_L3); QY, ZRO);

PTADD_L3); PX, ZRO); QX); PY, ZRO); QY); QZ);

PX, QZ); QZ); T0); T3); T0, PY); T3, QX);

PTADD

// // // // // // // // // // // // // // // // // // //

// PTADD_L3 // // // // // //

Is Q == identity? Is x1 == 0? dead cycle dead cycle dead cycle dead cycle (Q!=identity)->cont. with Is y1 == 0? dead cycle dead cycle dead cycle dead cycle (Q!=identity)->cont. with x2 + 0 Read x2 into location for x2 + 0 Read y2 into location for Set z1 to a one Return Start Start Start Read Read Start Start

the B’ A’ A’ B’ A’’ B

Point Addition = x2*z1 = z1^2

= y2*A’ = B’ + x1

// Is px == qx? // dead cycle

add.

add. x1 y1

APPENDIX A nop nop nop bne

st ld nop nop nop nop bne

82

(); (); (); (ADD, PTADD_L1);

// // // //

dead cycle dead cycle dead cycle If B’ != x1 then branch

(MLT, T2); (ADD, T2, QY); (); (); (); (); (ADD, PTADD_L2);

// // // // // // // //

Is py == qy? Read A’’ Start A’’ + y1 ?= 0 dead cycle dead cycle dead cycle dead cycle If A’’ != y1 then branch

jr (PTDBL); ret ();

// Case: P == Q // Jump to Point Double Routine // We are done... so return

st (ZRO, QX); st (ZRO, QY); ret ();

// PTADD_L2 // // //

Case: P == -Q x1 = 0 y1 = 0 Return

st st ld ld st st ld ld st st ld

// PTADD_L1 // // // // // // // // // // //

Case: Read Read Start Start Read Read Start Start Read Read Start

(ADD, (MLT, (MLT, (ADD, (ADD, (MLT, (MLT, (SQR, (SQR, (MLT, (MLT,

T1); T2); QZ, T1); T2, QY); T2); QZ); QZ, T2); T1); T1); T3); A, T0);

P != Q and P != -Q B A’’ C = z1*B A = A’’ + y1 A C E = A*C D’ = B^2 D’ E D’’ = a*A’

APPENDIX A st ld st ld ld st st ld ld st st ld ld st st ld st ld st ld ld st st ld st ld st ret

(MLT, (ADD, (ADD, (MLT, (SQR, (SQR, (MLT, (MLT, (SQR, (SQR, (MLT, (MLT, (ADD, (ADD, (MLT, (ADD, (ADD, (ADD, (ADD, (MLT, (ADD, (ADD, (MLT, (MLT, (MLT, (ADD, (ADD, ();

T0); QZ, T0); T0); T1, T0); QZ); QZ); T0); PX, QZ); T2); QX); QY); PY, QZ); QX, T0); QX); T1); QX, T3); QX); QY, QX); QY); T3, QY); T1, QX); T1); QY); T1, QZ); T1); QY, T1); QY);

83 // // // // // // // // // // // // // // // // // // // // // // // // // // // //

Read D’’ Start D’’’ = C + D’’ Read D’’’ Start D = D’*D’’’ Start z3 = C^2 Read z3 Read D Start F’ = x2*z3 Start x3’ = A^2 Read x3’ Read F’ Start G’ = y2*z3 Start x3’’= x3’ + D Read x3’’ Read G’ Start x3 = x3’’ + E Read x3 Start F = F’ + x3 Read F Start y3’ = E*F Start G = G’ + x3 Read G Read y3’ Start y3’’= z3*G Read y3’’ Start y3 = y3’ + y3’’ Read y3 Return to base

APPENDIX A

A.1.2

84

Koblitz Curve Point Addition

//--------------------------------// Koblitz Curve Point Add Routine //---------------------------------

ld nop nop nop nop bne ld nop nop nop nop bne ld st ld st st ret

ld ld st st ld ld

(ADD, (); (); (); (); (ADD, (ADD, (); (); (); (); (ADD, (ADD, (ADD, (ADD, (ADD, (ONE, ();

(MLT, (SQR, (SQR, (MLT, (MLT, (ADD,

nop ();

QX, ZRO);

PTADD_L3); QY, ZRO);

PTADD_L3); PX, ZRO); QX); PY, ZRO); QY); QZ);

PX, QZ); QZ); T0); T3); T0, PY); T3, QX);

PTADD

// // // // // // // // // // // // // // // // // // //

// PTADD_L3 // // // // // //

Is Q == identity? Is x1 == 0? dead cycle dead cycle dead cycle dead cycle (Q!=identity)->cont. with Is y1 == 0? dead cycle dead cycle dead cycle dead cycle (Q!=identity)->cont. with x2 + 0 Read x2 into location for x2 + 0 Read y2 into location for Set z1 to a one Return Start Start Start Read Read Start Start

the B’ A’ A’ B’ A’’ B

Point Addition = x2*z1 = z1^2

= y2*A’ = B’ + x1

// Is px == qx? // dead cycle

add.

add. x1 y1

APPENDIX A nop nop nop bne

st ld nop nop nop nop bne

85

(); (); (); (ADD, PTADD_L1);

// // // //

dead cycle dead cycle dead cycle If B’ != x1 then branch

(MLT, T2); (ADD, T2, QY); (); (); (); (); (ADD, PTADD_L2);

// // // // // // // //

Is py == qy? Read A’’ Start A’’ + y1 ?= 0 dead cycle dead cycle dead cycle dead cycle If A’’ != y1 then branch

jr (PTDBL); ret ();

// Case: P == Q // Jump to Point Double Routine // We are done... so return

st (ZRO, QX); st (ZRO, QY); ret ();

// PTADD_L2 // // //

Case: P == -Q x1 = 0 y1 = 0 Return

st st ld ld st st ld ld st st ld

// PTADD_L1 // // // // // // // // // // //

Case: Read Read Start Start Read Read Start Start Read Read Start

(ADD, (MLT, (MLT, (ADD, (ADD, (MLT, (MLT, (SQR, (SQR, (MLT, (ADD,

T1); T2); QZ, T1); T2, QY); T2); QZ); QZ, T2); T1); T1); T3); QZ, T0);

P != Q and P != -Q B A’’ C = z1*B A = A’’ + y1 A C E = A*C D’ = B^2 D’ E D’’ = C + (aA’) but a = 1

APPENDIX A st ld ld st st ld ld st st ld ld st st ld st ld st ld ld st st ld st ld st ret

(ADD, (MLT, (SQR, (SQR, (MLT, (MLT, (SQR, (SQR, (MLT, (MLT, (ADD, (ADD, (MLT, (ADD, (ADD, (ADD, (ADD, (MLT, (ADD, (ADD, (MLT, (MLT, (MLT, (ADD, (ADD, ();

T0); T1, T0); QZ); QZ); T0); PX, QZ); T2); QX); QY); PY, QZ); QX, T0); QX); T1); QX, T3); QX); QY, QX); QY); T3, QY); T1, QX); T1); QY); T1, QZ); T1); QY, T1); QY);

86 // // // // // // // // // // // // // // // // // // // // // // // // // //

Read D’’ Start D = D’*D’’ Start z3 = C^2 Read z3 Read D Start F’ = x2*z3 Start x3’ = A^2 Read x3’ Read F’ Start G’ = y2*z3 Start x3’’= x3’ + D Read x3’’ Read G’ Start x3 = x3’’ + E Read x3 Start F = F’ + x3 Read F Start y3’ = E*F Start G = G’ + x3 Read G Read y3’ Start y3’’= z3*G Read y3’’ Start y3 = y3’ + y3’’ Read y3 Return to base

APPENDIX A

A.1.3

87

Eﬃcient Koblitz Curve Point Addition

//----------------------------------------------------// Koblitz Curve Point Add Routine with out checking Q //-----------------------------------------------------

ld ld st st ld ld st st ld ld st st ld ld st st ld st ld ld st st ld ld st st ld ld

(MLT, (SQR, (SQR, (MLT, (MLT, (ADD, (ADD, (MLT, (MLT, (ADD, (ADD, (MLT, (MLT, (SQR, (SQR, (MLT, (ADD, (ADD, (MLT, (SQR, (SQR, (MLT, (MLT, (SQR, (SQR, (MLT, (MLT, (ADD,

PX, QZ); QZ); T0); T3); T0, PY); T3, QX); T1); T2); QZ, T1); T2, QY); T2); QZ); QZ, T2); T1); T1); T3); QZ, T0); T0); T1, T0); QZ); QZ); T0); PX, QZ); T2); QX); QY); PY, QZ); QX, T0);

PTADD

// // // // // // // // // // // // // // // // // // // // // // // // // // // // //

Start Start Start Read Read Start Start Read Read Start Start Read Read Start Start Read Read Start Read Start Start Read Read Start Start Read Read Start Start

the Point Addition B’ = x2*z1 A’ = z1^2 A’ B’ A’’ = y2*A’ B = B’ + x1 B A’’ C = z1*B A = A’’ + y1 A C E = A*C D’ = B^2 D’ E D’’ = C + (aA’) but a = 1 D’’ D = D’*D’’ z3 = C^2 z3 D F’ = x2*z3 x3’ = A^2 x3’ F’ G’ = y2*z3 x3’’= x3’ + D

APPENDIX A st st ld st ld st ld ld st st ld st ld st ret

(ADD, (MLT, (ADD, (ADD, (ADD, (ADD, (MLT, (ADD, (ADD, (MLT, (MLT, (MLT, (ADD, (ADD, ();

QX); T1); QX, T3); QX); QY, QX); QY); T3, QY); T1, QX); T1); QY); T1, QZ); T1); QY, T1); QY);

88 // // // // // // // // // // // // // // //

Read x3’’ Read G’ Start x3 = x3’’ + E Read x3 Start F = F’ + x3 Read F Start y3’ = E*F Start G = G’ + x3 Read G Read y3’ Start y3’’= z3*G Read y3’’ Start y3 = y3’ + y3’’ Read y3 Return to base

APPENDIX A

A.2

89

Point Doubling

The following routine computes the double of an elliptic curve point.

//-----------------------// Point Double Routine //------------------------

ld nop nop nop nop bne ld nop nop nop nop bne ret

(ADD, (); (); (); (); (ADD, (ADD, (); (); (); (); (ADD, ();

QX, ZRO);

ld st ld st ld ld st st ld st ld ld st

(SQR, (SQR, (SQR, (SQR, (MLT, (SQR, (SQR, (MLT, (MLT, (MLT, (MLT, (SQR, (SQR,

QZ); T0); QX); QX); QX, T0); T0); T0); QZ); B, T0); T0); T0, QZ); QX); QX);

PTDBL_L1); QY, ZRO);

PTDBL_L1);

PTDBL

// // // // // // // // // // // // // //

PTDBL_L1 // // // // // // // // // // // // //

Is Q == identity? Is x1 == 0? dead cycle dead cycle dead cycle dead cycle (Q!=identity)->cont. with add. Is y1 == 0? dead cycle dead cycle dead cycle dead cycle (Q!=identity)->cont. with add. Return to base Start Read Start Read Start Start Read Read Start Read Start Start Read

z3’ z3’ z3’’ z3’’ z3 x3’ x3’ z3 x3’’ x3’’ y3’ x3’’’ x3’’’

= z1^2 = x1^2 = z3’*z3’’ = z3’^2

= b*x3’ = x3’’z3 = z3’’^2

APPENDIX A st ld st ld st ld st ld st ld st ld st ld st ret

(MLT, (ADD, (ADD, (SQR, (SQR, (ADD, (ADD, (MLT, (MLT, (ADD, (ADD, (MLT, (MLT, (ADD, (ADD, ();

A.3

T1); QX, T0); QX); QY); QY); QY, T0); QY); A, QZ); T2); QY, T2); QY); QX, QY); QY); QY, T1); QY);

90 // // // // // // // // // // // // // // // //

Read y3’ Start x3 = Read x3 Start y3’’ = Read y3’’ Start y3’’’ = Read y3’’’ Start y3^(4) = Read y3^(4) Start y3^(5) = Read y3^(5) Start y3^(6) = Read y3^(6) Start y3 = Read y3 Return to base

x3’’’ + x3’’ y1^2 y3’’ + x3’’ a * z3 y3’’’ + y3^(4) x3*y3^(5) y3^(6) + y3’

Field Inversion

The following two routines perform inversion over the ﬁeld GF(2163 ). This second routine relies on the fact that the dbnz opcode also re-squares the contents of the squaring unit.

APPENDIX A

A.3.1

91

Inversion by Square and Multiply

//-----------------------// Field Inversion //-----------------------set (CTR1, 162); FLDINV st (ONE, T1); ld (SQR, T1); FLDINV_L1 st (SQR, T1); ld (MLT, T1, T0); st (MLT, T1); dbnz(CTR1, FLDINV_L1); ld (SQR, T1); st (SQR, T1); ret ();

// Set the counter // // // // // //

Square T0 Mult T0 Repeat 162 times One more squaring

APPENDIX A

A.3.2

92

Inversion by Itoh and Tsujii

//---------------------------------// Field Inversion by Itoh & Tsujii //----------------------------------

ld st ld st

(SQR, (SQR, (MLT, (MLT,

T0); T1); T1, T0); T1);

ld nop nop rsq st ld st

(SQR, (); (); (); (SQR, (MLT, (MLT,

T1);

T2); T1, T2); T1);

// // // // // // //

ld st ld st

(SQR, (SQR, (MLT, (MLT,

T1); T2); T2, T0); T1);

// -- square // // -- mult // T1 = a^(2^5 - 1)

ld nop nop rsq rsq rsq rsq st ld st

(SQR, (); (); (); (); (); (); (SQR, (MLT, (MLT,

T1);

// // // // // // // // // //

T2); T1, T2); T1);

FLDINV

// -- square // // -- mult // T1 = a^(2^2 - 1) -- square

-- square -- mult T1 = a^(2^4 - 1)

-- square

-----

square square square square

-- mult T1 = a^(2^10 - 1)

APPENDIX A

93

ld (SQR, T1); // -- square nop (); // set (CTR1, 8); // -- 9 squarings dbnz(CTR1, FLDINV_L1); FLDINV_L1 // st (SQR, T2); // ld (MLT, T1, T2); // -- mult st (MLT, T1); // T1 = a^(2^20 - 1) ld (SQR, T1); // -- square nop (); // set (CTR1, 18); // -- 19 squarings dbnz(CTR1, FLDINV_L2); FLDINV_L2 // st (SQR, T2); // ld (MLT, T1, T2); // -- mult st (MLT, T1); // T1 = a^(2^40 - 1) ld (SQR, T1); // -- square nop (); // set (CTR1, 38); // -- 39 squarings dbnz(CTR1, FLDINV_L3); FLDINV_L3 // st (SQR, T2); // ld (MLT, T1, T2); // -- mult st (MLT, T1); // T1 = a^(2^80 - 1) ld st ld st

(SQR, (SQR, (MLT, (MLT,

T1); T2); T2, T0); T1);

// -- square // // -- mult // T1 = a^(2^81 - 1)

ld (SQR, T1); // -- square nop (); // set (CTR1, 79); // -- 80 squarings dbnz(CTR1, FLDINV_L4); FLDINV_L4 // st (SQR, T2); // ld (MLT, T1, T2); // -- mult st (MLT, T1); // T1 = a^(2^162 - 1)

APPENDIX A

ld (SQR, T1); st (SQR, T1); ret ();

94

// -- square // T1 = (a^162) = (a^-1)

APPENDIX A

A.4

95

Coordinate Conversion

The following routine converts a point from its projective representation to its aﬃne representation.

//-----------------------// Convert to Affine //-----------------------ld st jr ld ld st st ld st st ret

(ADD, QZ, ZRO); CNVAFF (ADD, T0); (FLDINV); (MLT, T1, QX); (SQR, T1); (SQR, T0); (MLT, QX); (MLT, T0, QY); (MLT, QY); (ONE, QZ); ();

A.5

// // // // // // // // // //

Copy z1 to T0 Compute (1/z1) Start x1*(1/z1) Start (1/z1)^2 Read (1/z1)^2 Read x1 = x1*(1/z1) Start y1*(1/z1)^2 Read y1 = y1*(1/z1)^2 Set z1 to 1

Copy Routines

The following two routines are used to initialize the Q register at the beginning of a scalar multiplication. The ﬁrst loads Q with P and the second loads Q with −P .

APPENDIX A

A.5.1

96

Copy P to Q

//-----------------------// Copy P to Q //------------------------

ld st ld st st ret

(ADD, (ADD, (ADD, (ADD, (ONE, ();

A.5.2

PX, ZRO); QX); PY, ZRO); QY); QZ);

CPYP2Q

// // // // // // //

Is Q == identity? x2 + 0 Read x2 into location for x1 x2 + 0 Read y2 into location for y1 Set z1 to a one Return

// // // // // // //

Is Q == identity? x2 + 0 Read x2 into location for x1 x2 + y2 Read x2+y2 into location for y1 Set z1 to a one Return

Copy −P to Q

//-----------------------// Copy -P to Q //------------------------

ld st ld st st ret

(ADD, (ADD, (ADD, (ADD, (ONE, ();

PX, ZRO); QX); PY, PX); QY); QZ);

CPYMP2Q

Appendix B Tool Related Scripts and Setup Files This appendix includes several tool related scripts and setup ﬁles which were used in the development of the ECC co-processor discussed in this thesis.

B.1

Synthesis Scripts

Listed in this section are two scripts used to synthesize the design. The ﬁle synth compile.fst is the top level script which includes synt constraints.fst. These scripts were written for Synopsys’ FPGA Compiler II.

97

APPENDIX B

B.1.1 # # # # # # # # # #

Synthesis Compile Scripts

pmult_compile.fst This script synthesizes the pmult_top design for the Xilinx Vertex E FPGA. To run the script: fc2_shell -f pmult_compile.fst

# # Define variables # set proj pmult_proj set top AHBAHBTop set target VIRTEXE set chip pmult_ahb set export_dir exports set report_dir reports # # Remove any old version of the project, # and create the new project # comment out this section to work on # an existing project # exec rm -rf $proj create_project -dir . $proj # # Setup project variables # proj_export_timing_constraint = "yes" proj_enable_vpp = "yes"

98

APPENDIX B

99

# # Setup default variables # default_clock_frequency = 66 ####################################################### # # Identify the design source files # ####################################################### set set set set set

SOURCEDIR AHBDIR PMULTDIR MULTDIR ALUDIR

/secure2/jlutz/kp_unit/design $SOURCEDIR/ahb_if/rtl_v $SOURCEDIR/pmult/rtl_v $SOURCEDIR/mult/rtl_v $SOURCEDIR/alu/rtl_v

# AHB files add_file -format add_file -format add_file -format add_file -format add_file -format add_file -format add_file -format add_file -format add_file -format

Verilog Verilog Verilog Verilog Verilog Verilog Verilog Verilog Verilog

# Top pmult files add_file -format Verilog add_file -format Verilog add_file -format Verilog add_file -format Verilog add_file -format Verilog add_file -format Verilog

$AHBDIR/pmult_glue.v $AHBDIR/APBRegs.v $AHBDIR/APBIntcon.v $AHBDIR/AHB2APB.v $AHBDIR/AHBAPBSys.v $AHBDIR/AHBZBTRAM.v $AHBDIR/AHBDecoder.v $AHBDIR/AHBMuxS2M.v $AHBDIR/AHBAHBTop.v

$PMULTDIR/pmult_defines.v $PMULTDIR/pmult_biu.v $PMULTDIR/pmult_logic.v $PMULTDIR/pmult_ptmlt_ctl.v $PMULTDIR/pmult_ram.v $PMULTDIR/pmult_q.v

APPENDIX B

100

add_file -format Verilog $PMULTDIR/pmult_top.v add_file -format Verilog $PMULTDIR/pmult_useq.v # ALU file(s) add_file -format Verilog $ALUDIR/alu_top.v add_file -format Verilog $ALUDIR/square_core.v # Mult files add_file -format add_file -format add_file -format add_file -format add_file -format

Verilog Verilog Verilog Verilog Verilog

$MULTDIR/mult_defines.v $MULTDIR/m_table.v $MULTDIR/mult_ctrl.v $MULTDIR/mult_top.v $MULTDIR/t_table.v

# The Memories add_file -format Verilog $SOURCEDIR/pmult/user_cell/ram_8x32_d.v add_file -format Verilog $SOURCEDIR/pmult/user_cell/ram_256x32_s_dist.v add_file -format EDIF $SOURCEDIR/pmult/user_cell/ram_8x32_d.edn add_file -format EDIF $SOURCEDIR/pmult/user_cell/ram_256x32_s_dist.edn

# # Analyze all the source files and display the progress # analyze_file -progress # # Create a chip targetted for $target with the default part and # speed grade. The chip will be named $chip. $top indicates # the top level design. # create_chip -progress -target $target -name $chip $top

APPENDIX B

# # Set the current chip to add constraints # current_chip $chip # # Read the constraints file # source synth_constraints.fst ####################################################### # # Optimize the current chip # ####################################################### set opt_chip [format "%s-Optimized" $chip] optimize_chip -progress -name $opt_chip

####################################################### # # Generate Reports # ####################################################### # Set current chip current_chip $opt_chip # Create the reports directory exec rm -rf $report_dir exec mkdir -p $report_dir # Show any error and warning messages for the chip list_message > $report_dir/$top.errors_warnings.rpt

101

APPENDIX B # Create a timing report report_timing > $report_dir/$top.timing.rpt # Create a few other reports report_chip -force > $report_dir/$top.chip.rpt report_project -all > $report_dir/$top.project.rpt

####################################################### # # Export Verilog netlist, PPR netlist and constraints # to $export_dir # ####################################################### # create the export directory exec rm -rf $export_dir exec mkdir -p $export_dir # Export synopsys db files export_chip -dir $export_dir -db

# export edif netlists export_chip -progress -dir $export_dir # export verilog netlists export_chip -progress -dir $export_dir -simulation \ VERILOG -primitive -timing_constraint # # Save and close the project # close_project quit

102

APPENDIX B

B.1.2

Synthesis Constraints Script

# # synth_constraints.fst # # This script sets constraints. It is called by the # synth_compile.fst scripts. #

set PMULT_TOP /$chip/uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top set APB_TOP /$chip/uAHBAPBSys # # Specify the clock waveform # set_clock -period 30 -rise 0 -fall 15 HCLK_PORT set_clock -period 15 -rise 0 -fall 8 ECMULT_CLK_PORT

# Eliminate the boundaries of the field units. Hopefully this # will allow synopsys to generate the fastest logic. set_module_primitive optimize "$PMULT_TOP/pmult_logic/mult_top" set_module_primitive optimize "$PMULT_TOP/pmult_logic/mult_top/t_table" set_module_primitive optimize "$PMULT_TOP/pmult_logic/alu_top" set_module_primitive optimize "$PMULT_TOP/pmult_logic/pmult_ram" set_module_primitive optimize "$PMULT_TOP/pmult_logic" set_module_primitive optimize "$PMULT_TOP/pmult_useq" set_module_primitive optimize "$PMULT_TOP"

103

APPENDIX B

B.2

104

Place and Route Scripts

Listed in this section are several scripts which were used to place and route the design. The ﬁrst is the top level script which takes the design as a synthesized netlist and returns the ﬁnal bit ﬁle. The second scripts is the User Constraints File (UCF) ﬁle which is used to constrain the design.

B.2.1

Top Level Place and Route Script

#! /bin/csh -f setenv BASE_NAME AHBAHBTop # # Merge the RAM edn files and the synopsys generated edf # files into one ngd file. # ngdbuild -p xcv2000e-6-fg680 \ -aul \ -sd ../../../pmult/user_cell \ -uc $BASE_NAME.ucf \ -dd . \ $BASE_NAME.edf \ $BASE_NAME.ngd # # Map the design to gates on the Virtex E FPGA # map -p xcv2000e-6-fg680 \ $BASE_NAME.ngd \ -o map.ncd \ $BASE_NAME.pcf #

APPENDIX B

105

# Place and Route the Design # par -w \ -pl 5 \ -rl 5 \ map.ncd \ $BASE_NAME.ncd \ $BASE_NAME.pcf # # Run Static timing analysis # trce $BASE_NAME.ncd \ $BASE_NAME.pcf \ -e 1000 \ -o $BASE_NAME.twr trce

$BASE_NAME.ncd \ $BASE_NAME.pcf \ -e 100 \ -skew \ -o $BASE_NAME-skew.twr

# # Dump a verilog netlist and SDF file for timed simulation # ngdanno -o $BASE_NAME.nga \ -s 6 \ -p $BASE_NAME.pcf \ -report \ $BASE_NAME.ncd \ map.ngm ngd2ver

-aka \ -log $BASE_NAME.ngd2ver \

APPENDIX B

106 -ne \ -tm $BASE_NAME \ -verbose -ul -w \ -sdf_path . \ $BASE_NAME.nga \ $BASE_NAME.v

# # Generate the bit file to be downloaded onto the FPGA # bitgen $BASE_NAME.ncd \ $BASE_NAME.bit \ -l -m -w \ -f bitgen.ut

APPENDIX B

B.2.2

User Constraints File

################################################################# # Clock Information ################################################################# NET "HCLK_PORT" TNM_NET = "HCLK_PORT" ; TIMESPEC TS_HCLK_PORT = PERIOD "HCLK_PORT" 30 ns HIGH 50% ; NET "ECMULT_CLK_PORT" TNM_NET = "ECMULT_CLK_PORT" ; TIMESPEC TS_ECMULT_CLK_PORT = PERIOD "ECMULT_CLK_PORT" 15 ns HIGH 50% ;

############################################################## # Group Information ##############################################################

INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/pmult_q/ qx_reg_reg" TNM = "q_regs" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/pmult_q/ qy_reg_reg" TNM = "q_regs" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/pmult_q/ qz_reg_reg" TNM = "q_regs" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/mult_top/ a_reg" TNM = "ffu_inputs" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/mult_top/ b_reg" TNM = "ffu_inputs" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/alu_top/ a_reg" TNM = "ffu_inputs" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/alu_top/ b_reg" TNM = "ffu_inputs" ;

INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/pmult_biu/ ram_read_en_a_reg" TNM = "strg_read_ens" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/pmult_biu/

107

APPENDIX B ram_read_en_b_reg" TNM = "strg_read_ens" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/pmult_biu/ q_read_en_a_reg" TNM = "strg_read_ens" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/pmult_biu/ q_read_en_b_reg" TNM = "strg_read_ens" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/mult_top/ t_table/t_odata_reg" TNM = "mult_t_odata" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/mult_top/ a_reg" TNM = "mult_a_reg" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/pmult_top/pmult_logic/mult_top/ b_reg" TNM = "mult_b_reg" ;

INST "uAHBAPBSys/uAPBRegs/pmult_glue/ecm_odata_reg_reg" TNM = "ecmult_clk_buffer" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/write_data_reg_reg" TNM = "ahb_clk_buffers" ; INST "uAHBAPBSys/uAPBRegs/ecm_addr_reg" TNM = "ahb_clk_buffers" ; INST "uAHBAPBSys/uAPBRegs/pmult_glue/start_write_*" TNM = "ahb_clk_buffers"; INST "uAHBAPBSys/uAPBRegs/pmult_glue/start_read_*" TNM = "ahb_clk_buffers";

############################################################## # Path Information ############################################################## TIMESPEC TS_strg_rdens_2_ffus = FROM "strg_read_ens" TO "ffu_inputs" 30 ns ; TIMESPEC TS_strg_2_ffus = FROM "q_regs" TO "ffu_inputs" 30 ns ; TIMESPEC TS_ttable_a = FROM "mult_a_reg" TO "mult_t_odata" 15 ns ;

108

APPENDIX B

109

TIMESPEC TS_ttable_b = FROM "mult_b_reg" TO "mult_t_odata" 15 ns ; TIMESPEC TS_hclk2ecclk = FROM "ahb_clk_buffers" TO "ECMULT_CLK_PORT" 60 ns; TIMESPEC TS_ecclk2hclk = FROM "ecmult_clk_buffer" TO "PADS" 60 ns; TIMESPEC TS_P2P = FROM PADS TO PADS 30 ns ; OFFSET = IN 30 ns BEFORE "HCLK_PORT" ; OFFSET = OUT 30 ns AFTER "HCLK_PORT" ;

############################################################## # Port Information ############################################################## NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET

HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR

LOC=m2; LOC=m1; LOC=l4; LOC=l3; LOC=l2; LOC=l1; LOC=k4; LOC=k3; LOC=k2; LOC=k1; LOC=j4; LOC=j3; LOC=j2; LOC=j1; LOC=h4; LOC=h3; LOC=h2; LOC=h1; LOC=g4; LOC=g3;

APPENDIX B

110

NET NET NET NET NET NET NET NET NET NET NET NET

HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR HADDR

LOC=g2; LOC=g1; LOC=f4; LOC=f3; LOC=f2; LOC=f1; LOC=e3; LOC=e2; LOC=e1; LOC=d3; LOC=d2; LOC=d1;

NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET

HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA

LOC=y1; LOC=w1; LOC=ab2; LOC=aa4; LOC=aa3; LOC=w4; LOC=w3; LOC=w2; LOC=v5; LOC=v4; LOC=v3; LOC=v2; LOC=v1; LOC=u5; LOC=u4; LOC=u3; LOC=u2; LOC=u1; LOC=t4; LOC=t3; LOC=t2; LOC=t1; LOC=r4;

APPENDIX B

111

NET NET NET NET NET NET NET NET NET

HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA HDATA

LOC=r3; LOC=r2; LOC=p2; LOC=p1; LOC=n4; LOC=n3; LOC=n2; LOC=n1; LOC=m3;

NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET

CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 CTRLCLK1 PWRDNCLK1

LOC=av10; LOC=au10; LOC=at10; LOC=aw9; LOC=av9; LOC=au9; LOC=at9; LOC=aw8; LOC=av8; LOC=au8; LOC=at8; LOC=aw7; LOC=av7; LOC=au7; LOC=at7; LOC=aw6; LOC=av6; LOC=au6; LOC=at6; LOC=av19;

NET NET NET NET NET

CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2

LOC=av15; LOC=au15; LOC=at15; LOC=aw14; LOC=av14;

APPENDIX B

112

NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET

CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 CTRLCLK2 PWRDNCLK2

LOC=au14; LOC=at14; LOC=aw13; LOC=av13; LOC=au13; LOC=at13; LOC=aw12; LOC=av12; LOC=au12; LOC=aw11; LOC=av11; LOC=au11; LOC=at11; LOC=aw10; LOC=at21;

NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET

SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA

LOC=aw36; LOC=av36; LOC=au36; LOC=aw35; LOC=av35; LOC=aw34; LOC=av34; LOC=au34; LOC=at34; LOC=aw33; LOC=av33; LOC=au33; LOC=at33; LOC=aw32; LOC=av32; LOC=au32; LOC=at32; LOC=aw31; LOC=av31; LOC=au31;

APPENDIX B NET NET NET NET NET NET NET NET NET NET NET NET

SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA SDATA

113 LOC=at31; LOC=aw30; LOC=av30; LOC=au30; LOC=ar4; LOC=ah1; LOC=ag2; LOC=ad3; LOC=r1; LOC=p3; LOC=p4; LOC=c2;

#NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR NET SADDR

LOC=al36; # reserved for expansion LOC=am39; LOC=am38; LOC=am37; LOC=am36; LOC=an39; LOC=an38; LOC=an37; LOC=an36; LOC=ap39; LOC=ap38; LOC=ap37; LOC=ap36; LOC=ar39; LOC=ar38; LOC=ar37; LOC=ar36; LOC=at39; LOC=at38;

NET SMODE NET SnWR NET SnWBYTE

LOC=ah37; LOC=aj38; LOC=ak39;

APPENDIX B

114

NET NET NET NET NET NET NET NET

SnWBYTE SnWBYTE SnWBYTE SnCE SCLK SnOE SnCKE SADVnLD

LOC=ak38; LOC=ak37; LOC=ak36; LOC=al39; LOC=al38; LOC=aj36; LOC=aj39; LOC=aj37;

NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET NET

SW SW SW SW SW SW SW SW LED LED LED LED LED LED LED LED LED nPBUTT

LOC=au17; LOC=at17; LOC=ar17; LOC=aw16; LOC=av16; LOC=au16; LOC=at16; LOC=aw15; LOC=b37; LOC=at19; LOC=aw18; LOC=av18; LOC=au18; LOC=at18; LOC=ar18; LOC=aw17; LOC=av17; LOC=AU19;

NET FnOE NET FnWE

LOC=aw29; LOC=at30;

NET nLMINT

LOC=AF38;

NET HDRID NET HDRID NET HDRID

LOC=af36; LOC=ag39; LOC=ag38;

# tie high # tie high

APPENDIX B

115

NET HDRID

LOC=ag37;

NET NET NET NET

RTCK TDO TCK TDI

LOC=ac37; LOC=ad39; LOC=ac35; LOC=ad38;

NET HCLK_PORT NET HRESETn

LOC=a20; LOC=ag36;

NET NET NET NET NET NET NET NET

LOC=ak4; LOC=ak3; LOC=ak2; LOC=ak1; LOC=an3; LOC=an2; LOC=an1; LOC=am4;

HSIZE HSIZE HTRANS HTRANS HRESP HRESP HREADY HWRITE

#not currently used NET HBUSREQ NET HLOCK

LOC=a4; LOC=a5;

NET ECMULT_CLK_PORT LOC=aw19;

Bibliography [1] Wireless Application Protocol - Version 1.0, 1998. [2] G. B. Agnew, R.C. Mullin, and S. A. Vanstone. An implementation of elliptic curve cryptosystems over F2155 . IEEE Journal on Slected Areas in Communications, 11:804–813, June 1993. [3] Marcus Bednara, Michael Daldrup, Joachim von zur Gathen, Jamshid Shokrollahi, and Jurgen Teich. Implementation of elliptic curve cryptographic coprocessor over GF(2m ) on an FPGA. In International Parallel and Distributed Processing Symposium: IPDPS Workshops, April 2002. [4] Ian Blake, Gadiel Seroussi, and Nigel Smart. Elliptic Curves in Cryptography. Cambridge University Press, 1999. [5] D. Chudnovsky and G. Chudnovsky. Sequences of numbers generated by addition in formal groups and new primality and factoring tests. Advances in Applied Mathematics, 1987. [6] Canadian Microelectronics Corporation. CMC Rapic-Prototyping Platform: Design Flow Guide, 2002.

116

BIBLIOGRAPHY

117

[7] Canadian Microelectronics Corporation. CMC Rapic-Prototyping Platform: Installation Guide, 2002. [8] T. Dierks and C. Allen. The TLS Protocol - Version 1.0 IETF RFC 2246, 1999. [9] Joseph A. Gallian. Contemporary Abstract Algebra. Houghton Miﬄin Company, 1998. [10] Lijun Gao, Sarvesh Shrivastava, and Gerald E. Sobelman. Elliptic curve scalar multiplier design using FPGAs. In Cryptographic Hardware and Embedded Systems (CHES), 1999. [11] Daniel M. Gordon. A survey of fast exponentiation methods. J. Algorithms, 27(1):129–146, 1998. [12] Nils Gura, Sheueling Chang Shantz, Hans Eberle, Summit Gupta, Vipul Gupta, Daniel Finchelstein, Edouard Goupy, and Douglas Stebila. An end-to-end systems approach to elliptic curve cryptography. In Cryptographic Hardware and Embedded Systems (CHES), 2002. [13] M. Anwarul Hasan. Look-up table-based large ﬁnite ﬁeld multiplication in memory constrained cryptosystems. IEEE Transactions on Computers, 49(7), July 2000. [14] IEEE. P1363: Editorial Contribution to Standard for Public Key Cryptography, February 1998. [15] T. Itoh and S. Tsujii. A fast algorithm for computing multiplicative inverses in GF(2m ) using normal bases. Information and Computing, 78(3):171–177, 1988.

BIBLIOGRAPHY

118

[16] Brian King. An improved implementation of elliptic curves over GF(2n ) when using projective point arithmetic. In Selected Areas in Cryptography, 2001. [17] Neal Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, 1987. [18] Neal Koblitz. CM curves with good cryptographic properties. In Advances in Cryptography, Crypto ’91, pages 279–287. Springer-Verilag, 1991. [19] Philip H. W. Leong and Ivan K. H. Leung. A microcoded elliptic curve processor using FPGA technology. IEEE Transactions on VLSI Systems, 10(5), October 2002. [20] Julio Lopez and Ricardo Dahab. Improved algorithms for elliptic curve arithmetic in GF(2n ). In Selected Areas in Cryptography, pages 201–212, 1998. [21] Robert J. McEliece. Finite Fields for Computer Scientists and Engineers. Kluwer Academic Publishers, 1989. [22] Alfred Menezes. Elliptic curve public key cryptosystems. Kluwer Academic Publishers, 1993. [23] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. Handbook of Applied Cryptography. CRC Press LLC, 1997. [24] Victor Miller. Uses of elliptic curves in cryptography. In Advances in Cryptography, Crypto ’85, 1985. [25] NIST. FIPS 186-2 draft, Digital Signature Standard (DSS), 2000.

BIBLIOGRAPHY

119

[26] Souichi Okada, Naoya Torii, Kouichi Itoh, and Masahiko Takenaka. Implementation of elliptic curve cryptographic coprocessor over GF(2m ) on an FPGA. In Cryptographic Hardware and Embedded Systems (CHES), pages 25–40. SpringerVerlag, 2000. [27] OpenSSL. See http://www.openssl.org. [28] Gerardo Orlando and Christof Paar. A high-performance reconﬁgurable elliptic curve processor for GF (2m ). In Cryptographic Hardware and Embedded Systems (CHES), 2000. [29] Arash Reyhani-Masoleh. Low Complexity and Fault Tolerant Arithmetic in Binary Extended Finite Fields. PhD thesis, University of Waterloo, 2001. [30] Martin Christopher Rosner. Elliptic curve cryptosystems on reconﬁgurable hardware. Master’s thesis, Worcester Polytechnic Institute, 1998. [31] Jerome A. Solinas. Improved algorithms for arithmetic on anomalous binary curves. In Advances in Cryptography, Crypto ’97, 1997. [32] S. Sutikno, R. Eﬀendi, and A. Surya. Design and implemntation of arithmetic processor F2155 for elliptic curve cryptosystems. In IEEE Asia-Paciﬁc Conference on Circuits adn Systems, pages 647–650, November 1998.