A New Dimension of PKI Hardware Accelerator

Crypto@1408Bit – A New Dimension of PKI Hardware Accelerator Astrid Elbe, Tanja Römer and Norbert Janssen Infineon Technologies AG, Security and Chipc...
4 downloads 4 Views 287KB Size
Crypto@1408Bit – A New Dimension of PKI Hardware Accelerator Astrid Elbe, Tanja Römer and Norbert Janssen Infineon Technologies AG, Security and Chipcard ICs, 81609 Munich, Germany, Email: {astrid.elbe/norbert.janssen/tanja.roemer}@infineon.com

Abstract. A new architecture for a very fast and secure public key crypto-coprocessor Crypto@1408Bit usable in Smart Card ICs is presented. The key elements of Crypto@1408Bit architecture are a very fast Look Ahead Algorithm for modular multiplication, a very fast and secure serial-parallel adder, a fast and chip area efficient carry handling and a sufficient number of working registers enabling easy programming. With this architecture a new dimension of crypto performance and security against side channel attacks is achieved. Compared to crypto-coprocessors currently available on the Smart Card IC market Crypto@1408Bit offers a performance more than an order of magnitude faster. The security of the crypto-coprocessor relies on hardware and software security features like dual-rail-security logic against differential power attacks, high secure registers for critical operands and an register length with up to 128 Bit buffer for randomization of operands.

1

Introduction

Public-Key Cryptography is currently used in more and more smart card applications like banking, Pay TV or advanced GSM applications. The RSA scheme is still the most popular one although elliptic curves are a very interesting alternative especially for applications with limited power budget like in a mobile phone. Smart cards are the most secure and popular token to store secret data. Secret keys are generated on the card using a true Random Number Generator [13]. These keys are used for example for digital signatures, authentication or key exchange. The key length depends on the application. For typical smart card applications a bit length of approximately 1400 bit is recommended for the next 10 years. For other applications like IC’s for PC security a bit length of at least 2048 Bit is mandatory. All common public key algorithms use modular integer arithmetic. Modular arithmetic with the above mentioned bit lengths is not efficiently supported from chip card CPUs. In [1] we provided an overview of different architectural concepts [2, 3, 4, 5, 6, 7, 8] for the design of a long integer arithmetic unit which supports RSA and elliptic curves. This paper presents the most important features of the architecture of the crypto-coprocessor Crypto@1408Bit which is based on a serial-parallel design using ZDN (ZDN = Zwei Drittel N) algorithm for modular multiplication, cf. [1,7]. According to [1] the serial-parallel architecture and the so-called ZDN algorithm provides the best ratio between performance in combination with the smallest chip area. The paper is organized as follows: Section 2 summarizes the most important mathematical aspects of RSA and elliptic curve cryptosystems. The architecture of Crypto@1408Bit is described in section 3. In section 4 a detailed introduction into the ZDN look-ahead algorithm for multiplication and modular reduction is provided. The fast and secure hardware implementation Crypto@1408Bit of this algorithm is described in section 5. The paper finishes with a comparison with other hardware realizations in section 6 highlighting the benefits of our implementation.

2

RSA and Elliptic Curve Cryptosystems

Crypto@1408Bit is optimized towards Public-Key Cryptography based on the well-known RSA algorithm and as well on elliptic curves. Both schemes are based on long modular integer arithmetic. Considering a RSA signature let M be the hash value of a message, d the secret key and N the public module with N=pq, where p, q are secret primes, then the modular exponentiation Md mod N is calculated as the RSA signature. This exponentiation is calculated using Crypto@1408Bit by a sequence of modular multiplications and modular squarings. Therefore, the hardware must support modular multiplications with very long operands up to 1400 Bit very efficiently. It would be possible to implement a full modular exponentiation in hardware. Crypto@1408Bit does not provide this in order 1

to leave more flexibility for the programmer to implement a secure modular exponentiation by using hardware and software countermeasures against side channel attacks. An alternative to RSA is elliptic curve cryptography over GF(p) or GF(2d) which is based on multiplications of points on an elliptic curve. This operation is calculated using Crypto@1408Bit by a sequence of additions of curve points, which can be done by a sequence of modular multiplications, modular squarings, modular additions, modular subtractions and inversions. Compared to RSA, the operand length is much smaller for elliptic curves but the number of parameters involved is larger.

3

Architecture of Crypto@1408Bit

As described in Chapter 2, modular multiplication is the basic operation for RSA and elliptic curve cryptography. This operation has to be supported in a fast and secure way by the crypto-processor Crypto@1408Bit. In [1] different methods and hardware realisations for modular multiplications have been evaluated [2, 3, 4, 5, 6, 7, 8]. A serial-parallel architecture of the arithmetic unit in combination with the ZDN algorithm for modular multiplication [7] provides the optimum between highest performance, lowest possible area and power consumption. The crypto-processor architecture has two main parts: the Calculation Unit CU and the Control Unit Ctrl. The Calculation Unit contains register C, module register N, accumulator Z, Crypto Registers CR, shifters for the operands and the Arithmetic Unit AU including a three operand adder for calculation of the result of modular multiplication. The Control Unit manages the processes inside the Calculation Unit (esp. execution of the ZDN algorithm) and the communication with the Host CPU. Three different operational modes are available supporting the main crypto-applications: 1. Long Mode for standard RSA calculations 2. Parallel Mode optimised for RSA calculations using Chinese Reminder Theorem (CRT) 3. Short Mode optimised for elliptic curve cryptography over GF(p) and GF(2d) CR (Crypto Registers)

CR4 CR0

C CU = CU0 + CU1 (Calculation Unit)

Shifter -3 ... 3

N

Z

1408

Mux

Shifter 0 ... 5

-1, 0, +1

-1, 0, +1

AU = AUo + AU1 (1408 bit)

LR

Timing Generator

1408 Ctrli (Control Unit)

LAccu

LSB Accu

Dec

AU

12

Mux Length Control Unit

Decoder Host Interface

Register Control Unit

Arithmetic Control Unit

Host Clock Disable Bus Status Reset Host Bus

2

Figure 1 Register Architecture in the Long Mode As shown in figure 1 in the Long Mode the register length and the length of the Arithmetic Unit is 1408 bit. 2 Crypto Registers CR0 and CR4 are available. C, N and Z denote the 3 Working Registers. The Control Unit Ctrl controls the Calculation Unit. The Short Mode is suitable for elliptic curve cryptography. In order to minimize the power consumption the lower half of the Arithmetic Unit and the registers are switched off and only the most significant 576 bits of each register are used. However, note that the lower halves of the Crypto Registers, CR0, CR2, CR4, CR6, can still be used to store the multiplier for modular and non-modular multiplications. Changes between the different modes can be done by using a special instruction. The register architecture of the Crypto@1408Bit is strictly accumulator based. This follows in a natural way from the Look Ahead Algorithm used for modular and non-modular multiplication described in chapter 3. The accumulator is given by register Z. This means that the result of arithmetic instructions is stored in this register. To save the result the programmer can use registers C, N or one of registers CRi. The registers CRi may also be used to store additional parameters or intermediate results. The large number of CRi registers significantly increases security and usability of the Crypto@1408Bit since transfer into and out of the coprocessor is minimized. For modular multiplications the N register contains the module N. For the programmer the registers N, C and CRi can be addressed in the same way as ordinary RAM. Note, that the registers CRi and N can be used to store arbitrary parameters or data if the Crypto@1408Bit is not used. Therefore these registers are additional RAM, which can be used by the Host CPU. The configuration of the Working Registers C, N and Z and the Crypto Registers (CRi) depends on the chosen configuration mode of the crypto-coprocessor. Because of the special implementation of the Look Ahead Algorithm for the modular multiplication instructions a so-called “underflow buffer” is necessary. This buffer makes it possible to store the shifted contents of the working registers C and Z. Due to algorithmical aspects the length of the underflow buffer is 32 bit. Therefore, the bits 0 to 31 are reserved during modular muliplications. By virtue of the underflow buffer and security reasons (e.g. randomization) the real register length is chosen larger then the maximum operand length as described above. In order to perform subtractions with the adder the subtracted operand is added in the negated form. Since numbers are stored in two’s complement representation, this means simply complementing the bits and adding 1. The 3 inputs of the adder are needed in the case of modular arithmetic, the third operand is the module stored in register N, see figure 1. As described above the Control Unit is responsible for managing the processes in the Calculation Unit and the communication towards the Host CPU. The Control Unit consists of five main parts: Host Interface, Timing Generator, Length Control Unit, LSB Control Unit and Instruction Decoder. The Host Interface is responsible for the communication with the Host CPU. The Timing Generator generates the internal clock phases. For the management of the operand length and the position of the LSB, the Length Control Unit and the LSB Control Unit are used, respectively. The decoding of the instructions is performed by the Instruction Decoder. The Crypto@1408Bit crypto-coprocessor has its own instruction set, which has been developed in accordance to the IEEE P1363 cryptographic standard. The crypto-coprocessor is controlled by the Host CPU by writing data and instruction opcodes via the Host Interface into the Crypto@1408Bit. Dependent on the decoder part in the Control Unit used for instruction decoding, three classes of instructions can be distinguished: General Control Instructions, Register Control Instructions and Arithmetic Instructions. The Control Unit of Crypto@1408Bit has 52 instructions. All instruction opcodes are one byte long. The instruction set reflects the accumulator-based architecture which follows in a natural way from the multiplication algorithm. For programming the Crypto@1408Bit, the instruction opcodes can be directly used in the source code although this is not advised in general since this kind of coding is quite time consuming for the user. We suggest to use the macro-assembler. The macro-assembler substitutes the syntax of the instructions. The main instructions are modular multiplication and reduction based on the ZDN Look Ahead Algorithm, which speeds up the modular multiplication by an average factor of 2,7. 3

4.

ZDN Look Ahead Algorithm for Modular Multiplication

Consider a modular multiplication Z = C ⋅ M mod N, with Module N, Multiplicand C, Multiplier M and the result Z of the modular multiplication. In order to accelerate a standard non-modular multiplication a Booth Algorithm can be used. Two variants of the Booth Algorithm exist: an algorithm using variable shift length [9] and an algorithm with fixed shifts as described in [10] and [11]. The algorithm using variable shift length is easier to implement in hardware and therefore chosen in our implementation. This standard Booth Algorithm can be modified to achieve a performance increase of a factor of 3 instead of a factor of 2 for the standard Booth Algorithm. For modular multiplication a reduction step has to be calculated in addition to the multiplication. In this case the ZDN algorithm developed by Holger Sedlak [7] uses a Look Ahead Algorithm for multiplication, based on the modified Booth method combined with an interleaved modular reduction. The ZDN algorithm determines the Look Ahead parameters – signs and shift values. The calculation of the ZDN parameters is done in the Instruction Decoder of the Control Unit. The resulting operation is combined in a 3-operand adder.

4.1. Modified Booth Algorithm for Multiplication over GF(p) If

M=

L ( M ) −1

∑M m =0

2 = m

m

L(M )

∑M m =1

m −1

2 m−1 =

1

∑M

m= L ( M )

L ( M )−m

2 L ( M )−m

(1)

is the binary representation of the multiplier, then the product

C⋅M =

1

∑C ⋅ M

m= L ( M )

L ( M )−m

2 L ( M )−m.

(2)

The standard shift-and-add algorithm for the multiplication Z = C ⋅ M is given by Z:=0; for m:=L(M)-1 downto 0 do begin Z:=2*Z; {Shift} if (M[m]=1) then Z:=Z+C; {Add} end

where L(M ) denotes the length of the multiplier, which is stored in the LR length register: Note that the multiplier M is scanned from left to right (from the most significant to the least significant bit) one bit at a time. The original Booth method [12] scans two bits of the multiplier M at a time and uses the property µ

[0 1 1 ...1 1 0 ... 0]2 =

∑2

i

= 2 µ +1 − 2v

(3)

i =v

µ

υ

of binary numbers. 4

Thus, arithmetic operations are only performed at 01 and 10 borders of the multiplier, see Table 1 for the operations.

Table 1

Mm

Mm-1

Operation

0

1

Shift; Add

1

0

Shift; Sub

0

0

Shift

1

1

Shift

Operations of original Booth method

Now, the modified Booth Algorithm not only considers two consecutive bits but looks for strings of a distinct form: An α − string, α ∈ {0, 1}, is defined to be a sequence of 0’s and 1’s of one of the following forms: 1. 2. 3. 4.

A 0-string with an isolated 1, e.g., 0...010.... A 0-string followed by a 1-string, e.g., 0...011.... A 1-string with an isolated 0, e.g., 1...101.... A 1-string followed by a 0-string, e.g., 1...100....

The modified Booth algorithm is defined to be in Look Ahead state zero (LA=0) when shifting across zeros (cases 1, 2) and in Look Ahead state one (LA=1) otherwise (cases 3, 4). If the result of the multiplication is stored in register Z, the multiplicand in register C and the multiplier M (which is scanned bit by bit); e.g., in register CRi, then the following rules apply: 1. 2. 3. 4.

If there is an isolated 1 in a 0-string, then add C to Z at this position. If a 0-string is followed by a 1-string, then add C to Z at the position of the last 0. Change of LA. If there is an isolated 0 in a 1-string, then subtract C from Z at this position. If a 1-string is followed by a 0-string, then subtract C from Z at the position of the last 1. Change of LA.

The multiplier M is scanned from left to right. It is necessary to introduce M[-1]=0, M[-2]=0. sZ denotes the number of bits Z is shifted, the sign a ∈ {-1, 1} indicates if C is added or subtracted. Because of restrictions in the practical implementation (limited allowed width curk of shifter) the value a=0 is needed, too. The so called Look Ahead Algorithm Look Ahead (a, sz) is given by: sz=0; /*initialization shift value*/ a=1-2*LA /*compute sign, if LA=0 then a=1, i.e. addition of C, if LA=1 then a=-1, i.e. subtraction of C */ start: if ( (M[m-1] !=LA) && (M[m-2] !=LA) ) { LA=1-LA; /* switch look ahead status */ } else { if ( (m!=0) && (sz!=cur_k) ) /* keep look ahead status and go ahead */ { m=m-1; sz=sz+1;

5

if (M[m] ==LA) goto start; } else a=0; /* a=0, no addition or subtraction */ }

Summarizing, the Look Ahead Algorithm for multiplication generates a sign a ∈ {-1, 0, 1} and a shift value sZ, which are used to execute the steps

Z = Z ⋅ 2 SZ Z = Z + a ⋅C

(4) (5)

which are part of the modified Booth Algorithm to calculate Z = C ⋅ M . The shifts are arithmetic shifts realized in hardware with a barrel shifter. It can be shown that the average shift value for multiplication is theoretically 3 assuming independent and identically uniformly distributed multipliers, an infinite length of the operands and an infinite shifter. However, in reality the barrel shifter can only shift k bits at once, thus, reducing the average shift value 3 to 3 −

4 . 2k

Further reductions on the shift value result from the interleaving of the multiplication and the modular reduction, which is described below, and hardware properties. 4.2.

ZDN Principle for Modular Reduction over GF(p)

As said above, in order to perform a modular multiplication an additional reduction step is necessary. There are two possible ways for this reduction: either perform modular reduction after calculating the product M ⋅ C or perform reduction and multiplication in an interleaved way. Since the intermediate result M ⋅ C has an operand length about the sum of the two lengths the first approach is not suitable for cryptographic hardware. Additionally, the first approach is very slow. Thus, a reduction algorithm is needed which can be run in parallel to the modified Booth method. In order to obtain optimal performance of the overall algorithm the reduction algorithm must have a performance comparable to the multiplication algorithm, i.e., the average shift values should be close to the values of the Booth method. The ZDN Algorithm includes a reduction step that fulfils these requirements. The principle of this reduction method is described below. Consider Z and N as shown in Figure 2 and for simplicity not concerning C. After shifting Z up by the value of sZ, the goal is to find a shift value sN for N such that 2

SZ

Z and

SZ

2 N have about the “same size”. This means for the ZDN algorithm 1 1 − 2 S N N < 2 SZ Z − 2 S N N ≤ 2 S N N . 3 3

(6)

Mathematically this is always possible for exactly one sN. If si:=sZ-sN then one can write

2 −si 4 2 N < Z ≤ 2 −si N . 3 3

(7)

Note, that the shift values sZ and si can be computed in parallel. The ZDN algorithm is therefore based on the comparison of the intermediate result Z with certain multiples of

2 N. 3

6

Figure 2

Relative positions and shifts of registers Z and N

Now for reduction, if Z>=0 set sign b of N to –1 otherwise set b=1. The value s

2 sZ −si N will be

s

subtracted from 2 Z Z , otherwise added to 2 Z Z .

Z1 := 2 sZ Z ,

(8)

N1 := 2 sN N = 2 sZ −si N ,

(9)

Z '1 := Z1 + bN1

(10)

then with the above choice of si and sN = sZ – si it follows that

Z '1 ≤

1 N1, 3

(11)

which implies that after reduction, the shifted Z is reduced at least by one bit. Summarizing, the “reduction part” of the ZDN Algorithm generates Look Ahead parameters, sign-value b and the shift value sN, which are used to calculate

N := 2 sN N

(12)

Z := 2 sN Z + b ⋅ N

(13)

which in turn are needed to perform the modular reduction Z = Z mod N.

7

4.3. Interleaving of Multiplication and Modular Reduction over GF(p) In order to perform a modular multiplication the modified Booth algorithm and the above-described Look Ahead Algorithm for modular reduction are combined into the ZDN algorithm. This means that the multiplication and the reduction run in parallel, and the Look Ahead parameters a, sZ, b and sN are computed simultaneously. The shifting of the registers is parallelized, too. The operations

Z := 2 sZ Z + a ⋅ C and Z := Z + b ⋅ 2 sN N will be combined to a 3 operand addition Z = 2 sZ Z + a ⋅ C + b ⋅ 2 s N N .

(14)

If the shift value sZ exceeds the most significant bit (MSB) of register Z then Z is not shifted further to the left. Instead register C is shifted to the right. By doing so, one can avoid an overflow buffer and only an underflow buffer is needed. A control mechanism ensures that C does not leave the underflow buffer. Note that the shift value sZ’ in Figure 2 is equal to the absolute value of the Look Ahead Parameter si of the modular reduction step.

s Short Mode Parallel Mode Long Mode

s'' Z

Z

351 703 1407

0 1

s' Z

MSB

Z1 N Z

s'' Z

C

underflow buffer

0 . . . . . . 0

0 . . . . . . 0

C1

0 1 c

1 0 . . . . . . 0

LSB LSB - 1

LSB := LSB - s '' Z

LSB of registers

0

Figure 3

Relative positions and shifts of registers Z, C and N

The average shift value of 3 bits shows that both Look Ahead algorithms (for multiplication and modular reduction) have the same shift value on average. The average shift value of 3 is only achieved under the assumption of independent and identically uniformly distributed bits, an infinite length of operands and infinite long shifters. However, in practise shifter and register lengths are limited. Therefore, the average shift value is decreased to 2.7 in practise. Effects reducing the shift value in practise occur if the maximum possible shift value is larger then the maximum realised shifter length.

4.4. Modular Multiplication over GF(2d) The Look Ahead Algorithm over GF(2d) differs in some parts from the Look Ahead Algorithms over GF(p). These differences depend on the fact that the elements of GF(2d) are represented by polynomials over GF(2d). Examining differences between GF(2d) and GF(p) provides the following differences between the Look Ahead Algorithms over GF(2d) and GF(p):

8

1. Over GF(2d) addition and subtraction are given by XOR operation. This induces that the negative inverse of an element is represented by itself. Therefore, the sign values a and b, calculated in the Look Ahead Algorithms, are non-negative, which means 0 or 1. 2. The equations (1), (2) and (3) of Section 4.1. do not hold over GF (2d) anymore. This implies that the rules 3. and 4. of Section 4.1. can not be applied. Only shifting over 0’s is allowed. Therefore, the average shift value EM(sz) is reduced to 1.8. The value 2/3N is not defined over GF(2d). Therefore, the comparison of Z and N, which is performed during the Look Ahead Algorithm for reduction, is substituted by the comparison of the degrees of N(x) and Z(x), where N(x) and Z(x) give the polynomials representing N and Z.

5. Hardware Realisation and Design of Crypto@1408Bit 5.1. Three Operand Adder Serial-Parallel Design of the Calculation Unit The task of the Calculation Unit is to calculate the product Z = C ⋅ M mod N . The easiest way to do this is a multiplication of two numbers like in school mathematics. If we want to calculate for example C ⋅ M with C = 101 and M = 111 then 111 ⋅ 101 111 000 1110 111 100011

Figure 4

first partial product second partial product Subtotal third partial product Product

Multiplication of two binary numbers using a serial-parallel method

This means in the first step the first two partial products are calculated. The partial product is equal to the multiplicated M if the position of the multiplier C is a binary 1. The partial product is 0 if the position of C considered is a binary 0. In the second step the two partial products are added to each other, in the third step the third partial product is added to the sum of the first two partial products. Each partial product is calculated in a parallel operation. The partial products are combined in a serial operation. For a modular multiplication using the ZDN algorithm the serial-parallel multiplication has the following flow:

Initialisation of registers Serial operation for combining the partial products

Z := Z + a1 ⋅ C1 + b1 ⋅ N

Parallel operation for a partial product

Generation of Look Ahead parameters (signs and shift values for C and Z)

Z := Z + a2 ⋅ C2 + b2 ⋅ N

Parallel operation for a partial product

. . .

9

Figure 5

Modular multiplication of two binary numbers using a serial-parallel method combined with ZDN algorithm

with a1, a2, b1, b2

∈ {-1; 0; 1} sign look-ahead parameters calculated new in each iteration step.

For the whole modular exponentiation Md mod N it means that it is only slower by a factor of 4 when doubling the operands. This implies that the serial-parallel architecture combined with the ZDN algorithm is in favour especially for very long operands (e.g. 1400 Bit). The calculation of Z := Z + a1 ⋅ C1 + b1 ⋅ N is performed in parallel, i.e. in one step, for all, say, 1408 bits in a 3 operand adder shown in the figure below:

C

Figure 6

N

n-bit Half Adder

n-bit Full Adder

Z

(Carry Save)

(Carry Propagate)

Z := Z + a ⋅ C + b ⋅ N a, b ∈ {-1; 0; 1}

Serial-parallel 3 Operand Adder

Note that for modular multiplication over GF(2d) the carry in the Full Adder simply must be switched off. The 3-operand adder is designed in a bit slice architecture. One bit slice contains several cryptoregisters, a shifter for Z and C, the half adder and the full adder. Each bit slice generates the signals “Kill”, “Generate” and “Propagate”. Kill=1 means that the carry of the previous bit slice is killed, Generate=1 means that the considered bit slice generates a carry and the carry of the previous bit slice is propagated if the Propagate Signal is equal to 1. The bit slices are combined to blocks of 4 bit and the 4 bit blocks form larger blocks of 16 bit. These larger 16-bit blocks are combined to one adder of e.g. 1408 bit in order to achieve a 3 operand addition for all bits in parallel as shown in figure 7.

Figure 7

4 Bit Block of 3 Operand Adder 10

5.2. Carry Path Hierarchy for fast Addition Classical adders like carry-save adder or carry-ripple-adder as described in [1] have significant disadvantages for long integer arithmetic. The carry-ripple-adder is very slow for the worst case that the carry can propagate through all digits. The carry-save-adder is very fast but needs additional registers and therefore larger chip size to store the two additional operands needed for the redundant representation containing the carry information. The preferred adder for long integers should be able to add two integers in an almost constant time, independently from the length of the integers involved and should have a carry handling method generating as less as possible overhead in chip-size. The serial-parallel-adder described below allows to add integers in an almost constant time but avoids the overhead in chip size from the carry-save-adder. The basic structure of the carry path is illustrated in figures 7 and 8. It consists of the path running through the single bit slices and two bypass circuits bypassing 4 or 16 bit blocks. The bypass circuits are controlled by the propagate signals of the corresponding bit-slices. The bypass structure is realized via a special dual-rail-security-logic. Although the bypass circuits speed up the rippling of the carry, it is still not fast enough for the desired frequencies, which are up to 100 MHz in 0,22 µm embedded flash low power CMOS technology. Hence a logic circuit is added, which recognizes whether or not a 16-bit bypass is active. A 16 bit bypass is active if all propagate signals of the individual 16 bit slices of the 16 bit block are 1. If at least one of the 16 bit block bypasses is active, a so-called panic-signal is generated. In this case Crypto@1408Bit is interrupted for the period of time a carry needs to ripple through almost three 16-bit blocks by using the bypassing structure of 4 or 16 bit blocks. If two neighboured 16 bit block bypasses are active a so-called double panic signal is generated. In this case Crypto@1408Bit is interrupted for the worst case time the carry needs to ripple through the complete Arithmetic Unit by using the bypassing structure of 4 or 16 bit blocks.

Figure 8

Carry Path Hierarchy of 3 Operand Adder

With the panic signal mechanism of the 3 Operand Adder an Arithmetic Unit can be realized which is able to add 1408 bit numbers in one step in almost constant time and with working frequencies much higher than allowed by the worst case of carry rippling.

11

Figure 9 shows a layout of Crypto@1408Bit in 0,22 µm embedded flash CMOS technology. As one can see a very high transistor density could be reached.

Figure 9 Layout of Crypto@1408Bit with 1Bit Length 1408

6. Summary In this paper we have described a crypto-coprocessor based on a fast and secure serial parallel architecture for acceleration of public key cryptography like RSA and elliptic curve cryptography. In the following table a comparison of the performance of crypto-processors currently available on the smart card market is shown [14] and compared to our new crypto-processor Crypto@1408Bit.

Signatures RSA 1024 Bit without CRT RSA 2048 Bit with CRT RSA 2048 Bit without CRT

Philips FAMEX in P8WE50XX

STM MAP in ST19KF16

Hitachi Coprocessor in H8/3114

Infineon ACE in SLECX322P

400ms

380ms@10MHz

480ms@10MHz

1,1s

780ms@10MHz

n.a.

615ms@15MHz

70ms@33MHz

6,4s

n.a.

n.a.

34s@15MHz

3,8ms@33MHz

300ms@15MHz

Infineon Crypto@1408Bit 35ms@33MHz

Table 1 Performance comparison of different smart card crypto-coprocessors As shown in table 1 the performance of Crypto@1408Bit is an order of magnitude faster for RSA 1024 Bit signature without CRT. The Crypto@1408Bit is optimized for key lengths upto 1400Bit. Considering key length of 2048 Bit the acceleration is approximately 2 times faster than for existing coprocessors. As shown in section 5 the serial-parallel architecture combined with the ZDN algorithm results in the optimum performance for long integer modular multiplication per chip area. The 3-operand-adder is realised very area-efficient in a bit slice design. In addition due to the secure design of Crypto@1408Bit the security against information leakage attacks reaches a completely new dimension: The security of the crypto-processor relies on hardware and software security features. The most important features are: dual-rail-security logic against differential power attacks, high secure registers for critical operands and an register length with up to 128 Bit buffer for randomisation of operands.

12

7. Bibliography 1. E. Hess, B. Meyer, and N. Janssen, Design of Long Integer Arithmetic Units for Public-Key Algorithms, Eurosmart Security Conference Proceedings (June 2000), pp. 325-334

2. A. Avizienis, Signed digit representation for fast parallel arithmetic, IRE Transactions on Electronic Computers (1961), pp. 389-400.

3. J.C. Bajard, L.-S. Didier, and P. Kornerup, An RNS montgomery modular multiplication algorithm, IEEE Transactions on Computers 47 (1998), no. 7, pp. 766-776.

4. E. F. Brickell, A fast modular multiplication algorithm with application to two key cryptography, Proceedings of CRYPTO ’82 (D. Chaum, R. L. Rivest, and A. T. Sherman, eds.), Plenum Press, 1983, pp. 51-60.

5. P. L. Montgomery, Modular multiplication without trial division, Mathematics of Computation 44 (1985), no. 170, pp. 519-521.

6. J. K. Omura,

A public key cell design for smart card chips, International Symposium on Information Theory and its Application, 1990, pp. 983-985.

7. H. Sedlak, The RSA cryptography processor, Proceedings of EUROCRYPT ’87 (Amsterdam) (D. Chaum and W. L. Price, eds.), LNCS, vol. 304, Springer-Verlag, Berlin, 1988, pp. 95-105.

8. Naofomi Takagi and Shuzo Yajima, Modular multiplication hardware algorithms with a redundant representation and their application to RSA cryptosystems, IEEE Transactions on Computers 41 (1992), no. 7, pp. 887-891.

9. O. L. MacSorley, High-speed arithmetic in binary computers, Proceedings IRE 49 (1961), pp. 6791.

10. L. P. Rubinfeld, A proof of the modified Booth’s algorithm for multiplication, IEEE Transactions on Computers (1975), pp. 1014-1015.

11. C. V. Freyman, Statistical analysis of certain binary division algorithms, Proceedings IRE 49 (1961), pp. 91-103.

12. A. D. Booth, A signed binary multiplication technique, Quart. J. Mech. Appl. Math. 4 (1951), pp. 236-240.

13. M. Dichtl and N. Janssen, A High Quality Physical Random Number Generator, Eurosmart Security Conference Proceedings (June 2000), pp. 279-287.

14. H. Handschuh and P. Paillier, Smart Card Crypto-Coprocessors for Public-Key Cryptography, Crypto Bytes, 4 (1998), no. 1, pp. 6-11.

13