Montgomery Algorithm for Modular Multiplication with ...math-sa-sara0050/space16/... · A systolic...

Post on 22-Jun-2020

11 views 0 download

Transcript of Montgomery Algorithm for Modular Multiplication with ...math-sa-sara0050/space16/... · A systolic...

MRABET Amine

Montgomery Algorithm for Modular Multiplication

with Systolic Architecture

LIASD Paris 8

ENIT-TUNIS EL MANAR University

SAS - CMP - Gardanne

SPACE 2016

1

1. Introduction for pairing

2. Montgomery Multiplication (CIOS)

3. Architecture

4. Results

5. Conclusion and Perspectives

Plan

2

1. Introduction for pairing

2. Montgomery Multiplication (CIOS)

3. Architecture

4. Results

5. Conclusion and Perspectives

Plan

2

This work is part of the hardware implementation of

asymmetric cryptography primitives, such as Optimal-Ate

pairing based on elliptic curves, the cryptographic systems

based on elliptic curves and RSA,

3

General Context

This work is part of the hardware implementation of

asymmetric cryptography primitives, such as Optimal-Ate

pairing based on elliptic curves, the cryptographic systems

based on elliptic curves and RSA,

Which are the best known methods in asymmetric encryption.

General Context

3

Let G1 and G2 be two additive groups and let G3 be a

multiplicative group.

Pairing is an application

e : G1 × G2 G3 with the following properties:

4

Definition

Definition

4

Let G1 and G2 be two additive groups and let G3 be a

multiplicative group.

Pairing is an application

e : G1 × G2 G3 with the following properties:

e is non degenerate :

if P ∈ G1, P ≠ 0 it exists Q ∈ G2 such as e(P, Q) ≠ 1

and

if Q ∈ G2, Q ≠ 0 it exists P ∈ G1 such as e(P, Q) ≠ 1.

e is non degenerate :

if P ∈ G1, P ≠ 0 it exists Q ∈ G2 such as e(P, Q) ≠ 1

and

if Q ∈ G2, Q ≠ 0 it exists P ∈ G1 such as e(P, Q) ≠ 1.

Bilinearity:

e(xP, yQ) = e(P,Q)xy ,

e(xP, yQ)z = e(yP, zQ)x = e(zP, xQ)y = e(P,Q)xyz

Definition

4

Let G1 and G2 be two additive groups and let G3 be a

multiplicative group.

Pairing is an application

e : G1 × G2 G3 with the following properties:

The bilinearity of the pairings allowed the construction of

protocols.

5

Pairing protocols

5

Pairing protocols

Diffie–Hellman key exchange ( Joux 2001)

Identity-Based Cryptography(Boneh and Franklin)

Short signature schemes (Boneh, Lynn, Shacham)

The bilinearity of the pairings allowed the construction of

protocols.

Trusted authority

Alice

IA

Pairing protocolsExample of Cryptography Based on Identity

6

Bob

IB

S: The secret of the trusted authority

The Public keys are the identities of people.

S: The secret of the trusted authority

The Public keys are the identities of people.

The private keys are Constructed by the trusted authority and

Transmitted to users.

Trusted authority

Bob Alice

IB IA

6

PB=S*IB PA=S*IA

Pairing protocolsExample of Cryptography Based on Identity

e (PA, IB) = e (IA, IB) se (PB, IA) = e (IA, IB) s

7

Alice wants to send a message to Bob:

She chooses an integer a randomly,

She retrieves Bob's public key : IB,

She calculates the pairing e(IB;Q0)a,

She sends to Bob : [ aP, M ⊕H2 (e(IB;Q0)a) ]=[U,V]

Pairing protocols

Example of Cryptography Based on Identity

Encryption step of the clear message M

8

Bob follows the following steps:

He contacts the trusted authority to retrieve his private key

PB = sIB,

He finds the message by calculating V ⊕ H2 (e(PB,U)).

The message : M

The bilinearity of pairings :

e(PB,U) = e(sIB,aP) = e(IB,P)as = e(IB,sP)a

Pairing protocolsExample of Cryptography Based on Identity

Decryption step of the encrypted message.

Different pairings

9

Weil pairing

eW

: E (Fp)[r ] × E(Fpk)/rE (Fpk) → F*pk

(P,Q) → (-1)r fr, p

(Q) / fr ,Q

(P)

Miller Lite fr, p

(Q)

Miller Full fr ,Q

(P)

Inversion

Multiplication

Different pairings

9

Weil pairing

eW

: E (Fp)[r ] × E(Fpk)/rE (Fpk) → F*pk

(P,Q) → (-1)r fr, p

(Q) / fr ,Q

(P)

Tate pairing

eT: E (Fp)[r ] × E(Fpk)/rE (Fpk) → F*pk

(P,Q) → [ fr, p(Q) ] (p^k- 1)/r

Tate pairing is defined with the same parameters E, Fp, r, k

than Weil pairing.

For the calculation of Tate pairing we make log2(r) iterations during

the Miller algorithm, where r is the order of the subgroups used.

The main advantage compared to Tate pairing is the reduction of the number of

iterations made during the Miller algorithm.

log2(T) where T = t − 1, and t is the Frobenius trace on E(Fp).

The disadvantage of Ate pairing is that it corresponds to a Miller Full application.

Different pairings

Ate paring

G1 = E[r] ∩ Ker(p-[1]) = E(Fp)[r], G2 = E[r] ∩ Ker(p-[p])

eA

: G1 × G2 → F*pk;

(P,Q) → [ fT, Q

(P) ] (p^k- 1)/r

10

The calculation is made by an execution of Miller Lite, which would alleviate the

complexity of the calculations.

Different pairings

Twisted Ate pairingG1 = E[r] ∩ Ker(p-[1]) = E(Fp)[r], G2 = E[r] ∩ Ker(p-[p])

eTA

: G1 × G2 → F*pk;

(P,Q) → [ fT, p

(Q) ] (p^k- 1)/r

11

Different pairings

Ate-Optimal (OATE) pairing

Ate-Optimal pairing improves Ate pairing by reducing the number of iterations

in the Miller algorithm used to calculate f,Q(P).

In the case of BN curves , OATE pairing is defined by:

where = 6t+2 (t the parameter of BN curves)

The calculation is made by an execution of Miller Lite, which would alleviate the

complexity of the calculations.

Twisted Ate pairingG1 = E[r] ∩ Ker(p-[1]) = E(Fp)[r], G2 = E[r] ∩ Ker(p-[p])

eTA

: G1 × G2 → F*pk;

(P,Q) → [ fT, p

(Q) ] (p^k- 1)/r

11

The basic operations in the Finite field :

Addition

Subtraction

Multiplication

inversion

Basic operations

12

The basic operations in the Finite field :

Addition

Subtraction

Multiplication

inversion

Constitute the essential of calculation time of pairing.

That’s why the optimization of these operation is the most

important

12

Basic operations

1. Introduction for pairing

2. Montgomery Multiplication (CIOS)

3. Architecture

4. Results

5. Conclusion and Perspectives

Plan

13

Reminder: Montgomery algorithm

14

Reminder: Montgomery algorithm

14

Ordinary domain Montgomery domain

a M(a)=a.R mod p

b M(b)=b.R mod p

a.b M(a.b)=a.b.R mod p

Conversion between Ordinary Field and Montgomery

The CIOS method improves the Montgomery algorithm by

integrating multiplication and reduction.

How?

[1] Analyzing and Comparing Montgomery Multiplication Algorithms, IEEE Micro. , juin1996

Cetin Kaya Koç, Tolga Acar and Burton S. Kaliski Jr.

The Coarsely Integrated Operand Scanning method [1] ?

15

The CIOS method improves the Montgomery algorithm by

integrating multiplication and reduction.

How?

Instead of multiplying axb then performe to reduction, it

allows to alternate between the iterations of multiplication

and reduction.

[1] Analyzing and Comparing Montgomery Multiplication Algorithms, IEEE Micro. , juin1996

Cetin Kaya Koç, Tolga Acar and Burton S. Kaliski Jr.

15

The Coarsely Integrated Operand Scanning method [1] ?

What is a systolic architecture ?

16

It’s a network composed of a large number of cells, Each

cell receives data from the neighboring cells, performs a

simple calculation, and then transmits the results, always to

neighboring cells.

What is a systolic architecture ?

16

It’s a network composed of a large number of cells, Each

cell receives data from the neighboring cells, performs a

simple calculation, and then transmits the results, always to

neighboring cells.

What is a systolic architecture ?

16

It’s a network composed of a large number of cells, Each

cell receives data from the neighboring cells, performs a

simple calculation, and then transmits the results, always to

neighboring cells.

A systolic architecture provides very simplified elementary

cells. Therefore, this architecture reduces resource

requirements in hardware implementations.

It’s a network composed of a large number of cells, Each

cell receives data from the neighboring cells, performs a

simple calculation, and then transmits the results, always to

neighboring cells.

A systolic architecture provides very simplified elementary

cells. Therefore, this architecture reduces resource

requirements in hardware implementations.

Our contribution in this work is to combine a systolic

architecture, which is supposed to be the best solution for

FPGA implementations, with the CIOS method of the

Montgomery modular multiplication.

What is a systolic architecture ?

16

Coarsely Integrated Operand Scanning

17

Coarsely Integrated Operand Scanning

Coarsely Integrated Operand Scanning

17

Cutting the algorithm CIOS

17

alpha : the lines 5 and 6

17

_2alpha : the lines 7,8 and 9

alpha : the lines 5 and 6

Cutting the algorithm CIOS

17

beta: the lines11 and 12

_2alpha : the lines 7,8 and 9

alpha : the lines 5 and 6

Cutting the algorithm CIOS

gamma: the lines14 and 15

17

beta: the lines11 and 12

_2alpha : the lines 7,8 and 9

alpha : the lines 5 and 6

Cutting the algorithm CIOS

_2gamma: the lines16,17 and 18

17

gamma: the lines14 and 15

beta: the lines11 and 12

_2alpha : the lines 7,8 and 9

alpha : the lines 5 and 6

Cutting the algorithm CIOS

Plan

18

1. Introduction

2. Montgomery Multiplication (CIOS)

3. Architecture

4. Results

5. Conclusion and Perspectives

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=0

_

2

3_

2

Multiplication Step

Reduction Step

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

CIOS in Systolic for s=8

19

_2

_2

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=0

_

2

3_

2

Multiplication Step

Reduction Step

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=1

_

2

3_

2

19

CIOS in Systolic for s=8

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=0

_

2

3_

2

Multiplication Step

Reduction Step

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=1

_

2

3_

2

In this architecture we also have an integration between

the different iterations that loop on i.

In our case we have 3 iterations of i which can be

executed at the same time.

19

CIOS in Systolic for s=8

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=0

_

2

3_

2

Multiplication Step

Reduction Step

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=1

_

2

3_

2

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=7

_

2

3_

2

i=2

i=3

i=4

i=5

i=6

19

CIOS in Systolic for s=8

. . . . . . . . . . . .. . . . . . . . . . . .

. . . . . . . . . . . .

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=0

_

2

3_

2

Multiplication Step

Reduction Step

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5 a0 b6 a0 b7

j=0 j=1 j=2 j=3 j=4 j=5 j=7 j=6

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=1

_

2

3_

2

1 1 1

1

2 2 2

2 2 2

3 3

3 3

i=7

_

2

3_

2

a x b x R-1 mod p

i=2

i=3

i=4

i=5

i=6

19

CIOS in Systolic for s=8

. . . . . . . . . . . .. . . . . . . . . . . .

. . . . . . . . . . . .

i=0

2

2

i=1

2

2

2

2

a x b x R-1 mod p

i=2

Multiplication Step

Reduction Step

2

2

i=3

2

2

i=4

2

2

i=5

2

2

i=6

2

2

i=7

20

CIOS in Systolic for s=8

S

C C

S

ai bj

i=0

2

2

i=1

2

2

2

2

a x b x R-1 mod p

Multiplication Step

Reduction Step

2

2

i=3

2

2

i=4

2

2

i=5

2

2

i=6

2

2

i=7

20

CIOS in Systolic for s=8

i=2

S

C C

S

C

C

ai bj

m pj

i=0

2

2

i=1

2

2

2

2

a x b x R-1 mod p

Multiplication Step

Reduction Step

2

2

i=3

2

2

i=4

2

2

i=5

2

2

i=6

2

2

i=7

20

CIOS in Systolic for s=8

i=2

S

S

S

a0

a1

.

.

.

.

.

.

.

a7

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_

2

3_

2

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_

2

3_

2

B

i=0

i=1

A

p0 p1 p2 p3 p4 p5 p6 p7P

Data Flow

1 1 1

1

2 2 2

2 2 2

i=2

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_

2

3_

2

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_

2

3_

2

b0 b1 b2 b3 b4 b5 b6 b7

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

a0

a1

.

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

Data Flow

i=2

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_

2

3_

2

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_

2

3_

2

b0 b1 b2 b3 b4 b5 b6 b7

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

P2 P3

a0

a1

.

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

Data Flow

i=2

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

b0 b1 b2 b3 b4 b5 b6 b7

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

a0

a1

.

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

Data Flow

i=2

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

S

C

P2 P3

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

b1 b2 b0 b3 b4 b5 b6 b7

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

P2 P3

a0

a1

.

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

Data Flow

i=2

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

S

C

SC

C

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

b2 b0 b1 b3 b4 b5 b6 b7

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

P2 P3

a0

a1

.

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

Data Flow

i=2

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

S

C

S

C

S

C

C

S

C

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

b0 b1 b2 b3 b4 b5 b6 b7

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

P2 P3

a0

a1

.

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

Data Flow

i=2

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

S

C

S

C

S

C

S

C

S

CC

S

C

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

b1 b2 b0 b4 b5 b3 b6 b7

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p3 p4 p2 p5 p6 p7

P2 P3

a0

a1

.

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

Data Flow

i=2

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

S

C

S

C

S

C

S

C

S

C

S

CC C

S

C

S

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

b2 b0 b1 b5 b3 b4 b6 b7

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p4 p2 p3 p5 p6 p7

P2 P3

a0

a1

.

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

Data Flow

i=2

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

S

C

S

C

S

C

S

C

S

C

S

C

S

CC C

S

C

S S

C

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

b0 b1 b2 b3 b4 b5 b6 b7

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p2 p3 p4 p5 p6 p7

P2 P3

a0

a1

a2

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

i=2

Data Flow

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

CC C

S

C

S S

C

S

C

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

b1 b2 b0 b4 b5 b3 b7 b6

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p3 p4 p2 p6 p7 p5

P2 P3

a0

a1

a2

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

i=2

Data Flow

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

CC C

S

C

S S

C

S

C

S

C

b0 b1 b2 b3 b4 b5 b6 b7

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

1 1 1

1

2 2 2

2 2 2

3 3

3 3

_f

3 _f

b2 b0 b1 b5 b3 b4 b7 b6

B

B1 B2 B3

i=0

i=1

p0 p1 p2 p3 p4 p5 p6 p7P

p0 p1 p4 p2 p3 p7 p5 p6

P2 P3

a0

a1

a2

.

.

.

.

.

.

a7

A

1 1 1

1

2 2 2

2 2 2

i=2

Data Flow

21

. . . . . . . . .. . . . . . . .

. . . . . . . .

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S

C

S,C

S

CC C

S

C

S S

C

S

C

S

C

S

C

i=0

2

2

i=1

2

2

2

2

a x b x R-1 mod p

Multiplication Step

Reduction Step

2

2

i=3

2

2

i=4

2

2

i=5

2

2

i=6

2

2

During execution of this algorithm

there are always three iterations

of the loop 'i' which are executed

at the same time, which gives a

maximum of three alphas and

three gammas which are executed

in parallel.

i=7

22

CIOS in Systolic for s=8

i=2

According to the blocks that are

repeated, we modeled our FSM

with 3 states, which allows us to

perform all the multiplication in

just 33 cycles.

(8+3)*3=33

i=0

2

2

i=1

2

2

2

2

a x b x R-1 mod p

i=2

Multiplication Step

Reduction Step

2

2

i=3

2

2

i=4

2

2

i=5

2

2

i=6

2

2

i=7

S0 S1 S2

CIOS in Systolic for s=8

S0 S1 S2

S0 S1 S2

S0 S1 S2

S0 S1 S2

S0 S1 S2

S0 S1 S2

S0 S1 S2 S0 S1 S2 S0 S1 S2 S0

22

1 1 1

1

2 2 2

2 2 2

6 6

6 6

i=0

_

2

6_

2

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5a0 b14 a0 b15

j=0 j=1 j=2 j=3 j=4 j=5 j=14 j=15

CIOS in Systolic for s=16

23

CIOS in Systolic for s=16

23

1 1 1

1

2 2 2

2 2 2

6 6

6 6

i=0

_

2

6_

2

a0 b0 a0 b1 a0 b2 a0 b3 a0 b4 a0 b5a0 b14 a0 b15

j=0 j=1 j=2 j=3 j=4 j=5 j=14 j=15

i=2

i=3

i=15

1 1 1

1

2 2 2

2 2 2

6 6

6 6

_

2

6_

2

a x b x R-1 mod p

. . . . . . . . . . . .. . . . . . . . . . . .

. . . . . . . . . . . .. . . . . . . . . . . .

CIOS in Systolic for s=16

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15

b0 b1 b2

b3 b4 b5

B

B1

B2

B3

b6 b7 b8

b9 b10 b11

B4

b12 b13 b14

B5

b15

1

2

3

4

5

6

B6

24

CIOS in Systolic for s=16

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15

b0 b1 b2

b3 b4 b5

B

B1

B2

B3

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15P

p0

p1

p2 p3 p4

p5 p6 p7

P2

P3

b6 b7 b8

b9 b10 b11

B4

b12 b13 b14

B5

b15

1

2

3

4

5

6

p8 p9 p10

p11 p12 p13

P4

P5

p14 p15

P6

B6

P1

1

64

53

2

24

alpha_2

gamma_

2

alpha

(1)

alpha

(2)

alpha

(3)

gamma

(1)

gamma

(2)

gamma

(3)

beta

i++

K=256, w=32, s=8

K=512, w=64, s=8

33 clock cycles

CIOS in Systolic for s=8

25

K=256, w=16, s=16

alpha_f

gamma_f

alpha

(1)

alpha

(2)

alpha

(3)

gamma

(1)

gamma

(2)

gamma

(3)

beta

i++

alpha

(4)

alpha

(5)

alpha

(6)

gamma

(4)

gamma

(5)

gamma

(6)

K=512, w=32, s=16

66 clock cycles

Alpha_f

gamma_

f

alpha

(1)

alpha

(2)

alpha

(3)

gamma

(1)

gamma

(2)

gamma

(3)

beta

i++

K=256, w=32, s=8

K=512, w=64, s=8

33 clock cycles

CIOS in Systolic for s=8

25

S=8 6 +3 cells 33 clock cycles

S=16 12 +3 cells 66 clock cycles

S=32 24 +3 cells 132 clock cycles

S=64 48 +3 cells 264 clock cycles

Comparison

26

S=8 S=16 S=32

K=256 32 16 8

K=512 64 32 16

K=1024 128 64 32

Number of

cycles

33 66 132

The interest of each architecture depends on our needs

Security level

Resources

Speed

The method used

The interest of each architecture

27

ArchitecturesDigital signal processing (DSP)

Modern FPGAs are equipped with hardware extensions for

arithmetic calculation.

28

ArchitecturesDigital signal processing (DSP)

Modern FPGAs are equipped with hardware extensions for

arithmetic calculation.

Perform basic arithmetic calculations: multiplication, addition and

subtraction of unsigned integers.

28

The arithmetic operations of each cell

are designed to use the maximum of the

DSPs.

29

a[i]

b[j]

C__In

REGLSB w bits

REGMSB w bits

C__Out

S__Out

S__In

+

+x

alpha

_2

_2

Internal architectures - cells

p’

S__In

P[0]REG

C__Out

REG m

xx

+

beta

29

a[i]

b[j]

C__In

REGLSB w bits

REGMSB w bits

C__Out

S__Out

S__In

+

+x

alpha

S__In

Internal architectures - cells

m]

p[j]

C_ _In

REGLSB w bits

REGMSB w bits

C_ _Out

S_ _Out

gamma

S_ _In

+

+x

30

Internal architectures - cells

_2

_2

gamma_2

S1__2_In

C__2

REGw bits

REG S2__2_Out

S1__2_Out

S2__2_In

LSB w bits

MSB w bits

++

30

m]

p[j]

C_ _In

REGLSB w bits

REGMSB w bits

C_ _Out

S_ _Out

gamma

S_ _In

+

+x

Internal architectures - cells

alpha_2C__2

REG

REG S2__2_Out

S1__2_OutS__2_In LSB w bits

MSB w bits

+

Internal architectures - cells

30

gamma_2

S1__2_In

C__2

REGw bits

REG S2__2_Out

S1__2_Out

S2__2_In

LSB w bits

MSB w bits

++

m]

p[j]

C_ _In

REGLSB w bits

REGMSB w bits

C_ _Out

S_ _Out

gamma

S_ _In

+

+x

ROTATION

Mux

A (K bits)X

31

Internal architectures - Rotation

ROTATION

Mux

A (K bits)X

ROTATION

Mux

B (3 w bits)X

ROTATION

Mux

B (3 w bits)X

ROTATION

Mux

B (2 w bits)X

31

Internal architectures - Rotation

Internal architectures - Rotation

ROTATION

Mux

A (K bits)X

ROTATION

Mux

B (3 w bits)X

ROTATION

Mux

P (3 w bits)X

ROTATION

Mux

B (3 w bits)X

ROTATION

Mux

P (3 w bits)X

ROTATION

Mux

B (2 w bits)X

31

PE

alpha

(1)

MUX

C_1_Out

zero

C_1_InMUX

S_1_In

S_2_Out S_1_Out

S_1_Out

sig_state

A- alpha1

Architectures

32

PE

alpha

(1)

MUX

C_1_Out

zero

C_1_InMUX

S_1_In

S_2_Out S_1_Out

S_1_Out

PE

alpha

(2)

MUX

C_2_Out

C_2_In

MUXS_2_In

S_3_Out S_2_Out

S_2_Out

C_1_Out

sig_state sig_state

A- alpha1B- alpha2

Architectures

32

PE

alpha

(3)

MUX

C_3_Out

C_3_InMUX

S_3_In

S_3_Out

S_3_Out

C_2_OutS1__2_Out

sig_state

C- alpha3

PE

alpha

(1)

MUX

C_1_Out

zero

C_1_InMUX

S_1_In

S_2_Out S_1_Out

S_1_Out

PE

alpha

(2)

MUX

C_2_Out

C_2_In

MUXS_2_In

S_3_Out S_2_Out

S_2_Out

C_1_Out

sig_state sig_state

A- alpha1B- alpha2

Architectures

32

PE

gamma

(1)

C_ 1_Out

C_ 1_InS_ 1_In

S_ 1_Out

D- gamma1

m

p[0]

Architectures

33

PE

gamma

(2)

MUX

C_ 2_Out

C_ 2_InMUX

S_ 2_In

S_

2_Out

S_ 2_Out

C_ 1_OutS_ 1_Out

sig_state

E- gamma2

m

p[j]

PE

gamma

(1)

C_ 1_Out

C_ 1_InS_ 1_In

S_ 1_Out

D- gamma1

m

p[0]

Architectures

33

PE

gamma

(3)

MUX

C_ 3_Out

C_ 3_InMUX

S_ 3_In

S_

3_Out

S_ 3_Out

C_ 2_OutS_ 2_Out

sig_state

F- gamma3

m

p[j]

PE

gamma

(2)

MUX

C_ 2_Out

C_ 2_InMUX

S_ 2_In

S_

2_Out

S_ 2_Out

C_ 1_OutS_ 1_Out

sig_state

E- gamma2

m

p[j]

PE

gamma

(1)

C_ 1_Out

C_ 1_InS_ 1_In

S_ 1_Out

D- gamma1

m

p[0]

Architectures

33

PE

alpha_2

PE

gamma_2

S1__2_Out S2__2_Out S1_ _2_Out S2_ _2_Out

C_ _2

PE

beta

m C_ _Out

S_ _In

G- alpha_2H- gamma_2

I- beta

p’P[0]

S1__2_In S2__2_In C__2 S__2_In

Architectures

34

Plan

35

1. Introduction

2. Montgomery Multiplication (CIOS)

3. Architecture

4. Results

5. Conclusion and Perspectives

Nexys 4 DSP Frequency (MHz) Cycles

MMM(s=8/K=256) 31 105.275 33

Alpha 4 291.023 1

Gamma 4 291.023 1

Beta 4 388.350 1

Alpha_2 1 459.918 1

Gamma_2 2 442.811 1

Results

36

Nexys 4 DSP LUTs Reg Occupied

slice

Frequency Cycles

MMM

S=8/k=256

31 809 870 352 105.275 33

MMM

S=16/k=256

33 846 1123 402 145.892 66

MMM

S=8/k=512

87 2650 1614 878 64.825 33

MMM

S=16/k=512

57 1789 2164 798 105.594 66

Results

37

Plan

38

1. Introduction

2. Montgomery Multiplication (CIOS)

3. Architecture

4. Results

5. Conclusion and Perspectives

We have implemented the Montgomery multiplication with a

systolic architecture in a number of fixed clock cycles.

We made our design in order to use the maximum of the DSPs on

FPGA card

Conclusion

conclusion and perspectives

39

We implemented two architectures(s=8 and s=16)

We used this two design to implement the scalar multiplication for

the security level of 128-bits.

Perspective

40

Perform a Mixed Implementation Soft / hard (co-design) for the

Optimal-Ate pairing on the BN curves in Jacobian coordinates

using this multiplication algorithm.

Finalize the hardware implementation of the designs

s= 32.

s= 64.