Implementing NTRU on a GPU

35
Motivation NTRU Crash Course Architecture Implementation Results Implementing NTRU on a GPU Jens Hermans Fr´ ederik Vercauteren, Bart Preneel COSIC, K.U.Leuven 30 July 2009 Jens Hermans Fr´ ederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Transcript of Implementing NTRU on a GPU

Page 1: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Implementing NTRU on a GPU

Jens HermansFrederik Vercauteren, Bart Preneel

COSIC, K.U.Leuven

30 July 2009

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 2: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

1 MotivationWhy NTRU?Why GPU?

2 NTRU Crash CoursePolynomialsOperations and Parameter Choices

3 ArchitectureHardwareProgramming model

4 ImplementationOptimization for architectureGeneral structure

5 Results

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 3: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Why NTRU?Why GPU?

Motivation

Speeding up NTRU Encryption on GPU:

Why NTRU?

Why GPUs?

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 4: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Why NTRU?Why GPU?

Why NTRU?

NTRU Signatures (a.k.a. ’NotTrue’)l

NTRU Encryption [1]

Under development: IEEE 1363.1 [2]

Security parameters increase

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 5: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Why NTRU?Why GPU?

Why NTRU?

NTRU Encryption:

Central operation⇒ Convolution

(Not lattice based!)

’Post-quantum’ security (?)

⇒ looking good for parallel implementation

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 6: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Why NTRU?Why GPU?

Why GPU?

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 7: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Why NTRU?Why GPU?

Old-style GPU programming

What GPUs are supposed to do:

3D operations

2D operations (textures / shading)

Abuse 2D operations (e.g. custom shader):=⇒ RSA implementation, 2007 1

Complicated...

1Moss, Page, SmartJens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 8: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Why NTRU?Why GPU?

CUDA framework

Warning: sales talk

Your own personal supercomputer for < e500.

Nvidia CUDA Framework [3]:

Run ‘general’ programs on GPU

More complex operations, data types, branching...

Recent GPU required

Theory: 1TFlop (practice: 200 GFlop)

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 9: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Why NTRU?Why GPU?

CUDA Usage

Usages:

Linear algebra (e.g. CUBLAS)

Simulations (physics, chemistry, engineering...)

Image/video processing

...

Cryptography!

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 10: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Why NTRU?Why GPU?

Crypto on GPU

Current applications:

Ciphers:

RSA 2, ECC 3, AES 4

Cryptanalysis:

Factoring 5

Brute force

Focus: high throughput, not latency

2Moss, Page, Smart / Szerwinski, Guneysu / Fleissner3Szerwinski, Guneysu4Manavski / Harrison, Waldron5Bernstein, Chen, Cheng, Lange, Yang

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 11: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

PolynomialsOperations and Parameter Choices

1 MotivationWhy NTRU?Why GPU?

2 NTRU Crash CoursePolynomialsOperations and Parameter Choices

3 ArchitectureHardwareProgramming model

4 ImplementationOptimization for architectureGeneral structure

5 Results

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 12: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

PolynomialsOperations and Parameter Choices

Polynomials

P(N) = Z[X ]/(XN − 1) and Pq(N) = Zq[X ]/(XN − 1)

f ∈ P(N):

f =N−1∑i=0

fiXi

Multiplication a = b ? c in P(N), cyclic convolution:

ak = (b ? c)k =∑

i+j≡k

bi · cj

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 13: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

PolynomialsOperations and Parameter Choices

Encryption

Encryption:e = h ? r + m mod q (1)

with [2]:

N = 1171

m ∈ Pp(N) (p = 3)

h, e ∈ Pq(N) (q = 211)

Option 1: r ∈ Pp(N) #{ri = 1} = #{ri = −1} = dr = 106

Option 2: r = r1 ? r2 + r3 with r1, r2, r3 ∈ Pp(N)

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 14: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

PolynomialsOperations and Parameter Choices

Decryption

Decryption:

a ≡ f ? e mod q (2)

m = a ? f−1p mod p (3)

with:

f = 1 + p ? F ⇒ f−1p = 1

(Same options for F as for r)

... decryption failures!?

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 15: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

HardwareProgramming model

1 MotivationWhy NTRU?Why GPU?

2 NTRU Crash CoursePolynomialsOperations and Parameter Choices

3 ArchitectureHardwareProgramming model

4 ImplementationOptimization for architectureGeneral structure

5 Results

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 16: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

HardwareProgramming model

Processor

Nvidia GTX280:

240 cores, scalar processors

30 multiprocessors (8 cores each)

1.3 GHz

1GB Global Memory

32 & 64-bit integers, FP

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 17: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

HardwareProgramming model

Programming model

(Source: CUDA programming guide)

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 18: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

HardwareProgramming model

Memory types

(Source: CUDA programming guide)

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 19: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Optimization for architectureGeneral structure

Points of attention

Memory access ⇔ computation

Coalesced memory access, bank conflicts

Loop structure

Caching

Efficient mod p computation (decryption)

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 20: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Optimization for architectureGeneral structure

Convolution

Ordinary polynomials:

for i = 0 to N − 1 doIf (ri = +1)⇒ tk = tk + hk−i mod N

If (ri = −1)⇒ tk = tk − hk−i mod N

end for

Product-form polynomials (r = r1 ? r2 + r3) [4]:

for i = 0 to dr dotk = tk + hk−r+

i mod N

tk = tk − hk−r−i mod N

end for

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 21: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Optimization for architectureGeneral structure

Layout

1 block = 1 encryption

Upload rb,hb,mb

Bit packing

1 thread = 4× “ei”

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 22: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Optimization for architectureGeneral structure

Memory access

Thread k

Thread k+1

...

...

Block b

...

r b

hb

...

......

eb

...

Figure: Memory access.

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 23: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Optimization for architectureGeneral structure

Product-form polynomials

Product-form encryption

e = h ? r + m mod qwith r = r1 ? r2 + r3

Algorithm:

1 tmp← r2 ? h

2 tmp2← r1 ? tmp

3 e← tmp2 + r3 + m

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 24: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

1 MotivationWhy NTRU?Why GPU?

2 NTRU Crash CoursePolynomialsOperations and Parameter Choices

3 ArchitectureHardwareProgramming model

4 ImplementationOptimization for architectureGeneral structure

5 Results

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 25: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Results

Platform (N, q, p) Enc/s Dec/sC Intel Core2 @ 3.00GHz (1171, 2048, 3) 95 95CUDA GTX280 (1 op) 571 546CUDA GTX280 (20000 ops) 24 ·103 24 ·103

C Intel Core2 @ 3.00GHz (1171, 2048, 3) 3.22 ·103 -CUDA GTX280 (1 op) Product form 6.25 ·103 -CUDA GTX280 (20000 ops) 218 ·103 -

Table: Comparison of NTRU implementations

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 26: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Throughput

100

101

102

103

104

105

0

0.5

1

1.5

2

2.5x 10

4

Number of parallel operations

op

era

tio

ns /

s

Figure: Operations per second for encryption with ordinary polynomials.

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 27: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Comparison

Platform (N, q, p) Enc/s Dec/sFPGA 6 (251, 128, X + 2) 193 ·103 -Palm 6 Product form 21 11Palm 6 30 16ARM C 6 307 148

FPGA 7 (167, 128, 3) 18 8.4

C 8 (787, 587, ?) 7.66 ·103 4.61 ·103

C (1171, 2048, 3) 95 95CUDA (1 op) 571 546CUDA (20000 ops) 24 ·103 24 ·103

C (1171, 2048, 3) 3.22 ·103 -CUDA (1 op) Product form 6.25 ·103 -CUDA (20000 ops) 218 ·103 -

Table: Comparison of NTRU implementations

6Bailey, Coffin, Elbirt, Silverman, Woodbury7Atıcı, Batina, Fan, Verbauwhede, Yalcın8EBATS

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 28: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Comparison: other algorithms

1 2 3 40

0.5

1

1.5

2

2.5x 10

5th

roughput

NTRU PF

RSA 2048 ECC NIST−224

NTRU

Figure: Throughput (enc/s)

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 29: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Comparison: other algorithms

Different security levels:

NTRU (1171, 2048, 3): 256-bitNTRU (167, 128, 3): � 80 bitRSA 2048 bit: 112-bitECC NIST-224: 112-bit

Different amount of data:

NTRU (1171, 2048, 3): 1756 bitNTRU (167, 128, 3): 250 bitRSA: 1024/2048 bit

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 30: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Main results

NTRU:

Very fast implementation

Fast compared to other ciphers

Total throughput: 218000 enc/s or 47.8 MByte/s

Well suited for GPU

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 31: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Final remark: GPUs

+ -Computing power Memory access/transfer

Price Power consumptionThroughput Latency

Reprogrammable

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 32: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Questions ?

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 33: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Key references

J. Hoffstein, J. Pipher, and J.H. Silverman.

NTRU: A Ring-Based Public Key Cryptosystem.Lecture Notes in Computer Science, pages 267–288, 1998.

W. Whyte, N. Howgrave-Graham, J. Hoffstein, J. PIpher, J.H. Silverman, and P. Hirschhorn.

IEEE P1363.1 Draft 10: Draft Standard for Public Key Cryptographic Techniques Based on Hard Problems over Lattices.

Nvidia.

Compute Unified Device Architecture Programming Guide, 2007.

J. Hoffstein and J.H. Silverman.

Random small Hamming weight products with applications to cryptography.Discrete Applied Mathematics, 130(1):37–49, 2003.

Other references: see paper

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 34: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Comparison

Platform (N, q, p) Enc/s Dec/s

FPGA Xilinx Virtex 1000EFG860 @ 50 MHz (251, 128, X + 2) 193 ·103 -Palm Motorola Dragonball @ 20 MHz (C) Product form 21 11Palm Motorola Dragonball @ 20 MHz (Assembly) 30 16ARM C ARM7TDMI @ 37 MHz 307 148

FPGA Xilinx Virtex 1000EFG860 @ 500kHz (167, 128, 3) 18 8.4

C Intel Core2 Duo @ 3GHz (787, 587, ?) 7669 4613

C Intel Core2 Extreme @ 3.00GHz (1171, 2048, 3) 95 95CUDA GTX280 (1 op) 571 546

CUDA GTX280 (20000 ops) 24 ·103 24 ·103

C Intel Core2 Extreme @ 3.00GHz (1171, 2048, 3) 3.22 ·103 -

CUDA GTX280 (1 op) Product form 6.25 ·103 -

CUDA GTX280 (20000 ops) 218 ·103 -

Table: Comparison of NTRU implementations

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU

Page 35: Implementing NTRU on a GPU

MotivationNTRU Crash Course

ArchitectureImplementation

Results

Comparison: other algorithms

Platform (N, q, p) Enc/s Dec/sCUDA GTX280 (1 op) 571 546CUDA GTX280 (20000 ops) 24 ·103 24 ·103

CUDA GTX280 (1 op) Product form 6.25 ·103 6.25 ·103

CUDA GTX280 (20000 ops) 218 ·103 218 ·103

RSA comparisonCUDA 9 Nvidia 8800GTS 1024 bit 813

C++ 10 Core2 @ 1.83GHz 2048 bit (6.66 ·103) 168

ECC comparisonC 11 Core2 @ 1.83 GHz ECC NIST-224 1.86 ·103

9Szerwinski, Guneysu10Crypto++11EBATS

Jens Hermans Frederik Vercauteren, Bart Preneel COSIC, K.U.Leuven Implementing NTRU on a GPU