Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519...
Transcript of Sandy2x: New Curve25519 Speed Records - win.tue.nltchou/slides/sandy2x.pdfSandy2x: New Curve25519...
Sandy2x: New Curve25519 Speed Records
Tung Chou
Technische Universiteit Eindhoven, The Netherlands
October 13, 2015
X25519 and Ed25519
X25519
• ECDH scheme
• public keys and shared secrets are points on the Montgomery curve
y 2 = x3 + 486662x2 + x
over F2255−19
• by Bernstein, 2006
Ed25519
• signature scheme
• public keys and (part of) signatures are points on the twisted Edwardscurve
−x2 + y 2 = 1− 121665/121666x2y 2
over F2255−19
• by Bernstein, Duif, Lange, Schwabe, and Yang, 2011
1
The ECC implementation pyramid
Scalar multiplication
ECC add/double
Finite-field arithmetic
Big-integer or polynomial arithmetic
(slide credit: Peter Schwabe)2
The big multiplier
• used in all papers about ECC speeds on Intel microarchitectures
• 64× 64→ 128-bit multiplication in one instruction (mul)
• (This talk focuses on Sandy Bridge/Ivy Bridge)
3
The big multiplier
• used in all papers about ECC speeds on Intel microarchitectures
• 64× 64→ 128-bit multiplication in one instruction (mul)
• (This talk focuses on Sandy Bridge/Ivy Bridge)
3
The big multiplier
• used in all papers about ECC speeds on Intel microarchitectures
• 64× 64→ 128-bit multiplication in one instruction (mul)
• (This talk focuses on Sandy Bridge/Ivy Bridge)
3
The big multiplier
• used in all papers about ECC speeds on Intel microarchitectures
• 64× 64→ 128-bit multiplication in one instruction (mul)
• (This talk focuses on Sandy Bridge/Ivy Bridge)
3
The radix-251 representation for F2255−19
f = f0+ f1251+ f22102+ f32153+ f42204
g = g0+ g1251+ g22102+ g32153+ g42204
h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0
• 25 multiplication instructions + overhead.
• some carries required.
4
The radix-251 representation for F2255−19
f = f0+ f1251+ f22102+ f32153+ f42204
g = g0+ g1251+ g22102+ g32153+ g42204
h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0
• 25 multiplication instructions + overhead.
• some carries required.
4
The radix-251 representation for F2255−19
f = f0+ f1251+ f22102+ f32153+ f42204
g = g0+ g1251+ g22102+ g32153+ g42204
h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0
• 25 multiplication instructions + overhead.
• some carries required.
4
The radix-251 representation for F2255−19
f = f0+ f1251+ f22102+ f32153+ f42204
g = g0+ g1251+ g22102+ g32153+ g42204
h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0
• 25 multiplication instructions + overhead.
• some carries required.
4
The radix-251 representation for F2255−19
f = f0+ f1251+ f22102+ f32153+ f42204
g = g0+ g1251+ g22102+ g32153+ g42204
h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0
• 25 multiplication instructions + overhead.
• some carries required.
4
The radix-251 representation for F2255−19
f = f0+ f1251+ f22102+ f32153+ f42204
g = g0+ g1251+ g22102+ g32153+ g42204
h0 = f0g0+ 19f1g4+ 19f2g3+ 19f3g2+ 19f4g1h1 = f0g1+ f1g0+ 19f2g4+ 19f3g3+ 19f4g2h2 = f0g2+ f1g1+ f2g0+ 19f3g4+ 19f4g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g4h4 = f0g4+ f1g3+ f2g2+ f3g1+ f4g0
• 25 multiplication instructions + overhead.
• some carries required.
4
A small multiplier
• a 2-way vectorized multiplier
• 32× 32→ 64-bit multiplications in one instruction (vpmuludq)
• usage:(a0b0, a1b1) = (a0, a1)× (b0, b1)
5
A small multiplier
• a 2-way vectorized multiplier
• 32× 32→ 64-bit multiplications in one instruction (vpmuludq)
• usage:(a0b0, a1b1) = (a0, a1)× (b0, b1)
5
A small multiplier
• a 2-way vectorized multiplier
• 32× 32→ 64-bit multiplications in one instruction (vpmuludq)
• usage:(a0b0, a1b1) = (a0, a1)× (b0, b1)
5
A small multiplier
• a 2-way vectorized multiplier
• 32× 32→ 64-bit multiplications in one instruction (vpmuludq)
• usage:(a0b0, a1b1) = (a0, a1)× (b0, b1)
5
The radix-225.5 representation for F2255−19
f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230
g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230
h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0
• 100 multiplication instructions + overhead; 50 per multiplication.
• some carries required.
Sandy2x sets new speed records by using the vectorized multiplier.
6
The radix-225.5 representation for F2255−19
f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230
g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230
h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0
• 100 multiplication instructions + overhead; 50 per multiplication.
• some carries required.
Sandy2x sets new speed records by using the vectorized multiplier.
6
The radix-225.5 representation for F2255−19
f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230
g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230
h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0
• 100 multiplication instructions + overhead; 50 per multiplication.
• some carries required.
Sandy2x sets new speed records by using the vectorized multiplier.
6
The radix-225.5 representation for F2255−19
f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230
g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230
h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0
• 100 multiplication instructions + overhead; 50 per multiplication.
• some carries required.
Sandy2x sets new speed records by using the vectorized multiplier.
6
The radix-225.5 representation for F2255−19
f = f0+ f1226+ f2251+ f3277+ f42102+ f52128+ f62153+ f72179+ f82204+ f92230
g = g0+ g1226+ g2251+ g3277+ g42102+ g52128+ g62153+ g72179+ g82204+ g92230
h0 = f0g0+ 38f1g9+ 19f2g8+ 38f3g7+ 19f4g6+ 38f5g5+ 19f6g4+ 38f7g3+ 19f8g2+ 38f9g1h1 = f0g1+ f1g0+ 19f2g9+ 19f3g8+ 19f4g7+ 19f5g6+ 19f6g5+ 19f7g4+ 19f8g3+ 19f9g2h2 = f0g2+ 2f1g1+ f2g0+ 38f3g9+ 19f4g8+ 38f5g7+ 19f6g6+ 38f7g5+ 19f8g4+ 38f9g3h3 = f0g3+ f1g2+ f2g1+ f3g0+ 19f4g9+ 19f5g8+ 19f6g7+ 19f7g6+ 19f8g5+ 19f9g4h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ 38f5g9+ 19f6g8+ 38f7g7+ 19f8g6+ 38f9g5h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ f5g0+ 19f6g9+ 19f7g8+ 19f8g7+ 19f9g6h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ 2f5g1+ f6g0+ 38f7g9+ 19f8g8+ 38f9g7h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ f5g2+ f6g1+ f7g0+ 19f8g9+ 19f9g8h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ 2f5g3+ f6g2+ 2f7g1+ f8g0+ 38f9g9h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+ f5g4+ f6g3+ f7g2+ f8g1+ f9g0
• 100 multiplication instructions + overhead; 50 per multiplication.
• some carries required.
Sandy2x sets new speed records by using the vectorized multiplier.
6
Performance results
SB cycles IB cycles reference
X25519 public-key generation 54 346 52 169 Sandy2x
61 828 57 612 [A. Moon]
194 165 182 876 [Ed25519]
X25519 shared secret computation 156 995 159 128 Sandy2x
194 036 182 708 [Ed25519]
Ed25519 public-key generation 57 164 54 901 Sandy2x
63 712 59 332 [A. Moon]
64 015 61 099 [Ed25519]
Ed25519 sign 63 526 59 949 Sandy2x
67 692 62 624 [A. Moon]
72 444 67 284 [Ed25519]
Ed25519 verification 205 741 198 406 Sandy2x
227 628 204 376 [A. Moon]
222 564 209 060 [Ed25519]
• Andrew Moon “floodyberry”,https://github.com/floodyberry/ed25519-donna
7
Why is vectorization better?
Ports
• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.
• each instruction is decomposed into microoperations (µops)
• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.
• port utilization gives a lower bound of cycle count
8
Why is vectorization better?
Ports
• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.
• each instruction is decomposed into microoperations (µops)
• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.
• port utilization gives a lower bound of cycle count
8
Why is vectorization better?
Ports
• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.
• each instruction is decomposed into microoperations (µops)
• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.
• port utilization gives a lower bound of cycle count
8
Why is vectorization better?
Ports
• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.
• each instruction is decomposed into microoperations (µops)
• mul: 2 µops, handled by Port 0 and 1.
• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.
• port utilization gives a lower bound of cycle count
8
Why is vectorization better?
Ports
• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.
• each instruction is decomposed into microoperations (µops)
• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.
• vpaddq: 1 µop, handled by either Port 1 or Port 5.
• port utilization gives a lower bound of cycle count
8
Why is vectorization better?
Ports
• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.
• each instruction is decomposed into microoperations (µops)
• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.
• port utilization gives a lower bound of cycle count
8
Why is vectorization better?
Ports
• one each SB/IB core there are 6 ports: Port 0,1,5 are for arithmetic.
• each instruction is decomposed into microoperations (µops)
• mul: 2 µops, handled by Port 0 and 1.• vpmuludq: 1 µop, handled by Port 0.• vpaddq: 1 µop, handled by either Port 1 or Port 5.
• port utilization gives a lower bound of cycle count
8
Why is vectorization better?
Using the vectorized multiplier
• 109 vpmuludq + 95 vpaddq
• lower bound: 109 cycles (dominated by Port 0)
• actual cycle count 112 cycles (56 cycles per multiplication)
Using the serial multiplier
• 25 mul + 4 imul + 20 add + 20 adc
• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38
• actual cycle count is much larger: 52 cycles
• perf-stat shows that the core fails to distribute the µops equally overthe ports
9
Why is vectorization better?
Using the vectorized multiplier
• 109 vpmuludq + 95 vpaddq
• lower bound: 109 cycles (dominated by Port 0)
• actual cycle count 112 cycles (56 cycles per multiplication)
Using the serial multiplier
• 25 mul + 4 imul + 20 add + 20 adc
• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38
• actual cycle count is much larger: 52 cycles
• perf-stat shows that the core fails to distribute the µops equally overthe ports
9
Why is vectorization better?
Using the vectorized multiplier
• 109 vpmuludq + 95 vpaddq
• lower bound: 109 cycles (dominated by Port 0)
• actual cycle count 112 cycles (56 cycles per multiplication)
Using the serial multiplier
• 25 mul + 4 imul + 20 add + 20 adc
• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38
• actual cycle count is much larger: 52 cycles
• perf-stat shows that the core fails to distribute the µops equally overthe ports
9
Why is vectorization better?
Using the vectorized multiplier
• 109 vpmuludq + 95 vpaddq
• lower bound: 109 cycles (dominated by Port 0)
• actual cycle count 112 cycles (56 cycles per multiplication)
Using the serial multiplier
• 25 mul + 4 imul + 20 add + 20 adc
• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38
• actual cycle count is much larger: 52 cycles
• perf-stat shows that the core fails to distribute the µops equally overthe ports
9
Why is vectorization better?
Using the vectorized multiplier
• 109 vpmuludq + 95 vpaddq
• lower bound: 109 cycles (dominated by Port 0)
• actual cycle count 112 cycles (56 cycles per multiplication)
Using the serial multiplier
• 25 mul + 4 imul + 20 add + 20 adc
• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38
• actual cycle count is much larger: 52 cycles
• perf-stat shows that the core fails to distribute the µops equally overthe ports
9
Why is vectorization better?
Using the vectorized multiplier
• 109 vpmuludq + 95 vpaddq
• lower bound: 109 cycles (dominated by Port 0)
• actual cycle count 112 cycles (56 cycles per multiplication)
Using the serial multiplier
• 25 mul + 4 imul + 20 add + 20 adc
• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38
• actual cycle count is much larger: 52 cycles
• perf-stat shows that the core fails to distribute the µops equally overthe ports
9
Why is vectorization better?
Using the vectorized multiplier
• 109 vpmuludq + 95 vpaddq
• lower bound: 109 cycles (dominated by Port 0)
• actual cycle count 112 cycles (56 cycles per multiplication)
Using the serial multiplier
• 25 mul + 4 imul + 20 add + 20 adc
• lower bound: (25 · 2 + 4 + 20 + 20 · 2)/3 = 38
• actual cycle count is much larger: 52 cycles
• perf-stat shows that the core fails to distribute the µops equally overthe ports
9
Why is vectorization better?
More reasons
• carries take more cycles when using the serial multiplier
M− Mserial 52 68
vectorized 56 69.5
• batched squarings are faster
• instruction interleaving hides cost for addition/subtraction
• constant-time table lookups are faster with vector instructions
10
Why is vectorization better?
More reasons
• carries take more cycles when using the serial multiplier
M− Mserial 52 68
vectorized 56 69.5
• batched squarings are faster
• instruction interleaving hides cost for addition/subtraction
• constant-time table lookups are faster with vector instructions
10
Why is vectorization better?
More reasons
• carries take more cycles when using the serial multiplier
M− Mserial 52 68
vectorized 56 69.5
• batched squarings are faster
• instruction interleaving hides cost for addition/subtraction
• constant-time table lookups are faster with vector instructions
10
Why is vectorization better?
More reasons
• carries take more cycles when using the serial multiplier
M− Mserial 52 68
vectorized 56 69.5
• batched squarings are faster
• instruction interleaving hides cost for addition/subtraction
• constant-time table lookups are faster with vector instructions
10
Why is vectorization better?
More reasons
• carries take more cycles when using the serial multiplier
M− Mserial 52 68
vectorized 56 69.5
• batched squarings are faster
• instruction interleaving hides cost for addition/subtraction
• constant-time table lookups are faster with vector instructions
10
Ending Remarks
The main messages of this talk:
• Vectorization should be considered on recent Intel microarchitectures.
Code (for X25519 shared secret computation)
• https://sites.google.com/a/crypto.tw/blueprint/
11
Ending Remarks
The main messages of this talk:
• Vectorization should be considered on recent Intel microarchitectures.
Code (for X25519 shared secret computation)
• https://sites.google.com/a/crypto.tw/blueprint/
11
Ending Remarks
The main messages of this talk:
• Vectorization should be considered on recent Intel microarchitectures.
Code (for X25519 shared secret computation)
• https://sites.google.com/a/crypto.tw/blueprint/
11