NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012...
-
Upload
hoangtuyen -
Category
Documents
-
view
231 -
download
0
Transcript of NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012...
![Page 1: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/1.jpg)
NEON crypto
Daniel J. Bernstein, Peter Schwabe
September 11, 2012
CHES 2012, Leuven, Belgium
![Page 2: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/2.jpg)
NEON
I Target of this paper: Make cryptography fast on a large class ofmobile devices, e.g.,
Apple iPhone 3GS, Apple iPhone 4, 3rd generation Apple iPod touch(late 2009), Apple iPad 1, Nokia N9, Nokia N900, Palm Pre Plus,Samsung/Google Nexus S, Samsung Galaxy S, . . .
I All these devices have an ARM Cortex-A8 CPU with NEON vectorinstruction set
I Many more devices with NEON:
HTC Sensation, HTC 7 Mozart, HTC Desire, HTC/Google NexusOne, LG Optimus 7, Motorola Droid Bionic, Nokia Lumia,Samsung/Google Galaxy Nexus, Samsung Galaxy S II and S III, . . .
I Those devices have Cortex-A9 and Qualcomm Snapdragon CPUs
I Rest of this talk: Focus on NEON in Cortex-A8
2
![Page 3: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/3.jpg)
NEON
I Target of this paper: Make cryptography fast on a large class ofmobile devices, e.g.,
Apple iPhone 3GS, Apple iPhone 4, 3rd generation Apple iPod touch(late 2009), Apple iPad 1, Nokia N9, Nokia N900, Palm Pre Plus,Samsung/Google Nexus S, Samsung Galaxy S, . . .
I All these devices have an ARM Cortex-A8 CPU with NEON vectorinstruction set
I Many more devices with NEON:
HTC Sensation, HTC 7 Mozart, HTC Desire, HTC/Google NexusOne, LG Optimus 7, Motorola Droid Bionic, Nokia Lumia,Samsung/Google Galaxy Nexus, Samsung Galaxy S II and S III, . . .
I Those devices have Cortex-A9 and Qualcomm Snapdragon CPUs
I Rest of this talk: Focus on NEON in Cortex-A8
2
![Page 4: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/4.jpg)
NEON
I Target of this paper: Make cryptography fast on a large class ofmobile devices, e.g.,
Apple iPhone 3GS, Apple iPhone 4, 3rd generation Apple iPod touch(late 2009), Apple iPad 1, Nokia N9, Nokia N900, Palm Pre Plus,Samsung/Google Nexus S, Samsung Galaxy S, . . .
I All these devices have an ARM Cortex-A8 CPU with NEON vectorinstruction set
I Many more devices with NEON:
HTC Sensation, HTC 7 Mozart, HTC Desire, HTC/Google NexusOne, LG Optimus 7, Motorola Droid Bionic, Nokia Lumia,Samsung/Google Galaxy Nexus, Samsung Galaxy S II and S III, . . .
I Those devices have Cortex-A9 and Qualcomm Snapdragon CPUs
I Rest of this talk: Focus on NEON in Cortex-A8
2
![Page 5: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/5.jpg)
NEON
I Target of this paper: Make cryptography fast on a large class ofmobile devices, e.g.,
Apple iPhone 3GS, Apple iPhone 4, 3rd generation Apple iPod touch(late 2009), Apple iPad 1, Nokia N9, Nokia N900, Palm Pre Plus,Samsung/Google Nexus S, Samsung Galaxy S, . . .
I All these devices have an ARM Cortex-A8 CPU with NEON vectorinstruction set
I Many more devices with NEON:
HTC Sensation, HTC 7 Mozart, HTC Desire, HTC/Google NexusOne, LG Optimus 7, Motorola Droid Bionic, Nokia Lumia,Samsung/Google Galaxy Nexus, Samsung Galaxy S II and S III, . . .
I Those devices have Cortex-A9 and Qualcomm Snapdragon CPUs
I Rest of this talk: Focus on NEON in Cortex-A8
2
![Page 6: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/6.jpg)
NEON
I Target of this paper: Make cryptography fast on a large class ofmobile devices, e.g.,
Apple iPhone 3GS, Apple iPhone 4, 3rd generation Apple iPod touch(late 2009), Apple iPad 1, Nokia N9, Nokia N900, Palm Pre Plus,Samsung/Google Nexus S, Samsung Galaxy S, . . .
I All these devices have an ARM Cortex-A8 CPU with NEON vectorinstruction set
I Many more devices with NEON:
HTC Sensation, HTC 7 Mozart, HTC Desire, HTC/Google NexusOne, LG Optimus 7, Motorola Droid Bionic, Nokia Lumia,Samsung/Google Galaxy Nexus, Samsung Galaxy S II and S III, . . .
I Those devices have Cortex-A9 and Qualcomm Snapdragon CPUs
I Rest of this talk: Focus on NEON in Cortex-A8
2
![Page 7: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/7.jpg)
crypto
I Obvious target algorithm: AES with 128-bit key
I Best previous result: Krovetz and Rogaway report 25.4 cycles/bytefor implementation by Polyakov, included in OpenSSL
I Not protected against timing attacks
I For constant-time implementation: Bitsliced approach by Kasperand Schwabe (CHES 2009), logical operations on 8 blocks in parallel
I Per round of AES: 167 logical operations (148 in the last round)
I Total of 9 · (167) + 148 = 1651 logical operations
I NEON can do one logical operation per cycle
I Lower bound of 1651 cycles/8 blocks; 12.898 cycles/byte
I This ignores cost for bitslice transformation, xoring of keystream inCTR mode . . .
I Our AES NEON speed: 18.94 cycles/byte, constant time
3
![Page 8: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/8.jpg)
crypto
I Obvious target algorithm: AES with 128-bit key
I Best previous result: Krovetz and Rogaway report 25.4 cycles/bytefor implementation by Polyakov, included in OpenSSL
I Not protected against timing attacks
I For constant-time implementation: Bitsliced approach by Kasperand Schwabe (CHES 2009), logical operations on 8 blocks in parallel
I Per round of AES: 167 logical operations (148 in the last round)
I Total of 9 · (167) + 148 = 1651 logical operations
I NEON can do one logical operation per cycle
I Lower bound of 1651 cycles/8 blocks; 12.898 cycles/byte
I This ignores cost for bitslice transformation, xoring of keystream inCTR mode . . .
I Our AES NEON speed: 18.94 cycles/byte, constant time
3
![Page 9: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/9.jpg)
crypto
I Obvious target algorithm: AES with 128-bit key
I Best previous result: Krovetz and Rogaway report 25.4 cycles/bytefor implementation by Polyakov, included in OpenSSL
I Not protected against timing attacks
I For constant-time implementation: Bitsliced approach by Kasperand Schwabe (CHES 2009), logical operations on 8 blocks in parallel
I Per round of AES: 167 logical operations (148 in the last round)
I Total of 9 · (167) + 148 = 1651 logical operations
I NEON can do one logical operation per cycle
I Lower bound of 1651 cycles/8 blocks; 12.898 cycles/byte
I This ignores cost for bitslice transformation, xoring of keystream inCTR mode . . .
I Our AES NEON speed: 18.94 cycles/byte, constant time
3
![Page 10: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/10.jpg)
crypto
I Obvious target algorithm: AES with 128-bit key
I Best previous result: Krovetz and Rogaway report 25.4 cycles/bytefor implementation by Polyakov, included in OpenSSL
I Not protected against timing attacks
I For constant-time implementation: Bitsliced approach by Kasperand Schwabe (CHES 2009), logical operations on 8 blocks in parallel
I Per round of AES: 167 logical operations (148 in the last round)
I Total of 9 · (167) + 148 = 1651 logical operations
I NEON can do one logical operation per cycle
I Lower bound of 1651 cycles/8 blocks; 12.898 cycles/byte
I This ignores cost for bitslice transformation, xoring of keystream inCTR mode . . .
I Our AES NEON speed: 18.94 cycles/byte, constant time
3
![Page 11: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/11.jpg)
crypto
I Obvious target algorithm: AES with 128-bit key
I Best previous result: Krovetz and Rogaway report 25.4 cycles/bytefor implementation by Polyakov, included in OpenSSL
I Not protected against timing attacks
I For constant-time implementation: Bitsliced approach by Kasperand Schwabe (CHES 2009), logical operations on 8 blocks in parallel
I Per round of AES: 167 logical operations (148 in the last round)
I Total of 9 · (167) + 148 = 1651 logical operations
I NEON can do one logical operation per cycle
I Lower bound of 1651 cycles/8 blocks; 12.898 cycles/byte
I This ignores cost for bitslice transformation, xoring of keystream inCTR mode . . .
I Our AES NEON speed: 18.94 cycles/byte, constant time
3
![Page 12: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/12.jpg)
crypto
I Obvious target algorithm: AES with 128-bit key
I Best previous result: Krovetz and Rogaway report 25.4 cycles/bytefor implementation by Polyakov, included in OpenSSL
I Not protected against timing attacks
I For constant-time implementation: Bitsliced approach by Kasperand Schwabe (CHES 2009), logical operations on 8 blocks in parallel
I Per round of AES: 167 logical operations (148 in the last round)
I Total of 9 · (167) + 148 = 1651 logical operations
I NEON can do one logical operation per cycle
I Lower bound of 1651 cycles/8 blocks; 12.898 cycles/byte
I This ignores cost for bitslice transformation, xoring of keystream inCTR mode . . .
I Our AES NEON speed: 18.94 cycles/byte, constant time
3
![Page 13: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/13.jpg)
crypto: there’s more!
I Cryptographic primitives required for secure network communication:
I Symmetric encryptionI Secret-key authenticationI Key exchange (Diffie-Hellman)I Public-key signatures
I At least 128 bits of security
I Protection against timing attacks
I As fast as possible on ARM Cortex-A8
I Our choice of primitives:I Salsa20I Poly1305I Curve25519I Ed25519
4
![Page 14: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/14.jpg)
crypto: there’s more!
I Cryptographic primitives required for secure network communication:
I Symmetric encryptionI Secret-key authenticationI Key exchange (Diffie-Hellman)I Public-key signatures
I At least 128 bits of security
I Protection against timing attacks
I As fast as possible on ARM Cortex-A8
I Our choice of primitives:I Salsa20I Poly1305I Curve25519I Ed25519
4
![Page 15: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/15.jpg)
crypto: there’s more!
I Cryptographic primitives required for secure network communication:
I Symmetric encryptionI Secret-key authenticationI Key exchange (Diffie-Hellman)I Public-key signatures
I At least 128 bits of security
I Protection against timing attacks
I As fast as possible on ARM Cortex-A8
I Our choice of primitives:I Salsa20I Poly1305I Curve25519I Ed25519
4
![Page 16: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/16.jpg)
crypto: there’s more!
I Cryptographic primitives required for secure network communication:
I Symmetric encryptionI Secret-key authenticationI Key exchange (Diffie-Hellman)I Public-key signatures
I At least 128 bits of security
I Protection against timing attacks
I As fast as possible on ARM Cortex-A8
I Our choice of primitives:I Salsa20I Poly1305I Curve25519I Ed25519
4
![Page 17: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/17.jpg)
crypto: there’s more!
I Cryptographic primitives required for secure network communication:
I Symmetric encryptionI Secret-key authenticationI Key exchange (Diffie-Hellman)I Public-key signatures
I At least 128 bits of security
I Protection against timing attacks
I As fast as possible on ARM Cortex-A8
I Our choice of primitives:I Salsa20I Poly1305I Curve25519I Ed25519
4
![Page 18: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/18.jpg)
Salsa20
I Designed by Bernstein in 2005; recommended in the eSTREAMsoftware portfolio
I Generates random stream in 64-byte blocks, works on 32-bit integers
I Per block: 20 rounds; each round doing 16 add-rotate-xorsequences, such as
s4 = x0 + x12
x4 ^= (s4 >>> 25)
I In ARM without NEON: 2 instructions, 1 cycle
I Sounds like total of (20 · 16)/64 = 5 cycles/byte
, but:I Only 14 integer registers (need at least 17)I Latencies cause big troubleI Actual implementations were slower than 15 cycles/byte
5
![Page 19: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/19.jpg)
Salsa20
I Designed by Bernstein in 2005; recommended in the eSTREAMsoftware portfolio
I Generates random stream in 64-byte blocks, works on 32-bit integers
I Per block: 20 rounds; each round doing 16 add-rotate-xorsequences, such as
s4 = x0 + x12
x4 ^= (s4 >>> 25)
I In ARM without NEON: 2 instructions, 1 cycle
I Sounds like total of (20 · 16)/64 = 5 cycles/byte
, but:I Only 14 integer registers (need at least 17)I Latencies cause big troubleI Actual implementations were slower than 15 cycles/byte
5
![Page 20: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/20.jpg)
Salsa20
I Designed by Bernstein in 2005; recommended in the eSTREAMsoftware portfolio
I Generates random stream in 64-byte blocks, works on 32-bit integers
I Per block: 20 rounds; each round doing 16 add-rotate-xorsequences, such as
s4 = x0 + x12
x4 ^= (s4 >>> 25)
I In ARM without NEON: 2 instructions, 1 cycle
I Sounds like total of (20 · 16)/64 = 5 cycles/byte, but:I Only 14 integer registers (need at least 17)I Latencies cause big troubleI Actual implementations were slower than 15 cycles/byte
5
![Page 21: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/21.jpg)
Salsa20 on the Cortex-A8
I Add-rotate-xor sequences are 4-way parallel, good for SIMD
I Rotates are not free, cost 3 instructions:
4x a0 = diag1 + diag0
4x b0 = a0 << 7
4x a0 unsigned >>= 25
diag3 ^= b0
diag3 ^= a0
I This has 9 cycles latency: Need at least (9 · 20 · 4)/64 = 11.25cycles/byte
I Blocks are independent, interleave two blocks; need at least 6.875cycles/byte
I . . . interleave three blocks; need at least 6.25 cycles/byte
I The ARM unit is still idle, so interleave ARM with NEON:I One block on ARM, two blocks on NEONI Bottleneck: decode at most 2 instructions per cycle
I Final result, including overhead: 5.47 cycles/byte
6
![Page 22: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/22.jpg)
Salsa20 on the Cortex-A8
I Add-rotate-xor sequences are 4-way parallel, good for SIMD
I Rotates are not free, cost 3 instructions:
4x a0 = diag1 + diag0
4x b0 = a0 << 7
4x a0 unsigned >>= 25
diag3 ^= b0
diag3 ^= a0
I This has 9 cycles latency: Need at least (9 · 20 · 4)/64 = 11.25cycles/byte
I Blocks are independent, interleave two blocks; need at least 6.875cycles/byte
I . . . interleave three blocks; need at least 6.25 cycles/byte
I The ARM unit is still idle, so interleave ARM with NEON:I One block on ARM, two blocks on NEONI Bottleneck: decode at most 2 instructions per cycle
I Final result, including overhead: 5.47 cycles/byte
6
![Page 23: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/23.jpg)
Salsa20 on the Cortex-A8
I Add-rotate-xor sequences are 4-way parallel, good for SIMD
I Rotates are not free, cost 3 instructions:
4x a0 = diag1 + diag0
4x b0 = a0 << 7
4x a0 unsigned >>= 25
diag3 ^= b0
diag3 ^= a0
I This has 9 cycles latency: Need at least (9 · 20 · 4)/64 = 11.25cycles/byte
I Blocks are independent, interleave two blocks; need at least 6.875cycles/byte
I . . . interleave three blocks; need at least 6.25 cycles/byte
I The ARM unit is still idle, so interleave ARM with NEON:I One block on ARM, two blocks on NEONI Bottleneck: decode at most 2 instructions per cycle
I Final result, including overhead: 5.47 cycles/byte
6
![Page 24: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/24.jpg)
Salsa20 on the Cortex-A8
I Add-rotate-xor sequences are 4-way parallel, good for SIMD
I Rotates are not free, cost 3 instructions:
4x a0 = diag1 + diag0
4x b0 = a0 << 7
4x a0 unsigned >>= 25
diag3 ^= b0
diag3 ^= a0
I This has 9 cycles latency: Need at least (9 · 20 · 4)/64 = 11.25cycles/byte
I Blocks are independent, interleave two blocks; need at least 6.875cycles/byte
I . . . interleave three blocks; need at least 6.25 cycles/byte
I The ARM unit is still idle, so interleave ARM with NEON:I One block on ARM, two blocks on NEONI Bottleneck: decode at most 2 instructions per cycle
I Final result, including overhead: 5.47 cycles/byte
6
![Page 25: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/25.jpg)
Salsa20 on the Cortex-A8
I Add-rotate-xor sequences are 4-way parallel, good for SIMD
I Rotates are not free, cost 3 instructions:
4x a0 = diag1 + diag0
4x b0 = a0 << 7
4x a0 unsigned >>= 25
diag3 ^= b0
diag3 ^= a0
I This has 9 cycles latency: Need at least (9 · 20 · 4)/64 = 11.25cycles/byte
I Blocks are independent, interleave two blocks; need at least 6.875cycles/byte
I . . . interleave three blocks; need at least 6.25 cycles/byte
I The ARM unit is still idle, so interleave ARM with NEON:I One block on ARM, two blocks on NEONI Bottleneck: decode at most 2 instructions per cycle
I Final result, including overhead: 5.47 cycles/byte
6
![Page 26: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/26.jpg)
Salsa20 on the Cortex-A8
I Add-rotate-xor sequences are 4-way parallel, good for SIMD
I Rotates are not free, cost 3 instructions:
4x a0 = diag1 + diag0
4x b0 = a0 << 7
4x a0 unsigned >>= 25
diag3 ^= b0
diag3 ^= a0
I This has 9 cycles latency: Need at least (9 · 20 · 4)/64 = 11.25cycles/byte
I Blocks are independent, interleave two blocks; need at least 6.875cycles/byte
I . . . interleave three blocks; need at least 6.25 cycles/byte
I The ARM unit is still idle, so interleave ARM with NEON:I One block on ARM, two blocks on NEONI Bottleneck: decode at most 2 instructions per cycle
I Final result, including overhead: 5.47 cycles/byte
6
![Page 27: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/27.jpg)
Salsa20 on the Cortex-A8
I Add-rotate-xor sequences are 4-way parallel, good for SIMD
I Rotates are not free, cost 3 instructions:
4x a0 = diag1 + diag0
4x b0 = a0 << 7
4x a0 unsigned >>= 25
diag3 ^= b0
diag3 ^= a0
I This has 9 cycles latency: Need at least (9 · 20 · 4)/64 = 11.25cycles/byte
I Blocks are independent, interleave two blocks; need at least 6.875cycles/byte
I . . . interleave three blocks; need at least 6.25 cycles/byte
I The ARM unit is still idle, so interleave ARM with NEON:I One block on ARM, two blocks on NEONI Bottleneck: decode at most 2 instructions per cycle
I Final result, including overhead: 5.47 cycles/byte
6
![Page 28: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/28.jpg)
Poly1305
I Designed by Bernstein in 2005
I Secret-key one-time authenticator based on arithmetic in Fp withp = 2130 − 5
I Key k and (padded) 16-byte ciphertext blocks c1, . . . , ck are in Fp
I Main work: initialize authentication tag h with 0, then compute:
for i from 1 to k doh← h+ cih← h · k
end for
I Per 16 bytes: 1 , 1 addition in F2130−5
I Some (fast) finalization to produce 16-byte authentication tag
7
![Page 29: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/29.jpg)
Poly1305
I Designed by Bernstein in 2005
I Secret-key one-time authenticator based on arithmetic in Fp withp = 2130 − 5
I Key k and (padded) 16-byte ciphertext blocks c1, . . . , ck are in Fp
I Main work: initialize authentication tag h with 0, then compute:
for i from 1 to k doh← h+ cih← h · k
end for
I Per 16 bytes: 1 , 1 addition in F2130−5
I Some (fast) finalization to produce 16-byte authentication tag
7
![Page 30: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/30.jpg)
Poly1305
I Designed by Bernstein in 2005
I Secret-key one-time authenticator based on arithmetic in Fp withp = 2130 − 5
I Key k and (padded) 16-byte ciphertext blocks c1, . . . , ck are in Fp
I Main work: initialize authentication tag h with 0, then compute:
for i from 1 to k doh← h+ cih← h · k
end for
I Per 16 bytes: 1 multiplication, 1 addition in F2130−5
I Some (fast) finalization to produce 16-byte authentication tag
7
![Page 31: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/31.jpg)
Poly1305
I Designed by Bernstein in 2005
I Secret-key one-time authenticator based on arithmetic in Fp withp = 2130 − 5
I Key k and (padded) 16-byte ciphertext blocks c1, . . . , ck are in Fp
I Main work: initialize authentication tag h with 0, then compute:
for i from 1 to k doh← h+ cih← h · k
end for
I Per 16 bytes: 1 multiplication, 1 addition in F2130−5
I Some (fast) finalization to produce 16-byte authentication tag
7
![Page 32: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/32.jpg)
Poly1305 on the Cortex-A8
I Fastest NEON multiplier: Two SIMD 32× 32→ 64 bit integermultiplications every two cycles
I Multiply-accumulate at the same cost as multiply
I NEON additions lose carry bits; we need a carry-safe (redundant)representation
I Represent an element A of Fp as (a0, a1, a2, a3, a4) with
A =
4∑i=0
ai · 226·i
I In multiplication of C = A ·B obtain coefficients c0, c1, . . . , c8I Reduction: 2130 ≡ 5 (mod p). Hence add 5c5 to c0, 5c6 to c1, etc.
8
![Page 33: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/33.jpg)
Poly1305 on the Cortex-A8
I Fastest NEON multiplier: Two SIMD 32× 32→ 64 bit integermultiplications every two cycles
I Multiply-accumulate at the same cost as multiply
I NEON additions lose carry bits; we need a carry-safe (redundant)representation
I Represent an element A of Fp as (a0, a1, a2, a3, a4) with
A =
4∑i=0
ai · 226·i
I In multiplication of C = A ·B obtain coefficients c0, c1, . . . , c8I Reduction: 2130 ≡ 5 (mod p). Hence add 5c5 to c0, 5c6 to c1, etc.
8
![Page 34: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/34.jpg)
Poly1305 on the Cortex-A8
I Fastest NEON multiplier: Two SIMD 32× 32→ 64 bit integermultiplications every two cycles
I Multiply-accumulate at the same cost as multiply
I NEON additions lose carry bits; we need a carry-safe (redundant)representation
I Represent an element A of Fp as (a0, a1, a2, a3, a4) with
A =
4∑i=0
ai · 226·i
I In multiplication of C = A ·B obtain coefficients c0, c1, . . . , c8I Reduction: 2130 ≡ 5 (mod p). Hence add 5c5 to c0, 5c6 to c1, etc.
8
![Page 35: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/35.jpg)
Poly1305 on the Cortex-A8
I Schoolbook multiplication breaks into 25 32-bit integermultiplications and 16 64-bit additions
I Many of those are parallel, can perform them in SIMD
, butI this requires quite a bit of shuffling, andI latencies in the final carry chain kick in
I Better: Precompute k2
I Compute ((c0 · k) + c1) · k as (c0 · k2) + (c1 · k)I Always perform two independent multiplications in Fp together in
SIMD
I Final result: 2.20 cycles/byte
9
![Page 36: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/36.jpg)
Poly1305 on the Cortex-A8
I Schoolbook multiplication breaks into 25 32-bit integermultiplications and 16 64-bit additions
I Many of those are parallel, can perform them in SIMD, butI this requires quite a bit of shuffling, andI latencies in the final carry chain kick in
I Better: Precompute k2
I Compute ((c0 · k) + c1) · k as (c0 · k2) + (c1 · k)I Always perform two independent multiplications in Fp together in
SIMD
I Final result: 2.20 cycles/byte
9
![Page 37: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/37.jpg)
Poly1305 on the Cortex-A8
I Schoolbook multiplication breaks into 25 32-bit integermultiplications and 16 64-bit additions
I Many of those are parallel, can perform them in SIMD, butI this requires quite a bit of shuffling, andI latencies in the final carry chain kick in
I Better: Precompute k2
I Compute ((c0 · k) + c1) · k as (c0 · k2) + (c1 · k)
I Always perform two independent multiplications in Fp together inSIMD
I Final result: 2.20 cycles/byte
9
![Page 38: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/38.jpg)
Poly1305 on the Cortex-A8
I Schoolbook multiplication breaks into 25 32-bit integermultiplications and 16 64-bit additions
I Many of those are parallel, can perform them in SIMD, butI this requires quite a bit of shuffling, andI latencies in the final carry chain kick in
I Better: Precompute k2
I Compute ((c0 · k) + c1) · k as (c0 · k2) + (c1 · k)I Always perform two independent multiplications in Fp together in
SIMD
I Final result: 2.20 cycles/byte
9
![Page 39: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/39.jpg)
Poly1305 on the Cortex-A8
I Schoolbook multiplication breaks into 25 32-bit integermultiplications and 16 64-bit additions
I Many of those are parallel, can perform them in SIMD, butI this requires quite a bit of shuffling, andI latencies in the final carry chain kick in
I Better: Precompute k2
I Compute ((c0 · k) + c1) · k as (c0 · k2) + (c1 · k)I Always perform two independent multiplications in Fp together in
SIMD
I Final result: 2.20 cycles/byte
9
![Page 40: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/40.jpg)
Curve25519 and Ed25519
I Curve25519: ECDH key exchange (Bernstein, PKC 2006)
I Ed25519: Elliptic-curve signatures (Bernstein, Duif, Lange,Schwabe, Yang, CHES 2011)
I Arithmetic on Montgomery curve or birationally equivalent twistedEdwards curve over F2255−19
I Again, use redundant representation: A = (a0, . . . , a9), with
A =
9∑i=0
ai · 2d25.5·ie
I Similar ideas to Poly1305:I Efficient reduction through 2255 ≡ 19: add 19c10 to c0, etc.I Whenever possible, perform two independent multiplications or
squarings together
I Constant-time conditional swaps (Curve25519) and table lookups(Ed25519) to protect against timing attacks
10
![Page 41: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/41.jpg)
Curve25519 and Ed25519
I Curve25519: ECDH key exchange (Bernstein, PKC 2006)
I Ed25519: Elliptic-curve signatures (Bernstein, Duif, Lange,Schwabe, Yang, CHES 2011)
I Arithmetic on Montgomery curve or birationally equivalent twistedEdwards curve over F2255−19
I Again, use redundant representation: A = (a0, . . . , a9), with
A =9∑
i=0
ai · 2d25.5·ie
I Similar ideas to Poly1305:I Efficient reduction through 2255 ≡ 19: add 19c10 to c0, etc.I Whenever possible, perform two independent multiplications or
squarings together
I Constant-time conditional swaps (Curve25519) and table lookups(Ed25519) to protect against timing attacks
10
![Page 42: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/42.jpg)
Curve25519 and Ed25519
I Curve25519: ECDH key exchange (Bernstein, PKC 2006)
I Ed25519: Elliptic-curve signatures (Bernstein, Duif, Lange,Schwabe, Yang, CHES 2011)
I Arithmetic on Montgomery curve or birationally equivalent twistedEdwards curve over F2255−19
I Again, use redundant representation: A = (a0, . . . , a9), with
A =9∑
i=0
ai · 2d25.5·ie
I Similar ideas to Poly1305:I Efficient reduction through 2255 ≡ 19: add 19c10 to c0, etc.I Whenever possible, perform two independent multiplications or
squarings together
I Constant-time conditional swaps (Curve25519) and table lookups(Ed25519) to protect against timing attacks
10
![Page 43: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/43.jpg)
Curve25519 and Ed25519
I Curve25519: ECDH key exchange (Bernstein, PKC 2006)
I Ed25519: Elliptic-curve signatures (Bernstein, Duif, Lange,Schwabe, Yang, CHES 2011)
I Arithmetic on Montgomery curve or birationally equivalent twistedEdwards curve over F2255−19
I Again, use redundant representation: A = (a0, . . . , a9), with
A =9∑
i=0
ai · 2d25.5·ie
I Similar ideas to Poly1305:I Efficient reduction through 2255 ≡ 19: add 19c10 to c0, etc.I Whenever possible, perform two independent multiplications or
squarings together
I Constant-time conditional swaps (Curve25519) and table lookups(Ed25519) to protect against timing attacks
10
![Page 44: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/44.jpg)
Results & Outlook
I Secret-key authenticated encryption: 7.67 cycles/byte,> 830 MBit/sec on 800 MHz Cortex-A8
I Salsa20: 5.47 cycles/byteI Poly1305: 2.20 cycles/byte
I Compute shared secret (ECDH): 492417 cycles (> 1600/sec)
I All software is timing-attack resistant
I Also Cortex-A9 and Qualcomm Snapdragon CPUs benefit from thesoftware speedups
I Still required: Microarchitecture-specific optimization for those
I Followup result by Hamburg:I Use similar ECC techniques, slightly smaller curveI Use more powerful ARM core on Cortex-A9I Don’t use NEONI Compute shared secret (ECDH): 616000 cycles
I Obvious question: How far can we go on Cortex-A9 with NEON?
I Future: target low-power energy-efficient Cortex-A7
11
![Page 45: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/45.jpg)
Results & Outlook
I Secret-key authenticated encryption: 7.67 cycles/byte,> 830 MBit/sec on 800 MHz Cortex-A8
I Salsa20: 5.47 cycles/byteI Poly1305: 2.20 cycles/byte
I Compute shared secret (ECDH): 492417 cycles (> 1600/sec)
I All software is timing-attack resistant
I Also Cortex-A9 and Qualcomm Snapdragon CPUs benefit from thesoftware speedups
I Still required: Microarchitecture-specific optimization for those
I Followup result by Hamburg:I Use similar ECC techniques, slightly smaller curveI Use more powerful ARM core on Cortex-A9I Don’t use NEONI Compute shared secret (ECDH): 616000 cycles
I Obvious question: How far can we go on Cortex-A9 with NEON?
I Future: target low-power energy-efficient Cortex-A7
11
![Page 46: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/46.jpg)
Results & Outlook
I Secret-key authenticated encryption: 7.67 cycles/byte,> 830 MBit/sec on 800 MHz Cortex-A8
I Salsa20: 5.47 cycles/byteI Poly1305: 2.20 cycles/byte
I Compute shared secret (ECDH): 492417 cycles (> 1600/sec)
I All software is timing-attack resistant
I Also Cortex-A9 and Qualcomm Snapdragon CPUs benefit from thesoftware speedups
I Still required: Microarchitecture-specific optimization for those
I Followup result by Hamburg:I Use similar ECC techniques, slightly smaller curveI Use more powerful ARM core on Cortex-A9I Don’t use NEONI Compute shared secret (ECDH): 616000 cycles
I Obvious question: How far can we go on Cortex-A9 with NEON?
I Future: target low-power energy-efficient Cortex-A7
11
![Page 47: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/47.jpg)
Results & Outlook
I Secret-key authenticated encryption: 7.67 cycles/byte,> 830 MBit/sec on 800 MHz Cortex-A8
I Salsa20: 5.47 cycles/byteI Poly1305: 2.20 cycles/byte
I Compute shared secret (ECDH): 492417 cycles (> 1600/sec)
I All software is timing-attack resistant
I Also Cortex-A9 and Qualcomm Snapdragon CPUs benefit from thesoftware speedups
I Still required: Microarchitecture-specific optimization for those
I Followup result by Hamburg:I Use similar ECC techniques, slightly smaller curveI Use more powerful ARM core on Cortex-A9I Don’t use NEONI Compute shared secret (ECDH): 616000 cycles
I Obvious question: How far can we go on Cortex-A9 with NEON?
I Future: target low-power energy-efficient Cortex-A7
11
![Page 48: NEON crypto - Peter Schwabe · NEON crypto Daniel J. Bernstein, Peter Schwabe September 11, 2012 CHES 2012, Leuven, Belgium](https://reader031.fdocuments.net/reader031/viewer/2022030309/5b57eed17f8b9a655d8b5bf6/html5/thumbnails/48.jpg)
NEON crypto online
I The paper is online athttp://cryptojedi.org/papers/#neoncrypto
I NEON AES-128-CTR, Salsa20, Poly1305 now in SUPERCOP:http://bench.cr.yp.to
I We’re still speeding up Curve25519, Ed25519 but will include themin SUPERCOP
I All software in the public domain
I Software to be included in the next release of the NaCl library:http://nacl.cr.yp.to
12