Post on 11-Jun-2020
ARMed SPHINCS:Computing a 41 KB signature in 16 KB of RAM
Andreas Hulsing1, Joost Rijneveld2, Peter Schwabe2
Technische Universiteit Eindhoven1
Radboud University, Nijmegen2
The Netherlands
2016-03-07PKC 2016
2016-03-07 1 / 12
SPHINCS
I SPHINCS: Stateless, practical, hash-based, incredibly nicecryptographic signatures [BHHLNPSW15].
I Post-quantumI Hash functions do not fall to Shor (but halved by Grover)
I Hash-based schemes: conservative choiceI One-way functions necessary for signatures [Rom90]I Tight security reductions
I Collision resilient
SPHINCS 2016-03-07 2 / 12
SPHINCS-256
I Large hash-tree, height h = 60I Every d = 12-th layer: sign child node using an OTS
I Effectively a hypertree of h/d = 5 Merkle trees [Mer90]I Trade signature size for time
I Sign messages using 260 leaf nodes
. . .
2dh/d
. . .
d
SPHINCS 2016-03-07 3 / 12
SPHINCS-256
I Large hash-tree, height h = 60I Every d = 12-th layer: sign child node using an OTS
I Effectively a hypertree of h/d = 5 Merkle trees [Mer90]I Trade signature size for time
I Sign messages using 260 leaf nodes
I No need to remember index: stateless [Gol87]
. . .
2dh/d
. . .
d
SPHINCS 2016-03-07 3 / 12
SPHINCS-256
I 41KB signatures, 1KB keys
I OTS
I Hash functions
I Key expansion functionI FTS
I Contains 16-layer Merkle tree (so 216 = 65536 leafs)I Goal: 32 authentication paths, root nodeI Paths start at (deterministically chosen) ‘random’ leafsI Complete tree takes approx. 2MB RAM..
SPHINCS 2016-03-07 4 / 12
SPHINCS-256
I 41KB signatures, 1KB keys
I OTS: Winternitz OTS variant (WOTS+) [Hul13]
I Hash functions: BLAKE [ANWW13], πChaCha [Ber08]
I Key expansion function: ChaCha12I FTS: HORST [BHHLNPSW15]
I Contains 16-layer Merkle tree (so 216 = 65536 leafs)I Goal: 32 authentication paths, root nodeI Paths start at (deterministically chosen) ‘random’ leafsI Complete tree takes approx. 2MB RAM..
SPHINCS 2016-03-07 4 / 12
SPHINCS-256
I 41KB signatures, 1KB keys
I OTS: Winternitz OTS variant (WOTS+) [Hul13]
I Hash functions: BLAKE [ANWW13], πChaCha [Ber08]
I Key expansion function: ChaCha12I FTS: HORST [BHHLNPSW15]
I Contains 16-layer Merkle tree (so 216 = 65536 leafs)I Goal: 32 authentication paths, root nodeI Paths start at (deterministically chosen) ‘random’ leafsI Complete tree takes approx. 2MB RAM..
SPHINCS 2016-03-07 4 / 12
Platform
I STM32L100C development board
I Cortex M3, ARMv7-M
I libopencm3 firmware
I 32MHz, 32-bit architecture
I 16 registers
I 256KB Flash
I 16KB RAM
SPHINCS on the M3 2016-03-07 5 / 12
Treehash
I HORST tree is too large: 2MB!
I Treehash [Mer90]: only remember relevant nodesI Maintain a stack: at most log(n) = 16 nodes
(or log(8) = 3, in the example below)
SPHINCS on the M3 2016-03-07 6 / 12
Treehash
I HORST tree is too large: 2MB!I Treehash [Mer90]: only remember relevant nodes
I Maintain a stack: at most log(n) = 16 nodes(or log(8) = 3, in the example below)
SPHINCS on the M3 2016-03-07 6 / 12
Treehash
I HORST tree is too large: 2MB!I Treehash [Mer90]: only remember relevant nodes
I Maintain a stack: at most log(n) = 16 nodes(or log(8) = 3, in the example below)
SPHINCS on the M3 2016-03-07 6 / 12
Treehash
I HORST tree is too large: 2MB!I Treehash [Mer90]: only remember relevant nodes
I Maintain a stack: at most log(n) = 16 nodes(or log(8) = 3, in the example below)
SPHINCS on the M3 2016-03-07 6 / 12
Treehash
I HORST tree is too large: 2MB!I Treehash [Mer90]: only remember relevant nodes
I Maintain a stack: at most log(n) = 16 nodes(or log(8) = 3, in the example below)
SPHINCS on the M3 2016-03-07 6 / 12
Treehash
I HORST tree is too large: 2MB!I Treehash [Mer90]: only remember relevant nodes
I Maintain a stack: at most log(n) = 16 nodes(or log(8) = 3, in the example below)
SPHINCS on the M3 2016-03-07 6 / 12
Treehash
I HORST tree is too large: 2MB!I Treehash [Mer90]: only remember relevant nodes
I Maintain a stack: at most log(n) = 16 nodes(or log(8) = 3, in the example below)
SPHINCS on the M3 2016-03-07 6 / 12
Treehash
I HORST tree is too large: 2MB!I Treehash [Mer90]: only remember relevant nodes
I Maintain a stack: at most log(n) = 16 nodes(or log(8) = 3, in the example below)
SPHINCS on the M3 2016-03-07 6 / 12
Treehash
I HORST tree is too large: 2MB!I Treehash [Mer90]: only remember relevant nodes
I Maintain a stack: at most log(n) = 16 nodes(or log(8) = 3, in the example below)
SPHINCS on the M3 2016-03-07 6 / 12
Treehash considerations
I Goal: construct 32 HORST authentication paths
I Identify relevant nodes (320 of 131071)
I Identify relevant rounds (< 320 of 65536)
I Identify relevant nodes in relevant rounds (bitmasks)I Observation: sort masks by round index
I Simply maintain a pointerI Iterate while performing treehash
I Output in the appropriate order..
SPHINCS on the M3 2016-03-07 7 / 12
Treehash considerations
I Goal: construct 32 HORST authentication paths
I Identify relevant nodes (320 of 131071)
I Identify relevant rounds (< 320 of 65536)
I Identify relevant nodes in relevant rounds (bitmasks)I Observation: sort masks by round index
I Simply maintain a pointerI Iterate while performing treehash
I Output in the appropriate order..
SPHINCS on the M3 2016-03-07 7 / 12
Treehash considerations
I Goal: construct 32 HORST authentication paths
I Identify relevant nodes (320 of 131071)
I Identify relevant rounds (< 320 of 65536)
I Identify relevant nodes in relevant rounds (bitmasks)
I Observation: sort masks by round indexI Simply maintain a pointerI Iterate while performing treehash
I Output in the appropriate order..
SPHINCS on the M3 2016-03-07 7 / 12
Treehash considerations
I Goal: construct 32 HORST authentication paths
I Identify relevant nodes (320 of 131071)
I Identify relevant rounds (< 320 of 65536)
I Identify relevant nodes in relevant rounds (bitmasks)I Observation: sort masks by round index
I Simply maintain a pointerI Iterate while performing treehash
I Output in the appropriate order..
SPHINCS on the M3 2016-03-07 7 / 12
Treehash considerations
I Goal: construct 32 HORST authentication paths
I Identify relevant nodes (320 of 131071)
I Identify relevant rounds (< 320 of 65536)
I Identify relevant nodes in relevant rounds (bitmasks)I Observation: sort masks by round index
I Simply maintain a pointerI Iterate while performing treehash
I Output in the appropriate order..
SPHINCS on the M3 2016-03-07 7 / 12
Streaming
I Cannot store signature → stream out immediatelyI HORST inherently not ordered!
I Re-arrange on the hostI Tags (832 bytes total; 64 + 640 + 128)I Alternatively: traverse tree on host
I Cannot store expanded key material
I Interleave ChaCha12 and Treehash
I Streaming message inputI Blockwise BLAKE512I Stream twice: once for randomness, once for digest
SPHINCS on the M3 2016-03-07 8 / 12
Streaming
I Cannot store signature → stream out immediatelyI HORST inherently not ordered!
I Re-arrange on the hostI Tags (832 bytes total; 64 + 640 + 128)I Alternatively: traverse tree on host
I Cannot store expanded key material
I Interleave ChaCha12 and Treehash
I Streaming message inputI Blockwise BLAKE512I Stream twice: once for randomness, once for digest
SPHINCS on the M3 2016-03-07 8 / 12
Streaming
I Cannot store signature → stream out immediatelyI HORST inherently not ordered!
I Re-arrange on the hostI Tags (832 bytes total; 64 + 640 + 128)I Alternatively: traverse tree on host
I Cannot store expanded key material
I Interleave ChaCha12 and Treehash
I Streaming message inputI Blockwise BLAKE512I Stream twice: once for randomness, once for digest
SPHINCS on the M3 2016-03-07 8 / 12
Streaming
I Cannot store signature → stream out immediatelyI HORST inherently not ordered!
I Re-arrange on the hostI Tags (832 bytes total; 64 + 640 + 128)I Alternatively: traverse tree on host
I Cannot store expanded key material
I Interleave ChaCha12 and Treehash
I Streaming message inputI Blockwise BLAKE512I Stream twice: once for randomness, once for digest
SPHINCS on the M3 2016-03-07 8 / 12
ChaCha12
I Core computation: ChaCha permutationI 685818 calls per signatureI 65% of all computations
I 48 quarter-rounds of ADD, XOR and RORI Costs 542 cycles
I Note: slightly improved since proceedings version
I 512 bit state: 16 words of 32 bits eachI Fit precisely in the 16 registers..I .. but we must preserve PC and SPI → Reorder to minimize and group stack access
I Rotates on ARMv7 are (almost always) free!I eor r6, r6, r11, ROR #29
SPHINCS on the M3 2016-03-07 9 / 12
ChaCha12
I Core computation: ChaCha permutationI 685818 calls per signatureI 65% of all computations
I 48 quarter-rounds of ADD, XOR and RORI Costs 542 cycles
I Note: slightly improved since proceedings version
I 512 bit state: 16 words of 32 bits eachI Fit precisely in the 16 registers..
I .. but we must preserve PC and SPI → Reorder to minimize and group stack access
I Rotates on ARMv7 are (almost always) free!I eor r6, r6, r11, ROR #29
SPHINCS on the M3 2016-03-07 9 / 12
ChaCha12
I Core computation: ChaCha permutationI 685818 calls per signatureI 65% of all computations
I 48 quarter-rounds of ADD, XOR and RORI Costs 542 cycles
I Note: slightly improved since proceedings version
I 512 bit state: 16 words of 32 bits eachI Fit precisely in the 16 registers..I .. but we must preserve PC and SPI → Reorder to minimize and group stack access
I Rotates on ARMv7 are (almost always) free!I eor r6, r6, r11, ROR #29
SPHINCS on the M3 2016-03-07 9 / 12
ChaCha12
I Core computation: ChaCha permutationI 685818 calls per signatureI 65% of all computations
I 48 quarter-rounds of ADD, XOR and RORI Costs 542 cycles
I Note: slightly improved since proceedings version
I 512 bit state: 16 words of 32 bits eachI Fit precisely in the 16 registers..I .. but we must preserve PC and SPI → Reorder to minimize and group stack access
I Rotates on ARMv7 are (almost always) free!I eor r6, r6, r11, ROR #29
SPHINCS on the M3 2016-03-07 9 / 12
Performance
I Works on 16KB RAM XI Uses less than 7KB
I Recall: 32MHz clock frequency
I Key generation: 28 205 671 cycles (0.88 seconds)
I Signing: 589 018 151 cycles (18.41 seconds)I Verification: 16 414 251 cycles (0.51 seconds)
I (of which approx. 10M spent communicating)
I Note: slightly improved since proceedings version
I On 4-core Haswell:“[..] signs hundreds of messages per second.”
Performance 2016-03-07 10 / 12
Performance
I Works on 16KB RAM XI Uses less than 7KB
I Recall: 32MHz clock frequency
I Key generation: 28 205 671 cycles (0.88 seconds)
I Signing: 589 018 151 cycles (18.41 seconds)I Verification: 16 414 251 cycles (0.51 seconds)
I (of which approx. 10M spent communicating)
I Note: slightly improved since proceedings version
I On 4-core Haswell:“[..] signs hundreds of messages per second.”
Performance 2016-03-07 10 / 12
Performance
I Works on 16KB RAM XI Uses less than 7KB
I Recall: 32MHz clock frequency
I Key generation: 28 205 671 cycles (0.88 seconds)
I Signing: 589 018 151 cycles (18.41 seconds)I Verification: 16 414 251 cycles (0.51 seconds)
I (of which approx. 10M spent communicating)
I Note: slightly improved since proceedings version
I On 4-core Haswell:“[..] signs hundreds of messages per second.”
Performance 2016-03-07 10 / 12
Cost of the state
I Implemented XMSSMT [HRB13], configured similarlyI BLAKE and ChaCha primitives, 256 bitI Two layers, subtrees with 210 leafs each
I XMSSMT : Merkle trees linked with WOTS+
I Stateful: process leafs incrementally
I BDS traversal [BDS08], store partial trees (k = 6)
I Key generation: 8 857 708 189 cycles (276.80 seconds)
I Avg. signing: 19 441 021 cycles (0.61 seconds)
I Verification: 4 961 447 cycles (0.16 seconds)
I Note: slightly improved since proceedings version
Performance 2016-03-07 11 / 12
Cost of the state
I Implemented XMSSMT [HRB13], configured similarlyI BLAKE and ChaCha primitives, 256 bitI Two layers, subtrees with 210 leafs each
I XMSSMT : Merkle trees linked with WOTS+
I Stateful: process leafs incrementally
I BDS traversal [BDS08], store partial trees (k = 6)
I Key generation: 8 857 708 189 cycles (276.80 seconds)
I Avg. signing: 19 441 021 cycles (0.61 seconds)
I Verification: 4 961 447 cycles (0.16 seconds)
I Note: slightly improved since proceedings version
Performance 2016-03-07 11 / 12
Cost of the state
I Implemented XMSSMT [HRB13], configured similarlyI BLAKE and ChaCha primitives, 256 bitI Two layers, subtrees with 210 leafs each
I XMSSMT : Merkle trees linked with WOTS+
I Stateful: process leafs incrementally
I BDS traversal [BDS08], store partial trees (k = 6)
I Key generation: 8 857 708 189 cycles (276.80 seconds)
I Avg. signing: 19 441 021 cycles (0.61 seconds)
I Verification: 4 961 447 cycles (0.16 seconds)
I Note: slightly improved since proceedings version
Performance 2016-03-07 11 / 12
Conclusions
I Stateless is expensive, but not prohibitively soI Signing 30x as expensive as XMSSMT
I Verification similar to XMSSMT
I (Key generation much cheaper)
I Feasible on limited platformsI Verification is practicalI Non-interactive signatures (high latency)
I Further algorithmic improvements desirable
I Code is available (public domain):https://joostrijneveld.nl/papers/armedsphincs/
Conclusions 2016-03-07 12 / 12
Reference I
Daniel J. Bernstein, Diana Hopwood, Andreas Hulsing, Tanja Lange,Ruben Niederhagen, Louiza Papachristodoulou, Peter Schwabe and ZookoWilcox O’Hearn.
SPHINCS: Stateless, practical, hash-based, incredibly nice cryptographicsignatures.
In Marc Fischlin and Elisabeth Oswald, editors, Advances in Cryptology –EUROCRYPT 2015, volume 9056 of LNCS, pages 368-397. Springer,2015.
John Rompel.
One-way functions are necessary and sufficient for secure signatures.
In Proceedings of the twenty-second annual ACM symposium on theory ofcomputing, pages 387–394. ACM, 1990.
Ralph Merkle.
A certified digital signature.
In Gilles Brassard, editor, Advances in Cryptology – Crypto ‘89, volume435 of LNCS, pages 218-238. Springer-Verlag, 1990.
References 2016-03-07 13 / 12
Reference II
Andreas Hulsing.
W-OTS+ – shorter signatures for hash-based signature schemes.
In Amr Youssef, Abderrahmane Nitaj and Aboul-Ella Hassanien, editors,Progress in Cryptology – AFRICACRYPT 2013, volume 7918 of LNCS,pages 173-188. Springer, 2013.
Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O’Hearn andChristian Winnerlein.
BLAKE2: Simpler, smaller, fast as MD5.
In Michael J. Jacobson Jr., Michael E. Locasto, Payman Mohassel andReihaneh Safavi-Naini, editors, Applied Cryptography and NetworkSecurity, volume 7954 of LNCS, pages 119-135. Springer, 2013.
Daniel J. Bernstein.
ChaCha, a variant of Salsa20.
SASC 2008: The State of the Art of Stream Ciphers, 2008.
References 2016-03-07 14 / 12
Reference III
Oded Goldreich.
Two remarks concerning the Goldwasser-Micali-Rivest signature scheme.
In Andrew M. Odlyzko, editor, Advances in Cryptology – Crypto ‘86,volume 263 of LNCS, pages 104-110. Springer-Verlag, 1987.
Andreas Hulsing, Lea Rausch and Johannes Buchmann.
Optimal Parameters for XMSSMT .
In Alfredo Cuzzocrea, Christian Kittl, Dimitris E. Simos, Edgar Weippland Lida Xu, editors, Security Engineering and Intelligence Informatics,volume 8128 of LNCS, pages 194-208. Springer, 2013.
Johannes Buchmann, Erik Dahmen and Michael Schneider.
Merkle tree traversal revisited.
In Johannes Buchmann and Jintai Ding, editors, Post-QuantumCryptography, volume 5299 of LNCS, pages 63-78. Springer, 2008.
References 2016-03-07 15 / 12