Ingrid Verbauwhede, K.U.Leuven - COSIC 1€¦ · Ingrid Verbauwhede, K.U.Leuven - COSIC 10 KUL...
Transcript of Ingrid Verbauwhede, K.U.Leuven - COSIC 1€¦ · Ingrid Verbauwhede, K.U.Leuven - COSIC 10 KUL...
Ingrid Verbauwhede, K.U.Leuven - COSIC 1
KUL - COSIC Summer School- 1 Šibenik Croatia, June 2016
Hardware Design for Cryptographers
Ingrid Verbauwhede
ingrid.verbauwhede-at-esat.kuleuven.be
KU Leuven, ESAT- COSIC
Computer Security and Industrial Cryptography
Acknowledgements:
Current and former PhD students
KUL - COSIC Summer School- 2 Šibenik Croatia, June 2016
Outline
• Given: cryptographic algorithm
• Request: “efficient” hardware design
• AND “secure” hardware
• This lecture about Efficiency
• The cost of side-channel resistance is addressed
in other lectures
KUL - COSIC Summer School- 3 Šibenik Croatia, June 2016
Design Parameters
Embedded security:
Area, delay, power, energy,
physical security
KUL - COSIC Summer School- 4 Šibenik Croatia, June 2016
Embedded crypto everywhere
IMEC: Human++ project
Ingrid Verbauwhede, K.U.Leuven - COSIC 2
KUL - COSIC Summer School- 5 Šibenik Croatia, June 2016
5
[Source photograph: J. Rabaey: A Brand New Wireless Day]
[Terabits,
Long lasting
Security]
[Lightweight
HW entangled]
[Mbits
Security]
Swarm at the edge of the cloud
KUL - COSIC Summer School- 6 Šibenik Croatia, June 2016
Cyber-physical system
IMEC: NERF - brain stimulant
Deep Brain stimulation
[Sources: J. Rabaey, National Institutes of Health, Neurology journal]
“Networked embedded devices interacting with the environment” E.A. Lee after H. Gill (NSF)
KUL - COSIC Summer School- 7 Šibenik Croatia, June 2016
Embedded Security
Old Model (simplified view): -Attack on channel between communicating parties
-Encryption and cryptographic operations in black boxes
-Protection by strong mathematic algorithms and protocols
KUL - COSIC Summer School- 8 Šibenik Croatia, June 2016
Embedded Security
New Model (also simplified view): -Attack channel and endpoints
-Encryption and cryptographic operations in gray boxes
-Protection by strong mathematic algorithms and protocols
-Protection by secure implementation
Need secure implementations not only algorithms
Ingrid Verbauwhede, K.U.Leuven - COSIC 3
KUL - COSIC Summer School- 9 Šibenik Croatia, June 2016
Embedded Security
NEED BOTH
• Efficient, light-weight Implementation – Within power, area, timing budgets
– Public key: 1024 bits RSA on 8 bit mC and 100 mW
– Public key on a passive RFID tag
• Trustworthy implementation – Resistant to attacks
– Active attacks: probing, power glitches, JTAG scan chain
– Passive attacks: side channel attacks, including power, timing and electromagnetic leaks
KUL - COSIC Summer School- 10 Šibenik Croatia, June 2016
Cost definition
• Area
• Time: throughput versus latency
• Power, Energy
• Physical Security
• NRE (Non Recurring Engineering) cost
KUL - COSIC Summer School- 11 Šibenik Croatia, June 2016
AREA
KUL - COSIC Summer School- 12 Šibenik Croatia, June 2016
Area
• ASIC = Application Specific Integrated Circuit
– Gate count
– Unit = NAND gate = 4 transistors
• FPGA = Field Programmable Gate Array
– Unit is LUT, flip-flops
• Embedded micro-controllers
– Memory size = program size + data size
Ingrid Verbauwhede, K.U.Leuven - COSIC 4
KUL - COSIC Summer School- 13 Šibenik Croatia, June 2016
One Standard cell NAND gate
B
VDD
A
KUL - COSIC Summer School- 14 Šibenik Croatia, June 2016
Two Inverters
Connect in Metal
Share power and ground
Abut cells
VDD
Slide courtesy: J. Rabaey
KUL - COSIC Summer School- 15 Šibenik Croatia, June 2016
Standard Cell Zoom In
layout
vdd
vss
KUL - COSIC Summer School- 16 Šibenik Croatia, June 2016
More levels of metal: top levels not shown
Ingrid Verbauwhede, K.U.Leuven - COSIC 5
KUL - COSIC Summer School- 17 Šibenik Croatia, June 2016
TIME
Clock frequency versus sample frequency
Throughput versus latency
REAL-TIME
KUL - COSIC Summer School- 18 Šibenik Croatia, June 2016
Real-time, throughput, latency
• Throughput = associated with application
– Amount of data processed per time unit
– Video: Gbits/sec, Internet: Gpackets/sec
– Real-time sample rate: HW has to work as fast as
application dictates
• Latency = associated with application
– Delay from input to output
– Measure of reaction speed or turn-around time
– Low latency: brakes of a car, memory encryption
• High throughput and low latency don’t go
together
KUL - COSIC Summer School- 19 Šibenik Croatia, June 2016
Clock Frequency
• Extremely high throughput (Radar or fiber optics)
• One operator (= hardware unit, e.g. adder, shifter, register)
• for each operation (= algorithmic, e.g. addition, multiplication, delay)
clock frequency = sample frequency
• Most designs: time multiplexing
clock frequency = sample frequency
clock frequency
sample frequency = number of clock cycles available for the job
Clock frequency is a property of the hardware!
= 1 / max (longest combinatorial path)
= 1 / (critical path)
KUL - COSIC Summer School- 20 Šibenik Croatia, June 2016
Example: AES variations
• There is only one AES 128 algorithm
• There are multiple AES hardware
implementation options.
• Basic operatIONS:
– Byte-sub: non-linear operation on every byte
– Shift-row: Circular shifting of bytes in each row
– Mix-column: multiplying the round data with a fixed
polynomial
– Add-key: XORing the round data and round key
Ingrid Verbauwhede, K.U.Leuven - COSIC 6
KUL - COSIC Summer School- 21 Šibenik Croatia, June 2016
AES sequential
KUL - COSIC Summer School- 22 Šibenik Croatia, June 2016
AES: one cycle per round
• Sample frequency < clock frequency
• Example: clock is 200 MHz, sample rate is
10MHz, 128 bits per sample
• Requested throughput: min 1.28 Gbits/sec
• Puts HW limit: one AES in 20 clock cycles or less
Solution:
Parallel datapath, sequential execution
10 Msamples/s * 128 bits/sample
= 1.28 Gbits/s
KUL - COSIC Summer School- 23 Šibenik Croatia, June 2016
AES: one cycle per round Solution:
• One 128 bit datapath
• 11 clock cycles to finish AES
(plus load and store clock cycle)
• 200 MHz clock
Throughput: ??
Latency: ??
16 SBOX
Shift Row
Mix column
Key Addition
REGISTER
128
KUL - COSIC Summer School- 24 Šibenik Croatia, June 2016
AES: one cycle per round Solution:
• One 128 bit datapath
• 11 clock cycles to finish AES
(plus load and store clock cycle)
• 200 MHz clock
Throughput: ??
Latency: ??
16 SBOX
Shift Row
Mix column
Key Addition
REGISTER
128
200 Mcycles/s * 128 bits/sample
/ 11 cycles/sample
= 2.33 Gbits/s
11 cycles/sample / 200 Mcycles/s
= 55 ns/sample
Ingrid Verbauwhede, K.U.Leuven - COSIC 7
KUL - COSIC Summer School- 25 Šibenik Croatia, June 2016
Efficiency - adapt HW platform to application
Simple example: Key Schedule for round key
Two options:
• On the “fly” = just in time processing
• Pre-compute and store in memory
Memory AES
Key
Schedule
Key
Schedule
AES Typical for Hardware
1 cycle/round
Typical for Software
Minimum around 10 cycles/byte + bandwidth
128
1408
KUL - COSIC Summer School- 26 Šibenik Croatia, June 2016
Slide credit: Alireza Hodjat
AES Core
• Key schedule in parallel with data path
• 128-bit data and key
• One round implementation with minimum delay
– One cycle per round
– Direct implementation of Byte Substitution phase
• 11 cycles one encryption
Byte Substitution Mix Column Shift Row Key Addition
XO
R M
ix
Co
l
3
Mix
Co
l
4
Mix
Co
l
2
Mix
Co
l
1
S
S
S
…
S
Input
Data
+ +
+
S
S
S
S
+
+
+
+
+
+ +
+
+ +
+
+ +
+
Input
Key
Key scheduling datapath
Round Key
KUL - COSIC Summer School- 27 Šibenik Croatia, June 2016
AES compact: one SBOX
• Goal: low area
• Exercise: (clock is 200MHz)
– Min number of clockcycles?
– Throughput ?
– Latency?
1 SBOX
Shift
Row
Mix
column
Key
Addition
128 REGISTER
8
• Overhead: muxes and control logic KUL - COSIC Summer School- 28 Šibenik Croatia, June 2016
AES compact: one SBOX
• Goal: low area
• Exercise: (clock is 200MHz)
– Min number of clockcycles?
– Throughput ?
– Latency?
1 SBOX
Shift
Row
Mix
column
Key
Addition
128 REGISTER
8
• Overhead: muxes and control logic
128 bits/round * 11 rounds/sample
/ 8 bits/cycle = 176 cycles/sample
128 bits/sample * 200 Mcycles/s
/ 176 cycles/sample = 145 Mbits/s
176 cycles/sample / 200 Mcycles/s
= 880 ns/sample
Ingrid Verbauwhede, K.U.Leuven - COSIC 8
KUL - COSIC Summer School- 29 Šibenik Croatia, June 2016
AES parallel and pipeline
KUL - COSIC Summer School- 30 Šibenik Croatia, June 2016
AES parallel = loop unrolling
• Unroll twice, now 6 CC
• IF same clock THEN
• double throughput
• latency: ?
• (overhead ignored)
= expensive
16 SBOX
Shift Row
Mix column
Key Addition
REGISTER
128
16 SBOX
Shift Row
Mix column
Key Addition
KUL - COSIC Summer School- 31 Šibenik Croatia, June 2016
AES parallel = loop unrolling
• Unroll twice, now 6 CC
• IF same clock THEN
• double throughput
• latency: ?
• (overhead ignored)
= expensive
16 SBOX
Shift Row
Mix column
Key Addition
REGISTER
128
16 SBOX
Shift Row
Mix column
Key Addition
128 bits/sample * 200 Mcycles/s
/ 6 cycles/sample = 4.27 Gbits/s
6 cycles/sample / 200 Mcycles/s
= 30 ns/sample
KUL - COSIC Summer School- 32 Šibenik Croatia, June 2016
AES unroll & pipeline
1 clock cycle per AES
11 data samples in pipeline
200 MHz clock
Throughput: ??
Latency: ??
16 SBOX
Shift Row
Mix column
Key Addition
REGISTER
16 SBOX
Shift Row
Mix column
Key Addition
16 SBOX
Shift Row
Mix column
Key Addition
REGISTER
REGISTER
. . .
Ingrid Verbauwhede, K.U.Leuven - COSIC 9
KUL - COSIC Summer School- 33 Šibenik Croatia, June 2016
AES unroll & pipeline
1 clock cycle per AES
11 data samples in pipeline
200 MHz clock
Throughput: ??
Latency: ??
16 SBOX
Shift Row
Mix column
Key Addition
REGISTER
16 SBOX
Shift Row
Mix column
Key Addition
16 SBOX
Shift Row
Mix column
Key Addition
REGISTER
REGISTER
. . .
128 bits/sample * 200 Mcycles/s
/ 1 cycles/sample = 25.6 Gbits/s
11 cycles/sample / 200 Mcycles/s
= 55 ns
KUL - COSIC Summer School- 34 Šibenik Croatia, June 2016
AES loop enroll AND pipelined
In combination
with AES key
schedule
Round
10
Add
Key … Round
1
Round Key Round Data
S
S S S
S S …
+
+
+
+
+
+
+
… Mix
Column
1
Mix
Column
4
…
XOR
S S S S
+ +
+ +
Slide credit: Alireza Hodjat
KUL - COSIC Summer School- 35 Šibenik Croatia, June 2016
Non feedback modes of operation!!
• Example counter mode of operation
• Does not work with CBC or other feedback
modes of operation
AES
+
1
Key
Initial Seed
Seed Register Counter Register
Pseudo-random number
to be used as key-stream
KUL - COSIC Summer School- 36 Šibenik Croatia, June 2016
POWER
Ingrid Verbauwhede, K.U.Leuven - COSIC 10
KUL - COSIC Summer School- 37 Šibenik Croatia, June 2016
Power density
4004 8008
8080 8085
8086
286 386
486 Pentium® proc
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010 Year
Po
wer
Den
sit
y (
W/c
m2)
Hot Plate
Nuclear
Reactor
Rocket
Nozzle
Power density too high to keep junctions at low temp
For desktops and servers it is a cooling problem
Courtesy, Intel
KUL - COSIC Summer School- 38 Šibenik Croatia, June 2016
Cooling
• Cooling for the very small and the very large
Deep Brain stimulation
[Sources: J. Rabaey, National Institutes of Health, Neurology journal] [San Francisco Chronicle online, source Google]
KUL - COSIC Summer School- 39 Šibenik Croatia, June 2016
Power and Energy are not the same!
• Power = P = I x V (current x voltage) (= Watt)
– instantaneous
– Typically checked for cooling or for peak
performance
• Energy = Power x execution time (= Joule)
– Battery content is expressed in Joules
– Gives idea of how much Joules to get the job done
Low power processor ≠ low energy solution !
KUL - COSIC Summer School- 40 Šibenik Croatia, June 2016
Heat and parallelism
memory processor
M P
C Pmono = CV2f (Watt)
Power
(Heat)
C/4 C/4 C/4 C/4
M/4 P/4 M/4 P/4 M/4 P/4 M/4 P/4 4 (C/4)V2(f/4) = Pmono/4
but since f ~ V
can be even Pmono/43
Reduce power = reduce WASTE !!
TREND: MULTI-CORE!!
Ingrid Verbauwhede, K.U.Leuven - COSIC 11
KUL - COSIC Summer School- 41 Šibenik Croatia, June 2016
Match between algorithm & architecture
Close the gap:
• Dedicated HW: ASIC
• Programmable HW: FPGA
• Custom instructions, hand-
coded assembly
• Compiled code
• JAVA on virtual machine,
compiled on a real machine
Power Cost
???
General
Purpose
Fixed
Platform
Application
ASIC
KUL - COSIC Summer School- 42 Šibenik Croatia, June 2016
[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator
[2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110
[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet
[4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS
[5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS
648 Mbits/sec Asm Pentium III [3] 41.4 W 0.015 (1/800)
Java [5] Emb. Sparc 450 bits/sec 120 mW 0.0000037 (1/3.000.000)
C Emb. Sparc [4] 133 Kbits/sec 0.0011 (1/10.000)
350 mW
Power
1.32 Gbit/sec FPGA [1]
11 (1/1) 3.84 Gbits/sec 0.18mm CMOS
Figure of Merit
(Gb/s/W)
Throughput AES 128bit key
128bit data
490 mW 2.7 (1/4)
120 mW
Energy – flexibility trade-off
ASM StrongARM [2] 240 mW 0.13 (1/85) 31 Mbit/sec
Intel ISA for AES 32 Gbit/sec 95 W 0.34 (1/33)
KUL - COSIC Summer School- 43 Šibenik Croatia, June 2016
Communication & computation
Back of the envelope
KUL - COSIC Summer School- 44 Šibenik Croatia, June 2016
[source: G. Dolmans IMEC NL]
Red: Transmit Energy - nJ/ bit
Blue: Transmit Power - mWatt
Green: Transmit Energy - nJ/bit/m
Ingrid Verbauwhede, K.U.Leuven - COSIC 12
KUL - COSIC Summer School- 45 Šibenik Croatia, June 2016
1 micro Joule
Transmission:
• 300 bits in BAN
• 11 bits Bluetooth
• 3 bits Zigbee
Encryption:
• 11000 bits AES
• 500 bits SHA3 hash
• 1/5 of one point
multiplication
Ignores receive buget (= listening)
Ignores “overhead” of adding authentication bits, etc.
KUL - COSIC Summer School- 46 Šibenik Croatia, June 2016
46
Mutual Authentication
Symmetric shared key
Reader: K Tag: K
T1
rA
T1← EK(rA//rB)
rB
Check response
= decryption
check rB
compute rA
rB
T2← EK(rB//rA) T2
Tag: two AES encryptions, one transmission over Bluetooth
128 bit Bluetooth + 2 x AES 10 microJoule
KUL - COSIC Summer School- 47 Šibenik Croatia, June 2016
ECC based randomized Schnorr
Reader: y, X = xP Tag: x, Y =yP
T1, T2
r1,r2,
T1 = r1P, T2 = r2Y
v = r1+r2+cx
c c
c-1[vP - T1- y-1T2] =? X
v
Tag: two point multiplications, two transmissions over BAN
Crypto dominates 4 microJoule + 1 microJoule
KUL - COSIC Summer School- 48 Šibenik Croatia, June 2016
Conclusions
• Time has many faces: real-time, throughput, latency, clock frequency, critical path, …
• Power is not same as energy !
• Glitches = wasteful transitions = only occur in static CMOS, not possible in dynamic logic
• Energy – flexibility trade-off
• Depending on the algorithm/protocol and the wireless communication standard, either the crypto computation or the transmission dominates in terms of energy