SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11....

27
UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto Devices - L1 1 SCAs against Embedded Crypto Devices F.-X. Standaert UCL Crypto Group, Universit´ e catholique de Louvain Lecture 1 - Hardware Implementations

Transcript of SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11....

Page 1: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 1

SCAs against Embedded Crypto Devices

F.-X. Standaert

UCL Crypto Group, Universite catholique de Louvain

Lecture 1 - Hardware Implementations

Page 2: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 2

Outline

I Different types of computing devices

I Two key concepts

I Hardware performance indicatorsI Implementation tradeoffs

I Technology scaling

I Design tradeoffs

I FPGAs

I Application to block ciphers

I Further readings

Page 3: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 3

Different types of computing devices

I General purpose computers (e.g. microprocessors)I Software-programmed

I Reconfigurable devices (e.g. FPGAs)I Application Specific Integrated Circuits (e.g. AES)

I Hard-codedI Tradeoff: flexibility vs. performance

Page 4: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 4

Sequential logic

I 1 cycle: read in memory - operate - store in memory

I Operation delay Top > than critical path Tph (in sec)

I Operation frequency fop = 1/Top (in Hz)

Page 5: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 5

Abstraction levels (for memory & operations)

Page 6: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 6

Hardware performance indicators

I Hardware cost (in gates, transistors or circuit size)

I Operation frequency (in Hz)

I Data throughput (in bit/sec)

I Data latency (in clock cycles)I Power and energy (in Watts and Joules)

I Not equivalent, e.g.I Power matters for RFID devicesI Energy matters for battery-supplied devices

Page 7: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 7

Implementation tradeoffs

I Tph ∝ LD · CL·Vdd

Ion, with:

I LD the operation logic depth (in gates)I CL the load capacitance (in Farad)I Vdd the circuit supply voltageI Ion the MOSFET drain current in ON state

“Tph decreaseswith larger Vdd”

Page 8: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 8

Implementation tradeoffs (II)

I Sources of power and energy consumptionI Ptot = Pdyn + Pstat

I Pdyn ∝ Ngates · CL · V 2dd · fop · α (1 + βsc)

I α: activity factor / β: short circuitsI Pstat ∝ Ileak · Vdd

I with Ileak increasing with smaller Vdd

“Minimum energy besttrades Pdyn and Pstat”(here with Top = Tph)

⇒ ∃ frequency/energy tradeoff

Page 9: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 9

Technology scaling

I Pdyn dominates old technologies (down to 0.1µm)

I Pstat becomes significant in nanoscale devices

I Inter-device variability also increases with scaling !

Page 10: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 10

Design tradeoffs

I Resources sharing, e.g. with the AES ByteSub

I Low cost design: 1 S-box, 16 cycles

I Fast design: 16 S-boxes, 1 cycles

I Low cost implies more control ⇒ less efficient

Page 11: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 11

Design tradeoffs (II)

I Inner pipelining, e.g. with the AES round

Ideally: fop × 2(usually worse in practice)

Latency: 11 → 22 (cycles)

Throughput?

(128 bits/11 cycles) · fop(256 bits/22 cycles) · fop⇒ ideally ×2

Page 12: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 11

Design tradeoffs (II)

I Inner pipelining, e.g. with the AES round

Ideally: fop × 2(usually worse in practice)

Latency: 11 → 22 (cycles)

Throughput?(128 bits/11 cycles) · fop(256 bits/22 cycles) · fop⇒ ideally ×2

Page 13: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 12

Design tradeoffs (III)

I Further improvements of the throughput (fop fixed)

Parallelism (left) less efficient than outer pipelining (right)

Page 14: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 12

Design tradeoffs (III)

I Further improvements of the throughput (fop fixed)

Parallelism (left) less efficient than outer pipelining (right)

Page 15: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 13

FPGAs

I “Sea” of programmable logic blocks

I Connected with programmable routing

I Functionality determined by configuration bitsI Different technologies

I 0.18µm → 45 nmI Several manufacturers

Page 16: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 14

FPGAs (II)

Page 17: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 15

FPGAs (III)

I Logic blocksI From 3-input Look-Up Tables. . .

to 8-bit Arithmetic and Logic UnitsI The granularity of the device influences both the

design performances and configuration time

I Routing blocksI Structured according to the interconnect lengthI Major impact in final performances

I Embedded blocksI Memories, multipliers, . . .

Page 18: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 16

(How to use) FPGAs (IV)

I Compared to ASICs: fabrication + packaging arereplaced by configuration (i.e. sending a programmingfile to the chip to determine the “gates” functionality)

Page 19: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 17

FPGAs (V)

I Example: Xilinx logic block

Page 20: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 18

Application to block ciphers

I Target FPGA 1 has logic blocks (LB1) made of:I Two 4-input LUTsI One 1-bit MUX to combine the LUTsI Two registers

I Target FPGA 2 has logic blocks (LB2) made of:I Four 6-input LUTsI Three 1-bit MUX to combine the LUTsI Four registers

I Embedded memory, with each block (MB) made of:I 4096-bit RAM memoriesI Dual-ported (i.e. 2 R/W operations per cycle)I Configurable (4096× 1, 2048× 2, . . . )

Page 21: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 19

S-box implementations

I “Minimum memory” cost (in bits) of S1/S2? { . . . }I Cost of S1/S2 in LB1/LB2? { . . . }I Would you use the memory to implement S1/S2?

Page 22: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 20

Block cipher design

I Consider an AES-like cipher with the following round:

I To be implemented in FPGA 1 with S-box S2

I With MixColumn in 256 LUTs and logic depth 2 LUTs

I And the full cipher iterating 11 rounds

Page 23: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 21

Block cipher design

I What is the cost of one round in LUTs?I Design and evaluate the cost (in LUTs and regs) of:

I A 1-round loop architecture without pipelineI A 1-round loop architecture with maximum pipeline

I What is the latency (in cycles) of these architectures?I Assume TLUT = 10 ns, what is the throughput

achieved by these architectures (in bit/sec)?I Is this assumption realistic (physically speaking)?

I “Ideally”, what would happen if we move to a 2-roundloop architecture, or a 32-bit loop architecture?

I { . . . }

Page 24: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 22

Examples

I FPGA implementations of the AES Rijndael

Index E,D? Key Sched. Feedback? Device Architecture

1. E only on-the-fly no Virtex-E 128-bit unrolled2. E only on-the-fly no Virtex-E 128-bit loop3. E/D precomputed yes Virtex-II 32-bit loop4. E/D precomputed yes Spartan-II 8-bit loop5. E/D precomputed yes Spartan-II PicoBlaze

Index LUTs Regs. Slices RAMBs Freq. Throughput1. 3516 3840 2784 100 92 MHz 11.7 Gbit/sec2. 3846 2517 2257 0 169 MHz 2 Gbit/sec3. 288 113 146 3 123 MHz 358 Mbit/sec4. - - 124 2 67 MHz 2.2 Mbit/sec5. - - 119 2 90 MHz 710 Kbit/sec

Page 25: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 23

Summarizing

I Specialized hardware implementations (ASICs, FPGAs)can be used to reach high performances

I Many different metrics exist (cost, speed, . . . )

I Hardware Design optimization (e.g. sharing,pipelining) depends on algorithmic features

I Technology scaling can have high impact too!

Page 26: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 24

Further readings

I International Technology Roadmap forSemiconductors, http://www.itrs.net/

I F. Rodriguez-Henriquez, F. Saqib, N.A. Diaz, C.K.Koc, Cryptographic Algorithms on ReconfigurableHardware, Springer, 2007.

I H. Kaeslin, Digital Integrated Circuit Design,Cambridge University Press, 2008.

I J.M. Rabaey, Digital Integrated Circuits: a DesignPerspective, second edition, Prentice Hall, 2003.

Page 27: SCAs against Embedded Crypto Devicesattackschool.di.uminho.pt/slides/slides_fxs1.pdf · 2014. 11. 3. · UCL Crypto Group Microelectronics Laboratory SCAs against Embedded Crypto

UCL Crypto GroupMicroelectronics Laboratory SCAs against Embedded Crypto Devices - L1 25

Thanks