One Attack to Rule Them All: Collision Timing Attack ... · One Attack to Rule Them All: Collision...

1

One Attack to Rule Them All:Collision Timing Attack versus 42 AES ASIC

CoresAmir Moradi, Oliver Mischke, Christof Paar, Fellow, IEEE

Abstract—When complex functions, e.g., substitution boxes of block ciphers, are realized in hardware, timing attributes of theunderlying combinational circuit depend on the input/output changes of the function. These characteristics can be exploited bythe help of a relatively new scheme called fault sensitivity analysis. A collision timing attack which exploits the data-dependenttiming characteristics of combinational circuits is demonstrated in this article. The attack is based on an also recently publishedcorrelation collision attack, which avoids the need for a hypothetical timing model for the underlying combinational circuit torecover the secret materials. The target platforms of our proposed attack are 14 AES ASIC cores of the SASEBO LSI chips inthree different process technologies, 130nm, 90nm, and 65nm. Successfully breaking all cores including the DPA-protected andfault attack protected cores indicates the strength of the attack.

Index Terms—Side-channel attack, fault attack, timing attack, fault sensitivity attack, collision attack, ASIC, AES.

F

1 INTRODUCTION

WHILE modern cryptographic algorithms canmostly be assumed secure from the mathemati-

cal point-of-view, their implementation can in generaleasily be broken by means of so-called side-channelattacks if no special countermeasures are applied.Sensitive information about the internal secret of thecryptographic device, e.g., the used encryption key ina symmetric cipher, leak through side channels. Theyet known side channels include execution time [14],power consumption [15], and electromagnetic radia-tion [10], [30]. Faulty outputs in case of intentionallyinjected faults have also been used in the activebranch of non-invasive physical attacks, e.g., differ-ential fault analysis [5].

At CHES 2010 a new fault attack called Fault Sen-sitivity Analysis (FSA) [18] was proposed, which usesthe fact that the critical path of an AES S-box imple-mentation is data dependent because of the natureof the underlying gates. Using clock glitches and asimple hypothetical (timing) model for the S-box –which can be obtained by simulation knowing thedesign details or by a profiling phase – they showed acomplete break of an AES ASIC implementation usingthe critical path details of 50 ciphertexts.

Another contribution to CHES 2010 was a collisionattack enhanced by correlation [21]. Compared toclassical power analysis attacks, its main feature isthat it does not rely on the knowledge of an under-lying (hypothetical) power model. Instead, it directlycorrelates power traces to each other and – by finding

• A. Moradi, O. Mischke and C. Paar are with the Horst Görtz Institutefor IT-Security, Ruhr University Bochum, Germany. E-mail: {moradi,mischke, cpaar}@crypto.rub.de.

colliding S-box computations – is able to recover therelation between key parts. Using such an attack acomplete break of a masked FPGA implementationof AES has been shown.

The combination and improvement of these twoideas is the main contribution of this article. Wepresent an attack that exploits the timing character-istics of AES S-boxes, but in order to recover thesecret it does not need to know specifically how thesecharacteristics are and how they relate to the inputs.Despite the similar name, the attack presented hereis different to [8], where a collision timing attack oncache misses of an embedded system is presented.The aim of our proposed attack is to exploit thetiming characteristics of combinational functions, e.g.,S-boxes implemented in ASICs, thereby recoveringthe relation between key parts and restricting the keysearch space in a way that the secret key can berevealed knowing a single plaintext-ciphertext pair.

Similarly to [18], we have chosen the SASEBO-Rboard [2] as the evaluation platform. The board canhold different ASICs, and we have analyzed theSASEBO LSI2 [3], both in 130nm and 90nm technol-ogy, as well as the SASEBO LSI3 [4] in 65nm tech-nology. Each of them contains the same 14 differentimplementations of AES, therefore the only differenceis the process technology, which has a big influence onthe timing characteristics. The implementations them-selves differ in the style of the S-box realization and inside-channel countermeasures. One of the AES imple-mentations is also specifically designed to withstandattacks using fault injection, which in theory shouldmake the implementation resistant against attacks likeFSA.

Using the attack presented here we are able to break

2

all 14 AES implementations of each LSI regardless ofthe process technology. It is noteworthy that the effortrequired to mount the attack on different cores variesbecause of their different architecture and mostly theirdifferent S-box design. Nevertheless, the proposedattack can not only extract the desired secret fromunprotected implementations, but it is especially ableto overcome all DPA and fault attack countermeasureswhich are applied in these ASIC designs. In otherwords, randomizing the computation of combina-tional circuits by means of different masking schemesdoes not prevent the relation between its timing char-acteristics and the processed data. Preventing faultyciphertexts by means of one of the most efficient faultdetection schemes, furthermore, cannot prevent anattacker who injects the faults to measures the timingcharacteristics of the circuit.

In the later parts of this article the prerequisites,including a review of fault sensitivity analysis andthe correlation collision attack, are given in Section 2.Our proposed attack, namely collision timing attack,is expressed in Section 3, and the practical evaluationresults on all 130nm, 90nm, and 65nm SASEBO LSI2and LSI3 cores are presented in Section 4. Finally,Section 5 concludes our work.

This article is extending the authors parts of a jointCHES 2011 article [22] from the authors and a teamfrom Department of Informatics, The University ofElectro-Communications, Tokyo, Japan. The indepen-dent research of both teams was merged by requestof the CHES 2011 Program Committee because ofsome overlap of used techniques. For example, bothutilize the ideas of Fault Sensitivity Analysis [18] andCorrelation Collision Attacks [21] although their finalattacks are quite different. Compared to the confer-ence article [22] this publication not only covers theattacks on some unprotected or DPA-protected coresbut especially shows how the proposed fault attackcan be applied to a very sophisticated fault protectionscheme and is finally able to break all 42 cores ofall currently available SASEBO-R LSIs. The attackdescriptions on the unprotected and DPA-protectedcores are extended as well giving more detail likefor example explaining the necessary adaptions toattack a core using the counter mode of operationand giving insights on the applicability of the schemeto decryption circuits. We also highlighted the dif-ficulties we faced during the 6 month of practicalevaluations to allow the reader to better understandthe attack and evaluate our results (see Appendix).During the review phase of this publication anotherjournal article was published by the aforementionedteam from Tokyo [17] giving more details on FSAand how to extend it in a self-template mode to alsosuccessfully attack one of the fault protected SASEBO-R LSIs. We want to stress that both works have beenperformed completely independent from each other.

2 PRELIMINARIES

This section summarizes the underlying attacks,namely the fault sensitivity analysis [18] and corre-lation collision attack [21], which are the basis to theproposed collision timing attack. We also explain howthe timing data, which is required for the attack, canbe captured and finally give some definitions whichare used in the following sections.

2.1 Fault Sensitivity AnalysisUnlike Differential Fault Attacks (DFA) [5], no faultyciphertexts are required by an FSA attack. Instead,the attack works by increasing the fault intensity untila distinguishable characteristic can be observed, e.g.,the first appearance of a faulty output. It was demon-strated that the attack is able to completely breakthe AES_PPRM1 core of the SASEBO LSI2 (130nm)using 200 faulty operations of each randomly selected50 plaintexts. It also could reveal three key bytes ofthe AES_WDDL implementation of the same ASICusing its fault sensitivity leakage obtained from 1200plaintexts.

The presented method of increasing the fault sen-sitivity in [18] is the shortening of clock glitches, i.e.,two normal clock cycles get replaced by a short anda longer one, whereby the length of the short one canbe gradually decreased until a faulty output occursor the fault becomes stable. Since the critical pathof some gates, e.g., AND and OR gates, is data de-pendent, knowing the underlying model for this datadependency helps revealing the secret. For example,by simulation it could be ascertained that the timingdelay of a PPRM S-box correlates to the Hammingweight (HW) of its input.

Since no faulty ciphertexts are required, the attackmight also be applicable to implementations whichapply DFA countermeasures. While AES_WDDLshould in theory be immune against set-up time vi-olation attacks, by creating templates with a knownkey it was shown that at least some bits correlate tothe timing delay which lead to the aforementionedrecovery of three key bytes.

Another difference to DFA attacks is that the faultdoes not need to be restricted to a small sub-space. Incontrast, by for example attacking the last round ofthe AES_PPRM1 implementation, each faulty outputbyte can be independently observed and therefore thesame complete faulty output can be used to attack allkey bytes simultaneously. On the other hand, as statedin [18], while countermeasures like masking are onlyof limited use against DFA attacks, they may have agreat impact on FSA attacks since the critical path isaffected by the random mask bits.

2.2 Correlation Collision AttackThe correlation collision attack was introduced in [21].While remaining a certain noise sensitivity, its major

3

advantage compared to classical power analysis at-tacks is that it neither relies on a hypothetical powermodel nor requires a profiling phase. By enhanc-ing linear collision attacks [7] by the methods ofcorrelation-based DPAs, it is able to overcome side-channel countermeasures as long as a minimal first-order leakage remains.

A linear collision occurs if two instances of com-binational circuits or one instance at two differentpoints in time process(es) the same value, i.e., forAES the 8-bit output and thereby the input of the S-box must be the same. In this case, it is possible todeduce the relation between the attacked key bytessince the rearranging of S(i1 ⊕ k1) = S(i2 ⊕ k2) leadsto ∆ = i1 ⊕ i2 = k1 ⊕ k2, where both i1 and i2 areknown.

The correlation collision attack on AES works simi-larly, but starts by creating sets of mean traces for eachpossible input byte if performing the attack on the firstround. To do this for two input bytes, namely i1 andi2, all traces are sorted based on the correspondinginput byte value, and traces with the same value areaveraged, thereby creating 256 different mean tracesM1(i1) and M2(i2) for each of the two input bytes.Computing the variances for each set of mean traceswill reveal the point in time where the correspondingbytes are processed by the S-box, which is necessaryto align the traces for the attack.

If the previous assumption holds true, i.e., thepower consumption of two S-box computations arehighly similar, comparing pairs of mean sets alsoshows a high similarity between certain mean traces.Therefore, when ∆ = k1 ⊕ k2 and attacking the inputbytes i1 and i2, then M1(i1) ≈ M2(i2 = i1 ⊕ ∆). Thecorrect ∆ can be found by computing the correlationbetween the two sets of mean traces for each of the 256candidates of ∆. This yields a very high correlationcoefficient since no hypothetical model is applied butinstead the averaged real power consumptions areused in correlations.

2.3 How to Measure the Timing

The focus of this work is to analyze the timingcharacteristics of combinational circuits like S-boxes.As shown in Fig. 1, when the input of a combina-tional circuit changes, its output stops toggling aftera certain time (so-called ∆t). The maximum valueof ∆t for different inputs is known as the longestcritical path of the circuit, and defines the maximumfrequency of the clock signal which triggers the flip-flops providing the input and storing the output ofthe considered combinational circuit. Timing charac-teristics of a circuit are therefore defined as a set of ∆t(T =

{∆t1,∆t2, . . . ,∆tn

}), where ∆ti is the minimum

∆t for the given input i.Let us suppose that the target combinational circuit

is a part of a bigger circuit, e.g., a co-processor, which

i:aa 00

i:aa 55

i:aa ff

Fig. 1. Dependency of the timing of combinationalcircuits to the input changes. The gray bars denote astable output signal after the S-box input has changedfrom 0xaa to 0x00, to 0x55, or to 0xff.

io

o = 16´12.6ns

=12.4ns

12.2ns

ac

aa ff

2f 0f 16

o = 0e´

o = 16´ti

Fig. 2. A simplified example on how to use clockglitches to extract a ∆ti

provides some I/O signals for communication. If theoutput signals of the target combinational circuit areaccessible from the outside – which is quite unlikely –one can easily probe the output signals for the giveninput i and measure the time when the output signalsget stable. Otherwise, if the output of the targetcombinational circuit is stored into registers whichare triggered by a clock signal that can be controlledfrom the outside, as shown in Fig. 2, one can steadilyshorten the time interval between the input transitionand the output storage (known as setup time) till anincorrect value is stored into the registers while inputi is repeatedly given to the combinational circuit. Theminimum time interval when the considered registerstores the correct value can be concluded to ∆ti.

Note that this procedure is similar to clock glitches,which are mostly used in DFA to intentionally injectfaults or to skip the execution of an instruction andanalyze the faulty outputs based on the target algo-rithm [5], [6], [27]. However, measuring ∆t in this casedoes not deal with the faulty outputs; once a faultyoutput is detected, ∆ti can be concluded.

It should be noted that, because of the environ-mental noise, it might be required to repeat the sameprocedure and shorten the clock glitch period until theprobability of detecting faulty output gets higher thana threshold. Also, if the target combinational circuit isnot a single-bit function and it is possible to detect

4

which output bit is faulty, one can measure ∆t foreach output bit independently.

Therefore, we define the adversary model and de-fine his capabilities in order to be able to measure∆ti of the target combinational circuit for the giveninput i:

• The adversary can access and control the clocksignal which triggers the registers providing theinput and saving the output of the target combi-national circuit.

• He knows in which clock cycle the target com-binational circuit processes the desired data, e.g.,known or guessed input or output.

• He can control the target device in a way that thesame input value i is repeatedly processed by thetarget combinational circuit during shorteningthe time interval of the clock glitch.

• He is equipped with appropriate instruments toshorten the duration of the clock glitch withsuitable accuracy.

2.4 Definitions

Bitwise Capture: let us define BitCapib,∆t as the result

of a Bernoulli trial whether the output of the targetcombinational circuit at bit b is faulty while processingthe input i and when ∆t is the time interval of theclock glitch. Correspondingly, pib,∆t is defined as theprobability of “success” in independently repeatedBitCapi

b,∆t trials.Capture: defining Capi

∆t =∨b

BitCapib,∆t means that

Capi∆t is the same as the above defined trial regard-

less of a certain output bit. It is meaningful whendifferentiating between different faulty output bits isnot possible, e.g., if a circuit is equipped with a faultdetection scheme and prevents the propagation offaulty results. pi∆t is also defined as the probabilityof “success” in independently repeated Capi

∆t trials.Time: To represent the timing characteristics of

the target combinational circuit, we define T ib =

∆t; pib,∆t ≈ pTH as the time required to computethe corresponding output bit b when input i is given,where pTH is a threshold for the probability and isdefined based on physical characteristics of the targetcircuit and is also based on the maximum probabilityachieved by shortening ∆t. Accordingly, the timerequired to complete the computation of all bits whenprocessing input i is defined as T i = ∆t; pi∆t ≈ pTH .

Depending on the target device, its architecture, andthe role of the target combinational circuit inside thetarget device, it might not be possible to know theinput i processed. However, if the output of the targetcombinational circuit is accessible, one can make allthe above defined terms based on the fault-free outputo, i.e., BitCapo

b,∆t, pob,∆t, Capo∆t, po∆t, T

ob , and T o.

3 COLLISION TIMING ATTACK

For simplicity let us suppose that the target combi-national circuit is an S-box of the last round of anAES encryption, i.e., o = S(i) ⊕ k, where o is thecorresponding output ciphertext byte and k the targetkey byte.

If (bitwise) timing characteristics of an S-box im-plementation, i.e., T o (T o

b), show a diversity of ∆tdepending on output o, one can perform an attackand recover the secret knowing how the secret kcontributes in T o (T o

b). In other words, if the timingcharacteristics of the S-box itself regardless of k andlater key addition (⊕) are known as an extra infor-mation or are obtained by profiling using a circuitsimilar to the target, one can make a hypotheticalfunction to estimate the timing characteristics andexamine its similarity to T o (T o

b) for each key guess. Asimilar approach has been presented in [18], where thetiming characteristics of an AES S-box implementationwere profiled and an attack similar to a correlationpower analysis using a HW model was successfullyperformed. In fact, a set of Capo

∆t for a specific ∆t isused in [18] to mount the attack at the last round ofthe AES encryption.

One may also try using information theoretic tools,e.g., mutual information analysis [11], to overcomethe uncertainty of the leakage model. However, itis necessary to use a suitable leakage model thatcannot be selected without extra knowledge aboutthe (timing) characteristics of the target combinationalfunction, or several different models must be exam-ined to find a suitable one [39]. It is noteworthythat the leakages (Capo

∆t) consist of only two values(“fail” and “success”). This causes probability distri-butions (used in e.g., mutual information analysis)to be represented by only two bins in a histogram,and using other schemes to estimate the probabilitydistributions, e.g., kernel density estimation, in thiscase leads to increasing the noise. Here, when usinghistograms, mutual information will also be close tothe variance of means.

In contrast to a correlation attack or a mutualinformation analysis using a leakage model, we applya correlation collision attack [21] to avoid the necessityof considering any such model. Here the correlationcollision attack compares the timing characteristics T o

(T ob) of two S-box instances generating two output

sets, each of which is later XORed by a secret key byte.Suppose that 1T

o and 2To (or their corresponding

bitwise versions) are the timing characteristics of theS-box when processing o = S(i)⊕k1 and o = S(i)⊕k2

respectively. As stated in [21], the aim of a correlationcollision attack is to find the linear difference betweenk1 and k2, i.e., ∆ = k1 ⊕ k2. This will be done bycomparing 1T

o and 2To for all possible guesses of ∆.

For clarification of the attack scheme see Algorithm 1.This can also be adopted to attack the first round

5

Algorithm 1 Correlation Timing Attack(at the last round of an AES encryption)

Input: 1To :{

∆to=0, . . . ,∆to=255}

; o = S(i)⊕ k1

Input: 2To :{

∆to=0, . . . ,∆to=255}

; o = S(i)⊕ k2

1: for ∆ ∈ {0, 1, . . . , 255} do2: C∆ = Correlation( 1T

o, 2To⊕∆)

3: end for4: return arg max

∆C∆

of the AES encryption. Means, suppose that 1Ti and

2Ti are the timing characteristics of the key addition

followed by the S-box when calculating S(i ⊕ k1)and S(i⊕ k2). This is possible when the MixColumnstransformation of the first round is not following theSubBytes at the same clock cycle. Then, the correlationcollision attack can – exactly as in the previous case– recover ∆ = k1 ⊕ k2.

4 PRACTICAL EVALUATION

We are attacking the AES implementations of threeASIC chips built for the SASEBO-R board, namelythe SASEBO LSI2 (130nm), LSI2 (90nm), and LSI3(65nm). Each chip consists in the same 14 differentcores, which we group in

• unprotected (mainly varying in the style of theS-box implementation),

• DPA protected (from masked-AND gates [37]and a kind of threshold implementation [25] toMDPL [28], WDDL [36], and Pseudo-RSL [31]),

• fault attack protected [33].As stated previously, we are using a similar ap-

proach for fault injection as in [18]. We have first triedto generate the glitchy clock inside the control FPGAwithout using an external function generator, but thewidth of the glitchy clock could only be adjusted inlarge steps (e.g., of around 170ps [9]), which were notsmall enough to meet our desired condition. There-fore, we had to use a programmable digital functiongenerator to externally provide the precise clock fre-quencies. The external clock is fed into the SASEBO-Rcontrol FPGA where it is multiplied by a factor of 32using a Digital Clock Management (DCM) unit. Thisfast clock signal is then used together with some logicto shape the glitchy clock signal. An internal circuitcontrols the clock signal of the LSI to inject the glitchyclock at the preferred instance of time synchronizedto the AES computation of the target core. The blockdiagram of the setup and the timing diagram of thesignals are presented in Fig. 3.

As it is presented in the following, we change thewidth of the glitchy clock in steps of 25ps to 5ps. Also,the multiplication of the clock frequency is necessarybecause of the limitation (maximum frequency of15 MHz) of the function generator we have used,while the frequencies necessary to inject a fault in the

1

Start of Last Round1 START signal goes high 8 clock cycles before the first round of encryptionand goes low 8 clock cycles after the last round [3], [4].

Fig. 3. The block diagram of the experimental setupand the timing diagram of the relevant signals

combinational circuit are up to the range of 300MHz.Also, the DCMs inside the Virtex-II control FPGA ofthe SASEBO-R can, when fed with a low frequencyinput signal, only generate output frequencies up to210 MHz. Since some of the cores, especially of the65nm LSI3, require a higher frequency for fault in-jection, for these cores it was necessary to daisy chaintwo DCMs, one for generating a high frequency signalout of the function generator output and another oneto reach the maximum supported output frequencywhich can only be generated by the DCM using a highfrequency input [40]. We should mention that we keptcore voltage of all LSIs at 1.1 V during the experimentsshown here. The maximum clock frequencies dropdown significantly when decreasing the core voltage,e.g., to 0.6 V. Therefore, the problem mentioned abovedue to the necessity of two daisy chained DCMs canbe avoided, and the width of the glitchy clock canbe modified in larger steps if the adversary is able tomodify the core voltage in the target setup.

In all cores of the LSIs 16 instances of the S-boxare implemented to perform the complete SubBytesoperation in a single clock cycle (see Fig. 4 as thegeneral architecture of the encryption datapath). Allcores – except the one supporting a counter modeand the fault-protected one – realize a round basedarchitecture, i.e., S-boxes and MixColumns are per-formed consecutively during each clock cycle exceptfor the last round where MixColumns is absent [3],[4]. Therefore, extracting the timing characteristicsof the S-boxes in the first 9 rounds is not easilypossible, and one needs to inject faults by shorteningthe width of the clock glitches in the last round,when the target cores only compute the ShiftRowsand SubBytes operations followed by the final keyaddition and the result is stored in registers (similarscheme as used in [18]). In addition, one can see fromthe design architecture of the cores (Fig. 4) that theround key of the last round is already computed inthe previous round and is stored into a register. Theglitchy clock at the last round, hence, does not affectthe key scheduling computations. Note that all theAES cores available in our targeted LSIs support the

6

AddRoundKey

TemporaryKey Register

DataRegister

Kin[127:0]

323232

<<8

S-box

Rconi

32

KeyRegister

32323232

SubBytes

ShiftRows

MxCo32-bitSlice

32-bitSlice

32-bitSlice

3232

3232

32323232Din[127:0]

Kout[127:0]

Dout[127:0]

AddRoundKey

Fig. 4. General architecture of the encryption datapathof the attacked AES cores (taken from [3], [4])

encryption scheme while only two cores support bothencryption and decryption. The results of the attackon the encryption scheme of all cores are first pre-sented; then, we discuss on feasibility of the attackson the decryption modules at the end.

In the following the result of the attacks on differentcores and different LSIs are presented. Because ofthe high number of broken cores, only a subset ofthe performed attacks are presented in detail, givingadditional information about the differences to the notmentioned cores as required.

From the 14 AES cores of each LSI, all cores havebeen successfully broken. These 14 cores could bebroken on each of the 3 LSIs regardless of the processtechnology, giving a total of 42 successfully brokenASIC implementations using our proposed collisiontiming attack.

4.1 Attacking the Unprotected Cores

We start by showing the results of the attack on thefirst AES core of the 130nm chip, namely AES_Comp,whose S-boxes have been made using a compositefield approach [32]. As stated before, 16 separate S-boxinstances have been implemented which are active atthe same time. Therefore, it is not possible to com-pare the timing characteristics of one S-box instancewhen processing e.g., two values with different keybytes, that would be an ideal case for a collisiontiming attack. In contrast, the timing characteristicsof different S-box instances must be compared, whichmay slightly vary because of different placement androuting even when being based on the same netlist.

Since changing the glitchy clock width in oursetup requires reseting the DCM(s), we have collectedBitCapo

b,∆t for a specific ∆t while random plaintextsare given to the core. This was repeated while shorten-ing ∆t by steps of 10ps and finally exploiting the bit-wise timing characteristics T o

b . Figure 5 shows pob,∆t ofthe LSB (i.e., b = 0) for some output byte values of two

6300 5100

0.1

0.7

Δt [ps]

Pro

babi

lity

6300 5100

0.1

0.7

Δt [ps]

Pro

babi

lity

Fig. 5. Some pob=0,∆t curves for S-box instances no.(left) 0 and (right) 4 of AES_Comp (130nm)

0 255 5000

6400

Output value

Δt [p

s]

0 255 5000

6400

Output value

Δt [p

s]

Fig. 6. Bitwise timing characteristics T ob=0 of S-box in-

stances no. (left) 0 and (right) 4 of AES_Comp (130nm)

S-box instances1 extracted from their correspondingbitwise captures (10 000 captures for each ∆t). Also,Fig. 6 presents the bitwise timing characteristics T o

b=0

of these two S-box instances obtained by definingpTH = 0.1 (as can be seen in Fig. 5). The diversityof ∆t for these two S-boxes shows the dependencybetween the timing characteristics and the outputvalues. Performing the attack Algorithm 1 on T o

b=0 ofS-box instance number 0 and all other instances led tothe recovery of all 15 independent relations betweenthe 16 bytes of the last round key; part of the resultand the corresponding figures to show the minimumnumber of required captures are shown in Fig. 7. Theattack works the same considering other output bitsto derive T o

b as well as on other LSI chips. We shouldemphasize that the feasibility of the attack dependsnot only on the number of captures, but also on therange of ∆t and the precession of the steps whenshortening.

Careful study of the timing characteristics shownin Fig. 6 reveals that ∆t is much smaller than theother cases when the S-box input is zero, that is aknown issue since the zero-input power model hasbeen defined [12] to mount CPA attacks on AESS-box leakages. In fact, it is not needed to mountthe collision timing attack in this case, and the keybytes can be recovered observing T o

b of each S-boxinstance separately. However, as it is shown later thisproperty does not hold for the other cores realized bydifferent S-boxes, and mounting our proposed attackis essential to reveal the secrets.

The same S-box architecture is used in two othercores. The AES_Comp_ENC_top core is exactly thesame as the examined one, but it does not provide

1. S-box instance numbers start from 0 and are corresponding tociphertext byte indexes.

7

0 255 −0.3

0.9

Δk

Cor

rela

tion

0 255 −0.3

0.9

Δk

Cor

rela

tion

Fig. 7. Result of the attack on the last round ofAES_Comp (130nm) recovering ∆k between key bytes(up to bottom) (0,1) and (0,2), (left) using 10 000 cap-tures and (right) over the number of captures

the decryption module. The AES_PKG core utilizesthe same architecture as AES_Comp_ENC_top, butprecalculates the rounkeys and stores them into dedi-cated registers. Two other cores, i.e., AES_PPRM1 andAES_PPRM3, have the same architecture and datap-ath. Means, only the encryption is supported, and theroundkeys are computed on the fly. However, the S-boxes have been realized by a low-power approachcalled Positive Polarity Read-Muler (PPRM) [24] in1 and 3 stages respectively in the AES_PPRM1 andAES_PPRM3 cores; the later one is supposed to con-sume less power.

In order to perform the attack on these core thesame procedures as explained above have been re-peated. We have provided a list shown in Table 1 (inAppendix) as a reference for the timing characteristicsand the number of required captures to successfullymount the attack on different cores in different LSIs.Attacking the AES_TBL core, where S-boxes havebeen realized by look-up tables (case statements), isdifferent to the aforementioned cores. We illustratethis case when explaining how to mount the attackon the WDDL and MDPL cores.

The AES_CTR core, which enables the countermode, employs a 4-stage pipeline architecture tospeed up the computations and is depicted in Fig. 8.In order to utilize the pipeline the initialization vector(IV) and four plaintexts are sent to the core in thebeginning. The AES circuit then encrypts the full 128-bit IV to generate pseudo random numbers to XORencrypt the first plaintext and increments the storedIV for each of the following plaintexts. As attackerwe exploit the fact that we have full control overthe IV, i.e., we can completely control which inputis encrypted by the AES circuit. By i) setting theplaintext to all zeros, ii) focusing only on the firstresulting ciphertext where the input IV has not been

Kreg

Dreg

128

GF inverter

isomorphism

128

Din

<<8

Rconi32

128

32 32 32

32 32 32

P0reg

ShiftRows

isomorphism& affine

P1reg

P2reg

MixColumns32

GF inverter

isomorphism

K0reg

isomorphism& affine

K1reg

K2reg

128

S-box

32

32

Kin128

DOreg

Dout128

DI0reg

DI1reg

DI2reg

DI3reg

CTRreg

+1

Kout

Fig. 8. Architecture of the Counter-mode Core (takenfrom [3], [4])

incremented yet, and iii) reseting the core after eachobtained ciphertext to load a new IV, we basicallycreate the same attack scenario as for the other cores.The only difference is that in this case we use the IVas our known plaintext.

The S-box itself is divided into three parts (seeFig. 8), where the part with the longest critical pathis the inversion; therefore, we have selected this partto exploit the timing characteristics and mount theattacks during the computation of the last round.According to Table 4.7 and Table 4.15 of [3], theAES_CTR core has a longer critical path and is slowercompared to the AES_Comp core. Indeed, it comesfrom the integer addition module used to incrementthe IV. However, the encryption core itself is quite fast,and because of its pipeline architecture we expecteda shorter critical path for the inversion unit. Ourpractical experiments confirmed our assumption, and4200ps was the minimum ∆t for fault-free operation ofthe 130nm chip showing the operation frequency to bearound 240MHz (excluding the integer addition unit).Although we initially faced several problems makingfaulty results in the AES_CTR core of the 90nm and65nm chips because of the very short critical path, themounted attacks could successfully reveal the secretssimilarly to the shown results of the AES_Comp core.

4.2 Attacking the DPA-Protected CoresThe first DPA-protected core is AES_MAO in whicha gate-level masking scheme (masked-AND gate [37])is used to realize the masked S-boxes. The capturing

8

0 255

4300

4500

Output value

Δt [p

s]

0 255

4300

4500

Output value

Δt [p

s]

Fig. 9. Bitwise timing characteristics T ob=0 of S-box

instances no. (left) 5 and (right) 6 of AES_MAO (65nm)

0 255 −0.2

0.6

Δk

Cor

rela

tion

Fig. 10. Result of the attack on the last round ofAES_MAO (65nm) recovering ∆k between key bytes(5,6) (left) using 10 000 captures and (right) over thenumber of captures

scenario and the attack scheme are exactly the sameas that of the unprotected cores, e.g., AES_Comp.Figure 9 shows the timing characteristics of two S-box instances of the AES_MAO (65nm) core. It canbe seen that timing characteristics for different (un-masked) outputs still differ though the S-box inputsare masked by means of random values. Conse-quently, it is possible to extract the relation betweenthe key bytes, as depicted in Fig. 10 where the resultsafter obtaining 10 000 captures, when shortening ∆t insteps of 25ps are presented. Comparing the number ofrequired captures when attacking the AES_Comp andAES_MAO cores, one can conclude that the attack onthe masked-AND gates is harder. This is because ofthe influence of the randomness on the dependencyof the timing characteristics of an S-box to its givenunmasked inputs.

The threshold implementation scheme, which isa high-order masking countermeasure to the poweranalysis attacks, is based on a combination of booleanmasking and multi-party computation [26]. In con-trast to other masking schemes when implementedin hardware, e.g., [20], [21], it is expected to preventfirst-order leakage caused by signal glitches [23], [29].This scheme is an algorithmic-level countermeasurewhich needs to fulfill certain properties includingcorrectness, non-completeness, and uniformity [26].Although a core of our targets is called AES_TI [3],[4], it has been made without considering the later twoproperties. This core has been realized by modifyingthe datapath of the AES_Comp_ENC_top core. Thenon-linear gates are provided by only 2-input ANDgates, every signal is represented by four shares,and finally the AND gates are replaced with the 4-

0 255 −0.2

0.5

Δk

Cor

rela

tion

Fig. 11. Result of the attack on the last round ofAES_TI (65nm) recovering ∆k between key bytes (3,5)(left) using 1 000 000 captures and (right) over the num-ber of captures

shared threshold implementation of a 2-input ANDgate which is available at [23], [25]. It indeed makesthis core a kind of gate-level masked implementation.This can be verified by examining the source codeof this core available at [1]. Extracting the timingcharacteristics and mounting the same attack as beforeon this core could also break the implementationand recover the key relations (see Fig. 11 for exam-ple). However, as stated in Table 1 (in Appendix),much more captures are required compared to theAES_MAO core. To the best of our knowledge it is dueto the amount of randomness provided by the AES_TIcore where each single-bit signal is presented by fourshares, i.e., using three random mask bits, while onlyone mask value is used in masked-AND gates (theAES_MAO core).

DPA-resistant logic styles are used in four otherDPA-protected cores. Random Switching Logic(RSL) [34] is a pre-charge logic style combined witha logic-level masking scheme (first theoreticallyproposed at [13] and later used in [38]). Based on thedesign methodology presented in [31], pseudo RSL,which makes use of standard CMOS library, is usedin realization of both AES_PR and AES_WO cores.The design methodology is based on the fact thatconsecutive non-linear parts of the circuit which arerealized by RSL should get enabled one after eachother. It means that the S-box is divided into severalparts which to prevent propagation of glitches areenabled sequentially by means of a control unit.Therefore, timing characteristics of only the lastlogic step of the S-box circuit can be extracted andattacked. Repeating the illustrated attack could revealthe desired secret as the results of the attack on theAES_PR core are shown in Fig. 12. The other pseudoRSL core, i.e., AES_WO, is made for evaluationpurposes. Though its design is not obvious to us,comparing the corresponding ∆t range of these twocores (Table 1) reveals that the enable signals inthe AES_WO core are missing or the appropriatetiming constrains have not been considered for them.Interestingly, the AES_WO core can be attackedthe same as an unprotected core; even the numberof required captures to reveal the key relations is

9

0 255 −0.2

0.4

Δk

Cor

rela

tion

Fig. 12. Result of the attack on the last round ofAES_PR (65nm) recovering ∆k between key bytes(9,13) (left) using 10 000 captures and (right) over thenumber of captures

0 255 −0.3

0.8

Δk

Cor

rela

tion

Fig. 13. Result of the attack on the last round ofAES_WO (90nm) recovering ∆k between key bytes(7,8) (left) using 10 000 captures and (right) over thenumber of captures

comparable to that of the AES_Comp core. The resultof the attacks on this core is depicted in Fig. 13.

Wave Dynamic Differential Logic (WDDL) [36] isa logic style that is proposed to realize the SenseAmplifier Based Logic (SABL) [35] using a standardCMOS library. These logic styles aim at flatteningthe power consumption, a kind of hiding at transis-tor/gate level [19]. There are two types of WDDLsequential circuits, and a version that needs one clockcycle per either pre-charge or evaluation phase is usedto implement the AES_WDDL core. The reason ofsuch constrain is the structure of master-slave WDDLflip-flops [36]. Each round of the encryption functionis therefore executed in two clock cycles while the dat-apath and the architecture of this core are otherwisethe same as that of the AES_Comp_ENC_top core.Exactly the same specifications and features hold forthe AES_MDPL core which is realized using MaskedDual-rail Pre-charge Logic (MDPL) style [28]. MDPLworks similarly to WDDL in addition to a single maskbit which randomly swaps the rails of the completecircuit at every clock cycle.

In contrary to the other cores an injected faultby a clock glitch at the evaluation phase of bothWDDL and MDPL can only lead to a bit flip from1 to 0, not vice versa, because of the pre-dischargephase in which all signals are set to 0. Therefore,bitwise timing characteristics T o

b does not provideany information for those output values o in whichbit b is zero. This issue has been also addressedin [16] where a successful attack is performed onan AES_WDDL core. Our solution is to avoid using

0 255 −0.2

0.7

Δk

Cor

rela

tion

Fig. 14. Result of the attack on the last roundof AES_WDDL (130nm) recovering ∆k between keybytes (1,2) (left) using 10 000 captures and (right) overthe number of captures

0 255 −0.2

0.8

Δk

Cor

rela

tion

Fig. 15. Result of the attack on the last round ofAES_MDPL (65nm) recovering ∆k between key bytes(2,3) (left) using 10 000 captures and (right) over thenumber of captures

bitwise characteristics, and apply the attack on timingcharacteristics T o (not the bitwise one). This way thebitwise timing characteristic of all bits are combined,and T o provides information for all output values.The result of the attack on the AES_WDDL (130nm)and the AES_MDPL (65nm) cores are shown in Fig. 14and Fig. 15 respectively. Interestingly, the number ofcaptures required to successfully mount the attacks isless than the unprotected core cases.

We have seen – for reasons unknown to us – thesame behavior (no information is provided by T o

b

for those o where bit b is zero) when attacking theAES_TBL core, where the S-boxes have been realizedby look-up tables. The same approach, that we haveused to attack the AES_WDDL and the AES_MDPLcores (using T o), was helpful when attacking theAES_TBL core. It should be noted that, to attack theAES_MDPL and AES_WDDL cores, we only used10 000 captures for each ∆t in steps of 5ps. On theother hand, a successful attack on the AES_TBL corerequired around 1 000 000 captures, which might bebecause of marginal differences between the criticalpaths of the circuit realizing the look-up table.

4.3 Attacking the Fault-Protected Core

The scheme presented in [33] has been used to imple-ment the fault-protected core, i.e., AES_FA, of our 3targeted LSIs. The main idea of this scheme is to splitthe round function into two parts where both partscan be active at the same time. One half is used tocompute the next state of the half round while the

10

DregX

ShiftRowsInvShiftRows

128

GF(28)Inverters

affine-1

affine

EncKreg

128

DecKreg

128

Dout

Din

MixCol.InvMixCol.

DregY

Error0 Error1

Switching Box

1

1

=?

=?128

128

128

RoundKey

Register

<<8

S-box

Rconi

32

128

32 32 32

32

32 32 32 32

Kin

Kout

Fig. 16. Architecture of the Fault-Protected Core(taken from [33])

other half is used to verify the previous half-roundcomputation.

An overview of this architecture is given by Fig. 16.The first half-round circuit is comprised of ShiftRowsand SubBytes while the second half includes Mix-Columns and AddRoundkey, all also featuring theirinverse counterparts. This ability to perform both theencryption and decryption operations in each half isutilized for the fault detection. More precisely, afterevery half-round computation, in the next clock cyclethe inverse operation is performed, and the result iscompared to the original state to validate that the in-termediate state has indeed been computed correctly.

In order to clarify the order of operations an exem-plary encryption is depicted in Fig. 17 and is in thefollowing explained in detail. According to Fig. 17(a)both the first and the last roundkey (K0 and K10) arestored in registers2 (EncKreg and DecKreg in Fig. 16).Since an encryption is performed in this example,the first roundkey K0 is XORed with the plaintextand the result is stored as D0 in the state register(DregX in Fig. 16). This state is propagated to both thecomparison register (DregY in Fig. 16) and throughthe ShiftRows and SubBytes functions so that theresult of the first half round (D1X) is available at theinput of the state register.

At the next clock edge (Fig. 17(b)) the initial inputD0 is stored in the comparison register and the inter-mediate result D1X is stored in the state register andfollowing that is propagated through three differentdatapaths at the same time. First, D1X goes throughInverseShiftRows and InverseSubBytes (thereby re-computing the initial input D0) to the comparison cir-cuit, which triggers a fault flag if the computed valuediffers from the stored D0 in the comparison register.

2. Note that as it is explained later, the last roundkey K10 mustbe precomputed prior to the encryption or decryption operationwhen setting a key.

If this flag would be raised the circuit will reset and nociphertext output will be generated. Second, D1X goesthrough MixColumns and AddRoundkey making thecomplete first round result D1 available at the inputof the state register. Third, D1X is also propagated tothe comparison register input since it will be requiredto validate the next computation step.

Similar to the previous clock cycle at the next clockedge (Fig. 17(c)) D1X is stored in the comparisonregister and the intermediate result D1 is stored in thestate register. Following that D1 is again propagatedthrough three different datapaths. In order to checkif D1 has been computed correctly it is sent throughInverseMixColumns and InverseAddRoundkey to thecomparison function which checks for differences tothe stored original D1X value. Simultaneously theresult of the first half of the second round D2X iscomputed by going through ShiftRows and SubBytesand again the original value D1 is also propagated tothe comparison register.

These steps, basically (b) and (c) in Fig. 17, arerepeated until the intermediate result of the last roundD10X has been computed. In Fig. 17(d) the final resultD10 is computed, omitting MixColumns as usual, butin order to check whether an error occured in thekey scheduling the computed last roundkey is alsocompared to the stored one in the DecKreg register.Since the final computation of D10 has to be checkedas well, an additional clock cycle is needed before theciphertext can be issued, as depicted by Fig. 17(e).

Since each computation is checked by its inversecounterpart, not only dynamic but also static faultscan be detected. In theory the performance penaltyof this countermeasure is low, since, as stated in [33],while the double amount of clock cycles are necessaryfor a complete AES computation, at the same timethe critical path is shortened by splitting up of theround function. However, as it is explained later wehave found that because of the comparison circuit,which compares two 128-bit values, the critical pathof the circuit gets quite long and makes the circuitconsiderably slower compared to the other cores (ascan also be seen in Table 4.7 and Table 4.15 of [3]).

In addition, since MixColumns does not immedi-ately follow SubBytes, and they are separated by aregister, in contrary to the other cores one can exploitthe timing characteristics of the S-boxes at the firstround and also extend the attack on the later rounds.In order to mount the attack on the first round thereare two options:

i) If the glitchy clock appears in the first clock cycleof the first round (see Fig. 17(a)), the faulty resultD1X is saved into the state register, which is de-tected in the next clock cycle by the comparisoncircuit when the inverse computation leads tothe wrong D0 (see Fig. 17(b)). When we havetried this on different LSIs, 6600ps, 5500ps, and4800ps were the minimum width of the clock

11

D0

SR

K0

KGen

K10

Plain Text

SB

D1X

(a)

D0

(a)

D1X

ISR

K0

K1

K10

MX

D0

ISB

D1

D0

=?

(b)

D1X

(b)

D1

SR

K0

K1

K10

IMX

D1X

SB

D2X

=?

(c)

D1

D1X

(c)

.�.�.�.�.�.�

D10X

ISR

K0

K10

K10

D9

ISB

D10

D9

=?=?

(d)

D10X

(d)

D10

K0

K10

D10X

D10X

CipherText

K10

=?=?

(e)

(e)

Fig. 17. Example Operation of the Fault-Protected Core (taken from [33])

glitch with which respectively the 130nm, 90nm,and 65nm chips still worked fault free.

ii) In the second clock cycle (Fig. 5(b) of [33])the critical path of InverseShiftRows and In-verseSubBytes followed by the comparison unitis longer than MixColumns and AddRoundkey.Therefore, if the glitchy clock is injected inthe second clock cycle, the fault detection bit(output of the comparison circuit) is evaluated,i.e., is stored in the target register, before thecomputation of the InverseSubBytes or the com-parison itself has been finished. Examining thisin practice showed that the chip reports faultycomputation when the width of the clock glitchis less than 10 700ps, 6700ps, and 6800ps for the130nm, 90nm, and 65nm chips respectively. Infact, it shows that the increase of the critical pathcaused by the comparison circuit strongly affectsthe performance (throughput) of the design, asstated previously.

It is noteworthy that faults in the key scheduling unitare not detected until the described comparison in thelast round (see Fig. 17(e)). While the glitchy clocksmay in theory affect the round key computations,the fact that the critical path of that computation isactually split in half by a pipelining register in the S-boxes of the key scheduling unit [33] causes the faultsto be triggered in the normal datapath way earlier,since there a full S-box computation has to be per-

formed among other steps. Therefore, by selecting thesecond clock cycle to shorten the width of the clockglitch and mount the attack where the critical path ismaximal because of the included InverseShiftRows,InverseSubBytes and the comparison function, we arealso making sure not to affect the key schedulingwhich has a significantly shorter critical path.

In contrary to the other cores, here the fault detec-tion unit provides only one bit indicating the appear-ance of a fault, but from the adversary point of viewit is not possible to distinguish where (in which S-box instances) the fault occurred. Therefore, not onlythe bitwise capturing is infeasible, but also the faultdetection unit makes the capturing a very impreciseestimation process. Two timing characteristics of twoS-box instances of the AES_FA (65nm) core obtainedby pTH = 0.5 and two attack results are shown inFig. 18 and in Fig. 19 respectively. We have used500 000 captures for each clock glitch width whenshortening ∆t by steps of 10ps.

As stated before, the capturing process here is veryinaccurate because of the effect of the fault occurrenceon other S-box instances when considering a specificS-box instance. In other words, when at a specific∆t an S-box instance is always faulty, it preventsextracting any information about the behavior of theother S-box instances at that ∆t. Therefore, we couldonly recover the relation between some key bytes,e.g., between bytes (0,2), (0,4), (0,6), (0,7), and (0,10).

12

0 255 6720

6800

Output value

Δt [p

s]

0 255 6720

6800

Output value

Δt [p

s]

Fig. 18. Timing characteristics T i of S-box instancesno. (left) 0 and (right) 2 of AES_FA (65nm)

0 255

−0.1

0.5

Δk

Cor

rela

tion

0 255

−0.1

0.5

Δk

Cor

rela

tion

Fig. 19. Result of the attack on the first round ofAES_FA (65nm) recovering ∆k between key bytes(left) (0,2), (right) (0,4)

We should emphasize that this behavior is not thesame for all LSIs; different relations between key bytesare recovered when attacking the first round. Oursolution, to completely break the core and recover allthe key bytes, is to extend the attack on the secondround. When the glitchy clock is injected in the secondclock cycle of the second round, i.e., the forth clockcycle of encryption, we collected the result of 50 000trials with random plaintext for a specific ∆t when theprobability of the fault occurrence is about 0.5. Theattack process is divided into 4 parts, targeting each4-byte MixColumns output independently. We startwith the first column when the key bytes (0,5,10,15)of the first round key are required to compute the firstMixColumns output. Since the relation between keybytes e.g., (0,10) has been revealed by the last attackshown, the search space for the key bytes (0,5,10,15)is reduced to 224. Therefore, for each key guess theoutput of the MixColumns is computed, and basedon each MixColumns output byte the 4 captures and 4probabilities using the collected 50 000 trials are gen-erated. Interestingly, it is not required to mount thecollision attack. Instead, a variance check approachas stated in [21] can find the correct key guess. Theidea behind it is that the probabilities pi are differentfor different input values. For a wrong key guess thetrials will be randomly distributed into the captures,and the probabilities have a small variance comparedto the case that the captures have been made by thecorrect key guess. The result of this approach is shownin Fig. 20(a), where the correct key guess is obviouslydistinguishable. Using a 16-core PC to compute allvariance values for the 224 key space lasts roughly onehour. Since there are 4 captures and 4 probabilities pi

each of which corresponds to one MixColumns output

(a) first column (b) second column

(c) third column (d) fourth column

Fig. 20. Results of the variance check approach on thecaptures collected using the glitchy clock at the secondround of AES_FA (65nm)

byte, the variance check over the probabilities can bedone on each of these 4 probabilities independently.So, if the variances of probabilities of one output bytedid not show any distinguishable guess, other outputbytes can be examined. This process is repeated on theother columns using the recovered key bytes and therecovered relations so that the search space sometimesgets smaller than 224; the results are shown in Fig. 20.Although different relations between key bytes arerecovered when attacking different LSIs, the variancecheck approach at the second round could completelyreveal the secret in all LSIs searching in a space up to224.

We should emphasize that the key recovery attack ispossible even omitting the attack on the first round. Inthis case when no information about the key bytes isavailable, each part of the attack on the second round(exactly as explained above) must search in a spaceof 232 which might be a time consuming task usingPCs, but can be extremely accelerated by means ofgraphics processing units (GPU).

4.4 Attacking the Decryption Modules

Only two cores, AES_Comp and AES_FA, include theessential module for decryption while the AES_CTRcore also supports the decryption feature by swappingthe plaintext and ciphertext since the AES in countermode can be seen as a stream cipher. Attacking thedecryption module of the AES_FA core is exactly thesame as the attack on its encryption module. Means,in order to attack the first round of decryption theglitchy clock is injected in the second clock cycle whenthe critical path of SubBytes and ShiftRows followedby the the comparison unit is affected. Though it canbe ignored, the attack reveals the relation between

13

AddRoundKey

TemporaryKey Register

DataRegister

Kin[127:0]

323232

<<8

S-box

Rconi

32

KeyRegister

32323232

InvMxCo

32-bitSlice

32-bitSlice

32-bitSlice

3232

3232

32323232Din[127:0]

Kout[127:0]

Dout[127:0]

AddRoundKey

InvShiftRows

InvSubBytes

Fig. 21. Architecture of the decryption datapath of theAES_Comp core (taken from [3], [4])

some key bytes. Extending the attack on the secondclock cycle of the second round of decryption bymeans of the variance check approach and searchingin a space of up to 232 (in case the attack on the firstround is ignored) all the round keys are revealed. Weomit presenting the result of the attack because of itssimilarity to already presented results in Fig. 19 andFig. 20.

The decryption module of the AES_Comp core isa dedicated circuit which is depicted in Fig. 21. Thefirst clock cycle is the only clock cycle where Inverse-ShiftRows and InverseSubBytes without InverseMix-Columns are in the critical path. The glitchy clock,therefore, must be injected in this clock cycle to extractthe timing characteristics of the S-boxes. However,the output of the first decryption round cannot beviewed by the attacker; he can only check whetherwhole of the decryption output (plaintext) is corrector not (similar result that the AES_FA core provides).Therefore, the timing characteristics of some S-boxescannot be extracted and the attack scheme is similarto that of the AES_FA core. Again the result of theattack are omitted while information about the criticalpath and the number of required captures are givenin Table 1 (in Appendix). We should emphasize thatcontrary to the AES_FA core the attack cannot beextended to the second decryption round since In-verseMixColumns affects the critical path. Therefore,the relation between some key bytes can be recoveredcausing the search space for the full key to shrinksignificantly. In general, the used architecture for thedecryption module (Fig. 21) makes our proposed at-tack hard comparing to the corresponding encryptionmodule of the same core.

5 CONCLUSIONS

We have presented a collision attack which efficientlyutilizes the data dependent timing characteristics ofcombinational circuits to reveal the secrets. While theattack is based on the idea of [18], it is significantly

more powerful since it does not require any knowl-edge about the characteristics of the target combina-tional circuit.

We demonstrated the power of the attack by suc-cessfully breaking all 14 AES implementations of theSASEBO LSI2 and LSI3 in all currently available pro-cess technologies. It is indicated in [18] that whilemasking does not prevent DFA attacks, it may actuallyprovide security against FSA-based attacks becauseof the randomized inputs of the combinational func-tions. However, by breaking all DPA-protected coresof the mentioned ASICs we have shown that ran-domizing countermeasures itself cannot prevent data-dependent timing characteristics of the combinationalcircuit, and they therefore remain vulnerable againstthe attack introduced here.

Finally, our introduced attack is even capable ofextracting the secret key out of an implementationapplying a fault detection scheme by extending theattack on the second round of the cipher and a fewhours of computation. In short, the results shown inthis work imply the need for a special unit in the –especially side-channel protected – designs in order todetect the clock glitches to thwart such kind of attacks.

ACKNOWLEDGMENT

The authors would like to thank Akashi Satoh andthe Research Center for Information Security (RCIS)of Japan for the prompt and kind help in obtainingSASEBOs and cryptographic LSIs.

REFERENCES

[1] Cryptographic Circuits with Logic Level Countermeasuresagainst DPA. Information and Physical Security ResearchGroup, YOKOHAMA National University http://ipsr.ynu.ac.jp/circuit/.

[2] Side-channel Attack Standard Evaluation Board (SASEBO-R). Further information are available via http://www.morita-tech.co.jp/SASEBO/en/board/sasebo-r.html.

[3] ISO/IEC 18033-3 Standard Cryptographic LSI – withSide Channel Attack Countermeasures – Specification,ver 1.0. http://www.morita-tech.co.jp/SASEBO/resources/crypto_lsi/CryptoLSI2_Spec_Ver1.0_English.pdf, 2009.

[4] Standard Cryptographic LSI Specification – Countermea-sures against Side Channel Attacks (65nm) – Specification,ver 0.9. http://www.morita-tech.co.jp/SASEBO/resources/crypto_lsi/CryptoLSI3_Spec_Ver0.9_English.pdf, 2010.

[5] E. Biham and A. Shamir. Differential Fault Analysis of SecretKey Cryptosystems. In CRYPTO 1997, volume 1294 of LNCS,pages 513–525. Springer, 1997.

[6] J. Blömer and J.-P. Seifert. Fault Based Cryptanalysis of theAdvanced Encryption Standard (AES). In FC 2003, volume2742 of LNCS, pages 162–181. Springer, 2003.

[7] A. Bogdanov. Multiple-Differential Side-Channel CollisionAttacks on AES. In CHES 2008, volume 5154 of LNCS, pages30–44. Springer, 2008.

[8] A. Bogdanov, T. Eisenbarth, C. Paar, and M. Wienecke. Dif-ferential Cache-Collision Timing Attacks on AES with Appli-cations to Embedded CPUs. In CT-RSA 2010, volume 5985 ofLNCS, pages 235–251. Springer, 2010.

[9] S. Endo, T. Sugawara, N. Homma, T. Aoki, and A. Satoh.An on-chip glitchy-clock generator and its application to safe-error attack. In COSADE 2011, pages 175–182, 2011.

14

[10] K. Gandolfi, C. Mourtel, and F. Olivier. ElectromagneticAnalysis: Concrete Results. In CHES 2001, volume 2162 ofLNCS, pages 251–261. Springer, 2001.

[11] B. Gierlichs, L. Batina, P. Tuyls, and B. Preneel. MutualInformation Analysis. In CHES 2008, volume 5154 of LNCS,pages 426–442. Springer, 2008.

[12] J. D. Golic and C. Tymen. Multiplicative Masking and PowerAnalysis of AES. In CHES 2002, volume 2523 of LNCS, pages198–212. Springer, 2002.

[13] Y. Ishai, A. Sahai, and D. Wagner. Private Circuits: SecuringHardware against Probing Attacks. In CRYPTO 2003, volume2729 of LNCS, pages 463–481. Springer, 2003.

[14] P. C. Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In CRYPTO 1996,volume 1109 of LNCS, pages 104–113. Springer, 1996.

[15] P. C. Kocher, J. Jaffe, and B. Jun. Differential Power Analysis. InCRYPTO 1999, volume 1666 of LNCS, pages 388–397. Springer,1999.

[16] Y. Li, K. Ohta, and K. Sakiyama. Revisit Fault SensitivityAnalysis on WDDL-AES. In HOST 2011, pages 148–153. IEEE,2011.

[17] Y. Li, K. Ohta, and K. Sakiyama. New fault-based side-channelattack using fault sensitivity. IEEE Transactions on InformationForensics and Security, 7(1):88–97, 2012.

[18] Y. Li, K. Sakiyama, S. Gomisawa, T. Fukunaga, J. Takahashi,and K. Ohta. Fault Sensitivity Analysis. In CHES 2010, volume6225 of LNCS, pages 320–334. Springer, 2010.

[19] S. Mangard, E. Oswald, and T. Popp. Power Analysis Attacks:Revealing the Secrets of Smart Cards. Springer, 2007.

[20] S. Mangard, N. Pramstaller, and E. Oswald. SuccessfullyAttacking Masked AES Hardware Implementations. In CHES2005, volume 3659 of LNCS, pages 157–171. Springer, 2005.

[21] A. Moradi, O. Mischke, and T. Eisenbarth. Correlation-Enhanced Power Analysis Collision Attack. In CHES 2010,volume 6225 of LNCS, pages 125–139. Springer, 2010. theextended version: http://eprint.iacr.org/2010/297.

[22] A. Moradi, O. Mischke, C. Paar, Y. Li, K. Ohta, andK. Sakiyama. On the power of fault sensitivity analysis andcollision side-channel attacks in a combined setting. In CHES2011, volume 6917 of Lecture Notes in Computer Science, pages292–311, 2011.

[23] A. Moradi, A. Poschmann, S. Ling, C. Paar, and H. Wang.Pushing the Limits: A Very Compact and a Threshold Im-plementation of AES. In EUROCRYPT 2011, volume 6632 ofLNCS, pages 69–88. Springer, 2011.

[24] S. Morioka and A. Satoh. An Optimized S-Box Circuit Archi-tecture for Low Power AES Design. In CHES 2002, volume2523 of LNCS, pages 172–186. Springer, 2002.

[25] S. Nikova, C. Rechberger, and V. Rijmen. Threshold Implemen-tations Against Side-Channel Attacks and Glitches. In ICICS2006, volume 4307 of LNCS, pages 529–545. Springer, 2006.

[26] S. Nikova, V. Rijmen, and M. Schläffer. Secure HardwareImplementation of Nonlinear Functions in the Presence ofGlitches. Journal of Cryptology, 24(2):292–321, 2011.

[27] G. Piret and J.-J. Quisquater. A Differential Fault AttackTechnique against SPN Structures, with Application to theAES and KHAZAD. In CHES 2003, volume 2779 of LNCS,pages 77–88. Springer, 2003.

[28] T. Popp and S. Mangard. Masked Dual-Rail Pre-charge Logic:DPA-Resistance Without Routing Constraints. In CHES 2005,volume 3659 of LNCS, pages 172–186. Springer, 2005.

[29] A. Poschmann, A. Moradi, K. Khoo, C.-W. Lim, H. Wang, andS. Ling. Side-Channel Resistant Crypto for Less than 2, 300GE. Journal of Cryptology, 24(2):322–345, 2011.

[30] J.-J. Quisquater and D. Samyde. ElectroMagnetic Analysis(EMA): Measures and Counter-Measures for Smart Cards. InE-smart 2001, volume 2140 of LNCS, pages 200–210. Springer,2001.

[31] M. Saeki, D. Suzuki, K. Shimizu, and A. Satoh. A DesignMethodology for a DPA-Resistant Cryptographic LSI with RSLTechniques. In CHES 2009, volume 5747 of LNCS, pages 189–204. Springer, 2009.

[32] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. A Com-pact Rijndael Hardware Architecture with S-Box Optimization.In ASIACRYPT 2001, volume 2248 of LNCS, pages 239–254.Springer, 2001.

[33] A. Satoh, T. Sugawara, N. Homma, and T. Aoki. High-Performance Concurrent Error Detection Scheme for AESHardware. In CHES 2008, volume 5154 of LNCS, pages 100–112. Springer, 2008.

[34] D. Suzuki, M. Saeki, and T. Ichikawa. Random SwitchingLogic: A New Countermeasure against DPA and Second-Order DPA at the Logic Level. IEICE Transactions, 90-A(1):160–168, 2007.

[35] K. Tiri, M. Akmal, and I. Verbauwhede. A Dynamic andDifferential CMOS Logic with Signal Independent Power Con-sumption to Withstand Differential Power Analysis on SmartCards. In ESSCIRC 2002, pages 403–406, 2002.

[36] K. Tiri and I. Verbauwhede. A Logic Level Design Methodol-ogy for a Secure DPA Resistant ASIC or FPGA Implementa-tion. In DATE 2004, pages 246–251. IEEE, 2004.

[37] E. Trichina. Combinational Logic Design for AES SubByteTransformation on Masked Data. Cryptology ePrint Archive,Report 2003/236, 2003. http://eprint.iacr.org/.

[38] E. Trichina, T. Korkishko, and K.-H. Lee. Small Size, LowPower, Side Channel-Immune AES Coprocessor: Design andSynthesis Results. In AES Conference 2004, volume 3373 ofLNCS, pages 113–127. Springer, 2004.

[39] N. Veyrat-Charvillon and F.-X. Standaert. Generic Side-Channel Distinguishers: Improvements and Limitations. InCRYPTO 2011, volume 6841 of LNCS, page 348. Springer, 2011.the extended version: http://eprint.iacr.org/2011/149.

[40] XILINX. Virtex-II Pro and Virtex-II Pro X FPGA User Guide.Technical report, version 4.2, 2007. http://www.xilinx.com/support/documentation/user_guides/ug012.pdf.

15

APPENDIX

In the following some of the difficulties that we facedduring our practical investigations are explained indetail. We also provide a table containing the specifi-cations of all 14 AES cores of the three targeted LSIsincluding the ∆t ranges and the minimum number ofcaptures required to mount the attacks.

Difficulties• As stated in [3] and [4], each core has its own

clock tree which had a strong impact on glitchyclocks. The capacitive and resistive features of theclock line of the LSIs which is supplied by thecontrol FPGA changed the glitchy clock shapeand modified the situation whether the registersare triggered two times, or if one of the posi-tive edges is filtered by the clock tree elements.Therefore, we had to put different capacitors andresistors in the SASEBO-R board to change therising and falling slopes of the clock signal toreach the desired situation.

• When making the clock glitch at the last roundwe saw that some output bytes – and even someoutput bits – get faulty very rapidly comparedto the slow slope of po. For instance, see Fig. 22,where some probability curves arise very sharplyand make their corresponding attacks impossible.The reason for such a behavior is that someregisters do not respond to both rising edges ofthe glitchy clock and are triggered only once.Consequently, the timing characteristics of thoserelated S-boxes could not be extracted. Althoughmodifying the capacitance and resistance of theclock line could improve the situation, in somecases we were not able to solve it completely.However, the revealed relations between the keybytes in all attacked cores of all LSIs allowed fora easy key recovery process by searching in afeasible space.

• Since DCM outputs have an increased jitter whenthey are used to multiply the clock inputs, the

5800 52605670 53900

0.5

0.25

Δt [ps]

Pro

babi

lity

normalrapid

Fig. 22. Average of pob=1 for all 16 S-boxes ofAES_MDPL (65nm)

glitchy clock width also had significant jitter thatmade the capturing process noisy. The situationgot worse when we had to cascade two DCMsfor some cores (when keeping the core voltage as1.1 V), especially in 65nm technology, to reach thedesired ∆t. In this case, the second DCM oftencould not get locked because of the high jitterof the first DCM. So, we had to provide anothercircuit controlled by the PC to automatically resetthe control FPGA (in fact the DCMs) until theDCMs get locked and provide the requested highfrequencies.

• Since we have required high amounts of captures,e.g., up to 1 000 000, for different ∆t values tosuccessfully mount the attacks on some cores, wehave developed a special design for the controlFPGA to speed up the capturing process. Ourcontrol FPGA communicates with the target LSI,makes the glitchy clock on the desired clock cycle,and finally after performing a couple of capturingprocess sends the result back to the PC. In thisway we could efficiently increase the speed of thecapturing up to couple of thousands per second.

• According to [3] and [4], the clock signal of theinterface circuit of the LSIs is separated from thecore clocks. So, the glitchy clock does not appearon the interface circuit which makes the attackseasier. It might be a challenging case when theinterface circuit sees the glitches, and the controlflow of the target core gets infected. However, itseems to be unlikely since the critical path delayof the controller is usually not as long as of thecomplex S-box circuits.

16

TABLE 1Specification of all 14 AES cores of the three targeted LSIs including the ∆t ranges and the minimum

number of captures required to mount the attacks when the core voltage is provided by a 1.1 V power supply

IP core DescriptionLSI2 130nm LSI2 90nm LSI3 65nm

∆t range No. of ∆t range No. of ∆t range No. of[ps] Captures [ps] Captures [ps] Captures

AES_Comp 6450 5320 3650∆ : 10 ∼ 100 ∆ : 10 ∼ 100 ∆ : 10 ∼ 100

Encryption composite field 5000 5130 3370S-box 7250 6000 3800

Decryption ∆ : 25 40 000 ∆ : 20 80 000 ∆ : 20 40 0006650 5700 3500

AES_TBLtable look-up 5475 3960 3570S-box by ∆ : 25 1 000 000 ∆ : 20 1 000 000 ∆ : 10 1 000 000case statement 4900 3550 3420

AES_PPRM1S-box by 11350 6135 53251-stage ∆ : 25 < 100 ∆ : 20 ∼ 100 ∆ : 25 ∼ 100AND-XOR 7775 5555 5000

AES_PPRM3S-box by 6425 5230 36503-stage ∆ : 25 < 100 ∆ : 10 300 ∆ : 10 < 100AND-XOR 5150 5130 3420

AES_Comp composite field 6325 5200 3700S-box, only ∆ : 25 ∼ 100 ∆ : 10 ∼ 100 ∆ : 10 ∼ 100

ENC_top encryption 5100 5130 3410

AES_CTRcomposite field 4120 3310 3280S-box, ∆ : 20 ∼ 100 ∆ : 5 ∼ 100 ∆ : 5 ∼ 100counter mode 3660 3260 3210

AES_FA composite field 10600 6560 6800S-box, ∆ : 10 15 000 ∆ : 5 100 000 ∆ : 10 15 000

Round 1 fault 10440 6510 6670Round 2 detection 10700 50 000 7050 50 000 6980 50 000

AES_PKGcomposite field 6325 5360 3850S-box, precomp. ∆ : 25 ∼ 100 ∆ : 10 ∼ 100 ∆ : 10 ∼ 100roundkeys 5100 5130 3370

AES_MAODPA count. 8475 5900 4500by Masked ∆ : 25 3000 ∆ : 5 5000 ∆ : 25 4000And Operation 6250 5850 4300

AES_MDPLDPA count. 12825 9350 5800by MDPL ∆ : 25 < 100 ∆ : 25 < 100 ∆ : 5 < 100logic style 10850 8050 5260

AES_TIDPA count. 10860 5900 6340by Threshold ∆ : 20 500 000 ∆ : 5 600 000 ∆ : 20 500 000Implementation 9800 5850 5940

AES_WDDLDPA count. 6750 5250 3835by WDDL ∆ : 10 < 100 ∆ : 5 < 100 ∆ : 5 < 100logic style 5730 5150 3675

AES_PRDPA count. 31685 14400 6650by pseudo RSL ∆ : 10 100 000 ∆ : 20 150 000 ∆ : 25 5000logic style 31055 13840 6150

AES_WODPA count. 7575 5910 3900by pseudo RSL ∆ : 25 ∼ 100 ∆ : 10 ∼ 100 ∆ : 25 ∼ 100(evaluation) 6475 5430 3600

One Attack to Rule Them All: Collision Timing Attack ... · One Attack to Rule Them All: Collision...

Documents

Transcript of One Attack to Rule Them All: Collision Timing Attack ... · One Attack to Rule Them All: Collision...