Hardware Implementation and Side-Channel Analysis of LapinHardware Implementation and Side-Channel...

Hardware Implementation andSide-Channel Analysis of Lapin

Lubos Gaspar1, Gaetan Leurent1,2, and Francois-Xavier Standaert1

1 ICTEAM/ELEN/Crypto Group, Universite catholique de Louvain, Belgium.2 Inria, EPI SECRET, Rocquencourt, France.

e-mails: {lubos.gaspar,fstandae}@uclouvain.be,[email protected]

Abstract. Lapin is a new authentication protocol that has been de-signed for low-cost implementations. In a work from RFIDsec 2012,Berstein and Lange argued that at similar (mathematical) security lev-els, Lapin’s performances are below the ones of block cipher based au-thentication. In this paper, we suggest that as soon as physical security(e.g. against side-channel attacks) is taken into account, this criticismcan be mitigated. For this purpose, we start by investigating maskedhardware implementations of Lapin, and discuss the gains obtained oversoftware ones. Next, we observe that the structure of our implementa-tions significantly differs from block cipher ones (for which most resultsin side-channel analysis apply), hence raising questions regarding howto evaluate physical security in this case. We then provide first resultsof side-channel analyzes against unprotected and masked Lapin. Despiteinteresting properties of the masked implementations, our conclusionsare still contrasted because of the on-chip randomness requirements ofLapin protocol. These results give strong incentive to design similar butdeterministic protocols, e.g. based on the recently introduced LearningWith Rounding assumption.

Keywords: LPN, Ring-LPN, masking, side-channel analysis

1 Introduction

In [9], Heyse at al. proposed the Lapin authentication protocol based on thehardness of the Ring-LPN problem. Authors described two different Lapin vari-ants based on a carefully chosen ring R = F2[X]/f(X). In the first variant, thering is constructed with respect to an irreducible polynomial f(X) in F2. Thisway the ring becomes a Galois field. In the second variant the polynomial f(X)is reducible and it factors into distinct irreducible factors over F2, leading toimproved performances (only this second variant will be considered next).

This work has been funded in part by the ERC project 280141 (acronym CRASH),by the European Commissions 7th framework program’s project TAMPRES, and bythe Belgian Cybercrime Center of Excellence for Training Research and Education(B-CCENTRE). F.-X. Standaert is an associate researcher of the Belgiam Fund forScientific Research (FNRS-F.R.S).

2

The claim that such a protocol could provide better performances than stan-dard solutions using block ciphers gave rise to some debate, as witnessed bythe work of Bernstein and Lange [3]. In this paper, the authors strongly arguedagainst Lapin, because of its unclear security level, and performances that areanyway below the ones of lightweight ciphers. In this paper, we aim to mitigatethese criticisms in light of the interesting properties of Lapin regarding side-channel resistance. Namely, we would like to argue that3 as the (physical) se-curity level against side-channel attacks required by some application increases,Lapin gradually becomes an interesting alternative over the AES. The mainreason of this interesting feature is the linearity found in its core operations.

For this purpose, we first propose a generic hardware architecture for Lapin,and detail the performance gains that can be obtained from its implementation inan FPGA (compared to previous software implementations of unprotected Lapinand masked AES). Next, we provide a preliminary evaluation of its side-channelproperties. Interestingly, the situation of Lapin can be compared to recent in-vestigations of randomness extractors against side-channel attacks [15]. Namely,they can both be masked quite efficiently, while raising questions regarding howto best exploit/evaluate side-channel leakage. As a first step in this direction, wesuggest two ways to mount attacks against Lapin: one non-divide-and-conquerDPA-like attack, and one divide-and-conquer collision-like attack, exploiting thecorrelation between the leakage corresponding to multiple messages.

Overall, these results suggest that Lapin could be a promising candidatefor (reasonably) lightweight and physically secure implementations. Yet, andadmittedly, a significant drawback remains that it requires the generation ofrandomness on-chip, which may be an issue both from the performance and thephysical security point of view. As the previous work in [9], we ignored this part ofthe problem so far, leading to two important questions for further research. First,how to generate this noise efficiently and in a leakage-resilient manner. Second,can we build an authentication protocol similar to Lapin, but deterministic, e.g.using the recently introduced Learning With Rounding assumption [1, 2].

2 Background

In this section we recall the Lapin authentication protocol and the maskingcountermeasure.

2.1 The Lapin protocol

Lapin is a two-round authentication protocol, illustrated in Figure 1. It is definedover the ring R = F2[X]/f(X), where f is a polynomial over F2 of degree n. Theinitial public parameters are: λ – security level parameter (in bits); π – mapping{0, 1}λ → R; τ ∈ (0, 1/2) – Bernoulli distribution parameter; τ ′ ∈ (τ, 1/2) – readeracceptance threshold. Besides, the secret key of the tag and reader is defined as

K = (s, s′), with (s, s′)$← R. The protocol is executed as follows.

3 Up to some limitations related to the randomness requirements of Lapin - see next.

Hardware Implementation and Side-Channel Analysis of Lapin 3

Public parameters: R, π:{0, 1}λ → R, τ, τ ′, λ.Secret key: K = (s, s′) ∈ R2.

Tag Reader

Àc

c$← {0, 1}λ

Á r$← R∗; e

$← BerRτ ∈ R

Â z := r · (s · π(c)⊕ s′)⊕ e(r, z)

Ã if r /∈ R∗ reject

Ä e′ := z − r · (s · π(c)⊕ s′)

Å if HW (e′) > n · τ ′ rejectelse accept

Fig. 1. Two-round Lapin authentication protocol.

After the tag is detected in the reader’s vicinity, the reader randomly gen-erates a challenge c ∈ {0, 1}λ and sends it to the tag (step À in Fig. 1). The λparameter determines the security level of the protocol (e.g. λ = 80 bits). In themeantime, the tag generates parameters r and e (step Á). The parameter r is anuniformly chosen element of the ring R∗ and e is a low-weight ring element chosenwith Bernoulli distribution over F2 (Berτ ) with parameter (bias) τ ∈]0, 1/2[ (i.e.,Pr[X = 1] = τ if X ← Berτ ). After receiving the challenge c, the tag maps thechallenge to the ring through π, where π satisfies π(c)⊕π(c′) ∈ R \R∗ ⇔ c = c′.We denote R∗ the set of elements in R that have a multiplicative inverse. Subse-quently, the tag responds with (r, z = r · K(c) ⊕ e) ∈ R × R (step Â), whereK(c) = s · c ⊕ s′ is the session key that depends on the shared secret keyK = (s, s′) ∈ R2 and the challenge c. The reader accepts if e′ = z ⊕ r · K(c)(computed in the step Ä) is a polynomial of low weight (step Å). More detailson the Lapin protocol and all necessary security proofs can be find in [9].

Chinese Reminder Theorem representation (CRT): In this work, wefocus on versions of Lapin over a ring R = F2[X]/f(X) where f(X) factors intodistinct irreducible factors over F2. For an element h in the ring F2[X]/f(X),

we will denote h its CRT representation with respect to the factors of f(X)(for simplicity f(X) will be further denoted only as f). In other words, if f =f1 · f2 · · · fm where all fj are irreducible, then:

h.= (h mod f1, . . . , h mod fm), hi

.= h mod fi.

For the protocol to be implemented efficiently, all public and private values mustbe transformed to the CRT domain. However, in order to obtain the resultingtag response (r, z), it must be reconstructed from the response (r, z) as follows:

(r, z) =

m⊕i=1

ri ·

constant︷︸︸︷f

fi·

[(f

fi

)−1]fi

,

m⊕i=1

zi ·

constant︷︸︸︷f

fi·

[(f

fi

)−1]fi

. (1)

4

Although constants in the equation can be precomputed, this transformation stillinvolves m multiplications and additions of size n. Since the transformation fromand to the CRT representation only uses public values, the tag response (r, z)can be sent to the reader in its CRT representation (r, z) without decreasingLapin’s security. This way, the computationally extensive transformation can beperformed at the reader side.

The challenge mapping π is defined as π(c) = (c, c, c, c, c), i.e. each CRTcomponent is just the challenge padded with zeroes.

2.2 The masking countermeasure

Masking is a countermeasure against power analysis attacks based on secret shar-ing, first proposed by Chari et al. [5] and Goubin et al. [7]. Its main objectiveis to decrease the correlation between the power consumed by a device and thedata being processed, by applying one (or several) random mask(s) to interme-diate values. More formally, prior to the execution of the algorithm, all sensitivevalues (i.e. all key-dependent intermediate results used during the cryptographiccomputations) must be split into shares. Next, the algorithm is implemented insuch a way that the processing is only performed on these shares, which arerecombined at the end of the computation to produce the correct output result.Given that the shares are refreshed for each new authentication4, masking pro-vides an increase of the side-channel attacks data complexity that is exponentialin their number, under the assumption that the leakage of each share is indepen-dent of the others. However, for this exponential security increase to materializeinto strong concrete security, it is required that sufficient noise is present in theleakage measurements [18].

Different types of masking schemes have been proposed in the literature.Boolean masking (where the sharing is performed using a bitwise XOR oper-ation) appears as the most natural candidate in our context, since it can takeadvantage of the linearity of the computations in Lapin. In this context, the splitof a sensitive value h into d shares requires the generation of d−1 random maskvalues qi, and is defined as follows:

h1 = q1, . . . , hd−1 = qd−1, hd = h⊕d−1⊕i=1

qi.

Based on this sharing, a masked version of Lapin becomes straightforward toimplement. The secret keys s, s′ and low-weight element e are first divided toshares s1, s2, . . . , sd; s

′1, s′2, . . . , s

′d and e1, e2, . . . , ed, respectively. The final result

z is then obtained by recombining shares z1, z2, . . . , zd that are computed as

4 This implies storing all the shares of the secret key, for which the initial split isassumed to be performed once without leakage (otherwise this initialization canalways be the target of simple attacks).


follows:

z = (π(c) · s⊕ s′) · r ⊕ e,= [π(c) · (s1 ⊕ · · · ⊕ sd)⊕ (s′1 ⊕ · · · ⊕ s′d)] · r ⊕ (e1 ⊕ · · · ⊕ ed),= [(π(c) · s1 ⊕ s′1) · r ⊕ e1]⊕ · · · ⊕ [(π(c) · sd ⊕ s′d) · r ⊕ ed] ,= z1 ⊕ · · · ⊕ zd,

with

zi.= (π(c) · si ⊕ s′i) · r ⊕ ei.

Since Lapin is linear, we can compute the shares zi independently, and thereis no need of interaction between them nor refreshing during the computations,as opposed to masking non-linear gates within the AES [16], for instance. Thisleads to very efficient implementations.

On the independent leakage assumption: In addition, the linearity of Lapinalso allows to compute the shares sequentially. This time separation typicallyreduces the risk of glitches and other hardware effects that are well known tocontradict the independence assumption [13]. This is especially interesting in thecontext of hardware implementations as considered in the next section; this isthe typical context in which glitches can appear [14].

3 Hardware implementation

In this section we discuss our design choices for (unprotected and masked) Lapin.We present several implementations of a Lapin co-processor and report theirarea and timing performance. A hardware implementation takes advantage ofparallel computing in order to generate sufficient algorithmic noise, which allowssignificant security gains in practice (and improved performances).

3.1 Generic architecture

In order to implement the Lapin protocol, we use the same parameter values asdefined in Heyse at al. [9]. The degree of the polynomial f is chosen as n = 621,the security level parameter λ = 80 bits, Bernoulli distribution bias parametersτ = 1/6, τ ′ = 0.29 and the number of factors of f as m = 5. The five fjpolynomials are defined as follows:

f1(X) = X127 ⊕X8 ⊕X7 ⊕X3 ⊕ 1,

f2(X) = X126 ⊕X9 ⊕X6 ⊕X5 ⊕ 1,

f3(X) = X125 ⊕X9 ⊕X7 ⊕X4 ⊕ 1,

f4(X) = X122 ⊕X7 ⊕X4 ⊕X3 ⊕ 1,

f5(X) = X121 ⊕X8 ⊕X5 ⊕X ⊕ 1.

Assuming that the Lapin protocol is performed on d shares (each of them com-puted for all m CRT parts), we will denote the i−th share of the j−th CRT part

6

for a sensitive variable h as hi,j . Next, we propose a flexible architecture thatallows splitting the sensitive variables into arbitrary number of shares. Taking ad-vantage of generic VHDL coding enables the generation of such implementationsby re-synthetizing the same code with different parameters. For this purpose, wepresent the masked Lapin algorithm, the combined polynomial multiplication-reduction algorithm, and the hardware implementation of the complete Lapincore.

Masked Lapin: The generic masked Lapin algorithm is illustrated in Algo-rithm 1. First, a padded public challenge π(c) is multiplied by a secret key sdivided into shares si,j for 1 ≤ i ≤ d, 1 ≤ j ≤ m. Following, the result is

added to the secret key s′ divided into shares s′i,j . The sum is then multipliedby a public random tag response rj . Subsequently, the product is added to alow-weight element ei,j . The last step is to sum all resulting shares to form anunmasked tag response zj . Finally, rj and zj are sent back to the reader to finishthe authentication process. Note that all computations are performed on all mCRT parts.

Algorithm 1 Masked Lapin algorithm

Input:1: Padded public challenge π(c)2: Public random element r3: Secret keys s and s′ divided to shares s and s′ respectively4: Secret low-weigh error element e divided into shares eOutput: Response (z,r)5: for j from 1 to m do6: zj ← 07: for i from 1 to d do8: ti,j ← (π(c) · si,j ⊕ s′i,j) · rj ⊕ ei,j9: end for

10: for i from 1 to d do11: zj ← zj ⊕ ti,j12: end for13: end for14: Return z

Reduction of a low-weight error element e: Unlike the other parametersin Lapin, the low-weight error element e cannot be generated or pre-stored inCRT representation directly: its m CRT parts must be calculated prior to othercomputations. The reduction of such a large element is not straightforward andrequires additional hardware resources. In order to simplify this problem, wefirst write each share of e (a polynomial of degree 621) in Horner form usingfive polynomials [e(4), e(3), e(2), e(1), e(0)] of degree 127 (except for e(4) of degree


109):

ei,j =((((

e(4)i ·X

128⊕e(3)i)·X128⊕e(2)i

)·X128⊕e(1)i

)·X128⊕e(0)i

)mod fj . (2)

Next, the polynomial X128 can be reduced by each characteristic polynomial fjresulting in constant polynomials gj of degree less than deg(fj). After substitu-tion Equation 2 becomes:

ei,j =(((

e(4)i gj mod fj ⊕ e(3)i

)· gj mod fj ⊕ e(2)i

)· gj mod fj ⊕ e(1)i

)·gj mod fj⊕e(0)i .

(3)This way, only four multiplications, four reductions and four additions haveto be computed to obtain each ei,j . Moreover, the same hardware as used forperforming the computations in Algorithm 1 can be re-used to calculate the ei,j ’s.Note that since e(3), e(2), e(1) and e(0) are of degree 127, some extra hardwareis still necessary for their reduction.

Polynomial multiplication and reduction: Examining the previous algo-rithms reveals that 6× d×m polynomial multiplications, reductions and poly-nomial additions have to be performed to generate a response z. Among those,the most time-consuming operations are the polynomial multiplications and sub-sequent reductions of the products. Although the performances of the ”school-book” multiplication algorithm is theoretically lower (O(2n)) than the Karatsubaalgorithm (O(nlog23)), it has a very simple structure, and so the resulting imple-mentation is area-efficient and can operate at high clock frequencies. Moreover,polynomial reduction and multiplication operations can be executed simultane-ously in this case, so that no computational time is lost for the reduction step.For this reason, we have implemented a combined polynomial multiplication-reduction based on the schoolbook multiplication, as explained in Algorithm 2.

Two input polynomials ai,j and bi,j of degree at most nj − 1 (represented withbit arrays A[nj − 1 : 0] and B[nj − 1 : 0]) are multiplied together while partialproducts are reduced by the characteristic polynomial fj of degree nj (repre-sented with the bit array F [nj : 0]) simultaneously. A closer examination of thealgorithm shows that B is multiplied by one bit of A at a time. Therefore, ifA contains a secret value, Lapin will be vulnerable to a Simple Power Analysis(SPA) attack, where each partial multiplication leaks one bit of a key. For thisreason, Amust contain only public data. On the contrary, B is processed in largerblocks (according to the datapath size), so we used it to manipulate sensitivedata (that will additionally be protected against DPA thanks to masking).

Lapin architecture: We implemented Lapin as a hardware co-processor core,synthesized using Xilinx ISE 12.4 for Xilinx Virtex-5 XC5VLX50T FPGAs. Ourimplementation is illustrated in Figure 2. All variables are stored in the dataregister that is implemented in a dual-port embedded RAM. Random ring ele-ments r and low-weight error elements e have to be generated by a TRNG. Three

8

Algorithm 2 Combined polynomial multiplication-reduction

Input:1: polynomial ai,j , deg (ai,j) ≤ nj − 1 represented as bit array A[nj − 1 : 0]

2: polynomial bi,j , deg(bi,j)≤ nj − 1 represented as bit array B[nj − 1 : 0]

3: characteristic polynomial fj(X) of degree nj represented as bit array Fj [nj : 0]

Output: c(X) =(ai,j .bi,j

)mod fj)

4: C ← 05: for i from 1 to nj do6: if A[nj − i] = 1 then7: C ← C ⊕B8: end if9: if C[nj − 1] = 1 then

10: C ← (C � 1)⊕ Fj [nj − 1 : 0]11: else12: C ← (C � 1)13: end if14: end for15: Return C[nj − 1 : 0]

distinctive parts can be identified in the datapath of this Lapin core: the polyno-mial multiplication logic (shown in blue in Figure 2), the reduction logic (in red)and the addition logic (in green). During multiplication, a public parameter isstored in the shift register (implemented in logic). By shifting this register, onebit is selected at a time and multiplied with the secret parameter. The resultingpartial product is stored in the accumulator (also implemented in a dual-portRAM). Subsequently, this partial product is shifted and added to the next par-tial product. Whenever the size of this sum exceeds nj bits (nj = deg(fj)),reduction circuitry is activated in the next clock cycle to reduce the exceedingbit. Once the multiplication is finished, the result can be summed with a nextsecret parameter stored in the data register. Prior to this addition, all exceedingbits of this parameter are reduced by an auxiliary reduction circuitry. As canbe observed from the figure, multiplication involves shifting of partial productsstored in the accumulator. However, if a partial product of size nj is shifted inmore than one clock cycle (which is the case when k < 128), the most significantbit of each shifted word must be stored in a carry bit register Cr. This way astored carry bit becomes a least significant bit of the next word in the next clockcycle.

3.2 Performance evaluation

Implementation results: In order to investigate the performance trends re-sulting from different datapath sizes, we implemented our design for differentvalues of the k parameter in Figure 2, namely we considered k =8-, 16-, 32-,64- or 128-bit wide architectures, as summarized in Table 1. These results cor-respond to an unprotected implementation (i.e. d = 1) – but thanks to the


<<0

ACC

Z

<<

+

k

DIN

LAPIN core

DATA

REG

MSb

Fj0

LSb

1

+

k

[nj-1]

1

1

k

CTRL

k =

8,16, 32,

64 or 128

Cr

TRNGk

k

k

k

10

k

+

Fig. 2. Datapath with multiplication (blue), reduction (red) and addition (green) cir-cuitry.

linear structure of Lapin, the masked versions have essentially the same cost:only the memory requirements will increase proportionally to the number ofshares, in order to store intermediate results. We observe that if the datapathsize is decreased by half, the number of allocated fine-grained FPGA resourcesdoes not always decrease accordingly. This can be explained by the fact thatnarrower datapaths usually require more multiplexers and more complex controllogic. Moreover, this extra logic increases the overall datapath delays, resultingin lower maximal clock frequency (see the fifth column in Table 1).

Table 1. Implementation results: resource usage and timing information.

Datapath Slices BRAM fmax Clock cycles(k) 18kb 36kb (MHz) d = 1 d = 2 d = 3

8 213 2 0 125.3 20,977 41,969 62,96116 232 2 0 127.5 10,489 20,985 31,48132 311 1 1 127.2 5,245 10,493 15,74164 330 0 3 130.2 2,623 5,247 7,871128 451 0 6 140.3 1,332 2,664 3,996

Timing results: The detailed timing characteristics of our implementations aregiven in Table 2 (see Appendix B), in which each line corresponds to the numberof clock cycles required for the computations in one characteristic polynomialdomain (fj), and the last line represents the total number of clock cycles forthe full Lapin execution, i.e. one tag response for one authentication request.

10

The left part of the table shows results if no masking is used (secret variablesare not divided into shares, i.e. d = 1); its right part summarizes results forthree-share computations (i.e. d = 3). For completeness, timing characteristicsare again provided for all the aforementioned datapath sizes. The datapath andthe control logic were designed in order to eliminate cycles with no activity (i.e.pipeline bubbles). As a result, decreasing the datapath size by half results indoubling the number of clock cycles in most cases. The only exception is the128-bit datapath where some extra dummy cycles were necessary to avoid RAMread/write collisions.

Comparison: The timing comparison of software AES [16], software Lapin [9]and our hardware Lapin are given in Figure 3. In the case of software Lapin,the cycle counts for the masked versions are extrapolated from the unprotectedimplementation. These results lead to two main observations.

First, we see that masked Lapin implementations indeed become interestingalternatives over AES ones, as the number of shares increases. This is causedby the fact the the implementation cost of non-linear operations (which becomedominant in masked AES implementations) increases quadratically with d, whilethis increase is only linear in the case of Lapin. Interestingly, the number ofshares for which this gain concretely appears is reasonably small, hence close topractical interest. The software figures we have for protected implementations ofAES and Lapin on ATMega suggest that Lapin could become more efficient thanAES even with d = 2, but the crossing point can move significantly dependingon the masking scheme used, and the optimization level of the implementation.

Second, we see that (as usual) specialized hardware implementations allowa significant optimization of the performances of Lapin. Gains already appearin the comparison of 8-bit architectures: computing a tag response in softwarerequires 112,500 clock cycles (that can be decreased to 30,000 clock cycles ifprecomputation is allowed); our 1-share 8-bit hardware implementation requiresonly 20,977 clock cycles in this case (without precomputations). These advan-tages naturally amplify as we consider larger datapath sizes.

# of shares AES Lapin Lapind softw. [16, 8] softw. [9] 8b hardw.

1 5100 112500 209772 286844 225016 419693 572069 337532 629614 1003154 450048 839535 1489539 562564 1049456 2095756 675080 1259377 2779561 787596 146929 1 2 3 4 5 6 7

0

1

2

3·106

Number of shares d

Clock

cycles

Fig. 3. Number of clock cycles vs. number of shares (d) for software AES [16, 8], soft-ware Lapin [9] and hardware Lapin. With increase of used shares, the computation timeincreases quadratically for the AES and only linearly for both Lapin implementations.


4 Side-channel analysis of Lapin

The previous section suggests that Lapin is an interesting candidate for masking.First, its linearity allows increasing the number of shares for only a linear imple-mentation cost penalty. Second, it also allows manipulating the shares indepen-dently, which implies a better chance to fulfill the independent leakage require-ments that is crucial for masking to provide its expected security improvements.Third, it is efficiently implemented in hardware with large datapaths, providingalgorithmic noise that is needed for the exponential data complexity increase ofmasking to materialize into strong security levels. On the other hand, a limita-tion of this analysis remains that it “only” considers the security orders of themasking schemes (i.e. the minimum number of shares of which the leakage mustbe exploited to recover key-dependent information). While this is a traditionalapproach in side-channel analysis, it remains to understand how these securityorders translate into actual attack complexities. In particular, since Lapin hasa significantly different structure than block ciphers (to which most publishedhigher-order side-channel attacks apply), it is interesting to study attacks thatexploit the design of its multiplier. In this section, we consequently suggest sev-eral scenarios to analyze/evaluate a Lapin implementation using side-channel in-formation, and we study the efficiency of those attacks depending on the maskingorder of this implementation. As will be seen, these attacks differ from classicalDPA in some interesting respects.

We first point out that we can attack the CRT components independently:each component is computed separately, and we can test a key candidate si fromthe an authentication transcript without knowing the other CRT components.In the following we describe attacks on a single CRT component.

As a starting point, we consider a non-protected implementation of Lapin,and we study how to apply a standard DPA attack. In order to evaluate poweranalysis on our implementation of Lapin, we first have to study what parts ofthe computations are key-dependent, and how it might affect the power con-sumption. In our analysis we target the first multiplication s · π(c) in the Lapinprotocol, where s is a secret value, and c is a challenge that can be set by the at-tacker. Our architecture for Lapin includes a large accumulator that is updatedat each clock cycle, as shown in Figure 2, and we assume that this accumulatorwill induce a significant leakage dependent on its value a. In order to simplifyour analysis, we use a Hamming weight model, i.e. we assume that the powerconsumption is correlated with HW(a), and we run simulations where the sam-ples are computed as HW(a) +N , with N a Gaussian-distributed random noise.Our attacks will typically exploit the fact that, when computing a · c, the mul-tiplication algorithm updates the accumulator a as (we consider the optimizedmultiplication with an 80-bit c):

a0 = 0 ai+1 ←

{2 · ai ⊕ s if c[80− i] = 1

2 · ai otherwise.

12

Hence, the value of the a after a few cycles of computation is a small multipleof the secret:

a80 = s · c ai = s ·i∑

j=1

c[80− j]Xi−j .

Cautionary note: Assuming Hamming weight leakages is admittedly a simpli-fication of the real measurements used in side-channel attacks. However, it is areasonable abstraction for preliminary analyses, that has been used in numerouscontexts [12]. While the actual complexities provided by these simulated attacksare only meaningful up to the extent that true leakages behave similarly, they areusually informative to confirm whether some attack techniques can be successful.This is typically what the following results aim to exhibit, i.e. how side-channelattacks against Lapin differ from standard DPA against block ciphers.

4.1 A first DPA-like attack against unprotected Lapin.

In an unprotected design, the leakage reveals the Hamming weight of multiplesof the secret, with a chosen multiplier mi(c) =

∑ij=1 c[80 − j]Xi−j , depending

on the challenge c and the cycle i we target. If we exploit several different cyclesin a given trace, we can get information about HW(ai) = HW(s · mi(c)) forthe same c and different values of i. However, the same information can also beobtained by targeting a fixed cycle ι of the computation if we capture severaltraces and send the appropriate challenges cj so that mι(cj) = mj(c).

In a DPA attack, we guess a small part of the key, then predict the value ofthe leakage for a key guess according to a model, and compare the predictionto the actual measurements in order to rank the key candidates. For a blockcipher, the key is usually divided according to the structure of the cipher; forinstance, an attack on the AES will target the key bytes independently becauseeach SBox in the first round depends on a single key byte. In the case of Lapinthere is no such natural division of the key, but we can study the key bits requiredto compute some bits of the accumulator.

Recovering a few key bits. For a given t, if mi(c) is of degree at most t(e.g. if i ≤ t), we can compute the p least significant bits of s ·mi(c) from the pleast significant and the t − 1 most significant bits of s. This allows to build asimple DPA attack: after guessing the key bits, we compute the least significantbits of ai and we consider the remaining bits as algorithmic noise. We can thencompute the correlation between the leaked weight of a and the weight of thepredicted bits, and use it to rank the key candidates.

Note that if there is no measurement noise, we can only use 2t differentmeasures in this attack, because there are only 2t different polynomials of degreet or lower. Extra measures are only useful to reduce the measurement noise.

We implemented this attack using Pearson’s correlation coefficient as a com-parison tool [4], and we show an example of results in Figure 8 in Appendix A.


We can see from this example that the information from the measures is not suf-ficient to recover exactly the secret key bits; there is a cluster of key candidateswith the same correlation coefficient. This is due to the algebraic structure ofthe multiplier; for instance, if we consider two key candidate s and s′ = s · X,our prediction for the least significant bit of the accumulator using the key swill match the second-least significant bit of the accumulator for candidate s′.Therefore, both candidates will be in the same cluster.

Recovering the full key. As opposed to a typical side channel attack on ablock cipher, we don’t have independent parts of the key affecting different partsof the computation. Therefore, we don’t attack key parts independently usinga divide-and-conquer approach, but we recover key information gradually: if wehave a good candidate for n bits of the key, we generate key candidates for n+ 1bits by considering both values for the next key bit. This defines a tree of keycandidates, and we explore the tree following the best candidates.

More precisely, we compute a score for the candidates as the ratio of Pearson’scorrelation coefficient over the expected correlation coefficient for the right key(the square root of the number of predicted bits divided by the total numberof bits on the bus, over which the Hamming weight is computed). This allowsto compare the quality of key guesses of different lengths: if more key bits areguessed, we can predict more bits, and we expect a better correlation coefficient.The score of a node is computed when its parent is explored, and we select thenode with the higher score among all nodes whose score has been computed. Inpractice, we use a priority queue to store the nodes and to extract the best oneefficiently.

If we have 2t traces, we begin by guessing the t − 1 most significant bits,and one least significant bit of the key; this allows to predict one bit of theaccumulator at after t cycles. Next, we guess the second-least significant bit ofthe key, so that we can predict two bits of at.

Figure 5 in Appendix A shows the success rate of this DPA-like attack, withvarious parameters. The attack becomes less efficient with a large datapath,because the Hamming weight over a larger bus size introduces more algorithmicnoise to the predicted value.

Those experiments clearly show the effect of the algorithmic noise from theunknown bits in the Hamming weight (with variance k/4 for a k-bit datapath),and of the physical noise (with variance σ2). When the physical noise is dom-inant, i.e. σ2 > k/4, we see the data complexity increasing linearly with thevariance of the noise, as expected [11]. For instance, our experiment show a datacomplexity of about 27.σ2 to reach a high success rate with k = 8. When kincreases, this increases the algorithmic noise, and we have a similar increase ofthe data complexity if the algorithmic noise is dominant, i.e. when σ2 < k/4.We note that this increase is somewhat faster than k/4 because there are morenodes to explore to locate the correct key when k is larger, but we stop after afixed number of nodes are explored.

14

4.2 Collision-like attack.

We now describe an attack based on the structure of the operations in Lapin.The main advantage of this attack is that we can eliminate the algorithmic noisedue to the Hamming weight with a large datapath by comparing the leakagewith two different inputs. This is similar to side-channel collision attacks [17]where two traces are compared to detect specific events.

More precisely, we use the fact that the operation α 7→ α·X has a predictableeffect on the Hamming weight of α. We have:

α ·X mod f =

{(α� 1) if MSB(α) = 0

(α� 1)⊕ f if MSB(α) = 1,

where � to denotes a left shift. Alternatively, we can write it using a rotation≪ over deg(f) bits:

α ·X mod f =

{(α≪ 1) if MSB(α) = 0

(α≪ 1)⊕ f if MSB(α) = 1,

where f = f ⊕Xdeg(f)⊕ 1 is f without the highest and lowest coefficients. Sincethe polynomials f used in Lapin are pentanomials, we have HW(f) = 3, and wecan relate the Hamming weight of α and the Hamming weight of α ·X mod f :

HW(α ·X mod f) =

HW(α) if MSB(α) = 0

HW(α) + 3 if MSB(α) = 1 and HW(α≪ 1 ∧ f) = 0

HW(α) + 1 if MSB(α) = 1 and HW(α≪ 1 ∧ f) = 1

HW(α)− 1 if MSB(α) = 1 and HW(α≪ 1 ∧ f) = 2

HW(α)− 3 if MSB(α) = 1 and HW(α≪ 1 ∧ f) = 3.

Therefore, the distribution of HW(α·X)−HW(α) for a random α is the following:

if MSB(α) = 0: HW(α ·X)−HW(α) = 0,

if MSB(α) = 1: HW(α ·X)−HW(α) =

+3 with probability 1/8

+1 with probability 3/8

−1 with probability 3/8

−3 with probability 1/8.

To exploit this property, we will use two measures such that mi(c) = m andmi′(c

′) = m·X. Then, we can recover the value MSB(m·s) (i.e. a linear equationin s) by comparing HW(m ·s) and HW(m ·X ·s) (we use the analysis above withα = m · s). If there is no noise, we will recover a key bit with only two measures,with probability one.

As opposed to the attack of Section 4.1, this analysis uses the full state ofthe multiplier, and avoids algorithmic noise due to the Hamming weight. Thismakes the attack quite efficient. However, there is also an important limitation:


because the challenge used in Lapin is only 80-bit long, the multiplication m · conly takes 80 cycles, and we can only recover 80 bits from each CRT componentof the key with this technique.

If there is some measurement noise, we can remove it either by repeating themeasures of HW(m · s) and HW(m ·X · s) and averaging them, or by using allthe measures in a template attack [6]. We performed simulations with variouslevels of noise using a template attack, and the rank estimation code from [19]to compute the rank of the full key from the estimated key bits probability. Wereport our results in Appendix A, Figure 6. Those experiments show that with a128-bit datapath we can recover 80 key bits with very few candidates using only26 traces with a noise variance of 1. Again, the data complexity grows linearlywith the noise variance σ2, and we need about 26 · σ2 traces to reduce the keyspace to a few candidates.

If the datawidth k is smaller than 128, we have to combine 128/k measuresto build the full Hamming weight HW(a) in order to perform the attack. If thenoise variance is σ2 this becomes equivalent to a noise variance of 128/k · σ2 foran attack with a 128-bit datapath, and the number of traces required is about26 · σ2 · 128/k. As expected, this behavior is opposite to what happens in theattack of Section 4.1 where a larger datapath implies a higher attack complexitybecause of the extra algorithmic noise.

For the acquisition of the data, one can either extract the two leakages froma single trace at two different points of interest, or use two traces with chosenchallenges and extract the leakage from each trace at the same point in time. Inorder to minimize the number of traces required for the attack, we use all theclock cycles of the multiplier. More precisely, we send the challenge c = 279, sothat mi(c) = Xi−1 and mi+1(c) = Xi = mi(c) ·X.

Order of the attack. This attack exploits information from two different mea-sures, and combines them using the difference operation. Since we use a singlepair of challenges for each key bit, we can average the measures and the infor-mation can be extracted from the average leakage values by testing whether itis zero or in {−3,−1,+1,+3}. Therefore, this attack can be seen as a first-orderbivariate attack.

4.3 Attack on masked Lapin

We now study how this attack can be applied against a masked implementationof Lapin such as the implementation described in Algorithm 1. In a maskedimplementation the multiplication π(c) · s is split in d computations π(c) · sj ,with s =

⊕dj=1 sj . Therefore we have to combine leakages from each of the d

computations to recover information about the secret s.

First, we can see that if there is no noise, it is still easy to recover thekey using the attack of Section 4.2. If we send the challenge c = 279, we havemi(c) = Xi−1 and mi+1(c) = Xi = mi(c) ·X. By comparing HW(Xi−1 · sj) and

16

σ(x, y) = 0

4.1. MSB(α)=1

σ(x, y) = 6

4.2. MSB(α)=0

HW(α≪1∧f)=0

σ(x, y) = 2.25

4.3. MSB(α)=0

HW(α≪1∧f)=1

σ(x, y) = −2.25

4.4. MSB(α)=0

HW(α≪1∧f)=2

σ(x, y) = −6

4.5. MSB(α)=0

HW(α≪1∧f)=3

Fig. 4. Possible distributions for(

HW(α1 ·X)−HW(α1),HW(α2 ·X)−HW(α2)).

The probabilities are represented as: : 1/16, : 2/16, : 3/16, : 8/16

HW(Xi · sj), we can recover MSB(Xi−1 · sj), and we can rebuild a bit of s usingMSB(Xi−1 · s) =

⊕j MSB(Xi−1 · sj).

More generally, we study the 2d-dimensional distribution of:(HW(αj),HW(αj ·X)

)dj=1

, with α =⊕d

j=1 αj .

Following the analysis of Section 4.2, we combine the measures using a differenceoperation, and reduce them to d dimensions:(

HW(αj ·X)−HW(αj))dj=1

, with α =⊕d

j=1 αj .

We will later use this analysis with α = m · s and αj = m · sj .We now study the case d = 2 in more details. Following the analysis of

Section 4.2, we expect different distributions depending on the most significantbit of α.

MSB(α) = 1: If MSB(α) = 1, then we have either MSB(α1) = 0 andMSB(α2) = 1, or MSB(α1) = 1 and MSB(α2) = 0. This results in the dis-tribution of Figure 4.1: either HW(α1 · X) − HW(α1) = 0 and HW(α2 · X) −HW(α2) ∈ {−3,−1,+1,+3}, or HW(α1 ·X)−HW(α1) ∈ {−3,−1,+1,+3} andHW(α2 ·X)−HW(α2) = 0.

MSB(α) = 0: If MSB(α) = 0, then we have either MSB(α1) = 0 andMSB(α2) = 0, or MSB(α1) = 1 and MSB(α2) = 1. The first case gives HW(α1 ·X) − HW(α1) = 0 and HW(α2 · X) − HW(α2) = 0. In the second case, wehave HW(α1 ·X)− HW(α1) ∈ {−3,−1,+1,+3} and HW(α2 ·X)− HW(α2) ∈{−3,−1,+1,+3}, but we need to look at HW(α ≪ 1 ∧ f) in order to predictall the possibilities. The results are shown in Figure 4.

We can use those distributions to mount a template attack against a maskedimplementation of Lapin. We use c = 279, in order to collect traces with HW(Xi−1·sj) and HW(Xi · sj), and we recover MSB(Xi−1 · s) by distinguishing the distri-butions (i.e. we have α = Xi−1 ·s). We report our simulations results in Figure 7


in Appendix A. We can see that the data complexity increases roughly like thesquared variance σ4, which is typical of a second-order attack [5]. In our sim-ulations, the rank of the correct key becomes smaller than 210 when the datacomplexity is about 210 · σ4.

Order of the attack. This attack exploits information from four differentmeasures, and combines pairs of measures using the difference operation. Thenwe have to distinguish the distributions of Figure 4, which can be done bycomputing the covariance and testing whether it is zero or in {−6,−2.25, 2.25, 6}.Therefore, this attack can be seen as a second-order 4-variate attack.

5 Conclusion

The previous results suggest that Lapin has interesting properties for secure andefficient masking, because it can be implemented by manipulating shares inde-pendently. They also exhibit that the exploitation of its leakage does not directlyderive from standard DPA such as applied in the context of block ciphers. Yet, itis possible to mount attacks against both unprotected and masked Lapin, withsimilar intuition regarding the security order as for block ciphers. Admittedly,our side-channel experiments are only a first step, and several problems remainopen. Technically, it would certainly be worth investigating other leakage models(e.g. distance-based) and actual measurements. Besides, it could be interestingto further study the possible presence of more data-dependent algorithmic noisein an implementation (i.e. capturing more than the main register activity), andhow to get rid of it taking advantage of multiple plaintexts in a collision-likeattack. Eventually, and as pointed out in introduction, the problem of on-chiprandomness generation remains an important drawback of Lapin. Analyzing itsleakage, or designing deterministic protocols based on the Learning With Round-ing assumption are interesting scopes for further research.

References

1. Joel Alwen, Stephan Krenn, Krzysztof Pietrzak, and Daniel Wichs. Learning withRounding, Revisited - New Reduction, Properties and Applications. In Ran Canettiand Juan A. Garay, editors, CRYPTO (1), volume 8042 of Lecture Notes in Com-puter Science, pages 57–74. Springer, 2013.

2. Abhishek Banerjee, Chris Peikert, and Alon Rosen. Pseudorandom Functions andLattices. In David Pointcheval and Thomas Johansson, editors, EUROCRYPT,volume 7237 of Lecture Notes in Computer Science, pages 719–737. Springer, 2012.

3. Daniel J. Bernstein and Tanja Lange. Never Trust a Bunny. In Jaap-Henk Hoep-man and Ingrid Verbauwhede, editors, RFIDSec, volume 7739 of Lecture Notes inComputer Science, pages 137–148. Springer, 2012.

4. Eric Brier, Christophe Clavier, and Francis Olivier. Correlation Power Analysiswith a Leakage Model. In Joye and Quisquater [10], pages 16–29.

18

5. Suresh Chari, Charanjit S. Jutla, Josyula R. Rao, and Pankaj Rohatgi. TowardsSound Approaches to Counteract Power-Analysis Attacks. In Michael J. Wiener,editor, CRYPTO, volume 1666 of Lecture Notes in Computer Science, pages 398–412. Springer, 1999.

6. Suresh Chari, Josyula R. Rao, and Pankaj Rohatgi. Template Attacks. In BurtonS. Kaliski Jr., Cetin Kaya Koc, and Christof Paar, editors, CHES, volume 2523 ofLecture Notes in Computer Science, pages 13–28. Springer, 2002.

7. Louis Goubin and Jacques Patarin. DES and Differential Power Analysis (The”Duplication” Method). In Cetin Kaya Koc and Christof Paar, editors, CHES,volume 1717 of Lecture Notes in Computer Science, pages 158–172. Springer, 1999.

8. Vincent Grosso, Francois-Xavier Standaert, and Sebastian Faust. Masking vs.Multiparty Computation: How Large Is the Gap for AES? In Guido Bertoni andJean-Sebastien Coron, editors, CHES, volume 8086 of Lecture Notes in ComputerScience, pages 400–416. Springer, 2013.

9. Stefan Heyse, Eike Kiltz, Vadim Lyubashevsky, Christof Paar, and KrzysztofPietrzak. Lapin: An Efficient Authentication Protocol Based on Ring-LPN. InAnne Canteaut, editor, FSE, volume 7549 of Lecture Notes in Computer Science,pages 346–365. Springer, 2012.

10. Marc Joye and Jean-Jacques Quisquater, editors. Cryptographic Hardware andEmbedded Systems - CHES 2004: 6th International Workshop Cambridge, MA,USA, August 11-13, 2004. Proceedings, volume 3156 of Lecture Notes in ComputerScience. Springer, 2004.

11. Stefan Mangard. Hardware Countermeasures against DPA ? A Statistical Analysisof Their Effectiveness. In Tatsuaki Okamoto, editor, CT-RSA, volume 2964 ofLecture Notes in Computer Science, pages 222–235. Springer, 2004.

12. Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power analysis attacks -revealing the secrets of smart cards. Springer, 2007.

13. Stefan Mangard, Thomas Popp, and Berndt M. Gammel. Side-Channel Leakage ofMasked CMOS Gates. In Alfred Menezes, editor, CT-RSA, volume 3376 of LectureNotes in Computer Science, pages 351–365. Springer, 2005.

14. Stefan Mangard, Norbert Pramstaller, and Elisabeth Oswald. Successfully Attack-ing Masked AES Hardware Implementations. In Josyula R. Rao and Berk Sunar,editors, CHES, volume 3659 of Lecture Notes in Computer Science, pages 157–171.Springer, 2005.

15. Marcel Medwed and Francois-Xavier Standaert. Extractors against side-channelattacks: weak or strong? J. Cryptographic Engineering, 1(3):231–241, 2011.

16. Matthieu Rivain and Emmanuel Prouff. Provably Secure Higher-Order Masking ofAES. In Stefan Mangard and Francois-Xavier Standaert, editors, CHES, volume6225 of Lecture Notes in Computer Science, pages 413–427. Springer, 2010.

17. Kai Schramm, Gregor Leander, Patrick Felke, and Christof Paar. A Collision-Attack on AES: Combining Side Channel- and Differential-Attack. In Joye andQuisquater [10], pages 163–175.

18. Francois-Xavier Standaert, Nicolas Veyrat-Charvillon, Elisabeth Oswald, BenediktGierlichs, Marcel Medwed, Markus Kasper, and Stefan Mangard. The World IsNot Enough: Another Look on Second-Order DPA. In Masayuki Abe, editor,ASIACRYPT, volume 6477 of Lecture Notes in Computer Science, pages 112–129.Springer, 2010.

19. Nicolas Veyrat-Charvillon, Benoıt Gerard, and Francois-Xavier Standaert. Secu-rity Evaluations beyond Computing Power. In Thomas Johansson and Phong Q.Nguyen, editors, EUROCRYPT, volume 7881 of Lecture Notes in Computer Sci-ence, pages 126–141. Springer, 2013.


A Simulation results of the side-channel attacks

1 2 3 4 5 6 7 8 9 10 11 12 13 140

20

40

60

80

100

8-bit datapath: # traces (log2)

Succ

ess

rate

1 2 3 4 5 6 7 8 9 10 11 12 13 140

20

40

60

80

100


Succ

ess

rate

1 2 3 4 5 6 7 8 9 10 11 12 13 140

20

40

60

80

100


Succ

ess

rate

1 2 3 4 5 6 7 8 9 10 11 12 13 140

20

40

60

80

100


Succ

ess

rate

1 2 3 4 5 6 7 8 9 10 11 12 13 140

20

40

60

80

100


Succ

ess

rate

σ = 0 σ = 0.25 σ = 0.5 σ = 1σ = 2 σ = 4 σ = 8 σ = 16

Fig. 5. DPA attack success rate for full-key recovery, after exploring 216 tree nodes.

20

5 100

20

40

60

80

σ = 0.25: # traces (log2)

keyra

nk

5 100

20

40

60

80

σ = 1: # traces (log2)

keyra

nk

5 100

20

40

60

80


keyra

nk

5 100

20

40

60

80


keyra

nk

Fig. 6. Security graphs for the collision-like attack, with k = 128. We assume thatall the clock cycles are used for each trace. Alternatively, the attack can be mountedtargeting a single point of interest if the data complexity is multiplied by 80.

5 10 150

20

40

60

80

σ = 0.25: # traces (log2)

keyra

nk

5 10 150

20

40

60

80


keyra

nk

5 10 150

20

40

60

80


keyra

nk

5 10 150

20

40

60

80


keyra

nk

Fig. 7. Security graphs for the collision-like attack on a masked Lapin, with k = 128.


Fig. 8. Correlation coefficient for the key candidates, depending on the number oftraces. We use t = 7 and p = 3, and don’t add any noise to the Hamming weights.

B Additionnal implementation results

Table 2. Number of clock cycles required for Lapin calculation.

One share (d = 1) Three shares (d = 3)

8-bit 16-bit 32-bit 64-bit 128-bit 8-bit 16-bit 32-bit 64-bit 128-bit

f1 4,048 2,024 1,012 506 257 12,144 6,072 3,036 1,518 771f2 4,160 2,080 1,040 520 264 12,480 6,240 3,120 1,560 792f3 4,208 2,104 1,052 526 267 12,624 6,312 3,156 1,578 801f4 4,224 2,112 1,056 528 268 12,672 6,336 3,168 1,584 804f5 4,336 2,168 1,084 542 275 13,008 6,504 3,252 1,626 825

Total 20,977 10,489 5,245 2,623 1,332 62,961 31,481 15,741 7,871 3,996

Hardware Implementation and Side-Channel Analysis of LapinHardware Implementation and Side-Channel...

Documents

Transcript of Hardware Implementation and Side-Channel Analysis of LapinHardware Implementation and Side-Channel...