Argon and Argon2 - orbilu.uni.luorbilu.uni.lu/bitstream/10993/19901/1/Argon.pdf · 2.6 Design...

Argon and Argon2

Designers: Alex Biryukov, Daniel Dinu, and Dmitry KhovratovichUniversity of Luxembourg, Luxembourg

[email protected], [email protected], [email protected]

https://www.cryptolux.org/index.php/Password

https://github.com/khovratovich/Argon

https://github.com/khovratovich/Argon2

Version 1.1 of ArgonVersion 1.0 of Argon2

31th January, 2015

https://www.cryptolux.org/index.php/Password

https://github.com/khovratovich/Argon

https://github.com/khovratovich/Argon2

Contents

1 Introduction 3

2 Argon 52.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 SubGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 ShuffleSlices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Recommended parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Security claims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Main features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.2 Server relief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.3 Client-independent update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.4 Possible future extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Security analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5.1 Avalanche properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5.2 Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5.3 Collision and preimage attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5.4 Tradeoff attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Design rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6.1 SubGroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6.2 ShuffleSlices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6.3 Permutation F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6.4 No weakness, no patents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.7 Tweaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8 Efficiency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.8.1 Modern x86/x64 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8.2 Older CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8.3 Other architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Argon2 193.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.3 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.4 Compression function G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Available features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Server relief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.3 Client-independent update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.4 Possible future extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Design rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.1 Cost estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.2 Mode of operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.3 Implementing parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.4 Compression function design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.5 User-controlled parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1

3.4 Security analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4.1 Security of single-pass schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4.2 Ranking tradeoff attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.3 Multi-pass schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.4 Security of Argon2 to generic attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.5 Security of Argon2 to tradeoff attacks . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.5.1 x86 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.7 Other details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7.1 Permutation P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.7.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Change log 344.1 v2 and Argon2 – 31th January, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 v1 – 15th April, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 v1 – 8th April, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2

Chapter 1

Introduction

Passwords remain the primary form of authentication on various web-services. Passwords are usuallystored in a hashed form in a server’s database. These databases are quite often captured by the adver-saries, who then apply dictionary attacks since passwords tend to have low entropy. Protocol designersuse a number of tricks to mitigate these issues. Starting from the late 70’s, a password is hashed togetherwith a random salt value to prevent detection of identical passwords across different users and services.The hash function computations, which became faster and faster due to Moore’s law have been calledmultiple times to increase the cost of password trial for the attacker.

In the meanwhile, the password crackers migrated to new architectures, such as FPGAs, multiple-core GPUs and dedicated ASIC modules, where the amortized cost of a multiple-iterated hash functionis much lower. It was quickly noted that these new environments are great when the computation isalmost memoryless, but they experience difficulties when operating on a large amount of memory. Thedefenders responded by designing memory-hard functions, which require a large amount of memory tobe computed. The password hashing scheme scrypt [23] is an instance of such function.

Password hashing schemes also have other applications. They can be used for key derivation fromlow-entropy sources. Memory-hard schemes are also welcome in cryptocurrency designs [4] if a creatorwants to demotivate the use of GPUs and ASICs for mining and promote the use of standard desktops.

Problems of existing schemes. A design of a memory-hard function proved to be a tough problem.Since early 80’s it has been known that many cryptographic problems that seemingly require largememory actually allow for a time-memory tradeoff [18], where the adversary can trade memory for timeand do his job on fast hardware with low memory. In application to password-hashing schemes, thismeans that the password crackers can still be implemented on a dedicated hardware even though atsome additional cost. The scrypt function, for example, allows for a simple tradeoff where the time-areaproduct remains almost constant.

Another problem with the existing schemes is their complexity. The same scrypt calls a stack ofsubprocedures, whose design rationale has not been fully motivated (e.g, scrypt calls SMix, which callsROMix, which calls BlockMix, which calls Salsa20/8 etc.). It is hard to analyze and, moreover, hard toachieve confidence. Finally, it is not flexible in separating time and memory costs. At the same time, thestory of cryptographic competitions [2, 25, 3] has demonstrated that the most secure designs come withsimplicity, where every element is well motivated and a cryptanalyst has as few entry points as possible.

The ongoing Password Hashing Competition also highlighted the following problems:

• Should the memory addressing (indexing functions) be input-independent or input-dependent, orhybrid? The first type of schemes, where the memory read location are known in advance, isimmediately vulnerable to time-space tradeoff attacks, since an adversary can precompute themissing block by the time it is needed [10]. In turn, the input-dependent schemes are vulnerableto side-channel attacks [24], as the timing information allows for much faster password search.

• Is it better to fill more memory but suffer from time-space tradeoffs, or make more passes over thememory to be more robust? This question was quite difficult to answer due to absence of generictradeoff tools, which would analyze the security against tradeoff attacks, and the absence of unifiedmetric to measure adversary’s costs.

• How should the input-independent addresses be computed? Several seemingly secure options havebeen attacked [10].

3

• How large a single memory block should be? Reading smaller random-placed blocks is slower (incycles per byte) due to the spacial locality principle of the CPU cache. In turn, larger blocks aredifficult to process due to the limited number of long registers.

• If the block is large, how to choose the internal compression function? Should it be cryptographi-cally secure or more lightweight, providing only basic mixing of the inputs? Many candidates simplyproposed an iterative construction and argued against cryptographically strong transformations.

• Which operations should be used by the compression function in the goal of maximizing adversary’scosts with fixed desktop performance? Should we use recent AES instructions, ARX primitives,integer and floating-point operations?

• How to exploit multiple cores of modern CPUs, when they are available? Parallelizing calls to thehashing function without any interaction is subject to simple tradeoff attacks.

Our solutions. We offer two new hashing scheme called Argon and Argon2. Argon is our originalsubmission to PHC. It is a multipurpose hash function, that is optimized for highest resilience againsttradeoff attacks, so that any, even small memory reduction would lead to significant time and compu-tational penalties. Argon can be used for password hashing, key derivation, or any other memory-hardcomputation (e.g., for cryptocurrencies).

Argon2 summarizes the state of the art in the design of memory-hard functions. It is a streamlinedand simple design. It aims at the highest memory filling rate and effective use of multiple computingunits, while still providing defense against tradeoff attacks. Argon2 is optimized for the x86 architectureand exploits the cache and memory organization of the recent Intel and AMD processors. Argon2 hastwo variants: Argon2d and Argon2i. Argon2d is faster and uses data-depending memory access, whichmakes it suitable for cryptocurrencies and applications with no threats from side-channel timing attacks.Argon2i uses data-independent memory access, which is preferred for password hashing and password-based key derivation. Argon2i is slower as it makes more passes over the memory to protect from tradeoffattacks.

We recommend Argon for the applications that aim for the highest tradeoff resilience and wantto guarantee prohibitive time and computational penalties on any memory-reducing implementation.According to our cryptanalytic algorithms, an attempt to use half of the requested memory (for instance,64 MB instead of 128 MB) results in the speed penalty factor of 140 and in the penalty 218. The penaltygrows exponentially as the available memory decreases, which effectively prohibits the adversary to useany smaller amount of memory. Such high computational penalties are a unique feature of Argon.

We recommend Argon2 for the applications that aim for high performance. Both versions of Argon2allow to fill 1 GB of RAM in a fraction of second, and smaller amounts even faster. It scales easily tothe arbitrary number of parallel computing units. Its design is also optimized for clarity to ease analysisand implementation.

4

Chapter 2

Argon

2.1 Specification

Argon is a password hashing scheme, which implements a memory-hard function with memory andtime requirements as a parameter. It is designed so that any reduction of available memory imposes asignificant penalty on the running time of the algorithm.

Argon is a parametrized scheme with two main parameters:

• Memory size m (m cost). Argon can occupy any number of kilobytes of memory, and its perfor-mance is a linear function of the memory size.

• Number of iterations R (t cost). It also linearly affects the time needed to compute the output(tag) value.

Note that the overall time is affected by both t cost and m cost.Argon employs a t-byte permutation. In this submission it is the 5-round AES-128 with fixed key, so

t = 16.Argon employs a cryptographic hash function H. It is Blake2b [6].

2.1.1 Input

The hash function Π takes the following inputs:

• Password P of byte length p, 0 ≤ p < 232.

• Salt S of byte length s, 8 ≤ s < 232.

• Memory size m in kilobytes, 1 ≤ m < 224.

• Iteration number R, 2 ≤ R < 232.

• Secret value K of byte length k, 0 ≤ k ≤ 16;

• Associated data X of byte length x, 0 ≤ x < 232;

• Degree of parallelism d, 1 ≤ d ≤ 32.

• Tag length τ , 4 ≤ τ ≤ 232 − 1.

and outputs a tag of length τ :

Π : (P, S,m,R,K,X, τ)→ Tag ∈ {0, 1}8τ .

The authentication server places the following string alongside the user’s credentials:

Storage← τ ||m ||R || S ||X || Tag,

with the secret K stored in another place.

5

Initial hashing phase. The values p, s, k,R,m, x, τ are known at the start of hashing and are en-coded as 4-byte strings using the little-endian convention. Together with P, S,K,X they are fed into acryptographic hash function H that produces a 256-bit output I:

I = H(p, P, s, S, k,K, x,X, d,m,R, τ).

The input I is partitioned into 4-byte blocks I0, I1, I2, I3. Then these blocks are repeatedly used toconstruct n = 1024m/t blocks of t bytes each with a counter:

Ai = Ii (mod 4) || Ci,

where Ci is the 8-byte encoding of the value i.The scheme then operates on blocks Ai. It is convenient to view them as a matrix

A =

A0 An/32 · · · An−n/32A1 An/32+1 · · · An−n/32+1

· · ·An/32−1 An/16−1 · · · An−1

Initial round. In the initial round the permutation F is applied to every block Ai. The transformationF is the reduced 5-round AES-128 with the fixed key

K0 = (0x00,0x01,0x02,0x03,0x04,0x05,0x06,0x07,0x08,0x09,0x0a,0x0b,0x0c,0x0d,0x0e,0x0f).

MixColumns transformation is present in the last round.Then the transformation SubGroups is applied. It operates on the rows of A, which are called groups,

and is defined in Section 2.1.2.

Main rounds. In the main rounds the transformations ShuffleSlices and SubGroups are applied al-ternatively R times. ShuffleSlices operates on columns, which we call slices of A, and is defined inSection 2.1.3.

Finalization round. The first n/2 blocks of the memory are XORed into a 16-byte block B0, and thesecond n/2 blocks are XORed into B1. Then the hash function H is applied to B0||B1. We iterate Huntil τ bytes are generated, denoting the entire hash function by H ′:

B0 ←⊕

0≤i<n/2

Ai;

B1 ←⊕

n/2≤i<n

Ai;

B2||B3 ← H(B0||B1);

B4||B5 ← H(B2||B3);

· · ·Tag← B2||B4||B6 . . . = H ′(B0||B1).

The pseudocode of the entire algorithm given in Algorithm 1.

2.1.2 SubGroups

The SubGroups transformation applies the smaller operation Mix to each group of size 32. Mix takest-byte blocks A0, A1, . . . , A31 as input and processes them as follows:

1. 16 new variables X0, X1, . . . , X15 are computed as specific linear combinations of A0, A1, . . . , A31.

2. New values to Ai are assigned:Ai ← F(Ai ⊕F(Xbi/2c)).

6

Input: t-byte states A0, A1, . . . , An−1s← n/32Initial Round:for 0 ≤ j < n do

Ai ← F(Ai)endSubGroups:for 0 ≤ j < n/32 do

Mix(Aj , Aj+n/32, . . . , A31n/32+j)endfor 1 ≤ i ≤ R do

ShuffleSlices:for 0 ≤ j < d do

Shuffle(Ajn/d, Ajn/d+1, . . . , A(j+1)n/d−1);endSubGroups:for 0 ≤ j < n/32 do

Mix(Aj , Aj+n/32, . . . , A31n/32+j)end

endFinalization:B0 ←

⊕0≤i<n/2Ai;

B1 ←⊕

n/2≤i<nAi;

Tag← H ′(B0||B1).Algorithm 1: Pseudocode of the Argon hashing scheme with R iterations

The Xi are computed as linear functions of Ai (motivation given in Section 2.6.1):X0

X1

X2

X3

· · ·X15

= L ·

A0

A1

A2

A3

· · ·A31

,

where the addition is treated as bitwise XOR, and L is a (0, 1)-matrix of size 16× 32:

L =

00010001000100010001000100010001010100000101000001010000010100001010101000000000101010100000000001010101010101010000000000000000000000110000001100000011000000110000000000110011000000000011001100000000000000001100110011001100000000000000111100000000000011110000111100001111000000000000000000000000000000001111111100000000010001000100010001000100010001000010001000100010001000100010001000001111000000000000111100000000000000001111000000000000111100001111000011110000000000000000000010001000100010001000100010001000

(2.1)

2.1.3 ShuffleSlices

The ShuffleSlices transformation operates on slices. For d slices, blocks A0, A1, . . . , An/d−1 form the firstslice, then An/d, An/d+1, . . . , A2n/d−1 are the second slice, etc. Each slice is shuffled independently of theothers with the same algorithm Shuffle. Let us denote the 64-bit subwords of state B by B0, B1. The

7

X0

A1 A2A0 A3 A30 A31

F F F F F F

L

FX1

FX15

F

A1 A2A0 A3 A30 A31

Figure 2.1: Mix transformation.

Shuffle algorithm is derived from the RC4 permutation algorithm and operates according to Algorithm 2.

Input: Slice 〈B0, B1, . . . , Bs−1〉.j ← s− 1for 0 ≤ i ≤ s− 1 do

j ← (B0j +B0

i ) (mod s);Swap(Bi, Bj);

endAlgorithm 2: The Shuffle algorithm.

2.2 Recommended parameters

In this section we provide minimal values of t cost as a function of m cost that render Argon secure asa hash function. These values are based on the analysis made in Chapter 2.5. Future cryptanalysis mayreduce the minimal t cost values.

m cost 1 10 100 103 104 105 106

Memory used 1 KB 10 KB 100 KB 1 MB 10 MB 100 MB 1 GBMinimal t cost 3 3 3 3 3 3 3

Table 2.1: Minimally secure time and memory parameters.

We recommend the following procedure to choose m cost and t cost:

1. Figure out maximum amount of memory Mmax that can be used by the authentication application.

2. Figure out the maximum allowed time Tmax that can be spent by the authentication application.

3. Benchmark Argon on the resulting m cost and minimal secure t cost.

4. If the resulting time is smaller than Tmax, then increase t cost accordingly; otherwise decreasem cost until the time spent is appropriate.

The reference implementation allows all positive values of t cost for testing purposes.

2.3 Security claims

Collision and forgery resistance. We claim that Argon with an m-bit tag and recommended costparameters provides m/2 bits of security against collision attacks, m bits of security against forgeryattacks.

8

Time-memory tradeoff resistance. We claim that Argon imposes a significant computational penaltyon the adversary even if he uses as much as half of normal memory amount. Our best tradeoff algorithmsachieve the following penalties for R = 3 (Chapter 2.5):

Attacker’s fraction \ Regular memory 128 KB 1 MB 16 MB 128 MB 1 GB

12 91 112 139 160 18014 164 314 218 226 234

18 6085 220 231 236 247

Side-channel attacks. The optimized implementation of Argon on the recent Intel/AMD processorsis supposed to use the AES instructions, which essentially prohibits the side-channel attacks on the AEScore. The ShuffleSlices transformation is data-dependent and hence is potentially dependent on the cachebehaviour. To partially thwart the timing attacks, our optimized implementation selects slices in therandom order.

2.4 Features

We offer Argon as a password hashing scheme for architectures equipped with recent Intel/AMD pro-cessors that support dedicated AES instructions. The key feature of Argon is a large computationalpenalty imposed on an adversary who wants to save even a small fraction of memory. We aimed to makethe structure of Argon as simple as possible, using only XORs, swaps, and (reduced) AES calls whenprocessing the internal state.

2.4.1 Main features

Now we provide an extensive list of features of Argon.

Design rationality Argon follows a design strategy, which proved well in designing cryptographichash functions and block ciphers. After the memory is filled, the internalstate undergoes a sequence of identical rounds. The rounds alternate nonlin-ear transformations that operate row-wise and provide confusion with blockpermutations that operate columnwise and provide diffusion. The design ofnonlinear transformations aims to maximize internal diffusion and defeat time-memory tradeoffs. The data-dependent permutation part operates similarly tothe well-studied RC4 state permutation.

Tradeoff defense Thanks to fast diffusion and data-dependent permutation layers, Argon providesstrong defense against memory-saving adversaries. Even the memory reductionby a factor of 2 results in almost prohibitive computational penalty. Our besttradeoff attacks on the fastest set of parameters result in the penalty factor of50 when 1/2 of memory is used, and the factor of 150 when 1/4 of memory foris used.

Scalability Argon is scalable both in time and memory dimensions. Both parameters canbe changed independently provided that a certain amount of time is alwaysneeded to fill the memory.

Uniformity Argon treats all the memory blocks equally, and they undergo almost the samenumber of transformations and permutations. The scheme operates identicallyfor all state sizes without any artifacts when the size is not the power of two.

Parallelism Argon may use up to 32 threads/cores in parallel by delegating them slicepermutations and group transformations.

Server relief Argon allows the server to carry out the majority of computational burden onthe client in case of denial-of-service attacks or other situations. The serverultimately receives a short intermediate value, which undergoes a preimage-resistant function to produce the final tag.

9

Client-independentupdates

Argon offers a mechanism to update the stored tag with an increased time andmemory cost parameters. The server stores the new tag and both parametersets. When a client logs into the server next time, Argon is called as manytimes as need using the intermediate result as an input to the next call.

Possible extensions Argon is open to future extensions and tweaks that would allow to extend itscapabilities. Some possible extensions are listed in Section 2.4.4.

CPU-friendly Implementation of Argon benefits from modern CPUs that are equipped withdedicated AES instructions [16] and penalizes any attacker that uses a differ-ent architecture. Argon extensively uses memory in the random access pattern,which results in a large latency for GPUs and other memory-unfriendly archi-tectures.

Secret input support Argon natively supports secret input, which can be called key or pepper [14].

2.4.2 Server relief

The server relief feature allows the server to delegate the most expensive part of the hashing to the client.This can be done as follows (suppose that K and X are not used):

1. The server communicates S, m, R, d, τ to the client.

2. The client performs all computations up to the value of (B0, B1);

3. The client communicates (B0, B1) to the server;

4. The server computes Tag and stores it together with S, m, d, R.

The tag computation function is preimage-resistant, as it is an iteration of a cryptographic hash function.Therefore, leaking the tag value would not allow the adversary to recover the actual password nor thevalue of B by other means than exhaustive search.

2.4.3 Client-independent update

It is possible to increase the time and memory costs on the server side without requiring the client tore-enter the original password. We just compute

Tagnew = Π(Tagold, S,mnew, Rnew, τnew),

replace Tagold with Tagnew in the hash database and add the new values of m and R to the entry.

2.4.4 Possible future extensions

Argon can be rather easily tuned to support the following extensions, which are not used now for designclarity or performance issues:

Support for otherpermutations

A larger permutation can be easily plugged into the scheme and would notaffect its security. For example, hardware-friendly permutations of Keccak [9],Spongent [11], or Quark [5] may be used. Also, any cryptographically strong256-bit hash function may replace Blake2b in the initialization and finalizationphases.

2.5 Security analysis

2.5.1 Avalanche properties

Due to the use of a cryptographic hash function, a difference in any input bit is likely to spread over allmemory blocks, thus making the avalanche immediate. This has been added as a tweak to the originalsubmission.

10

2.5.2 Invariants

The SubGroups and ShuffleSlices transformations have several symmetric properties:

• Mix transforms a group of identical blocks (flat group) to a group of identical blocks, and viceversa.

• ShuffleSlices transforms a state of flat groups to a state of flat groups.

These properties are mitigated by the use of a counter in the initialization phase. The flat group conditionis a 32-collision on 128 bits, which requires 2

31·12832 = 2124 controlled input blocks, whereas the counter

leaves only 96 bits to the attacker.

2.5.3 Collision and preimage attacks

Since the 4-round AES has the expected differential probability bounded by 2−113 [21], we do not expectany differential attack on our scheme, which uses the 5-round AES. The same reasoning applies for linearcryptanalysis. Given the huge internal state compared to regular hash functions, we do not expect anyother collision or preimage attack to apply on our scheme.

2.5.4 Tradeoff attacks

In this section we list several possible attacks on Argon, where an attacker aims to compute the hash valueusing less memory. To compute the actual penalty, we first have to compute the number of operationsperformed in each layer:

• Initial Round: n calls to F , n memory reads and n memory writes (128-bit values);

• ShuffleSlices: 2n memory reads, 2n memory writes.

• SubGroups: 1.5n calls to F , 5n XORs, n memory reads and n memory writes (128-bit values);

• Finalization: n XORs and n memory reads.

Therefore, R-layer Argon that uses 16n bytes of memory performs (1.5R + 2.5)n calls to F , (5R + 6)nXORs, and (6R+ 5)n memory accesses.

The total amount of memory, which we denote by M , is also equal to 16n bytes.Let us denote by Y 0 the input state after the initial round (before the first call of SubGroups), by

X l the input to ShuffleSlices in round l, and by Y l the input to SubGroups in round l. The input to thefinalization round is denoted by XR+1.

In the further analysis we use the notation

• TF (A), where A is either Xj or Y j , denotes the amortized number of F-calls needed to computea single block of A according to the adversary’s memory use.

In the regular computation we have

TF (Y 0) = 1;

TF (X l) = 1.5 + TF (Y l−1);

TF (Y l) = TF (X l);

TF (Tag) = nTF (Y R).

Storing block permutations

Suppose that an attacker stores the permutations produced by the ShuffleSlices transformation, whichrequires lgn−5

128 M memory per round (Section 2.6.2). Let us denote by σi the permutation over Xi

realized by ShuffleSlices. The attack algorithm is as follows:

• Sequentially compute σ1, σ2, . . . , σR. To compute σl, use the knowledge of σ1, σ2, . . . , σl−1.

• For each group in Y R, compute its output blocks in XR+1 and accumulate their XOR.

• Compute the tag.

11

The permutation σl is computed slicewise by first storing n/32 slice elements and then computing apermutation over them. This requires n accesses to blocks from X l. In order to get a block from X l oneneeds to evaluate the corresponding F(Xi) in the Mix transformation, which in turn takes 8 unknownblocks from Y l−1. Therefore,

TF (X l) = 8TF (Y l−1) + 2; (2.2)

Given the stored σi, we haveTF (Y i) = TF (Xi),

but As a result,TF (X l) ≈ 1.25 · 8l,

andTF (σl) = 1.25n · 8l.

The attacker computes XR+1 groupwise, so the cost of computing a single group of XR+1 is 32TF (XR)+48. Therefore,

TF (XR+1) = TF (XR) + 1.5;

TF (Tag)

n≈ 1.25 · 8R + 1.5 +

R∑i=1

TF (σl) = 1.25 · 8R + 1.5 +

R∑i=1

1.25 · 8i ≈ 2.6 · 8R.

and the penalty is

P =2.6 · 8R

1.5R+ 2.5≈ 190 for R = 3 and 1253 for R = 4. (2.3)

This value should be compared to the fraction of memory used:

Mreduced

M= R

lg n− 5

128= R

lgM − 9

128,

where M is counted in bytes.Some examples for various n are given in Table 2.2.

Memory total 64 KB 1 MB 16 MB 256 MB 1 GBn 212 216 220 224 226

R = 3Memory used 10 KB 250 KB 5 MB 114 MB 500 MB

Penalty 190

R = 4Memory used 13 KB 330 KB 6 MB 150 MB 650 MB

Penalty 1253

Table 2.2: Computational penalty when storing permutations only.

Storing all Xi and block permutations

Suppose that an attacker stores all the permutations and as many values F(Xi) as possible in eachround. Storing the entire level would cost M/2 memory. Let the attacker spend αM memory on F(Xi),α ≤ 0.5, in the initial round. The attack algorithm is as follows:

• Compute 2α fraction of F(Xi) in the initial round and store them.

• Sequentially compute σ1, σ2, . . . , σR. To compute σl, use the knowledge of F(Xi) and σ1, σ2, . . . , σl−1.

• For each group in Y R, compute its output blocks in XR+1 and accumulate their XOR.

• Compute the tag.

12

Then for blocks of X1 that come from stored groups, we get

T ′F (X1) = 1 + TF (Y 0);

and for the othersT ′′F (X1) = 8TF (Y 0) + 2;

On average we haveTF (X1) = (8− 14α)TF (Y 0) + 2− 2α = 10− 16α;

For the other rounds Equation (2.2) holds. Therefore, we have

TF (σl) = (10− 16α)n · 8l−1;

TF (XR) ≈ (10− 16α)n · 8R−1,TF (Tag)

n≈ ·(20− 32α)8R−1,

and the penalty is

P =(20− 32α)8R−1

1.5R+ 2.5. (2.4)

For R = 3 we have the penalties depending on α in Table 2.3.

α 12

13

14

18

Penalty 36 86 110 146

Table 2.3: Computational penalty on an adversary who stores some F(Xi) in αM memory and addi-tionally all the permutations.

Summary

The optimal strategy for an adversary that has βM memory, β < 1, is as follows:

• If

β ≤ R lgM − 9

128,

then the adversary is recommended to spend the memory entirely to store the permutations pro-duced by ShuffleSlices. For β = l lgM−9128 , 0 ≤ l ≤ R, he gets the penalty about (Eq. (2.3))

P(l) =2.6 · 8l(n/32)R−l

1.5R+ 2.5,

where we assume that to compute a permutation online he has to make n/32 queries to the previouslayer.

• If

β > RlgM − 9

128,

then the adversary stores all the permutations in RM lgM−9128 memory, and spend the rest αM =

(β −R lgM−9128 ) on storing F(Xi) in SubGroups. If α ≤ 1/2, we have the penalty (Eq. (2.4)).

P(α) =(20− 32α)8R−1

1.5R+ 2.5

Some examples for different α and R = 3 are given in Table 2.3.

13

2.6 Design rationale

2.6.1 SubGroups

The SubGroups transformation should have the following properties:

• Every output block must be a nonlinear function of input blocks;

• It should be memory-hard in the sense that a certain number of blocks must be stored.

We decided to have the number of input blocks be a degree of 2, i.e. 2k. We compute r linearcombinations X1, X2, . . . , Xr of them:

Xi =⊕

λi,jAj , λi,j ∈ {0, 1}

and computeAi ← F(Ai ⊕F(Xg(i))).

Here function g maps {1, 2, . . . , 2k} to {1, 2, . . . , r}. The function should be maximally balanced.The values X1, X2, . . . , Xr should not be computable from each other without knowing a certain

number of input blocks. As a result, to compute any output block Ai, an adversary either must havestored F(Xg(i)) or has to recompute it online. The latter case determines the computational penalty ona reduced-memory adversary. The actual penalty results from the following parameters:

• The minimal weight w (diffusion degree) of vectors λi = (λi,1, λi,2, . . . , λi,2k) or their linear combi-nations;

• The proportion r2k

(group rate) of middle F calls compared to the total number of F calls in thegroup.

Therefore, an adversary, who needs the output Ai either stores r block values per group and needsthe input Ai, or stores nothing and needs w input blocks to compute one output block. To impose themaximum penalty on the adversary, we should maximize both r and w, but take into account that thismeans a small penalty to the defender as well.

Choice of linear combinations

The requirement of {Xi} not be computable from each other translates into the following. Let λi be avector of values of function fi of k variables y1, y2, . . . , yk. These functions form a linear code Λ, and theminimal weight of a codeword is w.

We consider Reed-Muller codes RM(d, k), which are value vectors of functions of k boolean variablesup to degree d. The minimal codeword weight is 2k−d.

For instance, d = 1 yields k+ 1 linearly independent codewords, which are value vectors of functions

f1 = y1, f2 = y2, . . . , fk = yk, fk+1 = 1,

which all have weight 2k−1 and generate the following formulas for Xi:

X1 = A2 ⊕A4 ⊕ . . . A2k ;

X2 = A3 ⊕A4 ⊕A7 ⊕A8 . . .⊕A2k ;

· · ·Xk+1 = A1 ⊕A2 ⊕A3 ⊕ . . .⊕A2k .

For d = 2 we have(k2

)+ k+ 1 linearly independent codewords. They may not have the same weight, but

it is easy to find those that all have weight 2k−2: there are k(2k − 1) of them.

Diffusion properties. The more difficult problem is to select a set Λ of codewords such that eachindex (i.e. each Ai) is covered by maximal number of codewords. For k = 5 we have at maximum 16linearly independent codewords of weight 8 and length 32 each. Since the sum of weights is 128, eachindex i may potentially be non-zero in exactly 4 codewords. However, the bitwise sum of such codewordswould be the all-zero vector, so such codewords would be linearly dependent. For linearly independentcodewords such minimal number is 3, and one of possible solution1, which we use, is given in Table 2.4.It results in matrix L, given in Equation (2.1).

1The reference document of Argon v0 contained an incorrect set of codewords, which was the result of a bug in acodeword generation procedure. The current matrix has rank 16, as verified by the symbolic computation system SAGE.

14

f0 = y1y2 :(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1)

A3 ⊕A7 ⊕A11 ⊕A15 ⊕A19 ⊕A23 ⊕A27 ⊕A31

f1 = y1(y3 + 1) :(0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0)

A1 ⊕A3 ⊕A9 ⊕A11 ⊕A17 ⊕A19 ⊕A25 ⊕A27

f2 = (y1 + 1)(y4 + 1) :(1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)

A0 ⊕A2 ⊕A4 ⊕A6 ⊕A16 ⊕A18 ⊕A20 ⊕A22

f3 = y1y5 :(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

A1 ⊕A3 ⊕A5 ⊕A7 ⊕A9 ⊕A11 ⊕A13 ⊕A15

f4 = y2y3 :(0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1)

A6 ⊕A7 ⊕A14 ⊕A15 ⊕A22 ⊕A23 ⊕A30 ⊕A31

f5 = y2y4 :(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1)

A10 ⊕A11 ⊕A14 ⊕A15 ⊕A26 ⊕A27 ⊕A30 ⊕A31

f6 = (y2 + 1)y5 :(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0)

A16 ⊕A17 ⊕A20 ⊕A21 ⊕A24 ⊕A25 ⊕A28 ⊕A29

f7 = y3y4 :(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1)

A12 ⊕A13 ⊕A14 ⊕A15 ⊕A28 ⊕A29 ⊕A30 ⊕A31

f8 = y3y5 :(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

A4 ⊕A5 ⊕A6 ⊕A7 ⊕A12 ⊕A13 ⊕A14 ⊕A15

f9 = (y4 + 1)y5 :(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)

A16 ⊕A17 ⊕A18 ⊕A19 ⊕A20 ⊕A21 ⊕A22 ⊕A23

f10 = y1(y2 + 1) :(0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0)

A1 ⊕A5 ⊕A9 ⊕A13 ⊕A17 ⊕A21 ⊕A25 ⊕A29

f11 = (y1 + 1)y2 :(0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0)

A2 ⊕A6 ⊕A10 ⊕A14 ⊕A18 ⊕A22 ⊕A26 ⊕A30

f12 = y3(y4 + 1) :(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)

A4 ⊕A5 ⊕A6 ⊕A7 ⊕A20 ⊕A21 ⊕A22 ⊕A23

f13 = (y3 + 1)y4 :(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0)

A8 ⊕A9 ⊕A10 ⊕A11 ⊕A24 ⊕A25 ⊕A26 ⊕A27

f14 = (y3 + 1)(y5 + 1) :(1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

A0 ⊕A1 ⊕A2 ⊕A3 ⊕A8 ⊕A9 ⊕A10 ⊕A11

f15 = (y1 + 1)(y2 + 1) :(1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0)

A0 ⊕A4 ⊕A8 ⊕A12 ⊕A16 ⊕A20 ⊕A24 ⊕A28

Table 2.4: Linearly independent combinations of 32 variables given by 16 functions of degree 2.

15

k 2 3 4 5 6

d = 1Diffusion degree 2 4 8 16 32

Independent vectors 3 4 5 6 7Rate 3

412

516

316

764

d = 2Diffusion degree 1 2 4 8 16

Independent vectors 4 7 11 16 22Rate 1 7

81116

12

1132

2.6.2 ShuffleSlices

The ShuffleSlices transformation aims to interleave blocks of distinct groups. Since the blocks in a groupare mixed in the SubGroups transformation, it is sufficient to shuffle the blocks in slices independently.This property also allows to parallelize the computation. Additionally, if the slices are stored in memorysequentially, then the Shuffle transformation works very fast if the entire slice fits into the CPU cache.As there are d slices, the entire ShuffleSlices transformation may avoid cache misses even if the totalmemory exceeds the cache size by a factor of d.

The exact transformation was taken from the RC4 algorithm [22], whose internal state S of n integersis updated as follows:

i← i+ 1;

j ← j + S[i];

S[i]↔ S[j],

where all indices and additions are computed modulo n. However, this algorithm is parallelizable:

• Read S[1], S[2], . . . , S[d] in parallel;

• Compute j1, j2, . . . , jd also in parallel (d additions can be done in log2 d time).

• Read S[j1], S[j2], . . . , S[jd] in parallel.

• Make all swaps in parallel.

This approach works until jk ≤ d for some k, which is unlikely for d � √s. To avoid the unnecessaryparallelism, we modify the j rule as

j ← S[i] + S[j],

which requires to read S[j] before updating j.

It is important to note that the resulting permutation can be stored in n(lgn−5)8 bytes of memory,

which is the lgM−9128 part of the total memory cost M , measured in bytes.

2.6.3 Permutation FWe wanted F to resemble a randomly chosen permutation with no significant differential or linearproperties or other properties that might simplify tradeoff attacks. Since we expect our scheme torun on modern CPUs, the AES cipher or its derivatives is a reasonable choice in terms of performance: itis known that the 10-round AES in the Counter mode runs at 0.6 cycles per byte on Haswell CPUs [16].We considered having a wider permutation, but our optimized implementations of the ShuffleSlices thatoperates on wider blocks were only marginally faster, whereas the SubGroups transformation becomesfar more complicated, and in some cases even is a bottleneck.

Therefore, AES itself is a natural choice. Clearly, all 10 rounds are not needed whereas the 4-roundversion is known to hide all the differential properties [21]. We decided to take 5 rounds.

2.6.4 No weakness, no patents

We have not hidden any weaknesses in this scheme.We are aware of no known patents, patent applications, planned patent applications, and other

intellectual-property constraints relevant to use of the scheme. If this information changes, we willpromptly announce these changes on the mailing list.

16

2.7 Tweaks

The current version of Argon tweaks the previous version as follows:

• Number of slices in ShuffleSlices is a tunable parameter to fix the degree of parallelism available tothe adversary.

• Number of slices and the tag length are now part of the input to further prevent the adversaryfrom amortizing exhaustive search.

• Associated data X can now be processed as part of the input. It can be an arbitrary string,containing application context, user data, or any other parameters that shall be binded to the hashvalue.

• All the inputs are hashed with a cryptographic hash function (Blake2b) in order to extract theentropy from the password and the salt as soon as possible and put it into a short string I.

• Due to the use of a hash function, password and salt may have arbitrary (up to 232 bytes) length.

• The string I is partitioned in only 4 parts, which fill the first 8 bytes of each 16-byte memory block,whereas the rest is taken by the counter. This has been done for simplicity.

• The Shuffle algorithm now employs a new formula to compute the value of the counter j, whichshould not allow to parallelize the algorithm, as it is the case for RC4.

• To produce the tag, memory is divided in two, then both parts are XORed into two 128-bit blocks.Both blocks are then input to a variable length hash function (modified Blake2b), which producestags of arbitrary length.

2.8 Efficiency analysis

2.8.1 Modern x86/x64 architecture

Desktops and servers equipped with modern Intel/AMD CPUs is the primary target platform for Argon.When measuring performance we compute the amortized number of CPU cycles needed to run Argon ona certain amount of memory. An optimized implementation may use the following ideas:

• All the operations are performed on 128-bit blocks, which can be stored in xmm registers.

• The F transformation, which is the 5-round AES-128, can be implemented by as few as 6 assemblerinstructions. The high latency of the AES round instructions aesenc can be mitigated by pipeliningthem. The structure of SubGroups is friendly to this idea: each group first applies F to 16 internalvariables in parallel, and then to 32 memory blocks in parallel. The 5-round AES in the countermode runs at approximately 0.3 cycles per byte. We report the speed of SubGroups from 2.5 to 3cycles per byte. About a half of this gap should be attributed to 144 128-bit XOR operations.

• We report the speed of ShuffleSlices from 4 to 8 cycles per byte.

Our implementation for R = 3, which is still far from optimal, achieves the speed from 30 cycles perbyte (when less than 32 MBytes are used) to 45 cycles per byte (when 1 GB is used) on a single core ofthe x64 Sandy Bridge laptop with 4 GB of RAM.

2.8.2 Older CPU

The absence of dedicated AES instructions will certainly decrease the speed of Argon. Since the 5-roundAES-128 in the Counter mode can run at about 5 cycles per byte, we expect the SubGroups to becomea bottleneck in the implementation, with the speed records getting close to 40-50 cycles per byte.

2.8.3 Other architectures

We expect that Argon is unsuitable for the GPU architecture. Even though it is parallelizable and theSubGroups transformation can be run on separate cores, the following ShuffleSlices transformation wouldneed to obtain outputs from each group. This assumes an active use of memory and thus very highlatency of ShuffleSlices on GPUs. As a result, the GPU-based password crackers should significantlysuffer in performance so that they would not be cost-efficient.

17

password salt

32

n/32

Mix

ShuffleSlices

F F F F

R rounds

Tag

SubGroups

SubGroups

secret

F F F F

F F F FMix

Mix

Mix

Mix

Mix

0I0

1I1

I3n32 − 1

31n32

I0

I3 n − 1

A0

A1

An/32−1

A31n/32

An−1

X1 Y 1

Y 0

ShuffleSlices SubGroups

Mix

Mix

Mix

XR Y R

XR+1

context

parallelism degreetag lengthalgorithm parameters, etc.

H

I0 I1 I2 I3 256-bit output

256-bit XOR

H

Variable length

Figure 2.2: Overview of Argon. Transformation F is the 5-round AES-128.

18

Chapter 3

Argon2

In this chapter we introduce the next generation of memory-hard hash function Argon, which we callArgon2. Argon2 is built upon the most recent research in the design and analysis of memory-hardfunctions. It is suitable for password hashing, password-based key derivation, cryptocurrencies, proofsof work/space, etc.

3.1 Specification

There are two flavors of Argon2 – Argon2d and Argon2i. The former one uses data-dependent memoryaccess to thwart tradeoff attacks. However, this makes it vulnerable for side-channel attacks, so Argon2dis recommended primarily for cryptocurrencies and backend servers. Argon2i uses data-independentmemory access, which is recommended for password hashing and password-based key derivation.

3.1.1 Inputs

Argon2 has two types of inputs: primary inputs and secondary inputs, or parameters. Primary inputs aremessage P and nonce S, which are password and salt, respectively, for the password hashing. Primaryinputs must always be given by the user such that

• Message P may have any length from 0 to 232 − 1 bytes;

• Nonce S may have any length from 8 to 232 − 1 bytes (16 bytes is recommended for passwordhashing).

Secondary inputs have the following restrictions:

• Degree of parallelism p may take any integer value from 1 to 64.

• Tag length τ may be any integer number of bytes from 4 to 232 − 1.

• Memory size m can be any integer number of kilobytes from 8p to 232 − 1, but it is rounded downto the nearest multiple of 4p.

• Number of iterations t can be any integer number from 1 to 232 − 1;

• Version number v is one byte 0x10;

• Secret value K may have any length from 0 to 32 bytes.

• Associated data X may have any length from 0 to 232 − 1 bytes.

Argon2 uses compression function G and hash function H. Here H is the Blake2b hash function,and G is based on its internal permutation. The mode of operation of Argon2 is quite simple when noparallelism is used: function G is iterated m times. At step i a block with index φ(i) < i is taken fromthe memory (Figure 3.1), where φ(i) is either determined by the previous block in Argon2d, or is a fixedvalue in Argon2i.

19

G GG

ii− 1 i+ 1φ(i+ 1)φ(i)

Figure 3.1: Argon2 mode of operation with no parallelism.

3.1.2 Operation

Argon2 follows the extract-then-expand concept. First, it extracts entropy from message and nonce byhashing it. All the other parameters are also added to the input. The variable length inputs P and Sare prepended with their lengths:

H0 = H(d, τ,m, t, v, 〈P 〉, P, 〈S〉, S, 〈K〉,K, 〈X〉, X).

Here H0 is 32-byte value.Argon2 then fills the memory with m 1024-byte blocks. For tunable parallelism with p threads, the

memory is organized in a matrix B[i][j] of blocks with p rows (lanes) and q = bm/dc columns. Blocksare computed as follows:

B[i][0] = G(H0, i︸︷︷︸4 bytes

|| 0︸︷︷︸4 bytes

), 0 ≤ i < p;

B[i][1] = G(H0, i︸︷︷︸4 bytes

|| 1︸︷︷︸4 bytes

), 0 ≤ i < p;

B[i][j] = G(B[i][j − 1], B[i′][j′]), 0 ≤ i < p, 2 ≤ j < q.

where block index [i′][j′] is determined differently for Argon2d and Argon2i, and G is the compressionfunction. Both will be fully defined in the further text. The inputs to G that are not full blocks, areprepended by necessary number of zeros.

If t > 1, we repeat the procedure; however the first two columns are now computed in the same way:

B[i][0] = G(B[i][q − 1], B[i′][j′]);

B[i][j] = G(B[i][j − 1], B[i′][j′]).

When we have done t iterations over the memory, we compute the final block Bm as the XOR of thelast column:

Bm = B[0][q − 1]⊕B[1][q − 1]⊕ · · · ⊕B[d− 1][q − 1].

The output tag is produced as follows. The hash function H is applied iteratively to Bm, each timeoutputting the first 32 bytes of the 64-byte hash value, until the total number of output bytes reaches τ .

3.1.3 Indexing

Now we explain how the index [i′][j′] of the reference block is computed. First, we determine the setof indices R that can be referenced for given [i][j]. For that we further partition the memory matrixinto l = 4 vertical slices. Intersection of a slice and a segment is segment of length q/l. Thus segmentsform their own matrix Q of size p× l. Segments of the same slice are computed in parallel, and may notreference blocks from each other. All other blocks can be referenced. Suppose we are computing a blockin a segment Q[r][s]. Then R includes:

• All blocks of segments Q[r′][∗], where r′ < r and ∗ takes all possible values from 0 to p− 1.

• All blocks of segments Q[r′][∗], where r′ > r and ∗ takes all possible values from 0 to p− 1 — if itis the second or later pass over the memory.

20

p lanes

4 slices

message

nonce

parameters

H

B[0][0]

B[p− 1][0]

HHH

Tag

Figure 3.2: Single-pass Argon2 with p lanes and 4 slices.

• All blocks of segment Q[r][s] (current segment) except for the last one;

• For the first block of the segment R does not include the previous block of the same lane.

Let R be the size of R. The blocks in R are numbered from 0 to R− 1 according to the following rules:

• Blocks outside of the current segment are enumerated first;

• Segments are enumerated from top to down, and then from left to right: Q[0][0], then Q[1][0], thenup to Q[p− 1][0], then Q[0][1], then Q[1][1], etc.

• Blocks within a segment are enumerated from the oldest to the newest.

Then Argon2d selects a block from R randomly, and Argon2i – pseudorandomly as follows:

• In Argon2d we take the first 32 bits of block B[i][j− 1] and denote this block by J . Then the valueJ (mod R) determines the block number from R.

• In Argon2i we run G2 — the 2-round compression function G – in the counter mode, where thefirst input is all-zero block, and the second input is constructed as

( r︸︷︷︸4 bytes

|| l︸︷︷︸4 bytes

|| s︸︷︷︸4 bytes

|| i︸︷︷︸4 bytes

|| 0︸︷︷︸1008 bytes

,

where r is the pass number, l is the lane, s is the slice, and i is is the counter starting in eachsegment from 0. Then we increase the counter so that each application of G2 gives 256 values ofJ , which are used to reference available blocks exactly as in Argon2d.

3.1.4 Compression function G

Compression function G is built upon the Blake2b round function P (fully defined in Section 3.7.1). Poperates on the 128-byte input, which can be viewed as 8 16-byte registers (see details below):

P(A0, A1, . . . , A7) = (B0, B1, . . . , B7).

Compression function G(X,Y ) operates on two 1024-byte blocks X and Y . It first computes R =X ⊕ Y . Then R is viewed as a 8× 8-matrix of 16-byte registers R0, R1, . . . , R63. Then P is first applied

21

rowwise, and then columnwise to get Z:

(Q0, Q1, . . . , Q7)← P(R0, R1, . . . , R7);

(Q8, Q9, . . . , Q15)← P(R8, R9, . . . , R15);

. . .

(Q56, Q57, . . . , Q63)← P(R56, R57, . . . , R63);

(Z0, Z8, Z16, . . . , Z56)← P(Q0, Q8, Q16, . . . , Q56);

(Z1, Z9, Z17, . . . , Z57)← P(Q1, Q9, Q17, . . . , Q57);

. . .

(Z7, Z15, Z23, . . . , Z63)← P(Q7, Q15, Q23, . . . , Q63).

Finally, G outputs Z ⊕R:

G : (X,Y ) → R = X ⊕ Y P−→ QP−→ Z → Z ⊕R.

PP

P

X Y

R

Q

P P P

Z

Figure 3.3: Argon2 compression function G.

3.2 Features

Argon2 is a multi-purpose family of hashing schemes, which is suitable for password hashing, key deriva-tion, cryptocurrencies and other applications that require provably high memory use. Argon2 is optimizedfor the x86 architecture, but it does not slow much on older processors. The key feature of Argon2 isits performance and the ability to use multiple computational cores in a way that prohibit time-memorytradeoffs. Several features are not included into this version, but can be easily added later.

3.2.1 Available features

Now we provide an extensive list of features of Argon.

22

Design rationality Argon2 follows a simple design with three main ideas: generic mode of oper-ation, tradeoff-resilient parallelism with synchronization points, and tradeoff-resistant compression function. All design choices are based on the intensiveresearch and explained below.

Performance Argon2 fills memory very fast. Data-independent version Argon2i securely fillsthe memory spending about a CPU cycle per byte, and Argon2d is twice asfast. This makes it suitable for applications that need memory-hardness butcan not allow much CPU time, like cryptocurrency peer software.

Tradeoff resilience Despite high performance, Argon2 provides reasonable level of tradeoff re-silience. Our tradeoff attacks previously applied to Catena and Lyra2 showthe following. With default number of passes over memory (1 for Argon2d, 3for Argon2i, an ASIC-equipped adversary can not decrease the time-area prod-uct if the memory is reduced by the factor of 4 or more. Much higher penaltiesapply if more passes over the memory are made.

Scalability Argon2 is scalable both in time and memory dimensions. Both parameters canbe changed independently provided that a certain amount of time is alwaysneeded to fill the memory.

Parallelism Argon2 may use up to 64 threads in parallel, although in our experiments 8threads already exhaust the available bandwidth and computing power of themachine.

Server relief Argon2 allows the server to carry out the majority of computational burden onthe client in case of denial-of-service attacks or other situations. The serverultimately receives a short intermediate value, which undergoes a preimage-resistant function to produce the final tag.

Client-independentupdates

Argon2 offers a mechanism to update the stored tag with an increased time andmemory cost parameters. The server stores the new tag and both parametersets. When a client logs into the server next time, Argon is called as manytimes as need using the intermediate result as an input to the next call.

Possible extensions Argon2 is open to future extensions and tweaks that would allow to extend itscapabilities. Some possible extensions are listed in Section 3.2.4.

GPU/FPGA/ASIC-unfriendly

Argon2 is heavily optimized for the x86 architecture, so that implementing iton dedicated cracking hardware should be neither cheaper nor faster. Evenspecialized ASICs would require significant area and would not allow reductionin the time-area product.

Secret input support Argon2 natively supports secret input, which can be key or any other secretstring.

Associated data sup-port

Argon2 supports additional input, which is syntactically separated from themessage and nonce, such as environment parameters, user data, etc..

3.2.2 Server relief

The server relief feature allows the server to delegate the most expensive part of the hashing to the client.This can be done as follows (suppose that K and X are not used):

1. The server communicates S, m, R, d, τ to the client.

2. The client performs all computations up to the value of (B0, B1);

3. The client communicates (B0, B1) to the server;

4. The server computes Tag and stores it together with S, m, d, R.

The tag computation function is preimage-resistant, as it is an iteration of a cryptographic hash function.Therefore, leaking the tag value would not allow the adversary to recover the actual password nor thevalue of B by other means than exhaustive search.

23

3.2.3 Client-independent update

It is possible to increase the time and memory costs on the server side without requiring the client tore-enter the original password. We just compute

Tagnew = Π(Tagold, S,mnew, Rnew, τnew),

replace Tagold with Tagnew in the hash database and add the new values of m and R to the entry.

3.2.4 Possible future extensions

Argon can be rather easily tuned to support the following extensions, which are not used now for designclarity or performance issues:

Support for otherhash functions

Any cryptographically strong hash function may replace Blake2b for the initialand the final phases. However, if it does not support variable output lengthnatively, some care should be taken when implementing it.

Support for othercompression func-tions and blocksizes

Compression function can be changed to any other transformation that fulfillsour design criteria. It may use shorter or longer blocks.

Support for Read-Only-Memory

ROM can be easily integrated into Argon2 by simply including it into the areawhere the blocks are referenced from.

3.3 Design rationale

Argon2 was designed with the following primary goal: to maximize the cost of exhaustive search onnon-x86 architectures, so that the switch even to dedicated ASICs would not give significant advantageover doing the exhaustive search on defender’s machine.

3.3.1 Cost estimation

We have investigated several approaches to estimating the real cost of the brute-force. As the firstapproximation, we calculate the hardware design&production costs separately from the actual runningcosts. The choice of the platform is determined by adversary’s budget: if it is below 100000 EURor so, then GPU and/or FPGA are the reasonable choices, whereas attackers with a million budget(governments or their subsidiaries) can afford development and production of the dedicated hardware(ASIC).

If ASICs are not considered, the hashing schemes that use more than a few hundred MB of RAM arealmost certainly inefficient on a GPU or a FPGA. GPUs efficiency gain (compared to regular desktops) israther moderate even on low-memory computations, which can be parallelized. If we consider a schemewith a low degree of parallelism and intensive memory use, then the high memory latency of the GPUarchitecture would make the brute-force much less efficient.

Argon2 protects against all types of adversaries, including those who can afford design and productionof dedicated hardware for password cracking or cryptocurrency mining. However, modelling the attackcost is not easy. Our first motivation was to calculate the exact running costs [10] by outlining ahypothetical brute-force chip, figuring out the energy consumption of all components and calculate thetotal energy-per-hash cost. However, we found out that the choice of the memory type and assumptionson its properties affect the final numbers significantly. For example, taking a low-leakage memory [26]results in the underestimation of the real power consumption, because we do not know exactly howit scales exactly with increasing the size (the paper [26] proposes a 32-KB chip). On the other hand,the regular DDR3 as a reference also creates ambiguity, since it is difficult to estimate its idle powerconsumption [1].

To overcome these problems, we turn to a more robust metric of the time-area product [27, 8]. Inthe further text, we’ll be interested in the area requirements by memory, by the Blake2 core [7], and themaximum possible bandwidth existing in commercial products.

• The 50-nm DRAM implementation [15] takes 550 mm2 per GByte;

• The Blake2b implementation in the 65-nm process should take about 0.1 mm2 (using Blake-512implementation in [17]);

24

• The maximum memory bandwidth achieved by modern GPUs is around 400 GB/sec.

3.3.2 Mode of operation

We decided to make a simple mode of operation. It is based on the single-pass, single-thread fillingmemory array B[] with the help of the compression function G:

B[0] = H(P, S);for j from 1 to t

B[j] = G(B[φ1(j)], B[φ2(j)], · · · , B[φk(j)]

),

(3.1)

where φi() are some indexing functions.This scheme was extended to implement:

• Tunable parallelism;

• Several passes over memory.

Number of referenced blocks. The number k is a tradeoff between performance and bandwidth.First, we set φk(j) = j − 1 to maximize the total chain length. Then we decide on the exact value ofk. One might expect that this value should be determined by the maximum L1 cache bandwidth. Tworead ports at the Haswell CPU should have allowed to spend twice fewer cycles to read than to write.In practice, however, we do not get such speed-up with our compression function:

k Cycles/block Bandwidth (GB/sec)2 1194 4.33 1503 5.1

Thus extra block decreases the speed but increases the bandwidth. Larger values of k would furtherreduce the throughput and thus the total amount of memory filled in the fixed time. However, anadversary will respond by creating a custom CPU that will be marginally larger but having k load ports,thereby keeping the time the same. Thus we conclude that the area-time product of attacker’s hardwareis maximized by k = 2 or k = 3, depending on the compression function. We choose k = 2 as we aim forfaster memory fill and more passes over the memory.

Number of output blocks. We argue that only one output block should be produced per step if weuse k = 2. An alternative would be to have some internal state of size t′ ≥ t and to iterate F severaltimes (say, l′) with outputting a t-bit block every round. However, it can be easily seen that if l′t < t′,then an adversary could simply keep the state in the memory instead of the blocks and thus get a freetradeoff. If l′t = t′, then, in turn, an adversary would rather store the two input blocks.

Indexing functions. For the data-dependent addressing we set φ(l) = g(B[l]), where g simply trun-cates the block and takes the result modulo l− 1. We considered taking the address not from the blockB[l − 1] but from the block B[l − 2], which should have allowed to prefetch the block earlier. However,not only the gain in our implementations is limited, but also this benefit can be exploited by the ad-versary. Indeed, the efficient depth D(q) is now reduced to D(q)− 1, since the adversary has one extratimeslot. Table 3.1 implies that then the adversary would be able to reduce the memory by the factorof 5 without increasing the time-area product (which is a 25% increase in the reduction factor comparedto the standard approach).

For the data-independent addressing we use a simple PRNG, in particular the compression function Gin the counter mode. Due to its long output, one call (or two consecutive calls) would produce hundredsof addresses, thus minimizing the overhead. This approach does not give provable tradeoff bounds, butinstead allows the analysis with the tradeoff algorithms suited for data-dependent addressing.

3.3.3 Implementing parallelism

As modern CPUs have several cores possibly available for hashing, it is tempting to use these cores toincrease the bandwidth, the amount of filled memory, and the CPU load. The cores of the recent IntelCPU share the L3 cache and the entire memory, which both have large latencies (100 cycles and more).Therefore, the inter-processor communication should be minimal to avoid delays.

25

The simplest way to use p parallel cores is to compute and XOR p independent calls to H:

H ′(P, S) = H(P, S, 0)⊕H(P, S, 1)⊕ · · · ⊕H(P, S, p).

If a single call uses m memory units, then p calls use pm units. However, this method admits a trivialtradeoff: an adversary just makes p sequential calls to H using only m memory in total, which keeps thetime-area product constant.

We suggest the following solution for p cores: the entire memory is split into p lanes of l equal sliceseach, which can be viewed as elements of a (p× l)-matrix Q[i][j]. Consider the class of schemes given byEquation (3.1). We modify it as follows:

• p invocations to H run in parallel on the first column Q[∗][0] of the memory matrix. Their indexingfunctions refer to their own slices only;

• For each column j > 0, l invocations to H continue to run in parallel, but the indexing func-tions now may refer not only to their own slice, but also to all jp slices of previous columnsQ[∗][0], Q[∗][1], . . . , Q[∗][j − 1].

• The last blocks produced in each slice of the last column are XORed.

This idea is easily implemented in software with p threads and l joining points. It is easy to see thatthe adversary can use less memory when computing the last column, for instance by computing theslices sequentially and storing only the slice which is currently computed. Then his time is multiplied by(1 + p−1

l ), whereas the memory use is multiplied by (1− p−1pl ), so the time-area product is modified as

ATnew = AT

(1− p− 1

pl

)(1 +

p− 1

l

).

For 2 ≤ p, l ≤ 10 this value is always between 1.05 and 3. We have selected l = 4 as this value gives lowsynchronisation overhead while imposing time-area penalties on the adversary who reduces the memoryeven by the factor 3/4. We note that values l = 8 or l = 16 could be chosen.

3.3.4 Compression function design

Overview

In contrast to attacks on regular hash functions, the adversary does not control inputs to the compressionfunction G in our scheme. Intuitively, this should relax the cryptographic properties required from thecompression function and allow for a faster primitive. To avoid being the bottleneck, the compressionfunction ideally should be on par with the performance of memcpy() or similar function, which may runat 0.1 cycle per byte or even faster. This much faster than ordinary stream ciphers or hash functions,but we might not need strong properties of those primitives.

However, we first have to determine the optimal block size. When we request a block from a randomlocation in the memory, we most likely get a cache miss. The first bytes would arrive at the CPUfrom RAM within at best 10 ns, which accounts for 30 cycles. In practice, the latency of a single loadinstruction may reach 100 cycles and more. However, this number can be amortized if we request a largeblock of sequentially stored bytes. When the first bytes are requested, the CPU stores the next onesin the L1 cache, automatically or using the prefetch instruction. The data from the L1 cache can beloaded as fast as 64 bytes per cycle on the Haswell architecture, though we did not manage to reach thisspeed in our application.

Therefore, the larger the block is, the higher the throughput is. We have made a series of experimentswith a non-cryptographic compression function, which does little beyond simple XOR of its inputs, andachieved the performance of around 0.7 cycles per byte per core with block sizes of 1024 bits and larger.

Design criteria

Let us fix the block size t and figure out the design criteria. It appears that collision/preimage resistanceand their weak variants are overkill as a design criteria for the compression function F . We recall,however, that the adversary is motivated by reducing the time-area product. Let us consider the followingstructure of the compression function F (X,Y ), where X and Y are input blocks:

• The input blocks of size t are divided into shorter subblocks of length t′ (for instance, 128 bits)X0, X1, X2, . . . and Y0, Y1, Y2, . . ..

26

• The output block Z is computed subblockwise:

Z0 = G(X0, Y0);

Zi = G(Xi, Yi, Zi−1), i > 0.

This scheme resembles the duplex authenticated encryption mode, which is secure under certain assump-tions on G. However, it is totally insecure against tradeoff adversaries, as shown below.

Attack on the iterative compression function. Suppose that an adversary computes Z =F (X,Y ) but Y is not stored. Suppose that Y is a tree function of stored elements of depth D. Theadversary starts with computing Z0, which requires only Y0. In turn, Y0 = G(X ′0, Y

′0) for some X ′, Y ′.

Therefore, the adversary computes the tree of the same depth D, but with the function G instead ofF . Z1 is then a tree function of depth D + 1, Z2 of depth D + 2, etc. In total, the recomputationtakes (D + s)LG time, where s is the number of subblocks and LG is the latency of G. This should becompared to the full-space implementation, which takes time sLG. Therefore, if the memory is reducedby the factor q, then the time-area product is changed as

ATnew =D(q) + s

sqAT.

Therefore, ifD(q) ≤ s(q − 1), (3.2)

the adversary wins.One may think of using the Zm−1[l − 1] as input to computing Z0[l]. Clearly, this changes little

in adversary’s strategy, who could simply store all Zm−1, which is feasible for large m. In concreteproposals, s can be 64, 128, 256 and even larger.

We conclude that F with an iterative structure is insecure. We note that this attack applies also toother PHC candidates with iterative compression function. Table 3.1 and Equation 3.2 suggests that itallows to reduce the memory by the factor of 12 or even higher while still reducing the area-time product.

Our approach We formulate the following design criteria:

• The compression function must require about t bits of storage (excluding inputs) to compute anyoutput bit.

• Each output byte of F must be a nonlinear function of all input bytes, so that the function hasdifferential probability below certain level, for example 1

4 .

These criteria ensure that the attacker is unable to compute an output bit using only a few input bitsor a few stored bits. Moreover, the output bits should not be (almost) linear functions of input bits, asotherwise the function tree would collapse.

We have not found any generic design strategy for such large-block compression functions. It isdifficult to maintain diffusion on large memory blocks due to the lack of CPU instructions that interleavemany registers at once. A naive approach would be to apply a linear transformation with certain branchnumber. However, even if we operate on 16-byte registers, a 1024-byte block would consist of 64 elements.A 64×64-matrix would require 32 XORs per register to implement, which gives a penalty about 2 cyclesper byte.

Instead, we propose to build the compression function on the top of a transformation P that alreadymixes several registers. We apply P in parallel (having a P-box), then shuffle the output registers andapply it again. If P handles p registers, then the compression function may transform a block of p2

registers with 2 rounds of P-boxes. We do not have to manually shuffle the data, we just change theinputs to P-boxes. As an example, an implementation of the Blake2b [7] permutation processes 8 128-bitregisters, so with 2 rounds of Blake2b we can design a compression function that mixes the 8192-bitblock. We stress that this approach is not possible with dedicated AES instructions. Even though theyare very fast, they apply only to the 128-bit block, and we still have to diffuse its content across otherblocks.

27

3.3.5 User-controlled parameters

We have made a number of design choices, which we consider optimal for a wide range of applications.Some parameters can be altered, some should be kept as is. We give a user full control over:

• Amount M of memory filled by algorithm. This value, evidently, depends on the application andthe environment. There is no ”insecure” value for this parameter, though clearly the more memorythe better.

• Number T of passes over the memory. The running time depends linearly on this parameter. Weexpect that the user chooses this number according to the time constraints on the application.Again, there is no ”insecure value” for T .

• Degree d of parallelism. This number determines the number of threads used by an optimizedimplementation of Argon2. We expect that the user is restricted by a number of CPU cores (orhalf-cores) that can be devoted to the hash function, and chooses d accordingly (double the numberof cores).

• Length of password/message, salt/nonce, and tag (except for some low, insecure values for salt andtag lengths).

We allow to choose another compression function G, hash function H, block size b, and number ofslices l. However, we do not provide this flexibility in a reference implementation as we guess that thevast majority of the users would prefer as few parameters as possible.

3.4 Security analysis

3.4.1 Security of single-pass schemes

We consider the following types of indexing functions:

• Independent of the password and salt, but possibly dependent on other public parameters (data-independent). Thus the addresses can be calculated by the adversaries. Therefore, if the dedicatedhardware can handle parallel memory access, the adversary can prefetch the data from the memory.Moreover, if she implements a time-space tradeoff, then the missing blocks can be also precomputedwithout losing time. In order to maximize the computational penalties, the designers proposedvarious formulas for indexing functions [13, 20], but several of them were found weak and admittingeasy tradeoffs.

• Dependent on the password (data-dependent). A frequent choice is to use some bits of the previousblock: φ2(j) = g(B[j− 1]). This choice prevents the adversary from prefetching and precomputingmissing data. The adversary figures out what he has to recompute only at the time the elementis needed. If an element is recomputed as a tree of F calls of average depth D, then the totalprocessing time is multiplied by D. However, this method is vulnerable to side-channel attacks, astiming information may help to filter out password guesses at the early stage.

• Hybrid schemes, where the first phase uses a data-independent addressing, and next phases use adata-dependent addressing. The side-channel attacks become harder to mount, since an adversarystill has to run the first phase to filter out passwords. However, this first phase is itself vulnerableto time-space tradeoffs, as mentioned above.

In the case of data-dependent schemes, the adversary can reduce the time-area product if the timepenalty due to the recomputation is smaller than the memory reduction factor. The time penalty isdetermined by the depth D of the recomputation tree, so the adversary wins as long as

D(q) ≤ q.

In contrast, the cracking cost for data-independent schemes, expressed as the time-area product, iseasy to reduce thanks to tradeoffs. The total area decreases until the area needed to host multiplecores for recomputation matches the memory area, whereas the total time remains stable until the totalbandwidth required by the parallelized recomputations exceeds the architecture capabilities.

Let us elaborate on the first condition. When we follow some tradeoff strategy and reduce the memoryby the factor of q, the total number of calls to G increases by the factor C(q). Suppose that the logic for

28

G takes Acore of area (measured, say, in mm2), and the memory amount that we consider (say, 1 GB),takes Amemory of area. The adversary reduces the total area as long as:

C(q)Acore +Amemory/q ≤ Amemory.

The maximum bandwidth Bwmax is a hypothetical upper bound on the memory bandwidth on theadversary’s architecture. Suppose that for each call to G an adversary has to load R(q) blocks fromthe memory on average, where q is the memory reduction factor. Therefore, the adversary can keep theexecution time the same as long as

R(q)Bw ≤ Bwmax,where Bw is the bandwidth achieved by a full-space implementation.

Lemma 1.R(q) = C(q).

This lemma is proved in Section 3.7.2.

3.4.2 Ranking tradeoff attack

To figure out the costs of the ASIC-equipped adversary, we first need to calculate the time-space tradeoffsfor our class of hashing schemes. To the best of our knowledge, the first generic tradeoffs attacks werereported in [10], and they apply to both data-dependent and data-independent schemes. The idea of theranking method [10] is as follows. When we generate a memory block X[l], we make a decision, to storeit or not. If we do not store it, we calculate the access complexity of this block — the number of calls toF needed to compute the block, which is based on the access complexity of X[l − 1] and X[φ(l)]. Thedetailed strategy is as follows:

1. Select an integer q (for the sake of simplicity let q divide T ).

2. Store X[kq] for all k;

3. Store all ri and all access complexities;

4. Store the T/q highest access complexities. If X[i] refers to a vertex from this top, we store X[i].

The memory reduction is a probabilistic function of q. We reimplemented the algorithm and applied tothe scheme 3.1 where the addresses are randomly generated. We obtained the results in Table 3.1. Eachrecomputation is a tree of certain depth, also given in the table.

We conclude that for data-dependent one-pass schemes the adversary is always able to reduce thememory by the factor of 4 and still keep the time-area product the same.

Memory fraction (1/q) 12

13

14

15

16

17

18

19

110

111

Computation&Read penalty C(q) 1.71 2.95 6.3 16.6 55 206 877 4423 214.2 216.5

Depth penalty (D(q)) 1.7 2.5 3.8 5.7 8.2 11.5 15.7 20.8 26.6 32.8

Table 3.1: Time and computation penalties for the ranking tradeoff attack for random addresses.

3.4.3 Multi-pass schemes

If the defender has more time than needed to fill the available memory, then he can run several passeson the memory. Also some designers decided to process memory several times to get better time-spacetradeoffs. Let us figure out how the adversary’s costs are affected in this case.

Suppose we make K passes with T iterations each following the scheme (3.1), so that after the firstpass any address in the memory may be used. Then this is equivalent to running a single pass with KTiterations such that φ(j) ≥ j − T . The time-space tradeoff would be the same as in a single pass with Titerations and additional condition

φ(j) ≥ j − T

K.

29

We have applied the ranking algorithm (Section 3.4.2) and obtained the results in Tables 3.2,3.3. Weconclude that for the data-dependent schemes using several passes does increase the time-area productfor the adversary who uses tradeoffs. Indeed, suppose we run a scheme with memory A with one passfor time T , or on A/2 with 2 passes. If the adversary reduces the memory to A/6 GB (i.e. by the factorof 6) for the first case, the time grows by the factor of 8.2, so that the time-area product is 1.35AT .However, if in the second setting the memory is reduced to A/6 GB (i.e. by the factor of 3), the timegrows by the factor of 14.3, so that the time-area product is 2.2AT . For other reduction factors the ratiobetween the two products remains around 2.

Nevertheless, we do not immediately argue for the prevalence of multi-pass schemes, since it can bepossible that new tradeoff algorithms change their relative strength.


13

14

15

16

1 pass 1.7 3 6.3 16.6 55

2 passes 15 410 19300 220 225

3 passes 3423 222 232

Table 3.2: Computation/read penalties for the ranking tradeoff attack.


13

14

15

16

1 pass 1.7 2.5 3.8 5.7 8.2

2 passes 5.7 14.3 28.8 49 75

3 passes 20.7 56 103 − −

Table 3.3: Depth penalties for the ranking tradeoff attack.

3.4.4 Security of Argon2 to generic attacks

Now we consider preimage and collision resistance of both versions of Argon2. Variable-length inputs areprepended with their lengths, which shall ensure the absence of equal input strings. Inputs are processedby a cryptographic hash function, so no collisions should occur at this stage.

The compression function G is not claimed to be collision resistant nor preimage-resistant. However,as the attacker has no control over its input, the collisions are highly unlikely. We only take care thatthe starting blocks are not identical by producing the first two blocks with a counter and forbidding toreference from the memory the last block as (pseudo)random.

Argon2d does not overwrite the memory, hence it is vulnerable to garbage-collector attacks and similarones, and is not recommended to use in the setting where these threats are possible. Argon2i with 3passes overwrites the memory twice, thus thwarting the memory-leak attacks. Even if the entire workingmemory of Argon2i is leaked after the hash is computed, the adversary would have to compute two passesover the memory to try the password.

3.4.5 Security of Argon2 to tradeoff attacks

Time and computational penalties for 1-pass Argon2d are given in Table 3.1. It suggests that theadversary can reduce memory by the factor of 4 while keeping the time-area product the same.

Argon2i is more vulnerable to tradeoff attacks due to its data-independent addressing scheme. Weapply the ranking algorithm to 3-pass Argon2i to calculate time and computational penalties. Table 3.2demonstrates that the memory reduction by the factor of 3 already gives the computational penaltyof around 214. The 214 Blake2b cores would take more area than 1 GB of RAM (Section 3.3.1), thusprohibiting the adversary to further reduce the time-area product. We conclude that the time-areaproduct cost for Argon2d can be reduced by 3 at best.

30

Argon2d (1 pass) Argon2i (3 passes)Processor Threads Cycles/Byte Bandwidth Cycles/Byte Bandwidth

(GB/s) (GB/s)Core i7-4600U 2 1 6.36 1.4 9.30Core i7-4600U 4 0.8 8.31 1.6 11.22Core i7-4600U 8 0.8 8.07 1.2 11.19

Table 3.4: Cycles/byte and bandwidth of Argon2.

3.5 Performance

3.5.1 x86 architecture

To optimize the data load and store from/to memory, the memory that will be processed has to bealligned on 16-byte boundary when loaded/stored into/from 128-bit registers and on 32-byte boundarywhen loaded/stored into/from 256-bit registers. If the memory is not aligned on the specified boundaries,then each memory operation may take one extra CPU cycle, which may cause consistent penalties formany memory accesses.

The results presented are obtained on an Intel i7-4600U (Haswell) clocked at 2.1 GHz with 8GHz ofmemory and the clock speed of 1600 MHz, running 64-bit operating system Linux Ubuntu 14.04.1 LTS.The gcc 4.8.2 compiler was used with the following options: -O3 -funroll-loops -unroll-aggressive

-Wall -pedantic -march=native. The cycle count value was measured using the rdtscp Intel in-trinsics C function which inlines the RDTSCP assembly instruction that returns the 64-bit Time StampCounter (TSC) value. The instruction waits for prevoius instruction to finish and then is executed, butmeanwhile the next instructions may begin before the value is read [19]. Although this shortcoming, weused this method because it is the most realiable handy method to measure the execution time and alsoit is widely used in other cryptographic operations benchmarking.

3.6 Applications

Argon2d is optimized for settings where the adversary does not get regular access to system memory orCPU, i.e. he can not run side-channel attacks based on the timing information, nor he can recover thepassword much faster using garbage collection [12]. These settings are more typical for backend serversand cryptocurrency minings. For practice we suggest the following settings:

• Cryptocurrency mining, that takes 0.1 seconds on a 2 Ghz CPU using 1 core — Argon2d with 2lanes and 250 MB of RAM;

• Backend server authentication, that takes 0.5 seconds on a 2 GHz CPU using 4 cores — Argon2dwith 8 lanes and 4 GB of RAM.

Argon2i is optimized for more dangerous settings, where the adversary possibly can access the samemachine, use its CPU or mount cold-boot attacks. We use three passes to get rid entirely of the passwordin the memory. We suggest the following settings:

• Key derivation for hard-drive encryption, that takes 3 seconds on a 2 GHz CPU using 2 cores —Argon2dwith 4 lanes and 6 GB of RAM;

• Frontend server authentication, that takes 0.5 seconds on a 2 GHz CPU using 2 cores — Argon2dwith 4 lanes and 1 GB of RAM.

31

3.7 Other details

3.7.1 Permutation PPermutation P is based on the round function of Blake2b and works as follows. Its 8 16-byte inputsS0, S1, . . . , S7 are viewed as a 4× 4-matrix of 64-bit words, where Si = (v2i+1||v2i):

v0 v1 v2 v3v4 v5 v6 v7v8 v9 v10 v11v12 v13 v14 v15

Then we do

G(v0, v4, v8, v12) G(v1, v5, v9, v13) G(v2, v6, v10, v14) G(v3, v7, v11, v15)

G(v0, v5, v10, v15) G(v1, v6, v11, v12) G(v2, v7, v8, v13) G(v3, v4, v9, v14),

where G applies to (a, b, c, d) as follows:

a← a+ b;

d← (d⊕ a) ≫ 32;

c← c+ d;

b← (b⊕ c) ≫ 24;

a← a+ b;

d← (d⊕ a) ≫ 16;

c← c+ d;

b← (b⊕ c) ≫ 63;

Here + are additions modulo 264 and ≫ are 64-bit rotations to the right.

3.7.2 Proof of Lemma 1

Proof. Let Aj be the computational complexity of recomputing M [j]. If M [j] is stored, then Aj = 0.When we have to compute a new block M [i], then the computational complexity Ci of computing M [i](measured in calls to F ) is calculated as

Ci = Aφ2(i) + 1.

and the total computational penalty is calculated as

C(q) =

∑i<T (Aφ2(i) + 1)

T.

Let Rj be the total number of blocks to be read from the memory in order to recompute M [j]. Thetotal bandwidth penalty is calculated as

R(q) =

∑i<T Rφ2(i)

T.

Let us prove thatRj = Aj + 1. (3.3)

by induction.

• We store M [0], so for j = 0 we have R0 = 1 and A0 = 0.

• If M [j] is stored, then we read it and make no call to F , i.e.

Aj = 0; Rj = 1.

32

• If M [j] is not stored, we have to recompute M [j − 1] and M [φ2(j)]:

Aj = Aj−1 +Aφ2(j) + 1 = Rj−1 − 1 +Rφ2(j) − 1 + 1 = (Rj−1 +Rφ2(j))− 1 = Rj − 1.

The last equation follows from the fact that the total amount of reads for computing M [j] is thesum of necessary reads for M [j − 1] and M [φ2(j)].

Therefore, we get

C(q) =

∑i<T (Aφ2(i) + 1)

T=

∑i<T Rφ2(i)

T= R(q).

33

Chapter 4

Change log

4.1 v2 and Argon2 – 31th January, 2015

• New scheme – Argon2 – is introduced;

• Argon is tweaked. Tweaks are detailed in Section 2.7.

4.2 v1 – 15th April, 2014

Typos corrected.

4.3 v1 – 8th April, 2014

The version v1 has the following differences with v0:

• New matrix L (Equation (2.1)). The software that generated the matrix in the previous versionhad a bug, which resulted in two duplicate rows. As a result, an adversary could use 12.5%less memory in the listed tradeoff attacks, whereas the diffusion and collision/preimage resistanceproperties remained as claimed. The current matrix has been checked by SAGE for the maximalrank.

• Version v0 mistakenly attributed garlic in [14] to the secret part of the input, whereas it must havebeen called pepper. V1 uses the correct naming.

34

Bibliography

[1] Micron power calculator. http://www.micron.com/products/support/power-calc.

[2] NIST: AES competition, 1998. http://csrc.nist.gov/archive/aes/index.html.

[3] NIST: SHA-3 competition, 2007. http://csrc.nist.gov/groups/ST/hash/sha-3/index.html.

[4] Litecoin - Open source P2P digital currency, 2011. https://litecoin.org/.

[5] Jean-Philippe Aumasson, Luca Henzen, Willi Meier, and Marıa Naya-Plasencia. Quark: Alightweight hash. In CHES’10, volume 6225 of Lecture Notes in Computer Science, pages 1–15.Springer, 2010. https://131002.net/quark/quark_full.pdf.

[6] Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O’Hearn, and Christian Winnerlein. Blake2– fast secure hashing. https://blake2.net/.

[7] Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O’Hearn, and Christian Winnerlein.BLAKE2: simpler, smaller, fast as MD5. In ACNS’13, volume 7954 of Lecture Notes in Com-puter Science, pages 119–135. Springer, 2013.

[8] Daniel J. Bernstein and Tanja Lange. Non-uniform cracks in the concrete: The power of freeprecomputation. In ASIACRYPT’13, volume 8270 of Lecture Notes in Computer Science, pages321–340. Springer, 2013.

[9] Guido Bertoni, Joan Daemen, Michael Peeters, and Gilles Van Assche. The Keccak reference, version3.0, 2011. http://keccak.noekeon.org/Keccak-reference-3.0.pdf.

[10] Alex Biryukov, Dmitry Khovratovich, and Johann Groszschadl. Tradeoff cryptanalysis of passwordhashing schemes. Technical report, 2014. available at https://www.cryptolux.org/images/5/57/Tradeoffs.pdf.

[11] Andrey Bogdanov, Miroslav Knezevic, Gregor Leander, Deniz Toz, Kerem Varici, and Ingrid Ver-bauwhede. Spongent: A lightweight hash function. In CHES’11, volume 6917 of Lecture Notes inComputer Science, pages 312–325. Springer, 2011.

[12] Christian Forler, Eik List, Stefan Lucks, and Jakob Wenzel. Overview of the candidates for thepassword hashing competition – and their resistance against garbage-collector attacks. CryptologyePrint Archive, Report 2014/881, 2014. http://eprint.iacr.org/.

[13] Christian Forler, Stefan Lucks, and Jakob Wenzel. Catena: A memory-consuming password scram-bler. IACR Cryptology ePrint Archive, Report 2013/525. to appear in Asiacrypt’14.

[14] Christian Forler, Stefan Lucks, and Jakob Wenzel. Catena: A memory-consuming password scram-bler. Cryptology ePrint Archive, Report 2013/525, 2013. http://eprint.iacr.org/.

[15] Bharan Giridhar, Michael Cieslak, Deepankar Duggal, Ronald G. Dreslinski, Hsing Min Chen,Robert Patti, Betina Hold, Chaitali Chakrabarti, Trevor N. Mudge, and David Blaauw. ExploringDRAM organizations for energy-efficient and resilient exascale memories. In International Confer-ence for High Performance Computing, Networking, Storage and Analysis (SC 2013), pages 23–35.ACM, 2013.

[16] Shay Gueron. Aes-gcm software performance on the current high end cpus as a performance baselinefor caesar competition. 2013. http://2013.diac.cr.yp.to/slides/gueron.pdf.

35

http://www.micron.com/products/support/power-calc

http://csrc.nist.gov/archive/aes/index.html

http://csrc.nist.gov/groups/ST/hash/sha-3/index.html‎

https://litecoin.org/‎

https://131002.net/quark/quark_full.pdf

https://blake2.net/

http://keccak.noekeon.org/Keccak-reference-3.0.pdf

https://www.cryptolux.org/images/5/57/Tradeoffs.pdf

https://www.cryptolux.org/images/5/57/Tradeoffs.pdf

http://eprint.iacr.org/

http://eprint.iacr.org/

http://2013.diac.cr.yp.to/slides/gueron.pdf

[17] Frank Gurkaynak, Kris Gaj, Beat Muheim, Ekawat Homsirikamol, Christoph Keller, Marcin Ro-gawski, Hubert Kaeslin, and Jens-Peter Kaps. Lessons learned from designing a 65nm ASIC forevaluating third round SHA-3 candidates. In Third SHA-3 Candidate Conference, March 2012.

[18] Martin E Hellman. A cryptanalytic time-memory trade-off. Information Theory, IEEE Transactionson, 26(4):401–406, 1980.

[19] Intel Corporation. Intel R© 64 and IA-32 Architectures Software Developer’s Manual. Number 325462-052US. September 2014.

[20] Marcos A. Simplicio Jr, Leonardo C. Almeida, Ewerton R. Andrade, Paulo C. F. dos Santos, andPaulo S. L. M. Barreto. The Lyra2 reference guide, version 2.3.2 (april 4th, 2013). Technical report,2014. available at https://password-hashing.net/submissions/specs/Lyra2-v1.pdf.

[21] Liam Keliher and Jiayuan Sui. Exact maximum expected differential and linear probability for2-round Advanced Encryption Standard (AES). IACR Cryptology ePrint Archive, 2005:321, 2005.

[22] Lars R Knudsen, Willi Meier, Bart Preneel, Vincent Rijmen, and Sven Verdoolaege. Analysismethods for (alleged) RC4. In Advances in CryptologyASIACRYPT98, pages 327–341. Springer,1998.

[23] Colin Percival. Stronger key derivation via sequential memory-hard functions. 2009. http://www.

tarsnap.com/scrypt/scrypt.pdf.

[24] Thomas Ristenpart, Eran Tromer, Hovav Shacham, and Stefan Savage. Hey, you, get off of mycloud: exploring information leakage in third-party compute clouds. In Proceedings of the 2009ACM Conference on Computer and Communications Security, CCS 2009, Chicago, Illinois, USA,November 9-13, 2009, pages 199–212, 2009.

[25] Matthew Robshaw and Olivier Billet. New stream cipher designs: the eSTREAM finalists, volume4986. Springer, 2008.

[26] Bram Rooseleer, Stefan Cosemans, and Wim Dehaene. A 65 nm, 850 mhz, 256 kbit, 4.3 pj/access,ultra low leakage power memory using dynamic cell stability and a dual swing data link. J. Solid-State Circuits, 47(7):1784–1796, 2012.

[27] Clark D. Thompson. Area-time complexity for VLSI. In STOC’79, pages 81–88. ACM, 1979.

36

https://password-hashing.net/submissions/specs/Lyra2-v1.pdf

http://www.tarsnap.com/scrypt/scrypt.pdf

http://www.tarsnap.com/scrypt/scrypt.pdf

Argon and Argon2 - orbilu.uni.luorbilu.uni.lu/bitstream/10993/19901/1/Argon.pdf · 2.6 Design...

Documents

Transcript of Argon and Argon2 - orbilu.uni.luorbilu.uni.lu/bitstream/10993/19901/1/Argon.pdf · 2.6 Design...