[IEEE 2010 5th International Design and Test Workshop (IDT) - Abu Dhabi, United Arab Emirates...

Area Efficient-High Throughput Sub-Pipelined Design of the AES in CMOS 180nm

A. Alma’aitaha and Zine-Eddine Abidb

a Department of Electrical and Computer Engineering, Queen’s University, Kingston, Canada b Department of Electrical and Electronics Engineering, ADMC-Higher Colleges of Technology, Abu Dhabi, UAE

[email protected] ; [email protected]

Abstract-- In this paper, efficient hardware of one of the most popular encryption algorithms, the Advanced Encryption Standard (AES), is presented. A modified sub-pipelined structure is proposed targeting high speed and low power-delay product of the compact AES design with on-the-fly key expansion unit. By adding 25.8% in hardware complexity to the existing ASIC designs, the throughput is increased more than 158% with better overall power-delay product. Compared to other compact AES implementation the proposed structure can go up to 6Gbit/sec with about 13k gate count.

1 Introduction Secure high speed data transmission is a key enabler for many applications (e.g. optical networks). As a result having architectures that can provide a Giga-bit data streams are required. Advanced Encryption Standard (AES) is an attractive candidate for such application due to its global acceptance as a robust system and its relatively small hardware footprint. AES is a special case of the Rijndael algorithm [1] which is a block cipher that works on fixed-length groups of bits, which are called blocks. It takes an input block of a certain size, usually 128 bit, and produces a corresponding output block of the same size. The transformation requires a second input, which is the secret key. AES supports only block sizes of 128 bits and key sizes of 128, 192 and 256 bits, the original Rijndael supports key and block sizes in any multiple of 32, with a minimum of 128 and a maximum of 256 bits. AES is an iterated block cipher with a fixed block size of 128-bits and variable key lengths. Therefore, it is an attractive algorithm for pipelining where each block (128-bit) iteration is followed by some registers to store the intermediate result (or a state) of that iteration. AES ASIC and FPGA designs fall in two main structures: the loop unrolled structure and compact loop. The loop unrolled structure is a replica of the compact loop with a different key is fed into each loop. On the other hand,

compact loop designs take a new key for each data iteration cycle. In this paper the compact loop is simulated in CMOS 180nm technology. The loop itself is sub-pipelined into stages by placing registers in locations where the signal propagation in the stages are comparable. In the optimization of the compact loop there are tradeoffs between increasing hardware complexity, power consumption and throughput. In the new optimized structure, increasing the gate count by 25.8% resulted in more than 158% increase in data throughput. Subsequently, the optimization results are generalized for the loop unrolled structure. The rest of this paper is organized as follows, in Section two, a brief introduction to the AES and its four transformations steps are presented. The optimizations at the architectural level are presented in Section three, with a focus on AES stage’s critical paths uniformity and key expansion module modification to meet the stage’s delay. The simulation results are shown in Section four followed by the conclusion and future work in Section five.

2 Advanced Encryption Standard (AES)

The input to the AES is an array of bytes called the initial state block, and since the block size is 128 bits, which is 16 bytes, the rectangular array is of a 4x4 dimension. (In the Rijndael version with variable block size, the row size is fixed to four and the number of columns varies). The number of columns is the block size divided by 32 and denoted Nb). The cipher key is similarly a rectangular array with four rows. The number of columns of the cipher key, denoted Nk, is equal to the key length divided by 32. In this work the input data is 128 bits with key length of 128 bit. For more details on the difference between AES and Rijndael algorithm the reader can refer to [1].

The input bytes are mapped onto the state bytes in the order a0,0, a1,0... and the bytes of the cipher key are

31978-1-61284-292-9/10/$26.00 ©2010 IEEE

mapped onto the array in the order k0,0, k1,0... as shown in Figure 1. The initial key, the seed, is expanded to several keys that will be added to the data at the end of each cipher round. AES uses a variable number of rounds, depending on the key length. For 128, 192, and 256 key bits there are 10, 12, and 14 rounds respectively (i.e. the key from 128-bit will be expanded to 10 keys of 128-bit each, and those expanded 10 keys will be associated with the corresponding 10 round). During each round, the following operations are applied on the state: SubBytes, ShiftRow, MixColumns, and AddRoundKey. Figure 2 illustrates the compact design of the AES.

Figure 1. State block of 128 bits (Nb=4 * 32-bits columns), and key

block with 128 bits (Nk=4*32-bits columns).

3 Proposed Design In this section the modifications on the previous compact ASIC implementations of the AES [2-8] at the module level are presented. The “equivalent decryption method” design is used if the intended designs, of Encryption and/or Decryption (Enc/Dec) rounds, share the hardware (used for encryption and decryption modes). The objectives of the modification on the system level are to increase the throughput of the design by applying pipelining methods to the Enc/Dec Rounds. By pipelining, the main critical path (longest propagation path) of the non-pipelined round is divided into sub-critical paths that will minimize dynamic power that result from unnecessary gate switching by the different signal arrivals in long paths. The new “sub critical paths” in the pipeline can be efficiently optimized in the future at gate and transistor levels. Some parts of this section contain general critical path calculations. In this section the sub pipelined Enc/Dec round and key expansion modules are described. The sub-pipelined round structure is presented along with the hardware complexity of the sub-pipelining.

3.1 Enc/Dec Round The Enc/Dec round is the path where the state (128 bit of data) will be processed in the SubBytes, shift rows, MixColumns, and the expanded round key addition for either encryption or decryption option. The compact AES structures [2,3,5] that implements Enc/ Dec one round is used to process the data repeatedly. The data

will flow through the three round operations (SubBytes, ShiftRows and MixColumns for encryption, and Inverse SubBytes, Inverse ShiftRows, and Inverse MixColumns for decryption) and stored in registers after the key addition stage as intermediate result, the intermediate data are then processed again by the same operations. One set of data will be processed in each clock cycle. This kind of structure is best suited when low clock frequency is preferred. The compact design can be pictured as “one loop structure” (shown in figure 2).

Figure 2. General structure of the Enc/Dec round pipelined with one

register after the Add round key step.

Figure 3. Proposed Sub-pipelined structure for one Enc/Dec round.

This structure requires one register after the Add-Round-Key step to store the 128 bits of Enc/Dec data. This loop is repeated 10 times to generate the final encrypted or decrypted data (for key size of 128-bit). In ASIC or FPGA implementation of AES, the bytes are following the same path of processing, however, the individual bits in the bytes go through different logic operation that differ in complexity and length. Therefore, some signals are propagating faster than others causing long fluctuation time until the signals are settled which causes enormous dynamic power consumption and reduces the minimum clock frequency. The power saving, high clock cycles, and keeping the area (gate count) relatively small are the main motivations behind the proposed design. In sub-pipelining, registers are inserted inside the round stages to divide the whole round into sub-stages that have approximately equal time delay with multiple blocks of

32

data being processed in the stages and stored in the registers simultaneously. The Encryption round is divided into three stages in contrast to four stages for the decryption round, as shown in Figure 3. The stages’ length is not limited to the length of the different operations (i.e. stage one can span to have the Subbyte operation with part of the ShiftRows operation). The first three stages are shared in both decryption and encryption operations, while the fourth stage is used only in decryption mode. The modules that will be used only in decryption (dark blocks in figure 3) are deactivated during encryption. In encryption mode, stage three is directly connected to Register-1 to skip the stage four and its associated registers. Since the loop is divided into three stages, three bursts of data can be processed simultaneously with the same key used three times in each round. Hence, the overall throughput will be increased as the new clock cycle length is lower compared to that of the pipelined case. In decryption as there is an additional stage, four bursts of data can be processed and the throughput will be as in encryption.

Table 1. Critical path for the sub-pipelined stages in the Enc/Dec

stage Component

name

Number of components for one byte input data

Critical path Total

number of gates for

each component XOR AND

1

Affine inverse 1 3 -- 20 XOR

Map 1 3 -- 11 XOR

MUXs 1 -- 2 24 NAND

Multiplication in GF(24)

1 3 1 15 XOR, 16

AND

8-bit XOR 3 1 -- 8 XOR

2

inverse 1 3 2 20 XOR, 10

AND

Multiplication in GF(24)

2 3 1 15 XOR, 16

AND

Map inverse 1 3 -- 15 XOR

3

Affine 1 3 -- 12 XOR

MUXs 3 -- 2 24 NAND

32 bit MixColumns

1/4 3 -- 25 XOR

Add round key 1 1 -- 8 XOR

4

32 bit Inv MixColumns

1/4 7 --

MUXs 3 -- 2 24 NAND

Add round key 1 1 -- 8 XOR

The criterion of dividing the steps is based on the critical path. Most of the four critical paths are mainly XOR gates. Other gates like inverters, AND, and NAND, which are relatively faster than XOR gates are

also considered in calculating the critical path for every stage but with less weight than XOR gates. In Figure 4, the block structure for the path of one byte in the Enc/Dec Round is presented. The internal components of each stage are also shown. The SubBytes design [9] has most of the critical path of the Enc/Dec Round. The Inv/SubBytes critical path contributed by 28 gates out 40-41 gates of the round, the rest of the critical path are mainly Inv/MixColumn, multiplexers (MUXs) and Add round key step. The components of the critical path for each stage is presented in Table 1.

Figure 4. One byte path in the proposed Sub-pipelined structure including the value of critical path for each stage

As it can be noticed in Table 1, due to its long critical path, stages one and two are part of the SubBytes operation and stage three also contains the Affine which is a part of SubBytes operation. The encryption round is divided into three stages in particular because it’s the only configuration that result in equally divided stages. Dividing the round into four or five stages will result in non-uniform critical paths as can be seen in Table 1, unless the small modules are to be divided into their gate level.

33

The registers between the stages add a constant delay for all stages and will not affect the uniformity of the stages critical paths. Registers, used to configure the sub-pipelining, will increase the hardware complexity by 25.8% in comparison to the overall system gate count before sub-pipelining. After sub-pipelining of the Enc/Dec round, the key expansion module should be designed to provide the proper expanded key at the proper time. The key expansion module should output the expanded key after three clock cycles in encryption mode or four clock cycles in decryption mode.

3.2 Key Expansion Unit The key scheduler module used in the optimized AES crypto processor is a modified on-the-fly scheduler based on the design of [10]. The module will schedule the key values in forward and reverse sequences for encryption and decryption respectively. The forward and reverse key scheduling is illustrated in Figure 5. The dotted lines represent the data path in the decryption (reverse scheduling of the keys). In the Key expansion module, four parts of the key are stored in the registers “a”, “b”, “c” and “d” (32 bit register each, containing initially the input key (the seed)). For encryption scheduling, the word in the register “d” goes through “LEFT ROTATION”, SubBytes, and XOR with RCON, where RCON is a constant value. It generates the next four 32-bit words propagating through four XOR stages with multiplexers (MUXs) between them. For Decryption scheduling, the dotted path will be followed and it has the same complexity of the Encryption, however, the last key will be generated with encryption scheduling.

Since the key expansion design is based on the “equivalent decryption method”, Encryption and Decryption modes share almost all the hardware. However, inverse mixed columns unit is added to the generated key when the key expansion module is operating in the decryption mode as shown in Figure 3 (the module above Register-7). The critical path of the key expansion module in the encryption (forward scheduling) is “Register-MUX-SubBytes-XOR-XOR-MUX-XOR-MUX-XOR-MUX-XOR-MUX”. In terms of number of gates, it is 17 gates plus the SubBytes module delay of 28 gates. In the decryption mode, the critical path will be longer in the key expansion module, with “MUX-XOR-SubBytes-XOR-XOR-MUX-Register-InvMixColumns”. In terms of number of gates in the critical path, it is 19 gates plus the SubBytes module. It is worth mentioning that the four SubBytes modules are required to process 32 bit in the Key expansion module process compared to 16 SubBytes in the Enc/Dec round.

3.2.1 Key expansion timing The critical path of the Key expansion is longer than the Enc/Dec round. To keep the key expansion module synchronized with the Enc/Dec Round, the key expansion is not pipelined and the key expansion registers are controlled by a divided clock signal. If both Enc/Dec Round and key expansion modules start processing data at the same time, the first key will be generated after the first burst of data are being processed by the third stage (i.e. the first key should be ready when the 1st data burst is being processed by stage “3” as shown in Figure 6; however, the first key will miss the first data and will be applied only to the second and the third bursts of data. In

Figure 5. key expansion scheduler [10]

34

order to match the key to the processed data, the key expansion module starts one clock cycle prior to the Enc/Dec Round. This will match the correct key to the data (i.e. first key will be applied to the 1st, 2nd and 3rd bursts of data). After the expansion of the 10 keys, the expanded keys are stored in registers that are connected in a loop design as shown in Figure 7. These registers will store the expanded keys in the first 10 key expansion cycles, and the required expanded keys will be in sequence and ready at the input of the XOR gates without any delay. The added hardware complexity for the registers is 6400 gates compared to the total system complexity of 20704 gates. The increase in the power consumption is not significant, because after expanding the keys, the key expansion module will be deactivated as long as the seed has not changed.

Figure 6. Timing of the key expansion clock with the round clock

Figure 7. Cascaded registers for storing the expanded key.

The non-restoring 8-transistor XOR design is chosen for the AES design implementation due to its simplicity. The rest of the gates (AND, NAND and Inverters) are implemented using traditional complementary CMOS for its efficiency and to restore the outputs of the XOR gates. The 2-input NAND is implemented using four transistor. AND gates has 6 transistors (NAND gate followed by an inverter). The

total AES design gate and transistor counts before and after sub-pipelining is presented in Table 2.

4 Simulation Results The simulation results of the sub pipelined design in decryption and encryption modes (fourth stage is disabled) are shown in Table 3. The difference in the power consumption between the two modes is due to the disabling of the fourth stage and its registers during encryption.

Table 2. Hardware complexity for the pipelined and sub-pipelined AES

Module Gates count Transistors count

SubBytes 2336 XOR, 928 AND, 382 NAND

25792

MixColumns 432 XOR 4320 Registers 256 Reg. before sub

pipelining, 832 Reg. after sub pipelining.

6656 before sub pipelining, 21632 after sub pipelining.

MUX 1280 NAND 6144 Add round key

256 XOR 2048

Key Expansion

136 register, 647 XOR, 232 AND, 792 NAND

13272

Total 3671 XOR, 1160 AND, 2454 NAND, 392 Reg. before sub pipelining or 968 Reg. after sub pipelining.

59000 (continuous round), 73976 (sub pipelined)

Table 3. CMOS 180nm simulations for the sub pipelined AES

system at 500MHz

Mode Power consumption at 500MHz (mW)

Maximum Stage Delay (ns)

Maximum Throughput (Gb/s)

Enc 84.60 1.80 6.274 Dec 94.50 1.80 6.321

In encryption mode, three bursts of data are processed 10 times (10* (3 * Max-delay)), then the final output need three clock cycles to be available (3*max-delay). The data processing delayed by one clock cycle after the initialization time of the system to allow the key expansion unit to produce the next key (in decryption its delayed by two clock cycles). The throughput in Table 3 is calculated by: Throughput (Enc) = ((((Max-delay)*34) +initialization time)-1)*128 bit *3 bursts Throughput (Dec) = ((((Max-delay)*45) +initialization time)-1)*128 bit * 4bursts The Initialization time is negligible compared to the total encryption or decryption processes time. In simulations presented in Table 3, the initialization time is 2.4ns. Table 4 shows the gate count, average power, and throughput

35

for some previous published ASIC design for the AES. In table 5, non-pipelined structure is compared to the sub pipelined structure in terms of gate count, maximum throughput and power consumption. The loop unrolled throughput can be estimated by multiplying the sub-pipelined throughput by 10 which will be around 62Gbit/sec.

Table 4. Performance of some previous published ASIC AES designs

Implementation

Technology Throughput (Mb/s)

Gates Power

Sever [6] 350nm 1690 149000 n/a

Kim [2] 180nm 1640 28626+ 128Kb ROM

Kuo [3] 180nm 1280 173000 Verbauwhede [4]

180nm 1600 173000 56mW

Gurkaynak [5]

250nm 2120 119000

Li [7] 180nm 3840 39980 n/a

Table 5. Loop unrolled AES compared to the proposed sub-pipelined loop AES

Implementation

Mode Max Throughp

ut (Mb/

s)

Clock Period

(ns)

Gate Count

Power (mW)

Non-pipelined

ENC/DEC

2442 5 9253 38.2

Pipelined ENC 6274 1.8 13093 84.6 DEC 6321 1.8 94.5

5 Conclusion and Future Work

In this paper, sub pipelined ASIC design for AES, in CMOS 180nm, is proposed and simulated. The sub pipelined structure for the “one loop structure” is divided into four stages, three stages are shared in both encryption and decryption modes and the fourth stage is set to be used in decryption mode only. This configuration have the same throughput for both modes and saves power consumption in encryption mode. Optimizations at gate and transistor levels were applied, and can be applied in future work especially for the 65nm CMOS node, where leakage power contributes significantly to the overall power consumption. Using dual threshold voltage technique in the critical paths of the sub pipelined stages is another direction of optimization. With a cost of 25.8% extra hardware, the simulation results show lower power delay products and an improvement in throughput of 158% compared to the non-pipelined structure in 180nm CMOS node, with a system throughput of more than 6.2Gb/s.

6 Acknowledgment The authors would like to thank NSERC-Canada

and CMC-Canada for the financial support and the Cadence tools used in this project, while the authors were at the University of Western Ontario, London, Canada.

7 References [1] Daemen, J., Rijmen, V.: The design of Rijndael:

AES- The Advanced Encryption Standard. Springer-Verlag Berlin Heidelberg, 2002.

[2] Kim, N.S., Mudge, T., and Brown, R. “A 2.3 Gbit/s fully integrated and synthesizable AES Rijndael core”, Proc. IEEE Custom Integrated Circuits Conference (CICC), San Jose, CA, pp. 193–196, 2003.

[3] H. Kuo and I. Verbauwhede, “Architectural

optimization for a 1.82 Gbits/sec VLSI implementation of the AES Rijndael algorithm,” Proc. Cryptographic Hardware and Embedded Systems (CHES) 2001, no. 2162 in LNCS, 2001.

[4] Verbauwhede, I., Schaumont, P., and Kuo, H.:

‘Design and performance testing of a 2.29-GBit/s Rijndael processor’, IEEE J. Solid-State Circuits, 38, (3), pp. 569–572, 2003.

[5] Gurkaynak, F.K., and Burg, A. et al.: ‘A 2 Gbit/s

balanced AES crypto-chip implementation’. Proc. Great Lakes Symp. on VLSI 2004, pp. 39–44, 2004.

[6] R. Sever, A. Neslin, Y. Tekmen, M. Askar “ A High

Speed ASIC Implementation of the Rijndael Algorithm” International Symposium on circuits and systems, Volume 2, Issue 23 , pp. 541-544, 2004.

[7] H. Li “Efficient and flexible architecture for AES”

IEE proc. on circuits, devices, and Systems, vol. 153, no. 6, 2006.

[8] Xinmiao Zhang and Keshab K. Parhi “High-Speed

VLSI Architectures for the AES Algorithm” IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 9, SEPTEMBER 2004.

[9] Wolkerstorfer, J., Oswald, E., and Lamberger, M.

“An ASIC implementation of the AES Sboxes”, Proc. CT-RSA, (Lect. Notes Comput. Sci., 2002, 2271), pp. 67–78, 2002.

[10] Joon Hyoung Shim, Dae Won Kim, Young Kyu

Kang, Taek Won Kwon and Jun Rim Choi “A Rijndael Cryptoprocessor Using Shared On-the-fly Key Scheduler” IEEE Asia-Pacific Conference on ASIC, pp. 89-92, 2002.

36

[IEEE 2010 5th International Design and Test Workshop (IDT) - Abu Dhabi, United Arab Emirates...

Documents

Transcript of [IEEE 2010 5th International Design and Test Workshop (IDT) - Abu Dhabi, United Arab Emirates...