[IEEE 2011 23rd International Conference on Microelectronics (ICM) - Hammamet, Tunisia...

Abstract—Stage optimization of the hardware implementation of the popular encryption algorithms, the Advanced Encryption Standard (AES), is presented. The optimization, for lower power dissipation, is based on implementing the Multi Threshold CMOS (MTCMOS) technique in each of the AES stages. The critical paths are implemented using high performance gates based on high driving current transistors. For the optimized design, the Simulation results show about 10% reduction in power consumption compared to non optimized designs while maintaining the same throughput of 18Gbit/sec.

Index Terms— Advance Encryption Standard (AES), Multi-

threshold CMOS (MTCMOS), Design optimization, Leakage current reduction.

I. INTRODUCTION Advanced Encryption Standard (AES) is an optimal

design for applications requiring architectures with Giga-bit data streams. AES is known for its global acceptance as a robust system with relatively small hardware footprint. It also provides secure high speed and low power data transmission enabling many important applications (e.g. optical networks and passive/active RFID tag ICs).

AES is an iterated block cipher with a fixed block size of 128-bits and variable key length. Therefore, it is an attractive algorithm for pipelining since each block (128-bit) is processed using the same algorithm flow for several iterations. Hence, the use of some registers to store the output of each iteration, and feed it back to the initial input for the next iteration, will significantly reduce the hardware complexity. Such structure is called the compact iteration loop.

The compact iteration loop is simulated in CMOS 65nm

technology and the results are presented in this paper. The loop itself is sub-pipelined into stages by placing registers in locations where the signal propagation time in each stage is about the same. In the optimization process of the sub-pipelined compact loop, the critical path of each stage is determined. To reduce the power consumption of the design, the sub-threshold power and the dynamic power consumption due to different propagation time of the signals are targeted in this work. The sub threshold leakage current is eliminated or reduced by implementing the maximum possible portion of

the design with low power transistors. The critical path, on the other hand, is implemented with faster transistors but higher leakage current. This implementation will not only lower the leakage power but also will allow closer propagation time for the critical and non-critical paths.

The remainder of the paper is organized as follows. Section two introduces AES and its four transformations steps. The optimization at the gate and transistor levels is presented in Section three, focusing on AES stages’ critical paths uniformity. Performance evaluation is given in Section four. Section five presents the conclusion.

II. PIPELINED AES ENC/DEC DESIGN

The Advance Encryption Standard (AES) has an input consisting of an array of bytes called the initial state block [1]. In this work the input block size is 128 bits. A modification of the previous compact ASIC implementations of the AES [2-7] at the module level has been proposed [8]. The “equivalent decryption method” design is used where design Encryption and/or Decryption (Enc/Dec) rounds share most of the hardware. The objective of the modification at the system level is to increase the throughput of the design by applying pipelining to the Enc/Dec Rounds. By pipelining, the main critical path (longest propagation path) of the non-pipelined round is divided into sub-critical paths that will minimize dynamic power resulting from unnecessary gate switching due to different signal arrivals time. The new “sub critical paths” can be efficiently optimized at both the gate and transistor levels. The sub pipelined Enc/Dec round and key expansion modules are described in this section. The sub-pipelined round structure is presented along with its hardware complexity. The compact AES structures [2,3,5] that implement Enc/ Dec algorithm, utilize one round to process the data repeatedly. The Enc/Dec round is the path where the state (128 bit of data) is processed in the SubBytes, shift rows, MixColumns, and the expanded round key addition for either encryption or decryption option. The data will flow through the round operations (SubBytes, ShiftRows and MixColumns for encryption; and Inverse SubBytes, Inverse ShiftRows, and Inverse MixColumns for decryption) and then stored in registers after the key addition stage as an intermediate result. The intermediate data is then processed again by the same operations. One set of data will be processed in each clock cycle. This kind of structure is best suited when low clock

Transistor Level Optimization of Sub-Pipelined AES Design in CMOS 65nm

Abdallah Alma’aitah Electrical and Computer Engineering Dept.

Queen’s University Kingston, ON, Canada email: [email protected]

Zine-Eddine Abid Electrical & Electronics Engineering Dept.

HCT-ADMC Abu Dhabi, UAE

email: [email protected]

978-1-4577-2209-7/11/$26.00 ©2011 IEEE

frequency is preferred. The compact design can be pictured as “one loop structure” (as shown in figure 1).

Figure 1. General structure of the Enc/Dec round pipelined with one

register after the Add round key step.

The AES round is divided into three and four stages for Encryption and Decryption modes respectively as shown in figure 2. The stages’ length is not limited to the length of the different operations (i.e. stage one can span to have the Subbyte operation with part of the ShiftRows operation). The first three stages are shared in both decryption and encryption operations, while the fourth stage is used only for decryption mode. The modules that will be used in decryption (the yellow blocks in figure 2) are deactivated during encryption.

Figure 2. Proposed Sub-pipelined structure for one Enc/Dec round.

The time delay of the critical path is used to divide the AES structure. Most of the four critical paths are mainly due to XOR gates. Other gates (inverter, AND, NAND), being faster than XOR gates, are also considered in calculating the critical path. The block structure for the path of one byte in the Enc/Dec Round is shown in Figure 3 along with the internal components of each stage. The critical path of the Enc/Dec round is dominated by the SubBytes design [9]. The Inv/SubBytes critical path consists of 28 gates out of 40 gates of the round, the rest of the critical path is mainly Inv/MixColumn, multiplexers (MUXs) and Add round key.

III. PROPOSED DESIGN

The benefits of CMOS technology scaling at the nanometer regime comes with the disruptive consequences of increasing MOS transistor leakage current [10]. This increase will not only impact the overall power consumption of a CMOS

system, but also reduces the allowed design margins due to the strong relationship between process variation and leakage power. Therefore, to have maximum benefits of technology scaling, new circuit designs to reduce such problem are essential. Dynamic power and leakage current reduction techniques are applied to the sub-pipelined AES based on CMOS 65nm transistors. The transistor level optimization will be even more important in CMOS technologies beyond 32nm.

Figure 3. One byte path in the proposed Sub-pipelined structure, including the number of gates in the critical path, for each stage

Reduction of the threshold voltage results in sub-threshold leakage current increase. The supply voltage VDD is reduced as CMOS transistors are scaled down. Therefore, the threshold voltage (Vt) must be reduced to maintain the drive current.

In our design the sub-threshold current reduction techniques are used for optimizing CMOS 65nm implantation of the AES. Sub-threshold leakage currents are exponentially dependent on the device’s threshold voltage and can be changed by orders of magnitude by switching between high and low threshold voltages. Dual threshold voltages transistors are used in the design in implementing the “XOR” and “AND/NAND” gates. In many modern processes, multiple threshold voltage devices are readily available to the circuit designer. Multi- performance devices only requires an extra mask layer to select between high and low threshold voltages. This provides the ability to choose either fast (but high leakage) or slow (but low leakage) transistor models. Multi-performance devices

enable leakage reduction with a possible small performance degradation.

A. MTCMOS in critical paths

To harvest the benefits of the dual threshold technique [10-13], the gates of the sub-pipelined circuit [8] is implemented using the two types of available transistors. The tradeoff is between delay and leakage power based on the Low Power (LP) transistors (high threshold voltage) and General Purpose (GP) transistors (lower threshold voltage). The critical path within the AES modules is implemented with GP transistors, while non-critical paths were implemented with LP transistors to minimize leakage currents and to allow comparable time delay to reduce the dynamic power. By using this method, leakage and dynamic currents can be significantly reduced in both the standby and active modes compared to the designs that are implemented with either GP or LP transistors.

In order to place the high performance gates in the longest

(critical) paths in every stage, the components of the stages in Figure 3 are represented in boxes with all the possible paths. The critical paths in all the stages are implemented using high performance gates. Figure 4 shows the four stages and their internal components. All the components are represented by the possible path instead of all gate representation. The circles and the diamond are for the “XOR” and “AND/NAND” gates respectively. The dark color components indicate the use of high performance gates (GP transistors) instead of low power gates. In AES structure, most of the critical path is dominated by the XOR gates. It was observed that 8-transistors XOR gate does not allow a significant reduction in power compared to CMOS-based XOR gate due to the better driving capabilities. The ratio between the numbers of the low power gates to the total gates is presented in Table 1. Such ratio is low in some stages due to the parallel processing of all input signals.

IV. PERFORMANCE EVALUATION

In this section, the simulation results of the optimized design in CMOS 65nm is presented. All the simulations are performed using Cadence SPECTRE. The transistors models library has both Low-Power (with high threshold voltage) and General Purpose transistors with Standard threshold voltage.

The time delay of the four stages in sub pipelined design is

presented in Table 2. The time delay values of the stages were taken for the worst case delay of all stages (based on the test vectors presented in the Appendix). In the “ALL-GP” design all the four stages and registers are built using gates with GP transistors, in “ALL-LP” design the stages and registers are built using gates with LP transistors. In the Optimized design, the stages are optimized as shown in Figure 4.

In encryption mode, one burst of data (128 bits) has to be iterated 10 times in three stages. Therefore, the total delay to encrypt one data vector is 10*(3* maximum stage delay). In decryption, the data burst will be processed 11 times in four stages. Before data encryption or decryption starts, a delay of one clock cycle is allowed for the key expansion unit

Figure 4. Critical/Non-Critical path optimization of the design in [8]

TABLE 1. The ratio between the number of low power gates to the number of all gates in the optimized sub pipelined structure.

Stage Ratio of LP gates to all gates Stage #1 32 % Stage #2 41 % Stage #3 18 % Stage #4 27 % Registers 66 %

Total 35 %

to produce the first extended key (in decryption it is delayed by two clock cycles) [8]. Consequently, the throughput in “bit/sec” can be calculated as: Throughput (Enc) = 128*3/((Max-delay)*34) +initialization time) Throughput (Dec) = 128*4/((Max-delay)*46) +initialization time) The Initialization time is negligible compared to the total encryption or decryption processes time. The simulation results for the sub-pipelined design in decryption and encryption mode are shown in Figure 5. The power is reduced by an average of 10% while achieving the same throughput of the high performance (GP-only) design. The deference of the power consumption between the two modes is due to the disabling of the fourth stage and its registers during encryption. Table 3 shows the gate count and throughput for some previously published ASIC designs. The higher throughput is mainly the result of the 65nm high performance

transistors (GP), however, the optimization provides lower total power consumption while maintaining a high throughput.

TABLE 2. Time delay of AES stages based on three different designs

Stage Design Delay (ns)

1

GP 0.398 LP 0.943

Optimized 0.418

2

GP 0.436 LP 0.995

Optimized 0.447

3

GP 0.407 LP 1.047

Optimized 0.428

4

GP 0.490 LP 1.143

Optimized 0.510

Figure 5. Power and time delay for the proposed sub pipelined structure in 65nm CMOS with GP, LP, and optimized stages.

TABLE 3. Throughput comparison of different AES designs.

Implementation CMOS technology

Throughput (Mb/s)

Gates

Sever [6] 350nm 1690 149000 Kim [2] 180nm 1640 28626+

128Kb ROM Kuo [3] 180nm 1280 173000 Verbauwhede [4] 180nm 1600 173000 Gurkaynak [5] 250nm 2120 119000 Li [7] 180nm 3840 39980 Sub-Pipelined[8] 180nm 6140 13093

This work (Optimized)

65nm 18200 13093

V. CONCLUSION

In this paper, sub-pipelined ASIC AES design is proposed and simulated in CMOS 65nm technology. The critical path of the sub-pipelined structure (of four stages) is examined. Optimizations at gate and transistor levels are applied to the AES design using dual threshold voltage technique in the critical paths of the sub-pipelined stages. Simulation results show about 10% lower power dissipation while maintaining the same system throughput of more than 18 Gb/s.

VI. APPENDIX Input test vectors (128 bits) used to simulate the time delay of each stage of various (LP, GP, and optimized) AES structures.

128 bit Test vectors (Hexadecimal) 32 43 F6 A8 88 5A 30 8D 31 31 98 A2 E0 37 07 34

19 A0 9A E9 3D F4 C6 F8 E3 E2 8D 48 BE 2B 2A 08

A4 68 6B 02 9C 9F 5B 6A 7F 35 EA 50 F2 2B 43 49 AA 61 82 68 8F DD D2 32 5F E3 4A 46 03 EF D2 9A

48 67 4D D6 6C 1D E3 5F 4E 9D B1 58 EE 0D 38 E7 E0 C8 D9 85 92 63 B1 B8 7F 63 35 BE E8 C0 50 01

F1 C1 7C 5D 00 92 C8 B5 6F 4C 8B D5 55 EF 32 0C 26 3D E8 FD 0E 41 64 D2 2E B7 72 8B 17 7D A9 25

5A 19 A3 7A 41 49 E0 8C 42 DC 19 04 B1 1F 65 0C

EA 04 65 85 83 45 5D 96 5C 33 98 B0 F0 2D AD C5 EB 59 8B 1B 40 2E A1 C3 F2 38 13 42 1E 84 E7 D2

REFERENCES [1] J. Daemen, V. Rijmen: The design of Rijndael: AES - The

Advanced Encryption Standard. Springer-Verlag Berlin Heidelberg, 2002.

[2] N.S. Kim, T. Mudge, and R. Brown, “A 2.3 Gbit/s fully integrated and synthesizable AES Rijndael core”, Proc. IEEE Custom Integrated Circuits Conference (CICC), San Jose, CA, pp. 193–196, 2003.

[3] H. Kuo and I. Verbauwhede, “Architectural optimization for a 1.82 Gbits/sec VLSI implementation of the AES Rijndael algorithm,” Proc. Cryptographic Hardware and Embedded Systems (CHES) 2001, no. 2162 in LNCS, 2001.

[4] I. Verbauwhede, P. Schaumont, and H. Kuo: ‘Design and performance testing of a 2.29-GBit/s Rijndael processor’, IEEE J. Solid-State Circuits, 38, (3), pp. 569–572, 2003.

[5] Frank K. Gürkaynak et al.: ‘A 2 Gbit/s balanced AES crypto-chip implementation’. Proc. Great Lakes Symp. on VLSI 2004, pp. 39–44, 2004.

[6] R. Sever, A. Neslin, Y. Tekmen, M. Askar “ A High Speed ASIC Implementation of the Rijndael Algorithm” International Symp. on circuits & systems, Volume 2, No 23 , pp. 541-544, 2004

[7] H. Li “Efficient and flexible architecture for AES” IEE proc. on circuits, devices, and Systems, vol. 153, no. 6, 2006.

[8] A. Alma'aitah and Zine-Eddine Abid, "Area efficient-high throughput sub-pipelined design of the AES in CMOS 180nm," 5th International Design and Test Workshop (IDT), pp.31-36, 2010.

[9] Xinmiao Zhang and Keshab K. Parhi “High-Speed VLSI Architectures for the AES Algorithm” IEEE Transactions on VLSI systems, Vol. 12, No. 9, September 2004.

[10] R. Kaushik, M. Saibal, and M. M. Hamid, “Leakage Current Mechanisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Circuit” Proc. IEEE, vol. 91, Issue 2, 2003.

[11] J. Yuan and C. Svensson, “High-Speed CMOS Circuit Technique,” IEEE J. Solid-State Circuits, vol. 24, pp.62–70, 1989.

[12] S. Narendra , D. Blaauw, A.Devgan, and F. Najm, “ Leakage issues in IC Design: Trends, Estimation and Avoidance”, Proceedings of the ASP-DAC Asia and South Pacific- Design Automation Conference, vol.1, 2005.

[13] Sill, F. Grassert, F. Timmermann, D. “Low Power Gate-level Design with Mixed- V, (MVT) Techniques” 17th Symposium on Integrated Circuits and Systems Design, pp. 278-282, 2004.

[IEEE 2011 23rd International Conference on Microelectronics (ICM) - Hammamet, Tunisia...

Documents

Transcript of [IEEE 2011 23rd International Conference on Microelectronics (ICM) - Hammamet, Tunisia...