LUCRARE DE DISERTAT IE Encriptare AES pe arhitecturi de tip … · Universitatea POLITEHNICA din...

Universitatea POLITEHNICA din Bucures,ti

Facultatea de Automatic s,i Calculatoare,

Departamentul de Calculatoare

LUCRARE DE DISERTAT,IE

Encriptare AES pe arhitecturi de tipconsumator

Conduc tor S,tiint

,i�c: Autor:

Sl.dr.ing Laura Gheorghe ing. Grigore Lupescu

Bucures,ti, 2014

University POLITEHNICA of Bucharest

Faculty of Automatic Control and Computers,

Computer Science and Engineering Department

MASTER THESIS

AES encryption on modern consumerarchitectures

Scienti�c Adviser: Author:

Sl.dr.ing Laura Gheorghe ing. Grigore Lupescu

Bucharest, 2014

Abstract

Specialized cryptographic processors target professional applications and o�er both low latencyand high throughput at the expense of cost. At the consumer level, a modern SoC embod-ies several accelerators and vector extensions (e.g. SSE, AES-NI), having a high degree ofprogrammability through multiple APIs (OpenMP, OpenCL, etc). This work explains howa modern x86 system that encompasses several compute architectures (MIMD/SIMD) mightperform well compared to a specialized cryptographic unit at the fraction of the cost. Theanalyzed algorithm is AES (AES-128, AES-256) and the mode of operation is ECB. The ini-tial test system is built around SoC AMD A6 5400K (CPU + integrated GPU), coupled witha discrete GPU � AMD R7 250. Benchmark results compare CPU OpenSSL execution (noAES-NI), CPU AES-NI acceleration, integrated GPU, discrete GPU and heterogeneous com-binations of the above processing units. Multiple test results are presented and inconsistenciesare explained. Finally based on initial results a system composed only of low-end and low powerconsumer components is designed, built and tested.

2

3

CPU Central Processing UnitAPU Application Processing UnitSoC System On ChipRAM Random Access MemoryDMA Direct Memory AccessSIMD Single Instruction Multiple Data

AES Advanced Encryption SystemECB Encryption Code Block, AES operation modeCTR Counter, AES operation mode

ALU Arithmetic Logical UnitFPU Floating Point UnitGFLOPS Giga Floating Point Operations Per SecondGB GigabytesMB Megabytes

GPU Graphics Processing UnitGPGPU General Purpose GPUVRAM Video Random Access MemoryPCIe Peripheral Component Interconnect ExpressiGPU Integrated Graphics Processing UnitdGPU Discrete Graphics Processing UnitCU Compute UnitVLIW Very Long Instruction WordLDS Local Data ShareGDS Global Data Share

OpenMP Open Multi-ProcessingOpenCL Open Compute LanguageCUDA Compute LanguageG++ GNU C++ CompilerGCC GNU C Compiler

AESNI AES New InstructionsSSE Streaming SIMD ExtensionsAVX Advanced Vector Extensions

Chapter 1

Introduction

Continuous technological advancements have driven the evolution of processors from a com-plexity of a couple of thousand transistors up to billions of transistors today. Processors arecurrently undergoing a transition driven by power and performance limitations and algorithmsmust also evolve so to best suit the ever increasing �ow of data processing. In 2004 the failureof frequency scaling (driven by physical limitations), has forced processor vendors to redesignchips as to integrate more processing cores. It follows that the problem regarding performancehas increased for software companies [20].

The year 2006 marked the �rst generation of GPGPUs (SIMD architecture) which was availableto consumers [5]. From then they have slowly evolved to become more and more generalizedso that their resources can be used for more than just graphics. The trend in the past cou-ple of years in software was towards exploiting the parallelism in multi-core CPUs (MIMDarchitecture) and compute capabilities of GPUs. Programming a multi-core CPU is challeng-ing due to the inherent data share, synchronizations required and bandwidth limitations [20].Programming a GPU is even more di�cult, because of the extensive knowledge regarding thetarget architecture, multiple architectures available, latency of memory transfers, limitations inhardware regarding branching and work scheduling.

Since 2011 nearly every consumer x86 CPU is of a multi-core con�guration and has an iGPUon the same die. GPUs have become widespread, and more software companies are tryingto o�oad parallel workloads such as compression, image processing, rendering, simulations tothem. In the case of a dGPU the main problem consists in the data transfer overhead of thePCIe bus. Most iGPUs because they are on the same die as the CPU share the same memorycontroller and can quickly access data to or from the GPU with low latency. Disadvantages ofsuch a solution come from the limited memory bandwidth and the limited thermal envelope ofthe whole CPU+GPU package.

A modern SoC embodies several accelerators and vector extensions, having a high degree ofprogrammability through multiple APIs (OpenMP, OpenCL, CUDA etc) [2, 7]. Opposed tothis are specialized industrial processing units (networking, cryptographic, signal processing etc)which have a low degree of programmability but o�er a much higher performance in what theywere designed for [15]. Certain areas previously limited to industrial applications have becomemore accessible to common users due to the continuous hardware extensions/add-ins. One ofthese areas is cryptography which while previously available through software can currentlybe accelerated by multiple threads (multi-core CPU), by using an accelerator like the GPU orthrough specialized hardware extensions.

The present work focuses on AES due to the opportunity to test and compare both softwareencryption, hardware encryption as well as using a GPU/SIMD accelerator for encryption. AES

4

CHAPTER 1. INTRODUCTION 5

was chosen by NIST to become the standard due to its high speed and low RAM requirements.It is a block cipher based on symmetric keys for the encryption of electronic data. [22]. ThoughAES is used on a global scale for data encryption it may also serve di�erent scopes such aspseudo-random number generation, mining cryptocurrency etc. Improving AES performancein today's architectures through software, may not only bene�t encryption, but a larger set ofproblems.

Previous work regarding AES encryption shows that high-end GPGPU architectures or high-end CPUs with AESNI achieve good performance [12], [25], [14], [18]. This work focuses solelyon modern consumer low-end architectures. Through specialized extensions such as AESNIor vector/SIMD extensions, architectures may accelerate AES encryption by a considerablemargin compared to "non-accelerated" versions of AES encryption.

The work starts by presenting the current literature with regards to architecture evolutionand opportunities to accelerate AES. Afterwards it explains how a modern x86 system thatencompasses several compute architectures (MIMD/SIMD) might perform well compared toa specialized cryptographic unit at the fraction of the cost. The initial test system is builtaround a modern x86 SoC (CPU+iGPU) coupled with a dGPU and fast RAM/VRAM mem-ory. Benchmark results compare CPU OpenSSL execution (no AES-NI), CPU AES-NI acceler-ation, integrated, discrete GPU and heterogeneous combinations of the above processing units.Multiple test results are presented and inconsistencies are explained.

Finally based on initial results a low-end/low-power/low-cost system is designed, built andtested. Conclusions are drawn with regards to performance measurements for both the initialtest system and the proposed solution (usability, cost, performance and complexity of deploy-ment).

Chapter 2

State of the art

Continuous technological advancements have driven the evolution of processors from a complex-ity of a couple of thousand transistors in late 1970 up to billions of transistors today. Along thepath of evolution several decisions were made that radically changed the structure of today'sprocessors. More notable examples include: addition of hardware vector extensions, transi-tion from frequency scaling to core scaling, structure changes to enable general purpose GPUs,auto-tuning based on demand and power consumption.

2.1 Modern CPU/APU architectures

From the consumer's point of view the more widespread architectures are x86/x86_64 andARM. This work focuses only on x86/x86_64 architecture. One novel architecture is theGPGPU which although is rather primitive in functions compared to a modern CPU, has aSIMD-like structure enabling it to perform a high amount of computations in parallel. Restric-tions in the compute model and complex programming models available for GPGPUs restricttheir general use and widespread adoption.

A homogeneous architecture denotes a uniform set of identical cores that commonly sharethe same memory space (UMA or NUMA). As opposed to homogeneous architectures, areheterogeneous architectures are where several di�erent compute units are placed on the samedie. Examples of compute units types include: general-purpose, programmable vector SIMD(GPUs), specialized (cryptographic, video), �eld programmable gate arrays. [2, 7]. The need forhigh-performance computing along with power and heat limitations has driven the demand forheterogeneity in computing systems. It follows that most of today's consumer processing unitsembody circuitry for acceleration of various specialized operations (video, image, cryptography).

The CPU is the main unit of control in a system and has evolved along time from a simple/singlecontrol/processing unit into a complex SoC integrating more compute capability and systemresponsibility (heterogeneous compute cores). Today's X86 CPUs are heterogeneous multi-core (x86 cores and GPU cores) with a complex cache hierarchy of up to 4 levels (L4 cache).Additional logic for a modern CPU is regarding virtualization, power management, vectorextensions (SSE, AVX) and security. CPUs nowadays usually integrate additional controllerssuch as memory controller, PCIe controller etc.

Programming modern CPUs has become increasingly di�cult due to the multi-core architec-ture, various di�erent compute units present (gpu, video) and the complex memory hierarchy.Achieving high performance requires o�oading computations to the correct accelerator withan architecture optimized code. This is very di�cult to perform because currently there is no

6

CHAPTER 2. STATE OF THE ART 7

standard abstraction layer or broader framework that can do this easily. Current standardssuch as OpenCL/OpenACC/CUDA require in depth knowledge of accelerator architecture.

2.2 Modern GPU architectures

Any data-intensive application that can be mapped to a 2D matrix is a candidate for running ona SIMD type architecture such as a GPU. In general the command processor reads commandsthat the host has written to memory-mapped registers in the system-memory address space.The command processor sends hardware-generated interrupts to the host when the commandis completed. The GPU's memory controller has direct access to all device memory and hostspeci�ed memory. A modern GPU may run any of the following programs: vertex shader (VS),geometry shader (GS), DMA Copy (DC), pixel shader (PS) or fragment shader, compute shader(CS) [3, 5, 6, 16, 9, 17].

In this work focus is only on compute shaders since this is the modern way of programming aGPU. A compute shader is a generic program (compute kernel) that uses an input work-item IDas an index to perform gather reads on one or more sets of input data, arithmetic computationand scatter writes to one or more set of output data to memory [3, 6].

From the perspective of the CPU, the GPU has a dedicated VRAM to which data must bemoved to and from RAM. If the GPU is discrete than data must travel through the PCIe lanesas opposed to an integrated GPU where memory may be mapped (iGPU and CPU share thesame RAM memory). The PCIe introduces a signi�cant amount of latency that must be hiddenthrough overlapping the data processing with data movement. As opposed to this an iGPU hasissues regarding memory bandwidth since it shares the same memory bus lane with the CPU.

The memory hierarchy from the perspective of a modern GPU is composed of the following:parallel data cache or shared memory, read-only constant cache and a read-only texture cache.The texture cache is not exposed when using compute shaders and any of the current computelanguages (OpenCL, CUDA, DirectCompute, OpenACC) [4, 6].

In a modern GPU architecture each SIMD has a cache memory space (16KB-128KB) thatenables low-latency communication between work-items within a work-group i.e. the local datashare (LDS) memory. This memory is con�gured in banks, each with a number of entries(each entry has a number of bytes). In essence the LDS is a very low-latency, RAM scratch-pad for temporary data with at least one order of magnitude higher e�ective bandwidth thandirect, uncached global memory. The high bandwidth of the LDS memory is achieved notonly through its proximity to the ALUs, but also through simultaneous access to its memorybanks.If, however, more than one access attempt is made to the same bank at the same time,a bank con�ict occurs. In this case, for indexed and atomic operations, hardware prevents theattempted concurrent accesses to the same bank by turning them into serial accesses. Thisdecreases the e�ective bandwidth of the LDS [4, 6, 3].

Another important memory found in a GPU is the global data share (GDS) memory that canbe used by work items on all SIMDs. This memory enables low-latency bandwidth to all theprocessing elements. Modern GPUs also contain circuit logic resources for implementation ofbarriers, semaphores, and other synchronization primitives.

The data stream of a kernel is architecturally de�ned as a number of work-groups. However,each GPU vendor de�ned it's own intermediate micro-architectural grouping (e.g. wavefrontsfor AMD) that is both a data and control �ow construct. For example in current AMD GPUs,a wavefront is de�ned as a 64 work-items wide data �ow and 1 instruction control �ow [4, 23].

As a summary programming a GPU has the following issues: problem needs to be suitable(SIMD like), data transfers to/from the CPU RAM, e�cient grouping based on target archi-


tecture (number work-groups, work-items, used registries) and e�cient use of cache hierarchy(LDS, GDS, constant memory).

2.3 Study of AES encryption

Encryption is the process of transforming plain-text into an unintelligible form also known ascipher-text. The AES algorithm is a symmetric block cipher that can encrypt and decryptinformation. It was adopted by NIST in 2001 as a standard for encryption of electronic data.This standard speci�es the Rijndael algorithm which can process data blocks of 128 bits, usingcipher keys with lengths of 128, 192, and 256 bits. Though initially designed to handle additionalblock sizes, the AES standard speci�es only 128 bits for the block size. Input (plain-text) andoutput (cipher-text) sequences are referred as blocks (AES 128 bits length). The algorithmmay be used with three di�erent key lengths (128, 192, 256) referred as 'AES-128', 'AES-192',and 'AES-256' [19, 22].

Internally, the AES algorithm's operations are performed on a two-dimensional array of bytescalled the State. The State consists of four rows of bytes, each containing Nb bytes, whereNb is the block length divided by 32. All bytes in the AES algorithm are interpreted as �nite�eld elements. They can be added and multiplied, but their operations are di�erent from thoseused for numbers. The addition of two elements in a �nite �eld is achieved by �adding� thecoe�cients for the corresponding powers in the polynomials for the two elements.

Since introduced in 2008 AESNI support has greatly extended and is present in both high-endand low-end CPUs as well as software support from several libraries and applications. Thefollowing libraries have code written to support AESNI acceleration: PolarSSL, Crypto++,CyaSSL, Crypto API (Linux), IAIK-JCE, Integrated Performance Primitives (IPP), OpenSSLand Solaris Cryptographic Framework. AESNI is also supported in various applications (latestversions) such as 7-Zip, BitLocker, TrueCrypt, Citrix XenClient and many more. Hardwaresupport is also present architectures di�erent from x86 such as SPARC T4.

The majority of literature with regards to AES processing inclines to strongly favour the GPUin comparison to the CPU [12, 18, 14, 25, 26, 13, 21, 16]. Details regarding implementationare usually scarce with the highlight on the AES algorithm, the chosen API (ex OpenCL) andthe results with comparison to the CPU [26, 13, 21]. In some, the transfer to/from the GPU isignored or the CPU code is not parallelized. Articles may create a false impression that the GPUclearly outperforms the CPU in AES processing. The current work doesn't attempt to makea generalization, instead it show how di�erent scenarios may impact performance and favoureither the CPU or GPU. Since hardware architectures vary greatly nowadays a comparisonwith a widely used cryptographic library (OpenSSL) is also provided for a better view of actualperformance [16].

For FPGA systems simulations/tests have been documented, with mixed results on performancecompared with today's systems [11, 24].

Internet Protocol Security is a widely used protocol for securing IP communications by authen-ticating and encrypting each IP packet. Cryptographic algorithms de�ned for use with IPsecalso include AES. Freescale, one of the major vendors in networking processors, integrates insome of its networking products a specialized IP responsible for high performance cryptogra-phy. The IP is named "SEC" and has several iterations, the latest being SEC 5.0 availablein Freescale Layerscape products. This IP is tightly coupled with the whole networking archi-tecture to provide low latency while also supporting a wide range of cryptographic primitives.The IP SEC 5.0 is advertised as being able to process a maximum of 20Gbps AES ECB-128i.e. around 2-3GB/sec. [15]. Other security IPs are also integrated in low power embeddedCPUs by Freescale. For instance security functions are enabled and accelerated in Freescale's


low power i.MX 6Dual/6Quad by CAAM�Cryptographic Acceleration and Assurance Module.

With the constant increase in data processing, cryptographic algorithms need to be e�cientlyimplemented in hardware and software. Another problem is the fact that most of today'sconsumer systems have multiple processing units (CPUs, GPUs). It follows that he main unitof control must correctly select the best available processing unit/combination of units, for agiven task. This work explores such a scenario, of a multiple processing unit system, for AESencryption. AES is used on a global scale for data encryption, but it may also serve di�erentscopes such as pseudo-random number generation, mining cryptocurrency etc. Improving AESperformance in today's architectures through hardware and software, may not only bene�tencryption, but a much larger set of problems.

Chapter 3

Methodology and implementation

This section describes the hardware selection for the initial test system and the general soft-ware implementation. Based on initial performance results, a solution composed of hardwareand software is proposed. The solution will showcase todays's hardware capabilities in AESencryption.

3.1 Test system hardware selection

Nowadays multi-core CPUs and GPGPUs are widespread ranging from inexpensive commodityhardware up to specialized server and high performance. Designing an architecture is a veryexpensive process for companies. Usually companies will improve their new architecture on apreviously launched architecture and tailor it to suite all customer markets (such as less cache,no virtualization). In other words low-end processing units from the same product family (e.g.Kaveri, Haswell) usually share the same architecture as high-end ones but have certain featuresdisabled(AVX, VTX, AESNI) or hardware units disabled (less cache, disabled cores) [2],[6].

The initial test system was built around latest generation consumer hardware. The system isbuilt around the SoC AMD A6-5400K which is a dual core x86 with hardware AES-NI instruc-tions. Integrated next to the x86 dual core module is a HD7540 iGPU, VLIW4 architecture with3 SIMD units (192 stream cores) with capability to zero copy from the CPU RAM to the GPURAM [2],[7],[8]. Both the CPU and iGPU have a shared bandwidth of 10.6GB (single channel)and 21.2 GB (dual channel). A third processing unit is the dGPU AMD R7 250 installed onthe PCIe 16x lane. It is based on GCN architecture with 6 SIMD units (384 stream cores) anda bandwidth of 73.6GB/sec. The system is presented in �gure 3.1.

Figure 1 presents a view of the test system. Blue colour denotes x86 CPU cores while reddenotes GPU SIMD units. Green denotes how data is transferred from on compute unit toanother. Initially all the processed data must reside in CPU RAM. When either the integratedor discrete GPUs will process the data, it must be copied/mapped to the GPU VRAM prior tothat. The following acronyms will be used in this paper: CPU (central processing unit), iGPU(integrated graphics processing unit), dGPU(discrete graphics processing unit), SoC (Systemon Chip), APU (accelerated processing unit).

10

CHAPTER 3. METHODOLOGY AND IMPLEMENTATION 11

Figure 3.1: Initial system

3.2 Software design

The developed application represents a demo for AES encryption hardware capabilities found intoday's processing units. In general, usability has been sacri�ced against performance i.e. theapplication is for demonstration purposes only. The goal of the application is to showcase ac-celerated encryption on modern processing units (CPU, iGPU, dGPU). It could thus constitutea base for further improvement in today's AES encryption solutions (applications, frameworks,libraries).

The name of the application is "ACCAES" (ACCelerated AES). The programming language ofchoice is C++, with the build system handled by cmake. The application can be compiled onboth Linux and Windows OS. For parallelism, ACCAES usses OpenCL for GPUs and OpenMPfor CPUs.

3.2.1 ACCAES architecture

Source code is divided into three categories: I/O, MAIN and AES. The main entry point inthe code is responsible for initial control regarding program �ow i.e. it dispatches both I/Ofunctions and AES functions based on user arguments. I/O functions handle �le read/write andare executed before and after AES processing. Only one �le may be encrypted from a singleprogram run.

AES code may be further divided based on target processing units (CPU, GPU, CPU+GPU).At the base of AES related code is the class AES_BASE from which classes that handle speci�chardware inherit from (AES_GPU, AES_HWNI, AES_HYBRID, AES_OPENSSL). The classAES_HYBRID handles both multi-core CPU and multiple GPUs and thus inherits from bothAES_GPU and AES_HWNI. For pro�ling another class has been de�ned, AES_PROFILER.Finally as a performance and correctness reference ACCAES uses through AES_OPENSSLclass functions from Crypto framework (libOpenSSL). The whole C++ class hierarchy is pre-sented in �gure 3.2.

AES functions are dispatched from the MAIN function and only one type of accelerated en-cryption may run in a single program run i.e AES_HWNI, AES_GPU or AES_HYBRID.


Figure 3.2: ACCAES software architecture

Overall the application has few dependencies: OpenMP, OpenCL, Crypto. The modular designalso allows the user to substract by few code modi�cations any module that is not used (ieAES_GPU). AES_OPENSSL in particular, may be deactivated directly from the build system(cmake).

3.2.2 ACCAES parallelism

Prior to any processing the CPU spawns N threads through OpenMP [1]. N is determined asthe sum between the number of CPU cores and the number of GPUs (iGPU or dGPU). Threadswill be responsible for either processing on the CPU cores using AESNI or with data transfer,work enqueue and synchronization. For example when 4 threads (T1, T2, T3, T4) are spawned,one could process on the CPU with 2 threads (T2, T3) while the rest control accelerators - T1could handle the dGPU and T4 could handle the iGPU.

Since the mode of operation is 128 bits the problem is embarrassingly parallel. The initialplain-text can be broken down into any distribution of chunk sizes multiple of 128 bits.

Processing on the CPU is done via OpenMP coupled with AESNI hardware acceleration. Loadbalancing is done via OpenMP and each core receives new work as it �nishes. The applicationusses wrappers for the C/C++ language to load and store SSE 128 bit variables and executeAESNI assembler instructions. The code is simple and well documented on the Internet inseveral white-papers from both INTEL and AMD.

Processing on the GPU is done using OpenCL running over OpenMP for asynchronous data


transfer/enqueue when the CPU is involved [10],[1],[23]. OpenMP spawns an additional threadfor each GPU to be controlled by the CPU, aside from the threads used for actual processing.Platform and device query, along with OpenCL code compilation is done in the main threadalong with data I/O.

The main thread is responsible for all I/O, static work distribution between CPU and GPU(s),starting worker threads using OpenMP (both CPU and GPU). Each thread spawned by OpenMPfor the GPU(s) will allocate/destroy memory on the device, handle data transfers and enqueuework. For the GPU each work item in a wavefront processes a di�erent region uniquely indexedby global id. Global id can be anything between 0 and (chunk length)/16-1, hence each kernelinstance will process 128 bits of plain-text. OpenCL kernels use precomputed S-box and GaloisField values, stored in constant memory space and the input plain-text of 128 bits is loaded invideo ram (VRAM)[14]. For each AES type (128/10, 256/14), a kernel has been de�ned. Theonly arguments send to the kernel are a pointer to the location in VRAM of the plain-text anda pointer to the encryption key.

In a hybrid CPU-GPU parallel workload, due to the di�culty in dynamically balancing work-load, workloads for each compute unit are statically de�ned by the user. The user speci�esthe chunk size for each compute unit (the sum of all chunk sizes will be the size of the input).To �nd the optimum slice percentage for each compute unit, some tests need to be performedby the user prior to actual execution. The motivation behind this is that the overhead of adynamic balance would be composed of the scheduling algorithm and additional latency intransfers to the GPUs. Also one should take into account the variation of the initial chunk sizeand GPU/CPU RAM/VRAM sizes. Due to the fact that performance using statically de�nedslices is satisfactory compared to single unit execution, there may be little to bene�t from usinga dynamic scheduler.

3.3 Implementation design

AES has a �xed block size of 128 bits, and a key size of 128, 192, or 256 bits. The problem ofencryption in ECB is thus an embarrassingly parallel with high granularity. AES operates on a4×4 column-major order matrix of bytes, termed the state. Performing AES implies executinga set of 4 operations (AddRoundKey, SubBytes, ShiftRows, MixColumns) by a �xed numberof times determined by the key size (ex 10 rounds for 128 bit keys).

3.3.1 ACCAES implementation design for CPU/MIMD

AESNI instructions represent an extension to the x86 instruction set, introduced initially byINTEL in 2008 to speed up AES encryption/decryption. The set of instructions are: AESENC,AESENCLAST, AESDEC, AESDECLAST, AESKEYGENASSIST, AESIMC, PCLMULQDQ.These instructions may be used to encrypt/decrypt 128bit chunks of plain-text using either 128,192 or 256 bit keys. The C/C++ code for using these instructions is straightforward: initiallyone would use AESKEYGENASSIST for key expansion than repeatedly AESENC/AESDEC(the number of times corresponds to the key length) and �nally call AESENCLAST/AESDE-CLAST. All AESNI instructions operate with special 128bit registers which require data tobe loaded/stored using special SSE functions such as _mm_load_si128/_mm_store_si128.Plain-text data needs also to be aligned prior to any copying for cache access to be optimum.

Since the code is straightforward (C wrappers for AESNI assembler) the compiler plays a crucialrole in code optimization for the target architecture. For example let's consider the latest threeversions of the g++ compiler (4.6 - year 2013, 4.7 - year 2013, 4.8 - year 2014). Intuition wouldsuggest there would be little performance di�erences between the generated binaries since the


code refers to AESNI (introduced in 2008), SSE2 operations (introduced in 2001) and basicOpenMP parallelism (introduced in 1998). In reality there is a major di�erence between eitherversion in both performance and generated assembly. Performance when compiling with g++4.6 was initially around 800MB/sec ECB128 while performance when compiling with g++ 4.7increased to 1100MB/sec and than to 1400MB/sec when using g++ 4.8.

Figure 3.3 shows that there are signi�cant di�erences in the generated assembly. The di�erencescome from how the compiler optimizes aligned memory accesses. In previous versions (g++ 4.6and 4.7) the compiler issues commands for data movement using aligned memory instructionswhilst in the newest version (g++ 4.8) the compiler uses data movement instructions for un-aligned memory access. In the C/C++ code, input data is marked as aligned hence it wouldseem that all versions of the compiler ignore the request.

An x86 CPU with AESNI activated can improve it's performance by a margin of 6-14x whichcan be further ampli�ed by using all cores - measurements are presented in further sections. Itfollows that a multicore with AESNI support has a very high AES encryption throughput andmay not require o�oading AES to another accelerator.

Figure 3.3: Compiler In�uence


3.3.2 ACCAES implementation design for GPU/SIMD

In a GPU/SIMD architecture such as a GPU the more faster operations are AddRoundKeyand ShiftRows. MixColumns and SubBytes are at least an order of magnitude more timeconsuming than AddRoundKey and ShiftRows. On a typical implementation similar to what theliterature describes, using A4 4000 iGPU and processing 10 iterations of either AddRoundKeyor ShiftRows, on a 32MB chunk, requires 5ms. By comparison MixColumns or Shiftrowsrequire 50-150ms for the same operation.

ACCAES initial GPU/SIMD implementation

The encryption function will receive the o�set of the plaintext in VRAM and the expandedkey. The key is expanded through AESNI if available. Data will reside in global device memory(VRAM for R7 250 and RAM for A4 4000). The variable state will be used to represent the 4x4AES column-major matrix and be of type uchar16. Also it shall be de�ned as local memoryin attempt to place it in the device cache for faster access. The function ShiftRows is designedusing vector addressing while the function AddRoundKey would is a simple XOR.

For this �rst iteration of the code both SubBytes and MixColumns will use precomputed tablesof Sbox and Galois Finite Field as described in the literature. The space for storing will be andarray of type constant uchar. Work group size is 128 while global work size is buf_size/16. Atthis point the AES kernel is complete with several optimizations already in place.

As a cipher encryption method ECB will be used along with AES128. In this �rst stage data willbe processed by the target device in chunks of 32MB. The �ow of execution is thus while(!done())writeData(32MB, &o�set); execKernel(32MB, &o�set); readData(32MB, &o�set);

This methods yields approx 100MB/sec on the AMD A4 4000. The AES kernel took 266msto process 32MB while transfers to and from the iGPU took 10ms respective 34ms. All calls(read, write, execute) are serial since there is only one command queue. Enqueue computecalls are non blocking but that does not impact in no way performance since read and writeare blocking. The kernel device occupancy is estimated at around 66%. Memory transfers varybetween 0.6 � 3 G/sec. LDS is used around 128Kb per work group.

The AMD R7 yields around 300MB/sec (approx 3x compared to A4 4000). The kernel executesin 88ms while reads and writes are similar to A4 4000 (33ms, 10ms). Memory transfer ratesare thus comparable to A4 4000 ranging from 0.9GB/sec to 3GB/sec. Kernel occupancy isconsiderably lower at 20%.

ACCAES optimized GPU/SIMD kernel implementation

For the operation MixColumns instead of using precomputed values in the Galois Finite Field,the idea is to actually compute them. MixColumns is the primary source of di�usion in theAES cipher. Each column is treated as a polynomial over the Galois Field and than is mul-tiplied modulo with a �xed polynomial. While the addition is a simple operation (XOR) themultiplication is a complicated operation.

From the pro�ler results MixColumns is more complex and time consuming than SubBytes andwas rewritten to compute values instead of querying for precomputed values. This is becausefor encryption SubBytes has 1 constant table of 256 bytes each (Sbox precomputed) with 16random queries per run while MixColumns needs 2 constant tables of 256 bytes each (Galois_-FF2, Galois_FF3) with 32 random queries per run.

The improvement from the new MixColumns code is around 50% for the iGPU (150MB/sec)and 33% for the dGPU (400MB/sec). Kernel occupancy has risen from 66.67% to 76.19% for


the iGPU and from 20% to 40% for the dGPU. Kernel execution time decreased from 266ms to153ms for iGPU while for dGPU from 88ms to 42ms. At this point the low memory transfersbetween host memory and device memory start to matter. For the SubBytes operation we movethe precomputed variables from constant memory into local cache memory. This means thatin the AES128_Enc kernel we de�ne local uchar Sbox[256] and attribute each value.

Using GPU caches will have a positive impact on both the iGPU and the dGPU. Kernel execu-tion time drops to 108ms for the iGPU (from 153ms) and to 34ms for the dGPU (from 42ms).Data transfers for the discrete GPU exceed the actual computation which limits any furtheroptimization on kernel execution (34 ms execution, 36 ms R/W data). Moving data towardsGPU caches does not impact kernel occupancy. Local data storage increases for both devicesfrom 128 bytes towards 384 bytes. Throughput increases towards 200MB/sec for the iGPU andno further improvement from 400MB/sec for the dGPU.

ACCAES optimized GPU/SIMD memory transfers

For any further improvement data transfer rates need to be improved. Data transfer accountsfor nearly 25% of execution time (32ms vs 140ms) on the iGPU and more than 50% for thedGPU (36ms vs 70ms). In the current single channel con�guration the iGPU achieves (1.2GB-3.7GB/sec) while the dGPU achieves (0.6GB � 2.4GB/sec). Overlapping can be achieved byusing a circular bu�er of the form (buf_rw0, buf_rw1, buf_rwN). Bu�ers will either be read,written or processed. There will be 3 command queues: read, write and execute. Synchroniza-tion will be event based and by using barriers through READ_BLOCK or WRITE_BLOCK.Results are presented in �gure 3.4.

For the iGPU the major problem is the shared memory bandwidth. The memory controller willhave to handle both iGPU execution and memory transfer from CPU space into iGPU space.Most transfers are of short duration yet some get delayed by a large margin and cancel anyoverall advantage of overlapping execution with memory transfer. Zero copy through memorymap should be considered instead for the iGPU.

The dGPU is ine�cient because of the fast execution compared to data transfer over the PCIebus. Even with transfer overlapped with execution the OpenCL stack is unable to balance welland several data transfer commands are taking too long.

Figure 3.4: iGPU/dGPU algorithm with and without data/execution overlap, 512MB chunk,circular bu�er 8x16MB

Chapter 4

Test system results

The goal of this section is to present how di�erent processing con�gurations may a�ect eitherthe CPU, the GPU or both. The tested system is A6 5400K (2C@ 3.6ghz, AESNI and iGPU3CU) coupled with dGPU R7 250 (6 CU). The algorithm used for the GPU is the initial non-optimized version. Conclusions should also apply to the improved GPU implementations, sincethis section describes the performance impact of various con�gurations. All tests were done onUbuntu 12.04 LTS x64 with g++ 4.6 as compiler. For estimation of execution time on Linuxclock_gettime was used (for Windows QueryPerformanceFrequency is used). For pro�ling theGPU, AMD Code XL was used (driver AMD Catalyst 13.35).

4.1 Single unit processing

The application can o�oad work either to single compute unit or a combination compute units.As previously stated a compute unit can either be a CPU or a GPU. Results show the CPU5400K with AESNI has the best performance among tested compute units from the initial testsystem. Throughput scales in a 2 core con�guration with about 90% e�ciency. The integratedGPU with 3 CU (theoretical 400G�ops, bandwidth 10.6 GB/sec) yields a modest 155MB/secwith a very low overhead in data transfer. The R7 250 with 6 CU (theoretical 900G�ops,bandwidth 72 GB/sec) yields a better result, 400MB/sec which correlates with the increase inprocessing power and memory bandwidth. The standard OpenSSL implementation (runningon the CPU) yields the lowest result, about 100MB/sec.

Figure 4.1 presents a comparison between di�erent processing units. Processing units a�ected bysmall input are CPU with AESNI and discrete GPU. OY axis represents MB/sec approximationwhile OX represents the plain-text size in MB given for encryption.

Transfer latency is very important to the discrete GPU. The PCIe bus introduces a signi�cantoverhead in processing corresponding to the total bandwidth. Figure 4.2 presents the PCIebus in�uence over the discrete GPU performance. OY axis represents MB/sec approximationwhile OX represents the plain-text size in MB given for encryption. For example in processing256MB over a PCIe 2.0 4x which has 2048MB/sec peak bandwidth the transfer would require250ms in an ideal case. If we consider the case of a discrete GPU which processes data at arate of 512MB/sec then processing 256MB/sec would require 500ms e�ective processing and250ms data transfer. This would result in a lower e�ective throughput of about 384MB/sec.Tests show there is a signi�cant performance gap between PCIe 16x (8GB/sec) and PCIe 4x(2GB/sec). Furthermore preliminary results show little to no improvement regarding overlap-ping data transfer with GPU execution. As a result the performance of a discrete GPU can beseverely limited by the PCIe bus.

17

CHAPTER 4. TEST SYSTEM RESULTS 18

Figure 4.1: Comparison between di�erent processing units (CPU, dGPU, iGPU), Ox (MB), Oy(MB/sec)

The memory hierarchy is critical in a heterogeneous system where the central point of controlis the CPU. Running multiple iterations over the same input shows the impact of the cache andthe overall stability of the system throughput. When running multiple iterations over the sameinput CPU AESNI does not vary by much whilst the discrete GPU results vary by a fairly largemargin. An important role is thus played by the cache which helps bridge the gap between thefast processing hardware AESNI and the low speed RAM. Theoretical peak bandwidth is notan issue here as both 10.6GB/sec (single channel) and 21.2 GB/sec (dual channel) yield similarresults.

Figure 4.2: PCIe bus in�uence over the dGPU (AMD R7 250) performance, Ox (MB), Oy(MB/sec)

In general higher frequencies in compute units would translate into better performance. Recentx86 SoC have very �ne control regarding thermal control. Compute units can dynamicallyoverclock or underclock to better �t the current work.

The 5400K is not designed to dynamically overclock but can be manually overclocked. Over-clocking the CPU by 10% i.e. 400mhz would result in little to no performance gains. Overclock-ing the integrated GPU by 30% i.e. 250mhz would result in a more signi�cant 20% improvementin performance. The integrated GPU has a good 70% overclocking e�ciency whilst the CPUshows no improvement with one exception, in small data processing.


4.2 Multiple units processing

A hybrid con�guration is a work distribution between two or more compute units. Figure 4.3presents a comparison between di�erent hybrid work split con�gurations, by considering thethroughput in MB/sec (OY axis). In ACCAES, work is statically divided between computeunits, based on results from previous performance tests. It is not enough to base the static workdistribution on individual results of each processing unit. For instance the CPU performancedegrades with each new processing unit added to the hybrid con�guration.

In the case of the APU, both CPU and integrated GPU share the same RAM. While thereis the bene�t of low latency transfers between CPU and iGPU the downside is the sharedmemory bus. When running AES either on the CPU or iGPU results remained fairly consistentbetween runs. When processing in parallel on both devices results are very inconsistent. TheCPU AESNI is more a�ected with a performance hit in the range 30%-50%, while the iGPUsu�ers only 20%-30% decrease. Overall the performance of CPU and iGPU working in parallelis slightly better than the CPU running alone.

For instance the iGPU (AMD HD 7540) alone can manage about 150MB/sec i.e. 1/6 of theCPU (AMD 5400K) performance. We ran several tests to �nd the optimal work split for eachhybrid con�guration presented in Figure 4.3. One example is for the CPU (AMD 5400K) andiGPU (AMD HD 7540) con�guration where we concluded that 12% iGPU and 88% CPU is anoptimal. Any other con�guration (example 25% iGPU and 75% CPU), would result in a lowerthroughput.

Figure 4.3: Comparison hybrid workload, Ox (MB/sec), Oy (con�guration, work-split %)

Figure 4.4 describes CPU e�ciency in various compute con�gurations. OY axis representssystem con�guration while OX represents the MB/sec approximation. It can be seen that theCPU performance degrades with each new processing unit added to the hybrid con�guration.

Using AMD Code XL a higher cache miss rate is shown as one leading cause, going from 80%(CPU only) e�ciency down to 60% (CPU + iGPU). This along with managing the transferfrom RAM to VRAM (for iGPU and dGPU) would explain the lower performance. Figure 4.5describes discrete GPU e�ciency in various compute con�gurations. Results show several in-consistencies to what would be expected. The most notable one is that the discrete GPUachieves a higher performance when the CPU is also processing AES than when the CPU isidling. We attribute these problems to the inconsistent memory hierarchy which has a signi�-cant impact on performance. Figure 4.6 describes integrated GPU e�ciency in various computecon�gurations. Results show that the integrated GPU is the most stable processing unit in ourAES implementation.


Figure 4.4: CPU (AMD 5400K) e�ciency drop in various compute con�gurations, Ox (MB/sec),Oy (con�guration)

Figure 4.5: dGPU (AMD R7 250) e�ciency drop in various compute con�gurations, Ox(MB/sec), Oy (con�guration)

Figure 4.6: iGPU (AMD HD7540) e�ciency drop in various compute con�gurations, Ox(MB/sec), Oy (con�guration)

Chapter 5

Proposed encryption system

In previous chapters it was showed that fast AES encryption can be achieved either by usinga CPU with dedicated hardware instructions (AESNI) or by the use of a GPU. Performanceimprovement achieved by hybrid CPU-GPU con�gurations has been shown to be debatable.This chapter will focus on building a low end consumer system for large data AES encryption.The proposed con�guration is a sub 100 Euro x86 system which can encrypt large AES datachunks (>1MB) at rates between 0.5 - 1GB/sec AES-128/AES-256 ECB.

5.1 Proposed hardware con�guration

The base for the entire system would be a x86 CPU which has AESNI instructions, ideallymulti-core con�guration. Another processing unit such as iGPU or dGPU could be used foro�oading/assisting AES encryption. Low power/low form factors will be preferred. The systemwill not require a HDD and should provide standard communication ports (ETH, USB). TheOS could boot either from a USB pen-drive or through the network. RAM requirements are lowand will depend on OS/other applications running, but as a minimum 2GB are recommended.

Figure 5.1: Proposed encryption system

Both Intel and AMD provide a rather diverse range of CPU products, but focusing on sub40 Euro parts will yield only a handful of them. CPUs in this price range are usually dual-

21

CHAPTER 5. PROPOSED ENCRYPTION SYSTEM 22

core/quad-core low-power and target mini/micro ITX cases. CPUs which belong to the samefamily (e.g. INTEL Haswell family) share the same underlying architecture with di�erencesin clock speed, cache size and hardware instructions/units disabled. At the present time theonly sub 40 Euro CPUs with AESNI instructions are AMD Kabini (AM1 socket). In the endI've selected the AMD Sempron 3850 (quad core @1.3ghz) which also has an iGPU (GCNarchitecture, 128 stream units, 450mhz). Based on previous results both the CPU wAESNIand the iGPU can be used to achieve a better AES throughput compared to consumer CPUswith no AESNI support. The RAM will be 2GB DDR3 1600mhz and the motherboard will beof mini ITX form factor. Finally I've added a modest mini ITX case with a 60w PSU. Thewhole con�guration at the current time is around 100 Euro.

5.2 Proposed software con�guration

The CPU is x86 hence both Linux or Windows will work, and based on past results there isno major di�erence in performance when choosing the OS. For the operating system, Ubuntu14.04 x86_64 LTS was chosen in order to maintain a backwards compatibility with previoustests (for a better comparison) - any recent distro should perform similar. The chosen OS willtake anywhere between 400-800MB under normal load conditions leaving about 128/256MB forVRAM mapping (from RAM) and 512 RAM for accaes application. Chunks will have to be atmost 512MB for a system with 2GB RAM. Previous results show that maximum performance forAESNI is quickly achieved (>32MB) hence the limitation will have little impact on throughput.In the case of a 4GB RAM system I propose the following con�guration: 512MB VRAM, 1.5GBOS related, 2GB accsec.

5.3 Peformance measurments of proposed system

The system was compiled and tested on Ubuntu 14.04 x64 LTS operating system with g++4.8 and gpu catalyst driver 13.35. Tests consisted of a 500MB �le (generated by /dev/uran-dom) being encrypted AES-128 ECB. The input �le (plain-text) was read from a mountedrAMDisk �le-system (based on tmpfs). Using OpenSSL pre-compiled libraries with no AESNIsupport, the CPU (1 core processing) manages around 20MB/sec. The iGPU manages around150MB/sec and thus would pair well with the multi-core CPU without AESNI. CPU processingwith AESNI shows the disproportion in throughput versus the iGPU. Processing on a singlethread the throughput exceeds 450MB/sec, i.e. over three times that of the iGPU. It followsthat the extent to which the iGPU may speed up computation is very limited. Previous resultson AMD A6 5400K/4000 have shown that the AESNI performance of the CPU is a�ected inmulti-device processing con�gurations. This case is not exception as the results show the iGPUcan only speed up computation in con�gurations up to 2 cores. In fact it actually makes littlesense to o�oad AES processing on the iGPU since merely processing on 3 cores will outperformany con�guration of CPU + iGPU - even with 4 cores.

Figure 5.3 shows various work splits between the CPU (1,2,3 and 4 cores) and iGPU. The iGPU'sperformance is of little importance and hence the work-split has to be very disproportionate.On the other hand if the CPU would not support AESNI the iGPU could be used to o�oadAES encryption.

Figure 5.3 optimum split con�gurations of CPU+iGPU versus CPU multi-core AESNI per-formance. The best con�guration would be multi-core CPU AESNI with 3 processing threadswhich yields 1GB/sec whilst the best CPU+iGPU con�guration would be single-thread AESNIwith iGPU which yields around 500MB/sec. In the current con�guration (multicore CPU withAESNI support) the iGPU should not be used for o�oading AES encryption.

CHAPTER 5. PROPOSED ENCRYPTION SYSTEM 23

Figure 5.2: Performance limitations iGPU, various work splits iGPU+CPU(num_cores), Ox(%), Oy (MB/sec)

Figure 5.3: CPU(num_cores) + iGPU performance low improvement

Chapter 6

Conclusion

Processors are currently undergoing a transition driven by power and performance limitationsand algorithms must also evolve along with them. The current work has shown for a partic-ular problem (AES encryption) that performance may vary greatly depending on the chosenarchitecture and software. Limitations and constraints have been found and explained in boththe tested architectures and the software stack. In order to achieve maximum performanceall the components of the system must be chosen and adjusted accordingly. It follows thata good understanding of the hardware and software ecosystem is mandatory when targetingperformance.

For the chosen problem (AES encryption), CPU AESNI instructions provide a simple way toimprove AES performance by a large margin. Though the use of AESNI instructions shouldbe straightforward, as shown, the compiler has an important role when further vectorizing thecode. A performance improvement of over 100% has been highlighted when using a newerversion of the g++ compiler. This shows that with the ever increasing complexity in softwareand hardware systems peak performance will be harder to achieve than nowadays.

Another option to accelerate AES is to use a dGPU but the performance to cost is less thana CPU with AESNI. Adding a dGPU for AES acceleration makes sense when the CPU doesnot support AESNI, otherwise the performance gain compared to cost is debatable. For CPUswhich have on the same die a programmable iGPU, AES throughput does improve regardlessof the shared memory bus, but by a small margin in the case of CPUs with AESNI. Since thespace the iGPU occupied in the x86 SoC die increases with each generation its contributionin AES throughput will increase as well. Likewise current generation of x86 SoCs have CPUand iGPU share the same memory space. In this situation the iGPU might become feasible forprocessing small chunks of data.

Finally this work presented a sub 100 Euro x86 low power system (<30 watt) with the processingthroughput of over 1GB/sec AES-128 ECB. AES may be o�oaded to the iGPU though thisyields lower throughputs than an individual CPU core. It follows that fast data encryptionusing AES is possible on consumer systems that have CPU AES-NI extensions or at least GPUaccelerators.

As a future work I will focus on heterogeneous uni�ed memory architectures (hUMA) where thecommunication between the CPU and GPU will be further simpli�ed. New GPU optimizationswill also be studied, and the current implementation of AES on the GPU is expected to improve.

24

Appendix A

Various code listings

A.1 AES-NI C/C++ code, aes_hwni.cpp

1

2 /*******************************************************3 * Function: enc256_hwni4 *******************************************************/5 void enc256_hwni (uchar* buf_in, uchar* buf_out, __m128i

key_exp_128i[])6 {7 __m128i cipher_128i;8 unsigned char in_alligned[16] _ALIGNED(16);9 unsigned char out_alligned[16] _ALIGNED(16);10

11 /* store plaintext in cipher variable than encrypt */12 memcpy(in_alligned, buf_in, 16);13 cipher_128i = _mm_load_si128((__m128i *) in_alligned);14 cipher_128i = _mm_xor_si128(cipher_128i, key_exp_128i[0]);15

16 /* then do 9 rounds of aesenc, using the associated key parts */17 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[1]);18 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[2]);19 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[3]);20 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[4]);21 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[5]);22 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[6]);23 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[7]);24 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[8]);25 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[9]);26 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[10]);27 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[11]);28 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[12]);29 cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i[13]);30

31 /* then 1 aesdeclast rounds */32 cipher_128i = _mm_aesenclast_si128(cipher_128i, key_exp_128i[14]);33

34 /* store back from register & copy to destination */

25

APPENDIX A. VARIOUS CODE LISTINGS 26

35 _mm_store_si128((__m128i *) out_alligned, cipher_128i);36 memcpy(buf_out, out_alligned, 16);37 }38

39 /*******************************************************40 * Function: encrypt41 * Info: do hardware assisted encryption - AES-NI42 *******************************************************/43 void AES_HWNI::encrypt()44 {45 ..................46

47 /* for number of threads according to user CLI */48 omp_set_num_threads(cpu_count);49

50 /* do encryption */51 #pragma omp parallel for private(i) \52 firstprivate(buf_in_mp, buf_out_mp, key_exp_128i)53 for(i=0; i<buf_len_mp; i+=16)54 enc256_hwni(buf_in_mp+i, buf_out_mp+i, key_exp_128i);55 }56 }

Listing A.1: AES-256 CPU AES-NI C/C++ code

A.2 GPU Execution overlaps I/O, aes_gpu.cpp

1 /*******************************************************2 * Function: encrypt_overlap3 * Info: encryption with I/O OVERLAP4 *******************************************************/5

6 void AES_GPU::encrypt_overlap()7 {8 int rounds = 0;9 uchar* key_exp = new uchar[15*16];10

11 if(key_len == 16)12 dev_gpu.dev_kernel_sel = dev_gpu.dev_kernel_enc128;13 else if(key_len == 32)14 dev_gpu.dev_kernel_sel = dev_gpu.dev_kernel_enc256;15

16 /* expand keys using AESNI CPU acceleration */17 expand_keys(key_exp, &rounds, key, key_len);18

19 dev_gpu.buffer_keys = clCreateBuffer(20 dev_gpu.dev_context,21 CL_MEM_READ_ONLY,22 rounds* 16* sizeof(char), NULL, NULL);23

24 /* write keys */25 clEnqueueWriteBuffer(dev_gpu.dev_cmd_queue,


26 dev_gpu.buffer_keys,27 CL_FALSE, 0, rounds* 16* sizeof(char), key_exp, 0, 0, 0);28

29 /* allocate memory INPUT - device RAM/VRAM, OpenCL handled */30 cl_mem buffer_gpu[GPU_BUF_NUM];31

32 for(int i=0; i<GPU_BUF_NUM; i++)33 buffer_gpu[i] = clCreateBuffer(34 dev_gpu.dev_context,35 CL_MEM_READ_WRITE,36 GPU_BUF_ALLOC_SIZE *sizeof(char), NULL, NULL);37

38 cl_event event_exec[GPU_BUF_NUM];39 cl_event event_io[GPU_BUF_NUM];40

41 size_t offset = 0;42 int BUF_i = 0;43 int num_it = buf_len / GPU_BUF_ALLOC_SIZE;44

45 size_t global_work = GPU_BUF_ALLOC_SIZE / 16;46 size_t threads = 128;47

48 for(int i=0; i<num_it; i++)49 {50 /* BUFFER INPUT, to be encrypted */51 check(clSetKernelArg(dev_gpu.dev_kernel_sel, 0,52 sizeof(cl_mem), (void*)&buffer_gpu[BUF_i]));53 /* BUFFER KEYS, used in encryption */54 check(clSetKernelArg(dev_gpu.dev_kernel_sel, 1,55 sizeof(cl_mem), (void*)&dev_gpu.buffer_keys));56

57 if(BUF_i == (GPU_BUF_NUM - 1))58 clEnqueueWriteBuffer(dev_gpu.dev_cmd_queue_iow,59 buffer_gpu[BUF_i], CL_TRUE, 0,60 GPU_BUF_ALLOC_SIZE *sizeof(char),61 buf_in + offset, 0, 0, &event_io[BUF_i]);62 else63 clEnqueueWriteBuffer(dev_gpu.dev_cmd_queue_iow,64 buffer_gpu[BUF_i], CL_FALSE, 0,65 GPU_BUF_ALLOC_SIZE *sizeof(char),66 buf_in + offset, 0, 0, &event_io[BUF_i]);67

68 check(clEnqueueNDRangeKernel(dev_gpu.dev_cmd_queue,69 dev_gpu.dev_kernel_sel, 1, NULL, &global_work,70 &threads, 1, &event_io[BUF_i], &event_exec[BUF_i]));71

72 if (i == (num_it - 1))73 clEnqueueReadBuffer(dev_gpu.dev_cmd_queue_io,74 buffer_gpu[BUF_i], CL_TRUE, 0,75 GPU_BUF_ALLOC_SIZE *sizeof(char),76 buf_out + offset, 1, &event_exec[BUF_i], 0);77 else if(BUF_i == (GPU_BUF_NUM - 1))78 clEnqueueReadBuffer(dev_gpu.dev_cmd_queue_io,


79 buffer_gpu[BUF_i], CL_TRUE, 0,80 GPU_BUF_ALLOC_SIZE *sizeof(char),81 buf_out + offset, 1, &event_exec[BUF_i], 0);82 else83 clEnqueueReadBuffer(dev_gpu.dev_cmd_queue_io,84 buffer_gpu[BUF_i], CL_FALSE, 0,85 GPU_BUF_ALLOC_SIZE *sizeof(char),86 buf_out + offset, 1, &event_exec[BUF_i], 0);87

88 offset += GPU_BUF_ALLOC_SIZE;89

90 BUF_i++;91 BUF_i = BUF_i % GPU_BUF_NUM;92 }93

94 /* release all OpenCL memory objects */95 for(int i=0; i<GPU_BUF_NUM; i++)96 clReleaseMemObject(buffer_gpu[i]);97 clReleaseMemObject(dev_gpu.buffer_keys);98 }

Listing A.2: AES GPU HOST C/C++ code

A.3 Various optimizations GPU, kernel.cl

1 /*******************************************************2 * Function: MixColumn3 *******************************************************/4 inline uchar4 MixColumn (uchar4 state)5 {6 uchar4 a;7 uchar4 b;8 uchar4 h;9

10 a = state;11

12 b.s0 = state.s0 << 1;13 h.s0 = ((state.s0 & 0x80) == 0x80) ? 0xFF : 0x00;14 b.s0 ^= 0x1B & h.s0;15





28 state.s0 = b.s0 ^ a.s3 ^ a.s2 ^ b.s1 ^ a.s1;29 state.s1 = b.s1 ^ a.s0 ^ a.s3 ^ b.s2 ^ a.s2;30 state.s2 = b.s2 ^ a.s1 ^ a.s0 ^ b.s3 ^ a.s3;31 state.s3 = b.s3 ^ a.s2 ^ a.s1 ^ b.s0 ^ a.s0;32

33 return state;34 }35

36 /*******************************************************37 * Function: MixColumns38 *******************************************************/39 inline uchar16 MixColumns(uchar16 state)40 {41 return (uchar16)(42 MixColumn(state.s0123),43 MixColumn(state.s4567),44 MixColumn(state.s89AB),45 MixColumn(state.sCDEF)46 );47 }48

49 /*******************************************************50 * Function: AES128_Enc51 * Info: AES128 Encrypt52 *******************************************************/53 __kernel void AES128_Enc(__global uchar16* buf, __global uchar16*

keys)54 {55 int idx = get_global_id(0);56 __local uchar16 state;57

58 __local uchar SBox[256];59 SBox[ 0]=0x63; SBox[ 1]=0x7c; SBox[ 2]=0x77; SBox[ 3]=0x7b;60 SBox[ 4]=0xf2; SBox[ 5]=0x6b; SBox[ 6]=0x6f; SBox[ 7]=0xc5;61 ...62 SBox[252]=0xb0; SBox[253]=0x54; SBox[254]=0xbb; SBox[255]=0x16;63

64 state = buf[idx];65 state = AddRoundKey(state, keys[0]);66

67 for (int i = 1; i < 10; i++)68 {69 state = SubBytes(state, SBox);70 state = ShiftRows(state);71 state = MixColumns(state);72 state = AddRoundKey(state, keys[i]);73 }74

75 state = SubBytes(state, SBox);76 state = ShiftRows(state);77 state = AddRoundKey(state, keys[10]);78

79 buf[idx] = state;


80 }

Listing A.3: AES GPU TARGET OpenCL code

A.4 Work split CPU-GPU, aes_hybrid.cpp

1 /* plain_gpu plain_cpu2 * \/ \/3 * *-----------------------------*4 * * WORK | WORK *5 * * GPU | CPU *6 * * AES | AES *7 * *-----------------------------*8 */9

10 /*******************************************************11 * Function: AES_HYBRID12 * Info: static work split amongst CPU and GPGPU13 *******************************************************/14 AES_HYBRID::AES_HYBRID(uchar* buf_in,15 uchar* buf_out,16 uint buf_len,17 uchar* key,18 uint key_len,19 string aes_method,20 AES_OPTIONS aes_options) :21

22 AES_BASE(buf_in,23 buf_out,24 buf_len,25 key,26 key_len,27 aes_method)28 {29 /* Space for GPU to work on */30 uchar* buf_in_gpu = NULL;31 uchar* buf_out_gpu = NULL;32 uint buf_len_gpu = 0;33 uchar* buf_in_hwni = NULL;34 uchar* buf_out_hwni = NULL;35 uint buf_len_hwni = buf_len;36

37 cl_user_ids = aes_options.cl_device_ids;38 cl_user_splits = aes_options.cl_device_splits;39

40 DIE(cl_user_ids.size() != cl_user_splits.size(),41 "Num CL devices and num splits are inequal, error from CLI

");42

43 /* offsets are 0 */44 buf_in_gpu = buf_in;45 buf_out_gpu = buf_out;


46 buf_len_gpu = 0;47

48 for(uint i=0; i<cl_user_ids.size(); i++)49 {50 /* computing space for GPU to work on */51 buf_in_gpu += buf_len_gpu;52 buf_out_gpu += buf_len_gpu;53 buf_len_gpu = buf_len * ((double)cl_user_splits[i] / (double)

100);54 buf_len_gpu -= (buf_len_gpu % (GPU_BUF_ALLOC_SIZE));55

56 /* CPU has less work */57 buf_len_hwni -= buf_len_gpu;58

59 /* init and push GPU */60 aes_gpus[i] = new AES_GPU(buf_in_gpu, buf_out_gpu, buf_len_gpu

,61 key, key_len, aes_method, cl_user_ids[i]);62 }63

64 /* computing space for CPU left space to work on */65 buf_in_hwni = buf_in_gpu + buf_len_gpu;66 buf_out_hwni = buf_out_gpu + buf_len_gpu;67

68 /* init CPU */69 aes_hwni = new AES_HWNI(buf_in_hwni, buf_out_hwni, buf_len_hwni,70 key, key_len, aes_method, aes_options.cpu_count);71 }

Listing A.4: AES hybrid processing CPU+GPU work split

Bibliography

[1] OpenMP Architecture Review Board. Openmp speci�cation 3.0. 2011.

[2] N. Brookwood. Amd white paper: Amd fusion family of apus, tech. rep., amd corporation.2010.

[3] AMD Corporation. Evergreen family instruction set architecture instructions and mi-crocode. 2011.

[4] AMD Corporation. Amd accelerated parallel processing opencl programming guide. 2013.

[5] NVIDIA Corporation. Nvidia geforce 8800 gpu architecture overview. 2006.

[6] NVIDIA Corporation. Nvidia's next generation cudatm compute architecture: Fermi.2011.

[7] M Daga. On the e�cacy of a fused cpu+gpu processor (or apu) for parallel computing.SAAHPC, pages 141�149, 2011.

[8] M. Doerksen. Designing apu oriented scienti�c computing applications in opencl. HighPerformance Computing and Communications (HPCC), pages 587�592, 2011.

[9] Slusanschi E. and Lupescu G. Image vectorization on modern architectures. IntelligentComputer Communication and Processing (ICCP), pages 295�298, 2011.

[10] Khronos OpenCL Working Group. Opencl speci�cation 1.2. 2012.

[11] Alireza Hodjat and Ingrid Verbauwhede. Interfacing a high speed crypto accelerator to anembedded cpu. Signals, Systems and Computers, pages 488 � 492, 2004.

[12] Keon Jang, Sangjin Han, Seungyeop Han, Sue Moon, , and KyoungSoo Park. Sslshader:Cheap ssl acceleration with commodity processors. NSDI, 2011.

[13] Deguang Le, Jinyi Chang, Xingdou Gou, and Ankang Zhang. Parallel aes algorithm forfast data encryption on gpu. pages V6�1 � V6�6.

[14] Qinjian Li, Chengwen Zhong, Kaiyong Zhao, Xinxin Mei, and Xiaowen Chu. Implementa-tion and analysis of aes encryption on gpu. HPCC-ICESS, pages 843�848, 2012.

[15] Kenny Liu. Qoriq t series and qoriq ls series update. 2013.

[16] G. Lupescu and L. Gheorghe. Commodity hardware performance in aes processing. Inter-national Symposium on Parallel and Distributed Computing (ISPDC), 2014.

[17] G. Lupescu and B. Pavaloiu. Genome matching on modern architectures. Systems, Signalsand Image Processing (IWSSIP), pages 193�196, 2013.

[18] Nishikawa N, Iwai K, Tanaka H, and Kurokawa N. Throughput and power e�ciency eval-uations of block ciphers on kepler and gcn gpus. Computing and Networking (CANDAR),pages 366�372, 2013.

32

BIBLIOGRAPHY 33

[19] NIST. Announcing the advanced encryption standard (aes). Federal Information ProcessingStandards Publication 197, 2011.

[20] Gepner P. and Kowalik M.F. Multi-core processors: New way to achieve high systemperformance. Parallel Computing in Electrical Engineering, 2006. PAR ELEC 2006. In-ternational Symposium, pages 9 � 13, 2006.

[21] Fei Shao, Zinan Chang, and Yi Zhang. Aes encryption algorithm based on the high per-formance computing of gpu. pages 588 � 590, 2010.

[22] William Stallings. Cryptography and Network Security: Principles and Practice (6th Edi-tion). Prentice Hall, 2013.

[23] R. Tsuchiyama, T. Nakamura, T. Iizuka, and A. Asahara. The OpenCL ProgrammingBook. Fixstars, 2012.

[24] Daisuke Tsutsumi and Tsukasa Abe. An aes processing system with a compact cpu corefor secure communication in embedded systems. TENCON, pages 1�5, 2012.

[25] Xingliang Wang, Xiaochao Li, Mei Zou, and Jun Zhou. Aes �nalists implementation forgpu and multi-core cpu based on opencl. Anti-Counterfeiting, Security and Identi�cation(ASID), pages 38�42, 2011.

[26] Xingliang Wang, Xiaochao Li, Mei Zou, and Jun Zhou. Aes �nalists implementation forgpu and multi-core cpu based on opencl. pages 38�42, 2011.

LUCRARE DE DISERTAT IE Encriptare AES pe arhitecturi de tip … · Universitatea POLITEHNICA din...

Documents

Transcript of LUCRARE DE DISERTAT IE Encriptare AES pe arhitecturi de tip … · Universitatea POLITEHNICA din...