COSC 6339 Accelerators in Big Data - UHgabriel/courses/cosc6339_f18/BDA... · COSC 6339...

1

COSC 6339

Accelerators in Big Data

Edgar Gabriel

Fall 2018

Motivation

• Programming models such as MapReduce and Spark

provide a high-level view of parallelism

– not easy for all problems, e.g. recursive algorithms, many

graph problems, etc.

• How to handle problems that do not have inherent

high-level parallelism?

– sequential processing: time to solution takes too long for

large problems

– exploit low-level parallelism: often groups of very few

instructions

• Problem with instruction level parallelism: costs for

exploiting parallelism exceed benefits if using regular

threads/processes/tasks

2

Historic context: SIMD Instructions

• Same operation executed for multiple data items

• Uses a fixed length register and partitions the carry chain to

allow utilizing the same functional unit for multiple

operations

– E.g. a 256 bit adder can be utilized for eight 32-bit add

operations simultaneously

• All elements in a register have to be on the same memory

page to avoid page faults within the instruction

Comparison of instructions

• Example add operation of eight 32-bit integers with and and

without SIMD instruction

LOOP: LOAD R2, 0(R4) /* load x(i) */

LOAD R0, 0(R6) /* load y(i) */

ADD R2, R2, R0 /* x(i)+y(i)*/

STORE R2, 0(R4) /* store x(i) */

ADD R4, R4, #4 /* increment x */

ADD R6, R6, #4 /* increment y */

BNEQ R4, R20, LOOP

---------------------

LOAD256 YMM1, 0(R4) /* loads 256 bits of data*/

LOAD256 YMM2, 0(R6) /* ditto */

VADDSP YMM1, YMM1, YMM2 /* AVX ADD operation */

STORE256 YMM1, 0(R4)

Note: not actual Intel assembly instructions and registers used

3 instructions required for managing the loop (i.e. not contributing to the actual solution of the problem)

Branch instructions typically lead to processor stalls since you have to wait for the outcome

of the comparison before you can decide what is the next instruction to execute

3

SIMD Instructions

• MMX (Mult-Media Extension) - 1996

– Existing 64 bit floating point register could be used for

eight 8-bit operations or four 16-bit operations

• SSE (Streaming SIMD Extension) – 1999

– Successor to MMX instructions

– Separate 128-bit registers added for sixteen 8-bit, eight

16-bit, or four 32-bit operations

• SSE2 – 2001, SSE3 – 2004, SSE4 - 2007

– Added support for double precision operations

• AVX (Advanced Vector Extensions) - 2010

– 256-bit registers added

• AVX2 – 2013

– 512 –bit registers added

Graphics Processing Units (GPU)

• Hardware in Graphics Units similar to SIMD units

– Works well with data-level parallel problems

– Scatter-gather transfers

– Mask registers

– Large register files

• Using NVIDIA GPUs as an example

4

Graphics Processing Units (II)

• Basic idea:

– Heterogeneous execution model

• CPU is the host, GPU is the device

– Develop a C-like programming language for GPU

– Unify all forms of GPU parallelism as CUDA thread

– Programming model is “Single Instruction Multiple

Threads”

• GPU hardware handles thread management, not

applications or OS

Example: Vector Addition

• Sequential code:int main ( int argc, char **argv )

{

int A[N], B[N], C[N];

for ( i=0; i<N; i++) {

C[i] = A[i] + B[i];

}

return (0);

}

CUDA: replace the loop by N threads

each executing on element of the vector

add operation

5

Example: Vector Addition (II)

• CUDA: replace the loop by N threads each executing

one element of the vector add operation

• Question: How does each thread know which elements

to execute?

– threadIdx : each thread has an id which is unique in

the thread block

• of type dim3, which is a

struct {

int x,y,z;

} dim3;

– blockDim: Total number of threads in the thread block

• a thread block can be 1D, 2D or 3D

Example: Vector Addition (III)

• Initial CUDA kernel:

• This code is limited by the maximum number of threads in a thread

block

– Upper limit on max. number of threads in one block

– if vector is longer, we have to create multiple thread blocks

void vecadd ( int *d_A, int *d_B, int* d_C)

{

int i = threadIdx.x;

d_C[i] = d_A[i] + d_B[i];

return;

}

Assuming a 1-D thread block

-> only x-dimension used

6

How does the compiler now which code to

compile for CPU and which one for GPU?

• Specifier tells compiler where function will be executed

-> compiler can generate code for corresponding processor

• Executed on CPU, called form CPU (default if not specified)

__host__ void func(…)

• CUDA kernel to be executed on GPU, called from CPU

__global__ void func(...);

• CUDA kernel to be executed on GPU, called from GPU

__device__ void func(...);

Example: Vector Addition (IV)

• so the CUDA kernel is in reality:

• Note:

– d_A, d_B, and d_C are in global memory

– int i is in local memory of the thread

__global__ void vecAdd ( int *d_A, int *d_B, int* d_C)

{

int i = threadIdx.x;

d_C[i] = d_A[i] + d_B[i];

return;

}

7

If you have multiple thread blocks


{

int i = blockIdx.x * blockDim.x + threadIdx.x;

d_C[i] = d_A[i] + d_B[i];

return;

}

ID of the thread block that

this thread is part of

Number of threads in a

thread block

Using more than one element per thread


{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j;

for ( j=i*NUMELEMENTS; j<(i+1)*NUMELEMENTS; j++)

d_C[j] = d_A[j] + d_B[j];

return;

}

8

Nvidia GT200

• A GT200 is multi-core chip with two level hierarchy

– focuses on high throughput on data parallel workloads

• 1st level of hierarchy: 10 Thread Processing Clusters (TPC)

• 2nd level of hierarchy: each TPC has

– 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1

core in a conventional processor)

– a texture pipeline (used for memory access)

• Global Block Scheduler:

– issues thread blocks to SMs with available capacity

– simple round-robin algorithm but taking resource

availability (e.g. of shared memory) into account

Nvidia GT200

Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”,

http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1



9

Streaming multi-processor (I)

• Instruction fetch, decode and issue logic

• 8 32bit ALU units (that are often referred to as Streaming

processor (SP) or confusingly called a ‘core’ by Nvidia)

• 8 branch units: no branch prediction or speculation, branch

delay: 4 cycles

• Can execute up to 8 thread blocks/1024 threads

concurrently

• Each SP has access to 2048 register file entries each with 32

bits

– a double precision number has to utilize two adjacent

registers

– register file can be used by up to 128 threads

concurrently

CUDA Memory Model

10

CUDA Memory Model (II)

• cudaError_t cudaMalloc(void** devPtr, size_t size)

– Allocates size bytes of device(global) memory pointed to by *devPtr

– Returns cudaSuccess for no error

• cudaError_t cudaMemcpy(void* dst, const void* src,

size_t count, enum cudaMemcpyKind kind)

– Dst = destination memory address

– Src = source memory address

– Count = bytes to copy

– Kind = type of transfer (“HostToDevice”, “DeviceToHost”,

“DeviceToDevice”)

• cudaError_t cudaFree(void* devPtr)

– Frees memory allocated with cudaMalloc

Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo

http://www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf

Example: Vector Addition (V)int main ( int argc, char ** argv) {

float a[N], b[N], c[N];

float *d_a, *d_b, *d_c;

cudaMalloc( &d_a, N*sizeof(float));

cudaMalloc( &d_b, N*sizeof(float));

cudaMalloc( &d_c, N*sizeof(float));

cudaMemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice);

cudaMemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice);

dim3 threadsPerBlock(256); // 1-D array of threads

dim3 blocksPerGrid(N/256); // 1-D grid

vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c);

cudaMemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost);

cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

return-0;

}

http://www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf

11

Nvidia Tesla V100 GPU

• Most recent Nvidia GPU architecture

• Architecture: each V100 contains

– 6 GPU Processing Clusters (GPCs)

– Each GPC has

• 7 Texture Processing Clusters (TPCs)

• 14 Streaming Multiprocessors (SMs)

– Each SM has

• 64 32-bit Floating Point cores

• 64 32-bit Integer cores

• 32 64-bit Floating Point cores

• 8 Tensor Cores

• 4 Texture Units

Nvidia V100 Tensor Cores

• Specifically designed to support neural networks

• Designed to execute

– D = A×B + C for 4x4 matrices

– Operate on FP16 input data with FP32 accumulation

Image source:

http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf


12

Image source:


NVLink

• GPUs traditionally utilize a PCIe slot for moving data and

instructions from CPU to GPU and between GPUs

– PCIe 3.0 x8: 8 GB/s bandwidth

– PCIe 3.0 x16: 16 GB/s bandwidth

– motherboards often restricted in the number of PCIe lanes

managed: i.e. using multiple PCIe cards will reduce the

bandwidth available for each card

• NVLink : high speed connection between multiple GPUs

– higher bandwidth per link (25 GB/sec) than PCIe

– V100 supports up to 6 NVLINKs per GPU


13

NVLink2 multi-GPU no CPU support

Image source:


NVLink2 with CPU support

Image source:


• only supports IBM Power 9 CPUs at the moment



14

Other V100 enhancements

• In earlier GPUs

– a group of threads (warps) executed a single instruction.

– A single program counter was used in combination with a active

mask that specified which threads of the warp are active at any

given point in time

– Divergent paths (e.g. if-then-else statements) lead to some

threads being inactive

• V100 introduces program counters and call stacks per thread.

– Independent thread scheduling allows the GPU to yield

execution of any thread

– Schedule-optimizer dynamically detects which how to group

active threads into SIMT units

– Threads can diverge and converge at sub-warp granularity

• Google’s DNN ASIC

• Coprocessor on the PCIe bus

• Large software-managed scratchpad

• Scratchpad:

– high speed memory (similar to cache)

– content controlled by application instead of system

Google Tensor Processing Unit (TPU)

15

TPU Microarchitecture

• Matrix multiply unit contains 256x256 ALUs that can perform

8 bit multiply-and-add operations generating 16-bit products

• Accumulator can be used for updating partial results

• Weights for matrix multiply operations provided by an off-

chip 8GiB weight memory and provided through the Weight

FIFO

• Intermediate results are held in 24 MiB on chip unified

memory

• Host server sends instructions over PCI bus to the TPU

• Programmable DMA transfers data to or from Host memory

Google TPU Architecture

Image source:

https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu


16

TPU Instruction Set Architecture

TPU Instruction Function

Read_Host_Memory Read data from memory

Read_Weights Read weights from memory

MatrixMultiply/Convolve Multiply or convolve with the data and

weights, accumulate the results

Activate Apply activation functions

Write_Host_Memory Write result to memory

• No program counter

• No branch instructions

• Contain a repeat field

• Very high CPI ( 10 – 20)

TPU microarchitecture

• Goal: hide the costs of other instruction and keep the Matrix Multiply Unit busy

• Systolic array: 2-D collection of arithmetic units that each independently compute a partial result

• Data arrives at cells from different directions at regular intervals

• Data flows through the array similar to a wave front -> systolic execution

Image source:



17

Systolic execution: Matrix-Vector

Example

Image source:


Systolic execution: Matrix-Matrix

Example

Image source:




18

Architecture No. of instructions per cycle

CPU A few

CPU w/ vector extensions Tens - hundreds

GPU Tens of thousands

TPU Up to 128k

Image source:


TPU software

• At this point mostly limited to Tensorflow

• Code that is expected to run on TPU is compiled using

an API that can run on GPU, TPU or CPU


COSC 6339 Accelerators in Big Data - UHgabriel/courses/cosc6339_f18/BDA... · COSC 6339...

Documents

Transcript of COSC 6339 Accelerators in Big Data - UHgabriel/courses/cosc6339_f18/BDA... · COSC 6339...