COSC 6339 Accelerators in Big Data - UHgabriel/courses/cosc6339_f18/BDA... · COSC 6339...
Transcript of COSC 6339 Accelerators in Big Data - UHgabriel/courses/cosc6339_f18/BDA... · COSC 6339...
1
COSC 6339
Accelerators in Big Data
Edgar Gabriel
Fall 2018
Motivation
• Programming models such as MapReduce and Spark
provide a high-level view of parallelism
– not easy for all problems, e.g. recursive algorithms, many
graph problems, etc.
• How to handle problems that do not have inherent
high-level parallelism?
– sequential processing: time to solution takes too long for
large problems
– exploit low-level parallelism: often groups of very few
instructions
• Problem with instruction level parallelism: costs for
exploiting parallelism exceed benefits if using regular
threads/processes/tasks
2
Historic context: SIMD Instructions
• Same operation executed for multiple data items
• Uses a fixed length register and partitions the carry chain to
allow utilizing the same functional unit for multiple
operations
– E.g. a 256 bit adder can be utilized for eight 32-bit add
operations simultaneously
• All elements in a register have to be on the same memory
page to avoid page faults within the instruction
Comparison of instructions
• Example add operation of eight 32-bit integers with and and
without SIMD instruction
LOOP: LOAD R2, 0(R4) /* load x(i) */
LOAD R0, 0(R6) /* load y(i) */
ADD R2, R2, R0 /* x(i)+y(i)*/
STORE R2, 0(R4) /* store x(i) */
ADD R4, R4, #4 /* increment x */
ADD R6, R6, #4 /* increment y */
BNEQ R4, R20, LOOP
---------------------
LOAD256 YMM1, 0(R4) /* loads 256 bits of data*/
LOAD256 YMM2, 0(R6) /* ditto */
VADDSP YMM1, YMM1, YMM2 /* AVX ADD operation */
STORE256 YMM1, 0(R4)
Note: not actual Intel assembly instructions and registers used
3 instructions required for managing the loop (i.e. not contributing to the actual solution of the problem)
Branch instructions typically lead to processor stalls since you have to wait for the outcome
of the comparison before you can decide what is the next instruction to execute
3
SIMD Instructions
• MMX (Mult-Media Extension) - 1996
– Existing 64 bit floating point register could be used for
eight 8-bit operations or four 16-bit operations
• SSE (Streaming SIMD Extension) – 1999
– Successor to MMX instructions
– Separate 128-bit registers added for sixteen 8-bit, eight
16-bit, or four 32-bit operations
• SSE2 – 2001, SSE3 – 2004, SSE4 - 2007
– Added support for double precision operations
• AVX (Advanced Vector Extensions) - 2010
– 256-bit registers added
• AVX2 – 2013
– 512 –bit registers added
Graphics Processing Units (GPU)
• Hardware in Graphics Units similar to SIMD units
– Works well with data-level parallel problems
– Scatter-gather transfers
– Mask registers
– Large register files
• Using NVIDIA GPUs as an example
4
Graphics Processing Units (II)
• Basic idea:
– Heterogeneous execution model
• CPU is the host, GPU is the device
– Develop a C-like programming language for GPU
– Unify all forms of GPU parallelism as CUDA thread
– Programming model is “Single Instruction Multiple
Threads”
• GPU hardware handles thread management, not
applications or OS
Example: Vector Addition
• Sequential code:int main ( int argc, char **argv )
{
int A[N], B[N], C[N];
for ( i=0; i<N; i++) {
C[i] = A[i] + B[i];
}
return (0);
}
CUDA: replace the loop by N threads
each executing on element of the vector
add operation
5
Example: Vector Addition (II)
• CUDA: replace the loop by N threads each executing
one element of the vector add operation
• Question: How does each thread know which elements
to execute?
– threadIdx : each thread has an id which is unique in
the thread block
• of type dim3, which is a
struct {
int x,y,z;
} dim3;
– blockDim: Total number of threads in the thread block
• a thread block can be 1D, 2D or 3D
Example: Vector Addition (III)
• Initial CUDA kernel:
• This code is limited by the maximum number of threads in a thread
block
– Upper limit on max. number of threads in one block
– if vector is longer, we have to create multiple thread blocks
void vecadd ( int *d_A, int *d_B, int* d_C)
{
int i = threadIdx.x;
d_C[i] = d_A[i] + d_B[i];
return;
}
Assuming a 1-D thread block
-> only x-dimension used
6
How does the compiler now which code to
compile for CPU and which one for GPU?
• Specifier tells compiler where function will be executed
-> compiler can generate code for corresponding processor
• Executed on CPU, called form CPU (default if not specified)
__host__ void func(…)
• CUDA kernel to be executed on GPU, called from CPU
__global__ void func(...);
• CUDA kernel to be executed on GPU, called from GPU
__device__ void func(...);
Example: Vector Addition (IV)
• so the CUDA kernel is in reality:
• Note:
– d_A, d_B, and d_C are in global memory
– int i is in local memory of the thread
__global__ void vecAdd ( int *d_A, int *d_B, int* d_C)
{
int i = threadIdx.x;
d_C[i] = d_A[i] + d_B[i];
return;
}
7
If you have multiple thread blocks
__global__ void vecAdd ( int *d_A, int *d_B, int* d_C)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
d_C[i] = d_A[i] + d_B[i];
return;
}
ID of the thread block that
this thread is part of
Number of threads in a
thread block
Using more than one element per thread
__global__ void vecAdd ( int *d_A, int *d_B, int* d_C)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j;
for ( j=i*NUMELEMENTS; j<(i+1)*NUMELEMENTS; j++)
d_C[j] = d_A[j] + d_B[j];
return;
}
8
Nvidia GT200
• A GT200 is multi-core chip with two level hierarchy
– focuses on high throughput on data parallel workloads
• 1st level of hierarchy: 10 Thread Processing Clusters (TPC)
• 2nd level of hierarchy: each TPC has
– 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1
core in a conventional processor)
– a texture pipeline (used for memory access)
• Global Block Scheduler:
– issues thread blocks to SMs with available capacity
– simple round-robin algorithm but taking resource
availability (e.g. of shared memory) into account
Nvidia GT200
Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”,
http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1
9
Streaming multi-processor (I)
• Instruction fetch, decode and issue logic
• 8 32bit ALU units (that are often referred to as Streaming
processor (SP) or confusingly called a ‘core’ by Nvidia)
• 8 branch units: no branch prediction or speculation, branch
delay: 4 cycles
• Can execute up to 8 thread blocks/1024 threads
concurrently
• Each SP has access to 2048 register file entries each with 32
bits
– a double precision number has to utilize two adjacent
registers
– register file can be used by up to 128 threads
concurrently
CUDA Memory Model
10
CUDA Memory Model (II)
• cudaError_t cudaMalloc(void** devPtr, size_t size)
– Allocates size bytes of device(global) memory pointed to by *devPtr
– Returns cudaSuccess for no error
• cudaError_t cudaMemcpy(void* dst, const void* src,
size_t count, enum cudaMemcpyKind kind)
– Dst = destination memory address
– Src = source memory address
– Count = bytes to copy
– Kind = type of transfer (“HostToDevice”, “DeviceToHost”,
“DeviceToDevice”)
• cudaError_t cudaFree(void* devPtr)
– Frees memory allocated with cudaMalloc
Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo
http://www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf
Example: Vector Addition (V)int main ( int argc, char ** argv) {
float a[N], b[N], c[N];
float *d_a, *d_b, *d_c;
cudaMalloc( &d_a, N*sizeof(float));
cudaMalloc( &d_b, N*sizeof(float));
cudaMalloc( &d_c, N*sizeof(float));
cudaMemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice);
dim3 threadsPerBlock(256); // 1-D array of threads
dim3 blocksPerGrid(N/256); // 1-D grid
vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c);
cudaMemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return-0;
}
11
Nvidia Tesla V100 GPU
• Most recent Nvidia GPU architecture
• Architecture: each V100 contains
– 6 GPU Processing Clusters (GPCs)
– Each GPC has
• 7 Texture Processing Clusters (TPCs)
• 14 Streaming Multiprocessors (SMs)
– Each SM has
• 64 32-bit Floating Point cores
• 64 32-bit Integer cores
• 32 64-bit Floating Point cores
• 8 Tensor Cores
• 4 Texture Units
Nvidia V100 Tensor Cores
• Specifically designed to support neural networks
• Designed to execute
– D = A×B + C for 4x4 matrices
– Operate on FP16 input data with FP32 accumulation
Image source:
http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
12
Image source:
http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
NVLink
• GPUs traditionally utilize a PCIe slot for moving data and
instructions from CPU to GPU and between GPUs
– PCIe 3.0 x8: 8 GB/s bandwidth
– PCIe 3.0 x16: 16 GB/s bandwidth
– motherboards often restricted in the number of PCIe lanes
managed: i.e. using multiple PCIe cards will reduce the
bandwidth available for each card
• NVLink : high speed connection between multiple GPUs
– higher bandwidth per link (25 GB/sec) than PCIe
– V100 supports up to 6 NVLINKs per GPU
13
NVLink2 multi-GPU no CPU support
Image source:
http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
NVLink2 with CPU support
Image source:
http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
• only supports IBM Power 9 CPUs at the moment
14
Other V100 enhancements
• In earlier GPUs
– a group of threads (warps) executed a single instruction.
– A single program counter was used in combination with a active
mask that specified which threads of the warp are active at any
given point in time
– Divergent paths (e.g. if-then-else statements) lead to some
threads being inactive
• V100 introduces program counters and call stacks per thread.
– Independent thread scheduling allows the GPU to yield
execution of any thread
– Schedule-optimizer dynamically detects which how to group
active threads into SIMT units
– Threads can diverge and converge at sub-warp granularity
• Google’s DNN ASIC
• Coprocessor on the PCIe bus
• Large software-managed scratchpad
• Scratchpad:
– high speed memory (similar to cache)
– content controlled by application instead of system
Google Tensor Processing Unit (TPU)
15
TPU Microarchitecture
• Matrix multiply unit contains 256x256 ALUs that can perform
8 bit multiply-and-add operations generating 16-bit products
• Accumulator can be used for updating partial results
• Weights for matrix multiply operations provided by an off-
chip 8GiB weight memory and provided through the Weight
FIFO
• Intermediate results are held in 24 MiB on chip unified
memory
• Host server sends instructions over PCI bus to the TPU
• Programmable DMA transfers data to or from Host memory
Google TPU Architecture
Image source:
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
16
TPU Instruction Set Architecture
TPU Instruction Function
Read_Host_Memory Read data from memory
Read_Weights Read weights from memory
MatrixMultiply/Convolve Multiply or convolve with the data and
weights, accumulate the results
Activate Apply activation functions
Write_Host_Memory Write result to memory
• No program counter
• No branch instructions
• Contain a repeat field
• Very high CPI ( 10 – 20)
TPU microarchitecture
• Goal: hide the costs of other instruction and keep the Matrix Multiply Unit busy
• Systolic array: 2-D collection of arithmetic units that each independently compute a partial result
• Data arrives at cells from different directions at regular intervals
• Data flows through the array similar to a wave front -> systolic execution
Image source:
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
17
Systolic execution: Matrix-Vector
Example
Image source:
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
Systolic execution: Matrix-Matrix
Example
Image source:
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
18
Architecture No. of instructions per cycle
CPU A few
CPU w/ vector extensions Tens - hundreds
GPU Tens of thousands
TPU Up to 128k
Image source:
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
TPU software
• At this point mostly limited to Tensorflow
• Code that is expected to run on TPU is compiled using
an API that can run on GPU, TPU or CPU