[05][cuda 및 fermi 최적화 기술] hryu optimization

CUDA and Fermi

Optimization Techniques

Hyungon Ryu, PSG SA

NVIDIA Confidential

Agenda

CUDA 101 (General Overview)

Toy of FDM example

Monte Carlo Simulation & Time Series Analysis

General CUDA Optimization Tips

CUDA 101

General Overview

NVIDIA Confidential

CUDA 101

CUDA

Graphics

(GPU)

Parallel

Programming

NVIDIA Confidential

CUDA Parallel programming

GPU Knowledge

Cg

OpenGL

DirectX

Parallel knowledge

Pthreads / winthreads

MMX, SSE

OpenMP

PVM / MPI

Hetereogeneous Knowledge

GPU

Parallel DSP

Parallel ASIC

Parallel FPGA

Cell BE

CUDAParallel Computing with GPU

NVIDIA Confidential

Parallel Model in OpenMP

6

CPU Program

Fork

Parallel

Join

NVIDIA Confidential

CUDA parallel Model

7

CPU Program

Kernel Launch

GPU threads

NVIDIA Confidential

Saxpy Example : CPU serial

for (int i = 0; i < n; ++i) {

y[i] = a*x[i] + y[i];

}

x

y

i

NVIDIA Confidential

Example of Saxpy Parallel : OpenMP

# pragma omp parallel shared (n,a,x,y) private (i)

# pragma omp for

for (int i = 0; i < n; ++i) {

y[i] = a*x[i] + y[i];

}

x

y

i

NVIDIA Confidential

Example of Saxpy Parallel : MPIfor ( i = start ; i < end; i++)

{

y[i] = a * x[i] + y[i];

}

x

y

i

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&size);

void para_range(int lowest, int highest,int nprocs, int myrank,

int *start, int *end) {

int wk1, wk2;

wk1 = (highest - lowest + 1) / nprocs;

wk2 = (highest - lowest + 1) % nprocs;

*start = myrank * wk1 + lowest + ( (rank<wk2) ? myrank : wk2);

*end = *start + wk1 - 1;

if(wk2 > rank) *end = *end + 1;

}

NVIDIA Confidential

Example of Saxpy Parallel : SSE

x

y

i

void saxpy_vector(short *z, short *x, short *y, short a, unsigned n) {

__m128i* x_ptr = (__m128i*) x;

__m128i* y_ptr = (__m128i*) y;

__m128i* z_ptr = (__m128i*) z;

__m128i a_vec = _mm_splat_epi16(a);

int i;

for (i = 0; i < n/8; ++i) {

__m128i x_vec = x_ptr[i];

__m128i y_vec = y_ptr[i];

__m128i z_vec = _mm_add_epi16( _mm_mullo_epi16 x_vec,a_vec),y_vec);

z_ptr[i] = z_vec;

}

}

NVIDIA Confidential

Saxpy Parallel : CUDA

x

y

i

{

x[i] = a * x[i] + t * y[i];

}

Saxpy <<<N ,M >>> (n, 2.0, x, y);

NVIDIA Confidential

CUDA C extensionLaunch the kernel

Function <<< Grid, Block >>> ( parameter);

Additional C standard API for mem control

cudaXXX : cudaMalloc, cudaMemcpy,

cuXXX : cuMalloc, cuMemcpy

cutXXX : cutXXX

For Function

__global__, __device__, __host__, __device__ __host__

For memory

__shared__, __device__, reg/loc

pre-defined variables

blockDim, blockIdx, threadIdx, cudaMemcpyHostToDevice

Pre-defined function

__syncthreads(), __mul24(); etc 13

NVIDIA Confidential

Process of CUDA developing

14

Algorithm

serial Programming

Compile

Debugging

Release

Serial CUDA parallel

Algorithm

serial Programming

Compile

Release

CUDA convert

Debugging

Compile

Parallelize

Optimize/profile

Profile

NVIDIA Confidential

CUDA is Parallel Computing !!!

15

Algorithm

serial Programming

Compile

Debugging

Release

Serial MPI parallel CUDA parallel

Algorithm

serial Programming

Compile

Release

parallel Programming

Debugging [totalview]

Optimize

Compile

Profile

Parallelize

MPIrun

Algorithm

serial Programming

Compile

Release

CUDA convert

Debugging

Compile

Parallelize

Optimize/profile

Profile

NVIDIA Confidential

16

CPU Program

Kernel Launch

Block

thread

SMC

Geometry Controller

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

Texture Unit

Tex L1

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

SM

TPC

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

NVIDIA Confidential

Image Processing Diagram without CUDA

For Loop start

For Loop End

CPU Total

time

Camera

capture

NVIDIA Confidential

Image Processing Diagram with CUDA

Camera

For Loop start

For Loop End

CPU

Memcpy

capture

CUDA parallel

Total

time

1D Heat Equation

CUDA Toy

for undergraduate student

NVIDIA Confidential

1D Heat Equation

1D bar

Heat Source

NVIDIA Confidential

1D Heat Equation

Boundary condition

Initial condition

2

2

x

u

t

u

0),1(),0( tutu

0)0,( uxu

1D Heat Equation

NVIDIA Confidential

Discretization

2

,1,,1,1, 2

x

uuu

t

uu ijijijijij

Forward difference with second order

ijijijij ruurruu ,1,,11, )21( Explicit method

2/ xtr

i(time), j(space)

2

2

x

u

t

u

descretization

relation

u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];

NVIDIA Confidential

Discretization (stencil)

j+1,i

j,i i+1

j-1,i

r

(1-2*r)

r

ijijijij ruurruu ,1,,11, )21(

NVIDIA Confidential 0 600Time :

13

12

11

10

9

8

7

6

5

4

3

2

1

0

j+1

j j

j-1r

(1-2*r)

r

Time 0 Update

stencil

Boundary Condition

space

Initial Condition

Conceptual Diagram


NVIDIA Confidential

Explicit Pseudo Code

Parameter and data Initialization

Stencil, boundary/initial condition,

FOR LOOP (time, i)

FOR LOOP (stencil, j)

Update the stencil relation

Results


NVIDIA Confidential

CPU code

u[i][j] vs. u[j]

u[i][j] easy to develop

Possible to visualize the process

u[j] efficient to use memory

Get the last result

NVIDIA Confidential

Main algorithm

for( i =0; i < N; i++) {

for( j =0; j < M; j++){

u[i+1][j] = r * u[i+1][j-1]

+ (1-2*r) * u[i ][j]

+ r * u[i ][j+1];

}

}

Time Iteration

Space Iteration

Heat relation

NVIDIA Confidential

Boundary Condition

N

N-1

Method1

N

N-1

N+1

copy copy copy

Method2

copy

compute

Free boundary

Fixed boundary

NVIDIA Confidential

0 600Time :

13

12

11

10

9

8

7

6

5

4

3

2

1

0

j+1

j j

j-1r

(1-2*r)

r

Time 0 Update

Parallelize the space stencil

Not possible to parallelize the time stencil.

stencil

How to Parallelize

Each threads(core) dedicate to each space stencil element

NVIDIA Confidential

How to parallelize

Parameter and data Initialization

Stencil, boundary/initial condition,

FOR LOOP (time, i)

FOR LOOP (stencil, j)

Update the stencil relation

Results

NVIDIA Confidential

Space parallelization

SequenceTime

CommTime 0

Time 1

Time 2

Time 3

NVIDIA Confidential

CPUcode

source

result

CPU

solve

Heat Relation

Boundary Condition

NVIDIA Confidential

CPU codedo{

time += dt; printf("aaaaaa %f\n",time);

for(i=1; i < mesh+1; i++){

temper[i].new_1 = (double) (1-2*r)*temper[i].old + r*(temper[i-1].old + temper[i+1].old);

printf("\n processing \t %d %f %f \n",i, temper[i].new_1, temper[i].old);

}

temper[mesh+1].new_1 = temper[mesh-1].new_1;

printf("\t print results %d %f %f \n",mesh+1, temper[mesh+1].new_1,temper[mesh-1].new_1 );

for(i=1; 1 < mesh+2; i++)

temper[i].old = temper[i].new_1; printf("aa\t\t %d %f %f \n",i, temper[i].new_1,temper[i].new_1 );

if((++count % print_step)==0){

printf("hh \t\t\t %10.5lf", time);

for(i=0; i<mesh; i+=2)

printf("df 8.4lf", temper[i].new_1);

if(!(i%2))

printf("fd %8.4lf\n", temper[mesh].new_1);

}

printf("\n\n time print %f\n", time); getchar();

}while(time < end_time);

if((count % print_step)){

printf("bghj %10.5lf", time);

for(i=0; i<mesh; i+=2)

printf("ahg 8.4lf", temper[i].new_1);

printf("nhgf %8.4lf\n", temper[mesh].new_1);

}

NVIDIA Confidential

GPUcode-01 Memory Map

sourcesource

resultresult

CPUGPU

solve Bypass without solving

Heat Relation

Boundary Condition

NVIDIA Confidential

GPUcode-01 Malloc Template

double* init_GPU_data(struct flow * temper, int mesh)

{

double *u_dev; // for GPU data upload

size_t gpuMemsize=sizeof(double)*(mesh+2) ;

double *u_host; // temperal value

cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");

u_host = (double *) malloc( gpuMemsize);

for(int i=0;i<mesh;i++){

u_host[i]= temper[i].old;

printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,

temper[i].old);

}

cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");

cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");


printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,

temper[i].old);

}

free(u_host);

return (double *)u_dev;

}

NVIDIA Confidential

GPUcode-01 Launch Templatevoid __global__ functionG(double *u_dev, int meshsize, double r, double bound )

{

int idx = blockIdx.x*blockDim.x+threadIdx.x;

int i = idx+1;

if(idx <meshsize +1 ) {

u_dev[i]= i*0.01;

}

if(idx == 6){

u_dev[idx+1]=u_dev[idx-1]=11;

}

}

void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh,

int print_step, int count, double time, double end_time, double bound )

{

size_t gpuMemsize = sizeof(double)*(mesh+2);

//for( int i=0; i < 6000 ; i++){ //time step

functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);

cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");

for(int i=0; i<mesh+1; i++){

printf( " in kernel - GPU : temper[%d] ==> %f \n", i, u_host[i]);

}

//}

return;

}

NVIDIA Confidential

GPUcode-02 Solving

sourcesource

resultresult

CPUGPU

solve solve

Heat Relation

Boundary Condition

Heat Relation

Boundary Condition

NVIDIA Confidential

GPUcode-02 Malloc part

double* init_GPU_data(struct flow * temper, int mesh)

{

double *u_dev; // for GPU data upload

size_t gpuMemsize=sizeof(double)*(mesh+2) ;

double *u_host; // temperal value

cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");

u_host = (double *) malloc( gpuMemsize);


u_host[i]= temper[i].old;

printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,

temper[i].old);

}

cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");

cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");


printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,

temper[i].old);

}

free(u_host);

return (double *)u_dev;

}

NVIDIA Confidential

GPUcode-02 Launch partvoid __global__ functionG(double *u_dev, int meshsize, double r, double bound )

{

int idx = blockIdx.x*blockDim.x+threadIdx.x;

int i = idx+1;

if(idx <meshsize +1 ) {

u_dev[i] = (double) (1-2*r)*u_dev[i] + r*(u_dev[i-1] + u_dev[i+1] );

// u_dev[i]= i*0.01;

}

if(idx == 6){

u_dev[idx+1]=u_dev[idx-1]=11;

}

}

void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh,

int print_step, int count, double time, double end_time, double bound )

{

size_t gpuMemsize = sizeof(double)*(mesh+2);

//for( int i=0; i < 6000 ; i++){ //time step

functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);

cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");

for(int i=0; i<mesh+1; i++){

printf( " in kernel - GPU : temper[%d] ==> %f \n", i, u_host[i]);

}

//}

return;

}

NVIDIA Confidential

How to Optimize?

Monte Carlo Simulation

NVIDIA Confidential

Computation Time

Malliavin MC results

type Black-Sholes 5000 10000 50000 100000 200000 300000

Price 12.8216 12.8706 12.8721 12.8413 12.8525 12.8559 12.8480

Err 0.000% 0.38% 0.39% 0.15% 0.24% 0.27% 0.21%

Delta 0.5858 0.5724 0.5887 0.5826 0.5826 0.5829 0.5824

Err 0 2.28% -0.51% 0.53% 0.53% 0.49% 0.58%

Gamma 0.0130 0.0112 0.0134 0.0127 0.0127 0.0127 0.0127

Err 0.00% 13.39% -3.34% 2.07% 2.26% 2.21% 2.47%

Total Time 0:00 00:03 00:05 00:25 00:50 01:44 02:33

time : 100 sectarget1 : > 1 sec (100X)

Target2 : >0.001 sec (2000X)

NVIDIA Confidential

Excel Sheet

Excel VBA

C/C++ dll Link

CUDA Acceleration

Monte Carlo Simulation for Finance

NVIDIA Confidential

function MC( S As Double, X As Double, T As Double, R As Double, _

Vol As Double, Q As Double, No As Double, rtype As String) As Double

Simul_No = No

dt = 1 'dt = 1 / 365

For K = 0 To Simul_No - 1

Juga = S

For i = 0 To MaxStep - 1

Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )

Next

price = Exp(-R * T) * Max(Juga - X, 0)

sum_price = sum_price + price

Next

MC = sum_price / Simul_No

End function

Monte Carlo Code with VBA

NVIDIA Confidential

Greek computation for Monte Carlo simulation

Malliavin approach

With Malliavin approach, we can save the computation time.

Malliavin Greeks

0 0[ ( )] [ ( )][ ( )]

E f S S E f SE f S

s S

[ ( )] [ ( ) ( )]E f S E f S W Ss

Malliavin weights

NVIDIA Confidential

Problem

To compare the accuracy, we compute the Price, Delta and Gamma of Vanilla

Call option.

Approach

1. Closed Form solution (VBA,C)

2. Monte (VBA, C)

3. Malliavin (VBA,C, CUDA v1, v2)

NVIDIA Confidential

function Malliavin( S As Double, X As Double, T As Double, R As Double, _

Vol As Double, Q As Double, No As Double, rtype As String) As Double

Simul_No = No

dt = 1 'dt = 1 / 365

For K = 0 To Simul_No - 1

Juga = S

For i = 0 To MaxStep - 1

Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )

Next

WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol

WT_delta = (WT / (S * Vol * T))

WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol)

price = Exp(-R * T) * Max(Juga - X, 0)

delta = Exp(-R * T) * Max(Juga - X, 0) * WT_delta

gamma = Exp(-R * T) * Max(Juga - X, 0) * WT_gamma

sum_price = sum_price + price

sum_delta = sum_delta + delta

sum_gamma = sum_gamma + Gamma

Next

Malliavin = sum_delta / Simul_No

End function

Malliavin Monte Carlo Code with VBA

NVIDIA Confidential

void Malliavin( double S , double X , double T , double R, double Vol, double Q, long No){

long Simul_No = No;

double dt = 1; // dt = 1 / 365

for ( int K = 0; K< Simul_No - 1 ; K++){

Juga = S ;

for (int i = 0; i< MaxStep - 1 ki++){

Juga = Juga * exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * sqrt(dt) * norm() ); // rand with box muller

}

WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol;

WT_delta = (WT / (S * Vol * T));

WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol);

price = Exp(-R * T) * max(Juga - X, 0);

delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;

gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;

sum.price = sum.price + price;

sum.delta = sum.delta + delta;

sum.gamma = sum.gamma + Gamma ;

}

r.price = sum.delta / Simul_No

r.delta = sum.delta / Simul_No

r.gamma = sum.delta / Simul_No

return 0;

}

Step1 Malliavin Monte Carlo C language sketch

NVIDIA Confidential

Void __global__ Malliavin_compute( double S , double X , double T , double R, double Vol, double Q, long No){

long Simul_No = No;

double dt = 1; // dt = 1 / 365

for ( int K = 0; K< Simul_No - 1 ; K++){

Juga = S ;

for (int i = 0; i< MaxStep - 1 ki++){

Juga = Juga * exp( (R - Q - Vol *Vol / 2) * dt + Vol * sqrt(dt) * norm(k,i) );

}

WT = (log(Juga) - log(S) - (R - Q - 1 / 2 * Vol *Vol) * T) / Vol;

WT_delta = (WT / (S * Vol * T));

WT_gamma = (1 / (Vol * T * S *S)) * (WT*WT / (Vol * T) - WT - 1 / Vol);

price = Exp(-R * T) * max(Juga - X, 0);

delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;

gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;

sum.price = sum.price + price;

sum.delta = sum.delta + delta;

sum.gamma = sum.gamma + Gamma ;

}

r.price = sum.delta / Simul_No

r.delta = sum.delta / Simul_No

r.gamma = sum.delta / Simul_No

return 0;

}

Simm_No =

Total Sim /(N threads* M blocks)

Real Price =

Sum (r.price ) / (N*M)

Step2 Sketch (kernel part)

NVIDIA Confidential

(float *) __device__ normal(int k, int, j, int size_j, float * normal) {

int index = k*size_j + j;

return &normal[index];

}

uniform

normal

stock

option

normal()

RNG_compute2()

RNG_compute1()

Malliavin_compute()

Parallel

cores

data in global memory

Step2 Parallel Memory Map

NVIDIA Confidential

#include <stdio.h>

__global__ RNG_compute(parameter);

__global__ Malliavin_compute(parameter);

main(){

malloc(); //cpu malloc

cudaMalloc(); //GPU malloc

cudaMemcpy(); // transfer

RNG_compute 1<<<N,M>>> (parameter); // generate RNG (uniform)

RNG_compute2 <<<N,M>>> (parameter); // generate RNG ( BM,Moro)

Malliavin_compute <<<N,M>>> (parameter); // simulation

cudaMemcpy(); //get results

return 0;

}

Step2 Sketch (host part)

NVIDIA Confidential

__global__

static void RNG_rand48_get_int(uint2 *state, int *res, int num_blocks, uint2 A, uint2 C)

{

const int nThreads = blockDim.x*gridDim.x;

int nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;

uint2 lstate = state[nOutIdx];

int i;

for (i = 0; i < num_blocks; ++i) {

res[nOutIdx] = ( lstate.x >> 17 ) | ( lstate.y << 7);

nOutIdx += nThreads;

lstate = RNG_rand48_iterate_single(lstate, A, C);

}

nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;

state[nOutIdx] = lstate;

}

Step2 Malliavin Monte Carlo CUDA language sketch (rng part1)

NVIDIA Confidential

Void __global__ RNG_compute( int * uniform , float * normal, int length){

int index = blockDim.x*blockIdx.x+threadIdx.x;

__shared__ int s[i];

__shared__ float s_r[i];

if( threadIdx.x ==0){

for (int i = 0 ; i<blockDim.x ; i++){

s[i]=uniform[blockDim.x*blockIdx.x + i]; // load uniform

}

}

s_r[threadIdx.x] = (float) moro(s[threadIdx.x]); // moro inversion with parallel

if( threadIdx.x ==0){

for (int i = 0 ; i<blockDim.x ; i++){

s[blockDim.x*blockIdx.x+i]= s_r[i] ]; // save normal

}

}

}

Step2 Malliavin Monte Carlo CUDA language sketch (rng part2)

NVIDIA Confidential

__device__ void BoxMuller(float& u1, float& u2){

float r = sqrtf(-2.0f * logf(u1));

float phi = 2 * PI * u2;

u1 = r * __cosf(phi);

u2 = r * __sinf(phi);

}

Box-Muller vs Moro Inversion

__device__ Moro( float u ){

// skip the const value

x = u - 0.5;

if (abs(x) < 0.42) {

r = x*x;

r = x * (((a4 * r + a3) * r + a2) * r + a1) / ((((b4 * r + b3) * r + b2) * r + b1) * r + 1);

} else{

if (x > 0) r = log(-log(1 - u));

if (x <= 0) r = log(-log(u));

r = c1 + r * (c2 + r * (c3 + r * (c4 + r * (c5 + r * (c6 + r * (c7 + r * (c8 + r * c9)))))));

if (x <= 0) r = -r;

}f

return r;

}

NVIDIA Confidential

Step3 New approach for Direct LCG and Parallel Memory Map

stock

option

rand_new(seed)Malliavin_compute()

Parallel

cores

uniform

normal

double rand_new( int next){

next = (a * next + b) % M;

return (double) next * (1/M);

}

moro(u)

data in registers

NVIDIA Confidential

Excel Sheet

Excel VBA

C/C++ dll Link

Platform for finance

Socket communication

GPU cluster

Grid Manager

8 GPU per node

12 nodes system (1/2 for backup)12*8 GPU*448 core

= 43,000 core

NVIDIA Confidential

How to Optimize?

Time Series Analysis

NVIDIA Confidential

Multivariate Time-series

S(2)

S(1)

S(3)

S(n)

m

Corr(1,2)

Corr(1,3)

Corr(1,n)

Corr(2,3)

NVIDIA Confidential

How to parallelize with CUDA

S(2)

S(1)

S(3)

S(n)

serialize

Unefficient approach for Shared Memory Usage

NVIDIA Confidential


S(2)

S(1)

S(3)

S(n)

seria

lize

efficient approach for Shared Memory Usage

Need to reduction technique for mean etc.

NVIDIA Confidential


S(2)

S(1)

S(3)

S(n)

seria

lize

Parallelize

With

multiNode

multiGPU

Good method for multiGPU & large Cluster

NVIDIA Confidential

In pair(I,J), Pearson Correlation Coefficient

2 2

( )( )

( ) ( )

i i

i i

x x y y

x x y y

( )( )i i i i i ix x y y x y x y y x xy

i ix y nxy 2 2 2( )i ix x x nx 2 2 2( )i iy y y ny

2 2 2 2( )( )

i i

i i

x y nxy

x nx y ny

We can parallelize the Sumation !!

After sumation, find the mean.

NVIDIA Confidential

How to parallelize with CUDA : Flow Chart

Input A, B

Find the mean of A, B

start to sum (Ai), (Bi)

Find Cov(A,B), Cov(A,A), Cov(B,B)

start to sum (Ai,Bi), (Ai^2), (Bi^2) with mean

Input A, B

Find the mean of A, B

Start to sum (Ai), (Bi), (Ai,Bi), (Ai^2), (Bi^2)

Method 1 Method 2

Find Cov(A,B), Cov(A,A), Cov(B,B)

Benefit : Easy to implementation

with two function

Benefit : oneshot sum (speed-up)

R+CUDA projecthttp://brainarray.mbni.med.umich.edu/Brainarray/rgpgpu/

http://brainarray.mbni.med.umich.edu/Brainarray/rgpgpu/

NVIDIA Confidential

In pair(i,j), Pearson Correlation Coefficient

i ix y2

ix

FOR (i, j) – pair : serial

FOR k (time-series) : parallel

compute

reduction for results

compute mean(i), mean(j),

compute cov(i,j), cov(i,i) ,cov(j,j)

compute corr(i,j)

ix iy2

iy

NVIDIA Confidential

In pair(i,j), Pearson Correlation Coefficient

i ix y2

ix

FOR (i, j) – pair : serial

FOR k (time-series) : parallel

FOR shared memory

compute

reduction for results

compute mean(i), mean(j),

compute cov(i,j), cov(i,i) ,cov(j,j)

compute corr(i,j)

ix iy2

iy

NVIDIA Confidential

How to Optimize?

Focus on Optimization

System Optimization Tips before

start CUDA programming

NVIDIA Confidential

Remote Use

Remote Host Client

WDDM

RDPRDP driver RDP protocol

Remote Host Client

WDDM

VNCVNCVNC protocol

CUDA

enabled

CUDA

disabled

intercept the

CUDA driver

snapshot of

real ccreen

NVIDIA Confidential

TCC driver

Remote Host Client

WDDM

RDPRDP driver RDP protocol

WDM

• Remote Desktop

• Kernel Launch Over Head

• GPU exclusive Mode

• Windows Service for Session0/1

• large single Malloc

CUDA Application

resultGPU

CUDA

enabled

NVIDIA Confidential

GPU exclusive Mode for multiuser

ServerRemote users

Remote users

Remote usersGPU 0

GPU 1

GPU 2

GPU 3

GPU Pool

NVIDIA Confidential

GPUDirect for GPU cluster

System Pinned Memory

System Pageable Memory

System Pinned Memory share

MPI Communication

NVIDIA Confidential

FBO on SDI with Quadro

System

Memory

CPU

GPU

Video

Memory

DVI

DVI

NVIDIA Graphics Card

DVI

Third Party

SDI Input

NVIDIA SDI Output

Memory

Controller

System

Memory

CPU

GPU

Video

Memory

DVI

DVI

NVIDIA Graphics Card

DVI

Third Party

SDI Input

NVIDIA SDI Output

Memory

Controller

Direct write to OpenGL Frame Buffer ObjectWrite to Host memory and to write GPU memory

Conceptual Tips for

CUDA optimization

SIMT architecture

Single Instruction Multiple Threads

NVIDIA Confidential

Abstract overview of CPU Core

ALUFPU LSU

L1

inst. cache

Register Bank

inst. fetch

L1

data cache

RA

X

RB

X

RC

X

RD

X

Memory BankCPU

NVIDIA Confidential

Threads on Multicore CPU

CPU

reg

Core

TH

CPU

reg

Core

TH TH

CPU

reg

Core

TH TH

Winthread, pthread

Hyper-threading

TH TH

General Programming

CPU

NVIDIA Confidential

Threads on Manicore GPU

79

MP

reg

SP

MP

reg

SP

TH TH TH TH TH TH TH TH TH TH

TH TH TH TH TH

Block Block Block

H/W Multi Processor vs S/W Active Block

Block Block

NVIDIA Confidential

Overview of WARP schedule

schedule

<<<0,0>>><<<0,1>>>

<<<0,31>>><<<0,32>>>

threads

Register

per threads

8 SP core

warp

<<<0,33>>><<<0,34>>>

Warp 0

MP

NVIDIA Confidential

Overview of Instruction Fetch

coresFP operation

ADD

0 1 2 3

Load A, B

Store C

blockIdx.x=0, threadIdx.x=3

4 5 6 7

0 1 2 3 4 5 6 7

NVIDIA Confidential

Overview of Instruction Fetch

coresFP operation

ADD

0 1 2 3

Load A, B

Store C

blockIdx.x=0, threadIdx.x=4

4 5 6 7

0 1 2 3 4 5 6 7

NVIDIA Confidential

Thread schedule within MP : WARP

1024 * 30 : 30K

WARP

B1

B2

Look like

Cocurrent

Threads

MP

NVIDIA Confidential

Occupancy Calculator on CUDA SDK

NVIDIA Confidential

CUDA profiler on CUDA toolkit

NVIDIA Confidential

Parallel NSight 1.5 Professional

Memory Hierarchy

NVIDIA Confidential

Managing Memory

Multiprocessor

Host

CPU

ChipsetDRAM

Device

DRAM

Local

Memory

Global

Memory

GPU

Multiprocessor

MultiprocessorRegisters

Shared Memory

Bandwidth

Size

L1 CacheL2 Cache

NVIDIA Confidential

Matrix Size for Best CUBLAS3.2 Performance

050

100150200250300350400450500550600650

96 x

96

384 x

384

672 x

672

960 x

960

1248 x

1248

1536 x

1536

1824 x

1824

2112 x

2112

2400 x

2400

2688 x

2688

2976 x

2976

3264 x

3264

3552 x

3552

3840 x

3840

4128 x

4128

4416 x

4416

4704 x

4704

4992 x

4992

5280 x

5280

5568 x

5568

5856 x

5856

6144 x

6144

6432 x

6432

6720 x

6720

7008 x

7008

7296 x

7296

7584 x

7584

7872 x

7872

8160 x

8160

8448 x

8448

Gflops SGEMM: Multiples of 96

0

50

100

150

200

250

300

350

64

x 6

4

38

4 x

38

4

70

4 x

70

4

10

24

x 1

02

4

13

44

x 1

34

4

16

64

x 1

66

4

19

84

x 1

98

4

23

04

x 2

30

4

26

24

x 2

62

4

29

44

x 2

94

4

32

64

x 3

26

4

3584 x

3584

39

04

x 3

90

4

42

24

x 4

22

4

45

44

x 4

54

4

48

64

x 4

86

4

51

84

x 5

18

4

55

04

x 5

50

4

58

24

x 5

82

4

61

44

x 6

14

4

64

64

x 6

46

4

6784 x

6784

71

04

x 7

10

4

74

24

x 7

42

4

77

44

x 7

74

4

80

64

x 8

06

4

83

84

x 8

38

4

Gflops DGEMM: Multiples of 64

cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz

NVIDIA Confidential

cuBLAS level III

0

50

100

150

200

250

300

350

64

x 6

4

12

8 x

12

8

19

2 x

19

2

25

6 x

25

6

32

0 x

32

0

38

4 x

38

4

44

8 x

44

8

51

2 x

51

2

57

6 x

57

6

64

0 x

64

0

70

4 x

70

4

76

8 x

76

8

83

2 x

83

2

89

6 x

89

6

96

0 x

96

0

10

24

x 1

02

4

10

88

x 1

08

8

11

52

x 1

15

2

1216 x

1216

12

80

x 1

28

0

13

44

x 1

34

4

1408 x

1408

14

72

x 1

47

2

15

36

x 1

53

6

1600 x

1600

16

64

x 1

66

4

17

28

x 1

72

8

1792 x

1792

18

56

x 1

85

6

19

20

x 1

92

0

1984 x

1984

20

48

x 2

04

8

21

12

x 2

11

2

2176 x

2176

22

40

x 2

24

0

23

04

x 2

30

4

23

68

x 2

36

8

24

32

x 2

43

2

24

96

x 2

49

6

25

60

x 2

56

0

26

24

x 2

62

4

26

88

x 2

68

8

27

52

x 2

75

2

28

16

x 2

81

6

28

80

x 2

88

0

29

44

x 2

94

4

30

08

x 3

00

8

30

72

x 3

07

2

31

36

x 3

13

6

32

00

x 3

20

0

32

64

x 3

26

4

33

28

x 3

32

8

33

92

x 3

39

2

34

56

x 3

45

6

GflopsDGEMM: Multiples of 64

cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz

NVIDIA Confidential

Roofline Analysis (Arithmetic Intensity)

Samuel Williams, Andrew Waterman, and Dav id Patterson,

Roofline: An Insightful Visual Performance Model for Multicore Architectures

NVIDIA Confidential

Perf. on CUDA Application

Gflops

Ns

F/B

Bandwidth of PCI-e slot

Bandwidth of Global Memory

NVIDIA Confidential

Tips for Optimization

Consider Algorithm for parallel (naïve algorithms will be good)

Consider Occupancy ( SIMT)

Consider Memory Bottleneck

NVIDIA Confidential

More Information for CUDA Optimization

CUDA Zone

http://www.nvidia.com/CUDA

Developer Zone

http://developer.nvidia.com

GTC 2010 contents

http://www.nvidia.com/gtc2010-content

쿠다 카페 (CUDA café in Korea)

http://cafe.daum.net/KCUG

http://www.nvidia.com/CUDA

http://developer.nvidia.com/




Thanks

Hyungon Ryu

[email protected]

[05][cuda 및 fermi 최적화 기술] hryu optimization

Documents

Transcript of [05][cuda 및 fermi 최적화 기술] hryu optimization