[05][cuda 및 fermi 최적화 기술] hryu optimization

95
CUDA and Fermi Optimization Techniques Hyungon Ryu, PSG SA

Transcript of [05][cuda 및 fermi 최적화 기술] hryu optimization

Page 1: [05][cuda 및 fermi 최적화 기술] hryu optimization

CUDA and Fermi

Optimization Techniques

Hyungon Ryu, PSG SA

Page 2: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Agenda

CUDA 101 (General Overview)

Toy of FDM example

Monte Carlo Simulation & Time Series Analysis

General CUDA Optimization Tips

Page 3: [05][cuda 및 fermi 최적화 기술] hryu optimization

CUDA 101

General Overview

Page 4: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

CUDA 101

CUDA

Graphics

(GPU)

Parallel

Programming

Page 5: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

CUDA Parallel programming

GPU Knowledge

Cg

OpenGL

DirectX

Parallel knowledge

Pthreads / winthreads

MMX, SSE

OpenMP

PVM / MPI

Hetereogeneous Knowledge

GPU

Parallel DSP

Parallel ASIC

Parallel FPGA

Cell BE

CUDAParallel Computing with GPU

Page 6: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Parallel Model in OpenMP

6

CPU Program

Fork

Parallel

Join

Page 7: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

CUDA parallel Model

7

CPU Program

Kernel Launch

GPU threads

Page 8: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Saxpy Example : CPU serial

for (int i = 0; i < n; ++i) {

y[i] = a*x[i] + y[i];

}

x

y

i

Page 9: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Example of Saxpy Parallel : OpenMP

# pragma omp parallel shared (n,a,x,y) private (i)

# pragma omp for

for (int i = 0; i < n; ++i) {

y[i] = a*x[i] + y[i];

}

x

y

i

Page 10: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Example of Saxpy Parallel : MPIfor ( i = start ; i < end; i++)

{

y[i] = a * x[i] + y[i];

}

x

y

i

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

MPI_Comm_size(MPI_COMM_WORLD,&size);

void para_range(int lowest, int highest,int nprocs, int myrank,

int *start, int *end) {

int wk1, wk2;

wk1 = (highest - lowest + 1) / nprocs;

wk2 = (highest - lowest + 1) % nprocs;

*start = myrank * wk1 + lowest + ( (rank<wk2) ? myrank : wk2);

*end = *start + wk1 - 1;

if(wk2 > rank) *end = *end + 1;

}

Page 11: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Example of Saxpy Parallel : SSE

x

y

i

void saxpy_vector(short *z, short *x, short *y, short a, unsigned n) {

__m128i* x_ptr = (__m128i*) x;

__m128i* y_ptr = (__m128i*) y;

__m128i* z_ptr = (__m128i*) z;

__m128i a_vec = _mm_splat_epi16(a);

int i;

for (i = 0; i < n/8; ++i) {

__m128i x_vec = x_ptr[i];

__m128i y_vec = y_ptr[i];

__m128i z_vec = _mm_add_epi16( _mm_mullo_epi16 x_vec,a_vec),y_vec);

z_ptr[i] = z_vec;

}

}

Page 12: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Saxpy Parallel : CUDA

x

y

i

{

x[i] = a * x[i] + t * y[i];

}

Saxpy <<<N ,M >>> (n, 2.0, x, y);

Page 13: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

CUDA C extensionLaunch the kernel

Function <<< Grid, Block >>> ( parameter);

Additional C standard API for mem control

cudaXXX : cudaMalloc, cudaMemcpy,

cuXXX : cuMalloc, cuMemcpy

cutXXX : cutXXX

For Function

__global__, __device__, __host__, __device__ __host__

For memory

__shared__, __device__, reg/loc

pre-defined variables

blockDim, blockIdx, threadIdx, cudaMemcpyHostToDevice

Pre-defined function

__syncthreads(), __mul24(); etc 13

Page 14: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Process of CUDA developing

14

Algorithm

serial Programming

Compile

Debugging

Release

Serial CUDA parallel

Algorithm

serial Programming

Compile

Release

CUDA convert

Debugging

Compile

Parallelize

Optimize/profile

Profile

Page 15: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

CUDA is Parallel Computing !!!

15

Algorithm

serial Programming

Compile

Debugging

Release

Serial MPI parallel CUDA parallel

Algorithm

serial Programming

Compile

Release

parallel Programming

Debugging [totalview]

Optimize

Compile

Profile

Parallelize

MPIrun

Algorithm

serial Programming

Compile

Release

CUDA convert

Debugging

Compile

Parallelize

Optimize/profile

Profile

Page 16: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

16

CPU Program

Kernel Launch

Block

thread

SMC

Geometry Controller

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

Texture Unit

Tex L1

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

SM

TPC

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

Page 17: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Image Processing Diagram without CUDA

For Loop start

For Loop End

CPU Total

time

Camera

capture

Page 18: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Image Processing Diagram with CUDA

Camera

For Loop start

For Loop End

CPU

Memcpy

capture

CUDA parallel

Total

time

Page 19: [05][cuda 및 fermi 최적화 기술] hryu optimization

1D Heat Equation

CUDA Toy

for undergraduate student

Page 20: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

1D Heat Equation

1D bar

Heat Source

Page 21: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

1D Heat Equation

Boundary condition

Initial condition

2

2

x

u

t

u

0),1(),0( tutu

0)0,( uxu

1D Heat Equation

Page 22: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Discretization

2

,1,,1,1, 2

x

uuu

t

uu ijijijijij

Forward difference with second order

ijijijij ruurruu ,1,,11, )21( Explicit method

2/ xtr

i(time), j(space)

2

2

x

u

t

u

descretization

relation

u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];

Page 23: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Discretization (stencil)

j+1,i

j,i i+1

j-1,i

r

(1-2*r)

r

ijijijij ruurruu ,1,,11, )21(

Page 24: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential 0 600Time :

13

12

11

10

9

8

7

6

5

4

3

2

1

0

j+1

j j

j-1r

(1-2*r)

r

Time 0 Update

stencil

Boundary Condition

space

Initial Condition

Conceptual Diagram

u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];

Page 25: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Explicit Pseudo Code

Parameter and data Initialization

Stencil, boundary/initial condition,

FOR LOOP (time, i)

FOR LOOP (stencil, j)

Update the stencil relation

Results

u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];

Page 26: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

CPU code

u[i][j] vs. u[j]

u[i][j] easy to develop

Possible to visualize the process

u[j] efficient to use memory

Get the last result

Page 27: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Main algorithm

for( i =0; i < N; i++) {

for( j =0; j < M; j++){

u[i+1][j] = r * u[i+1][j-1]

+ (1-2*r) * u[i ][j]

+ r * u[i ][j+1];

}

}

Time Iteration

Space Iteration

Heat relation

Page 28: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Boundary Condition

N

N-1

Method1

N

N-1

N+1

copy copy copy

Method2

copy

compute

Free boundary

Fixed boundary

Page 29: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

0 600Time :

13

12

11

10

9

8

7

6

5

4

3

2

1

0

j+1

j j

j-1r

(1-2*r)

r

Time 0 Update

Parallelize the space stencil

Not possible to parallelize the time stencil.

stencil

How to Parallelize

Each threads(core) dedicate to each space stencil element

Page 30: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

How to parallelize

Parameter and data Initialization

Stencil, boundary/initial condition,

FOR LOOP (time, i)

FOR LOOP (stencil, j)

Update the stencil relation

Results

Page 31: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Space parallelization

SequenceTime

CommTime 0

Time 1

Time 2

Time 3

Page 32: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

CPUcode

source

result

CPU

solve

Heat Relation

Boundary Condition

Page 33: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

CPU codedo{

time += dt; printf("aaaaaa %f\n",time);

for(i=1; i < mesh+1; i++){

temper[i].new_1 = (double) (1-2*r)*temper[i].old + r*(temper[i-1].old + temper[i+1].old);

printf("\n processing \t %d %f %f \n",i, temper[i].new_1, temper[i].old);

}

temper[mesh+1].new_1 = temper[mesh-1].new_1;

printf("\t print results %d %f %f \n",mesh+1, temper[mesh+1].new_1,temper[mesh-1].new_1 );

for(i=1; 1 < mesh+2; i++)

temper[i].old = temper[i].new_1; printf("aa\t\t %d %f %f \n",i, temper[i].new_1,temper[i].new_1 );

if((++count % print_step)==0){

printf("hh \t\t\t %10.5lf", time);

for(i=0; i<mesh; i+=2)

printf("df 8.4lf", temper[i].new_1);

if(!(i%2))

printf("fd %8.4lf\n", temper[mesh].new_1);

}

printf("\n\n time print %f\n", time); getchar();

}while(time < end_time);

if((count % print_step)){

printf("bghj %10.5lf", time);

for(i=0; i<mesh; i+=2)

printf("ahg 8.4lf", temper[i].new_1);

printf("nhgf %8.4lf\n", temper[mesh].new_1);

}

Page 34: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

GPUcode-01 Memory Map

sourcesource

resultresult

CPUGPU

solve Bypass without solving

Heat Relation

Boundary Condition

Page 35: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

GPUcode-01 Malloc Template

double* init_GPU_data(struct flow * temper, int mesh)

{

double *u_dev; // for GPU data upload

size_t gpuMemsize=sizeof(double)*(mesh+2) ;

double *u_host; // temperal value

cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");

u_host = (double *) malloc( gpuMemsize);

for(int i=0;i<mesh;i++){

u_host[i]= temper[i].old;

printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,

temper[i].old);

}

cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");

cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");

for(int i=0;i<mesh;i++){

printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,

temper[i].old);

}

free(u_host);

return (double *)u_dev;

}

Page 36: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

GPUcode-01 Launch Templatevoid __global__ functionG(double *u_dev, int meshsize, double r, double bound )

{

int idx = blockIdx.x*blockDim.x+threadIdx.x;

int i = idx+1;

if(idx <meshsize +1 ) {

u_dev[i]= i*0.01;

}

if(idx == 6){

u_dev[idx+1]=u_dev[idx-1]=11;

}

}

void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh,

int print_step, int count, double time, double end_time, double bound )

{

size_t gpuMemsize = sizeof(double)*(mesh+2);

//for( int i=0; i < 6000 ; i++){ //time step

functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);

cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");

for(int i=0; i<mesh+1; i++){

printf( " in kernel - GPU : temper[%d] ==> %f \n", i, u_host[i]);

}

//}

return;

}

Page 37: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

GPUcode-02 Solving

sourcesource

resultresult

CPUGPU

solve solve

Heat Relation

Boundary Condition

Heat Relation

Boundary Condition

Page 38: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

GPUcode-02 Malloc part

double* init_GPU_data(struct flow * temper, int mesh)

{

double *u_dev; // for GPU data upload

size_t gpuMemsize=sizeof(double)*(mesh+2) ;

double *u_host; // temperal value

cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");

u_host = (double *) malloc( gpuMemsize);

for(int i=0;i<mesh;i++){

u_host[i]= temper[i].old;

printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,

temper[i].old);

}

cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");

cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");

for(int i=0;i<mesh;i++){

printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,

temper[i].old);

}

free(u_host);

return (double *)u_dev;

}

Page 39: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

GPUcode-02 Launch partvoid __global__ functionG(double *u_dev, int meshsize, double r, double bound )

{

int idx = blockIdx.x*blockDim.x+threadIdx.x;

int i = idx+1;

if(idx <meshsize +1 ) {

u_dev[i] = (double) (1-2*r)*u_dev[i] + r*(u_dev[i-1] + u_dev[i+1] );

// u_dev[i]= i*0.01;

}

if(idx == 6){

u_dev[idx+1]=u_dev[idx-1]=11;

}

}

void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh,

int print_step, int count, double time, double end_time, double bound )

{

size_t gpuMemsize = sizeof(double)*(mesh+2);

//for( int i=0; i < 6000 ; i++){ //time step

functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);

cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");

for(int i=0; i<mesh+1; i++){

printf( " in kernel - GPU : temper[%d] ==> %f \n", i, u_host[i]);

}

//}

return;

}

Page 40: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

How to Optimize?

Page 41: [05][cuda 및 fermi 최적화 기술] hryu optimization

Monte Carlo Simulation

Page 42: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Computation Time

Malliavin MC results

type Black-Sholes 5000 10000 50000 100000 200000 300000

Price 12.8216 12.8706 12.8721 12.8413 12.8525 12.8559 12.8480

Err 0.000% 0.38% 0.39% 0.15% 0.24% 0.27% 0.21%

Delta 0.5858 0.5724 0.5887 0.5826 0.5826 0.5829 0.5824

Err 0 2.28% -0.51% 0.53% 0.53% 0.49% 0.58%

Gamma 0.0130 0.0112 0.0134 0.0127 0.0127 0.0127 0.0127

Err 0.00% 13.39% -3.34% 2.07% 2.26% 2.21% 2.47%

Total Time 0:00 00:03 00:05 00:25 00:50 01:44 02:33

time : 100 sectarget1 : > 1 sec (100X)

Target2 : >0.001 sec (2000X)

Page 43: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Excel Sheet

Excel VBA

C/C++ dll Link

CUDA Acceleration

Monte Carlo Simulation for Finance

Page 44: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

function MC( S As Double, X As Double, T As Double, R As Double, _

Vol As Double, Q As Double, No As Double, rtype As String) As Double

Simul_No = No

dt = 1 'dt = 1 / 365

For K = 0 To Simul_No - 1

Juga = S

For i = 0 To MaxStep - 1

Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )

Next

price = Exp(-R * T) * Max(Juga - X, 0)

sum_price = sum_price + price

Next

MC = sum_price / Simul_No

End function

Monte Carlo Code with VBA

Page 45: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Greek computation for Monte Carlo simulation

Malliavin approach

With Malliavin approach, we can save the computation time.

Malliavin Greeks

0 0[ ( )] [ ( )][ ( )]

E f S S E f SE f S

s S

[ ( )] [ ( ) ( )]E f S E f S W Ss

Malliavin weights

Page 46: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Problem

To compare the accuracy, we compute the Price, Delta and Gamma of Vanilla

Call option.

Approach

1. Closed Form solution (VBA,C)

2. Monte (VBA, C)

3. Malliavin (VBA,C, CUDA v1, v2)

Page 47: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

function Malliavin( S As Double, X As Double, T As Double, R As Double, _

Vol As Double, Q As Double, No As Double, rtype As String) As Double

Simul_No = No

dt = 1 'dt = 1 / 365

For K = 0 To Simul_No - 1

Juga = S

For i = 0 To MaxStep - 1

Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )

Next

WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol

WT_delta = (WT / (S * Vol * T))

WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol)

price = Exp(-R * T) * Max(Juga - X, 0)

delta = Exp(-R * T) * Max(Juga - X, 0) * WT_delta

gamma = Exp(-R * T) * Max(Juga - X, 0) * WT_gamma

sum_price = sum_price + price

sum_delta = sum_delta + delta

sum_gamma = sum_gamma + Gamma

Next

Malliavin = sum_delta / Simul_No

End function

Malliavin Monte Carlo Code with VBA

Page 48: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

void Malliavin( double S , double X , double T , double R, double Vol, double Q, long No){

long Simul_No = No;

double dt = 1; // dt = 1 / 365

for ( int K = 0; K< Simul_No - 1 ; K++){

Juga = S ;

for (int i = 0; i< MaxStep - 1 ki++){

Juga = Juga * exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * sqrt(dt) * norm() ); // rand with box muller

}

WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol;

WT_delta = (WT / (S * Vol * T));

WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol);

price = Exp(-R * T) * max(Juga - X, 0);

delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;

gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;

sum.price = sum.price + price;

sum.delta = sum.delta + delta;

sum.gamma = sum.gamma + Gamma ;

}

r.price = sum.delta / Simul_No

r.delta = sum.delta / Simul_No

r.gamma = sum.delta / Simul_No

return 0;

}

Step1 Malliavin Monte Carlo C language sketch

Page 49: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Void __global__ Malliavin_compute( double S , double X , double T , double R, double Vol, double Q, long No){

long Simul_No = No;

double dt = 1; // dt = 1 / 365

for ( int K = 0; K< Simul_No - 1 ; K++){

Juga = S ;

for (int i = 0; i< MaxStep - 1 ki++){

Juga = Juga * exp( (R - Q - Vol *Vol / 2) * dt + Vol * sqrt(dt) * norm(k,i) );

}

WT = (log(Juga) - log(S) - (R - Q - 1 / 2 * Vol *Vol) * T) / Vol;

WT_delta = (WT / (S * Vol * T));

WT_gamma = (1 / (Vol * T * S *S)) * (WT*WT / (Vol * T) - WT - 1 / Vol);

price = Exp(-R * T) * max(Juga - X, 0);

delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;

gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;

sum.price = sum.price + price;

sum.delta = sum.delta + delta;

sum.gamma = sum.gamma + Gamma ;

}

r.price = sum.delta / Simul_No

r.delta = sum.delta / Simul_No

r.gamma = sum.delta / Simul_No

return 0;

}

Simm_No =

Total Sim /(N threads* M blocks)

Real Price =

Sum (r.price ) / (N*M)

Step2 Sketch (kernel part)

Page 50: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

(float *) __device__ normal(int k, int, j, int size_j, float * normal) {

int index = k*size_j + j;

return &normal[index];

}

uniform

normal

stock

option

normal()

RNG_compute2()

RNG_compute1()

Malliavin_compute()

Parallel

cores

data in global memory

Step2 Parallel Memory Map

Page 51: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

#include <stdio.h>

__global__ RNG_compute(parameter);

__global__ Malliavin_compute(parameter);

main(){

malloc(); //cpu malloc

cudaMalloc(); //GPU malloc

cudaMemcpy(); // transfer

RNG_compute 1<<<N,M>>> (parameter); // generate RNG (uniform)

RNG_compute2 <<<N,M>>> (parameter); // generate RNG ( BM,Moro)

Malliavin_compute <<<N,M>>> (parameter); // simulation

cudaMemcpy(); //get results

return 0;

}

Step2 Sketch (host part)

Page 52: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

__global__

static void RNG_rand48_get_int(uint2 *state, int *res, int num_blocks, uint2 A, uint2 C)

{

const int nThreads = blockDim.x*gridDim.x;

int nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;

uint2 lstate = state[nOutIdx];

int i;

for (i = 0; i < num_blocks; ++i) {

res[nOutIdx] = ( lstate.x >> 17 ) | ( lstate.y << 7);

nOutIdx += nThreads;

lstate = RNG_rand48_iterate_single(lstate, A, C);

}

nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;

state[nOutIdx] = lstate;

}

Step2 Malliavin Monte Carlo CUDA language sketch (rng part1)

Page 53: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Void __global__ RNG_compute( int * uniform , float * normal, int length){

int index = blockDim.x*blockIdx.x+threadIdx.x;

__shared__ int s[i];

__shared__ float s_r[i];

if( threadIdx.x ==0){

for (int i = 0 ; i<blockDim.x ; i++){

s[i]=uniform[blockDim.x*blockIdx.x + i]; // load uniform

}

}

s_r[threadIdx.x] = (float) moro(s[threadIdx.x]); // moro inversion with parallel

if( threadIdx.x ==0){

for (int i = 0 ; i<blockDim.x ; i++){

s[blockDim.x*blockIdx.x+i]= s_r[i] ]; // save normal

}

}

}

Step2 Malliavin Monte Carlo CUDA language sketch (rng part2)

Page 54: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

__device__ void BoxMuller(float& u1, float& u2){

float r = sqrtf(-2.0f * logf(u1));

float phi = 2 * PI * u2;

u1 = r * __cosf(phi);

u2 = r * __sinf(phi);

}

Box-Muller vs Moro Inversion

__device__ Moro( float u ){

// skip the const value

x = u - 0.5;

if (abs(x) < 0.42) {

r = x*x;

r = x * (((a4 * r + a3) * r + a2) * r + a1) / ((((b4 * r + b3) * r + b2) * r + b1) * r + 1);

} else{

if (x > 0) r = log(-log(1 - u));

if (x <= 0) r = log(-log(u));

r = c1 + r * (c2 + r * (c3 + r * (c4 + r * (c5 + r * (c6 + r * (c7 + r * (c8 + r * c9)))))));

if (x <= 0) r = -r;

}f

return r;

}

Page 55: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Step3 New approach for Direct LCG and Parallel Memory Map

stock

option

rand_new(seed)Malliavin_compute()

Parallel

cores

uniform

normal

double rand_new( int next){

next = (a * next + b) % M;

return (double) next * (1/M);

}

moro(u)

data in registers

Page 56: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Excel Sheet

Excel VBA

C/C++ dll Link

Platform for finance

Socket communication

GPU cluster

Grid Manager

8 GPU per node

12 nodes system (1/2 for backup)12*8 GPU*448 core

= 43,000 core

Page 57: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

How to Optimize?

Page 58: [05][cuda 및 fermi 최적화 기술] hryu optimization

Time Series Analysis

Page 59: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Multivariate Time-series

S(2)

S(1)

S(3)

S(n)

m

Corr(1,2)

Corr(1,3)

Corr(1,n)

Corr(2,3)

Page 60: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

How to parallelize with CUDA

S(2)

S(1)

S(3)

S(n)

serialize

Unefficient approach for Shared Memory Usage

Page 61: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

How to parallelize with CUDA

S(2)

S(1)

S(3)

S(n)

seria

lize

efficient approach for Shared Memory Usage

Need to reduction technique for mean etc.

Page 62: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

How to parallelize with CUDA

S(2)

S(1)

S(3)

S(n)

seria

lize

Parallelize

With

multiNode

multiGPU

Good method for multiGPU & large Cluster

Page 63: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

In pair(I,J), Pearson Correlation Coefficient

2 2

( )( )

( ) ( )

i i

i i

x x y y

x x y y

( )( )i i i i i ix x y y x y x y y x xy

i ix y nxy 2 2 2( )i ix x x nx 2 2 2( )i iy y y ny

2 2 2 2( )( )

i i

i i

x y nxy

x nx y ny

We can parallelize the Sumation !!

After sumation, find the mean.

Page 64: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

How to parallelize with CUDA : Flow Chart

Input A, B

Find the mean of A, B

start to sum (Ai), (Bi)

Find Cov(A,B), Cov(A,A), Cov(B,B)

start to sum (Ai,Bi), (Ai^2), (Bi^2) with mean

Input A, B

Find the mean of A, B

Start to sum (Ai), (Bi), (Ai,Bi), (Ai^2), (Bi^2)

Method 1 Method 2

Find Cov(A,B), Cov(A,A), Cov(B,B)

Benefit : Easy to implementation

with two function

Benefit : oneshot sum (speed-up)

R+CUDA projecthttp://brainarray.mbni.med.umich.edu/Brainarray/rgpgpu/

Page 65: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

In pair(i,j), Pearson Correlation Coefficient

i ix y2

ix

FOR (i, j) – pair : serial

FOR k (time-series) : parallel

compute

reduction for results

compute mean(i), mean(j),

compute cov(i,j), cov(i,i) ,cov(j,j)

compute corr(i,j)

ix iy2

iy

Page 66: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

In pair(i,j), Pearson Correlation Coefficient

i ix y2

ix

FOR (i, j) – pair : serial

FOR k (time-series) : parallel

FOR shared memory

compute

reduction for results

compute mean(i), mean(j),

compute cov(i,j), cov(i,i) ,cov(j,j)

compute corr(i,j)

ix iy2

iy

Page 67: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

How to Optimize?

Page 68: [05][cuda 및 fermi 최적화 기술] hryu optimization

Focus on Optimization

Page 69: [05][cuda 및 fermi 최적화 기술] hryu optimization

System Optimization Tips before

start CUDA programming

Page 70: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Remote Use

Remote Host Client

WDDM

RDPRDP driver RDP protocol

Remote Host Client

WDDM

VNCVNCVNC protocol

CUDA

enabled

CUDA

disabled

intercept the

CUDA driver

snapshot of

real ccreen

Page 71: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

TCC driver

Remote Host Client

WDDM

RDPRDP driver RDP protocol

WDM

• Remote Desktop

• Kernel Launch Over Head

• GPU exclusive Mode

• Windows Service for Session0/1

• large single Malloc

CUDA Application

resultGPU

CUDA

enabled

Page 72: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

GPU exclusive Mode for multiuser

ServerRemote users

Remote users

Remote usersGPU 0

GPU 1

GPU 2

GPU 3

GPU Pool

Page 73: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

GPUDirect for GPU cluster

System Pinned Memory

System Pageable Memory

System Pinned Memory share

MPI Communication

Page 74: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

FBO on SDI with Quadro

System

Memory

CPU

GPU

Video

Memory

DVI

DVI

NVIDIA Graphics Card

DVI

Third Party

SDI Input

NVIDIA SDI Output

Memory

Controller

System

Memory

CPU

GPU

Video

Memory

DVI

DVI

NVIDIA Graphics Card

DVI

Third Party

SDI Input

NVIDIA SDI Output

Memory

Controller

Direct write to OpenGL Frame Buffer ObjectWrite to Host memory and to write GPU memory

Page 75: [05][cuda 및 fermi 최적화 기술] hryu optimization

Conceptual Tips for

CUDA optimization

Page 76: [05][cuda 및 fermi 최적화 기술] hryu optimization

SIMT architecture

Single Instruction Multiple Threads

Page 77: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Abstract overview of CPU Core

ALUFPU LSU

L1

inst. cache

Register Bank

inst. fetch

L1

data cache

RA

X

RB

X

RC

X

RD

X

Memory BankCPU

Page 78: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Threads on Multicore CPU

CPU

reg

Core

TH

CPU

reg

Core

TH TH

CPU

reg

Core

TH TH

Winthread, pthread

Hyper-threading

TH TH

General Programming

CPU

Page 79: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Threads on Manicore GPU

79

MP

reg

SP

MP

reg

SP

TH TH TH TH TH TH TH TH TH TH

TH TH TH TH TH

Block Block Block

H/W Multi Processor vs S/W Active Block

Block Block

Page 80: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Overview of WARP schedule

schedule

<<<0,0>>><<<0,1>>>

<<<0,31>>><<<0,32>>>

threads

Register

per threads

8 SP core

warp

<<<0,33>>><<<0,34>>>

Warp 0

MP

Page 81: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Overview of Instruction Fetch

coresFP operation

ADD

0 1 2 3

Load A, B

Store C

blockIdx.x=0, threadIdx.x=3

4 5 6 7

0 1 2 3 4 5 6 7

Page 82: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Overview of Instruction Fetch

coresFP operation

ADD

0 1 2 3

Load A, B

Store C

blockIdx.x=0, threadIdx.x=4

4 5 6 7

0 1 2 3 4 5 6 7

Page 83: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Thread schedule within MP : WARP

1024 * 30 : 30K

WARP

B1

B2

Look like

Cocurrent

Threads

MP

Page 84: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Occupancy Calculator on CUDA SDK

Page 85: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

CUDA profiler on CUDA toolkit

Page 86: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Parallel NSight 1.5 Professional

Page 87: [05][cuda 및 fermi 최적화 기술] hryu optimization

Memory Hierarchy

Page 88: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Managing Memory

Multiprocessor

Host

CPU

ChipsetDRAM

Device

DRAM

Local

Memory

Global

Memory

GPU

Multiprocessor

MultiprocessorRegisters

Shared Memory

Bandwidth

Size

L1 CacheL2 Cache

Page 89: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Matrix Size for Best CUBLAS3.2 Performance

050

100150200250300350400450500550600650

96 x

96

384 x

384

672 x

672

960 x

960

1248 x

1248

1536 x

1536

1824 x

1824

2112 x

2112

2400 x

2400

2688 x

2688

2976 x

2976

3264 x

3264

3552 x

3552

3840 x

3840

4128 x

4128

4416 x

4416

4704 x

4704

4992 x

4992

5280 x

5280

5568 x

5568

5856 x

5856

6144 x

6144

6432 x

6432

6720 x

6720

7008 x

7008

7296 x

7296

7584 x

7584

7872 x

7872

8160 x

8160

8448 x

8448

Gflops SGEMM: Multiples of 96

0

50

100

150

200

250

300

350

64

x 6

4

38

4 x

38

4

70

4 x

70

4

10

24

x 1

02

4

13

44

x 1

34

4

16

64

x 1

66

4

19

84

x 1

98

4

23

04

x 2

30

4

26

24

x 2

62

4

29

44

x 2

94

4

32

64

x 3

26

4

3584 x

3584

39

04

x 3

90

4

42

24

x 4

22

4

45

44

x 4

54

4

48

64

x 4

86

4

51

84

x 5

18

4

55

04

x 5

50

4

58

24

x 5

82

4

61

44

x 6

14

4

64

64

x 6

46

4

6784 x

6784

71

04

x 7

10

4

74

24

x 7

42

4

77

44

x 7

74

4

80

64

x 8

06

4

83

84

x 8

38

4

Gflops DGEMM: Multiples of 64

cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz

Page 90: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

cuBLAS level III

0

50

100

150

200

250

300

350

64

x 6

4

12

8 x

12

8

19

2 x

19

2

25

6 x

25

6

32

0 x

32

0

38

4 x

38

4

44

8 x

44

8

51

2 x

51

2

57

6 x

57

6

64

0 x

64

0

70

4 x

70

4

76

8 x

76

8

83

2 x

83

2

89

6 x

89

6

96

0 x

96

0

10

24

x 1

02

4

10

88

x 1

08

8

11

52

x 1

15

2

1216 x

1216

12

80

x 1

28

0

13

44

x 1

34

4

1408 x

1408

14

72

x 1

47

2

15

36

x 1

53

6

1600 x

1600

16

64

x 1

66

4

17

28

x 1

72

8

1792 x

1792

18

56

x 1

85

6

19

20

x 1

92

0

1984 x

1984

20

48

x 2

04

8

21

12

x 2

11

2

2176 x

2176

22

40

x 2

24

0

23

04

x 2

30

4

23

68

x 2

36

8

24

32

x 2

43

2

24

96

x 2

49

6

25

60

x 2

56

0

26

24

x 2

62

4

26

88

x 2

68

8

27

52

x 2

75

2

28

16

x 2

81

6

28

80

x 2

88

0

29

44

x 2

94

4

30

08

x 3

00

8

30

72

x 3

07

2

31

36

x 3

13

6

32

00

x 3

20

0

32

64

x 3

26

4

33

28

x 3

32

8

33

92

x 3

39

2

34

56

x 3

45

6

GflopsDGEMM: Multiples of 64

cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz

Page 91: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Roofline Analysis (Arithmetic Intensity)

Samuel Williams, Andrew Waterman, and Dav id Patterson,

Roofline: An Insightful Visual Performance Model for Multicore Architectures

Page 92: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Perf. on CUDA Application

Gflops

Ns

F/B

Bandwidth of PCI-e slot

Bandwidth of Global Memory

Page 93: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

Tips for Optimization

Consider Algorithm for parallel (naïve algorithms will be good)

Consider Occupancy ( SIMT)

Consider Memory Bottleneck

Page 94: [05][cuda 및 fermi 최적화 기술] hryu optimization

NVIDIA Confidential

More Information for CUDA Optimization

CUDA Zone

http://www.nvidia.com/CUDA

Developer Zone

http://developer.nvidia.com

GTC 2010 contents

http://www.nvidia.com/gtc2010-content

쿠다 카페 (CUDA café in Korea)

http://cafe.daum.net/KCUG

Page 95: [05][cuda 및 fermi 최적화 기술] hryu optimization

Thanks

Hyungon Ryu

[email protected]