[05][cuda 및 fermi 최적화 기술] hryu optimization
Transcript of [05][cuda 및 fermi 최적화 기술] hryu optimization
CUDA and Fermi
Optimization Techniques
Hyungon Ryu, PSG SA
NVIDIA Confidential
Agenda
CUDA 101 (General Overview)
Toy of FDM example
Monte Carlo Simulation & Time Series Analysis
General CUDA Optimization Tips
CUDA 101
General Overview
NVIDIA Confidential
CUDA 101
CUDA
Graphics
(GPU)
Parallel
Programming
NVIDIA Confidential
CUDA Parallel programming
GPU Knowledge
Cg
OpenGL
DirectX
Parallel knowledge
Pthreads / winthreads
MMX, SSE
OpenMP
PVM / MPI
Hetereogeneous Knowledge
GPU
Parallel DSP
Parallel ASIC
Parallel FPGA
Cell BE
CUDAParallel Computing with GPU
NVIDIA Confidential
Parallel Model in OpenMP
6
CPU Program
Fork
Parallel
Join
NVIDIA Confidential
CUDA parallel Model
7
CPU Program
Kernel Launch
GPU threads
NVIDIA Confidential
Saxpy Example : CPU serial
for (int i = 0; i < n; ++i) {
y[i] = a*x[i] + y[i];
}
x
y
i
NVIDIA Confidential
Example of Saxpy Parallel : OpenMP
# pragma omp parallel shared (n,a,x,y) private (i)
# pragma omp for
for (int i = 0; i < n; ++i) {
y[i] = a*x[i] + y[i];
}
x
y
i
NVIDIA Confidential
Example of Saxpy Parallel : MPIfor ( i = start ; i < end; i++)
{
y[i] = a * x[i] + y[i];
}
x
y
i
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
void para_range(int lowest, int highest,int nprocs, int myrank,
int *start, int *end) {
int wk1, wk2;
wk1 = (highest - lowest + 1) / nprocs;
wk2 = (highest - lowest + 1) % nprocs;
*start = myrank * wk1 + lowest + ( (rank<wk2) ? myrank : wk2);
*end = *start + wk1 - 1;
if(wk2 > rank) *end = *end + 1;
}
NVIDIA Confidential
Example of Saxpy Parallel : SSE
x
y
i
void saxpy_vector(short *z, short *x, short *y, short a, unsigned n) {
__m128i* x_ptr = (__m128i*) x;
__m128i* y_ptr = (__m128i*) y;
__m128i* z_ptr = (__m128i*) z;
__m128i a_vec = _mm_splat_epi16(a);
int i;
for (i = 0; i < n/8; ++i) {
__m128i x_vec = x_ptr[i];
__m128i y_vec = y_ptr[i];
__m128i z_vec = _mm_add_epi16( _mm_mullo_epi16 x_vec,a_vec),y_vec);
z_ptr[i] = z_vec;
}
}
NVIDIA Confidential
Saxpy Parallel : CUDA
x
y
i
{
x[i] = a * x[i] + t * y[i];
}
Saxpy <<<N ,M >>> (n, 2.0, x, y);
NVIDIA Confidential
CUDA C extensionLaunch the kernel
Function <<< Grid, Block >>> ( parameter);
Additional C standard API for mem control
cudaXXX : cudaMalloc, cudaMemcpy,
cuXXX : cuMalloc, cuMemcpy
cutXXX : cutXXX
For Function
__global__, __device__, __host__, __device__ __host__
For memory
__shared__, __device__, reg/loc
pre-defined variables
blockDim, blockIdx, threadIdx, cudaMemcpyHostToDevice
Pre-defined function
__syncthreads(), __mul24(); etc 13
NVIDIA Confidential
Process of CUDA developing
14
Algorithm
serial Programming
Compile
Debugging
Release
Serial CUDA parallel
Algorithm
serial Programming
Compile
Release
CUDA convert
Debugging
Compile
Parallelize
Optimize/profile
Profile
NVIDIA Confidential
CUDA is Parallel Computing !!!
15
Algorithm
serial Programming
Compile
Debugging
Release
Serial MPI parallel CUDA parallel
Algorithm
serial Programming
Compile
Release
parallel Programming
Debugging [totalview]
Optimize
Compile
Profile
Parallelize
MPIrun
Algorithm
serial Programming
Compile
Release
CUDA convert
Debugging
Compile
Parallelize
Optimize/profile
Profile
NVIDIA Confidential
16
CPU Program
Kernel Launch
Block
thread
SMC
Geometry Controller
SP
DP
SP
SP SP
SP SP
SP SP
I-Cache
MT Issue
C-Cache
SFU SFU
SharedMemory
Texture Unit
Tex L1
SP
DP
SP
SP SP
SP SP
SP SP
I-Cache
MT Issue
C-Cache
SFU SFU
SharedMemory
SP
DP
SP
SP SP
SP SP
SP SP
I-Cache
MT Issue
C-Cache
SFU SFU
SharedMemory
SM
TPC
SP
DP
SP
SP SP
SP SP
SP SP
I-Cache
MT Issue
C-Cache
SFU SFU
SharedMemory
NVIDIA Confidential
Image Processing Diagram without CUDA
For Loop start
For Loop End
CPU Total
time
Camera
capture
NVIDIA Confidential
Image Processing Diagram with CUDA
Camera
For Loop start
For Loop End
CPU
Memcpy
capture
CUDA parallel
Total
time
1D Heat Equation
CUDA Toy
for undergraduate student
NVIDIA Confidential
1D Heat Equation
1D bar
Heat Source
NVIDIA Confidential
1D Heat Equation
Boundary condition
Initial condition
2
2
x
u
t
u
0),1(),0( tutu
0)0,( uxu
1D Heat Equation
NVIDIA Confidential
Discretization
2
,1,,1,1, 2
x
uuu
t
uu ijijijijij
Forward difference with second order
ijijijij ruurruu ,1,,11, )21( Explicit method
2/ xtr
i(time), j(space)
2
2
x
u
t
u
descretization
relation
u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
NVIDIA Confidential
Discretization (stencil)
j+1,i
j,i i+1
j-1,i
r
(1-2*r)
r
ijijijij ruurruu ,1,,11, )21(
NVIDIA Confidential 0 600Time :
13
12
11
10
9
8
7
6
5
4
3
2
1
0
j+1
j j
j-1r
(1-2*r)
r
Time 0 Update
stencil
Boundary Condition
space
Initial Condition
Conceptual Diagram
u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
NVIDIA Confidential
Explicit Pseudo Code
Parameter and data Initialization
Stencil, boundary/initial condition,
FOR LOOP (time, i)
FOR LOOP (stencil, j)
Update the stencil relation
Results
u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
NVIDIA Confidential
CPU code
u[i][j] vs. u[j]
u[i][j] easy to develop
Possible to visualize the process
u[j] efficient to use memory
Get the last result
NVIDIA Confidential
Main algorithm
for( i =0; i < N; i++) {
for( j =0; j < M; j++){
u[i+1][j] = r * u[i+1][j-1]
+ (1-2*r) * u[i ][j]
+ r * u[i ][j+1];
}
}
Time Iteration
Space Iteration
Heat relation
NVIDIA Confidential
Boundary Condition
N
N-1
Method1
N
N-1
N+1
copy copy copy
Method2
copy
compute
Free boundary
Fixed boundary
NVIDIA Confidential
0 600Time :
13
12
11
10
9
8
7
6
5
4
3
2
1
0
j+1
j j
j-1r
(1-2*r)
r
Time 0 Update
Parallelize the space stencil
Not possible to parallelize the time stencil.
stencil
How to Parallelize
Each threads(core) dedicate to each space stencil element
NVIDIA Confidential
How to parallelize
Parameter and data Initialization
Stencil, boundary/initial condition,
FOR LOOP (time, i)
FOR LOOP (stencil, j)
Update the stencil relation
Results
NVIDIA Confidential
Space parallelization
SequenceTime
CommTime 0
Time 1
Time 2
Time 3
NVIDIA Confidential
CPUcode
source
result
CPU
solve
Heat Relation
Boundary Condition
NVIDIA Confidential
CPU codedo{
time += dt; printf("aaaaaa %f\n",time);
for(i=1; i < mesh+1; i++){
temper[i].new_1 = (double) (1-2*r)*temper[i].old + r*(temper[i-1].old + temper[i+1].old);
printf("\n processing \t %d %f %f \n",i, temper[i].new_1, temper[i].old);
}
temper[mesh+1].new_1 = temper[mesh-1].new_1;
printf("\t print results %d %f %f \n",mesh+1, temper[mesh+1].new_1,temper[mesh-1].new_1 );
for(i=1; 1 < mesh+2; i++)
temper[i].old = temper[i].new_1; printf("aa\t\t %d %f %f \n",i, temper[i].new_1,temper[i].new_1 );
if((++count % print_step)==0){
printf("hh \t\t\t %10.5lf", time);
for(i=0; i<mesh; i+=2)
printf("df 8.4lf", temper[i].new_1);
if(!(i%2))
printf("fd %8.4lf\n", temper[mesh].new_1);
}
printf("\n\n time print %f\n", time); getchar();
}while(time < end_time);
if((count % print_step)){
printf("bghj %10.5lf", time);
for(i=0; i<mesh; i+=2)
printf("ahg 8.4lf", temper[i].new_1);
printf("nhgf %8.4lf\n", temper[mesh].new_1);
}
NVIDIA Confidential
GPUcode-01 Memory Map
sourcesource
resultresult
CPUGPU
solve Bypass without solving
Heat Relation
Boundary Condition
NVIDIA Confidential
GPUcode-01 Malloc Template
double* init_GPU_data(struct flow * temper, int mesh)
{
double *u_dev; // for GPU data upload
size_t gpuMemsize=sizeof(double)*(mesh+2) ;
double *u_host; // temperal value
cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");
u_host = (double *) malloc( gpuMemsize);
for(int i=0;i<mesh;i++){
u_host[i]= temper[i].old;
printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,
temper[i].old);
}
cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");
cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");
for(int i=0;i<mesh;i++){
printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,
temper[i].old);
}
free(u_host);
return (double *)u_dev;
}
NVIDIA Confidential
GPUcode-01 Launch Templatevoid __global__ functionG(double *u_dev, int meshsize, double r, double bound )
{
int idx = blockIdx.x*blockDim.x+threadIdx.x;
int i = idx+1;
if(idx <meshsize +1 ) {
u_dev[i]= i*0.01;
}
if(idx == 6){
u_dev[idx+1]=u_dev[idx-1]=11;
}
}
void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh,
int print_step, int count, double time, double end_time, double bound )
{
size_t gpuMemsize = sizeof(double)*(mesh+2);
//for( int i=0; i < 6000 ; i++){ //time step
functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);
cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");
for(int i=0; i<mesh+1; i++){
printf( " in kernel - GPU : temper[%d] ==> %f \n", i, u_host[i]);
}
//}
return;
}
NVIDIA Confidential
GPUcode-02 Solving
sourcesource
resultresult
CPUGPU
solve solve
Heat Relation
Boundary Condition
Heat Relation
Boundary Condition
NVIDIA Confidential
GPUcode-02 Malloc part
double* init_GPU_data(struct flow * temper, int mesh)
{
double *u_dev; // for GPU data upload
size_t gpuMemsize=sizeof(double)*(mesh+2) ;
double *u_host; // temperal value
cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");
u_host = (double *) malloc( gpuMemsize);
for(int i=0;i<mesh;i++){
u_host[i]= temper[i].old;
printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,
temper[i].old);
}
cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");
cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");
for(int i=0;i<mesh;i++){
printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%f\n", i, i, u_host[i], i,
temper[i].old);
}
free(u_host);
return (double *)u_dev;
}
NVIDIA Confidential
GPUcode-02 Launch partvoid __global__ functionG(double *u_dev, int meshsize, double r, double bound )
{
int idx = blockIdx.x*blockDim.x+threadIdx.x;
int i = idx+1;
if(idx <meshsize +1 ) {
u_dev[i] = (double) (1-2*r)*u_dev[i] + r*(u_dev[i-1] + u_dev[i+1] );
// u_dev[i]= i*0.01;
}
if(idx == 6){
u_dev[idx+1]=u_dev[idx-1]=11;
}
}
void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh,
int print_step, int count, double time, double end_time, double bound )
{
size_t gpuMemsize = sizeof(double)*(mesh+2);
//for( int i=0; i < 6000 ; i++){ //time step
functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);
cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");
for(int i=0; i<mesh+1; i++){
printf( " in kernel - GPU : temper[%d] ==> %f \n", i, u_host[i]);
}
//}
return;
}
NVIDIA Confidential
How to Optimize?
Monte Carlo Simulation
NVIDIA Confidential
Computation Time
Malliavin MC results
type Black-Sholes 5000 10000 50000 100000 200000 300000
Price 12.8216 12.8706 12.8721 12.8413 12.8525 12.8559 12.8480
Err 0.000% 0.38% 0.39% 0.15% 0.24% 0.27% 0.21%
Delta 0.5858 0.5724 0.5887 0.5826 0.5826 0.5829 0.5824
Err 0 2.28% -0.51% 0.53% 0.53% 0.49% 0.58%
Gamma 0.0130 0.0112 0.0134 0.0127 0.0127 0.0127 0.0127
Err 0.00% 13.39% -3.34% 2.07% 2.26% 2.21% 2.47%
Total Time 0:00 00:03 00:05 00:25 00:50 01:44 02:33
time : 100 sectarget1 : > 1 sec (100X)
Target2 : >0.001 sec (2000X)
NVIDIA Confidential
Excel Sheet
Excel VBA
C/C++ dll Link
CUDA Acceleration
Monte Carlo Simulation for Finance
NVIDIA Confidential
function MC( S As Double, X As Double, T As Double, R As Double, _
Vol As Double, Q As Double, No As Double, rtype As String) As Double
Simul_No = No
dt = 1 'dt = 1 / 365
For K = 0 To Simul_No - 1
Juga = S
For i = 0 To MaxStep - 1
Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )
Next
price = Exp(-R * T) * Max(Juga - X, 0)
sum_price = sum_price + price
Next
MC = sum_price / Simul_No
End function
Monte Carlo Code with VBA
NVIDIA Confidential
Greek computation for Monte Carlo simulation
Malliavin approach
With Malliavin approach, we can save the computation time.
Malliavin Greeks
0 0[ ( )] [ ( )][ ( )]
E f S S E f SE f S
s S
[ ( )] [ ( ) ( )]E f S E f S W Ss
Malliavin weights
NVIDIA Confidential
Problem
To compare the accuracy, we compute the Price, Delta and Gamma of Vanilla
Call option.
Approach
1. Closed Form solution (VBA,C)
2. Monte (VBA, C)
3. Malliavin (VBA,C, CUDA v1, v2)
NVIDIA Confidential
function Malliavin( S As Double, X As Double, T As Double, R As Double, _
Vol As Double, Q As Double, No As Double, rtype As String) As Double
Simul_No = No
dt = 1 'dt = 1 / 365
For K = 0 To Simul_No - 1
Juga = S
For i = 0 To MaxStep - 1
Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )
Next
WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol
WT_delta = (WT / (S * Vol * T))
WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol)
price = Exp(-R * T) * Max(Juga - X, 0)
delta = Exp(-R * T) * Max(Juga - X, 0) * WT_delta
gamma = Exp(-R * T) * Max(Juga - X, 0) * WT_gamma
sum_price = sum_price + price
sum_delta = sum_delta + delta
sum_gamma = sum_gamma + Gamma
Next
Malliavin = sum_delta / Simul_No
End function
Malliavin Monte Carlo Code with VBA
NVIDIA Confidential
void Malliavin( double S , double X , double T , double R, double Vol, double Q, long No){
long Simul_No = No;
double dt = 1; // dt = 1 / 365
for ( int K = 0; K< Simul_No - 1 ; K++){
Juga = S ;
for (int i = 0; i< MaxStep - 1 ki++){
Juga = Juga * exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * sqrt(dt) * norm() ); // rand with box muller
}
WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol;
WT_delta = (WT / (S * Vol * T));
WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol);
price = Exp(-R * T) * max(Juga - X, 0);
delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;
gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;
sum.price = sum.price + price;
sum.delta = sum.delta + delta;
sum.gamma = sum.gamma + Gamma ;
}
r.price = sum.delta / Simul_No
r.delta = sum.delta / Simul_No
r.gamma = sum.delta / Simul_No
return 0;
}
Step1 Malliavin Monte Carlo C language sketch
NVIDIA Confidential
Void __global__ Malliavin_compute( double S , double X , double T , double R, double Vol, double Q, long No){
long Simul_No = No;
double dt = 1; // dt = 1 / 365
for ( int K = 0; K< Simul_No - 1 ; K++){
Juga = S ;
for (int i = 0; i< MaxStep - 1 ki++){
Juga = Juga * exp( (R - Q - Vol *Vol / 2) * dt + Vol * sqrt(dt) * norm(k,i) );
}
WT = (log(Juga) - log(S) - (R - Q - 1 / 2 * Vol *Vol) * T) / Vol;
WT_delta = (WT / (S * Vol * T));
WT_gamma = (1 / (Vol * T * S *S)) * (WT*WT / (Vol * T) - WT - 1 / Vol);
price = Exp(-R * T) * max(Juga - X, 0);
delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;
gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;
sum.price = sum.price + price;
sum.delta = sum.delta + delta;
sum.gamma = sum.gamma + Gamma ;
}
r.price = sum.delta / Simul_No
r.delta = sum.delta / Simul_No
r.gamma = sum.delta / Simul_No
return 0;
}
Simm_No =
Total Sim /(N threads* M blocks)
Real Price =
Sum (r.price ) / (N*M)
Step2 Sketch (kernel part)
NVIDIA Confidential
(float *) __device__ normal(int k, int, j, int size_j, float * normal) {
int index = k*size_j + j;
return &normal[index];
}
uniform
normal
stock
option
normal()
RNG_compute2()
RNG_compute1()
Malliavin_compute()
Parallel
cores
data in global memory
Step2 Parallel Memory Map
NVIDIA Confidential
#include <stdio.h>
__global__ RNG_compute(parameter);
__global__ Malliavin_compute(parameter);
main(){
malloc(); //cpu malloc
cudaMalloc(); //GPU malloc
cudaMemcpy(); // transfer
RNG_compute 1<<<N,M>>> (parameter); // generate RNG (uniform)
RNG_compute2 <<<N,M>>> (parameter); // generate RNG ( BM,Moro)
Malliavin_compute <<<N,M>>> (parameter); // simulation
cudaMemcpy(); //get results
return 0;
}
Step2 Sketch (host part)
NVIDIA Confidential
__global__
static void RNG_rand48_get_int(uint2 *state, int *res, int num_blocks, uint2 A, uint2 C)
{
const int nThreads = blockDim.x*gridDim.x;
int nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;
uint2 lstate = state[nOutIdx];
int i;
for (i = 0; i < num_blocks; ++i) {
res[nOutIdx] = ( lstate.x >> 17 ) | ( lstate.y << 7);
nOutIdx += nThreads;
lstate = RNG_rand48_iterate_single(lstate, A, C);
}
nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;
state[nOutIdx] = lstate;
}
Step2 Malliavin Monte Carlo CUDA language sketch (rng part1)
NVIDIA Confidential
Void __global__ RNG_compute( int * uniform , float * normal, int length){
int index = blockDim.x*blockIdx.x+threadIdx.x;
__shared__ int s[i];
__shared__ float s_r[i];
if( threadIdx.x ==0){
for (int i = 0 ; i<blockDim.x ; i++){
s[i]=uniform[blockDim.x*blockIdx.x + i]; // load uniform
}
}
s_r[threadIdx.x] = (float) moro(s[threadIdx.x]); // moro inversion with parallel
if( threadIdx.x ==0){
for (int i = 0 ; i<blockDim.x ; i++){
s[blockDim.x*blockIdx.x+i]= s_r[i] ]; // save normal
}
}
}
Step2 Malliavin Monte Carlo CUDA language sketch (rng part2)
NVIDIA Confidential
__device__ void BoxMuller(float& u1, float& u2){
float r = sqrtf(-2.0f * logf(u1));
float phi = 2 * PI * u2;
u1 = r * __cosf(phi);
u2 = r * __sinf(phi);
}
Box-Muller vs Moro Inversion
__device__ Moro( float u ){
// skip the const value
x = u - 0.5;
if (abs(x) < 0.42) {
r = x*x;
r = x * (((a4 * r + a3) * r + a2) * r + a1) / ((((b4 * r + b3) * r + b2) * r + b1) * r + 1);
} else{
if (x > 0) r = log(-log(1 - u));
if (x <= 0) r = log(-log(u));
r = c1 + r * (c2 + r * (c3 + r * (c4 + r * (c5 + r * (c6 + r * (c7 + r * (c8 + r * c9)))))));
if (x <= 0) r = -r;
}f
return r;
}
NVIDIA Confidential
Step3 New approach for Direct LCG and Parallel Memory Map
stock
option
rand_new(seed)Malliavin_compute()
Parallel
cores
uniform
normal
double rand_new( int next){
next = (a * next + b) % M;
return (double) next * (1/M);
}
moro(u)
data in registers
NVIDIA Confidential
Excel Sheet
Excel VBA
C/C++ dll Link
Platform for finance
Socket communication
GPU cluster
Grid Manager
8 GPU per node
12 nodes system (1/2 for backup)12*8 GPU*448 core
= 43,000 core
NVIDIA Confidential
How to Optimize?
Time Series Analysis
NVIDIA Confidential
Multivariate Time-series
S(2)
S(1)
S(3)
S(n)
m
Corr(1,2)
Corr(1,3)
Corr(1,n)
Corr(2,3)
NVIDIA Confidential
How to parallelize with CUDA
S(2)
S(1)
S(3)
S(n)
serialize
Unefficient approach for Shared Memory Usage
NVIDIA Confidential
How to parallelize with CUDA
S(2)
S(1)
S(3)
S(n)
seria
lize
efficient approach for Shared Memory Usage
Need to reduction technique for mean etc.
NVIDIA Confidential
How to parallelize with CUDA
S(2)
S(1)
S(3)
S(n)
seria
lize
Parallelize
With
multiNode
multiGPU
Good method for multiGPU & large Cluster
NVIDIA Confidential
In pair(I,J), Pearson Correlation Coefficient
2 2
( )( )
( ) ( )
i i
i i
x x y y
x x y y
( )( )i i i i i ix x y y x y x y y x xy
i ix y nxy 2 2 2( )i ix x x nx 2 2 2( )i iy y y ny
2 2 2 2( )( )
i i
i i
x y nxy
x nx y ny
We can parallelize the Sumation !!
After sumation, find the mean.
NVIDIA Confidential
How to parallelize with CUDA : Flow Chart
Input A, B
Find the mean of A, B
start to sum (Ai), (Bi)
Find Cov(A,B), Cov(A,A), Cov(B,B)
start to sum (Ai,Bi), (Ai^2), (Bi^2) with mean
Input A, B
Find the mean of A, B
Start to sum (Ai), (Bi), (Ai,Bi), (Ai^2), (Bi^2)
Method 1 Method 2
Find Cov(A,B), Cov(A,A), Cov(B,B)
Benefit : Easy to implementation
with two function
Benefit : oneshot sum (speed-up)
R+CUDA projecthttp://brainarray.mbni.med.umich.edu/Brainarray/rgpgpu/
NVIDIA Confidential
In pair(i,j), Pearson Correlation Coefficient
i ix y2
ix
FOR (i, j) – pair : serial
FOR k (time-series) : parallel
compute
reduction for results
compute mean(i), mean(j),
compute cov(i,j), cov(i,i) ,cov(j,j)
compute corr(i,j)
ix iy2
iy
NVIDIA Confidential
In pair(i,j), Pearson Correlation Coefficient
i ix y2
ix
FOR (i, j) – pair : serial
FOR k (time-series) : parallel
FOR shared memory
compute
reduction for results
compute mean(i), mean(j),
compute cov(i,j), cov(i,i) ,cov(j,j)
compute corr(i,j)
ix iy2
iy
NVIDIA Confidential
How to Optimize?
Focus on Optimization
System Optimization Tips before
start CUDA programming
NVIDIA Confidential
Remote Use
Remote Host Client
WDDM
RDPRDP driver RDP protocol
Remote Host Client
WDDM
VNCVNCVNC protocol
CUDA
enabled
CUDA
disabled
intercept the
CUDA driver
snapshot of
real ccreen
NVIDIA Confidential
TCC driver
Remote Host Client
WDDM
RDPRDP driver RDP protocol
WDM
• Remote Desktop
• Kernel Launch Over Head
• GPU exclusive Mode
• Windows Service for Session0/1
• large single Malloc
CUDA Application
resultGPU
CUDA
enabled
NVIDIA Confidential
GPU exclusive Mode for multiuser
ServerRemote users
Remote users
Remote usersGPU 0
GPU 1
GPU 2
GPU 3
GPU Pool
NVIDIA Confidential
GPUDirect for GPU cluster
System Pinned Memory
System Pageable Memory
System Pinned Memory share
MPI Communication
NVIDIA Confidential
FBO on SDI with Quadro
System
Memory
CPU
GPU
Video
Memory
DVI
DVI
NVIDIA Graphics Card
DVI
Third Party
SDI Input
NVIDIA SDI Output
Memory
Controller
System
Memory
CPU
GPU
Video
Memory
DVI
DVI
NVIDIA Graphics Card
DVI
Third Party
SDI Input
NVIDIA SDI Output
Memory
Controller
Direct write to OpenGL Frame Buffer ObjectWrite to Host memory and to write GPU memory
Conceptual Tips for
CUDA optimization
SIMT architecture
Single Instruction Multiple Threads
NVIDIA Confidential
Abstract overview of CPU Core
ALUFPU LSU
L1
inst. cache
Register Bank
inst. fetch
L1
data cache
RA
X
RB
X
RC
X
RD
X
Memory BankCPU
NVIDIA Confidential
Threads on Multicore CPU
CPU
reg
Core
TH
CPU
reg
Core
TH TH
CPU
reg
Core
TH TH
Winthread, pthread
Hyper-threading
TH TH
General Programming
CPU
NVIDIA Confidential
Threads on Manicore GPU
79
MP
reg
SP
MP
reg
SP
TH TH TH TH TH TH TH TH TH TH
TH TH TH TH TH
Block Block Block
H/W Multi Processor vs S/W Active Block
Block Block
NVIDIA Confidential
Overview of WARP schedule
schedule
<<<0,0>>><<<0,1>>>
<<<0,31>>><<<0,32>>>
threads
Register
per threads
8 SP core
warp
<<<0,33>>><<<0,34>>>
Warp 0
MP
NVIDIA Confidential
Overview of Instruction Fetch
coresFP operation
ADD
0 1 2 3
Load A, B
Store C
blockIdx.x=0, threadIdx.x=3
4 5 6 7
0 1 2 3 4 5 6 7
NVIDIA Confidential
Overview of Instruction Fetch
coresFP operation
ADD
0 1 2 3
Load A, B
Store C
blockIdx.x=0, threadIdx.x=4
4 5 6 7
0 1 2 3 4 5 6 7
NVIDIA Confidential
Thread schedule within MP : WARP
1024 * 30 : 30K
WARP
B1
B2
Look like
Cocurrent
Threads
MP
NVIDIA Confidential
Occupancy Calculator on CUDA SDK
NVIDIA Confidential
CUDA profiler on CUDA toolkit
NVIDIA Confidential
Parallel NSight 1.5 Professional
Memory Hierarchy
NVIDIA Confidential
Managing Memory
Multiprocessor
Host
CPU
ChipsetDRAM
Device
DRAM
Local
Memory
Global
Memory
GPU
Multiprocessor
MultiprocessorRegisters
Shared Memory
Bandwidth
Size
L1 CacheL2 Cache
NVIDIA Confidential
Matrix Size for Best CUBLAS3.2 Performance
050
100150200250300350400450500550600650
96 x
96
384 x
384
672 x
672
960 x
960
1248 x
1248
1536 x
1536
1824 x
1824
2112 x
2112
2400 x
2400
2688 x
2688
2976 x
2976
3264 x
3264
3552 x
3552
3840 x
3840
4128 x
4128
4416 x
4416
4704 x
4704
4992 x
4992
5280 x
5280
5568 x
5568
5856 x
5856
6144 x
6144
6432 x
6432
6720 x
6720
7008 x
7008
7296 x
7296
7584 x
7584
7872 x
7872
8160 x
8160
8448 x
8448
Gflops SGEMM: Multiples of 96
0
50
100
150
200
250
300
350
64
x 6
4
38
4 x
38
4
70
4 x
70
4
10
24
x 1
02
4
13
44
x 1
34
4
16
64
x 1
66
4
19
84
x 1
98
4
23
04
x 2
30
4
26
24
x 2
62
4
29
44
x 2
94
4
32
64
x 3
26
4
3584 x
3584
39
04
x 3
90
4
42
24
x 4
22
4
45
44
x 4
54
4
48
64
x 4
86
4
51
84
x 5
18
4
55
04
x 5
50
4
58
24
x 5
82
4
61
44
x 6
14
4
64
64
x 6
46
4
6784 x
6784
71
04
x 7
10
4
74
24
x 7
42
4
77
44
x 7
74
4
80
64
x 8
06
4
83
84
x 8
38
4
Gflops DGEMM: Multiples of 64
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
NVIDIA Confidential
cuBLAS level III
0
50
100
150
200
250
300
350
64
x 6
4
12
8 x
12
8
19
2 x
19
2
25
6 x
25
6
32
0 x
32
0
38
4 x
38
4
44
8 x
44
8
51
2 x
51
2
57
6 x
57
6
64
0 x
64
0
70
4 x
70
4
76
8 x
76
8
83
2 x
83
2
89
6 x
89
6
96
0 x
96
0
10
24
x 1
02
4
10
88
x 1
08
8
11
52
x 1
15
2
1216 x
1216
12
80
x 1
28
0
13
44
x 1
34
4
1408 x
1408
14
72
x 1
47
2
15
36
x 1
53
6
1600 x
1600
16
64
x 1
66
4
17
28
x 1
72
8
1792 x
1792
18
56
x 1
85
6
19
20
x 1
92
0
1984 x
1984
20
48
x 2
04
8
21
12
x 2
11
2
2176 x
2176
22
40
x 2
24
0
23
04
x 2
30
4
23
68
x 2
36
8
24
32
x 2
43
2
24
96
x 2
49
6
25
60
x 2
56
0
26
24
x 2
62
4
26
88
x 2
68
8
27
52
x 2
75
2
28
16
x 2
81
6
28
80
x 2
88
0
29
44
x 2
94
4
30
08
x 3
00
8
30
72
x 3
07
2
31
36
x 3
13
6
32
00
x 3
20
0
32
64
x 3
26
4
33
28
x 3
32
8
33
92
x 3
39
2
34
56
x 3
45
6
GflopsDGEMM: Multiples of 64
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
NVIDIA Confidential
Roofline Analysis (Arithmetic Intensity)
Samuel Williams, Andrew Waterman, and Dav id Patterson,
Roofline: An Insightful Visual Performance Model for Multicore Architectures
NVIDIA Confidential
Perf. on CUDA Application
Gflops
Ns
F/B
Bandwidth of PCI-e slot
Bandwidth of Global Memory
NVIDIA Confidential
Tips for Optimization
Consider Algorithm for parallel (naïve algorithms will be good)
Consider Occupancy ( SIMT)
Consider Memory Bottleneck
NVIDIA Confidential
More Information for CUDA Optimization
CUDA Zone
http://www.nvidia.com/CUDA
Developer Zone
http://developer.nvidia.com
GTC 2010 contents
http://www.nvidia.com/gtc2010-content
쿠다 카페 (CUDA café in Korea)
http://cafe.daum.net/KCUG