Post on 22-Jun-2020
Terry Spitz
Citi
14 May 2013
Citi | Markets Quantitative Analysis
Optimising Risk Management
Computational Methods and Technologies for Finance
Agenda
1. Increasing Compute Requirements
2. High Performance Hardware
3. Software Optimisation Technologies
4. Writing efficient parallel code
5. Case Study: Optimising Pricing Models
6. Conclusions
Increasing Compute Requirements
Pre-2008
• More trades
• More complex payoffs, more observation dates
• More complex models, more factors, more complex calibration
Post-2008
• More competitive marketplace demands reduced cost per trade
• More stability/accuracy from more steps/trials
• More market data scenarios
• More regulatory testing, for example hedging simulations
In t h e t w i l igh t of Moore’s La w , t h e t ra ns i t ions t o
m ult icore p roces s ors , GPU com p ut ing , a nd c loud
com p ut ing a re not s ep a ra t e t rend s , bu t a s p ec t s of a
s ing le t rend – m a ins t rea m com p ut ers from d es k t op s t o
s m a rt p h ones a re be ing p er m a nent ly t ra ns for m ed in t o
h e t erogen eous s up ercom p ut er c lus t ers . Hencefor t h , a
s ing le com p ut e -in t ens ive a p p l ica t ion w i l l need t o
h a r nes s d i fferen t k ind s of cores , in im m ens e num bers ,
t o ge t i t s job d one .
Th e free lunch is over . Now w elcom e t o t h e h a rd w a re
jungle .
Herb Su t ter (2012)
High Performance Hardware
• NVidia
– GeForce/Quadro (1999-)
– Tesla (2007-)
– Fermi (2010-)
– Kepler (2012-)
• Intel
– Xeon (2004-)
– Larrabee (2006-2009)
– MIC Architecture: Knight’s Ferry / Intel Xeon Phi (2012-)
• Sony Cell (2005-2009)
• FPGA vendors
Hardware cost
Device Intel Xeon on
Grid NVidia Tesla
M2090 Intel Xeon Phi
Cost per year $6300/server $2000/card est. $2000/card
Cores 12x 3Ghz 512x 1.3Ghz 62x 1.05Ghz
Cost per core per year
$525/core $4/core est. $31/core
Speed 300 Gflops 665 Gflops est. 1290 Gflop
Cost per GFlop $21/Gflop $3/Gflop est. $1.6/Gflop
Memory 16Gb 6Gb 8Gb
We s h ould forge t a bou t s m a l l
e ffic ien c ies , s a y a bou t 9 7 % of t h e
t im e p rem a t ure op t im iz a t ion is t h e
root of a l l evi l .
Yet w e s h ould not p a s s up our
op p ort un i t ies in t h a t cr i t ica l 3 %.
Don a ld Kn u th (1974 )
Software Optimisation Technologies
• Optimise serial code – Profile and optimise algorithms & code
– Better compiler: e.g. Intel CC
• Parallelise on CPU – SSE/AVX
– Multicore
– OpenMP
– MPI
– Grid
• Port to GPU – CUDA
– OpenCL
– C++ AMP
– Data parallel: Thrust, Microsoft Accelerator
Software Parallelisation APIs
Raw OS fork() / pthread_create(…) / CreateThread(…);
OpenMP #pragma omp parallel for for(int i = 0; i < N; i++) { …
TBB/PPL parallel_for (0, size, [&](int i) { …
Grid Session::sendTaskInput(Message* taskInput);
CUDA __global__ void VecAdd(float* A, float* B) {…
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(…);
OpenCL kernel = clCreateKernel(program, …) clEnqueueNDRangeKernel(queue, kernel, …)
C++ AMP parallel_for_each(array.extent, [=](index<2> i) restrict(amp) { …
Thrust thrust::transform(rng.begin(), rng.end(), payoffs.begin(), compute_payoff(…))
Data-parallel Algorithms
Correlated Paths for Monte Carlo http://www.nvidia.com/content/GTC-2010/pdfs/2064_GTC2010.pdf
Writing efficient GPU code
• Balance memory versus compute bottlenecks
• You need to run > 10,000 threads to keep the GPU busy
• SIMD (Single instruction, Multiple Data) with few divergent branches/loops
• Find the parallelism, for example:
– Multiple contracts
– Outer loops, e.g. MC Paths
– Third-party functions, e.g. matrix operations, RNG, parallel_reduce
• Extreme optimisation
– coherent memory access, shared memory, block/tile size, synchronisation, atomics, fast_maths, asynch/overlapped operations, …
• Limitations
– No exceptions, virtual functions, STL/vectors, new/delete, debugging, recursive functions, IEEE compliant maths, unsupported types, complex objects, …
Porting existing applications
https://developer.nvidia.com/content/assess-parallelize-optimize-deploy
• Analyse
– identify the hot spots by profiling with one or more realistic data sets
– estimate performance improvements considering the strong and weak scaling
• Parallelise
– GPU-accelerated libraries
– OpenACC directives
– GPU programming languages
• Optimise
• Deploy
Th e k ey w it h GPUs is t o a s k w h y y ou
w a n t t o u s e t h em – a re y ou look in g t o
d o s om et h in g a lot ch ea p er, or a lot
fa s t e r? Or a re y ou look in g t o d o
s om et h in g t h a t y ou ca n n ot fea s ibly
d o t od a y ?
Th e a s s es s p h a s e p re t t y m u ch h a s t o
h a ve w h a t ever m e t r ic is a p p lica ble
(t im e t o s olu t ion , # s olu t ion s /s econd
p er w a t t , s olu t ion in u n d er 5 m in u t es
– w h a t ever) fir m ly in m in d .
J oh n As h ley - NVid ia
Case Study:
Migrating Pricing Models to
CUDA
Binomial Tree
BinomialTree::calculateFairValue(…)
treeResult = tree.calculate();
applyTerminalCondition();
for(int i = startBackwardAt; i>=0; --i)
{
_steps[i]->update(*this, spots, fairs);
applyExerciseDecision(…)
}
kernel
Binomial Tree Parallel
BinomialTree::calculateFairValue(…)
treeResult = tree.calculate();
applyTerminalCondition();
for(int i = startBackwardAt; i>=0; --i)
{
_steps[i]->update(*this, spotsVector, fairsVector);
applyExerciseDecision(…)
}
host
device_vector<double> d_spots(numContracts * numSteps);
device_vector<double> d_fairs(numContracts * numSteps);
applyTerminalConditions(d_spots, d_fairs);
binomialTreePricer<<<numContracts>>>(…)
for(int nContract=0; …)
fairValue[nContract] = h_fairs[nContract*numSteps + offset];
CUDA Approach
CPU CUDA
1. Prepare data
2. Loop over ‘kernel’
3. Summarise results
1. Prepare data 2. Allocate device memory 3. Copy data to device 4. Invoke parallel kernel 5. Copy data from device 6. Summarise results
VarianceSwap Pricer
VarianceSwap::calculateFairValue(…)
VarianceSwapReplication::calculateFairValue(…)
calcFutureVariance(…)
integrandStart = createLogContractIntegrand(…)
integrator.integrate(…)
for (i=0; i<degreeDiv2; ++i)
calculateEuropean(…)
engine.calculate(…)
European::calculateFairValue(…)
call = new Call(…)
integration.integrate(…)
for (i=0; i<degree; ++i)
double payoff = blackPremium(…)
VarianceSwap Pricer Parallel
VarianceSwap::calculateFairValue(…)
VarianceSwapReplication::calculateFairValue(…)
calcFutureVariance(…)
integrandStart = createLogContractIntegrand(…)
integrator.integrate(…)
for (i=0; i<degreeDiv2; ++i)
calculateEuropean(…)
engine.calculate(…)
European::calculateFairValue(…)
call = new Call(…)
integration.integrate(…)
for (i=0; i<degree; ++i)
double payoff = blackPremium(…)
kernel
host
device_vector<double> bsPayoffs(…)
device_vector<double> inputs(…)
calculateBS<<<varSwapDegree, europeanDegree>>>(…)
MonteCarlo Pricer
EuropeanMCPricer::calculateFairValue(…) { MCPathsGeneratorUtils::generatePaths(…) pathGen->generateIndependentNormals(…) _RNGFactory->createNewRNG(…) | RNG->getIndependentNormals(…) | getIndependentUniforms(variates); | convertUniformstoNormals(variates); pathGen->generateCorrelatedNormals(…) pathGen->generatePaths(…) _model->diffuse(…) for( size_t iTrial = 0; iTrial<trials; ++iTrial) { contract->calcContractualFlows( mcFixingSchedule for(size_t i=0; i < nFlows; ++i) { ImplementorUtils::getDefaultImplementor(…) calcEngine.calculate( flowImpl, fvRequest, results) fairValue += results->getFairValue(); } priceCollector(fairValue); result->setFairValue(acc::mean(priceCollector)); result->set(FairValueStdDev, sqrt(variance(priceCollector)/trials));
MonteCarlo Pricer Parallel
EuropeanMCPricer::calculateFairValue(…) { MCPathsGeneratorUtils::generatePaths(…) pathGen->generateIndependentNormals(…) _RNGFactory->createNewRNG(…) RNG->getIndependentNormals(…) getIndependentUniforms(variates); convertUniformstoNormals(variates); pathGen->generateCorrelatedNormals(…) pathGen->generatePaths(…) _model->diffuse(…) for( size_t iTrial = 0; iTrial<trials; ++iTrial) { contract->calcContractualFlows( mcFixingSchedule for(size_t i=0; i < nFlows; ++i) { ImplementorUtils::getDefaultImplementor(…) calcEngine.calculate( flowImpl, fvRequest, results fairValue += results->getFairValue(); } priceCollector(fairValue); result->setFairValue(acc::mean(priceCollector)); result->set(FairValueStdDev, sqrt(variance(priceCollector)/trials));
host
curandCreateGenerator(…)
curandGenerateNormalDouble(…)
thrust::device_vector<double> contractFlows(nFlows*trials);
doTrialKernel<<<nTrials>>>(…)
pathGen->generateCorrelatedNormal(nTrial, …) kernel
pathGen->generatePath(nTrial, …)
_model->diffuse(…)
contract->calcContractualFlows(nTrial, …)
thrust::transform(contractFlows, flowPVs, PVs);
double fairValue = thrust::reduce(PVs)/trials;
double variance = thrust::inner_product(PVs, PVs, 0.0)
– fairValue*fairValue;
Copying Object Graphs
C++ Data Marshalling Best Practices - Cliff Woolley, NVIDIA http://on-demand.gputechconf.com/gtc/2012/presentations/S0377-GTC2012-Data-Marshalling-Practices.pdf
• We need data marshalling (serialization) to GPU
– We are moving data from one physical address space to another
– Virtual function tables must be updated
– Possible differences in structure layout
– Want bus transfers to be as efficient as possible
– Want parallel-friendly data organization to benefit the GPU
• C-style struct: cudaMemcpy and you’re done (even for an array)
– Except for fixing alignment
• TRICKIER CASES
– Virtual functions (device vtable ≠ host vtable)
- Split off any class containing virtual functions into:
• Base class contains only Plain Old Data members
• Derived class contains the virtual functions
– Bitfields
– AoS vs. SoA
– STL
Copying Object Graphs - Example
class PathGenerator : public Base { public: __host__ __device__ shared_ptr<MCPaths> generateCorrelatedNormals( shared_ptr<MCPaths> uniformRandomNumbers ) const; private: shared_ptr<PathParams> _params; ublas::matrix<double> _cholesky; };
PathGenerator vtbl
_params
_cholesky
PathParams vtbl
_antithetic
_brownianBridge
_size2
_data
ublas::matrix
_size1
…
Copying Object Graphs - DeviceWriter
class PathGenerator : public Base { public: void prepareDeviceMemory(DeviceWriter& writer) { writer.writeObject(this); writer.writeObject(_params); writer.writeMatrix(&_cholesky); } };
PathGen vtbl
_params
_cholesky
PathParams vtbl
_antithetic
_brownianBridge
_size2
_data
ublas::matrix
_size1
…
vtbl _params _cholesky vtbl _antithetic _brownianBridge _size2 _data _size1 … Host:
Original:
vtbl _params _cholesky vtbl _antithetic _brownianBridge _size2 _data _size1 … Device:
Calling member function on device
file: PathGenerator.cu #include ‚cuda_shared_ptr.hpp‛ #include ‚cuda_vector.h‛ #include ‚cuda_matrix.hpp‛ #include ‚MCPaths.h‛ PathGenerator::generateCorrelatedNormals(shared_ptr<MCPaths> uniformRandomNumbers) { shared_ptr<thrust::device_vector<double>> correlatedNormals( thrust::device_vector<double>(uniformRandomNumbers.size())); DeviceWriter writer; prepareDeviceMemory(writer); writer.writeObject(uniformRandomNumbers); writer.copyToDevice(); generateCorrelatedNormalKernel<<1, uniformRandomNumbers.getTrials>>>( writer.getDevicePtr(this), writer.getDevicePtr(uniformRandomNumbers), thrust::raw_pointer_cast(correlatedNormals->data())); return correlatedNormals; }
vtbl _params _cholesky vtbl _antithetic _brownianBridge _size2 _data _size1 … Device:
file: cuda_shared_ptr.hpp template<class T> class shared_ptr { __host__ __device__ T* operator->() const {…} }
Calling virtual function on device
file: PathGenerator.cu #include ‚cuda_shared_ptr.hpp‛ #include ‚cuda_vector.h‛ #include ‚cuda_matrix.hpp‛ #include ‚MCPaths.h‛ PathGenerator::generatePaths(shared_ptr<MCPaths> correlatedNormals) { shared_ptr<thrust::device_vector<double>> paths( thrust::device_vector<double>(correlatedNormals())); DeviceWriter writer; prepareDeviceMemory(writer); writer.writeObject(correlatedNormals); writer.addHostvtbl(“BSModel”, getvtbl(_bsModel)); writer.addDevicevtbl(“BSModel”, getDevicevtbl<BSModel>()); writer.copyToDevice(); generatePathKernel<<1, correlatedNormals.getTrials>>>(..) _model->diffuse(…) }
vtbl _params _cholesky vtbl _antithetic _brownianBridge _size2 _data _size1 … Device:
file: Model.h class ModelBase { __host__ __device__ virtual void diffuse(…) {…} }
__host__ __device__? file: PathGenerator.2.h
#ifndef __CUDACC__
shared_ptr<MCPaths> PathGenerator::generateCorrelatedNormals(…) const
{
shared_ptr<MCPaths> dZs_ptr(new MCPaths(…));
MCPaths& dZs = *dZs_ptr;
#else
__device__ void DevicePathGenerator::generateCorrelatedNormal(const int iTrial, …) const
{
MCPathsRaw& dZs = *pdZs;
#endif
const size_t nStateVariables = dZs.getIndices();
const size_t nTimeSteps = dZs.getTimeSteps();
#ifndef __CUDACC__
for( size_t iTrial =0; iTrial< randomNumbers.getTrials(); ++iTrial)
{
#endif
for(size_t index =0; index< nStateVariables; ++index)
{
for(size_t jTime=0; jTime< nTimeSteps; ++jTime)
{
…
dZs(2*iTrial+1,index,jTime) = crng;
Case Study: Conclusions
1. Understand the balance between performance needs, hardware, development complexity/costs, risk
2. To CUDA-enable existing libraries:
– Choose appropriate granularity for parallelisation, review code which will need porting to CUDA
– Simple objects are unchanged
– Move object allocation (new/vector resizing) out of lower level functions
– Add *.cu files
• #include cuda-enabled STL and boost headers
• write kernels to dispatch to __device__ member functions
• write host code to prepare/copy objects and call kernel
– Modify/split existing headers
• Use #ifdef __CUDACC__ to wrap device functions/subclasses
• CUDAHOSTDEVICE prefix for shared header functions
– Project files
• Add a build configuration to disable CUDA
3. Issues
– Persisting objects on device between calls (using this paradigm)
– nvcc compiler warnings all on/off
– Debugging errors is painful (Nsight should help)
Citi | Markets Quantitative Analysis
The End