Financial Applications of Algorithmic Differentiation1439412/FULLTEXT02.pdf · Again, even if...

U.U.D.M. Project Report 2020:11

Examensarbete i matematik, 30 hpHandledare: Maciej Klimek Examinator: Erik EkströmMaj 2020

Department of MathematicsUppsala University

Financial Applications of AlgorithmicDifferentiation

Chengbo Wang

Uppsala University

Abstract

In this thesis, Adjoint Algorithmic Differentiation applications in sensitivities computationof different financial instruments such as an American put option and CVA based on a fixed-for-floating interest rate swap is demonstrated. Introduction to sensitivities, dual numbers,xVA, and underlying interest rate and credit risk models is put within generic mathemati-cal framework and comparison between Adjoint Algorithmic Differentiation method and theconventional bumping method is made to show advantages of Adjoint Algorithmic Differenti-ation. The implementations demonstrated in this work are partially based on the ForwardDiffpackage in Julia.

1 Chengbo Wang

Uppsala University

Contents

1 Introduction 31.1 American-style options and derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 American-style options in MC simulations . . . . . . . . . . . . . . . . . . . 31.1.2 Sensitivities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 xVA and sensitivities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Mathematical framework 52.1 Traditional methods for sensitivities estimation . . . . . . . . . . . . . . . . . . . . 5

2.1.1 FD method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 PD method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 AD method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 The general setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Tangent mode AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 Adjoint mode AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4 Higher derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Dual numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 Definition and basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Higher derivatives with hyper-dual numbers . . . . . . . . . . . . . . . . . . 12

2.4 CVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.1 Definition and related concepts . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 CVA for Interest Rate Swaps . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.3 Interest rate and credit risk models . . . . . . . . . . . . . . . . . . . . . . . 15

3 AD Implementations 173.1 A simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Tangent mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 Adjoint mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.4 Example revisit with dual numbers . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Valuation of an American put option and sensitivities . . . . . . . . . . . . . . . . . 193.2.1 Longstaff-Schwartz: a LSM approach . . . . . . . . . . . . . . . . . . . . . . 193.2.2 An AAD adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.3 Implementation in Julia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Valuation of CVA and sensitivities . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.1 AAD implementation within G2++-CIR++ framework . . . . . . . . . . . . 233.3.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Conclusion 27

2 Chengbo Wang

Uppsala University

1 Introduction

In the last post-crisis decade, one of the utmost missions of each financial institution is to makesure its financial risks are on a controllable level, and such risks can be partly dealt with applyinghedging techniques.

Meanwhile, we cannot talk about complex financial derivative pricing and hedging techniqueswithout Monte Carlo(MC) simulations: in regards to the sheer fact that the practical pricing mod-els implemented by financial institutes are too complex to be treated analytically, MC simulationis the only computationally practicable solution on the table.

The major difficulty researchers and practitioners are faced daily with is that such simulationsare time-consuming to compute, especially for the case when such simulations are employed tocalculate the ”greeks”, or price sensitivities, which are essential for hedging strategies with respectto certain financial derivatives.

For starters, we have the traditional Finite Difference(FD) method, or bumping, to approx-imate values of the sensitivities by perturbing the underlying model parameters infinitesimally.The computation cost is significantly large since it shoots up linearly regarding the amount ofsensitivities required to compute. Hence the computational efficiency problem arises when thepricing model depends on a large number of parameters.

Then we have alternatives such as Pathwise Derivative(PD) method. Since such techniquefails to compute the derivative when the payoff function is not almost surely continuous of theparameter, e.g. payoff function of a digital option. Again, even if derivatives can be approximatedby FD method, but the repetition of payout evaluations makes it costly computation-wise.

In this paper we take a close and mathematical look at the Algorithmic Differentiation(AD)method, to see how it performs when computing payouts and sensitivities of financial derivativescompared to existing methods.

The remainder of this paper is structured as follows. In this chapter, we go through Americanoptions in MC simulations and sensitivities. In the next chapter, we will mainly review the FDand PD methods and then introduce the mechanism of AD method. Then in Chapter 3 someexamples are given to show how AD method works in practice. In Chapter 4, we compute severaldesired derivative payouts and following sensitivities to show how AD method performs comparedto others by numerical results. We draw the conclusion of this paper in the end.

1.1 American-style options and derivatives

1.1.1 American-style options in MC simulations

In finance world, an option can be sorted by the dates when the option is to be exercised. Exceptsome special options, the vast majority is either European or American style: a European option isexercised only at the expiration date of the instrument, meanwhile an American option is exercisedat any time before the expiration date, or at the expiration date.

3 Chengbo Wang

Uppsala University

For vanilla option in both kinds, the payoff function is (S−K)+ for a call option and (K−S)+

for put, where K is the strike price and S is the spot price of the underlying asset. Furthermore,considering the dividend-free case, an American call option has the equivalent value as a Europeancall option due to the fact that it is not optimal to exercise an American call before the expirydate t = T . Therefore for our research, we will concentrate in the basic American put option andits simulations.

In the context of option pricing, the Monte Carlo method was first piloted by Boyle [1] to ap-proximate the value of an option, pointing out the significance of variance reduction. Henceforthvarious techniques, e.g. common random numbers, antithetic variates, control variates, stratifiedsampling, and importance sampling, were used to approach a statistically efficient simulation. ForAmerican options, unlike the straightforward and efficient European option pricing, the directlyinherited extension of nested Monte Carlo simulations is computationally expensive due to earlyexercises: in comparison to a partial differential equation in a European option, option value atintermediate times are required.

For American-style options, different pricing algorithms using the Snell envelopes and re-gressions to approximate the continuation values were proposed as the Least-Squares MonteCarlo(LSM) method. Among them there are Tsitsiklis and Van Roy [2], Longstaff and Schwartz[3], Clement, Lamberton, and Protter [4].

In the latter part of this paper, Longstaff-Schwartz Algorithm(LSA) is adopted for its accessibil-ity and universality of its implementation. The pivotal insight of this approach is that conditionalexpectation value of payoff on the information accessible at the very time point can be estimatedfrom the simulations with least squares. This algorithm can be applied in path-dependent andmultivariate scenarios where FD method fail. By regression, expected continuation values are es-timated then we could exercise the option optimally along each path. With further discountingand averaging, the option price could be obtained.

1.1.2 Sensitivities

Sensitivities, or greeks, without further specification, are first-order partial derivatives, which arefactors to show how fast the price of derivatives react towards the change of underlying parameters.For higher-order partial derivatives, it is conducted similarly. The alternative name of greeks stemsfrom the fact that most of these factors are denoted in Greek alphabet. In real-world finance, itgoes without saying that sensitivities are powerful when it comes to hedging derivatives. Somesensitivities listed below are to be computed later in implementations of MC simulations.

Among underlying parameters for a common option there are the underlying instrument priceS, volatility σ, time to maturity T , and risk-free rate r. Thus we have first-order partial derivativese.g. delta(∆), vega(V), theta(Θ), and rho(ρ):

∆ =∂V

∂S, V =

∂V

∂σ, Θ = −∂V

∂T, ρ =

∂V

∂r,

where V is the value of the instrument. Furthermore, there are second-order ones e.g. gamma(Γ)and vanna:

Γ =∂2V

∂S2=∂∆

∂S, vanna =

∂2V

∂S∂σ=∂∆

∂σ=∂V∂S

.

4 Chengbo Wang

Uppsala University

By the classic Black-Scholes formula we have an analytic solution for a European option, henceformulae for European greeks are also explicit. However, when it comes to American option, it isnot the case. In chapter two we will discuss ways to estimate these sensitivities in details.

1.1.3 xVA and sensitivities

XVA, or Valuation Adjustments, is the generic term for valuation of different adjustments inderivative contracts, regarding to credit, funding, capital costs and so on. Among them there areCredit Valuation Adjustment(CVA), Debit Valuation Adjustment(DVA), Funding Valuation Ad-justment(FVA), Valuation Adjustment for regulatory capital(KVA). Historically speaking, deriva-tive pricing has been founded on the seminal Black-Scholes risk-neutral framework, and CVA waspartly adopted, mostly unilaterally, by some tier one banks pre-crisis. After the financial crisis of2007-09, DVA, FVA, KVA, and other XVA values, representing counterparty risk, are taken intothe components of derivatives prices by banks [5].

Since now xVA is actively managed in front offices, naturally computation of sensitivities ispivotal. Similar to option sensitivities, a first-order xVA sensitivity is equal to

∂xVA

∂θi,

where θi is an input parameter. Inspired by greeks, it is called a delta when θi is of linear inputmarket data, e.g. an FX rate, a vega when θi is of volatility inputs and a theta when θi = t.

Similarly we get second-order sensitivities

∂2xVA

∂θi∂θj,

where θi and θj are two input parameters. It is referred as a gamma when θi and θj are of thesame type and are linear input market data.

In the following chapters, we will have a look at CVA based on a vanilla interest rate swapin single currency, one of basic but essential xVA values and its sensitivity computation in AADimplementation.

2 Mathematical framework

2.1 Traditional methods for sensitivities estimation

Before we start, consider a parametrized family of random variables (P (θ))θ∈Θ, where θ is certainparameter of the priced security, Θ ⊂ R is an interval, P (θ) is the payoff function of the pricedsecurity. To begin with, let

V (θ) := E[P (θ)] (1)

be the price of a particular derivative security [6]. Below we introduce several ways to estimateV ′(θ).

5 Chengbo Wang

Uppsala University

2.1.1 FD method

Finite Differentiation method was first introduced in derivative pricing by Schwartz [7] in 1977.One basic approach to estimate the derivative price’s sensitivity to changes in the parameter θ,V ′(θ), is to use the forward-difference ratio

∆F :=V (θ + ε)− V (θ)

ε,

where ε is positively close to zero. Practically we often do not know the exact value of V (·), butwe can instead estimate them by simulating n samples of P (θ), so-called Pn(θ), and further n ofP (θ + ε), Pn(θ + ε), and averaging the newly plugged in results to get the estimating forward-difference ratio

∆F :=Pn(θ + ε)− Pn(θ)

ε

If V is twice differentiable at θ, then by Taylor’s polynomials we have

V (θ + ε) = V (θ) + V ′(θ)ε+1

2V ′′(θ)ε2 + o(ε2)

and the bias of the new estimator

Bias(∆F ) := E[∆F − V ′(θ)] =1

2V ′′(θ)ε+ o(ε). (2)

Set this aside and we further consider a central-difference estimator

∆C :=Pn(θ + ε)− Pn(θ − ε)

2ε

as an alternative estimator of V ′(θ). Reviewing the bias of ∆C

Bias(∆C) := E[∆C − V ′(θ)] = o(ε),

we can see it is superior than (2), and converges to zero. We should also notice that ∆C requiresa little more input in practice, since we need to estimate V (θ − ε) additionally.

Now we further move on to discuss the variance of ∆C . To begin with, it is reasonable toassume that the pairs (P (θ+ ε), P (θ− ε)) and (Pi(θ+ ε), Pi(θ− ε)) for i = 1, ..., n are i.i.d. In thiscase, marked with case (a),

Var[∆C ] =V ar[P (θ + ε)− P (θ − ε)]

4nε2. (3)

It shows that it boils down to that V ar[P (θ+ ε)−P (θ− ε) is the core to crack. Now if we assumethat V ar[P (θ)] is continuous, we can simulate P (θ + ε) and P (θ − ε) independently:

Var[P (θ + ε)− P (θ − ε)] = V ar[P (θ + ε) + P (θ − ε)]→ 2V ar[P (θ)], ε→ 0.

In case (b), we simulate P (θ+ ε) and P (θ− ε) using Common Random Numbers(CRN), i.e. theyare simulated from the same sequence U1, U2, ... of uniform random numbers, hence P (θ + ε) andP (θ − ε) should be strongly correlated. Note that CRN technique is applied with an underlying

6 Chengbo Wang

Uppsala University

assumption that P is almost surely continuous when θ ∈ Θ.When this is not met, it leads to thelast case of (c). In conclusion, we have (3) result in

Var[∆C ] =

O(ε−2), (a)O(ε−1), (b)O(1). (c)

When CRN is approached, it is expected that variance reduces drastically. Meanwhile, it is note-worthy that, comparing Bias(∆C) and Var(∆C), there exists a trade-off between bias and variancein selection of ε in cases (a) and (b). While pricing American options using FD method performswell in the case of options on assets with lower degrees of freedom, underlying assets with highdimensionality require more efficient numerical approaches.

2.1.2 PD method

Recall that in the previous section that V (θ) := E[P (θ)]. Now the Pathwise Derivative methodprovides another probable estimator by interchanging the order of differentiation:

V ′(θ) =∂

∂θE[P (θ)] = E[

∂P (θ)

∂θ]. (4)

We regard (V (θ))θ∈Θ as a stochastic process on a single probability space (Ω,F , P ). The term”pathwise” stems from that V ′(θ) is the pathwise derivative of V , i.e. P[P ′(θ) exists] = 1. If theinterchange in (4) can be justified, then ∂P (θ)/∂θ could be an unbiased estimator of V ′(θ). Variedsufficient conditions are provided in Section 7.2.2 of Glasserman [6] to achieve (4).

Although PD method is computationally more efficient than FD method, the severe drawbacklies in that it is only applicable to a restricted class of problems for which certain regularityconditions are satisfied.

2.2 AD method

2.2.1 The general setting

Algorithmic Differentiation (or Automatic Differentiation, AD) method is applied to compute par-tial derivatives to functions highly efficiently, and the chain rule is its very core.

In general, AD method has two modes: the tangent (or forward, TAD) and the adjoint (orbackward, AAD) mode. For instance intuitively speaking if the first-order derivative is desiredgiven f(x) = g(h(x)), then the chain rule yields

∂f

∂x=∂g

∂h

∂h

∂x.

In tangent mode, the computation comes inside out, from ∂h∂x

to ∂g∂h

. It goes reversely in adjoint

mode, ∂g∂h

is computed first, then ∂h∂x

.

Now consider a function F mapping Rn into Rm. Without loss of generality, we restrict thediscussion to a scalar output, i.e. m = 1. We thus consider multivariate functions of the type

F : Rn → R, y = F (x),x = (x1, x2, ..., xn)T,

7 Chengbo Wang

Uppsala University

and we assume that F and its implementation, which potentially contains lots of mathematicalsteps, are (locally) differentiable. Now we are interested in generating the vector of all partialderivatives of y w.r.t. x

∇F = ∇F (x) =

(∂y

∂x1

,∂y

∂x2

, ...,∂y

∂xn

)T

∈ Rn.

Given the input value of vector x, the output scalar y is calculated by steps of computer instruc-tions. It can be expected that the execution can be represented by a sequence of intermediatevariable u1, ..., uN , such that

ui =

xi, i = 1, ..., n

Φi(ujj<i), i = n+ 1, ..., N − 1Φi(ujj<i) = y, i = N

where functions Φi represent a composition of one or more elementary or intrinsic operations, e.g.Φn+2(u1, u2, un−2, un−1) = Φn+2(ujj<n+2) = u1 + u2/un−1 + u2

n−2.

Moreover, we adopt the notations from Capriotti [8]:

v1 := (un+1, un+2, ..., un+k1)T, ...,vL := (uN−kL , uN−kL+1, ..., uN−1)T,

v0 := x = (x1, ..., xn)T,vL+1 := y,

and notice that k1, ..., kL are obtained in such way:

un+k =

Φn+k(ujj<n+k), k = 1Φn+k(·) 6= Φn+k(ujn+1<j<n+k), k = 2, ..., k1

Φn+k(·) = Φn+k(ujn+1<j<n+k), k = k1 + 1Φn+k(·) 6= Φn+k(ujn+k1+1<j<n+k), k = k1 + 2, ..., k1 + k2

Φn+k(·) = Φn+k(ujn+k1+1<j<n+k), k = k1 + k2 + 1...

Φn+k(·) 6= Φn+k(ujn+∑L−2

i=1 ki+1<j<n+k), k =∑L−1

i=1 ki + 1

Φn+k(·) = Φn+k(ujn+∑L−1

i=1 ki+1<j<n+k), k =∑L−1

i=1 ki + 2, ...,∑L

i=1 ki

and

n+L∑i=1

ki + 1 = N.

Now we introduce some important notations for the following discussion. For a fixed i (i =1, ..., n),

u ≡ ∂u

∂xi, u ≡

(∂y

∂u

)T

, Dij ≡∂ui∂uj

.

2.2.2 Tangent mode AD

In tangent mode, by the independence of propagation of gradient’s components and applying thechain rule,

y = uN =∂y

∂xx =

∂y

∂vL

∂vL

∂vL−1· ... · ∂v1

∂xx,

8 Chengbo Wang

Uppsala University

where by definition

x =∂x

∂ui= ei = (0, ..., 0, 1, 0, ..., 0︸︷︷︸

1 is the i-th element

)T,

or in scalars we have

uN =n∑j=1

DN,N−1 · ... ·Dn+1,iuj.

This could also be concluded in matrices:

VL+1 =

(IL+1

∂vL+1/∂VTL

)︸︷︷︸

:=DL+1

VL = ... = DL+1 · ... ·D1x, (5)

where for k = 0, ..., L:Vk := (v0, ...,vk)

T,VL+1 := (v0, ...,vL+1)T,

Ik+1 is an identity matrix whose dimension is equal to the length of Vk, and

∂vk+1

∂VTk

= (∂vk+1

∂v0

,∂vk+1

∂v1

, ...,∂vk+1

∂vk).

As we can see, to compute all desired partial derivatives, the matrix equation (5) needs to beexecuted n times. In each time of evaluation, x is set to the i-th basis vector in Rn. To get alldesired derivatives, n times of execution are required. Thus, the computational cost for derivativesby using tangent mode is proportional to the number of input variables. The cost is similar to theFD method’s, however, the calculation should be more accurate.

Figure 1: Computational Graph for A Generic Function F in TAD

2.2.3 Adjoint mode AD

Recalling that in Section 2.2.1 we narrowed down the scope to the case in which m = 1, we shallsee that when n >> m, TAD suffers from a prime sweep and a dual tangent sweep for each inputelement of x. However, the fact that the algorithm scales for free when m increases still leads towell-performed application where large sequences generated from small input data seed are needed.

9 Chengbo Wang

Uppsala University

Reversely, we have

x =

(∂y

∂x

)T

, y =

(∂y

∂vL

∂vL

∂vL−1· ... · ∂v1

∂x

)T

,

or in scalars

uj =n∑j=1

DTN,N−1 · ... ·DT

N,n+1DTn+1,juN =

n∑j=1

(DN,N−1...DN,n+1Dn+1,j)T ,

since by definition uN = y = ∂y∂y

= 1.

To be concluded in matrices,

V0 = (I1,∂vT

1

∂V0

)︸︷︷︸:=DT

1

V1 = DT1 ...D

TL+1.

It is noteworthy that to execute the adjoint mode, an initial tangent sweep is a prerequisiteto compute and store all the D matrices. Compared to the tangent mode, when n is large, theadjoint mode is significantly more efficient. Similar to the cost by the tangent mode, the cost bythe adjoint mode is of order 1, which is the number of outputs.

Figure 2: Computational Graph for A Generic Function F in AAD

2.2.4 Higher derivatives

As we know, there are a lot of second- even third-order greeks e.g. gamma, charm, and speed. Allthese greeks can be achieved if we continue this differentiating program. Combinations of modeslead to computational costs of different orders. For instance, for p-th-order derivatives, if we applythe tangent mode for every time, the the cost should be of order np. Given that y is scalar (as inSection 2.2.1 stated, m = 1), we should apply adjoint mode each time for higher order derivatives.However, for a general case when m > 1, for instance, we can obtain a second-order derivative byapplying the tangent mode to a first-order adjoint mode, at a computational cost of order m · n.We can thus further the selection of combination case by case for higher order derivatives.

10 Chengbo Wang

Uppsala University

2.3 Dual numbers

2.3.1 Definition and basic properties

When we scrutinize the real-world AD algorithm descriptions and implementations, the core ap-proach is slightly different from what we discussed above. In most computational implementationit is a technique called operator overloading. Mathematically speaking, the main difference liesin that instead of treating the desired derivatives as compositions within the underlying function,the function itself is lifted to a subset of the algebra D2 := (R2,+, ·), or just D, of dual numbers [9].

Dual number d = (a, b), first introduced in 1873 by Clifford [10] is described with details byKalman [11], is defined as

d = a+ bε, a, b ∈ R, ε 6= 0, ε2 = 0.

We can easily verify that D is an associative and commutative algebra over R. This is analogousto a complex number, which has the definition as

a+ bε, a, b ∈ R, ε2 = −1.

For any dual number d, we can also easily find that

dn = (an, nan−1b).

By observation, we can see that these numbers naturally relate to differentiation, if ε is an extremelysmall but positive number. This suggests that if we let f(x) = xn, then

f(d) = f((a, b)) = (a+ bε)n

=n∑k=0

(nk

)an−k(bε)k = an + nan−1bε+ 0

= (f(a), f ′(a)b)

given that f is applied when f is extended on D. This finding is intriguing and further suggeststhat if we extend the definition of a generic real function f : R → R to f : D → D for any dualnumber d, where f is differentiable at a, then

f(d) = f(a) + f ′(a)bε = (f(a), f ′(a)b).

Taking further b = 1, we havef(d) = (f(a), f ′(a)), (6)

where the second term has the desired first-order derivative.

This conclusion (6) is more obvious when we consider a polynomial

P (a) = p0 + p1a+ p2a2 + ...+ pna

n

and extend a to the dual number d, then

P (d) = P ((a, b)) = p0 + p1d+ p2d2 + ...+ pnd

n

= (p0 + p1a+ p2a2 + ...+ pna

n) + (p1bε+ 2p2abε+ ...+ npnan−1bε)

= (P (a), P ′(a)b).

11 Chengbo Wang

Uppsala University

Furthermore, one can derive results extended from other elementary functions (with b = 1):

sin d = (sin a, cos a),

cos d = (cos a,− sinx),

ed = (ea, ea),

log d = (log a, 1/a), a 6= 0.

Now we further extend function f to a n-variate one g : Rn → R, and accordingly g : Dn → D[9]. From now on, we adopt the notation in Section 2.2.1, let bi = ai, then a dual numberdi = ai + aiε = (ai, ai), where ai, ai ∈ R, ε 6= 0, ε2 = 0,∀i, j ∈ 1, ..., n. The conclusion is furtherextended into

g(d1, ..., dn) = (g(a1, ..., an),∇g(a1, ..., an) · a) , (7)

where following the previous notation

a := (a1, ..., an)T.

Similar to (6), when taking ai = 1 and aj = 0, j 6= i, the second part is the desired partialderivatives ∂g/∂ai, as known as Jacobian matrix Jg(a1, ..., an). We take a glimpse at when n = 2,the basic arithmetic would be:

d1 ± d2 = (a1 ± a2, a1 ± a2),

d1 · d2 = (a1a2, a1a2 + a2a1),

d1

d2

= (a1

a2

,a2a1 − a1a2

a22

), ∀ai, aj ∈ R, i, j ∈ 1, 2.

Notice that dual number d2 only has a multiplicative inverse if a2 6= 0. Results can be verifiedsimply using the definition.

2.3.2 Higher derivatives with hyper-dual numbers

Dual numbers give us more space to think over how to compute higher order derivatives. In fact,to retrieve a N -th partial derivative, instead of dual numbers a + bε, we now consider so-calledhyper-dual numbers [12]:

a+ b1ε1 + b2ε2 + ...+ bNεN

whereε1 6= ε2 6= ... 6= εN 6= 0,

ε21 = ... = ε2N = 0,

εiεj 6= 0,∀i 6= j ∈ 1, ..., N.

The algebra DN+1 := (RN+1,+, ·) has similar properties as D2 does. Other mathematical opera-tions can be thereby defined. W.l.o.g., we take a look at only N = 2 case. Following the definition,we can derive similar property as in (6) using the Taylor series where function f : D2 → D2 isadapted from generic real function f : R→ R, which is differentiable at x = a:

f(a+ b) = f(a) + f ′(a)b+1

2!f ′′(a)b2 +

1

3!f ′′′(a)b3 + ...

12 Chengbo Wang

Uppsala University

By definition,b = b1ε1 + b2ε2, b

2 = 2b1b2ε1ε2, b3 = d4 = ... = 0.

Hencef(a+ b1ε1 + b2ε2) = f(a) + f ′(a)b1ε1 + f ′(a)b2ε2 + f ′′(a)b1b2ε1ε2,

where f is differentiable at x = a. Again, when taking b1 = b2 = 1, the coefficients in ε-terms aredesired derivatives:

f(a+ ε1 + ε2) = f(a) + f ′(a)ε1 + f ′(a)ε2 + f ′′(a)ε1ε2.

Like conclusion (7) in the last section, we extend function f to a n-variate function g : Rn → R,and accordingly g : Dn2 → D2. In the multivariate Taylor polynomials,

g(a + b) = g(a) + Jg(a)bT +1

2bHg(a)bT + ...,

where Jg(a) is the Jacobian matrix and Hg(a) is the Hessian matrix of g, and

a = (a1, ..., an),

b = (ε1 ε2)

(b11 . . . b1n

b21 . . . b2n

)=

(2∑i=1

εibi1, ...,2∑i=1

εibin

),

Jg(a) =

(∂g

∂a1

,∂g

∂a2

, ...,∂g

∂an

):= (g1, ..., gn),

Hg(a) =

(∂2g

∂ai∂aj

)ij

:= (gij)ij .

Then

Jg(a)bT = (ε1 ε2)

(b11 . . . b1n

b21 . . . b2n

) g1...gn

=n∑j=1

gj

2∑i=1

εibij,

1

2bHg(a)bT =

1

2

(2∑i=1

εibi1, ...,

2∑i=1

εibin

)(gij)ij

(2∑i=1

εibi1, ...,

2∑i=1

εibin

)T

=1

2

(n∑j=1

gj1

2∑i=1

bij, ...,

n∑j=1

gjn

2∑i=1

bij

)(2∑i=1

εibi1, ...,

2∑i=1

εibin

)T

=1

2

n∑k=1

[(2∑i=1

εibik

)n∑j=1

(gjk

2∑i=1

εibij

)]

= ε1ε2

n∑j,k=1

b1jb2kgjk.

Therefore

g(a + b) = g(a) +

(n∑j=1

b1jgj

)ε1 +

(n∑j=1

b2jgj

)ε2 +

(n∑

j,k=1

b1jb2kgjk

)ε1ε2.

13 Chengbo Wang

Uppsala University

Taking bij = 1, i, j ∈ 1, ..., n,

g(a + b) = g(a) +n∑j=1

∂g

∂ajε1 +

n∑j=1

∂g

∂ajε2 +

n∑j,k=1

∂2g

∂aj∂akε1ε2.

Again, the desired second-order partial derivatives can be obtained in this way.

2.4 CVA

2.4.1 Definition and related concepts

We adopt the notations from Brigo et al [13], Ghamami et al [14] and adjust for consistency, andwe first showcase the basic definition. Given a neutral pricing process Vt defined on the probabilityspace (Ω,F , P ) for a portfolio of transactions with a single counterparty which might default atrandom time point τ . With a payoff P (t, T ) and a filtration Gtt≥0, where Gt = σV (s) : 0 ≤ s ≤t, we have

V (t) = E[P (t, T )|Gt].

Also with a constant recovery rate R ∈ [0, 1] and discount factors D(t) = e∫ t0 rsds where r is short

rate, the unilateral CVA at t = 0 is given as

CVA = (1−R)E[10<τ≤TD(τ)V (τ)+]. (8)

The nature of τ admits the stochastic intensity λ, which is also adapted to Gtt≥0, then

Pτ > t|G = e−∫ t0 λsds. (9)

From (9), we have

P(τ > t) = E[e−∫ t0 λsds] =: E[Λ(t)]. (10)

Combining (10) and interchangibility of integral and expected value, we can further develop theequation (8)

CVA = (1−R)E[10<τ≤TD(τ)V (τ)+]

= (1−R)E[E[10<τ≤TD(τ)V (τ)+|G]]

= (1−R)

∫ T

0

E[D(τ)V (τ)+|G, τ = t]fτ |G(t)dt

= (1−R)

∫ T

0

D(t)V (t)+fτ |G(t)dt

= (1−R)

∫ T

0

D(t)V (t)+λ(t)Λ(t)dt.

Note that EE(t) := E[V (t)+] is called the Expected Exposure(EE) and (1 − R) as Loss GivenDefault(LGD).

14 Chengbo Wang

Uppsala University

2.4.2 CVA for Interest Rate Swaps

In the scope of xVA sensitivities, there have been works by Capriotti et al [15, 16], Savickas et al[17], and a master thesis in 2015 [18]. The former two groups of authors have reported successfulapplications of PD and AAD methods on this topic compared to FD approach. We decided toadopt the latter one’s framework to show another CVA sensitivities implementation. Its specialtylies in that: First, default probabilities are correlated with ever-changing market factors. Second,the assumption of V (t) = F (S1(t), ..., Sn(t)) is no longer applicable when it comes to more complexproducts in which explicit pricing functions are not approachable.

Now since we want to calculate value adjustment sensitivities of a bunch of interest rate swaps,the model of the interest rate term structure needs to be specified. In the last section, we used thesymbol D(t) for discounting purpose. The idea is the same but for pricing a zero-coupon bond,we switch to

B(t, T ) := D(T )t=t = Et[e−∫ Tt rsds].

As we can see, B(0, T ) is a discount factor to calculate the present value of any cashflow.

An fixed-for-floating Interest Rate Swap (IRS) is a contract where two parties exchange onefixed interest rate payment with a floating one. Now for simplicity’s sake, we assume that a swapwith start date T0, for both fixed and floating rates, they share the same year count fractionδ(Ti−1, Ti), same each interval [Ti−1, Ti] and same NT payment dates T1, ..., TN , when a fixedpayment with fixed interest rate K for a floating one

L(Ti−1, Ti) =1

δ(Ti−1, Ti)

(B(0, Ti−1)

B(0, Ti)− 1

)and a notional N , the value of the fixed-for-floating IRS is given as

Vswap(t) = Vfixed(t)− Vfloating(t)

= δNKNT∑

i=mink:k≥t·NT /T

B(t, Ti)− δNNT∑

i=mink:k≥t·NT /T

B(t, Ti)L(Ti−1, Ti)

= δN

(N∑i

B(0, Tk)(K − L(Tk−1, Tk))

)

= N

(N∑i

(B(t, ti)(1 + δK)−B(t, ti−1))

).

The result above suggests that if we can derive the initial term structure B(0, Ti), then Vswapat any time t is in our grasp. In the next section we will decide how to do so.

2.4.3 Interest rate and credit risk models

In the last section, we figured out how to calculate V at any time step, but how to calculate B(t, T )(t > 0) remains missing. Also as to compute CVA, there are still variables like λ(t) and r(t) up todecide. Following the reference paper [18], we adopt G2++ model as the interest rate model andCIR++ model as the credit risk model. G2++ helps to determine the process of B(t, T ) and r(t)

15 Chengbo Wang

Uppsala University

meanwhile CIR++ does that of λ(t). That is, we obtain B(t, T ) via

ρ = f1(ρ, σ, η) =σρ13 + ηρ23√

σ2 + η2 + 2σηρ12

,

V (t, T ) = f2(a, b, σ, η, ρ, t, T ) =σ2

a2

(T − t+

1

2a

(4e−a(T−t) − e−2a(T−t) − 3

))+η2

b2

(T − t+

1

2b

(4e−b(T−t) − e−2a(T−t) − 3

))+

2 ¯ρση

ab

(T − t+

1

a(e−a(T−t) − 1) +

1

b(e−b(T−t) − 1)− 1

a+ b(e−(a+b)(T−t) − 1)

),

A(t, T ) = f3(a, b, x(t), y(t), V (t, T ), V (0, T ), V (0, t))

=1

2[V (t, T )− V (0, T ) + V (0, t)]− 1− e−a(T−t)

ax(t)− 1− e−b(T−t)

by(t),

P (t, T ) = f4(P (0, T ), P (0, t), A(t, T )) =P (0, T )eA(t,T )

P (0, t),

and short rate r byr(t) = x(t) + y(t) + f(t), r(0) = r0,

where f(t) are the observed on-the-spot forward rates in the market, and x and y follow that

dx = −ax dt+ σ dW1, x(0) = 0,

dy = −by dt+ η dW2, y(0) = 0.

For default intensity λ,λ(t) = z(t) + λ0(t) + γ(t),

wheredz = κ(µ− z) dt+ ν

√z dW3, z(0) = 0,

and λ0(t) are bootstrapped from market quotes for CDS spreads, and γ is defined by

γ(t) = − κµ(etε − 1)

2ε+ (κ+ ε)(etε − 1),

with ε =√σ2 + κ2 + 2. Furthermore, the three listed SDEs are intertwined in the way that

dWidWj = ρij dt, and a, b, σ, η, κ, µ, ν, ρ are built-in parameters. Since our focus is on thesheer AAD mechanism, for all the detailed models and formulae in this section, one can reviewthe literature of Brigo et al [13, 19, 20].

16 Chengbo Wang

Uppsala University

3 AD Implementations

3.1 A simple example

To see how AD works in both modes, a simple example of the function F : R2 → R is given as:

y = F (x1, x2) = ex21+x1x2 ,

and we want to compute all the first order derivatives ∂y∂x1

and ∂y∂x2

. As instructed in the last section,the whole computation process is broken down into:

u1 = x1, u2 = x2,x = (u1, u2)T,

u3 = Φ3(u1, u2) = Φ3(ujj<3) = u21, D3,1 = 2u1, D3,2 = 0,

u4 = Φ3(u1, u2) = Φ4(ujj<3) = u1u2, D4,1 = u2, D4,2 = u1,v1 = (u3, u4)T,

u5 = Φ3(u3, u4) = Φ5(uj2<j<5) = u3 + u4 = v2, D5,3 = D5,4 = 1,

u6 = y = Φ6(u5) = Φ6(uj4<j<6) = y = eu5 , D6,5 = eu5 = eu21+u1u2 .

3.1.1 Tangent mode

Figure 3: Computational Graph for the example function F

In the tangent mode, we inherit the notations in the last section:

V0 =

(u1

u2

), V1 =

1 00 1D3,1 D3,2

D4,1 D4,2

V0 =

(I1

∂v1/∂xT

)V0 = D1V0,

V2 =

(I2

∂v2/∂uT1

)V1 = D2V1 = D2D1V0,

V3 =

(I3

∂y/∂uT1

)V2 = D3V2 = D3D2D1V0.

Now we compute matrices of D:

D1 =

1 00 1

2u1 0u2 u1

,D2 =

1

11

10 0 1 1

,D3 =

1

11

11

0 0 0 0 eu21+u1u2

,

17 Chengbo Wang

Uppsala University

and finally we have

V3 = D3D2D1V0 =

1 00 1

2u1 0u2 u1

2u1 + u2 u1

(2u1 + u2)eu21+u1u2 u1e

u21+u1u2

V0.

When we take V0 = (1, 0)T, (0, 1)T in the last equation, we immediately have the desired resultsof all the first order derivatives.

3.1.2 Adjoint mode

In the adjoint mode, the order of computation is accordingly reverse.

V2 =(

I3∂vT

1

∂x

)V3 = DT

3 V3,

V1 =(

I2∂vT

1

∂x

)V2 = DT

2 V2,= DT2 DT

3 V3,

V0 =

(1 0 D3,1 D4,1

0 1 D3,2 D4,2

)V1 =

(I1

∂vT1

∂x

)V1 = DT

1 V1 = DT1 DT

2 DT3 V3.

When we take V3 = (0, 0, 0, 0, 0, 1)T, the desired results are achieved similarly.

3.1.3 Numerical results

To give a direct comparison among AD with two modes and FD method results, we take x1 =x2 = 1 and compute ∂y

∂x1:

TAD AAD FD22.1671683 22.1671683 22.1671685

Here for the FD method, we approximate the result with

∂y

∂x1

≈ F (x1 + ε, x2)− F (x1 − ε, x2)

2ε,

where ε = 10−8.

3.1.4 Example revisit with dual numbers

Now we implement the example from 3.1 again with dual numbers. In tangent mode, we startwith u1 and u2:

u1 = x1 + ε(1, 0), u2 = x2 + ε(0, 1), u1 = (1, 0), u2 = (0, 1),

u3 = u21 = x2

1 + 2x1ε(1, 0), u3 = 2x1(1, 0),

u4 = u1u2 = x1x2 + ε(x2, x1), u4 = (x2, x1),

u5 = u3 + u4 = x1(x1 + x2) + ε(2x1 + x2, x1), u5 = (2x1 + x2, x1),

18 Chengbo Wang

Uppsala University

u6 = eu5 = ex1(x1+x2) + ex1(x1+x2)ε(2x1 + x2, x1), u6 = ex1(x1+x2)(2x1 + x2, x1).

Notice here (1, 0) and so on are just row vectors, not to confuse with the notations of dual numberswe made before. In adjoint mode, the computation is reverse, we take first u6 = 1:

eu5+ε = eu5 + eu5ε,D6,5 = eu5 , u5 = D6,5u6 = eu5 ,

(u3 + ε(1, 0)) + (u4 + ε(0, 1) = u3 + u4 + ε(1, 1)), D5,3 = D5,4 = 1,

u3 = D5,3u5 = eu3+u4 , u4 = D5,4u5 = eu3+u4 ,

(u1 + ε)2 = u21 + ε2u1, D3,1 = 2u1,

(u1 + ε(1, 0))(u2 + ε(0, 1) = u1u2 + ε(u1, u2)), D4,1 = u1, D4,2 = u2,

u1 = D3,1u3 +D4,1u4 = (2u1 + u2)eu21+u1u2 ,

u2 = D4,2u4 = u2eu21+u1u2 .

We can see that D matrices in adjoint mode have been computed in tangent mode. If we run atangent sweep first and store all the derivatives, then an adjoint sweep, it will save quite operationtime.

3.2 Valuation of an American put option and sensitivities

3.2.1 Longstaff-Schwartz: a LSM approach

For a further comparison among different methods, an American put option based on a singleasset is next considered. Also, Longstaff-Schwartz (LS) Algorithm [3] is adopted and the processof implementation can be further replace due to its modular structure.

In LS, stock price process is computed following paths of geometric Brownian motion. Thisindicates that with an initial stock price S0, strike price K, volatility σ, time to maturity T , risk-free interest rate r, number of paths Np, number of time steps NT , accumulated random numbersZp,t =

∑ti=1 Zp,i, we have the function h : R4×N2×R→ R to compute stock price Sp,t given path

n at time point t:

Sp,t = h(S0, σ, T, r, t, NT , Zp,t) = S0 exp

((r − 0.5σ2)tT/Nt + σ

t∑i=1

Zp,i

).

Pricing function F calculates the price of this American put option, where F follows thenotation as in the last section. F : R5 × N2 × RNp×NT → R,

V = F (S0, K, σ, T, r,Np, NT , (Zp,t)),

and we further implement AAD method on F to obtain the gradient of option price w.r.t. pricingparameters g = ∇(S0,σ,T )V .

Aside from these active inputs and outputs, there are noteworthy intermediate variables likethe exercising time for each path tp, the option value given certain path vp, the set for indices ofin-the-money path I, and the exercise boundary b.

19 Chengbo Wang

Uppsala University

Furthermore, function h is regarded as an intermediate function of F , and so is function R,which is a regression to compute b:

b = R(I, (Si,t), (vi)).

Below the general algorithm of AAD in LS framework is given with details, the main body ofpseudocodes is similar to Deusson et al [21], and details are customized for the scope of thispaper:

Algorithm 1 LS

Input: S0, K, σ, T, r,Np, NT , Zp,t.Output: (Intermediate:) Sp,t, tp, vp, I, b; (final:) V .

1: for p = 1→ Np do2: Sp,NT

← h(S0, σ, T, r,NT , NT , Zp,NT) . cash flow initiated when t = NT

3: if Sp,NT< K then

4: vp ← K − Sp,NT

5: tp ← NT

6: else7: vp ← 08: end if9: end for

10: for t = NT − 1→ 1 do . cash flow traced back by time step11: I ← 12: for p = 1→ Np do13: vp ← vp · exp (−rT/NT ) . discounted over 1 time step14: Sp,t ← h(S0, σ, T, r, t, NT , Zp,t)15: if Sp,t < K then . index of an in-the-money situation marked down16: I ← I ∪ p17: end if18: end for19: b← R(I, (Si,t), (vi)) . exercise boundary computed via regression20: for all p ∈ I do21: if Sp,t < b then22: vp ← K − Sp,t23: tp ← t . exercising time marked down24: end if25: end for26: end for27: V ← 028: for p = 1→ Np do29: V ← V + vp30: end for31: V ← V · exp (−rT/NT )/Np

20 Chengbo Wang

Uppsala University

3.2.2 An AAD adaptation

From Algorithm 1, the option price V is computed via LS

V =1

Np

∑p∈I

[Ke−r T

Nptp − S0e

− 12σ2 T

NTtp+σZp,ttp

]. (11)

It is noteworthy that the exercise boundary tp depends on I, and it boils down to dependence oninput parameters. However, if we execute it with AD methods, tp will be recognized as an inde-pendent variable. When differentiating (11), for example to get ∆, we can see that by definition,

∆ = S0, V = σ, Θ = −T ,

and further

∆ = S0 =∂V

∂S0

=1

Np

∑p∈I

[−e−

12

TNT

tp+σZp,tp

].

We can see that if we want to obtain Γ by further differentiating ∆ with regard to S0 in AD, zerowould be expected.

This is actually because the sheer nature of comparison, it behaves like a Heaviside step func-tion, which is not differentiable at the step for our AD methods. To tackle this differentiationproblem, one possible solution is a pathwise adjoint approach [21]. To be more specific, the dis-continuity can be addressed if we neglect the number of paths in which stock price is close to theexercise boundary, and compute the average over all the rest of the paths.

It follows that adjoints of b are zeros, therefore the comparison in Algorithm 1 line 21 is negli-gible. What is more, to compute local sensitivities for each path, we interchange the time and thepath loop so that paths are separate, hence it is pathwise.

The pseudo-code is valid in Algorithm 2 line 12 given thatNp is independent of other parametersand interchangeability of partial differentiation with regard to NP and any other given variable x:

∂(∑vp(x)/Np)

∂x=

∑(∂vp(x)/∂x)

Np

.

As what we discussed in the last two chapters, the adjoint function typically contains an initialtangent sweep. This replicates the original function F (in our case is LS) and evaluates the firstoutput vp (in our case, since we want to keep the value of V in LS, V2 should be the second output).It also stores all necessary intermediate derivatives for the further adjoint sweep.

Below Algorithm 2 shows the core AAD adaptation of LS:

Algorithm 2 LS(AAD)

Input: S0, σ, T ;K, r,Np, NT , Zp,t.Output: V1, V2, g.

1: ((tp), V1)← F (S0, K, σ, T, r,Np, NT , Zp,t) . LS sweep2: V2 ← 0,g← 0 . initialization for AD3: for p = 1→ Np do . TAD sweep

21 Chengbo Wang

Uppsala University

4: (Sp,tp ,∂Sp,tp

∂S0,∂Sp,tp

∂σ,∂Sp,tp

∂T)← h(S0 + ε(1, 0, 0), σ + ε(0, 1, 0), T + ε(0, 0, 1), r, tp, NT , Zp,t)

5: (vp,∂vp

∂Sp,tp,)← (K − Sp,tp) exp (−rtp/NT )

6: v ← 1 . AAD sweep7: V ← V + vp

8: S0 ←∂Sp,tp

∂S0

∂vp∂Sp,tp

,v, σ ←

∂Sp,tp

∂σ

∂vp∂Sp,tp

,v, T ←

∂Sp,tp

∂T

∂vp∂Sp,tp

v

9: g← g + (S0, σ,−T )T

10: end for11: V2 ← V2/Np

12: g← g/Np

3.2.3 Implementation in Julia

So far we have witnessed how dual numbers can help with AD algorithm. It also inspired Revelset al [22] to develop the ForwardDiff.jl package in Julia language, amongst other AD tools onautodiff.org. Considering function overloading and generic programming-friendly features andhigh performance, we choose this very package in Julia to execute our algorithms.


First we take S0 = 100, K = 95, σ = 0.25, T = 180/365, r = 0.05, Np = 5000, NT = 1000. We alsoadopt an FD method [23] to further present the results from our AAD approach. We can see thatfor results via AAD and FD methods are indiscernible. Before calculating values via FD method,we need to decide which h to use.

Figure 4: FD’s Convergence towards AAD Sensitivities

We plot the relationship between the order of magnitude of h and the corresponding logarithmof average relative errors across all the sensitivities. In Figure 4 we can see that the curve convergesaround h = 10−8. Hence we can safely choose it as the step value h. The relative errors beloware presented based on the chosen step value level to further appraise how AAD method performscompared with FD.

22 Chengbo Wang

Uppsala University

LS LS(AAD) FD relative errorV 3.83010241 3.82912330 3.82832681 2.080520×10−4

∆ -0.31708787 -0.31708891 3.279837×10−6

V 24.7818642 24.7818921 1.125822×10−6

Θ 4.88642110 4.88640026 4.264898×10−6

As we discussed in Section 2.2.2 and 2.2.3, that TAD sweep computes differentials in linear time,generally of the same speed as FD method, and requires to repeat to compute all the differentials.On the other hand, AAD sweep requires an advanced TAD sweep, but computes all the desireddifferentials in constant time. So AAD could boost the computation in trade-off with high memoryrequirement. To confirm this, we run tests with different Np and NT values.

Figure 5: Timing of American Put Greeks Computation in FD and AAD Methods

As we can see from Figure 4, AAD approach is much faster than the conventional bumpingmethod. Furthermore, for this LS algorithm the running time and the number of grids have alinear relationship.

3.3 Valuation of CVA and sensitivities

In the previous chapters, we have showcase the mechanism of CVA pricing bit by bit. When itcomes to implementation, there are still several points we want to clarify.

3.3.1 AAD implementation within G2++-CIR++ framework

Again we apply the Julia package ForwardDiff.jl for numerical computation. Algorithm 3 of func-tion F below vastly follows the framework represented in [18], but we make adaptations based ondifferent underlying interest rate swap condition and detailed realization. We pipe in dual numbersinto F and automatically obtain desired derivatives g.

Algorithm 3 G2++-CIR++

Input: ρij, σ, η, B0,Ti ;T, a, b, κ, µ, ν, z0, J,Np, NT ,N , Y (p)i , λ0, f(t), δ(t, T ).

23 Chengbo Wang

Uppsala University

Output: CV A.1: L← Cholesky decomposition: (ρij) = LLT

2: h← T/(J ·NT )3: ρ← f1(ρij, σ, η)4: Ti ← i · T/NT

5: for t = 0→ J ·NT do6: for i = 0→ NT do7: if J · i ≥ t then8: W (t, Ti)← f2(a, b, σ, η, ρ, t, Ti)9: end if

10: end for11: end for12: for p = 1→ Np do

13: (Z(p)1 , Z

(p)2 , Z

(p)3 )T ← L(Y

(p)1 , Y

(p)2 , Y

(p)3 )T

14: x(p)0 ← 0, y

(p)0 ← 0, z

(p)0 ← 0

15: r(p)0 ← f(0), λ

(p)0 ← λ0

16: D(p)0 ← 1,Λ

(p)0 ← 1

17: for t = 0→ J ·NT − 1 do18: x

(p)t+1 ← (1− a)x

(p)t h+ σ

√hZ

(p)1

19: y(p)t+1 ← (1− b)y(p)

t h+ η√hZ

(p)2

20: z(p)t+1 ← z

(p)t + κ(µ− z(p)

t )h+ ν

√hz

(p)t

+Z(p)3

21: r(p)t+1 ← x

(p)t+1 + y

(p)t+1 + f(k · h)

22: λ(p)t+1 ← z

(p)t+1 + λ0 + γ(k · h)

23: for j = 1→ t+ 1 do24: D

(p)j ← D

(p)j−1 exp (−hr(p)

j )

25: Λ(p)j ← Λ

(p)j−1 exp (−hλ(p)

j )26: end for27: for i = 0→ NT do28: if J · i ≥ t+ 1 then29: A

(p)(t+1)h,Ti

← f3(a, b, x(p)t , y

(p)t ,Wt,Ti ,W0,Ti ,W0,t)

30: B(p)(t+1)h,Ti

← f4(B0,Ti , B0,(t+1)h, A(p)(t+1)h,Ti

)31: end if32: end for33: end for34: V

(p)0 ← 0

35: for t = 0→ J ·NT do36: for i = 1→ NT do37: if J · i ≥ t then38: B

(p)0,Ti← B0,Ti

39: V(p)t ← V

(p)t +N (B

(p)t·h,Ti(1 + δ(t · h, Ti)K)−Bt·h,Ti−1

)40: end if41: end for42: end for43: end for

24 Chengbo Wang

Uppsala University

44: for t = 0→ J ·NT do45: EEt·h ← 046: CV A← 047: for p = 1→ Np do

48: EEt·h ← EEt·h + λ(p)t Λ

(p)t D

(p)t V

(p)t

+

49: end for50: EEt·h ← EEt·h/Np

51: CV A← CV A+ EEt·h52: end for53: CV A← CV A(1−R)

Here are some notes regarding Algorithm 3:

Figure 6: Computational Graph for Algorithm 3

1) In the original CIR++ processas demonstrated in Section 2.4.3, simu-lations of z(t) comprise of square rootterms like

√z. However, clearly it is

not guaranteed that z(t) > 0 for anyt. Here we simply substitute them with√z+;

2) A zero bond curve B(0, Ti) canbe bootstrapped from market quotes forpar swap rates, which has a payment fre-quency of 6 months for the floating legand of 1 year for the fixed one, givenT and ACT/ACT day count convention,which year count fraction δ(Ti−1, Ti) alsofollows;

3) Default intensity λ0 can be boot-strapped based on market quotes for creditdefault swaps, for simplicity we take a con-stant value;

4) We assume the underlying single in-terest rate swap has no exchange of notionalat the both ends of the swap, and we choosethe fixed swap rate K = B(0, TNT

) to makesure the present value equals zero. We setthe payment frequency of both swap legsequals 6 months and coincide fixing dateswith payment dates;

5) For instantaneous forward rate f(t), we calculate it via f(t) = B(0, t)t − 1 for each stepgiven the zero curve B(0, T );

25 Chengbo Wang

Uppsala University

6) For recovery rate R, there is research like Emery et al [24] pointing out that different bankshave split choices of R, varying from 5% to 70%. Here we take the middle way 40%.


Par swap rates for SEK were retrieved on the SEB website at 12:27 on April 30, 2020:

Maturity(years) 1 2 3 4 5 6 7 8 9 10rate 0.13 0.10 0.11 0.13 0.16 0.20 0.24 0.28 0.32 0.36

Initial term structure B(0, Ti) can thus be derived from the table above. The curves are given inFigure 8. We set the rest of parameters as below:

T a b κ µ ν z0 J K Np N λ0

10 0.06 0.5 0.5 0.1 0.1 0.015 1000 0.36 5000 106 0.07

Again, we need to choose a proper step value h for FD method. As shown in Figure 9, it is safe topick h = 10−8. Then via Julia we present sensitivities regarding the correlations and vegas(CV Aσand CV Aη), the relative error is also defaulted on the level of h = 10−8:

ρ12 ρ13 ρ23 σ ηΘ -0.5 0.1 0.1 0.001 0.01

CV AΘFD 189.230111 340.899337 101.490920 246005.272 -7923.68531

AAD 189.230227 340.899005 101.490841 246005.133 -7923.69235relative error(×10−7) 6.840729 9.738945 7.783948 5.650285 8.884755

We can see that AAD method provides similar results as of bumping. Delta sensitivities CV AB(0,Ti)

are also given below in the table and in Figure 7 and we can see that again FD and AAD approachesresult in almost indiscernible numerical values. Running with different values of Np and J , weobtain the similar result in Figure 10 as in Section 3.2.4 AAD approach outperforms FD in theperspective of computing speed.

B0,T0 B0,T1 B0,T2 B0,T3 B0,T4

Θ 0.12680780 0.12680780 0.12680780 0.11553511 0.09894217

CV AΘFD 892.572700 -1747.93371 -1889.41081 -2718.34582 -2981.23788

AAD 892.577413 -1747.94950 -1890.49105 -2719.72981 -2981.27411B0,T5 B0,T6 B0,T7 B0,T8 B0,T9

Θ 0.10447453 0.10901995 0.11933023 0.128888566 0.14417061

CV AΘFD -4617.32801 -3219.18589 -6811.19814 -5716.59185 -7825.14825

AAD -4617.39898 -3219.19403 -6811.20057 -5716.60103 -7825.34921B0,T10 B0,T11 B0,T12 B0,T13 B0,T14

Θ 0.15881323 0.17957476 0.19915844 0.21945733 0.23914154

CV AΘFD -3012.28415 -4792.78118 -2148.19815 -8194.21851 -4291.15159

AAD -3012.29481 -4793.00137 -2148.19701 -8195.20273 -4291.19263B0,T15 B0,T16 B0,T17 B0,T18 B0,T19

Θ 0.259740981 0.27976773 0.30040909 0.32052114 0.34132540

CV AΘFD -7182.81157 814.873518 -3417.18587 -217.058195 -2815.62943

AAD -7182.83195 814.872814 -3417.19982 -217.058192 -2815.68104B0,T20

Θ 0.36161007

CV AΘFD 155138.291

AAD 155138.890

26 Chengbo Wang

Uppsala University

Figure 7: CVA Delta Sensitivities Figure 8: Initial Term Structure and Underly-ing Par Swap Rates

Figure 9: FD’s Convergence towards Sensitivi-ties in AAD

Figure 10: Timing of CVA Sensitivities Com-putation in FD and AAD Methods

4 Conclusion

In this paper, we have presented advantageous applications of AAD method. AAD method, com-pared to traditional methods like FD, shortens the computational time and betters the accuracy.

27 Chengbo Wang

Uppsala University

We first introduce sensitivities in American options and xVA, then provide detailed mathe-matical frameworks to model and price these financial derivatives. AD method, amongst othermethods like FD and PD, are also discussed with concrete examples.

In both cases of implementation of sensitivities computation, we give out three algorithm re-lated to the previous frameworks. it suffices to say that AAD method outperforms FD methodin approximately one order of magnitude from the perspective of computational speed. As foraccuracy, AAD computes the exact derivatives while FD might be problematic since we need totune in advance the size of step h so that we have a better control of truncation errors.

The models for sensitivities implementation are selected with caution and by design. LS al-gorithm for American put option greeks is based on an American Monte Carlo approach, whichemploys regressions to estimate conditional expectation values. Meanwhile to calculate CVA sen-sitivities, we detour the regression approach via adopting an exogenous G2++-CIR++ interestrate and credit risk models.

Last but not least, we propose further improvement and extension for this work:

1) For both implementations, we only focus on the mathematical process of it, and have notgiven too much consideration of memory request. However, in real-world sensitivities computation,it should be given a large proportion of deliberation. If we just naively cope with AD methods, itcould lead to severe consumption of CPU and GPU. Techniques such as checkpointing are worthmentioning to further refine previous AAD methods;

2) Following conclusions contain similar points to that of another thesis [18]. We have onlyapplied the models on basic instruments, i.e. a classic American put option with no dividend anda single asset of IRS, recalling that for simplicity’s sake, we take on many assumptions for theAmerican option and the fixed-for-floating IRS. Whether it can still be robust given for instance aportfolio of assets of more exotic nature and complicated structures remains in doubt, especially forAlgorithm 3, where we can see running time increases significantly with the rise in the size of J ·Np.

A regression-based model could be again taken into consideration to scale preferably into lin-earity in this scenario. Moreover, we have only dabbed the unilateral CVA, this is more of a focusin pre-crisis financial computation, not to mention that the xVA family is far beyond that. Asintroduced in previous chapters, we have never actually computate any higher order sensitivities.The work of optimization could also be given a second thought;

3) The very two main branches of financial computation include the work of calibration, whichis overlooked in this paper on AAD applications. Nevertheless, AAD implementation can be ofgreat benefit for such numerical analyses.

References

[1] Phelim P Boyle. “Options: A Monte Carlo Approach”. In: Journal of financial economics4.3 (1977), pp. 323–338.

28 Chengbo Wang

Uppsala University

[2] John N Tsitsiklis and Benjamin Van Roy. “Regression Methods for Pricing Complex American-style Options”. In: IEEE Transactions on Neural Networks 12.4 (2001), pp. 694–703.

[3] Francis A Longstaff and Eduardo S Schwartz. “Valuing American Options by Simulation: aSimple Least-squares Approach”. In: The Review of Financial Studies 14.1 (2001), pp. 113–147.

[4] Emmanuelle Clement, Damien Lamberton, and Philip Protter. “An Analysis of A LeastSquares Regression Method for American Option Pricing”. In: Finance and Stochastics 6.4(2002), pp. 449–471.

[5] Antoine Savine. Modern Computational Finance: AAD and Parallel Simulations. John Wiley& Sons, 2018, pp. 425–426.

[6] Paul Glasserman. Monte Carlo Methods in Financial Engineering. Vol. 53. Springer Science& Business Media, 2013.

[7] Eduardo S. Schwartz. “The Valuation of Warrants: Implementing A New Approach”. In:Journal of Financial Economics 4.1 (Jan. 1977), pp. 79–93.

[8] Luca Capriotti. “Fast Greeks by Algorithmic Differentiation”. In: Journal of ComputationalFinance 14(3) (2011), pp. 3–35.

[9] Philipp HW Hoffmann. “A Hitchhiker’s Guide to Automatic Differentiation”. In: NumericalAlgorithms 72.3 (2016), pp. 775–811.

[10] W.K. Clifford. “Preliminary Sketch of Biquaternions”. In: Proceedings of the London Math-ematical Society 4 (1871), pp. 381–395.

[11] Dan Kalman. “Doubly Recursive Multivariate Automatic Differentiation”. In: Mathematicsmagazine 75.3 (2002), pp. 187–202.

[12] Jeffrey Fike and Juan Alonso. “The Development of Hyper-dual Numbers for Exact Second-derivative Calculations”. In: 49th AIAA Aerospace Sciences Meeting including the New Hori-zons Forum and Aerospace Exposition. 2011, p. 886.

[13] Damiano Brigo, Massimo Morini, and Andrea Pallavicini. Counterparty Credit Risk, Collat-eral and Funding: with Pricing Cases for All Asset Classes. Vol. 478. John Wiley & Sons,2013.

[14] Samim Ghamami and Lisa R Goldberg. “Stochastic Intensity Models of Wrong Way Risk:Wrong Way CVA Need Not Exceed Independent CVA”. In: The Journal of Derivatives 21.3(2014), pp. 24–35.

[15] Luca Capriotti and Shinghoi Lee. “Adjoint Credit Risk Management”. In: Risk Magazine(Aug. 2014).

[16] Luca Capriotti, Yupeng Jiang, and Andrea Macrina. “AAD and Least-Square Monte Carlo:Fast Bermudan-style options and XVA Greeks”. MA thesis. 2017, pp. 35–49.

[17] Vytautas Savickas et al. “Super Fast Greeks: An Application to Counterparty ValuationAdjustments”. In: Wilmott 2014.69 (2014), pp. 76–81.

[18] “Computing Sensitivities of CVA Using Adjoint Algorithmic Differentiation”. MA thesis.Oxford University, 2015.

[19] Damiano Brigo and Fabio Mercurio. Interest Rate Models-Theory and Practice: with Smile,Inflation and Credit. Springer Science & Business Media, 2007.

29 Chengbo Wang

Uppsala University

[20] Damiano Brigo and Andrea Pallavicini. “Counterparty Risk and Contingent CDS Valuationunder Correlation Between Interest-Rates and Default”. In: SSRN Electronic Journal 926067(2006).

[21] Jens Deussen, Viktor Mosenkis, and Uwe Naumann. Fast Estimates of Greeks from AmericanOptions: A Case Study in Adjoint Algorithmic Differentiation. Tech. rep. Technical ReportAIB-2018-02, RWTH Aachen University, 2018.

[22] Jarrett Revels, Miles Lubin, and Theodore Papamarkou. “Forward-Mode Automatic Differ-entiation in Julia”. In: arXiv:1607.07892 (2016).

[23] Mark Richardson. “Numerical Methods for Option Pricing”. In: University of Oxford, Specialtopic (2009).

[24] Kenneth Emery, Sharon Ou, and J Tennant. “Corporate Default and Recovery Rates, 1920-2009”. In: Moody’s Investors Service (2010).

30 Chengbo Wang

Financial Applications of Algorithmic Differentiation1439412/FULLTEXT02.pdf · Again, even if...

Documents

Transcript of Financial Applications of Algorithmic Differentiation1439412/FULLTEXT02.pdf · Again, even if...