510 Lecture Notes - George Mason Universityphysics.gmu.edu/~rubinp/courses/510/510notes.pdf ·...

510 Lecture Notes

Phil Rubin

Under continuous revision.

Last revised: April 26, 2012

Contents

1 Computers 81.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.2 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.1 Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.2 Cache Memory . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.3 Swap Memory . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5.1 Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5.2 Real Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5.3 Machine Accuracy . . . . . . . . . . . . . . . . . . . . . . . 12

2 Useful Theorems 132.1 Intermediate Value Theorem . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Corollary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Fundamental Theorem of Calculus . . . . . . . . . . . . . . . . . . . 13

2.2.1 First Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Corollary or Second Part . . . . . . . . . . . . . . . . . . . . 13

2.3 Integration by Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Rolle’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 Mean-Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 Derivative Formulation . . . . . . . . . . . . . . . . . . . . . 142.5.2 Integral Formulation . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Errors and Uncertainties 153.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Mistakes/Blunders . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 Random Errors . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.3 Systematic Errors . . . . . . . . . . . . . . . . . . . . . . . . 16

2

CONTENTS 3

3.3 Rounding Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.1 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.3 Repeated Operations . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Algorithmic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Interpolation 194.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Simple, Two-Point Interpolation . . . . . . . . . . . . . . . . . . . . 204.3 Interpolating between Equal Increments of the Independent Variable . 204.4 Interpolating between Unequal Increments of the Independent Variable 21

4.4.1 Divided Differences . . . . . . . . . . . . . . . . . . . . . . 224.4.2 Lagrange’s Interpolations . . . . . . . . . . . . . . . . . . . . 22

4.5 Spline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.6 Inverse Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 244.7 Two-way Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Solving Equations Numerically 255.1 Transcendental Equations . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 The Method of “Regula Falsi” . . . . . . . . . . . . . . . . . 255.1.2 The Newton-Raphson Method . . . . . . . . . . . . . . . . . 265.1.3 The Method of Iteration . . . . . . . . . . . . . . . . . . . . 27

5.2 Roots of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2.1 Graeffe’s Root-Squaring Method . . . . . . . . . . . . . . . . 27

6 Differentiation 316.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Forward Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.3 Central Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.4 Quantifying the Uncertainties . . . . . . . . . . . . . . . . . . . . . . 326.5 Second Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7 Integration 347.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347.2 Newton-Coates Methods . . . . . . . . . . . . . . . . . . . . . . . . 35

7.2.1 Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . 357.2.2 Simpson’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . 367.2.3 Optimizing the Number of Regions . . . . . . . . . . . . . . 37

7.3 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Ordinary Differential Equations 408.1 First-Order ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8.1.1 Homogeneous Equation . . . . . . . . . . . . . . . . . . . . 418.1.2 Non-homogeneous Equation . . . . . . . . . . . . . . . . . . 42

8.2 Higher-Order ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 CONTENTS

8.3 Solving ODEs Numerically . . . . . . . . . . . . . . . . . . . . . . . 448.3.1 Order of Accuracy and Truncation Error . . . . . . . . . . . . 448.3.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8.4 Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468.5 The Leap-Frog Method . . . . . . . . . . . . . . . . . . . . . . . . . 488.6 The Runge-Kutta Method . . . . . . . . . . . . . . . . . . . . . . . . 498.7 The Predictor-Corrector Method . . . . . . . . . . . . . . . . . . . . 508.8 The Intrinsic Method . . . . . . . . . . . . . . . . . . . . . . . . . . 51

9 Partial Differential Equations 529.1 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 529.2 Classifying PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539.3 Elliptical Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

9.3.1 Analytical Solution . . . . . . . . . . . . . . . . . . . . . . . 539.3.2 Numerical Solutions . . . . . . . . . . . . . . . . . . . . . . 54

9.4 Hyperbolic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 559.4.1 Analytical Solutions . . . . . . . . . . . . . . . . . . . . . . 559.4.2 Numerical Solutions . . . . . . . . . . . . . . . . . . . . . . 55

9.5 Parabolic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

10 Matrices 5910.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5910.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5910.3 Matrix Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6010.4 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 6110.5 Solving Systems of Equations . . . . . . . . . . . . . . . . . . . . . 62

10.5.1 Standard Exact Methods . . . . . . . . . . . . . . . . . . . . 6310.5.2 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . 65

10.6 Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 66

11 Monte Carlo 6811.1 Random Number Generators . . . . . . . . . . . . . . . . . . . . . . 6811.2 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 6911.3 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 7011.4 The Metropolis Algorithm . . . . . . . . . . . . . . . . . . . . . . . 70

12 Fourier Analysis 7312.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7312.2 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7312.3 Fourier Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7412.4 Discrete Fourier Transformation . . . . . . . . . . . . . . . . . . . . 7512.5 Fast Fourier Transformation . . . . . . . . . . . . . . . . . . . . . . 76

CONTENTS 5

13 Time Series Analysis 7813.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7813.2 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7813.3 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7913.4 Isolating Trends: Smoothing Techniques . . . . . . . . . . . . . . . . 80

13.4.1 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8013.4.2 Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . 80

13.5 Isolating the Primary Periodicity . . . . . . . . . . . . . . . . . . . . 8113.6 Isolating Secondary Periodicity . . . . . . . . . . . . . . . . . . . . . 8213.7 Isolating an Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

List of Figures

5.1 Graphical solution for xex = 1. . . . . . . . . . . . . . . . . . . . . 25

6

List of Tables

4.1 Difference Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

9.1 Common boundary conditions for 2nd-order PDEs. . . . . . . . . . . 539.2 Linear, two-dimensional, 2nd-order PDE classification scheme. . . . . 53

7

Chapter 1

Computers

1.1 IntroductionFor our purposes, a computer is a Central Processing Unit (CPU) and storage (memory)all made to work by an operating system (OS). By work we mean move and modifyinformation, represented in the CPU and memory by bits, sequences of zeros and ones.Information is stored in memory and altered and directed by the CPU. The OS is theinterface between all of this and the user.

1.2 CPUA CPU is characterized by two parameters: speed and chip (or bus) size.

1.2.1 SpeedEach CPU operation takes place in steps. A clock synchronizes these steps, so thatthings taking part in various sections of the CPU are coordinated. Obviously, fasterclocks permit more steps per unit time (within the limits of the CPU components toperform their actions during an operation step). More steps in a unit time means fasterperformance of the operation. However, smart operation algorithms reduce the numberof steps required, also reducing the performance time. Different CPUs are differentlyattuned to various forms of algorithm optimimzation. Thus, to get the most efficiency,one should know what any given CPU is good at and what it’s not good at.

1.2.2 SizeSize refers to the number of bits that can be moved around–typically between memoryand CPU–simultaneously. Data transfer requires one or more step each transaction, sothe numer of steps required to transfer information is roughly inversely proportional tothe chip (or bus) size. Too large a transfer, on the other hand, can cause managmentand handling problems, so optimization is necessary here, too.

8

1.3. MEMORY 9

1.3 MemoryThe CPU transfers data between three storage areas: main memory, cache memory,and swap memory. Flow is coordinated by a memory controller, which is part of acollection of electronic devices (known as a chip set) that supports the CPU and locatedaround it on the so-called mother-board.

1.3.1 Main MemoryMost of the information CPUs use is stored in so-called random access memory (RAM),usually located on the mother-board. This is dynamic storage, made of capacitors andmetal-oxide-semiconductor field effect transistors (MOSFETs). It’s called dynamic be-cause it must be read and restored (refreshed) continuously at the rate of roughly a halfa kilohertz. This technology is very cheap at present (in 1975, the price of RAM was0.825 cents / bit; in 2010, the price was 2 × 10−9 cents / bit, a decrease in 35 yearsof more than 8 orders of magnitude), but slow, due to the method of retrieval, whichrequires a refresh cycle, and a heavy user of computer cycles due to refreshing.

1.3.2 Cache MemoryAlso on the mother-board, usually closer to the CPU, is static memory. Bits are storedon flip-flops, made up of 8 transistors which form 2 nand gates. Access is quick, and norefreshing is necessary, but static memory is expensive in particular relative to dynamicmemory.

Computer lore has it that in any operation, of all the information that is needed orprocessed, only 20% requires access 80% of the time. Operating systems know to putsmall amounts of frequently utilized data into cache, to speed things up.

1.3.3 Swap MemoryInformation rarely accessed (say once a minute or more) is stored off the mother-board,typically on magnetic storage units. This storage is referred to as swap.

1.4 Operating SystemThe operating system manages computer hardware and interfaces with the user throughapplications. The operations the CPU performs on it own are referred to as the instruc-tion set, a finite number (around a hundred or so) of short operations (roughly a half-dozen steps each). Assembly language programming permits the user direct access tothese.

1.5 NumbersThe computers we use operate digitally; that is, data is discrete or discontinuous, asopposed to anolog or continuous. This doesn’t mean the information being represented

10 CHAPTER 1. COMPUTERS

is necessarily discrete, but that the representation is. An herein lies one limit to theprecision of a computer. A second limit comes from the bus size: the range of numbers(of both integers and reals) and the number of significant figures (reals), all representedin bits (binary numbers, in base 2) whose quantity can’t exceed this size. Operationsthat opproach these limits are susceptible to various sorts of errors.

1.5.1 IntegersThe precision of integers, by definition having no decimal places, are not limited in pre-cision, but are in range, to 2nbits numbers, from 0 to 2nbits−1 (unsigned) or−2nbits−1

to 2nbits−1− 1 (signed). Computers reserve the highest order bit for the sign of signedintegers: 0 if positive, then the value of is that of the nbits− 1 other bits; 1 if negative,and the value is −2nbits−1+ the value of the nbits− 1 remaining bits.

Integer Example For a hypothetical four-bit machine, nbits = 4:number unsigned signed

0000 0 00001 1 10010 2 20011 3 30100 4 40101 5 50110 6 60111 7 71000 8 -81001 9 -71010 10 -61011 11 -51100 12 -41101 13 -31110 14 -21111 15 -1

Recall or understand that a string of eight bits constitute one byte, and that fourbytes (32 bits) form a word. Then, depending on machine architecture, integers maybe defined in up to six different ways (each of which may be signed or unsigned),according to the number of bits which compose it: nimble or semitoctet (4 bits, usedto encode decimal digits); byte, octet, or character (8 bits = 1 byte, used for ASCIIcharacters); halfword or short (16 bits = 2 bytes); word, long, or integer (32 bits = 4bytes = 1 word); doubleword, long long, quad (64 bits = 8 bytes = 2 words, used forvery large numbers or very high precision); octaword or double quad (128 bits, rare,requiring special compilers). A Boolen or logical value, a 1 bit number (1 = true; 0 =false), is also an integer, but obviously can’t be signed.

An interesting note with regard to all of this: the default integer is 32 bits, allowingfor a range of 4,294,967,296 values, or 4 Giga-values. It’s no accident that manyoperating systems limit memory access to 4 Gbytes (addresses are integers). Manycounters are also integer values. Computer time is often relative to a certain start date,

1.5. NUMBERS 11

1 Jan 00:00, 1900 or 1970; the number of seconds since these dates crosses the 4 Giga-value in the second half of the 2030s.

1.5.2 Real NumbersThe presence of a decimal point distinquishes a real number from an integer, and com-puters handle the decimal point in either of two ways: fixed-point or floating-point.The first is normally used in business–in particular, accounting–applications, where thenumber of decimal places is usually constrained. Science and engineering computingemploys the second almost exclusively. There are no unsigned real numbers.

Fixed-point Notation

Let N be the number of bits available to store numbers, m the number of places to theright of the decimal place and n the number of places to the left of the decimal place,each determined by machine and application, then the following constraint holds: m+n = N − 2. Another bit is used for for the sign, and the last bit holds the one’s place(20) value:

xfixed = (−1)s×(bn2n+bn−12n−1+. . .+b020+b−12−1+. . .+bm−12−(m−1)+bm2−m)

where s is the sign bit. Thus, assuming no overflow, the absolute error is 2−m−1, thefirst term cut off on the right of the decimal place. The error, then, is fixed, too, and forsmall numbers it will be relatively large, and relative error is the important measure ofprecision.

Floating-point Notation

In floating-point arithmetic, two or three types are typically available: single-precision(4 bytes), double-precision (8 bytes, or two words), and quad-precision (16 bytes, onlyon 64-bit machines with special compilers and compiler options). Memory allocationdepends on the architecture and precision, but assumes this generic form:

xfloating = (−1)s ×mantissa× 2exponent−bias

Again, s is the sign bit. The mantissa is the fractional or decimal part of the number.The exponent is increased by the bias so that it is positive definite. The architectureand precision determine how many bits are allocated to each of the mantissa, exponent,and bias.

In single-precision, 23 bits are allocated for the mantissa, which gives 6 or 7 deci-mal places (1/223). For most 32-bit machines, the exponent is an 8-bit unsigned integer(0:255) and the bias is set to 127, so the range of values goes roughly between 2−128

and 2127. In scientific notion, the range of single-precision numbers may legitimatelyspan from 1.17549435× 1038 to 3.40282347× 1038.

Double-precision usually allocates 52 bits to the mantissa, so permits 16 decimalplaces (1/252). With one bit allocated to the sign, 11 bits remain for the exponent(0:2048), and the bias is 2047. The span of numbers is then from 2.2250738585072014×10308 to 1.7976931348623157× 10308.

12 CHAPTER 1. COMPUTERS

1.5.3 Machine AccuracyAs you can see, there are bit limits to any representation of a number, and these translateinto inherent limits to the accuracy of any calculation. First of all, decimal limitationsimply that real numbers are not continuous. Going out of bounds of allowable rangesresults in the computer telling you that your result is NaN (Not a Number). Calculationswithin the limits, but requiring additional decimal places are rounded. To repeat: therepresentation in memory of every calculation’s result includes roundoff errors. There’sno IEEE standard for rounding, but the most frequent algorithm is to round to theclosest number with the limited number of decimal places. In this case, the roundofferror is somewhere between zero and half the value of the least significant decimalplace.

If the roundoff error is bigger than the number itself, you encounter a conditionknown as underflow. This condition is somewhat ameliorated nowadays by a provisioncalled subnormal numbers, in which the precision for representation is conditionallyextended, so that your algorithm doesn’t blow up, but such subnormal numbers, whenthey appear in operations, make no difference (that is, they act like 1 in multiplicationand 0 in addition, thus returning the normal number regardless of the magnitude of thesubnormal number. Operations between two subnormal numbers tend to lead to NaNresults.

The difference between a number, say 1, and the next larger number the computercan represent, is therefore finite (and not infinitessimal, as would be the case if the rep-resentation were continuous). In computer lingo, this difference is called the machineaccuracy and labeled εM and it’s numerically equal to the reciprocal of the 2 to thepower of the number of mantissa bits:

εM = 2−Nmantissa bits

Chapter 2

Useful Theorems

2.1 Intermediate Value TheoremLet f(x) be a continuous function on the closed real interval [a, b]. Given a number yand x1, x2 ∈ [a, b] | f(x1) ≤ y ≤ f(x2), then ∃ c ∈ [a, b] | y = f(c).

2.1.1 Corollary

Let f(x) be a continuous function on the finite, closed real interval [a, b]. Then f(x)has maximum and minimum values on [a, b]; i.e., ∃ x1, x2 ∈ [a, b] | f(x1) ≤ f(x) ≤f(x2)∀x ∈ [a, b].

2.2 Fundamental Theorem of Calculus

2.2.1 First Part

Let f(x) be a continuous function on the closed real interval [a, b], and F (x) be thefunction defined, ∀x ∈ [a, b], by F (x) =

∫ xaf(t) dt, then, F (x) is continuous and

differentiable on [a, b], and F ′(x) = f(x)∀x ∈ [a, b].

2.2.2 Corollary or Second Part

Let f(x) be a continuous function on [a, b], and g(x) an antiderivative of f(x), f(x) =

g′(x), ∈ [a, b], then∫ baf(x)dx = g(b)− g(a).

2.3 Integration by PartsLet f(x) and g(x) be real-value functions whose derivatives are continuous on [a, b].Then

∫ baf ′(x)g(x)dx = f(x)g(x)|x=b

x=a −∫ baf(x)g′(x)dx.

13

14 CHAPTER 2. USEFUL THEOREMS

2.4 Rolle’s TheoremLet f(x) be a continuous function on the closed real interval [a, b] and differentiableon the open interval (a, b). If f(a) = f(b) = 0, then ∃ c ∈ (a, b) | f ′(c) = 0.

2.5 Mean-Value Theorem

2.5.1 Derivative FormulationLet f(x) be a continuous function on the closed real interval [a, b], where a < b, anddifferentiable on the open interval (a, b), then ∃ c ∈ (a, b) | f ′(c) = f(b)−f(a)

b−a .

2.5.2 Integral FormulationLet g(x) be a non-negative function integrable on the closed real interval [a, b]. If f(x)

is a continuous function on the closed real interval [a, b] then ∃ c ∈ (a, b) |∫ baf(x)g(x)dx =

f(c)∫ bag(x)dx.

2.6 Taylor’s TheoremLet k ≥ 1 be an integer and f(x) a real-valued function k times differentiable at x = a.Then there exists a real-valued function Rk+1(x) such that f(x) = f(a) + f ′(a)

1! (x −a) + f ′′(a)

2! (x− a)2 + · · ·+ f(k)(a)k! (x− a)k +Rk+1(x), and limx→aRk+1(x) = 0 as

|x − a|k. For some real number b between x and a, Rk+1(x) = f(k+1)(b)(k+1)! (x − a)k+1,

is the remainder, or approximation or truncation error.

Chapter 3

Errors and Uncertainties

3.1 IntroductionCalculations end up with results, just like experiments. And just like experiments,there are limits to the accuracy (proximity to the ’true’ value) and precision (exactnesswithout regard to ’true’ value) of the results. In terms of computions, accuracy isa measure of the quality of the algorithm and its implimentation, and precision is ameasure of the limits imposed by random errors. As with experiments, it’s importantto quantify these uncertainties; otherwise, the results are meaningless.

We must assume that any number used in a floating-point calculation is inexact,either at input (conversion) or during calculations due to decimal-place limitations(rounding). Identify a floating-point number x. It is rounded, so that its stored valueis x(1 + ε), where ε is the relative error. If x operated on by some function f , so thatinstead of computing f(x), then the operation actually computes f [x(1 + ε)], and theabsolute error of the operation is (assuming f is differentiable in the neighborhood ofx):

f(x+ xε)− f(x) ≈ εxf ′(x)

and the relative error therefore be

f(x+ xε)− f(x)

f(x)≈ εxf

′(x)

f(x)

Since ε tends to be small, concern doesn’t arise unless x happens to be very large,or f(x) ∼ 0. Furthermore, solving expressions in terms of series is a standard com-putational technique, and a slow converging series requires many terms, so that evensmall uncertainties begin to mount up.

3.2 TypesWe won’t repeat here all that you’ve gone over in laboratory classes, but concentrateon those uncertainties most relevant to computations.

15

16 CHAPTER 3. ERRORS AND UNCERTAINTIES

3.2.1 Mistakes/BlundersTry to avoid these, and fix them when you discover them.

1. Programming mistakes.

2. Typographical errors in program or data.

3. Running the wrong program.

4. Using wrong data file.

5. Using the wrong algorithm.

3.2.2 Random ErrorsHard to detect in computations, and almost impossible to control.

1. Glitches.

2. Corruptions.

3. Cosmic rays.

4. Round off (if closest-value or truncation procedure followed)

3.2.3 Systematic ErrorsThese are related to architecture, operating system, algorithms, and so on. They arehard to identify after the fact.

1. Round off (if uni-directional procedure followed).

2. Approximations in algorithm.

Round off errors come from limits to the number of decimal places required bythe memory bus size of the computer. When real numbers are involved, the numberof machine bits available is much less than the number of decimal places required toprecisely represent the numbers. The more operations done with such numbers, theworse the situation gets, and the algorithm can become unstable (giving NaN results orcrashing). If the round off error becomes bigger than the numbers involved, then theresults are, in the lingo, garbage.

Approximations must often be employed to make the problem soluble (even if notaccurately so). For example, an infinite series will be replaced with a finite sum, sothat f(x) representing an infinite series is replaced by f(x) + ε(x,N), where N is thenumber of terms of the truncated series and ε(x,N), called algorithmic error, remain-der, or truncation error, is the ignored part of the infinite series. Note that the missingpiece depends, to a first approximation, on the evaluation value of the series and on thenumber of terms. Depending on the nature of the series, it is frequently the case thatthe algorithmic error is minimized as long as N � x, but this, obviously, depends onthe nature of the series.

3.3. ROUNDING ERRORS 17

3.3 Rounding Errors

3.3.1 Subtraction

z = x− y ⇒ z(1 + εz) = x(1 + εx)− y(1 + εy)

1 + εz =x

z(1 + εx)− y

z(1 + εy)

Thus, if x ≈ y � z,

1 + εz ≈x

z(εx − εy)

which approaches an undefined, unstable condition.

3.3.2 Multiplication

z = x× y ⇒ z(1 + εz) = x(1 + εx)× y(1 + εy)

1 + εz =(1 + εx)(1 + εy)

≈1 + εx + εy

εz ≈ εx + εy

Since the εs can be signed, the magnitude of εz may be bigger or smaller than eitherεx + εy .

3.3.3 Repeated OperationsIf we assume closest-value rounding, the uncertainty of each term inan operation willbe randomly distributed, and the overall uncertainty after some number of repetitionsof the operation can be estimated by considering the effect to be a random walk. In thiscase, the “wandering” would be proportional to the uncertainty of the operation. Theproportionality constant is the squareroot of the number of repetitions of the operation:R =

√Nσr. Operational uncertainties pile up each time the operation is carried out.

Operations such as subtraction and multiplication are subject to a cumulative round offerror:

εro '√Nεz

An algorithm whose operations are repeated many times better limit the uncertaintieson those operations to less than

√N , or the uncertainty will be bigger than the value.

If an operation’s uncertainties are not only due to random roundoff, the situationcan be even worse. Recursive operations may have uncertainties that go as N or evenN !.

3.4 Algorithmic ErrorsNumerical integration, series summation, or anything of this sort, common in computa-tional physics, the algorithms used can be characterized by a step size, h, or the number

18 CHAPTER 3. ERRORS AND UNCERTAINTIES

of steps N . Assuming to rounding errors, the presumption is that if h→ 0 or n→∞,then the result would be exact. But the fact is that h or N has to be finite, and so theresult is inexact.

To quantify this, then, the uncertainty could reasonbly be assumed to go as the stepsize or as the inverse of the number of steps:

εalg ' αhβ

orεalg '

α

Nβ

where α and β depend on the details of the algorithm, and may not even be constantsuntil h→ 0 or n→∞.

Let’s consider the step number characterization for now, and recall that for roundingerror,

εro '√Nεz ≈

√Nε

So that the operation uncertainty is approximated by the machine uncertainty. Sinceboth effects are present in any floating-point operation, we get a total uncertainty:

εtot = εalg + εro 'α

Nβ+√Nε

Notice the implication: one uncertainty goes inversely as the other, which meansthat there may be an optimal number of steps for minimizing total uncertainty. Onecan see if this is the case by testing the algorithm on a problem with an analytic so-lution, and plotting the difference between the exact and calculated values against thenumber of steps. Note that at very large N , the slope of the curve should approach 1/2,while at small N , the slope should be −β and the magnitude proportional to α, so theuncertainty of the algorithm, in principle, can described quantitatively.

Chapter 4

Interpolation

4.1 Introduction

Often, an experiment results in a collection of ordered-pair data, which, in principlecan be entered as a two-column table of values, one column for the independent vari-able, say x, and another the dependent variable, say y. In most physical cases, therelationship between these can be assumed to be smooth and continuous, even though,necessarily, measurements are made and recorded at discontinuous values of the in-dependent variable x. The first thing to do with data is to visualize it with a y vs xscatter-plot. You want to see if the assumption of ”good behavior” is well-founded,and then more fundamentally to see how to treat it.

A reasonable relationship between y and x suggests that it may be summarizedas a function, y = f(x), which may be known, determined, or assumed. But let’ssuppose, for the moment, all you really need from the data is value for the dependentvariable y at a given value of the independent variable x which is not among thoserecorded in the experiment. If you know or have determined the function relating thetwo variables, then you can plug in the x of interest and get y. Otherwise, a simpleprocedure is to draw a smooth curve through the data points and then read off the valueof y at x from the graph. Alternatively, you can take the y value of the x closest tothe value of interest. Obviously, the uncertainty in either of these approaches is likelyto be large, but for some purposes might be perfectly adequate. A somewhat morequantitative approach, which nonetheless foregoes the heavy computation that can berequired to fit the data, is interpolation, which, like fitting, relies on the assumption ofcontinuity between data points and a reasonably strong relationship between points, atleast locally, near the region of interest.

Interpolation uses a subset of the data points in the vicinity of the value x of interest.The more points used, the higher the order, n, of the interpolation. Increasing theorder does not necessarily increase the accuracy of the interpolation, and, since mostinterpolation methods require roughly n! operations, the efficiency of the interpolationdrastically decreases with increasing order.

Interpolation methods can be used for extrapolation (values beyond the data col-

19

20 CHAPTER 4. INTERPOLATION

lected), but doing so has its dangers, unless the form of f(x) is known beyond the data.All the methods suffer from truncation error, which increases as the difference betweenthe value x of interest and its nearest data point.

4.2 Simple, Two-Point InterpolationIf, in the region of interest, the data look to lie approximately on a line, simple, two-point interpolation may be adequate. If y is the value sought for a value x between x1

and x2, then that y can be found:

y − y1

x− x1=

y2 − y1

x2 − x1= m

y = y1 +m(x2 − x1)

The uncertainty of the interpolated value is proportional to the square of size of theinterval.

4.3 Interpolating between Equal Increments of the In-dependent Variable

Let’s assume first that measurements were made at equal intervals of the independentvariable x, the interval size being δx,

δx ≡ x1 − x0 = x2 − x1 = · · · = xn − xn−1.

Our approach is to form, on the basis of our data, a difference table [see Table 4.1],where the differences are given by the formula1

Table 4.1: Difference Table.x y ∆1 ∆2 ∆3 ∆4 ∆5 ∆6

x0 y0

x1 y1 ∆1y0

x2 y2 ∆1y1 ∆2y0

x3 y3 ∆1y2 ∆2y1 ∆3y0

x4 y4 ∆1y3 ∆2y2 ∆3y1 ∆4y0

x5 y5 ∆1y4 ∆2y3 ∆3y2 ∆4y1 ∆5y0

x6 y6 ∆1y5 ∆2y4 ∆3y3 ∆4y2 ∆5y1 ∆6y0

∆ryn = ∆r−1yn+1 −∆r−1yn

= yn+r − ryn+r−1 +r(r − 1)

2!yn+r−2 + . . .+ (−1)ryn

1Note the alternating signs in the series. To check that you’ve got it right, the sum of the entries in anycolumn equals the difference between the first and last entries of the previous column.

4.4. INTERPOLATING BETWEEN UNEQUAL INCREMENTS OF THE INDEPENDENT VARIABLE21

=

r∑m=0

(−1)m(

rm

)yn+r−m. (4.1)

(rm

)is the binomial combinatoric symbol(

rm

)≡ r!

(r −m)!m!.

This method works because, usually, the differences at some order become zero orvery small. For example, if f(x) is a polynomial of the nth-degree, the differencesin the ∆n column are the same, insuring that the differences in the ∆n+1 column andbeyond are all zero.

If xk and yk make an ordered pair of values in Table 4.1 in the case that the constantindependent variable interval is δx, then we define

u ≡ x− xkδx

, (4.2)

for any x not in the table, so that x = xk + uδx.In terms of u, Newton’s interpolation formula for the value of y at x is

y = yk + u∆yk +u(u− 1)

2!∆2yk +

u(u− 1)(u− 2)

3!∆3yk + . . .

+u(u− 1)(u− 2) · · · (u− r + 1)

r!∆ryk. (4.3)

You will notice that Equation 4.3 picks out differences lying along a diagonal in thedifference table, starting with yk. It should therefore be used for interpolation of valuesnear the beginning of the data set. For interpolation near the end of the set, anotherformula is better:

y = yk + u∆yk−1 +u(u+ 1)

2!∆2yk−2 +

u(u+ 1)(u+ 2)

3!∆3yk−3 + . . .

+u(u+ 1)(u+ 2) · · · (u+ r − 1)

r!∆ryk−r. (4.4)

In either case, the summation should be continued until the resolution (numberof significant figures) you desire has been reached. Unless it is clear that f(x) is acontinuous function, extending beyond the data range, you should be very careful whenextrapolating beyond either end of that range.

4.4 Interpolating between Unequal Increments of theIndependent Variable

In the case when δx is not constant, the procedure just outlined is obviously inappli-cable. Instead, either of two approaches, ”divided differences” or Lagrange’s interpo-lation formula is used. Neither is very precise, particularly given the amount of effortnecessary to implement it.

Again, assume n+ 1 ordered pairs of data are collected, (x0, y0), . . . , (xn, yn).


4.4.1 Divided DifferencesDefining “divided differences” by

[yj ] ≡ yj , j = 0, . . . , n− 1 (4.5)

[yj , yj+1] ≡ yj+1 − yjxj+1 − xj

[yj , yj+2] ≡ [yj+1, yj+2]− [yj , yj+1]

xj+2 − xj=

yj+2−yj+1

xj+2−xj+1− yj+1−yj

xj+1−xj

xj+2 − xj

[yj , . . . , yj+k] ≡ [yj+1, . . . , yj+k]− [yj , . . . , yj+k−1]

xj+k − xj(4.6)

j = 0, . . . , n− k, k = 1, . . . , n,

we create an interpolation polynomial

y = f(x) ≡ a0 +

n−1∑m=1

amnm(x), (4.7)

where

am ≡ [y0, . . . , ym]

nm(x) ≡m−1∏i=0

(x− xi)

4.4.2 Lagrange’s InterpolationsDefine the quantity

Xi ≡n−1∏j = 0j 6= i

x− xjxi − xj

The Lagrange’s interpolation formula is

y =

n−1∑i=0

Xiyi (4.8)

=(x− x1)(x− x2) · · · (x− xn−1)

(x0 − x1)(x0 − x2) · · · (x0 − xn−1)y0

+(x− x0)(x− x2) · · · (x− xn−1)

(x1 − x0)(x1 − x2) · · · (x1 − xn−1)y1

+(x− x0)(x− x1) · · · (x− xn−2)

(xn−1 − x0)(xn−1 − x1) · · · (xn−1 − xn−2)yn−1. (4.9)

Generally speaking, only one polynomical of order n − 1 passes through n datapoints. The uncertainty on the interpolation is proportional to the interval size con-tainting the unknown value to the power n.

4.5. SPLINE 23

4.5 SplineSpline interpolations employ polynomial interpolation, with the added constraints ofcontinuity to the second derivative at every interval. Such continuity of slope andcurvature makes for a very nice looking line representation through the data locally. Itis not as accurate as a fit.

A standard spline interpolation takes the form of a cubic-polynomial, of order 4,at each interval. If fi, f ′i , f

′′i , and f ′′′i are the value, first, second, third derivatives,

respectively, of this polynomial evaluated at the low value of the interval, xi containingthe x of interest, then the spline-interpolated value for y(x) for this x, xi < x < xi+1

is approximated by

y(x) = fi + (x− xi)f ′i +(x− xi)2

2!f ′′i +

(x− xi)3

3!f ′′′i

The work is to find fi and its derivatives for each interval. The continuity contraintsgive four equations for the four unknowns of the cubic polynomial: y = a+bx+cx2 +dx3. Thus:

1. The polynomials of adjoining intervals must give the same value at the boundaryof the intervals:

fi−1(xi) = fi(xi)

2. The slope and curvature (first and second derivatives) of the polynomials of ad-joining intervals must give the same values at the boundary of the intervals:

f ′i−1(xi) = f ′i(xi); f ′′i−1(xi) = f ′′i (xi)

3. A final constraint on the third derivative gives the last equation needed to deter-mine the four unknowns. There is no consensus as to the form of this constraint,but the two most frequently used are

(a) The third derivative is zero at the end-points of the data set:

f ′′′0 (x0) = f ′′′n−1(xn−1) = 0

and

(b) The third derivative is approximated from the second derivatives:

f ′′′i =f ′′i xi − f ′′i−1(xi)

xi − xi−1

These supply four equations for the four unknowns at each interval, but these con-straint equations couple intervals, so it’s not a straight-forward computational problemof solving four simultaneous equations for each interval. Rather,the equations allowfor the construction of an (n− 1)× (n− 1) semi-diagonal matirix, which needs to beinverted.


4.6 Inverse InterpolationTo find x given a value for y, you can reverse Equation 4.3 or 4.4 to find u as a functionof y and ∆y, and then use Equations 4.2 or a method of successive approximations tofind x. Alternatively, you can use Lagrange’s formula, Equation 4.9, simply reversingx and y to find x = g(y).

4.7 Two-way InterpolationIf the data have two independent variables (say, volume and temperature, with thedependent variable being pressure), the best approach is to interpolate for each inde-pendent variable separately using Newton’s formula.

Chapter 5

Solving Equations Numerically

5.1 Transcendental EquationsWhile no general method for finding the roots of transcendental equations is known,graphing allows an approximate solution. Thus, if we are given the equation xex = 1,we may graph y = x and y = e−x on the same plot and read off the intersectionpoint(s) or plot y = xex − 1 and note where the curve crosses y = 0 [see Figure 5.1].The solution with either approach in this case is x ≈ 0.6.

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

x

xexp(-x)

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

x

xexp(x)-1

Figure 5.1: Graphical solution for xex = 1.

Numerical methods can provide more precise results.

5.1.1 The Method of “Regula Falsi”Starting with an approximate solution, call it xa, obtained from graphing or guessingor some other means, we seek a more precise answer close to xa, but differing from it

25

26 CHAPTER 5. SOLVING EQUATIONS NUMERICALLY

by some small amount ∆x, given by

∆x ≡ (xa − xb)|yb||ya|+ |yb|

, (5.1)

where xb is, perhaps, a more precise value, and ya and yb are the values of the functionfor xa and xb, respectively. Notice that ∆x is a value related to the difference betweenthe approximate solution and the better guess. If |ya| = |yb|, then ∆x = (xa − xb)/2;if |ya| > |yb|, then ∆x < (xa − xb)/2; if |ya| < |yb|, then ∆x > (xa − xb)/2.

With ∆x determined in this way, the new best guess for x is

x = xb + ∆x.

And this can be repeated with x→ xb until the precision desired is achieved.

Example For xex = 1, we approximated a solution as xa = 0.6, for which we,rewriting y = xex − 1, ya = 0.093. We make perhaps a better guess of xb = 0.55, forwhich, yb = −0.047. Then ∆x = 0.017, so x = 0.567. We can iterate again with anew xb = 0.567, yb = −0.0004, and get a new ∆x = 0.0001, so x = 0.5671. And soforth.

One must be a bit careful, though, in guessing a value to compare to the approx-imate value that is on the other side of the “true” value. The corrections tend to ex-aggerate discrepencies in the case of a bad guess. Try, for example xb = 0.58 in ourexample.

5.1.2 The Newton-Raphson MethodAn alternative method is available if the value derivative of y = f(x) can be readilycalculated.

Taking xa to be an approximate root, then

∆x ≡ − f(xa)

f ′(xa), (5.2)

and then an improved approximation is gained, analogously to before, from

x = xa + ∆x.

The new value becomes the approximation that can be improved with another iteration,and this can go on until you achieve the precision desired. You will find that after acouple of iterations, the value of the derivative changes little and so it doesn’t requirecontinuous recalculation.

Example Take, again, xex = 1, so that f(x) = xex − 1 and f ′(x) = (1 + x)ex.From the graph, we approximate xa = 0.6; thus, f(xa) = 0.093 and f ′(xa) = 2.915.And so, ∆x = −.03, and x = 0.57. Another iteration has xa = 0.57, giving, f(xa) =0.008 and f ′(xa) = 2.776. This leads to ∆x = −.003, and x = 0.567. And so forth.

5.2. ROOTS OF POLYNOMIALS 27

5.1.3 The Method of Iteration

A third approach is to rearrange the equation in question into its own iteration equation.Abstractly, if f(x) = 0, then the equation can, perhaps in several ways, be rewritten asx = g(x). The first approximation is plugged into g(x), giving x, which becomes thenext approximation. Since g(x) can take multiple forms, it’s usually best to start withthe simplest possible arrangment. If the value for x seems not to be settling down aftera few iterations, try another arrangement.

Example Beginning with xex = 1, we rearrange to get x = e−x. We take the firstapproximation from the graphic result, xa = 0.6, and plug this into the right side ofthe equation, leading to x = 0.55. This becomes the second approximation. Pluggingin, we get x = 0.577. Once, more, we get x = 0.562. And, again, x = 0.570. And,one last time, x = 0.566. Etc.

5.2 Roots of Polynomials

Any of the methods introduced for solving transcendental equations can be used withpolynomials, as well. Of these, the Newton-Raphson method is likely to be the mostefficient, particularly if any root is required (rather than all roots).

You’ll recall from Equation 5.2 that the Newton-Raphson correction to an approx-imation is the ratio of the function to the derivative of the function at the value of theapproximation. Before the days of calculators and computers, it was convenient toemploy an algorithm similar to the following:

1. Take a polynomial to have the form

y(x) = c0xn + c1x

n−1 + · · ·+ cn.

2. Write all coefficients in a line (if a power is missing, put 0).

3. Let h1 = c1 + c0xa.

4. Let h2 = c2 + h1xa.

5. And so forth, until hn = cn + hn−1xa = y(xa).

And do the same thing in the case of the derivative y′(x). The ratio of these gives you∆x and x = xa + ∆x. And repeat.

5.2.1 Graeffe’s Root-Squaring Method

With a little more effort, it is possible, and has been, even before microelectronics,to find all roots–complex as well as real–simultaneously, without needing an initialapproximation.


1. Starting with a polynomial of the form

y(x) = c0xn + c1x

n−1 + · · ·+ cn = 0.

Divide through by c0 to give

y(x) = xn + a1xn−1 + a2x

n−2 · · ·+ an,

where a1 = c1/c0, a2 = c2/c0, etc.

2. Write all coefficients ai in a line (if a power is missing, put 0), beginning with 1.

3. Square each coefficient, writing the results under the original list.

4. Create a new set of coefficents bi as illustrated below:

1 a1 a2 a3 a4 · · ·1 a2

1 a22 a2

3 a24 · · ·

−2a2 −2a1a3 −2a2a4 −2a3a5 · · ·+2a4 +2a1a5 +2a2a6 · · ·

−2a6 −2a1a7 · · ·+2a8 · · ·

1 b1 b2 b3 b4 · · ·1 b21 b22 b23 b24 · · ·

5. Repeat procedure with new coefficients until the doubled “cross-products” (suchas 2a3a5, one on each side of the column of interest), contribute essentially noth-ing to the next squared term.

At this point, you’ll have n coefficients, say h1, h2,. . . , hn. Then, by formulas origi-nally derived by Vieta, if x1, x2,. . . ,xn are the n real roots of the polynomial,

|x1|2s

≈ h1

|x2|2s

≈ h2

h1

...

|xn|2s

≈ hnhn−1

where s is the number of squaring iterations.Before computers and hand calculators, there were tables of logarithms, so these

equations were written:

log |x1| ≈1

2slog h1

log |x2| ≈1

2s(log h2 − log h1)

5.2. ROOTS OF POLYNOMIALS 29

log |x3| ≈1

2s(log h3 − log h2)

...

log |xn| ≈1

2s(log hn − log hn−1)

Log tables usually went to four decimal places (the inverse went to two). Starting withindividual roots, the Newton-Raphson method can be used both to check the resultsand to attain greater precision.

If two or more roots are degenerate (real and equal), a doubled cross-product willnot decrease with iteration but always equal to 1/2 the squared term it is added to.If, otherwise, the doubled cross-products do not become vanishingly small, while thesigns of the respective sums alternate, this indicates the presence of complex roots.

Example We seek the root of y(x) = x4−56x3 +490x2 +11, 112x−117, 495 = 0.

1 −5.600× 101 4.900× 102 1.111× 104 −1.175× 105

1 3.136× 103 2.401× 105 1.234× 108 1.381× 1010

−0.980× 103 12.440× 105 1.152× 108

−2.350× 105

1 2.156× 103 1.250× 106 2.386× 108 1.381× 1010

1 4.648× 106 1.562× 1012 5.693× 1016 1.907× 1020

−2.500× 106 −1.029× 1012 −3.452× 1016

0.028× 1012

1 2.148× 106 5.610× 1011 2.241× 1016 1.907× 1020

1 4.614× 1012 3.147× 1023 5.022× 1032 3.637× 1040

−1.122× 1012 −0.963× 1023 −2.140× 1032

0.004× 1023

1 3.492× 1012 2.188× 1023 2.882× 1032 3.637× 1040

1 1.219× 1025 4.787× 1046 8.306× 1064 1.323× 1081

−0.044× 1025 −0.201× 1046 −1.591× 1064

1 1.175× 1025 4.586× 1046 6.715× 1064 1.323× 1081

1 1.381× 1050 2.103× 1093 4.388× 10129 1.750× 10162

1 1.904× 10100 4.414× 10186 1.925× 10259 3.062× 10324

Notice that by the third iteration (23 = 8th-power), the first set of doubled cross-products becomes negligible, and by the fifth iteration (25 = 32nd-power) last setbecomes negligible. We stop after the next (sixth) iteration, so 2s = 64, and oursolutions are

log |x1| ≈ 100.2797/64 = 1.5669

log |x2| ≈ (186.6448− 100.2797)/64 = 1.3494

log |x3| ≈ (259.2844− 186.6448)/64 = 1.1350

log |x4| ≈ (324.0860− 259.2844)/64 = 1.0188,


so that |x1| ≈ 36.89, |x2| ≈ 22.36, |x3| ≈ 13.65, and |x4| ≈ 10.45. By inspection,x3 < 0, while all the other values are positive.

The sum of these roots should be approximately equal to a1, and it is. The precisionof these can be improved using the Newton-Raphson method.

Chapter 6

Differentiation

6.1 Introduction

The derivative f ′(x) of a function f(x) is defined as:

f ′(x) =df

dx

∣∣∣∣x

≡ lim∆x→0

f(x+ ∆x)− f(x)

∆x(6.1)

If this is going to be determined numerically, take note of two dangers. The first is thesubtraction rounding error from taking small differences from number relatively muchlarger than the difference. The second is that ∆x can run up against machine reso-lution (decimal place) limitations. Thus, differentiating numerically demands specialattention to the uncertainties of the results.

We’ll look at two procedures for calculating derivatives numerically.

6.2 Forward Difference

The Taylor expansion of the function f at x+ ∆x is:

f(x+ ∆x) = f(x) + ∆xf ′(x) +(∆x)2

2!f ′′(x) +

(∆x)3

3!f (3)(x) + . . . (6.2)

To some approximation (better the smaller ∆x is), we can ignore powers of ∆xabove the first, giving f(x + ∆x) ' f(x) + ∆xf ′(x). We could then numericallycompute an approximation to the derivative with a rearrangement of this equation

f ′NC(x) ≡ f(x+ ∆x)− f(x)

∆x(6.3)

where NC denote numberical calculation. This looks like the definition of the deriva-tive, but without the limit. Because ∆x is finite, such a calculation estimating the

31

32 CHAPTER 6. DIFFERENTIATION

derivative excludes all the higher-order terms that we chose to ignore, and these be-come the uncertainty of this calculation.

εf ′NC

=(∆x)

2!f ′′(x) +

(∆x)2

3!f (3)(x) + . . . (6.4)

Obviously, the smaller ∆x the smaller the uncertainty of the algorithm, but the greaterthe likelihood of rounding errors. If the subtraction uncertainty becomes bigger than∆x, then the result becomes meaningless.

6.2.1 Example

f(x) = c0 + c2x2

f ′(x) = 2bx

f ′NC(x) = 2c2x+ c2∆x

Obviously, smaller ∆x makes computation more accurate–until rounding errors beginto overwhelm the uncertainty and ruin the precision.

6.3 Central DifferenceIt is marginally better to calculate the difference in the derivative approximation a half-step forward and a half-step back:

f ′NC(x) ≡ f(x+ ∆x/2)− f(x−∆x/2)

∆x/2(6.5)

The difference of the Taylor series expansions of the denominator terms cancels theodd powers of ∆x, thereby shrinking the uncertainty somewhat. Thus, the uncertaintygoes as (∆x)2 rather than as ∆x. That is, if f ′′′(x)(∆x/2)2/3! < f ′′(x)∆x/2!, thenthe central-difference algorithm should give a better result.

Note, that the central difference algorithm gives an exact result for our exampleabove.

6.4 Quantifying the UncertaintiesYou’ll recall that at one extreme algorithmic errors dominate, while at the other round-ing errors dominate. It’s best to optimize between the two, and so we might aim forsomething like εro ≈ εalg.

Now, we saw that the forward difference algorithm has an algorithmic uncertaintythat goes as εFDalg ' f ′′(x)∆x/2!, while the uncertainty of the central difference algo-rithm goes as εCDalg ' f ′′′(x)(∆x/2)2/3!.

6.5. SECOND DERIVATIVES 33

Both algorithms make use of small differences between relatively large numbers,so the rounding error approaches that of the machine accuracy: εM = 2−Nmantissa bits .Thus, εro ' εM/∆x.

The equality requirement then gives us an idea of the appropriate ∆x.

Forward Difference: εro ' εM/∆x ≈ f ′′(x)∆x/2/2! ' εFDalg ⇒

(∆x)2 ' 2εMf ′′(x)

(6.6)

Central Difference: εro ' εM/∆x ≈ f ′′′(x)(∆x/2)2/3! ' εCDalg ⇒

(∆x)3 ' 24εMf ′′′(x)

(6.7)

6.5 Second DerivativesBecause second derivatives involve a second subtraction of small differences, it is sus-ceptible to even greater rounding errors than the first derivative. The standard procedureto minimize the effect is to take the difference between first derivatives expanded in theforward direction and backward direction, respectively.

f ′′(x) ' f ′(x+ ∆x/2)− f ′(x−∆x/2)

∆x

' 1

(∆x)2{[f(x+ ∆x/2)− f(x)]− [f(x)− f(x−∆x/2)]} (6.8)

The uncertainty goes as εf ′′ ' 2f (4)(x)(∆x/2)2/4!, proportional, like the centraldifference algorithm, the (∆x)2, but to the fourth derivative.

Chapter 7

Integration

7.1 IntroductionNumerical integration–or calculating the area under a curve (whether a function or aset of data points) with a computer algorithm–amounts to adding up the the areas ofboxes, and so is often referred to as numerical quadrature. Quadrature is defined as theconstruction of a square having the same area as some other figure. Plotting the curveon graph paper will make the notion clear. However, few of the numerical methods wediscuss amounts to forming and summing squares, per se, so the alternative name is abit misleading in this context.

As with many numerical techniques, the best approach to take often depends onthe details of the problem being addressed. If the integration is to be done under afunction, then one probably wants to perform some sort of Gaussian quadrature. Thetrick here is to minimize the number of function calls, which can be expensive, bothin terms of machine cycles and rounding errors. Gaussian quadrature is also used forcalculating the area under unequally-spaced data points. Integration under a collectionof equally-spaced data points is typically carred out with one or another version of theNewton-Coates approach. The man concern with any method for integrating under datapoints is truncation error, which is somewhat related to whether or not the end-pointsof the data are used.

We start with Riemann’s definition of an integral:

∫ b

a

f(x)dx ≡ lim∆x→0

(b−a)/∆x∑i=1

∆xf(xi)

(7.1)

Numerically, ∆x cannot go to zero–in fact, it can’t be smaller than the machine resolu-tion, and so the summation becomes a finite sum of boxes with height f(xi) and basewi, known as weights, which are proportional to ∆x.∫ b

a

f(x)dx 'N∑i=1

wif(xi) (7.2)

34

7.2. NEWTON-COATES METHODS 35

Here, N is the number of boxes in the interval [a, b] summed.

7.2 Newton-Coates MethodsAll Newton-Coates integration methods divide the total integration interval into equal-sized (base) subintervals. The integrand is evaluated at equally spaced points xi. Intheir most basic forms, treat the equivalent of a Taylor series expansion of the integrand.What differentiates the methods is the order of the expansion used. The higher theorder, the more precision is obtained–unless there are singularities in the distribution–but more expensive in terms of machine cycles. Known sigularites should be addressedby partitioning the computation to regions between them.

7.2.1 Trapezoidal Rule

In this method, the curve through the data points is approximated by a piece-wise linearfit. That is, a line is computed between each pair of points that correspond to the end-points of each subinterval. These lines then form the tops of trapezoids whose areaswill be computed and summed.

Assume there are N data points in the region between [a, b] (inclusive) which isto be integrated by summing the areas of N − 1 trapezoids, or else decide that thisregion is to be divided into N − 1 trapezoids (here, N is the number of trapezoid basecorners), the tops of which pass through some number of points. Either way, the baseof each trapezoid will be of size

∆x =b− aN − 1

and the values of the independent variable used in the computations will go as

xi = a+ (i− 1)∆x, i = 1, 2, . . . , N

The area of trapezoid i is, then

Ii =1

2(yi+1 + yi)∆x

and the integral from a to b is

I =

∫ b

a

f(x)dx 'N−1∑i=1

Ii (7.3)

Note that ya and yb appear only once in the sum, while all other xi appear twice. Interms of the generic weighting formula, Equation 7.2, f(xi) = yi, and

wi =

[∆x

2,∆x, . . . ,∆x,

∆x

2

](7.4)

36 CHAPTER 7. INTEGRATION

The algorithmic uncertainty is due to truncation error. It goes, to lowest order, as

εtrapalg ∼ (b− a)(∆x)2 (7.5)

but it is also proportional to the second derivative of the true function of the distribution,which means that, for a finite data distribution, it’s proportional to the difference be-tween the first derivatives at the endpoints. Therefore, if either or both of these diverge,this approach fails.

7.2.2 Simpson’s RuleIn the trapezoid rule, we connected to points with a straight line, regardless of the val-ues of the intermediate points, if any. This is the equivalent of evaluating the functionalform of these data points to the first order in a Taylor series expansion. Formulating aparabola with three equally spaced points is a possible way of more smoothly connect-ing the points. This is equivalent to evaluating the functional form to the second-orderterm. The integral then becomes the sum of the areas under each parabola. A parabola,of course, has the form f(x) = αx2 + βx+ γ.

As before, assume there are N data points in the region between [a, b] (inclusive)which is to be integrated by summing the areas of (N − 1)/2 regions of rectangularbase and parabolic top (note that N must be odd). The base will be of size

2∆x =b− aN − 1

and the values of the independent variable used in the computations will go as

xi = a+ (i− 1)∆x, i = 1, 2, . . . , N

Integrating f(x) over the three-point interval:∫ xi+1

xi−1

αx2 + βx+ γ ≈ 2∆xf(xi)

to first order in ∆x. By carrying out the arithmetic to the same order, the same value,is reached with:

∆x

3[f(xi−1) + 4f(xi) + f(xi+1)]

And thus we arrive at Simpson’s rule for numerical integration:

Ii ≡∫ xi+∆x

xi−∆x

f(xi)dx '∆x

3yi−1 +

4∆x

3yi +

∆x

3yi+1 (7.6)

This is done for adjacent intervals and then summed.

I =

(N−1)/2∑i=1

Ii (7.7)

7.3. GAUSSIAN QUADRATURE 37

Since each interval endpoint (except for the a and b) is both end and the beginningof a region, the weights for Simpson’s rule are:

wi =

[∆x

3,

4∆x

3,

2∆x

3,

4∆x

3, . . . ,

2∆x

3,

4∆x

3,

∆x

3

](7.8)

The algorithmic uncertainty is due to truncation error. It goes, to lowest order, as

εtrapalg ∼ (b− a)(∆x)4 (7.9)

It is also proportional to the fourth derivative of the true function of the distribution.

7.2.3 Optimizing the Number of RegionsAssuming, again, that rounding error is proportional to the machine accuracy, with theproportionality constant the square root of the number of steps, and that the optimalsituation is when algorithmic and rounding errors are approximately the same size, onefinds that for

1. the trapezoidal rule with

• single precision: N − 1 ≈ 600− 650

• double precision: N − 1 ≈ 106

for a total relative uncertainty of about 10−5 and 10−12, respectively.

2. Simpson’s rule with

• single precision: N − 1 ≈ 35− 40

• double precision: N − 1 ≈ 2100− 2200

for a total relative uncertainty of about 10−6 and 10−13, respectively.

7.3 Gaussian QuadratureGaussian quadrature methods use non-equal intervals, and the interval widths can beadjusted within a single integration such that the intervals of slowly varying regions willbe wider than those of regions which vary rapidly. Gaussian quadrature can thereby bemore precise than Newton-Coates methods, as long as there are no singularities ineither the function be integrated or its derivative. Such singularities can and should bedealt with, however, by creating separate integration ranges with the singular pointsincluded in the limits of integration. This affords excellent accuracy, but is harder–maybe very much harder–to program.

Many types of Gaussian quadrature approaches exist–Gauss-Laguerre, Gauss-Hermite,Gauss-Chebyshev, and Gauss-Jacobi, for example. They’re all based on the samepremise, namely that the function being integrated can be approximated by a poly-nomial, at least in the region of interest, which is multiplied by a weighting function.

38 CHAPTER 7. INTEGRATION

Any specific combination, named for its inventor, works better for some functions thanfor others. Most of these approaches work for specific integral limits, such as [1,−1],[0,∞], [−∞,∞]

The general procedure is the same in all cases. First, one decides which approachto employ, thereby defining the polynomial, q(x) that will approximate the functionand the weighting function ρ(x) by which q(x) will be multiplied.

Then one decides the order of the polynomial. Gaussian quadrature gives the exactintegration if f(x) is a polynomial of order r < 2R, where R is the order of theapproximating polynomical, q(x). If, instead of a function, one hasN data points, thenGaussian quadrature is exact if r = 2N − 1, but the bigger r is, the more expensive thecalculation becomes.

Having chosen the polynomial q(x) =∑Ri=0 cix

i and its order R, the next stepis to determine its coefficients ci. It’s roots are then just the xi. Finally, one gets theweights, wi.

This is all feasible due to the theorem,If q(x) is a polynomial of degree R , such that∫ b

a

q(x)ρ(x)xkdx = 0 (7.10)

where k is any integer on [0, R − 1], and ρ is a weighting function, then, if xi are theR roots of q(x), there exists a set of weights wi such that the integration formula

∫ b

a

f(x)ρ(x)dx ≈R∑i=0

wif(xi)

is exact if f(x) is a polynomial of degree < 2R.All but one of the R+ 1 coefficients can be found from the R equations of the q(x)

integral. That last coefficient falls out when finding the roots of q(x). The wi are foundfrom the R equations formed by setting f(x) = ρ(x)[1, x, x2, ..., xR−1].

Understand and appreciate what this means. We know (think of two points to aline; three points to a parabola) that we can fit an N − 1 polynomial to N points. Thissays we can fit an N polynomial to 2N − 1 points. So, it’s pretty powerful, but, asmentioned, very difficult or painful to institute. Understanding the uncertainty of theresult is even more difficult. And, it only works well for smooth functions.

We’ll look only at perhaps the most popular, method, called Gauss-Legendre quadra-ture, in particular the three-point form. In Gauss-Legendre quadrature, the weightingfunction is particularly simple: ρ(x) = 1. It requires, however, integral limits of[−1, 1], so an integral over an arbitrary span [a, b] has to be transformed with

z =2x− (b+ a)

b− a; dz =

2

b− adx (7.11)

Three-point Gauss-Legendre quadradure requires R = 3, so

q(z) = c0 + c1z + c2z2 + c3z

3

7.3. GAUSSIAN QUADRATURE 39

Equation 7.10 (recall, here, ρ = 1) gives us three equations for these four unknowns:∫ 1

−1

q(z)dz = 0 ⇒ 2c0 +2

3c2 = 0∫ 1

−1

q(z)zdz = 0 ⇒ 2

3c1 +

2

5c3 = 0∫ 1

−1

q(z)z2dz = 0 ⇒ 2

3c0 +

2

5c2 = 0

The first and third of these can be consistent only with c0 = c2 = 0. The secondequation says that c1 = − 3

5c3. This gives

q(z) = c3(z3 − 3

5z)

These are easily seen to be ± 35 and 0.

The weighted sum then becomes∫ 1

−1

f(z)dz ≈ w1f(−√

3

5) + w2f(0) + w3f(

√3

5)

For this to be exact, the order of f(z) must be < 2R, but obviously we need only goup to order 3, or f(z) = 1, x, x2. Plugging in, we get

2 = w1 + w2 + w3

0 = −√

3

5w1 +

√3

5w3

2

3=

3

5w1 +

3

5w3

The second equation implies that w1 = w3, so, from the third equation, w1 = w3 = 59 ,

and then from the first equation, w2 = 89 .

Thus, what you end up with for three-point Gauss-Legendre quadrature is:∫ b

a

f(x)dx =b− a

2

∫ 1

−1

f(z)dz ≈ b− a2

[5

9f(−

√3

5) +

8

9f(0) +

5

9f(

√3

5)

](7.12)

As you can see, this is pretty involved, particularly if you want to go to higher R.Also, a reliable determination of the uncertainty is almost impossible.

Chapter 8

Ordinary Differential Equations

The laws of physics are typically expressed in the form of differential equations, whichdescribe the change of one or more physical variables as a result of changes in one ormore other physical variables. We may represent in a general fashion such a descriptioninvolving two variables x, an independent variable, and y, a dependent variable whichmay be a function of x:

F

(y(x),

dy(x)

dx, . . . ,

dmy(x)

dxm, x

)= 0. (8.1)

Because the highest derivative is the mth, the equation is called an mth-order ordinarydifferential equation (ODE). It’s solution is y(x). In general, this solution of an mth-order ODE has m unspecified constants, which are determined by initial or boundaryconditions that specify m additional constraints.

8.1 First-Order ODEIf the dependent variable appears only to the power 1, and if the highest-order derivativeis first-order, then the ODE is called “linear, first-order” and has the general form:

dy

dx= f(y, x). (8.2)

This equation gives the slope of the function y(x) in the x− y plane.The solution of Equation 8.2–the relationship between x and y–which satisfies all

values of x and y takes the general form:

y =

∫f(y, x)dx. (8.3)

We can come up with a closed expression y = φ(x) if the integral is analyticallysoluble, but frequently we cannot, and then we turn to some approximation procedure,a class of which is subsumed under the name “numerical methods.”

40

8.1. FIRST-ORDER ODE 41

The function f(y, x) can be separated into parts, one of which contains y andanother that doesn’t. Since, we consider here only linear equations, this separationbecomes:

dy

dx− f(x)y = g(x), (8.4)

where f(x) and g(x) are both arbitrary functions of x. If g(x) = 0, then Equation 8.4is called “homogeneous”; if g(x) 6= 0, then Equation 8.4 is called “non-homogeneous.”

8.1.1 Homogeneous Equation

Consider the homogeneous equation

dyhdx− f(x)yh = 0.

Separating variables,dyhyh

= f(x)dx,

and integrating,

ln yh =

∫f(x)dx+ c1,

where c1 is the constant of integration, determined by boundary conditions. This leadsto the formal solution:

yh = c2e∫f(x)dx

c2 = ±ee1

This is called the “quadrature solution,” because of the integral (quadrature) of f(x).The final solution of the linear, first-order, homogeneous equation requires solving thisintegral.

Among the more important examples of this type of equation, there are these three(β and ω are positive constants with dimensions the inverse those of x, while A, B,and D are constants with dimensions the same as those of y):

Decay :dy

dx+ βy = 0⇔ y = Ae−βx (8.5)

Growth :dy

dx− βy = 0⇔ y = Be+βx (8.6)

Oscillation :dy

dx± iωy = 0⇔ y = De∓iωx (8.7)

42 CHAPTER 8. ORDINARY DIFFERENTIAL EQUATIONS

Example: Suppose f(x) = x and y(0) = 1. Then the linear, first-order, homoge-neous ODE is dyh

dx + xyh = 0, has the solution

yh = c2e∫f(x)dx

= c2e−∫xdx

= c2e− x

2 +c3

= c2ec3e−

x2

= c4e− x

2 .

Since y(0) = 1, c4 = 1, so the analytical solution is yh = e−x2 .

8.1.2 Non-homogeneous Equation

If g(x) 6= 0 in Equation 8.4, we encounter a non-homogeneous equation. An ana-lytical solution begins by defining h(x) ≡ −

∫f(x)dx and multiplying this through

Equation 8.4.

eh(x) dy

dx− eh(x)f(x)y = eh(x)g(x).

But notice that

d

dx(eh(x)y) = y

deh(x)

dx+ eh(x) dy

dx

= yeh(x) dh(x)

dx+ eh(x) dy

dx

= −eh(x)f(x)y + eh(x) dy

dx,

so,d

dx(eh(x)y) = eh(x)g(x),

ord(eh(x)y) = eh(x)g(x)dx.

Integrating:

yeh(x) =

∫eh(x)g(x)dx+ C,

or

y(x) = e−h(x)

[∫eh(x)g(x)dx+ C

],

h(x) = −∫f(x)dx,

where C in the constant of integration.

8.2. HIGHER-ORDER ODE 43

8.2 Higher-Order ODEWe ignore here a number of techniques (such as series) for solving higher-order ODEsin favor of a technique that is particularly amenable to numerical methods.

With an appropriate choice of variables, a higher-order differential equation, whichwe can write, from Equation 8.1, as

dmz

dxm= F

(z(x),

dz(x)

dx, . . . ,

dm−1z(x)

dzm−1, x

), (8.8)

is reducible to a system of first-order ODEs. Introducing

y1(x) = z

y2(x) =dz

dx...

ym(x) =dm−1z

dxm−1,

we generate from Equation 8.8 the first-order ODE system

dy1

dx= y2

dy2

dx= y3

...dymdx

= F (y1, . . . , ym, x).

Defining the vector ~y ≡ (y1, . . . , ym), the first-order ODE system can be writtenconcisely as

d~y

dx= ~f(~y, x), (8.9)

where, ~f ≡ [f1(~y, x), . . . , fm(~y, x)]. Here, this means f1(~y, x) = y2, etc.Because the independent variable x, appears explicitly in ~f , the system is called

“non-autonomous.” A non-autonomous system can be made an autonomous system byadding constraint equations to satisfy the boundary or initial conditions. For now, weconsider only cases that have such conditions defined at a particular value of x, say x0.Then Equation 8.9 has the formal solution

~y(x) = ~y(x0) +

∫ x

x0

~f [~y(x′), x′]dx′, (8.10)

which (notice that ~y(x) is on both sides of the equation, presupposing you know the an-swer) is almost always impossible to solve analytically. We will approximate solutionsnumerically.


8.3 Solving ODEs Numerically

Integration computes the area under a curve. We can approximate this area by choppingit into little rectangular pieces and summing their areas. If we take a small intervalalong the x-axis, ∆x ≡ xn+1−xn, then a small portion of the integral in Equation 8.10is approximated: ∫ xn+1

xn

f(y(x′), x′)dx′ ≈ ∆xf [y(xn), xn].

[Note that we drop the vector notation for the time being; when necessary, we’ll bringit back.]

Thus, we substitute for Equation 8.10

y(xn+1) = y(xn) + ∆xf(y(xn), xn),

which is more concisely written

yn+1 = yn + ∆xfn. (8.11)

This is a recursion relationship. We cover the entire domain of interest by repeatedlycalculating yn = y(xn) at discrete values of x given by xn = x1 + (n − 1)∆x forfn = f(yn, xn).

Equivalently, we can write difference relations to approximate a derivative:

dy

dx

∣∣∣∣n

≈

yn+1−yn

∆x forwardyn+1−yn−1

2∆x centeredyn−yn−1

∆x backward(8.12)

Each sort of difference relation has its applicability, as we will see.We will introduce a number of techniques for solving ODEs. The appropriate one

to use in any given situation is the one that is at the same time efficient, accurate, andstable.

8.3.1 Order of Accuracy and Truncation Error

If we expand y(x) with a Taylor series around xn, we get

yn+1 = yn + ∆xdy

dx

∣∣∣∣n

+(∆x)2

2

d2y

dx2

∣∣∣∣n

+(∆x)3

3!

d3y

dx3

∣∣∣∣n

+ . . . (8.13)

An approximation is said to be mth-order accurate if the approximation reproducesevery term of the expansion up to and including the term that contains ∆xm. This isthe given technique’s “order of accuracy.” Conversely, the order of the first term that isnot reproduced is called the “truncation error.”

8.3. SOLVING ODES NUMERICALLY 45

8.3.2 StabilityIt is possible that, for example, rounding due a computer’s finite accuracy causes asolution at a given iteration to deviate from the exact value of the recursion relation,Equation 8.11. Let us symbolize this deviation by δy and compare the deviations ofsuccessive iterations:

δyn+1

δyn= γ, (8.14)

If, as the recursion continues, the magnitude of the deviation grows, then |γ| > 1, indi-cating that solutions become less accurate with each iteration. The method employedis classified “unstable.” On the other hand, if |γ| ≤ 1 (that is, if deviations arise fromrandom rather than systematic sources), the method is considered “stable.”

To see, in context, what γ means, expand f(y, x) with respect to y, and evaluate itat a particular iteration n:

f = fn + δyndf

dy

∣∣∣∣n

+(δyn)2

2

d2f

dy2

∣∣∣∣n

+ . . . .

That is, a deviation δyn, if present, causes the function evaluated at this iteration todiffer from its exact value in the recursion relation, Equation 8.11.

What then happens to the recursion relation, Equation 8.11, yn+1 = yn + ∆xfn,if, at each iteration, yn is inexact by the amount δyn? Using Equation 8.3.2,

yn+1 + δyn+1 = yn + δyn + ∆x

(fn + δyn

df

dy

∣∣∣∣n

+ . . .

).

Since, by Equation 8.11, yn+1 = yn + ∆xfn, the equation above reduces to

δyn+1 = δyn + ∆xδyndf

dy

∣∣∣∣n

+ . . .

= δyn

(1 + ∆x

df

dy

∣∣∣∣n

+ . . .

).

Comparing this with Equation 8.14 gives

γ = 1 + ∆x

(df

dy

∣∣∣∣n

+ . . .

)(8.15)

Stability requires |γ| ≤ 1.We’ve been treating f and y as individual entities, when, most generally, we should

be treating them as arrays. As such, df/dy → ∂ ~f/∂~y, a matrix F = F(1) + F(2) + . . .with components

F(1)ij =

∂fi∂yj

F(2)ij =

∂2fi∂y2

j

...


Of course, fi represents the components of ~f and yj represents the components of ~y.The deviation in iteration result δy → δ~y, so that

δy(n+1)i = δy

(n)i + ∆xδy

(n)i

∑j

∂fi∂yj

∣∣∣∣∣∣n

+ . . .

.

Then,δ~yn+1 = δ~yn [I + ∆xF] = Γδ~yn.

Stability requires that all eigenvalues of Γ have modulus no greater than unity.As a general rule, stability is attained if the increment of the independent variable

is significantly smaller than any features in the functions f . Thus, for example, if thereis an oscillatory pattern, then the increment should be much smaller than the period ofthis oscillation.

We now look at a number of methods and evaluate their accuracy and stability.Remember that one wants the most efficient algorithm that satisfies both these criteria.Different methods are amenable in this way to different sorts of problems. It can beespecially difficult to determine stable ranges with the kind of algebraic expressionswe arrive at below. You may need to look at extreme conditions (for example, largeand small limits for coefficients in various terms of the differential equation or forindependent or dependent variables). It may be possible then to get one or more closedexpressions. You then choose the simplest method that offers stable solutions underthese conditions.

8.4 Euler Method

The Euler method approximates derivatives with the forward difference of Equation 8.12.Thus,

yn+1 = yn + ∆xdy

dx

∣∣∣∣n

. (8.16)

Example: Since the instantaneous, one-dimensional acceleration and velocity are,respectively,

a =dv

dt,

v =ds

dt,

differential velocity and position changes become

dv = adt

ds = vdt.

8.4. EULER METHOD 47

The Euler method (with y → v or y → s, and and x → t) for solving the differentialequations, and thereby determining the evolution of motion, becomes

v(tn+1) = v(tn) + ∆tdv

dt

∣∣∣∣tn

= v(tn) + a(tn)∆t

s(tn+1) = s(tn) + ∆tds

dt

∣∣∣∣tn

= s(tn) + v(tn)∆t.

Comparing the Euler approximation (Equation 8.16) to the Taylor expansion (Equa-tion 8.13), we see that the Euler method is 1st-order accurate, and, equivalently, thatthe truncation error is at the (∆x)2 term.

The stability of the Euler method is determined by taking γ to 1st-order:

γ = 1 + ∆xdf

dy

∣∣∣∣n

.

Because stability requires |γ| ≤ 1, for the Euler method to remain stable,

−1 ≤ 1 + ∆xdf

dy

∣∣∣∣n

≤ 1.

In dynamical problems (those most often encountered in Physics), the independentvariable is time, so ∆x → ∆t > 0. As such, we see that stability requires df

dy

∣∣∣n≤ 0

and

∆t ≤ −2/df

dy

∣∣∣∣n

.

The range, therefore, of allowable independent variable increments when using theEuler method to solve dynamical problems is

∆t ∈(

0, − 2/df

dy

∣∣∣∣n

]. (8.17)

Among the three important examples listed previously, Equations 8.5-8.7, only thefirst, the decay equation, satisfies this requirement. The growth and oscillation equa-tions are both unstable under the Euler method, the former because df/dy = β > 0,and the latter because |γ| ≤ 1 ⇒ |γ|2 ≤ 1, which, since in this case df/dy = ∓iw,means |γ|2 = |1 + df/dy|2 = 1 + ∆t2ω2. Because ∆t2ω2 > 0, 1 + ∆t2ω2 > 1.

Now, in fact, a linear equation, our primary focus here, is frequently soluble ana-lytically. Not so a non-linear equation, which then requires some sort of approximationprocedure. In the case of a numerical method, stability may require continuously ad-justing the independent variable increment ∆x.

Example: Considerdy

dx= −αy2,


where α > 0. It follows that,

f(y, x) = −αy2

df

dy= −2αy.

The stability condition then depends on y:

∆x ≤ 1

αy.

8.5 The Leap-Frog MethodWhere the Euler method approximates a derivative with the forward difference (Equa-tion 8.12), the leap-frog method does so with the centered difference:

dy

dx

∣∣∣∣n

≈ yn+1 − yn−1

2∆x

yn+1 = yn−1 + 2∆xfn, (8.18)

where Equation 8.9 was used for dy/dx. Recall that fn = f(yn, xn), and notice,therefore, that both yn−1 and yn are needed to calculate yn+1. The benefit is, in effect,that the function fn is calculated midway between one solution and the next. Theinterpolation is thus made not from the beginning of an interval, but from a centralvalue of the interval.

Example: Taking, as in the Euler method example, the definitions of instantaneous,one-dimensional acceleration and velocity, and s(t = 0) = s0, etc., the leap-frogmethod progresses

v(t2n+1) = v(t2n−1) + 2a(t2n)∆t

s(t2n+2) = s(t2n) + 2v(t2n+1)∆t,

but where v(t1) = v0 + a(t0)∆t.To determine the order of accuracy, we expand yn−1 in Equation 8.18 around step

n:

yn+1 = yn −∆xdy

dx

∣∣∣∣n

+(∆x)2

2

d2y

dx2

∣∣∣∣n

− (∆x)3

3!

d3y

dx3

∣∣∣∣n

+ . . .+ 2∆xfn

= yn + ∆xdy

dx

∣∣∣∣n

+(∆x)2

2

d2y

dx2

∣∣∣∣n

− (∆x)3

3!

d3y

dx3

∣∣∣∣n

+ . . .

where Equation 8.2 was used for fn. In comparison with Equation 8.13, the leap-frogmethod is seen to agree to the (∆x)2 term, and so is 2nd-order accurate. It is moreaccurate than the Euler method.

8.6. THE RUNGE-KUTTA METHOD 49

As for the stability of the leap-frog method, recall that the stability criterion requires|γ| ≤ 1 in Equation 8.14. But here, instead of just yn+1 and yn, we also have yn−1.Clearly, if

yn+1

yn= γ,

thenynyn−1

= γ,

and, thus,yn+1

yn−1= γ2.

Substituting these relationships into Equation 8.18 and expanding fn (to 1st-order)as before, we find, in terms of γ,

γ2 = 1 + 2∆x∂f

∂y

∣∣∣∣n

γ.

This is a quadratic equation in γ with solutions

γ = ∆x∂f

∂y±

√(∆x

∂f

∂y

)2

+ 1.

Multiplying these solutions, you’ll find that the product equals −1, meaning that theyare negative reciprocals or complex conjugates. The former holds if the partial deriva-tive has a real part, in which case the magnitude of one or the other of the solutions mustbe greater than 1, and the method is inherently unstable. In the latter case, the partialderivative must be purely imaginary such that the radicand is negative and the modulusof both solutions will be equal to 1. This happens when, for example, the differentialequation describes oscillatory behavior (see Equation 8.7) and the independent variableincrement is related to ω such that

∆x ≤ 1

ω.

Under these conditions, the leap-frog method is stable.

8.6 The Runge-Kutta MethodWe might try combining the Euler and leap-frog methods, by using intermediate valuesto iterate across an interval:

yn+1/2 = yn +1

2∆xfn (8.19)

yn+1 = yn + ∆xfn+1/2, (8.20)

where the intermediate step (Equation 8.19) is for the purpose of calculation only andnot included as a “result.” Notice that the intermediate value is an Euler approximationat half the independent variable increment. The iteration solution (Equation 8.20) isobtained with a leap-frog calculation within a single increment.


Example: Again, we consider linear velocity and acceleration. The intermediatevalues are:

v(tn+1/2) = vn +1

2an∆t

s(tn+1/2) = sn +1

2vn∆t,

leading to the iteration

v(tn+1) = v(tn) + a(tn+1/2)∆t

s(tn+1) = s(tn) + v(tn+1/2)∆t.

That the leap-frog method is employed implies that this method is also 2nd-orderaccurate, and is in fact known as the “two-step” or “2nd-order Runge-Kutta” method.There is a family of such approaches, each member of which gives a different order ofaccuracy.

We follow the same approach as before to determine the stability of this method, butwith the caveat that the partial derivative of the function fn with respect to the depen-dent variable must be independent of the of the independent variable. The derivationresults in

yn+1

yn= γ = 1 + ∆x

∂f

∂y+

1

2

(∆x

∂f

∂y

)2

.

For the dynamical case of exponential decay, the stability criterion is the same as thatunder the Euler method, Equation 8.15. The same difficulties remain for exponentialgrowth and oscillations, except that in this case, due to the extension to second order,we have |γ|2 = 1+ 1

4 (ω∆t)4, which, given ω∆t < 1, is not so different from unity. Assuch, this method is frequently employed in damped oscillations, where the amplitudeof the oscillation decreases exponentially.

We end this survey of solving ODEs numerically with two methods that estimatethe integral of Equation 8.10 by chopping the area under the curve into trapezoids ratherthan rectangles. The finite recursion relation becomes

yn+1 = yn +1

2∆x(fn+1 + fn). (8.21)

In this way, the finite magnitude of the slope of the curve within a given interval ismore closely estimated. However, this requires (in fn+1) knowing the solution beforefinding it, which is possible only in special circumstances.

8.7 The Predictor-Corrector MethodThe predictor-corrector method uses the Euler method to “guess” a solution followedby the trapezoidal rule (Equation 8.21) to adjust it. This is a two-step procedure, butshould not be confused with the Runge-Kutta method. It goes like this:

y∗n+1 = yn + ∆xfn (8.22)

yn+1 = yn +1

2∆x(fn∗ + fn). (8.23)

8.8. THE INTRINSIC METHOD 51

The similarity of this method with the Runge-Kutta method extends to its order ofaccuracy (truncating at 3rd-order) and stability.

8.8 The Intrinsic MethodSome linear, 1st-order differential equations, including the important examples Equa-tions 8.5-8.7, are well handled by the trapezoidal rule, Equation 8.21.

In the case of the decay equation ( 8.5), Equation 8.21 becomes

yn+1 = yn −1

2∆x(βyn+1 + βyn).

Solving for yn+1 yields the recursion relation,

yn+1 =

[1− 1

2β∆x

1 + 12β∆x

]yn. (8.24)

With the approach used several times, we can show that the intrinsic method is 2nd-order accurate. Variation leads to γ equaling the rato in the brackets, which is alwaysless than one. Thus, this method is always stable for a linear, 1st-order exponentialgrowth problem.

In the case of the linear, 1st-order oscillation problem, the intrisic method is 2nd-order accurate. It is always stable, since the modulus of the ratio (with ∓iω → β)equals unity.

It can be shown that this method is also stable in the case of the linear, 1st-ordergrowth problem. Unfortunately, the practical use of this method is limited to suchsimple integrals. Realistic problems are rarely limited to the forms of Equations 8.5-8.7.

Chapter 9

Partial Differential Equations

Whereas ODEs and their solutions involve a single independent variable, partial dif-ferential equations PDEs, involve more than one independent variable. If the functionΨ(x, y, . . .) is the solution of a PDE, then it satisfies an equation of the form

F (x, y, . . . , Ψ, Ψx, Ψy, . . . , Ψxx, Ψxy, Ψyy, . . .) = 0, (9.1)

where

Ψx ≡∂Ψ

∂x, Ψy ≡

∂Ψ

∂y, . . . , Ψxx ≡

∂2Ψ

∂x2, Ψxy ≡

∂2Ψ

∂x∂y,Ψyy ≡

∂2Ψ

∂y2, . . .

are partial derivatives. A “system of partial differential equations“ is comprised of anumber of PDEs whose solutions may include one or more different functions.

A PDE’s “order” is set by the degree of the highest derivative in the equation. APDE is linear if it is 1st-order and it contains no products of derivatives (that is, ifthe coefficient of each partial derivative is a function of only independent variables orconstant).

We focus primarily on 1st- and 2nd-order PDEs.Few if any ready applications exist for PDEs, but there are many techniques tailored

as needed for the problem at hand. Typically, finite difference methods are turned to.

9.1 Boundary ConditionsCommonly occurring 2nd-order PDEs are subject to boundary or initial conditions thatcan formally be written

αΨ + βΨn = γ, (9.2)

where α, β, γ, and Ψ, and their derivatives are all functions of (x, y) and are definedin a domainR bounded by a surface S , and

Ψn ≡∂Ψ

∂~n,

is the derivative normal to the boundary. Table 9.1 summarizes a classification systemfor such boundary conditions which depend on values of α and β.

52

9.2. CLASSIFYING PDES 53

Table 9.1: Common boundary conditions for 2nd-order PDEs.classification conditionsDirichlet (first kind) β = 0

Ψ specified at SNeumann (second kind) α = 0

Ψn specified at S2 equations:

Cauchy α = 0 in one, andβ = 0 in the other

Robbins (third kind) α, β 6= 0

9.2 Classifying PDEsConsider the linear, 2nd-order PDE, in two dimensions:

a∂2Ψ

∂x2+ 2b

∂2Ψ

∂x∂y+ c

∂2Ψ

∂y2+ d

∂Ψ

∂x+ e

∂Ψ

∂x+ fΨ + g = 0. (9.3)

The relationship between coefficents a, b, and c permits a classification system [seeTable 9.2].

Table 9.2: Linear, two-dimensional, 2nd-order PDE classification scheme.

Condition PDE Type ExampleEquation Name Boundary Condition

b2 < ac Elliptic ∂2Ψ∂x2 + ∂2Ψ

∂y2 = 0 Laplace’s Eq’n. Dirichlet or Neumann

b2 > ac Hyperbolic ∂2Ψ∂x2 − ∂2Ψ

∂y2 = 0 Wave Eq’n. Cauchy

b2 = ac Parabolic ∂2Ψ∂x2 − ∂Ψ

∂y = 0 Heat Eq’n. Dirichlet or Neumann

9.3 Elliptical Equations

9.3.1 Analytical SolutionSeparation of variables can often lead to a system of decoupled ODEs. Consider thetwo-dimensional Laplace equation from Table 9.2. We take the form of Ψ to be

Ψ = X(x)Y (y),

and substitute into∂2Ψ

∂x2+∂2Ψ

∂y2= 0

54 CHAPTER 9. PARTIAL DIFFERENTIAL EQUATIONS

and divide through by XY to get:

1

X

d2X

dx2+

1

Y

d2Y

dy2= 0.

Note that ordinary derivatives replace the partials since X depends only on x and Ydepends only on y.

We separate this equation into two ordinary differential equations:

d2X

dx2= −ξ2X

d2Y

dy2= ξ2Y,

where ξ is a so-called ”separation constant.” It’s two signed versions can also be con-sidered the ”eigenvalues” of the ODEs, which would then be considered eigenvalueequations with operators d2/dx2 and d2/dy2, respectively. And so, defining

Λx ≡ d2

dx2

Λy ≡ d2

dy2,

leads to the eigenvalue equations

ΛxX = λxX

ΛyY = λyY,

such that λx = −ξ2 and λy = ξ2.The solutions are exponentials

X(x) = axeiξx + bxe

−iξx

Y (y) = ayeξy + bye

−ξy

or trigonometric functions

X(x) = Ax sin ξx+Bx cos ξx

Y (y) = Ay sinh ξy +By sinh ξy.

The product of these gives a particular solution, and boundary conditions determineax, y and bx, y or Ax, y and Bx, y .

9.3.2 Numerical SolutionsAs the analytical solution may suggest, numerical solutions to elliptic equations in-

volve finding eigenvalues and therefore manipulating matrices, We therefore put offdealing with them until we deal with this topic in a general way.

9.4. HYPERBOLIC EQUATIONS 55

9.4 Hyperbolic Equations

9.4.1 Analytical SolutionsThe wave equation in Table 9.2 is soluble with the same separation of variable tech-nique as was used for the elliptic (Laplace’s) equation. In this case, the eigenvalueis degenerate (the same for each equation, in two dimensions). The solutions of theODEs, which depend on boundary conditions, are again multiplied to yield the PDEsolution.

9.4.2 Numerical SolutionsThe wave equation in Table 9.2 has been simplified by specifying that the wave velocityis unity in some units. More generally, we take the wave velocity to be c and write theequation

∂2Ψ

∂x2− c2 ∂

2Ψ

∂y2= 0,

which we rewrite (∂

∂x+ c

∂

∂y

)(∂

∂x− c ∂

∂y

)Ψ = 0.

We have the freedom here to set either expression in parentheses to zero and the otherto some arbitrary function of x and y, say φ(x, y), which is independent of Ψ. Thisleads to a system of two 1st-order equations:

∂φ

∂x+ c

∂φ

∂y= 0

∂Ψ

∂x− c∂Ψ

∂y= φ.

There’s nothing special about the assignment here; the signs before the velocity couldbe interchanged, if one prefers. The first equation (the one set to zero) is solved first,the solution of which is used to solve the second equation. The similarity of the twoequations implies that the accuracy and stability of the two should be the same, asshould the product.

The first equation is called an “advection” equation, and is similar in form to theequation of continuity, which we’ll discuss shortly. It implies that something is beingconserved: all the changes sum to zero. An example is the mass of an incompressiblefluid. If, as is typical in one-dimensional flow, x→ t and y → s, giving

∂µ

∂t+ c

∂µ

∂s= 0, (9.4)

where the typical boundary condition is µ(t, s) ≡ µ0(s), at t = 0. We seek a solutionto this equation.

In the case of ODEs, the iteration was along a single dimension, that of the sin-gle independent variable. PDEs, in contrast, have multiple independent variables, andthere must be iterations along each dimension. Our example contains two independent


variables, t and s, and so we iterate ∆t along t, and ∆s along s. Let us indicate t stepswith a superscript n and s steps with a subscript j , and employ Euler’s method in t andthe leap-frog method in s. That is, use (see Equation 8.12) a forward difference in tand a centered difference in s:

µn+1j − µnj

∆t+ c

µnj+1 − µnj−1

2∆s= 0,

which leads to the iteration relation

µn+1j = µnj −

c∆t

2∆s(µnj+1 − µnj−1). (9.5)

We determine, as with ODEs, the accuracy of this approach with Taylor expansions,but this time in both dimensions. In t, we have

µn+1j = µnj + ∆t

∂µ

∂t

∣∣∣∣nj

+(∆t)2

2

∂2µ

∂t2

∣∣∣∣nj

+ . . . ,

In s, we have

µnj+1 = µnj + ∆s∂µ

∂s

∣∣∣∣nj

+(∆s)2

2

∂2µ

∂s2

∣∣∣∣nj

+(∆s)3

3!

∂3µ

∂s3

∣∣∣∣nj

+ . . .

µnj−1 = µnj −∆s∂µ

∂s

∣∣∣∣nj

+(∆s)2

2

∂2µ

∂s2

∣∣∣∣nj

− (∆s)3

3!

∂3µ

∂s3

∣∣∣∣nj

+ . . .

Substituting these into Equation 9.5, and rearranging, yields:

∆t∂µ

∂t

∣∣∣∣nj

+(∆t)2

2

∂2µ

∂t2

∣∣∣∣nj

+ . . . = − c∆t2∆s

(2∆s

∂µ

∂s

∣∣∣∣nj

+ 2(∆s)3

3!

∂3µ

∂s3

∣∣∣∣nj

+ . . .

).

Clearly, as might be expected given the methods we chose, this approach is 2nd-ordertruncated and so 1st-order accurate in t, and 3rd-order truncated and so 2nd-order ac-curate in s.

We expect the stability of this approach to be more sensitive to ∆t than to ∆s.Letting µ(t, s) = τ(t)σ(s), Equation 9.5 takes the form

τn+1σj = τnσj −c∆t

2∆sτn(σj+1 − σj−1),

or

τn+1 = τn(

1− c∆t

2∆s

σj+1 − σj−1

σj

).

For uncertainties in solutions δτ ,

δτn+1 =

(1− c∆t

2∆s

σj+1 − σj−1

σj

)δτn.

Thus,

γ = 1− c∆t

2∆s

σj+1 − σj−1

σj.

9.4. HYPERBOLIC EQUATIONS 57

This, recall, would indicate a stable method if |γ| ≤ 1. As we’re discussing a planewave, we can approximate, in a procedure called a van Neumann analysis, the spatialbehavior with a plane wave, σ(s) = eiks, where k is the so-called wave number, andtry to discover some k or set of ks that satisfies the stability criterion. Substituting,

γ = 1− c∆t

2∆s

eiksj+1 − eiksj−1

eiksj

= 1− c∆t

2∆s(eik(sj+1−sj − eik(sj−1−sj )

= 1− c∆t

2∆s(eik∆s − e−ik∆s)

= 1− c∆t

2∆s(2i sin k∆s)

= 1− i c∆t∆s

sin k∆s,

so,

|γ|2 = 1 +

(c∆t

∆s

)2

sin2 k∆s

which is larger than unity for all k and all finite ∆s and ∆t, and so this method isinherently unstable.

Replacing µnj in Equation 9.5 with the average of its nearest neighbors: µnj →(µnj+1 + µnj−1)/2. This leads to

γ = cos k∆s− i c∆t∆s

sin k∆s,

or

|γ|2 = cos2 k∆s+

(c∆t

∆s

)2

sin2 k∆s

= 1− sin2 k∆s+

(c∆t

∆s

)2

sin2 k∆s

= 1− sin2 k∆s

{1−

(c∆t

∆s

)2},

which will be less than or equal to unity for all k if the quantity in the brackets isnon-negative, or

1−(c∆t

∆s

)2

≥ 0,

which meansc∆t

∆s≤ 1,

or∆s

∆t≥ c.


In words, this says that ∆s/∆t, the rate of propagation of information (changes) aroundthe space-time grid of the problem, must be at least as large as c, the magnitude of thewave velocity. Another way of saying this is that the time increment ∆tmust be smallerthan the time it takes the wave to travel the spatial increment ∆s/c. Equivalently, thespatial increment ∆smust be larger than the distance the wave travels in time incrementc∆t.

This condition is called a Courant-Friedrichs-Lewy condition, here as applied to ahyperbolic equation.

9.5 Parabolic EquationsExamples of this sort are the heat equation, the diffusion equation, and Schrodinger’stime dependent equation.

The simplest treatment is the same as that for the hyperbolic case.Consider the diffusion equation:

∂u

∂t= κ

∂2u

∂x2

Then, again using Euler and Leap-Frog methods, we arrive at a two-dimensinaliteration relation

un+1j = unj +

κ∆t

(∆x)2(unj+1 − 2unj + unj−1) (9.6)

Again, stability is assured for |γ| ≤ 1. Employing the van Neumann analysis as forthe hyperbolic case, we find

γ = 1− 4κ∆t

(∆x)2sin2

(k∆x

2

)(9.7)

which is stable for all k if

∆t ≤ 1

2

(∆x)2

κ(9.8)

So this looks to fine, but it’s very expensive in terms of computer time, due to the(∆x)2 term.

Chapter 10

Matrices

10.1 Introduction

A matrix is a collection of quantities arranged in rows and columns. Its constituents, orelements, are identifiable by row and column addresses: the (i, j)th element is foundat the intersection of row i with column j. This convenient layout and addressing makematrices naturally amenable to computer techniques. Fortunately, it turns out that mostproblems requiring numerical (as opposed to analytical) solutions can be formulated interms of matrices.

Methodologically, what’s ordinarily involved is algebraic manipulation, such asadding or multiplying matrices, inversion, or determination of a trace or eigenvalues.The ease or difficulty of such procedures depends on the type of matrix involved.

10.2 Classification

Assuming there are m rows and n columns to a vector, then, if m = n, the matrix issquare, otherwise, it is rectangular. If m = 1 and n > 1, the matrix is known as a1-by-n row vector; if m > 1 and n = 1 case, it is known as an m-by-1 column vector.A 1-by-1 matrix is commonly referred to as a scalar.

Consider an arbitrary m-by-n matrix A with elements aij . If A is square, then thediagonal along which i = j is known as the principal diagonal, and the rest of thematrix not along this diagonal, that is, for i 6= j, is called off-diagonal. The sum of theelements along the principal diagonal is known as the trace of A:

Tr(A) ≡n∑i=1

aii. (10.1)

The transpose of A, AT , is formed by interchanging each element aij with elementaji. The complex conjugate of A, A∗ is formed by replacing each element aij withits complex conjugate a∗ij . The adjoint, or conjugate transpose of A, A†, is formed

59

60 CHAPTER 10. MATRICES

by interchanging the complex conjugate of each element aij , a∗ij , with the complex ofelement aji, aji∗.

Further classification is possible by determining characteristics of elements.If all the elements of a matrix are zero, then the matrix is called null, 0.If ~x is a m-by-1 column vector with at least one nonzero element, and if A~x = 0,

then A is said to be singular. Also, if the determinant of A is zero, then A is singular.These are equivalent definitions.

If A is square and all off-diagonal elements are zero while the elements of theprincipal diagonal, aii are finite, then A is said to be diagonal. Diagonalization, theprocess of transforming a general square matrix into a diagonal matrix, is equivalent tofinding the eigenvalues of the matrix.

A diagonal matrix whose elements along the diagonal are all equal to 1 is desig-nated the identity matrix I.

If A is square and all elements are zero except the diagonal, aii, and first off-diagonal, ai,i±1, then A is said to be tridiagonal. Such matrices often appear at anintermediate stage of a diagonalization procedure and in 1-dimensional problems.

If A is square and all off-diagonal elements for which i < j are zero, then A is saidto be lower triangular.

If A is square and all off-diagonal elements for which i > j are zero, then A is saidto be upper triangular.

If A is square and A=AT , that is, all aij = aji, then A is said to be symmetric.If A is square and A=A†, that is, all aij = aji

∗, then A is said to be self-adjointor Hermitian. All eigenvalues of such matrices are real, and therefore physical, soHermitian matrices show up frequently in physics. The most common physics matricesare, in fact, a special type of Hermitian: real symmetric. In this case, as the nameimplies, A is symmetric, and all its elements are real. The square matrix A is anotherspecial sort of Hermitian, called positive definite, if all are its eigenvalues are greaterthan zero. There is also a type called non-negative definite which includes eigenvaluesof zero.

A matrix whose elements are mostly zero is referred to as sparse. If such a matrixis square, and the number of finite (non-zero) elements is proportional to n, it may bepossible to determine the eigenfunctions of the matrix without storing every elementin (computer) memory. This convenience will turn up in simplified models of physicalphenomena and in the discretization of partial differential equations.

This does not, of course, exhaust the categorization of matrices, but it’s more thanenough for our purposes.

10.3 Matrix Addition

If A, B, and C are matrices of the same order m × n, and C is said to be the sum ordifference of A and B, C = A ± B, then what is meant is:

cij = aij ± bij{i = 1, 2, . . . ,mj = 1, 2, . . . , n

10.4. MATRIX MULTIPLICATION 61

This topic provides an opportunity to learn (or trivially practice) some computerprogramming. The logic of computer operations and that of matrix manipulations arewell matched, so programming for such manipulations is fairly obvious. The storage ofelements is in arrays. Matrix elements are stored in two-dimensional arrays. Storage,however, is language dependent. In some computer languages, like C, C++, and Pascal,the second index varies faster and the storage is row-by-row. In other languages, suchas Fortran, it’s the reverse, and the first index varies fastest and storage is column-by-column. As a general rule, efficient programming accesses elements in the order inwhich they are stored. Thus, in the first case, the algorithm for adding or subtractingmatrices, would look like:

loop over rowsloop over row elementsend row element loop

end row loop

while in the second case the algorithm would look like:

loop over columnsloop over column elementsend column element loop

end column loop

The time a computer takes to add or subtract matrices is roughly proportional to thetotal number matrix elements.

10.4 Matrix MultiplicationIf A and B are of the same order, and B is the product of A with a scalar γ, B = γA,then

bij = γaij .

If A is of order m × n and B is of order n × p, and the product of A and B, C =AB, where the product C is of order m× p, the

cij =

n∑k=1

aikbkj .

If A, B, and C are square matrices, then, matrix multiplication among them isassociative, A(BC) = (AB)C, and distributive, (A + B)C = AC + BC and C(A + B) =CA + CB, but not generally commutative, AB 6= BA, a significant fact in understandingquantum systems.

If A and 0 are both square, then A0 = 0A = 0.If A and B are both square matrices such that AB = BA = I, then A and B are both

nonsingular and are inverse of one another, B = A−1 or A = B−1.Given A a matrix and γ a scalar, the inverse of the product of these (γA)−1 = 1

γA−1.


The inverse of the product of two matrices, (AB)−1 = B−1A−1, the commutedproduct of the inverse matrices.

The transpose of a matrix product is the commuted product of the transposed ma-trices, (AB)T = BTAT , and the adjoint of a matrix product is the commuted productof the adjoint matrices, (AB)† = B†A†. The transpose of an inverse matrix (A−1)T =(AT )−1, the inverse of the transposed matrix.

If A is square and A†A = I, or, equivalently, A† = A−1, then A is said to be uni-tary. Because the eigenvectors of a Hermitian matrix are orthogonal, a matrix whosecolumns are these eigenvectors is unitary.

If A is non-singular, then B and C are said to be similar when C = A−1BA.The power of a matrix, as with the power of a scalar, is the repeated product, An

= AAA. . . A, n times. For example, A4 = AAAA. It is also the case that, by use of aseries expansion, a matrix may exponentiate:

eA ≡∞∑n=0

An

n!.

A final kind of matrix multiplication is called direct product, by which, say, anm×m matrix and an n× n matrix form an mn×mn matrix:

C = A⊗B ≡

a11B a12B · · · a1mBa21B a22B · · · a2mB

...... · · ·

...am1B am2B · · · ammB

.That is, each element of A multiplies each element of B and is arranged as shown.

Efficient programming of matrix multiplication, which involves looping over therows of one matrix against the columns of another, and so requiring three instead ofjust two loops, depends on knowing the architecture of the machine you’re running onas well as the language used. Generally, the time a computer takes to multiply matricesis proportional to the third power of the average number of operations in each loop.

10.5 Solving Systems of EquationsGiven the square matrix A and rectangular matrices X and B (A is order m ×m, andX and B are order m× n), we look to solve

AX = B.

If A is non-singular, then the solution of X is found formally by pre-multiplying bothsides of the matrix equation by the inverse of A, A−1 to get

X = A−1B.

It is important to see that pre-multiplying (multiplying from the left) and post-multiplying(multiplying from the right) are not equivalent in the case of matrices since matrix mul-tiplication may not, and often does not, commute.

The inversion, particularly when the order of the matrix (related to the number ofunknowns to be solved) is high, can be quite cumbersome.

10.5. SOLVING SYSTEMS OF EQUATIONS 63

10.5.1 Standard Exact MethodsThe typical approach to solving algebraic equations follows some sort of eliminationor reduction associated with the name Gauss. The procedure evolves in two stages.Along the way, all that are used are the elementary operations:

1. Row interchange, Ri ↔ Rj ,

2. Row replacement with a nonzero multiple of itself, ρRi → Ri,

3. Row replacement with the sum of itself and a multiple of another row, Ri +ρRj → Ri,

where R is a matrix row and ρ is a non-zero real number.In the first stage (forward-elimination), the matrix (or system of equations) is “re-

duced” to row echelon form, an upper triangular form in which

1. all nonzero rows are above all all-zero rows, and

2. the leading coefficient, or pivot of a row (required to be 1, in our case) are strictlyto the right of the leading coefficient of the row above it.

The second stage reduces the row echelon form into a reduced row echelon (or rowcanonical) form, which is the same as the row echelon form, but with the followingadditional characteristics:

1. pivots must be 1, and

2. all entries above each pivot (in the same column) are zero.

If, as a result of these operations, one encounters a degenerate equation with nosolution, the conclusion is that this system has no unique solution. A solution set maybe expressable parametrically. There are other methods for making this determination.

In actual practice, the matrix manipulated with this approach is not A, but a so-called augmented matrix obtained by combining A and B into (A|B). Computer algo-rithms used to find these sorts of solutions logically amount to pre-multiplication of theaugmented matrix by a series of invertible matrices, which themselves may be idealizedas a single matrix resulting from ordered products. The process, with respect to A, isknown as matrix decomposition, or factorization of the matrix. In this case, where therow echelon form is an upper triangular matrix, the product of the invertible matricesis a lower triangular matrix. The result of the first stage, then, is an LU decompositionof A. The second stage transforms this matrix product into the product of an invertiblematrix and a reduced row echelon matrix.

Let’s describe an algorithm that transforms the matrix “in place,” that is, the ele-ments of the original matrix are replaced by the elements of a row echelon form, whichmay then be solved by back-subsitution (there are other ways):

1. Create 2-dimensional arrays a[i, j] and b[i, j] which store the elements of A andB by row i and column j.


2. Loop over columns of A from left to right (call the current column the pivotcolumn), through the last column of A, locating the diagonal element, which isthe pivot, in the pivot row.

3. Divide every element in the pivot row, which includes the corresponding row ofB, by the pivot value.

4. Subtract appropriate multiples of the privot row from each row of A and B belowthe pivot row to get a zero in every entry below the pivot in the pivot column.

This is followed by the back-substitution.This procedure’s stablity is assured if the matrix is diagonally dominant,

|aii| >∑j 6=i

|aij |,

the magnitude of the diagonal element in each row of the matrix is bigger than the sumof the magnitudes of all the other (non-diagonal) elements in the same row, or if thematrix is Hermitian (positive definite).

In the following pseudo-code outline of the algorithm, we incorporate a trick, calledpartial pivoting, to improve the algorithm’s stability by swapping rows to set the largest(absolute value) element in the pivot column as the pivot (thereby reducing incidentsof division by small numbers).

loop over the columns of A

loop over rows of (A|B)

identify the row that has the largest element in pivotcolumn

end row loop

swap identified row with the row whose index is the same asthat of the pivot column

if the pivot is nonzero, then divide all elements in (new) pivotrow by pivot value (pivot now equals 1)

loop over all rows below pivot row

subtract the product of the value of the element inthe pivot column and pivot row from the current row(value under pivot will be zero)

end row loop

end column loop

Then reverse order, starting with last row and last column in A and and make pivotequal to 1. Work back upwards so that all values above and below unit pivots are zero.The system of equations is then solved.

10.5. SOLVING SYSTEMS OF EQUATIONS 65

10.5.2 Iterative MethodsThe decomposition phase of Gaussian elimination uses computer time as N3, whereN is the number of equations in N unknowns. The back-substition phase goes as N2

for each column of B. Thus, for really large systems, it pays to have an alternative. Ifthe matrices involved happen to be sparse, then iterative methods are available. Wepresent two, both stationary, in which each iteration consists of the same operationon the current iteration arrays. In both cases, the iteration continues until the changesbetween iterations are smaller than some tolerance value.

Jacobi Method

This method assigns a single value to each unknown with respect to all other unknowns.It is appropriate, and, not coincidentally, likely to converge, when the largest (absolute)value of each row and column of the matrix representation of the linear equation setis along the diagonal. Convergence means the values change less each iteration and isassured if the matrix is strictly diagonally dominant:

aii >∑i 6=j

|aij | (strict row diagonal dominance)

aii >∑j 6=i

|aij | (strict column diagonal dominance).

Formally, the method starts with separating A into three pieces: a diagonal matrixD, a strictly upper triangular matrix U, and a strictly lower triangular matrix L: A = D+ U + L. In terms of these, the Jacobi interative solution is

X(`+1) = D−1[B− (U + L)X(`)].

At the element level, this logical division into three matrices is unnecessary:

x(`+1)ij =

1

aii

bij −∑j 6=i

aijx(`)ij

,

for i = 1, 2, . . . , m, j = 1, 2, . . . , n. Notice that each element requires everyother element (except itself), and so individual elements cannot not be overwritten:two arrays are required.

In outline, the algorithm can be conceptualized as follows:

loop until convergence criteria met

loop over rows

set initial value (x(0)ij = 0)

loop over columnscompute sum, excluding element under con-sideration


end column loopcompute iterated value

end row loop

substitute new values for oldcheck for convergenceend convergence loop

Gauss-Seidel Method

Values for unknowns are updated as more information becomes known. In the Jacobimethod, iterated values are all computed from older values. Ongoing replacementmarginally increases the convergence rate and decreases the memory utilization, botheffectively speeding up the calculation, at least a little.

In the Gauss-Seidel method, the matrix equation is

X(`+1) = (D + L)−1(B−UX(`)).

The actual calculation is done at the element level:

x(`+1)ij =

1

aii

bij −∑j<i

aijx(`+1)ij −

∑j>i

aijx(`)ij

,

In outline,

loop until convergence criteria met

loop over rows

set initial value (x(0)ij = 0)

loop over columnscompute sum, using new values where alreadycomputed, but excluding element under con-sideration

end column loopcompute iterated value, replacing previous value

end row loop

check for convergenceend convergence loop

10.6 Eigenvalue ProblemsIf A is an n×n square matrix, and ~v is a column vector with n rows, then the equation

A~x = λ~x

10.6. EIGENVALUE PROBLEMS 67

is knowns as an eigenvalue equation and λ are the eigenvalues of A. The set of eigen-values {λ1, λ2, . . . , λn} is called the spectrum of A. The eigenvalue with the largestmagnitude gives the spectral radius of A.

Given the n× n identity matrix I,

A~x = λ~x = λI~x,

or(A− λI)~x = ~0.

If the determinant of the quantity in parenthesis also equals zero,

|A− λI| = 0,

then we have an nth-degree polynomial known as the characteristic equation of A,which has n roots, known as characteristic roots, which are also the eigenvalues of A,λ. In this case, ~x is a solution of (A - λ I)~x = ~0, and called a characteristic vector. If|~x| = 1, or equivalently

~xT~x = 1,

then ~x is a normalized eigenvector associated with the eigenvalue λ of A.The sum of the eigenvalues in a spectrum is equal to the trace of the matrix,

trace A =

n∑i=1

λi,

and the product is equal to the determinant,

det A =

n∏i=1

λi,

Matrices similar to one another have the same eigenvalues. So, obviously, a di-agonal matrix whose elements are the eigenvalues of some matrix is similar to thatmatrix. Thus, diagonalization and determining eigenvalues amount to essentially thesame thing.

If A is a square matrix and can be transformed to the diagonal matrix D via thetransformation matrix S,

D = SAS−1

then S is an invertible matrix whose columns are the eigenvectors of A. So, in diag-onalizing a matrix, we get the eigenvalues and the eigenvectors. And, conversely, ingetting its eigenvalues and associated eigenvectors, one diagonalizes a matrix.

We won’t discuss diagonalization algorithms in detail. Canned routines are avail-able to diagonalize real symmetric, Hermitian, tridiagonal, and other special matrices.Computing environments like Mathematica and Maple can handle general matrices.

In the case of real symmetric or Hermitian matrix diagonalization, the approach istypically to transform the original matrix into tridiagonal form first with one routine,and then complete the diagonalization of the tridiagonal matrix with a second routine.

Chapter 11

Monte Carlo

Monte Carlo (MC) is a generic name for a group of techniques that rely on random orpseudo-random numbers and statistical techniques to solve a great variety of problems.MC methods permit the sampling of configurations of large, complex systems, theresults of which offer a characterization of the whole.

11.1 Random Number Generators

Random number generators (RNGs) are algorithms or devices capable of producingpatternless sequences. Historically, physical techniques, such as flipping coins, tossingdice, pulling cards from a shuffled deck, and the like, served this purpose. At present,physical techniques employ analog noise and radioactive decay, for example.

Computers are now frequently used. Nearly all higher-level languages provide agenerator in their libraries. Due to the finite resolution of computers, such algorithmsdo repeat after many trials, and are therefore referred to as pseudo-random numbergenerators (PRNGs).

The most typical computer algorithm relies on the generation of chaotic sequences.Two of the most popular are known as congruential. A linear congruential generatoris an iterative method:

xn+1 = (a ∗ xn + b)%m

or

xn+1 = (a ∗ xn + b) mod m,

where a and b are large (usually prime) integers, andm is 2 raised to the power (numberof bits the computer employs in integer arithmetic). So, for a 32-bit machine,m = 232.This is the maximum number of non-repeating numbers that can be generated, althoughrepetition usually commences well before this number of iterations. The symbols %and mod, known as modulo indicate that the result is the remainder after dividing thequantity to the left of the symbol by the quantity on the right (the modulus).

68

11.2. SAMPLING DISTRIBUTIONS 69

A slightly simpler algorithm, with correspondingly shorter randomness, is the mul-tiplicative congruential generator:

xn+1 = (a ∗ xn)%b,

where, again, a and b are large (usually prime) integers.In both these algorithms, x1 (or x0, depending on which language employed) is

known as the seed. For a given seed, the resulting sequence, though chaotic, is deter-mined, and so the procedure is deterministic. Keeping track of the seed is a good idea,in case the calculation needs to be repeated. It is common, when different sequencesneed to be generated, to use the system clock time (an unique value) as the seed.

11.2 Sampling DistributionsGenerators, such as those discussed above, produce numbers distributed evenly be-tween 0 and m or b. Scaling these can produce other flat distributions, but betweengiven limits, most frequently between 0 and 1. Thus, in the multiplicative case,

xn+1 = ((a ∗ xn)%b)/b,

distributes evenly between 0 and 1.It is possible to produce other than flat distributions. Library routines are usually

available for different kinds of distributions. But we discribe how a few can be gener-ated.

Gaussian One way to produce random numbers distributed normally, or, equiva-lently, that follow a Gaussian distribution, is to rely on the Central Limit Theorem:produce a series of, say, 10 generated numbers, and find the average. A series of theseaverages should form a bell curve.

Exponential decay Some functions can be inverted without much effort. If the de-sired distribution is represented by such a function, then a series of outputs of theinverse of the function acting on evenly-distributed, randomly generated numbers, gen-erates the distribution of interest.

Sinusoidal If x is the output of a flat generator of random numbers between 0 and 1,then u = 2x − 1 is a flat distribution between −1 and 1. If we generate two numbers,v1 = x, which is flat between 0 and 1, and v2 = u, which is flat between −1 and 1,we can calculate a third number, r, such that r2 = v2

1 + v22 . If r2 > 1, then start over

and regenerate two new numbers, v1 and v2; otherwise, the sine of a random, evenlydistributed angle between 0 and 2π, is given by

sin θ =2v1v2

r2,

and cosine over the same flat angular distribution is given by

cos θ =v2

2 − v21

r2.

70 CHAPTER 11. MONTE CARLO

11.3 Numerical IntegrationRecalling that integrating is a procedure for determing the area (volume) under a curve(inside a solid), a MC technique known as hit-or-miss may be employed to solve a def-inite integral. Defining a square (cube) that encompasses the range of the integration,one generates two (or more) random numbers, evaluates the integrand, and counts thefrequency that the generated numbers lie within the boundaries of the integral. The ra-tio of this count to the total number of generation cycles is related to the area (volume)encompassed by the integral.

For example, if y = f(x), and we want to evaluate

I =

∫ b

a

ydx,

we create a rectangle with one side at least of length b − a, and the other at leastof ymax. We then generate pairs of random numbers, which we can label x and y′,respectively. If y = f(x) ≤ y′, then the point (x, y′) will be inside the region boundedby f(x), and is counted as a “hit.” If, on the other hand, y = f(x) > y′, we have a“miss.” After many recurrences of this procedure, the ratio, # hits/(# hits + # misses)= the area under the curve/ the area of the rectangle. Therefore, the area under thecurve–or, equivalently, the value of the integral–I = area of rectangle × # hits/(# hits+ # misses). This algorithm has rather slow convergence properties, so there are manymore efficient procedures available, but this is among the easiest to comprehend.

For few dimensions, numerical integration techniques we discussed earlier in thecourse are more efficient, but as the number of dimensions increases, MC solutionsare quicker. For example, the trapeziodal rule converges as N−2/D, where D is thenumber of dimensions, whereas MC techniques converge as N−1/2. Thus, for D > 4,MC is faster.

11.4 The Metropolis AlgorithmSimulating “reality” with algorithms employing random numbers is often the only wayto obtain quantitative results characterizing large, complex systems. Assume such asystem is in some state (or condition) i and can change to another state j, and viceversa. If, inspite of such changes, the system is to remain in equilibrium, the changefrom state i to state j must occur just as frequently as the change from j to i. If pi (pj)is the probability of finding the system in state i (j), and Ti→j (Tj→i) is the probabilityor rate that state i (j) makes a transition to state j (i), then

piTi→j = pjTj→i,

which can be rearrangedTi→jTj→i

=pjpi.

The ratio of state probabilities pj/pi is typically given by some theory or knownempirically. The modeling of the system, then, requires getting the transition rates

11.4. THE METROPOLIS ALGORITHM 71

correct. In the use of Monte Carlo techniques, this is done by imagining the ratio tobe some sort of distribution. Random numbers are generated and, as in the case ofnumerical integration, compared to the value of pj/pi at each step. In this context, theprocedure is called importance sampling. If the probability ratio is greater than therandom number, then the transition occurs, otherwise, it does not. Symbolically,

Ti→j =

{1 if pj > pipjpi

if pj < p1.

The simulation comes in the case that pj < p1. There is a finite possibility that thetransition takes place even if the probability of the new state is less than that of theold. A random number, chosen between 0 and 1, is compared to pj/pi; if the randomnumber is smaller than the ratio, the transition occurs.

After many such trials, the modeled behavior of the system should be followingthe transition rate, and it becomes possible to calculate characteristic properities of thesystem in terms of averages or expectation values.

If xk are a collection of outcomes, of which there are N possibilities, each with anoccurrence probability of pk, then the expectation value of x:

E(x) =

N∑k=1

xkpk.

If, instead of a simple outcome, one were interested in some function of such outcomes,f(x), the expected value of f(x) is

E[f(x)] =

N∑k=1

f(xk)pk.

Note that these sums are over all components, and the function f(x) could be aquite complicated function to calculate for each component. The same amount of timewill be spent computing very improbable terms as very probable terms, and many iden-tical things end up being recalculated. The end result is a limit to the feasibility of theapproach.

We can improve things by replacing the sum over all possibilites with a sum in-volving only the more likely terms, which are sampled according to an appropriateprobability distribution. And herein lies the power of the Monte Carlo technique.

Let’s make things somewhat more concrete. Say a system contains a large numberof components each of which can be in one of two states, identified as −1 or 1. Thecomponents are arranged in a rectangular lattice, so that each component (except, ofcourse, those on the boundaries; the ways these are handled depends on the boundaryconditions imposed) is surrounded by four nearest neighbors: left, right, above, andbelow. The particular state of an individual component depends on the states of itsnearest neighbors; various configurations have different associated probabilities. Theprobability or rate that a component changes its state is related to these probabilities.

Let’s consider a model employing such a lattice, called the Ising Model. It considersthe interaction between nearest neighbors resulting in a particular element of the lattice

72 CHAPTER 11. MONTE CARLO

taking on one of two possible configurations or values. One can imagine arrows on achess board that can point in either of two directions: up or down, u = 1, d = −1. If thetheory that determines the ratio of probabilites pj/pi is based on statistical mechanics,then the ratio will be:

pjpi

= e−β(Ej−Ei),

where Ei is the “energy” of state i, and β = kBT . kB is known as Boltzmann’sconstant. If we symbolize the direction of the pointer s, then Ei = −Jsi(sleft +sright + sabove + sbelow), where J is a parameter with dimensions of energy thatcharacterize the system. Since systems tend to their lowest energy state, configurationswith aligned neighbors are preferred. Since s can have values of ±1 only, an energychange Ej − Ei involving a switch of pointer direction

Ej − Ei = 2Jsi(sleft + sright + sabove + sbelow)

If this quantity is zero or negative, then pj/pi ≥ 1, so the pointer direction changes.If the quantity is positive, then pj/pi < 1. In this case, a random number between 0and 1 is compared to pj/pi = e−β(Ej−Ei); if it is smaller, then the pointer changesdirection, otherwise it is left alone. A pass through all components in this fashion isknown as a Metropolis sweep.

Note that the parameter Jβ is left undefined. It characterizes the system and whenvaried results in different behaviors at equilibrium. It generally takes a number ofsweeps before equilibrium is reached. This is especially true if the characteristic pa-rameter Jβ is such that the system is near to a “phase transition,” that is, where theglobal alignment of the system changes dramatically.

Good means for determining if equilibrium has been reached are available in thestabilizing of the total energy and the “average alignment.” In our case, for an N ×Nlattice,

E ∼N2∑`=1

∑k 6=`

s`sk

and< s >=

1

N2

∑`

s`.

Chapter 12

Fourier Analysis

12.1 IntroductionPeriodic functions do a better job than, say, polynomials of representing periodic orcyclical behavior. A periodic behavior is one that repeats itself with a fixed frequency.Note that this doesn’t imply simple periodicity, as there may be multiple characteristicfrequencies or the periodicity may be the effect of multiple periodic behaviors.

Fourier analysis decomposes the behavior into series or integrals of single compo-nent functions, each with its own characteristic frequency parameter. In fact, Fourieranalysis can be applied to any functional form that has a finite number of disconti-nuities (piecewise continuous). Periodic functions and functions defined on a finiteinterval can be represented by a series known as a Fourier series, while non-periodicfunctions or functions with unbounded domains can be represented by so-called Fourierintegrals.

12.2 Fourier SeriesLet ν signify the frequency of some periodic (piecewise continuous) behavior whosephase moves along at the rate v, then the wavelength λ = v/ν. Fourier analysis sayswe can represent this behavior with

f(x) = a0 +

∞∑n=1

[an cos

(2πn

x

λ

)+ bn sin

(2πn

x

λ

)](12.1)

where an and bn are (real) expansion coefficients computed by

a0 =1

λ

∫ λ/2

−λ/2f(x)dx (12.2)

an =2

λ

∫ λ/2

−λ/2f(x) cos

(2πn

x

λ

)dx (n = 1, 2, . . .) (12.3)

73

74 CHAPTER 12. FOURIER ANALYSIS

bn =2

λ

∫ λ/2

−λ/2f(x) sin

(2πn

x

λ

)dx (n = 1, 2, . . .) (12.4)

Recall that sine (odd) and cosine (even) have definite parity, so that if f(x) hasdefinite parity, one set of coefficients can be disgarded, because the product of f(x)and sine or cosine would be odd, giving all associated integrals zero-values for all n:

f(−x) = −f(x) (f(x) odd): Then f(x) cos(2πnxλ

)is odd, so{

an = 0 (n = 0, 1, 2, . . .)f(x) =

∑∞n=1 bn sin

(2πnxλ

)f(−x) = +f(x) (f(x) even): Then f(x) sin

(2πnxλ

)is odd, so{

bn = 0 (n = 1, 2, . . .)f(x) = a0 +

∑∞n=1 an sin

(2πnxλ

)If f(x) has no definite parity, then the expansion is done in a complex series.

f(x) =

∞∑n=−∞

cne2πinx/λ (12.5)

where cn is the appropriate combination of an and bn using Euler’s identity, eiθ =cos θ + i sin θ, and is found with:

cn =1

λ

∫ λ/2

−λ/2f(x)e−2πinx/λdx (12.6)

In either case, the sums are infinite, and so, for all practical purposes have to betruncated. If f(x) is normalizable (that is,

∫∞−∞ f(x)dx exists), then the coefficients

become smaller with n: |an| → 0 and |bn| → 0 as n → ∞, so truncation amounts tocutting off the series when the required precision is reached.

12.3 Fourier IntegralsAny normalizable function f(x) can be expanded in a Fourier integral (or inverseFourier Transform):

f(x) =1√2π

∫ ∞−∞

g(k)eikxdk (12.7)

where k = 2π/λ is the wavenumber. g(k), the Fourier transform of f(x), plays thesame role as the expansion coefficients and is found with the Fourier Tranform:

g(k) =1√2π

∫ ∞−∞

f(x)e−ikxdk (12.8)

12.4. DISCRETE FOURIER TRANSFORMATION 75

Periodic behavior often may be easier to represent with functions of time and fre-quency (or angular velocity) than with functions of position and wavelength. Withω = 2πν,

f(t) =1√2π

∫ ∞−∞

g(ω) eiωtdω (12.9)

g(ω) =1√2π

∫ ∞−∞

f(t) e−iωtdt (12.10)

Coding efficiency may be aided by recognizing the following relationships be-tween, say, f(x) and g(k)

• f(x) real⇔ g(−k) = g∗(k)

• f(x) imaginary⇔ g(−k) = −g∗(k)

• f(x) even⇔ g(k) even

• f(x) odd⇔ g(k) odd

• f(x) real and even⇔ g(k) real and even

• f(x) real and odd⇔ g(k) imaginary and odd

• f(x) imaginary and even⇔ g(k) imaginary and even

• f(x) imaginary and odd⇔ g(k) real and odd

• f(ax)⇔ 1|a|g(k/a); 1

|b|f(x/b)⇔ g(bk) scaling

• f(x− x0)⇔ g(k)eikx0 ; f(x)e−ik0x ⇔ g(k − k0) displacing

12.4 Discrete Fourier TransformationSay a set of discrete data points, fn = f(n∆t), n = . . .−2,−1, 0, 1, 2, . . . are collectedby sampling some behavior at a regular time increment of ∆t, called the samplinginterval. The sampling rate is the reciprocal of the sampling interval: the smaller ∆t,the greater the sampling rate. High sample rates provide finer details of the behavior (ofcourse, too much or too fine detail can be as problematic as too little or too coarse). Thesampling rate sets the band-width limit of the function f(t) representing the behavioraccording to Nyquist’s Theorem: f(t) is a band-width limited process if the FourierTransform (Spectrum)–Eq. 12.10 with ν = ω/2π–is negligible beyond a certain critical(so-called Nyquist) frequency:

±νC ≡ ±1

2∆t(12.11)

Assuming the criterion of Nyquist’s Theorem is met, then the sampling theoremtells us that f(t) is completely determined by

f(t) =

∞∑n=−∞

fnsin [2πνC(t− n∆t)]

2πνC(t− n∆t)(12.12)

76 CHAPTER 12. FOURIER ANALYSIS

However, if the behavior is not band-width limited to frequencies less than the Nyquistfrequency, the discreteness of the sampling will result in what is called aliasing: fre-quency components outside the Nyquist frequency cut-off range are mirrored or trans-lated or superposed inappropriately into the range defined by the Nyquist frequency.

Of course, in a real measurement, the data set will include a finite number, N(assumed here to be even), of points, and the sum runs from n = 0, 1, 2, . . . , N − 1.The discrete frequencies of the analysis are then given by

νm =m

N∆tm =

−N2, . . . , 0, . . . ,

N

2(12.13)

Note that ±νC = νm=±N/2.In terms of this, the Fourier Transform for each discrete frequence νm is:

g(νm) =

∫ ∞−∞

f(t)e−2πiνmtdt

≈N−1∑n=0

fne2πiνmn∆t∆t

= ∆t

N−1∑n=0

fne−2πinm/N (12.14)

The last summation is called the Discrete Fourier Transform of the N data pointsfn:

gm ≡N−1∑n=0

fne−2πinm/N (12.15)

And so, the discrete valuesg(νm) = ∆tgm (12.16)

Recall that N is assumed even and m = −N2 , . . . , 0, . . . , N2 . Therefore, from the

definition, g−m = gN−m, indicating a periodicity (of N) in the transform. Therefore,rather than vary between −N/2 and N/2, m is varied between 0 and N − 1, just as n,somewhat simplifying the calculation. In this convention, zero frequency correspondsto m = 0, positive frequencies 0 < ν < νC correspond to 1 ≤ m ≤ N/21, andnegative frequencies νC < ν < 0 correspond to N/2 + 1 ≤ m ≤ N1. Both ν = νCand ν = −νC correspond to N/2. The inverse transform is, then

fn =1

N

N−1∑m=0

gme2πinm/N (12.17)

12.5 Fast Fourier TransformationDefine

WN ≡ e−2πi/N (12.18)

12.5. FAST FOURIER TRANSFORMATION 77

The the Discrete Fourier Transform, Eq. 12.15, can be written

gm =

N−1∑n=0

WnmN fn (12.19)

which can be seen as multiplication of the vector fn by the an n × m matrix whose(n,m)th element is Wnm

N . Since both n and m have lengths of m, the approachjust outlined involves some N2 operations, plus some overhead. It is possible, withan approach called the Fast Fourier Transform, to speed this up dramatically, to ∼N log2N operations, a savings of a factor of about 100 for N = 1000.

The key to this improvement is noting that W 2N = WN/2, which allows the length-

N Discrete Fourier Transform to be rewritten as the sum of two length-N/2 DiscreteFourier Transforms (one odd, one even). This was discovered in the early 1940s byDanielson and Lanczos, and developed for computer calculations in the early 1960s byCooley and Tukey.

gm =

N−1∑n=0

WnmN fn

=

N−1∑n=0

e2πinm/Nfn

=

N/2−1∑j=0

e2πi2jm/Nf2j +

N/2−1∑j=0

e2πi(2j+1)m/Nf2j+1

=

N/2−1∑j=0

W jmN/2f2j +Wm

N

N/2−1∑j=0

W jmN/2f2j+1

≡ g(e)m +Wm

N g(o)m (12.20)

This reduces the number of DFT by a factor of two, so even though m runs from 0to N , the even and odd components are repeated in a cycle of N/2.

Chapter 13

Time Series Analysis

13.1 IntroductionTime series analysis is a “simple” signal processing technique, related to Fourier anal-ysis, for identifying offsets, correlations, trends, or cycles in sequential data collectedat constant time intervals. The object is to summarize the data with a low-dimensionalmodel and make forecasts. Another way of saying this is that time series analysis de-termines whether (univariate) data collected at regular time intervals is stationary: themean, variance, and autocorrelation (for example) are constant. If not, the analysisserves to identify patterns in the data that cause it to stray from stationarity.

13.2 White NoiseWhite noise is the standard example of a stationary time series. It consists of uncorre-lated (random) independent data points, with mean zero and a constant variance. Theautocorrelation coefficient is consistent with zero except when the time lag is zero,when it is one, by definition.

The autocorrelation coefficient, Rh = Ch/C0 is the ratio of the covariance of thedata at a particular time lag Ch to the variance of the data C0. The covariance function:

Ch =

∑N−1−ht=0 (Xt − X)(Xt+h − X)

N(13.1)

where h is the time lag. A time lag of h = 1 means the difference between the meanand the next nearest neighbor in the second parenthesis; h = 2 means the next-to-nextnearest neighbor; etc. An autocorrelation plot graphs the autocorrelation coefficient asa function of h. Significant differences from zero (−1 ≤ Rh ≤ 1) anywhere on theplot (except, of course, for h = 0), indicates some coherent effect(s). Probabilitiesshould be distributed normally around zero as

z1−α/2√N

78

13.3. DECOMPOSITION 79

where z is the normal percent point function (the inverse of the cumulutive distribu-tion function), and α is the significance level. Notice how this statistic becomes morediscriminatory as

√N .

Another, perhaps simpler, measure of stationarity is the turning point test. If thedata are randomly distributed, three successive data points are equally likely to relatein terms of magnitude in six possible ways:

• Xi ≥ Xi+1 ≥ Xi+2

• Xi ≥ Xi+2 ≥ Xi+1

• Xi+1 ≥ Xi ≥ Xi+2

• Xi+2 ≥ Xi ≥ Xi+1

• Xi+2 ≥ Xi+1 ≥ Xi

• Xi+1 ≥ Xi+2 ≥ Xi

Notice that in four of the six possibilities, the middle number is a turning point: itis bigger (smaller) than the previous number but smaller (bigger) than the succeedingnumber. On average, then, (2/3)(N−2) turning points should be found in a N-memberdata set (end points can’t be turning points). The variance around this mean is 8N/45.A statistical test can be made by counting the number of turning points in the data andcomparing to the average expected for white noise relative to the standard deviation√

8N/45.

13.3 DecompositionA common approach to time series analysis that exhibits non-stationarity is to decom-pose the data into different elements:

Trend: long-term, uni-directional changes in the average value

Primary Periodicity: also called a seasonal effect

Secondary Periodicity: residual periodic effects once the primary cycle has beenisolated

Residuals: random or other systematic fluctuations.

Offset: Systematic shift from zeroHaving isolated each component, a simple model for describing the data is the sum

or product of these. Ideally, removing trends and primary and secondary periodicityshould, in principle, leave nothing but randomly distributed residuals. Removing theoffset should leave white noise.

80 CHAPTER 13. TIME SERIES ANALYSIS

13.4 Isolating Trends: Smoothing TechniquesTwo common methods for identifying trends are averaging and exponential smoothing.

13.4.1 AveragingFor a set of randomly distributed data, the simple average (or mean) is perhaps thebest characterization of the data, as it–by definition–minimizes the mean-squared errorof a random distribution. It is not, however, a good estimator if the data, for example,follow a trend. A moving average–computation of the mean of successive, overlapping,subsets of the data–can do better. As long as the trend is not too steep, and there arean odd number of data points, this procedure does a better job. For an even number ofdata points, and for steep trends, a moving average of the moving average–the so-calleddouble moving average–does even better. The result will be averages aligned with themiddle value of the data subset regardless of even-oddness of the sample population,and the possibility then to fit the result to a line describing the trend. Subtracting theresulting fit from the data should eliminate (much of) the trend.

13.4.2 Exponential SmoothingAverages, including moving averages, give equal weight to each data point (except theend-points for moving averages). Exponential smoothing is a technique that gives moreweight to recent data.

Suppose there are N data Xt, t = 0, 1, 2, . . . , N − 1. The basic single exponentialsmoothing formula for t ≥ 2, in terms of S, the so-called smoothed observation or theexponentially weighted moving average (EWMA), is

St = αXt + (1− α)St−1 (13.2)

where 0 < α < 1 is the smoothing constant, which is directly related to the weighting.It has to be determined or estimated to minimize the mean-squared error. Trial-and-error will work eventually, but many applications carry out the minimization process,often employing a non-linear optimizing procedure called the Marquardt algorithm.Typical good values of α range between 0.25 and 0.5.

To find S1, S0 needs to be fixed. Sometimes, it is set to X0, sometimes to theaverage of the first few points, and sometimes to an educated guess. Having chosenS0, the basic formula can be expanded to include all entries:

St = α

t−1∑i=0

(1− α)iXt−i + (1− α)t−1S0 (13.3)

In terms of the weight, w = 1−α, this sum exhibits why this is referred to as exponen-tial weighting (or smoothing). Further, since w < 1, one sees that “older” data receivesmaller weights than later data.

Like the single moving average, single exponential smoothing does progressivelyworse as the trend steepens or trend drifts with time, and like the moving average, a

13.5. ISOLATING THE PRIMARY PERIODICITY 81

second pass does better. For double exponential smoothing, second parameter–call itβ, found, as α, by hit-or-miss or a Marquardt algorithm–leading to a second statistic,b, called the trend factor, updates the trend through the data, and thereby removes thelag in responding to trend changes. The algorithm goes like:

St = αXt + (1− α)(St−1 + bt−1) (13.4)bt = β(St − St−1) + (1− β)bt−1 (13.5)

Notice that bt is calculated using St and then is used to determin St+1. b0 can bechosen with X1 − X0, or [(X1 − X0) + (X2 − X1) + (X3 − X2)]/3, or, for any i,(Xi −X0)/i.

13.5 Isolating the Primary Periodicity

A dominant periodicity in the data can be identified with a third round of exponentialsmoothing. Triple exponential smoothing requires a third constant and provides a thirdparameter, γ and I , respectively. The latter is called the seasonal index. The fullprocedure is referred to by the name of the developers, Holt-Winters or HW method.Initialization of the HW method’s seasonal index requires at least one cycle’s worth ofdata, with this broken up into some number, L, periods. To get the initial trend factor,though, requires period-to-period comparisons from one complete cycle to the next, soactually two cycles of data, or 2L periods, is really the minimum requirement. Thebasic equations are:

St = αXt

It%L+ (1− α)(St−1 + bt−1) (13.6)

bt = β(St − St−1) + (1− β)bt−1 (13.7)

It = γXt

St+ (1− γ)It%L (13.8)

The initial trend factor is determined by:

b0 =1

L

[(XL −X0)

L+

(XL+1 −X1)

L+ · · ·+ (X2L−1 −XL−1)

L

](13.9)

The initial seasonal index is determined by:

Ii =L

N

N∑j=0

xL(j−1)+i

Aj∀i = 0, 1, . . . , L− 1 (13.10)

Aj =

∑L−1i=0 xL(j−1)+i

L∀j = 0, 1, . . . , int(N/L) (13.11)

Aj is the average value of the jth period of the primary cycle, of which there areint(N/L) + 1 in the data.

82 CHAPTER 13. TIME SERIES ANALYSIS

13.6 Isolating Secondary PeriodicityFourier analysis.

13.7 Isolating an OffsetA non-zero average of the the data indicates an offset. A determination of the variance(standard deviation or error of the mean, assuming the distribution being averaged israndomly distributed) gives an estimate of the extent of the offset.

510 Lecture Notes - George Mason Universityphysics.gmu.edu/~rubinp/courses/510/510notes.pdf ·...

Documents

Transcript of 510 Lecture Notes - George Mason Universityphysics.gmu.edu/~rubinp/courses/510/510notes.pdf ·...