Solving Triangular Systems More Accurately and...

33
17th IMACS World Congress, July 11 - 15, 2005, Paris, France Scientific Computation, Applied Mathematics and Simulation Solving Triangular Systems More Accurately and Efficiently Philippe LANGLOIS and Nicolas LOUVET University of Perpignan, France. http://webdali.univ-perp.fr/

Transcript of Solving Triangular Systems More Accurately and...

17th IMACS World Congress, July 11 - 15, 2005, Paris, France

Scientific Computation, Applied Mathematics and Simulation

Solving Triangular Systems More Accurately and Efficiently

Philippe LANGLOIS and Nicolas LOUVET

University of Perpignan, France.

http://webdali.univ-perp.fr/

Outline and key words

• More precision: why and how ?

◦ motivations, XBLAS, expansions, double-double library

◦ error free transformations, CENA, twice the current precision behavior

• An accurate and efficient new algorithm to solve Tx = b

◦ The corrected substitution algorithm

◦ Experimental results exhibit the

. actual accuracy: twice the current working precision behavior,

. actual speed: twice faster than the corresponding XBLAS subroutine.

IMACS WC 2005 – Ph. Langlois and N. Louvet. 1

More precision: why ?

The general “rule of thumb” (RoT) for backward stable algorithms:

solution accuracy ≈ condition number of the problem × computing precision

1. IEEE-754 precisions: single (u = 2−23 ≈ 10−7), double (u = 2−52 ≈ 10−15)

2. Condition number for solving a triangular linear system Tx = b:

cond(T, x) =‖|T−1||T ||x|‖∞

‖x‖∞, always > 1.

3. Accuracy of the solution for Tx = b of size n:

‖x−bx‖∞‖x‖∞ . n× cond(T, x)× u

. LAPACK ERRBD

IMACS WC 2005 – Ph. Langlois and N. Louvet. 2

solution accuracy ≈ condition number of the problem × computing precision

1e-16

1e-14

1e-12

1e-10

1e-08

1e-06

1e-04

0.01

1

1 100000 1e+10 1e+15 1e+20 1e+25

rela

tive

forw

ard

erro

r

condition number

Condition number and relative forward error

1/uu

n u cond

ieee754 double precision

More accuracy for ill-conditionned problems?

IMACS WC 2005 – Ph. Langlois and N. Louvet. 3

More accuracy : how ?

1. Design better algorithms!

• difficult and not always possible: problem dependant

• an example: iterative refinement for linear system using the residual r = b−Ax.

. When computing r in precision u: RoT still holds;

. When computing r in precision u2 (2×CWP):

solution accuracy ≈ u, while condition number . u−1.

. Motivation for eXtended and mixed BLAS (Li, Demmel, Bailey et al.;TOMS,02):

2. More precision with more bits

• extended internal precision, Fused Multiply and Accumulate, quadruple precision,

expansion librairies, double-double, quad-double, . . .

• generic solution

• lack of availability and portability

• lost of speed efficiency

IMACS WC 2005 – Ph. Langlois and N. Louvet. 4

More bits with software: expansions

Floating point expansion: the rational value x equals the floating point m-tuple (x1, . . . , xm) s.t.

i) x =∑m

i=1 xi (the exact sum) ii) |x1| ≥ |x2| ≥ . . . |xm| iii) xi and xi+1 do not overlap (*)

x1

x =

x2 x3 x4 x5 x6

• m: arbitrary or fixed length

• Arbitrary precision: Kahan+Priest (91), Bailey (95), Shewchuk (97), . . .

• Fixed extended precision: double-double u2: Briggs (98), Bailey (00); quad-double u4 :

Bailey+Li+Hida (01)

• Arithmetic overhead, e.g., for double-double:+ × /

20 25 or 2 FMA 32

• Pros and Cons in Li et al. (TOMS, 02).

IMACS WC 2005 – Ph. Langlois and N. Louvet. 5

More accuracy : now

1. Design better algorithms

2. More precision with more bits

3. Correcting generated rounding errors

. using more bits locally: error free transformations (EFT),

. leading to design better algorithms: more accurate and actually fast.

• Old but good idea!◦ Gill (1951) for fixed point arithmetic and Runge-Kutta schemes,◦ Kahan (1965) for compensated summation . . . and many others after,◦ iterative refinement with residual r in extended precision.

• Sometimes, more bits (together with some extra-work) yield the definitive solution:

accuracy ≈ precision

• The more general case: at a reasonable cost we get a twice the current working precision

behavior. . .

IMACS WC 2005 – Ph. Langlois and N. Louvet. 6

What accuracy for a 2×CWP behavior?

. “Improved RoT” for twice the current working precision behavior:

accuracy of the computed solution . precision + condition number× precision2.

So we have the three following regimes in precision u for an accurate solution x:

1. condition number ≤ 1/u: the accurate result x has about the optimal accuracy

|x− x||x|

≈ u;

2. 1/u ≤ condition number ≤ 1/u2: the accurate result x now verifies

|x− x||x|

≈ cond× u2;

3. and no more accuracy when condition number > 1/u2.

IMACS WC 2005 – Ph. Langlois and N. Louvet. 7

Algorithms satisfying the 2× CWP behavior

• Kahan compensated summation (65)

• A recent paper to appear in SIAM J. Sc. Comp., 2005: T. Ogita, S.M. Rump, S. Oishi.

. Accurate sum and dot product algorithms

• with twice the current working precision behavior,

• and actual performance better than theoretical values and current double-double library.

• The first proof of the “Improved RoT” with a 2×CWP behavior.

◦ Recursive versions (k fold) that exhibit a k×CWP behavior (with proofs).

• Automatic Linear Correction Method CENA: Langlois (BIT,01)

. Automatic identification and computation of a first order approximate of the global forward

error that provide a validated corrected result.

◦ “exact” correction (and validation) for linear algorithms: summation, dot product, Horner

schema, triangular system, . . .

• algorithmic differentiation (Linnainmaa, 1983), running error bound (Wilkinson)

Error free transformationsIMACS WC 2005 – Ph. Langlois and N. Louvet. 8

Error Free Transformations (and approximate ones)

Error free transformations are properties and algorithms to compute the elementary generated

rounding errors at the CWP.

• Summation: Møller (65), Kahan (65), Knuth (74), Priest(91).

(s, σ) = TwoSum (a, b) is such that a + b = s + σ and s = a⊕ b.

• Product: Veltkampt (), Dekker (72), . . .

(p, π) = TwoProd (a, b) is such that a× b = p + π and p = a⊗ b.

• FMA: Boldo+Muller (05)

(p, δ1, δ2) = ErrFMA(a, b, c) is such that p = FMA(a, b, c) and a× b + c = p + δ1 + δ2.

• Division: Langlois+Nativel (98)

(d, δ) = ApproxTwoDiv (a, b) is such that a/b = d + δ and d = a� b,

|δ − δ| ≤ u|δ|.

• Inverse: Pichat (77), . . .

• Square root: Langlois+Nativel (98)

IMACS WC 2005 – Ph. Langlois and N. Louvet. 9

Error free transformation of the sum: two computations

Algorithm 1 fast-two-sum, Dekker (72)

Require: a ∈ F, b ∈ F s.t. |a| ≥ |b|s := fl (a + b) ; bapp := fl (s− a) ; σ := fl (b− bapp)

Ensure: s, σ

cost: 2 flops and 1 test

Algorithm 2 algorithm without pre-ordering or two-sum, Knuth (74)

Require: a, b ∈ Fs := fl (a + b) ; bapp := fl (s− a) ; aapp := fl (s− bapp) ;

δb := fl (b− bapp) ; δa := fl (a− aapp) ; σ := fl (δa + δb) ;

Ensure: s, σ

cost: 5 flops

IMACS WC 2005 – Ph. Langlois and N. Louvet. 10

An accurate and efficient new algorithm to solve Tx = b

• The corrected substitution algorithm

• Experimental results exhibit the

. actual accuracy: twice the current working precision behavior,

. actual speed: twice faster than the corresponding XBLAS subroutine.

IMACS WC 2005 – Ph. Langlois and N. Louvet. 11

Rounding errors in the substitution algorithm to solve Tx = b

Substitution algorithm

computes x1, x2, . . . , xn,

xi =bi −

∑i−1j=1 ti,jxj

ti,i,

solution of Tx = b:

t1,1

.... . .

ti,1 · · · ti,i...

. . .

tn,1 · · · · · · · · · tn,n

x1

...

xi

...

xn

=

b1

...

bi

...

bn

.

Algorithm 3 The substitution algorithm and its generated rounding errors

si,0 = bi

for j from 1 to i− 1 do

pi,j = ti,j ⊗ xj { generates πi,j }si,j = si,j−1 pi,j { generates σi,j }

end for

xi = si,i−1 � ti,i { generates δi }

IMACS WC 2005 – Ph. Langlois and N. Louvet. 12

The corrected substitution algorithm: principle

• The global forward error in computed x is (∆xi)i=1:n,

∆xi = xi − xi =1

ti,i

i−1∑j=1

(σi,j − πi,j − ti,j ×∆xj)

+ δi

• Assuming the correcting term ∆xi being computed, the corrected solution isfor i from 1 to n do

xi = xi⊕∆xi

end for

• Corrected result = (result from substitution x) + (correcting term ∆x)

• Theoretical complexity in O(n2):

the whole algorithm costs (27n2 + 21n− 2)/2 flops (and less with FMA)

IMACS WC 2005 – Ph. Langlois and N. Louvet. 13

Inlining the correction . . .

Algorithm 4 The corrected substitution algorithm

for i from 1 to n do

si,0 = bi

ui,0 = 0for j from 1 to i− 1 do

(pi,j , πi,j) = TwoProd (ti,j , xj)(si,j , σi,j) = TwoSum (si,j−1,−pi,j)ui,j = ui,j−1 (ti,j ⊗ ∆xj ⊕ πi,j σi,j)

end for

(xi, δi) = ApproxTwoDiv (si,i−1, ti,i)∆xi = ui,i−1 � ti,i ⊕ δi

end for

IMACS WC 2005 – Ph. Langlois and N. Louvet. 14

Introductory exemple

Let M such that M ⊕ 1 = M .1

−1 1

1 −1 1

x1

x2

x3

=

M

1

2

The exact solution x is

M

M + 1

3

, that rounds to fl (x) =

M

M

3

.

IMACS WC 2005 – Ph. Langlois and N. Louvet. 15

The solution is x = (M,M + 1, 3) that rounds to fl (x) = (M,M, 3)

• With the substitution algorithm, x3 has no exact digit:

x1 = M → M

x2 = 1⊕M → M

x3 = 2M ⊕M → 0

• The corrected algorithm yields the best accurate solution w.r.t. to the working precision:

x1 = M → M → x1 = M

x2 = 1⊕M → M+(1) → x2 = M

x3 = 2M ⊕M → 0+(1⊕ 2) → x3 = 3

IMACS WC 2005 – Ph. Langlois and N. Louvet. 16

Numerical experiments

We compare

• dtrsv: IEEE-754 double precision substitution algorithm,

• dtrsv x: XBLAS substitution with inner computation in double-double (from S. Li’s XBLAS

reference website),

• dtrsv cor: our corrected substitution algorithm.

All computations are performed in C language and IEEE-754 double precision.

. Accuracy tests

. Speed efficiency tests

IMACS WC 2005 – Ph. Langlois and N. Louvet. 17

Numerical experiments: testing the accuracy

• We carefully generate 200 arbitrary ill-conditionned triangular systems

◦ of dimension n = 10,

◦ with Skeel condition numbers cond(T, x) varying from 10 to 1040;

◦ xd is the exact solution (rational and Maple) rounded to IEEE-754 double precision,

◦ ‖xd − x‖∞/‖xd‖∞ measures the actual accuracy of computed x..

• Theoretical accuracy bound for dtrsv (IEEE-754 double precision substitution) is

‖x− x‖∞‖x‖∞

. n× cond(T, x)× u.

IMACS WC 2005 – Ph. Langlois and N. Louvet. 18

Accuracy experiments exhibit a twice the working precision behavior

1

1e-2

1e-4

1e-6

1e-8

1e-10

1e-12

1e-14

1e-16

1 100000 1e+10 1e+15 1e+20 1e+25 1e+30 1e+35 1e+40

rela

tive

forw

ard

erro

r

condition number

1/(n u2)1/(n u)

u

u n cond (2) u + u2 n cond (3)

dtrsvdtrsv_cor

dtrsv_x

. dtrsv x (XBLAS+double-double) and dtrsv cor (corrected) provide the same accuracy.

. Measured accuracy bounds for dtrsv x and dtrsv cor is

‖x− x‖∞‖x‖∞

. n× cond(T, x)× u2

.IMACS WC 2005 – Ph. Langlois and N. Louvet. 19

Numerical experiments: testing the speed efficiency

• What is measured?

◦ the overhead to double the accuracy with ratios dtrsv cor/dtrsv and dtrsv x/dtrsv;

◦ the relative efficiency of the accurate algorithms with ratio dtrsv x/dtrsv cor.

• How is it measured?

For every system dimension n varying from 5 to 2000:

◦ we perform 100 runs measuring (100) numbers of cycles (TSC counter for IA-32),

◦ we keep the mean value, the min and the max of the 10 smallest numbers of cycles.

IMACS WC 2005 – Ph. Langlois and N. Louvet. 20

Speed efficiency: measured and theoretical ratios

Intel Celeron: 2.4GHz, 256kB L2 cache - GCC 3.4.1

ratio minimum mean maximum theoretical

dtrsv cor/dtrsv 2.17 2.95 9.70 13.5

dtrsv x/dtrsv 4.02 5.81 19.66 22.5

dtrsv x/dtrsv cor 1.81 1.95 2.07 1.67

Pentium 4: 3.0GHz, 1024kB L2 cache - GCC 3.4.1

ratio minimum mean maximum theoretical

dtrsv cor/dtrsv 4.50 8.14 9.95 13.5

dtrsv x/dtrsv 8.82 16.10 19.60 22.5

dtrsv x/dtrsv cor 1.87 1.98 2.03 1.67

IMACS WC 2005 – Ph. Langlois and N. Louvet. 21

The corrected algorithm runs twice faster than corresponding double-double XBLAS

2.5

2

1.5 200 400 600 800 1000 1200 1400 1600 1800 2000

Tim

e ra

tio

Size of the system

Measured time ratio of dtrsv_x over dtrsv_cor [Intel Celeron, 2.4GHz, 256kB L2 cache]

dtrsv_x / dtrsv_cor

20

15

10

5

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Tim

e ra

tio

Size of the system

Measured time ratio of dtrsv_x and dtrsv_cor over dtrsv [Intel Celeron, 2.4GHz, 256kB L2 cache]

dtrsv_cor / dtrsvdtrsv_x / dtrsv

2.5

2

1.5 200 400 600 800 1000 1200 1400 1600 1800 2000

Tim

e ra

tio

Size of the system

Measured time ratio of dtrsv_x over dtrsv_cor [Intel P4,3.0GHz, 1024kB L2 cache]

dtrsv_x / dtrsv_cor

20

15

10

5

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Tim

e ra

tio

Size of the system

Measured time ratio of dtrsv_x and dtrsv_cor over dtrsv [Intel P4, 3.0GHz, 1024kB L2 cache]

dtrsv_cor / dtrsvdtrsv_x / dtrsv

IMACS WC 2005 – Ph. Langlois and N. Louvet. 22

Conclusion

• Need for more precision: there are alternative solutions to expansions: double-double

◦ Ogita, Rump, Oishi emphasizes the efficiency for specialized applications

◦ Step 1: the CENA method automatizes the search for a solution of this kind for any

algorithm . . .

◦ Step 2: . . . to be optimized inlining the correction and EFT.

• The corrected substitution algorithm provides

• actual accuracy as doubling the working precision,

• actual speed being twice faster than the corresponding XBLAS subroutine.

. Current results: Horner schema for polynomial evaluation: double, k-fold, proofs, underflow

Research Reports at http://webdali.univ-perp.fr

. Next steps for Tx = b: proof of “Improved RoT”, FMA . . . and to other algorithms.

IMACS WC 2005 – Ph. Langlois and N. Louvet. 23

Dekker’s fast two-sum

+y

−x

−yapp

s = fl(x + y)

x

x + y

+

s

yapp

y

δ

δ

IMACS WC 2005 – Ph. Langlois and N. Louvet. 24

Elementary rounding errors relations, bounds and overheads

Final remark: these overheads are theoretical ratios. Ogita, Rump and Oishi exhibit that

implementation are more efficient, e.g., 5 in × (cache and pipelined units are well managed by

modern compilers).

IMACS WC 2005 – Ph. Langlois and N. Louvet. 25

The CENA first principle: a first order analysis of the global rounding error (GFE)

In place of xN = f(X), we compute xN = fl (f) (X)with data X = (x1, . . . , xn) ∈ Fn and f a given computable (real) function.

f computation introduces (for ◦ as prev. and operands in F)

• (N − n) intermediate variables xk,

• (N − n) elementary errors δk associated to xk

xk = fl (xi ◦ xj) = xi ◦ xj + δk = xk + δk

• Linear Approximation of the GFE w.r.t. the Elementary Errors δk:

xN − xN = f(X, δ)− f(X, 0) =∑N

k=n+1∂ bf

∂ δk(X, δ) · δk − EL = ∆L − EL

IMACS WC 2005 – Ph. Langlois and N. Louvet. 26

The linear approximation of the GFE is not a new idea . . .

S. Linnainmaa (75), M. Iri (90) use it to derive linear (deterministic or probabilistic) bounds.

/ Deterministic bounds are very pessimistic: nor compensation of ERE, nor “without rounding”

result could be considered.

/ There is no practical means by which to know the generated errors δv.

M. Iri, in ”History of Automatic Differentiation and Rounding Error Estimation”, SIAM (1991).

IMACS WC 2005 – Ph. Langlois and N. Louvet. 27

We have error free and approximate transformations. . .

. . . so let us compute a first order correcting term

xN = fl(xN + ∆L

)• Correcting factor ∆L = fl

(∑Nk=n+1

∂bxN

∂ δk(X, δ) · δk

)Automatic/Algorithmic Differentiation, (AD) + ERE computation

/ limitation: residual error in the corrected result: xN − xN = −(EL + EC)

1. EL: error from the first order approximation

2. EC : finite precision computation in AD, ERE, final correction.

, 1. some (classic) algorithms satisfies EL = 0

2. running error bounds EC : |x−x| ≤ EC

IMACS WC 2005 – Ph. Langlois and N. Louvet. 28

The CENA Method

� �� �

Forward Error

Confidence Area

by = bf(x)

x

Data Space D Result Space R

BEC

f(x)

Residual Error

y = f(x)

IMACS WC 2005 – Ph. Langlois and N. Louvet. 29

Linear Algorithm : EL = 0

Definition 1 A linear algorithm on F is an algorithm that only contains the operations

{+,−,×, /,√ } and such that

- every multiplication fl (xi × xj) satisfies xi = xi or xj = xj , and

- every division fl (xi/xj) satisfies xj = xj , and

- every square root fl(√

xi

)satisfies xi = xi.

Example 1 summation algorithms, inner product computation, triangular linear system solving,

polynomial evaluation with Horner’s scheme, . . .

IMACS WC 2005 – Ph. Langlois and N. Louvet. 30

CENA verifies a 2×CWP behavior

Let us consider x = fl(x− ∆L

), CWP u, and ∆L being a good approximate of the actual

error, i.e., for linear algos and a reasonable computing error eN in ∆L.

We have x = (x− ∆L)(1 + δ1),and ∆L = (x− x)(1 + δ2),(in the most favorable case for the correcting term) with |δ1|, |δ2| ≤ u.

Since|x− x||x|

≈ condition number× u,

we have

|x−x||x| ≤ u(1 + |bx−x|

|x| ).

IMACS WC 2005 – Ph. Langlois and N. Louvet. 31

How to use the CENA method ?

Let us have an original program written in F90, Ada, C++ (overloading op.)

1. Include the CENA library with appropriate with or use clauses.

2. Change the original floating point type to the corresponding corrected floating point type, e.g.,

real (that can be a user-defined type) is changed to a real . Every arithmetic operator

existing for a real variable has an overloaded counterpart for the a real variable.

3. Define the entries x1, x2, . . . , xn, e.g., calling the CENA subroutine entry(X) .

4. Correct the chosen variables after they have been computed, e.g., calling the CENA

subroutine correct(Z) after Z:=f(X) .

5. The functions VALUE(Z) and BOUND(Z) return the corrected value and the associated

bound of real type.

IMACS WC 2005 – Ph. Langlois and N. Louvet. 32