Computational Science and Engineering (Int. Master’s Program) · November 4, 2015 Steffen...

Computational Science and Engineering(Int. Master’s Program)

Fakultät für InformatikTechnische Universität München

Master’s Thesis

The Fault Tolerant Combination Technique in anIterative Refinement Framework

Author: Steffen Seckler1st examiner: Prof. Dr. Hans-Joachim Bungartz2nd examiner: Prof. Dr. Thomas HuckleAssistant advisor(s): Christoph Kowitz, M.Sc. (hons)Thesis handed in on: October 15, 2015

I hereby declare that this thesis is entirely the result of my own work except where other-wise indicated. I have only used the resources given in the list of references.

November 4, 2015 Steffen Seckler

Acknowledgments

I want to thank Christoph Kowitz for being a very good advisor, who was always thereif I needed help or had questions. He provided technical background knowledge to alarge extend and was very familiar with the topic. Additionally I want to thank my twoexaminers Prof. Dr. Hans-Joachim Bungartz and Prof. Dr. Thomas Huckle for taking thetime to correct this work.

v

Abstract

With petascale computing just around the corner fault tolerance becomes more and moreimportant. In this work a fault tolerant algorithm for large scale simulations is introduced.It combines the fault tolerant combination technique with the iterative refinement method,allowing to handle both hard and soft faults. The properties of the algorithm are discussedon the example of Poisson’s equation.

vii

Contents

Acknowledgements v

Abstract vii

I. Introduction and Theory 1

1. Introduction 3

2. Methods 52.1. Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2. Combination Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3. Fault Tolerant Combination Technique . . . . . . . . . . . . . . . . . . . . . . 102.4. Combination of FTCT and Iterative Refinement . . . . . . . . . . . . . . . . . 122.5. Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5.1. Single Solution Technique . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.2. Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.3. Error Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6. Fault Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6.1. Hard Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6.2. Soft Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6.3. Multiple Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7. Model Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

II. Simulation Results 21

3. Verification of Individual Components 233.1. Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2. Combination Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4. Behavior without Faults 254.1. General Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.1. Error Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.2. Convergence to Full Grid Solution . . . . . . . . . . . . . . . . . . . . 264.1.3. Choosing Realistic Initial Parameters . . . . . . . . . . . . . . . . . . 284.1.4. Converged Error Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2. Ruge-Stüben Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.1. Solver Parameters and General Behavior . . . . . . . . . . . . . . . . 294.2.2. Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

ix

Contents

4.3. Iterative Method I – Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.1. Non-Adaptive Damped Jacobi Method . . . . . . . . . . . . . . . . . 324.3.2. Adaptive Damped Jacobi Method . . . . . . . . . . . . . . . . . . . . 334.3.3. Optimal Jacobi Iteration Count . . . . . . . . . . . . . . . . . . . . . . 35

4.4. Iterative Method II – Gauss-Seidel . . . . . . . . . . . . . . . . . . . . . . . . 364.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5. Fault Tolerance I – Hard Faults 395.1. Ruge-Stüben Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.1. Standard Combination Technique . . . . . . . . . . . . . . . . . . . . 395.1.2. Fault Tolerant Combination Technique . . . . . . . . . . . . . . . . . 405.1.3. Single Solution Technique . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2. Iterative Method I – Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2.1. Standard Combination Technique . . . . . . . . . . . . . . . . . . . . 425.2.2. Fault Tolerant Combination Technique . . . . . . . . . . . . . . . . . 445.2.3. Single Solution Technique . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3. Iterative Method II – Gauss-Seidel . . . . . . . . . . . . . . . . . . . . . . . . 465.3.1. Standard Combination Technique . . . . . . . . . . . . . . . . . . . . 465.3.2. Fault Tolerant Combination Technique . . . . . . . . . . . . . . . . . 475.3.3. Single Solution Technique . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4. Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6. Fault Tolerance II – Silent Faults 516.1. Ruge-Stüben Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.1.1. Single Silent Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.1.2. Multiple Silent Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2. Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2.1. Single Silent Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2.2. Multiple Silent Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3. Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

III. Conclusion and Outlook 59

7. Conclusion 61

8. Outlook 63

Appendix 64

A. Additional Graphs 67

Bibliography 71

x

Part I.

Introduction and Theory

1

1. Introduction

In the recent years, the importance of high performance computing has been steadily ris-ing. For some applications more precise results were necessary, for others the simulationdomain had to be enlarged, again others needed an increase number of simulated particles.All of these factors led to a very steep increase of the amount of degrees of freedom for sim-ulations. With the same amount of computing power and the same algorithm, a simulationwith an increased number of degrees of freedom will take longer to finish. Tackling thiscan be done by using a more proper algorithm or by enhancing the computing power or acombination of both. While the computing power of a single chip did indeed increase overthe recent time, that increase could not compensate the demand for computing power. In-stead of handling the computation with a single processing unit, multiple interconnectedprocessing units were used and parallel applications have been introduced. With it spe-cialized algorithms and their implementations became more and more important.

Instead of increasing the computing power, one can also find an algorithm, that com-bines a good approximation with a low cost to produce it, these algorithms reduce thenumber of degrees of freedom while maintaining similarly accurate simulations. One ofthese methods was introduced by S. Smolyak in 1963 [25]. A sparse grid [6] is used in-stead of a regular Cartesian grid to discretize a domain. Thereby the amount of degreesof freedom is drastically reduced. Especially for more than two dimensional problems thereduction becomes necessary. Since the curse of dimensionality holds for regular grids,leading to an exponential increase of unknowns, with decreasing discretization width. Us-ing sparse grids will however introduce a non-uniform grid, for which operators are hardto discretize. If a discretization is found, the resulting matrix will mostly become denseand non-structured. In general this results in a system, that is hard to solve in parallel.

Using sparse grids thus results in less degrees of freedom, but discretizing the problemand solving it in parallel is often cumbersome and sometimes not possible.

The sparse grid combination technique, first introduced in [13] in 1992 combines the re-duction of degrees of freedom of a sparse grid discretization with regular grids. It usesmultiple independent solutions on different regular grids and combines them, to get thesolution on an underlying sparse grid. Since the solutions on the regular grids are inde-pendent, they can be computed simultaneously. The sparse grid combination techniqueprovides an additional layer of parallelism.

Another aspect of the ever increasing problem sizes and parallelism is the steady in-crease of the totally required computation time. With it the probability of the occurrenceof errors rises. These errors are handled in various ways. Up to now saving the state ofthe simulation every few seconds and, upon an occurrence of an error, restarting from thecheckpoint is frequently used [23]. This however can lead to an overhead of up to 25% andis often unfeasible. Other techniques like the algorithm-based fault tolerance techniquecan also be used to handle corrupted data and provide less overhead [20]. However therequirements towards the MPI implementations are often unrealistic [4].

3

1. Introduction

In this thesis two algorithmic approaches to handle occurring errors are combined. Onthe one hand, an iterative solver in the form of the iterative refinement technique is used,that can handle both hard and soft faults (see Section 2.6). On the other hand, the faulttolerant combination technique is applied to handle hard faults: Some of the solutions,required for the combination technique, are not calculated. The fault tolerant combina-tion technique allows to find new combination schemes, that do not require the missingsolutions. The properties of the combined solver are discussed in this thesis.

In Chapter 2 the theory for the individual parts is discussed. Besides of the iterativeRe-finement (Section 2.1), the standard combination technique (Section 2.2) and the fault tol-erant combination technique (Section 2.3), the combined algorithm (Section 2.4) is intro-duced. Some implementation details are given (Section 2.5), e.g. the used solvers andan additional pseudo combination technique. The technique on how to simulate errors isreviewed in Section 2.6. In Section 2.7 the model problem is characterized.

Part II gives a description of the results of the pure algorithm without faults (Chapter4), with hard faults (Chapter 5) and with soft faults (Chapter 6). Hereby the results will begiven considering the three different solvers and the different combination techniques.

Finally a conclusion and an outlook is given in Part III.

4

2. Methods

2.1. Iterative Refinement

Iterative refinement is a method developed by the British mathematician James H. Wilkin-son [26], which focuses on solving a system of linear equations, by using the residualequation repeatedly.

For a matrix A ∈ Cn×n, and the vectors b ∈ Cn and u ∈ Cn, the matrix vector equationAu = b can be solved using the iterative refinement method. The algorithm uses theresidual r ∈ Cn of the matrix vector equation, as well as a correction term d ∈ Cn. Themethod is depicted in Algorithm 2.1. Hereby Step 3 can be solved approximately. In itsoriginal purpose the introduced error in this step was due to rounding errors of exactsolvers, mostly Gaussian elimination, however many approximate or iterative solvers canbe used instead, e.g. Gauss-Seidel, Jacobi or a multi-grid method. Many solvers, that areused in an iterative refinement scheme use a factorization of A, since it can be reused inevery iteration and the factorization only has to be computed once. Since the method isof iterative nature a counter m has been introduced to differentiate between the differentiterations.

Algorithm 2.1 (Iterative Refinement to solve Au = b)

1. Start with an initial guess u0 and m = 0.

2. Compute rm = b−Aum

3. Solve Adm = rm.

4. Update um+1 = um + dm

5. Go back to Step 2 if not converged.

If Step 3 is solved exactly with no errors and all other steps are computed exactly as well,then the algorithm converges after one step:

um+1 = um + dm (Step 4)

= um +A−1rm dm = A−1rm (Step 3)

= um +A−1(b−Aum) rm = b−Aum (Step 2)

= um +A−1b−A−1Aum

= um +A−1b− um

um+1 = A−1b

5

2. Methods

Since the iterative refinement method improves the stability and accuracy of a solver andconverges for a wide range of matrices, it is used in many different areas of applications,e.g. image denoising [15] and is implemented in many different software packages, e.g.LAPACK [8]. Software packages support iterative refinement only very recently. The mainreason for this is, that iterative refinement will only work at its best, if the residual is cal-culated at a higher precision than the correction term dm [7]. In this thesis the iterativerefinement technique is used as a framework for the combination technique. The combi-nation technique will be used in Step 3 of the algorithm and is described in the followingsection. Using the combination technique, the correction term is automatically of less accu-racy than the residual calculation, since only a sparse grid representation of the correctionterm is calculated.

For more information and a detailed convergence analysis of the iterative refinementtechnique see e.g. [17].

2.2. Combination Technique

A general form of the combination technique is introduced to allow an easy definition ofboth the standard combination technique, as well as the fault tolerant combination tech-nique in the next section. These two parts are mainly based upon [14, 10, 6, 13].

Let us consider a problem on the unit d-cube with the solution u. This solution is as-sumed to be continuous on the whole domain u ∈ V = C

Ä[0, 1]d

ä. This function space

is infinite-dimensional and not representable on a computer, unless an analytic expres-sion is known. Since the latter is not the case, the solution will not be represented exactly,but an approximation in a lower dimensional function space is wanted. The function willbe represented using an arbitrary finite-dimensional basis ϕ0, . . . , ϕN and coefficientsc0, . . . , cN

u(x) =N∑i=0

ciϕi. (2.1)

The function will then be an element of the, by the basis functions spanned, function space.One possible function space is the space of multi-linear functions. For the definition of thisfunction space a discretization is needed. The discretization can be build using a set Ω ofpoints xk. An appropriate set of points can be generated using regular grids, so grids forwhich the points in each dimension i have a constant distance hi. In a one dimensionalcase, the discretization grid then becomes ωh := xk = k · h : k = 0, . . . , N with N = 1

h .Introducing a discretization level ι and dividing the the distance of neighboring points bytwo with each increased level, the discretization width h then becomes h = 2−i. A onedimensional discretization grid of i-th level will be called Ωi and is defined via Ωi = ω2−i .For multi-dimensional grids a level vector i = i1, . . . , id ∈ Nd is introduced. This multi-index describes the regular d-dimensional grid Ωi using the one dimensional grids Ωi:

Ωi := Ωi1 × . . .× Ωid (2.2)

Using this grid Ωi, a basis of a function space can be defined as the piecewise multi-linear

6


Figure 2.1.: Shown are various full (top) and sparse grids (bottom). The full grids includethe regular grid Ω[1,3] (top left), as well as the Cartesian grid Ω[4,4] (top right).The shown sparse grids are of level 4 (function space V s

4 ). While the bottomleft grid includes levels containing only boundary points, the bottom right doesnot.

functions ψj, that are fully described by their evaluations on the grid points xk:

ψj(xk) :=

1 ,k = j

0 ,k = j(2.3)

The function space, that is defined using this basis will be called Vi. An approximation ofu on the regular grid Ωi is called ui ∈ Vi.

Using the defined function spaces Vi only regular grids are considered. It is howeveroften useful to consider sparse grids, since they contain far less grid points (c.f. Figure2.1). Additionally problems discretized using sparse grids, instead of regular grids, leadto smaller matrices, while maintaining similar accuracy. A function space on a sparse gridspace of level q can be generated using multiple function spaces defined through regulargrids:

V sq :=

∑∥i∥1≤q

Vi (2.4)

A sparse grid solution usq ∈ V sq is a solution which closely approximates the original solu-

tion u.Based on (2.4) a combined sparse grid solution ucq can then be expressed as a linear

combination of solutions ui, that are generated on regular grids

ucq =∑∥i∥≤q

ciui. (2.5)

7

2. Methods

Using proper coefficients the standard combination technique can be derived.For a general definition of combination techniques, the set of combination solutions is

generalized. A general combination technique can then be expressed as a linear mappingof a set of solutions ui to the combined solution ucI , where I ⊂ Nd is set of multi-indexes,that describes the solutions necessary for the combination.

ucI =∑i∈I

ciui (2.6)

Since this is a linear combination of solutions, the combined solution ucI is element ofV sI :=

∑i∈I Vi. However a proper choice for the combination coefficients ci has to be

made to approximate u with ucI . One way of getting these is to minimize the functional∥u−∑

i∈I ciui∥, which is known as opticom [16]. This approach will not be used in thiswork, since it comes with a high complexity. It can be hard to parallelize and can intro-duce a big additional workload. We will look at a priori known good coefficients.

Let us therefore introduce the hierarchical surplus space Wi (see e.g. [6]) via

Vi =Wi ⊕∑j<i

Vj (2.7)

where j is another multi-index of the same dimensions as i and

j < i ⇔ ∃k : jk < ik ∧ ∀k′ : jk′ ≤ ik′ . (2.8)

The surplus space defines the function space, that is added when adding the functionspace Vi to the existing function space

∑j<i Vj. The contribution of the surplus space Wi

to the combined solution ucI is∑

j≥i cj, since Wi ⊂ Vj for all j ≥ i. The surplus space Wi

should now either contribute (∑

j≥i cj = 1) or not contribute (∑

j≥i cj = 0) to the solution.Henceforth the condition ∑

j∈I, j≥i

cj ∈ 0, 1 (2.9)

can be derived. The constant solution u0 ∈ V0 = W0 should always contribute (∑

j∈I cj =1). (2.9) is known as the inclusion-exclusion principle: Every possible function spaceshould be resembled at most once. If the same function is representable in the functionspaces of multiple combination solutions, the combination coefficients of those solutionshave to be chosen, such that they either cancel each other or only a single contributionremains.

To find an optimal set of a priori known combination coefficients, an error bound for∥u− ucI∥2 (see e.g. [14]) is used. This error bound is minimized, when

Q(cii∈I

):=

∑i∈I↓

4−∥i∥1∑

j∈I, j≥i

cj (2.10)

is maximized, where I ↓ :=¶i ∈ Nd : ∃j ∈ I s.t. i ≤ j

©is the lower index set of I . Introduc-

ing the hierarchical coefficient

wi :=∑

j∈I,j≥i

cj ∈ 0, 1 (2.11)

8


using (2.9), (2.10) can be simplified to

Q′(w) :=∑i∈I↓

4−∥i∥1wi. (2.12)

Maximizing (2.10) is then equivalent to solving the general coefficient problem (GCP)as the binary integer programming problem of maximising (2.12). This problem is NP-complete [18]. However a unique solution can be easily found, if I is associated with alower semi-lattice, i.e. it is closed under the ∧ operator

∀i, j ∈ I ⇒ i ∧ j ∈ I , (i ∧ j)k = min ik, jk . (2.13)

For this kind of index sets the solution of the GCP is wi = 1 for each i ∈ I . Once the wi

are known, the ci can be calculated easily. The coefficients derived from the index set, thatcorresponds to the standard sparse grid of level q, i.e. Iq =

¶i ∈ Nd : ∥i∥1 ≤ q

©, lead to the

so called standard combination technique

ucq =d−1∑k=0

(−1)kÇd− 1

k

å ∑∥i∥1=q−k

ui. (2.14)

Valid combination techniques can only be found for n ≥ d − 1. Therefore, in this thesis acombination technique of level n will correspond to a sparse grid of level q = n+ (d− 1).The combination technique then becomes

ucn =d−1∑k=0

(−1)kÇd− 1

k

å ∑∥i∥1=n+(d−1)−k

ui. (2.15)

Additionally, a shift s can be introduced, which allows for fewer, but denser grids:

ucn,s =d−1∑k=0

(−1)kÇd− 1

k

å ∑∥i∥1=n+(d−1)−k+s,

max(i)≤n−k

ui (2.16)

The combination technique, that forms the d-dimensional solution ucn,s will be abbreviatedas ddnnss, such that the two-dimensional solution of uc4,0 is generated with the combina-tion technique d2l4s0.

Solutions, that arise from grids without any inner grid points do not contribute to the so-lution, since homogeneous Dirichlet boundary conditions are used. They will be ignored.

In general the combination technique can be compressed to the following algorithm.

Algorithm 2.2 (Combination Technique)

1. Get the necessary level vectors from (2.16).

2. Solve the problem on these levels to obtain the partial solutions.

3. Combine the solutions according to (2.16).

9

2. Methods

iy

ix

+1

+1

+1

-1

-1

iy

ix

+1

+1-1

Figure 2.2.: Grids and weighting factors for the combination technique with n = 3 (2.16).The combination techniques with shift 0 (left), as well as the combination tech-nique with shift 1 (right) are shown. The corresponding sparse grid of level 4 isdepicted above. Levels vectors with points only on the boundary are omitted.

2.3. Fault Tolerant Combination Technique

Based upon the results from the previous Section 2.2 the fault tolerant combination tech-nique can be viewed as the combination technique, that fulfills the GCP (2.12) for a givenindex set I of levels.

In practice one tries to solve the problem on grids of the index set Istart. But due to hardfaults, only for the index set I ⊂ Istart solutions can be obtained. The uncertain nature ofthis problem does not allow the combination coefficients ci to be known a priori, since ineach iteration different grids could be missing. Instead the combination coefficients haveto be calculated, once the index set is known. As noted before, the GCP is NP-complete andwould take a huge amount of time to solve. However using index sets associated to lowersemi-lattices a unique solution can be easily found (see Section 2.2). One could thereforereduce the index set I to be closed under the ∧ operator and obtain the coefficients forthis new index set Ireduced. Assuming that the starting index set is associated with a lowersemi-lattice, different strategies are possible to regain this property:

1. Easiest and most intuitive approach, that consists of erasing all indices bigger thanthe erroneous ones.Note: An index j is bigger than another index i ⇔ ∀k ∈ 1, . . . , d : jk ≥ ik.

2. Find a subset Ireduced of I , such that it is of maximal size and closed under the ∧operator ((2.13) holds). Note: This subset may not be unique.

3. Recompute solutions, which perturb the closure of the index set Ireduced. Mostlysolutions of non-maximal indices are recomputed. See e.g. [21].

All of these strategies have their own advantages and disadvantages. Scheme 1 will notgive an optimal subset and may ignore information, which was already correctly com-puted. However it is the fastest and easiest method. Strategy 2 will find the optimal

10

2.3. Fault Tolerant Combination Technique

iy

ix

iy

ix

+1

+1

-1

iy

ix

+1

+1

+1-1

-1

iy

ix

+1

+1

+1

-1

-1

Figure 2.3.: Shown are different strategies of fault tolerant combination techniques withn = 4 (2.16). After selecting the initial index set Istart (top left) two errorsoccur, such that the solutions u3,2 and u1,3 could not be obtained. Strategies1 (top right) and 2 (bottom left), as well as the in this work used strategy 3(bottom right) are depicted (see Section 2.3). Red framed grids are removedand ignored, green framed grids are recomputed. The shown numbers depictthe combination coefficients of the different grids.

subset. It uses all information, however computing the necessary subset is a complicatedprocedure. Method 3 requires the recomputation of solutions, that according to Strategy1 would require erasing other solutions. This provides an index set that is closed as longas the initial index set Istart was closed. The problem sizes of recomputed solutions arecomparably small, since the biggest problems will never be resolved. Therefore the re-computation of the solutions does not take much extra work. A comparison of the threemethods is shown in Figure 2.3.

In this work Method 3 will be used. Using an initial index set according to the standardcombination technique, it can be ensured, that besides of the normally used solutions (k =0, . . . , d−1, (2.16)) only the solutions with k = d are needed for the combination technique.

11

2. Methods

All of these solutions will be computed, while one allows only solutions with k = 0 to bemissing in the final index set I . Errors on other solutions will not be allowed and handledby recomputation.

Instead of actually recomputing these solutions, it was decided, that it would be simpler,to not allow them to get corrupted. Calculating the overall compute time including thetime to solve corrupted solutions again, can then be done by adding the required time ofa corrupted solution twice (or even trice). The overall computational time was howevernot directly used and such, the recomputation of those solutions becomes unimportant.Furthermore studies have shown [21], that these recomputations do not need much time,since they are only done for comparably small grids.

2.4. Combination of FTCT and Iterative Refinement

The main focus of this work lies in the combination of FTCT with iterative refinement, thusallowing to merge robustness towards hard faults with robustness towards soft faults. Theformer is provided by the fault tolerant combination technique, since whole solutions canbe neglected. While the latter is yielded through iterative refinement, due to its iterativenature. For a detailed description of the error types refer to Section 2.6.

To join the two methods together, the FTCT is embedded into the iterative refinementscheme. Therefore in Algorithm 2.1 the residual equation Adm = rm(Step 3) is solved noton the full grid, but on many different grids. Afterwards the correction term dm is obtainedusing the combination technique. All other steps remain the same, so that the combinedalgorithm looks the following:

Algorithm 2.3 (joined algorithm to solve Au = b)

1. Compute rm = b−Aum

2. Solve Adm = rm using the (fault tolerant) combination technique

3. Update um+1 = um + dm

4. Repeat until converged.

The complete, detailed algorithm is iteratively structured. Each iteration consists ofseven steps, which are shown in Figure 2.4.

Each iteration starts with an approximation um of the solution u. In each iterative refine-ment step the residual rm = b−Aum has to be computed first. This calculation is executedon the full grid. Once this is done the residual rm has to be copied and broadcasted, suchthat there exist as many copies of the residuals as there exist level vectors in the combina-tion technique. Each of these copies then has to be restricted, such that the dimensions ofthe residual match the dimensions of the discretization grid of the individual level vectors.For this step full weighting is used (see e.g. [5]). This restriction can be interpreted as amatrix vector multiplication of the restriction matrix Ri with the residual rm

rm,i = Rirm. (2.17)

12

2.4. Combination of FTCT and Iterative Refinement

The restricted residual rm,i, as well as the restriction matrix Ri are dependent on the levelvector i. Full weighting in the 1-dimensional case can be resembled by a matrix generatedby

(Rr)k =r2k−1 + 2r2k + r2k+1

4. (2.18)

Multiple one-dimensional full weighting steps can be concatenated to generate a schemefor arbitrary dimension and level changes.

After a successful restriction, the residual equationAidm,i = rm,i has to be solved, whereAi is the discretization matrix of the discussed problem for the combination grid Ωi anddm,i is the calculated correction term. In this thesis the residual equation is solved usingdifferent algorithms (Section 2.5.2). Once solved, a correction term is generated for eachgrid. The calculated correction terms have to be prolongated back to the original full grid.In this thesis (multi-)linear interpolation is used (see e.g. [5]). Therefore another matrixoperation is performed:

(Pd)k =

dk/2 k evend(k−1)/2+d(k+1)/2

2 k odd(2.19)

Again multiple one-dimensional prolongation steps can be concatenated.The prolongated correction terms Pidm,i can then be combined and added to the approx-

imation um to generate a new approximation um+1.After each iteration it has to be checked, whether or whether not to continue with the

algorithm. How this is done is described in Section 2.5.3.In addition to a stopping criterion, a starting criterion has to be chosen, an initial guess

u0 has to be defined. The zero solution u0 = 0 is chosen for this purpose, since it can beexactly resembled on any arbitrary grid, especially on a sparse grid. Why this is important,is explained in Section 4.1.2.

13

2. Methods

0. Start with an approximation um of the solution u.

1. Calculate the residual rm = b−Aum.

2. Broadcast the residual rm, such that one has asmany copies as level vectors, that are required forthe combination technique.

3. Restrict the residual rm to match the dimension ofthe needed levels: rm,i = Rirm

4. Solve the residual equation Aidm,i = rm,i on eachgrid.

5. Interpolate the correction term dm,i to the full grid.dm,i,FG = Pidm,i

6. Combine all correction terms to get the full gridcorrection dm using the chosen combinationtechnique.

7. Calculate the new approximation um+1 = um + dm.

Figure 2.4.: A more detailed view on the m-th iteration of the complete algorithm, thatcombines iterative refinement with a combination technique. After completingone cycle, one obtains the new approximation um+1, that can be used in thenext iteration.

14

2.5. Implementation Details

2.5. Implementation Details

In this section some implementation details are described. This includes the single solutiontechnique (Subsection 2.5.1). It can be interpreted as a combination technique, howeveronly a single solution is used instead of using multiple solutions and combining them. Inone of the steps of the algorithm, a system of equations has to be solved. The used solversto perform this task are listed in Subsection 2.5.2, while they are described in greater detaillater on in the thesis. The analysis of the algorithm, as well as the introduction of a stop-ping criterion of the combined algorithm, require the definition of an error (Subsection2.5.3).

2.5.1. Single Solution Technique

Besides of the standard combination technique (Section 2.2) and the fault tolerant combi-nation technique (Section 2.3) another method that can be interpreted as a combinationtechnique has been used. This method however only generates a single level vector, forwhich the problem is solved. This level vector changes after each refinement step of theiterative refinement technique, such that the refinement has been done in different levelsand directions for each iteration. Therefore an index i of (2.16) with k = 0 is chosen beforeeach iteration. Only grids of the highest level are considered. Hereby a random and adeterministic way of choosing the next index i is implemented:

Random The index i is chosen randomly.

Deterministic The index i is chosen deterministically. Therefore the index set with k = 0is ordered with the index ι, such that ι(i) > ι(j) ⇔ ∃l ∈ 0, . . . , d− 1 : (il >jl) ∧ (∀m ∈ 0, . . . , l − 1 : im = jm). Starting with the biggest ι the next index i ischosen as the index with the next smallest ι.

2.5.2. Solvers

Three different solvers are used to solve the residual equation Aidm,i = rm,i (Step 4 ofFigure 2.4). Besides of the Ruge-Stüben solver of the pyamg package [2] for python, theGaussSeidel method, as well as the damped Jacobi method have been used.

The range of the solvers has been chosen, such that one (almost) exact solver has beenused besides of two iterative ones. The Ruge-Stüben solver hereby takes the place of theexact solver, since it solves a given equation up to a certain accuracy. This accuracy is cho-sen to be very low, such that the introduced error of the combination technique dominatesthe error of this solver. The Gauss-Seidel and Jacobi method act on an initial guess of thesolution of a problem and calculate a better approximation of the solution than the initialguess.

2.5.3. Error Measurement

To analyze the convergence properties of a method, error measurements are necessary.Therefore the difference between a current solution and a reference solution is measured.

15

2. Methods

Normally one would choose the reference solution as the solution on the full grid. How-ever convergence towards this solution cannot be guaranteed. Instead the distance of thesolution u towards the solution uconv, to which the algorithm is converging, is chosen aserror measurement. The absolute error is measured on the full grid as

Eabs = ∥u− uconv∥2 , (2.20)

where the standard Frobenius norm is chosen to calculate a scalar value of the error. Togenerate a relative error, the absolute error is divided by the Frobenius norm of the righthand side b of a discretized partial differential equation (here: right hand side of Poisson’sequation, Section 2.7):

Erel =∥u− uconv∥2

∥b∥2(2.21)

This relative error will be used in all measurements and its development is used for analy-sis. It also allows for the introduction of a stopping criterion of the iterative algorithm.Once the calculated error is below a certain threshold, the simulation can be stopped.While this stopping criterion is used, other criteria to stop iterating exist. One could checkthe change of the error, and stop if it does not change much anymore. In this case theconverged solution is not necessary to be known a priori.

Often also a check of the residual is used. One then stops, when the norm of the residualis below a certain value. However, since convergence towards the full grid solution cannotbe guaranteed, the residual will not vanish on the full grid. Only a check for a stagnatingresidual could be implemented as a stopping criterion. If one wants to check single resid-uals that vanish, one needs to check the residuals on the single combination grids. Theseresiduals will indeed vanish (see Figure 4.1).

2.6. Fault Emulation

In nowadays high-performance simulations errors are bound to occur, since the overallneeded compute time increases steadily. This is a consequence of the increasing interestin more and more detailed simulations to predict results more precisely, which for someapplications, such as plasma physics is of utter importance. Despite of constantly improv-ing algorithmic treatments of these simulations, the overall complexity increases. This ishandled by more and more parallel computations, as well as faster processors. The latteris mostly achieved by continuously shrinking the size of transistors, which can increasethe error proneness of the Central processing unit (CPU). Additionally the increasinglyparallel computations allow for simulations that need far larger total compute times, evenif the passing real time stays the same. With an increased compute time, the probability,that an error occurs will also increase.

Studies have shown [24], that in 2007 the failure rate per socket, i.e. per processor chip,was roughly 0.1 per year. In the year 2018, the failure rate is expected to increase to at least2000 per year, resulting in a mean time to interrupt of less than 50 minutes on each socket.

All in all with even larger machines and computations it becomes more important tohandle errors. Currently this is mostly done by saving the state of the simulation after acertain time interval and restarting from such a checkpoint if an error is detected. Or by

16

2.6. Fault Emulation

using redundancy. Errors that can be corrected in such a way are called hard faults (Sub-section 2.6.1). Errors that cannot be treated with a checkpoint strategy are called soft orsilent faults, since they are not detected (Subsection 2.6). Silent errors have to be eitherdetected, for example by using voting schemes [19] or they have to be treated algorithmi-cally, e.g. using iterative solvers. An algorithmic treatment of hard faults is also possible[21]. In this work, the latter is done using the fault tolerant combination technique.

2.6.1. Hard Faults

Hard faults are faults, that are noticed by the user, since whole solutions are not calcu-lated, or never returned. Such faults can occur when a process of a parallel applicationterminates unexpectedly, for example because of a hardware failure. Additionally softfaults can become hard faults, when they are noticed and it is decided, that the erroneouscomputation is too be abandoned. Reasons to do so include errors, that are too big to behandled algorithmically or errors, that would cause problems for the solver, e.g. due totoo large derivatives.

The points, where we allow for hard faults to occur in the algorithm, are Steps 2 to 5(Figure 2.4). The hard faults have to be known, before the combination technique combinesthe different correction terms dm. Hard faults in the other parts of the algorithm have to betreated conventionally, e.g. through restarting from checkpoints or through redundancyand will thus not be simulated in this work. The probability of errors occurring in theseregions is small anyways, since most of the computation time is spend in the restriction,solving and interpolation steps.

The hard faults are simulated by marking some solutions as failed directly before westart the combination technique. The combination technique then handles the failures.While the standard combination technique (Section 2.2) is not able to handle them specifi-cally, the fault tolerant combination technique will handle them by adapting the index setaccordingly and by scheduling some computations for recomputation. In this work thelatter is not done, instead failing simulations, that would need to be recomputed are notallowed to fail in the first place.

2.6.2. Soft Faults

In contrast to hard faults, soft faults are not detected explicitly. They arise due to bit flips inmemory, cache or registers and produce errors in floating point computations, which leadto single wrong floating point numbers. The scale of the error varies depending on the bit,that is flipped. While flips in the mantissa of the floating point representation of a numberwill cause an error of at most 50%, a flip in the exponent can change the magnitude of thefloating point number completely. Also flips in the sign are possible. These errors will besimulated by multiplying an affected floating point number with the value µ, such that theperturbed number uperturbed can be calculated via

uperturbed = µ · uoriginal. (2.22)

To simulate a large range of possible silent faults, different values for the multiplicator µare assumed, e.g. µ ∈

−1, 2, 1× 10−5, 1× 102, 1× 105

. Bit flips can cause errors, that

are larger than these, however it is save to assume the above, since bigger errors would

17

2. Methods

introduce noticeable peaks in a solution and could be easily detected. Even for 1× 105 thisis the case, however errors of this size will be simulated for comparison.

A single silent fault will not prevent the convergence of an iterative solver [9], this alsoholds true for the iterative refinement technique. However the combination of iterativerefinement and combination technique will not be able to handle all introduced errors inthe full grid space, since not all initial errors can be dampened by the combined algorithm(see Section 4.1.2). Therefore all silent errors should only produce sparse grid errors. Toensure this, the silent errors are introduced on the coarse grids, i.e. after the solving Step4 and before the interpolation Step 5 (Figure 2.4). The assumption, that only the solve stepcan introduce errors, is not unfounded, since the restriction (Step 3) and interpolation arenot very time demanding.

When it is decided, that a certain grid is subject of a soft error, then there is exactly oneerror introduced. This is simulated in two different ways:

Random The error is introduced at a random inner grid point. This mimics the real ran-dom behavior, since errors can occur at any point.

Deterministic The error is introduced at one specific inner grid point. This has been doneto eliminate one additional random influence and allow for better deductions. Thechosen grid point was the innermost grid point (∀i ∈ 1, . . . , d : xi = 0.5), since thisone is existent and located at the same point for every grid.

However not all errors can be simulated like this. Errors occurring within a solver canpropagate throughout the solution. This however is not possible without manipulating thesolver itself. Instead the simulated errors mimic the behavior of errors in either memory,once the result is stored, or errors in transmitting the data in Step 6.

2.6.3. Multiple Faults

As mentioned before, a single (silent) fault cannot influence the convergence properties ofan iterative solver. However in large scale computations, one cannot guarantee, that onlya single fault will occur. Instead, studies have shown, that the time between two faultsis best described by a Weibull distribution with a shape parameter of 0.7 [22]. Howeverin this work this distribution is not used, since it is easier to draw analytical conclusionsfrom the exponential distribution, which closely resembles the Weibull distribution forshape parameters close to 1.

f (x;λ, k) =

kλ

(xλ

)k−1 e−(x/λ)k x ≥ 0,

0 x < 0,Weibull distribution (2.23)

f

Åx;

1

β

ã=

1β exp

Ä−x

β

äx ≥ 0,

0 x < 0,Exponential distribution (2.24)

For each grid a time-like random variable was drawn according to the exponential dis-tribution and if the simulation time exceeded this time, an error has been introduced. Forhard faults whole solutions have been deleted, for soft faults a single error was introduced.

18

2.7. Model Problem

In addition to the exponential distribution, a uniform distribution has been used. The lat-ter has been implemented in such a way, that a solution is being marked as defective witha previously specified error probability ϵ. Such that in the mean the defined fraction of allsolutions is being erroneous. To compare the two solutions, the parameter λ = 1

β for theexponential distribution has been set, such that in the mean the fraction ϵ of all solutionsfails. This means, that most of the erroneous solutions are the solutions, that take longer tocompute. In general these are the solutions, that arise from calculations with more degreesof freedom. For the combination technique, most of the affected solutions are those on thegrids of the highest level.

The simulation times, that determine whether or whether not a fault occurs, are not theactually measured times, since they would be heavily system dependent and in addition,the implementation was not optimized for speed. Instead a measure T , that is proportionalto the complexity of the solver has been introduced.

T = N ·madapt (2.25)

The above equation holds, since all used solvers are of linear complexity in regards to thenumber of unknowns N . The additional modifier madapt is used to correctly predict thenon-linear behavior of the adaptive Jacobi and Gauss-Seidel methods (see Sections 4.3 and4.4).

2.7. Model Problem

As the model problem Poisson’s equation is used. This elliptic partial differential equa-tion is one of the most used for testing properties of solvers. In physics, it can be used todescribe the potential, that is generated by either a charge or mass density distribution.For these two purposes it is generally assumed, that the generated potential vanishes atan infinite distance to the charge/mass distribution. This however is not practical, whenusing computations, since infinite domains cannot be properly simulated. Instead the do-main Ω, on which the problem is defined, is assumed to be finite and certain boundaryconditions are implied. The latter can be realized by either constraining the values of thesolution directly, by prescribing the derivative of the solution on the boundary (Neumannboundary condition), or by a mixture of the two. In this work the focus lies on the former,the so called Dirichlet boundary condition.

∆u(x) = f(x) ∀x ∈ Ω Poisson’s equation (2.26)u(x) = g(x) ∀x ∈ ∂Ω Dirichlet boundary conditions (2.27)

σ · ∇u(x) = h(x) ∀x ∈ ∂Ω Neumann boundary conditions (2.28)

As specified before, the observed domain will be chosen as the unit d-cube Ω = [0, 1]d.As right hand side for Poisson’s equation a smooth term, that vanishes on the boundaryhas been used. The boundary conditions have been selected to be homogeneous Dirichletboundary conditions.

∆u(x) =d∏

i=1

sin2(πxi) ∀x ∈ Ω (2.29)

u(x) = 0 ∀x ∈ ∂Ω homogeneous Dirichlet boundary conditions (2.30)

19

2. Methods

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

-0.035

-0.03

-0.025

-0.02

-0.015

-0.01

-0.005

0

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

-0.035

-0.03

-0.025

-0.02

-0.015

-0.01

-0.005

0

Figure 2.5.: Solution of the discretized Poisson equation on a full grid of level 7 (left) resp.level 4 (right).

To solve the above equations numerically different discretization strategies are possible. Inthis thesis a finite difference method is used. The domain Ω is therefore discretized alonga regular grid as in Section 2.2. Poisson’s equation then becomes

d∑j=1

u(xi + hj)− 2u(xi) + u(xi − hj)

h2j= f(xi) for all inner grid points xi, (2.31)

u(xi) = 0 for all boundary points xi, (2.32)

with hj = ej ∥hj∥ ∈ Nd and ∥hj∥ being the discretization width of the grid in the directionej of the j-th dimension. The discretization of the grid, encourages a definition of thesolution by defining it uniquely through the values on the grid points of the regular gridΩi (c.f. Section 2.2). The discretized solution udiscretized can then be easily obtained and isshown in Figure 2.5 for two different grid levels. (2.31) and (2.32) define together withthe discretization of the solution a linear system of equations. Using ui = udiscretized(xi) =u(xi) and bi = f (xi) the matrix vector equation Ax = b can be derived. The matrix A thenbecomes:

Amn =

−2

∑dj=1 h

−2j ,m = n

h−2m , ∃k : |mk − nk| = 1 ∧ ∀k′ = k : mk′ = nk′

(2.33)

This equation only holds for the inner grid points, however for all boundary points homo-geneous Dirichlet conditions are imposed. They can thus be ignored in the matrix equa-tion.

20

Part II.

Simulation Results

21

3. Verification of Individual Components

To verify the correctness of the individual components certain tests have been performed.They are presented in this chapter. Testing has been done for both the pure iterative refine-ment method (Section 3.1) and the pure combination technique (Section 3.2).

3.1. Iterative Refinement

The iterative refinement method has been verified by showing convergence of the methodtowards the full grid solution. Hereby the pure iterative refinement method is used, i.e. Al-gorithm 2.1. The convergence plots for both iterative methods and the multi-grid methodare shown in Figure 3.1. In contrast to choosing either the Jacobi method or the Gauss-Seidel method as solver within the iterative refinement framework, choosing the multi-grid solver allows for convergence to machine precision after just one step of the iterativerefinement method. This is to be expected, since the iterative refinement method convergesto an exact solver, if Adm = rm (Step 3 of Algorithm 2.1) is solved exactly.


To test the correctness of the combination technique and the prolongation the combinationtechnique has been applied to the function

u(x) =d∏

i=1

sin2(πxi).

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 100 200 300 400 500 600 700

erro

r to

full

grid

sol

utio

n

total iteration count

Gauss-Seidel (4 inner steps)Jacobi (4 inner steps)

1e-018

1e-016

1e-014

1e-012

1e-010

1e-008

1e-006

0.0001

0.01

1

0 1

erro

r to

full

grid

sol

utio

n


MGS - inner accuracy 1e-10MGS - inner accuracy 1-1e-10

Figure 3.1.: Convergence of the iterative solvers (left) resp. the multi-grid solver (right) inthe pure iterative refinement framework (d2l4).

23

3. Verification of Individual Components

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

-0.01

-0.005

0

0.005

0.01

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

-0.004-0.003-0.002-0.001 0 0.001 0.002 0.003 0.004

Figure 3.2.: Test of the combination technique. For a given full grid solution (top left), theresult after applying the combination technique to it (top right) and the error(bottom) are shown. Except for the bottom right picture (d2l5s0), all plots aregenerated for the combination technique d2l4s0.

Therefore on each combination grid, a solution was calculated according to the above for-mula. Afterwards the solutions were interpolated and then combined (Step 5 and 6 ofthe combined algorithm, Figure 2.4). Figure 3.2 shows the full grid solution, the solu-tion generated with the combination technique , as well as the error between those two(ucombined − uinitial). It can be easily seen, that the combined solution slightly deviates fromthe exact solution. However the error will decrease with a better approximation of thecomputation domain through the sparse grid. This can be done either using a higher shiftor an increased level of the combination technique. The latter is depicted below.

24

4. Behavior without Faults

In this chapter the basic characteristics of the combined algorithm (see Section 2.4) arediscussed. Hereby no errors will be taken into account to discuss the properties of thesole algorithm. Additionally, a basic terminology is specified to allow a better analysisof the properties. The analysis is split into two parts: The properties that hold for everyused inner solver (Section 4.1) and properties, that differ between the used solvers. Thelatter is separated again, depending on whether the multi-grid method (Section 4.2), theJacobi method (Section 4.3) or the Gauss-Seidel method are used to solve Step 4 (Figure2.4). Additionally the used methods are described in more detail.

4.1. General Properties

In this section the properties of the combined algorithm are discussed, that are indepen-dent of the used method to solve Step 4 (Figure 2.4). This includes the solution, to whichthe solver will converge, as well as some simulation parameters, that have to be fulfilled,to guarantee that convergence.

4.1.1. Error Specification

In Figure 4.1 one can see, that the combined solver does not converge to the full gridsolution (cf. Figure 4.1). This is the case, because not all existing error frequencies, arerepresentable on any of the combination grids and thus these errors cannot be removed.This is shown in the bottom of Figure 4.1. While the residual on the full grid remainsconstant and does not vanish, the residuals on all combination grids will decrease andvanish with an increased iteration count.

Instead of the error to the full grid solution, the error towards the converged solutionwill be regarded for convergence studies, since it gives further insight in the convergenceproperties of the solver. The conditions, under which the solver actually converges to thefull grid solution, are discussed in Section 4.1.2.

25


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 2 4 6 8 10

erro

r

iteration count

error to full grid solutionerror to converged solution

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

-0.0001

-5e-005

0

5e-005

0.0001

0.00015

0.0002

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

0 1 2 3 4 5 6 7 8 9

resi

duum


total residuumresidua of combination grids

Figure 4.1.: Error development (top left) and error (top right) after convergence of the com-bined algorithm using the Ruge-Stüben solver. The development of the residuaon the full grid and on the combination grids are shown on the bottom (d2l4s0).

4.1.2. Convergence to Full Grid Solution

Convergence towards the full grid solution can only be achieved, when the initial error isfully representable by the underlying sparse grid of the combination technique. This canbe achieved, by creating the starting solution in the following way:

Algorithm 4.1

1. Create an initial error according to∏d

i=1 sin(πkix), where d is the dimensionality of theproblem and ki are the frequencies in the different dimensions.Other initial errors can be used as well.

2. Use the combination technique to make this error representable in the sparse grid.

3. Add this error to the full grid solution of the problem.

26

4.1. General Properties

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 5 10 15 20 25

erro

r

iteration count


Figure 4.2.: Error development of the combined solver using the multi grid solver. Theinitial error is representable with the combination technique. The error is gen-erated according to Algorithm 4.1, with the frequencies k1 = k2 = 5. Conver-gence towards the full grid solution can be achieved (d2l4s0).

Note that one can use an arbitrary error in Step 1. This error can also be generated bycalculating the error of an arbitrary initial guess towards the exact full grid solution.

Figure 4.2 demonstrates, that convergence can be achieved using the above method.If however the initial error is not representable by the sparse grid, the solution will notconverge towards the full grid solution (cf. Figure 4.1). Figure 4.3 shows the introducedinitial error between the different steps from the above generation algorithm.

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

-0.1

-0.05

0

0.05

0.1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

-0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 0.25

Figure 4.3.: Errors on the full grid of the standard combination technique d2l4s0. The erroris generated according to Algorithm 4.1 with the frequencies k1 = k2 = 5. Thefull grid error (after Step 1, left) and the sparse grid error (after Step 2, right)are shown.

27


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

0 5 10 15 20

erro

r

iteration count


0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

-0.1

-0.05

0

0.05

0.1

0.15

Figure 4.4.: Error development (left) and converged solution (right) on the grid of the stan-dard combination technique d2l4s0 with initial error frequencies k1 = k2 = 5.Convergence towards neither the full grid, nor the sparse grid solution arepossible.

4.1.3. Choosing Realistic Initial Parameters

Convergence to the actual solution on the full grid can only be achieved, if the initial erroris representable on a sparse grid. This can be guaranteed using the algorithm introducedin Section 4.1.2. This algorithm however only works, when the full grid solution is known.For a normal use case this is not possible, since the solution is not known a priori, but thedesired result of the simulation. Thus the algorithm can normally not be applied and itcannot be guaranteed that the initial error is sparse. If one however starts with an initialguess, that is sparse, then one will converge to a solution, that is close to the full gridsolution. The easiest initial guess, that is sparse, is the zero solution, which will from hereon out be used as initial guess. If one does not choose a wise initial guess, convergencetowards the real solution cannot be guaranteed, which is illustrated in Figure 4.4. The badinitial guess is generated using Algorithm 4.1, but neglecting Step 2.

4.1.4. Converged Error Size

As shown before, the solver does only converge to the full grid solution under certainconditions. These conditions can however not be fulfilled in a normal simulation, since thesolution would have to be known a priori. Therefore it is of interest, to know the error ofthe converged solution towards the full grid solution and how it behaves, when increasingthe level or the shift of the combination technique. The error is measured, after the solutionconverged. Figure 4.5 shows the development of the converged error with the level of thediscretization. An exponential convergence of the error in the level can be observed. Uponfurther studying the exact errors, one can conclude, that the convergence rate with respectto the grid width h is quadratic (Table 4.1). This quadratic convergence is expected, sincesparse grids provide nearly quadratic convergence towards full grid solutions [11]:∥∥∥ufull grid − uconverged

∥∥∥2= O(h2n log(h

−1n )d−1) (4.1)

28

4.2. Ruge-Stüben Solver

shift level 3 level 4 level 5 level 6 level 7 l.6 / l.70 2.92× 10−4 9.31× 10−5 2.41× 10−5 6.24× 10−6 1.63× 10−6 3.831 1.10× 10−4 2.80× 10−5 6.95× 10−6 1.77× 10−6 4.55× 10−7 3.902 - 5.67× 10−6 1.46× 10−6 3.63× 10−7 9.31× 10−8 3.90

Table 4.1.: Converged error towards the full grid solution.

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

3 4 5 6 7

conv

erge

d er

ror

level

shift=0shift=1shift=2

Figure 4.5.: Converged error towards the full grid solution for the two dimensional prob-lem.

Instead of increasing the level, one can also increase the shift to gain a better approxima-tion. This is due to a better resemblance of the full grid through the sparse grid. Howeverit also amounts to larger, but less grids and thus increases the time needed for individualmethods and decreases the parallelizability. For some simulations very coarse grids arenot favorable and thus a shift is recommended.


As mentioned in Section 2.5.2 the Ruge-Stüben solver of the pyamg package [2] for pythonis being used as an exact solver up to a certain accuracy. In this section the behavior ofthe combined algorithm using this solver is examined. Hereby the main focus lies on theconvergence speed of the solver.

4.2.1. Solver Parameters and General Behavior

Figure 3.1 uses this exact solver and shows, that without the combination technique aconvergence after one step can be achieved, no matter the inner accuracy. To understandwhat the inner accuracy hereby means, one has to understand the concept of the usedmulti-grid solver. In this case the used multi-grid solver is in itself an iterative solver,

29


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 2 4 6 8 10

erro

r to

con

verg

ed

iteration count

inner accuracy=1inner accuracy=1-1e-10

inner accuracy=1e-20

Figure 4.6.: Error development of the Ruge-Stuben solver for the combined algorithm fordifferent inner accuracies of the multi-grid solver (d2l4s0).

which iterates over many V-cycles of a typical multi-grid algorithm. It stops, when therelative residual rk

r0falls below the prescribed inner accuracy. Since the multi-grid solver

convergences after just one iteration, there is no dependence of the combined algorithmon the inner accuracy. This is depicted in Figure 4.6. The only exception for this is an inneraccuracy bigger or equal to 1.0. Here the multi-grid solver does not iterate at all and noimprovement can be expected.

In contrast to the pure iterative refinement scheme, there will not be a convergence afterjust one step, if the combined algorithm is used. In one iteration only sparse grid errorscan be removed. Due to the interpolation, combination and restriction there still exists anerror in the next step, since the remaining errors on the full grid are partially transferredto the sparse grid.

One step of iterative refinement with an exact solver generates the exact solution of aproblem (see Section 2.1). Using the combination technique, the sparse grid solution isgenerated after one step of the combined algorithm. This shows, that iterative refinementindeed improves the accuracy of the algorithm in comparison to using just the combinationtechnique.

4.2.2. Convergence Rate

To further analyze the convergence properties of the solver the initial error is being intro-duced in the same way as in Section 4.1.2 (Algorithm 4.1).

Using an initial error, that is sparse, it is assured, that the solution will converge towardsthe full grid solution. Figure 4.7 shows the error development for different initial frequen-cies on a two dimensional grid of level 5. Straight lines and therefore constant convergencerates can be observed for three of the graphs. These graphs correlate to the higher frequen-cies [k1, k2] = [16, 1], [8, 2], [4, 4]. Only one grid of the combination technique is able to

30


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 2 4 6 8 10 12 14 16

erro

r


frequency=[16,1]frequency=[8,2]frequency=[4,4]

frequency=[8,1]frequency=[4,2]

Figure 4.7.: Error development for different initial error frequencies. For error frequencies,that are only representable on one grid, straight lines are observed. Error fre-quencies, representable on multiple lines lead to buckled curves (d2l5s0).

represent these error frequencies. The convergence rate is determined by restricting theresidual to this single smaller grid, solving for the correction term dm on that grid and in-terpolating the correction term back to the full grid. For these grids the error is divided bya factor of roughly 3 with each iteration step (3.0235 for [16, 1], 3.0909 for [8, 2] and 3.2000for [4, 4]). Those frequencies correspond to the grids with the levels [5, 1], [4, 2], [3, 3]. Thusthe convergence rate on different grids can vary. The same analysis can be done using thetwo dimensional combination technique of level 4 with zero shift (Figure A.1). Then thefrequencies [8,1] and [4,2] amount to the divisors 3.0952 and 3.3333. This behavior seemsto be persistent no matter the level you choose, as long as one chooses frequencies, that areonly representable on one grid. Additionally the convergence speed will remain the same:One iteration will shrink the error by roughly two thirds.

The other frequencies from Figure 4.7 are representable on multiple grids and with thatthe convergence rates can change. The contributions of the different grids to the resid-ual – and therefore to the error – are depicted in Figure 4.8. One of the reasons for thechanging slope of the total residuum is the change of the grid with the main contributionto it. There are however many other reasons for changing convergence rates, since onegrid does not necessarily only represent a single frequency and the convergence rates forthese may vary as well on a single grid. Furthermore the combination technique inducesan interplay between the different levels. This results in the fact, that as long as there areerrors on the coarser grid levels (e.g [4, 1]), there will always be an error on the finer levels(here [5, 1] and [4, 2]). The convergence rate is dominated by the grid on which the errorvanishes the slowest. The existence of the error on multiple grids also has the effect, thatthe overall convergence rate will decrease to less than normally and the factor by whichthe error is divided with each step becomes less than 3. This can be seen in Figure 4.7 forthe frequencies [4, 2] and [8, 1]. Here the divisor decreases to 2.5 ([4, 2]), resp. 1.9 ([8, 1]).

31


1e-005

0.0001

0.001

0.01

0.1

1

10

100

1000

0 2 4 6 8 10 12 14

resi

duum


total residuumlevels [5,1]

levels [4,2]levels [4,1]

Figure 4.8.: Contributions of the different grids to the total residuum with a starting errorfrequency of [8,1]. This shows the Frobenius norm

(∑i u

2i

)1/2 of the residuaand thus the sizes are not scaled (d2l5s0).

4.3. Iterative Method I – Jacobi

Step 4 of the combined algorithm (Figure 2.4) can also be solved using an iterative method,that only approximates the solution. Therefore the damped Jacobi method has been im-plemented, where a solution to Au = b is found by iterating over

u(k+1) = ωD−1(b−Ru(k)) + (1− ω)u(k), (4.2)

where D is the diagonal sub matrix of A, R = A−D and ω is a weighting factor.This section describes the properties of the combined algorithm using this method. Ad-

ditionally an adaptive Jacobi method is introduced, which allows for better convergenceof the combination technique.

4.3.1. Non-Adaptive Damped Jacobi Method

The normal Jacobi method, for which one approximates the correction term dm by a cer-tain number of Jacobi iterations converges only if a certain amount of Jacobi iterations isexecuted. For a two dimensional grid of level 4, 16 iterations are needed for ω = 1 and 23iterations are needed for ω = 2

3 . This iteration count further increases using grids of higherlevel. For a weighting factor of ω = 2

3 one can observe (Figure 4.9), that a higher iterationcount is needed and that the algorithm converges slower for the same amount of Jacobiiterations. This is to be expected, since in general the weighted Jacobi method convergesslower than the non-weighted method. Only for high error frequencies faster convergencecan be observed. For the introduced model problem low frequency errors are dominatingand thus the weighted method is not useful.

32


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 100 200 300 400 500 600 700 800 900

erro

r to

con

verg

ed


niterations=20, ω=2/3niterations=25, ω=2/3niterations=75, ω=2/3

niterations=14, ω=1niterations=25, ω=1niterations=75, ω=1

Figure 4.9.: Error development of the combined algorithm using the damped Jacobimethod. Using to few inner Jacobi iterations will lead to divergence (d2l4s0).

The reason for the minimal amount of iterations is, that the Jacobi method convergesdifferently fast for the different grid levels. A larger correction is calculated on the lowlevel grids than on the high level grids. The low level grids contribute with a negativeweighting factor and the high level grids with a positive weighting factor to the overallcorrection. The accumulated correction can thus have a different sign than the contributingcorrection terms. This however should never occur, since the correction will be added ina wrong direction. Some iterations, for which this is the case are shown in Figure 4.9.The number of total Jacobi iterations ntot is hereby the total number of Jacobi iterationsperformed on one grid.

4.3.2. Adaptive Damped Jacobi Method

As previously mentioned, the convergence speed of the Jacobi method differs on the dif-ferent grid levels and a minimum iteration count is needed.

Using a constant iteration count on all grids implies, that for the coarser grids the ap-proximation is better than for the finer grids. The combination technique however intro-duces an error, that corresponds to the difference of the approximations on the levels. It istherefore essential, that the approximations on the different levels are not too far apart. Theminimum iteration count for which this can be ensured, is one for which the Jacobi methodalmost converges. This minimum count is therefore governed by the amount of iterations,that the Jacobi method needs to converge on the finest grid and increases exponentially

33


1

10

100

1000

3 3.5 4 4.5 5 5.5 6

min

iter

cou

nt

level

Figure 4.10.: Using the combined algorithm (d2s0) with the Jacobi method, a certain min-imal amount of undamped Jacobi iterations is necessary. This amount in-creases exponentially with the level of the combination technique.

with the level of the finest grid (cf. Figure 4.10).Since the Jacobi method nearly converges on the finest grids, convergence on the coarse

grids is ensured. The latter is hereby often more precise than needed. Too much work isdone on the coarse grids without increasing the accuracy, since the error is governed bythe difference of the approximations. Instead of using a constant amount of iteration steps,another way of handling the different convergence speeds can be introduced by iteratingdifferently often on the different grids.

Since the convergence speed of the Jacobi method to solve Poisson’s equation is indi-rectly proportional to the total number of grid points N , one needs twice as much itera-tions on a grid with twice as much grid points to get the same accuracy [12]. This is doneusing the modifier m = 2ξ, where ξ is the difference of the sum of the multi-index i andthe minimal sum of all multi-indices of the index set I of the combination technique:

m = 2ξ (4.3)

ξ =∑k

ik −minj∈I

∑k

jk (4.4)

For a two dimensional problem and the standard combination technique, this modifier iseither 1 or 2, depending on the level of the grid. This modifier is then used to calculate theneeded iteration count niterations of the grid by multiplying it to the minimal iteration countnmin:

niterations = nmin ·m (4.5)

For the fault tolerant combination technique, there are three different levels and thus themodifier can also be 4. Figure 4.11 shows the convergence properties of the combinedalgorithm using the adaptive undamped Jacobi method. The maximum number of total

34


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 100 200 300 400 500

erro

r to

con

verg

ed


nmin= 1nmin= 2nmin= 4nmin= 8nmin=16nmin=32

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 100 200 300 400 500

erro

r to

con

verg

ed


shift 0shift 1shift 2

Figure 4.11.: Error development using the adaptive undamped Jacobi method for differentminimal iteration counts and a shift of 0 (left) and for a fixed minimal inneriteration count of 1 and varying shift (right) on the two dimensional grid oflevel 4.

Jacobi iterations ntot,max, that are performed on any grid, is used as a cost indicator andrepresents a good approximation, since most of the work is done in the upper levels. Forthe studies of the fault tolerance the adaptive damped Jacobi method will be used.

Figure 4.11 also exposes, that no convergence can be achieved for nmin = 1. For this, oneiteration is performed on the grids with the lower level and two iterations are performedon the ones with higher level. The main problem lies in the bad handling of high frequencyerrors by the undamped Jacobi method, which are mirrored each iteration. For the lowlevel grids, they are mirrored once, for the high level grids, they are mirrored twice. In thetwo dimensional combination technique, grids of different levels have different signs, thehigh frequency errors are added upon each other. This leads to the observed bad handlingof high frequency errors. The error can however not occur, if the iteration count on everylevel is even. This is achieved, when one chooses nmin to be even. Theoretically, an odditeration count on every level would also suffice, however this is not possible using amultiplicator as in (4.5).

Another way to handle the high frequency errors, is by using the damped Jacobi method.This method converges for high frequency errors and cannot produce the observed error,since high frequency errors are damped. Figure 4.12 shows the error development of anadaptive damped Jacobi method, where even for nmin = 1 convergence can be observed.

4.3.3. Optimal Jacobi Iteration Count

In this section the optimal amount of Jacobi iterations in every iterative refinement step isdiscussed. Therefore the simulation has been done for different inner iteration counts andthe inner iteration count, for which convergence can be achieved the fastest is then chosenas optimal. Hereby the fastest method is the one, which needs the least computational ef-fort to reach convergence. The computational demand is approximated by the amount oftotal Jacobi iterations (number of iterative refinement steps × number of Jacobi iterationsper iterative refinement step). One further approximation has been made, by only consid-ering the convergence rate for the first step of the iterative refinement. The convergencerate was then calculated by assuming an exponential convergence in the iteration count

35


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 100 200 300 400 500

erro

r to

con

verg

ed


nmin=16nmin= 8nmin= 4nmin= 2nmin= 1

Figure 4.12.: Error development of the combined algorithm using the adaptive dampedJacobi method (ω = 2/3, d2l4s0). Convergence can be achieved even fornmin = 1.

and interpolating. The optimal amount of Jacobi iterations depends heavily on the usageof the non-adaptive or adaptive method and is described in this section.

Non-Adaptive Jacobi Method

As previously mentioned, the non-adaptive Jacobi method requires a minimum iterationcount, that grows exponentially with the level of the combination technique. Thereforealso the optimal iteration count has to increase at least exponentially with the level. Thisis however unfeasible and thus the non-adaptive method will not be used.

Adaptive Jacobi Method

For the adaptive damped Jacobi method convergence can be observed independent of theamount of Jacobi iterations (Figure 4.12). If fewer Jacobi iterations are used, the conver-gence rate increases. This is probably due to the fact, that the convergence speed of theJacobi method is not exactly indirect proportional to the problem size. The optimal iter-ation count does not increase exponentially and less work is needed using the adaptiveJacobi method.

4.4. Iterative Method II – Gauss-Seidel

In addition to the Jacobi method, another iterative method to solve Step 4 of the com-bined algorithm (Figure 2.4) has been implemented, namely the Gauss-Seidel method.This method finds a solution to Au = b by iterating over

L∗u(k+1) = (b− Uu(k)), (4.6)

36


10

100

1000

10000

3 3.5 4 4.5 5 5.5 6

opt i

ter

coun

t

level

Figure 4.13.: Optimal inner iteration count of the combined algorithm using the non-adaptive undamped Jacobi solver on the two dimensional problem. The shiftof the combination technique is 0. The optimal iteration count increases ex-ponentially with increasing level.

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 100 200 300 400 500 600 700

erro

r to

con

verg

ed


nmin= 5nmin=10nmin=20nmin=30nmin=40

1

10

100

1000

3 3.5 4 4.5 5 5.5 6 6.5 7

min

iter

cou

nt

level

dim 2 shift 0dim 2 shift 1

dim 3 shift 0dim 3 shift 1

Figure 4.14.: Left: Error development using the Gauss-Seidel method for different minimaliteration counts (d2l4s0).Right: Minimal inner iteration count using the Gauss-Seidel method for dif-ferent grids.

where L∗ is the lower triangular component of A and U = A − L∗ is strictly upper trian-gular. This section describes the properties of the combined algorithm using this method.

As for the Jacobi method, using the normal Gauss-Seidel method with a constant itera-tion count on each level, a certain minimal iteration count is needed (cf. Figure 4.14).

An adaptive method has been introduced, that works in the same way as the one fromthe Jacobi method (cf. Section 4.3.2). It allows for a similar accuracy of approximation ofthe correction term dm and prevents unneeded work. In contrast to the undamped Jacobimethod, convergence can be achieved when using a single Gauss-Seidel iteration in eachiterative refinement step. The good convergence properties for high frequency errors areresponsible for this. Figure 4.15 depicts the error development using the adaptive Gauss-Seidel method. As for the damped Jacobi method, the optimal choice for the minimal inneriteration count is 1.

37


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 20 40 60 80 100 120

erro

r to

con

verg

ed


nmin= 1nmin= 2nmin= 4nmin= 8nmin=16nmin=32

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 50 100 150 200 250 300

erro

r to

con

verg

ed


shift 0shift 1shift 2

Figure 4.15.: Left: Error development using the adaptive Gauss-Seidel method for differ-ent minimal iteration counts (d2l4s0).Right: Error development using the adaptive Gauss-Seidel method for a min-imal inner iteration count nmin = 1 on the two dimensional grid using a com-bination technique of level 4 with different shifts.

4.5. Conclusion

Even without (emulated) faults, some methods can lead to divergence when using themas a solver in the combined algorithm. Out of the introduced solvers the Jacobi methodand the Gauss-Seidel method have this behavior. They both will require a certain mini-mal iteration count, if one uses them in an unmodified version, so with constant iterationcount. Adapting the iteration count to the number of unknowns on the combination gridimproves this behavior. Then both iterative methods will converge, with the exceptionof the undamped Jacobi method. Using it, divergence can only occur, if one uses an oddamount of iterations on one level and an even amount on another. Taking care, that thisdoes not happen, convergence can always be achieved.

Using a solver with only a small error, e.g. the Ruge-Stüben solver, convergence canalways be achieved.

In the remainder of this thesis the Ruge-Süben solver, as well as the adaptive Gauss-Seidel method and the adaptive damped Jacobi method will be used, since they alwayslead to convergence, if no faults occur.

The combined algorithm will not converge towards the actual full grid solution, butto a solution close to it. With increasing level or shift of the combination technique, thisapproximation improves.

38

5. Fault Tolerance I – Hard Faults

In this chapter the fault tolerance of the different combination techniques in combinationwith iterative refinement is described. Hereby only hard faults are considered. Fault toler-ance towards soft faults is described in Chapter 6. Hard faults are introduced, as describedin Section 2.6.1, whole solutions will be ignored from the combination techniques. This re-sembles a hardware failure, where a process does not return a result or where a soft faultis actually detected and the solution is considered to be ignored.

This chapter is split according to the different solvers: First Section 5.1 will describe thefault tolerance using the multi-grid method, in Section 5.2 the Jacobi method is consideredand in Section 5.3 the Gauss-Seidel method is described. All of these sections are subdi-vided according to the different combination techniques. First the standard combinationtechnique (see Section 2.2) is considered, allowing for no correction through the combina-tion technique, next the fault tolerant combination technique (see Section 2.3) will be takeninto account. It allows for nice convergence and will adjust the index set for the combina-tion, as well as the needed weights according to the failed solutions. The last subsectionwill always describe the single solution technique (see Section 2.5.1), which is not a realcombination technique, since it only considers one solution in each time step and does notcombine multiple solutions.

Finally a comparison of the different methods and combination techniques is given inSection 5.4.


First of the convergence properties of the combined algorithm using the multi-grid solverhave been examined. Starting with the original standard combination technique (Section5.1.1), that just ignores grids, for which an error has been diagnosed, the fault tolerantcombination technique (Section 5.1.2) and the single solution technique (Section 5.1.3) havebeen studied.

5.1.1. Standard Combination Technique

The standard combination technique is not able to handle errors in itself. Hard faults willdisable some of the partial solutions. The affected solutions will not be considered, whenusing the combination technique. Using the standard combination technique without it-erative refinement will produce wrong results, since the combined solution will be com-posed out of randomly selected partial solutions. However using the iterative refinementtechnique, an algorithmic error handling can be introduced and is visible in Figure 5.1.

Depending on whether the errors are distributed equally or exponentially (see Section2.6.3) a different behavior can be observed. For equally distributed errors convergence can

39


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 10 20 30 40 50 60 70

erro

r to

com

bi s

olut

ion

iteration count

ε=0. ε=0.1ε=0.2ε=0.3ε=0.4ε=0.5ε=0.6ε=0.7

conv. crit.

1e-008

1e-006

0.0001

0.01

1

100

0 20 40 60 80 100 120 140

erro

r to

com

bi s

olut

ion

iteration count

ε=0. ε=0.1 ε=0.2 ε=0.3 ε=0.4 ε=0.45ε=0.5

conv. crit.div. crit.

Figure 5.1.: Sample error development using the multi-grid solver and the standard com-bination technique with different error rates ϵ. The errors are distributed eitherequally (left) or exponentially (right) (d2l4s0).

5

10

15

20

25

30

35

40

45

0 0.1 0.2 0.3 0.4 0.5

itera

tions

to c

onve

rgen

ce

error rate

0

20

40

60

80

100

120

140

0 0.1 0.2 0.3 0.4 0.5

itera

tions

to c

onve

rgen

ce

error rate

Figure 5.2.: Mean iteration count to reach convergence for the standard combination tech-nique, using the multi-grid method and equally distributed (left), respectivelyexponentially distributed (right) errors (d2l4s0).

be observed no matter which error rate is chosen. For exponentially distributed errors thisis not the case. There, only up to an error rate of roughly 45% convergence can be achieved(d2l4s0). The reason for this is, that for the equally distributed grids all grids are equallylikely to fail and thus in the average almost an equal amount of grids of the two differentlevels will fail. For exponentially distributed errors, grids of higher levels are more likelyto fail, leaving the grids of lower levels as majority. For the combination technique thismeans, that the effective weighting factor (the sum of the weighting factors of the workinggrids), which normally should be +1, will mostly be negative and thus the correction willbe added in the wrong direction. For both distributions a higher error rate means slowerconvergence (Figure 5.2).

5.1.2. Fault Tolerant Combination Technique

The fault tolerant combination technique (FTCT) does not solely rely on the iterative re-finement to handle the errors. Instead it adapts the weighting factors, to properly handledetected errors. In contrast to the standard combination technique, the effective weighting

40


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 10 20 30 40 50 60

erro

r to

com

bi s

olut

ion

iteration count

ε=0. ε=0.1ε=0.2ε=0.3ε=0.4

ε=0.5ε=0.6ε=0.7ε=0.8

conv. crit.

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 20 40 60 80 100 120 140

erro

r to

com

bi s

olut

ion

iteration count

ε=0. ε=0.1ε=0.2ε=0.3

ε=0.4ε=0.5ε=0.6

conv. crit.

Figure 5.3.: Sample error development using the multi-grid method and the fault tolerantcombination technique. The errors are distributed either equally (left) or expo-nentially (right) (d2l4s0).

1

10

100

1000

0 0.2 0.4 0.6 0.8 1

itera

tions

to c

onve

rgen

ce

error rate

1

10

100

1000

0 0.2 0.4 0.6 0.8 1

itera

tions

to c

onve

rgen

ce

error rate

Figure 5.4.: Mean iteration count to reach convergence for the FTCT, using the multi-gridmethod and equally distributed (left), respectively exponentially distributed(right) errors (d2l4s0).

factor will always be 1.In Figure 5.3 error developments for different error rates are shown. For all error rates

convergence can be observed. The convergence speed for small error rates is almost identi-cal, only for large error rates will the convergence rate diminish clearly. For exponentiallydistributed errors, the convergence speed is generally slower than for equally distributederrors. This effect is more visible on high error rates, since then almost all high level gridswill fail, leaving high frequency errors untouched. The latter explains the consecutivelyconstant errors on the error development plots for often occurring exponentially errors.In Figure 5.4 a more detailed view on the average amount of iterations to convergence isgiven.


Figure 5.5 shows the error development for the single solution technique, here only onesolution of the highest level has been used (see Section 2.5.1).

The deterministic and random schemes converge almost equally fast. However there is

41


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 5 10 15 20 25 30 35

erro

r to

con

verg

ed s

olut

ion

iteration count

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 5 10 15 20 25 30 35 40 45

erro

r

iteration count

Figure 5.5.: Error development for the single solution technique using the multi-gridmethod. The grids for the left plot are determined using the deterministicscheme, for the right plot the random scheme is used (d2l4s0).

a slight advantage for the deterministic scheme. This results from the fact, that there areiterations in the random scheme, where no improvement is achieved. This happens, whenthe chosen grid corresponds to error frequencies, that have already been eliminated. Thedeterministic scheme circulates the grids and chooses each grid equally often. Thus theerror in each direction is similarly fast eliminated. In comparison to the proper (standard& fault tolerant) combination techniques and even though the single solution techniqueneeds more iterations to converge, the single solution technique actually needs less totalwork to converge. The reason for this is, that the solution only has to be calculated on onegrid instead of calculating the solution on many grids. This however comes at the cost,that the single solution technique does not introduce an additional layer of parallelism,like the proper combination technique do. Additionally the slightly deformed lines for thedeterministic scheme indicate, that a better deterministic scheme could exist.


In this section the error tolerance of the combined algorithm using the adaptive damped(ω = 2

3 ) Jacobi method is being evaluated. This is done for the different combinationtechniques and shows different result for all of them.


Using the standard combination technique the whole error handling is done by the itera-tive refinement method. Certain properties to the error rate and the amount of inner Jacobiiterations have to be imposed depending on the distribution type of the errors.

Figure 5.6 shows the convergence properties for equally distributed errors (d2l4s0). Onecan clearly observe, that up to a certain threshold the totally needed work remains almostindependent of the amount of inner Jacobi iterations. It does not matter, whether onechooses 2, 4, 8 or 16 inner iterations. However for values above that threshold, which inthis case amounts to values above 16 iterations, too much work is done. The main reasonfor this is, that the Jacobi method already converged and the remaining iterations do not

42


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

0 500 1000 1500 2000 2500 3000 3500

erro

r to

com

bi s

olut

ion


nmin= 2nmin= 4nmin= 8nmin=16nmin=32nmin=64

conv. crit

Figure 5.6.: Sample error development using the standard combination technique with theadaptive damped Jacobi method for different iteration counts and equally dis-tributed errors with an error rate of 0.4. The total iteration count indicates theamount of total Jacobi iterations performed on a grid of the highest consideredlevel (d2l4s0).

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

0 200 400 600 800 1000 1200

erro

r to

com

bi s

olut

ion


ε=0.0ε=0.2ε=0.4ε=0.6

ε=0.8ε=0.9

conv. crit.

100

1000

10000

0 0.2 0.4 0.6 0.8 1

tota

l ite

ratio

ns to

con

verg

ence

error rate

Figure 5.7.: Sample error development (left) and mean total iteration count to convergence(right) for the standard combination technique using the damped adaptive Ja-cobi method with equally distributed errors and error rates ϵ. The inner Jacobiiteration count is 2 (d2l4s0).

help improving the accuracy. The value of the threshold resembles approximately theamount of Jacobi iterations needed to converge. This value will thus increases with risinglevel.

For the standard combination technique with equally distributed errors and the Jacobimethod, an inner iteration count of 2 is chosen. Figure 5.7 shows the convergence proper-ties for this iteration count for different error rates. Convergence can be observed for allerror rates, with decreasing convergence speeds for high error rates. For error rates below50% the mean convergence speed remains steady, while the variance increases.

As for the multi-grid solver, a completely different behavior can be observed, when

43


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

100

1000

0 200 400 600 800 1000 1200

erro

r to

com

bi s

olut

ion


ε=0.0ε=0.1ε=0.2ε=0.3ε=0.4ε=0.5ε=0.6


400

600

800

1000

1200

1400

1600

1800

2000

0 0.1 0.2 0.3 0.4 0.5

tota

l ite

ratio

ns to

con

verg

ence

error rate

Figure 5.8.: Error development (left) and mean total iteration count to convergence (right)for the standard combination technique using the damped adaptive Jacobimethod with exponentially distributed errors. The inner Jacobi iteration countis 2 (d2l4s0).

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

100

1000

0 1000 2000 3000 4000 5000

erro

r to

com

bi s

olut

ion


nmin= 2nmin= 4nmin= 8nmin=16nmin=32nmin=64


1e-008

1e-006

0.0001

0.01

1

100

10000

0 5000 10000 15000 20000 25000 30000

erro

r to

com

bi s

olut

ion


nmin=16nmin=32nmin=64


Figure 5.9.: Error development for the standard combination technique using the dampedadaptive Jacobi method with exponentially distributed errors. The error ratesare 0.2 (left) respectively 0.3 (right) (d2l4s0).

using exponentially distributed errors (Figure 5.8). For an inner iteration count of 2 theconvergence speed decreases rapidly with increasing error rate. Convergence can only beobserved up to error rates of 10%. Convergence can however be established for highererror rates, if more Jacobi iterations are performed (Figure 5.9). However the same upperlimit of around 45% as for the multi-grid solver is in place for the error rate, since the Jacobimethod will converge to the exact solution for high iteration counts.


If the errors are uniformly distributed (Figure 5.10) and using the fault tolerant combina-tion technique, one can observe convergence for all error rates. The convergence speedonly decreases significantly for very high error frequencies.

For low error rates and equally distributed errors, the convergence speed can actuallyincrease. This can be explained with the fact, that fewer grids are used and thus the ac-cumulated error, through the differently exact approximations on the different grids, is

44


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 50 100 150 200 250 300 350 400 450 500

erro

r to

com

bi s

olut

ion


ε=0. ε=0.1ε=0.2

ε=0.3ε=0.4

conv. crit.

100

1000

10000

0 0.2 0.4 0.6 0.8 1

tota

l ite

ratio

ns to

con

verg

ence

error rate

Figure 5.10.: Error development (left) and mean total iteration count to convergence (right)for the fault tolerant combination technique using the damped adaptive Ja-cobi method with uniformly distributed errors. The inner Jacobi iterationcount is 2 (d2l4s0).

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 100 200 300 400 500 600

erro

r to

com

bi s

olut

ion


ε=0. ε=0.1ε=0.2

ε=0.3ε=0.4

conv. crit.

100

1000

10000

0 0.2 0.4 0.6 0.8 1

tota

l ite

ratio

ns to

con

verg

ence

error rate

Figure 5.11.: Error development (left) and mean total iteration count to convergence (right)for the fault tolerant combination technique using the damped adaptive Ja-cobi method with exponentially distributed errors. The inner Jacobi iterationcount is 2 (d2l4s0).

smaller. For large error rates, the convergence rate will decrease, since most of the upperlevel vectors will no longer be existent.

Figure 5.11 shows the behavior of the fault tolerant combination technique in combina-tion with the damped adaptive Jacobi method for exponentially distributed errors. Forexponentially distributed errors, the convergence rate decreases starting at a lower errorrate than for the equally distributed errors, since for the same error rate more upper levelvectors are missing.


Using the single solution technique only one solution is being used per iterative refinementstep. Figure 5.12 shows the error development of the combined algorithm. Using onlyone solution instead of a combined solution actually accelerates the convergence. Thereason for this lies in the different convergence speeds of the Jacobi method on the different

45


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 50 100 150 200 250 300 350 400

erro

r to

con

verg

ed s

olut

ion


njac= 1njac= 2njac= 6njac=10

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 50 100 150 200 250 300 350 400

erro

r


njac= 1njac= 5njac=10njac=10

Figure 5.12.: Error development of the single solution technique using the damped adap-tive Jacobi method. A deterministic (left) and a random (right) scheme tosample the grids, on which the Jacobi operations are performed, have beenused. The Jacobi iteration count is 2 (d2l4s0).

grids, which even using an adaptive Jacobi method could not be properly corrected. Thedifferent speeds add an error to the combination technique, that makes the convergencerate decrease. This error however cannot occur using only a single correction.

For both the deterministic and random scheme, the convergence rates do not changemuch if only few inner Jacobi iterations are performed. Starting from some point to muchwork is done and the convergence rate worsens. The deterministic scheme has a slightadvantage over the random scheme, since the error is eliminated in each direction equallyfast. For the random scheme, one direction can be used more often than another, while theerror in that direction is already eliminated.


In this section the error tolerance of the combined algorithm using the adaptive Gauss-Seidel method is being evaluated. This is done as for the Jacobi method. Similar results areachieved. In this section the differences and similarities towards the Jacobi method will bepointed out.


As for the Jacobi method, using the standard combination technique leads always to con-vergence (Figures 5.13, 5.14), as long as the errors are uniformly distributed. The con-vergence rate slows down with an increased amount of inner steps. This however hap-pens faster than for the Jacobi method, due to the faster convergence of the Gauss-Seidelmethod. High error rates result in slower convergence, while low error rates do not changethe convergence properties much and only increase the variance of the convergence speed.

For exponentially distributed errors (Figure 5.15), similar results to the Jacobi methodcan be observed: Convergence can no longer be obtained for all error rates and amounts ofinner Gauss-Seidel operations. For an inner iteration count of 2 the Gauss-Seidel methodallows for a higher tolerance of errors (20%), than the Jacobi method (10%). Additionally

46


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 100 200 300 400 500 600

erro

r to

com

bi s

olut

ion


nmin= 2nmin= 4nmin= 8nmin=16

conv. crit.

Figure 5.13.: Error development of the combined algorithm using the adaptive Gauss-Seidel method with different iteration counts and the standard combinationtechnique (d2l4s0) with equally distributed error with an error rate of 0.2.

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

0 100 200 300 400 500 600

erro

r to

com

bi s

olut

ion


ε=0.0ε=0.1ε=0.3ε=0.5ε=0.7ε=0.9

conv. crit.

10

100

1000

0 0.2 0.4 0.6 0.8 1

tota

l ite

ratio

ns to

con

verg

ence

error rate

Figure 5.14.: Error development (left) and mean total iteration count to convergence (right)for the standard combination technique using the adaptive Gauss-Seidelmethod with uniformly distributed errors. The inner Gauss-Seidel iterationcount is 2 (d2l4s0).

not as many inner iterations are needed for the higher error rates. Both facts base uponthe higher convergence speed of the Gauss-Seidel method. Error rates above a certainthreshold (45%) cannot be handled, since even for an exact solver this is not possible (seeSection 5.1.1).


The fault tolerant combination technique will converge almost independently of the errorrate, if equally distributed errors are used. The convergence rate will only change, if almostall high-level grids are missing. The variance of the iterations, that it takes the algorithm toconverge, will however increase with an increasing error rate. As for the Jacobi method forlow error rates a slightly increased convergence speed is possible due to fewer combinationgrids. For high error rates, the convergence will slow down. This however happens for

47


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

100

1000

0 200 400 600 800 1000 1200 1400 1600 1800

erro

r to

com

bi s

olut

ion


ε=0.0ε=0.1ε=0.2ε=0.3ε=0.4


0

500

1000

1500

2000

2500

0 0.1 0.2 0.3 0.4 0.5

tota

l ite

ratio

ns to

con

verg

ence

error rate

Figure 5.15.: Error development (left) and mean total iteration count to convergence (right)for the standard combination technique using the adaptive Gauss-Seidelmethod with exponentially distributed errors. The inner Jacobi iterationcount is 2 (d2l4s0).

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 20 40 60 80 100 120 140 160 180 200

erro

r to

com

bi s

olut

ion


ε=0. ε=0.1ε=0.2

ε=0.3ε=0.4

conv. crit.

100

1000

10000

0 0.2 0.4 0.6 0.8 1

tota

l ite

ratio

ns to

con

verg

ence

error rate

Figure 5.16.: Error development (left) and mean total iteration count to convergence (right)for the fault tolerant combination technique using the adaptive Gauss-Seidelmethod with equally distributed errors. The inner Gauss-Seidel iterationcount is 2 (d2l4s0).

the Gauss-Seidel method at a lower error rate than for the Jacobi method (cf. Figures 5.10,5.16).

For exponentially distributed errors (Figure 5.17), an increase in the needed iterationcount to converge can be seen at a similar error rate as for the Jacobi method. This increaseis simply due to the fact, that the high level grids fail very likely. The high level grids arehowever also the grids, on which the error remains the longest. Only if no error is existenton the coarse level grids, there can be no error on the fine level grids. As for equallydistributed errors, the convergence rate of the solver will slow down for high error rates.Using the Jacobi method this does happen at larger error rates, than for the Gauss-Seidelmethod.


As for the Jacobi method, faster convergence than using the standard or fault tolerant con-vergence technique can be achieved by using just one solution. The same reason, namely

48

5.4. Comparison

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 50 100 150 200 250 300 350 400 450

erro

r to

com

bi s

olut

ion


ε=0. ε=0.1ε=0.2

ε=0.3ε=0.4

conv. crit.

100

1000

10000

0 0.2 0.4 0.6 0.8 1

tota

l ite

ratio

ns to

con

verg

ence

error rate

Figure 5.17.: Error development (left) and mean total iteration count to convergence (right)for the fault tolerant combination technique using the adaptive Gauss-Seidelmethod with exponentially distributed errors. The inner Gauss-Seidel itera-tion count is 2 (d2l4s0).

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 20 40 60 80 100 120 140

erro

r to

con

verg

ed s

olut

ion


nGS= 1nGS= 2nGS= 3nGS= 5nGS=10

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 50 100 150 200 250 300 350 400

erro

r to

con

verg

ed s

olut

ion


nGS= 1nGS= 2nGS= 3nGS= 5nGS=10

Figure 5.18.: Error development of the single solution technique using the adaptive Gauss-Seidel method. A deterministic (left) and random (right) scheme to samplethe grids, on which the Gauss-Seidel operations are performed, have beenused. The minimal Gauss-Seidel iteration count is 2 (d2l4s0).

the different convergence speeds on the different levels, is responsible for that. In con-trast to the Jacobi method, using the Gauss-Seidel method results in a faster increase of theneeded work with increasing inner iteration count. A small inner iteration count is advis-able. Even though less work needs to be done, one level of parallelization is neglected andno fault tolerance against hard errors is provided. Both can however be reestablished byusing redundancy, so computing the same solution multiple times.

5.4. Comparison

The iterative refinement scheme provides one layer of fault tolerance, such that even us-ing the standard combination technique convergence can be achieved. Only for high errorrates and exponentially distributed errors, convergence is not possible. Even though thisdistribution is the more realistic one, the error rate, that is needed to hinder the conver-

49


gence, is very high and unlikely to be reached in a normal simulation.For larger problems the computation time rises and with it the error rate. Additionally

to the increasing probability of failed solutions, the amount of grids rises, thus leading toa lower critical error rate, from which no convergence can be achieved (c.f. Figure A.2).

The standard combination technique is therefore not applicable for larger problems. Us-ing the fault tolerant combination technique, convergence can be achieved in a far bettermanner. The solution can be found even for exponentially distributed errors. Only ifalmost every high level grid witnesses an error, he convergence speed will decrease dras-tically. This however is a condition, which indicates a very high error-proneness of themachine. And one should consider taking an improvement of it into account.

Apart from the fault tolerant combination technique, convergence can also be achievedwith less total work if in each iterative refinement step, only the correction term is cal-culated on a single grid (single solution technique). This however decreases the paral-lelizability of the calculation in comparison to the two introduced combination techniquesand results in a longer total computation time using the multi-grid method. Even thoughthe parallelizability is decreased, the two iterative methods actually take less work tosolve Poisson’s equation using the single solution technique. However no error protec-tion against hard faults is provided using this method, unless redundancy is used, or if adefective step of iterative refinement is simply neglected.

50

6. Fault Tolerance II – Silent Faults

This chapter characterizes the convergence properties of the algorithm with the occurrenceof soft faults (Section 2.6.2). These faults are not detected and can only be corrected usingalgorithmic approaches, in this case iterative refinement. For fault tolerance against hardfaults, see Chapter 5.

This chapter is subject to a similar structure as the chapter concerning hard faults: It isdivided in two parts, the first (Section 6.1) considering the multi-grid method, the second(Section 6.2) considers the iterative methods, in this case the Jacobi method. The third sec-tion (Section 6.3) gives a comparison of the two methods. In contrast to the hard faults,only one iterative method has been used and the Gauss-Seidel method has not been con-sidered. This has been done, since as for the hard faults, very similar results to the Jacobimethod are being expected.

The individual methods are first examined upon occurrence of a single silent fault. Latermultiple silent faults are introduced as described in Section 2.6.3. Hereby only the ex-ponential distribution is used, since it is the more realistic distribution in comparison toequally distributed errors and no matter which distribution is used, the results remainvery similar (see Section 6.1.2).

Since no hard faults occur, only the standard combination technique is used – the faulttolerant combination technique would produce the same combination.


In this section the convergence properties using the multi-grid solver are discussed. First asingle silent fault is introduced, later on multiple silent faults are introduced using the ex-

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 2 4 6 8 10 12 14 16

erro

r to

com

bi s

olut

ion


error at it. 1error at it. 2error at it. 3error at it. 4error at it. 5error at it. 6error at it. 7error at it. 8error at it. 9

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 2 4 6 8 10 12 14 16 18 20

erro

r to

com

bi s

olut

ion



Figure 6.1.: Error development using the multi-grid solver and the standard combinationtechnique. A single silent fault of size µ = 2 (left), respectively µ = 10 (right)has been introduced (d2l4s0).

51


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 2 4 6 8 10 12 14 16

erro

r to

com

bi s

olut

ion



1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 2 4 6 8 10 12 14 16

erro

r to

com

bi s

olut

ion



Figure 6.2.: Error development using the multi-grid solver and the standard combinationtechnique. A single silent fault of size µ = −1 (left), respectively µ = 1× 10−5

(right) has been introduced (d2l4s0).

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

100

0 5 10 15 20 25 30

erro

r to

com

bi s

olut

ion



1e-008

1e-006

0.0001

0.01

1

100

10000

0 5 10 15 20 25 30 35

erro

r to

com

bi s

olut

ion



Figure 6.3.: Error development using the multi-grid solver and the standard combinationtechnique. A single silent fault of size µ = 1× 103 (left), respectively µ =1× 105 (right) has been introduced (d2l4s0).

ponential distribution. Additionally a comparison of exponential and equally distributederrors is done to show, that both have a similar effect.

6.1.1. Single Silent Fault

The algorithm will always converge, if only one silent fault is introduced. The fault ishereby generated with the deterministic scheme (Section 2.6.2). For errors of reasonablesize (Figures 6.1, 6.2), the needed iteration count until convergence does not change much,as long as the error does not happen in the first iteration. If however the error is introducedat another iteration, the result will always be very similar and only show a slight increasein the work needed until convergence is reached. It hereby is unimportant, if the bit flipincreases the value of the faulty float (µ ∈ 2, 10), flips the sign of the float (µ = −1) ordecreases the value (µ = 1× 10−5).

All of these failures have in common, that they are not easily detected. This changesdrastically for higher values of µ and a detection would be possible. A detected fault canbe handled in the same way as a hard fault.

52


1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 5 10 15 20 25 30

erro

r to

com

bi s

olut

ion


ε=1.0ε=0.5ε=0.1ε=0.0

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 2 4 6 8 10 12 14 16 18

erro

r to

com

bi s

olut

ion


ε=1.0ε=0.5ε=0.1ε=0.0

1e-008

1e-006

0.0001

0.01

1

100

10000

1e+006

0 10 20 30 40 50 60

erro

r to

com

bi s

olut

ion


µ= -1µ= .5µ= 2µ= 4µ= 8µ=16

no errors

Figure 6.4.: Sample error development using the multi-grid solver and the standard com-bination technique with different error rates. Silent faults of size µ = −1 (topleft), respectively µ = 1× 10−5 (top right) are introduced. The bottom figureshows the error development for varying error size, if one error is introducedon every grid (d2l4s0).

Figure 6.3 shows the error development for large values of µ. As before the highestimpact has an error in the first iteration. Errors, that do not occur in the first iteration, havea similar effect to the totally needed iteration count, independent of the iteration, in whichthey occur. Higher values of µ lead to a larger amount of work needed. A rough check forsilent faults, that only detects large errors should be implemented.

6.1.2. Multiple Silent Faults

While the occurrence of only a single error may be reasonable, if the rate at which errors oc-cur is low, multiple errors can happen. They will be introduced using the random method(Section 2.6.3). The impact of multiple silent faults on the algorithm is probed for the twodimensional problem using the combination technique of level 4 without a shift. In thiscase, convergence can be achieved independent of the error rate, if the multiplicator µ is−1 or at most 4 (Figure 6.4). Hereby at most one error is added per grid.

If the error size indicating value µ is above 4, convergence can only be guaranteed, if the

53


1e-008

1e-006

0.0001

0.01

1

100

10000

1e+006

0 10 20 30 40 50 60 70 80

erro

r to

com

bi s

olut

ion


ε=0.15ε=0.1 ε=0.05ε=0.05ε=0.

1e-008

1e-006

0.0001

0.01

1

100

10000

1e+006

1e+008

1e+010

0 5 10 15 20 25 30 35 40

erro

r to

com

bi s

olut

ion


ε=0.1 ε=0.05ε=0.01ε=0.01ε=0.01ε=0.

Figure 6.5.: Sample error development using the multi-grid solver and the standard com-bination technique with different error rates. Silent faults of size µ = 1× 102

(left), respectively µ = 1× 105 (right) are introduced (d2l4s0).

0

10

20

30

40

50

60

70

0 0.1 0.2 0.3 0.4 0.5

itera

tions

to c

onve

rgen

ce

error rate

exponentially distributedequally distributed

Figure 6.6.: Mean iterations to convergence using the multi-grid solver and the standardcombination technique with µ = 10. The errors are either exponentially orequally distributed. For the calculation of the mean and the standard deviation50 samples have been used for each error rate (d2l4s0).

error rate is restricted. For µ = 1× 102 this means error rates of up to 10% are allowed,while for µ = 1× 105 the error rate has to be less than 5%.

Influence of the Distribution Type of the Errors

To check, whether the distribution type of the errors has an effect. They were compared.Figure 6.6 shows the mean iteration count to convergence for both exponentially and uni-formly distributed error. For both types similar results are achieved. One can conclude,that the effect of the silent fault is almost independent of the grid it is introduced in.

54

6.2. Jacobi

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 20 40 60 80 100 120

erro

r to

com

bi s

olut

ion

outer iteration count

error at iteration 1error at iteration 2error at iteration 4error at iteration 8

error at iteration 16error at iteration 32error at iteration 64

error at iteration 128

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 20 40 60 80 100 120

erro

r to

com

bi s

olut

ion





Figure 6.7.: Error development using the adaptive damped Jacobi solver and the standardcombination technique. A single silent fault of size µ = −1 (left), respectivelyµ = 10 (right) has been introduced (d2l4s0).

6.2. Jacobi

The adaptive damped (ω = 23 ) Jacobi method with an inner iteration count of 2 is used

in this section. After investigating the convergence properties for a single silent fault,the convergence properties using multiple silent faults are discussed. As in the previoussection, these silent faults are distributed according to the exponential distribution.

6.2.1. Single Silent Fault

In this section the error proneness of the Jacobi method, when introducing one silent faultis probed. This fault is introduced via the deterministic scheme (Section 2.6.2). The Jacobimethod is for small error sizes less prone to errors than the multi-grid solver.

Figure 6.7 depicts the error development for the introduction of just one silent fault.Values of µ below 10 do not influence the convergence in any visible way. Even, if theerror is introduced after the first iteration, the needed total work will stay the same. The

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

0 20 40 60 80 100 120 140 160

erro

r to

com

bi s

olut

ion





1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

10

100

1000

0 20 40 60 80 100 120 140 160 180 200

erro

r to

com

bi s

olut

ion





Figure 6.8.: Error development using the adaptive damped Jacobi solver and the standardcombination technique. A single silent fault of size µ = 1000 (left), respectivelyµ = 1× 105 (right) has been introduced (d2l4s0).

55


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 50 100 150 200 250 300

erro

r to

com

bi s

olut

ion


ε=1.0ε=0.8ε=0.5ε=0.3ε=0.1

no faults

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

0 50 100 150 200 250 300

erro

r to

com

bi s

olut

ion


ε=1.0ε=0.8ε=0.5ε=0.3ε=0.1

no faults

Figure 6.9.: Sample error development using the adaptive damped Jacobi method and thestandard combination technique. Silent faults of size µ = −1 (left), respectivelyµ = 1× 10−5 (right) have been introduced (d2l4s0).

reason for the lower error proneness can be seen in the smaller corrections, that are madeeach step. The resulting error will thus be smaller.

Upon introduction of larger errors (Figure 6.8) deviations of the fault free behavior canbe observed and the time to convergence can be increased significantly. The exact pointof introduction does hereby not influence the result at all. If the multiplicator stays thesame and the error is introduced in the same grid and the same grid point, there will notbe any change in the totally needed work till convergence. As for the Ruge-Stüben solver,convergence will always be reached, no matter the error size.

6.2.2. Multiple Silent Faults

The introduction of multiple faults changes the susceptibility of the algorithm towards softfaults. While a single soft fault with a small size µ < 10 or a sign flip could not influencethe result much, multiple faults of these sizes have a visible effect (Figure 6.9).

Similar to the multi-grid solver, convergence can be achieved, if the error size is smallenough or, if the errors occur infrequently enough. As for the multi grid solver, conver-gence can be achieved independent of the error rate, if µ ≤ 4 or if µ = −1 (Figure 6.10). Forerrors, that are bigger than this, certain bounds to the error frequency have to be fulfilled,if convergence shall be reached.

For an error size of µ = 8, error rates of around 40% are allowed, while the error rateis restricted to below 3% for µ = 100 (Figure 6.11). The allowed error rate will furtherdecrease, if the error size is increased. For an error size of µ = 1× 105, an error rate of 0.5%will lead to divergence (Figure A.3).

6.3. Comparison

Both iterative and exact methods to solve Step 4 of the algorithm (Figure 2.4) converge forbit flips that occur in the mantissa or the sign, as long as only one silent fault is allowedto happen within each grid. Only bit flips in the exponent can make the algorithm diver-gent. For that, the exponent has to be increased. Decreasing exponents will still lead to

56

6.3. Comparison

1e-008

1e-006

0.0001

0.01

1

100

10000

1e+006

0 100 200 300 400 500 600 700 800 900 1000

erro

r to

com

bi s

olut

ion


µ= -1µ=1e-5µ= 2µ= 4µ= 8

no faults

Figure 6.10.: Sample error development using the adaptive damped Jacobi method and thestandard combination technique using different multiplicators µ. One error isintroduced in every grid at every iteration (d2l4s0).

1e-008

1e-006

0.0001

0.01

1

100

10000

1e+006

0 100 200 300 400 500 600 700

erro

r to

com

bi s

olut

ion


ε=1.0ε=0.5ε=0.4ε=0.3ε=0.1

no faults

1e-008

1e-006

0.0001

0.01

1

100

10000

1e+006

0 200 400 600 800 1000 1200

erro

r to

com

bi s

olut

ion


ε=0.15ε=0.1 ε=0.05ε=0.03ε=0.01

no faults

Figure 6.11.: Sample error development using the adaptive damped Jacobi method and thestandard combination technique. Silent faults of size µ = 8 (left), respectivelyµ = 1× 102 (right) have been introduced (d2l4s0).

convergence.This behavior even holds true for larger sized problems (see Figures A.5 and A.4), even

though for larger problems more errors are bound to occur, since more grids are involved.These measurements however only take an error rate per combination solution into ac-count and only single errors are allowed to happen. These two assumptions will howevernot hold for large problems, since the computation time for each solution will increase andthus the error rate will rise.

For both the Jacobi method and the multi-grid method similar results have been gained:For the two dimensional combination technique of level 4 the multiplicator µ is allowedto be up to 4, if one error occurs on every grid. When using an iterative method, thecorrection term is smaller in every iteration. Single errors produce a lower error in theoverall solution, compared to using an exact solver. Even though this is the case, the

57


maximal allowed error rates for large errors are smaller when using an iterative method.Big errors would only allow for small error rates, however they are easy to detect. From a

certain error size, the errors should be detected and the affected grids could be consideredto be failed, thus leading to a hard faults. These hard faults can then be handled using thefault tolerant combination technique.

58

Part III.

Conclusion and Outlook

59

7. Conclusion

With the combination of iterative refinement with the fault tolerant combination techniquea method to tackle bit flips and hardware faults has been introduced. Additionally, the it-erative refinement method allows for more exact solutions, than the pure combinationtechnique or computations on sparse grids. The handling of detected faults (hard faults) isfar better than the handling of undetected faults (soft faults). For hard faults, the fault tol-erant combination technique is able to maintain a correction in the right direction. Herebyonly small extra work needs to be done. If many hard faults occur, the convergence speedcan be impacted significantly, since all important solutions are lost. Even if only one solu-tion is correctly computed, convergence can still be achieved (single solution technique).High error rates are however unlikely and if that problem arises, one should question thehardware components.

Even without the use of the fault tolerant combination technique, convergence usingthe standard combination technique is possible. This convergence however depends onthe error rate. For high error rates, convergence can no longer be achieved. Especiallyfor exponentially distributed errors, the critical error rate decreases significantly and willdrop lower with an increased problem size. The fault tolerant combination technique issuperior to the standard combination technique and guarantees convergence.

The fault tolerant combination technique requires the errors to be detected. Soft, unde-tected faults cannot be handled by the fault tolerant combination technique. Convergencecan no longer be guaranteed independent of the size and occurrence frequency of the er-rors. However convergence is almost guaranteed, if the error is small. This happens whenthe bit flip, that causes the error, occurs in either the mantissa or the sign of the floatingpoint number. Only errors in the exponent have a significant effect. This means, that only11 bits for the 64 bit representation and 8 bits for the 32 bit representation (IEEE 754 [1])are significant. Out of these only about half of the possible flips will actually increasethe number, thus only 9% of all possible bit flips are of importance (12.5% for the 32 bitrepresentation).

Bit flips can have a gigantic impact. It is advisable to check for errors using certaincriteria. This might for example be a simple check for continuity. Such checks can howeveronly be performed if specific properties of the solution are known. If such an error isdetected, the affected solution can be marked and the fault tolerant combination techniquecan be used to ignore that solution.

The used methods to handle errors can only handle them at certain points in the algo-rithm. For the fault tolerant combination technique errors are only handled if they eitheroccur on one of the slave processes, that generate partial solutions, which are later on com-bined, or if they occur in the communication of these solutions. Errors occurring in othersteps cannot be handled. These errors are however less likely to arise, since the compu-tation time should be lower on these stages. The total computation time is dominated bysolving the system of equations on the child nodes.

61

7. Conclusion

Occurrences of silent faults, that can be handled algorithmically, are restricted further.If they occur in critical algorithm steps, convergence towards the sparse grid solution canno longer be guaranteed. The allowed positions of the silent faults are however the mostcompute extensive parts of the algorithm. For a detailed description on where they areallowed, please refer to Section 2.6.2.

All in all the introduced algorithm is very good for handling hard faults and appropriateto handle the most frequent silent faults, while keeping a high level of parallelism.

62

8. Outlook

In addition to the done work a more effective implementation of the algorithm would bedesirable to allow for bigger problem sizes and to measure the actual computation time.This would then allow to actually compare the convergence speeds, when using differentmethods. With a more effective application, real world problems could be simulated, e.g.gyrokinetic plasma simulations could be performed. Applying the algorithm to other usecases would allow to compare its convergence speed to other algorithms and would allowto draw conclusions about the performance of the introduced method. With an increasedsystem size and proper parallelization real errors could occur and would not have to besimulated. However with these various errors will arise and the proper position of theerrors could no longer be guaranteed. Especially for silent faults, this could destroy theconvergence properties of the algorithm. One way to get rid of these problem is to notstore an actual full grid solution, but to store a sparse grid solution instead. Using thesparse grid representation of the overall solution would allow silent faults to occur almosteverywhere, since errors in the solution would always remain representable in the sparsegrid. However other problems will appear. The calculation of the residual, as well as therestriction and interpolation steps can become cumbersome to implement, or become verycompute extensive. One additional problem, that is currently worked on is the detectionof silent errors (e.g. [3]). With it silent errors could be detected, partial solutions couldbe marked as erroneous and avoided through the fault tolerant combination technique.One simple way to handle large silent faults would be to recompute the residuals on thecombination grids and ignore the ones with high residuals in the combination.

Additional to the used fault tolerant combination technique, the opticom method [16]could be implemented and compared to the used method.

63

Appendix

65

A. Additional Graphs

67


1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 2 4 6 8 10 12 14

erro

r




Figure A.1.: Error development for different initial error frequencies using the combinedalgorithm and the multi grid solver (d2l4s0).Addition to Figure 4.7.

1e-006

0.0001

0.01

1

100

10000

1e+006

0 5 10 15 20 25 30 35 40 45

erro

r to

com

bi s

olut

ion

iteration count

errorrate=0. errorrate=0.1errorrate=0.2errorrate=0.3

errorrate=0.4errorrate=0.5


Figure A.2.: Error development for exponentially distributed errors using the combinedalgorithm and the multi grid solver. The standard combination technique isused. In comparison to a combination technique of lower level (Figure 5.1)the critical error rate, from which no convergence can be achieved is higher(d2l5s0).Addition to Chapter 5.

68

1e-008

1e-006

0.0001

0.01

1

100

10000

1e+006

1e+008

0 50 100 150 200 250 300 350 400 450 500

erro

r to

com

bi s

olut

ion


ε=0.01 ε=0.005ε=0.001no faults

Figure A.3.: Sample error development using the adaptive damped Jacobi method and thestandard combination technique. Silent faults of size m = 1× 105 have beenintroduced for different error rates ϵ (d2l4s0).Addition to Figure 6.1.2.

1e-008

1e-006

0.0001

0.01

1

100

10000

1e+006

0 50 100 150 200 250 300 350

erro

r to

com

bi s

olut

ion


µ= -1µ= 2µ= 4µ= 8µ=16

no errors

Figure A.4.: Sample error development using the Ruge-Stüben solver and the standardcombination technique (d2l6s0). Silent faults of different size µ have been in-troduced on every grid. As for smaller sized problems (see Section 6.1.2) thecritical error size lies at around m = 8, no significant change can be observed.Addition to Section 6.1.2.

69


1e-008

1e-006

0.0001

0.01

1

100

10000

1e+006

0 20 40 60 80 100 120 140 160 180

erro

r to

com

bi s

olut

ion


ε=0.25ε=0.2 ε=0.15ε=0.1

Figure A.5.: Sample error development using the Ruge-Stüben solver and the standardcombination technique (d2l6s0). Silent faults of size µ = 100 have been intro-duced with different exponential error rates. In comparison to smaller sizedproblems (see Section 6.1.2) the critical error rate lies higher.Addition to Section 6.1.2.

70

Bibliography

[1] IEEE standard for binary floating-point arithmetic. Institute of Electrical and ElectronicsEngineers, New York, 1985. Note: Standard 754–1985.

[2] W. N. Bell, L. N. Olson, and J. B. Schroder. PyAMG: Algebraic multigrid solvers inPython v2.0, 2011. Release 2.0.

[3] Austin R. Benson, Sven Schmit, and Robert Schreiber. Silent error detection in numer-ical time-stepping schemes. CoRR, abs/1312.2674, 2013.

[4] Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George Bosilca, andJack Dongarra. A checkpoint-on-failure protocol for algorithm-based recovery instandard mpi. In Christos Kaklamanis, Theodore Papatheodorou, and PaulG. Spi-rakis, editors, Euro-Par 2012 Parallel Processing, volume 7484 of Lecture Notes in Com-puter Science, pages 477–488. Springer Berlin Heidelberg, 2012.

[5] W. Briggs, V. Henson, and S. McCormick. A Multigrid Tutorial, Second Edition. Societyfor Industrial and Applied Mathematics, second edition, 2000.

[6] Hans-Joachim Bungartz and Michael Griebel. Sparse grids. Acta Numerica, 13:1–123,2004.

[7] Alfredo Buttari, Jack Dongarra, Julie Langou, Julien Langou, Piotr Luszczek, andJakub Kurzak. Mixed precision iterative refinement techniques for the solution ofdense linear systems. International Journal of High Performance Computing Applications,21(4):457–466, 2007.

[8] James Demmel, Yozo Hida, E Jason Riedy, and Xiaoye S Li. Extra-precise iterativerefinement for overdetermined least squares problems. ACM Transactions on Mathe-matical Software (TOMS), 35(4):28, 2009.

[9] James Elliott, Frank Mueller, Miroslav Stoyanov, and Clayton Webster. Quantifyingthe impact of single bit flips on floating point arithmetic. preprint, 2013.

[10] J. Garcke. Sparse grids in a nutshell. In J. Garcke and M. Griebel, editors, Sparse gridsand applications, volume 88 of Lecture Notes in Computational Science and Engineering,pages 57–80. Springer, 2013. extended version with python code http://garcke.ins.uni-bonn.de/research/pub/sparse_grids_nutshell_code.pdf.

[11] Jochen Garcke. Sparse grids in a nutshell. In Jochen Garcke and Michael Griebel, ed-itors, Sparse Grids and Applications, volume 88 of Lecture Notes in Computational Scienceand Engineering, pages 57–80. Springer Berlin Heidelberg, 2013.

71

http://garcke.ins.uni-bonn.de/research/pub/sparse_grids_nutshell_code.pdf

http://garcke.ins.uni-bonn.de/research/pub/sparse_grids_nutshell_code.pdf

Bibliography

[12] Richard J. Gonsalves. 3.1 poisson’s equation and relaxation methods. http://www.physics.buffalo.edu/phy410-505/2011/topic3/app1/index.html, Fall2011.

[13] M. Griebel, M. Schneider, and C. Zenger. A combination technique for the solutionof sparse grid problems. In P. de Groen and R. Beauwens, editors, Iterative Methodsin Linear Algebra, pages 263–281. IMACS, Elsevier, North Holland, 1992. also as SFBBericht, 342/19/90 A, Institut für Informatik, TU München, 1990.

[14] B. Harding, M. Hegland, J. Larson, and J. Southern. Scalable and Fault Tolerant Com-putation with the Sparse Grid Combination Technique. ArXiv e-prints, April 2014.

[15] Lin He. Applications and generalizations of the iterative refinement method. PhD thesis,University of California Los Angeles, 2006.

[16] Markus Hegland, Jochen Garcke, and Vivien Challis. The combination technique andsome generalisations. Linear Algebra and its Applications, 420(23):249 – 275, 2007.

[17] N. Higham. 12. Iterative Refinement, chapter 12, pages 231–243. Society for Industrialand Applied Mathematics, 2002.

[18] RichardM. Karp. Reducibility among combinatorial problems. In RaymondE. Miller,JamesW. Thatcher, and JeanD. Bohlinger, editors, Complexity of Computer Computa-tions, The IBM Research Symposia Series, pages 85–103. Springer US, 1972.

[19] N.G. Leveson, S.S. Cha, J.C. Knight, and T.J. Shimeall. The use of self checks andvoting in software error detection: an empirical study. Software Engineering, IEEETransactions on, 16(4):432–443, Apr 1990.

[20] Franklin T. Luk and Haesun Park. An analysis of algorithm-based fault tolerancetechniques. Journal of Parallel and Distributed Computing, 5(2):172 – 184, 1988.

[21] Alfredo Parra Hinojosa, Christoph Kowitz, Mario Heene, Dirk Pflüger, and Hans-Joachim Bungartz. Towards a fault-tolerant, scalable implementation of gene. InProceedings of ICCE 2014, Lecture Notes in Computational Science and Engineering.Springer-Verlag, 2015. Accepted.

[22] B. Schroeder and G.A. Gibson. A large-scale study of failures in high-performancecomputing systems. Dependable and Secure Computing, IEEE Transactions on, 7(4):337–350, Oct 2010.

[23] Bianca Schroeder and Garth A Gibson. Understanding failures in petascale comput-ers. Journal of Physics: Conference Series, 78(1):012022, 2007.

[24] Bianca Schroeder and Garth A Gibson. Understanding failures in petascale comput-ers. Journal of Physics: Conference Series, 78(1):012022, 2007.

[25] S A Smolyak. Quadrature and interpolation formulas for tensor products of certainclasses of functions. Dokl. Akad. Nauk SSSR, (148), 1963.

72

http://www.physics.buffalo.edu/phy410-505/2011/topic3/app1/index.html

http://www.physics.buffalo.edu/phy410-505/2011/topic3/app1/index.html

Bibliography

[26] J. H. Wilkinson. Rounding Errors in Algebraic Processes. Notes on Applied Science,No. 32. HMSO, London, UK, 1963. Also published by Prentice-Hall, EnglewoodCliffs, NJ, USA, 1964, translated into Polish as Bledy Zaokraglen w Procesach Alge-braicznych by PWW, Warsaw, Poland, 1967 and translated into German as Rundungs-fehler by Springer-Verlag, Berlin, Germany, 1969. Reprinted by Dover Publications,New York, 1994.

73

Computational Science and Engineering (Int. Master’s Program) · November 4, 2015 Steffen...

Documents

Transcript of Computational Science and Engineering (Int. Master’s Program) · November 4, 2015 Steffen...