Parareal Acceleration of

Parareal Acceleration ofMatrix Multiplication

Toshiya Takami and Akira Nishida

Kyushu University, Japan

Aug. 30 - Sep. 2, 2011 ParCo2011, Ghent, Belgium

1

Contents

Introduction: Time-domain Decomposition

What is Parareal?

Parareal-in-Time Algorithm as a Perturbation

Application: Series by Matrix-Vector Multiplications

Convergence Property

Speed-up Ratio and Efficiency

Discussion: Applicability to Other Linear Calculations

Conclusion

2

Time Evolution

Time-evolution problems are widely solved in scientific

simulations described by discretized differential equations.

Parallel technique is usually applied through domain

decomposition in the space direction, where quantity

on the surface of each domain must be shared with its

neighbors.

On the other hand, efficient parallelism by the time-

domain decomposition seems difficult because of its

severe dependency on the previous state.

3

Time-domain Parallelism

Time-evolution is usually defined by strictly dependent

relations, which is difficult to be parallelized

‘’Parareal-in-Time’’ is one of the time-domain methods that

can be used in spite of such strict dependency.

x0

t = 0

・・・1 2

x1 x2 x3

3

F1 F2

・・・

・・・ xk−1 xk

k − 1 k

Fk−2F0 Fk−1F3 Fk−3Fk−4F4 F5

4 5 k − 2k − 3

xk−3 xk−2x4 x5

Domain Decomposition in Time Direction

xk+1 = Fk(xk)

J-L. Lions, Y. Maday, and G. Turinici, C. R. Acad. Sci., Ser. I: Math. 232, 661–668 (2001).

4

What is “Parareal”?

“Its main goal concerns real time problems, hence the

proposed terminology of ‘parareal’ algorithm.

“Parareal is not the first algorithm to propose the solution of

evolution problems in a time-parallel fashion”

Multiple Shooting Methods with Newton’s Iterations

Space-Time Multigrid Method

M. J. Gander and S. Vandewalle, SIAM J. Sci. Comput. 29, 556-578 (2007).


5

Parareal-in-Time for Scientific Applications

Since the first proposal by J-L. Lion, et al., various time-

evolution problems has been analyzed.

PDE with a fluid-structure coupling:

Molecular Dynamics with Quantum Iteraction:

Quantum Control Problem:

Symplectic Integrator:

L. Baffico, S. Bernard, Y. Maday, G. Turinici, and G. Zérah,

Phys. Rev. E66, 057701 (2002).

G. Bal and Q. Wu, LNCSE 60, 401–408 (Springer, 2008).

C. Farhat and M. Chandesris, Int. J. Numer. Meth. Engng. 58, 1397-1434 (2003).

Y. Maday, G. Turinici, Int. J. Quant. Chem. 93, 223-228 (2003).

6

Sequence Defined byParareal-in-Time

Instead of the sequence defined by ,

consider an approximated one by an iterative relation,

where is a coarse evolution that approximates the

original one , and r is the order of the approximation.

The exact operators can be calculated in parallel.

In the limit , it has been shown that the approximate

sequence converges to the original one:

which is called the Parareal-in-Time method.

{xk}{x(r)

k }

x(r+1)k+1 = Gk(x(r+1)

k ) + Fk(x(r)k )− Gk(x(r)

k )

Gk(·)Fk(·)

r →∞{x(r)

k }→ {xk}

xk+1 = Fk(xk)


Fk(·)

7

Parareal-in-Timeas a Perturbation (1)

While convergence of the Parareal-in-Time method is

already shown as the multiple-shooting method, a simple

explanation is given here using a linear problem defined by

bounded operators:

consider a sequence of vectors

defined by

where we assume and are bounded.

Convergence of the sequence above can be explained as a

perturbation expansion and the spectral radius of operators.

{xk}xk+1 = Fkxk

= [Gk + (Fk − Gk)]xk

Gk Fk − Gk

8

Parareal-in-Timeas a Perturbation (2)

For linear problems, the operator is separated into a

coarse operator and its correction ,

If we introduce as r-th order approximation of

Thus, the Parareal-in-Time calculations for linear problems

can be analyzed as a higher order perturbation.

xk = [Gk−1 + (Fk−1 − Gk−1)]xk−1 =k−1∏

j=0

[Gj + (Fj − Gj)]x0

Fj

Gj Fj − Gj

x(r+1)k+1 = [Gk + (Fk − Gk)]x(r+1)

k

= Gkx(r+1)k + (Fk − Gk)x(r+1)

k

≈ Gkx(r+1)k + (Fk − Gk)x(r)

k

{x(r)k } {xk}

9

How to Calculatethe Parareal Sequence

The parallel procedure of this calculation is:

are parallel in k, but are sequantial.

k-direction is first, r-direction is sequential

x0 x1 x2 x3 xk−2 xk−1 xk

r = 1

r = 2

r = 3

G

F − G

x(3)k

G

F − G

G

F − GG

F − G

G

F − GG

F − G

xk−3

F − G

G

F − G

G

F − G

G

F − G

G

F − G

F − GGGG

G

G

r = 0

x(2)k−1

x(1)k−2

...

...

...

...

...G

F − GG

F − GG

F − GG

x4

Order of Perturbation

Time

GkFk − Gk

10

Series defined by Matrix- Vector Multiplication

A linear time-evolution defined by matrix multiplications,

can be represented as an expanded sum in the order of .

The third order approximation:

ε

x(3)k =

[Gk + ε

∑(Terms with a F )

+ε2∑

(Terms with two F ’s)

+ε3∑

(Terms with three F ’s)]x0

xk = (G + εF )xk−1 = (G + εF )kx0

11

Residual Errorsfrom Spectral Radius

Errors can be estimated from the rest of expanded terms:

Error from the exact sequence is bounded by

By the Stirling’s formula, the magnitude of r-th term is

k∑

j=r+1

εj∑

(Terms with F j)x0

∣∣∣x(r)k − x0

∣∣∣ρ(Gk) |x0| ≤

k∑

j=r+1

(kj

) [ερ(F )ρ(G)

]j

(kr

) [ερ(F )ρ(G)

]r

=

√k

2π(k − r)r

(ek

r

)r [ερ(F )ρ(G)

]r

12

Example 1:Real-Symmetric Matrix

Consider a sequence of the real vector defined by a real-

symmetrix matrix , where G is diagonal and F is a

random matrix with elements satisfying

N=1024, !=0.01 and in this case,

we can set

dashed: error of (r+1)-th term

(a)

Norm

ali

zed

Err

or

Length of Sequence

-19

-17

-15

-13

-11

-9

-7

-5

10

10

10

10

10

10

10

10

r=5

r=10

r=15

0 50 100 150 200

{xk}G + εF

〈|Fii|2〉 =1

2N, 〈|Fij |2〉 =

14N

ρ(F ) ≈ ρ(G) ≈ 1

-1 1

eigenvalues of F

13

Example 2:Unitary Matrix

Unitary time-evolution is often used in quantum mechanics:

where H is a Hermitian matrix satisfying

N=1024, !=0.01 and in this case,

we can set

dashed: error of (r+1)-th term

(b)

No

rma

lize

d E

rro

r

Length of Sequence

-13

-12

-11

-10

-9

-8

-7

-6

-5

r=10

r=20

r=30

r=40

r=50

10

10

10

10

10

10

10

10

10

0 200 400 600 800 1000

xk = exp[∆t

i! H

]xk−1 ≈ (I − iεH)k x0

-1 1

eigenvalues of H

〈|Hii|2〉 = 〈|Hij |2〉 =1

4N

ρ(H) ≈ 1

14

Implementation by MPI

We can implement the Parareal-in-Time iteration on parallel computers by the use of MPI_Send/Recv.

The original number of calculations, , is compared with the parareal one, .

G G G

G G G

G G G

G G

G

G

G

G

F

G

Process 1:

Process 2:

Process 3:

Process 4:

Iterate r times

P resources

KTg + Tc(P − 1) rK(Tf + Tg)/P

F F

F F F

F F F

F F FComputational Time

K(Tg + Tf )KTg + Tc(P − 1) + rK(Tg + Tf )/P

E. Aubanel, Parallel Computing 37, 172-182 (2011).

15

Estimation ofSpeed-up Ratio

Speed-up Ratio will be represented in the function,

where and .

Property of the Speed-up:

Max. Efficiency: small P

Max. Speed-up: Sp

eed

up

Rati

o

Total Number of Ranks (P)

Limit of Speedup

K>>P>>1 K~P

Max

. E

ffic

iency

(1/r

)

0

1/T

0 K

S(r, K, P ) =K(Tg + Tf )

KTg + Tc(P − 1) +rK(Tf + Tg)

P

=P

r + TP +P (P − 1)

Kt

T ≡ Tg/(Tg + Tf ) t ≡ Tc/(Tg + Tf )

K ! P ! 1

16

Parallel Performanceon a Cluster Machine

The performance of the Parareal method is studied on a cluster with 768 cores (Westmere 12 cores x 64 nodes).

When we increase the order r of the parareal iterations.

Large matrix: almost linear

Small matrix: not linear

(512 parallel execution)

(a)

Tim

e (

sec)

Parareal Iterations

Normal Method

Normal Method

Parareal (N=1024)

Parareal (N=8192)

1

2

5

10

20

50

100

200

500

1000

0 10 20 30 40 50

17

Speed-up Ratio

The dashed line represents

Actual speed-up is almost half of the value by S(r,K,P).

Efficiency is very low, which is bounded by 1/r.

N=8192: linear speed-up

N=1024: saturated

(b)

Sp

eed

up

Rati

o

Total Number of Rank

Linea

r Spee

dup

r=10

r=20

r=40

N=8192

N=1024

0.2

0.5

1

2

5

10

20

50

100

1 10 100 1000

S(r, K, P ) =P

r + TP +P (P − 1)

Kt

情報処理学会研究報告IPSJ SIG Technical Report

(a)

Tim

e (

sec)

Parareal Iterations

Normal Method

Normal Method

Parareal (N=1024)

Parareal (N=8192)

1

2

5

10

20

50

100

200

500

1000

0 10 20 30 40 50

(b)

Sp

eed

up

Rati

o

Total Number of Rank

Linea

r Spee

dup

r=10

r=20

r=40

N=8192

N=1024

0.2

0.5

1

2

5

10

20

50

100

1 10 100 1000

図 6 Parareal-in-Time 法と通常計算の速度比較: (a) r 次計算を 512 並列で実行した場合の経過時間，(b) 並列計算でのスピードアップ比．黒のマークは小規模行列の問題 (1024 × 1024)，白のマークは比較的大規模な行列の問題 (8192 × 8192) を表す．四角，丸，三角は，それぞれ，r = 10, 20, 40 次の計算に対応する．

表 1 並列計算によるスピードアップ比

P 16 32 64 128 256 512 768

N = 1024 r = 10 0.71 0.88 1.27 1.61 1.62 1.86 1.73

r = 20 0.36 0.61 0.84 1.16 1.38 1.55 1.49

r = 40 0.19 0.35 0.56 0.80 0.89 1.25 1.31

N = 8192 r = 10 0.69 1.35 2.70 5.41 10.68 21.38 30.97

r = 20 0.36 0.71 1.42 2.83 5.56 11.20 16.63

r = 40 0.19 0.37 0.73 1.44 2.80 5.82 8.52

計算時間の比 T は，G と F の計算量を比べることで見積もることは可能だが，実際の実行時間は他の様々な要因から予測が難しい場合もある．この手法が有効か，そうでないかを決定するには，データ量などに応じた現実の計算性能を正しく考慮する必要がある．

4.2 並列計算でのスピードアップ比行列演算を Parareal-in-Time法で並列化した場合の経過時間を，768 CPU (12 cores ×

72 nodes) を持つ並列クラスタで測定した．結果を図 6 に示すが，ここでは，量子力学の時間発展をモデル化したユニタリ行列 (第 3.2 節) を対象とし，長さが 1536 の複素ベクトル列 (K = 1536) の計算に対して測定した．図 6(a) では，512 並列実行時の計算時間を，Parareal-in-Time法の次数に対してプロットした．水平の線は，並列化していない従来法での時間を表す．N = 8192 の場合，次数 r = 5 から 50 に対して，14.7 秒から 130.1 秒の計算時間になっており，図ではわかりにくいが r にほぼ線形の関係にある．また，1 CPUで

Large MatrixSmall Matrix

Long Time Series

Short Time Series

Distributed-parallelLinear Library

Multi-threadLinear Library

EigenvalueProblem Parareal-in-Time

Acceleration

図 7 Target problem sizes of the parareal-in-time and usual schemes.

の従来法での計算が 602.5 秒であるので，スピードアップは 41.0 倍から 4.6 倍である．一方，小さい行列 (N = 1024)に対して，スピードアップ比は 2.0 程度と小さくなっている．図 6(b)には，次数 r = 10, 20, 40 に対するスピードアップ比を示す．フラットMPIによる分散並列化で，16プロセスから 768プロセスまでの結果である．N = 8192 の場合に，t = 0 と通信の時間を無視した時の，式 (21)の値を図中に点線で示すが，実測値との乖離が大きい．また，表 1に，これらのデータに対応するスピードアップ比の数値を示しておく．スピードアップ比は並列ランク数に比べて低いが，N = 8192 の場合はリニアに伸びていることがわかる．つまり，特定の次数のParareal-in-Time法は，一定以上大きい行列の計算に対してはかなり高効率で並列実行できることがわかる．しかし，N = 1024 の場合にはリニアな伸びは見られない．これは，式 (21)中の t に相当する通信時間の割合が，N = 1024

の時は無視できない事によると思われる．4.3 Parareal-in-Time法の適用範囲上記の測定と解析の結果，Parareal-in-Time法が効率的に適用できる行列のサイズと時系列の長さについて，大まかに図 7に示すような分類が出来る．通常の線形ライブラリでは，行列の各要素に対する演算の独立性を利用して，複数のスレッド，あるいは，複数のプロセスで並列化されている．行列の対角化が O(N3) 程度の演算であることより，小さな行列演算で定義された非常に長い時系列を対象とする場合は，固有状態表現を通して解かれるべきである．大きな行列に対しては O(N3) の計算量は大きすぎるため，Parareal-in-Time

法による時間方向並列化は，通常の要素方向の並列化法と合わせて解かれるべきである．

6 c© 2011 Information Processing Society of Japan

18

Applicability to General Iterative Methods

What types of problems can be accelerated by the Parareal?

Since the usual row/column parallelism is effectively used in linear libraries, this algorithm should be applied to

Long Time Series

Large Matrices

How about other algorithm?

K ! P ! 1

cost(G) cost(F )

19

Parareal SOR (preliminary result 1)

In order to solve a linear equation ,

the SOR (successive over-relaxation) method is used:

The parareal accelerated versions are

Model 1: is defined as an identity operator:

Model 2: is defined by ! = 0 :

[D + ε(L + U)]x = b

xk+1 = (1− w)xk + w(D + εU)−1[b− εLxk]

x(r+1)k+1 = x(r+1)

k + w{

(D + εU)−1[b− εLx(r)k ]− x(r)

k

}G

Gx(r+1)

k+1 = (1− w)x(r+1)k + wD−1b

+w{

(D + εU)−1[b− εLx(r)k ]−D−1b

}

= (1− w)x(r+1)k + w(D + εU)−1[b− εLx(r)

k ]

20

Parareal SOR (preliminary result 2)

The parareal converges to the exact sequence only for w<1, while w is usually chosen 1<w<2 (graphs below: w=0.2).

Applicable only for the case of slow convergence?

Difference

Iterations

-8

-7

-6

-5

-4

-3

-2

-1

0

10

10

10

10

10

10

10

10

10

r=10 r=20 r=30 r=40 r=50

0 20 40 60 80 100

Model 1 Model 2

Difference

Iterations

-8

-7

-6

-5

-4

-3

-2

-1

0

10

10

10

10

10

10

10

10

10

r=10

r=20

r=30

r=40

0 50 100 150 200

21

Conclusion

The Parareal-in-Time method for linear evolution problems

was analyzed as a perturbation expansion.

Remaining errors were obtained through spectral radius

of the coarse and fine transformation.

A simple linear transformation (matrix multiplication) was

accelerated by the Parareal-in-Time algorithm.

Convergence properties were analyzed.

The speed-ups were compared to the theoretical estimate.

22

Further Studies

How to include this algorithm into parallel linear libraries.

interfaces, implementations, tuning, ...

hybrid parallel implementations with the parareal-in-time

and the usual row/column parallelisms

How wide is this algorithm applied to linear calculations

such as iterative solvers of linear equations: SOR, CG, etc?

Application of this method in massively parallel computers

such as K-computer, Kobe, Japan.

23

Thank you for your attention.

24

Parareal Acceleration of

Documents

Transcript of Parareal Acceleration of