Lecture 11: Parallel Processing of Irregular Computations & Load Balancing

Post on 12-Jan-2016

30 views 0 download

Tags:

description

Lecture 11: Parallel Processing of Irregular Computations & Load Balancing. Shantanu Dutt ECE Dept. UIC. Discrete Event Simulation—Basics with VHDL Descriptions as an Example. VHDL Dataflow Description of a Circuit:. Library IEEE; use IEEE.STD_LOGIC_1164.all; - PowerPoint PPT Presentation

Transcript of Lecture 11: Parallel Processing of Irregular Computations & Load Balancing

Lecture 11: Parallel Processing of Irregular Computations & Load Balancing

Shantanu DuttECE Dept.UIC

Discrete Event Simulation—Basics with VHDL Descriptions as an Example.

Library IEEE; use IEEE.STD_LOGIC_1164.all;entity ckt1 is port(s1,s2:in bit; Z:out bit); end entity ckt1;architecture data_flow of ckt1 issignal sbar1,sbar2,x,y:bit;begin sbar1 <= not s1 after 2 ns;sbar2 <= not s2 after 2 ns;x <= s1 and sbar2 after 4 ns;y <= s2 and sbar1 after 4 ns;Z <= x or y after 4 ns;end architecture data_flow;

VHDL Dataflow Description of a Circuit:

Discrete Event Simulation—Basics

Discrete Event Simulation—Basics (cont’d)

Discrete Event Simulation—Basics (cont’d)

Parallel DES for Logic Simulation

Correctness Issues in Parallel DES• What happens is inter-processor messages

are received out of simulation time order, either from the same processor or from different processors? In other words, if a msg. w/ simulation time ti is received before a msg. w/ simulation time tj, where ti > tj, then what happens? The sim. time ti and tj msgs. could be coming from the same or different processors

• If a proc. “blindly” processes all msgs. as they come, then this can lead to incorrect simulation. E.g., the sim. time tj msg. can cause an output that affects the input to the process for the sim. time ti msg. in the above example. So if the earlier arriving sim. time ti msg. is processed before the later arriving sim. time tj msg., the former simulation output will likely be incorrect.

Correctness Issues in Parallel DES: Solutions• For each msg. sent from processor Pk targeting a

(simulation) process Qr (which is, say, in processor Pq), Pk records the sim. time tq of the latest such msg. When sending the next msg. targeting Qr, Pk also mentions the prev. sim. time along w/ the current one tj.

• So the msg. data looks like Mj = (input value, tj [curr. sim. time], tq [prev. sim. time])

• The next msg. Mi = (input value, ti, tj)• Receiving proc. Pq also records the sim. time tq of

the last msg. received for each input of Qr. If a new msg. meant for that input of Qr has the prev. sim. time the same as that it has recorded, then that msg. is correct in terms of timing order. Otherwise, Pq will store the msg. but wait for a previous msg. of correct timing order that it has not yet recvd.

• So if for the input, say, A, of Qr, the recorded time of the prev. simulation is tq, and msg. Mi =(value, ti, tj) is recvd. it will not be processed. Only after msg. Mj =(value, tj, tq) is recvd. it will be processed, followed by the processing of msg. Mi (since the latest recorded sim. time for i/p A of Qr is tj).

• With regards to msg. from multiple processors, Pq will not perform any simulation until it has recvd. timing-correct msgs (e.g., Mj above) from all procs. supposed to send it msgs. This issue underscores the imp. of null msgs. w/o which simulation will not proceed further in this aforementioned approach

Some examples of applications requiring DES

Search TechniquesA

BC

D

E

F

G

1

2

3

4

5

6

7

BFS

A

BC

D

E

F

G

Graph

dfs(v) /* for basic graph visit or for soln finding when nodes are partial or full solns */ v.mark = 1; for each (v,u) in E if (u.mark != 1) then dfs(u)

Algorithm Depth_First_Search_Soln for each v in V v.mark = 0;if G has partial soln nodes then for each v in V if v.mark = 0 then dfs(v); end for; else soln_dfs(root); /* root is a particular node in V from were we can start the solution search */

soln_dfs(v)/* used when nodes are basic elts of the problem and not partial soln nodes, and a soln. is a path */v.mark = 1;If path to v is a soln, then return(1);for each (v,u) in E if (u.mark != 1) then soln_found = soln_dfs(u) if (soln_found = 1) then return(soln_found)end for;v.mark = 0; /* can visit v again to form another soln on a different path */return(0)

DFS (black arcs)and Soln_DFS (black+red arcs)

A

BC

D

E

F

G

1

2

3

4

5

6

7

8

9

Soln found (A,B,E,C,F)that meets some criterion

10

Search Techniques—Exhaustive DFS

optimal_soln_dfs(v)/* used when nodes are basic elts of the problem and not partial soln nodes, and a soln. is a path */beginv.mark = 1;If path to v is a soln, then begin if cost < best_cost then begin best_soln=soln; best_cost=cost; endif v.mark=0; return;Endiffor each (v,u) in E if (u.mark != 1) then cost = cost + edge_cost(v,u); /* global var. */ optimal_soln_dfs(u)end for;v.mark = 0; /* can visit v again to form another soln on a different path */endAlgorithm Depth_First_Search_Opt_Soln for each v in V v.mark = 0; best_cost = infinity; cost = 0; optimal_soln_dfs(root);

DFS (black arcs)and Soln_DFS (black+red arcs)

Optimal_Soln_DFS (black+red+green) arcs

A

BC

D

E

F

G

1

2

3

4

5

6

7

8

9

Soln found(A,B,E,C,F)

10

i > 10

i+1

i+2

i+3 i+4

Best soln. sofar (A,C,E,D,F,G)

Best-First Search

BeFS (root)begin open = {root} /* open is list of gen. but not expanded nodes—partial solns */ best_soln_cost = infinity; while open != nullset do begin curr = first(open); if curr is a soln then return(curr) /* curr is an optimal soln */ else children = Expand_&_est_cost(curr); /* generate all children of curr & estimate their costs---cost(u) should be a lower bound of cost of the best soln reachable from u */ for each child in children do begin if child is a soln then delete all nodes w in open s.t. cost(w) >= cost(child); endif store child in open in increasing order of cost; endfor endwhileend /* BFS */

Expand_&_est_cost(Y)begin children = nullset; for each basic elt x of problem “reachable” from Y do begin if x not in Y and if feasible child = Y U {x}; path_cost(child) = path_cost(Y) + cost(Y, x) /* cost(Y, x) is cost of reaching x from Y */ est(child) = lower bound cost of best soln reachable from child; cost(child) = path_cost(child) + est(child); children = children U {child}; endforend /* Expand_&_est_cost(Y);

Y = partial soln. = a path from root to current “node” (a basic elt. of the problem, e.g., a city in TSP, a vertex in V0 or V1 in min-cut partitioning). We go from each such “node” u to the next one u that is “reachable “ from u in the problem “graph” (which is part of what you have to formulate)

u 10

12 1519

18

1718

16

(1)

(2)

(3)

costs

root

Best-First Search

Proof of optimality when cost is a LB• The current set of nodes in “open” represents a complete front of generated nodes, i.e., the rest of the in-generated nodes in the search space are descendants of “open”• If first node curr in “open” is a soln, then cost(curr) <= cost(w) for each w in “open”• Cost of any solution node in the search space not in “open” and not yet generated is >= cost of its ancestor in “open” and thus >= cost(curr). Thus curr is the optimal (min-cost) soln

u 10

12 15 19

18

17

1816

(1)

(2)

(3)

costs

root

Y = partial soln.

Search techs for a TSP example9

5

2

1

3

5 4

8

7

5

AB

C

D

E

F

B E F

F

D F

E F D E

D

x

A A

C

F E E

A A A

27 31 33

Exhaustive search using DFS (w/ backtrack) for findingan optimal solution

Solution nodes

TSP graph

Search techs for a TSP example (contd)

B E F

F

D F

E F

A A

C

F

A

27

23+8

BeFS for finding an optimal TSP solution

22+9

C D E

C E D

X X X

F D

21+6

C F

B F

F

A

8+16

11+14

14+9

20

5+15

• Lower-bound cost estimate: MST({unvisited cities} U {current city} U {start city})• LB as structure (spanning tree) is a superset of reqd soln structure (cycle)• min(metric M’s values in set S)<= min(M’s values in subset S’)• Similarly for max??

9

5

21

3

5 4

8

7

5

AB

C

D

E

F

MST for node (A, E, F); =MST{F,A,B,C,D}; cost=16

Path cost for(A,E,F) = 8

Set S of all spanningtrees in a graph G

Set S’of all Hamiltonianpaths (that visits a nodeexactly once)in a graph G

S

S’

BFS for 0/1 ILP Solution

root(no vars

exp.)

• X = {x1, …, xm} are 0/1 vars• Choose vars Xi=0/1 as next nodes in some order (random or heuristic based)X2=0 X2=1

Solve LPw/ x2=0;Cost=cost(LP)=C1

Solve LPw/ x2=1;Cost=cost(LP)=C2

Solve LPw/ x2=1, x4=0;Cost=cost(LP)=C3

Solve LPw/ x2=1, x4=1;Cost=cost(LP)=C4

X4=0 X4=1

X5=0 X5=1

Solve LPw/ x2=1, x4=1, x5=1Cost=cost(LP)=C6

Solve LPw/ x2=1, x4=1, x5=0Cost=cost(LP)=C5

optimal soln

Cost relations:C5 < C3 < C1 < C6C2 < C1C4 < C3

(stop when child gen. is a soln. node that is at most(1+alpha)*cost(best(open)), alpha is given sub-opt. fraction.

for speedup > 1

• For Sp(P) > 1, we need n*texp/((n/P)*(texp+(P-1)*tacc)) > 1 texp > texp/P + (P-1)*tacc texp(P-1)/P > (P-1)*tacc P < texp / tacc

• For constant efficiency, this is even worse:E(P)=Sp(P)/P = T(1)/(Tp(P)*P) = n*texp/(P*(n/P)*(texp+(P-1)*tacc)) = n*texp/(n*texp+ n*(P/(P-1))*tacc) = const. C <= 1 1 + ((P-1)/P)*tacc/texp) = 1/C ((P-1)/P)*tacc/texp = 1/C – 1Differentiating both sides wrt P to minimize the expr. (max. C), we get:(tacc/texp )/P2) = 0, which cannot occur for any P.

for speedup > 1

• Nodes w/ cost >= the current best global soln. so far are discarded. Note that this can sometimes lead to idling, and at other times non-essential work can be done before such deletion of nodes take place. Both are overheads of parallel B&B.

• A local best soln. @ head of local open is global opt. if all other processors have terminated by then (their termn. msg. may be in transit in some cases)

Load Balancing

Load info exchange LIE

Load/work transfer

Legend:

• Generic Load Balance protocol‒ Periodic LIEs between subsets of processors (generally,

neighbors or small extended neighborhoods, e.g., distance k apart for small k)

‒ Followed by work transfers as indicated by the LIE and work transfer policy

• Issues to be determined in a LB technique (generally application and parallel system dependent):

‒ Frequency of LIE‒ Definition of load‒ Load difference threshold or in general some relative load

condition criteria to trigger work transfer‒ Donor or receiver initiated load/work transfer?‒ How much and which work to transfer?

without a numerical load computation

based on rank (a la the AC method)

: minimizes non-essential work but significantly increases idling due to large

taccess/texp

Quality Equalizing (QE) Load Balancing Techniques

• Various techniques developed by my former Ph.D. student Prof. Nihar Mahaptra (MSU) and myself over a few years. The refs are:

• N.R. Mahapatra and S. Dutt, ``An efficient delay-optimal distributed termination detection algorithm'', Jour. Parallel and Distr. Computing , Oct. 2007, pp. 1047-1066.

• N.R. Mahapatra and S. Dutt, ``Adaptive Quality Equalizing: High-Performance Load Balancing for Parallel Branch-and-Bound Across Applications and Computing Systems'', Proc. Joint IEEE Parallel Processing Symposium/ Symp. on Parallel and Distr. Processing , April 1998.

• N.R. Mahapatra and S. Dutt, ``Random Seeking: A General, Efficient, and Informed Randomized Scheme for Dynamic Load Balancing'', Proc. Tenth IEEE Parallel Processing Symposium, April 1996, pp. 881-885.

• N.R. Mahapatra and S. Dutt, ``New anticipatory load balancing strategies for scalable parallel best-first search'', American Mathematical Society's DIMACS Series on Discrete Mathematics and Theoretical Computer Science, Vol. 22, 1995, pp. 197-232. S. Dutt and N.R. Mahapatra, ``Scalable load-balancing strategies for parallel A* algorithms'', Special Issue on Scalability of Parallel Algorithms and Architectures Journal of Parallel and Distr. Computing, Vol. 22, No. 3, Sept. 1994, pp. 488-505.

• S. Dutt and N.R. Mahapatra, ``Parallel A* algorithms and their performance on hypercube multiprocessors'',Proc. Seventh IEEE Parallel Processing Symposium, 1993, pp. 797-803. 

• The donor processor grants very few nodes to acceptor (e.g., alternating 2-3 nodes starting from local rank 2 node)

• For high-latency low-bw platforms like NOWs (n/w of workstations and Beowulf clusters like Argo):

– set s higher (should be inversely proportional to bw , otherwise n/w saturation can occur)

– decrease frequency of load info exchange (LIE)

(alternating rank nodes in merged open list) for s > 1

. Will see worst-case analysis later in this regard.

[E = T1/PTp(P) = W(N)/Wp(P)= W(N)/(W(N) + Wo(N, P)), N is problem size)

Scalability Analysis• Derivation of QE’s isoefficieny upper-bound of

Q(PDd)

Worst-case assumption (for worst-case rank difference): each proc. is worst in its neighborhood, and its neighbor on this path is best in my neighborhood

Best node rank wrt Pi,1’s = Q(sd)Best node rank wrt Pi,2’s = Q(sd) and wrt Pi,1’s = Q(2sd)

Best node rank wrt Pi,D-1’s = Q(sd) and wrt Pi,1’s= Q ((D-1)sd) =

Pi,1

Pi,2

Pi,3

Pi,D-1

Pi,D

opt. cost

Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes

proc. w/ best node opt. soln. in worst case for isoeff.

proc. w/ worst node

opt. cost

Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes

proc. w/ opt. node

proc. w/ worst node

Scalability Analysis• Derivation of QE’s isoefficieny upper-bound of

Q(PDd)

• Taking the fact that d other such paths of “neighbors” of the 1st path, the rank difference among d such paths of length about D is also Q (Dsd) (the Q (sd) rank gap between neighboring processors on a path encompasses the rank difference w/ the other (d-1) other neighbors, one each in the “neighboring” (d-1) paths of length about D) = Q((Dd) (s=const).

• After Q ((Dsd) = Q ((Dd) iterations proc. w/ best node produces the optimal solution. In this time, Q((Dd)2/2 ) non-essential (NE) works get done in a group of d neighborhood paths of distance about D. This happens across Q(P/dD) such path groups total NE work across P procs. = Q((P/Dd)*(Dd)2) Q (PDd) NE work or idling.

opt. cost

Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes

proc. w/ best node opt. soln. in worst case for isoeff.

proc. w/ worst node

opt. cost

Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes

proc. w/ opt. node

proc. w/ worst node

Q(3tc + 3ts/2)/texp) to be preciseQ(2tc + ts)/texp) to be precise(texp is a constant wrt arch.)

. Rationale: More global load balancing( smaller global rank difference betw. best and worst qual-itatively loaded processors) w/o high commun. overhead

s

(costk(1) > costi(3)