Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads

Post on 24-Feb-2016

28 views 0 download

Tags:

description

Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads. Aleksandar Prokopec Martin Odersky. Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads. Irregular Data-Parallel. Aleksandar Prokopec Martin Odersky. Uniform workload. - PowerPoint PPT Presentation

Transcript of Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads

1

Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads

Aleksandar ProkopecMartin Odersky

2

Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads

Aleksandar ProkopecMartin Odersky

Irregular Data-Parallel

3

Uniform workload

(0 until 10000000) reduce (+)

4

Uniform workload

(0 until 10000000) reduce (+)

sum = sum + x

5

Uniform workload

(0 until 10000000) reduce (+)

sum = sum + x

N

cycles

6

Baseline workload

for (0 until 10000000) {}

N

cycles

7

Irregular workload

8

Irregular workload

N

cycles

9

Irregular workload

for { x <- 0 until width y <- 0 until height} image(x, y) = compute(x, y)

N

cycles

10

Irregular workload

for { x <- 0 until width y <- 0 until height} image(x, y) = compute(x, y)image(x, y) = compute(x, y)

N

cycles

11

Workload function

workload(n) – work spent on element n after the data-parallel operation completed

12

Workload function

Could be…

Runtime valuedependent

for { x <- 0 until width y <- 0 until height} img(x, y) = compute(x, y)

workload(n) – work spent on element n after the data-parallel operation completed

13

Workload function

Could be…

Execution-scheduledependent

for (n <- nodes) n.neighbours += new Node

workload(n) – work spent on element n after the data-parallel operation completed

14

Workload function

Could be…

Totally randomfor ((x, y) <- img.indices) img(x, y) = sample( x + random(), y + random() )

workload(n) – work spent on element n after the data-parallel operation completed

15

Data-parallel scheduler

Assign loop elements to workerswithout knowledge about the workload function.

16

Data-parallel scheduler

1. Linear speedup for the baseline workload

Assign loop elements to workerswithout knowledge about the workload function.

17

Data-parallel scheduler

1. Linear speedup for the baseline workload2. Optimal speedup for irregular workloads

Assign loop elements to workerswithout knowledge about the workload function.

18

Static batching

Decides on the worker-element assignment before the data-parallel operation begins.

N

cycles

19

Static batching

Decides on the worker-element assignment before the data-parallel operation begins.

No knowledge → divide uniformly.

Not optimal for even mildly irregular workloads.

N

cycles

20

Fixed-size batching

Workload-driven – decides during execution.

N

cycles

progress

21

Fixed-size batching

Workload-driven – decides during execution.

N

cycles

0

22

Fixed-size batching

Workload-driven – decides during execution.

N

cycles

2 T0: CAS

T0

23

Fixed-size batching

Workload-driven – decides during execution.

N

cycles

4T1: CAS

T0 T1

24

Fixed-size batching

Workload-driven – decides during execution.

N

cycles

6 T0: CAS

T0T1

25

Fixed-size batching

Workload-driven – decides during execution.

N

cycles

8 T0: CAS

T0T1

26

Fixed-size batching

Workload-driven – decides during execution.

N

cycles

10 T0: CAS

T0T1

27

Fixed-size batching

Workload-driven – decides during execution.

N

cycles

12 T0: CAS

T0T1

28

Fixed-size batching

Workload-driven – decides during execution.

N

cycles

progress

Pros: lightweightCons: minimum batch size, contention

29

Fixed-size batching - contention

30

Factoring, GSS, TS

Batch size varies.

N

cycles

progress

Pros: lightweightCons: contention

31

Task-based work-stealing

N

cycles

0..2 2..4 4..8 8..16

32

Task-based work-stealing

N

cycles

0..2 2..4 4..8 8..16

2..4

4..8

8..16

T0 T10..2

33

Task-based work-stealing

N

cycles

0..2 2..4 4..8 8..16

2..4

4..8

8..16

T0 T10..2

steal – a rare event

34

Task-based work-stealing

N

cycles

0..2 2..4 4..8 8..16

2..4

4..8

8..16

T0 T110..12

12..16

8..100..2

35

Task-based work-stealing

Pros: can be adaptive - uses stealing informationCons: heavyweight - minimum batch size much larger

N

cycles

0..2 2..4 4..8 8..16

2..4

4..8

8..16

T0 T110..12

12..16

0..2 8..10

36

Task-based work-stealing

N

cycles

0..2 2..4 4..8 8..16

Cannot be stolenafter T0 starts processing it

37

Work-stealing tree

0 0T0 N

owned

38

Work-stealing tree

0 0T0 N 0 50T0 N

owned owned

T0: CAS

39

Work-stealing tree

0 0T0 N 0 50T0 N 0 NT0 N…

owned owned completed

T0: CAS T0: CAS

What about stealing?

40

Work-stealing tree

0 0T0 N 0 50T0 N 0 NT0 N…

owned owned completed

0 -51T0 N

T0: CAS

T1: CAS

stolen

T0: CAS

41

Work-stealing tree

0 50T0 N 0 NT0 N…

owned completed

0 -51T0 N

T0: CAS

stolen

T0: CAS

0 0T0 N

owned

T1: CAS

42

Work-stealing tree

0 50T0 N 0 NT0 N…

owned completed

0 -51T0 N

T0: CAS

stolen

0 -51T0 N

expanded

50 50T0 M M MT1 N

T0: CAS

0 0T0 N

owned

M = (50 + N) / 2

43

Work-stealing tree

0 50T0 N 0 NT0 N…

owned completed

0 -51T0 N

T0: CAS

stolen

0 -51T0 N

expanded

50 50T0 M M MT1 N

T0: CAS

0 0T0 N

owned

M = (50 + N) / 2

T0 or T1: CAS

44

Work-stealing tree

0 50T0 N 0 NT0 N…

owned completed

0 -51T0 N

T0: CAS

stolen

0 -51T0 N

expanded

50 50T0 M M MT1 N

T0 or T1: CAS

T0: CAS

0 0T0 N

owned

M = (50 + N) / 2

45

Work-stealing tree - contention

50

Work-stealing tree scheduling

1) find either a non-expanded, non-completed node2) if not found, terminate3) if not owned, steal and/or expand, and descend4) advance until node is completed or stolen5) go to 1)

51

Work-stealing tree scheduling

1) find either a non-expanded, non-completed node2) if not found, terminate3) if not owned, steal and/or expand, and descend4) advance until node is completed or stolen5) go to 1)

1) find either a non-expanded, non-completed node

52

Choosing the node to steal

Find first, in-order traversal

2 9

5

3

53

Choosing the node to steal

Find first, in-order traversal

2 9

5

3

Catastrophic – a lot of stealing, huge trees

54

Choosing the node to steal

Find first, in-order traversal Find first, random order traversal

2 9

5

3

2 9

5

3

Catastrophic – a lot of stealing, huge trees

55

Choosing the node to steal

Find first, in-order traversal Find first, random order traversal

2 9

5

3

2 9

5

3

Catastrophic – a lot of stealing, huge trees

Works reasonably well.

56

Choosing the node to steal

Find first, in-order traversal Find first, random order traversal Find most elements

2 9

5

3

2 9

5

3

2 9

5

3

Catastrophic – a lot of stealing, huge trees

Works reasonably well. Generates least nodes.Seems to be best.

57

Comparison with fixed-size batching

58

Comparison with fixed-size batching

59

Comparison with task work-stealing

60

Thank you!

Questions?

61

Finding work

62

Other workloads