Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS...

25
Feedback-Driven Pipelining 1 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J. Watson Research Center

Transcript of Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS...

Page 1: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

Feedback-Driven Pipelining

11

M. Aater Suleman*

Moinuddin K. Qureshi

Khubaib*

Yale Patt*

*HPS Research GroupThe University of Texas at Austin

IBM T.J. Watson Research Center

Page 2: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

22

Background

• To leverage CMPs, programs must be parallelized

• Pipeline parallelism:– Split each loop iteration into multiple stages– Each stage can be assigned more than one core or

multiple stages can share a core

• Pipeline Parallelism applicable to variety of workloads– Streaming [Gordon+ ASPLOS’06 ]– Recognition, Synthesis and Mining [Bienia+ PACT’08]– Compression/Decompression [Intel TBB 2009]

Page 3: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

3

Pipeline Parallelism Example

Search String

First, it reads a candidate string. >>Next, it compares the candidate string with the search string to compute similarity >>>Last, it inserts the candidate string into a heap sorted based on similarity. If after the insertion, the heap has more than N elements, it removes the smallest element from the heap. Once the kernel has iterated through all input strings,>>>the heap contain the closest N strings. This kernel can be implemented as a 3-stage pipeline with stages S1, S2, and S3.>>>Note that Stage S2 is scalable because multiple strings can be compared concurrently., However, S3 is non-scalable since only one thread can be allowed to updated the shared heap. >>>For simplicity, lets assume that the three stages respectively execute for 5, 20, and 10 time units when run as a single thread>>>

abssdfkjedwekjwersafsdfsDFSADFkjwelrk

abssdfkjedwekjwersafsdfsDFSADFkjwelrk

Similarity score:

Find the N most similar strings to a given search string

S1: Read S2: Compare S3: Insert

9

7

6

3

5

N-entrysorted on SimilarityScore

QUEUE1 QUEUE2

Page 4: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

4

0

NumCores = 1

5 10 15 20 25 30 35 40

Key Problem: Core to Stage Allocation

S1: Read (1 time unit)

S2: Compare (4 time units)S3: Insert (1 time unit)

45

1 core/stageNumCores = 3

2 cores/stageNumCores = 6

Best Alloc. (steady state)NumCores = 6

Allocation impacts both power and performance:-Assigning few cores to a stage can reduce performance-Assigning more cores than needed wastes powerCore-to-stage allocation must be chosen carefully

Page 5: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

Best Core-to-Stage Allocation

• Best allocation depends on relative throughput and scalability of each stage

• Scalability and throughput varies with input set and machine Profile-based and compile-time solutions are sub-optimal

• Millions of possible allocations even for shallow pipelines

e.g. 8 stage can be allocated to 32 cores in 2.6M ways (integer allocation)

Brute-force searching of best allocation is impractical

5

Goal: Automatically find the best core-to-stage allocation at run-time taking into account the input set, machine

configuration, and scalability of stages

Page 6: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

66

Outline

• Motivation• Feedback-Driven Pipelining• Case Study• Results• Conclusions

Page 7: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

7

Key Insights

• Pipeline performance is limited by the slowest stage: LIMITER

• LIMITER stage can be identified by measuring the execution time of each stage using existing cycle counters

• Scalability of a stage can be estimated using hill-climbing, i.e., continue to give cores until performance stops increasing

• Non-limiter stages can share cores as long as allocating them the same core does not make them slower than the LIMITER– Saved cores can be assigned to LIMITER or switched off to save power

Page 8: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

8

Feedback-Driven Pipelining (FDP)

Add a core to the current LIMITER

Improves

Combine fastest stages on one core

No

Assign One Core per Stage

Available cores?

Performance?

Performance?Same

Yes

Degrades

Degrades

Take one core from LIMITER, Save Power

Page 9: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

9

Required Support

• FDP uses Instructions to read the Time Stamp Counter (rdtsc)

• Software: Modify worker thread to call FDP library functions

FDP_Init()While(!DONE)

stage_id = FDP_InitStage()Pop a work quanta FDP_BeginStage (stage_id)Run stageFDP_EndStage(stage_id)Push the iteration to the in-queue of next stage

Page 10: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

Performance Considerations

• All required data structures are maintained in software and only use virtual memory

• Training data is collected by reading the cycle counter at the start and end of each stage’s execution– We reduce overhead by sampling only 1/128 iterations

– Training can continue seamlessly at all times

• FDP algorithm runs infrequently – once every 2000 iterations

• Each allocation is tried only once to ensure convergence – almost zero-overhead once converged

10

Page 11: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

1111

Outline

• Motivation• Feedback-Driven Pipelining• Case Study• Results• Conclusions

Page 12: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

12

Experimental Methodology

• Measurements taken on an Intel-based 8-core SMP (2xCore2Quad chips)

• Nine pipeline workloads from various domains

• Evaluated configurations:• FDP

• Profile-based

• Proportional Allocation

• Total execution times measured using the Linux time utility (expts. repeated to reduce randomness due to I/O and OS)

Page 13: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

FDP gives morecores to S3

FDP gives evenmore cores to S3

13

Case Study I: compress

LIMITER

FDP combines stagesto free up cores

Optimizedexecution

Page 14: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

1414

Outline

• Motivation• Feedback-Driven Pipelining• Case Study• Results• Conclusions

Page 15: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

15

Performance

0

1

2

3

4

5

6

7

8

1 CorePerStage Prop Assignment Profile-Based FDP

Sp

eed

up

WR

T 1

-co

re

On Avg, Profile-Based provides 2.86x speedup and FDP 4.3x speedup

Page 16: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

16

Robustness to input set

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

compress-2 compress-3 Gmean

1 CorePerStage Prop Assignment Profile-Based FDP

Sp

eed

up

WR

T 1

-co

re

(Input set hard to compress)

S3 stage now takes 80K-140K cycles instead of 2.4M cyclesS5 (writing output to files) takes 80K cycles too and is non-scalable

Page 17: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

17

Savings in Active Cores

0

1

2

3

4

5

6

7

8

MCar

lo

com

press

BSchol

es

pagem

ine

imag

e

mtw

iste

rra

nkfe

rret

dedup

Amea

n

1 CorePerStage Prop Assignment Profile-Based FDP

Nu

mb

er o

f A

ctiv

e C

ore

s

FDP not only improves performance but can save power too!

Page 18: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

18

Scalability to Larger Systems

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

MCarlo compress BScholes pagemine image mtwister rank ferret dedup Gmean

Prop Assignment FDP

Sp

eed

up

WR

T 1

-co

re

Larger machine: 16-core system (4x AMD Barcelona)Evaluating Profile-Based is Impractical (Several thousand configs.)

FDP provides 6.13x (vs. 4.3x with Prop.). FDP also saves power (11.5 active cores vs. 16 with Prop.)

Page 19: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

1919

Outline

• Motivation• Feedback-Driven Pipelining• Case Study• Results• Conclusions

Page 20: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

2020

Conclusions

• Pipelined parallelism applicable to wide variety of workloads– Key problem: How many cores to assign to each stage?

• Our insight: performance limited by slowest stage: LIMITER

• Our proposal FDP identifies LIMITER stage at runtime using existing performance counters

• FDP uses a hill-climbing algorithm to estimate stage scalability

• FDP finds the best core-to-stage allocation successfully– Speedup of 4.3x vs. 2.8x with practical profile-based– Robust to input set and scalable to larger machines– Can be used to save power when LIMITER does not scale

Page 21: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

21

Questions

Page 22: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

2222

Related Work

• Flextream – Hormati+ (PACT 2009)

– Does not take stage scalability into account– Requires dynamic recompilation

• Compile-time tuning of pipeline workloads:– Navarro+ (PACT 2009, ICS 2009), Liao+ (JS 2005), Gonzalez+ (Parallel

Computing 2003)

• Profile Based Allocation in Domain Specific apps.

Page 23: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

23

Feedback-Driven Pipelining (FDP)

Add a core to the current limiterCombine fastest stages on one core

No

Assign One Core per Stage

Available cores?

Performance?

Performance?

Yes

Degrades

Undo change

Seen before?

No

Seen before?

Undo changeDegrades

Yes

Page 24: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

24

FDP for Work Sharing Model

FDP Performs similar to WorkSharing with Best number of threads!

Page 25: Feedback-Driven Pipelining 11 M. Aater Suleman* Moinuddin K. Qureshi Khubaib* Yale Patt* *HPS Research Group The University of Texas at Austin IBM T.J.

25

Data Structures for FDP