Automatically Tuning Task-Based Programs for Multi-core Processors

62
Automatically Tuning Task-Based Programs for Multi-core Processors Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine

description

Automatically Tuning Task-Based Programs for Multi-core Processors. Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine. Motivation. Recent microprocessor trends Number of cores increased rapidly Architectures vary widely - PowerPoint PPT Presentation

Transcript of Automatically Tuning Task-Based Programs for Multi-core Processors

Page 1: Automatically Tuning Task-Based Programs for Multi-core Processors

Automatically Tuning Task-Based Programs for Multi-core Processors

Jin ZhouBrian Demsky

Department of Electrical Engineering and Computer Science

University of California, Irvine

Page 2: Automatically Tuning Task-Based Programs for Multi-core Processors

Motivation

• Recent microprocessor trends– Number of cores increased rapidly– Architectures vary widely

• Challenges for software development– Parallelization is now key for performance– Current parallel programming model: threads + locks

• Hard to develop correct and efficient parallel software• Hard to adapt software to changes in architectures

Page 3: Automatically Tuning Task-Based Programs for Multi-core Processors

Goals

• Automatically generate parallel implementation • Automatically tune parallel implementation

Page 4: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Compiler

OverviewProgram Processor Specification

Implementation Generator

Simulation-based Evaluator

Candidate implementations

Implementation Optimizer

Leading implementations

Profile Data

Multi-core Processor

Tuned implementations

Optimized multi-core binary

Code Generator

Optimized implementation

Page 5: Automatically Tuning Task-Based Programs for Multi-core Processors

Example

• MonteCarlo Example– Partitions problem into several simulations– Executes the simulations in parallel– Aggregates results of all simulations

Page 6: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Language

• A hybrid language combines data-flow and Java– Programs are composed of tasks– Tasks compose with dataflow-like semantics– Tasks contain Java-like object-oriented code internally– Programs cannot explicitly invoke tasks– Runtime automatically invokes tasks

• Supports standard object-oriented constructs including methods and classes

Page 7: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Language

• Flags – Capture current role (type state) of object in

computation– Each flag captures an aspect of the object’s state– Change as the object’s role evolves in program– Support orthogonal classifications of objects

Page 8: Automatically Tuning Task-Based Programs for Multi-core Processors

task startup(StartupObject s in initialstate) { Aggregator aggr = new Aggregator(s.args[0]){merge:=true}; for(int i = 0; i < 4; i++) Simulator sim = new Simulator(aggr){run:=true}; taskexit(s: initialstate:=false);}task simulate(Simulator sim in run) { sim.runSimulate(); taskexit(sim: run:=false, submit:=true);} task aggregate(Aggregator aggr in merge, Simulator sim in submit) { boolean allprocessed = aggr.aggregateResult(sim); if (allprocessed) taskexit(aggr: merge:=false, finished:=true; sim: submit:=false, finished:=true); taskexit(sim: submit:=false, finished:=true);}

class Aggregator { flag merge; flag finished; … }

class Simulator { flag run; flag submit; flag finished; ... }

Page 9: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

Runtime initialization

StartupObjectnew

initialstate state finished stateStartupObject

Page 10: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

StartupObject startup task

execute on

initialstate state finished stateStartupObject

Page 11: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

startup taskStartupObject

set

Aggregator

Simulator

Simulator Simulator

new

Simulator

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 12: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

StartupObject

Aggregator

Simulator

Simulator Simulator

Simulatorsimulateexecut

e on

execute on

simulate task

execute onsimulate

task

execute on

simulate task

simulate task

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 13: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

StartupObject

setAggregator

Simulator

Simulator Simulator

Simulatorsimulate

task

simulate task

simulate task

simulate taskset

set

set

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 14: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

aggregate task

StartupObject

Aggregator

Simulator

Simulator Simulator

Simulator

execute on

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 15: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

aggregate taskStartupObject

Aggregator

Simulator

Simulator Simulator

Simulator

set

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 16: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

StartupObject

Aggregator

Simulator

Simulator Simulator

Simulatoraggregate

task

execute on

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 17: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

StartupObject

Aggregator

Simulator

Simulator Simulator

Simulatoraggregate task

set

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 18: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

aggregate task

StartupObject

Aggregator

Simulator

Simulator Simulator

Simulator

execute on

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 19: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

aggregate task

StartupObject

Aggregator

Simulator

Simulator Simulator

Simulator

set

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 20: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

StartupObject

Aggregator

Simulator

Simulator Simulator

Simulatoraggregate

taskexecute on

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 21: Automatically Tuning Task-Based Programs for Multi-core Processors

Bamboo Program Execution

Global Flagged Object Space

StartupObject

Aggregator

Simulator

Simulator Simulator

Simulatoraggregate task

set

merge state finished state

submit state

initialstate state finished stateStartupObject

Aggregator

Simulator run state finished state

Page 22: Automatically Tuning Task-Based Programs for Multi-core Processors

Implementation Generation

Bamboo Compiler

Bamboo Program Processor Specification

Implementation Generator

Candidate implementations

Profile Data

Page 23: Automatically Tuning Task-Based Programs for Multi-core Processors

Implementation Generation

• Dependence Analysis: analyzes data dependence between tasks

• Parallelism Exploration: extracts potential parallelism

• Mapping to Cores: maps the program to real processor

Page 24: Automatically Tuning Task-Based Programs for Multi-core Processors

Flag State Transition Graph (FSTG)

Simulator

submit

finished

aggregate:2Mcyc; 100%

run

simulate:32Mcyc; 100%

Page 25: Automatically Tuning Task-Based Programs for Multi-core Processors

Combined Flag State Transition Graph (CFSTG)

StartupObject

initialstate

finishedstartup:3Mcyc; 100%

Simulator

run

submit

simulate:32Mcyc; 100%

finished

aggregate:2Mcyc; 100%

1

Aggregator

aggregate:2Mcyc; 75%

finishedaggregate:2Mcyc; 25%

merge

4

Number of new objects

Page 26: Automatically Tuning Task-Based Programs for Multi-core Processors

Core Group

Initial Mapping

StartupObject

initialstate

finished

startup:3Mcyc; 100%

Simulator

run

submit

simulate:32Mcyc; 100%

finished

aggregate:2Mcyc; 100%

1

Aggregator

aggregate:2Mcyc; 75%

finishedaggregate:2Mcyc; 25%

merge

4

Page 27: Automatically Tuning Task-Based Programs for Multi-core Processors

Preprocessing Phase

• Identifies strongly connected components (SCC) and merges them into a single core group

• Converts CFSTG into a tree of core groups by replicating core groups as necessary

Page 28: Automatically Tuning Task-Based Programs for Multi-core Processors

Data Locality RuleStartupObject

initialstate

finishedstartup:3Mcyc; 100%

Simulator

41Aggregator

aggregate:2Mcyc; 75%

finishedaggregate:2Mcyc; 25%

merge

run

Aggregator

StartupObject1

Simulator

4

• Default rule• Maximize data locality to

improve performance– Minimizes inter-core

communications– Improves cache behavior

Page 29: Automatically Tuning Task-Based Programs for Multi-core Processors

Data Parallelization Rule• To explore potential data

parallelism

Aggregator

StartupObject

Simulator 1

1Simulator

Simulator

Simulator1

1

1

Aggregator

StartupObject1

Simulator

4

StartupObjectinitialstate

finishedstartup:3Mcyc; 100%

Simulator

41Aggregator

aggregate:2Mcyc; 75%

finishedaggregate:2Mcyc; 25%

merge

run

Page 30: Automatically Tuning Task-Based Programs for Multi-core Processors

Rate Matching Rule• If the producer executes

multiple times in a cycle, how many consumers are required?

• Match two rates to estimate the number of consumers– Peak new object creation rate– Object consumption rate

Producer

initproduce

produce

Producer

Consumer

Consumer

Consumerrun

Page 31: Automatically Tuning Task-Based Programs for Multi-core Processors

Mapping to Processor

• Constraint: limited cores

Core 1 Core 2

• Map CFSTG core groups to physical cores

• Extended CFSTG

Aggregator

StartupObject

Simulator 1

1Simulator

Simulator

Simulator1

1

1

Page 32: Automatically Tuning Task-Based Programs for Multi-core Processors

Mapping to Cores• One possible mapping

Aggregator

StartupObject

Simulator 1

1Simulator

Simulator

Simulator1

1

1

Core 2

Core 1

Page 33: Automatically Tuning Task-Based Programs for Multi-core Processors

Mapping to Cores• Isomorphic mappings: have same performance

• Backtracking-based search: to generate non-isomorphic implementations

Aggregator

StartupObject

Simulator 1

1Simulator

Simulator

Simulator1

1

1

Aggregator

StartupObject

Simulator 1

1Simulator

Simulator

Simulator1

1

1

Core 2

Core 1Core 1

Core 2

Page 34: Automatically Tuning Task-Based Programs for Multi-core Processors

Implementation Generation

Bamboo Compiler

Simulation-based Evaluator

Candidate implementations

Implementation Optimizer

Leading implementations

Tuned implementations

Optimized implementation

Page 35: Automatically Tuning Task-Based Programs for Multi-core Processors

Simulation-Based Evaluation• To select the best candidate implementation• High-level simulation

– Does NOT actually execute the program– Constructs abstract execution trace with similar statistics– Compare the execution time or throughput and core usage

SimulatorCore

Task TaskCore

Task Task

Page 36: Automatically Tuning Task-Based Programs for Multi-core Processors

Simulation-Based Evaluation• Markov model

– Built from profile data– For each task estimates:

• The destination state• The execution time• A count of each type of new

objects

StartupObject

initialstate

fnishedstartup:3Mcyc; 100%

1Aggregator

1Simulator

Simulator

Simulator

Simulator

1

1

1

aggregate:2Mcyc; 75%aggregate:2Mcyc; 25%

merge

run

finished

submitsimulate:32Mcyc; 100%

finishedaggregate:2Mcyc; 100%

Page 37: Automatically Tuning Task-Based Programs for Multi-core Processors

Simulated Execution Tracecore 0 core 1

0 StartupObject(1)

3 Aggregator(1), Simulator (4)

4 Simulator(1)transfer a Simulator

35 Aggregator(1), Simulator(1), Simulator(2)36 Simulator(1)

67 Aggregator(1), Simulator(3), Simulator(1)

37 transfer a Simulator

99 Aggregator(1), Simulator(4)

101 Aggregator(1), Simulator(3)

103 Aggregator(1), Simulator(2)

105 Aggregator(1), Simulator(1)

107 empty

Aggregator(1), Simulator(2), Simulator(2)

Aggregator(1), Simulator(4)

1 Aggregator in the initial state and 4 Simulators in the submit state

Page 38: Automatically Tuning Task-Based Programs for Multi-core Processors

Problem of Exhaustive Searching

• The search space expands quickly• Exhaustive search is not feasible for complicated

applications

Number of CFSTG Core Groups Number of Cores Number of Candidates

32 16 > 6,00064 32 > 14,000,000

Page 39: Automatically Tuning Task-Based Programs for Multi-core Processors

Random Search?• Very low chance to find the best implementation

Chance to find the best implementation

Page 40: Automatically Tuning Task-Based Programs for Multi-core Processors

Developer Optimization Process• Create an initial implementation• Evaluate it and identify performance bottlenecks• Heuristically develop new implementations to

remove bottlenecks• Iteratively repeat evaluation and optimization

Page 41: Automatically Tuning Task-Based Programs for Multi-core Processors

Directed Simulated Annealing (DSA)

Directed Simulated Annealing

Randomly generate candidate implementations

High-level Simulator

As-built Critical Path Analysis

Leading candidate implementations

Implementation Generator

Potential bottlenecks

Tuned candidate implementation

New candidate implementations

Page 42: Automatically Tuning Task-Based Programs for Multi-core Processors

As-Built Critical Path (ABCP)

Aggregator

StartupObject

Simulator

11

Simulator

Simulator

Simulator1

11

• Provide post-mortem analysis of project managementcore 0 core 1

0 StartupObject(1)

3 Aggregator(1), Simulator (4)

4 Simulator(1)transfer a Simulator

35 Aggregator(1), Simulator(1), Simulator(2)36 Simulator(1)

67 Aggregator(1), Simulator(3), Simulator(1)

37 transfer a Simulator

99 Aggregator(1), Simulator(4)

101 Aggregator(1), Simulator(3)

103 Aggregator(1), Simulator(2)

105 Aggregator(1), Simulator(1)

107 empty

Aggregator(1), Simulator(2), Simulator(2)

Page 43: Automatically Tuning Task-Based Programs for Multi-core Processors

As-Built Critical Path Analysis

• Compute the time when a task invocation’s data dependences are resolved

core 0 core 1

0 StartupObject(1)

3 Aggregator(1), Simulator (4)

4 Simulator(1)transfer a Simulator

35 Aggregator(1), Simulator(1), Simulator(2)36 Simulator(1)

67 Aggregator(1), Simulator(3), Simulator(1)

37 transfer a Simulator

99 Aggregator(1), Simulator(4)

101 Aggregator(1), Simulator(3)

103 Aggregator(1), Simulator(2)

105 Aggregator(1), Simulator(1)

107 empty

Aggregator(1), Simulator(2), Simulator(2)

Page 44: Automatically Tuning Task-Based Programs for Multi-core Processors

Waiting Task Optimization

• Waiting tasks: – Tasks whose real invocation time is later than the time when all

its data dependences are resolved– Delayed because of resource conflicts– Bottlenecks, remove them from ABCP

• Optimization– Migrate waiting tasks to spare cores– Shorten the ABCP to improve performance

Page 45: Automatically Tuning Task-Based Programs for Multi-core Processors

Critical Task Optimization• There may not exist spare cores to move waiting tasks to• Identify critical tasks: tasks that produce data that is

consumed immediately• Attempt to execute critical tasks as early as possible• Migrate other tasks which blocked some critical task to

other corescore 0 core 1

35 Aggregator(1), Simulator(1), Simulator(2)36 Simulator(1)

67 Aggregator(1), Simulator(3), Simulator(1)

99 Aggregator(1), Simulator(4)

101 Aggregator(1), Simulator(3)

Simulator(2)

1

2

Page 46: Automatically Tuning Task-Based Programs for Multi-core Processors

Code Generator

Bamboo CompilerOptimized multi-core binary

Code Generator

Optimized implementation

Intermediate C code

Page 47: Automatically Tuning Task-Based Programs for Multi-core Processors

Evaluation

• MIT RAW simulator– Cycle accurate simulator configured for 16 cores– RAW chip: tiled chip, shared memory, on-chip network

• Benchmarks:– Series: Java Grande benchmark suite– MonteCarlo: Java Grande benchmark suite– FilterBank: StreamIt benchmark suite– Fractal

Page 48: Automatically Tuning Task-Based Programs for Multi-core Processors

Speedups on 16 cores

Benchmark Clock Cycles (106 cyc) Speedup to 1-Core Bamboo1-Core

Bamboo16-Core Bamboo

Series 26.4 1.8 14.7

Fractal 38.4 3.3 11.6

MonteCarlo 191.7 19.0 10.1

FilterBank 91.2 6.7 13.6

• Successfully generated implementations with good performance

Page 49: Automatically Tuning Task-Based Programs for Multi-core Processors

Comparison to Hand-Written C CodeBenchmark Clock Cycles (106 cyc) Speedup

to 1-Core COverhead of

Bamboo1-Core C 1-Core Bamboo

16-Core Bamboo

Series 25.0 26.4 1.8 13.9 5.6%

Fractal 36.2 38.4 3.3 11.0 6.1%

MonteCarlo 138.8

191.7 19.0 7.3 38.1%

FilterBank 71.1 91.2 6.7 10.6 28.3%• Overhead of Bamboo:– Small for Series and Fractal– Larger overhead for MonteCarlo and FilterBank:

• GCC cannot reorder instructions to fill floating-point delay slots for Bamboo implementations due to imprecise alias results

• Easy to add alias information to facilitate the reordering

Page 50: Automatically Tuning Task-Based Programs for Multi-core Processors

Comparison of Estimation and Real Execution

• The simulation estimations are close to the real execution time

Benchmark 1-Core Bamboo Binary 16-Core Bamboo Binary

Clock Cycles (106 cyc) Error Clock Cycles (106 cyc) Error

Estimation Real Estimation Real

Series 26.3 26.4 0.38% 1.7 1.8 5.56%

Fractal 38.4 38.4 0% 3.1 3.3 6.06%

MonteCarlo 191.0 191.7 0.37% 18.3 19.0 3.68%

FilterBank 91.2 91.2 0% 6.5 6.7 2.99%

Page 51: Automatically Tuning Task-Based Programs for Multi-core Processors

Optimality of Directed Simulated Annealing

Page 52: Automatically Tuning Task-Based Programs for Multi-core Processors

Fractal

Page 53: Automatically Tuning Task-Based Programs for Multi-core Processors

MonteCarlo

Page 54: Automatically Tuning Task-Based Programs for Multi-core Processors

FilterBank

Page 55: Automatically Tuning Task-Based Programs for Multi-core Processors

Generality of Synthesized Implementation

• The speedups of both 16-core Bamboo versions are similar

• Successfully generate a sophisticated implementation utilizing pipelining for MonteCarlo

Benchmark Profile_original, Input_double Profile_double, Input_double

Clock Cycles (106 cyc) Speedup Clock Cycles (106 cyc) Speedup

1-Core 16-Core 16-Core

Series 54.2 3.6 15.1 3.6 15.1

Fractal 76.6 6.5 11.8 6.5 11.8

MonteCarlo 383.2 37.8 10.1 35.7 10.7

FilterBank 182.3 13.3 13.7 13.3 13.7

Page 56: Automatically Tuning Task-Based Programs for Multi-core Processors

Related Work• Data-flow and streaming languages:

– Bamboo relaxes typical restrictions in these models to permit:

• Flexible mutation of data structures• Data structures of arbitrarily complex constructs

– Bamboo supports applications that non-deterministically access data

• Tuple-space language: compiler cannot automatically create multiple instantiations to utilize multiple cores

• Self-tuning libraries: mostly address specific computations

Page 57: Automatically Tuning Task-Based Programs for Multi-core Processors

Conclusion• We developed a new approach to automatically

tune task-based programs for multi-core processors– Automatically generate parallel implementations– Automatically tune according to specific architecture

• The approach was evaluated on MIT RAW simulator– Successfully generated implementations with good

performance– Successfully generated a sophisticated implementation

utilizing pipelining

• Can be extended to the broader context of traditional programming languages

Page 58: Automatically Tuning Task-Based Programs for Multi-core Processors

Thank you!

Page 59: Automatically Tuning Task-Based Programs for Multi-core Processors

Future Work

• Apply our approach on non-simulated multi-core processors

• Develop more sophisticated processor specification

• Explore rich set of applications

Page 60: Automatically Tuning Task-Based Programs for Multi-core Processors

Design Rationale

• Why not dynamic scheduling?– Bad scalability over increasing cores– Our basic approach makes it easier to adapt to future

changes in architectures

Page 61: Automatically Tuning Task-Based Programs for Multi-core Processors

Tree Transform

Producer1

Consumer1Producer2

4

1

Producer1

Consumer

1Producer2

4

1

Consumer

Page 62: Automatically Tuning Task-Based Programs for Multi-core Processors

Tags

• Motivation: consider a video processor example• Tags group objects together:

– Tags have types– Can create many instances of a tag type– Each instance defines a group

• Can bind tag instances to objects• Tags can specify that task parameters must be in

the same group