Software PerformanceOptimisation Group Frameworks for domain-specific optimization at run-time Paul...

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p Frameworks for domain-specific optimization at run-time

Paul Kelly (Imperial College London)

Joint work with

Kwok Cheung Yeung

Milos Puzovic

September 2005

Where we’re coming from… I lead the Software Performance Optimisation group within Computer SystemsStuff I’d love to talk about another time:

Scalable interactive fluid-flow visualisationFPGA and GPU acceleratorsBounds-checking for C, links with unchecked codeIs Morton-order layout for 2D arrays competitive?Efficient algorithms for scalable pointer alias analysisDomain-specific optimisation frameworksInstant-access cycle stealingProxying in CC-NUMA cache-coherence protocols – adaptive randomisation and combining

Hyde Park

Albert Hall

Science Museum

V & A

Dept of Computing

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p3Mission statement

Extend optimising compiler technology to challenging contexts beyond scope of conventional compilersComponent-based software: cross-component optimisationFor example in distributed systems:

Optimisation across network boundariesBetween different security domainsMaintaining proper semantics in the event of failures

Emergent mission (“mission creep”):Design a domain-specification optimisation “plug-in” architecture for compiler/VM

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p4Abstraction

Most performance improvement opportunities come from adapting components to their contextMost performance improvement measures break abstraction boundariesSo the goal of performance programming tool support is get performance without making a mess of your code

Optimisations are cross-cutting…

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p5

Slogan: Optimisations are features

and features can be separately-deployable, separately-marketable, components, or aspects

How can this be made to work?

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p6Open compilers

Idea: implement optimisation features as compiler passes

Need to design open plug-in architecture for inserting new optimization passes

Some interesting issues in how to design extensible intermediate representation

Feature composition raises research issuesInterference: can we verify that feature A doesn’t interfere with feature B?Phase ordering problem – which should come first? Can feature B benefit from feature A’s program analysis?

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p7Open virtual machines

How about an open optimizing VM?

Fresh issues:Dynamic installation of optimisation features?

Open access to instrumentation/profiling

Exploit opportunity to use dynamic information as well as static analysis

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p8Open virtual machines

How about an open optimizing VM?

Fresh issues:Dynamic installation of optimisation features?

Open access to instrumentation/profiling

Exploit opportunity to use dynamic information as well as static analysis

This talk has three parts:Motivating exampleA framework for deploying optimisations as separately-deployable features, or componentsSupport for optimisations that integrate static analysis with dynamic information

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p9Project strategy

Implement aggregation optimisation for .Net Remoting Do it with lower overheads than our Java versionTo do it, build general-purpose tools for

Domain-specific optimisationRun-time optimisation

Results so far:“reflective” dataflow analysis framework“optimisations as aspects” framework prototype

Plugin architecture for “domain-specific optimisation features” (DSOFs)

elementary Remoting aggregation works, with excellent performance

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p10Aggregating remote calls

void m(RemoteObject r, int a)

{

int x = r.f(a);

int y = r.g(a,x);

int z = r.h(a,y);

System.Console.WriteLine(z);

}

Client Server

f

g

h

Network

Client Server

f

g

h

Network

Aggregationa sequence of calls to same server can be executed in a single message exchangeReduce number of messagesAlso reduce amount of data transferred

Common parametersResults passed from one call to another

Non-trivial correctness issues; see Yoshida&Ahern, OOPSLA05

Six messages

a

xa,x

ya,y

a,z

a

z

Two messages, no need to copy x and y back

Real-world benchmarks…Simple example: Multi-user Dungeon (from Flanagan’s Java Examples in a Nutshell)

“Look” method:String mudname = p.getServer().getMudName();

String placename = p.getPlaceName();

String description = p.getDescription();

Vector things = p.getThings();

Vector names = p.getNames();

Vector exits = p.getExits();

Seven aggregated calls:Time taken to execute “look”: Ethernet ADSL

Without call aggregation: 5.4ms 759.6ms

With call aggregation: 5.8ms 164.9ms

Speedup: 0.93 4.61

Client: Athlon XP 1800+

Servers: Pentium III 500MHz, 650MHz and dual 700MHz

Linux, Sun JDK 1.4.1_01 (Hotspot)

Network: Ethernet: 10.03 MB/s, ping 0.1ms, DSL: 10.7KB/s, ping 98ms

Mean of 3 trials of 1000 iterations each

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p12Call aggregation: our first implementation

Each fragment carries use/def and liveness info:

Y can be executed before X, but p must be copiedZ cannot be delayed because p is printed

int m() {

while (p<N) {

q += x1.m1(p);

p = 0;

p += x2.m2(p);

System.out.println(p);

}

return p;

}

poss.remote

poss.remote

X Y Z B2

Defs {q} {p} {p} {}

Uses {x1,p,q} {} {x2,p} {p}

p<N

q += x1.m1(p);

return p;

p = 0;

println(p)

p += x2.m2(p);

Fragment W

Fragment X

Fragment Z

Fragment Y

“Veneer” virtual JVM intercepts class loading, and fragments each method. Interpretive “executor” inspects local fragment following each remote call

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p13Call aggregation: our first implementation

At this point, executor has collected a sequence of delayed remote calls (fragments X and Z)But execution is now forced by need to printNow, we can inspect delayed fragments and construct optimised execution planIf x1 and x2 are on same server, send aggregate callIf x1 and x2 are on different servers, send execution plan to x2’s server, telling it to invoke x1.m1(porig) on x1’s server

p<N

return p;

println(p)

Fragment W

q += x1.m1(porig);

p = 0;

p += x2.m2(p);

Fragment Y

Fragment Z

Fragment X

porig = p;Fragment Y executed first

Fragments X and Z are delayed

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p14Aggregation with conditional forcing

Runtime optimisation is justified for optimising “heavyweight” operations

In this example aggregation is valid if x > y

If we intercept the fork we can find out whether aggregation will be valid

Original “Veneer” implementation intercepts all control-flow forks in methods containing aggregationopportunities

We need a better analysis, that pays overheads only when a benefit might occur

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p15Deferred DFA – motivating example

Outcome known at run-time

Deferred data-flow analysis. Shamik Sharma, Anurag Acharya and Joel SaltzUC Santa Barbara techreport TRCS98-38

Identifies “lossy”, “predictable” control-flow forks

rescue data-flow information thrown away by conservative analysis by deferring meet operation

Generates data-flow summary functions for regions between

Uses predicted control-flow to “stitch” together summary function for actual path, using the work list algorithm

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p16DSOFs

Domain-specific optimisation features

Need a framework to plug the components into

What does the framework need to achieve?Cross-cuttingSeparately-deployableQuery language to select target sitesStatic access to dataflow/dependence informationDynamic access to dataflow/dependence information

Let’s start with AOP…

public class RemoteCallDSOF : Loom.Aspect { private OpDomains opDomains = new OpDomains(); private DelayedCalls delayedCalls = new DelayedCalls(); private Set delayedCallsDef = new HashedSet();

public RemoteCallDSOF (DDFAAnalysis analysis) { this.opDomains = analysis.getOpDomains(); }

[Loom.ConnectionPoint.IncludeAll] [Loom.Call(Invoke.Instead)] public object AnyMethod(object[] args) {

OpDomain thisOpDomain = opDomains.getOpDomain(Context.MethodName); OpNode opNode = thisOpDomain.OpNode; Set opNodeDef = opNode.getDefs(); Set opDomainUse = thisOpDomain.getUses(); if (((Set) opNodeDef & opDomainUse).Count > 0) || (((Set) opDomainUse & delayedCallsDef).Count > 0) { delayedCalls.Execute(); object ret = Context.Invoke(args); return ret; } else { delayedCalls.Add(Context.MethodName, args); }

if(!opDomains.hasNext()) { object ret = delayedCalls.Execute(); return ret; }

return null;} }

getOpDomain() function stitches together summary functions

thisOpDomain.getUses() function returns all variables that are used within the op-domain

opNode.getDefs() function returns all variables that are defined by op-node

RMI aggregation

DSOF, based on

Loom aspect weaver

Static part of pointcut

Dynamic part of pointcut, refers to dataflow properties of control flow that can be predicted from this point

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p21

public Double [ ] vectorAddition (DDFAAnalysis analysis, int size ) { Double [ ] v1 = new Double [size];Double [ ] v2 = new Double [size];

ArrayAdder adder = new ArrayAdder();

Double [ ] ret1 = adder.Add(v1, v2);Double [ ] ret2 = adder.Add(ret1, v2);Double [ ] ret3 = adder.Add(ret2, v2);Double [ ] ret4 = adder.Add(ret3, v2);

return ret4;

}

Includes four consecutive calls to same remote object

There is data-dependency between the calls

Remote call aggregation benchmark

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p22Remote call aggregation benchmark

ILMethod method = CodeDatabase.GetMethod (new Function(Example.adder) ) ;DDFAAnalysis analysis = new DDFAAnalysis ( ) ;analysis.Apply(method) ;


RemoteCallDSOF opt = new RemoteCallDSOF(analysis);IAdder adder = (IAdder) Loom.Weaver.CreateInstance(typeof(ArrayAdder), null, opt );


return ret4;

}

We deploy DSOF using Loom aspect weaver

When adder is created, DSOF is interposed

Slightly clunky…

Remote call aggregation benchmark

ILMethod method = CodeDatabase.GetMethod (new Function(Example.adder) ) ;DDFAAnalysis analysis = new DDFAAnalysis ( ) ;analysis.Apply(method) ;


RemoteCallDSOF opt = new RemoteCallDSOF(analysis);IAdder adder = (IAdder) Loom.Weaver.CreateInstance(typeof(ArrayAdder), null, opt );


return ret4;

}

Aspect intercepts control flow at potential remote call sites

Accesses results of static dataflow analysis

Uses values of variables to determine whether future control flow will allow aggregation

Performance results

loopback device(3GHz Pentium 4,.Net V1.1)

Modem, ping time 156.2ms(client: 1.2GHz Pentium 4, server 2.6GhHz Pentium 4, .Net V1.1)

Very preliminary resultsVector addition benchmarkSubstantial speedup even on fast loopback connectionBy avoiding interpretive mechanism, overheads are smaller than in our Java implementation

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p25Observations…VM

No change to VMNot needed for our work so far

Though a more powerful dynamic interposition mechanism (ie aspect weaver) would be good

More ambitiously: access VM’s dataflow analysis?

Access and control VM’s instrumentation– Via a dynamic aspect weaver?

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p26Observations…AOP

What is the function of the aspect weaver here?

Type-safe binary rewriting

Pointcut language goes some way towards providing open access to intermediate representation

We have built a reflective dataflow analysis library to extend this somewhat

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p27Observations…DSI

Our scheme for aggregating Remote calls is an example of a “Domain-Specific Interpreter” pattern

Delay execution of callsExecution of delayed calls is eventually forced by a dependenceInspect list delayed calls, plan efficient execution

This idea is useful for optimising many APIsExample: parallelising VTK (Beckmann, Kelly et al LCPC05)Example: Fusing MPI collective communications (Field, Kelly, Hansen EuroPar02)Example: Data alignment in parallel linear algebra (Beckmann & Kelly, LCR98)

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p28Observations…other DSOFs

We’re interested in API-specific optimisations“anti-pattern” rewritingCommonly “heavyweight”, so some runtime overhead can be justified

But not all optimisations fit the Domain-Specific Interpreter patternEg “SELECT *”antipattern

Find all the uses of the result setFind all the columns that might actually be usedRewrite the query to select just the columns needed

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p29Conclusions and future directions

Implementation incompleteNeeds to be embedded in aspect language

Can deferred dataflow analysis work interprocedurally?How would we derive where lp-fork aspects have to be deployed in order to produce the dataflow data needed by selected aspect

Apply optimisation statically where possibleRepresent optimisation more abstractly?

Composition metaprogrammingOptimisation encapsulated as aspectOperates on code that composes functions from some APIExploits component metadata

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p30Software products

Our “Adon” (Adaptive Optimisation for .Net) library is available athttp://www.doc.ic.ac.uk/~phjk/Software/Adon/

Adon can be used interactively using the Adon Browser

Or programmatically, for example to apply partial evaluation to specialize a method from your program

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p31Programming with Adon: specialization

Allows us to extract and mess with any method of the running application’s code

// Get the representation for the method Example.PowerILMethod method = CodeDatabase.GetMethod(“Example.Power”);

// Create a specialising transformation, specialising the second// parameter of the transformed method to the integer value 3SpecialisingTransformation transform = new SpecialisingTransformation();transform.Specialise(method.Parameters[1], 3);

// Apply the transformation to Example.Powertransform.Apply(method);

// Generate the modified methodMethodInfo dynamicMethod = method.Generate();

// Invoke the new methodConsole.Out.WriteLine(dynamicMethod.Invoke(null, new object[] { 2 }

));

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p32The Adon Browser

Example: let’s mess with Bubblesort

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p33

Browser GUI interfaces to Adon library

Browse and analysis your app’s bytecode

The Adon Browser

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p34

Browser GUI interfaces to library


The Adon Browser

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p35



Apply selected analyses

The Adon Browser

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p36




The Adon Browser

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p37




The Adon Browser

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p38

Apply selected transformations

The Adon Browser

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p39


The Adon Browser

So

ftwa

re P

erf

orm

an

ce

Op

timis

atio

n G

rou

p40


The Adon Browser

Software PerformanceOptimisation Group Frameworks for domain-specific optimization at run-time Paul...

Documents

Transcript of Software PerformanceOptimisation Group Frameworks for domain-specific optimization at run-time Paul...