Software PerformanceOptimisation Group Frameworks for domain-specific optimization at run-time Paul...
-
date post
20-Dec-2015 -
Category
Documents
-
view
224 -
download
4
Transcript of Software PerformanceOptimisation Group Frameworks for domain-specific optimization at run-time Paul...
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p Frameworks for domain-specific optimization at run-time
Paul Kelly (Imperial College London)
Joint work with
Kwok Cheung Yeung
Milos Puzovic
September 2005
Where we’re coming from… I lead the Software Performance Optimisation group within Computer SystemsStuff I’d love to talk about another time:
Scalable interactive fluid-flow visualisationFPGA and GPU acceleratorsBounds-checking for C, links with unchecked codeIs Morton-order layout for 2D arrays competitive?Efficient algorithms for scalable pointer alias analysisDomain-specific optimisation frameworksInstant-access cycle stealingProxying in CC-NUMA cache-coherence protocols – adaptive randomisation and combining
Hyde Park
Albert Hall
Science Museum
V & A
Dept of Computing
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p3Mission statement
Extend optimising compiler technology to challenging contexts beyond scope of conventional compilersComponent-based software: cross-component optimisationFor example in distributed systems:
Optimisation across network boundariesBetween different security domainsMaintaining proper semantics in the event of failures
Emergent mission (“mission creep”):Design a domain-specification optimisation “plug-in” architecture for compiler/VM
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p4Abstraction
Most performance improvement opportunities come from adapting components to their contextMost performance improvement measures break abstraction boundariesSo the goal of performance programming tool support is get performance without making a mess of your code
Optimisations are cross-cutting…
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p5
Slogan: Optimisations are features
and features can be separately-deployable, separately-marketable, components, or aspects
How can this be made to work?
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p6Open compilers
Idea: implement optimisation features as compiler passes
Need to design open plug-in architecture for inserting new optimization passes
Some interesting issues in how to design extensible intermediate representation
Feature composition raises research issuesInterference: can we verify that feature A doesn’t interfere with feature B?Phase ordering problem – which should come first? Can feature B benefit from feature A’s program analysis?
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p7Open virtual machines
How about an open optimizing VM?
Fresh issues:Dynamic installation of optimisation features?
Open access to instrumentation/profiling
Exploit opportunity to use dynamic information as well as static analysis
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p8Open virtual machines
How about an open optimizing VM?
Fresh issues:Dynamic installation of optimisation features?
Open access to instrumentation/profiling
Exploit opportunity to use dynamic information as well as static analysis
This talk has three parts:Motivating exampleA framework for deploying optimisations as separately-deployable features, or componentsSupport for optimisations that integrate static analysis with dynamic information
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p9Project strategy
Implement aggregation optimisation for .Net Remoting Do it with lower overheads than our Java versionTo do it, build general-purpose tools for
Domain-specific optimisationRun-time optimisation
Results so far:“reflective” dataflow analysis framework“optimisations as aspects” framework prototype
Plugin architecture for “domain-specific optimisation features” (DSOFs)
elementary Remoting aggregation works, with excellent performance
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p10Aggregating remote calls
void m(RemoteObject r, int a)
{
int x = r.f(a);
int y = r.g(a,x);
int z = r.h(a,y);
System.Console.WriteLine(z);
}
Client Server
f
g
h
Network
Client Server
f
g
h
Network
Aggregationa sequence of calls to same server can be executed in a single message exchangeReduce number of messagesAlso reduce amount of data transferred
Common parametersResults passed from one call to another
Non-trivial correctness issues; see Yoshida&Ahern, OOPSLA05
Six messages
a
xa,x
ya,y
a,z
a
z
Two messages, no need to copy x and y back
Real-world benchmarks…Simple example: Multi-user Dungeon (from Flanagan’s Java Examples in a Nutshell)
“Look” method:String mudname = p.getServer().getMudName();
String placename = p.getPlaceName();
String description = p.getDescription();
Vector things = p.getThings();
Vector names = p.getNames();
Vector exits = p.getExits();
Seven aggregated calls:Time taken to execute “look”: Ethernet ADSL
Without call aggregation: 5.4ms 759.6ms
With call aggregation: 5.8ms 164.9ms
Speedup: 0.93 4.61
Client: Athlon XP 1800+
Servers: Pentium III 500MHz, 650MHz and dual 700MHz
Linux, Sun JDK 1.4.1_01 (Hotspot)
Network: Ethernet: 10.03 MB/s, ping 0.1ms, DSL: 10.7KB/s, ping 98ms
Mean of 3 trials of 1000 iterations each
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p12Call aggregation: our first implementation
Each fragment carries use/def and liveness info:
Y can be executed before X, but p must be copiedZ cannot be delayed because p is printed
int m() {
while (p<N) {
q += x1.m1(p);
p = 0;
p += x2.m2(p);
System.out.println(p);
}
return p;
}
poss.remote
poss.remote
X Y Z B2
Defs {q} {p} {p} {}
Uses {x1,p,q} {} {x2,p} {p}
p<N
q += x1.m1(p);
return p;
p = 0;
println(p)
p += x2.m2(p);
Fragment W
Fragment X
Fragment Z
Fragment Y
“Veneer” virtual JVM intercepts class loading, and fragments each method. Interpretive “executor” inspects local fragment following each remote call
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p13Call aggregation: our first implementation
At this point, executor has collected a sequence of delayed remote calls (fragments X and Z)But execution is now forced by need to printNow, we can inspect delayed fragments and construct optimised execution planIf x1 and x2 are on same server, send aggregate callIf x1 and x2 are on different servers, send execution plan to x2’s server, telling it to invoke x1.m1(porig) on x1’s server
p<N
return p;
println(p)
Fragment W
q += x1.m1(porig);
p = 0;
p += x2.m2(p);
Fragment Y
Fragment Z
Fragment X
porig = p;Fragment Y executed first
Fragments X and Z are delayed
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p14Aggregation with conditional forcing
Runtime optimisation is justified for optimising “heavyweight” operations
In this example aggregation is valid if x > y
If we intercept the fork we can find out whether aggregation will be valid
Original “Veneer” implementation intercepts all control-flow forks in methods containing aggregationopportunities
We need a better analysis, that pays overheads only when a benefit might occur
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p15Deferred DFA – motivating example
Outcome known at run-time
Deferred data-flow analysis. Shamik Sharma, Anurag Acharya and Joel SaltzUC Santa Barbara techreport TRCS98-38
Identifies “lossy”, “predictable” control-flow forks
rescue data-flow information thrown away by conservative analysis by deferring meet operation
Generates data-flow summary functions for regions between
Uses predicted control-flow to “stitch” together summary function for actual path, using the work list algorithm
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p16DSOFs
Domain-specific optimisation features
Need a framework to plug the components into
What does the framework need to achieve?Cross-cuttingSeparately-deployableQuery language to select target sitesStatic access to dataflow/dependence informationDynamic access to dataflow/dependence information
Let’s start with AOP…
public class RemoteCallDSOF : Loom.Aspect { private OpDomains opDomains = new OpDomains(); private DelayedCalls delayedCalls = new DelayedCalls(); private Set delayedCallsDef = new HashedSet();
public RemoteCallDSOF (DDFAAnalysis analysis) { this.opDomains = analysis.getOpDomains(); }
[Loom.ConnectionPoint.IncludeAll] [Loom.Call(Invoke.Instead)] public object AnyMethod(object[] args) {
OpDomain thisOpDomain = opDomains.getOpDomain(Context.MethodName); OpNode opNode = thisOpDomain.OpNode; Set opNodeDef = opNode.getDefs(); Set opDomainUse = thisOpDomain.getUses(); if (((Set) opNodeDef & opDomainUse).Count > 0) || (((Set) opDomainUse & delayedCallsDef).Count > 0) { delayedCalls.Execute(); object ret = Context.Invoke(args); return ret; } else { delayedCalls.Add(Context.MethodName, args); }
if(!opDomains.hasNext()) { object ret = delayedCalls.Execute(); return ret; }
return null;} }
getOpDomain() function stitches together summary functions
thisOpDomain.getUses() function returns all variables that are used within the op-domain
opNode.getDefs() function returns all variables that are defined by op-node
RMI aggregation
DSOF, based on
Loom aspect weaver
Static part of pointcut
Dynamic part of pointcut, refers to dataflow properties of control flow that can be predicted from this point
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p21
public Double [ ] vectorAddition (DDFAAnalysis analysis, int size ) { Double [ ] v1 = new Double [size];Double [ ] v2 = new Double [size];
ArrayAdder adder = new ArrayAdder();
Double [ ] ret1 = adder.Add(v1, v2);Double [ ] ret2 = adder.Add(ret1, v2);Double [ ] ret3 = adder.Add(ret2, v2);Double [ ] ret4 = adder.Add(ret3, v2);
return ret4;
}
Includes four consecutive calls to same remote object
There is data-dependency between the calls
Remote call aggregation benchmark
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p22Remote call aggregation benchmark
ILMethod method = CodeDatabase.GetMethod (new Function(Example.adder) ) ;DDFAAnalysis analysis = new DDFAAnalysis ( ) ;analysis.Apply(method) ;
public Double [ ] vectorAddition (DDFAAnalysis analysis, int size ) { Double [ ] v1 = new Double [size];Double [ ] v2 = new Double [size];
RemoteCallDSOF opt = new RemoteCallDSOF(analysis);IAdder adder = (IAdder) Loom.Weaver.CreateInstance(typeof(ArrayAdder), null, opt );
Double [ ] ret1 = adder.Add(v1, v2);Double [ ] ret2 = adder.Add(ret1, v2);Double [ ] ret3 = adder.Add(ret2, v2);Double [ ] ret4 = adder.Add(ret3, v2);
return ret4;
}
We deploy DSOF using Loom aspect weaver
When adder is created, DSOF is interposed
Slightly clunky…
Remote call aggregation benchmark
ILMethod method = CodeDatabase.GetMethod (new Function(Example.adder) ) ;DDFAAnalysis analysis = new DDFAAnalysis ( ) ;analysis.Apply(method) ;
public Double [ ] vectorAddition (DDFAAnalysis analysis, int size ) { Double [ ] v1 = new Double [size];Double [ ] v2 = new Double [size];
RemoteCallDSOF opt = new RemoteCallDSOF(analysis);IAdder adder = (IAdder) Loom.Weaver.CreateInstance(typeof(ArrayAdder), null, opt );
Double [ ] ret1 = adder.Add(v1, v2);Double [ ] ret2 = adder.Add(ret1, v2);Double [ ] ret3 = adder.Add(ret2, v2);Double [ ] ret4 = adder.Add(ret3, v2);
return ret4;
}
Aspect intercepts control flow at potential remote call sites
Accesses results of static dataflow analysis
Uses values of variables to determine whether future control flow will allow aggregation
Performance results
loopback device(3GHz Pentium 4,.Net V1.1)
Modem, ping time 156.2ms(client: 1.2GHz Pentium 4, server 2.6GhHz Pentium 4, .Net V1.1)
Very preliminary resultsVector addition benchmarkSubstantial speedup even on fast loopback connectionBy avoiding interpretive mechanism, overheads are smaller than in our Java implementation
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p25Observations…VM
No change to VMNot needed for our work so far
Though a more powerful dynamic interposition mechanism (ie aspect weaver) would be good
More ambitiously: access VM’s dataflow analysis?
Access and control VM’s instrumentation– Via a dynamic aspect weaver?
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p26Observations…AOP
What is the function of the aspect weaver here?
Type-safe binary rewriting
Pointcut language goes some way towards providing open access to intermediate representation
We have built a reflective dataflow analysis library to extend this somewhat
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p27Observations…DSI
Our scheme for aggregating Remote calls is an example of a “Domain-Specific Interpreter” pattern
Delay execution of callsExecution of delayed calls is eventually forced by a dependenceInspect list delayed calls, plan efficient execution
This idea is useful for optimising many APIsExample: parallelising VTK (Beckmann, Kelly et al LCPC05)Example: Fusing MPI collective communications (Field, Kelly, Hansen EuroPar02)Example: Data alignment in parallel linear algebra (Beckmann & Kelly, LCR98)
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p28Observations…other DSOFs
We’re interested in API-specific optimisations“anti-pattern” rewritingCommonly “heavyweight”, so some runtime overhead can be justified
But not all optimisations fit the Domain-Specific Interpreter patternEg “SELECT *”antipattern
Find all the uses of the result setFind all the columns that might actually be usedRewrite the query to select just the columns needed
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p29Conclusions and future directions
Implementation incompleteNeeds to be embedded in aspect language
Can deferred dataflow analysis work interprocedurally?How would we derive where lp-fork aspects have to be deployed in order to produce the dataflow data needed by selected aspect
Apply optimisation statically where possibleRepresent optimisation more abstractly?
Composition metaprogrammingOptimisation encapsulated as aspectOperates on code that composes functions from some APIExploits component metadata
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p30Software products
Our “Adon” (Adaptive Optimisation for .Net) library is available athttp://www.doc.ic.ac.uk/~phjk/Software/Adon/
Adon can be used interactively using the Adon Browser
Or programmatically, for example to apply partial evaluation to specialize a method from your program
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p31Programming with Adon: specialization
Allows us to extract and mess with any method of the running application’s code
// Get the representation for the method Example.PowerILMethod method = CodeDatabase.GetMethod(“Example.Power”);
// Create a specialising transformation, specialising the second// parameter of the transformed method to the integer value 3SpecialisingTransformation transform = new SpecialisingTransformation();transform.Specialise(method.Parameters[1], 3);
// Apply the transformation to Example.Powertransform.Apply(method);
// Generate the modified methodMethodInfo dynamicMethod = method.Generate();
// Invoke the new methodConsole.Out.WriteLine(dynamicMethod.Invoke(null, new object[] { 2 }
));
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p32The Adon Browser
Example: let’s mess with Bubblesort
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p33
Browser GUI interfaces to Adon library
Browse and analysis your app’s bytecode
The Adon Browser
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p34
Browser GUI interfaces to library
Browse and analysis your app’s bytecode
The Adon Browser
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p35
Browser GUI interfaces to library
Browse and analysis your app’s bytecode
Apply selected analyses
The Adon Browser
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p36
Browser GUI interfaces to library
Browse and analysis your app’s bytecode
Apply selected analyses
The Adon Browser
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p37
Browser GUI interfaces to library
Browse and analysis your app’s bytecode
Apply selected analyses
The Adon Browser
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p38
Apply selected transformations
The Adon Browser
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p39
Apply selected transformations
The Adon Browser
So
ftwa
re P
erf
orm
an
ce
Op
timis
atio
n G
rou
p40
Apply selected transformations
The Adon Browser