Parallel Inclusion-based Points-to Analysis

30
Parallel Inclusion- based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1

description

Parallel Inclusion-based Points-to Analysis. Mario Méndez-Lojo Augustine Mathew Keshav Pingali. The University of Texas at Austin (USA). Points-to analysis. Static analysis technique approximate locations pointed by (pointer) variable useful for optimization, verification… Dimensions - PowerPoint PPT Presentation

Transcript of Parallel Inclusion-based Points-to Analysis

Page 1: Parallel Inclusion-based Points-to Analysis

1

Parallel Inclusion-based Points-to Analysis

Mario Méndez-Lojo Augustine Mathew

Keshav Pingali

The University of Texas at Austin (USA)

Page 2: Parallel Inclusion-based Points-to Analysis

2

Points-to analysis• Static analysis technique

– approximate locations pointed by (pointer) variable– useful for optimization, verification…

• Dimensions– flow sensitivity

What is the set of locations pointed by x?x=&a; x=&b;

flow sensitive: {b}, flow insensitive: {a,b}

– context sensitivity• Focus: context insensitive + flow insensitive solutions

– inclusion-based, not unification-based– available in modern compilers (gcc, LLVM…)

Page 3: Parallel Inclusion-based Points-to Analysis

3

Inclusion-based points-to analysis

• First proposed by Andersen [Andersen thesis’94]

• Much research focused on performance improvements– heuristics for cycle detection [Fahndrich PLDI’98; Hardekopf PLDI’07]

– offline preprocessing [Rountev PLDI’00]

– better ordering [Magno CGO’09]

– BDD-based representation [Lhotak PLDI’04]

– …• What about parallelization?– “future work” [Khalon PLDI’07, Magno CGO’09,….]

– never done before

Page 4: Parallel Inclusion-based Points-to Analysis

4

Parallel inclusion-based points-to analysis

• Challenges– highly irregular code

• BDD, sparse bit vectors, etc– some phases of the algorithms are difficult to parallelize

• SCC detection/DFS

• Contributions1. novel graph formulation2. parallelization of Andersen’s algorithm

• exploits algorithmic structure• up to 5x speedup vs. Hardekopf & Lin’s state-of-the-art

implementation [PLDI’07]

Page 5: Parallel Inclusion-based Points-to Analysis

5

Agenda

inclusion-based points-to analysis

graph formulation

parallel inclusion-based

points-to analysis

efficient parallelization

Parallelization of graph (irregular)

algorithms

parallelization of irregular algorithms

Page 6: Parallel Inclusion-based Points-to Analysis

6

Agenda

inclusion-based points-to analysis

graph formulation

parallel inclusion-based

points-to analysis

efficient parallelization

parallelization of irregular algorithms

Page 7: Parallel Inclusion-based Points-to Analysis

7

Andersen’s algorithm for C programs

1. Extract pointer assignmentsa= &b, a=b, a=*b, *a=b

2. Transform statements into set constraints

3. Iteratively solve system of constraints

C code name constrainta = &b address of pts(a) {b}⊇a = b copy pts(a) pts(b)⊇

a = b ∗ load ∀v pts(b) : pts(a) pts(v)∈ ⊇*a = b store ∀v pts(a) : pts(v) pts(b)∈ ⊇

Page 8: Parallel Inclusion-based Points-to Analysis

8

Example

program

a=&v;

*a=b;

b=x;

x=&w;

Page 9: Parallel Inclusion-based Points-to Analysis

9

Example

program

a=&v;

*a=b;

b=x;

x=&w;

constraints

)()(:)( bptsvptsaptsv

)()( xptsbpts

}{)( vapts

}{)( wxpts

ptsa {}b {}v {}w {}

x {}

Page 10: Parallel Inclusion-based Points-to Analysis

10

Example

program

a=&v;

*a=b;

b=x;

x=&w;

constraints

)()(:)( bptsvptsaptsv

)()( xptsbpts

}{)( vapts

}{)( wxpts

ptsa {v}b {w}v {w}w {}

x {w}

Page 11: Parallel Inclusion-based Points-to Analysis

11

Constraint representation shortcomings

• Difficult reasoning about algorithm– separate representation

• constraints• points-to sets

– in parallel• which constraints can be processed simultaneously?• which points-to can be modified simultaneously?

• Cycle collapsing complicates things– representative table

• Need simpler representation

Page 12: Parallel Inclusion-based Points-to Analysis

12

Proposed graph representation1. Extract pointer assignments

a= &b, a=b, a=*b, *a=b

2. Create initial constraint graph– nodes ≡ variables– edges ≡ statements

3. Apply graph rewrite rules until fixpoint (next slide)

C code name edge

a = &b address of

a = b copy

a = b ∗ load

*a = b store

Page 13: Parallel Inclusion-based Points-to Analysis

13

Graph rewrite rulesname rule ensures

copy )()( bptsapts

Page 14: Parallel Inclusion-based Points-to Analysis

14

Graph rewrite rules

Example:

name rule ensures

copy )()( bptsapts

program

b=&v;

a=&x

a=b;

b=&w;

Page 15: Parallel Inclusion-based Points-to Analysis

15

Graph rewrite rulesname rule ensures

copy )()( bptsapts

load)()(:)(vptsapts

bptsv

store )()(:)(bptsvpts

aptsv

Page 16: Parallel Inclusion-based Points-to Analysis

16

Example revisited

program

a=&v;

*a=b;

b=x;

x=&w;

constraints

)()(:)( bptsvptsaptsv

)()( xptsbpts

}{)( vapts

}{)( wxpts

pts

a {}b {}v {}w {}

x {}

pts

a {v}b {w}v {w}w {}

x {w}

init

solve

solve

init

Page 17: Parallel Inclusion-based Points-to Analysis

17

Advantages of graph formulation

• Solving process entirely expressed as graph rewriting• Merging can be easily incorporated– equivalent edge– new rules

• Leverage existing techniques for parallelizing graph algorithms [next few slides]

push equivalent

Page 18: Parallel Inclusion-based Points-to Analysis

18

Agenda

inclusion-based points-to analysis

graph formulation

parallel inclusion-based

points-to analysis

efficient parallelization

parallelization of irregular algorithms

Page 19: Parallel Inclusion-based Points-to Analysis

19

Graph algorithms – Galois approach

•Active node –node where computation is needed–Andersen: node violating a rule’s

invariant•Activity –application of certain code to active

node–Andersen: rewrite rule

•Neighborhood–set of nodes/edges read/written by

activity–Andersen: 3 nodes involved in rule

Page 20: Parallel Inclusion-based Points-to Analysis

20

Parallelization of graph algorithms

Page 21: Parallel Inclusion-based Points-to Analysis

21

Parallelization of graph algorithms

Page 22: Parallel Inclusion-based Points-to Analysis

22

Parallelization of graph algorithms

• Correct parallel execution– neighborhoods do not overlap → activities can be executed in parallel– baseline conflict detection policy

• Implementation– use speculation– each node has an associated exclusive abstract lock– graph operations → acquire locks on read/written nodes– lock already owned → conflict → activity rolled back

Page 23: Parallel Inclusion-based Points-to Analysis

23

Parallelizing Andersen’s algorithm• Baseline conflict detection– activity acquires 3 locks, processes rule– conflict when rules sharing nodes are processed

simultaneously• Correct but too restrictive

activity 1 adds p-edge <a,v> activity 2 adds p-edge <x,v>

Page 24: Parallel Inclusion-based Points-to Analysis

24

Optimal conflict detection • Avoid abstract locks on read nodes– edges never removed from graph

activity 1 adds p-edge <a,v> activity 2 adds p-edge <x,v>

Page 25: Parallel Inclusion-based Points-to Analysis

25

Optimal conflict detection • Avoid abstract locks on read nodes

– edges never removed from graph• Avoid abstract locks on written nodes

– edge additions commute with each other– concrete implementation guarantees consistency

• Conflicts → abstract locks→ rollbacks– speculation not necessary!

activity 2 adds p-edge <b,v>activity 1 adds p-edge <b,v>

Page 26: Parallel Inclusion-based Points-to Analysis

26

Implementation

• Implemented in Galois system [Kulkarni PLDI’07]

– graph implementations, scheduling policies, etc.– conflict detection turned off → speculation overheads

• Key data structures– Binary Decision Diagram

• points-to edges• lock-based hash set

– sparse bit vector• copy/store/load edges• lock-free linked list

• Download at http://www.ices.utexas.edu/~marioml

Page 27: Parallel Inclusion-based Points-to Analysis

Results: runtimes

• Intel Xeon machine, 8 cores– (our) sequential vs parallel– whole analysis– JVM 1.6, 64 bits, Linux 2.6.30

• Input: suite of C programs– gcc: 120K vars, 156K stmts– tshark: 1500K vars, 1700K stmts

• Low parallelization overheads–not more than 30%

• Good scalability–↑cores → ↓runtime

1 2 4 6 80.0

200.0

400.0

600.0

800.0

1,000.0

1,200.0

1,400.0

1,600.0gcc

time

(sec

.)

1 2 4 6 80

10,000

20,000

30,000

40,000

50,000

60,000

70,000tshark

time

(sec

.)

Page 28: Parallel Inclusion-based Points-to Analysis

28

Results: speedups• Reference (sequential) analysis

– Hardekopf & Lin [PLDI’07, SAS’07]

–written in C++– state-of-the-art, publicly available

implementation

• Xeon machine, 8 cores– reference: phase within LLVM 2.6– parallel: standalone, JVM 1.6

• Speedup wrt C++ version– whole analysis– 2-5x– can be > 1 with 1 thread (tshark)

•Sequential phases limit speedup– SCC detection, BDD init, etc.

gcc

vim svn

pine ph

p

mpl

ayer

gim

p

linux

tsha

rk

0

1

2

3

4

51 thread8 threads

Benchmark

Spee

dup

wrt

C++

sequ

entia

l

≈150K vars, 150K stmsseq: 7% (1 thread)

seq: 26% (8 threads)

≈150K vars, 150K stmsseq: 9% (1 thread)

seq: 32% (8 threads)

Page 29: Parallel Inclusion-based Points-to Analysis

29

Conclusions

• Inclusion-based points-to analysis– widely used technique

• Contributions1. Novel graph representation2. First parallelization of a points-to algorithm• correctness: exploit graph abstraction• efficiency: exploit algorithm structure

• Good results– 2-5x speedup wrt to state-of-the-art

Page 30: Parallel Inclusion-based Points-to Analysis

30

Thank you!

implementation + slides available athttp://www.ices.utexas.edu/~marioml