Deterministic Execution of Nondeterministic Shared-Memory Programs

Deterministic Execution of Nondeterministic Shared-Memory

Programs

Dan Grossman

University of Washington

Dagstuhl Seminar on

Design and Validation of Concurrent Systems

August 2009

September 2009 Dan Grossman: Determinism 2

What if…

What if you could run the same multithreaded program on the same inputs twice and know you would get the same results?

• What exactly does that mean?• Why might you want that?• How can we do that (semi-efficiently)?

But first:– Some background on me and “the talks I’m not giving”– Key terminology and perspectives

• More important than technical details at this event


Biography / group names

Me: • “Programming-languages person”• Type systems, compilers for memory-safe C dialect 200-2004• 30% 80% focus on multithreading, 2005-• Co-advising 3-4 students with computer architect Luis Ceze, 2007-

Two groups for “marketing purposes”• WASP, wasp.cs.washington.edu

• SAMPA, sampa.cs.washington.edu


The talk you won’t seevoid transferFrom(int amt, Acct other){ atomic{ other.withdraw(amt); this.deposit(amt); }}

“Transactions are to shared-memory concurrency as garbage

collection is to memory management” [OOPSLA 07]

Semantic problems with nontransactional accesses: worse than locks!– Fix with stronger guarantees and compiler opts [PLDI07]– Or static type system, formal semantics, and proof [POPL08]– Or more dynamic approach adapting to Haskell [submitted]– …

Prototypes for OCaml, Java, Scheme, and Haskell


This talk…

Take an arbitrary C/C++ program with POSIX threads– Locks, barriers, condition variables, data races, whatever

Compile it funny

Link it against a funny run-time system

Get deterministic behavior– Well, as deterministic as a sequential C program

Joint work: Luis Ceze, Tom Bergan, Joe Devietti, Owen Anderson


Terminology

Essential perspectives, not just definitions

• Parallelism vs. concurrency– Or different terms if you prefer

• Sequential semantics vs. determinism vs. nondeterminism– What is an input?

• Level of abstraction– Which one do you care about?


Concurrency

Working “definition”:

Software is concurrent if a primary intellectual challenge is responding to external events from multiple sources in a timely manner.

Examples: operating system, shared hashtable, version control

Key challenge is responsiveness – often leads to threads or asynchrony

Correctness usually requires synchronization (e.g., locks)


Parallelism

Working “definition”:

Software is parallel if a primary intellectual challenge is using extra computational resources to do more useful work per unit time.

Examples: scientific computing, most graphics, a lot of servers

Key challenge is Amdahl’s Law– No sequential bottlenecks, no imbalanced load

When pure fork-join isn’t correct, need synchronization


The confusion

• First, this use of terms isn’t standard

• Many systems are both– And it’s really a matter of degree

• Similar lower-level mechanisms, such as threads and locks– And similar errors (race conditions, deadlocks, etc.)

• Our work determinizes these lower-level mechanisms, so we determinize concurrent and parallel applications– But purely parallel ones probably benefit less


Terminology






Sequential semantics

• Some languages can have results defined purely sequentially, but are designed to have better parallel-performance guarantees (thanks to a cost model)– Examples: DPJ, Cilk, NESL, …

• For correctness, reason sequentially• For performance, reason in parallel

• Really designed for parallelism, not concurrency

• Not our work


Sequential isn’t always deterministic

[Surprisingly easy to forget this]

int f1(){ print(“A”); print(“B”); return 0; }

int f2(){ print(“C”); print(“D”); return 0; }

int g() { return f1() + f2(); }

Must g() print ABCD?• Java: yes• C/C++: no, CDAB allowed, but not ACBD, ACDB, etc.


Another exampleDijkstra’s guarded-command conditionals

if x % 2 == 1 -> y := x - 1

[] x < 10 -> y := 7

[] x >= 10 -> y := 0

fi

We might still expect a particular language implementation (compiler) to be deterministic– May choose any deterministic result consistent with the

nondeterministic semantics– Presumably doesn’t change choice across executions, but

may across compiles (including “butterfly effects”)– Our work does this


Why helpful?

So programmer gets a deterministic executable, but doesn’t know which one– Key degree of freedom for automated performance

Still helpful for:– Whole-program testing and debugging– Automated replicas– In general, repeatability and reducing possible executions


Define deterministic, part 1

Deterministic: “outputs depend only on inputs”

• That’s right, but means must clearly specify what is an input (and an output)– Can define away anything you want– Example: All syscall results are inputs, so seeding the

pseudorandom number generator with time-of-day is “deterministic”

• We mean what you think we mean– Inputs: command-line, I/O, syscalls– Not inputs: cache state, hardware timing, thread scheduler


Terminology






Define deterministic, part 2

“Is it deterministic?” depends crucially on your abstraction level– Another obvious easy-to-forget thing

Examples:• File systems• Memory-allocation (Java vs. C)• Set implemented as a list • Quantum mechanics

Our work:• The “language level”: state of logical memory, program output• Application may care only about a higher level (future work)


Okay… how?Trade-off between complexity and performance:

PERFO

RMANCE

COMPLEXITYPerformance:

– Overhead (single-thread slowdown)– Scalability (minimize extra synchronization, waiting)


Starting serial

Determinization is easy!– Run one thread at a time in round-robin order– Context-switch after N basic blocks for deterministic N

• Cannot use a timer; use compiler and run-time– Races in source program are irrelevant; locks still respected

Example with 3 threads running (time moves with arrows)

load A

store B

store C

load B

load A

store C

… … …

T1 T2 T3 1 quantum

1 round


Parallel quanta• The quanta in a round can start to run in parallel provided they

stop before any communication occurs (see how next)– So each round has two stages, parallel then serial

load A store C

load B

load A

store B store C

T1 T2 T3

Parallel stage endswith global barrier

Serial stage ends;next round starts

… …

…


Is that legal?

– Can produce different result than serial execution– In fact, execution not necessarily equivalent with any

serialization of quanta

But it doesn’t matter as long as we are deterministic! Just need:• Parallel stages do no communication• Parallel stages end at deterministic points

load A store C

load B

load A

store B store C

T1 T2 T3


Performance

Keys to scalability:

1. Run almost everything in the parallel stage

2. Keep quanta balanced– Assume (1), use rough instruction costs

load A store C

load B

load A

store B store C

T1 T2 T3


Memory ownership

To avoid communication during parallel stage:• Every memory location is “shared” or “owned by 1 thread T”

– Dynamic table checked and updated during execution• Can read only memory that is shared or owned-by-you• Can write only memory owned-by-you• Locks: just like memory locations + blocking ends quantum

In our example, perhaps A is shared, B and C are owned by T2

load A store C

load B

load A

store B store C

T1 T2 T3


Changing ownership

Policy: For each location (any deterministic granularity is correct),• First owner is first thread to allocate in the location• On read in serial stage, if owned-by-other set to shared• One write in serial stage, set to owned-by-self

Correctness:1. Ownership immutable in parallel stages (so no communication)2. Serial-stage changes are deterministic

So many, many polices are correct– Chose the obvious one for temporal locality + read-sharing– Must have good locality for scalability!


Overhead

Significant overhead:– All reads/writes consult ownership information– All basic blocks subtract from a thread-local quantum counter

Reduce via:– Lots of run-time engineering and data structures (not too

much magic, but most important)– Obvious compiler optimizations like escape analysis and

hoisting counter-subtractions– Specialized compiler optimizations like Subsequent Access

Optimization: Don’t recheck same ownership unless a quantum boundary might intervene.

• Correctness of this is a subtle argument and slightly affects the ownership-change policy (deterministically!)


Brittle

Change any line of code, command-line argument, environment variable, etc. and you can get a different deterministic program

We are mostly robust to memory-safety errors ,

except – Bounds errors that corrupt ownership information– Bounds errors that write to another thread’s allegedly-thread-

local data


Results

Overhead: Varies a lot, but about 3x at 8 threads

Scalability: Varies a lot, but on average with parsec suite (*)

nondet 8 threads vs. nondet 2 threads = 2.4 (linear = 4)

det 8 threads vs. det 2 threads = 2.0

det 8 threads vs. nondet 2 threads = 0.91 (range 0.41 - 2.75)

“How do you want to spend Moore’s Dividend?”

* subset runnable: no mpi, no C++ exceptions, no 32-bit assumptions


Buffering

Actually, ownership is only one approach

Second approach relies on buffering and a commit stage• Even higher overhead (to consult buffers)• Even better scalability (block only for synchronization & commits)

And a third hybrid approach

Hopefully more details soon


Conclusion

The fundamental assumption that nondeterministic shared-memory programs must be run nondeterministically is false

A fun problem to throw principled compiler and run-time optimizations at.

Could dramatically change how we test and debug parallel and concurrent programs

Most-related work:– Kendo from MIT: done concurrently (in parallel? ), requires

knowing about data races statically, different approach– Colleagues in ASPLOS09: hardware support for ownership– Record & replay systems:we can replay without the record

Deterministic Execution of Nondeterministic Shared-Memory Programs

Documents

Transcript of Deterministic Execution of Nondeterministic Shared-Memory Programs