Based on work by Edward A. Lee (2006) Presented by Leeor Peled, June 2010 Seminar in VLSI...

The Problem with ThreadsBased on work by Edward A. Lee (2006)

Presented by Leeor Peled, June 2010

Seminar in VLSI Architectures (048879)

Asynchronous computingDuring this course, we learned how to design asynchronous

logic, how to coordinate and time its elements, and how to build async elements, controllers and data paths.

It’s now time to investigate further layers of computing systems and see if we can utilize what we learned there.

Handshake protocols

Signal level

CE’s

RTL/CLlevel

SOC level

GALS

? SWdomain

OS scheduling,Interrupts,Threads!

Wire delays,Gate delays

DataDependency

Clock skewing

?

problems

solutions

SW Parallelism Most applications are serialHW manipulates Inst/mem/data level parallelism

Superscaling, OOO, Vectorization (SIMD)Dependencies still limit the parallelism.Still high penalty on mem access, IO

Thread level parallelism –Software manipulation - high latency stall switch contextGood for multiple tasks (e.g. servers), but can we boost a single app?

Yes. Write concurrent code!But-

Very hard to develop Bug prone Few software paradigms / programming models

SW Parallelism (cont.) Interesting similarity between SW to HW:

Asynchronous ≈ parallel ?Faster ,more efficient but also Non-deterministic

Various possibilities for the order of occurrence - Must be prepared for each.

Race condition may occur between threads just like signals

So why not use similar methods?

5

Parallelism examples –Fine Grain Parallelization (Taken from Ginosar, “many-cores” slides)

Convert (independent) loop iterationsfor ( i=0; i<10000; i++ ) { a[i] = b[i]*c[i]; }

Into parallel tasksduplicable task XX(…) 10000 { ii = INSTANCE;

a[ii] = b[ii]*c[ii];}

All tasks, or any subset, can be executed in parallel

Linear Solver: Simulation snap-shots (Taken from Ginosar, “many-cores” slides)

Parallelism examples (cont.)Unfortunately, not all applications are “embarrassingly parallel”. In reality we employ various “design patterns” that were thoroughly

investigated (and available in libs)Producer-Consumer model :

procedure producer() { while (true) { item = produceItem()

if (itemCount == BUFFER_SIZE) { sleep() }

putItemIntoBuffer(item) itemCount = itemCount + 1 if (itemCount == 1) { wakeup(consumer) } }}

procedure consumer() { while (true) {

if (itemCount == 0) { sleep() } item = removeItemFromBuffer() itemCount = itemCount - 1 if (itemCount == BUFFER_SIZE - 1) { wakeup(producer) } consumeItem(item) }}

Producer-Consumer visualization

http://www.eonclash.com/Tutorials/Multithreading/MartinHarvey1.1/Ch9.html

Looks familiar?



Threads: problem statementReal workloads must work very hard to sync concurrent code.Following example shows the problem with unprotected access

Serial: functinos A and B can be called in any order. Possible outputs are 0,0 and 1,1

Concurrent: also possible 0,1 (what about 1,0?). How would the program react?

Design Issues:Memory orderingCoherencyConsistencyDebugability

A:St [x],1St [y],1

B:S = ld [x]T = ld [y]Print S,T

Threads: problem statement (cont.)Invalid results are bad, but some problems are worse –

DeadlockLivelocks

Example – observer pattern (in Java):

What’s the problem?

public class ValueHolder { public void addListener(listener) {…}

public void setValue(newValue) { myValue = newValue; for (int i = 0; i < myListeners.length; i++) { myListeners[i].valueChanged(newValue) }}


DeadlockLivelocks



public class ValueHolder { public synchronized void addListener(listener) {…}

public synchronized void setValue(newValue) { myValue = newValue; for (int i = 0; i < myListeners.length; i++) { myListeners[i].valueChanged(newValue) }}


DeadlockLivelocks



public synchronized void addListener(listener) {…}

public void setValue(newValue) { synchronized(this) { myValue = newValue; listeners = myListeners.clone(); } for (int i = 0; i < listeners.length; i++) { listeners[i].valueChanged(newValue) }}

Other Synchronizing

Object

Threads: the bleak reality

All Programmers

Programmers who use threads

Those whoWant to do it

properly

The ones

that are any good

Threads: current methodsCurrently, the only defenses againt such problems are –

The technical aspect –Analyze software structure using dedicated tools (formal verification)

Blast, intel thread checkerUse protected languages

Cilk, Split-C (also various SW TM flavors) – lock/sync semanticsGuava (private mem space for unsynced objectes)

Use predefined design patternsTransactions (DB), TM

The human aspect –Employ experienced programmersApply a strict software design process (code reviews, debug sessions)Coding rules (lock acquiring order)

The business aspect – be prepared to recall and compensate often…

Parallel objects - solutions Lee’s Observation: It’s not concurrency that is inherently difficult

it’s just the thread model! Key issues here - a thread shares everything, so everything might change

for it between two atomic actions.Threads may interleave in any way (memory ordering has vast options)

can change state on all other threads

Parallel computation with threads can be shown to explode exponentially in the number of outcomes

Long, boring mathematical proof ahead…

But In fact - we usually only need to share a single message or data stream!

t0

t1

A A’

Some math Let :

N={0,1,2,3,...}B={0,1} B* : the set of all finite bit sequencesBω:(NB) : the set of all infinite bit sequencesB** = B* U Bω will represent the state of the computing macineQ: (B**B**)

An imperative macine M=(A,c) is composed of a finite set of atomic “instructions” A Q , and a control function c: B**⊂ N that represents how they’re sequenced.

A “halt” instruction h ∊ A is defined : ∀ b ∊ B**, h(b)=bA sequential program (length m) is a function p:NA, s.t. ∀n≥m, p(n)=h

The set of all programs is countably infinite (|P|= 0א )An execution of p starts with b0 B**, and n N, b∊ ∀ ∊ n+1=p(c(bn),bn)

Some math (cont’d)Now, for multiple threads, we replace the program execution with –

bn+1=pi(c(bn),bn), i {1,2}∊Each action is atomic, but for each step, i (the active context) is determined

arbitrarily (we’re assuming no simultaneous execution for simplicity). The correct notation should be: bn+1=pin(c(bn),bn), in {1,2}∊

Let S:({1..m}{1,2}) be the vector of contexts (i0, i1, ..im), so |S|=2m

Interleaving leads to exponential growth in possible outcomes, even for a given set of programs and initial state.

Further advantages of sequential programs -The sequence bn is well defined. The function computed by the program is partially defined for each

input leading to halt.p1 and p2 can be compared

Multithreading also makes these exponentially harder.

Parallel objects - solutions (cont.)What other solutions do we have to activate multiple objects concurrently?Move from object-oriented design to actor-oriented

Also similar to the async logic we discussed – each logical element is in charge of its own input/output

To compare – OO equivalent in VLSI means that the signals would have to be “responsible” for their own correct transfer

Let us study the following 4 actor oriented models of computation (MOCs)RendezvousPN (process network)SR (synchronous/reactive)DE (discrete events)

these MOCs are all different alternatives with a similar computability strength, but one might be better than the other for some design patterns

Actor oriented design - Rendezvous

Based on work by Reo. Same functionality as beforeEach actor (producer/consumer/observer) is a process

(No more process per dataflow / data object)Communication is through randezvous

Producers are mutually exclusive (consumers are not)2 possible 3-way rendezvous possibilities

Merge is now the only non-deterministic elementNo deadlocks, no consistency (values ordering) problem

Actor oriented design - PN (process network)

Based on PN model of concurrency by Kahn & MacQueen (‘77)Communication is through streamsUnbounded FIFOsBlocking reads

Same benefits, plus – queuing allows the observer/consumer to operate at different speeds (unless we explicitly add dependency), or delay the observation indifferently

Actor oriented design - SR (sync/reactive) Concept based on synchronous languages such Esterel, SIGNAL and Lustre (mostly

used for RT/embedded systems like aircraft control, nuclear plants) Synchronous: time is an ordered sequence of instants

Actual evaluation assumed to be zero time – instant reactions Reactive: Instants initiated by environmental events (Harel/Penueli)

“When is just as important as what” At each clock tick, every signal is evaluated (iteratively if needed) or is absent

Provides deterministic concurrency, events are ordered Scheduler picks order of evaluation (may be done in compilation time, Edwards ‘98).

Mutual dependency handled by iterations.

Actor oriented design - DE (discrete events)

Concept based on VHDL/Verilog or Opnet network modelerExact timing specification with rigorous semanticsEach event is timed and processed chronologically.

Merge (and the entire system) are deterministic.Unlike SR, here every evaluation takes a certain time delta

More realisticHowever, evaluation order might introduce non-determinism if not

define properly

The road aheadActor oriented design is not new, various languages exist -

CORBA event service (distributed push-pull) ROOM and UML-2 (dataflow, Rational, IBM) VHDL, Verilog (discrete events, Cadence, Synopsys, ...) LabVIEW (structured dataflow, National Instruments) Modelica (continuous-time, constraint-based, Linkoping) OPNET (discrete events, Opnet Technologies) SDL (process networks) Occam (rendezvous) Simulink (Continuous-time, The MathWorks) SPW (synchronous dataflow, Cadence, CoWare)

However, most are domain specific, and the few general purpose ones never caught onProgrammers don’t like new syntaxAdding libs to existing languages is not enough UML case study?

Lee’s suggested solution is “coordination languages”Polymorphic objects from other languages, general type-system

Ptolemy II Design environment

Actors/components can be defined in C/C++, java, Matlab, python, perl, …Visual editor, abstract syntaxVarying concurrency models

Models of Computation in Ptolemy II CI – Push/pull component interaction Click – Push/pull with method invocation CSP – concurrent threads with rendezvous Continuous – continuous-time modeling with fixed-point semantics CT – continuous-time modeling DDF – Dynamic dataflow DE – discrete-event systems DDE – distributed discrete events DPN – distributed process networks FSM – finite state machines DT – discrete time (cycle driven) Giotto – synchronous periodic GR – 3-D graphics PN – process networks Rendezvous – extension of CSP SDF – synchronous dataflow SR – synchronous/reactive TM – timed multitasking

Actor oriented design - examplesTwo implementation of sequential

interleaving based on rendezvous Both are deterministicBarrier allows rendezvous to

occur only when both inputs are ready

Buffer can rendezvous with input OR with output.

Commutator chooses one input for rendezvous (round robin)

Actor oriented design - examples

Conclusions The bottom line from Lee’s work is –

Instead of working with non-deterministic threads and attempting to prune this non-determinism, we should start with deterministic models, and add non-determinism only when needed.

Problem in adapting it is still - lack of cooperation from users (same as with async VLSI design, in fact)Only languages that are general purpose, and no new syntaxA transparent solution would be simpler to enforce

Library based, compiler, HW…Can we take something back to the VLSI level?

Some synchronization schemes can be built in HW (which?)Actor oriented approach – are we there? Design methodology / tools?

Based on work by Edward A. Lee (2006) Presented by Leeor Peled, June 2010 Seminar in VLSI...

Documents

Transcript of Based on work by Edward A. Lee (2006) Presented by Leeor Peled, June 2010 Seminar in VLSI...