Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results ...

27
Jeremy Denham April 7, 2008

Transcript of Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results ...

Page 1: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Jeremy DenhamApril 7, 2008

Page 2: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

MotivationBackground / Previous workExperimentationResultsQuestions

Page 3: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Modern processor design trends are primarily concerned with the multi-core design paradigm.

Still figuring out what to do with them Different way of thinking about “shared-

memory multiprocessors” Distributed apps?

Synchronization will be important.

Page 4: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, Mellor-Crummey & Scott 1991.

Scalable, busy-wait synchronization algorithms No memory or interconnect contention O(1) remote references per mechanism

utilization Spin locks and barriers

Page 5: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

“Spin” on lock by busy-waiting until available.

Typically involves “fetch-and-Φ” operations

Must be atomic!

Page 6: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

“Test-and-set” Needs processor support to make it

atomic “fetch-and-store” xchg works in x86

Loop until lock is possessedExpensive!

Frequently accessed, too Networking issues

Page 7: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Can reduce fetch-and-Φ ops to one per lock acquisition

FIFO service guaranteeTwo counters

Requests Releases fetch_and_increment request counter Wait until release counter reflects turn

Still problematic…

Page 8: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

T.E. Anderson Incoming

processes put themselves in the queue

Lock holder hands off the lock to next in queue

Faster than ticket, but more space

Page 9: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

FIFO GuaranteeLocal spinning!Small constant amount of spaceCache coherence a non-issue

Page 10: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Each processor allocates a record next link boolean flag

Adds to queueSpins locallyOwner passes lock to next user in

queue as necessary

Page 11: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Mechanism for “phase separation”

Block processes from proceeding until all others have reached a checkpoint

Designed for repetitive use

Page 12: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

“Local” and “global” senseAs processor arrives

Reverse local sense Signal its arrival If last, reverse global sense Else spin

Lots of spinning…

Page 13: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Barrier information is “disseminated” algorithmically

At each synchronization stage k, processor i signals processor (i + 2k) mod P, where P is the number of processors

Similarly, processor i continues when it is signaled by processor (i - 2k) mod P

log(P) operations on critical path, P log(P) remote operations

Page 14: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Tree-based approachOutcome statically determined“Roles” for each round

“loser” notifies “winner,” then drops out “winner” waits to be notified,

participates in next round “champion” sets global flag when over

log(P) roundsHeavy interconnect traffic…

Page 15: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Also tree-basedLocal spinningO(P) space for P processors (2P – 2) network transactionsO(log P) network transactions on

critical path

Page 16: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Use two P-node trees“child-not-ready” flag for each child

present in parentWhen all children have signaled

arrival, parent signals its parentWhen root detects all children have

arrived, signals to the group that it can proceed to next barrier.

Page 17: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Experiments done on BBN Butterfly 1 and Sequent Symmetry Model B machines

BBN Supports up to 256 processor nodes 8 MHz MC68000

Sequent Supports up to 30 processor nodes 16 MHz Intel 80386

Most concerned with Sequent

Page 18: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.
Page 19: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.
Page 20: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.
Page 21: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Want to extend to multi-core machines

Scalability of limited usefulness (not that many cores) Shared resources Core load

Page 22: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Intel Centrino Duo T5200 Processor Two cores 1.60 GHz per core 2MB L2 Cache

Windows Vista2GB DDR2 Memory

Page 23: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Evaluate basic and MCS approaches Simple and complex evaluations Core pinning Load ramping

Page 24: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Code porting Lots of Linux-specific code

Win32 Thread API Esoteric… How to pin a thread to a core?

Timing Win32 μsec-granularity measurement

Surprisingly archaic C code

Page 25: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Spin lock base code portedBarriers nearly doneSimple experiments for spin locks

done More complex on the way

Page 26: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Simple spin lock tests Simple lock outperforms MCS on:▪ Empty Critical Section▪ Simple FP Critical Section▪ Single core▪ Dual core

More procedural overhead for MCS on small scale

Next steps: ▪ More threads!▪ More critical section complexity

Page 27: Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.