high performance computing

31
Managing distributed Shared L2 caches through OS-level Page allocation Sangyeun Cho Lei Jin Presented by Lalit Azad December 1 st ,2008

description

high performance computing presentation

Transcript of high performance computing

Page 1: high performance computing

Managing distributed Shared L2 caches through OS-level Page allocationSangyeun Cho

Lei Jin

Presented by Lalit Azad

December 1st ,2008

Page 2: high performance computing

Motivation

• Currently available multicore processors have 2-8 cores

• Trend towards embedding 10-100 cores on one chip• Ever widening memory speed gap and limited chip

bandwidth worsening the dependence of program• Wire-delay dominance and distributed nature of L2

cache leads to non-uniform cache access latencies

Page 3: high performance computing

• Categories of L2 caches:PRIVATE SHARED

Page 4: high performance computing

• Private cache Cache associated with specific processor core which replicates data freely as processor access them Advantages: Faster access times Low average L2 access latency Disadvantages: Low memory capacity High miss rate for program with higher workloads

Page 5: high performance computing

• Shared L2 cache: Aggregation of logical single cache Each cache slice accepts an exculsive subset of memory block Advantages: Better overall utilization of on-chip caching capacity than private cache Enforcing cache coherence is simpler due to exclusive sepration of memory blocks in cache Disadvantages: L2 cache hit latency is longer

T0 T1 T2 T3

T4 T5 T6 T7

T8 T9 T10 T11

T12 T13 T14 T15

0 1 2 ……………………...........

15

Blocks in memory

Page 6: high performance computing

Methodology

• Emphasis on gross interleaving of cache lines to physically distributed L2 cache

• L2 cache slice mapping is done at memory page granularity

• Advantage:Process scheduling and data mapping

decisions are made dynamically and synergistically with full flexibility of OS

Page 7: high performance computing

• Implementation on 3 types of caching policies:• Private caching policy • Shared caching policy• Hybrid caching policy

Page 8: high performance computing

• L2 caching at cache line granularity• Formula:• S=A mod n• S:cache slice number• A: memory block number• n: number of cache slices

Page 9: high performance computing

• Advantage:a. Improves L2 cache capacity utilization by freely

distributing memory blocks available in cache.b. Increases L2 bandwidth by scattering temporally

close loads and stores to different cache slices.• Disadvantages:a. Increases on chip N/W traffic and efficient cache

access latency.

Page 10: high performance computing

• L2 Cache at memory page granularity-

Page 11: high performance computing

• L2 caching at page granularity• Pages in memory are mapped to cache • Granularity is at memory page(coarser)• Locality of reference is the key idea• Mapping is done by O.S• Formula: S=PPN mod N• PPN- physical page number• N-number of cache slices• Advantages: Uses the principle of locality of reference for better L2 hit rate Page granularity avoids high on-chip traffic Leads to faster access and higher hit ratio

Page 12: high performance computing

• Working of page allocation• A Congruence group (CGi )maps the physical pages to

processor Pi.• The congruence group maintains a free list of available pages.• Private caching maps individual pages to Pi• Shared maps a page from a set of (CGi) to Pi• Hybrid caching splits the (CGi) into k groups and maps each Pi

to a group.

Operating Systems controls the implementation part without the hardware support

Page 13: high performance computing

Effective cache management• Page spreading: Operating system allocates the cache of the neighbouring processors to the processor having a larger working set• Page spilling : When there is a drop in the total number of available pages in the (CGi)

drop below a specific threshold , then O.S allocates pages from a different group.• O.S may sometimes be forced to allocate • pages from the free lists• Each Tire-n tile indicates that it is n-steps closer to target tile {P}

Tier-1 tiles

Page 14: high performance computing

Cache Pressure

• Defined in terms of cache misses • Cache pressure cache misses∝• When cache pressure increases, spreading is

done to tiles with lower cache pressure to utilize cache optimally

Page 15: high performance computing

Home Core

• Function of performance of IPC-

• Max( ) gives “home core”• M- vector having recent miss rate• L- vector having recent contention layer of n/w• P- current page allocation info.• Q- QoS (cache capacity /quota)• C- Number of caches.

Page 16: high performance computing

Virtual Multicore• Relies on concept of hybrid cache• Processors are grouped according to application usage• The group is called a cluster• The processors in one cluster share an L2 cache• O.S allocates pages in a Round-Robin technique• Upper bound is kept on the network latency

Page 17: high performance computing

Advantages of multicore

• Maintaining cache coherency becomes easier• On-chip network bandwidth is optimized• Due to clustering, outside communication is reduced to bare

minimum.• Upper bound on network latency can be used for implementing

spreading enforced by the O.S in case of cache contention.

Page 18: high performance computing

Demand Paging

• Approach: If the memory access is the first access , the page allocator will pick up available memory and allocate it using selected allocation policy.

• This is also termed as Lazy loading technique• It is implemented in UNIX.

Page 19: high performance computing

• Use SimpleScalar tool set to model 4x4 mesh multicore processor chip

• No page spilling was ever experienced• Used single-threaded, multiprogrammed, and parallel

workloadsa. Single-threaded = variety of SPEC2k benchmarks, integer

programs, and floating-point programsb. Multiprogrammed = one core (core 5 in the experiments)

runs a target benchmark, while other processors run a synthetic benchmark that continuously generates memory accesses

c. Parallel = SPLASH-2 benchmarks

Page 20: high performance computing

Single threaded Performance• Terms used-• PRV: private

– PRV8: 8MB cache size (instead of 512k)• SL: shared• SP: OS-based page allocation

– SP-RR: Round-robin allocation– SP-80: 80% allocated locally, 20% spread across tier-1 cores– Sp-60: 60% allocated locally, 40% spread across tier-1 cores– SP-40: 40% allocated locally, 20% spread across tier-1 cores

Page 21: high performance computing

Single Program performance of different Policies

Page 22: high performance computing

Performance on single threaded workloads

Page 23: high performance computing

Performance sensitivity to network traffic

Page 24: high performance computing

Results

• O.S based cache management can differentiate between hardware resources seen by programs

• Given preference to high priority programs by not allowing other programs to interfere

• A pure Hardware based scheme like SL does not provide flexibility.

Page 25: high performance computing

Performance sensitivity to n/w traffic

• Observations:• SP40-CS uses control spreading method to

limit the spreading of data onto different cores which already have the data.

• SP40-CS achieves highest performance in all except ammp.

• Performance limited due to local cache distribution.

Page 26: high performance computing

• VM performed better than most of the Policies• VM outperformed PRV by 5% • VM outperformed SL by 21%

No real difference here!

VM outperforms the rest!

Page 27: high performance computing

Future work

• How to achieve performance goal under dynamically changing workload?

• Multiplexing related O.S techniques with each other..ex: cache block replication, cache block migration

• Accurately monitor cache performance and its behavior

Page 28: high performance computing

Conclusion

• Proposed O.S based page allocation scheme is flexible

• Can mimic different hardware strategies and can control the degree of data placement and data sharing.

• Such flexibility leads to more efficient and high scale performance

Page 29: high performance computing

References

• http://en.wikipedia.org/• http://www.multicoreinfo.com/papers-2006/

Page 30: high performance computing

• Questions??

Page 31: high performance computing

Thank you!