high performance computing

Managing distributed Shared L2 caches through OS-level Page allocationSangyeun Cho

Lei Jin

Presented by Lalit Azad

December 1st ,2008

Motivation

• Currently available multicore processors have 2-8 cores

• Trend towards embedding 10-100 cores on one chip• Ever widening memory speed gap and limited chip

bandwidth worsening the dependence of program• Wire-delay dominance and distributed nature of L2

cache leads to non-uniform cache access latencies

• Categories of L2 caches:PRIVATE SHARED

• Private cache Cache associated with specific processor core which replicates data freely as processor access them Advantages: Faster access times Low average L2 access latency Disadvantages: Low memory capacity High miss rate for program with higher workloads

• Shared L2 cache: Aggregation of logical single cache Each cache slice accepts an exculsive subset of memory block Advantages: Better overall utilization of on-chip caching capacity than private cache Enforcing cache coherence is simpler due to exclusive sepration of memory blocks in cache Disadvantages: L2 cache hit latency is longer

T0 T1 T2 T3

T4 T5 T6 T7

T8 T9 T10 T11

T12 T13 T14 T15

0 1 2 ……………………...........

15

Blocks in memory

Methodology

• Emphasis on gross interleaving of cache lines to physically distributed L2 cache

• L2 cache slice mapping is done at memory page granularity

• Advantage:Process scheduling and data mapping

decisions are made dynamically and synergistically with full flexibility of OS

• Implementation on 3 types of caching policies:• Private caching policy • Shared caching policy• Hybrid caching policy

• L2 caching at cache line granularity• Formula:• S=A mod n• S:cache slice number• A: memory block number• n: number of cache slices

• Advantage:a. Improves L2 cache capacity utilization by freely

distributing memory blocks available in cache.b. Increases L2 bandwidth by scattering temporally

close loads and stores to different cache slices.• Disadvantages:a. Increases on chip N/W traffic and efficient cache

access latency.

• L2 Cache at memory page granularity-

• L2 caching at page granularity• Pages in memory are mapped to cache • Granularity is at memory page(coarser)• Locality of reference is the key idea• Mapping is done by O.S• Formula: S=PPN mod N• PPN- physical page number• N-number of cache slices• Advantages: Uses the principle of locality of reference for better L2 hit rate Page granularity avoids high on-chip traffic Leads to faster access and higher hit ratio

• Working of page allocation• A Congruence group (CGi )maps the physical pages to

processor Pi.• The congruence group maintains a free list of available pages.• Private caching maps individual pages to Pi• Shared maps a page from a set of (CGi) to Pi• Hybrid caching splits the (CGi) into k groups and maps each Pi

to a group.

Operating Systems controls the implementation part without the hardware support

Effective cache management• Page spreading: Operating system allocates the cache of the neighbouring processors to the processor having a larger working set• Page spilling : When there is a drop in the total number of available pages in the (CGi)

drop below a specific threshold , then O.S allocates pages from a different group.• O.S may sometimes be forced to allocate • pages from the free lists• Each Tire-n tile indicates that it is n-steps closer to target tile {P}

Tier-1 tiles

Cache Pressure

• Defined in terms of cache misses • Cache pressure cache misses∝• When cache pressure increases, spreading is

done to tiles with lower cache pressure to utilize cache optimally

Home Core

• Function of performance of IPC-

• Max( ) gives “home core”• M- vector having recent miss rate• L- vector having recent contention layer of n/w• P- current page allocation info.• Q- QoS (cache capacity /quota)• C- Number of caches.

Virtual Multicore• Relies on concept of hybrid cache• Processors are grouped according to application usage• The group is called a cluster• The processors in one cluster share an L2 cache• O.S allocates pages in a Round-Robin technique• Upper bound is kept on the network latency

Advantages of multicore

• Maintaining cache coherency becomes easier• On-chip network bandwidth is optimized• Due to clustering, outside communication is reduced to bare

minimum.• Upper bound on network latency can be used for implementing

spreading enforced by the O.S in case of cache contention.

Demand Paging

• Approach: If the memory access is the first access , the page allocator will pick up available memory and allocate it using selected allocation policy.

• This is also termed as Lazy loading technique• It is implemented in UNIX.

• Use SimpleScalar tool set to model 4x4 mesh multicore processor chip

• No page spilling was ever experienced• Used single-threaded, multiprogrammed, and parallel

workloadsa. Single-threaded = variety of SPEC2k benchmarks, integer

programs, and floating-point programsb. Multiprogrammed = one core (core 5 in the experiments)

runs a target benchmark, while other processors run a synthetic benchmark that continuously generates memory accesses

c. Parallel = SPLASH-2 benchmarks

Single threaded Performance• Terms used-• PRV: private

– PRV8: 8MB cache size (instead of 512k)• SL: shared• SP: OS-based page allocation

– SP-RR: Round-robin allocation– SP-80: 80% allocated locally, 20% spread across tier-1 cores– Sp-60: 60% allocated locally, 40% spread across tier-1 cores– SP-40: 40% allocated locally, 20% spread across tier-1 cores

Single Program performance of different Policies

Performance on single threaded workloads

Performance sensitivity to network traffic

Results

• O.S based cache management can differentiate between hardware resources seen by programs

• Given preference to high priority programs by not allowing other programs to interfere

• A pure Hardware based scheme like SL does not provide flexibility.

Performance sensitivity to n/w traffic

• Observations:• SP40-CS uses control spreading method to

limit the spreading of data onto different cores which already have the data.

• SP40-CS achieves highest performance in all except ammp.

• Performance limited due to local cache distribution.

• VM performed better than most of the Policies• VM outperformed PRV by 5% • VM outperformed SL by 21%

No real difference here!

VM outperforms the rest!

Future work

• How to achieve performance goal under dynamically changing workload?

• Multiplexing related O.S techniques with each other..ex: cache block replication, cache block migration

• Accurately monitor cache performance and its behavior

Conclusion

• Proposed O.S based page allocation scheme is flexible

• Can mimic different hardware strategies and can control the degree of data placement and data sharing.

• Such flexibility leads to more efficient and high scale performance

References

• http://en.wikipedia.org/• http://www.multicoreinfo.com/papers-2006/

http://en.wikipedia.org/

http://en.wikipedia.org/

http://www.multicoreinfo.com/papers-2006/

http://www.multicoreinfo.com/papers-2006/

• Questions??

Thank you!

high performance computing

Technology

Transcript of high performance computing