high performance computing
-
Upload
jimmy514in -
Category
Technology
-
view
202 -
download
5
description
Transcript of high performance computing
Managing distributed Shared L2 caches through OS-level Page allocationSangyeun Cho
Lei Jin
Presented by Lalit Azad
December 1st ,2008
Motivation
• Currently available multicore processors have 2-8 cores
• Trend towards embedding 10-100 cores on one chip• Ever widening memory speed gap and limited chip
bandwidth worsening the dependence of program• Wire-delay dominance and distributed nature of L2
cache leads to non-uniform cache access latencies
• Categories of L2 caches:PRIVATE SHARED
• Private cache Cache associated with specific processor core which replicates data freely as processor access them Advantages: Faster access times Low average L2 access latency Disadvantages: Low memory capacity High miss rate for program with higher workloads
• Shared L2 cache: Aggregation of logical single cache Each cache slice accepts an exculsive subset of memory block Advantages: Better overall utilization of on-chip caching capacity than private cache Enforcing cache coherence is simpler due to exclusive sepration of memory blocks in cache Disadvantages: L2 cache hit latency is longer
T0 T1 T2 T3
T4 T5 T6 T7
T8 T9 T10 T11
T12 T13 T14 T15
0 1 2 ……………………...........
15
Blocks in memory
Methodology
• Emphasis on gross interleaving of cache lines to physically distributed L2 cache
• L2 cache slice mapping is done at memory page granularity
• Advantage:Process scheduling and data mapping
decisions are made dynamically and synergistically with full flexibility of OS
• Implementation on 3 types of caching policies:• Private caching policy • Shared caching policy• Hybrid caching policy
• L2 caching at cache line granularity• Formula:• S=A mod n• S:cache slice number• A: memory block number• n: number of cache slices
• Advantage:a. Improves L2 cache capacity utilization by freely
distributing memory blocks available in cache.b. Increases L2 bandwidth by scattering temporally
close loads and stores to different cache slices.• Disadvantages:a. Increases on chip N/W traffic and efficient cache
access latency.
• L2 Cache at memory page granularity-
• L2 caching at page granularity• Pages in memory are mapped to cache • Granularity is at memory page(coarser)• Locality of reference is the key idea• Mapping is done by O.S• Formula: S=PPN mod N• PPN- physical page number• N-number of cache slices• Advantages: Uses the principle of locality of reference for better L2 hit rate Page granularity avoids high on-chip traffic Leads to faster access and higher hit ratio
• Working of page allocation• A Congruence group (CGi )maps the physical pages to
processor Pi.• The congruence group maintains a free list of available pages.• Private caching maps individual pages to Pi• Shared maps a page from a set of (CGi) to Pi• Hybrid caching splits the (CGi) into k groups and maps each Pi
to a group.
Operating Systems controls the implementation part without the hardware support
Effective cache management• Page spreading: Operating system allocates the cache of the neighbouring processors to the processor having a larger working set• Page spilling : When there is a drop in the total number of available pages in the (CGi)
drop below a specific threshold , then O.S allocates pages from a different group.• O.S may sometimes be forced to allocate • pages from the free lists• Each Tire-n tile indicates that it is n-steps closer to target tile {P}
Tier-1 tiles
Cache Pressure
• Defined in terms of cache misses • Cache pressure cache misses∝• When cache pressure increases, spreading is
done to tiles with lower cache pressure to utilize cache optimally
Home Core
• Function of performance of IPC-
• Max( ) gives “home core”• M- vector having recent miss rate• L- vector having recent contention layer of n/w• P- current page allocation info.• Q- QoS (cache capacity /quota)• C- Number of caches.
Virtual Multicore• Relies on concept of hybrid cache• Processors are grouped according to application usage• The group is called a cluster• The processors in one cluster share an L2 cache• O.S allocates pages in a Round-Robin technique• Upper bound is kept on the network latency
Advantages of multicore
• Maintaining cache coherency becomes easier• On-chip network bandwidth is optimized• Due to clustering, outside communication is reduced to bare
minimum.• Upper bound on network latency can be used for implementing
spreading enforced by the O.S in case of cache contention.
Demand Paging
• Approach: If the memory access is the first access , the page allocator will pick up available memory and allocate it using selected allocation policy.
• This is also termed as Lazy loading technique• It is implemented in UNIX.
• Use SimpleScalar tool set to model 4x4 mesh multicore processor chip
• No page spilling was ever experienced• Used single-threaded, multiprogrammed, and parallel
workloadsa. Single-threaded = variety of SPEC2k benchmarks, integer
programs, and floating-point programsb. Multiprogrammed = one core (core 5 in the experiments)
runs a target benchmark, while other processors run a synthetic benchmark that continuously generates memory accesses
c. Parallel = SPLASH-2 benchmarks
Single threaded Performance• Terms used-• PRV: private
– PRV8: 8MB cache size (instead of 512k)• SL: shared• SP: OS-based page allocation
– SP-RR: Round-robin allocation– SP-80: 80% allocated locally, 20% spread across tier-1 cores– Sp-60: 60% allocated locally, 40% spread across tier-1 cores– SP-40: 40% allocated locally, 20% spread across tier-1 cores
Single Program performance of different Policies
Performance on single threaded workloads
Performance sensitivity to network traffic
Results
• O.S based cache management can differentiate between hardware resources seen by programs
• Given preference to high priority programs by not allowing other programs to interfere
• A pure Hardware based scheme like SL does not provide flexibility.
Performance sensitivity to n/w traffic
• Observations:• SP40-CS uses control spreading method to
limit the spreading of data onto different cores which already have the data.
• SP40-CS achieves highest performance in all except ammp.
• Performance limited due to local cache distribution.
• VM performed better than most of the Policies• VM outperformed PRV by 5% • VM outperformed SL by 21%
No real difference here!
VM outperforms the rest!
Future work
• How to achieve performance goal under dynamically changing workload?
• Multiplexing related O.S techniques with each other..ex: cache block replication, cache block migration
• Accurately monitor cache performance and its behavior
Conclusion
• Proposed O.S based page allocation scheme is flexible
• Can mimic different hardware strategies and can control the degree of data placement and data sharing.
• Such flexibility leads to more efficient and high scale performance
References
• http://en.wikipedia.org/• http://www.multicoreinfo.com/papers-2006/
• Questions??
Thank you!