Exploiting Heterogeneous Architectures

Exploiting Heterogeneous Architectures

Alex Beutel, John Dickerson, Vagelis Papalexakis15-740/18-740 Computer Architecture, Fall 2012

In Class DiscussionTuesday 10/16/2012

Heterogeneous Hardware Systems

• Multiple CPU Single GPU systems• Asymmetric Multicore Processors (AMP)– Combination of general-purpose big and small cores– Trade-off between performance & power consumption– Usually “on chip” AMP’s

• Single-ISA architectures– Similar to AMP but have same instruction sets among

cores– “small” processors can support in-order execution– “big ones can support out-of-order execution

OverviewBIS PIE YinYang SMS

Optimization Speed – Remove bottlenecks

Speed – ILP and MLP

Speed - Async and Power

Speed - Memory

Hardware ACMP (single ISA) GPU & CPU

Location SW/HW Compiler (SW) VM/SW Mem Controller (HW)

Outline1. BIS: Jos A. Joao et al. “Bottleneck identification and scheduling in multithreaded

applications,” in Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (AS- PLOS ’12).

2. YinYang: Ting Cao et al. “The yin and yang of power and performance for asymmetric hardware and managed software,” in Proceedings of the 39th International Symposium on Computer Architecture (ISCA ’12).

3. PIE: Kenzo Van Craeynest et al. “Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE),” in Pro- ceedings of the 39th International Symposium on Computer Architecture (ISCA ’12).

4. SMS: Rachata Ausavarungnirun et al. “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” in Proceedings of the 39th International Symposium on Computer Architecture (ISCA ’12).

Bottleneck Identification and Scheduling in Multithreaded Applications

• Focuses on the problem of removing bottlenecks– Big problem in many systems – can’t scale well to many threads

• Bottlenecks include critical sections, pipeline stalls, barriers are a few examples

• In ACMPs previous research shows that “big” cores can be used to handle (serializing) bottlenecks– Limited fine grain adaptivity and generality

• Authors propose BIS – Key insight – costliest bottlenecks are those that make other threads

wait longest– involves co-operation of software and hardware to detect bottlenecks– Accelerates them using 1 or more “big” cores of the ACMP

Bottlenecks

• Amdahl’s serial portion• Critical sections• Barriers• Pipeline stages

Bottleneck Identification

• Software used to identify bottlenecks• Instructions such as BottleneckCall,

BottleneckReturn, BottleneckWait give feedback to BIS system

• BIS system keeps track of blocks and thread waiting cycles (TWC) (with optimizations)

Scheduling in Multithreaded Applications (with ACMPs)

• Take N bottlenecks with highest TWC and accelerate

• Many methods to accelerate, they focus on assigning to bigger cores in ACMP

• Send worst bottlenecks from small cores to big core and keep in Scheduling Buffer

• Lots of edge cases dealt with such as avoiding false serialization

• Also extend to multiple large core context

Bottleneck Identification and Scheduling in Multithreaded Applications

The Yin/Yang Metaphor• Hardware: heterogeneous multi-core balances power and performance

– Everyone cares about performance-per-energy (PPE) instead of absolute performance

• Software: move toward managed programming languages with virtual machines, like Java (JVM), C# (.NET), JavaScript,

• Yang of heterogeneous: exposed hardware adds complexity• Yin of managed language: VM handles all that exposed complexity for

the programmer•

• Yang of VM languages is overhead• Yin of heterogeneous hardware is small cores can alleviate that

overhead problem

Yin and Yang of Power and Performance: Overview

• Virtual machines consume a ton of extra computation time and energy (~40%)

• Java VM-related numbers (~37%):– 10% garbage collection– 12% JIT– 15% executing untouched instructions via the interpreter

• Paper: exploits GC, JIT, Interpreter tasks by placing them on the right types of cores with combination of parallelism, asynchrony, non-criticality, and hardware sensitivity

Yin and Yang of Power and Performance: Overview

• Garbage collection – asynchronous, can use many cores, does not benefit from high clock rate. Use low power core with high memory bandwidth

• JIT – async, some parallelism, and non-critical. Use small core because powerful enough

• Interpreter – Critical path and not async. Uses the applications parallelism. Again use low power cores generally

YinYang Experimental Evaluation• Power: they measure power overhead of VM services, and

yes the VM eats power so it's a good candidate for heterogeneous systems

• Power-per-energy (PPE): lots of results reported like this instead of absolute.

• Moving the JIT and GC to lower-clocked cores increases PPE (by 9-13%)

• GC very memory-bound, great on low-power cores• JIT less memory-bound, but embarrassingly parallel so still

great on low-power cores • Interpreter PPE improvement less stark, but still there.

YinYang Experimental Evaluation

PIE: Performance Impact Estimation

• A heterogeneous multi-core architecture is one that features big, powerful, power-hungry core(s) and small, weak, energy-efficient core(s).

• How do we map workloads onto the appropriate cores to maximize "speed-per-energy"

• PIE is a static or dynamic scheduler that takes both memory-level parallelism and instruction-level parallelism into account to predict how well a job will do on different types of cores.– Static: schedule jobs once for duration of job– Dynamic: push parts of jobs to appropriate cores

PIE (contd.)

• Motivation: Intuition is wrong:– “Compute-heavy jobs should go on the

heavyweight 'big' cores, while memory-heavy jobs can do well enough on 'small' cores.”

– Authors do experiments and find out that big cores do well on MLP intensive jobs, while small cores do well on ILP intensive jobs

More PIE• One way to schedule is to randomly sample job-core mappings,

learn, choose best– High overhead!

• PIE instead tries to estimate a job's performance on Core-type B, while the job is running on Core-type A

• Estimate is based on aggregate of:– Base instruction-level parallelism (ILP) score– A memory-level parallelism (MLP) component that is a function of the

processor's architecture and the observed cache misses from specific job

• Scheduler takes estimate of job's performance on different processor types, decides where to put the job

PIE works well

Multiple CPU Single GPU Systems• Memory becomes critical resource– GPU accesses vastly different from CPU ones– GPUs generate significantly more requests– GPU spawns many different threads– Increased contention between GPU and CPU

• Need to design a Memory Controller – Schedules the memory accesses– Ensures fairness– Is scalable and easy to implement

• Current approaches non robust to presence of both GPU and CPUs

Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous

Systems • Proposes a multi-stage approach to application-aware

memory scheduling– Handles interference between bandwidth demanding apps and non

demanding ones (e.g. GPU and CPU respectively)– Simplified hardware implementation due to decoupling of memory

controller across multiple stages• Improves CPU performance without degrading GPU

performance– Authors test on many settings (e.g. CPU only, GPU only, CPU & GPU)– Compare to existing approaches

Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous

Systems • Sophisticated approaches that prioritize memory

accesses need too complex logic– E.g. CAM memories

• SMS uses a three stage approach:1. Batch Formulation: Per source aggregation of

memory request batches2. Batch Scheduler: Prioritize batches coming from

latency critical apps (e.g. CPU ones)3. DRAM Command Scheduler: FIFO queues per DRAM

bank/each batch from Stage2 is placed on these FIFOs

SMS works well

Discussion• Leaving ISA assumption, how can we combine ideas from first

three papers?– Seems like we can incorporate ILP and MLP in our queuing decisions in

BIS– VM can also become more dynamic

• BIS, PIE, and YinYang papers assume heterogeneous multi-core systems with the same instruction set architecture (ISA). How would these papers change if we assumed different ISAs?– I'm thinking not fundamentally. It would make the prediction part of

PIE more complicated (you'd need some context-aware performance scaling between ISAs), but wouldn't break it.

– Similarly, you'd need some scaling for VM stuff, but it wouldn't break anything there, either.

– Maybe the class has an opinion on this? Do we need to keep ISAs homogeneous across a heterogeneous multi-core? What do we gain or lose from this?

Exploiting Heterogeneous Architectures

Documents

Transcript of Exploiting Heterogeneous Architectures