HSA System Emulation and Performance Evaluation Shih-Hao Hung Performance, Applications, and...

HSA System Emulation and Performance Evaluation

Shih-Hao Hung

Performance, Applications, and Security LabNational Taiwan University

◆Single processor with unsatisfying performance◆Hardware acceleration: Task partitioning for efficiency

– for I/O– for network– for encoding/decoding– for graphics

◆Special-purpose processors: Programmable/Efficient– Network Processors, DSP’s, GPU’s,...

◆Reconfigurable hardware (FPGA): Efficient/Programmable◆Homogeneous multicore: Data parallelism◆Cloud computing: Scalability◆Heterogeneous systems: may include any of above

Evolution of Computing Systems

Shih-Hao Hung, NTU-CSIE

◆Today, computers are complex and heterogeneous– New smartphones have 4~8 cores and sophisticated SW– Even embedded systems have multiple CPU and GPU cores– A cloud system consists of a large number of computers– Mobile cloud computing emphasizes on inter-operability for

smooth and transparent interactions ◆Good for application developers and makers

– Many powerful and convenient HW/SW kits available– Makes it easy to change the world (in your own way)

◆However, leading-edge systems engineering/research is harder than ever

Complexity in Systems Research

◆Applications as innovative as possible◆Time to market as short as possible◆Development skills as low as possible◆Performance as fast as possible◆Power and Energy as efficient as possible◆Size as small as possible

How to Produce Leading-Edge Products?

◆Good in performance and efficiency, but – Unconventional– Hard to design and program– Complex

◆Solving these technology barriers– Skills of research and innovation are needed to

solve unconventional problems– Learning new methodologies and knowledge to

handle the issues– Use of design tools and virtualization technology

to address complexity

Heterogeneous Systems

◆Tools to reduce difficulties and increase productivity– Libraries, Debuggers, Simulators,...– Assist the design and verification processes– Make it easy to search the design space– Shorten time-to-market

◆What are missing?– Experiences: Exploring the new world is very different from

copying designs, reverse engineering, or cost-down(BTW, skilled hands are needed badly now...)

– Virtual Platforms: Playgrounds which mimic real systems are needed for experimenting new ideas/designs

Satisfying the Needs for Systems R&D

◆Virtual platforms are used for years in HW design– Have you written any Verilog or VHDL code lately?– Circuit-level simulators (Analog design, SPICE)– Logic-level simulators, a.k.a. register-transfer-level (RTL)– Transaction-level modeling (TLM)– Electronic System Level (ESL)

◆Unfortunately, these are very very slow!

Virtual Platforms

Wanted for HW/SW Codesign!

◆Verification: Detailed cycle-by-cycle RTL model◆Architecture study:

– Processor pipeline model– Branch prediction model– TLB model– Private cache model– Cache coherence model– Memory model– I/O bus model– I/O device model

What Are Wanted for HW Design?

◆Verification: Detailed cycle-by-cycle RTL model◆Architecture study:

– Processor pipeline model– Branch prediction model– TLB model– Private cache model– Cache coherence model– Memory model– I/O bus model– I/O device model

Need Everything for HW Design?

◆System-wide profiling, monitoring and tracing– Performance analysis, e.g. hot functions, HW/SW interactions– Behavior analysis, e.g. security model for malware detection

• Wen-Chieh Wu and Shih-Hao Hung. DroidDolphin: a Dynamic Android Malware Detection Framework Using Big Data and Machine Learning, in Proc. the 2014 Research in Adaptive and Convergent Systems (RACS 2014), Towson, US, October 5-8, 2014.

– Full-system power consumption analysis– Guidance for real-time programming

◆Current and parallel programming– Resolving race conditions for shared resources– Identification of performance bottlenecks– Visualizing interprocessor communications & synchronization– Guidance for heterogeneous computing

What Are Wanted for Software Design?

Parallel Smart Event Tracing

OpenCL Application

PQEMU GPU Simulator

Host System

Target System

TracingControl

PI PIVPMU EventCollector Trace

AnalysisTools

: Modeling related : Tracing related

TracingEngine

Linux Kernel

CPUEmulator

Buffer

◆Traditional tracing techniques are ad-hoc– Require HW and/or SW instrumentation Poor portability

• HW instrumentation is nearly impossible for most users• SW instrumentation may require deep knowledge on OS, runtime software

and compiler tools– Intrusiveness: Need to remove the overhead of instrumentation

◆In-Emulation Tracing– Instrumentation in QEMU works for virtually any popular ISA, OS

and software high portability– HW models can be added for HW analysis– HSA GPU or FPGA can also be added to emulate heterogeneous

systems

Advantage for In-Emulation Tracing?

HSAemu

• First functional emulator for HSA

• Created by Prof. Yeh-Ching Chung at NTHU.

• Published recently in a top conference:

Jiun-Hung Ding, Wei-Chung Hsu, Bai-Cheng Jeng, Shih-Hao Hung and Yeh-Ching Chung. HSAemu – A Full System Emulator for HSA Platforms, in International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS 2014), New Delhi, India, October 12-17, 2014.

◆In-Emulation Tracing◆Performance optimization for applications

– Find software bottlenecks on single-threaded applications– Help parallelize application with OpenCL/Sumatra/…– Evaluate performance for OpenCL/Sumatra applications

◆Performance evaluation for systems– Support early-stage architecture design– Help define and test hardware-software interface– Enable early-stage system software design

Making HSAemu Better?

◆MCEmu– Chia-Heng Tu, Shih-Hao Hung, and Tung-Chieh Tsai. 2012. MCEmu:

A Framework for Software Development and Performance Analysis of Multicore Systems. ACM Trans. Des. Autom. Electron. Syst. 17, 4, Article 36 (October 2012).

◆System Evaluation– Shih-Hao Hung, Chi-Sheng Shih, Tei-Wei Kuo, Chia-Heng Tu, and

Che-Wei Chang, A Real-Time, Energy-Efficient System Software Suite for Heterogeneous Multicore Platforms, in International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS 2012), Tampere, Finland, October 7-12, 2012.

Moving Old Tricks to HSAemu

The MCEmu Framework

◆ Software development tool

◆ Board support package

◆ Smart event tracing unit

◆ Virtual performance monitoring unit

◆ Parallel simulation framework

Main Processor(s)

Realtim

lock &

Virtual I/O

evices

Smart Event

Tracing Unt

Smart Event

Tracing Unt

Virtual Performance Monitoring Unit

System Bus

System Emulator (QEMU)

Special Purpose Processor #1

Special Purpose Processor #2

Device Simulator

Processor/Device Simulators

Inter-core communicationInter-core communication

Software Development Kit

Tracing/Profiling ToolsTracing/Profiling Tools

Board Support PackageBoard Support Package

Multicore Applications

Host System (Multicore)Hos

MCEmu Framework – Virtual Performance Monitoring Unit

CPU events

Math model

Platform emulator

Cache simulator

Mem. simulator

Disk simulator

Timing model 1(Fast, rough)

Timing model 2

Timing model 3(Slow, accurate)

Pipeline simulator

External architecture models

Joint estimators

Performance counters

Model and simulator selection, & power setting adjustment

Cache events

Mem. events

Disk events

Performance counter

Estimated cycle count

Applications and performance tools

Control path Data path VPMU

Inst. stream

Performance counterEstimated

Power/Energy

Power calculator

Current voltage status register

Current freq. status register

◆VPMU organization for multicore processors Joint estimators

Performance counter

Estimated cycle count

Performance counterEstimated

Power/Energy

Power calculatorCPU events

Performance counters

Cache events

System performance

counters

Mem. events

Disk events

Global clock

Coherence cache events

System power/energy

Processor core #1

Processor core #2VTD VPD

Processor core #3VTD VPDVPMU

MCEmu Framework – Virtual Performance Monitoring Unit

MCEmu Framework – Smart Event Tracing Unit

Event filtering engine

Event registration device

Performance tools

Control path Data path SETU

Application & OSInst. stream

Process name

Operating mode

Performance events

System performance

counters

Mem. events

Disk events

Global clock

Coherence cache events

System power

Processor core #1

VTD VPD

Processor core #2

VTD VPD

Processor core #3

VTD VPD

Trace record buffer

Trace filePerformance

visualization toolconvert

Virtual Performance Analyzer

◆ Virtual Performance Analyzer (VPA) supports performance analysis and systems design for Android– Hook necessary component simulators

to model and monitor performance & power (VPMU)

– Trace HW/SW events with Smart Event Tracing (SET) engine, driver, and agent

– Run Android/Linux with minimum porting efforts and observe w/ friendly tools

– User may start experiment with optimization tricks, e.g. changing cache sizes, adding crypto accelerators, revising drivers, applying DVFS techniques, etc.

Design for Android Systems

2011 ESWEEK Android Competition 4th Place

Shih-Hao Hung, Tei-Wei Kuo, Chi-Sheng Shih, and Chia-Heng Tu. System-Wide Profiling and Optimization with Virtual Machines, in Proc. 17th Asia and South Pacific Design Automation Conference (ASP-DAC 2012), pp. 395 - 400, Sydney, Australia, Jan. 2012. (EI)

Estimate of Power Consumption w/ VPA

◆ Measured by instrumentation or external power meter – data collection overhead, limited information, usability

◆ VPA – Systematically generated model, fast and accurate enough, no need for actual hardware, deployable in cloud

Shih-Hao Hung, Jen-Hao Chen, Chia-Heng Tu and Jeng-Peng Shieh. Exploring the Design Space for Android Smartphones, in Proc. The Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS-2014), London, United Kingdom, July 2-4, 2014.

Finding Optimal Solutions in Virtual Space

HW:CPU: big.LITTLEGPUCacheMemoryI/O Devices

SW:OS tunablesApplications

Shih-Hao Hung, Jen-Hao Chen, Chia-Heng Tu and Jeng-Peng Shieh. Exploring the Design Space for Android Smartphones, in Proc. The Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS-2014), London, United Kingdom, July 2-4, 2014.

②③

⑤ ⑥

Configurations 1 2 3 4 (G1) 5 6Cache size (KB) 8 8 32 32 32 132Associativity 1 4 4 4 2 2Block size (Bytes) 512 32 128 32 32 128Subblock size (Bytes) 64 32 32 32 32 32Write allocate? N Y Y Y Y Y

Replacement policy FIFO Random LRU LRU LRU FIFO

Die area (mm2) 0.081 0.118 0.258 0.3130 0.348 1.167

Estimated execution time (ms) 80,302 18,582 14,961 15,546 14,169 14,016

(NOTE: Processing technology is 65nm)

Cache Simulation for Multicore

Cache Simulator - GEMS

• Detailed memory system simulation model that can simulate a wide variety of memory hierarchies and support many different cache coherence protocols

• Baseline: singled threaded, very slow

Parallel Cache Simulation

P1 P2 P3 P4

L1 cache L1 cache L1 cacheL1 cache

• Need to figure out 4C:• Compulsory misses• Conflict misses• Capacity misses• Coherence misses

• First 3C are within a processor• Identified by standard cache simulators

• Approximate coherence misses with parallel method

Parallel Cache Simulation Scheme◆Simulation speed could be enhanced with integrating lab’s

previous work– (2012) Hui-Hsin’s M.S. Thesis on parallel cache simulator– (2014) Jen-Jong’s M.S. Thesis on cache simulator for HSA

Non-deterministic Communications

• Approximation? Memory access order in a MIMD system within a parallel region are non-deterministic anyway

Refi,p Refi,q Refi, p Refi, q Refi, j Refi, q

Case 1: no overlap Case 2: partial overlap Case 3: total overlap

◆Minimum number of coherence misses occur when there is no overlap

◆Easy to calculate– RAW– WAR– WAW

Required Communications

Refi,p Refi,q

Case 1: no overlap

Estimating Optional Communications

• Ri,j: read references to cache line i by core j• Wi,j: write references to cache line i by core j

• Refi,j: the union set of Ri,j and Wi,j

• Range(X): length of memory reference range, where X is the set of memory reference

• L: length of the overlap region

鄭人榮碩士論文台大資工所 2014

System Architecture Overview◆System Emulator:

– Insert VPMU for performance profiling

– Coordinate synchronization for each simulator

◆SSLAB GPU:– Provide GPU runtime

performance information– Coalesce GPU memory

traces

◆Cache Simulator:– Simulate 3C cache

simulation– Evaluate cache coherence

by analytic modelApril 19, 2023

HSA Runtime APIHSA Runtime API

Guest OSGuest OS

HSA ApplicationHSA Application

Cache SimulatorCache Simulator

Analyticmodel

3C CacheSimulation3C CacheSimulation

SSLAB GPUSSLAB GPUPQEMUPQEMU

VPMUVPMU

ProcessorsProcessors

I/O Device

Execution EngineExecution Engine

Translation engine

Command Monitor

VTDVTD

TracebufferTracebuffer

SSLAB GPU emulator◆ Command Monitor

– Notify VPMU to enable GPU timing device

◆ Virtual Timing Device– Calculate GPU local timing

• ex: GPU CU local time = instruction counts * average CPI * CPU Fre/ GPU Fre

◆ Memory helper function– Count instructions

in runtime– Generate memory traces– Reschedule memory

traces

April 19, 2023

HSA APIHSA API

HSA monitor

HSA monitorVPMUVPMU

notify

HSA CU threadsHSA CU threads

Global_loadGlobal_storeGlobal_loadGlobal_store

Trace senderTrace sender

VTDVTD

traces

Instruction counts

SimulatorCache

Simulator traces

Task dispatch

Memory access

update GPU local time

Experiments (Jen-Jong Cheng, 2014-07)

•Host System– 32 Intel Xeon E5-2660 2.2GHz processor, 16GB DDR3– Ubuntu-12.04 (64bit)

•Virtual platform– PQEMU-0.13 + SSLAB GPU + Multi2Sim– ARM Realview-PBX-a9, support up to 4 cores

•Benchmark– AMD OpenCL– Splash2 benchmarks (CPU benchmarks)– Srad (OpenCL with shared memory)

•Cache Configuration– 16KB cache size, 4 way, 32B cache line size, 128 cache sets

Accuracy, Compared to GEMS

• Splash benchmark with 4 threads on 4 ARM cores• AAER = Average Absolute Error Rate

• One thousand memory references trigger the synchronization.

Example of Cache Misses Analysis

FPGA Accelerators

◆Intel and FPGA– http://www.extremetech.com/extreme/184828-intel-unveils-new-xeon-chip

-with-integrated-fpga-touts-20x-performance-boost

◆Video demo from Altera & Xilink– https://www.altera.com/products/design-software/embedded-software-de

velopers/opencl/overview.highResolutionDisplay.html– http://www.xilinx.com/products/design-tools/sdx/sdaccel.html

FPGA Acceleration◆Potential for higher

power-performance ratio than GPU

◆Keys:– Data copies can be

done by wires– Intensive simple

integer operations– Conversion of loops

into pipelines– Can be placed in-line

Connecting an FPGA Simulator to QEMU (1/2)◆System Emulator:

• Contains an FPGA device, accessible from Linux and apps• Transfer FPGA commands and simulation data to FPGA simulator

Shih-Hao Hung, Tien-Tzong Tzeng, Jyun-De Wu, Min-Yu Tsai,Yi-Chih Lu, Jeng-Peng Shieh, Chia-Heng Tu, Wen-Jen Ho. MobileFBP: Designing portable reconfigurable applications for heterogeneous systems, in Journal of Systems Architecture, Volume 60, Issue 1, January 2014, Pages 40-51. (SCI)

Connecting an FPGA Simulator to QEMU (2/2)◆ FPGA Simulator:

– Controlling Interface implemented with Verilog Procedure Interface (VPI)– Data Buffer for saving simulation data

Design Hardware Acceleration in Virtual Space

◆Save time to market and correct designs early – Profile applications: Finds

Performance bottlenecks & Data flow analysis

– Develop accelerator and software support in parallel

– Evaluate strategies with co-simulation

Virtual Performance Analyzer

Machine

Application

Accelerator

Driver

Virtual Machine

Application

VerilogSimulator

Driver

In Physical Space

In Virtual Space

Beyond a Single System

Design for Heterogeneous Clouds

Web Services

Webkit

MapReduce

WebCL, WebGL

OpenCL, OpenGL

Filesystem

User Data

Apps on Servers

X86X86X86

ARMARMARM

GPUGPUGPU

Heterogeneous Cloud Infrastructure

GPUGPUFPGA

Management Facilities

Switching Fabric

Performance & Cost ModelsPerformance

& Cost Models

◆Servers as the basic elements in a cloud system◆Design and optimize for big data analytics? In virtual space

MOST Big Data Project, 2013-2014

Accelerating MapReduce

112/04/19

Reduce

Shuffle Sort

Reduce

ShuffleSort

Network

Node 1 Node 2

Filter on FPGA

Reduce on FPGA

Map on FPGA

Compression

Decompression

◆ Attach FPGA boards to accelerate MapReduce

◆ Filtering data at the source to reduce CPU work for query operations

◆ Develop toolkit and API for applications to utilize FPGA for intensive Map and Reduce computation

◆ Compression/decompression engines to reduce network traffics

◆ RDMA engine to reduce overhead of network protocol

Hardware-Software Co-Design◆Development Toolkit for

accelerating MapReduce application with FPGA– Source code analyzer: Figures out

program structure and adds instrumentation code

– Performance profiler: Identifies bottlenecks

– FPGA API: Enables programmer to invoke FPGA for acceleration

– High-Level Language to FPGA Compiler: Help convert HLL to HDL

– FPGA Library: Includes commonly used functions

– Virtual Platform: Allows programmer to debug and test FPGA acceleration

112/04/19

MapReduce App

Source Code Analyzer

Performance Analyzer

FPGA API HLL-to-HDL Compiler

FPGA Lib

Virtual Platform

Non-Critical Path

Critical Path

New MapReduce App

◆Systems research is more and more challenging, and it is very important to Taiwan’s industry

◆Tightly-couple hardware-software design is key to winning, and it can be done effectively with right methodologies and tools

◆Virtualization technologies and tools can help to build smarter systems from mobile to cloud applications

◆HSA gets more and more interesting and requires research/innovation skills with knowledge and tools

◆Lots of opportunities!

Conclusion

HSA System Emulation and Performance Evaluation Shih-Hao Hung Performance, Applications, and...

Documents

Transcript of HSA System Emulation and Performance Evaluation Shih-Hao Hung Performance, Applications, and...

CONTROLS: EMULATION TO IMPROVE THE PERFORMANCE OF ...section/publications/2008... · Emulation of a container terminal (virtual representa-tion of the container terminal and real

Shih-Yu's portfolio

Emulation: Interpretation

Upscale emulation

Wu Shih-Mao

OwnersManual SHIH TZU

DHCP Emulation

Joanne Shih

Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

The Expert of Mobile Office Integration...Celine Shih - administration managing … Celine Shih - administration managing … Celine Shih - administration managing … Manager Anita

Methods for Emulation of Multi-Core CPU Performance - Inria

CrowdMeter: An Emulation Platform For Performance Evaluation

vIOMMU/ARM: full emulation and virtio-iommu … full emulation and virtio-iommu approaches ... b e • rev 0.1 draft ... • Please test & report bug/performance issues

Shu Mei Shih

Shih- tzu Dog

UVI Emulation One | サウンドバンクマニュアルOne/...4 Emulation One - Emulation II の前編（兄弟バージョン） Emulation One はVangelis、Herbie Hancock、David

Lei-Shih Chen

ARM Virtualization: Performance and Architectural Implicationscdall/pubs/isca2016-dall.pdf · ARM Virtualization: Performance and Architectural Implications Christoffer Dall, Shih-Wei

NCTUns Emulation

Methods for Emulation of Multi-Core CPU Performance