Post on 22-Dec-2015
HSA System Emulation and Performance Evaluation
Shih-Hao Hung
Performance, Applications, and Security LabNational Taiwan University
1
2
◆Single processor with unsatisfying performance◆Hardware acceleration: Task partitioning for efficiency
– for I/O– for network– for encoding/decoding– for graphics
◆Special-purpose processors: Programmable/Efficient– Network Processors, DSP’s, GPU’s,...
◆Reconfigurable hardware (FPGA): Efficient/Programmable◆Homogeneous multicore: Data parallelism◆Cloud computing: Scalability◆Heterogeneous systems: may include any of above
Evolution of Computing Systems
Shih-Hao Hung, NTU-CSIE
3
◆Today, computers are complex and heterogeneous– New smartphones have 4~8 cores and sophisticated SW– Even embedded systems have multiple CPU and GPU cores– A cloud system consists of a large number of computers– Mobile cloud computing emphasizes on inter-operability for
smooth and transparent interactions ◆Good for application developers and makers
– Many powerful and convenient HW/SW kits available– Makes it easy to change the world (in your own way)
◆However, leading-edge systems engineering/research is harder than ever
Complexity in Systems Research
Shih-Hao Hung, NTU-CSIE
4
◆Applications as innovative as possible◆Time to market as short as possible◆Development skills as low as possible◆Performance as fast as possible◆Power and Energy as efficient as possible◆Size as small as possible
How to Produce Leading-Edge Products?
Shih-Hao Hung, NTU-CSIE
5
◆Good in performance and efficiency, but – Unconventional– Hard to design and program– Complex
◆Solving these technology barriers– Skills of research and innovation are needed to
solve unconventional problems– Learning new methodologies and knowledge to
handle the issues– Use of design tools and virtualization technology
to address complexity
Heterogeneous Systems
Shih-Hao Hung, NTU-CSIE
6
◆Tools to reduce difficulties and increase productivity– Libraries, Debuggers, Simulators,...– Assist the design and verification processes– Make it easy to search the design space– Shorten time-to-market
◆What are missing?– Experiences: Exploring the new world is very different from
copying designs, reverse engineering, or cost-down(BTW, skilled hands are needed badly now...)
– Virtual Platforms: Playgrounds which mimic real systems are needed for experimenting new ideas/designs
Satisfying the Needs for Systems R&D
Shih-Hao Hung, NTU-CSIE
7
◆Virtual platforms are used for years in HW design– Have you written any Verilog or VHDL code lately?– Circuit-level simulators (Analog design, SPICE)– Logic-level simulators, a.k.a. register-transfer-level (RTL)– Transaction-level modeling (TLM)– Electronic System Level (ESL)
◆Unfortunately, these are very very slow!
Virtual Platforms
Shih-Hao Hung, NTU-CSIE
Wanted for HW/SW Codesign!
8
◆Verification: Detailed cycle-by-cycle RTL model◆Architecture study:
– Processor pipeline model– Branch prediction model– TLB model– Private cache model– Cache coherence model– Memory model– I/O bus model– I/O device model
What Are Wanted for HW Design?
8
9
◆Verification: Detailed cycle-by-cycle RTL model◆Architecture study:
– Processor pipeline model– Branch prediction model– TLB model– Private cache model– Cache coherence model– Memory model– I/O bus model– I/O device model
Need Everything for HW Design?
9
10
◆System-wide profiling, monitoring and tracing– Performance analysis, e.g. hot functions, HW/SW interactions– Behavior analysis, e.g. security model for malware detection
• Wen-Chieh Wu and Shih-Hao Hung. DroidDolphin: a Dynamic Android Malware Detection Framework Using Big Data and Machine Learning, in Proc. the 2014 Research in Adaptive and Convergent Systems (RACS 2014), Towson, US, October 5-8, 2014.
– Full-system power consumption analysis– Guidance for real-time programming
◆Current and parallel programming– Resolving race conditions for shared resources– Identification of performance bottlenecks– Visualizing interprocessor communications & synchronization– Guidance for heterogeneous computing
What Are Wanted for Software Design?
10
11
Parallel Smart Event Tracing
OpenCL Application
PQEMU GPU Simulator
Host System
Target System
TracingControl
Tool
PI PIVPMU EventCollector Trace
AnalysisTools
Disk
: Modeling related : Tracing related
TracingEngine
Linux Kernel
CPUEmulator
Buffer
12
◆Traditional tracing techniques are ad-hoc– Require HW and/or SW instrumentation Poor portability
• HW instrumentation is nearly impossible for most users• SW instrumentation may require deep knowledge on OS, runtime software
and compiler tools– Intrusiveness: Need to remove the overhead of instrumentation
◆In-Emulation Tracing– Instrumentation in QEMU works for virtually any popular ISA, OS
and software high portability– HW models can be added for HW analysis– HSA GPU or FPGA can also be added to emulate heterogeneous
systems
Advantage for In-Emulation Tracing?
12
13
HSAemu
Shih-Hao Hung, NTU-CSIE
• First functional emulator for HSA
• Created by Prof. Yeh-Ching Chung at NTHU.
• Published recently in a top conference:
Jiun-Hung Ding, Wei-Chung Hsu, Bai-Cheng Jeng, Shih-Hao Hung and Yeh-Ching Chung. HSAemu – A Full System Emulator for HSA Platforms, in International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS 2014), New Delhi, India, October 12-17, 2014.
14
◆In-Emulation Tracing◆Performance optimization for applications
– Find software bottlenecks on single-threaded applications– Help parallelize application with OpenCL/Sumatra/…– Evaluate performance for OpenCL/Sumatra applications
◆Performance evaluation for systems– Support early-stage architecture design– Help define and test hardware-software interface– Enable early-stage system software design
Making HSAemu Better?
14
15
◆MCEmu– Chia-Heng Tu, Shih-Hao Hung, and Tung-Chieh Tsai. 2012. MCEmu:
A Framework for Software Development and Performance Analysis of Multicore Systems. ACM Trans. Des. Autom. Electron. Syst. 17, 4, Article 36 (October 2012).
◆System Evaluation– Shih-Hao Hung, Chi-Sheng Shih, Tei-Wei Kuo, Chia-Heng Tu, and
Che-Wei Chang, A Real-Time, Energy-Efficient System Software Suite for Heterogeneous Multicore Platforms, in International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS 2012), Tampere, Finland, October 7-12, 2012.
Moving Old Tricks to HSAemu
15
17
The MCEmu Framework
◆ Software development tool
◆ Board support package
◆ Smart event tracing unit
◆ Virtual performance monitoring unit
◆ Parallel simulation framework
Main Processor(s)
Realtim
e C
lock &
Mem
ory S
ystem
Virtual I/O
D
evices
Smart Event
Tracing Unt
Smart Event
Tracing Unt
Virtual Performance Monitoring Unit
Virtual Performance Monitoring Unit
System Bus
System Emulator (QEMU)
Special Purpose Processor #1
Special Purpose Processor #2
Device Simulator
Processor/Device Simulators
Inter-core communicationInter-core communication
Software Development Kit
Linux
Too
ls a
nd
Lib
rary
Sys
tem
S
oftw
are
Tracing/Profiling ToolsTracing/Profiling Tools
Sys
tem
-Lev
el
Em
ulat
ion/
Sim
ulat
ion
Board Support PackageBoard Support Package
Multicore Applications
App
lica
tion
s
Host System (Multicore)Hos
t
17
18
MCEmu Framework – Virtual Performance Monitoring Unit
CPU events
Math model
Platform emulator
Cache simulator
Mem. simulator
Disk simulator
Timing model 1(Fast, rough)
Timing model 2
Timing model 3(Slow, accurate)
Pipeline simulator
External architecture models
Joint estimators
Performance counters
Model and simulator selection, & power setting adjustment
VTD
Cache events
Mem. events
Disk events
Performance counter
Estimated cycle count
Applications and performance tools
Control path Data path VPMU
Inst. stream
VPD
Performance counterEstimated
Power/Energy
Power calculator
Current voltage status register
Current freq. status register
18
19
◆VPMU organization for multicore processors Joint estimators
VTD
Performance counter
Estimated cycle count
VPD
Performance counterEstimated
Power/Energy
Power calculatorCPU events
Performance counters
Cache events
System performance
counters
Mem. events
Disk events
Global clock
Coherence cache events
System power/energy
Processor core #1
Processor core #2VTD VPD
Processor core #3VTD VPDVPMU
19
MCEmu Framework – Virtual Performance Monitoring Unit
20
MCEmu Framework – Smart Event Tracing Unit
Event filtering engine
Event registration device
Performance tools
Control path Data path SETU
Application & OSInst. stream
Process name
Operating mode
Performance events
System performance
counters
Mem. events
Disk events
Global clock
Coherence cache events
System power
VPMU
Processor core #1
VTD VPD
Processor core #2
VTD VPD
Processor core #3
VTD VPD
Trace record buffer
Trace filePerformance
visualization toolconvert
…
20
22
◆ Virtual Performance Analyzer (VPA) supports performance analysis and systems design for Android– Hook necessary component simulators
to model and monitor performance & power (VPMU)
– Trace HW/SW events with Smart Event Tracing (SET) engine, driver, and agent
– Run Android/Linux with minimum porting efforts and observe w/ friendly tools
– User may start experiment with optimization tricks, e.g. changing cache sizes, adding crypto accelerators, revising drivers, applying DVFS techniques, etc.
Design for Android Systems
Shih-Hao Hung, NTU-CSIE
2011 ESWEEK Android Competition 4th Place
Shih-Hao Hung, Tei-Wei Kuo, Chi-Sheng Shih, and Chia-Heng Tu. System-Wide Profiling and Optimization with Virtual Machines, in Proc. 17th Asia and South Pacific Design Automation Conference (ASP-DAC 2012), pp. 395 - 400, Sydney, Australia, Jan. 2012. (EI)
23
Estimate of Power Consumption w/ VPA
Shih-Hao Hung, NTU-CSIE
◆ Measured by instrumentation or external power meter – data collection overhead, limited information, usability
◆ VPA – Systematically generated model, fast and accurate enough, no need for actual hardware, deployable in cloud
Shih-Hao Hung, Jen-Hao Chen, Chia-Heng Tu and Jeng-Peng Shieh. Exploring the Design Space for Android Smartphones, in Proc. The Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS-2014), London, United Kingdom, July 2-4, 2014.
24
Finding Optimal Solutions in Virtual Space
Shih-Hao Hung, NTU-CSIE
HW:CPU: big.LITTLEGPUCacheMemoryI/O Devices
SW:OS tunablesApplications
Shih-Hao Hung, Jen-Hao Chen, Chia-Heng Tu and Jeng-Peng Shieh. Exploring the Design Space for Android Smartphones, in Proc. The Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS-2014), London, United Kingdom, July 2-4, 2014.
25
①
②③
④
⑤ ⑥
Configurations 1 2 3 4 (G1) 5 6Cache size (KB) 8 8 32 32 32 132Associativity 1 4 4 4 2 2Block size (Bytes) 512 32 128 32 32 128Subblock size (Bytes) 64 32 32 32 32 32Write allocate? N Y Y Y Y Y
Replacement policy FIFO Random LRU LRU LRU FIFO
Die area (mm2) 0.081 0.118 0.258 0.3130 0.348 1.167
Estimated execution time (ms) 80,302 18,582 14,961 15,546 14,169 14,016
(NOTE: Processing technology is 65nm)
27
Cache Simulator - GEMS
Shih-Hao Hung, NTU-CSIE
• Detailed memory system simulation model that can simulate a wide variety of memory hierarchies and support many different cache coherence protocols
• Baseline: singled threaded, very slow
28
Parallel Cache Simulation
Shih-Hao Hung, NTU-CSIE
Host
P1 P2 P3 P4
L1 cache L1 cache L1 cacheL1 cache
• Need to figure out 4C:• Compulsory misses• Conflict misses• Capacity misses• Coherence misses
• First 3C are within a processor• Identified by standard cache simulators
• Approximate coherence misses with parallel method
29
Parallel Cache Simulation Scheme◆Simulation speed could be enhanced with integrating lab’s
previous work– (2012) Hui-Hsin’s M.S. Thesis on parallel cache simulator– (2014) Jen-Jong’s M.S. Thesis on cache simulator for HSA
30
Non-deterministic Communications
Shih-Hao Hung, NTU-CSIE
• Approximation? Memory access order in a MIMD system within a parallel region are non-deterministic anyway
Refi,p Refi,q Refi, p Refi, q Refi, j Refi, q
Case 1: no overlap Case 2: partial overlap Case 3: total overlap
Tim
e
31
◆Minimum number of coherence misses occur when there is no overlap
◆Easy to calculate– RAW– WAR– WAW
Required Communications
31
Refi,p Refi,q
Case 1: no overlap
Tim
e
32
Estimating Optional Communications
Shih-Hao Hung, NTU-CSIE
• Ri,j: read references to cache line i by core j• Wi,j: write references to cache line i by core j
• Refi,j: the union set of Ri,j and Wi,j
• Range(X): length of memory reference range, where X is the set of memory reference
• L: length of the overlap region
鄭人榮 碩士論文 台大資工所 2014
33
System Architecture Overview◆System Emulator:
– Insert VPMU for performance profiling
– Coordinate synchronization for each simulator
◆SSLAB GPU:– Provide GPU runtime
performance information– Coalesce GPU memory
traces
◆Cache Simulator:– Simulate 3C cache
simulation– Evaluate cache coherence
by analytic modelApril 19, 2023
33
HSA Runtime APIHSA Runtime API
Guest OSGuest OS
HSA ApplicationHSA Application
Cache SimulatorCache Simulator
Analyticmodel
Analyticmodel
3C CacheSimulation3C CacheSimulation
SSLAB GPUSSLAB GPUPQEMUPQEMU
VPMUVPMU
ProcessorsProcessors
I/O Device
I/O Device
Execution EngineExecution Engine
Translation engine
Translation engine
Command Monitor
Command Monitor
VTDVTD
TracebufferTracebuffer
34
SSLAB GPU emulator◆ Command Monitor
– Notify VPMU to enable GPU timing device
◆ Virtual Timing Device– Calculate GPU local timing
• ex: GPU CU local time = instruction counts * average CPI * CPU Fre/ GPU Fre
◆ Memory helper function– Count instructions
in runtime– Generate memory traces– Reschedule memory
traces
April 19, 2023
34
HSA APIHSA API
HSA monitor
HSA monitorVPMUVPMU
notify
HSA CU threadsHSA CU threads
Global_loadGlobal_storeGlobal_loadGlobal_store
Trace senderTrace sender
VTDVTD
traces
Instruction counts
Cache
SimulatorCache
Simulator traces
Task dispatch
Memory access
update GPU local time
35
Experiments (Jen-Jong Cheng, 2014-07)
Shih-Hao Hung, NTU-CSIE
•Host System– 32 Intel Xeon E5-2660 2.2GHz processor, 16GB DDR3– Ubuntu-12.04 (64bit)
•Virtual platform– PQEMU-0.13 + SSLAB GPU + Multi2Sim– ARM Realview-PBX-a9, support up to 4 cores
•Benchmark– AMD OpenCL– Splash2 benchmarks (CPU benchmarks)– Srad (OpenCL with shared memory)
•Cache Configuration– 16KB cache size, 4 way, 32B cache line size, 128 cache sets
36
Accuracy, Compared to GEMS
Shih-Hao Hung, NTU-CSIE
• Splash benchmark with 4 threads on 4 ARM cores• AAER = Average Absolute Error Rate
• One thousand memory references trigger the synchronization.
鄭人榮 碩士論文 台大資工所 2014
FPGA Accelerators
◆Intel and FPGA– http://www.extremetech.com/extreme/184828-intel-unveils-new-xeon-chip
-with-integrated-fpga-touts-20x-performance-boost
◆Video demo from Altera & Xilink– https://www.altera.com/products/design-software/embedded-software-de
velopers/opencl/overview.highResolutionDisplay.html– http://www.xilinx.com/products/design-tools/sdx/sdaccel.html
38
39
FPGA Acceleration◆Potential for higher
power-performance ratio than GPU
◆Keys:– Data copies can be
done by wires– Intensive simple
integer operations– Conversion of loops
into pipelines– Can be placed in-line
40
Connecting an FPGA Simulator to QEMU (1/2)◆System Emulator:
• Contains an FPGA device, accessible from Linux and apps• Transfer FPGA commands and simulation data to FPGA simulator
Shih-Hao Hung, Tien-Tzong Tzeng, Jyun-De Wu, Min-Yu Tsai,Yi-Chih Lu, Jeng-Peng Shieh, Chia-Heng Tu, Wen-Jen Ho. MobileFBP: Designing portable reconfigurable applications for heterogeneous systems, in Journal of Systems Architecture, Volume 60, Issue 1, January 2014, Pages 40-51. (SCI)
41
Connecting an FPGA Simulator to QEMU (2/2)◆ FPGA Simulator:
– Controlling Interface implemented with Verilog Procedure Interface (VPI)– Data Buffer for saving simulation data
42
Design Hardware Acceleration in Virtual Space
Shih-Hao Hung, NTU-CSIE
◆Save time to market and correct designs early – Profile applications: Finds
Performance bottlenecks & Data flow analysis
– Develop accelerator and software support in parallel
– Evaluate strategies with co-simulation
Virtual Performance Analyzer
Machine
Application
Accelerator
Driver
Virtual Machine
Application
VerilogSimulator
Driver
In Physical Space
In Virtual Space
44
Design for Heterogeneous Clouds
Shih-Hao Hung, NTU-CSIE
Web Services
Webkit
MapReduce
WebCL, WebGL
OpenCL, OpenGL
Filesystem
User Data
Apps on Servers
X86X86X86
ARMARMARM
GPUGPUGPU
Heterogeneous Cloud Infrastructure
GPUGPUFPGA
Management Facilities
Management Facilities
Switching Fabric
Switching Fabric
Performance & Cost ModelsPerformance
& Cost Models
◆Servers as the basic elements in a cloud system◆Design and optimize for big data analytics? In virtual space
MOST Big Data Project, 2013-2014
45
Accelerating MapReduce
112/04/19
Map
Reduce
Shuffle Sort
Map
Reduce
ShuffleSort
Network
Node 1 Node 2
Filter on FPGA
Reduce on FPGA
Map on FPGA
Compression
RDMA
Decompression
◆ Attach FPGA boards to accelerate MapReduce
◆ Filtering data at the source to reduce CPU work for query operations
◆ Develop toolkit and API for applications to utilize FPGA for intensive Map and Reduce computation
◆ Compression/decompression engines to reduce network traffics
◆ RDMA engine to reduce overhead of network protocol
46
Hardware-Software Co-Design◆Development Toolkit for
accelerating MapReduce application with FPGA– Source code analyzer: Figures out
program structure and adds instrumentation code
– Performance profiler: Identifies bottlenecks
– FPGA API: Enables programmer to invoke FPGA for acceleration
– High-Level Language to FPGA Compiler: Help convert HLL to HDL
– FPGA Library: Includes commonly used functions
– Virtual Platform: Allows programmer to debug and test FPGA acceleration
112/04/19
MapReduce App
Source Code Analyzer
Performance Analyzer
FPGA API HLL-to-HDL Compiler
FPGA Lib
Virtual Platform
Non-Critical Path
Critical Path
New MapReduce App
47
◆Systems research is more and more challenging, and it is very important to Taiwan’s industry
◆Tightly-couple hardware-software design is key to winning, and it can be done effectively with right methodologies and tools
◆Virtualization technologies and tools can help to build smarter systems from mobile to cloud applications
◆HSA gets more and more interesting and requires research/innovation skills with knowledge and tools
◆Lots of opportunities!
Conclusion
Shih-Hao Hung, NTU-CSIE