AMD_11th_Intl_SoC_Conf_UCI_Irvine
-
Upload
pankaj-singh -
Category
Technology
-
view
567 -
download
0
description
Transcript of AMD_11th_Intl_SoC_Conf_UCI_Irvine
Platform Coherency and SoC Verification Challenges
PANKAJ SINGH, CHETHAN-RAJ M , PRAKASH RAGHAVENDRA, ANINDYASUNDAR NANDI, DIBYENDU DAS AND TONY TYE
THE 11TH INTERNATIONAL SYSTEM-ON-CHIP (SOC) CONFERENCE, EXHIBIT, AND WORKSHOPS, OCTOBER 2013, IRVINE, CALIFORNIA WWW.SOCCONFERENCE.COM
ACKNOWLEDGEMENTS:
PHIL ROGERS AMD CORPORATE FELLOW, ROY JU & BEN SANDER SR FELLOW NARENDRA KAMAT, PRAVEEN DONGARA AND LEE HOWES
2 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
TODAY’S TOPICS
A New Parallel Computing Platform
– Heterogeneous System Architecture
Opportunities, Benefits and Feature Roadmap
Kaveri Platform Coherency
Shared memory, Platform atomics
Kaveri Verification Approach
SoC Verification Challenges and Solutions
HSA
1 KAVERI
PLATFORM COHERENCY
KAVERI VERIFICATION
SoC VERIFICATION
2 3 4
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
A New Parallel Computing
Platform – Heterogeneous
System Architecture (HSA)
HSA
1 KAVERI
PLATFORM COHERENCY
KAVERI VERIFICATION
SoC VERIFICATION
2 3 4
4 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
The APU is a great advance compared
to previous platforms
Combines scalar processing on CPU
with parallel processing on the GPU and
high-bandwidth access to memory
APU: ACCELERATED PROCESSING UNIT
Challenge: How do we make it even better going forward? Easier to program
Easier to optimize
Easier to load balance
Higher performance
Lower power
CPU pair GPU SIMD
5 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
THE HSA OPPORTUNITY ON MODERN APPLICATIONS
SOLUTION
PROBLEM
Developer
Return (Differentiation in
performance,
reduced power,
features,
time to market)
Developer Investment (Effort, time, new skills)
Good user experiences
Historically, developers program CPUs
HSA + Libraries = productivity & performance with low power
Wide range of differentiated experiences
~4M apps
~20+M* CPU
coders
PROBLEM
Significant niche value
GPU/HW blocks hard to program
Not all workloads accelerate
Few hundred
apps
Tens of Ks GPU
coders
Few 100Ks HSA apps
Few M HSA
coders
*IDC
6 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
HSA AND ITS BENEFITS
HSA is an enabler of APU’s higher performance and power efficiency
Our industry-leading APUs speed up applications beyond graphics
CPU and GPU (APUs) work cooperatively together directly in system memory
Makes programming the APU as easy as C++
Improves Performance per watt
App-Accelerated
Software Applications
Serial and Task-Parallel Workloads
Data-Parallel Workloads
Graphics Workloads
HSA IS A COMPUTING PLATFORM THAT DRIVES NEW CLASS OF APPLICATIONS
Ref [1]
7 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Improves Power and Performance: Move application from CPU to GPU, remove data copies,
and reduce launch time
Simulate removing memory copies: 1.32 X
0 fps
5 fps
10 fps
15 fps
20 fps
25 fps
CPU CPU+GPU
Measured Perf
1.11 * 2.88 * 1.32 = 4.22 X Better Energy Efficiency
Easier to Program + Remove Copies
CPU Cores
CPU Cores
NB+GPU
NB+GPU
DRAM DRAM
0 W
5 W
10 W
15 W
20 W
25 W
30 W
35 W
CPU CPU+GPU
Measured Power
ENERGY COMPUTATION BREAKDOWN: MOTIONDSP 720P VIDEO CLEAN-UP
HSA EFFICIENCY IMPROVEMENT (AN EXAMPLE)
Ref [1]
8 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
HETEROGENEOUS SYSTEM ARCHITECTURE FEATURE ROADMAP
System
Integration
GPU compute
context switch
GPU graphics
pre-emption
Quality of Service
Architectural
Integration
Unified Address Space
for CPU and GPU
Fully coherent memory
between CPU & GPU
GPU uses pageable
system memory via
CPU pointers
Optimized
Platforms
Bi-Directional Power
Mgmt between CPU
and GPU
GPU Compute C++
support
User Mode Schedulng
Physical
Integration
Integrate CPU & GPU
in silicon
Unified Memory
Controller
Common
Manufacturing
Technology
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
PLATFORM COHERENCY
HSA
1 KAVERI
PLATFORM COHERENCY
KAVERI VERIFICATION
SoC VERIFICATION
2 3 4
10 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Shared memory accesses between the CPU and
GPU happens via ‘system memory’.
– Corresponds to the notion of shared virtual memory
(SVM) in OpenCL 2.0, available via clSVMalloc()
call. With SVM, CPUs and GPUs can share an
address space and share the pointer to the same
memory location.
– The compiler supports SVM and atomics calls that
work across the CPU-GPU boundary.
– System-memory accesses may go one of three
paths
If coherence with CPU is not required:
GARLIC path
If kernel-granularity coherence with CPU is
required: ONION bus path
If instruction-granularity coherence with CPU
is required: Bypass L2 via ONION+ bus (required
by atomics)
KAVERI SOC – ENABLING SHARED MEMORY AND PLATFORM
ATOMICS
11 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
CONCURRENT STACK PUSH USING ATOMIC COMPARE-AND-
EXCHANGE (AN EXAMPLE)
Each CPU thread and each GPU workitem execute the following code concurrently:
The code shows an example implementation of a concurrent stack’s “push” operation.
The “compare_exchange_strong” is an atomic call that ensures only one of the CPU/GPU
thread/workitem succeeds in updating the “head” pointer of the stack stored in list[0]
do { head = list[0]; //redundant because the atomic call updates head on failure
list[i] = head; } while (!atomic_compare_exchange_strong(&list[0], &head,i));
Time Instant Workitem i=2 Workitem i=4
Before ACE head=3, list[2]=3 head=3,list[4]=3
ACE Wins! Loses and goes back & retries
After ACE completes list[0]=2 list[0]=2
3
5
-1
2
3
5
-1
0
1
2
3
4
5
…
99
0
1
2
3
4
5
…
99
i=2 and i=4 contest for ACE List after i=2 wins!
(List: 3 (head)->5->-1) (List: 2 (head)->3->5->-1)
12 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
IMPLEMENTING PLATFORM ATOMICS FOR KAVERI
The compiler has implemented these atomics (per OpenCL 2.0 standards) for Kaveri.
The key issue in implementing these atomics is to make sure that both CPU and GPU see
the shared memory in “coherent” state.
The coherency is implemented using the ONION+ memory path and using the GPU ISA
instructions, which can invalidate/bypass L1/L2 caches selectively from the GPU side and
snoop to invalidate the CPU caches. This support is provided in the KV SOC.
For example: atomic_load with acquire semantics generates code on the GPU side as
shown (in Kaveri L2 is always bypassed for coherent access). Similarly, atomic_store with
release semantics generates the GPU ISA given later.
OpenCL 2.0 and C11 atomics support various kinds of memory_scope & memory_ordering
1. load with glc=1 // bypass the L1 cache
2. S_waitcnt 0 // wait for the load to complete
3. buffer_wbinv_vol // invalidate L1 so that any following load reads from memory
1. s_waitcnt 0 // wait for any previous memop to complete
2. store with glc=0 // L1 is a write-through cache, so write onto memory as L2 is bypassed
3. s_waitcnt 0; // prevent any following memop to move up
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
KAVERI SOC VERIFICATION
APPROACH
HSA
1 KAVERI
PLATFORM COHERENCY
KAVERI VERIFICATION
SoC VERIFICATION
2 3 4
14 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
CPU-BASED VERIFICATION
Higher language (C/C++)
BFM model used across PCIe-based
interface to inject data
GPU sends requests to DRAM over 2
paths: coherent and non-coherent
Assembly based input
Memory image of x86 machine code is
preloaded into DRAM model
CPU fetches instructions from DRAM
and executes them
GPU-BASED VERIFICATION
CPU
NorthBridge
Graphics model
SouthBridge BFM
DRAM Model
TRADITIONAL VERIFICATION AND SOC CHALLENGE
SoC Verification Challenge Layer of complexity due to HSA coherency environment.
SoC GPU needs to be programmed, which requires host
SoC CPU can be used the host. However, running the same host software stack results
in huge simulation time
One approach is Mailbox:
Inefficient due to lack of CPU-GPU interaction, longer run time.
GPU-focused verification not suitable for CPU-GPU interaction (HSA)
GFX
15 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
The memory accesses and configuration writes from the test are extracted into C function calls
Intent Capture performs this activity and encapsulates the GPU test into a function called Replay.
On CPU side, one thread runs Replay function while other threads execute the CPU side of the test.
Composite test (CPU test + generated FusionReplay function) is compiled using cxshell to generate a .sim
memory image
SOC VERIFICATION METHODOLOGY: TEST FLOW
Test (Open CL)
sp3 shader
GPU Test
Intent Capture
Replay()
CPU Test
One Thread
[ Driver
CPU]
Runs
Other
Threads CX
Shell
.sim
memory
image
APU RTL
Sim
Test
Output
Capture Output
Running driver code on simulated CPU is
impossible due to simulation run-times.
Ref [4]
Intent Capture is a mechanism to allow existing
discrete GPU graphics tests to execute on the CPU
in a Heterogeneous APU simulation.
16 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
POWER MANAGEMENT: BAPM
Core
Pwr
Rest
of
APU
Pwr
AP
U P
wr
Core
Pwr
Rest
of
APU
Pwr
cores
@ Pbase
App1 with High CAC
All-cores active
Die
Te
mp
Core
Pwr
Rest
of
APU
Pwr
App2 with Med CAC Half-
cores active
App3 with Low CAC All-
cores active
P0/Pbase
P1
SWP0
SWP1
SW/OS
View
HW
View
… …
Pb0
Pbx
...
Multiple
Boost
Pstates
ILLUSTRATION WITH
CPU-CENTRIC SCENARIO
CPU Core1
CPU Core2
Compute
Unit
Power
Monitor
calculates
CPU
Power
GPU Core1
GPU Core2
GPU
Power
Monitor
calculates
GPU
Power
Compare
Temperature to
Limit & adjust
Voltage/Frequency
If Temp > Limit, reduce power allocation If Temp < Limit, increase power allocation
Firmware
converts power
into
temperature
estimates
In a multi-core design, apps running on CPU/GPU cores may consume less power
Power-efficient algorithms exploit this power headroom for performance
The GPU can borrow power credit from the CPU in GPU-centric scenarios and vice versa
Ref[2]
17 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
CPU-centric
BAPM VERIFICATION APPROACH @ SOC Multiple
Boost
Pstates
NB CAC
Manager
CPU Core1
CPU
Power
Monitor
CPU Core2
CPU
Power
Monitor
GPU Core1
GPU
Power
Monitor
GPU Core2
GPU
Power
Monitor
SMU F/W
• Developed high and low power consuming CPU
patterns based on micro-architecture and power
analysis.
• Interleaved high and low power patterns in random
stimulus
• Used an Irritator to manipulate the credits sent to
CAC manager at times to hit corner cases like
back-to-back boost/throttle
• Modeled F/W algorithm using a simple BFM
• Added CSR framework to drive read/write to CAC
manager
• A very few sanity tests run with real f/w loaded
through backdoor to check the end-to-end flow.
• Used irritators to model GPU power credit
reporting instead of running GPU applications.
• GPU power monitor verified at GPU IP level
Efficient Coverage-driven random verification
CPU boosted because of GPU giving away credits and vice versa
Crosses of CPU/GPU events and effect on BAPM
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
SOC VERIFICATION
CHALLENGES & SOLUTION
HSA
1 KAVERI
PLATFORM COHERENCY
KAVERI VERIFICATION
SoC VERIFICATION
2 3 4
19 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Intent Capture and Playback methodology
Test setup update @ IP level to support test run with SOC
as a new target
Using functional model to simulate IP[RTL] in SoC scenario
for IP test development and easy porting to SoC
TEST STIMULUS REUSE AND PORTING TO SOC
CPU C Model/RTL
Bus Unit
MPMM
GPU C Model
cMemory Memory Model
MEMIO Memory Model
CPU to GPU access
APU Test
Output
GPU C Model
Test Output
DV Test
Capture Output
Replay Capture Output
A simple HSA SOC test with 1 Rd-WR in RTL takes about 18
hours whereas it is <1 hour on the Heterogeneous C model
Goal: Improve Quality, Reduce development time
Tool and flow differences/set-up across IP and SOC, make stimulus reuse difficult.
IP2SoC script
Export suite, test key
Common test options
sim output directories
reports
Perf_options.yml Memory config
run_job command-line options: GNB,XNB,UNB
NB/DCT prog. options
UNB Perf options
Run/Execute Regression
Create job spec [ip2soc –merge]
Test setup update such as configuration changes, test stimulus
defines allowed IP test to be reused.
20 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
HW-SW INTERACTION : MODELING & ABSTRACTION
Complex and evolving logic moving from hardware to firmware for better controllability. Challenges:
Firmware algorithms are compute-intensive and often developed late in design cycle.
Additional challenge to Verification in terms of load and execute time of the software.
Model the relevant section of the software using BFM with proper interface to the hardware
Add sufficient controllability to stress different paths of the BFM model - find coverage
Adaptive stimulus based on coverage of the BFM/state-machine
HW-SW INTERACTION: MODELING AND ABSTRACTION
Goals: Improve Quality, Reduce development time
Connected Standby Verification Approach
21 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
ADAPTIVE STIMULUS
Typically, power management transitions kick off after active code execution stops. This results in deeper
corner cases associated with thread-level coordination in multi-core design.
Predicting occurrences of deeper phases and targeting those by code/stimulus is difficult.
Define the power management modes as state machines - each state having granular phases including
thread specific information.
Dynamic irritator monitors these state transitions, inserts random/directed asynchronous events (like different
sorts of interrupts, probes, warmreset) and updates a scoreboard.
Events are generated very close to the relevant points - provides great controllability.
Dynamic irritator adapts based on scoreboard statistics - eventually putting more weightage to the less
frequently covered <state> X <event> buckets.
Goals: Improve Quality, Reduce development time
22 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
CONSTRAINT RANDOM STIMULUS AND RANDOMIZATION AT SOC
S11
S21 St S23
St St
Random Initial States
St
Ref[3]
Register
IP Constraints
SOC Constraints
Randomization utility
Package level info
Build Fuse
Modes: LFBR, BfD,long_init/ unfused test
RandomConfig executable
Time t=0 [config values
Import value after reset
Run
CMD line options
Goals: Improve Quality, Reduce development time
Complex SoC requires Randomization at different levels
23 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
OVERCOMING LIMITATIONS OF GATE-LEVEL SIMULATION
Challenges with Netlist simulation :
Longer run-times
Longer debug times
Approach to minimize runtime: Compute intensive RTL and associated verification components must be
replaced with a less intensive test-vector applicator : Apply test vectors directly from FSDB file.
Run RTL simulation,get FSDB
Create Gatesim files (gatesim.v,forces.v )
Build w Netlist + Gatesim files + TB to drive stimulus
from FSDB
Run Netlist sims(with
FSDB dump)
Approach to minimize Debug effort: Verdi NPI based Methodology to automate Debug:
Goals: Improve Quality, Reduce development time
Ref [5]
10x runtime optimization over traditional approach.
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
THANKYOU
25 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
REFERENCES
[1] A New Parallel Computing Platform – HSA, CTHPC 2013 Keynote
Speech, Roy Ju, AMD Senior Fellow
[2] AMD APUs :Dynamic Power Management Techniques, DAC 2013.
Praveen Dongara, System Architect
[3] Wilson Research Group-MGC 2013.
[4] Kaveri DTP. Internal Document.
[5] Innovative Approach to Overcome Limitations of Netlist Simulation,
SUNG 2013. Prodip K, Pankaj S,Meera M, Narendran K
26 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
GLOSSARY
GPU – Graphics processing unit
APU -- Accelerated Processing Unit
Open CL™ -- Open Computing Language
TDP – Thermal Design power – a measure of a design infrastructure’s ability to cool a device
AMD Turbo Core Technology – AMD boost mechanism
BIAPM -- Bi-directional Application Power Management.
Cac -- Capacitance AC switching, measures switching activity of a cluster
TDP -- Thermal Design Power, represents the average thermal dissipation power required to cool the design
Pstate -- Processor performance state
GARLIC -- Graphic Accelerated Reduced Latency Integrated Channel
ONION -- On-chip Northbridge to I/O Noncoherent bus
FSDB – Fast Signal Database
27 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
BACKUP
28 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
75.0
80.0
85.0
90.0
95.0
100.0
75.0
80.0
85.0
90.0
95.0
100.0
75.0
80.0
85.0
90.0
95.0
100.0
The dynamically calculated temperature of
each core and the GPU enables the
operating point of each to be dynamically
balanced in-order to maximize
performance within temperature limits.
Low activity in one core enables it to be a
thermal sink for a more active core
GPU-centric Balanced CPU-centric
DYNAMIC FINE-GRAINED POWER TRANSFERS
Ref [2]
29 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Disclaimer
The information presented in this document is for informational purposes only and may contain technical inaccuracies,
omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not
limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases,
product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD
assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this
information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD makes no representations or warranties with respect to the contents hereof and assumes no responsibility for any
inaccuracies, errors or omissions that appear in this information.
AMD specifically disclaims any implied warranties of merchantability or fitness for any particular purpose. In no event will AMD
be liable to any person for any direct, indirect, special or other consequential damages arising from the use of any information
contained herein, even if AMD is expressly advised of the possibility of such damages.
Trademark Attribution
AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United
States and/or other jurisdictions. Open CL and the Open CL logo are trademarks of Apple, Inc. and used by permission or
Khronos. Microsoft, Windows and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other
jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their
respective owners.
©2011 Advanced Micro Devices, Inc. All rights reserved.