Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57...
Transcript of Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57...
1
ARM Processor Technology Update ARM Cortex®-A72 Processor Taking Mobile Performance
and Efficiency To New Levels
ARM Tech Forum, June 2015
Ian Smythe
Director of Marketing Programs
CPU Group
2
Processing Solutions
for Consumer
Markets
3
Accelerating the Pace of Innovation
2009 Display 5x
Camera 4x
Connectivity 20x
Sensors 3x
Video 34x
CPU 17x
GPU 40x
Memory Bandwidth 16x
2014
By GalaxyOptimus (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-
sa/3.0)], via Wikimedia Commons By Creative Tools. Watermark removed by User:Ainali [CC BY 2.0
(http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons
4
ARM®v8-A Architecture: Mobile Leadership in 2015
Asus
Pegasus X002
Huawei Honor
4X
Huawei
Ascend Y550
Lenovo
A858T
Lenovo
Lemon K3
Lenovo
Sisley S90
Lenovo
Vibe X2 Pro
LG
Flex 2
Galaxy S6
Edge
HTC
Desire 820
Meizu
M1 Note
Oppo
R5
Oppo
1105
Samsung
Galaxy A7
Samsung
Galaxy Mega 2
Samsung
Galaxy Note 4
Vivo
X5Max
Xiaomi
Redmi 2
Just some of the ARMv8-A architecture-based phones announced so far
Unsubsidized price estimates* from $100 to $750
*Pricing information from www.gsmarena.com
HTC
Desire 510
5
Scales efficiently to significantly higher performance in larger screen devices
Fits even more compute in a smaller footprint
with less power
Cortex®-A Processors: Scalable for Large Screen Devices
By Google (Open Source OS Screenshot) [CC-BY-SA-3.0
(http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons
6
Cortex-A72 as ‘big’ core
increases performance and efficiency
ARM big.LITTLE™: Must-Have for Longer Battery Life
Technology Evolution
big.LITTLE Cluster switching to big.LITTLE MP
big.LITTLE with
Intelligent Power Allocation
7
3.5x performance of Cortex-A15 in smartphone
power envelope
Maximizes sustained device performance
75% less energy for same workloads enabling slimmer and
cooler devices
Compelling scalable solutions
Smartphones to large-screen compute solutions
16nm FF+ POP enables high frequency designs to 2.5GHz+
Designed with the system in mind
CoreLink CCI-500 interconnect
Mali-T880 GPU, V550 Video, DP550 Display
MMU-400, NIC-400, ELA-500
ARM Cortex-A72: Highest Performance ARM Cortex CPU
8
Compelling single-threaded performance
Large performance increase across all workloads including integer, memory-intensive, crypto, floating point, etc.
Baseline microarchitecture similar to Cortex-A57
Significant advancements in power efficiency
Re-optimized every logical block from Cortex-A57
Power reduction enables sustained operation at Fmax
Area reduction lowers costs and static power
Feature support for enterprise and mobile SoCs
Cortex-A72: Increased Performance and Reduced Power
9
1.9
2.6
Cortex-A72: Accelerating Usable Performance
2016
Premium
2014
2015
x
x
Increase in sustained performance within
smartphone power budget 3.5x
Cortex-A15
28nm
1.6 GHz
Cortex-A57
20nm
2.0 GHz
Cortex-A57
14/16nm
2.3 GHz
Cortex-A72
14/16nm
2.5 GHz
10
28nm
28nm
28nm
Cortex-A72: Reducing Power Consumption
28nm
20nm
16FF+
75% Less energy
at target
process
Energy consumed for same mobile workloads
Cortex-A72
2GHz max
1.1 GHz @ equivalent performance
50% Less energy
At iso-process 40-60% further reductions on
average across multiple workloads
Combined with Cortex-A53:
Cortex-A15 Cortex-A57
2GHz max
1.3 GHz @ equivalent performance
1.6GHz 2.2GHz max 2.5GHz max
11
Intel workloads measured on Dell Venue Pro II. SPEC benchmarks measured using gcc compiler v4.9 with –o3 flag.
Cortex-A72 measured on RTL with realistic memory system with the same compiler settings
Multi-threaded workloads use 2C4T Core-M CPU and estimated on 4C Cortex-A72 configuration w/2MB L2 cache.
Cortex-A72: More performance in constrained envelopes Compelling Mobile SoCs for smartphone, tablet, and laptop form factors
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Geekbench ST SPECint SPECfp Geekbench MT
(4T)
SPECintRate (4T) STREAM Add STREAM Copy STREAM Scale STREAM Triad
No
rmalized
Perf
orm
an
ce
Core-M 2 GHz (14FF)
Cortex-A72 2.5 GHz est (16nm)
Single-thread Multi-thread Memory
4W <1W
12
L2 Cache L2 Cache
Cache Coherent Interconnect
Interrupt Control
big Cluster
LITTLE Cluster Architecturally Identical Processors
High performance tuned “big” cores
High efficiency tuned “LITTLE” cores
Hardware Coherency
Cache Coherent Interconnect (CCI)
L1 and L2 snooping between clusters
Seamless & Automatic Task Allocation
Global Task Scheduling (big.LITTLE MP)
Heterogeneous Computing
Up to1.8x higher performance vs. LITTLE-only*
45% to 65% CPU power savings vs. big-only*
big.LITTLE Technology: Right Core for the Right Task
* Measured across a set of common use-cases on a 4xCortex-A57.4xCortex-A53 big.LITTLE device
† Average power across high-end gaming and low-utilisation workloads
1 2
Relative big. LITTLE Power
Cortex-A57
Cortex-A53
Cortex-A15
Cortex-A7
35%†
Lower
power
13
The combination of High Performance ‘big’ and High
Efficiency ‘LITTLE’ CPUs deliver optimal power efficiency
and user experience within the thermal constraints
big.LITTLE: Optimizing for Power Efficiency Measured Power and Performance during Web Browsing
LITTLE Cluster big Cluster LITTLE Cluster big Cluster
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
big.LITTLE* LITTLE-only* big-only*
Power Page Load Time
*Measurements taken from the same SoC
Lower is better
14
Three Ways big.LITTLE Has Improved in 2015
big.LITTLE Validation Suite ARM Intelligent Power
Allocation (IPA)
Validation suite simplifies
tuning and shortens time to
market
Native support for IPA, the
new Linux Thermal
Framework
Testcases
Report
Generator
ARMv8 Cortex-A CPUs
big.LITTLE devices in 2015
will achieve higher
performance efficiency
Traditional
IPA
0 1 2 3
AnTuTu HTML5
Epic Citadel
Vellamo HTML5
Quadrant CPU
AnTuTu CPU
Octane
AndEBench
WebXPRT big.LITTLE(ARMv8)
big.LITTLE(ARMv7)
Cortex-A57
15
Single thread performance is crucial for gaming, video playback and web browsing applications
big.LITTLE software migrates latency sensitive threads to High Performance CPUs to reduce
execution time and deliver an improved mobile user experience
125% 138% 159% 157% 140% 119%
Angry
Birds
Audio
Player
Photo
Editor
Facebook Castle
Master
Asphalt 8
big.LITTLE User Experience Improvement
LITTLE-only (4L) big.LITTLE (1b+4L)
big.LITTLE Delivers a Richer User Experience
b: “big” High Performance CPU
L: “LITTLE” High Efficiency CPU
†
†
0 0.2 0.4 0.6 0.8 1
4b+4L
2b+4L
1b+4L
4L
2L
1L
Normalised Time
Web Page Load Time Performance
(Higher is Better)
40%
Norm
aliz
ed
Applic
atio
n S
peed
16
big.LITTLE with Cortex-A72 for Entry to Mid-range
Configurations with High Performance CPUs
offers greater user experience and higher power
efficiency benefits relative to LITTLE only
Topologies with Cortex-A72 CPU as big core
offer improved user experience at reduced area
0
0.5
1
1.5
2
Angry Birds Temple Run Video Playback Asphalt 8
Normalised User Experience
LITTLE-only (SMP)
big.LITTLE (1b+4L)
Cortex-A72 with 2MB L2 for 2 cores, 1MB L2 for 1 core
Cortex-A53 1MB L2 for MP4, 512kB L2 for 2 MP2 and Octa-LITTLE 2nd cluster
LITTLE-only
Increasing in Single Thread Performance
Increasing in User Experience
Increasing in Energy Efficiency
big.LITTLE with Cortex-A72
1.09x Area: 1.0x
1.3x 0.98x
17
Standalone Devices Companion Devices Tethered Embedded Deeply Embedded
Embedded OS Rich OS
Always aware, lowest-power High-efficiency performance, constrained power budget
Peripheral Autonomous Compute
ARM at the Heart of the Wearables Market
18
Processing Solutions for
Networking and Infrastructure
Markets
19
Range of SoCs Addressing Infrastructure
Highly Accelerated Balanced Massively Multicore
QorIQ LS2 ThunderX Tile-MX 100 MPSoC
Opeteron™ A1100 Stratix® 10 X-Gene™
One Size Does Not Fit All
20
Cortex-A57 Networking Solutions Gather Pace ARMv8 SoCs in deployments now, many more coming.
Freescale LS2085 and LS 2045 Cortex-A57 based 8-core and 4-core complex SDN Switching, NFV Solutions
Networking Applications: Enterprise Routing, Data Center
Solutions, OpenFlow switching, Enterprise Switching,
Security Appliances/IPS/IDS, DPI, ADC/Wan-Opt
HiSilicon 32-core
First 16nm FinFET ARMv8-A networking chip
32-core ARM Cortex-A57 SoC
Networking applications: Next Generation BTS,
Core Routers, Virtualized appliances, SDN
AMD Hierofalcon, Seattle platforms
21
Enterprise Compute Requirements
Specialised Processing
L1, Content Delivery, Security
Diverse requirements
Trend: Advanced modulation schemes
Need: DSPs, Accelerators
Data Plane Processing
Throughput driven, IO intensive
Deterministic performance
Trend: Higher packet rates
Need: Small Cores at Maximum Efficiency
Control Plane Processing
Fast Event Processing
Complex signalling
Trend: Evolving Software
Need: Efficient, High Compute Performance
MAC Scheduling
Real Time, Latency Driven
Multiple core processing
Trend: More Complexity (LTE-A, 5G)
Need: High Compute, Low Latency Performance
High Bandwidth, Low Latency Interconnect
Wide Range of Implementations from Few to Many Coherent Devices
22
DSPDSP
ACE
Network Interconnect
NIC-400
Flash
NIC-400
USB
Memory
Controller
DMC-520
x72
DDR4-3200
AHB
Snoop Filter1-32MB L3 cache
PCIe
10-40
GbE
DPI Crypto
CoreLink™ CCN-512 Cache Coherent Network
DSP SATA
Memory
Controller
DMC-520
x72
DDR4-3200
Cortex-A72
Memory
Controller
DMC-520
x72
DDR4-3200
Memory
Controller
DMC-520
x72
DDR4-3200
PCIe
DPI
I/O Virtualisation CoreLink MMU-500
SRAM
Network Interconnect
NIC-400
GPIO PCIe
GIC-500
Cortex CPU
or CHI
master
Cortex-A53
Cortex-A72
Cortex-A53
Cortex-A72
Cortex-A53
Cortex-A72
Cortex-A53
Cortex CPU
or CHI
master
Cortex CPU
or CHI
master
Cortex CPU
or CHI
master
®
Extensible Architecture for Heterogeneous Multi-core Solutions
Up to 4
cores per
cluster
Up to 12
coherent
clusters
Integrated
L3 cache
Up to 24 I/O
coherent
interfaces for
accelerators
and I/O
Peripheral address space
Heterogeneous processors – CPU, GPU, DSP and
accelerators Virtualized Interrupts
Up to Quad
channel
DDR3/4 x72
23
Maximizing Throughput Density: per mm2, per Watt
0
0.2
0.4
0.6
0.8
1
1.2
Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3
20 Thread Workload
2.3
GH
z
2.7
GH
z
2.6
G
Hz
Rela
tive
perf
orm
ance
(Sp
ec2
K6 r
ate)
Comparison for equivalent number of threads Platforms used:
Xeon-E5 2660 10C20T platform (measured) Xeon-E5 2650 10C20T platform (measured) Gcc compiler v4.9 with –o3 flag
Estimated result on example 20C ARM Cortex platforms with CCN-508, 28MB total L2+L3 cache
per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled to 20T based on modelled and empirical results Power estimated in 16nm based on ARM internal implementations for entire CPU+ interconnect
2.5
GH
z
105W
105W
<30W
<30W
ARM Solution Benefits:
Less than 1/3rd the power for equivalent
performance
Allows more specialized computing or
significantly greater thread density in
the same power budget
(10 cores 20 threads) (20 cores 20 threads) (20 cores 20 threads) (10 cores 20 threads)
POP
Optimizations
POP
Optimizations
24
Cortex-A72: Ideal for Dense Compute Environments
Cortex-A72 is <20 % size
Single Broadwell CPU + 256K1 L2
~8mm2
Cortex-A72 MP4 + 2MB L23
~8mm2
Single Cortex-A72 core 2
~1.15mm2
A quad core Cortex-A72 with 8x L2 cache RAM is
the same size
1Source: Estimated from die-shot image provided by Intel at IDF 2014. 2/3Source: ARM trial implementations on TSMC 16FF+, using ARM Artisan libraries
Core
25
ARM Ecosystem
ARM
Scalable
ISA
This diagram is a sample representation of the ARM Partner Ecosystem for illustration purposes only
26
Mobile Cortex-A72 delivers 3.5x performance of Cortex-A15 in the smartphone envelope
Compelling scalable solutions from smartphone to large-screen compute
Designed with the system in mind: CPU, CCI, GPU, Video, MMU, NIC, ELA
Wearables from Cortex-M to Cortex-A
Infrastructure Cortex-A72 (and Cortex-A57) are ideal for dense, high-throughput computing
Small footprint for greater density on-die for larger core counts
Scalable configurations of larger (40+) cores with ARM Corelink CCN products
Deliver maximum throughput per mm2, per watt and per chip
Enterprise ready feature set and ecosystem
Summary
27
Thank you