John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in:...
description
Transcript of John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in:...
An Introduction to Goblin-Core64
A Massively Parallel Processor Architecture Designed for Complex Data Analytics
John D. Leideljleidel<at>ttu<dot>edu
Overview
• Data Intensive Computing Architectural Challenges– The destruction of cache efficiency using irregular
algorithms
• Goblin-Core64 Architecture Infrastructure Design– Sustainable Exascale performance with data intensive
applications
• Progress and Roadmap– The path forward
DATA INTENSIVE COMPUTING ARCHITECTURAL CHALLENGES
The destruction of cache efficiency using irregular algorithms
What is Big Data?…and how does it relate to HPC?
• Problem spaces outside of traditional HPC are now encountering the same problems that we find in HPC– Complexity– Time to Solution– Scale
• These problems are generally not– Simulating the physical world– Bound by simple floating point performance– As the problem scales, the result set is fixed
• These problems are generally– Sparse in nature– Contain complex [sometimes unconstrained] data types– As the problem scales, the result set scales
• The other side of the HPC coin
Three Drivers to HPC Solutions
5
Time To Solution
Algorithmic Complexity
Scale of Working Set
HPCHPC
HPC
Convergence Criteria for HPC Adoption
• Time + Complexity– Fraud Detection– High Performance Trading Analytics
• Time + Scale– Power Grid Analytics– Graph500 Benchmark
• Complexity + Scale– Epidemiology– Agent-Based Network Analytics
• Time + Complexity + Scale– Grand Challenge Problems– Cyber Analytics
6
Dense Solver Efficiency
Tianhe-2
(Milk
yWay
-2)
Sequoia
- BlueG
ene/Q
K computer
Jaguar
- Cray
XT5-HE
Roadru
nner
Roadru
nner
BlueGen
e/L
BlueGen
e/L
BlueGen
e/L
Earth-Si
mulator
Earth-Si
mulator
Earth-Si
mulator
ASCI W
hite, S
P Power3
ASCI R
ed
ASCI R
ed
ASCI R
ed
ASCI R
ed
SR2201/1
024
Numerical
Wind Tu
nnel
XP/S140
CM-5/10240.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
top500 efficiency
top500 efficiency
Sparse Solver Efficiency
DOE/NNSA/LLNL Sequoia BGQ UV 2000 GraphCREST-MC48 Dingus Matterhorn Gordon0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Graph500 GTEPS/core
Graph500 GTEPS/core
Cache-less Architectures
GOBLIN-CORE64 ARCHITECTURE INFRASTRUCTURE DESIGN
Sustainable Exascale performance with data intensive applications
Goal
Build an architecture that efficiently maps programming model concepts to hardware in order to improve data intensive [sparse] application throughput
The Result: Goblin-Core64• Hierarchical set of architectural modules that provide:
– Native PGAS memory addressing– High efficiency RISC ISA– SIMD capabilities – Architectural notion of “tasks” – Latency hiding techniques
• Single cycle context/task switching– Advanced synchronization techniques
• Ease the burden of barriers and sync points by eliminating spin waits– Memory coalescing//aggregation
• Local requests • Global requests• AMO’s
– Makes use of latest high bandwidth memory technologies• Hybrid Memory Cube
Goblin-Core64 ModulesTask
Reg
Task Unit Task Proc
ALUSIMD
Task Group
MMU
GC64 Socket
Software Managed
Scratchpad
HMC Memory Interface
AMO Unit
Coalesce Unit
SOC
NET
Peripherals Packet Engine
1 2
34
GC64 Module Hierarchy• Task Unit
– Small divisible unit; Register file + control logic• Task Proc
– Multiple Task Units + context switch control logic• Task Group
– Multiple Task Procs + local MMU• GC64 Socket
– Coalesce Unit: coalesces adjacent memory requests into a single payload– AMO Unit: intelligently handles local AMO requests– HMC Unit: HMC packet request engine + SERDES– Software Managed Scratchpad: on-chip memory– Packet Engine: off-chip memory interface
GC64 Scalable Units
………..
GC64 Execution Model• GC64 execution model provides
“pressure-driven”, single cycle context switching between threads/tasks
• Pressure state machine provides fair-sharing of ALU based upon:– Number of outstanding requests– Statistical probability of a register
stall– Number of cycles in current
execution context• Minimum execution is two
instructions– Based upon instruction format
ALU SIMD
Task Unit Mux
Context Switch State Machine
GC64 Unified [PGAS] Addressing• GC64 physical addressing provides block access to:
– Local HMC [main] memory – Local software-managed scratchpad– Globally mapped [remote] memory
• Pointer arithmetic between memory spaces• Obeys all the constraints of paged, virtual memory
Base Physical Address [33:0]CUB [37:33]Reserved [41:38]Socket [49:42]Unused
[63:50]
Local HMC MemoryScratchpadRemote Memory
One of 8 local HMC devicesCUB = 0xFRemote Socket ID
Physical Address Specification
Physical Address Destinations
GC64 ISA• Simple 64-bit instruction format
– Two instructions per payload• Optional immediate payload
• Instruction control block– Specifies immediate values, breakpoints and vector register aliasing
GC64 Vector Register Aliasing• Vector register aliasing
provides access to scalar register file from SIMD unit
• No need for additional vector register file– Increasing the data path, not
the physical storage• Compiler optimizations can be
used to perform complex, irregular operations– Vector-Scalar-Vector arithmetic– Vector Fill– Scatter/Gather
GC64 Potential PerformanceTask
GroupsTask Procs
Task Units
Tasks/Socket
Peak Giga-ops Bandwidth/Socket
2 2 2 4 4 2.384 GB/s
4 4 4 64 32 9.536 GB/s
8 8 4 256 256 38.147 GB/s
16 16 8 2048 1024 152.588 GB/s
32 16 16 8192 4096 305.176 GB/s
32 32 16 16384 8192 610.352 GB/s
64 32 16 32768 16384 1220.703 GB/s
128 32 16 65536 32768 2441.406 GB/s
*256 128 32 1048576 262144 19531.25 GB/s
*256 *256 *256 16777216 524288 39062.50 GB/s
*Max config- SIMD Width = 4- Task Issue Rate = 2- Cycle = 1Ghz
PROGRESS AND ROADMAPThe path forward
GC64 Progress Report• Complete
– GC64 ISA definition– Physical Address Format– Execution Model– HMC Simulator [to be used in the GC64 sim]
• 2 x papers submitted, third paper in progress• First academic publications on Hybrid Memory Cube technology
• In Progress– Architecture specification document– GC64 Simulator– ABI definition– Virtual addressing model– Compiler & Binutils [LLVM]
• Active Research Topics– Memory coalescing & AMO techniques [Spring 2014]– Context switch pressure model– Software managed scratchpad organization– Off-chip network protocol– Thread/task runtime optimization
HMC-Sim Stream Triad Results
Goblin-Core64
• Source code and specification is licensed under a BSD license
• www.gc64.org– Source code– Architecture documentation– Developer documentation
References[1] John D. Leidel. Convey ThreadSim: A Simulation Framework for Latency-Tolerant Architectures. High Performance Computing Symposium: HPC2011, Boston, MA. April 6, 2011.[2] John D. Leidel. Designing Heterogeneous Multithreaded Instruction Sets from the Programming Model Down. 2012 SIAM Conference on Parallel Processing for Scientific Computing, Savannah, Georgia. February 2012.[3] John D. Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors. In Proceeedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC ’12). IEEE Computer Society, Washington, DC, USA 232-239. [4] John D. Leidel. Toward a General Purpose Partitioned Global Virtual Address Specification for Heterogeneous Exascale Architectures. 2013 Exascale Applications and Software Conference, Edinburgh, Scotland, UK. April 2013. [5] John D. Leidel, Geoffrey Rogers, Joe Bolding. Toward a Scalable Heterogeneous Runtime System for the Convey MX Architecture. 2013 Workshop on Multithreaded Architectures and Applications, Boston, MA. May 2013. [6] https://code.google.com/p/goblin-core/[7] http://www.hybridmemorycube.org/[8] http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdf[9] John D. Leidel, Yong Chen. A High Fidelity, and Accurate Simulation Framework for Hybrid Memory Cube Devices. 2014 Internal Parallel and Distributed Processing Symposium. Submitted.
BACKUP
What did we learn from CHOMP?• Pros
– We can design tightly coupled ISA’s and runtime models that are extremely efficient• Each instruction becomes precious & necessary
– Code generation is quite natural• Allows the compiler to the best opportunities for optimization
– Latency hiding characteristics function as designed– Single-cycle context switch mechanisms function as designed– AMO’s are increasingly useful
• For more than just memory protection– RISC ISA’s are still providing high performance architectures
• VLSI and JIT’ing is unnecessary overhead/area
What did we learn from CHOMP?• Cons
– NEED MORE BANDWIDTH!• Tests with dense per-thread memory operations could utilize ~4X more bandwidth
with no other changes– Designing for an FPGA has its constraints– The lack of ILP hinders arithmetic throughput in some applications
• Even SpMV kernels can utilize ILP– Paged virtual memory is always expensive
• Not all applications/programmers exploit large pages– Use the native runtime!
• Don’t simply rely on the compiler to generate efficient code for higher-level parallel languages [OpenMP,et.al]. Use the machine-level runtime
– Cache is bad, but coalescing is good• Cache is expensive to implement and often impedes performance• We can occasionally take advantage of spatial locality at access time
GC64 Research• PGAS Addressing
– How do we develop a segmented address directory w/o the use of a TLB? – How do we distribute and maintain this directory while providing applications “virtual memory” security?
• Efficient Synchronization– Building efficient synchronization algorithms using the GC64 ISA mechanisms and available bandwidth is
orthogonal to traditional implementations• Memory Coalescing
– Development of memory request coalescing algorithms for local, remote and AMO-type requests– How do our synchronization techniques play into this?
• Software-Managed Scratchpad– Is there room/desire to add features such as this? – What additional pressure does this put on the programming model/compiler?
• Compilation Techniques– Providing MTA-C style loop transformations– https://sites.google.com/site/parallelizationforllvm/loop-transforms
• Runtime Models– Building a machine-level optimized runtime that is programming model agnostic
• HMC Integration– HMC 1.0 specification is available. Very different than traditional DRAM technology