John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in:...

28
An Introduction to Goblin-Core64 A Massively Parallel Processor Architecture Designed for Complex Data Analytics John D. Leidel jleidel<at>ttu<dot>edu

description

An Introduction to Goblin-Core64 A Massively Parallel Processor Architecture Designed for Complex Data Analytics. John D. Leidel jleidel ttu edu. Overview. Data Intensive Computing Architectural Challenges The destruction of cache efficiency using irregular algorithms - PowerPoint PPT Presentation

Transcript of John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in:...

Page 1: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

An Introduction to Goblin-Core64

A Massively Parallel Processor Architecture Designed for Complex Data Analytics

John D. Leideljleidel<at>ttu<dot>edu

Page 2: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

Overview

• Data Intensive Computing Architectural Challenges– The destruction of cache efficiency using irregular

algorithms

• Goblin-Core64 Architecture Infrastructure Design– Sustainable Exascale performance with data intensive

applications

• Progress and Roadmap– The path forward

Page 3: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

DATA INTENSIVE COMPUTING ARCHITECTURAL CHALLENGES

The destruction of cache efficiency using irregular algorithms

Page 4: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

What is Big Data?…and how does it relate to HPC?

• Problem spaces outside of traditional HPC are now encountering the same problems that we find in HPC– Complexity– Time to Solution– Scale

• These problems are generally not– Simulating the physical world– Bound by simple floating point performance– As the problem scales, the result set is fixed

• These problems are generally– Sparse in nature– Contain complex [sometimes unconstrained] data types– As the problem scales, the result set scales

• The other side of the HPC coin

Page 5: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

Three Drivers to HPC Solutions

5

Time To Solution

Algorithmic Complexity

Scale of Working Set

HPCHPC

HPC

Page 6: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

Convergence Criteria for HPC Adoption

• Time + Complexity– Fraud Detection– High Performance Trading Analytics

• Time + Scale– Power Grid Analytics– Graph500 Benchmark

• Complexity + Scale– Epidemiology– Agent-Based Network Analytics

• Time + Complexity + Scale– Grand Challenge Problems– Cyber Analytics

6

Page 7: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

Dense Solver Efficiency

Tianhe-2

(Milk

yWay

-2)

Sequoia

- BlueG

ene/Q

K computer

Jaguar

- Cray

XT5-HE

Roadru

nner

Roadru

nner

BlueGen

e/L

BlueGen

e/L

BlueGen

e/L

Earth-Si

mulator

Earth-Si

mulator

Earth-Si

mulator

ASCI W

hite, S

P Power3

ASCI R

ed

ASCI R

ed

ASCI R

ed

ASCI R

ed

SR2201/1

024

Numerical

Wind Tu

nnel

XP/S140

CM-5/10240.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

top500 efficiency

top500 efficiency

Page 8: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

Sparse Solver Efficiency

DOE/NNSA/LLNL Sequoia BGQ UV 2000 GraphCREST-MC48 Dingus Matterhorn Gordon0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Graph500 GTEPS/core

Graph500 GTEPS/core

Cache-less Architectures

Page 9: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GOBLIN-CORE64 ARCHITECTURE INFRASTRUCTURE DESIGN

Sustainable Exascale performance with data intensive applications

Page 10: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

Goal

Build an architecture that efficiently maps programming model concepts to hardware in order to improve data intensive [sparse] application throughput

Page 11: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

The Result: Goblin-Core64• Hierarchical set of architectural modules that provide:

– Native PGAS memory addressing– High efficiency RISC ISA– SIMD capabilities – Architectural notion of “tasks” – Latency hiding techniques

• Single cycle context/task switching– Advanced synchronization techniques

• Ease the burden of barriers and sync points by eliminating spin waits– Memory coalescing//aggregation

• Local requests • Global requests• AMO’s

– Makes use of latest high bandwidth memory technologies• Hybrid Memory Cube

Page 12: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

Goblin-Core64 ModulesTask

Reg

Task Unit Task Proc

ALUSIMD

Task Group

MMU

GC64 Socket

Software Managed

Scratchpad

HMC Memory Interface

AMO Unit

Coalesce Unit

SOC

NET

Peripherals Packet Engine

1 2

34

Page 13: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GC64 Module Hierarchy• Task Unit

– Small divisible unit; Register file + control logic• Task Proc

– Multiple Task Units + context switch control logic• Task Group

– Multiple Task Procs + local MMU• GC64 Socket

– Coalesce Unit: coalesces adjacent memory requests into a single payload– AMO Unit: intelligently handles local AMO requests– HMC Unit: HMC packet request engine + SERDES– Software Managed Scratchpad: on-chip memory– Packet Engine: off-chip memory interface

Page 14: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GC64 Scalable Units

………..

Page 15: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GC64 Execution Model• GC64 execution model provides

“pressure-driven”, single cycle context switching between threads/tasks

• Pressure state machine provides fair-sharing of ALU based upon:– Number of outstanding requests– Statistical probability of a register

stall– Number of cycles in current

execution context• Minimum execution is two

instructions– Based upon instruction format

ALU SIMD

Task Unit Mux

Context Switch State Machine

Page 16: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GC64 Unified [PGAS] Addressing• GC64 physical addressing provides block access to:

– Local HMC [main] memory – Local software-managed scratchpad– Globally mapped [remote] memory

• Pointer arithmetic between memory spaces• Obeys all the constraints of paged, virtual memory

Base Physical Address [33:0]CUB [37:33]Reserved [41:38]Socket [49:42]Unused

[63:50]

Local HMC MemoryScratchpadRemote Memory

One of 8 local HMC devicesCUB = 0xFRemote Socket ID

Physical Address Specification

Physical Address Destinations

Page 17: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GC64 ISA• Simple 64-bit instruction format

– Two instructions per payload• Optional immediate payload

• Instruction control block– Specifies immediate values, breakpoints and vector register aliasing

Page 18: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GC64 Vector Register Aliasing• Vector register aliasing

provides access to scalar register file from SIMD unit

• No need for additional vector register file– Increasing the data path, not

the physical storage• Compiler optimizations can be

used to perform complex, irregular operations– Vector-Scalar-Vector arithmetic– Vector Fill– Scatter/Gather

Page 19: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GC64 Potential PerformanceTask

GroupsTask Procs

Task Units

Tasks/Socket

Peak Giga-ops Bandwidth/Socket

2 2 2 4 4 2.384 GB/s

4 4 4 64 32 9.536 GB/s

8 8 4 256 256 38.147 GB/s

16 16 8 2048 1024 152.588 GB/s

32 16 16 8192 4096 305.176 GB/s

32 32 16 16384 8192 610.352 GB/s

64 32 16 32768 16384 1220.703 GB/s

128 32 16 65536 32768 2441.406 GB/s

*256 128 32 1048576 262144 19531.25 GB/s

*256 *256 *256 16777216 524288 39062.50 GB/s

*Max config- SIMD Width = 4- Task Issue Rate = 2- Cycle = 1Ghz

Page 20: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

PROGRESS AND ROADMAPThe path forward

Page 21: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GC64 Progress Report• Complete

– GC64 ISA definition– Physical Address Format– Execution Model– HMC Simulator [to be used in the GC64 sim]

• 2 x papers submitted, third paper in progress• First academic publications on Hybrid Memory Cube technology

• In Progress– Architecture specification document– GC64 Simulator– ABI definition– Virtual addressing model– Compiler & Binutils [LLVM]

• Active Research Topics– Memory coalescing & AMO techniques [Spring 2014]– Context switch pressure model– Software managed scratchpad organization– Off-chip network protocol– Thread/task runtime optimization

Page 22: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

HMC-Sim Stream Triad Results

Page 23: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

Goblin-Core64

• Source code and specification is licensed under a BSD license

• www.gc64.org– Source code– Architecture documentation– Developer documentation

Page 24: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

References[1] John D. Leidel. Convey ThreadSim: A Simulation Framework for Latency-Tolerant Architectures. High Performance Computing Symposium: HPC2011, Boston, MA. April 6, 2011.[2] John D. Leidel. Designing Heterogeneous Multithreaded Instruction Sets from the Programming Model Down. 2012 SIAM Conference on Parallel Processing for Scientific Computing, Savannah, Georgia. February 2012.[3] John D. Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors. In Proceeedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC ’12). IEEE Computer Society, Washington, DC, USA 232-239. [4] John D. Leidel. Toward a General Purpose Partitioned Global Virtual Address Specification for Heterogeneous Exascale Architectures. 2013 Exascale Applications and Software Conference, Edinburgh, Scotland, UK. April 2013. [5] John D. Leidel, Geoffrey Rogers, Joe Bolding. Toward a Scalable Heterogeneous Runtime System for the Convey MX Architecture. 2013 Workshop on Multithreaded Architectures and Applications, Boston, MA. May 2013. [6] https://code.google.com/p/goblin-core/[7] http://www.hybridmemorycube.org/[8] http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdf[9] John D. Leidel, Yong Chen. A High Fidelity, and Accurate Simulation Framework for Hybrid Memory Cube Devices. 2014 Internal Parallel and Distributed Processing Symposium. Submitted.

Page 25: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

BACKUP

Page 26: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

What did we learn from CHOMP?• Pros

– We can design tightly coupled ISA’s and runtime models that are extremely efficient• Each instruction becomes precious & necessary

– Code generation is quite natural• Allows the compiler to the best opportunities for optimization

– Latency hiding characteristics function as designed– Single-cycle context switch mechanisms function as designed– AMO’s are increasingly useful

• For more than just memory protection– RISC ISA’s are still providing high performance architectures

• VLSI and JIT’ing is unnecessary overhead/area

Page 27: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

What did we learn from CHOMP?• Cons

– NEED MORE BANDWIDTH!• Tests with dense per-thread memory operations could utilize ~4X more bandwidth

with no other changes– Designing for an FPGA has its constraints– The lack of ILP hinders arithmetic throughput in some applications

• Even SpMV kernels can utilize ILP– Paged virtual memory is always expensive

• Not all applications/programmers exploit large pages– Use the native runtime!

• Don’t simply rely on the compiler to generate efficient code for higher-level parallel languages [OpenMP,et.al]. Use the machine-level runtime

– Cache is bad, but coalescing is good• Cache is expensive to implement and often impedes performance• We can occasionally take advantage of spatial locality at access time

Page 28: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General

GC64 Research• PGAS Addressing

– How do we develop a segmented address directory w/o the use of a TLB? – How do we distribute and maintain this directory while providing applications “virtual memory” security?

• Efficient Synchronization– Building efficient synchronization algorithms using the GC64 ISA mechanisms and available bandwidth is

orthogonal to traditional implementations• Memory Coalescing

– Development of memory request coalescing algorithms for local, remote and AMO-type requests– How do our synchronization techniques play into this?

• Software-Managed Scratchpad– Is there room/desire to add features such as this? – What additional pressure does this put on the programming model/compiler?

• Compilation Techniques– Providing MTA-C style loop transformations– https://sites.google.com/site/parallelizationforllvm/loop-transforms

• Runtime Models– Building a machine-level optimized runtime that is programming model agnostic

• HMC Integration– HMC 1.0 specification is available. Very different than traditional DRAM technology