Final apu13 phil-rogers-keynote-21

THE PROGRAMMER’S GUIDE TO REACHING FOR THE CLOUD PHIL ROGERS, CORPORATE FELLOW, AMD

NOV. 11, 2013

2 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC

MODERN CLOUD WORKLOADS ARE HETEROGENEOUS

Video is expected to represent two thirds of mobile data traffic by 2017 ‒ Video is continuously being captured, uploaded, transcoded and streamed ‒ Video processing is inherently parallel … and can be accelerated

Big data growing exponentially with Exabytes of data crawled monthly ‒ Indexing the web and extracting high definition information ‒ Map reduce is a heterogeneous workload

Natural User Interfaces are still in their infancy ‒ Accurate extraction of meaning from gesture and voice ‒ Getting to the fingertips and voice inflections

SCALAR CONTENT WITH A GROWING MIX OF PARALLEL CONTENT

NEED TO SIMULTANEOUSLY INCREASE PERFORMANCE AND REDUCE POWER


FUTURE TECHNOLOGY GROWTH WILL ACCELERATE THE TREND

Rapid growth of Sensor Networks ‒ Drives exponential increase in data

Internet of Everything (IoE) results in explosion of data sources ‒ Another exponential growth in data

at local and cloud level

Context Aware Computing is a Huge Big-Data Problem ‒ Both local and cloud compute must

get faster/lower power

DRIVING FUTURE DEMAND FOR LOCAL AND CLOUD PARALLEL EFFICIENCY

Source: Cisco IBSG, 2013

HOW MUCH VALUE IS AT STAKE IN THE IOE ECONOMY?

$9.5 trillion

From industry-specific

use cases

$4.9 trillion

From cross-industry

use cases

$14.4 trillion

RAPID GROWTH OF THE NUMBER OF THINGS CONNECTED TO THE INTERNET

1995 2000 2005 2010 2015 2020

50B

“Fixed” Computing (you go to the device)

Mobility / BYOD (the device goes with

you)

Internet of Things (age of devices)

Internet of Everything (people, process, data,

things)


HSA APU PROCESSORS OPERATE HARMONIOUSLY AT LOW POWER

Techniques include: ‒ Image Stabilization, Super Resolution, Deblur, Deinterlace, Lighting & Contrast

Enhancements examine pixels from a large number of video frames ‒ Super-resolution based on information from surrounding frames

Algorithms can be run on multiple processors in the APU ‒ CPU, GPU, DSPs, Fixed Function Accelerators ‒ Convolutions, motion estimation, histograms,

format conversions, etc. ‒ Processing flows freely between processors

for best efficiency

EXAMPLE: VIDEO ENHANCEMENT


HETEROGENEOUS PROCESSORS - EVERYWHERE SMARTPHONES TO SUPER-COMPUTERS

Phone

Tablet

Notebook

Workstation

Dense Server

Super computer

A SINGLE SCALABLE ARCHITECTURE FOR THE WORLD’S PROGRAMMERS IS DEMANDED AT THIS POINT


HOW DOES HSA MAKE THIS ALL WORK?

Enables acceleration of languages like Java, C++ AMP and Python

All processors use the same addresses, and can share data structures in place

Heterogeneous computing can use all of virtual and physical memory

Extends multicore coherency to the GPU and other processors

Pass work quickly between the processors

Enables quality of service

HSA FOUNDATION – BUILDING THE ECOSYSTEM

HSA in 2013


HSA FOUNDATION AT LAUNCH BORN IN JUNE 2012

Founders


HSA FOUNDATION TODAY – NOVEMBER 2013 A GROWING AND POWERFUL FAMILY

Founders

Promoters

TBA at APU-13

Supporters

Contributors

Universities NTHU Programming Language Lab

NTHU System Software Lab C O M P U T E R S C I E N C E

http://www.multicorewareinc.com/index.php


HSA FOUNDATION PROGRESS

Membership growing rapidly ‒ 2-3 new members per month ‒ Universities enrolling

Four working groups generating specifications ‒ HSA Programmers Reference Manual published ‒ HSA System Architecture spec going to ratification by the

end of the year ‒ Runtime WG and Tools WG will publish early next year

HSA Development platforms to ship in early 2014

WHAT AN AMAZING FIRST YEAR


HSAIL

Kernel Fusion Driver (KFD)

HSA Core Runtime

HSA Finalizer

HSA Helper Libraries

OpenCL™ App

OpenCL Runtime

Java App

Java JVM (Sumatra)

Python App

Fabric Engine RT

C++ AMP App

Various Runtimes

PROGRAMMING LANGUAGES PROLIFERATING ON HSA

Workloads


HIGH EFFICIENCY VIDEO CODEC – HEVC (H.265) VALUE PROPOSITION

30% TO 50% MORE EFFICIENT THAN H.264 AT 1080P RESOLUTION

30% to 50%

HEVC VISUAL QUALITY IS SIGNIFICANTLY BETTER THAN H.264 AT ANY GIVEN BIT RATE

H.264 @ 500 kbps

H.265 @ 500 kbps

4K VIDEO BENEFITS ARE EVEN MORE SIGNIFICANT WITH HEVC

4K Ultra HDTV Sony XBR $4999

4K Video Cameras GoPro $399


HIGH EFFICIENCY VIDEO CODEC – HEVC (H.265)

Source: Cisco VNI Mobile Forecast, 2013

Traffic Share Exabytes Per Month

0

2

4

6

8

10

12

2012 2013 2014 2015 2016 2017

Mobile Video Mobile Web/DataMobile M2M Mobile File Sharing

66.5%

24.9%

5.1%

3.5%

WHY HEVC WILL PROLIFERATE

The next generation MPEG video encoding standard Significantly higher efficiency (up to 50% lower bit

rates at given quality) than AVC (H.264) Highly beneficial for HD video (1080p or below) Especially beneficial for 4K video Scales to 8K Ultra High Definition video (up to

8192×4320) Computationally complex, but by design easier to

parallelize than H.264

CLOUD VIDEO PROVIDERS NEED THE HIGHER COMPRESSION FOR QUALITY OF SERVICE


ALL STAGES OF HEVC ARE ACCELERATED ON THE APU

Decrypt Decode and decompress Scaling and Enhancement Encode and compress Encrypt

HEVC (H.265) ACCELERATION EFFICIENT CLOUD DEPLOYMENT

ENCODE IS THE HEAVIEST STAGE

Leverage point for compression

Highly parallel Algorithms improve

monthly Must stay programmable

H.265 ENCODING IS 5 – 10X MORE COMPUTATIONALLY COMPLEX THAN H.264

Picture can be divided into Macroblock regions with a much wider range of sizes and shapes

Motion vectors have 33 prediction directions compared to 8 for H.264


OVERVIEW OF B+ TREES

B+ Trees are a special case of B Trees

Fundamental data structure used in several popular database management systems ‒ SQLite ‒ CouchDB

A B+ Tree … ‒ is a dynamic, multi-level index ‒ Is efficient for retrieval of data, stored in a block-oriented

context

Order (b) of a B+ Tree measures the capacity of its nodes

7 8

d7 d8

6

d6

5

d5

4

d4

3

d3

2

d2

1

d1

2 4 6 7

3 5


APPLICATIONS THAT USE B/B+ TREES

http://www.sqlite.org/famous.html http://wiki.apache.org/couchdb/CouchDB_in_the_wild

primary data store on the client-side

Mail, Safari, iPhone, iPod, iTunes

Firefox and Thunderbird

Android, Chrome

multi-data center key-value store

market-data framework

large hadron collider


HOW WE ACCELERATE

Utilize coarse-grained parallelism in B+ Tree searches ‒ Perform many queries in parallel ‒ Increase memory bandwidth utilization with parallel reads ‒ Increase throughput (transactions per second for OLTP)

B+ Tree searches on an HSA enabled APU ‒ Allows much larger B+ Trees to be searched, than traditional GPU compute ‒ Eliminates data-copies since CPU and GPU cores can access the same memory


1M search queries in parallel

Input B+ Tree contains 112 million keys and uses 6GB of memory

Hardware: AMD “Kaveri” APU with Quad Core CPU and 8 GCN Compute Units at 35W TDP

Software: OpenCL on HSA

RESULTS

0

1

2

3

4

5

6

7

8 16 32 64 128Sp

eedu

p Order of B+ Tree

Baseline: 4-core OpenMP + hand-tuned SSE CPU implementation

Results measured in AMD Labs on “Kaveri” APU, 35W TDP, 16GB DRAM


REVERSE TIME MIGRATION (RTM)

A technique for creating images based on sensor data to improve seismic interpretations done by geophysicists

A memory-intensive and highly parallel algorithm

RTM is run on massive data sets

A natural scale out algorithm

Often run today on 100K node CPU systems

Bringing this to HSA and APU based supercomputing will increase performance for current sensor arrays, and allow more sensors and accuracy in the future.

Marine crews Land crews

HOWEVER, SPEED OF PROCESSING AND INTERPRETATION IS A CRITICAL BOTTLENECK IN MAKING FULL USE OF ACQUISITION ASSETS


TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA SEARCH

MINING BIG DATA

Multi-stage pipeline or parallel processing stages

Traditional GPU Compute is challenged by copies

APU with HSA accelerates each stage in place ‒ Sort ‒ Compression ‒ Regular expression parsing ‒ CRC generation

Acceleration of large data search scales out across the cluster of APU nodes

Input HDFS

Output HDFS

HDFS Replication

HDFS Replication

sort

copy

merge

split 0 map

part 0 reduce

split 1 map

split 2 map

part 1 reduce

Programming Languages


PROGRAMMING MODELS EMBRACING HSAIL AND HSA THE RIGHT LEVEL OF ABSTRACTION

UNDER DEVELOPMENT Java: Project Sumatra OpenJDK 9 OpenMP from SuSE C++ AMP, based on CLANG/LLVM Python and KL from Fabric Engine

NEXT DSLs: Halide, Julia, Rust Fortran JavaScript Open Shading Language R


HSA ENABLES DEVELOPERS TO LEVERAGE HC … EASILY & NATURALLY

PREFERRED PROGRAMMING LANGUAGES

Java, C++, OpenMP, Python *

SVM, Coherence, GPU Enqueue

OpenJDK/Sumatra, Fabric Engine

TRANSPARENT CALLS TO POPULAR LIBRARIES

OpenCV, SciPy, NumPy, ImageMagick, Bolt, …

Arbitrary data structures, SVM, Coherence, User mode queueing

OpenCV API, Bolt STL library

USING CONVENTIONAL METHODS

Arbitrary data structures, malloc, function pointers, call-backs, recursion, semaphores, atomics

SVM, Coherence, User-mode queueing, GPU Enqueue, HSAIL

Linked-list/tree traversal + other complex shared host data structures

* Java 8, C++ AMP, OpenMP 4.0 next generation standards and extensions


C++ AMP ACCELERATION GOES MULTI-PLATFORM

Herb Sutter Announced C++ AMP for the Windows® Platform at ADS 2011

We very much liked the single source model of development, and decided to extend it to be multi-platform

Today we are announcing C++ AMP is moving beyond Microsoft® Windows to embrace Linux. We will offer this acceleration on both our APUs and our discrete GPUs

We are also bringing Bolt STL Library support to C++ AMP

AVAILABLE IN OPEN SOURCE 1H-2014

C++AMP LLVM-IR or SPIR 1.2

HSAIL

SPIR 1.2

Any HSA Implementation

Any OpenCL™+SPIR Implementation

LLVM Compiler

CLANG Front-end


HSA ENABLEMENT OF JAVA

JAVA 8 – HSA ENABLED APARAPI

Java 8 brings Stream + Lambda API. ‒ More natural way of expressing data parallel

algorithms ‒ Initially targeted at multi-core.

APARAPI will : ‒ Support Java 8 Lambdas ‒ Dispatch code to HSA enabled devices at

runtime via HSAIL

JVM

Java Application

HSAIL

HSA Finalizer & Runtime

APARAPI + Lambda API

CPU ISA GPU ISA

GPU CPU

JAVA 7 – OpenCL ENABLED APARAPI

AMD initiated Open Source project APIs for data parallel algorithms

‒ GPU accelerate Java applications ‒ No need to learn OpenCL™

Active community captured mindshare ‒ ~20 contributors ‒ >7000 downloads ‒ ~150 visits per day

JVM

Java Application

OpenCL™

OpenCL™ Compiler & Runtime

APARAPI API

CPU ISA GPU ISA

GPU CPU

JAVA 9 – HSA ENABLED JAVA (SUMATRA)

Adds native GPU acceleration to Java Virtual Machine (JVM)

Developer uses JDK Lambda, Stream API JVM uses GRAAL compiler to generate HSAIL JVM decides at runtime to execute on either

CPU or GPU depending on workload characteristics.

JVM

Java Application

HSAIL

HSA Finalizer & Runtime

Java JDK Stream + Lambda API

Java GRAAL JIT backend

CPU ISA GPU ISA

GPU CPU

We will provide HSA Enabled Aparapi on Java 8

to bridge between Aparapi on Java 7 and HSA/Sumatra on Java 9


JAVA DEMO WELCOME GARY FROST TO THE STAGE

28 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC 28

NBODY REVISTED

NBody problem: ‒ Calculate the position of ‘N’ bodies in 3D space by computing the gravitational effect each has on all

of the others and updating it’s position.

A Java sequential NBody implementation would start with an Object for each Body.

Then we would iterate over all bodies updating the position of each

A pre Java 8 Java ‘parallel’ version would not fit so nicely on this slide ;)

public class Body{ // State of object

private float x, y, z, m, vx, vy, vz; // Method to update position relative to other bodies void updatePosition(Body[] bodies){ /* code omitted */ }

}

for (Body b: bodies) { b.updatePosition(bodies) });

29 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC 29

JAVA 8’S ‘PROJECT LAMBDA’ SIMPLIFIES PARALLEL PROGRAMMING

Offers an alternate syntax for processing arrays/collections of data

To process a stream in parallel we just tag the stream with the parallel() modifier

In Java 8 a parallel stream executes across all CPU cores.

In Java 9 (Sumatra) a parallel stream executes across all CPU and GPU cores

for (Body b; bodies) b -> updatePosition(bodies);

Arrays.stream(bodies) // wrap array in a stream .forEach(b -> b.updatePosition(bodies);

Arrays.stream(bodies) // Wrap an array in a stream .parallel(); // tag the stream as parallel .forEach(b -> b.updatePosition(bodies);


JAVA DEMO


JAVA AND THE CLOUD

Java 8 and Java 9 provide parallel acceleration

Parallel workloads are proliferating in the cloud

Hadoop framework for scale out

HSA APUs provide workload acceleration

THE RIGHT LANGUAGE WITH ACCELERATION ON CLOUD APUS

“THE ROLE OF JAVA™ IN HETEROGENEOUS COMPUTING, AND HOW YOU CAN HELP”

DON’T MISS THE KEYNOTE TOMORROW FROM ORACLE’S NANDINI RAMANI

Programming Tools


ANNOUNCING AMD’S UNIFIED SDK

Access to AMD APU and GPU programmable components

Component installer - choose just what you need

APP SDK 2.9

Web-based sample browser Supports programming standards: OpenCL™, C++ AMP Code samples for accelerated open source libraries:

‒ OpenCV, OpenNI, Bolt, Aparapi

OpenCL™ source editing plug-in for visual studio Now supports Cmake

AMD Unified SDK

MEDIA SDK 1.0 BETA

GPU accelerated video pre/post processing library Leverage AMD's media encode/decode acceleration blocks Library for low latency video encoding Supports both Windows Store and Classic desktop

Initial release includes: ‒ APP SDK v2.9 ‒ Media SDK 1.0 Beta


ANNOUNCING AMD V1.3

AMD’s comprehensive heterogeneous developer tool suite including: ‒ CPU and GPU Profiling ‒ GPU kernel Debugging ‒ GPU kernel analysis

New features in version 1.3: ‒ Supports Java ‒ Integrated static kernel analysis ‒ Remote debugging/profiling ‒ Supports latest AMD APU and GPU products

CPU PROFILER

Time-based profiling Analyze call-chain relationships Java profiling with inline

function support Cache-line utilization profiling Supports latest AMD processors

GPU PROFILER

OpenCL™ Application Trace Profile OpenCL kernels Timeline visualization of GPU

counter data Kernel Occupancy Viewer Remote GPU Profiling

GPU DEBUGGER

Real-time OpenCL kernel debugging with stepping and variable display

OpenCL and OpenGL API Statistics

Object visualization Remote GPU debugging

STATIC KERNEL ANALYZER

Compile, analyze and disassemble OpenCL Kernels

View kernel compilation errors/warnings

Estimate kernel performance View generated ISA code View registers


OPEN SOURCE LIBRARIES ACCELERATED BY AMD

OpenCV

Most popular computer vision library

Now with many OpenCL™ accelerated functions

Bolt

C++ template library Provides GPU off-load for

common data-parallel algorithms

Now with cross-OS support and improved performance/functionality

clMath

AMD released APPML as open source to create clMath

Accelerated BLAS and FFT libraries

Accessible from Fortran, C and C++

Aparapi

OpenCL™ accelerated Java 7 Java APIs for data parallel

algorithms (no need to learn OpenCL™


AMD APUS, HSA – CLIENT TO THE CLOUD

Parallel workloads are booming ‒ Acceleration where the data is ‒ On the client for a snappy user experience ‒ In the cloud for scalable services

HSA enabled APUs in the cloud ‒ Big data analytics ‒ Video processing ‒ Science, imaging, genomics ‒ Unleashing the Java development community

Acceleration at all tiers of the cloud ‒ Data centers, media hubs, cloud periphery

A CONVERGENCE AT THE RIGHT TIME


A SPECIAL GUEST

Gary Campbell Infrastructure Technology Strategy CTO HP


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL is a trademark of Apple Inc. and Microsoft and Windows are trademarks of Microsoft Corp. Other names are for informational purposes only and may be trademarks of their respective owners.

Final apu13 phil-rogers-keynote-21

Technology

Transcript of Final apu13 phil-rogers-keynote-21