I/O Virtualization and System Acceleration in POWER8 · 2017. 2. 16. · IBM Power Systems....

I/O Virtualization and System Acceleration in POWER8

Michael Gschwind

IBM Power Systems

Industry Trends Generate New Opportunities

1

10

100

2004 2006 2008 2010 2012 2014

Price/P

erf

orm

ance

Processors

Semiconductor

Technology

Applications and Services

Firmware, Operating

System and Hypervisor

System Stack

Systems Management &

Cloud Deployment

Systems Acceleration &

HW/SW Optimization

Workload Acceleration

Services Delivery Model

Advanced Memory Tech.

Network & I/O Accel.

System stack innovations are required to

continue Cost/Performance improvements

© 2015 International Business Machines Corporation

System Design ca. 2015

• Consolidation in the Cloud

– Virtual Machine Aggregation

– Processor Virtualization

• I/O Aggregation

– I/O Virtualization

– I/O Isolation

• Maintain Performance growth for critical workloads

– Under Power constraints

– Under Cost constraints

– Under Area constraints


Innovation Through Open Standards

• Consolidation in the Cloud

– Virtual Machine Aggregation

– Processor Virtualization

• I/O Aggregation

– I/O Virtualization

– I/O Isolation

• Maintain Performance growth for critical workloads

– Under Power constraints

– Under Cost constraints

– Under Area constraints

Power Instruction

Set Architecture

I/O Design

Architecture

Coherent Accelerator

Interface Architecture


I/O Design Architecture

• I/O Virtualization

– IO address translation

– Single level translation support contiguous PowerVM partitions

– Hierarchical translation support Linux/KVM partition memory

• Partition data isolation

– Separate I/O address spaces for partitions

• Partition fault isolation

– Fault management domains based on Partitionable Endpoints (PE)


Partitioning and Managing the I/O Space: Partitionable Endpoints


PCIe Host Bridge

Partitionable

Endpoint State

Partitionable

Endpoint State…

PCIe Host Bridge

Partitionable

Endpoint State

Partitionable

Endpoint State…

IOV Adapter

PF VF VF VF VF

Adapter

Port

Port

Po

rt

Po

rt Switch

Adapter Adapter

Adapter Adapter

Port

Port

Po

rt

Po

rt Switch

Port

Port

Po

rt

Po

rt Switch

Partitionable

Endpoints

Managing Partitionable Endpoints

• Many I/O functions tracked using Partitionable Endpoints

– MMIO Load/Store address domains

– DMA I/O bus address domains and TCEs

– Interrupts

– Ordering of transactions per PE

– PE Error and Reset domains

• Enhanced RAS capabilities

– Enhanced I/O Error Handling (EEH)

– Partition isolation

POWER8

…

PCIe

Validate Translate

Interrupts

Memory

PHB PHB


I/O DMA memory translation

• Translate I/O DMA memory addresses to system memory address

– Isolation of partitions

– Create contiguous view of non-contiguous data

– Enable 32 bit devices for 64 bit systems

• Processor chip contains cache of recently accessed tables

– Full tables stored in system memory


translate

validated I/O

bus DMA

address

validate I/O

bus DMA

address

validate the

translated I/O

bus DMA

addressvalidated and translated

I/O bus DMA address

I/O bus DMA address

Partitionable Endpoint Management

• DMA validation

– “Trusted” Requester ID (RID): 16-bit bus/device/function number

– “Untrusted” 64-bit address – received from device driver

• Interrupts tracked in system memory

• PEs impacted by I/O error identified via RID

– SRIOV Physical Function fault PEs of Phys. & all Virt. Adapter Functions

DMA Validation

TCE cache

PE#

PE

#

PE# for operation

Interrupt Tracking

PE#

PE

#

Interrupt

Vectors

and state

Interrupt Source

Controller state

machine

Error State Tracking

Off

ch

ip

(sys

tem

me

mo

ry)

On

ch

ip

PE#

Index

PE#

vector

array

Error message

state machine

PE# Vector

DMA Address RID DMA Address RIDData RID


Coherent Accelerator Processor Interface (CAPI)

• Integrated in every Power8

system

• Builds on a long history of IBM

workload acceleration

• Integrates with big-endian and

little-endian accelerators


Microprocessor Trends

Lo

g (

Perf

orm

an

ce)

Single Thread

More active Transistors, higher frequency



Multi-Core

Hybrid

Special Purpose


Heterogeneous System Challenges

• The 4 ‘P’s of System Design

• Programmer Productivity

• Realize accelerator Performance benefits

• Portability: Investment protection for applications

• Partitioning for multi-user systems: processes, partitions


Workload-optimized acceleration with coherent accelerators

• Attached accelerators

– Accelerate functions that do not fit traditional CPU model

– Heterogeneous System Architecture

• Coherent integration in system architecture

– Data sharing

– Programming

– Performance


Application Acceleration

• Fine-grained data sharing

coherent, shared memory

• Accelerator-initiated data accesses/transfers

coherent, shared memory

• Pointer identity

shared addressing

• Flexible synchronization

symmetric, programmable interfaces


CAPI Acceleration overcomes Device Driver Deceleration

Typical I/O Model Flow:

Flow with Coherent Model:

Total ~13µs for data prep

Total ~0.36µs


Workload-optimized acceleration

• On-chip integrated accelerators (SoC design)

– Compute accelerator (Cell BE)

– Compression (P7+)

– Encryption (P7+)

– Random number generation (P7+)

– …

• SoC design offers highest integration, but…

– Requires new chip design for accelerator

– Long time to market

– Requires very high volumes

Cell BE

POWER7+


CAPI: Coherent Accelerator Processor Interface

• Integrate accelerators into system arch.

– Modular interface

– Third-party high value-add components

• Standardized, layered protocol

– architectural interface

– functional protocol

– PCIe signaling protocol

• Create workload-optimized innovative solutions

– Faster time to market

– Lower bar to entry

– Variety of implementation options

FPGAs, ASICs

Coherence Bus

proxy

PSL

POWER8

* Power Service Layer© 2015 International Business Machines Corporation

CAPI accelerator programming

Coherence Bus

proxy

PSL

Virtual Addressing

• Pointer identity: Pointers reference

same object as the host application

• CAPI accelerators work with same virtual

memory addresses as CPU

• CAPI shares page tables and provides

address translation of host application

• Peer-to-peer programming between

CPU and accelerator with

in-memory data sharingVirtualization and Partitioning

• Address translation supports process

isolation Accelerator has access to

application context (only)

• Address translation supports partition

isolation Accelerator has access to

partition data (only)© 2015 International Business Machines Corporation

CAPI accelerator programming

Coherence Bus

proxy

PSL

Hardware Managed Cache Coherence

• No need for memory pinning

• Data fetched by accelerator based on

accelerator application flow

• Accelerator participates in locks

• Low latency communication


CAPI Accelerator Virtualization

• Dedicated Model

– Accelerator assigned to single process in a partition

– Binding via Operating System (process) and Hypervisor (partition)

• Time-quantum shared programming model

– Protocol-controlled model

• Accelerator-directed shared programming model

– networking model (select context based on incoming data)

Coherence Bus

proxy

PSL

work queues


Coherent acceleration of data sharing

• Offload data synchronization and data

transfers

– No need to invalidate data before

initiating I/O transfers

– Accelerator feeds itself avoid use

of high-function CPU as data mover

• Available translation, transfer and

synchronization bandwidth scales with

parallelism

– As more accelerators are used,

available resources scale up

– Avoid CPU becoming serial

bottleneck0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16SPEs

no

rmalized

execu

tio

n t

ime

compute

address management

Cost avoided by

system-wide

page tables

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 4 8 16

SPEs

Tim

e (

no

rma

lize

d)

compute

cache mgmt

Cost avoided by

cache-coherent

transfers

[Gschwind, ICCD 2008]© 2015 International Business Machines Corporation

Coherent accelerator programming flexibility

• Garbage collection accelerator

– Explore boundaries of acceleration

– Study accelerator programmability

• Pointer identity: advanced data structures

– Autonomous traversal of data

• Self-paced memory access

– Simplifies data management

– No callbacks to request data

– Zero-copy data access

• Handle complex access patterns [Cher & Gschwind, VEE 2008]

Normalized area×delay product


CAPI Attached Flash Optimization

• Attach IBM FlashSystem to POWER8 via CAPI

• Read/write commands issued via APIs to eliminate 97% of path length

• Saves 20-30 cores per 1M IOPS

20K

instructions

reduced to

500

Application

Library

POSIX Async

I/O Style API

Shared

Memory Work

Queue

aio_read()

aio_write()

strategy ( )

Pin buffers, Translate,

Map DMA, Start I/O

Application

Read/Write Syscall

Interrupt, unmap,

unpin,Iodone scheduling

Disk and Adapter DD

strategy ( ) iodone ( )

File System

iodone ( )

LVM


CAPI Unlocks the Next Level of Performance for Flash

Identical hardware with 2

different paths to data

FlashSystem

Conventional

I/O CAPI

POWER S822L >5x better IOPS

per HW thread>2x lower latency

0

20,000

40,000

60,000

80,000

100,000

120,000

0

50

100

150

200

250

300

350

400

450

500

IOPS per Hardware Thread Latency (µs)


Load Balancer

500GB Cache

Node

10Gb Uplink

POWER8 Server

Flash Array w/ up

to 56TB

Differentiated NoSQL

(POWER8 + CAPI + FlashSystem)

Infrastructure Attributes

- 192 threads in 4U server drawer

- 56 TB of flash per 2U drawer

- Shared Memory & cache for dynamic tuning

- Elimination of I/O and network overhead

- Cluster solution in a box

Today’s NoSQL in memory (x86)

Infrastructure Requirements

- Large distributed (Scale out)

- Large memory per node

- Networking bandwidth needs

- Load balancing

Power CAPI-attached FlashSystem for NoSQL regains infrastructure control and reigns in the cost to deliver services.

WWW10Gb Uplink

WWW

Backup Nodes

500GB Cache

Node500GB Cache

Node500GB Cache

Node500GB Cache

Node

What CAPI Means for NoSQL Solutions


Summary

• POWER8 delivers advanced virtualization for CPU and I/O

– High-performance I/O virtualization

– Data isolation

– Fault isolation

• POWER8 takes the next step in exploiting system accelerators

– Eliminate overheads inherent in I/O model

– Reduced latency

– Increased throughput

• Collaborative Innovation based on Open Standards

– Both specifications donated to OpenPOWER Foundation by IBM

– Available royalty-free to all OpenPOWER members


I/O Virtualization and System Acceleration in POWER8 · 2017. 2. 16. · IBM Power Systems....

Documents

Transcript of I/O Virtualization and System Acceleration in POWER8 · 2017. 2. 16. · IBM Power Systems....