I/O Virtualization and System Acceleration in POWER8 · 2017. 2. 16. · IBM Power Systems....

26
I/O Virtualization and System Acceleration in POWER8 Michael Gschwind IBM Power Systems

Transcript of I/O Virtualization and System Acceleration in POWER8 · 2017. 2. 16. · IBM Power Systems....

  • I/O Virtualization and System Acceleration in POWER8

    Michael Gschwind

    IBM Power Systems

  • Industry Trends Generate New Opportunities

    1

    10

    100

    2004 2006 2008 2010 2012 2014

    Price/P

    erf

    orm

    ance

    Processors

    Semiconductor

    Technology

    Applications and Services

    Firmware, Operating

    System and Hypervisor

    System Stack

    Systems Management &

    Cloud Deployment

    Systems Acceleration &

    HW/SW Optimization

    Workload Acceleration

    Services Delivery Model

    Advanced Memory Tech.

    Network & I/O Accel.

    System stack innovations are required to

    continue Cost/Performance improvements

    © 2015 International Business Machines Corporation

  • System Design ca. 2015

    • Consolidation in the Cloud

    – Virtual Machine Aggregation

    – Processor Virtualization

    • I/O Aggregation

    – I/O Virtualization

    – I/O Isolation

    • Maintain Performance growth for critical workloads

    – Under Power constraints

    – Under Cost constraints

    – Under Area constraints

    © 2015 International Business Machines Corporation

  • Innovation Through Open Standards

    • Consolidation in the Cloud

    – Virtual Machine Aggregation

    – Processor Virtualization

    • I/O Aggregation

    – I/O Virtualization

    – I/O Isolation

    • Maintain Performance growth for critical workloads

    – Under Power constraints

    – Under Cost constraints

    – Under Area constraints

    Power Instruction

    Set Architecture

    I/O Design

    Architecture

    Coherent Accelerator

    Interface Architecture

    © 2015 International Business Machines Corporation

  • I/O Design Architecture

    • I/O Virtualization

    – IO address translation

    – Single level translation support contiguous PowerVM partitions

    – Hierarchical translation support Linux/KVM partition memory

    • Partition data isolation

    – Separate I/O address spaces for partitions

    • Partition fault isolation

    – Fault management domains based on Partitionable Endpoints (PE)

    © 2015 International Business Machines Corporation

  • Partitioning and Managing the I/O Space: Partitionable Endpoints

    © 2014 International Business Machines Corporation

    PCIe Host Bridge

    Partitionable

    Endpoint State

    Partitionable

    Endpoint State…

    PCIe Host Bridge

    Partitionable

    Endpoint State

    Partitionable

    Endpoint State…

    IOV Adapter

    PF VF VF VF VF

    Adapter

    Port

    Port

    Po

    rt

    Po

    rt Switch

    Adapter Adapter

    Adapter Adapter

    Port

    Port

    Po

    rt

    Po

    rt Switch

    Port

    Port

    Po

    rt

    Po

    rt Switch

    Partitionable

    Endpoints

  • Managing Partitionable Endpoints

    • Many I/O functions tracked using Partitionable Endpoints

    – MMIO Load/Store address domains

    – DMA I/O bus address domains and TCEs

    – Interrupts

    – Ordering of transactions per PE

    – PE Error and Reset domains

    • Enhanced RAS capabilities

    – Enhanced I/O Error Handling (EEH)

    – Partition isolation

    POWER8

    PCIe

    Validate Translate

    Interrupts

    Memory

    PHB PHB

    © 2015 International Business Machines Corporation

  • I/O DMA memory translation

    • Translate I/O DMA memory addresses to system memory address

    – Isolation of partitions

    – Create contiguous view of non-contiguous data

    – Enable 32 bit devices for 64 bit systems

    • Processor chip contains cache of recently accessed tables

    – Full tables stored in system memory

    © 2015 International Business Machines Corporation

    translate

    validated I/O

    bus DMA

    address

    validate I/O

    bus DMA

    address

    validate the

    translated I/O

    bus DMA

    addressvalidated and translated

    I/O bus DMA address

    I/O bus DMA address

  • Partitionable Endpoint Management

    • DMA validation

    – “Trusted” Requester ID (RID): 16-bit bus/device/function number

    – “Untrusted” 64-bit address – received from device driver

    • Interrupts tracked in system memory

    • PEs impacted by I/O error identified via RID

    – SRIOV Physical Function fault PEs of Phys. & all Virt. Adapter Functions

    DMA Validation

    TCE cache

    PE#

    PE

    #

    PE# for operation

    Interrupt Tracking

    PE#

    PE

    #

    Interrupt

    Vectors

    and state

    Interrupt Source

    Controller state

    machine

    Error State Tracking

    Off

    ch

    ip

    (sys

    tem

    me

    mo

    ry)

    On

    ch

    ip

    PE#

    Index

    PE#

    vector

    array

    Error message

    state machine

    PE# Vector

    DMA Address RID DMA Address RIDData RID

    © 2015 International Business Machines Corporation

  • Coherent Accelerator Processor Interface (CAPI)

    • Integrated in every Power8

    system

    • Builds on a long history of IBM

    workload acceleration

    • Integrates with big-endian and

    little-endian accelerators

    © 2015 International Business Machines Corporation

  • Microprocessor Trends

    Lo

    g (

    Perf

    orm

    an

    ce)

    Single Thread

    More active Transistors, higher frequency

    More active Transistors, higher frequency

    More active Transistors, higher frequency

    Multi-Core

    Hybrid

    Special Purpose

    © 2015 International Business Machines Corporation

  • Heterogeneous System Challenges

    • The 4 ‘P’s of System Design

    • Programmer Productivity

    • Realize accelerator Performance benefits

    • Portability: Investment protection for applications

    • Partitioning for multi-user systems: processes, partitions

    © 2015 International Business Machines Corporation

  • Workload-optimized acceleration with coherent accelerators

    • Attached accelerators

    – Accelerate functions that do not fit traditional CPU model

    – Heterogeneous System Architecture

    • Coherent integration in system architecture

    – Data sharing

    – Programming

    – Performance

    © 2015 International Business Machines Corporation

  • Application Acceleration

    • Fine-grained data sharing

    coherent, shared memory

    • Accelerator-initiated data accesses/transfers

    coherent, shared memory

    • Pointer identity

    shared addressing

    • Flexible synchronization

    symmetric, programmable interfaces

    © 2015 International Business Machines Corporation

  • CAPI Acceleration overcomes Device Driver Deceleration

    Typical I/O Model Flow:

    Flow with Coherent Model:

    Total ~13µs for data prep

    Total ~0.36µs

    © 2015 International Business Machines Corporation

  • Workload-optimized acceleration

    • On-chip integrated accelerators (SoC design)

    – Compute accelerator (Cell BE)

    – Compression (P7+)

    – Encryption (P7+)

    – Random number generation (P7+)

    – …

    • SoC design offers highest integration, but…

    – Requires new chip design for accelerator

    – Long time to market

    – Requires very high volumes

    Cell BE

    POWER7+

    © 2015 International Business Machines Corporation

  • CAPI: Coherent Accelerator Processor Interface

    • Integrate accelerators into system arch.

    – Modular interface

    – Third-party high value-add components

    • Standardized, layered protocol

    – architectural interface

    – functional protocol

    – PCIe signaling protocol

    • Create workload-optimized innovative solutions

    – Faster time to market

    – Lower bar to entry

    – Variety of implementation options

    FPGAs, ASICs

    Coherence Bus

    proxy

    PSL

    POWER8

    * Power Service Layer© 2015 International Business Machines Corporation

  • CAPI accelerator programming

    Coherence Bus

    proxy

    PSL

    Virtual Addressing

    • Pointer identity: Pointers reference

    same object as the host application

    • CAPI accelerators work with same virtual

    memory addresses as CPU

    • CAPI shares page tables and provides

    address translation of host application

    • Peer-to-peer programming between

    CPU and accelerator with

    in-memory data sharingVirtualization and Partitioning

    • Address translation supports process

    isolation Accelerator has access to

    application context (only)

    • Address translation supports partition

    isolation Accelerator has access to

    partition data (only)© 2015 International Business Machines Corporation

  • CAPI accelerator programming

    Coherence Bus

    proxy

    PSL

    Hardware Managed Cache Coherence

    • No need for memory pinning

    • Data fetched by accelerator based on

    accelerator application flow

    • Accelerator participates in locks

    • Low latency communication

    © 2015 International Business Machines Corporation

  • CAPI Accelerator Virtualization

    • Dedicated Model

    – Accelerator assigned to single process in a partition

    – Binding via Operating System (process) and Hypervisor (partition)

    • Time-quantum shared programming model

    – Protocol-controlled model

    • Accelerator-directed shared programming model

    – networking model (select context based on incoming data)

    Coherence Bus

    proxy

    PSL

    work queues

    © 2015 International Business Machines Corporation

  • Coherent acceleration of data sharing

    • Offload data synchronization and data

    transfers

    – No need to invalidate data before

    initiating I/O transfers

    – Accelerator feeds itself avoid use

    of high-function CPU as data mover

    • Available translation, transfer and

    synchronization bandwidth scales with

    parallelism

    – As more accelerators are used,

    available resources scale up

    – Avoid CPU becoming serial

    bottleneck0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16SPEs

    no

    rmalized

    execu

    tio

    n t

    ime

    compute

    address management

    Cost avoided by

    system-wide

    page tables

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    1 2 4 8 16

    SPEs

    Tim

    e (

    no

    rma

    lize

    d)

    compute

    cache mgmt

    Cost avoided by

    cache-coherent

    transfers

    [Gschwind, ICCD 2008]© 2015 International Business Machines Corporation

  • Coherent accelerator programming flexibility

    • Garbage collection accelerator

    – Explore boundaries of acceleration

    – Study accelerator programmability

    • Pointer identity: advanced data structures

    – Autonomous traversal of data

    • Self-paced memory access

    – Simplifies data management

    – No callbacks to request data

    – Zero-copy data access

    • Handle complex access patterns [Cher & Gschwind, VEE 2008]

    Normalized area×delay product

    © 2015 International Business Machines Corporation

  • CAPI Attached Flash Optimization

    • Attach IBM FlashSystem to POWER8 via CAPI

    • Read/write commands issued via APIs to eliminate 97% of path length

    • Saves 20-30 cores per 1M IOPS

    20K

    instructions

    reduced to

    500

    Application

    Library

    POSIX Async

    I/O Style API

    Shared

    Memory Work

    Queue

    aio_read()

    aio_write()

    strategy ( )

    Pin buffers, Translate,

    Map DMA, Start I/O

    Application

    Read/Write Syscall

    Interrupt, unmap,

    unpin,Iodone scheduling

    Disk and Adapter DD

    strategy ( ) iodone ( )

    File System

    iodone ( )

    LVM

    © 2015 International Business Machines Corporation

  • CAPI Unlocks the Next Level of Performance for Flash

    Identical hardware with 2

    different paths to data

    FlashSystem

    Conventional

    I/O CAPI

    POWER S822L >5x better IOPS

    per HW thread>2x lower latency

    0

    20,000

    40,000

    60,000

    80,000

    100,000

    120,000

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    IOPS per Hardware Thread Latency (µs)

    © 2015 International Business Machines Corporation

  • Load Balancer

    500GB Cache

    Node

    10Gb Uplink

    POWER8 Server

    Flash Array w/ up

    to 56TB

    Differentiated NoSQL

    (POWER8 + CAPI + FlashSystem)

    Infrastructure Attributes

    - 192 threads in 4U server drawer

    - 56 TB of flash per 2U drawer

    - Shared Memory & cache for dynamic tuning

    - Elimination of I/O and network overhead

    - Cluster solution in a box

    Today’s NoSQL in memory (x86)

    Infrastructure Requirements

    - Large distributed (Scale out)

    - Large memory per node

    - Networking bandwidth needs

    - Load balancing

    Power CAPI-attached FlashSystem for NoSQL regains infrastructure control and reigns in the cost to deliver services.

    WWW10Gb Uplink

    WWW

    Backup Nodes

    500GB Cache

    Node500GB Cache

    Node500GB Cache

    Node500GB Cache

    Node

    What CAPI Means for NoSQL Solutions

    © 2015 International Business Machines Corporation

  • Summary

    • POWER8 delivers advanced virtualization for CPU and I/O

    – High-performance I/O virtualization

    – Data isolation

    – Fault isolation

    • POWER8 takes the next step in exploiting system accelerators

    – Eliminate overheads inherent in I/O model

    – Reduced latency

    – Increased throughput

    • Collaborative Innovation based on Open Standards

    – Both specifications donated to OpenPOWER Foundation by IBM

    – Available royalty-free to all OpenPOWER members

    © 2015 International Business Machines Corporation