Unlocking GPU potential with JIT - HPTS · –DBMS C: CPU-based, vector-at-a-time, SIMD, based on...

22
Unlocking GPU potential with JIT Anastasia Ailamaki with Periklis Chrysogelos and Panos Sioulas

Transcript of Unlocking GPU potential with JIT - HPTS · –DBMS C: CPU-based, vector-at-a-time, SIMD, based on...

  • Unlocking GPU potential with JITAnastasia Ailamaki

    with Periklis Chrysogelos and Panos Sioulas

  • One hardware does not fit all

    2Rethink query engines for accelerator-level parallelism

    single-coresingle-CPU

    multi-coresingle-CPU

    multi-coremulti-CPUmulti-GPU

    multi-coremulti-CPU

    Time

  • 0

    1

    2

    > GPU memory < GPU memory

    Exec

    utio

    n tim

    e(v

    s com

    mer

    cial

    CPU

    DBM

    S)

    Data Size

    CPU CommercialHybrid Commercial (Projected)GPU CommercialCPU PrototypeHybrid Prototype (Projected)GPU Prototype

    0

    1

    2

    3

    4

    5

    6

    > GPU memory < GPU memory

    Thro

    ughp

    ut(v

    s com

    mer

    cial

    CPU

    DBM

    S)

    Data Size

    CPU CommercialHybrid Commercial (Projected)GPU CommercialCPU PrototypeHybrid Prototype (Projected)GPU Prototype

    3

    One hardware fits all: The end of an efficient story

    20%-80% throughput loss due to lack of portability

  • Designing query engines for heterogeneous HW

    4Decomposition of design space to find sweet spot

    HardwareOblivious

    HardwareConscious

    PerformancePortability

  • OLAP in heterogeneous servers: design space

    5Selective obliviousness

    intra-operator

    inter-device

    intra-device

    Performance depends on μ-arch

    Portability impacted by specialization Inject target-specific info using codegen

    Limited device inter-operabilityEncapsulate heterogeneity and balance load

    GPU⨝

    CU⨝⨝σ

    GPU⨝⨝σGPU⨝⨝σ

    GPU⨝⨝σ

    Tune operators to memory hierarchy specifics

    [CIDR2019]

  • Optimizer can produce cross-device plans

    Inter-device: HetExchange• Decouple data- from control-flow• Operators encapsulate trait conversions

    6

    filter

    unpack

    cpu2gpu

    gpu2cpu

    pack

    mem-move

    mem-move

    aggregate

    router

    router

    unpack

    aggregate[VLDB2019]

    Flow Scope Trait

    ControlDelegation Heterogeneous Parallelism

    Routing Homogeneous Parallelism

    DataTransfer Data Locality

    Granularity Execution Granularity

    filter

    aggregate

    scan

  • Device Boundary Crossings

    • Cross-device pipelined execution

    • Hand-over execution to next device

    • Launch kernels/threads, synchronize, backpressure

    • Only operators aware of device heterogeneity

    7Encapsulate heterogeneous parallelism

    filter

    unpack

    cpu2gpu

    gpu2cpu

    pack

    mem-move

    mem-move

    aggregate

    router

    router

    unpack

    aggregate

  • Concurrent Execution

    • Horizontal & Vertical parallelism

    • Instantiate pipelines multiple times

    • Routing policies: load-balance, partition, locality

    8Encapsulate homogeneous parallelism

    filter

    unpack

    cpu2gpu

    gpu2cpu

    pack

    mem-move

    mem-move

    aggregate

    router

    router

    unpack

    aggregate

  • Data Transfers

    • Handle memory transfers/prefetching

    • Hide memory topology

    • Overlap transfers with execution

    9Hide memory heterogeneity

    filter

    unpack

    cpu2gpu

    gpu2cpu

    pack

    mem-move

    mem-move

    aggregate

    router

    router

    unpack

    aggregate

  • Execution Granularity

    • Processing: in-registers => tuple-at-a-time

    • Memory transfers: packets => block-at-a-time

    • Transition between execution granularities

    • Create homogeneous (reg. policy) packets

    10Transition between execution granularities

    filter

    unpack

    cpu2gpu

    gpu2cpu

    pack

    mem-move

    mem-move

    aggregate

    router

    router

    unpack

    aggregate

  • filter

    aggregate

    scan

    HetExchange

    Logical plan

    Heterogeneity-aware plans

    11

    Efficiency&

    Operator portability

    aggregate

    router

    segmenter

    mem-move

    cpu2gpu

    unpack

    filter

    router

    aggregate

    gpu2cpu

    JIT

    SELECT SUM(a)FROM TWHERE b > 42

  • aggregate

    router

    segmenter

    mem-move

    cpu2gpu

    unpack

    filter

    router

    aggregate

    gpu2cpu

    9

    8

    4

    3

    2

    1

    7

    10

    11 5

    6

    devicecrossing

    devicecrossing

    JIT

    pipeline id x

    GPU pipelineCPU pipeline

    instances

    HetExchange in a JITed engine

    12Generate pipelines and instantiate

    Run

    routing point

    routing point

    filter

    aggregate

    scan

    Logical plan

    SELECT SUM(a)FROM TWHERE b > 42

    HetExchange

  • aggregate

    router

    segmenter

    mem-move

    cpu2gpu

    unpack

    filter

    router

    aggregate

    gpu2cpu

    9

    8

    4

    3

    2

    1

    7

    10

    11 5

    6

    devicecrossing

    devicecrossing

    JIT

    13

    HetExchange in a JITed engine

  • 14

    Device providers

    Inject target-specific info using the JIT infrastructure

  • nonpartitioned

    Device-optimized operators• Same challenges• Similar algorithms• Different mappings

    15Reuse algorithms, specialize mappings to hardware

    ⨝ σΓ

    ⨝ sort-merge⨝ ⨝⨝ radix-

    Boncz et al.[VLDB1999]

    Sioulas et al. [ICDE2019]

    Fanout: L1 & TLB size Scratchpad size

    ScratchpadL1Placement during probe:

    Ξ

  • nonpartitioned

    Hardware-dependent JIT code

    16Lower generic description to device-specific code

    ⨝ σΓ

    ⨝ sort-merge⨝ ⨝⨝ radix-⨝

    Ξ

    scan

    radix-join

    scan

    CPUProvider

    GPUProvider

    gpu radix-joincpu radix-joinDevice-optimized implementation

    Device-independent implementation

    Hardware-aware algorithm

    Radix-joinCPU-mapping

    Radix-joinGPU-mapping

  • Experimental Setup• 2x Intel Xeon E5-2650L v3 12-core @ 1.80GHz, 256GB RAM

    • 2x NVIDIA GeForce GTX1080, 8GB, PCIe3 x16 per GPU

    • DBMS C/G: state-of-the-art commercial DBMS– DBMS C: CPU-based, vector-at-a-time, SIMD, based on MonetDB/X100

    – DBMS G: GPU-based, JIT engine

    17

  • Performance on CPU-resident data

    18Hybrid throughput = 88.5% (CPU-only + GPU-only), on average

    0

    50

    100

    150

    200

    DBMS C ProteusCPUs

    ProteusHybrid

    Projected ProteusGPUs

    DBMS G

    Exec

    utio

    n Ti

    me

    (s)

    SSB SF1000, 600GB CSVworking set: 92-138GB / query

    PCIe

  • A glimpse into the future• Effect of interconnects and GPU compute power• Access to high-throughput network

    19

  • Adapting access method to query

    Up to 45% speed-up by tuning access method to hardware

    [CIDR2020]

    0

    20

    40

    60

    80

    100

    PCIe-GeForce PCIe-Tesla NVLink-Tesla

    Exec

    utio

    n ti

    me

    (s)

    Interconnect technology and GPU type

    Eager Lazy SemilazySSB SF1000, 600GB CSV

    working set: 92-138GB / query2GPUs per configuration

    GPU-only execution

  • Towards placing the CPU on the side

    • Shared & limited PCIe buses to NIC/GPU

    • Similar intra/inter-server BW

    • Direct NIC-GPU access (RDMA)

    21Avoid CPU bottleneck => Device-centric OLAP engines

    Remote machineRemote storage

    Remote machineRemote storage

    ~12GBps ~12GBps

    ~12GBps

    ~32GBps

    ~12GBps

  • JIT unleashes ALP

    • Run on all available devices

    • Relational operators oblivious to heterogeneity

    • Fast: Inject target-specific information through codegen

    • Result: 5x-10x versus CPU-/GPU-specialized systems

    22

    Thank you!