Unlocking GPU potential with JIT - HPTS · –DBMS C: CPU-based, vector-at-a-time, SIMD, based on...

Unlocking GPU potential with JITAnastasia Ailamaki

with Periklis Chrysogelos and Panos Sioulas

One hardware does not fit all

2Rethink query engines for accelerator-level parallelism

single-coresingle-CPU

multi-coresingle-CPU

multi-coremulti-CPUmulti-GPU

multi-coremulti-CPU

Time

0

1

2

> GPU memory < GPU memory

Exec

utio

n tim

e(v

s com

mer

cial

CPU

DBM

S)

Data Size

CPU CommercialHybrid Commercial (Projected)GPU CommercialCPU PrototypeHybrid Prototype (Projected)GPU Prototype

0

1

2

3

4

5

6

> GPU memory < GPU memory

Thro

ughp

ut(v

s com

mer

cial

CPU

DBM

S)

Data Size

CPU CommercialHybrid Commercial (Projected)GPU CommercialCPU PrototypeHybrid Prototype (Projected)GPU Prototype

3

One hardware fits all: The end of an efficient story

20%-80% throughput loss due to lack of portability

Designing query engines for heterogeneous HW

4Decomposition of design space to find sweet spot

HardwareOblivious

HardwareConscious

PerformancePortability

OLAP in heterogeneous servers: design space

5Selective obliviousness

intra-operator

inter-device

intra-device

Performance depends on μ-arch

Portability impacted by specialization Inject target-specific info using codegen

Limited device inter-operabilityEncapsulate heterogeneity and balance load

GPU⨝

CU⨝⨝σ

GPU⨝⨝σGPU⨝⨝σ

GPU⨝⨝σ

Tune operators to memory hierarchy specifics

[CIDR2019]

Optimizer can produce cross-device plans

Inter-device: HetExchange• Decouple data- from control-flow• Operators encapsulate trait conversions

6

filter

unpack

cpu2gpu

gpu2cpu

pack

mem-move

mem-move

aggregate

router

router

unpack

aggregate[VLDB2019]

Flow Scope Trait

ControlDelegation Heterogeneous Parallelism

Routing Homogeneous Parallelism

DataTransfer Data Locality

Granularity Execution Granularity

filter

aggregate

scan

Device Boundary Crossings

• Cross-device pipelined execution

• Hand-over execution to next device

• Launch kernels/threads, synchronize, backpressure

• Only operators aware of device heterogeneity

7Encapsulate heterogeneous parallelism

filter

unpack

cpu2gpu

gpu2cpu

pack

mem-move

mem-move

aggregate

router

router

unpack

aggregate

Concurrent Execution

• Horizontal & Vertical parallelism

• Instantiate pipelines multiple times

• Routing policies: load-balance, partition, locality

8Encapsulate homogeneous parallelism

filter

unpack

cpu2gpu

gpu2cpu

pack

mem-move

mem-move

aggregate

router

router

unpack

aggregate

Data Transfers

• Handle memory transfers/prefetching

• Hide memory topology

• Overlap transfers with execution

9Hide memory heterogeneity

filter

unpack

cpu2gpu

gpu2cpu

pack

mem-move

mem-move

aggregate

router

router

unpack

aggregate

Execution Granularity

• Processing: in-registers => tuple-at-a-time

• Memory transfers: packets => block-at-a-time

• Transition between execution granularities

• Create homogeneous (reg. policy) packets

10Transition between execution granularities

filter

unpack

cpu2gpu

gpu2cpu

pack

mem-move

mem-move

aggregate

router

router

unpack

aggregate

filter

aggregate

scan

HetExchange

Logical plan

Heterogeneity-aware plans

11

Efficiency&

Operator portability

aggregate

router

segmenter

mem-move

cpu2gpu

unpack

filter

router

aggregate

gpu2cpu

JIT

SELECT SUM(a)FROM TWHERE b > 42

aggregate

router

segmenter

mem-move

cpu2gpu

unpack

filter

router

aggregate

gpu2cpu

9

8

4

3

2

1

7

10

11 5

6

devicecrossing

devicecrossing

JIT

pipeline id x

GPU pipelineCPU pipeline

instances

HetExchange in a JITed engine

12Generate pipelines and instantiate

Run

routing point

routing point

filter

aggregate

scan

Logical plan

SELECT SUM(a)FROM TWHERE b > 42

HetExchange

aggregate

router

segmenter

mem-move

cpu2gpu

unpack

filter

router

aggregate

gpu2cpu

9

8

4

3

2

1

7

10

11 5

6

devicecrossing

devicecrossing

JIT

13

HetExchange in a JITed engine

14

Device providers

Inject target-specific info using the JIT infrastructure

nonpartitioned

Device-optimized operators• Same challenges• Similar algorithms• Different mappings

15Reuse algorithms, specialize mappings to hardware

⨝ σΓ

⨝ sort-merge⨝ ⨝⨝ radix-

Boncz et al.[VLDB1999]

Sioulas et al. [ICDE2019]

Fanout: L1 & TLB size Scratchpad size

ScratchpadL1Placement during probe:

⨝

Ξ

nonpartitioned

Hardware-dependent JIT code

16Lower generic description to device-specific code

⨝ σΓ

⨝ sort-merge⨝ ⨝⨝ radix-⨝

Ξ

scan

radix-join

scan

CPUProvider

GPUProvider

gpu radix-joincpu radix-joinDevice-optimized implementation

Device-independent implementation

Hardware-aware algorithm

Radix-joinCPU-mapping

Radix-joinGPU-mapping

Experimental Setup• 2x Intel Xeon E5-2650L v3 12-core @ 1.80GHz, 256GB RAM

• 2x NVIDIA GeForce GTX1080, 8GB, PCIe3 x16 per GPU

• DBMS C/G: state-of-the-art commercial DBMS– DBMS C: CPU-based, vector-at-a-time, SIMD, based on MonetDB/X100

– DBMS G: GPU-based, JIT engine

17

Performance on CPU-resident data

18Hybrid throughput = 88.5% (CPU-only + GPU-only), on average

0

50

100

150

200

DBMS C ProteusCPUs

ProteusHybrid

Projected ProteusGPUs

DBMS G

Exec

utio

n Ti

me

(s)

SSB SF1000, 600GB CSVworking set: 92-138GB / query

PCIe

A glimpse into the future• Effect of interconnects and GPU compute power• Access to high-throughput network

19

Adapting access method to query

Up to 45% speed-up by tuning access method to hardware

[CIDR2020]

0

20

40

60

80

100

PCIe-GeForce PCIe-Tesla NVLink-Tesla

Exec

utio

n ti

me

(s)

Interconnect technology and GPU type

Eager Lazy SemilazySSB SF1000, 600GB CSV

working set: 92-138GB / query2GPUs per configuration

GPU-only execution

Towards placing the CPU on the side

• Shared & limited PCIe buses to NIC/GPU

• Similar intra/inter-server BW

• Direct NIC-GPU access (RDMA)

21Avoid CPU bottleneck => Device-centric OLAP engines

Remote machineRemote storage

Remote machineRemote storage

~12GBps ~12GBps

~12GBps

~32GBps

~12GBps

JIT unleashes ALP

• Run on all available devices

• Relational operators oblivious to heterogeneity

• Fast: Inject target-specific information through codegen

• Result: 5x-10x versus CPU-/GPU-specialized systems

22

Thank you!

Unlocking GPU potential with JIT - HPTS · –DBMS C: CPU-based, vector-at-a-time, SIMD, based on...

Documents

Transcript of Unlocking GPU potential with JIT - HPTS · –DBMS C: CPU-based, vector-at-a-time, SIMD, based on...