Accelerated Data Management Systems Through Real-Time ...
Transcript of Accelerated Data Management Systems Through Real-Time ...
Accelerated Data Management Systems Through Real-Time Specialization
Anastasia Ailamakifor Periklis Chrysogelos, Aunn Raza, Panos Sioulas,
and the DIAS team
Agile data management engines
3Where’s the sweet spot?
HeterogeneityOblivious
HeterogeneityConscious
HardwareOblivious
HardwareConscious
Portability Performance
Portability Performance
+ALP +ALP
Analytics on heterogeneous hardware
4Device specialization carries portability debt
Perf
orm
ance
Portabilitydebt
XXX-consciousengine
HW-obliviousengine
YYY-consciousengine
Hardware Features & Diversity
HW-promisedperformance
Specializationbenefit
Analytics on heterogeneous hardware
5Device specialization carries portability debt
0
5
10
15
20
Join-heavy 50%-50% Mix Scan-heavy
Tota
l Exe
cutio
n Ti
me
(s) CPU-only Hybrid GPU-only
2x Intel(R) Xeon(R) Gold 5118 CPU2x Mellanox MT27800 100G IB NIC
2x NVIDIA Tesla V100S GPU13 Queries, CPU-resident data
96-144GB working set/queryJoin-heavy: SPJ{1-4}Aggr
Scan-heavy: Scan-Aggr4 servers
Lifetime of a Query
6
σ
⨝
Γ
filter
aggregate
scan
σ
T1
⨝σ
T2⨝
T3 ∏
T4
project
scan
scanscan
join
filter
join
join
SELECT SUM(T3.c * T4.d)FROM T1, T2, T3, T4, …WHERE T1.f < 50 AND T1.a = T2.t1_id AND …
Query Execution Plan
Query Execution
Result
Query
Query Parsing & Optimization
SUM
5
Random
Control-heavy
Sequential
Hardware-conscious Analytics?
Traditional CPU-optimized Relies on
radix-(join/group by) High cache-size-to-thread ratio
vector-at-a-time High cache-size-to-thread ratio
parallelism/inter-socket atomics Efficient inter-socket operations
7Won’t work on GPUs
⨝
T4
Random-access
Control-heavy
Sequential scan
A fast equi-join algorithmRadix-join
Partition both inputsSize partition fanout based on memory hierarchy (TLB+caches)Assuming sufficient cache-to-thread ratio
GPU memory hierarchyLow cache-to-thread ratioSoftware and hardware-managed cachesBut collaborative thread execution
8Think differently for GPUs!
[Boncz et al. VLDB1999]R SR hashtable
h
probe
GPU-aware radix-joinCollaboratively partition per GPU thread block
Amortize radix cluster maintenanceRely on big register files and thread overlappingAvoid random accesses to GPU memory
Stage partition output in scratchpadIrregular access patterns through scratchpadCoalesce writes through shared memoryMultiple threads “complete” a cache line
93.6x speedup
scratchpad
threads
input
Partitions
[ICDE2019]
Accelerator-conscious Analytics
Traditional CPU-optimized CPU-GPU
radix-(join/group by) Tune operators to memory hierarchy specifics
vector-at-a-time Code fusion & specializationfor fast composability
parallelism/inter-socket atomics Encapsulate heterogeneity and balance load
10
⨝
T4
Random-access
Control-heavy
Sequential scan
aggregate
router
segmenter
mem-move
cpu2gpu
unpack
filter
router
aggregate
gpu2cpu
9
8
4
3
2
1
7
10
11 5
6
devicecrossing
devicecrossing
pipeline id x
GPU pipelineCPU pipeline
JIT
instances
SQL à ALP-aware code
11JIT pipelines
filter
aggregate
scan
HetExchange
Logical plan
SELECT SUM(a)FROM TWHERE b > 42
[VLDB2019]
aggregate
router
segmenter
mem-move
cpu2gpu
unpack
filter
router
aggregate
gpu2cpu
9
8
4
3
2
1
7
10
11 5
6
devicecrossing
devicecrossing
JIT
12
JIT Code Generation for OLAP in GPUs[VLDB2019]
aggregate
router
segmenter
mem-move
cpu2gpu
unpack
filter
router
aggregate
gpu2cpu
9
8
4
3
2
1
7
10
11 5
6
devicecrossing
devicecrossing
pipeline id x
GPU pipelineCPU pipeline
JIT
instances
From SQL to Pipeline Orchestration
14Multiple pipeline instances
Run
routing point
routing point
filter
aggregate
scan
HetExchange
Logical plan
SELECT SUM(a)FROM TWHERE b > 42
[VLDB2019]
JIT data flow inspectionDecouple data- from control-flowEncapsulate trait conversions into operatorsInspect flows to load-balance
15Distribute load to devices adaptively
filter
unpack
cpu2gpu
gpu2cpu
pack
mem-move
mem-move
aggregate
router
router
unpack
aggregate
Flow Scope Trait
ControlDelegation Heterogeneous Parallelism
Routing Homogeneous Parallelism
DataTransfer Data Locality
Granularity Execution Granularity
filter
aggregate
scan
[VLDB2019]
Abstractions for fast CPU-GPU analytics
16Selective obliviousness
intra-operator
inter-device
intra-device
Operator tuning is μ-architecture specific
Portability clashes with specialization Inject target-specific info using codegen
Limited device inter-operabilityEncapsulate heterogeneity and balance load
GPU⨝
CU⨝⨝σ
GPU⨝⨝σGPU⨝⨝σ
GPU⨝⨝σ
Tune operators to memory hierarchy specifics
[CIDR2019]
CPU
GPU
Specialized OLTP & OLAP Systems
18Data freshness bounded by ETL latency
AnalyticsTransactions
Extract-Transform-Load
OLTP OLAP
Fresh Data
Hybrid Transactional and Analytical Processing
OLTP: task-parallel– High rate of short-lived transactions– Mostly point accesses ( high data access locality)
OLAP: data-parallel– Few, but long-running queries– Scans large parts of database
19Align tasks & hardware to improve utilization
HTAP: Chasing ‘locality of freshness’Static OLAP-OLTP assignment
– Unnecessary tradeoff between interference and performance– Pre-determined resource assignment based on workload type– Wasteful data consolidation and synchronization
Real-time, Adaptive scheduling of HTAP workloads– Specialize to requirements and data/freshness-rates– Workload-based resource assignment– Pay-as-you-go snapshot updates
20Task placement based on resource usage
GPU
CPU
GPU
CPU
Colocatedexecution
Separated execution
PerformanceData freshness
GPU
CPU
GPU
CPU
Workload Isolation & Fresh Data Throughput
21no extreme is good
ETL
Isolated Elastic-ComputeHybrid-Access
Independent execution (isolation)Fresh Data Access Bandwidth
Fresh Data
OLAPOLTP Interference ßà performancePre-determined resource assignment
Colocated
Workload Isolation & Fresh Data
ETL
Isolated Elastic-ComputeHybrid-Access
Fresh Data
OLAPOLTP
Colocated
Task placement & consolidation based on resource usage
Real-time: Adaptive scheduling of HTAP workloads– Specialize to requirements and amount of unconsolidated data– Workload-based resource assignment– Pay-as-you-go snapshot updates
Caldera: HTAP on CPU-GPU ServersStore data in shared memoryRun OLTP workloads on task-parallel processorsRun OLAP workloads on data-parallel processors
– On-demand copy-on-write snapshots in shared memory
23Adaptively scaling resources with load
Data-parallel archipelago (OLAP)Task-parallel archipelago (OLTP)
Core GPUCore Core Core Core Core GPU
DRAMIn-memory data store
DRAMDRAM DRAM DRAM DRAM DRAM DRAM
[CIDR2017]
GPU Accesses Fresh Data from CPU MemoryOLTP generates fresh data
on CPU Memory
Data access protected byconcurrency control
OLAP needs to accessfresh data
24snapshot isolation for OLAP w/o CC overheads
Read / Write
Storage
Fetch Fresh Data
Main Memory
DRAM NVLink
PCIe
OLTP
OLAP
[CIDR2017, CIDR2020, SIGMOD2020]
GPU Accesses Fresh Data from CPU MemoryOLTP generates fresh data
on CPU Memory
Data access protected byconcurrency control
OLAP needs to accessfresh data
25snapshot isolation for OLAP w/o CC overheads
Read / Write
Storage
Fetch Fresh Data
Main Memory
DRAM NVLink
PCIe
OLTP
OLAP
GPU
CPUCommandSchedule
[CIDR2017, CIDR2020, SIGMOD2020]
Increasing workload complexityDiverse modern data problems
– IOT, OCR, ML, NLP, Medical, Mathematics etc…
DBMS catch-up for popular functionality– Human effort and big delays– Oblivious to out-of-DBMS workflows
Vast resource of libraries– Authored by domain experts, used by everybody– Loose library-to-data-sources integration and optimization
26Need for systems that can “learn” new functionality
Network looks like a single machineSimilar intra-/inter-server interconnect bandwidthLocal memories and NUMA effects across devicesCPU-GPU: Capacity-Throughput
27Heterogeneous interconnected devices across the CPUs
IB
IB