Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

31
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors : National Science Foundation, LogicBlox Inc. , and NVIDIA Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation Haicheng Wu 1 , Gregory Diamos 2 , Srihari Cadambi 3 , Sudhakar Yalamanchili 1 1 Georgia Institute of Technology 2 NVIDIA Research 3 NEC Laboratories America

description

Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. Haicheng Wu 1 , Gregory Diamos 2 , Srihari Cadambi 3 , Sudhakar Yalamanchili 1 1 Georgia Institute of Technology 2 NVIDIA Research 3 NEC Laboratories America. - PowerPoint PPT Presentation

Transcript of Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

Page 1: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Sponsors: National Science Foundation, LogicBlox Inc. , and NVIDIA

Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

Haicheng Wu1, Gregory Diamos2, Srihari Cadambi3, Sudhakar Yalamanchili1

1Georgia Institute of Technology2NVIDIA Research

3NEC Laboratories America

Page 2: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Data Warehousing Applications on GPUs

2

The Opportunity Significant potential data parallelism

If data fits in GPU memory, 2x—27x speedup has been shown 1

The Challenge Need to process 1-50 TBs of data2

15–90% of the total time* spent in moving data between CPU and GPU *

Fine grained computation

1 B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, 2009.

2 Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey.

Page 3: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Relational Algebra (RA) OperatorsRA operators are the building blocks of DB applications

• Set Intersection• Set Union• Set Difference

• Cross Product• Join• Select• Project

Key Value

3 True, a

3 False, b

4 True, a

Example: Select [Key == 3]

Key Value

3 True, a

3 False, b

4 True, a

3

Page 4: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Relational Algebra (RA) OperatorsRA are building blocks of DB applications

• Set Intersection• Set Union• Set Difference

• Cross Product• Join• Select• Project

Key

Value

3 a

3 b

4 a

Key

Value

3 c

4 d

5 e

Example: Join

Key Value

3 a,c

3 b,c

4 a,d

New Key = Key(A) ∩ Key(B)

New Vallue = Value(A) U Value(B)

A B JOIN (A, B)

4

Page 5: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Data Movement in Kernel Execution

5

~250GB/s

① Input

② Execute

③ Result

M

N

T

Page 6: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion- A Data Movement Optimization

6

Increase the granularity of kernel computation

Reduce data movement throughout the hierarchy

Inspired by loop fusion

Compile-time automation Input is an optimized query

plan

Page 7: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion

GPU MEM GPU Core

A1

A2

A3

Temp

A1

A2

A3

Temp

Result Result

Before Fusion

GPU MEM GPU Core

A1

A2

A3

A1

A2

A3

Result Result

After Fusion

Temp

Kernel AKernel B Fused Kernel A&B

Kernel A

A1 A2

A3

Kernel B

Result

Temp

A1 A2 A3

Fused Kernel A , B

Result

7

Page 8: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Major BenefitsReduce Data Footprint

Reduction in accesses to global memory Access to common data across kernels improves temporal locality Reduction in PCIe transfers

Expand optimization scope of the compiler Data re-use Increase textual scope of optimizers

8

Kernel A

A1 A2

A3

Kernel B

Result

Temp

A1 A2 A3

Fused Kernel A , B

Result

Page 9: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 9

Red Fox Compilation Flow

RA-to-PTX(nvcc + RA-Lib)

RuntimeLogicBloxFront-End

Language Front-End

Translation Layer

Back-End

Datalog Queries

Query Plan PTX/Binary Kernel

Kernel Weaver

Kernel Weaver – CUDA source to source transformation to apply kernel fusion

PTX – Parallel Thread Execution

RA Primitives Library

Page 10: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example of SELECT

* G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, 2013.

RA Implementation-Multi-Stage Algorithms

10

All primitives have the same three stages *

Each stage normally maps to 1 CUDA kernel

Page 11: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion – Three Steps

1. Opportunity: Find candidates meeting fusion criteria.

2. Feasibility: Choose kernels to fuse according to available resources.

3. Fusion: Kernel fusion.

11

Page 12: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion Criteria (1)

12

Compatible kernel configurations (CTA & thread dimensions)

Implementations of RA primitives are parametricEmpirically choose configurations after fusion

M1

N1

M2

N2

T1

T2

M

N

TKernel A

Kernel B

Fused Kernel A & B

12

Page 13: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dependence RestrictionThread dependence

Kernel Fusion Criteria (2)

Kernel A

Kernel B

Input data have 2 attributes

Operations of each thread are independentUse registers to communicate

13

Kernel A

Kernel B

Page 14: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dependence RestrictionThread dependenceCTA (Thread Block) dependence

Kernel Fusion Criteria (2)

14

Kernel A

Kernel B

14

Page 15: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15

Kernel Fusion Criteria-CTA Dependence

Threads in the same CTA have dependence

No dependence between CTAs

Can be fused

After fusion Use Shared MEM to

communicate Synchronization is needed

Example of 2 back-to-back JOINs

Page 16: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Dependence RestrictionThread dependenceCTA (Thread Block) dependenceKernel dependence

Kernel Fusion Criteria (2)

16

Can be fused

Kernel A

Kernel B

16

Page 17: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion Criteria - Candidates for FusionOnly exhibit thread or CTA dependenceBounded by operators with kernel dependence

17

Page 18: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 18

Choosing Operators to Fuse

Dependence Graph 1. Topo Sort 2. Incrementally add operators3. Stop When the Estimated Usage is Larger than Budget

Kernel fusion will increase resource usage, e.g., registers

Greedy heuristic to choose

Page 19: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 19

Kernel Weaving and Fusion

Interweaving and Fusing individual stages (CUDA

kernels)

Use registers or shared memory to store temporary result

Page 20: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 20

Fusing Thread Dependent Only Operators

Example of fusing 2 SELECTs

Unary operators only No Synchronization required Register-based communication

Select Select

Page 21: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21

Gather

Partition Compute

Fusing CTA and Thread Dependent Operators

Partition multiple inputs Synchronization necessary Communication via shared

memory

Example Pattern

Page 22: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Experimental Environment

CPU 2 quad-core Xeon E5520 @ 2.27GHz

Memory

48 GB

GPU 1 Tesla C2070 (6GB GDDR5 memory)

OS Ubuntu 10.04 Server

GCC 4.4.3

NVCC 4.0

Use micro-benchmarks derived from TPC-HMeasure memory allocation, memory access demand, effect of optimization scope, and PCIe traffic

Full queries from TPC-H22

Page 23: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

TPC-H Benchmark Suites

23

A popular decision making benchmark suiteMicro-benchmarks are common patterns from TPC-H

Baseline: directly using primitive implementation without fusion

Optimized: fusing all primitives of each pattern

Page 24: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

a b c d e0

1

2

3

4

5

6

7

8

9

10

7.89

1.42 1.581.11

2.45

Spee

dup

Fused vs. Not Fused

Small Inputs-PCIe excluded

24

Average 2.89x speedup

Small inputs (64MB-1GB) fitting the GPU memory

Page 25: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Small Inputs-Analysis

25

Memory Allocation

Compiler Optimization

(Speedup of O3)

a b c d e0

1000000000

2000000000

3000000000

4000000000

5000000000

6000000000

4,429,185,024.00

4,643,094,528.004,257,21

8,560.003,758,096,384.00

4,697,620,480.00

1,610,612,736.00

3,133,145,088.00

763,363,328.00

4,697,620,480.00

2,684,354,560.00

Not fused Fused

Siz

e (

GB

)

a b c d e0%

10%20%30%40%50%60%70%80%90%

75.29%

29.46%

56.12%

13.43%

67.69%

Chart Title

Mem

ory

Acc

ess

Redu

ction

a b c d e0

0.5

1

1.5

2

2.5

3

3.5

1.08

2.45 2.31

1.251.00

1.78

2.80 2.90

1.371.09

Not Fused Fused

Spee

dup

Memory Access Reduction

Page 26: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Large Inputs-PCIe included

26

Not

Fus

ed

Fuse

d

Not

Fus

ed

Fuse

d

Not

Fus

ed

Fuse

d

Not

Fus

ed

Fuse

d

Not

Fus

ed

Fuse

d

a b c d e

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

PCI Compute

Nor

mal

ized

Tim

e

Average 2.22x speedup overall

and 2.35x speedup in PCIe

Large inputs (1GB-1.6GB) fitting the GPU memory

Page 27: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Resource Usage & Occupancy

27

PTX Reg #

Shared MEM

(Byte)

Occupancy

(%)

PROJECT 11 0 100

SELECT 22 3848 88

JOIN 47 13580 38

+/- 10 0 100

Multiply 13 0 100

PTX Reg #

Shared MEM

(Byte)

Occupancy

(%)

(a) 22 2308 88

(b) 55 23560 33

(c) 62 23048 17

(d) 30 4612 67

(e) 27 0 75

Kernel fusion may increase resource usage and thus decrease occupancy

These two do not negate the other benefits

Individual primitive

After kernel fusion

Page 28: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Real Queries (scale factor = 1)

28

TPC-H Q1

1.25x speedup

TPC-H Q21

1.22x speedup

Page 29: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Extensions

Different Domains Require multi-stage algorithm Dependence classification still applies

Different Representation PTX, OpenCL, LLVM

Different Platform CPU, GPU/CPU hybrid

29

Page 30: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

ConclusionsKernel Fusion can reduce data transfer and speeds up the

computation for Data Warehousing Apps.

Definition of basic dependences and general criteria for kernel fusion applicable across multiple application domains

Quantification of the impact of kernel fusion on different levels of the CPU-GPU memory hierarchy for a range of RA operators.

Proposes and demonstrates the utility of compile-time data movement optimizations based on kernel fusion

30

Page 31: Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Thank You

Questions?

31