Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop...

31
Query Compilation of Dataflow Programs for Heterogeneous Platforms Felix Beier & Kai-Uwe Sattler DBIS@TU Ilmenau www.tu-ilmenau.de/dbis

Transcript of Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop...

Page 1: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Query Compilation of Dataflow Programs for Heterogeneous

Platforms

Felix Beier & Kai-Uwe SattlerDBIS@TU Ilmenauwww.tu-ilmenau.de/dbis

Page 2: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Outline1. Requirements of Engineering

Applications2. Dataflow Programming in PipeFlow3. Query Compilation4. Outlook

2

Page 3: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Motivation• Data-intensive engineering applications

– Improved sensing technology– Complex models– „Smart“ devices

• Relying on database technology?– Not really, but Matlab & friends

3

Page 4: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

• Find defects in electronic components• 3D surface measurement (x,y position, z measure)• Resolution: nm (100s MB - 10s TBs per scan)• Static data CAD circuit models (if available)

Nanopositioning Machine

(1) (2) (3)

4

Page 5: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Nanopositioning Machine

5

Page 6: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Camera Head

Stitching Rasterization

Normalization

Separation

Error Detection Error Model

Postprocessing External Info

Element Model

Nanopositioning Machine

6

Page 7: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

• Identification of neurophysiologically active areas• N EEG/MEG sensors, data rate: 600-5000 Hz• Static brain model with K - M vertices, brain atlas• Complex analysis pipeline: signal filtering,

decomposition, matrix inversion, time-based folding• Indexing for brain model & decomposition dictionary

Online Source Localization

(4) (4) (4)(5)

7

Page 8: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

• Blood simulation for medical filters• N particles of different kinds (blood, water, …)• Static scene with dynamic particles• KNN-queries with index-rebuild in each step

Particle Simulation

(6) (6)

8

Page 9: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

DB Technology Inside?• Database technology = providing abstractions

– Data manipulation (query language etc.)– Hardware (storage, storage hierarchies, …)– Scalability (indexing, parallel processing,

MapReduce)• Simplifies development

– Code reuse– Providing correctness and performance

guarantees

9

But, are these still the right abstractions?

Page 10: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

• Very large, spatiotemporal data sets• Large gaps between computer science and

engineering contexts– Domain-specific languages– Tooling ecosystem

• Reinventing the wheel– Data management algorithms– Specializations for parallel hardware– Optimizations

Lessons Learned

10

Page 11: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Requirements• Extensibility by user-defined operations

– Optimization, parallelization, recovery, …• Complex data types

– Spatio-temporal data, time series, matrices, …• Low latency, online processing

– Data stream processing, CEP• Dealing with uncertainty in measurements

and models

11

Page 12: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Dataflow Programming in PipeFlow

Big Dynamic Engineering Data

Page 13: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

• Model flow of data between transformation steps• Inject data management & processing primitives

– Partitioning & merging steps– Parallelization– Mapping to (specialized) hardware– Distributed workload management at cluster-scale– Flow optimizations– Fault tolerance

• Keeping framework usable for engineers– No new languages / paradigms– Reusable programs– Integration of domain-specific data types & libraries

Dataflow Programming

13

Page 14: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

QueryQ

uery

DSMS

Query

DBMS

Query

Source SinkOperator

Store & Process Continuous Query Processing

Publish-Subscribe Pattern

typed pipes

Programming Model

14

Page 15: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

embedded in application

standalone process

distributed processesin clusters

Dataflowspecification

C++ code generation

PipeFlowCompiler

graph checking and rewriting

PipeFlow

15

Page 16: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

�I(t)

�P (t)

�O(t)

�S(t)

λ

τi τp

... ...

Tfuture tuples processed tuples

Input Queue Operator States Output QueueT T

• Encapsulated functionality– Computation

(possibly stateful)– Implementation in domain-

specific languages– Specialization for hardware

platforms (CPU/GPU)– Wrapper for library functions

• Meta-information– Typed input, output and

parameter channels– State handling– Operator location– Profiling information

Operator Model

16

Page 17: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

PipeFlow: Overview• Dataflow specification language inspired by Pig• Specification = sequence of operators connected by

typed „pipes“• Large set of predefined operators (sources, joins,

aggregation, windows, CEP, …)

17

$pipe1 := operator1(…) params;$pipe2 := operator2(…) params;$out := operator3($pipe1, $pipe2, …) params;

• Supported by the PipeFabric engine: C++ library of operator templates & utility functions

Page 18: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Query Compilation

Page 19: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

PipeFlow: Operators & Expressions

• Type-specific template instantiation for operators

• Expressions are compiled into native code

19

typedef Tuple<int, double, MatrixXd> MyTuple;auto op = new Filter<MyTuple>([&](MyTuplePtr tp) { return std::get<0>(*tp) > 10; });

$o := filter($in) by i1 > 10;

Page 20: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Dataflow Parallelization• Providing abstraction for data parallelism

• Partitioning of input data stream: tuple-wise, batch-wise, column-wise

• Execution environment: threads for multi-core CPU, threads for GPU, distributed processes for compute cluster

• Result merging• Supporting user-defined operators!

20

Split Merge

Op1

Opn

Semantics

Page 21: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Parallelization in PipeFlow• Make parallelization explicit but hide the

implementation details by a parallelize operator

21

define calc_statistics ($in) returns $out { $x := myOp($in); $out := mySecondOp($x) ...;};

$res := parallelize($in) on slice(x) do calc_statistics using (mode = "thread", partitions = 10);

Page 22: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Slice, Split & Merge• Physical algebra operators

• Slice := split a single tuple or tuple value into multiple instances, i.e. vertical partitioning, vector/matrix decomposition

• Scatter:= route tuples to subqueries based on PartitionID• Gather:= collect partial results from parallel subqueries• Merge:= combine partial results, i.e. merge streams, join

tuple components or even values, final aggregation

22

Scatter Gather

Op1

Opn

…Slice

ExchangeExchange

Merge

SliceFunc ScatterFn MergeFnGatherFn

Page 23: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

GPU Processing

23

• GPU: vector processor attached to host system

• SIMD / SIMT operations• Input copy -> compute -> output copy

Host SystemPCIe Bus

(7)

Page 24: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

GPU Processing Example

24

+*

V1

V2

c

R

• Vector addition and scalar multiplication• R = (V1 + V2) * c• Generate parallel GPU Kernel• Parallelize for multiple GPUs

Page 25: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

GPU Processing Example

25

• Fine granular parallelism: SIMD• Vectorize on element index• Mapping to thread ID

template< typename E >__global__ void gpuVecAddMul(c,v1,v2,scatter,gather){ int t = calculateThreadID(); // consider grid E e1 = scatter(v1,t); // read v1[t] E e2 = scatter(v2,t); // read v2[t] E r = c * (e1 + e2); gather(r,t); // write to result[t]};

Page 26: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

GPU Processing Example

26

• Coarse granular parallelism: multi-GPU• Partition input vectors• Mapping (disjoint) partitions to each GPU• Collect & merge (partial) results

template< typename E >void multiGpuVecAddMul(c,v1,v2,slice,scatter,proc,gather,merge){ slices = slice(c,v1,v2); // partition input vectors scatter(slices,gpus); // host-to-device copy partitions results = in_parallel_do(gpuVecAddMul(...)); // launch kernels collected = gather(results); // device-to-host copy merge(collected); // combine in result vector and publish};

Page 27: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Query Compilation: Rewriting

27

datatype operation slice scatter gather merge

atomic filter, projection, ..

stream hash, key

union -

atomic aggregate stream hash, key

union post-aggregation

vector,matrix

+, scalar mult.

1-dim decomposition

slice-id union compostion

matrix advanced decomposition slice-id union problem-specific

• Option 1: determine functions based on datatype + operation + X? automatically

• Option 2: user-provided functions

Page 28: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Outlook

Page 29: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

• Modules for domain-specific types– First-class types, e.g., events, signals, images, matrices, tensors,

graphs, …– Library integration, e.g., OpenCV, Pregel, …– Modeling uncertainty

• Aspect-orientation for injecting data management routines– Functional models as monads, arrows– Parallelization primitives – Automatic & manual partitioning– Elastic scaling

• Multi-level optimizations– Rule-based graph optimizations– Domain-specific optimization rules– Machine-specific optimizations for hardware

What’s next?

29

Page 30: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

Discussion

30

Page 31: Query Compilation of Dataflow Programs for Heterogeneous Platforms … · 2014. 10. 7. · Workshop on System Software Support for Big Data 25.09.2014 Felix Beier / 29 Query Compilation

Workshop on System Software

Support for Big Data / 2925.09.2014 Felix Beier

Query Compilation of Dataflow Programs for Heterogeneous Platforms

References(1) http://www.tu-ilmenau.de/en/institute-of-process-

measurement-and-sensor-technology/research/nanopositionier-und-nanomesstechnik/

(2) http://www.tu-ilmenau.de/cc-npmm/projekte/(3) http://www.itwissen.info/definition/lexikon/Chip-chip.html(4) Dinh, C.; Rühle, J.; Bollmann, S.; Haueisen, J.; Güllmar,

D.: A GPU-accelerated Performance Optimized RAP-MUSIC Algorithm for Real-Time Source Localization. In Biomedizinische Technik (Berl.), 2012, 57

(5) http://people.ee.ethz.ch/~cattin/MIA-ETH/02-IntensityBasedRegistration-media/figs/labelled-brain.png

(6) M. Färber. Molecular dynamics simulation, 2013.(7) http://www.nvidia.de/object/tesla_c1060_de.html

31