Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on...

Ohio State University Department of Computer Science and Engineering

Data-Centric Transformations on Data-Centric Transformations on Non-Integer Iteration SpacesNon-Integer Iteration Spaces

Swarup Kumar Sahoo

Gagan Agrawal

The Ohio State University


RoadmapRoadmap• Motivation • Background• System Overview• XQuery, Low and High Level schema and Mapping

schema• Compiler Analysis and Algorithm• Parallelization• Experiment • Summary and Future Work


MotivationMotivation

• Declarative and application specific languages – Uses high-level abstractions

– Simplifies development of applications

• Use of restructuring transformations– Difficult due to these abstractions

• Goal : Apply data-centric transformations– On integer and non-integer based iteration space while providing

high-level abstractions/virtual view of underlying datasets.


BackgroundBackground

• Data-centric transformation :– Input data is brought into memory/cache in chunks or

shackles and then corresponding program fragments or loop iterations requiring access to these data are executed.

– Helps in improving data locality.• Integer based iteration space

– Loop takes integer values with constant step-size between a lower and upper bound.

• Non-integer based iteration space– Loop takes values from a sequence or set of real numbers,

strings, or any other data types.– Easily expressible in declarative languages


Example: Data-centric Example: Data-centric transformationtransformation

for i:= 1 to 3

{

Count the number of occurrences of i in a list of digits }


Naïve Strategy Naïve Strategy

DatasetOutput

0

2

2

224

1

11

1 1

5

33

33

336

Requires 3 Scans

Counter


Data Centric StrategyData Centric Strategy

DatasetOutput

0 0 0

2

2

224

1

11

1 1

5

33

33

336

Requires just one scan

Counter1 Counter2 Counter3

21 11


Example: Data-centric Example: Data-centric transformation with non-integer transformation with non-integer

iteration spaceiteration space

for each distinct color (green, blue, pink) {

with that color }


Naïve Strategy Naïve Strategy

DatasetOutput

000

Requires 3 Scans

555

Counter


Data Centric StrategyData Centric Strategy

DatasetsOutput

0 0 0

Requires just one scan

Counter1 Counter2 Counter3

Mapping

5 5 51 112


Previous work and ContributionsPrevious work and Contributions

• Related Work– Data-centric multilevel blocking (Pingali et. al., PLDI 1997)– Sparse matrix code synthesis from high-level specifications

(Pingali et. al., SC 2000)– Supporting XML Based high-level abstraction on flat-file

datasets (LCPC 2003, XIME-P 2004)• Contributions of this paper

– An improved data- centric transformation algorithm which works on both integer and non-integer based iteration spaces.

– Handling of out-of-core computations involving multi-dimensional datasets, without limiting the organization of low-level datasets.

– Automatic parallelization of the considered class of application.


System OverviewSystem OverviewHigh levelXML Schema

Mapping Schema

Dataset

CompilerMapping Service

System OverviewSystem Overview

Low levelXML Schema

Low-level Library

Cluster with Disk

XQuery Source Code


XQuery and XML SchemasXQuery and XML Schemas

• High-level declarative languages ease application development– XQuery is a high-level language for processing XML datasets– Derived from database, declarative, and functional languages!

• High-level schema– XML is used to provide a virtual view of the dataset

• Low-level schema – reflects actual physical layout.

• Mapping schema:– describes mapping between each element of high-level

schema and low-level schema


Oil Reservoir SimulationOil Reservoir Simulation• Support cost-effective Oil

Production• Simulations on a 3-D grid• 17 variables and cell

locations in 3-D grid at each time step

• Computation of bypassed regions– Expression to determine if a

cell is bypassed for a time-step– Within a spatial region and

range of time steps– Grid cells that are bypassed for

every time-step in the rangeOil Reservoir management


High-Level SchemaHigh-Level Schema< xs:element name="data" maxOccurs="unbounded" >

< xs:complexType > < xs:sequence (unique x,y,z,t) >

< xs:element name="x" type="xs:integer"/ > < xs:element name="y" type="xs:integer"/ > < xs:element name="z" type="xs:integer"/ > < xs:element name="time" type="xs:integer"/ > < xs:element name="velocity" type="xs:float"/ > < xs:element name="mom" type="xs:float"/ >

< /xs:sequence >

< /xs:complexType >

< /xs:element >


High-Level XQuery Code Of Oil High-Level XQuery Code Of Oil Reservoir managementReservoir management

unordered( for $i in ($x1 to $x2)

for $j in ($y1 to $y2) for $k in ($z1 to $z2)

let $p := document("OilRes.xml")/datawhere ($p/x=$i) and ($p/y = $j) and ($p/z = $k) and ($p/time >= $tmin) and ($p/time <= $tmax) return <info> <coord> {$i, $j, $k} </x-coord> <summary> { analyze($p) } </summary> </info>

)


Low-Level SchemaLow-Level Schema<file name="info">

<sequence> <group name="data">

<attribute name="time"> <datatype> integer </datatype> <dataspace> <rank> 1 </rank> <dimension> [1] </dimension> </dataspace> </attribute>

<dataset name="velocity"> <datatype> float </datatype> <dataspace> <rank> 1 </rank> <dimension> [x] </dimension> </dataspace> </dataset>

..............

</group> </sequence>

</file>


Mapping SchemaMapping Schema

//high/data/velocity //low/info/data/velocity

//high/data/time //low/info/data/time

//high/data/mom //low/info/data/mom [index(//low/info/data/velocity, 1)]

//high/data/x //low/coord/x [index(//low/info/data/velocity, 1)]


Modified Oil Reservoir management Modified Oil Reservoir management with non-integer iteration spacewith non-integer iteration space

let $src = document(“Oil.xml”)//data/x,y,z

Let $coord = distinct-values($src)

unordered(

for $C in $coord

let $p := document("OilRes.xml")/datawhere ($p/x=$C/x) and ($p/y = $C/y) and ($p/z = $C/z)

and ($p/time >= $tmin) and ($p/time <= $tmax)

return

<info>

<coord> {$C/x, $C/y, $C/z} </x-coord>

<summary> { analyze($p) } </summary>

</info>

)


Basic steps in our Data Centric Basic steps in our Data Centric Transformation algorithmTransformation algorithm

• Mapping Function T :Iteration space → High-Level data

• Mapping Function C : High-Level data → Low-Level data

• Mapping Function C · T = M : Iteration space → Low-Level data

• Our Goal is to compute M-1 and use the following steps– Iterate over each data element in actual storage – Find out iterations of the original loop in which they are accessed

using M-1.– Access required elements of other datasets.– Execute computation corresponding to those iterations.


Handling non-integer based iteration Handling non-integer based iteration space with hash-tablespace with hash-table

• Abstract integer iteration space:– Based on the unique sequence number of each element

in the actual iteration space.

– One-to-one correspondence between actual and abstract iteration space

» Hash table can be used to create this mapping

» Sequence number in the hash table indicates the iteration instance in abstract iteration space


Template for Generated Code using Template for Generated Code using hash tablehash table

Generated_Query { Go through the datasets and create a list of tuples, each denoting an

iterationForeach i in the list of tuples { apply hash function on i If i is not present in hash table, enter i into hash table and store its

sequence number and the corresponding output element }

For k = 1, …, NO_OF_CHUNKS { Read kth chunk of dataset S1 using HDF5 functions. Foreach of the other datasets S2, … , Sn

access the required chunk of the dataset. Foreach data element in the chunks of data {

compute the iteration instance i. apply the hash function and determine the corresponding output element.

apply the reduction computation and update the output. } }

}


Handling non-integer based iteration Handling non-integer based iteration space without hash-tablespace without hash-table

• Find out the two choices required for construction of actual iteration space

• Determine the procedure to construct the actual iteration space

• From High-level schema, select the attributes forming unique set of tuples (V)

• Consider the set of attributes forming the iteration space as P.

• If P is not a subset of V, we use hash table.• Else if P = V, transformation is done without hash

table.• Else if P is a proper subset of V, then the choice

depends on the presence of duplicate tuples.


ParallelizationParallelization• Two obvious ways to parallelize

– First one is to parallelize the for loop going through different chunks

– Second one is to parallelize the for loop going through data in each chunk

• Choose the method depending on the number of chunks and chunk size.

• Reduction operation required to combine values from different processors.


Experimental test bedExperimental test bed• HDF5 version 1.6.3 ( uses MPI-I/O for parallel I/O )

• Sequential experiments - 700 MHz PIII machine,1GB memory, Linux version 2.4.18

• Parallel Experiments – Itanium 2 cluster with dual 1.3 Ghz Itanium 2 processor nodes, 4 GB RAM, 80 GB hard drive

• Four applications– Transaction database analysis

– Original Oil reservoir simulation

– Modified Oil reservoir simulation

– Virtual microscope


Experimental resultExperimental resultVirtual Microscope

Oil Reservoir Simulation

Modified Oil Reservoir Simulation

Transaction database Analysis

With DCT without hash table

1.32 2.64 2.08 -

With DCT using hash table

- - 2.97 7.57

Without DCT

10.65 27.13 23.69 96.11

Execution time (sec.) using different versions of transformation algorithm


Experimental resultExperimental result

Parallel Performance of Virtual

Microscope

0

20

40

60

80

100

1 2 4 8

Number of Processors

Ex

ec

uti

on

Tim

e (

Se

c)



Parallel Performance of Oil Reservoir

Simulation

010203040506070

1 2 4 8


Ex

ec

uti

on

Tim

e (

se

c)


Experimental resultExperimental resultParallel Performance of Modified Oil

Reservoir Simulation

0

10

20

30

40

1 2 4 8


Ex

ec

uti

on

Tim

e (

se

c)



Parallel Performance of Transaction

Database Analysis

010203040506070

1 2 4 8


Ex

ec

uti

on

Tim

e (

se

c)


SummarySummary• Compiler techniques

– Perform data centric transformations automatically on integer and non-integer based iteration space.

– More efficient method without using has table for data centric transformation on non-integer based iteration space.

– Support High-level abstractions on complex low-level data formats.

– Parallelization of the considered class of queries.• Future Work

– Experimental results on more applications.– Compare performance with manual implementations – Formalize the mapping schema.– Extend applicability of the algorithm to more general class of

queries.

Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on...

Documents

Transcript of Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on...