Categorical Skylines for Streaming Data

Categorical Skylines for Streaming Categorical Skylines for Streaming DataData

Center for E-Business TechnologySeoul National University

Seoul, Korea

Nam, Kwang-hyun

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Nikos Sarkas, Gautam Das, Nick Koudas and Anthony K. H. Tunge

Univ. of Toronto, Univ. of Texas at Arlington,and National Univ. of Singapore

SIGMOD 2008

Copyright 2008 by CEBT

ContentsContents

Introduction

Motivation

Background

Efficient skyline maintenance for categorical tuples

Experimental Evaluation

Conclusions

IDS Lab Seminar (User Log Analysis) - 2


IntroductionIntroduction

Current situation

It has become increasingly difficult to process abundant data in order to isolate useful and relevant information

To facilitate the exploration of a data space

One such successful tool is the skyline query

The skyline of a data set

– The subset of tuples which are not dominated on all of their attributes by any other tuple

– The tuples that have a uniquely interesting combination of attribute values that no other tuple can match.



MotivationMotivation

Recent work

Considered the on-demand evaluation of skyline queries on tuples with partially ordered categorical attributes

But in an offline environment

– Inappropriate for a highly dynamic data streaming environment

Objective

To identify and study the problem of maintaining the skyline of streaming data with partially ordered, categorical attributes

To realize two novel techniques

– Constitute the building blocks of STARS (STreaming ARrangement Skyline) to solve the problem.

Database Tuning - 4


BackgroundBackground

A limited-capacity, sliding-window buffer B in memory

To only store the n most recent tuples from the stream

The contents of the buffer constitute a data set denoted by D

The data set is comprised of n tuples with d categorical attributes

The domain Domi of each the attributes is partially ordered

Partial order (an antisymmetric preorder)

A binary relation"≤" over a set P which is reflexive, antisymmetric, and transitive, i.e., for all a, b, and c in P, we have that:

– a ≤ a (reflexivity)

– if a ≤ b and b ≤ a then a = b (antisymmetry)

– if a ≤ b and b ≤ c then a ≤ c (transitivity)

Poset (Partially ordered set)

– A set with a partial order

– Commonly represented as DAG (directed acyclic graph)

Database Tuning - 5



Example 1

Value a and b -> ‘Comparable and a dominates b’

Value b and c -> ‘Incomparable’

Assume tuples comprised of d categorical attributes

Tuple t1 dominates tuple t2 ( t2 < t1 or t1 > t2 )

– If for every attribute Xi, t2.Xi ≤i t1.Xi and there is at least one

attribute Xj such that t2.Xj < t1.Xj

Tuples t1 and t2 do not dominate one another

– Call this situation ‘tied’ ( t1 ~ t2)

– Then, the skyline of data set D is the subset of all tuples that are not dominated by any other tuple

Database Tuning - 6

• Vertex : Domain value• Directed edge : For each pair of comparable values• Vertex : Domain value• Directed edge : For each pair of comparable values



Example 2

Consider three tuples t1, t2 and t3 with two categorical attributes

The domain of both attributes is the poset of Figure 1

Assume t1 = (a,b), t2 = (b,d), and t3 = (b,c)

Skyline maintenance solution

Checking whether a tuple a dominated by the current skyline

Retrieving the tuples in the buffer that are dominated by the outgoing skyline tuple

– Since only these tuples are candidates for entering the skyline

Lemma 1

Let t1,t2∈B be two tuples so that t1 < t2. Then, if t2 arrived after

t1 in the stream, t1 will never be in the skyline of B

Database Tuning - 7

i) t2 < t1 ∵ b < a and d < bii) t2 < t3 ∵ b = b and d < ciii) t1 ~ t3 ∵ b < a, but b ~ c∴ The skyline of this data set consists of tuples t1 and t3.

i) t2 < t1 ∵ b < a and d < bii) t2 < t3 ∵ b = b and d < ciii) t1 ~ t3 ∵ b < a, but b ~ c∴ The skyline of this data set consists of tuples t1 and t3.


Efficient skyline maintenance for categorical Efficient skyline maintenance for categorical tuplestuples

Two novel techniques for realizing the building blocks of the skyline maintenance frame work

Indexing the skybuffer (the relevant part of the buffer for skyline)

– To enable efficiently identify the skybuffer tuples dominated by a query tuple

Organizing the skyline

– To enable rapidly answer whether a query tuple is dominated by the skyline

Database Tuning - 8


Topological sortingTopological sorting

Topological sort

A numbering of the vertices of a DAG

– Every edge from a vertex numbered i to a vertex numbered j satisfies i < j

If for two values a, b, a > b, then a will appear before b.

– For a given value x, all values dominated by x will appear after it in the linear order, while x can never dominate any value that precedes it in the ordering

Definition 1

Let v be a value of a partially-ordered domain. We denote by r(v) the integer corresponding to v’s position in a ceratin topological sort of the domain

Database Tuning - 9


Topological sortingTopological sorting

Lemma 2

Let v1, … , vm be the m values of a partially-ordered domain

Dom. Then vi > vj only if r(vi) < (vj)

Lemma 3

Let t1, t2 be two tuples with d partially-ordered categorical

attributes X1, …, Xd. Then t1 > t2 only if r(t1.Xi) < r(t2.Xi),

1≤i≤d

Lemma 4

Let t1, t2 be two tuples with d partially-ordered categorical

attributes X1, … Xd. If ∃i, j such that r(t1.Xi) < r(t2.Xi) and

r(t1.Xj) > r(t2.Xj), then t1 ~ t2

Database Tuning - 10


Organizing the skybuffer tuples in a Organizing the skybuffer tuples in a gridgrid

One of goals is indexing the skybuffer

To efficiently insert and delete tuples, as well as identify the tuples dominated by a query tuple

Build skybuffer indexing solution around a simple grid

Example 3

Consider a set of tuples with two categorical attributes, the domain of both attributes being the poset of Figures 2. Suppose that the poset values have been mapped to integers according to topological sort (i) in Figure 2. Then, we create the grid by using this topological sort as the grid scales and place the skybuffer tuples in its cell.

Each grid cell corresponds to a unique combination of attribute values and only contains tuples with these exact values

– The cell corresponding to tuples with attribute values (d, e) has been marked with an “x”.



Improving the skybuffer Improving the skybuffer organizationorganization

1. Visiting only relevant cells

One of advantage of the grid-based index

Directly identify and process precisely the cells that contain dominated tuples (directly visit the cells marked with •)

Lemma 5 (Focused Search)

Let t be a tuple with d partially-ordered categorical attributes X1

∈ Domi, …, Xd ∈ Domd. Let dom(t.Xi) be the values in Domi such

that t.Xi ≥ v, v ∈ Domi. Then, t dominates a tuple s if and only if

s ∈ dom(t.X1) x … x dom(t.Xd)

Example

– The query tuple is (d,e) in figure 3. The domain values dominated by d are {d,g}. The domain values dominated by e are {e,g,h}. Then only tuples with values in {d,g} x {e,g,h} are dominated by {d,e}.




2. Controlling the grid granularity

A problem with the grid-based index

The lack of control over granularity of the grid

Example

– If the tuples have 4 attributes and each domain size is about 500, then the grid would be comprised of 62.5 billion cells

Definition 2

Consider a DAG and its vertices. A vertex is a source if it has no incoming edges. Then, the depth of a vertex in a DAG is the length of the longest path from a source to the vertex.

Lemma 6

Let v1, …, vm be the m values of a partially-ordered domain Dom.

Then vi > vj only if depth(vi) < depth(vj)




Example 4

Consider a data set consisting of tuples with 2 partially-ordered categorical attributes, the domain of both of them being the poset of Figure 4(a). Then, we can group domain values that lie on the same depth and create the grid of Figure 4(b). Every cell would contain tuples with the same attribute values. Now, tuples with corresponding attributes that lie at the same depth level are placed in the same cell.

– tuple (d,e) lies in the cell marked with “x” in Figure 4(b), along with other tuples with values in {c,d,e} x {c,d,e}.

Problem

The cells can contain tuples both dominated and not dominated by the query tuple

– A dominance check is required to identify the dominated tuples




3. Poset partitioning

Example 5

Suppose that our data set contains tuples with two partially-ordered categorical attributes, the domain of both being the poset of Figure 5(a). The values of the domain have been grouped. If we order the groups in ascending order of their depth value, breaking ties arbitrarily, we propose a valid topological sort for the poset

Figure 5(b) presents the resulting grid by using as scales the grouping of Figure 5(a)

The cells that contain candidate are marked with •

– This allows us to use focused search in order to directly access relevant cells.


All cells that can potentially contain tuples that are dominated by (d,e) are locatedin the rectangular area.

All cells that can potentially contain tuples that are dominated by (d,e) are locatedin the rectangular area.


Experimental EvaluationExperimental Evaluation

SDC (Stratification by Dominance Classification) Introduced in Stratified computation of skylines with partially-

ordered domains written by Chan et al. in 2005.

Improves over BBS+ in terms of both progressiveness and speed

Suggested in the numerical skyline maintenance solution in offline environment

Performance evaluation of STARS and SDC Goal

– To identify the impact of the buffer size, data dimensionality and domain structure on performance

Measure

– The average time required to process a buffer update

– i.e., A combined tuple arrival and expiration




Performance on synthetic data

STARS outperforms SDC by an order of magnitude

The time required by STARS is in the order of a millisecond or less Rendering its use in real life applications entirely realistic


Effect of buffer size and dimensionality on performance

STARSSDC



Performance on a stream of real, skewed and correlated data

Used DMV data set and three categorical attributes of the “cars” table: Maker/Model, Color, and Year

The result can be attributed to the presence of skew and correlation in the data that leads considerably larger skylines as the buffer size increases


Performance on skewed real data

STARSSDC

3 dimensional tuples



Further evlauation of STARS

Pruning efficency

Measured as the fraction of skyline tuples that need to be examined on average in order to answer a dominance query

Found to be resilient to increases in data dimensionality


Pruning efficiency of arrangement skyline organization

Every depth level has twice the number of values from the level above


All depth levels have the same number of valuesAll depth levels have the same number of values



Demonstrates the potential perfomance benefits by utilizing the techniques of ‘improving the skybuffer organization’

The benefit of increasing the grid granularity is eventually offset( 상쇄 ) by the overhead of visiting many sparse cells

Knee in the curves


Effect of grid granularity on performance



All depth levels have the same number of valuesAll depth levels have the same number of values

3 dimensional tuples


ConclusionsConclusions

Identified and motivated the problem of maintaining the skyline of streaming data with partially ordered, categorical attributes

Realized two novel techniques

To constitute the building blocks of an efficient solution to the problem.

Introduce a lightweight data structure for indexing the tuples in the streaming buffer


SkylineSkyline

Center for E-Business TechnologySeoul National University

Seoul, Korea

Nam, Kwang-hyun

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea


Introduction of Skyline QueryIntroduction of Skyline Query

Motivation

The amount of information is explosively increasing

– Users often face information overload

How can different attributes be compared?

Solution

Return objects which are not dominated by any other object in all attributes

Intuitive querying

– No overhead of formulation


AABB

Low

Low

High

High

Price

Speed


Categorical Skylines for Categorical Skylines for Streaming DataStreaming Data

Motivation

Existing work is not proper to maintain skylines for streaming data

Skyline maintenance of streaming tuples

Incoming tuple

– Consider whether incoming tuple can become part of skyline

Outgoing tuple

– Consider whether the tuples dominated by outgoing tuple can be candidates for entering skyline

Skyline maintenance solution

– Checking whether a tuple is dominated by the current line

– Retrieving the tuples in the buffer that are dominated by the outgoing sky line tuple



Categorical Skylines for Categorical Skylines for Streaming DataStreaming Data

Proposal

Indexing the skybuffer

– To insert and delete tuples, as well as identify the skybuffer tuples dominated by a query tuple efficiently

– 1. Topological sorting

– 2. Organizaing the skybuffer tuples in a grid

– 3. Improving the skybuffer

Visiting only relevant cells, Controlling the grid granularity, Poset partitioning

Advantages

Optimize query evaluation

Applicable to a wide range of dimensionality

Adapt any size of poset


Categorical Skylines for Streaming Data

Documents

Transcript of Categorical Skylines for Streaming Data