Efficient Skyline Querying with Variable User Preferences on Nominal Attributes Raymond Chi-Wing...

Efficient Skyline Querying with Variable User Preferences on Nominal Attributes

Raymond Chi-Wing Wong1, Ada Wai-Chee Fu2, Jian Pei3,Yip Sing Ho2, Tai Wong2 and Yubao Liu4

The Hong Kong University of Science and Technology1

The Chinese University of Hong Kong2

Simon Fraser University3

Sun Yat-Sen University4

Prepared by Raymond Chi-Wing WongPresented by Raymond Chi-Wing Wong

Outline

1. Introductiona. Skylineb. Contributions

2. Problem Definition3. Adaptive SFS4. IPO-Tree5. Conclusion

1. Introduction

Package ID

Price Hotel-class

a 1600 4

b 2400 1

c 3000 5

3 packages

Suppose we want to look for a vacation package

Package a “dominates” package b

We want to have a cheaper package. We want to have a higher hotel-

class.

We know that 1. Package a has a cheaper price2. Package a has a higher hotel-class

We want to find a set of packages which are NOT dominatedby any other pacakges All of the “best”

possible choices.i.e., {a, c}

skyline

1. Introduction

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)

c 3000 5 H (Horizon)

d 3600 4 H (Horizon)

e 2400 2 M (Mozilla)

f 3000 3 M (Mozilla)

6 packages


We want to have a cheaper package. We want to have a higher hotel-

class.

How about this one?

Different customers may have different preferences on Hotel-group.

Suppose a customer has the

following preferences. H < T < MThe skyline points are packages a and c.

Suppose another customerhas the following

preferences. H < M < TThe skyline points are packages a, c and e.

In other words, differentpreferences give differentskyline points.

1. Introduction

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





6 packages


Suppose a customer has the

following preferences. H < T < MThe skyline points are packages a and c.

Suppose another customerhas the following

preferences. H < M < TThe skyline points are packages a, c and e.

In other words, differentpreferences give differentskyline points.

Problem: Given a preference

on Hotel-group, we want tofind the skyline with

respect to this preference efficiently

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)








1. Introduction

1. Introduction

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)








1. Introduction

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)








Straightforward solution:

Adopt some existing skyline techniques such as SFS (Sort-First Skyline) to compute the skyline on-the-fly when we need to perform a skyline query

It works. However, this solution is not scalable and the results cannot be returned efficiently.

1. Introduction

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)










Full Materialization solution:

Pre-computation: For each possible preference, (1) pre-compute the skyline and (2) store it in a storageSkyline Query: return the stored skyline directly for a skyline query

It works when there are limited number of preferences.

However, this solution is not scalable when there are a lot of possible preferences.

e.g. three nominal attributes (like Hotel-Group)

each of which contains 40 possible values

there are 4.1 x 109 possible preferences (in our

problem setting).

1. Introduction

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)











Pre-computation: For each possible preference, (1) pre-compute the skyline and (2) store it in a storageSkyline Query: return the stored skyline directly for a skyline querySemi-Materialization solution:

Pre-computation: For SOME possible preferences, (1) pre-compute the skyline and (2) store it in a storageSkyline Query: return the stored skyline directly OR with simple operations for a skyline query

Good tradeoff between storage consumption and efficiency

1. Introduction

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)













Adaptive SFS

IPO-Tree (Implicit Preference Order Tree)

Questions: 1. What preferences should be stored?2. With these preferences, how can we perform a skyline query

efficiently?

1. Contributions Most Existing Work

Assume that each attribute has a certain ordering (either totally ordered or partially ordered) on the attribute values

Our Work Different users can have different

preferences (i.e., the ordering on attribute values are different with different users)

Propose a semi-materialization method IPO-tree to answer the skyline query efficiently.

2. Problem Definition Usually, a user should NOT specify an

ordering on all possible values on attribute Hotel-Group

Only list a few of the most favorite choicese.g. M < H < *Implicit preference

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)








Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





A user prefers M to H.




Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





A user prefers H to *.

All possible values in attribute Hotel-group other than “M” and “H” (in this case, “T”)

This is the reason why we call an implicit preference.

Problem: Given an implicit preference

on Hotel-group, we want to find theskyline with respect to this

preferenceefficiently




Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)








Binary orders ={ }


M<H




Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)








Binary orders ={ }


M<H , M<T




Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)








Binary orders ={ }


M<H , M<T , H<T




Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)








Since the user gives only TWO choices, we define the order ofhis preference to be TWO.

We also call this preferencethe second-order implicit preference.


Idea of our proposed semi-materialization IPO-tree

1. Store the skyline wrt the first-order implicit preference ONLY

2. Find the skyline wrt the implicit preference of any ordering from the skyline wrt the first-order implicit preference


efficiently?

3. Adaptive SFS






Adaptive SFS


3. Adaptive SFS Original SFS

Idea: Suppose we have a function f Each tuple is assigned with a score obtained by f Sort the tuples in ascending order of the scores Process the tuples with this ordering

Adaptive SFS Similar idea However, the original score function is based on

Numeric attributes NOT nominal attributes

What we change is the score function

Idea:1. Pre-Computation:

first pre-sort the tuples according to this new score function

2. Skyline Query:re-sort the tuples for a skyline query

4. IPO-Tree






Adaptive SFS


4. IPO-Tree

Idea of our proposed semi-materialization IPO-tree

1. Store the skyline with respect to the first-order implicit preference ONLY

2. Find the skyline with respect the implicit preference of any ordering from the skyline with respect to the first-order implicit preference


efficiently?

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





Binary Orders:{M < T, M <

H} Some values other than

“M” (i.e., “H” and “T”)

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}


H} Some values other than

“H” (i.e., “T” and “M”)

Binary Orders:{H < T, H <

M}

f is NOT a skyline point. Why?

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}


M}

f is NOT a skyline point. Why?

With the binary order H<M, c dominates f

We say that “H<M” disqualifiesf as a skyline point.

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}

M < H < *


H}

Binary Orders:{ }


M}

M<H

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}

M < H < *


H}

Binary Orders:{ }

Some values other than

“M” and “H” (i.e., “T”)


M}

M<H, M<T

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}

M < H < *


H}

Binary Orders:{ }

Some values other than

“M” and “H” (i.e., “T”)


M}

M<H, M<T , H<T

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}

M < H < *


H}

Binary Orders:{ }


M}

M<H, M<T , H<T

PSKY1 = a set of data points in SKY1 with value “M” = {e, f}

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}

M < H < *


H}

Binary Orders:{ }


M}

M<H, M<T , H<T

PSKY1

= {e, f}

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}

M < H < *


H}

Binary Orders:{ }


M}

M<H, M<T , H<T

PSKY1 = {e, f}

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}

M < H < *


H}

Binary Orders:{ }


M}

M<H, M<T , H<T

PSKY1 = {e, f}

SKY3={ }

SKY3 = (SKY1 SKY2) U PSKY1

U

= {a, c, e} U {e, f}= {a, c, e, f}

a, c, e, f

Additional binary order!

This binary order may

disqualify some datapoints in SKY3 like “f”

Observation: These points must be in

PSKY1

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





H < *SKY2 = {a, c, e}

M < H < *


H}

Binary Orders:{ }


M}

M<H, M<T , H<T

PSKY1 = {e, f}

SKY3={ }


U

= {a, c, e} U {e, f}= {a, c, e, f}

a, c, e, f

Skyline wrt thefirst-order preference

Skyline wrt thesecond-order preference


4. IPO-Tree

M < *SKY1 = {a, c, e, f}

H < *SKY2 = {a, c, e}

M < H < * SKY3={ }a, c, e, f




v1 < v2 <*

v1 < *

v2 < *Merging Property

4. IPO-Tree

Second-order Preference




Third-order Preference


Skyline wrt thethird-order preference


Fourth-order Preference


Skyline wrt thefourth-order preference

Skyline wrt thethird-order preference

v1 < v2 <*

v1 < *

v2 < *

v1 < v2 < v3 < *

v1 < v2 < *

v3 < *

v1 < v2 < v3 < v4 < *

v1 < v2 < v3 < *

v4 < *

5. Empirical Study Datasets

Synthetic Dataset Anti-correlated dataset

Real Dataset (from UCI) Nursery Dataset

Default Values (Synthetic) No. of tuples = 500K No. of numeric dimensions = 3 No. of nominal dimensions = 2 No. of values in a nominal dimension = 20 Order of implicit preference = 3

5. Empirical Study Variation

No. of data points No. of numeric dimensions No. of nominal dimensions Cardinality of nominal dimensions Order of implicit preference

Comparison SFS-D SFS-A IPO Tree IPO Tree-10

Original SFS

Adaptive SFS

IPO Tree which stores 10most frequent values for eachnominal attribute (for

comparison)

5. Empirical StudySynthetic Data Set

5. Empirical StudyReal Data Set

6. Conclusion

Different customers have different preferences different skylines

Skyline Query on Nominal Attributes

Adaptive SFS algorithm IPO-Tree algorithm Experiments

3. Adaptive SFS

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





Package ID

Price ReverseHotel-class

Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)





3. Adaptive SFS

Package ID


Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)





Using some existing algorithms, we can first removesome data points which must not be in skylinewith respect to any implicit preference

3. Adaptive SFS

Package ID


Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)





Package ID

Score

a

c

e

f

Step 1 (Pre-computation): pre-sort the tuples according to the new score function

Each value in attribute Hotel-Group

is assigned with a SPECIAL value

This special value is set tothe total number of possible

valuesin Hotel-Group (i.e., 3)

3. Adaptive SFS

Package ID


Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)





Package ID

Score

a

c

e

f


Score of point a is1600 + 1 + 3





= 1604

1604

3. Adaptive SFS

Package ID


Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)





Package ID

Score

a

c

e

f


Score of point c is3000 + 0 + 3





= 3003

1604

3003

3. Adaptive SFS

Package ID


Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)





Package ID

Score

a

c

e

f






1604

3003

2406

3005

Package ID

Score

a 1604

e 2406

c 3003

f 3005

3. Adaptive SFS

Package ID


Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)





Package ID

Score

a 1604

e 2406

c 3003

f 3005

3. Adaptive SFS

Package ID


Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)





Step 2 (Skyline Query): re-sort the tuples for a skyline query (e.g., H<T<*)

Package ID

Score

a 1604

e 2406

c 3003

f 3005

Value “H” is assigned with value 1.

Value “T” is assigned with value 2.

All values other than “H” and “T”

(i.e.,“M”) are still equal to value 3.

Pre-computation:

Package ID

Score

a

e

c

f

Skyline Query:

Score of point a is1600 + 1 + 2 = 1603

1603

2406

3005

3. Adaptive SFS

Package ID


Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)






Package ID

Score

a 1604

e 2406

c 3003

f 3005

Value “H” is assigned with value 1.

Value “T” is assigned with value 2.

All values other than “H” and “T”

(i.e.,“M”) are still equal to value 3.

Pre-computation:

Package ID

Score

a

e

c

f

Skyline Query:

Score of point c is3000 + 0 + 1 =3001

1603

3001

2406

3005

Since the score of a and c areupdated, we need to re-sorta and c.

Note that the ordering of all OTHER

points not containing “H” nor “T”

remains unchanged.

3. Adaptive SFS

Package ID


Hotel-group

a 1600 1 T (Tulips)

b 2400 4 T (Tulips)






Package ID

Score

a 1604

e 2406

c 3003

f 3005

Pre-computation:

Package ID

Score

a

e

c

f

Skyline Query:

1603

3001

2406

3005

We just use the original SFS.

With this sorted list, we find the skyline = {a, c}

4. IPO-Tree

Idea Pre-computation

Store the skyline wrt the first-order preference

Skyline Query Find the skyline wrt the preference of

any order according to the stored skylines wrt the first-order preference

e.g.1 Hotel-Group: M<* Airline : G<*e.g.2 Hotel-Group: M<* Airline : e.g.3 Hotel-Group: Airline : G<*

How can we do it efficiently?

We propose an indexing structure called IPO-tree

4. IPO-Tree

Package ID


Hotel-group

Airline

a 1600 1 T (Tulips) G (Gonna)

b 2400 4 T (Tulips) G (Gonna)

c 3000 0 H (Horizon) G (Gonna)

d 3600 1 H (Horizon) R (Redish)

e 2400 3 M (Mozilla) R (Redish)

f 3000 2 M (Mozilla) W (Wings)root

T<* H<* M<*

G<* R<* W<* G<* R<* W<* G<* R<* W<* G<* R<* W<*

Hotel-group: T<*Airline : G<*

Hotel-group: T<*Airline :

Hotel-group: Airline : G<*

Hotel-Group

Airline

e.g. three nominal attributes (like Hotel-Group) each of which contains 40 possible values

Full Materialization

there are 4.1 x 109 possible preferences (in our problem setting).

Semi-Materialization IPO-tree

there are 70,644 nodes (which is significantly smaller than4.1 x 109).

4. IPO-Tree

One nominal attribute Merging Property

Multiple nominal attributes Consider ONE nominal attribute at a time

with Merging Property Fix the ordering of OTHER nominal

attributes Then, consider each of other nominal

attributes with Merging Property

4. IPO-Tree

Package ID

Price Hotel-class

Hotel-group

a 1600 4 T (Tulips)

b 2400 1 T (Tulips)





4. IPO-Tree

Package ID

Price Hotel-class

Hotel-group

Airline

a 1600 4 T (Tulips) G (Gonna)

b 2400 1 T (Tulips) G (Gonna)

c 3000 5 H (Horizon) G (Gonna)

d 3600 4 H (Horizon) R (Redish)

e 2400 2 M (Mozilla) R (Redish)

f 3000 3 M (Mozilla) W (Wings)

Hotel-Group: M<H<*Airline : G<R<*

Hotel-Group: M<*Airline : G<R<*

Hotel-Group: H<*Airline : G<R<*

Hotel-Group: M<*Airline : G<*

Hotel-Group: H<*Airline : R<*

Hotel-Group: H<*Airline : G<*

Hotel-Group: H<*Airline : R<*

4. IPO-Tree

M < *SKY1 = {a, c, e, f}

H < *SKY2 = {a, c, e}

M < H < *

PSKY1 = {e, f}

SKY3={ }


U

= {a, c, e} U {e, f}= {a, c, e, f}

a, c, e, f

4. IPO-Tree

Theorem: Given a user query with x-th order implicit preference on m’’ nominal attributes, the number of set operations required for an x-th order implicit preference is O(xm’’).

m’’ = 2x = 2

No. of set operations = O(22)

Hotel-Group: M<H<*Airline : G<R<*

e.g.

Efficient Skyline Querying with Variable User Preferences on Nominal Attributes Raymond Chi-Wing...

Documents

Transcript of Efficient Skyline Querying with Variable User Preferences on Nominal Attributes Raymond Chi-Wing...