1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and...

36
1 From XML Schema to Relations: A Cost-Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and...

Page 1: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

1

From XML Schema to Relations: A Cost-Based Approach to XML Storage

Presented by Xinwan Bian and Danyu Wu

02-21-02

Page 2: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

2

Introductions

Where to save XML document? . XML database . Object-Oriented database . Object-Relational database . Relational database

Page 3: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

3

Difficulties of Saving XML document into Relational Database

XML has more complex tree structure than flat relational tables

XML contains richer data types The integration with legacy tables

Page 4: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

4

Different Approaches to schema mappings

Fixed XML-to-relational mappings

Commercial RDBMS utility tools

Bell Laboratories cost-based approach

Page 5: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

5

LegoDB, an XML storage mapping system

Three design principles . Cost-based search . Logical/physical independence . Reuse of existing technology

Page 6: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

6

The Basic Approach of LegoDB

Create a p-schema for input XML schema

Obtain cost estimates with input of data statistics and XQuery workload

Exploit alternative storage configurations and achieve an optimal mapping

Page 7: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

7

Architecture of the Mapping Engine

GeneratePhysical Schema

Physical Schema Transformation

Query/Schema Translation

Query Optimizer

XML data statisticsXML Schema

PS0 PSi RSi

Optimal Configuration XQuery workload

cost(SQi)

Rsi: Relational Schema/Queries/StatsPsi: Physical Schema

Page 8: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

8

Questions

Its Advantages?

Its Disadvantages?

Page 9: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

9

Example of P-Schema Creation

type Show= type Show= TABLE Show show [@type [String], show [@type[String], ( show_id

INT, title [String], title [String] type STRING, year [Integer], year [Integer], year INT ) reviews [String]*, Reviews*, TABLE Review …] type Reviews = ( Review_id, reviews[String] review

String, parent_show

INT)(a) Initial XML schema (b) P-Schema © Relational

table

Page 10: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

10

What’s P-Schema?

Physical schemas (p-schemas) is an extension of XML schemas in two significant ways:

. They contain data statistics . They can be easily mapped into

relational tables

Page 11: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

11

Example of P-Schema with statistics

type Show = show [ @type[ String <#8,#2>],

year[ Integer<#4,#1800,#2100,#300>], title[ String<#50,#34798>], Review*<#10> ] type Review = review[ String<#800> ] Scalar<#size, #min, #max, #distinct> String<#size,#distinct> *<#count>

Page 12: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

12

Stratified Physical Types scalar type s ::= Integer | String | Boolean Physical type ps ::= ps<#size,#min,#max,#distincts> Named type nt::= X (type name) | nt | nt (choice) | (empty) | nt{n,m,#<}#count> repetition Optional type ot ::= nt (named type) | s (optional scalar) |

L[ot] (optional element) | ot, ot (optional sequence) | () (empty) Physical type pt ::= nt (named type) | ot{0,1} (optional type) |

s (scalar) | L[pt] (element) | pt, pt (sequence) | () empty Schema item si ::= type X = pt (type declaration) Schema ::= schema Sn = si, si, … end (schema)

Page 13: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

13

Mapping of p-schema to relations Create one relation RT for each type name T For each RT, create a key that stores node id For each RT, create a foreign key to all

relations RPT such that PT is a parent type of T A column is created in RT for each sub-element

of T that is a physical type If the data type is contained within an optional

type then the corresponding column can be null

Page 14: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

14

More details of P_Schema to relational mappings

Page 15: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

15

Schema Transformations

Advantages of transformations at XML Schema level

. Much of the XML schema semantics not present in

a given relational schema. . More natural rewriting at the XML level . The framework is more easily extensible to

other non-relational stores

Page 16: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

16

Inlining/Outlining Transformation

One can either associate a type name to a given nested element (outlining) or next its definition directly within its parent element (inlining).

type TV= seasons [Integer] type TV = Description, seasons[Integer], Episode* => description[String], Episode* type Description = description [String]

Page 17: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

17

Union Factorization/Distribution Transformation The first law ((a,(b|c)) == (a,b|a,c)

type Show = type Show = show [@type[String], title[String] show [(@type[String], title[String], year [Integer], title[String], year[Integer], Aka{1,10}, Review*, {Movie|TV}] Aka{1,10}, Review*, box_office[Integer],type Movie = => video_sales[Integer])

box_office[Integer] | (@type[String], title[String],

video_sales[Integer] year[Integer], Aka{1,10} Review*, seasons[Integer],Type TV = seasons[Integer],

description[String],Episode*)] description[String], Episode*

Page 18: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

18

Corresponding relational configuration

TABLE TV ( TV_id INT, seasons String, TABLE TV ( parent_show ) TV_id INT, => seasons String,TABLE Description description String, ( Description_id INT parent_Show ) description String, parent_TV )

Page 19: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

19

Union Factorization/Distribution continues

The Second law (a[t1|t2] == a[t1]|a[t2])

Type Show = show[(@type[String], type Show = (Show Part1 | Show Part2 ) title[String],year[Integer], type Show Part 1 = show [@type [String], Aka{1,10}, Review*, title [String], year[Integer], Aka{1,10}, box_office[Integer], Review*, box_office[Integer], video_sales[Integer]) video_sales[Integer] ]| (@type [String], => title [String], year [Integer], type Show Part2 = Aka{1,10}, Review*, show [@type [String], title[String], seasons [Integer], year [Integer], Aka{1,10}, description [String], Review*, seasons [Integer], Episode*) ] description [String], Episode* ]

Page 20: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

20

Corresponding relational configurations

TABLE Show ( Show_id INT, TABLE Show_Part1 ( type String, title String, Show_Part1_id INT,

year INT) type String, title String, year INT, box_office INT,TABLE Movie ( video_sales INT) Movie_id INT, Box_Office INT, => video_sales INT, parent_show INT) TABLE Show_Part2 ( Show_Part2_id INT,TABLE TV ( type String, title String, TV_id INT, seasons INT, year INT, seasons INT, description string, parent_show INT) description String )

Page 21: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

21

Wildcard rewritings

‘~’: any element names can be used ‘~!a’: any name but “a” can be used.

Type Review = type Reviews = review [~[ String ]*] review[ (NYTReview |

OtherReview)*]

=> type NYTReview = nyt[ String]

type OtherReview = (~!nyt) [String]

Page 22: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

22

XQuery Queries Examples

Q1: FOR $v in imdb/show WHERE $v/year = 1999 RETURN ($v/title, $v/year, $v/nyt_reviews)

Q2: FOR $v in imdb/show RETURN $v Q3: FOR $v in imdb/show WHERE $v/title = c3 RETURN

$v/description Q4: FOR $v in imdb/show RETURN <result> { $v/title, $v/year, (FOR $e IN $v/episode WHERE

$e/guest_director = c4 RETURN $e) } </result>

Page 23: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

23

XQuery Workload Examples

Publish = { Q1 : 0.4, Q2: 0.4, Q3: 0.1, Q4: 0.1}

Lookup = {Q1: 0.1, Q2: 0.1, Q3:0.4, Q4: 0.4}

Page 24: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

24

Search Algorithm Procedure GreedySearchInput: xSchema: schema, xWkld: query workload, xStats:data statistics Output: pSchema: an efficient physical schema1 begin minCost = infinite large ; pSchema = GetInitialPhysicalSchema(xSchema) cost = GetPSchemaCost(pSchema, xWkld, xStats) while (cost < minCost) do 5 minCost = cost pSchemaList = ApplyTransformations(pSchema) for each pSchema’ € pSchemaList do cost’=GetPSchemaCost(pSchema’,xWkld,xStats) if cost’s < cost then cost = cost’; pSchema = pSchema’ endif10 endfor endwhile return pSchema end.

Page 25: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

25

Experimental Settings

Two variations of the greedy search: greedy-so and greedy-si.

Greedy-so: Initial physical schema: all element outlined (except base

type). During search: Inlining transformations applied.

Greedy-si: Initial physical schema: all elements inlined (except

elements with multiple occurences) During search: Outlining transformations applied.

Page 26: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

26

Efficiency of Greedy Search

5 lookup queries and 3 publish queries

Page 27: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

27

Results

For lookup: Greedy-so converges to the final configuration a lot faster.

For publish: opposite.

Page 28: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

28

Reasons:

The traversals made by lookup queries are localized. The final configuration has only a few inlined elements. Greedy-so can reach this configuration earlier than greedy-si.

The publish queries traverse larger number of elements. The final configuration has several inlined elements. Greedy-si can reach this configuration earlier than greedy- so.

Page 29: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

29

Sensitivity of configurations to varied workloads

Create a spectrum of workloads that combined the lookup queries and publish queries in the ratio k : (1-k), where k€[0,1] is the fraction of lookup queries in the particular workload.

Three workloads corresponding to k = 0.25, 0.50, and 0.75, resulting three configurations.

Page 30: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

30

Figure 11: Sensitivity to variations in the workload

Page 31: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

31

Inlining as a bad idea to some queries

(a)The query does limited, localized traversals and/or does not access all the attributes involved.

(b)The query has highly selective selection predicates.

(c)The query involves join of attributes not structurally adjacent in the XML Schema (e.g. actor and director).

Page 32: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

32

Effectiveness of XML transformations:Union Distribution

Page 33: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

33

Results of the union-transformed configuration

Overlap between the curves for C[0.25] and C[0.75] with OPT.

C[0.25] and C[0.75] cross at a small angle.

C[All-inlined] performed 2~5 times worse than optimal.

Page 34: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

34

Wildcards

Find the NYTimes reviews for shows produced in 1999:

Page 35: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

35

Questions

The optimal mapping in this paper is cost-based. What else needs to be considered?

Page 36: 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

36

References P.Bohannon, J.Freire, P. Roy, and J. Sim’eon. From XML schema to

relations: A cost –based approach to XML storage. Technical report, Bell Laboratories, 2001. Full version.

A. Deutsch, M. Fernandez, and D. Suciu. Storing semi-structured data with STORED. In Proc. Of SIGMOND, pp 431-442, 1999.

D. Florescu and D. Kossman. A performance evaluation of alternative mapping schemas for storing XML in a relational database. Technical Report 3680, INRIA, 1999

M. Klettke and H. Meyer. XML and object-relational database system – enhancing structural mappings based on statistics. In Proc. Of WebDB, pp63-68, 2000.

A. Schmidt, M. Kersten, M. Windhouwer, and F.Waas. Efficient relational storage and retrieval of XML documents. In Proc. Of WebDB, pp47-52, 2000.

J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and Opportunities. In Proc. Of VLDB, pp302-314, 1999.