XML Schema Computations: Schema Compatibility Testing and Subschema Extraction

XML Schema Computations: SchemaCompatibility Testing and Subschema Extraction

Thomas Y.T. LEE and David W.L. Cheung

Department of Computer ScienceThe University of Hong Kong

October 28, 2010CIKM 2010

Toronto, Canada

1

Outline

Introduction and motivation

Formal models for XML data and schemas

Schema computational algorithms

Experiments and conclusions

2

Outline





3

Data interoperability on web servicesIn order for two web services to be interoperable , the XMLschema on the message receiving end must accept all possibleXML messages from the sending end.I The sending schema must be a subschema of the receiving

schema.

Schema A Schema B

XML

Instances

XML

Instances

Web

Service

B

Web

Service

A

∩_

4

W3C XML Schema and data standards

1. W3C XML Schema (XSD) is the most popular schemalanguage to define data standards.

2. In order for the new version of an XSD to bebackward-compatible with the old version, the new versionmust be a superschema of the old version.

I The new schema must accept every instance of the oldschema.

3. However, a typical e-commerce standard XSD containsthousands of types / elements, which makes manualverification of compatibility hardly possible.

4. When an XSD is too large, how can we extract a smallersubschema just enough for processing by a specificapplication?

5

Schema compatibility problems

1. Given two XSDs, how to verify two XSDs are equivalent orone is a subschema of the other?

2. Given XSD A , how to extract a smaller subschema of A calledB so that B recognizes only a subset of elements recognizedby A?

3. In this research, we have developed the formal models forXML data and schemas, as well as the algorithms to solvethese problems.

6

Outline





7

Data Tree (DT) to model XML data

A DT is a tree where edges represent elements and nodesrepresent their contents.

<Quote>

<Line>

<Desc>hPhone</Desc>

<Price>499.9</Price>

</Line>

<Line>

<Desc>iMat</Desc>

<Price>999.9</Price>

</Line>

</Quote>

n0:ε

n1:ε

<Quote>

n2:ε

<Line>

n3:ε

<Line>

n4:"hPhone"

<Desc>

n5:"499.9"

<Price>

n6:"iMat"

<Desc>

n7:"999.9"

<Price>

8

Schema Automaton (SA) to model XML schemas

1. An SA is a deterministic finite automaton (DFA) where eachstate is associated with a regular expression (RE) and a set ofvalues called value domain (VDom)

2. The DFA called vertical language (VLang) defines how thesymbols are arranged along the paths from the root to theleaves.2.1 Each state represents an XSD data type and each symbol

represents an element name.

3. The RE of a state called horizontal language (HLang)defines how child elements can be arranged under an XSDdata type, i.e., content model.

4. The value domain defines the set of all possible values anelement can contain.

9

Example SA

q0

q1<Quote>

q2

<Order>

q3<Line>

q4<Line>

q5<Desc>

q6

<Price>

q7

<Product> q8<Qty><Desc>

<Price>

q HLang(q) VDom(q)

q0 <Quote>|<Order> {ε}

q1 <Line>+ {ε}

q2 <Line>+ {ε}

q3 <Desc><Price> {ε}

q4 <Product><Qty> {ε}

q HLang(q) VDom(q)

q5 {ε} STRINGSq6 {ε} DECIMALSq7 <Desc><Price> {ε}

q8 {ε} INTEGERS

10

Outline





11

Schema compatibility testing

1. Schema equivalence testing and subschema testing .

2. A schema minimization is involved.2.1 All useless states (data types) are removed first. A useless

state is an inaccessible state or a state which does notrecognize any element with a finite number of descendants.

2.2 The process is like a DFA minimization but the HLang andVDom of each state are considered when deciding whethertwo states can be merged.

3. We have proved that two SAs (XSDs) are equivalent iff theirminimized forms have isomorphic VLang DFAs and all

corresponding HLangs and VDoms are equivalent .

4. We have developed an algorithm to verify whether an SA is asubschema of another SA.

12

Useless states

q0

q1A

q2B

q4C

q3C

A

q7 q8A

B

q5q6

CA q9B

B

q HLang(q) VDom(q)

q0 A{2,5}BC? STRINGSq1 C* STRINGSq2 {ε} INTEGERSq3 A* STRINGSq4 B+ STRINGS

q HLang(q) VDom(q)

q5 C STRINGSq6 A+B* INTEGERSq7 A? STRINGSq8 B* STRINGSq9 {ε} DECIMALS

1. q7 and q8 are inaccessible.

2. q5 and q6 are irrational because they generate infinite children.

3. q9 is useless because it is blocked by irrational states.

4. q4 is useless because it must lead to an irrational state.

13

Schema minimization and equivalence

Schema A

q0

q1<Quote>

q2

<Order>

q3<Line>

q4<Line>

q8<Qty>

q7

<Product>

q5

q6

<Desc>

<Price>

<Desc>

<Price>

q HLang(q) VDom(q)

q0 〈Quote〉|〈Order〉 {ε}

q1 〈Line〉+ {ε}

q2 〈Line〉+ {ε}

q3 〈Desc〉〈Price〉 {ε}

q4 〈Product〉〈Qty〉 {ε}

q5 {ε} STRSq6 {ε} DECSq7 〈Desc〉〈Price〉 {ε}

q8 {ε} INTSq4 〈Product〉〈Qty〉 {ε}

1. q3 and q7 can be merged into q9.

2. Two SAs are equivalent.

q0q1<Quote>

q2

<Order>q9<Line>

q4<Line>

q8<Qty>

<Product>

q5

q6

<Desc>

<Price>

Schema B

q HLang(q) VDom(q)


q1 〈Line〉+ {ε}

q2 〈Line〉+ {ε}



q5 {ε} STRSq6 {ε} DECSq8 {ε} INTS

14

Subschema testing

Schema A

q0q1<Quote>

q2

<Order>q9<Line>

q4<Line>

q8<Qty>

<Product>

q5

q6

<Desc>

<Price>

q HLang(q) VDom(q)


q1 〈Line〉+ {ε}

q2 〈Line〉+ {ε}



q5 {ε} STRSq6 {ε} DECSq8 {ε} INTS

B is a subschema of A.1. HLang(q0B) ⊆ HLang(q0A ) and VDom(q0B) = VDom(q0A ).2. HLang(q6B) = HLang(q6A ) and VDom(q6B) ⊆ VDom(q6A ).3. HLang(qB

i ) = HLang(qAi ) and VDom(qB

i ) = VDom(qAi ), for i = 1.5, 9.

q0 q1<Quote> q9<Line>q5

q6

<Desc>

<Price>

Schema B

q HLang(q) VDom(q)

q0 〈Quote〉 {ε}

q1 〈Line〉+ {ε}


q5 {ε} STRSq6 {ε} INTS

15

Subschema extraction

We have developed the subschema extraction algorithm:I Given SA (XSD) A and a set of symbols (element names) Z ,

compute an SA which accepts all instances (XML documents)of A except those containing some symbols not in Z .

q0q1<Quote>

q7

<Order>q2<Line>

q4<Desc>

q5

<Price>

q3<Line>

<Product>

q6<Qty>

q HLang(q) VDom(q)

q0 <Quote>|<Order> {ε}

q1 <Line>+ {ε}

q7 <Line>+ {ε}

q2 <Desc><Price> {ε}

q HLang(q) VDom(q)

q3 <Product><Qty> {ε}

q4 {ε} STRINGSq5 {ε} DECIMALSq6 {ε} INTEGERS

I Z = {<Quote>, <Line>, <Desc>, <Price>, <Order>, <Qty>}, where <Product> isexcluded.

16

Outline





17

xCBL compatibility testing experiment

1. Data sets: XML Common Business Libraryfile no. of data element doc.

XSD size files types names typesxCBL 3.0 1.8MB 413 1,290 3,728 42xCBL 3.5 2.0MB 496 1,476 4,473 51

2. The subschema testing program has disproved the claim onxCBL.org:The only modifications allowed to xCBL 3.0 documents were theadditions of new optional elements and additions to code lists; tomaintain interoperability between the two versions. An xCBL 3.0instance of a document is also a valid instance in xCBL 3.5.

3. xCBL 3.5 is not a superschema of xCBL 3.0.

4. The experiment took only 272ms when the quick RE testwas applied.

I Machine: [email protected], 4GB RAM, Linux OS

18

Schema size reduction by subschema extraction

1. The subschema extraction program was run to extractdifferent subschemas from xCBL. Each subschemarecognizes a different element subset for a specificapplication, e.g., order, invoice, etc.

2. The schema size was reduced to 6–32% of the original size.

3. The time required by XMLBeans to compile a subschema wasreduced to 34–50% of the time originally required.

4. The time to extract such a subschema was only 2–3s.

0

1000

2000

3000

4000

5000

original invoice order quote auction catalog 0

5

10

15

20

25

30

35

num

ber

time

(sec

ond)

#element names#types

#element declarationsXMLBeans compilation time

Subschema extraction from xCBL 3.5.

19

Conclusions1. We have developed:

I formal models for XML and XSD, andI algorithms for schema equivalence and subschema testing,

and subschema extraction.2. These algorithms are PSPACE-complete because of

comparions of regular expressions.I We have developed a heuristic (quick RE test) to make these

algorithms run fast on very large schemas.3. Our experiments:

I have proved that xCBL 3.5 is in fact not backward-compatiblewith xCBL 3.0, and

I have extracted small subschemas from xCBL for differentinstance subsets, which largely reduce processing time onthese subschemas.

4. These models can be extended for other applications:I web service adaptor for legacy systems (text to XML

transformation), andI schema inferrer from XML instances.

20

XML Schema Computations: Schema Compatibility Testing and Subschema Extraction

Technology

Transcript of XML Schema Computations: Schema Compatibility Testing and Subschema Extraction