XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
Click here to load reader
-
Upload
thomaslee -
Category
Technology
-
view
622 -
download
0
description
Transcript of XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: SchemaCompatibility Testing and Subschema Extraction
Thomas Y.T. LEE and David W.L. Cheung
Department of Computer ScienceThe University of Hong Kong
October 28, 2010CIKM 2010
Toronto, Canada
1
Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
2
Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
3
Data interoperability on web servicesIn order for two web services to be interoperable , the XMLschema on the message receiving end must accept all possibleXML messages from the sending end.I The sending schema must be a subschema of the receiving
schema.
Schema A Schema B
XML
Instances
XML
Instances
Web
Service
B
Web
Service
A
∩_
4
W3C XML Schema and data standards
1. W3C XML Schema (XSD) is the most popular schemalanguage to define data standards.
2. In order for the new version of an XSD to bebackward-compatible with the old version, the new versionmust be a superschema of the old version.
I The new schema must accept every instance of the oldschema.
3. However, a typical e-commerce standard XSD containsthousands of types / elements, which makes manualverification of compatibility hardly possible.
4. When an XSD is too large, how can we extract a smallersubschema just enough for processing by a specificapplication?
5
Schema compatibility problems
1. Given two XSDs, how to verify two XSDs are equivalent orone is a subschema of the other?
2. Given XSD A , how to extract a smaller subschema of A calledB so that B recognizes only a subset of elements recognizedby A?
3. In this research, we have developed the formal models forXML data and schemas, as well as the algorithms to solvethese problems.
6
Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
7
Data Tree (DT) to model XML data
A DT is a tree where edges represent elements and nodesrepresent their contents.
<Quote>
<Line>
<Desc>hPhone</Desc>
<Price>499.9</Price>
</Line>
<Line>
<Desc>iMat</Desc>
<Price>999.9</Price>
</Line>
</Quote>
n0:ε
n1:ε
<Quote>
n2:ε
<Line>
n3:ε
<Line>
n4:"hPhone"
<Desc>
n5:"499.9"
<Price>
n6:"iMat"
<Desc>
n7:"999.9"
<Price>
8
Schema Automaton (SA) to model XML schemas
1. An SA is a deterministic finite automaton (DFA) where eachstate is associated with a regular expression (RE) and a set ofvalues called value domain (VDom)
2. The DFA called vertical language (VLang) defines how thesymbols are arranged along the paths from the root to theleaves.2.1 Each state represents an XSD data type and each symbol
represents an element name.
3. The RE of a state called horizontal language (HLang)defines how child elements can be arranged under an XSDdata type, i.e., content model.
4. The value domain defines the set of all possible values anelement can contain.
9
Example SA
q0
q1<Quote>
q2
<Order>
q3<Line>
q4<Line>
q5<Desc>
q6
<Price>
q7
<Product> q8<Qty><Desc>
<Price>
q HLang(q) VDom(q)
q0 <Quote>|<Order> {ε}
q1 <Line>+ {ε}
q2 <Line>+ {ε}
q3 <Desc><Price> {ε}
q4 <Product><Qty> {ε}
q HLang(q) VDom(q)
q5 {ε} STRINGSq6 {ε} DECIMALSq7 <Desc><Price> {ε}
q8 {ε} INTEGERS
10
Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
11
Schema compatibility testing
1. Schema equivalence testing and subschema testing .
2. A schema minimization is involved.2.1 All useless states (data types) are removed first. A useless
state is an inaccessible state or a state which does notrecognize any element with a finite number of descendants.
2.2 The process is like a DFA minimization but the HLang andVDom of each state are considered when deciding whethertwo states can be merged.
3. We have proved that two SAs (XSDs) are equivalent iff theirminimized forms have isomorphic VLang DFAs and all
corresponding HLangs and VDoms are equivalent .
4. We have developed an algorithm to verify whether an SA is asubschema of another SA.
12
Useless states
q0
q1A
q2B
q4C
q3C
A
q7 q8A
B
q5q6
CA q9B
B
q HLang(q) VDom(q)
q0 A{2,5}BC? STRINGSq1 C* STRINGSq2 {ε} INTEGERSq3 A* STRINGSq4 B+ STRINGS
q HLang(q) VDom(q)
q5 C STRINGSq6 A+B* INTEGERSq7 A? STRINGSq8 B* STRINGSq9 {ε} DECIMALS
1. q7 and q8 are inaccessible.
2. q5 and q6 are irrational because they generate infinite children.
3. q9 is useless because it is blocked by irrational states.
4. q4 is useless because it must lead to an irrational state.
13
Schema minimization and equivalence
Schema A
q0
q1<Quote>
q2
<Order>
q3<Line>
q4<Line>
q8<Qty>
q7
<Product>
q5
q6
<Desc>
<Price>
<Desc>
<Price>
q HLang(q) VDom(q)
q0 〈Quote〉|〈Order〉 {ε}
q1 〈Line〉+ {ε}
q2 〈Line〉+ {ε}
q3 〈Desc〉〈Price〉 {ε}
q4 〈Product〉〈Qty〉 {ε}
q5 {ε} STRSq6 {ε} DECSq7 〈Desc〉〈Price〉 {ε}
q8 {ε} INTSq4 〈Product〉〈Qty〉 {ε}
1. q3 and q7 can be merged into q9.
2. Two SAs are equivalent.
q0q1<Quote>
q2
<Order>q9<Line>
q4<Line>
q8<Qty>
<Product>
q5
q6
<Desc>
<Price>
Schema B
q HLang(q) VDom(q)
q0 〈Quote〉|〈Order〉 {ε}
q1 〈Line〉+ {ε}
q2 〈Line〉+ {ε}
q9 〈Desc〉〈Price〉 {ε}
q4 〈Product〉〈Qty〉 {ε}
q5 {ε} STRSq6 {ε} DECSq8 {ε} INTS
14
Subschema testing
Schema A
q0q1<Quote>
q2
<Order>q9<Line>
q4<Line>
q8<Qty>
<Product>
q5
q6
<Desc>
<Price>
q HLang(q) VDom(q)
q0 〈Quote〉|〈Order〉 {ε}
q1 〈Line〉+ {ε}
q2 〈Line〉+ {ε}
q9 〈Desc〉〈Price〉 {ε}
q4 〈Product〉〈Qty〉 {ε}
q5 {ε} STRSq6 {ε} DECSq8 {ε} INTS
B is a subschema of A.1. HLang(q0B) ⊆ HLang(q0A ) and VDom(q0B) = VDom(q0A ).2. HLang(q6B) = HLang(q6A ) and VDom(q6B) ⊆ VDom(q6A ).3. HLang(qB
i ) = HLang(qAi ) and VDom(qB
i ) = VDom(qAi ), for i = 1.5, 9.
q0 q1<Quote> q9<Line>q5
q6
<Desc>
<Price>
Schema B
q HLang(q) VDom(q)
q0 〈Quote〉 {ε}
q1 〈Line〉+ {ε}
q9 〈Desc〉〈Price〉 {ε}
q5 {ε} STRSq6 {ε} INTS
15
Subschema extraction
We have developed the subschema extraction algorithm:I Given SA (XSD) A and a set of symbols (element names) Z ,
compute an SA which accepts all instances (XML documents)of A except those containing some symbols not in Z .
q0q1<Quote>
q7
<Order>q2<Line>
q4<Desc>
q5
<Price>
q3<Line>
<Product>
q6<Qty>
q HLang(q) VDom(q)
q0 <Quote>|<Order> {ε}
q1 <Line>+ {ε}
q7 <Line>+ {ε}
q2 <Desc><Price> {ε}
q HLang(q) VDom(q)
q3 <Product><Qty> {ε}
q4 {ε} STRINGSq5 {ε} DECIMALSq6 {ε} INTEGERS
I Z = {<Quote>, <Line>, <Desc>, <Price>, <Order>, <Qty>}, where <Product> isexcluded.
16
Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
17
xCBL compatibility testing experiment
1. Data sets: XML Common Business Libraryfile no. of data element doc.
XSD size files types names typesxCBL 3.0 1.8MB 413 1,290 3,728 42xCBL 3.5 2.0MB 496 1,476 4,473 51
2. The subschema testing program has disproved the claim onxCBL.org:The only modifications allowed to xCBL 3.0 documents were theadditions of new optional elements and additions to code lists; tomaintain interoperability between the two versions. An xCBL 3.0instance of a document is also a valid instance in xCBL 3.5.
3. xCBL 3.5 is not a superschema of xCBL 3.0.
4. The experiment took only 272ms when the quick RE testwas applied.
I Machine: [email protected], 4GB RAM, Linux OS
18
Schema size reduction by subschema extraction
1. The subschema extraction program was run to extractdifferent subschemas from xCBL. Each subschemarecognizes a different element subset for a specificapplication, e.g., order, invoice, etc.
2. The schema size was reduced to 6–32% of the original size.
3. The time required by XMLBeans to compile a subschema wasreduced to 34–50% of the time originally required.
4. The time to extract such a subschema was only 2–3s.
0
1000
2000
3000
4000
5000
original invoice order quote auction catalog 0
5
10
15
20
25
30
35
num
ber
time
(sec
ond)
#element names#types
#element declarationsXMLBeans compilation time
Subschema extraction from xCBL 3.5.
19
Conclusions1. We have developed:
I formal models for XML and XSD, andI algorithms for schema equivalence and subschema testing,
and subschema extraction.2. These algorithms are PSPACE-complete because of
comparions of regular expressions.I We have developed a heuristic (quick RE test) to make these
algorithms run fast on very large schemas.3. Our experiments:
I have proved that xCBL 3.5 is in fact not backward-compatiblewith xCBL 3.0, and
I have extracted small subschemas from xCBL for differentinstance subsets, which largely reduce processing time onthese subschemas.
4. These models can be extended for other applications:I web service adaptor for legacy systems (text to XML
transformation), andI schema inferrer from XML instances.
20