Post on 16-Dec-2015
Storing and Querying XML Documents Using Relational Databases
Mustafa Ataymatay@wayne.edu
Wayne State UniversityDetroit, MI
February 28, 2006
2/28/2006 Wayne State University 2
Outline of Talk What is XML? HTML vs. XML Problem
Statement Schema-based
Relational Approach
Schema Mapping
Data Mapping Query Mapping Reconstruction Conclusions
2/28/2006 Wayne State University 3
What is XML?
eXtensible Markup Language primarily created by Jon Bosak of
Sun Microsystems officially recommended by W3C
(World Wide Web Consortium) since 1998
a simplified form of SGML (Standard Generalized Markup Language)
2/28/2006 Wayne State University 4
What is XML? (cont.) a meta language
allows you to create and format your own document markups
separates content from format
a method for putting structured data into a text file; these files are
easy to read unambiguous extensible platform-independent
2/28/2006 Wayne State University 5
HTML vs. XML<html> <body>
<table border=1> <tr>
<td><b><font color=red>First Value</td><td><b><font type="Arial" color=blue>Second Value</td><tr><td><b>87</td><td><b>99</td><tr><td><b>45</td><td><b>67</td><tr><td><b>86</td><td><b>84</td></table>
</body></html>
2/28/2006 Wayne State University 6
HTML vs. XML (cont.)<?xml version="1.0"?><!DOCTYPE house SYSTEM "house.dtd">
<house year=“1980”> <entrance>front door</entrance> <entrance>back door</entrance> <rooms> <living_room>
<window>double hung1</window> <window>double hung2</window> <leads_to>kitchen</leads_to> <leads_to>hallway</leads_to> </living_room> <kitchen> <window>double hung1</window> <leads_to>living_room</leads_to> </kitchen> </rooms></house>
2/28/2006 Wayne State University 7
HTML vs. XML (cont.) HTML- uses tags and attributes
- content and formatting can be placed together
- tags and attributes are pre-determined and rigid
- describes what a document looks like
- doesn’t allow user to define content rules
XML- uses tags and attributes- content and format are
separate; formatting is contained in a stylesheet
- allows user to create his/her own set of tags and attributes
- describes the information in a document
- allows user to define content rules (DTD)
2/28/2006 Wayne State University 8
Why Storing and Querying XML?
XML has emerged as the standard for representing and exchanging data on the World Wide Web.
The increasing amount of XML documents requires the need to store and query XML data efficiently.
2/28/2006 Wayne State University 9
A Sample XML Dataset
European Bioinformatics Institute Databases ftp://ftp.ebi.ac.uk/pub/databases/interpro/ match.xml ~ 700MB
2/28/2006 Wayne State University 10
Approaches of Storing and Querying XML Documents
using Native XML repositories Software AG’s Tamino eXcelon’s XIS
using XML-enabled commercial database systems Oracle XML DB DB2 XML Extender Microsoft SQLXML
using RDBMS/ORDBMS to store and query XML documents (Relational Approach)
2/28/2006 Wayne State University 11
Why to store XML in RDBMS? to get advantage of mature RDBMS
technology in efficient storage, indexing and optimization techniques
to enable companies or researchers to store and query XML data using their existing RDBMS system
to enable processing of transformed XML data using both XML and relational queries from a middleware environment
2/28/2006 Wayne State University 12
Relational Approach XML-Publishing
XPERANTO - Carey et al., WebDB’00 SilkRoute – M. Fernandez et al., WWW’00
Schema-less approach Edge – D. Florescu et al., IEEE DEB’99 STORED – A. Deutsch et al., SIGMOD’99
Schema-based approach Basic, Shared and Hybrid inlining –
J. Shanmugasundaram et al., VLDB’99 ODTDMap – M. Atay et al., IS’06
2/28/2006 Wayne State University 13
Schema-based Relational Approach
Schema Mapping XML data model is mapped into the relational
model Data Mapping
XML documents are shredded and composed into tuples to be inserted into the relational database
Query Mapping XML queries are translated into SQL queries
Reverse Data Mapping (Reconstruction) Original XML document is recovered from the
RDBMS
2/28/2006 Wayne State University 14
Schema Mapping
Schema mapping algorithm ODTDMap contains the following steps: Simplifying DTDs Creating and inlining DTD graphs Generating relational schema and the
schema mapping file.
2/28/2006 Wayne State University 15
Sample DTD – univ.dtd
<!DOCTYPE univ [ <!ELEMENT univ (colleges, schools?) > <!ATTLIST univ uName CDATA #REQUIRED> <!ELEMENT colleges (college+) > <!ELEMENT college (dep*) > <!ATTLIST college cName CDATA
#REQUIRED> <!ELEMENT schools (school+) > <!ELEMENT school (dep*) > <!ATTLIST school sName CDATA #REQUIRED> <!ELEMENT dep (tel?, fax?, website?) > <!ATTLIST dep dName CDATA #REQUIRED> <!ELEMENT tel (#PCDATA) > <!ELEMENT fax (#PCDATA) > <!ELEMENT website (#PCDATA) >]>
2/28/2006 Wayne State University 16
Creating DTD Graph<!DOCTYPE univ [ <!ELEMENT univ (colleges, schools?) > <!ATTLIST univ uName CDATA
#REQUIRED> <!ELEMENT colleges (college+) > <!ELEMENT college (dep*) > <!ATTLIST college cName CDATA
#REQUIRED> <!ELEMENT schools (school+) > <!ELEMENT school (dep*) > <!ATTLIST school sName CDATA
#REQUIRED> <!ELEMENT dep (tel?, fax?, website?) > <!ATTLIST dep dName CDATA #REQUIRED> <!ELEMENT tel (#PCDATA) > <!ELEMENT fax (#PCDATA) > <!ELEMENT website (#PCDATA) >]>
2/28/2006 Wayne State University 17
Inlining DTD Graph
2/28/2006 Wayne State University 18
Generating Relational Schema
2/28/2006 Wayne State University 19
Data Mapping Challenging issues of data mapping
Should respect to schema mapping Varying document structure Scalability
We introduced two efficient linear algorithms OXInsert
main memory data mapping algorithm DOM-based
SDM streaming data mapping algorithm SAX-based
2/28/2006 Wayne State University 20
Sample XML document - univ.xml
<?xml version=”1.0” ?><!DOCTYPE univ SYSTEM “univ.dtd”><univ uname=”WSU”> <colleges> <college cname=”Science”> <dep dname=”CS”> <website>www.cs.wayne.edu</website> </dep> </college> <college cname=”Engineering”> <dep dname=”ECE”> <tel>313-5773920</tel> </dep> <dep dname=”IE”></dep> </college> <college cname=”Pharmacy”></college> </colleges></univ>
2/28/2006 Wayne State University 21
XMLTree for univ.xml
<?xml version=”1.0” ?><!DOCTYPE univ SYSTEM “univ.dtd”>1 <univ uname=”WSU”>3 <colleges>4 <college cname=”Science”>6 <dep dname=”CS”>8
<website>www.cs.wayne.edu</website> </dep> </college>9 <college cname=”Engineering”>11 <dep dname=”ECE”>13 <tel>313-5773920</tel> </dep>14 <dep dname=”IE”></dep> </college>16 <college cname=”Pharmacy”></college> </colleges> </univ>
2/28/2006 Wayne State University 22
XMLTree for univ.xml
<?xml version=”1.0” ?><!DOCTYPE univ SYSTEM “univ.dtd”>1 <univ uname=”WSU”>3 <colleges>4 <college cname=”Science”>6 <dep dname=”CS”>8
<website>www.cs.wayne.edu</website> </dep> </college>9 <college cname=”Engineering”>11 <dep dname=”ECE”>13 <tel>313-5773920</tel> </dep>14 <dep dname=”IE”></dep> </college>15 <college cname=”Pharmacy”></college> </colleges> </univ>
2/28/2006 Wayne State University 23
Database state after univ.xml is mapped
2/28/2006 Wayne State University 24
Performance of OXInsert and SDM
Data Mapping
0
20
40
60
80
100
10 20 30 40 50
Size (MB)
Tim
e (sec)
DOM
SAX
2/28/2006 Wayne State University 25
Data Mapping Across Different Schema Mappings
0
500
1000
1500
2000
2500
3000
3500
25 50 75 100 125
Size (MB)
Tim
e (s
ec)
Basic Shared Hybrid
2/28/2006 Wayne State University 26
Query Mapping
We translate simple XPath expressions to SQL
XPath is the core of XML query languages. We identified 3 algorithms for query
mapping Naïve Cluster Containment Join
2/28/2006 Wayne State University 27
Naïve Takes an XPath expression creates a nested SQL
query comprised of SQL queries for each XPath step e.g.
XPath: /univ /colleges /college /dep[@dName=‘CS’]
SQL: Select dep.ID from dep where dep.dName=‘CS’ and dep.parentID in
(Select college.ID from college where college.parentID in
(Select colleges.ID from univ where colleges.parentID in
(Select univ.ID from univ where univ.parentID=0) ) )
2/28/2006 Wayne State University 28
Cluster
A cluster is a sequence of consecutive elements stored in the same table
Takes an XPath expression and creates a nested SQL query comprised of SQL queries for each XPath cluster
e.g. XPath:
/univ /colleges /college /dep[@dName=‘CS’] SQL:
Select dep.ID from dep where dep.dName=‘CS’ and dep.parentID in
(Select college.ID from college where college.parentID in
(Select colleges.ID from univ) )
2/28/2006 Wayne State University 29
Containment Join Relies on the well-formedness of XML
documents Requires the pre-computation of max. ID of
descendants of each element instance (endID) Facilitates efficient evaluation of recursive XML
queries e.g.
XPath: /univ /colleges /college /dep[@dName=‘CS’]
SQL: Select dep.ID from dep, college, univ where dep.dName=‘CS’ and dep.ID>=college.ID and dep.ID<=college.endID and
college.parentID=univ.colleges.ID
2/28/2006 Wayne State University 30
A Recursive Query Example
XPath: /univ //dep
Sub queries of the recursive query /univ /colleges /college /dep /univ /schools /school /dep
Naïve: 8 SQL queries + 6 joins + 1 union Cluster: 6 SQL queries + 4 joins + 1
union Containment Join: 1 SQL query + 1 join
Select dep.ID from dep, univ where dep.ID>=univ.ID and dep.ID<=univ.endID
2/28/2006 Wayne State University 31
Reconstruction In query mapping stage, the elements selected by an
XML query can be returned in one of the following two modes:
Select mode: returns IDs Reconstruct mode: returns XML subtrees
Algorithm Reconstruct reconstructs the XML subtree rooted at a given element
The importance of Reconstruction lies in two aspects: XML subtree reconstruction has great impact on the
query response time in reconstruct mode. It demonstrates that our mapping scheme is lossless
2/28/2006 Wayne State University 32
Conclusions Schema mapping [1,5]
lossless and order preserving processing set-valued XML attributes simple processing of recursion
Data mapping [1,3] We described the first linear-time
schema-based data mapping algorithms We justified their effectiveness on
different schema mapping algorithms
2/28/2006 Wayne State University 33
Conclusions (cont.) Query mapping
We identified 3 algorithms Our CJ algorithm outperforms the only
published recursive query mapping algorithm by Krishnamurthy et al., IEEE ICDE’04
Reconstruction [2] We introduced an efficient reconstruction
algorithm It can be used in relational schema-based
mapping unlike its rivals used in XML-publishing
2/28/2006 Wayne State University 34
Future Work
Extending the schema mapping to XML Schema
Extending the query mapping to XQuery
Introducing DTD/Schema constraints to the proposed mapping scheme
Incorporating access control methods to the proposed mapping scheme
2/28/2006 Wayne State University 35
Acknowledgements
Dr. Shiyong Lu Dr. Farshad Fotouhi Artem Chebotko Dapeng Liu Yezhou Sun
2/28/2006 Wayne State University 36
Publications
1. Mustafa Atay, Artem Chebotko, Dapeng Liu, Shiyong Lu, Farshad Fotouhi, "Efficient Schema-based XML-to-Relational Data Mapping", International Journal of Information Systems, 2006. (to appear)
2. Artem Chebotko, Dapeng Liu, Mustafa Atay, Shiyong Lu and Farshad Fotouhi, “Reconstructing XML Subtrees from Relational Storage of XML Documents”, in Proc. of the 2nd International Workshop on XML Schema and Data Management (XSDM’05), in conjunction with ICDE’2005, Tokyo, Japan, April, 2005
3. Mustafa Atay, Yezhou Sun, Dapeng Liu, Shiyong Lu and Farshad Fotouhi, “Mapping XML Data to Relational Data: DOM-based Approach”, in Proc. of the 8th IASTED International Conference on Internet and Multimedia Systems and Applications (IMSA’2004). Kauai, Hawaii, USA. August, 2004.
4. Shiyong Lu, Yezhou Sun, Mustafa Atay, and Farshad Fotouhi, “On the consistency of XML DTDs”, International Journal of Data and Knowledge Engineering, 2004.
5. Shiyong Lu, Yezhou. Sun, Mustafa Atay, and Farshad Fotouhi, “A new inlining algorithm for mapping XML DTDs to relational schemas”, In Proc. of the First International Workshop on XML Schema and Data Management, in conjuction with the 22nd ACM International Conference on Conceptual Modeling (ER2003), Chicago, Illinois, USA, October 2003.
6. Shiyong Lu, Yezhou Sun, Mustafa Atay, Farshad Fotouhi, "A Sufficient and Necessary Condition for the Consistency of XML DTDs", in Proc. of the First International Workshop on XML Schema and Data Management, in conjunction with the 22nd ACM International Conference on Conceptual Modeling (ER'2003), Chicago, Illinois, USA, October, 2003.