XML Schemas and Queries
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 19, 2023
2
Readings & Reminders
Reminder: Homework 1 Milestone 2 due tonight @ 11:59PM
Homework 2 pre-release is now posted
XML, DTD, Schema XPath XSLT
For next week: Altinel & Franklin paper on XFilter
3
Sample XML<?xml version="1.0" encoding="ISO-8859-1" ?> <dblp> <mastersthesis mdate="2002-01-03" key="ms/Brown92"> <author>Kurt P. Brown</author> <title>PRPL: A Database Workload Specification Language</title> <year>1992</year> <school>Univ. of Wisconsin-Madison</school> </mastersthesis> <article mdate="2002-01-03" key="tr/dec/SRC1997-018"> <editor>Paul R. McJones</editor> <title>The 1995 SQL Reunion</title> <journal>Digital System Research Center Report</journal> <volume>SRC1997-018</volume> <year>1997</year> <ee>db/labs/dec/SRC1997-018.html</ee> <ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee> </article>
4
XML Data Model VisualizedRoot
?xml dblp
mastersthesis article
mdate key
author title year school editor title yearjournal volume eeee
mdatekey
2002…
ms/Brown92
Kurt P….
PRPL…
1992
Univ….
2002…
tr/dec/…
Paul R.
The…
Digital…
SRC…
1997
db/labs/dec
http://www.
attributeroot
p-i element
text
5
XML Isn’t Enough on Its Own
It’s too unconstrained for many cases! How will we know when we’re getting
garbage? How will we query? How will we understand what we got?
6
Document Type Definitions (DTDs)
DTD is an EBNF grammar defining XML structure XML document specifies an associated DTD, plus
the root element DTD specifies children of the root (and so on)
DTD defines special significance for attributes: IDs – special attributes that are analogous to
keys for elements IDREFs – references to IDs IDREFS – space-delimited list of IDREFs
7
An Example DTD
Example DTD:<!ELEMENT dblp((mastersthesis | article)*)><!ELEMENT mastersthesis(author,title,year,school,committeemember*)><!ATTLIST mastersthesis(mdate CDATA #REQUIRED
key ID #REQUIREDadvisor CDATA #IMPLIED>
<!ELEMENT author(#PCDATA)>
…Example use of DTD in XML file:
<?xml version="1.0" encoding="ISO-8859-1" ?> <!DOCTYPE dblp SYSTEM “my.dtd"> <dblp>…
8
DTDs Are Very Limited
DTDs capture grammatical structure, but have some drawbacks: Only string scalar types Global ID/reference space is inconvenient No way of defining OO-like inheritance
9
XML Schema: DTDs Rethought
Features: XML syntax Better way of defining keys using XPaths Type subclassing … And, of course, built-in datatypes
10
Basic Constructs of Schema
Separation of elements (and attributes) from types: complexType is a structured type
It can have sequences or choices
element and attribute have name and type Elements may also have minOccurs and maxOccurs
Subtyping, most commonly using:<complexContent> <extension base=“prevType”> … </…>
11
Simple Schema Example
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name=“mastersthesis" type=“ThesisType"/> <xsd:complexType name=“ThesisType">
<xsd:attribute name=“mdate" type="xsd:date"/><xsd:attribute name=“key" type="xsd:string"/><xsd:attribute name=“advisor" type="xsd:string"/><xsd:sequence>
<xsd:element name=“author" type=“xsd:string"/> <xsd:element name=“title" type=“xsd:string"/> <xsd:element name=“year" type=“xsd:integer"/> <xsd:element name=“school" type=“xsd:string”/> <xsd:element name=“committeemember"
type=“CommitteeType” minOccurs=“0"/> </xsd:sequence>
</xsd:complexType>
12
Embedding XML Schema
<root xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="s1.xsd" > <grade>a</grade> </root>
<s1:root xmlns:s1="http://www.schemaValid.com/s1ns" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:schemaLocation="http://www.schemaValid.com/s1ns s1ns.xsd" > <s1:grade>a</s1:grade> </s1:root>
But the XML parser is actually free to ignore this – the schema is typically specified “from outside” the document
13
Manipulating XML
Sometimes: Need to restructure an XML document Or simply need to retrieve certain parts that
satisfy a constraint, e.g.: All books All books by author XYZ
14
Document Object Model (DOM)vs. Queries
Build a DOM tree (as we saw earlier) and access via Java (etc.) DOMNode object DOM objects have methods like “getFirstChild()”,
“getNextSibling” Common way of traversing the tree Can also modify the DOM tree – alter the XML – via
insertAfter(), etc.
Alternate approach: a query language Define some sort of a template describing traversals from
the root of the directed graph In XML, the basis of this template is called an XPath
Can also declare some constraints on the values you want The XPath returns a node set of matches
15
XPaths
In its simplest form, an XPath is like a path in a file system:/mypath/subpath/*/morepath
The XPath returns a node set representing the XML nodes (and their subtrees) at the end of the path
XPaths can have node tests at the end, returning only particular node types, e.g., text(), processing-instruction(), comment(), element(), attribute()
XPath is fundamentally an ordered language: it can query in order-aware fashion, and it returns nodes in order
16
Sample XML<?xml version="1.0" encoding="ISO-8859-1" ?> <dblp> <mastersthesis mdate="2002-01-03" key="ms/Brown92"> <author>Kurt P. Brown</author> <title>PRPL: A Database Workload Specification Language</title> <year>1992</year> <school>Univ. of Wisconsin-Madison</school> </mastersthesis> <article mdate="2002-01-03" key="tr/dec/SRC1997-018"> <editor>Paul R. McJones</editor> <title>The 1995 SQL Reunion</title> <journal>Digital System Research Center Report</journal> <volume>SRC1997-018</volume> <year>1997</year> <ee>db/labs/dec/SRC1997-018.html</ee> <ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee> </article>
17
XML Data Model VisualizedRoot
?xml dblp
mastersthesis article
mdate key
author title year school editor title yearjournal volume eeee
mdatekey
2002…
ms/Brown92
Kurt P….
PRPL…
1992
Univ….
2002…
tr/dec/…
Paul R.
The…
Digital…
SRC…
1997
db/labs/dec
http://www.
attributeroot
p-i element
text
18
Some Example XPath Queries
/dblp/mastersthesis/title /dblp/*/editor //title //title/text()
19
Context Nodes and Relative Paths
XPath has a notion of a context node: it’s analogous to a current directory “.” represents this context node “..” represents the parent node We can express relative paths:
subpath/sub-subpath/../.. gets us back to the context node
By default, the document root is the context node
20
Predicates – Filtering Operations
A predicate allows us to filter the node set based on selection-like conditions over sub-XPaths:
/dblp/article[title = “Paper1”]
which is equivalent to:
/dblp/article[./title/text() = “Paper1”]
because of type coercion. What does this do:
/dblp/article[@key = “123” and ./title/text() = “Paper1”
and ./author/*/element()]
21
Axes: More Complex Traversals
Thus far, we’ve seen XPath expressions that go down the tree (and up one step) But we might want to go up, left, right, etc. These are expressed with so-called axes:
self::path-step child::path-step parent::path-step descendant::path-step ancestor::path-step descendant-or-self::path-step ancestor-or-self::path-
step preceding-sibling::path-step following-sibling::path-step preceding::path-step following::path-step
The previous XPaths we saw were in “abbreviated form”
22
Users of XPath
XML Schema uses simple XPaths in defining keys and uniqueness constraints
XLink and XPointer, hyperlinks for XML
XSLT – useful for converting from XML to other representations (e.g., HTML, PDF, SVG)
XQuery – useful for restructuring an XML document or combining multiple documents Might well turn into the “glue” between Web Services,
etc.
23
A Functional Language for XML
XSLT is based on a series of templates that match different parts of an XML document There’s a policy for what rule or template is
applied if more than one matches (it’s not what you’d think!)
XSLT templates can invoke other templates XSLT templates can be nonterminating (beware!)
XSLT templates are based on XPath “match”es, and we can also apply other templates (potentially to “select”ed XPaths) Within each template, directly describe what
should be output
24
An XSLT Template
An XML document itself XML tags create output OR are XSL operations
All XSL tags are prefixed with “xsl” namespace All non-XSL tags are part of the XML output
Common XSL operations: template with a match XPath Recursive call to apply-templates, which may also select
where it should be applied
Attach to XML document with a processing-instruction:
<?xml version = “1.0” ?><?xml-stylesheet type=“text/xsl” href=“http://www.com/my.xsl” ?>
25
An Example XSLT Stylesheet
<xsl:stylesheet version=“1.1”> <xsl:template match=“/dblp”> <html><head>This is DBLP</head> <body> <xsl:apply-templates /> </body> </html> </xsl:template> <xsl:template match=“inproceedings”>
<h2><xsl:apply-templates select=“title” /></h2> <p><xsl:apply-templates select=“author”/></p> </xsl:template> …</xsl:stylesheet>
26
XSLT Processing Model
List of source nodes result tree fragment(s) Start with root
Find all template rules with matching patterns from root Find “best” match according to some heuristics Set the current node list to be the set of things it maches
Iterate over each node in the current node list Apply the operations of the template “Append” the results of the matching template rule to the
result tree structure Repeat recursively if specified to by apply-templates
27
What If There’s More than One Match?
Eliminate rules of lower precedence due to importing
Break a rule into any | branches and consider separately
Choose rule with highest computed or specified priority
Simple rules for computing priority based on “precision”: QName preceded by XPath child/axis specifier: priority 0 NCName preceded by child/axis specifier: priority -0.25 NodeTest preceded by child/axis specifier: pririty -0.5 else priority 0.5
28
Other Common Operations
Iteration:<xsl:for-each select=“path”></xsl:for-each>
Conditionals:<xsl:if test=“./text() < ‘abc’”></xsl:if>
Copying current node and children to the result set:
<xsl:copy><xsl:apply-templates />
</xsl:copy>
29
Creating Output Nodes
Return text/attribute data (this is a default rule):<xsl:template match=“text()|@*”>
<xsl:value-of select=“.”/></xsl:template>
Create an element from text (attribute is similar):
<xsl:element name=“text()”><xsl:apply-templates/>
</xsl:element>
Copy nodes matching a path<xsl:copy-of select=“*”/>
30
Embedding Stylesheets
You can “import” or “include” one stylesheet from another:<xsl:import href=“http://www.com/my.xsl/”><xsl:include href=“http://www.com/my.xsl/”>
“Include”: the rules get same precedence as in including template
“Import”: the rules are given lower precedence
31
XSLT Summary
A very powerful, template-based transformation language for XML document other structured document Commonly used to convert XML PDF, SVG, GraphViz
DOT format, HTML, WML, …
Primarily useful for presentation of XML or for very simple conversions
But sometimes we need more complex operations when converting data from one source to another Joins – combining and correlating information from
multiple sources Aggregation – computing averages, counts, etc.
32
XSLT and Alternatives
XSLT is focused on reformatting documents Stylesheets are focused around one XML file XML file must reference the stylesheet
What if we want to: Manage and combine collections of XML documents? Make Web service requests for XML? “Glue together” different Web service requests? Query for keywords within documents, with ranked
answers
This is where XQuery plays a role – see CIS 330 / 550 for details
Top Related