Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf ·...

13
1 Alphabet Soup: The XML Schema Overview The first tutorial in this series introduced the core Extensible Markup Language (XML) technologies. The second tutorial described the construction of a well-formed XML document. This tutorial covers the role of the XML schema, the primary elements of a schema, and the relationship between an XML document and an XML schema. In this tutorial, we use Microsoft Internet Explorer, Crimson Editor, and the Topologi Schematron Validator, a XML schema validation tool. Valid XML Data In the previous tutorial we looked at the structure and contents of an XML document; however, we did not do anything to make sure the data in the document made sense. Internet Explorer did not encounter any problems displaying the XML document shown in Figure 1 because the data are well formed. The first two elements appear to be valid names; however, the last two elements do not look like names. Figure 1: Well- formed XML Data In the case of a list of names, validation may not be a critical issue. When we want to share data among organizations, we need to more rigorously define the structure of the XML document and impose restrictions on the values elements may take. The primary mechanism for doing this is an XML schema. This requires all parties involved in the exchange of information to come to a

Transcript of Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf ·...

Page 1: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

1

Alphabet Soup: The XML Schema

Overview

The first tutorial in this series introduced the core Extensible Markup Language (XML)

technologies. The second tutorial described the construction of a well- formed XML document.

This tutorial covers the role of the XML schema, the primary elements of a schema, and the

relationship between an XML document and an XML schema. In this tutorial, we use Microsoft

Internet Explorer, Crimson Editor, and the Topologi Schematron Validator, a XML schema

validation tool.

Valid XML Data

In the previous tutorial we looked at the structure and contents of an XML document; however,

we did not do anything to make sure the data in the document made sense. Internet Explorer did

not encounter any problems displaying the XML document shown in Figure 1 because the data

are well formed. The first two elements appear to be valid names; however, the last two

elements do not look like names.

Figure 1: Well- formed XML Data

In the case of a list of names, validation may not be a critical issue. When we want to share data

among organizations, we need to more rigorously define the structure of the XML document and

impose restrictions on the values elements may take. The primary mechanism for doing this is

an XML schema. This requires all parties involved in the exchange of information to come to a

Page 2: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

2

consensus about several issues, including the tag names, the meaning of each tag, the structure of

the overall XML document, and the types of values each element may hold. The result of this

process is an XML schema.

In the last tutorial, we worked with a simple bank account XML document (Figure 2). We will

continue using the same example in this tutorial.

Figure 2: Well- formed Bank Account Data (BankAcct2.xml)

We derived this XML document’s structure from a simple inverted tree diagram (Figure 3) that

describes the overall structure of a bank account. Figure 4 is a simple XML schema based on the

structure of the bank account. Line 2 in the XML document (Figure 2) provides a link to the

XML schema (xmlns:mySchema="bankacct2.xsd").

Page 3: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

3

Figure 3: Bank Account Structure

Figure 4: Initial Bank Account XML Schema (BankAcct2.xsd)

Bank Account

Account ID

Account Holders

Balance

Account Holder (1)

Account Holder (2)

Account Holder (3)

Holder Name

Holder Tax ID

Holder Name

Holder Tax ID

Holder Name

Holder Tax ID

Page 4: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

4

Initial XML Schema: The schema definition mirrors the hierarchical structure of the bank

account. Line 1 defines the file as an XML schema using the XML schema standard defined by

the World Wide Web Consortium (W3) in 2001.

Lines 3 through 11 define the first two levels of the hierarchy. Line 3 starts the definition of a

bank account element. Line 4 specifies that the element is a complex element. A complex

element represents multiple facts clustered together (e.g., an address), multiple occurrences of

the same type of information (e.g., family members), or some combinations of these. In this

case, the bank account consists of an account id, account holders, and a balance. The account id

element holds a string value; as a result, it may hold any character, digit, or punctuation symbol.

The balance element holds a decimal value; as a result, it may only hold numeric values,

including integers and decimals. XML provides the capability to create new data types. The

account holders element is a reference (ref) to an element type defined in the same XML schema.

In this case, the account holders element defines a node in the document hierarchy.

Lines 13 through 19 define the complex data type for the account holders element. Based on this

definition, an account holders node may contain any number of account holder nodes

(maxOccurs = “unbounded”).

Lines 21 through 28 define the complex data type for the account holder element. From this

definition, each account holder has a name and tax id, both of which are simple string elements.

Line 30 terminates the schema definition so that it is well formed.

This simple XML schema illustrates the key components of any XML schema. The XML

schema provides a way to specify the tags used in the XML document by defining each element

in the XML document. Each definition has an opening tag (e.g., line 1) and a closing tag (line

30), or is self-terminating in a manner similar to that used in an XML document (e.g., lines 6, 7,

and 8). A complex element represents a repeating set of elements (the set of account holders), or

a collection of elements treated as a single unit (e.g., an account holder with a name and tax id).

The lowest level of any branch in the tree resolves to a simple element (e.g., account id, balance,

Page 5: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

5

holder name, and holder tax id). In most cases, the XML schema organizes the elements as they

should appear in the XML document.

The XML schema specifies the tags available, their organization, and their data types. The XML

schema does not specify a semantic meaning for each tag, the organization creating the XML

schema must do this. However, as was mentioned in the first tutorial, a number of government

and public organizations have successfully defined XML schemas for specific business

applications.

Validating an XML Document Against an XML Schema: Most applications that support XML do

not support validation of the document very well, if at all. In many cases, the application simply

fails to respond when presented with an invalid XML document. This is a problem since XML is

picky about character case, spacing, and other issues that we humans often overlook.

Fortunately, several companies have developed programs to validate XML documents against

schemas to ensure the XML documents comply with the standards defined in the schemas. One

tool is the Topologi Schematron Validator. Open the Schematron Validator. Figure 5 shows the

initial window displayed by the Schematron Validator.

Figure 5: Initial Topologi Schematron Validator Window

Page 6: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

6

Use the Browse button on the left side of the window to navigate to the location containing the

XML document and select (single click) the desired XML document (BankAcct2.xml). Use the

Browse button on the right side to navigate to the location containing BankAcct2.xsd and select

this document. Click Run. In this case, the XML document is valid (Figure 6).

Figure 6: Valid XML Document

Close the Validation Results window. Open BankAcct2.xml in Crimson Editor. Delete the line

specifying the account id (<AccountID>90210222</AccountID>) and save the changes. Switch

to the Schematron Validator and run the validation again. As shown in Figure 7, the validator

detects that a required element, the account id, is not in the XML document and displays an error

message.

Page 7: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

7

Figure 7: Missing Element

Switch to Crimson Editor and undo the change to restore the account id. Add a letter ‘A’ as the

first character of the balance value (A23234079). Save the changes. Switch to the Schematron

Validator. Run the validation. Once again, the Schematron Validator detects an error.

Figure 8: Data Type Error

Switch to Crimson Editor and undo the change to remove the ‘A’ in the balance value. Put a

decimal point after the third digit in the balance value (232.34079). Save the changes. Switch

Page 8: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

8

to the Schematron Validator. Run the validation. The validation should complete without errors

even though the balance is not a valid currency value.

Switch to Crimson Editor. Put “ZXY” at the start of the account id value (ZXY90210222). Save

the changes. Switch to the Schematron Validator. Run the validation. The validation should

complete without errors even with the change to the account id.

The previous two changes did not trigger errors because the XML schema does not constrain the

values of the account id and balance beyond requiring that they be a string value and decimal

value, respectively. We will address these issues separately, starting with the account id.

Defining Data Types to Constrain Values

Open BankAcct2.xsd in Crimson Editor. Assume the required format for the account id is

exactly eight numeric digits. Modify BankAcct2.xsd as shown in Figure 9 to include the

accountIdType definition and use this new data type for the account id element. Save the

changes.

Figure 9: Account ID Custom Data Type

The accountIdType specifies that the base for the type is the standard string data type. The

pattern limits the value to eight numeric digits. The definition for the account id element uses

the new data type.

Page 9: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

9

Switch to the Schematron Validator and run the validation. You should get an error message

indicating the account id value is not valid given the data type. Switch to Crimson Editor and

select BankAcct2.xml. Remove the three characters (‘XYZ’) at the start of the account id value.

Save the changes. Switch to the Schematron Validator and run the validation. You should not

get an error since the account id matches the rule defined in the XML schema.

Switch to Crimson Editor and select BankAcct2.xml. Remove the first digit from the account id

value. Save the changes. Switch to the Schematron Validator and run the validation. You

should get an error since the account id is not eight characters long. Switch to Crimson Editor

and undo the change to restore the missing digit in the account id.

Switch to BankAcct2.xsd in Crimson Editor. One odd characteristic of the current

implementation of the XML schema definition language is the lack of consistent support for a

currency data type. Because of this, we need to define a data type to require a currency value for

the balance. Modify BankAcct2.xsd as shown in Figure 10 to include the accountIdType

definition and use this new data type for the account id element. Save the changes.

Figure 10: Balance Custom Data Type

The currencyType specifies that the base for the type is the standard decimal data type. The

fractionDigits constraint limits the value to two digits to the right of the decimal point.

Page 10: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

10

Switch to the Schematron Validator and run the validation. You should get an error message

indicating the balance value is not valid given the data type. Switch to Crimson Editor and select

BankAcct2.xml. Remove the last three digits of the balance so the value is 232.34. Save the

changes. Switch to the Schematron Validator and run the validation. You should not get an

error since the balance matches the rule defined in the XML schema. As a note, the

fractionDigits property defines the maximum number of digits to the right of the decimal point.

Assume we need to keep track of account open and close dates. Switch to Crimson Editor and

select BankAcct2.xsd. As shown in Figure 11, add an open date element to the XML schema

immediately after the account id element. Specify that the open date element stores a date value.

Add a close date element immediately after the open date element. For the close date element set

the minOccurs property to zero (0). Using the minOccurs property, you do not have to enter a

close date value until the account holder closes the account. Save the changes.

Figure 11: Account Open and Close Dates

Switch to the Schematron Validator and run the validation. You should get an error message

because there is no open date element in the XML document. Switch to Crimson Editor and

select BankAcct2.xml. Add an open date of February 14, 2002 as shown in Figure 12. By

default, dates must be in the format shown. Save the changes. Switch to the Schematron

Validator and run the validation. You should not get an error message.

Page 11: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

11

Figure 12: Specifying a Date in an XML Document

Dates in XML take the form YYYY-MM-DD where YYYY is the four digit year, MM is the two

digit month with a leading zero if needed, and DD is the two digit day of the month with a

leading zero if needed.

Figure 13 and Figure 14 show the completed XML document and XML schema.

Figure 13: XML Document

<?xml version="1.0"?> <BankAccount xmlns:mySchema="bankacct2.xsd"> <AccountID>90210222</AccountID> <OpenDate>2002-02-14</OpenDate> <AccountHolders> <AccountHolder> <HolderName>H. Simpson</HolderName> <HolderTaxID>24512423</HolderTaxID> </AccountHolder> <AccountHolder> <HolderName>M. Szyslak</HolderName> <HolderTaxID>53445231</HolderTaxID> </AccountHolder> <AccountHolder> <HolderName>N. Flanders</HolderName> <HolderTaxID>11234129</HolderTaxID> </AccountHolder> </AccountHolders> <Balance>232.34</Balance> </BankAccount>

Page 12: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

12

Figure 14: XML Schema

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:simpleType name="accountIdType"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{8}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="currencyType"> <xsd:restriction base="xsd:decimal"> <xsd:fractionDigits value="2"/> </xsd:restriction> </xsd:simpleType> <xsd:element name="BankAccount"> <xsd:complexType> <xsd:sequence> <xsd:element name="AccountID" type="accountIdType"/> <xsd:element name="OpenDate" type="xsd:date"/> <xsd:element name="CloseDate" type="xsd:date" minOccurs="0"/> <xsd:element ref="AccountHolders"/> <xsd:element name="Balance" type="currencyType"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="AccountHolders"> <xsd:complexType> <xsd:sequence> <xsd:element ref="AccountHolder" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="AccountHolder"> <xsd:complexType> <xsd:sequence> <xsd:element name="HolderName" type="xsd:string"/> <xsd:element name="HolderTaxID" type="xsd:string"/> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema>

Summary

This tutorial described the basic elements of an XML schema. This tutorial does not provide a

comprehensive coverage of XML. For additional information about XML schemas, consult the

XML Schema specification developed by the W3 and the Microsoft XML 4.0 Parser software

Page 13: Alphabet Soup: The XML Schemaocean.otr.usm.edu/~w300778/is-doctor/pubpdf/xmlxsd.pdf · 2005-10-31 · Initial XML Schema: The schema definition mirrors the hierarchical structure

13

development kit (SDK) documentation available from Microsoft. The next tutorial covers

construction of XSL stylesheets to process XML documents.

XML Resources

Crimson Editor, www.crimsoneditor.com.

Microsoft Internet Explorer, www.microsoft.com.

Microsoft XML 4.0 Parser Software Development Kit (SDK), www.microsoft.com.

Topologi P/L Schematron Validator, www.topologi.com.

World Wide Web Consortium, www.w3.org.