ApacheCon 2000 Everything you ever wanted to know about XML Parsing

84
Everything you always wanted to know about XML parsing Ted Leung CTO, Enkubator, LLC [email protected]

description

 

Transcript of ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Page 1: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Everything you always wanted to know about XML parsing

Ted LeungCTO, Enkubator, [email protected]

Page 2: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Thank Youn IBMn ASFn Xerces-Dev

Page 3: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 4: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Overviewn Focus on server side XMLn Not document processingn Benefits to servers / e-business

n Model-View separation for *MLn Ubiquitous format for data exchange

n Hope for network effects from data availability

Page 5: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

xml.apache.orgn Xerces XML parser

n Javan C++n Perl

n Xalan XSLT processorn Javan C++ (soon)

n FOPn Cocoon

Page 6: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Xerces-Jn Why Java?

n Unicoden Code motionn Server side cross platform support

n C++ version has lagged Java version

Page 7: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 8: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Basic XML Conceptsn Well formednessn Validity

n DTD’sn Schemas

n Entities

Page 9: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Example: RSS<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS .91//EN""http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">

<channel><title>freshmeat.net</title><link>http://freshmeat.net</link><description>the one-stop-shop for all your Linux software needs</description><language>en</language><rating>(PICS-1.1 "http://www.classify.org/safesurf/" 1 r (SS~~000 1))</rating><copyright>Copyright 1999, Freshmeat.net</copyright><pubDate>Thu, 23 Aug 1999 07:00:00 GMT</pubDate><lastBuildDate>Thu, 23 Aug 1999 16:20:26 GMT</lastBuildDate><docs>http://www.blahblah.org/fm.cdf</docs>

Page 10: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

RSS 2<image><title>freshmeat.net</title><url>http://freshmeat.net/images/fm.mini.jpg</url><link>http://freshmeat.net</link><width>88</width><height>31</height><description>This is the Freshmeat image stupid</description></image>

<item><title>kdbg 1.0beta2</title><link>http://www.freshmeat.net/news/1999/08/23/935449823.html</link><description>This is the Freshmeat image stupid</description></item>

<item><title>HTML-Tree 1.7</title><link>http://www.freshmeat.net/news/1999/08/23/935449856.html</link><description>This is the Freshmeat image stupid</description></item>

Page 11: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

RSS 3<textinput><title>quick finder</title><description>Use the text input below to search freshmeat</description><name>query</name><link>http://core.freshmeat.net/search.php3</link></textinput>

<skipHours><hour>2</hour></skipHours>

<skipDays><day>1</day></skipDays>

</channel></rss>

Page 12: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

RSS DTD<!ELEMENT rss (channel)><!ATTLIST rss

version CDATA #REQUIRED> <!-- must be "0.91"> -->

<!ELEMENT channel (title | description | link | language | item+ | rating? | image? | textinput? | copyright? | pubDate? | lastBuildDate? | docs? |managingEditor? | webMaster? | skipHours? | skipDays?)*><!ELEMENT title (#PCDATA)><!ELEMENT description (#PCDATA)><!ELEMENT link (#PCDATA)><!ELEMENT language (#PCDATA)><!ELEMENT rating (#PCDATA)><!ELEMENT copyright (#PCDATA)><!ELEMENT pubDate (#PCDATA)><!ELEMENT lastBuildDate (#PCDATA)><!ELEMENT docs (#PCDATA)><!ELEMENT managingEditor (#PCDATA)><!ELEMENT webMaster (#PCDATA)><!ELEMENT hour (#PCDATA)><!ELEMENT day (#PCDATA)><!ELEMENT skipHours (hour+)><!ELEMENT skipDays (day+)>

Page 13: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

RSS DTD 2<!ELEMENT item (title | link | description)*><!ELEMENT textinput (title | description | name | link)*><!ELEMENT name (#PCDATA)>

<!ELEMENT image (title | url | link | width? | height? | description?)*><!ELEMENT url (#PCDATA)><!ELEMENT width (#PCDATA)><!ELEMENT height (#PCDATA)>

Page 14: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

XML Application Tasksn Convert XML into application objectn Convert application object to XMLn Read or write XML into a databasen XSLT conversion to XHTML or HTML

Page 15: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

RSS ObjectsLogical View

RSSChannel

+getTitle()+getDescription()+getLink()+getItems()+getImage()+getTextInput()

RSSItem

+getTitle()+getDescription()+getLink()

RSS Image

+getTitle()+getDescription()+getLink()+getURL()+getHeight()+getWidth()

RSSTextInput

+getTitle()+getDescription()+getLInk()+getName()

Page 16: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 17: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Parser API’sn What is the job of a parser API?

n Make the structure and contents of a document available

n Report errors

Page 18: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX APIn Event callback stylen Representation of elements & atts

n must build own stack

n Development modeln Reasonably openn xml-devn http://www.megginson.com/SAX/SAX2

Page 19: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Handler Strategiesn Single Monolithic

n Requires the handler to know entirely about the grammar

n One per application object and multiplexern Each application object has its own

DocumentHandlern The Multiplexer handler is registered with the

parsern Multiplexer is responsible for swapping SAX

handlers as the context changes.

Page 20: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 1 RSS Handlerpublic void startElement(String name,AttributeList atts) throws SAXException {

currentText = new StringBuffer();textStack.push(currentText);if (name.equals("rss") {

elementStack.push(name);} else if (name.equals("channel")) {

elementStack.push(name);currentChannel = new RSSChannel();currentItems = new Vector();currentChannel.setItems(currentItems);currentImage = new RSSImage();currentChannel.setImage(currentImage);currentTextInput = new RSSTextInput();currentChannel.setTextInput(currentTextInput);currentChannel.setSkipHours(new Vector());currentChannel.setSkipDays(new Vector());

} else if (name.equals("item")) {elementStack.push(name);currentItem = new RSSItem();

} else if (name.equals("image")) {elementStack.push(name);

} else if (name.equals("textinput")) {elementStack.push(name);

} else if (name.equals("skipHours")) {elementStack.push(name);

} else if (name.equals("skipDays")) {elementStack.push(name);

} else {}

}

Page 21: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 1 RSS Handler 2public void endElement(String name) throws SAXException {

String stackValue = (String) elementStack.peek();String text = ((StringBuffer)textStack.pop()).toString();if (name.equals("rss")) {

elementStack.pop();} else if (name.equals("channel")) {

elementStack.pop();} else if (name.equals("title")) {

if (stackValue.equals("channel")) {currentChannel.setTitle(text);

} else if (stackValue.equals("image")) {currentImage.setTitle(text);

} else if (stackValue.equals("item")) {currentItem.setTitle(text);

} else if (stackValue.equals("textinput")) {currentTextInput.setTitle(text);

} else {}

...

Page 22: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 1 RSS Handler 3} else if (name.equals("description")) {

if (stackValue.equals("channel")) {currentChannel.setDescription(text);

} else if (stackValue.equals("image")) {currentImage.setDescription(text);

} else if (stackValue.equals("item")) {currentItem.setDescription(text);

} else if (stackValue.equals("textinput")) {currentTextInput.setDescription(text);

} } else if (name.equals("language")) {

currentChannel.setLanguage(text);} else if (name.equals("item")) {

currentItems.add(currentItem);elementStack.pop();

} else if (name.equals("textinput")) {currentChannel.setTextInput(currentTextInput);elementStack.pop();

}}

Page 23: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX1 RSS Handler 4public void characters(char[] ch,int start,int length) throws SAXException {

currentText.append(ch, start, length);

}

Page 24: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 1 RSS Driverpublic class SAX1RSSParse extends Object {

public static void main(String args[]) {Parser p = new SAXParser();DocumentHandler h = new SAX1RSSHandler();p.setDocumentHandler(h);try {

p.parse("file:///Work/ApacheCon/fm0.91_full.rdf");} catch (SAXException se) {

System.out.println("Error during parsing"+se.getMessage());se.printStackTrace();

} catch (IOException e) {System.out.println("I/O Error during parsing+e.getMessage());e.printStackTrace();

}System.out.println(((SAX1RSSHandler)h).getChannel().toString());

}}

Page 25: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 1 Basic Event Handlingn Basic Event Handling

n attribute handling in startElementn characters() for contentn character data handling in endElement

n Multiple characters calls

Page 26: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 1 RSS ErrorHandlerpublic class SAX1RSSErrorHandler implements ErrorHandler {public void warning(SAXParseException ex) throws SAXException {

System.err.println("[Warning] ”+getLocationString(ex)+":”+ex.getMessage());

}

public void error(SAXParseException ex) throws SAXException {System.err.println("[Error] "+getLocationString(ex)+":”+

ex.getMessage());}

public void fatalError(SAXParseException ex) throwsSAXException {

System.err.println("[Fatal Error]"+getLocationString(ex)+":”+ex.getMessage());

throw ex;}

Page 27: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX1 RSS ErrorHandler 2private String getLocationString(SAXParseException ex) {

StringBuffer str = new StringBuffer();

String systemId = ex.getSystemId();

if (systemId != null) {

int index = systemId.lastIndexOf('/');

if (index != -1)

systemId = systemId.substring(index + 1);

str.append(systemId);

}

str.append(':');

str.append(ex.getLineNumber());

str.append(':');

str.append(ex.getColumnNumber());

return str.toString();

}

}

}

Page 28: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

ErrorDriverpublic class SAX1RSSParse extends Object {

public static void main(String args[]) {Parser p = new SAXParser();DocumentHandler h = new SAX1RSSHandler();p.setDocumentHandler(h);ErrorHandler eh = new SAX1RSSErrorHandler();p.setErrorHandler(eh);try {

p.parse("file:///Work/ApacheCon/fm0.91_full.rdf");} catch (SAXException se) {

System.out.println("Error during parsing"+se.getMessage());se.printStackTrace();

} catch (IOException e) {System.out.println("I/O Error during parsing+e.getMessage());e.printStackTrace();

}System.out.println(((SAX1RSSHandler)h).getChannel().toString());

}}

Page 29: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 1 Error Handlingn Error Handling

n warningn error – parser may recovern fatalError – XML 1.0 spec errors

n Locators

Page 30: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Entity Handlingn <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS91//EN"

"http://my.netscape.com/publish/formats/rss-0.91.dtd">

n This will do a network fetch!

public class SAX1RSSEntityHandler implements EntityResolver {public SAX1RSSEntityHandler() {}public InputSource resolveEntity(String publicId,String systemId) throws SAXException, IOException {if (publicId.equals("-//Netscape Communications//DTD RSS

91//EN")) {FileReader r = new FileReader("/usr/share/xml/rss91.dtd");return new InputSource(r);

}return null;

}}

Page 31: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

EntityDriverpublic class SAX1RSSParse extends Object {

public static void main(String args[]) {Parser p = new SAXParser();DocumentHandler h = new SAX1RSSHandler();p.setDocumentHandler(h);ErrorHandler eh = new SAX1RSSErrorHandler();p.setErrorHandler(eh);EntityResolver r = new SAX1RSSEntityResolver();p.setEntityResolver(r);try {

p.parse("file:///Work/ApacheCon/fm0.91_full.rdf");} catch (SAXException se) {

System.out.println("Error during parsing"+se.getMessage());se.printStackTrace();

} catch (IOException e) {System.out.println("I/O Error during parsing+e.getMessage());e.printStackTrace();

}System.out.println(((SAX1RSSHandler)h).getChannel().toString());

}}

Page 32: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

HandlerBase

n The kitchen sinkn Lets you do DocumentHandler,

ErrorHandler, EntityResolver and DTD Handler

n A good way to go if you are only overriding a small number of methods

Page 33: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 2 Beta

n Configurablen DTDn ContentHandler replaces

DocumentHandler (NS Aware)n Attributes replaces AttributeList (NS

Aware)n XMLReader replaces Parsern XMLFiltern DefaultHandler replaces HandlerBase

Page 34: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 2 Betan Optional LexicalHandler

n Entity boundariesn DTD bounariesn CDATA sectionsn Comments

n Optional DeclHandlern ElementDecls and AttributeDeclsn Internal and External Entity Declsn NO parameter entities

n Compatibility Adapters

Page 35: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX 2 Configurablen Uses strings (URI’s) as lookup keysn Features - booleans

n validationn namespaces

n Properties - Objectsn Extensibility by returning an object that

implements additional function

Page 36: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Xerces and Configurablen Standard way to set all parser

configuration settings.n Applies to both SAX and DOM Parsers

Page 37: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Xerces Featuresn Inclusion of general entitiesn Inclusion of parameter entitiesn Dynamic validationn Extra warningsn Allow use of java encoding namesn Lazy evaluating DOMn DOM EntityReference creationn Inclusion of igorable whitespace in DOM

Page 38: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Xerces Propertiesn Name of DOM implementation classn DOM node currently being parsed

Page 39: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 40: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM APIn Object representation of elements &

attsn Development Model

n W3C Recommendations and process

n http:///www.w3.org/DOM

Page 41: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM ArchitectureNode

+getNodeName()+getNodeValue()+getChildNodes()+getFirstChild()+getLastChild()+getNextSibling()+getPreviousSibling()+insertBefore()+appendChild()+replaceChild()+removeChild()

Document

Element

+getAttribute()+setAttribute()+removeAttribute()+getTagName()+getAttributeNode()+setAttributeNode()+removeAttributeNode()+normalize()

Attr

Text CDataSection

EntityReference Entity

ProcessingInstruction

Comment

DocumentType

NotationCharacterData

NodeList NamedNodeMap

Page 42: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM Instance (XML)

<?xml version="1.0" encoding="US-ASCII"?>

<a>

<b>in b</b>

<c>

text in c

<d/>

</c>

</a>

Page 43: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM Tree

Document

Elementa

Elementb

Elementc

Elementd

Text[lf]

Text[lf]text in c[lf]

Text[lf]

Text[lf]

Text[lf]

Textin b

Page 44: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM RSSpublic static void main(String args[]) {

DOMParser p = new DOMParser();

try {p.parse("file:///Work/ApacheCon/fm0.91_full.rdf");System.out.println(dom2Channel(p.getDocument()).toString());

} catch (SAXException se) {System.out.println("Error during parsing "+se.getMessage());se.printStackTrace();

} catch (IOException e) {System.out.println("I/O Error during parsing “+e.getMessage());e.printStackTrace();

}}

Page 45: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM2Channelpublic static RSSChannel dom2Channel(Document d) {

RSSChannel c = new RSSChannel();Element channel = null;NodeList nl = d.getElementsByTagName("channel");channel = (Element) nl.item(0);for (Node n = channel.getFirstChild(); n != null;

n = n.getNextSibling()) {if (n.getNodeType() != Node.ELEMENT_NODE)continue;

Element e = (Element) n;e.normalize();String text = e.getFirstChild().getNodeValue();if (e.getTagName().equals("title")) {c.setTitle(text);

} else if (e.getTagName().equals("description")) {c.setDescription(text);

} else if (e.getTagName().equals("link")) {c.setLink(text);

} else if (e.getTagName().equals("language")) {

...

Page 46: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM2Channel 2…

}nl = channel.getElementsByTagName("item");c.setItems(dom2Items(nl));nl = channel.getElementsByTagName("image");c.setImage(dom2Image(nl.item(0)));nl = channel.getElementsByTagName("textinput");c.setTextInput(dom2TextInput(nl.item(0)));nl = channel.getElementsByTagName("skipHours");c.setSkipHours(dom2SkipHours(nl.item(0)));nl = channel.getElementsByTagName("skipDays");c.setSkipDays(dom2SkipDays(nl.item(0)));return c;

}

Page 47: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM Level 1n Attr

n API for result as objectsn API for result as strings

n No DTD supportn Need Import - copying between doms is

bad

Page 48: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM Level 2n Core / Namespaces

n Import

n Traversaln Eventsn Rangen Stylesheets

Page 49: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Xerces DOM extensionsn Feature controls insertion of ignorable

whitespacen Feature controls creation of

EntityReference Nodesn User data object provided on

implementation class org.apache.xml.dom.NodeImpl

Page 50: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Xerces Lazy DOMn Delays object creation

n a node is accessed - its object is createdn all siblings are also created.

n Reduces time to complete parsing n Reduces memory usagen Good if you only access part of the

DOM

Page 51: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Tricky DOM stuffn item() based access

n use getNextSibling()

n Entity and EntityReference nodes –depends on parser settings

n Ignorable whitespacen Subclassing the DOM

n Use the Document factory class

Page 52: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DOM vs SAXn DOM

n creates a document object treen tree must the be reparsed & converted to

BO’sn Could subclass BO’s from DOM classes

n SAXn event handler based APIn stack managementn no intermediate data structures generated

Page 53: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 54: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Validationn When to use validationn non-validating != wf checking

n Not required to read external entities

n validation dial settings

Page 55: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Namespacesn Purpose

n Syntactic discrimination onlyn They don’t point to anythingn They don’t imply schema composition/combination

n Universal Namesn URI + Local Name

{http://www.mydomain.com}Elementn Declarations

n xmlns “attribute”

n Prefixesn Scopingn Validation and DTDs

Page 56: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Namespace Example<xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:html="http://www.w3.org/TR/xhtml1/strict">

<xsl:template match="/">

<html:html>

<html:head>

<html:title>Title</html:title>

</html:head>

<html:body>

<xsl:apply-templates/>

</html:body>

</xsl:template>

...

</xsl:stylesheet>

Page 57: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Default Namespace Example<xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns="http://www.w3.org/TR/xhtml1/strict">

<xsl:template match="/">

<html>

<head>

<title>Title</title>

</head>

<body>

<xsl:apply-templates/>

</body>

</xsl:template>

...

</xsl:stylesheet>

Page 58: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX2 ContentHandlerpublic class NamespaceDriver extends DefaultHandler implements ContentHandler {

public void startPrefixMapping(String prefix,String uri) throws SAXException {

System.out.println("startMapping "+prefix+", uri = "+uri);

}

public void endPrefixMapping(String prefix) throws SAXException {

System.out.println("endMapping "+prefix);

}

public void startElement(String namespaceURI,String localName,String

rawName,Attributes atts) throws SAXException {

System.out.println("Start: nsURI="+namespaceURI+" localName= "+localName+"rawName = "+rawName);

printAttributes(atts);

}

public void endElement(String namespaceURI,String localName,String rawName) throws SAXException {

System.out.println("End: nsURI="+namespaceURI+" localName= "+localName+"rawName = "+rawName);

}

Page 59: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

SAX2 ContentHandler 2public void printAttributes(Attributes atts) {

for (int i = 0; i < atts.getLength(); i++) {

System.out.println("\tAttribute "+i+" URI="+atts.getURI(i)+

" LocalName="+atts.getLocalName(i)+

" Type="+atts.getType(i)+" Value="+atts.getValue(i));

}

}

Page 60: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 61: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

XML Scheman Richer grammar specificationn W3C Working Draftn Structures

n XML Instance Syntaxn Namespacesn Weird content models

n Datatypesn Lots

n Prototype Support in Xerces

Page 62: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

RSS Schema<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE schema PUBLIC "-//W3C//DTD XMLSCHEMA 20000225//EN"

"http://www.w3.org/TR/2000/WD-xmlschema-1-20000225/structures.dtd"><xsd:schema

targetNamespace="http://my.netscape.com/publish/formats/rss-0.91"><xsd:element name="rss"><xsd:complexType>

<xsd:element ref="channel"/><xsd:attribute name="version" fixed="0.91" minOccurs="1"/>

</xsd:complexType></xsd:element>

Page 63: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

RSS Schema 2<xsd:element name="channel"><xsd:complexType>

<xsd:element ref="title"/><xsd:element ref="description"/><xsd:element ref="link"/><xsd:element ref="language"/><xsd:element ref="item" minOccurs="1" maxOccurs="*"/><xsd:element ref="rating" minOccurs="0"/><xsd:element ref="image" minOccurs="0"/><xsd:element ref="textinput" minOccurs="0"/><xsd:element ref="copyright" minOccurs="0"/><xsd:element ref="pubDate" minOccurs="0"/><xsd:element ref="lastBuildDate" minOccurs="0"/><xsd:element ref="docs" minOccurs="0"/><xsd:element ref="managingEditor" minOccurs="0"/><xsd:element ref="webMaster" minOccurs="0"/><xsd:element ref="skipHours" minOccurs="0"/><xsd:element ref="skipDays" minOccurs="0"/>

</xsd:complexType></xsd:element>

Page 64: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

RSS Schema 3...

<xsd:element name="name" type="xsd:string"/><xsd:element name="rating" type="xsd:string"/><xsd:element name="language" type="xsd:string"/><xsd:element name="width" type="xsd:integer"/><xsd:element name="height" type="xsd:integer"/><xsd:element name="copyright" type="xsd:string"/><xsd:element name="pubDate" type="xsd:date"/><xsd:element name="lastBuildDate" type="xsd:date"/><xsd:element name="docs" type="xsd:string"/><xsd:element name="managingEditor" type="xsd:string"/><xsd:element name="webMaster" type="xsd:string"/><xsd:element name="hour" type="xsd:time"/><xsd:element name="day" type="xsd:string"/>

</xsd:schema>

Page 65: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 66: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Grammar Accessn No DOM standardn SAX 2n Common Applications

n Editorsn Stub generators

n Also experimental API in Xerces

Page 67: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DTD Access Examplepublic class DTDDumper implements DeclHandler {

public void elementDecl (String name, String model) throws SAXException {

System.out.println("Element "+name+" has this content model: "+model);

}

public void attributeDecl (String eName, String aName, String type,

String valueDefault, String value) throwsSAXException {

System.out.print("Attribute "+aName+" on Element "+eName+" has type "+type);

if (valueDefault != null)

System.out.print(", it has a default type of "+valueDefault);

if (value != null)

System.out.print(", and a default value of "+value);

System.out.println();

}

Page 68: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DTD Access Example 2public void internalEntityDecl (String name, String value) throwsSAXException {

String peText = name.startsWith("%") ? "Parameter " : "";

System.out.println("Internal "+peText+"Entity "+name+" has value: "+value);

}

public void externalEntityDecl (String name, String publicId, StringsystemId) throws SAXException {

String peText = name.startsWith("%") ? "Parameter " : "";

System.out.print("External "+peText+"Entity "+name);

System.out.print("is available at "+systemId);

if (publicId != null)

System.out.print(" which is known as "+publicId);

System.out.println();

}

Page 69: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

DTD Access Driverpublic static void main(String args[]) {

try {

XMLReader r = new SAXParser();

r.setProperty("http://xml.org/sax/properties/declaration-handler",

new DTDDumper());

r.parse(args[0]);

} catch (Exception e) {

e.printStackTrace();

}

}

Page 70: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 71: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Round Trippingn Use SAX2 - DOM can’t do itn XML Decln Comments (SAX2)n PE Decls

Page 72: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

“Serializing”n Several choices

n XMLSerializern TextSerializern HTMLSerializer

n XHTMLSerializer

n Work as either:n SAX DocumentHandlersn DOM serializers

Page 73: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Serializer Example

DOMParser p = new DOMParser();

p.parse("file:///twl/Work/ApacheCon/fm0.91_full.rdf");

Document d = p.getDocument();

// XML

OutputFormat format = new OutputFormat("xml", "UTF-8", true);

XMLSerializer serializer = new XMLSerializer(System.out, format);

serializer.serialize(d);

// XHTML

format = new OutputFormat("xhtml", "UTF-8", true);

serializer = new XHTMLSerializer(System.out, format);

serializer.serialize(d);

Page 74: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Entity Handlingn Entity / Catalog handling

n Catalog handling is mostly for document processing (SGML) applications

n Pluggable entity handling is important for server applications.n Provides access point to DBMS, LDAP, etc.

n predefined character entitiesn amp, lt, gt, apos, quot

Page 75: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Concurrencyn “Thread safety”

n Xerces allows multiple parser instances, one per thread

n Xerces does not allow multiple threads to enter a single parser instance.

n org.apache.xerces.framework.XMLParser#parseSome()

Page 76: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 77: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Grammar Designn Attributes vs Elements

n performance tradeoff vs programming tradeoff

n At most 1 attribute, many subelements

n ID & IDREFn NMTOKEN vs CDATAn CDATASECTIONsn Ignorable Whitespace

Page 78: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Grammar Design 2n Everything is Stringsn Data modelling

n Relationships as elementsn modelling inheritance

n Plan to use schema datatypesn avoid weird types

Page 79: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 80: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Performance Tipsn skip the DOMn reuse parser instances (code)

n reset() method

n defaulting is slown external anythings are slower, entities

are slowern only validate if you have to, validate

once for all timen use fast encoding (US-ASCII)

Page 81: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture

Page 82: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Xerces Architecturen Use SAX based infrastructure where

possible to avoid reinventing the wheeln Modular / Framework designn Prefer composition for system

components

Page 83: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Xerces Architecture Diagram

SAXParser DOMParser DOM Implementations

Error Handling XMLParser XMLDTDScanner DTDValidator

XMLDocumentScanner XSchemaValidator

Entity Handling

Readers

Page 84: ApacheCon 2000 Everything you ever wanted to know about XML Parsing

Things we don’t don HTML parsing (yet)

n java Tidy http://www3.sympatico.ca/ac.quick/jtidy.html

n Grammar caching (yet)n Parse several documents in a stream