Effective XML
description
Transcript of Effective XML
![Page 1: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/1.jpg)
Effective XML• Elliotte Rusty Harold• [email protected]• http://www.cafeconleche.org/
![Page 2: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/2.jpg)
Part 0: Should We Use XML?
![Page 3: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/3.jpg)
The XML Backlash
“With proper mark-up/logic separation, a POJO data model, and a refreshing lack of XML, Apache Wicket makes developing web-apps simple and enjoyable again. Swap the boilerplate, complex debugging and brittle code for powerful, reusable components written with plain Java and HTML.”
-- Apache Wicket
![Page 4: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/4.jpg)
Choose XML
● For data that must be exchanged● Or extended● Or stored
![Page 5: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/5.jpg)
Don’t Choose XML for
● Purely local, transient data (e.g. internal method arguments
● RPC is an edge case
![Page 6: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/6.jpg)
Why Use XML
● Well-defined, well understood● Secure● Extensible● Fast● Easy● Robust● Internationalizable● Platform independent● Language independent● Not executable● Standard parsers easily available
![Page 7: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/7.jpg)
Avoid
● JSON● YAML● Java Properties● Custom syntax● Etc.
![Page 8: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/8.jpg)
Why? 2 usually orthogonal reasons● Mixing Data with Code is Bad
– Unportable data– Opens big security holes– This is why you want to use XML instead of Ruby, Python, PHP, etc.
● Weak Parsers– Bugs and security holes– Not internationalizable– This is why you don’t want to use YAML, custom file formats parsed by regular expressions, etc.
![Page 9: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/9.jpg)
Limited Use Cases
● Works for:– Lists– Maps– Sets– Simple config files
● Not so well for:– Trees– Networks– Narrative data– Annotated data
![Page 10: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/10.jpg)
Choose the right tools:
● XPath, XSLT, XQuery● E4X, XOM, JDOM● RELAX NG● Avoid
– Regular expressions– DOM– W3C XSD Schemas
![Page 11: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/11.jpg)
Part I: Syntax
![Page 12: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/12.jpg)
Stay with XML 1.0
• XML 1.1:• New name characters• C0 control characters• C1 control characters • NEL• Undeclare namespace prefixes
• Incompatible with• Most XML parsers• W3C and RELAX NG schema languages• XOM, JDOM• Many browsers
![Page 13: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/13.jpg)
Part II: Structure
![Page 14: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/14.jpg)
The XML Stack
![Page 15: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/15.jpg)
Allow All XML syntax
• CDATA sections• Entity references• Processing instructions• Comments• Numeric character references• Document type declarations• Different ways of representing the same core content; not different information
![Page 16: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/16.jpg)
Distinguish text from markup
• A DocBook element<programlisting><![CDATA[<value> <double>28657</double></value>]]></programlisting>
• The content is:<value> <double>28657</double></value>
• This is the same:<programlisting><value> <double>28657</double> </value></programlisting>
![Page 17: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/17.jpg)
The reverse problem
•Tools that create XML from strings:•Tree-based editors like <Oxygen/> or XML Spy
•WYSIWYG applications like OpenOffice Writer
•Programming APIs such as DOM, JDOM, and XOM
•The tool automatically escapes reserved characters like <, >, or &. •Just because something looks like an XML tag does not mean it is an XML tag.
![Page 18: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/18.jpg)
White space matters
• Parsers report all white space in element content, including boundary white space
• An xml:space attribute is for the client application only, not the parser
• White space in attribute values is normalized
• Parsers do not report white space in the prolog, epilog, the document type declaration, and tags.
![Page 19: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/19.jpg)
Make structure explicit through markup• Bad
<Transaction>Withdrawal 2003 12 15 200.00</Transaction>
• Better<Transaction type="withdrawal"> <Date>2003-12-15</Date> <Amount>200.00</Amount></Transaction>
![Page 20: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/20.jpg)
Store metadata in attributes
• Material the reader doesn’t want to see• URLs• IDs• Styles• Revision dates• Author’s name
• No substructure• Revision tracking• Citations
• Single item only
![Page 21: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/21.jpg)
Remember mixed content
• Narrative documents• Record-like documents• The RSS problem<item> <title>Xerlin 1.3 released</title> <description> Xerlin 1.3, an open source XML Editor written in Java, has been released. Users can extend the application via custom editor interfaces for specific DTDs. New features in version 1.3 include XML Schema support, WebDAV capabilities, and various user interface enhancements. Java 1.2 or later is required. </description><link>http://www.cafeconleche.org/#news2003April7</link></item>
![Page 22: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/22.jpg)
What you really want is this:
<description> <p><a href="http://www.xerlin.org"><strong>Xerlin 1.3</strong></a>,an open source XML Editor written in Java, has been released. Users can extend the application via custom editor interfaces for specific DTDs. New features in version 1.3 include:</p> <ul> <li>XML Schema support</li> <li>WebDAV capabilities</li> <li>Various user interface enhancements</li> </ul> <p>Java 1.2 or later is required.</p> </description>
![Page 23: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/23.jpg)
What people do is this:<description><p><a href="http://www.xerlin.org"><strong>Xerlin 1.3</strong></a>, an open source XML Editor written in Java, has been released. Users can extend the application via custom editor interfaces for specific DTDs. New features in version 1.3 include:</p> <ul> <li>XML Schema support</li> <li>WebDAV capabilities</li> <li>Various user interface enhancements</li> </ul> <p>Java 1.2 or later is required.</p> </description>
![Page 24: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/24.jpg)
Part III: Semantics
![Page 25: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/25.jpg)
Include all information in instance documents• Not all parsers read the DTD• Especially browsers• Beware
• Default attribute values• Parsed entity references• XInclude• ID type dependence (XPath, DOM, etc.)
![Page 26: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/26.jpg)
Encode binary data using quoted printable and/or Base64
• Quoted printable works well for mostly text
• Base-64 for non-text data• Can you link to the data with a URL instead?
• Can you bundle the data with XML using zip, jar, XOP, or MIME?
![Page 27: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/27.jpg)
Use namespaces for modularity and extensibility
• Simple cases can use one default namespace
• http URIs are normally preferred• DTD validation is tricky• Code to namespace URIs, not prefixes
• Avoid namespace prefixes in element content and attribute values
![Page 28: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/28.jpg)
Reuse XHTML for generic narrative content
• <!ENTITY % xhtml1 SYSTEM "http://www.w3.org/TR/xhtml1/DTD/strict.dtd">%xhtml1;
• <!ELEMENT description %Block;>
![Page 29: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/29.jpg)
Choose the right schema language for the job• DTDs• The W3C XML Schema Language• RELAX NG• Schematron
![Page 30: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/30.jpg)
Use only what you need
• You need• Well-formed XML 1.0• A parser
• You probably need:• Namespaces
• You may not need:• DTDs• Schemas• XInclude• SOAP• WS-Kitchen-Sink• etc.
![Page 31: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/31.jpg)
Always use a parser• Can’t use regular expressions:
• Detecting encoding• Comments and processing instructions that contain tags
• CDATA sections• Unexpected placement of spaces and line breaks within tags
• Default attribute values• Character and entity references• Malformed documents• Internal DTD Subset
• Why not?• Unfamiliarity with parsers• Too slow
![Page 32: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/32.jpg)
Layer Functionalitybook.xml
XInclude
XSLT Transform to
XHTML
finished_book.xml
preface.xml
xmlsyntax.xml
XSLT Transform to
HTML
XSLT Transform to
XSL-FO
XSLT Transform to
Extract
SAX Program that extracts
examples
16 more chapters...
finished_book.xml
Valid?
book.xhtml book.html book.fo chapter1.xmlchapter1.xmlchapter2.xml
fop
book.pdf
chapters 1 to 17.xml
Example Source Code
Files
XSLT Transform to
XSL-FO
chapter1.xmlchapter2.xmlchapters 1 to 17.fo
xmlprotocols.xml
Yes
Print Error MessageNo
fop
chapter1.xmlchapter2.xmlchapters 1 to 17.pdf
![Page 33: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/33.jpg)
Program to standard APIs
• Easier to deploy in Java 1.4/1.5• Different implementations have different performance characteristics
• SAX is fast• DOM interoperates
![Page 34: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/34.jpg)
Program to non-standard APIs for ease of development● JDOM, XOM● E4X
![Page 35: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/35.jpg)
Read the complete DTD
• Be conservative in what you generate; liberal in what you accept
• Important content from DTD:• Default attribute values• Namespace declarations• Entity references• ID types
![Page 36: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/36.jpg)
Navigate with XPath
• More robust against unexpected structure
• Allow optimization by engine• Easier to code; enhanced programmer productivity
• Might be slower
![Page 37: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/37.jpg)
Validate inside your program with schemas
![Page 38: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/38.jpg)
Part IV: Implementation
![Page 39: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/39.jpg)
Write documents in Unicode
•Prefer UTF-8•Smaller in English•ASCII compatible
•Normalization•É, ü, ì and so forth•NFC•ICU
![Page 40: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/40.jpg)
Avoid Vendor Lockin; Beware
• Opaque, binary data used in place of marked up text.
• Over-abbreviated, inobvious names like F17354 and grgyt
• APIs that hide the XML• Products that focus on the "Infoset”
• Alternate serializations of XML• Patented formats
![Page 41: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/41.jpg)
Hang on to your relational database• For tabular data• But consider native XML databases going forward
![Page 42: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/42.jpg)
Pick the correct MIME type
• application/xml• Not text/xml!• Don't use charset• application/mathml+xml• image/svg+xml• application/xslt+xml
![Page 43: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/43.jpg)
TagSoup Your HTML
![Page 44: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/44.jpg)
Compress if space is a problem
//output OutputStream fout = new FileOutputStream("data.xml.gz"); OutputStream out = new GZipOutputStream(fout); OutputFormat format = new OutputFormat(document); XMLSerializer output = new XMLSerializer(out, format); output.serialize(doc); // input InputStream fin = new FileInputStream("data.xml.gz"); InputStream in = new GZipInputStream(fin); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = factory.newDocumentBuilder(); Document doc = parser.parse(in); // work with the document...
![Page 45: Effective XML](https://reader035.fdocuments.net/reader035/viewer/2022062321/56813d60550346895da73138/html5/thumbnails/45.jpg)
To Learn More
• Effective XML: 50 Specific Ways to Improve Your XML Documents• Elliotte Rusty Harold• Addison-Wesley, 2003• ISBN 0-321-15040-6• $44.99• http://cafeconleche.org/books/effectivexml