Introduction to XML Timothy W. Cole Thomas G. Habing University of Illinois at UC CDP / Colorado...

download Introduction to XML Timothy W. Cole Thomas G. Habing University of Illinois at UC CDP / Colorado Alliance of Research Libraries 23 October 2002

of 62

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Introduction to XML Timothy W. Cole Thomas G. Habing University of Illinois at UC CDP / Colorado...

  • Introduction to XMLTimothy W. Cole Thomas G. Habing University of Illinois at UC

    CDP / Colorado Alliance of Research Libraries 23 October 2002

  • PresentersTim ColeMathematics Librarian & Assoc. Prof. of Library Admin.PI, Univ. of Illinois OAI Metadata Harvesting

    Tom HabingResearch Programmer, Grainger Engineering Library Information

  • Agenda Introduction to XMLWhat is it?Whats it good for?How does it work?The infrastructure of XMLUsing XML on the WebImplementation issues & costs

  • What is it?Discussion points:First principles: OHCOExample: A simple XML fragmentCompare/contrast: SGML, HTML, XHTMLA different XML for every communityTerminology

  • Ordered hierarchies of content objectsPremise: A text is the sum of its component partsA could be defined as containing: , s, could contain: s A could contain: sA could contain: s or s or s Components chosen should reflect anticipated use

  • Ordered hierarchies of content objectsOHCO is a useful, albeit imperfect, modelExposes an objects intellectual structureSupports reuse & abstraction of componentsBetter than a bit-mapped page imageBetter than a model of text as a stream of characters plus formatting instructionsData management system for document-like objectsDoes not allow overlapping content objectsIncomplete; requires infrastructure

  • Content objects in a bookBookFrontMatterBookTitleAuthor(s)PubInfoChapter(s)ChapterTitleParagraph(s)BackMatterReferencesIndex

  • Content objects in a catalog cardCardCallNumberMainEntryTitleStatementTitleProperStatementOfResponsibilityImprintSummaryNoteAddedEntrySubject(s)Added EntryPersonalName(s)

  • A simple XML fragment XML Is Easy Tim Cole Tom Habing CDP Press, 2002 First Was SGML Once upon a time

  • This is NOT XML

    It was six men of Indostan To learning much inclined, Who went to see the Elephant (Though all of them were blind), That each by observation Might satisfy his mind.

  • XML comes from SGMLStandard Generalized Markup LanguageBased on IBMs GML (Goldfarb, et al.)ISO standard since 1989Used for large-scale document management (Boeing 747 users manual)

    Expensive, complex to implementNot Web-friendly (no well-formed SGML)Too many options (e.g., tag minimization)

  • XML, HTML, & XHTMLHTMLdisplay-oriented, SGML-based scheme for making Web pagesSyntax & allowed elements (semantics) are fixedXMLset of rules for defining markup schemesElement set is fully extensibleSyntax is fixedXHTMLHTML modified to be XML-compliant (not just SGML-compliant)

  • Markup languages comparedXML syntax is stricter than HTML or SGMLMust explicitly close all elementsAttributes must be enclosed in quotesAll markup is case-sensitiveXML & SGML: no fixed tags, no predefined styleXML & SGML are extensibleFixed elements (HTML) vs. rules (XML, SGML)HTML elements describe how to present contentXML elements can describe the content itself

  • A different XML for every communityXML is a set of rules used for defining & encoding intellectual structuresXML is extensible & customizableIts greatest strengthIts greatest weaknessHTML was invented by physicistsWhat if it had been lawyers, or teachers, or bureaucrats, or librarians, or ?

  • TerminologyDocument instanceDocument classDocument Type Definition (DTD), or schemaWell-formed XMLValid XMLStylesheetsXML TransformationsDocument Object Model (DOM)

  • Whats it good for?Discussion points:Smarter documentsFull textMetadataMachine-to-machine interactions

  • Smarter documentsStandards-basedFacilitatesSearch & discoveryPrecise, field-specific searchingInteroperability & normalizationComplex transformationsLinking between and within textsReuse of documents and fragments

  • Smarter documents

  • Full textElectronic Text Center (U of VA Library)Originally SGML, now also XML, eBooks70,000 texts; 350,000 related images37,000 visits to collection per day eBook ForumInternational trade & standards organizationGoal: establish specs & stds for epublishing

  • Using XML for full textNo inherent presentation informationRequiresCSS in XML-aware browsers, orXSLT to transform to XHTML, orXSL-FO to reformat for presentationTechniques for including non-text content vary by applicationXML can be verboseMost standard full-text schemas are complex

  • MetadataXML schemas exist for a range of metadata standardsEncoded Archival Description (EAD)MARC 21 XML (also MODS)Metadata Encoding & Transmission Standard (METS)Dublin Core VariantsOpen Archives Initiative (OAI)National Science Digital Library (NSDL)Resource Description Framework (RDF)

  • Using XML for metadataConsistency in applying schemaOptional versus required elementsConsistent use of elementsGranularity & depth of informationXML schemas still evolvingAttributes versus elementsMixing namespacesSchema languagesPhilosophical issues

  • Machine-to-machine interactionsWeb servicesFacilitating machine-to-machine communications via XMLSimple Object Access Protocol (SOAP)XML Protocol Working GroupSemantic WebAbstract representation of data on the WebXML and Databases

  • How does it work?In XML, theres content and theres markup.MarkupElementsAttributesCommentsProcessing instructionsContentEntitiesEncoded (Unicode) characters

  • ElementsElements are markup that enclose content or Content modelsParsed Character Data OnlyChild Elements OnlyMixedCole, T

  • AttributesAssociate a name-value pair with an elementCan be used to embellish contentor to associate added content to an element

    Cole, T

  • CommentsHuman-readable annotationsCan be inserted anywhere after headersNot part of the document structureUsually ignored by XML parsersDo not have to be passed to application

  • Processing instructionsMachine-readable & application-specificMust be passed through by XML ParsersXML Declaration is a special PIXML Declaration is always first line in file

  • EntitiesPlaceholders for internal or external contentPlaceholder for a single characteror string of textor external content (images, audio, etc.)Implementation specifics may vary

    &copyright; is replaced by &pic; is replaced by graphic image

  • Character Encoding IssuesXML Parsers must accept UTF-8 & UTF-16Also must accept nnnn; or hhhh;MARC-8 encodings must be converted to Unicode for use in XML

  • The infrastructure of XML Required to make it workDTDs & schemas: defining document classesReusing & integrating schemas (using namespaces)Stylesheets for presentation & transformationStandards for linking, querying, & pointingProgramming standards

  • Defining Document ClassesFormal descriptions of document structureSet expectationsMaximize reusabilityEnforce business rulesDTDsXML SchemaSchematronRelax NG

  • Document Type Definitions (DTD)Legacy from SGML; part of XML standard

  • XML schema languageNew in XMLUses XML syntaxSupports datatypingRicher and more complex

  • Alternatives: Schematron & RelaxNGSchematron based on XPath (XSLT)Doesnt support datatyping as wellSupports additional content modelsMay become an ISO standardRelaxNGReturns some of the power of SGML DTDs back to XML (mixed and unordered content)Uses datatyping from the XML Schema specDoes not support inheritanceDeveloped by an OASIS Technical Committee chaired by James Clark

  • NamespacesQualify element and attribute names Allows modularization of schemasMix and match elements from multiple schemas in document instancesImport or include from one XML Schema into another

  • XML & Cascading Style SheetsAttach styling instructions directly to XML files
  • XSLT Transforming StylesheetsLanguage for transforming XML documentsInto HTML, Text, or other XML documentsSupported in new browsers (IE5+, Mozilla; not Opera)Usually applied on the server or in batch modeValuable for interoperability or reusability


  • XSL-FO (formatting objects)Another styling languageSimilar to CSS, but includes the power of XSLT to rearrange the documentSyntax is entirely XMLNot currently supported in browsers (but there are tools for use on the server or in batch mode)

    Author: Cole, T

  • XPath, XPointer, & XLinkXPathAllows addressing of parts of an XML documentUsed in XSLT, XPointer, and XQuery/document/front/author/@numberXPointer (working draft)Used as a fragment id in an XML URI referencehttp://.../some.xml#xpointer(/document/front/author)XLinkCreates and describes extended or simple links between resourcesUsed for HTML-style hrefs or imgs, tables of contents, etc.

    Cole, T

  • XQuery (XML query language)Treat an XML document or collection of documents as a databaseEquivalent to SQL SELECT statements, only for XMLSome support in XML databases (but working draft only)

  • Programming standardsPlatform- and language-neutral interfaces that allow programs and scripts to dynamically access and update the content, structure, and style of XML documents.Document Object Model (DOM)Object-basedBetter for complex documentsHigh memory usage, slowerDocuments can be updatedSimple API for XML (SAX)Event-basedBetter for simple documentsLow memory usage, fasterDocuments cannot be updated

  • Other XML-related standardsXBaseXFormsXML EncryptionXML SignatureMany more

  • Using XML on the WebCase Studies:Illinois DLI / D-Lib Test Suite Pro