Processing XML Documents

71
Processing XML Documents SNU IDB Lab.

description

Processing XML Documents. SNU IDB Lab. Processing XML documents. Processing XML Data Document Formatting (XSL & XSLT). Contents : processing XML data. Concepts Writing XML Reading XML Event processing Tree manipulation Events or trees? Transformation tools. Concepts (1/4). - PowerPoint PPT Presentation

Transcript of Processing XML Documents

Page 1: Processing XML Documents

Processing XML DocumentsSNU IDB Lab.

Page 2: Processing XML Documents

2

Processing XML documents Processing XML Data Document Formatting (XSL & XSLT)

Page 3: Processing XML Documents

3

Contents : processing XML data Concepts Writing XML Reading XML Event processing Tree manipulation Events or trees? Transformation tools

Page 4: Processing XML Documents

4

Concepts (1/4) Developing software to generate XML output is a triv-

ial matter. However, reading an XML documents can be complicated by a number of issues and features of the language. Thus the DTD may need to be pro-cessed, either to add default information, or to com-pare against the document instance in order to vali-date it.

XML processor

<-->-------<-->

rules

dataerrors

Application

Page 5: Processing XML Documents

5

Concepts (2/4)

Programmers wishing to read XML data files need an XML-aware processing module, termed an XML pro-cessor.

XML processor– XML processor is responsible for marking the content of the

document available to the application– detect problems such as file formats that the application

cannot process, or URLs that do not point to valid resources.

Page 6: Processing XML Documents

6

Concepts (3/4) Two fundamentally different approaches to reading

the content of an XML document are known as the ‘event-driven’ and ‘tree-manipulation’ techniques.

Event-driven – Document is processed in strict sequence.– Each element in the data stream is considered as event trig-

ger, which may precipitate some special action on the part of the application.

Page 7: Processing XML Documents

7

Concepts (4/4) Tree-manipulation

– The tree approach provides access to the entire document, allowing its contents to be interrogated and manipulated in any order.

Page 8: Processing XML Documents

8

Writing XML (1/3) To produce XML data, it is only necessary to include

XML tags in the output strings. However, one decision that has to be made is whether to output line-end codes or whether to omit them.

In many respects it is simpler and safer to omit line-end codes. But if the XML document is likely to be viewed or edited using tools that are not XML-aware, this approach makes the document very difficult to read.

Page 9: Processing XML Documents

9

Writing XML (2/3) Some text editors will only display as much text as

will fit on one line in the window Although some editors are able to display more text

by creating ‘soft’ line breaks at the right margin, the content is still not very legible.

It would seem to be more convenient to break the document into separate lines at obvious points in the text. However, there may be a problem for the recipi-ent application in determining when line-end codes are there purely to make the XML data file more legi-ble.

Page 10: Processing XML Documents

10

Writing XML (3/3)

<book><front><title>The Book Title</title><author>J. Smith</author><date>October 1917</date></front><body> <chapter><title>First Chapter</title><para>This is the first chapter in the book.</para><para>This is the …….

<book><front><title>The Book Title</title><author>J. Smith</author><date>October 1917</date></front><body> <chapter><title>First Chapter</title><para>This is the first chapter in the book.</para><para>This is the ……. …..

Page 11: Processing XML Documents

11

Reading XML (1/4)

XML processor

<-->-------<-->

image

data

Application

entity manager

XML frag-ment

XML docu-ment

Page 12: Processing XML Documents

12

Reading XML (2/4) The XML processor hides many complications from

the application.

The XML processor has at least one sub-unit, termed the entity manager, which is responsible for locating fragments of the document held in entity declara-tions or in order data files, and handling replacement of all references to them

Page 13: Processing XML Documents

13

Reading XML (3/4) The XML processor delivers data to application, but

there are two distinct ways in which this can be done.

(1) Event driven– The simplest is to pass the data directly to the application as

a stream. The application accepts the data stream and reacts to the markup as it is encountered.

Page 14: Processing XML Documents

14

Reading XML (4/4) (2) Tree-walking

– XML processor holding onto the data on the application’s be-half, and allowing the application to ask questions about the data and request portions of it.

Grove – A tree or group of trees can be stored in a data structure.

Page 15: Processing XML Documents

15

Event processing (1/2) The simplest method of processing an XML document

is to read the content as a stream of data, and to in-terpret mark up as it is encountered.

If out-of-sequence processing is required, such as needing to collect all the titles in a document for in-sertion at the start of the document as a table of con-tents, then a ‘two -pass’ processor is needed.

In the first pass, the titles are collected. In the sec-ond pass, they are inserted where they required.

Page 16: Processing XML Documents

16

Event processing (2/2) Simple API for XML(SAX 1.0)

– To reduce the workload of the application developer, and make it easy to replace one parser with another, a common event-driven interface has been proposed for object-ori-ented languages such as JAVA.

Page 17: Processing XML Documents

17

Tree manipulation (1/3) Software that holds the entire document in memory

needs to organized the content so that it can be eas-ily searched and manipulated.

There is no need for multi-pass parsing when any part of the document can be accessed instantly.

Applications that benefit from this approach include XML-aware editors, pagination engines and hyper-text-enabled browsers.

Page 18: Processing XML Documents

18

Tree manipulation (2/3) The abstract description of the model for SGML doc-

uments is called grove, and the grove scheme is equally applicable to XML.

The name ‘grove’ is appropriate because it mainly describes a series of trees.

A grove is a ‘directed graph of nodes’

Each node is an object of a specified type: a package of information that conforms to a pre-defined tem-plate.

Page 19: Processing XML Documents

19

Tree manipulation (3/3) A property has a name and a value, so can be compared

to an attribute.

A node that describes a person mat have a property called ‘age’ which holds the value representing the age of an individual.

A node must have a type property, and name property, so that it can be identified, or referred to.

node

para

element

property

Property value type

gi

Page 20: Processing XML Documents

20

Events or trees ? (1/3) Event-driven benefits

– The parser does not have to hold much information about the documents in memory.

– The document structure does not have to be managed in memory, either by the parser or, depending on what it needs to do, by the application. This make parsing very fast.

– It does not have to do anything special in order to process the document in a simple linear fashion, from start to end.

Page 21: Processing XML Documents

21

Events or trees ? (2/3) Tree-walking benefits.

– With the entire document held in memory, the document structure can be analyzed several times over, quickly and easily.

– The data structure management module may be profitably utilized by the application to the manage the document components on its behalf.

– A documents that contains errors can be rejected before the application begins to process its contents, thereby eliminat-ing the need for messy roll-back routines.

Page 22: Processing XML Documents

22

Events or trees ? (3/3) Other considerations

– The memory usage advantage of the event-driven approach may be only theoretical.

– If the application uses an event-driven API, the parser need not build a document tree, but if the application uses a tree-walking API, it can itself use the event-driven API to build its tree model.

Page 23: Processing XML Documents

23

Transformation tools When the intent is simply to change an XML docu-

ment structure into a new structure, there are exist-ing tools.

These tools can usually do much more advanced things, such as changing the order of elements, sort-ing them, and generating new content new content automatically.

It can transform XML document into another XML document, or into an HTML document.

Page 24: Processing XML Documents

24

Processing XML documents Processing XML Data Document Formatting (XSL & XSLT)

Page 25: Processing XML Documents

25

Contents : Document Formatting Concepts Selecting a style sheet XSLT Style sheet DTD issues XSL

Page 26: Processing XML Documents

26

Concepts of XSL XML Stylesheet Language XML documents are intended to be easily read by

both people and software People don’t want to see documents with tags It is necessary to replace the tags with appropriate

text styles

Page 27: Processing XML Documents

27

Concepts of Style sheets (1/2)

<title>An example of style</title><intro><para>This example shows how important styleIs to material intended to be read.</para></intro><para>This is a <em>normal</em> paragraph.</para ><warning><para>Styles are important!</para><warning>

An example of style This example shows how importantstyle Is to material intended to be read. This is a normalparagraph. Styles are important!

An example of style

This example shows how important style Is to material intended to be read.

This is a normal paragraph.

Warning: Styles are important!

Removal of tag ?

Style applied

Page 28: Processing XML Documents

28

Concepts of Style sheets (2/2)

authoring

DTD style sheet

documents

presentation

<title>This is a title</title><p>This paragraph containsa <em>highlighted</em> term.</p>

This is a titleThis paragraph contains a highlighted term

This is a titleThis paragraph contains a highlighted term

Page 29: Processing XML Documents

29

Concepts of DTD and style sheet A single style sheet may be applied to a number of

documents formatted in the same way

An XML document can be associated with more than one style sheet.

Authoring

DTD Style sheet A Presentation

DocumentsStyle sheet B Presentation

Page 30: Processing XML Documents

30

Concepts of Styling with XSL A set of formatting objects In this first version, all allowed formatting objects are

rectangular FO DTD(Formatting Objects DTD)

– Elements such as ‘block’– Attributes such as ‘text-align’

Page 31: Processing XML Documents

31

Concepts of Transforming with XSLT(1/2) To author XML document with FO DTD is obviously

negate the entire philosophy of XML – self describing, not self formatting of HTML

An XSLT processor takes an existing XML document as input, and generates a new XML document with new DTD as output.

Page 32: Processing XML Documents

32

Concepts of Transforming with XSLT (2/2)

XSLT style sheetSource DTD

XML document XSLT processor

New XML document

Presentation

FO DTD

XSL processor

An <emph>emphasized</emph> word.

<template match=“emph”> <fo:inline-sequence font-weight=“bold”> <apply-templates/> </fo:inline-sequence></template>

An emphasized word.

Page 33: Processing XML Documents

33

Selecting a style sheet An XML processing instruction is used for selecting a

style sheet.

<?xml-stylesheet href=“mystyles.xsl”type=“text/xsl”title=“default” ?>

<?xml-stylesheet href=“myBIGstyles.xsl”type=“text/xsl”title=“bigger font”alternative=“yes” ?>

Page 34: Processing XML Documents

34

XSLT : general structure (1/3) Root element – stylesheet, transform

– <stylesheet xmlns=“http://www.w3.org/XSL/Transform/1.0”>– <transform xmlns=“http://www.w3.org/XSLT/Transform/1.0”>

Another namespace – an XSLT style sheet may also contain elements that are not part of stylesheet or transform– <stylesheet xmlns=“http://www.w3.org/XSL/Transform/1.0”

xmlns:X=“………….”>…… <X:my-element>…</X:my-element>…

Page 35: Processing XML Documents

35

XSLT : general structure (2/3) Result namespace – Indicator of what the output of

the XSL processor is– <stylesheet xmlns=“http://www.w3.org/XSL/Transform/1.0”

xmlns:X=“……” result-ns=“X”>

Id – embedded stylesheet in a larger XML document– <?xml-stylesheet type=“text/xsl” href=“#MyStyles” ?>

<X:book> <stylesheet id=“MyStyles” …> … </stylesheet> …

Page 36: Processing XML Documents

36

XSLT : general structure (3/3) Result Version

Result Encoding – to specify which version of XML and a character set encoding scheme should be used for the output file– <stylesheet … result-version=“2.0”

result-encoding=“ISO-8859-1”>

Page 37: Processing XML Documents

37

XSLT : White space An XSLT processor creates a tree of nodes, including

nodes for each text string in and between the markup tags.

Default – all white space is preserved.Default Space – when ‘strip’ applied, it is possible to remove the white space.– <stylesheet … default-space=“strip”>

<preserve-space elements=“pre poetry”/> …</stylesheet>

Page 38: Processing XML Documents

38

XSLT : Templates The body of the style sheet consists of at least one

transformation rule, as represented by the Template element– <template match=“para”>

…</template>

– <template match=“warning/para”> …</template>

Page 39: Processing XML Documents

39

XSLT : Imports and Inclusions Multiple style sheets may share some definitions.

– <stylesheet …> <import href=“tables.xsl”> <import href=“colours.xsl”> <template …>…</template>

– <include href=“…”>…</include> Import rules are not considered to be as important as

other rules. The include element can be used anywhere and in-

cluded rules are not considered to be less important than other rules

Page 40: Processing XML Documents

40

XSLT : Priorities When more than one complex rule matches the cur-

rent element, it is necessary to explicitly give one rule a higher priority than the others, using the Prior-ity attribute.– <template match=“chapter//para”><!-- priority = 1-->

…</template>

– <template match=“warning//para” priority = “2”> …</template>

It the priority attribute is not used, or not used cor-rectly, an XSLT processor may choose to simply select the last rule.

Page 41: Processing XML Documents

41

XSLT : Recursive processing If an animal element existed within the paragraph,

and there was no rule for this element, but it could contain the emphasis element, then the emphasized text would not be formatted.– <para>A <animal><emph>Giraffe</emph></animal> is an

animal.</para> To eliminate this problem, a rule is needed to act as a

catch-all, representing the elements not covered by explicit formatting rules– <template match=“/|*”>

<apply-templates />

Page 42: Processing XML Documents

42

XSLT : Selective processing The Apply Templates element can take a Select at-

tribute, which overrides the default action of process-ing all children. Using Xpath patterns, it is possible to select specific children, and ignore the rest.– <template match=“names”>

<apply-templates select=“name[@type=‘company’]” /></template>

The Apply Templates element can be used more than once in a template.

Page 43: Processing XML Documents

43

XSLT : Output formats An XSLT transformation tool is expected to write out a

new XML document. One way to do this is simply to insert the appropriate elements into the templates.– <template match=“para”>

<html:p><apply-templates/></html:p></template>

Comments and processing instructions can be in-serted into the output document using comment and processing instruction elements– <processing-instruction name=“ACME”>INSERT_TOC</pro-

cessing-instruction>– <comment>This is the HTML version</comment>

Page 44: Processing XML Documents

44

XSLT : Sorting elements The Sort element is used within the Apply Templates

element to sort the elements it selects:– <list>

<item sortcode=“1”>ZZZ</item> <item sortcode=“3”>MMM</item> <item sortcode=“2”>AAA</item></list>

<template match=“list”> <apply-templates><sort/></apply-templates></template>

<sort select=“@sortcode” />

Page 45: Processing XML Documents

45

XSLT : Automatic numbering In many XML documents, list items are not physically

numbered in the text, making it easy to insert, move or delete items without having to edit all the items, so the style sheet must add the required numbering.– <template match=“selection/title”>

<number level=“multi” count=“chapter|section” format=“1.A” /> <apply-templates/></template>

– 1.A First section of Chapter One2.C Third section of Chapter Two

Page 46: Processing XML Documents

46

XSLT : Variables and templates(1/3) A style sheet often contains a number of templates

that produce output that is identical, or very similar, and XSLT includes some mechanisms for avoiding such redundancy.

Variable, Value Of– <variable name=“Colour”>red</variable>

<html:h1> The colour is <xsl:value-of select=“$Colour”/>.<html:h1>

The colour is red.

Page 47: Processing XML Documents

47

XSLT : Variables and templates (2/3) When the same formatting is required in a number of

places, it is possible to simply reuse the same tem-plate.– <template name=“CreateHeader”>

<html:h2>*****<apply-templates/>*****</html:h2></template>

<template match=“title”> <call-template name=“CreateHeader” /></template><template match=“head”> <call-template name=“CreateHeader” /></template>

Page 48: Processing XML Documents

48

XSLT : Variables and templates (3/3) Such a mechanism is even more useful when the ac-

tion performed by the named template can be modi-fied, by passing parameters to it that override default values.– <template name=“CreateHeader”>

<param name=“Prefix”>%%%</param> <html:h2><value-of select=“$Prefix”/> <apply-templates/>*****</html:h2></template>

<call-template name=“CreateHeader”> <with-param name=“Prefix”>%%%%%</with-param></call-template>

%%%Header*****

Page 49: Processing XML Documents

49

XSLT : Creating and copying elements(1/2) An element can be created in the output document

using the Element element, with the element name specified using the Name attribute, and an optional namespace specified using the Namespace attribute

Elements can also be created that are copies of the source element, using the Copy element.– <template match=“third-header-level”>

<element namespace=“html” name=“h3”> <apply-templates/> </element></template>

Page 50: Processing XML Documents

50

XSLT: Creating and copying elements(2/2) Source document elements can also be selected and

copied out to the destination document using the Copy Of element, which uses a Select attribute to identify the document fragment or set of elements to be reproduced at the current position.– <template match=“body”>

<body> <copy-of select=“//h1 | //h2” /> <apply-templates/> </body></template>

Page 51: Processing XML Documents

51

XSLT : Repeating structures When creating tabular output from source elements,

or some other very regular structure, a technique is available that reduces the number of templates needed significantly, and in so doing improves the clarity of the style sheet.– <template match=“countries”>

<html:table> <for-each select=“country”> <html:tr> <html:th><apply-templates select=“name”/></html:th> <for-each select=“borders”> <html:td><apply-templates/></html:td> </for-each>

Page 52: Processing XML Documents

52

XSLT : Conditions (1/2) When a template transforms a source document ele-

ment into formatted output, it is possible to vary the output depending on certain conditions.– <template match=“para”>

<html:p> <if test=“not (position() mod 2 = 0)”> <attribute name=“style”>color: red</attribute> </if> <apply-templates/> </html:p></template>

Page 53: Processing XML Documents

53

XSLT : Conditions (2/2) When an attribute can take a number of different

values, each one producing a different format.– <template match=“para”>

<html:p> <choose> <when test=“@type=‘normal’”> <attribute name=“style”>color:black</attribute> </when> <otherwise> <attribute name=“style”>color:yellow</attribute> </otherwise> </choose> <apply-templates/>

Page 54: Processing XML Documents

54

XSLT : Keys XSLT allows keys to be defined and associated with

particular elements. The Name attribute provides a name for a set of iden-

tifiers. The Match attribute specifies the elements to be in-

cluded in this set of identifiers, using an Xpath pat-tern.

The Use attribute is an Xpath expression that identi-fies the location of the identifier values.– <key name=“global” match=“*” use=“@id” />

<book id=“book”> <chapter id=“chap1”>…</chapter></book>

Page 55: Processing XML Documents

55

Style sheet DTD issues The XSLT standard includes a DTD that defines the

XSLT elements and attributes. But this DTD alone may not be sufficient. The fact that XSLT markup can be mixed with output markup means that a DTD may need to be defined that includes both sets of ele-ments.

The DTD must, of course, also contain the definition for the elements concerned.

However, this problem can be avoided entirely, by us-ing the Element and Attribute elements throughout the style sheet.

Page 56: Processing XML Documents

56

XSL : Representation format Each formatting object can be represented by an el-

ement from the FO(Formatting Objects) DTD, which is defined in an annex to the standard.

An XSL processor is expected to receive input from the XSLT processor, though in some cases an imple-mentation may be able to receive an XML document that conforms to the FO DTD instead, created by other means.

Page 57: Processing XML Documents

57

XSL: General presentation model (1/3) XSL creates formatting objects to hold the content to

be presented. Formatting objects create rectangular areas.

Areas are divided into four categories: area-contain-ers, block-areas, line-areas, and inline-areas.

AB

Area-container

Block-area

Line-area

Inline-area

Page 58: Processing XML Documents

58

XSL: General presentation model (2/3) An area-container has a coordinate system, by which

embeded objects can be placed, defining the ‘top’, ‘bottom’, ‘left’ and ‘right’ directions, and is able to contain other area-containers.

Area-containers may also contain block-areas. The placement of block-areas within area-containers de-pends on the ‘writing mode’.

When a block-area is too long to fit in the area-con-tainer, another block-area may be created in the next area-container.

area container

right

top

left

bottom

bottom

right

top

left

Page 59: Processing XML Documents

59

XSL: General presentation model (3/3) A block-area may contain more block-areas. Embedded blocks may be narrower than the enclos-

ing block, in the non-writing-mode direction, using indent properties.

Blocks may contain line-areas, which are adjacent to each other in the line-progression direction.

Line-areas can contain inline-areas, which correspond with XML inline elements.

Inline-areas are drawn from an initial position-point, and it is possible to adjust this point upward or downward in relation to neighboring inline-areas.

Page 60: Processing XML Documents

60

XSL – CSS-compatible formatting objects and properties Many of the XSL formatting object types correspond

with established CSS display property types. Many of the formatting options available in XSL are

derived from properties provided in CSS.

Page 61: Processing XML Documents

61

XSL : Block-level objects The Block element is used enclose any simple block of text. Text styles can be defined, margin, border and padding at-

tributes may be added, text may be aligned in different ways, hyphenation can be controlled, and the whole block can be removed from the flow and positioned explicitly.

Graphics between text blocks are represented by Display Graphic elements

A rule can be drawn horizontally or vertically between text blocks, using the Display Rule element.

<fo:block>A block of text.</fo:block><fo:block>Another block of text.</fo:block>

A block of text.

Another vlock of text.

Page 62: Processing XML Documents

62

XSL : Inline objects Individual characters can be represented by the

Character element. The first object inside a Block element may be a First

Line Marker element. A graphic can be presented within a line of text, us-

ing the Inline Graphics element. Rule line can be drawn inline, as well as between

blocks The Inline Sequence element has already been

demonstrated. The current page number can be inserted into the

text using the Page Number element

Page 63: Processing XML Documents

63

XSL : Lists The items are a sequence of List Item elements.

– <fo:list-block> <fo:list-item>…</fo:list-item> <fo:list-item>…</fo:list-item></fo:list-block>

The block may directly contain a sequence of labels followed by contents, using the List Item Label and List Item Body elements.– <fo:list-item-label><fo:block>LABEL</fo:block></fo:list-i …

<fo:list-item-body> <fo:block>First Block In Content</fo:block>

Page 64: Processing XML Documents

64

XSL : Tables When a table has a caption, the main element is

called Table and Caption. The Table element contains the actual table grid. The

model follows the HTML approach– <fo:table>

<fo:table-header>…</fo:table-header> <fo:table-footer>…</fo:table-footer> <fo:table-body>…</fo:table-body></fo:table><fo:table-body> <fo:table-row>…</fo:table-row></fo:table-body><fo:table-row> <fo:table-cell>…</fo:table-cell></fo:table-row>

Page 65: Processing XML Documents

65

XSL : Hypertext links A range of text can be enclosed in a Simple Link ele-

ment, which is used to provide a mechanism for hy-pertext linking to another object– See <fo:simple-link internal-destination=“chap9”>Chapter

9</fo:simple-link> for details.See <fo:simple-link external-destination=“file:///book3.xml”>Book 3</fo:simple-link> for details.

Page 66: Processing XML Documents

66

XSL: Alternative document fragments When publishing electronically, it is possible to hide

and reveal portions of the document depending on user actions.– <fo:multi-case name=“closed” initial=“true”>

<fo:block>Heading <fo:multi-toggle switch-to=“opened”> [+] </fo:multi-toggle> …<fo:multi-case name=“opened”> <fo:block>Heading <fo:multi-toggle switch-to=“closed”> [-]

Page 67: Processing XML Documents

67

XSL : Alternative properties The Multi Properties element contains a number of

initial, empty Multi Property Set elements, each one providing the style to apply under a given circum-stance, identified using the State attribute.– <fo:multi-properties>

<fo:multi-property-set state=“visited” color=“#FF0000” /><fo:multi-property-set state=“active” color=“#00FF00” />This text to be coloured depending on the state</fo:multi-properties>

Page 68: Processing XML Documents

68

XSL: Floating objects and foot-notes The Float element is used to contain such items, indi-

cating to the pagination engine that it may move the content as appropriate.– <fo:float><fo:table>…</fo:table></fo:float>

The Footnote element contains the footnote text, which will float to the base of the page, and may also contain a reference to the footnote.– Here is a reference<fo:footnote>

<fo:footnote-citation>*</fo:footnote-citation><fo:block>* The footnote</fo:block></fo:footnote> to a footnote

Page 69: Processing XML Documents

69

XSL : Building pages (1/2) The Flow element contains all the block-level objects

that constitute the main text flow content of the doc-ument.

The Page Sequence element may contain the Flow element and any number of objects that are to be re-peated in the same place on each page in the se-quence, which is termed static content.

A page sequence must include information on how different page master templates are to be used in the sequence.

Page 70: Processing XML Documents

70

XSL : Building pages (2/2)

Xml camp oopsla Xml camp

Binding margins

Writing mode

extents

Page 71: Processing XML Documents

71

XSL : Hyphenation The ‘hyphenate’ property defaults to ‘false’, but can

be set to ‘true’, so enabling the hyphenation or words.– <fo:block hyphenate=“true”

hyphenation-char=“-” hyphenation-push-char-count=“2” hyphenation-remain-char-count=“2”>…