Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

31
Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka

Transcript of Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

Page 1: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

Processing of structured documents

Spring 2002, Part 1Helena Ahonen-Myka

Page 2: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

2

Course organization

581290-5 laudatur course, 3 culectures (in Finnish)

22.1.-21.2. Tue 12-14, Thu 10-12 not obligatory

exercise sessions 29.1.-27.2. course assistants: Olli Lahti and Miro

Lehtonen (new group Wed 12-14 A318) not obligatory

Page 3: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

3

Requirements

Exam (Wed 6.3. at 16-20): 45 pointsProject: 15 pointsExercises: 5 extra pointsMaximum of points: 60

Page 4: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

4

Outline (preliminary)

1. Descriptions of structure context-free grammars namespaces, information sets (XML DTD,) XML Schema

2. Programming interfaces SAX, DOM SOAP

3. Traversing documents XPath

Page 5: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

5

Outline...

4. Querying structured documents XML Query

5. XML Linking6. XML databases7. Metadata: RDF8. Compressing XML data9. ...

Page 6: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

6

Prerequisites

You should know the basics of XML DTD, elements, attributes, syntax XSLT (basics), formatting

some programming experience is needed

Page 7: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

7

Group project

Group of 4-5 students groups are formed in the exercise sessions

in the 2nd weekTask: construct a toy B2B e-commerce

application a travel agency which sells packages

containing hotel nights and concerts a hotel (or several) a concert ticket office

Page 8: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

8

Group project

Task continues a customer can reserve packages using

a web page a reservation causes a query to the

hotels and the ticket offices for the availability of rooms and tickets

for all the communication and for the storage of all the documents you should use XML

Page 9: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

9

Group project

Try to get some simple implementation work may depend on the support we can offer

you don´t have to consider all the real life problems, like consistency of reservations

concentrate on playing with XMLstate of the work is presented in the last

exercise sessions (also students who don’t normally attend exercises)

Page 10: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

10

Requirements for project

More instructions follow later...return a report by 22.3. (as an URL)The report should include

(short) requirements analysis descriptions of the structure (DTD, Schema) other designs, architecture, ...

Some kind of a working prototype not necessarily the whole system

Page 11: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

11

1. Structure descriptions

Regular expressions, context-free grammars -> What is XML?

(XML Document type definitions)namespaces, information setsXML Schema

Page 12: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

12

Regular expressions

A way to describe set of strings over an alphabet (of chars, events, elements…)

many uses: text searching (e.g. emacs, grep, perl) in grammatical formalisms (e.g. XML DTDs)

relevant for document structures: what kind of structural content is allowed for different document components

Page 13: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

13

Regular expressions

A regular expression over alphabet is either (an empty set) (epsilon; sometimes lambda ) a, where a R | S (choice; sometimes R S) R S (catenation) or R* (Kleene closure)

where R and S are regular expressions

Page 14: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

14

Regular expressions

Regular expression E denotes a language (a set of strings) L(E): L() = (empty set) L() = {} (singleton set of empty string) L(a) = {a} (singleton set of a ) L(R|S) = L(R) L(S) = {w | w L(R) or w L(S)}

L(RS) = L(R)L(S) = {xy | x L(R) and y L(S)}

L(R*) = L(R)* = {x1…xn| xk L(R), k=1,…,n; n 0}

Page 15: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

15

Example

top-level structure of a document: = {title, author, date, sect} title followed by an optional list of authors,

followed by an optional date, followed by one or more sections:

title auth* (date | ) sect sect*common abbreviations:

E? = (E | ); E+ = E E* -> title auth* date? sect+

Page 16: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

16

Context-free grammars

Used widely for syntax specification (programming languages)

G = (V, , P, S) V: the alphabet of the grammar G; V =

N : the set of terminal symbols;

N = V- : the set of nonterminal symbols P: set of productions S N: the start symbol

Page 17: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

17

Productions and derivations

Productions: A -> , where A N, V* e.g. A -> aBa (1)

Let , V*. String derives directly, => , if = A, = for some , V*, and

A -> is a production of the grammar e.g. AA => AaBa (assuming prod. 1

above)

Page 18: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

18

Language generated by a context-free grammar

derives , =>* , if there is a sequence of 0 or more direct derivations that transforms to

The language generated by a CFG G: L(G) = {w * | S =>* w}

L(G) is a set of strings: to model structural elements, we consider parse trees

Page 19: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

19

Parse trees of a CFG

Aka syntax trees or derivation treesnodes labelled by symbols of V (or by ):

internal nodes by nonterminals, root by start symbol

leaves using terminal symbols (or )parent with label A can have children

labeled by X1,…,Xk only if A -> X1…Xk is a production

Page 20: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

20

CFGs for document structures

Nonterminals represent document structures e.g. Ref -> AuthorList Title PublData

AuthorList -> Author AuthorList AuthorList ->

problem: obscures the relation of elements (the last

Author several hierarchical levels away from Ref) -> solution: extended CFGs

Page 21: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

21

Extended CFGs (ECFGs)

Like CFGs, but right-hand-sides of productions are regular expressions over V, e.g. Ref -> Author* Title PublData

Let , V*. String derives directly, => , if = A, = for some , V*, and

A -> E is a production such that L(E) e.g. Ref => Author Author Author Title

PublData

Page 22: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

22

Language generated by an ECFG

Defined similarly to CFGsTheorem: Languages generated by

extended and ordinary CGFs are the same

Page 23: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

23

Parse trees of an ECFG

Similar to parse trees of an ordinary CFG, except that…

parent with label A can have children labeled by X1,…,Xk when A -> E is a production such that X1…Xk L(E)

-> an internal node may have arbitrarily many children (e.g. Authors below a Ref node)

Page 24: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

24

What is XML?

metalanguage that can be used to define markup languages gives syntax for defining extended

context free grammars XML documents that adhere to an ECFG

are strings in that language document types (grammars)- document

instances (strings in the language)

Page 25: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

25

XML encoding of structure

XML document essentially a parenthesized linear encoding of a parse tree corresponds to a preorder walk start of inner node (element) A denoted by a

start tag <A>, end denoted by end tag </A> leaves are strings (or empty elements)

+ certain extensions (especially attributes)

Page 26: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

26

Terminal symbols in practice

Leaves of parse trees are labeled by single characters (symbols of )

too granular in practice: instead terminal symbols which stand for all values of a type e.g. #PCDATA in XML for variable length

content of data characters richer data types in XML schema

formalisms

Page 27: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

27

An example DTD

<!DOCTYPE invoice [<!ELEMENT invoice (orderDate, shipDate, billingAddress voice*, fax?)><!ELEMENT orderDate (#PCDATA)><!ELEMENT shipDate (#PCDATA)><!ELEMENT billingAddress (name, street, city, state, zip)><!ELEMENT voice (#PCDATA)><!ELEMENT fax (#PCDATA)><!ELEMENT name (#PCDATA)><!ELEMENT street (#PCDATA)><!ELEMENT city (#PCDATA)><!ELEMENT state (#PCDATA)><!ELEMENT zip (#PCDATA)>]>

Page 28: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

28

<invoice> <orderDate>19990121</orderDate> <shipDate>19990125</shipDate> <billingAddress> <name>Ashok Malhotra</name> <street>123 IBM Ave.</street> <city>Hawthorne</city> <state>NY</state> <zip>10532-0000</zip> </billingAddress> <voice>555-1234</voice> <fax>555-4321</fax></invoice>

And a document:

Page 29: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

29

XML processing model

A processor (parser) reads XML documents passes data to an application

XML Specification tells how to read, what to pass

Page 30: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

30

Well-formed XML documents

documents that adhere to the formal requirements (syntax) of the XML specification

if a document is not well-formed, it is not an XML document (and the XML tools do not have to process it)

Page 31: Processing of structured documents Spring 2002, Part 1 Helena Ahonen-Myka.

31

Valid documents

a document is a valid XML-document, if it is well-formed and adheres to the structure defined in the DTD given

XML-processor can be validating or non-validating

sometimes validity is important, sometimes not