Semi-structured Data

30
Semi-structured Data

description

Semi-structured Data. Facts about the Web. Growing fast Popular Semi-structured data Data is presented for ‘human’-processing Data is often ‘self-describing’ (including name of attributes within the data fields). Vision for Web data. - PowerPoint PPT Presentation

Transcript of Semi-structured Data

Page 1: Semi-structured Data

Semi-structured Data

Page 2: Semi-structured Data

Facts about the Web

• Growing fast

• Popular

• Semi-structured data – Data is presented for ‘human’-processing– Data is often ‘self-describing’ (including name

of attributes within the data fields)

Page 3: Semi-structured Data

Vision for Web data

• Object-like – it can be represented as a collection of objects of the form described by the conceptual data model

• Schemaless – not conformed to any type structure

• Self-describing – necessary for machine readable data

Page 4: Semi-structured Data

Facts about database systems

• Integration of databases with different schemas is often needed

• Sharing information between different databases on the World Wide Web becomes more and more important for business

Page 5: Semi-structured Data

Semi-structured data

• Bridging different data models (relational, object-oriented

Page 6: Semi-structured Data

Semi-structured data representation

• A database of semi-structured data is a graph with – A set of nodes, each is either a leaf or a interior

node; – Each interior node has a set of arcs coming out

from it, connecting it with another node; each arc has a label; and

– A root that does not have an arc entering it. Every node must be reachable from the root.

Page 7: Semi-structured Data

cf

name addressaddress

street city street city

mh

name streetcity

mv

title year

CarrieFisher

Maple H’wood Locust Malibu

MarkHamill

Oak H’wood StarWars

1977

Root

star starmovie

starOf

starOf

starsIn

starsIn

Example of semi-structured data representing a movie and stars

Page 8: Semi-structured Data

Information integration via semi-structured data

• Simple• Semi-structured data

as interface between users of different databases (with different schemas)

Interface

DB1 DB2

User

Application of DB1

Application of DB2

Page 9: Semi-structured Data

XML – Overview

• Simplifying the data exchange between software agents

• Popular thanks to the involvement of W3C (World Wide Web Consortium – independent organization

www.w3c.org)

Page 10: Semi-structured Data

XML – Characteristics

• Simple, open, widely accepted

• HTML-like (tags) but extensible by users (no fixed set of tags)

• No predefined semantics for the tags (because XML is developed not for the displaying purpose)

• Semantics is defined by stylesheet (later)

Page 11: Semi-structured Data

XML Documents

• User-defined tags:<tag> info </tag>

• Properly nested:<tag1>.. <tag2>…</tag1></tag2>is not valid

• Root element: an element contains all other elements• Processing instructions <?command ….?>• Comments <!--- comment --- >• CDATA type• DTD

Page 12: Semi-structured Data

XML element

• Begin with a opening tag of the form

<XML_element_name>

• End with a closing tag

</XML_element_name>

• The text between the beginning tag and the closing tag is called the content of the element

Page 13: Semi-structured Data

XML element

<Star-Movie-Data><Star>

<Name> Carrie Fisher </Name><Address> <Street> 123 Maple St. </Street> <City> Hollywood </City> </Address><Address> <Street> 5 Locus Ln. </Street> <City> Malibu</City> </Address>

</Star> <Star>

<Name> Mark Hamill </Name><Address> <Street> 456 Oak Rd. </Street> <City> Brentwood </City></Address>

</Star> <Movie>

<Title> Star Wars </Title> <Year>1997</Year></Movie></ Star-Movie-Data>

Name elelementStar Elelement

Page 14: Semi-structured Data

XML element

<Star-Movie-Data>

<Star name=“Carrie Fisher”>

….

</Star>

</ Star-Movie-Data>

Attribute Value of the attribute

Page 15: Semi-structured Data

Relationship between XML elements

• Child-parent relationship– Elements nested directly in an element are the

children of this element (Student is a child of PersonList, Name is a child of Student, etc.)

• Ancestor/descendant relationship: important for querying XML documents (extending the child/parent relationship)

Page 16: Semi-structured Data

XML elements & Database Objects• XML elements can be converted into

objects by– considering the tag’s names of the children as

attributes of the objects – Recursive process

<Student StudentID=“123”>

<Name> “XYZ PQR” </Name>

<CrsTaken>

<CrsName>CS582</CrsName>

<Grade>“A”</Grade> </CrsTaken>

</Student>

(#099,

Name: “XYZ PQR”

CrsTaken:

<CrsName>“CS582”</CrsName>

<Grade>“A”</Grade>

)

Partially converted object

Page 17: Semi-structured Data

XML elements & Database Objects

• Differences: Additional text within XML elements

<Student StudentID=“123”>

<Name> “XYZ PQR” </Name>

has taken the following course

<CrsTaken>

Database management system II

<CrsName>CS582</CrsName>

with the grade

<Grade>“A”</Grade> </CrsTaken>

</Student>

Page 18: Semi-structured Data

XML elements & Database Objects

• Differences: XML elements are orderd

<CrsTaken>

<CrsName>“CS582”</CrsName>

<Grade>“A”</Grade>

</CrsTaken>

<CrsTaken>

<Grade>“A”</Grade>

<CrsName>“CS582”</CrsName>

</CrsTaken>

{#901, Grade: “A”, CrsName: “CS582”}

Page 19: Semi-structured Data

XML Attributes

• Can occur within an element (arbitrary many attributes, order unimportant, same attribute only one)

• Allow a more concise representation • Could be replaced by elements • Less powerful than elements (only string value, no

children)• Can be declared to have unique value, good for

integrity constraint enforcement (next slide)

Page 20: Semi-structured Data

XML Attributes

• Can be declared to be the type of ID, IDREF, or IDREFS

• ID: unique value throughout the document

• IDREF: refer to a valid ID declared in the same document

• IDREFS: space-separated list of strings of references to valid IDs

Page 21: Semi-structured Data

Well-formed XML Document

• It has a root element

• Every opening tag is followed by a matching closing tag, elements are properly nested

• Any attribute can occur at most once in a given opening tag, its value must be provided, quoted

Page 22: Semi-structured Data

Document Type Definition

• Set of rules (by the user) for structuring an XML document

• Can be part of the document itself, or can be specified via a URL where the DTD can be found

• A document that conforms to a DTD is said to be valid

• Viewed as a grammar that specifies a legal XML document, based on the tags used in the document

Page 23: Semi-structured Data

DTD Components

• A name – must coincide with the tag of the root element of the document conforming to the DTD

• A set of ELEMENTs – one ELEMENT for each allowed tag, including the root tag

• ATTLIST statements – specifies the allow attributes and their type for each tag

• *, +, ? – like in grammar definition – * : zero or finitely many number – + : at least one– ? : zero or one

Page 24: Semi-structured Data

DTD Components – Element

<!ELEMENT Name definition>

type, element list etc.

Name of the element

definition can be: EMPTY, (#PCDATA), or element list (e1,e2,…,en) where the list (e1,e2,…,en) can be shorted using grammar like notation

Page 25: Semi-structured Data

DTD Components – Element

<!ELEMENT Name(e1,…,en)>

nth – element

1st – element

Name of the element

<!ELEMENT PersonList (Title,Contents)>

<!ELEMENT Contents(Person *)>

Page 26: Semi-structured Data

DTD Components – Element

<!ELEMENT Name EMPTY>

no child for the element Name

<!ELEMENT Name (#PCDATA)>

value of Name is a character string

<!ELEMENT Title EMPTY>

<!ELEMENT Id (#PCDATA)>

Page 27: Semi-structured Data

DTD Components – Attribute List

<!ATTLIST EName Att {Type} Property> where

- Ename – name of an element defined in the DTD

- Att – attribute name allowed to occur in the opening tag of Ename

- {type} – might/might not be there; specify the type of the attribute (CDATA, ID, IDREF, IDREFS)

- Property – either #REQUIRED or #IMPLIED

Page 28: Semi-structured Data

<!DOCTYPE Stars [<!ELEMENT STARS (STAR*)><!ELEMENT STAR(NAME,ADDRESS+,MOVIES)><!ELEMENT NAME (#PCDATA)><!ELEMENT ADDESS (STREET, CITY)><!ELEMENT STREET (#PCDATA)><!ELEMENT CITY (#PCDATA)><!ELEMENT MOVIES (MOVIE*)><!ELEMENT MOVIE (TITLE, YEAR)><!ELEMENT TITLE (#PCDATA)><!ELEMENT YEAR (#PCDATA)>

]>

A simple DTD for the movie and star database (no integrity constraints)

Page 29: Semi-structured Data

<!DOCTYPE Stars-Movies [<!ELEMENT STARS-MOVIES (STAR* MOVIES*)><!ELEMENT STAR(NAME,ADDRESS+)>

<!ATTLIST STAR starID ID starredIn IDREF><!ELEMENT NAME (#PCDATA)><!ELEMENT ADDESS (STREET, CITY)><!ELEMENT STREET (#PCDATA)><!ELEMENT CITY (#PCDATA)><!ELEMENT MOVIE (TITLE, YEAR)>

<!ATTLIST MOVIE movieID ID starsOf IDREF><!ELEMENT TITLE (#PCDATA)><!ELEMENT YEAR (#PCDATA)>

]>A DTD for the movie and star database with attributes and

integrity constraints

Page 30: Semi-structured Data

Homework 5 (Due Oct 23)

• 4.2.3 (Pg 146, complete book) (10pt)

• 4.4.1 (part c, Pg 164, complete book) (10pt)

• 4.5.4 (Pg 172, complete book) (10pt)