Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin...
Transcript of Well-designed XML Data - dc.uba.ar · Well-designed XML Data Marcelo Arenas and Leonid Libkin...
Well-designed XML Data
Marcelo Arenas and Leonid Libkin
University of Toronto
Outline
Part 1 - Database Normalization from the 1970s and 1980s.
Part 2 - Classical theory revisited: normalizing XML documents.
Part 3 - Classical theory re-done: new justifications for normalization.
2
Part 1: Classical Normalization
Design: decide how to represent the information in a particular data model.
• Even for simple application domains there is a large number of ways of representing the data of interest.
We have to design the schema of the database.
• Set of relations.
• Set of attributes for each relation.
• Set of data dependencies.
3
Designing a Database: An Example
Attributes: number, title, section, room.
Data dependency: every course number is associated with only one title.
Relational Schema:
R(number, title, section, room),
number → title
T(number, section, room),
S(number, title), number → title
GOOD alternative:
4
BAD alternative:
Problems with BAD: Update Anomaly
GB2481Database SystemsCSC434
GB2483Computer OrganizationCSC258
GB2582Computer OrganizationCSC258
LP2661Computer OrganizationCSC258
roomsectiontitlenumber
Title of CSC258 is changed to Computer Organization I.
5
Problems with BAD: Update Anomaly
GB2481Database SystemsCSC434
GB2483Computer OrganizationCSC258
GB2582Computer OrganizationCSC258
LP2661Computer OrganizationCSC258
roomsectiontitlenumber
Title of CSC258 is changed to Computer Organization I.
5
Problems with BAD: Update Anomaly
GB2481Database SystemsCSC434
GB2483Computer Organization I
CSC258
GB2582Computer Organization I
CSC258
LP2661Computer Organization I
CSC258
roomsectiontitlenumber
Title of CSC258 is changed to Computer Organization I.The instance stores redundant information.
5
Deletion Anomaly
GB2481Database SystemsCSC434
GB2483Computer Organization I
CSC258
GB2582Computer Organization I
CSC258
LP2661Computer Organization I
CSC258
roomsectiontitlenumber
CSC434 is not given in this term.
6
Deletion Anomaly
GB2481Database SystemsCSC434
GB2483Computer Organization I
CSC258
GB2582Computer Organization I
CSC258
LP2661Computer Organization I
CSC258
roomsectiontitlenumber
CSC434 is not given in this term.
6
Deletion Anomaly
GB2483Computer Organization I
CSC258
GB2582Computer Organization I
CSC258
LP2661Computer Organization I
CSC258
roomsectiontitlenumber
CSC434 is not given in this term.
Additional effect: all the information about CSC434 was deleted.
6
Insertion Anomaly
GB2483Computer Organization I
CSC258
GB2582Computer Organization I
CSC258
LP2661Computer Organization I
CSC258
roomsectiontitlenumber
A new course is created: (CSC336, Numerical Methods)
7
Insertion Anomaly
GB2483Computer Organization I
CSC258
GB2582Computer Organization I
CSC258
LP2661Computer Organization I
CSC258
roomsectiontitlenumber
A new course is created: (CSC336, Numerical Methods)
7
Insertion Anomaly
GB2483Computer Organization I
CSC258
??Numerical MethodsCSC336
GB2582Computer Organization I
CSC258
LP2661Computer Organization I
CSC258
roomsectiontitlenumber
A new course is created: (CSC336, Numerical Methods)The instance stores attributes that are not directly related.
7
Avoiding Update Anomalies
Database SystemsCSC434
Computer Organization
CSC258
titlenumber
GB2481CSC434
GB2483CSC258
GB2582CSC258
LP2661CSC258
roomsectionnumber
Title of CSC258 is changed to Computer Organization I.
8
Avoiding Update Anomalies
Database SystemsCSC434
Computer Organization
CSC258
titlenumber
GB2481CSC434
GB2483CSC258
GB2582CSC258
LP2661CSC258
roomsectionnumber
Title of CSC258 is changed to Computer Organization I.
8
Avoiding Update Anomalies
Database SystemsCSC434
Computer Organization I
CSC258
titlenumber
GB2481CSC434
GB2483CSC258
GB2582CSC258
LP2661CSC258
roomsectionnumber
Title of CSC258 is changed to Computer Organization I.CSC434 is not given in this term.
The instance does not store redundant information.
8
Avoiding Update Anomalies
Database SystemsCSC434
Computer Organization I
CSC258
titlenumber
GB2481CSC434
GB2483CSC258
GB2582CSC258
LP2661CSC258
roomsectionnumber
CSC434 is not given in this term.
8
Avoiding Update Anomalies
Database SystemsCSC434
Computer Organization I
CSC258
titlenumber
GB2483CSC258
GB2582CSC258
LP2661CSC258
roomsectionnumber
CSC434 is not given in this term.
The title of CSC434 is not removed from the instance.
A new course is created: (CSC336, Numerical Methods)
8
Avoiding Update Anomalies
Database SystemsCSC434
Computer Organization I
CSC258
titlenumber
GB2483CSC258
GB2582CSC258
LP2661CSC258
roomsectionnumber
A new course is created: (CSC336, Numerical Methods)
8
Avoiding Update Anomalies
Database SystemsCSC434 Numerical MethodsCSC336
Computer Organization I
CSC258
titlenumber
GB2483CSC258
GB2582CSC258
LP2661CSC258
roomsectionnumber
A new course is created: (CSC336, Numerical Methods)No information about sections has to be provided.Each relation stores attributes that are directly related.
8
Normalization Theory
Main idea: a normal form defines a condition that a well designed database should satisfy.
Normal form: syntactic condition on the database schema.• Defined for a class of data dependencies.
Main problems:
• How to test whether a database schema is in a particular normal form.
• How to transform a database schema into an equivalent one satisfying a particular normal form.
9
Normalization Theory Today
Normalization theory for relational databases was developed in the 70s and 80s.
Why do we need normalization theory today?• New data models have emerged: XML.
• XML documents can contain redundant information.
Redundant information in XML documents:• Can be discovered if the user provides semantic
information.
• Can be eliminated.
10
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
#PCDATA⇒grade
#PCDATA⇒name
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
DTD:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
#PCDATA⇒grade
#PCDATA⇒name
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
DTD:
11
Part 2: XML and Normalization
</courses>
</course>
…
<course cno=“CSC434”>
</course>
…
<course cno=“CSC258”>
<courses>
XML Document:
#PCDATA⇒grade
#PCDATA⇒name
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
DTD:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
#PCDATA⇒grade
#PCDATA⇒name
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
DTD:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
#PCDATA⇒grade
#PCDATA⇒name
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
DTD:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
#PCDATA⇒grade
#PCDATA⇒name
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
DTD:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
#PCDATA⇒grade
#PCDATA⇒name
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
DTD:
11
Part 2: XML and Normalization
</courses>
</course>
</taken_by>
</student>
<grade> B+ </grade>
<name> Fox </name>
<student sno=“st1”>
<taken_by>
<course cno=“CSC258”>
<courses>
XML Document:
#PCDATA⇒grade
#PCDATA⇒name
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
DTD:
11
XML Databases
D : Σ : Two students with the same @sno value must have the same name.
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
12
XML Schema: (D, Σ)
Redundancy in XML
courses
coursecourse info
@cno @cno taken_bytaken_by
student student
@snoname gradegrade name@sno
student
name@sno
. . .
“st1” “st1” “A+”“B+”
“CSC258” “CSC434”
“Fox”“Fox”
“st1” “Fox”
13
XML Database Normalization
DTD: Data dependency:
Two students with the same @sno value must have the same name.
name, grade
⇒student
@sno⇒student
student*⇒taken_by
taken_by⇒course
@cno⇒course
course*⇒courses
14
XML Database Normalization
DTD:
, info* @sno is the identifier of info elements.
grade⇒student
@sno⇒student
student*⇒taken_by
taken_by
⇒course
@cno⇒course
course*⇒courses
name⇒info
@sno⇒info
Data dependency:
Two students with the same @sno value must have the same name.
14
A “Non-relational” Example
DBLP
conf conf
title issueissue
article articlearticle
@yeartitle title @year
@year
“ICDT”
@year
author @yeartitleauthor“1999”
“1999”
“1999”“Dong” “2001”“Jarke”
“2001”
“. . .” “. . .” “. . .”
15
XNF: XML Normal Form
It eliminates two types of anomalies.
It was defined for XML functional dependencies:
DBLP.conf.@title → DBLP.confDBLP.conf.issue →
DBLP.conf.issue.article.@year
16
Problems to Address
Functional dependencies for XML.
Normal form for XML documents (XNF).
•Generalizes BCNF.
Algorithm for normalizing XML documents.
•Implication problem for functional dependencies.
17
Framework: Paths in DTDs
Paths(D): all paths in a DTD Dcourses.course courses.course.@cnocourses.course.student.namecourses.course.student.name.S
EPaths(D): all paths in a DTD D that end with an element, for example, courses.course.
We distinguish three kinds of elements: attributes (@), strings (S) and element types.
FDs are defined by means of a relational representation of XML documents.
18
Framework: XML Trees
v1
v2
v3 v4
v5
v6 v7
v0
. . .
courses
coursecourse
@cno
“cs100”@sno name grade @sno name grade
student student
“123” “456”
“Fox” “B+” “Smith” “A”S S S S
19
Tree Tuples
v1
v2
v0
courses
course
@cno student
“cs100”
t(courses) = v0
t(courses.course) = v1
t(courses.course.@cno) = “cs100”t(courses.course.student) = v2
t(p) = ⊥, for the remaining paths
We consider tuples containing a minimal
amount of ⊥ values
Relational representation: tree tuples - mappings
t : Paths(D) → Vertices ∪ Strings ∪ {⊥}
A tree tuple represents an XML tree:
20
XML Tree: set of Tree Tuples
v1
v2
v3 v4
v5
v6 v7
v0
. . .
courses
coursecourse
@cno
“cs100”@sno name grade @sno name grade
student student
“123” “456”
“Fox” “B+” “Smith” “A”S S S S
v1
v2
courses
course
@cno
“cs100”
student
v0
v3 v4
@sno name grade
“123”
“Fox” “B+”S S
v5
v6 v7
@sno name grade
student
“456”
“Smith” “A”S S
. . .
course
21
Functional Dependencies for XML
Expressions of the form: X → Y
defined over a DTD D, where X, Y are finitenon-empty subsets of Paths(D).
XML tree T can be tested for satisfaction of X → Y if:
X ∪ Y ⊆ Paths(T) ⊆ Paths(D)
T |= X → Y if for every pair u, v of tree tuples in T:
u.X = v.X and u.X ≠ ⊥ implies u.Y = v.Y22
FD: Examples
University DTD: courses ⇒ course*course ⇒ @cno, student*student ⇒ @sno, name, grade
Two students with the same @sno value must have the same name:
courses.course.student.@sno → courses.course.student.name.S
Every student can have at most one grade in every course:
{ courses.course, courses.course.student.@sno } →
courses.course.student.grade.S
23
Implication Problem for FD
Given a DTD D and a set of functional dependencies Σ ∪ {ϕ}:
(D, Σ) |- ϕ (implies ϕ) if for any XML tree T conforming to D and satisfying Σ , it is the case that T |= ϕ
(D, Σ)+ = { ϕ | (D, Σ) |- ϕ }
Functional dependency ϕ is trivial if it is implied by the DTD alone.
24
Checking FD Satisfaction
v1
v2
v3 v4
v6
v7 v8
v0
courses
coursecourse
@cno
“cs100”@sno name grade @sno name grade
student
“123” “123”
“Fox” “B+” “Fox” “A+”S S S S
v5
@cno
“cs225”
studentv1
v2
v3 v4
v0
courses
course
@cno
“cs100”@sno name grade
student
“123”
“Fox” “B+”S S
v6
v7 v8
course
@sno name grade
“123”
“Fox” “A+”S S
v5
@cno
“cs225”
student
{ courses.course, courses.course.student.@sno } → courses.course.student.grade.S
Checking FD Satisfaction
v1
v2
v3 v4
v5
v6 v7
v0
courses
course
@cno
“cs100”@sno name grade @sno name grade
student
“123” “123”
“Fox” “B+” “Fox” “A+”S S S S
studentv1
v2
v3 v4
v0
courses
course
@cno
“cs100”@sno name grade
student
“123”
“Fox” “B+”S S
v5
v6 v7
@sno name grade
“123”
“Fox” “A+”S S
student
{ courses.course, courses.course.student.@sno } → courses.course.student.grade.S
XNF: XML Normal Form
XML specification: a DTD D and a set of functional dependencies Σ.
A Relational DB is in BCNF if for every non-trivial functional dependency X → Y in the specification, X is a key.
(D, Σ) is in XNF if:
For each non-trivial FD X → p.@l or X → p.S in (D, Σ)+, X → p is in (D, Σ)+.
25
Back to DBLP
DBLP is not in XNF:
DBLP.conf.issue → DBLP.conf.issue.article.@year ∈ (D,Σ)+
DBLP.conf.issue → DBLP.conf.issue.article ∉
(D,Σ)+
Proposed solution is in XNF.26
Normalization Algorithm
The algorithm applies two transformations until theschema is in XNF.
If there is an anomalous FD of the form:
DBLP.conf.issue → DBLP.conf.issue.article.@year
then apply the “DBLP example rule”.
Otherwise: choose a minimal anomalous FD and apply the “University example rule”.
27
Normalizing XML Documents
28
Remember:
DBLP.conf.issue[q] → DBLP.conf.issue.article.[p]@year
Normalizing XML Documents
28
Normalizing XML Documents
28
Normalizing XML Documents
28
Normalizing XML Documents
28
Normalizing XML Documents
28
Reasoning About FDs
28
Part 3: What was Missing? Justification!
What is a good database design?
• Well-known solutions: BCNF, 4NF, …
But what is it that makes a database design good?
• Elimination of update anomalies.
• Existence of algorithms that produce good designs: lossless decomposition, dependency preservation.
Previous work was specific for the relational model.
• Classical problems have to be revisited in the XML context.
29
Justification of Normal Forms
Problematic to evaluate XML normal forms.
• No XML update language has been standardized.
• No XML query language yet has the same “yardstick” status as relational algebra.
• We do not even know if implication of XML FDs is decidable!
We need a different approach.
• It must be based on some intrinsic characteristics of the data.
• It must be applicable to new data models.
• It must be independent of query/update/constraint issues.
Our approach is based on information theory. 30
Information Theory
Entropy measures the amount of information provided by a certain event.
Assume that an event can have n different outcomes with probabilities p1, …, pn.
Amount of information gained by knowing that event i occurred :Average amount of information gained (entropy) :
Entropy is maximal if each pi = 1/n :
31
log1pi
∑i = 1
n
p i log1pi
log n
Entropy and Redundancies
Database schema: R(A,B,C), A → B
Instance I:
Pick a domain properly containing adom(I) :• Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4
• Entropy: log 5 ≈ 2.322
421
321CBA
421
321CBA
421
21CBA
421
321CBA
42131
CBA
Pick a domain properly containing adom(I) : {1, …, 6}
• Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2
• Entropy: log 1 = 0
{1, …, 6}
32
Entropy and Normal Forms
Let Σ be a set of FDs over a schema S.
Theorem (S,Σ) is in BCNF if and only if for every instance of (S,Σ) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0).
A similar result holds for 4NF and MVDs.
This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ...
33
Problems with the Measure
The measure cannot distinguish between different types of data dependencies.
It cannot distinguish between different instances of the same schema:
51
421
321
CBA
41
321
CBA
entropy = 0
R(A,B,C), A → B
entropy = 0
34