DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung...

31
DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore Gillian Dobbie University of Auckland, New Zealand

Transcript of DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung...

Page 1: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

1

NF-SS: A Normal Form for Semistructured Schemata

Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee

National University of Singapore

Gillian DobbieUniversity of Auckland, New Zealand

Page 2: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

2

Outline

1. Motivations

2. Semistructured schema and its data tree

3. Integrity constraints for semistructured data

4. NF-SS: Normal Form for Semistructured Schemata

5. Designing of semistructured schema into NF-SS

6. Discussions of the designing approach

7. Comparison with related proposal

8. Summary

Page 3: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

3

1. Motivation: Example 1

<!ELEMENT department (course+) <!ATTLIST department name ID #REQUIRED><!ELEMENT course (students*)> <!ATTLIST course cid ID #REQUIRED title CDATA #implied><!ELEMENT student (grade?)> <!ATTLIST student sid ID #REQUIRED name CDATA #REQUIRED age CDATA #IMPLIED> <!ELEMENT grade (#PCDATA)>

course

title

student

sid age

name

+

department

grade

cid *

?

name

Page 4: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

4

1. Motivation (cont.)

Redundancy: name and age of a student Updating Anomaly:

– Insertion– Rewriting– Deletion

cid: cs4221

title: database design

sid: s01

“A”

title: data Mining

age: 21

name: Jack

course name: CS

department

course

student

sid: s02

name: Tom

student

grade

cid: cs5220

sid: s01

age: 21

name: Jack

student

Page 5: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

5

1. Motivation:Example 2

<!ELEMENT teacher (ClassRoom*)> <!ATTLIST teacher tid ID #REQUIRED> name CDATA #REQUIRED><!ELEMENT ClassRoom (subject*)> <!ATTLIST ClassRoom room# ID #REQUIRED><!ELEMENT subject (time)> <!ATTLIST subject cid ID #REQUIRED><!ELEMENT time EMPTY> <!ATTLIST day CDATA #REQUIRED hour CDATA #REQUIRED>

teacher

ClassRoom

subject

tid

room#

day hour

time

* name

*

* cid

Path anomaly: –The schema doesn’t reflect the integrity constraints: tid,day,hourcid,room#

Page 6: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

6

2. Semistructured Schema and Data tree

A semistructured schema is defined to be D = (E, A, B, P, R, r)

course

title

student

sid age

name

+

department

grade

cid *

?

name

E: Object typeA:

attributes

•E is a finite set of object types in D.

•A is a finite set of attributes, disjoint from E.

•P is a function from E to object type definition with symbol in {*, +, ? ,1} called multiplicity e.g: P (course) = student*

r: root Object type

•R is a function from E to the power set of A e.g.: R(student) = {sid, name, age }

multiplicity

• r E and is called the object type of the root. e.g.: r = department

•B is a set of basic domain type like string, integer, Boolean etc.

Page 7: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

7

2. Semistructured Schema and Data tree (Cont.)

cid: cs4221

title: database design

sid: s01

“A”

title: data Mining

age: 21

name: Jack

course name: CS

department

course

student

sid: s02

name: Tom

student

grade

cid:cs5220

sid: s01

age: 21

name: Jack

student

A data tree T with respect to a semistructured schema D = (E, A, B, P, R, r) is defined to be a tree T=(V, lab, obj, att, val, root), showing a database instance.

Page 8: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

8

course

cid:cs4221

sid: s02

title:database design

“A”

department

student

name: Tom

student

name:CS

grade

course

title

student

sid age

name

+

department

grade

cid *

?

name

•The path of a node n in semistructured schema D is denoted as pathD(n). e.g.: PathD for student is /department / course / student •The path of a node v in data tree T is denoted as PathT(v) e.g.: PathT for student “s02” is /department / course/ student

•The target set of node n in T, T[n], is {v: vV, nEA PathT(v)= PathD(n)}. e.g.: the target set T[student] includes nodes of students with sid “s02” etc.

2. Semistructured Schema and Data tree (Cont.)

Page 9: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

9

2. Semistructured Schema and Data tree (Cont.)

Two nodes from two data tree w.r.t schema D satisfy value equality iff

– they are attributes nodes with the same tag and the same value;– or they are object nodes having the same tag and their children are

pairwise value equal

cid: cs4221

title: database design

sid: s01

“A”

title: data Mining

age: 21

name: Jack

course name: CS

department

course

sid: s02

name: Tom

student

grade

cid: cs5220

sid: s01

age: 21

name: Jack

studentstudent studentstudent

Two data trees T1 and T2 w.r.t schema D = (E, A, B, P, R, r), X E A. T1 and T2 agree on X, denoted as iff the following condition is hold: t1T1[X],t2T2[X], such that (t1=vt2)

Page 10: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

10

3. Integrity Constraints for Semistructured Data

Extended Functional Dependency(EFD)Let D = (E, A, B, P, R, r) be a semistructured schema, let X EA and Y EA. Y is extended functionally dependent on X,is denoted as XY. Let S denotes a set of data trees that areimages of D, S satisfies XY, iff for any data trees T1, T2 in S,

if they agree on every component in X, then they will agree onY.that is, T1, T2 S((xX, T1=xT2) such that T1=yT2).

Inference rule for EFDE1:(reflexivity) If YX, then XY, for any X, Y EAE2:(augmentation) if XY then XZYZ, for any X, Y, Z EAE3:(transitivity) If XY, YZ then XZ, for any X, Y, Z EA

Page 11: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

11

3. Integrity Constraints for Semistructured Data (Cont.)

Notation: EFD XY is partial EFD: If there exists an X’X such that X’Y.

Otherwise, is full EFD.e.g.: (1) course[@cid],student[@sid]student[@name] is partial

EFD (2) student[@sid]student[@name] its full EFD XY is said to be coherent iff /X/Y is a path in D; otherwise it is

called an incoherent EFD.

teacher

ClassRoom

subject

tid

room#

day hour

time

* name

*

* cid

O1[@X1], …, Oi[@Xi],…,On-1[@Xn-1]On[@Xn]

e.g.:teacher[@tid], time [@day, @hour]subject[@cid] is an incoherent EFD, since /teacher / time /subject is not a path in schema.

Page 12: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

12

3. Integrity Constraints for Semistructured Data (Cont.)

If there exists ZEA, such that XY and YZ and Y X, then Z is transitively extended functionally dependent on X via Z.

e.g.: age is transitively dependent on course via student since

(1) course[@cid]student[@sid]

(2) student[@sid]student[@age] and

(3)student[@sid] course[@cid]

course

title

student

sid age

name

+

department

grade

cid *

?

name

Page 13: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

13

3. Integrity Constraints for Semistructured Data (Cont.)

Theorem Let D = (E, A, B, P, R, r) be a semistructured schema, X, Y, Z E A. If Z is transitively dependent on X via Y, then there exists a data tree of D where a rewriting anomaly occurs upon updating the values of Z.

cid: cs4221

title: database design

sid: s01

“A”

title: data Mining

age: 21

name: Jack

course name: CS

department

course

student

sid: s02

name: Tom

student

grade

cid: cs5220

sid: s01

age: 21

name: Jack

student

Page 14: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

14

3. Integrity Constraints for Semistructured Data (Cont.)

Key Constraints : Based on EFD semantics Notation: Ko = O1[@X1]/…/Oi[@Xi]/…/On[@Xn]/O[@X]

for key of an object type O in semistructured schema D. /O1/…/O is a path in D

If n equals one, then Ko is called an absolute key. Otherwise it

is called a relative key.

book

isbn

chapter

number

section

number

+

+

Example

•Kbook= book[@isbn]. Kbook is an absolute key

•Kchapter =book[@isbn]/chapter[@number]. Kchapter is a relative key

•Ksection= book[@isbn]/chapter[@number]/section[@number]. Ksection is a relative key

Page 15: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

15

3. Integrity Constraints for Semistructured Data (Cont.)

Let D be a semistructured schema and O be its root objecttype. The set of basic dependencies of D, denoted as

BD(D), isdefined as follows: Let X, Y be children of O, non-trivial extended functional

dependencies of the form XY where X is a key of O or Y is part of a key of O, are in BD(D).

Let O1 be a sub-object type of O and D1 be a schema tree that is rooted at O1 and add KO as attribute(s) of O1, then BD(D1) BD(D).

No other non-trivial dependencies that is not generated from above is in BD(D)

Page 16: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

16

4. NF-SS

Let D be a semistructured schema and O be its root object type. D is in Normal Form for Semistructured Schemata (NF-SS), iff 1. O has at least one key.2. For any non-trivial EFD of the form XY satisfied by O, where X

and Y are attributes of O, then either X is a key or Y is part of the key of O

3. For any sub-object type O1 of O

(a) If adding KO to O1 as its components with other remains,

a schema tree rooted at O1 will be in NF-SS.

(b) KO KO1= or KO KO1, where KO and KO1 are O and O1’s key respectively. (c) O1 is not transitively dependent on KO 4. Any non-trivial EFD in D can be derived from BD(D) by using the inference rules for EFDs.

Page 17: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

17

5. Designing Semistructured Schema into NF-SS

We adopt restructuring approach for the designing.

We propose four heuristic restructuring rules– Decomposition object types.– Creation new object types.– Regrouping components of an object type.

Objective– Remove transitive or partial EFD and

incoherent EFD from the given dependency and key constraints.

Page 18: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

18

5. Designing Semistructured Schema into NF-SS(cont.)

Rule 1. (Remove Transitive Dependency by Decomposition)Given an object type O in a semistructured schema D, if there issome non-prime component(s) Y of O that is transitivelydependent on some key of O, i.e., KO X, X Y and X KO , and

X KO =. Then, restructuring the schema as follows.

1. Duplicate X to form a new node(s) Z. 2. Move Y and all the descendants of Y and their corresponding edges under Z. 3. Make X as foreign key of O, and add a reference edge from the original node X to Z.

Page 19: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

19

5. Designing Semistructured Schema into NF-SS(cont.)

Example 5.1: schema D satisfies the following EFDs(1)department[@name]course[@cid] (2) course[@cid]department

(3)course[@cid]course[@title] (4)course[@cid]student[@sid(5)course[@cid],student[@sid]grade (6)student[@sid]student[@name, @age]

course

title

student

sid age

name

+

department

grade

cid *

?

name

course

title

student1

sid

name

+

department

grade

cid *

?

student2

sid age name

Page 20: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

20

5. Designing Semistructured Schema into NF-SS(cont.)

Rule 2. Remove Path Anomaly by Path SplittingGiven a semistructured schema D. Suppose there exists an incoherent EFD: O1[@X1],…,On[@Xn] Y, Y is either an

objecttype or an attribute, and there exists a path P that

contains{O1,…,On,Y}. Path P can be split into two sub-paths P1 and

P2,where P1 only contains {O1,…,On } and Y, while P2 contains

{O1,…,On} and (P-Y).

Page 21: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

21

5. Designing Semistructured Schema into NF-SS(cont.)

Example 5.2:schema D satisfies following EFDs (1) teacher[@tid],timeClassRoom (2)teacher[@tid],

timesubject

teacher

ClassRoom

subject

tid

room#

day hour

time

* name

*

* cid

teacher

tid * name

time

day hour

ClassRoom

room#

subject

cid

Page 22: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

22

5. Designing Semistructured Schema into NF-SS(cont.)

Rule 3. Removing Partial Dependency by Creating New Object typeGiven an object type O in a semistructured schema, let X be aset of prime attributes of O, and Y be the set of O’s

attributes. Let O1 be a sub-object type of O. If (KO -X) O1

and no proper superset of X satisfy this property, then restructure the schema as follows:

1. (KO Y –X) becomes the only attribute(s) of O while O1

remains to be its sub-object type.

2.Create a new object type O2 that is a direct component of O. 3.Move rest of the components of O and all their descendants

and corresponding edges under O2.

Page 23: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

23

5. Designing Semistructured Schema into NF-SS(cont.)

Example 5.3: schema D shown in Figure (a). the following EFDs {O[@A,@B]D, O[@A,@B]O2, O[@A] O1, O[@A] E } and the key of O is {A,B}.

O

O1 O2

B * A

C F

E D

(a)Un-normalized schema as the partial EFD O[@A,@B} O1

Rule 3

O[@K, @B] O2

O'

A

O1 *

C

O3 *

O2

F

* B E D

(b)Un-normalized schema as the incoherent EFD O’[@A] E

Rule 2

O’[@A] E

O''

A

O1 *

C

O3 *

O2

F

* B

E

D

(c)Normalized schema

Page 24: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

24

5. Designing Semistructured Schema into NF-SS(cont.)

Rule 4. (Restructuring To Satisfy Condition 3(b) of NF-SS Definition)

Given an object type O in a semistructured schema D, X be aset of O’s attributes and single-valued atomic sub-objecttypes, O1 be a complex sub-object type of O. O1 has relative

key KO1 , but KO KO1 and KO1 KO .Let Y be KO KO1 X, and Y

. D is restructured as follows: 1. O1 remains to be a sub-object type of O.

2. Make Y as components of O. 3.Create a new object type O2 to be a child of O and the rest

components of O (excluding Y) become children of O2.

Page 25: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

25

5. Designing Semistructured Schema into NF-SS(cont.)

Example 5.4: schema D in Figure (a) satisfies the EFD (1) O[@K, @A] O1 (2) O[@K, @B]O2 and the key of O is {K, A, B}.

O

O1 O2

A * K

C F

B

D E

*

(a)Un-normalized schema as O1 and O2 partially dependent on {K,A,B}

O'

O1

O2

A * K

C

F

D

E

* O3

B *

(b)Un-normalized schema as KO=O’[@K,@A] and KO3=O’[@K]/O3[@B] such that KO KO3

O''

O1 O2

* K

C F D E

* O3

B *

O4

A *

(c)Normalized schema

Rule 3

O[@K, @B] O2

R u l e 4

3oo

Page 26: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

26

5. Designing Semistructured Schema into NF-SS(cont.)

Algorithm 1: Restructuring AlgorithmInput: A set S that contains semistructured schemas, and a set of EFDs for S.Output: A set of semistructured schemas that in NF-SS.Begin1. for each semistructured schema D in S do if D is not in NF-SS then repeat until no further change: (1) if there exists transitive EFD: KO X, X Y and X KO for an object type O in D, Case X KO =: apply Rule 1 to remove the transitive EFD.

Case X KO : apply Rule 3 to remove the transitive EFD.

Case X KO : apply Rule 4 to remove the transitive EFD. (2) if there exists incoherent EFD then apply Rule 2 to remove it.2. output S.End

Page 27: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

27

6. Discussion of Restructuring Approach for Designing

Is the restructuring rules complete? No.– covering is not guaranteed – dependency preservation is not guaranteed

Does it give unique solution? No.– depending on the order in which the dependencies

are examined Designing task can be made easier if more semantics

available.– In [5], We have proposed another approach for

designing semistructured databases using ORA-SS, a semantic rich model .

Nevertheless, it does give practical heuristics and provides insights into the normalization task for semistructured databases.

Page 28: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

28

7. Comparison with Related Proposal The first attempt to define normal form for semistructured data

([ER’99] S.Y.Lee, M.L.Lee, T.W.Ling, and L.A.Kalinichenko.) [3]– Defines a schema called S3-Graph, which makes no distinction

between element node and attribute node and no cardinality specification.

– Proposes S3-NF, but missing key constraints, an essential part of database design.

– The decomposition method may not be able to remove some other kinds of anomalies, like partial dependency and path anomaly that may exist in a schema.

The most recent proposal: XNF (XML Normal Form)       ([ER 2001] D.W.Embley and W.Y.Mok. ) [2]

– It mainly provides algorithms to translate a schema, represented in a conceptual model called CM hypergraphs, to a scheme-tree forest in XNF.

– Like S3-Graph, scheme tree doesn't lend itself to XML definition. – XNF isn’t formulated with the concept of key. – The algorithms given suffers from efficiency. – A large set of results is expected.

Page 29: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

29

8. Summary

A normal for semistructured schemata– It is incorporated with integrity constraints.– It guarantees no redundancy and hence no undesirable

updating anomalies for the conforming semistructured databases.

– It gives more reasonable representations of real world semantics

Restructuring Approach for designing semistructured databases– a set of heuristic restructuring rules is proposed.– an algorithm for iteratively restructuring a schema into NF-

SS is developed. – It provides insights into the normalization task for

semistructured databases.

Page 30: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

30

References

1. J. Clark and S. DeRose. XML Path Language (XPath). W3C Working Darft, November 1999. http://www.w3.org/TR/xpath.

2.D.W.Embley and W.Y.Mok. Developing XML Documents with Guaranteed “Good” Properties. Proceedings of the 20th International Conference on Conceptual Modeling (ER), 2001.

3. S. Y. Lee, M. L. Lee, T. W. Ling and L. A.. Kalinichenko. Designing Good Semi-structured Databases. Proceedings of the 18th International Conference on Conceptual Modeling (ER), 1999.

4. T. W. Ling and L. L. Yan. NF-NR: A Practical Normal Form for Nested Relations. Journal of Systems Integration. Vol4, 1994, pp309-340

5. Xiaoying Wu, Tok Wang Ling, Mong Li Lee, Gillian Dobbie. Designing Semistructured Databases Using the ORA-SS Model, accepted for publication in Proceedings of the 2nd International Conference on Web Information Systems Engineering (WISE) , IEEE Computer Society, Kyoto, Japan, December 2001.

Page 31: DASWIS 2001 1 NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

DASWIS 2001

31

Q&A