Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with...

24
Schemaless Semistructured Data Revisited —Reinventing Buneman’s Deterministic Semistructured Data Model— Keishi Tajima School of Informatics, Kyoto University PBF at Edinburgh, Scotland 28 Oct. 2013

Transcript of Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with...

Page 1: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited—Reinventing Buneman’s Deterministic Semistructured Data Model—

Keishi Tajima

School of Informatics, Kyoto University

PBF at Edinburgh, Scotland28 Oct. 2013

Page 2: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Background

Why Semistructured Data in PBF 2013?

One of the topics where Peter Bunemanhas made big contributions.

I was:

• visiting U. Penn DB group from 2000 to 2001, and

• joined their research on efficient archiving of versionhistory of semistructured data.

1

Page 3: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Background

What is Deterministic Semistructured Data Model?

The basic idea of the research was partly based on:

“A Deterministic Model for Semistructured Data”by P. Buneman, A. Deutsch, W.-C. TanWorkshop in conj. with ICDT’99, pp. 14–19, 1999

Their Deterministic Semistructured Data Model:

•Edge-labeled graph.

•Edges outgoing from a node have unique labels.

•Edge labels are also graphs.

2

Page 4: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Standard Model and Their Model

Standard Model

Their Model

3

Page 5: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Goal of this Research

Reinventing the data model through another way.

Contribution:

•Providing additional rationale of the design of the model.

4

Page 6: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Goal of this Research

the same?

Reinventing the data model through another way.

Contribution:

•Providing additional rationale of the design of the model.

•Providing an archive of their discussion (provenance).

5

Page 7: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Roadmap

1. What is important for models for semistructured data?

⇒ uniform treatment of data and metadata

2. What are most important metadata?

⇒ attribute names and key values

3. We extend a standard model following the discussion.

Start: edge-labeled graphs

3.1. Use key values as edge labels.

3.2. Why unique edge labels?

3.3. Why any graphs as edge labels?

6

Page 8: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

What is important in models for semistructured data?

What is semistructured data?

• nested irregular structure

• no predefined schema defining the structure (schemaless)

What is schema?

•metadata annotating data

– for systems to parse data

– for users to know the meaning of data

Standard approach

• embed metadata inside data (self-describing)

Uniform treatment of data and metadata is important.

7

Page 9: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

What are data and metadata?

Let’s see the most classic way of organizing large data:

tables

Two types of tables has been popularly used:

• relational tables

– a set of tuples indexed by column names

•multidimensional tables (or matrices if two-dimensional)

– a multidimensional array indexed by arbitrary values

8

Page 10: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Relational Tables

A relational table with column names

Nutrition Facts

AllergenItem Cal Sugar Fat · · · Milk Wheat Egg

Hamburger 250 5.5 9 · · · -√ √

Cheeseburger 300 6.5 12 · · ·√ √ √

... ... ... ... ... ... ... ...

Gigaburger 540 8.8 29 · · · -√ √

• column names = attribute names = metadata

• Some attribute names have hierarchical structure.(e.g., Allergen.Milk)

9

Page 11: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Multidimensional Tables

A multidimensional table with row/column names

Schuylkill River Trail Mileage Chart

Philadelphia Manayunk · · · TamaquaPhiladelphia — 7 · · · 114.5Manayunk 7 — · · · 107.5

... ... ... ... ...

Tamaqua 114.5 107.5 · · · —

•Both column names and row names are metadata

10

Page 12: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Relational Table or Multidimensional Table?

No clear distinction between them.

Car Sales by Month and State

NY NJ PA · · · CAJan 2010 233 149 183 · · · 258Feb 2010 358 187 170 · · · 286Mar 2010 285 174 191 · · · 225

... ... ... ... ... ...

Dec 2010 169 89 115 · · · 188

•A typical multidimensional table in OLAP.

•Can also be interpreted as a relation.

11

Page 13: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Relational Table or Multidimensional Table?

No clear distinction between them.

Car Sales by Month and State

Month NY NJ PA · · · CAJan 2010 233 149 183 · · · 258Feb 2010 358 187 170 · · · 286Mar 2010 285 174 191 · · · 225

... ... ... ... ... ...

Dec 2010 169 89 115 · · · 188

•A typical multidimensional table in OLAP.

•Can also be interpreted as a relation.

12

Page 14: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Relational Table or Multidimensional Table?

No clear distinction between them.

Car Sales by Month and State

State NY NJ PA · · · CAJan 2010 233 149 183 · · · 258Feb 2010 358 187 170 · · · 286Mar 2010 285 174 191 · · · 225

... ... ... ... ... ...

Dec 2010 169 89 115 · · · 188

•A typical multidimensional table in OLAP.

•Can also be interpreted as a relation.

13

Page 15: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Interpreting Relations as Multidimensional Tables

Item Cal Sugar Fat · · · Milk Wheat EggHamburger 100 5.5 9 · · · -

√ √

Cheeseburger 200 6.5 12 · · ·√ √ √

... ... ... ... ... ... ... ...Gigaburger 540 8.8 29 · · · -

√ √

• attribute names = column names

• key values = row names

Relations can be interpreted as multidimensionaltables if we interpret key values as row names.

14

Page 16: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Interpreting Relations as Multidimensional Tables

Item Cal Sugar Fat · · · Milk Wheat EggHamburger 100 5.5 9 · · · -

√ √

Cheeseburger 200 6.5 12 · · ·√ √ √

... ... ... ... ... ... ... ...Gigaburger 540 8.8 29 · · · -

√ √

Benefit:

• Symmetric treatment of rows and columns.

•More intuitive for human.“cal of Hamburger is 100, sugar of Hamburger is 5.5,. . . , cal of Cheeseburger is 200, . . . ”

15

Page 17: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

What are metadata in tables?

Attribute names and key values most oftenand naturally play roles of metadata.

Why are they distinguished in RDB?

•Columns are static, while rows are dynamic.

•Cells in a column store the same type, while cells in arow may not.

These two are not important in semistructured data.

16

Page 18: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

What should we do with semistructured data?

Let’s use attribute names and key values as metadata.

Benefit of treating key values as metadata:

•Natural way of organizing data.

•We often specify data items by specifying key values, sowhy not?

17

Page 19: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Using Key Values as Labels in Semistructured Data

Representation of tables in ordinary model:

With key values as metadata (row-wise):

18

Page 20: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Using Key Values as Labels in Semistructured Data

Representation of tables in ordinary model:

With key values as metadata (column-wise):

19

Page 21: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Deterministic?

Do we need multiple outgoing edges with the same label?

No, if we use key values as edge labels

Benefit?

Uniform treatment of set-value and single-value

In the ordinary models, it was achieved by carefully designed QLs.

20

Page 22: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Only atomic values as edges?

We have composite keys.

Hierarchical representation of composite keys?

Problems:

•Produces meaningless nodes.

•Forces some order unnecessarily. (data independence!)

•Especially bad when we have keys that are set-valued.

21

Page 23: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

How we represent composite or set-valued keys?

We represent composite or set values by graphs.

⇓We use graphs as edge labels.

Now we have reinvented the data model.

22

Page 24: Schemaless Semistructured Data Revisitedtajima/papers/pbf13slides.pdfWorkshop in conj. with ICDT’99, pp. 14{19, 1999 Their Deterministic Semistructured Data Model: Edge-labeled graph.

Schemaless Semistructured Data Revisited School of InformaticsKyoto University

Summary

1. Uniform treatment of data and metadata is importantin semistructured data

2. attribute names and key values are most importantmetadata

3. We extend a standard model in the following way

3.1. Use key values as edge labels.

3.2. Unique edge labels because:

• uniform treatment of single-value/set-value.

3.3. Allow any graph values as edge labels because:

•we have composite or set-valued keys.

23