Declaratively Producing Data Mash-ups

37
Declaratively Producing Data Mash-ups Sudarshan Murthy 1 , David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland State University http://www.sixml.org

description

Declaratively Producing Data Mash-ups. Sudarshan Murthy 1 , David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland State University. http://www.sixml.org. Mash-ups. Web applications that combine information from multiple sources [Wikipedia] - PowerPoint PPT Presentation

Transcript of Declaratively Producing Data Mash-ups

Page 1: Declaratively Producing Data Mash-ups

Declaratively Producing Data Mash-ups

Sudarshan Murthy1, David Maier2

1Applied Research, Wipro Technologies2 Department of Computer Science, Portland State University

http://www.sixml.org

Page 2: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 2

Mash-ups

• Web applications that combine information from multiple sources [Wikipedia]– A mash-up does not need to be a web app

• Data that includes or transcludes content from multiple sources

• In either case, a source is likely only a fragment

• This work is about data mash-ups– In this talk, a mash-up is an XML document

Page 3: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 3

Portland State University Campus Map

• 45 markers, 53 landmarks– Marker: Balloon

on map– Landmark:

Building, department, …

• Information from 188 fragments in 58 web pages

• Fragments selected manuallyhttp://sparce.cs.pdx.edu/cmap/

Page 4: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 4

Portland Metro Food Markets

• 154 markers, 154 landmarks

• 154 fragments harvested programmatically from 4 MS Word documents

• Developed for the Oregon Department of Agriculture

http://sparce.cs.pdx.edu/Declaratively Producing Data Mash-ups/oda-1.1/

Page 5: Declaratively Producing Data Mash-ups

An HTML Review Report

Apr 21, 2023 Declaratively Producing Data Mash-ups 5

Page 6: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 6

Problem Areas

• Development– Getting data from heterogeneous fragments– Might use a DBMS, yet code operators such

as sort, join, and aggregate for external data

• Execution– When to get external data, how much to get?

• Design: Expressing that– A part comes from an external fragment– A part is data (such as page number) which

cannot be “selected” in the source

Page 7: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 7

Outline

• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion

Page 8: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 8

Superimposed Information (SI)

• SI is new data and structure overlaid on existing base information

• Mark: A reference to an external fragment

• Benefits– Multiple, simultaneous

organizations – Make new connections

among base fragments– Preserve context

Superimposed

Layer

Base Layer

Information Source1

Information Source2

Information Sourcen

marks

Heterogeneous sources: Word, Excel, PDF, HTML,…

Page 9: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 9

The Mash-up Production Process

Collect marks, add new data and structure

Extract data from marks and combine with added data

Collect and Classify Extract and Combine Transform

DocsDBMS

Services

Services

Format reconstituted data for display and other purposes

Services

Condensed mash-up

Reconstitutedmash-up

DBMS DocsDBMS Docs

Formattedmash-up

Page 10: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 10

SI, Bi-level Information, Mash-ups

• A condensed mash-up is SI– Links mash-up parts to external fragments– Relates to mash-up design: Sixml

• A reconstituted mash-up and a formatted mash-up are both bi-level information – SI plus reconstituted parts– Relates to runtime mash-up manipulation

and execution: Sixml DOM and Sixml Navigator

Page 11: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 11

Outline

• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion

Page 12: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 12

Sixml

• A mash-up specification language – SI represented as XML; Sixml is XML

• A condensed mash-up is encoded as a Sixml document

• A mark association is encoded as an XML element of a type we define– Associate marks with six kinds of content– Validated using standard schema constructs– Uniform and comprehensible serialization

Page 13: Declaratively Producing Data Mash-ups

<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> <sixml:TMark> Contradicts prior work <sixml:Descriptor>…</sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>

<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> <sixml:TMark> Contradicts prior work <sixml:Descriptor>…</sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt" sixml:valueSource="true"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>

<Comment excerpt=""> Contradicts prior work</Comment>

<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> Contradicts prior work <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>

<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> Contradicts prior work <sixml:AMark target="excerpt"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>

Apr 21, 2023 Declaratively Producing Data Mash-ups 13

Sixml Mark Associations

• By default text excerpt is assigned at run time, but possible to declare that the value should be something other than the excerpt• Mark association names shown here are same as type name, but custom names are possible (with both static and dynamic typing)

Page 14: Declaratively Producing Data Mash-ups

<Comment excerpt="" xmlns:sixml="…" xmlns:xsi="…"> <sixml:TMark> Contradicts prior work <sixml:Descriptor xsi:type="sixml:XPointer">  <pointer>http://www.w3.org/#element(/1/2)</pointer> </sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt" sixml:valueSource="true"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor xsi:type="sixml:SPARCE"> <Agent>OfficeAgents.MSWord</Agent> <Doc location="c:\abc.doc" /> <Subdoc startChar="45" endChar="53" /> </sixml:Descriptor> </sixml:EMark></Comment>Apr 21, 2023 Declaratively Producing Data Mash-ups 14

Sixml Mark Descriptors

Page 15: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 15

Outline

• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion

Page 16: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 16

Sixml DOM

• Extends W3C XML DOM to easily manipulate Sixml documents – Using DOM can be tedious and inefficient

• Automatic and lazy reconstitution– Detects mark associations and interprets

attributes such as sixml:valueSource– Developer uses only the DOM interface

• Access to descriptors and “context” of external fragments

Page 17: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 17

Run-time Representation

A Descriptor

AMark

A Descriptor

EMark

5. @markID@a“text”

“Contradicts…” A

“excerpt”

@excerpt

Comment

TMark

true

A @target

“” @valueSource A Descriptor

<Comment excerpt="" xmlns:sixml="http://schema.sixml.org"> <sixml:TMark> Contradicts prior work <sixml:Descriptor>…</sixml:Descriptor> </sixml:TMark> <sixml:AMark target="excerpt" sixml:valueSource="true"> <sixml:Descriptor>…</sixml:Descriptor> </sixml:AMark> <sixml:EMark> <sixml:Descriptor>…</sixml:Descriptor> </sixml:EMark></Comment>

DOM tree

Page 18: Declaratively Producing Data Mash-ups

A Context TMark

EMark

5. @markID@a“text” A Descriptor

A Descriptor

A Descriptor

A @valueSource

“provides…”

“excerpt”

@excerpt

true

A @target

Comment

“Contradicts…”

A Context

A Context

AMark

Apr 21, 2023 Declaratively Producing Data Mash-ups 18

Generating a Sixml DOM Tree

A Descriptor

AMark

A Descriptor

EMark

5. @markID@a“text”

“Contradicts…” A

“excerpt”

@excerpt

Comment

TMark

true

A @target

“” @valueSource A Descriptor

Sixml DOM tree

A mark association is “attached” to its target, but is not a child - The DOM interface suffices to access the reconstituted mash-up

Descriptor is not a child

Value reconstituted

Page 19: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 19

Context Information• Information retrieved from the context of an

external fragment

• An xsi:type-specific implementation determines (statically or dynamically) what is in context

<sixml:Context> <Content> <Text>provide ... system</Text>   </Content> <Presentation> <FontName>Times New Roman</FontName>   <FontSize>11</FontSize> </Presentation> <Placement> <Page>3</Page> </Placement></sixml:Context>

Page 20: Declaratively Producing Data Mash-ups

Programming with Sixml DOM

1.procedure WriteComment(SixmlElement c)2. XmlElement ctxt = c.markAssociations[0].Context

3. XmlNode page = ctxt.getElementsByTagName("Page")[0]

4. Writeln("Page: ", page.firstChild.nodeValue)

5. Writeln("Excerpt: ", c.getAttribute("excerpt"))

6. Writeln("Comment: ", c.firstChild.nodeValue)

• Only Lines 1 and 2 use the Sixml DOM interface

• Lines 2–4 get page number; Line 5 the reconstituted excerpt; and Line 6 the comment text

Apr 21, 2023 Declaratively Producing Data Mash-ups 20

Page 21: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 21

Outline

• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion

Page 22: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 22

Sixml Navigator

• Alternative to the traditional path navigator

• Extends XDM so that Sixml documents can be declaratively queried using existing languages and query processors– Also applies to XPath 1.0 and XSLT 1.0

• Performs automatic and lazy reconstitution

Page 23: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 23

XDM Extensions

• Allow child elements for any kind of node with which a mark may be associated

• Make a mark association a child of its target node

• Represent a mark descriptor and context as children of a mark association

• These extensions allow reuse of existing query languages and processors

Page 24: Declaratively Producing Data Mash-ups

A Context TMark

EMark

5. @markID@a“text” A Descriptor

A Descriptor

A Descriptor

A @valueSource

“provides…”

“excerpt”

@excerpt

true

A @target

Comment

“Contradicts…”

A Context

A Context

AMark

Apr 21, 2023 Declaratively Producing Data Mash-ups 24

An Extended-XDM Tree

A Context TMark

EMark

5. @markID@a“text” A Descriptor

A Descriptor

A Descriptor

A @valueSource

@excerpt

A @target

Comment

“Contradicts…”

A Context

A Context

AMark

Extended-XDMtree

Page 25: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 25

Queries over Bi-level Information

• With Comment as current node, get the comment text

./text()

• Get excerpt of commented region ./@excerpt

• Get page number of commented region ./sixml:EMark/sixml:Context/Placement/Page

<sixml:Context> <Placement> <Page>3</Page> </Placement></sixml:Context>

EMark

5. @markID@a“tA Descriptor A Context

Page 26: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 26

Outline

• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion

Page 27: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 27

Implementation and Usage

• Element types for Sixml mark associations defined in XML Schema

• Sixml DOM and Sixml Navigator in C# on the .NET Framework– Sixml DOM implemented by extending DOM

and by revising DOM– Three implementations of Sixml DOM: 2

extensions (MS and Mono), 1 revision (Mono)

• Sixml, Sixml DOM, and Sixml Navigator used in mash-ups for several applications

Page 28: Declaratively Producing Data Mash-ups

Experimental Data

• 8 mash-ups – 4 each from 2 apps; different scale factors– File size: 200 KB to 26.1 MB– #Docs referenced: 18 to 426– #Mark associations: 1.9K to over 311K

• 3 traditional XML documents– File size: 484 KB to 113.7 MB– Tree depth: 4, 8, 16

Apr 21, 2023 Declaratively Producing Data Mash-ups 28

Page 29: Declaratively Producing Data Mash-ups

Evaluation Summary

• Sixml DOM– Saves time over DOM when accessing mark

associations– When accessing SI, savings decrease as

the amount of SI increases– It is better to use DOM to access large

traditional XML documents

• Sixml Navigator– Saves time over traditional navigator for

both mark associations and SI

Apr 21, 2023 Declaratively Producing Data Mash-ups 29

Page 30: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 30

Outline

• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion

Page 31: Declaratively Producing Data Mash-ups

Summary

• A mash-up has three forms: condensed, reconstituted, and formatted

• Sixml, Sixml DOM, and Sixml Navigator support the three forms, respectively

• Sixml makes it easier to specify mash-ups; Sixml DOM and Navigator provide a more efficient means of manipulating mash-ups

• The XML Schema instance documents and the source code are on www.sixml.org

Apr 21, 2023 Declaratively Producing Data Mash-ups 31

Page 32: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 32

Outline

• Introduction• The conceptual approach• Sixml: Condensed mash-ups• Sixml DOM: Reconstituted mash-ups• Sixml Navigator: Formatted mash-ups• Evaluation• Summary• Discussion

Page 33: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 33

Our Mash-up Framework

XSLT and XQuery Processors

XPath ProcessorClient Application

Sixml Sixml DOM Sixml Navigator

SPARCE Bulk Accessor Cloaker

Reference and retrieve fragments of arbitrary types

Efficiently retrieve large number of fragments

Hide data to improve query expression and execution

Page 34: Declaratively Producing Data Mash-ups

Bi-level Query Processors

• Sixml Navigator uses Sixml DOM internally: Does not construct extended-XDM trees

• Existing query processors use the Sixml Navigator instead of using the traditional navigator

Apr 21, 2023 Declaratively Producing Data Mash-ups 34

BulkAccessor transform(contextInfo) XMLContextTransformer

scope SixmlNavigator

0..1 *

Produces

apply(styleSheet) XSLTProcessor

Node Evaluation Context 1 *

Embeds

Source * *

moveToRoot() moveToFirstChild() moveToNextSibling() moveToPreviousSibling() moveToParent()

XPathNavigator

evaluate(expression) XPathEvaluator

1 * Uses

SixmlNode

Page 35: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 35

Mark Creation

Superimposed Application

Mark Manager

Clipboard

Superimposed Info Descriptors

Repository

<Mark ID="M4"> <Agent>AcrobatAgents.PDFAgent</Agent> <Class>AcrobatPDFTextMark</Class>  <Address>2|395|439</Address> …   <ContainerID>D6</ContainerID> </Mark>

M4S1

Page 36: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 36

Activation and Context Retrieval

Superimposed Application

Mark Manager

Context Manager

Superimposed Info

Base Application

Descriptors Repository

Base Info

<Mark ID="M4"> <Agent>AcrobatAgents.PDFAgent</Agent> <Class>AcrobatPDFTextMark</Class>  <Address>2|395|439</Address> …   <ContainerID>D6</ContainerID> </Mark>

M4S1

Page 37: Declaratively Producing Data Mash-ups

Apr 21, 2023 Declaratively Producing Data Mash-ups 37

About ContextPDF Mark PowerPoint Mark

• Context information is modeled as a hierarchical property set