Challenges in Ubiquitous Data Management Michael Franklin UC Berkeley.

Challenges in Ubiquitous Challenges in Ubiquitous Data ManagementData Management

Michael FranklinUC Berkeley

© 2000 Michael J. Franklin 2

Ubiquitous ComputingUbiquitous Computing

“In ten years, billions of people will be using the Web, but a trillion "gizmos" will also be connected to the Web.” Asilomar Rep. on DB Research, Dec. 1998

You’ve heard it before…

Wireless Internet-enabled devices projected to soon outnumber wired Internet devices.

Many computing devices per person: Smartphones, PDAs, Smartcards, badges, wearables, lightswitches, toasters, …


Ubiquitous ConnectivityUbiquitous Connectivity

Tremendous improvements in Internet backbone bandwidth and reductions in diameter.

Broadband connectivity to the home and office (i.e. the “last mile”) is being solved.

Wireless technologies are enabling anytime-anywhere connectivity.


Ubiquitous Data AccessUbiquitous Data Access

But, ubiquitous computing and connectivity aren’t worth much without ubiquitous data access.

“Fundamentally, the ability to access all information from anywhere and have ONE unified and synchronized information repository is critical to making appliances useful.” Hambrecht and Quist, iWord , 3/99

Ubiquitous data access will put existing data management techniques to the test, in all aspects – searching, location, reliability, consistency, …


Ubiquitous Data – State of the ArtUbiquitous Data – State of the Art Everyone uses a database system and/or search

engine every day Although they may not realize it! (the true test of “ubiquity”).

The Internet and WWW have become a ubiquitous means of global data dissemination and exchange.

Databases play a crucial but largely invisible role here. XML and related standards are enabling increasingly

sophisticated interoperation.

Wireless access provides anytime-anywhere access and enables location-centric applications.


Scenarios and RequirementsScenarios and Requirements Real “killer apps” have not yet emerged.

Many in industry have begun to refer to a “user experience” rather than a particular app.

Many of these scenarios are quite irritating

e.g. “buy milk now!!!!” Typical scenarios require three types of functionality:

Support for mobility – of users and data Context awareness – what is the user trying to do? Support for collaboration – varied and dynamic groups of

people; real-time or asynchronous,…


Demands on Data ManagementDemands on Data Management A key requirement that emerges from all three of these

categories is adaptivity.

movement/availability of data and people continually changing contexts dynamic groups and interactions

A problem and solution: “user-in-the-loop”:

people can deal with ambiguity and conflict resolution. requires a collaborative and responsive approach to

information systems: provide fast interactive performance quickly respond to user direction.


MobilityMobility Limited device capabilities:

storage & CPU, battery power, bandwidth, display, … requires adjustment of data delivery to these

Varying and intermittent connectivity

requires proxies and smart data staging/pre-staging requires global access to data

Location-centric applications

“find open drugstores within two miles of my current location.”

must be able to deal with locations and distances servers must track huge numbers of moving objects


Context AwarenessContext Awareness System must maintain an internal representation of the

users’ needs, tasks, roles, preferences, etc.

requires “user profiles” and models some information can be leveraged from PIM apps

In some scenarios, e.g. “smart spaces”, system must continually monitor and react to changes in the environment:

requires processing streams of data from sensors, logs, etc.

All require inferencing and learning techniques over dirty and incomplete data.


CollaborationCollaboration

Synchronization and consistency support

collaboration revolves around a set of shared data

requirements range from unmoderated chat rooms to complete ACID transactions

Also need maintenance of history

to support asynchronous collaborations to support changes in group membership must be durable and highly-available.


Two On-going ProjectsTwo On-going Projects Two projects currently underway to address some of

these issues (both part of “Endeavour”).

Data Centers/Dissemination-Based Info Sys

Profile-based data management includes “data recharging” collaboration with Stan Zdonik at Brown and Mitch

Cherniack at Brandeis Telegraph

adaptive query processing over data streams with Joe Hellerstein at UC Berkeley


Data Centers FrameworkData Centers Framework

An architecture that combines data delivery techniques for responsive client access.

3 types of nodes: Data sources Clients Information brokers (can add value)

Any data delivery mode can be used.

Network transparency Dynamic


Delivery OptionsDelivery Options

PushPull

Aperiodic Periodic

Unicast 1-to-n Unicast 1-to-n

Aperiodic Periodic

Unicast 1-to-n Unicast 1-to-n

request/response

request/responsew/snoop

polling pollingw\snoop

Email lists

publish/subscribe

Emaillistdigests

Broad-castdisks

publish/subscribe


Network TransparencyNetwork Transparency

Clients Brokers Sources

The type of a link matters The type of a link matters only only to nodes on each endto nodes on each end


DBIS ExampleDBIS Example

1-to-n pushServerDB

Proxy cache

An example:

Can vary dynamically

Unicast pull

Proxy cache

Proxy cache

Unicast pull

Unicast pull


““Data Recharging” for Weakly Data Recharging” for Weakly Connected DevicesConnected Devices

Mobile devices require 2 resources: power and data

It is impractical to be continuously connected to fixed sources of these.

Devices cope with disconnection using caching:

Power cached in rechargeable batteries Data cached in hot-synched memory

Recharging the power is easy…

Anywhere, Anytime, “Hands-off” operation, Flexible connection duration


Data Recharging – Elevator PitchData Recharging – Elevator Pitch

Make recharging data as simple as recharging power:

Anywhere – no need to connect to your home machine,

Anytime – no prior arrangements necessary, “Hands-off” operation – system knows what you

need Flexible connection duration – the longer you stay

connected, the better your device-resident data gets.


Some QuestionsSome Questions

How to know where the user will be?

and do we care? (for context – yes, for staging -??)

How to know what the user wants?

How to prioritize data delivery?

The answer is User Profiles


““Data Recharging” ProfilesData Recharging” Profiles Three main components:

1) Content-based specifications of user interests(read “queries”)

2) Specifications of user priorities/requirementspriority ordering, resolution, freshness, dependencies

3) User Context information – where, when, who, what This info is available in the user’s PIM data!

Profiles must be both specified explicitly and learned automatically.


First cut at Profile ModelFirst cut at Profile Model Tasks, sub-tasks, and jobs

Dependencies and alternatives expressed in a tree “Values” assigned and manipulated

Two optimization problems:

Bounded (known) sync time Unknown sync time

Bounded case is an instance of the “precedence-constrained knapsack problem”

The XFilter system allows us to process millions of standing queries of XML documents


The challenge is to efficiently and quickly match incoming XML documents against the potentially huge set of user profiles.

Xfilter- An XML-Based SDI Xfilter- An XML-Based SDI SystemSystem

XML Conversion

XML Document

s Filter Engine

User Profiles

Users

Filtered Data

Data Sources


Important XPath FeaturesImportant XPath Features Parent/Child (‘/’) and Ancestor/Descendant (‘//’):

/catalog/product//msrp

Wildcards (match any single element):

/catalog/*/msrp

Element Node Filters to further refine the nodes:

Filters can contain nested path expressions

//product[price/msrp < 300]/name

Filter applied to product element node


ArchitectureArchitecture

XPath Parser

Filter Engine

Path NodesProfile Info

XML Documents

XML Parser(SAX Based) Element

Events

SuccessfulProfiles &

Filtered Data

ProfileBase

SuccessfulQueries

Query Index

User Profiles(XPath Queries)

/a//b/c//b/d/*/e/c/*/d//e

/a/b[c/d]/e//d/*/*/e/b/e


XML Parsing and FilteringXML Parsing and Filtering Event-based XML Parsing using SAX API

XML documents are converted to a linear sequence of events that drive the execution of the filter

Callback functions are implemented to deal with the different events

Start Element Element Data End Element


Filter EngineFilter Engine Tricky aspects of the XPath language:

Checking the order of elements in the queries Handling wildcards and descendent operators Evaluating filters that are applied to element

nodes (Nested path expressions) Solution:

Convert each XPath query into a Finite State Machine (FSM)

A profile is considered to be satisfied when its final state is reached

Index the states of FSMs for efficient evaluation


FSM RepresentationFSM Representation Each element node is a state

A state is represented using a Path Node structure:

Contains information to process current state: Compare the level of element name in input document

with the level value of the path node Evaluate the element node filter if there is any Locate next path nodes for the state change in the FSM

representation Calculate the level values of next states using relative

distance values (in terms of levels) stored in the path nodes


Handling Multiple QueriesHandling Multiple Queries

Hash table based on the element names in the queries

Each node contains two lists of path nodes:

Candidate List: Stores the path nodes that represent current state of each query

Wait List: Stores the path nodes that represent the future states

State transition is represented by promoting a path node from the Wait List to the Candidate List

Initial distribution of path nodes has a significant impact on performance

Key insight for scalable Profile Matching:Index the queries instead of the data


ExampleExampless

Q1 = / a / b // c

Q1

1

NA

1

Q1

2

1

?

Q1

3

NA

-1

Q1-1 Q1-2 Q1-3

Q2 = // b / * / c / d

Q2

1

NA

-1

Q2

2

2

?

Q2

3

1

?

Q2-3Q2-2Q2-1

Q3 = / * / a / c // d Q4 = b / d / e Q5 = / a / * / * / c // e

Q3

3

NA

-1

Q3

2

1

?

Q3

1

NA

2

Q3-3Q3-2Q3-1

Q5

1

NA

1

Q5-1

Q5

2

3

?

Q5-2

Q5

3

NA

-1

Q5-3

Q4

1

NA

-1

Q4-1

Q4

2

1

?

Q4-2

Q4

3

1

?

Q4-3

Query Id

Position

Rel Dist

Level


Query Index ConstructionQuery Index Construction

z

a

b

c

d

e

WL

CLQ2-1

Q2-2

Q2-3

Q3-1

Q3-2

Q3-3

Element Hash Table

CL : Candidate ListWL: Wait List

WL

Q1-1

Q1-2

Q1-3

WL CL

WL

CL

CL

WL CL

Q4-1

Q4-2

Q4-3

Q5-1

Q5-2

Q5-3


Data Centers - Research AgendaData Centers - Research Agenda Profile Definition and Maintenance

Update Storage and Preparation

Efficient integration of "recharge" updates with existing cached data.

Recharge, Trickle Charge, Jump Start... Consistency Guarantees

Global Data Staging

Approaches will be driven by (mostly PIM) applications.


Telegraph: Telegraph: An Adaptive Dataflow An Adaptive Dataflow EngineEngine

Dataflow because that’s what data does… data streaming from sensors real-time processing of streams: update

stream, click-stream, swipe-stream, … siphon data from the “deep web” “continuous queries” for dissemination-based apps

Adaptivity due to volatility… sensor nets wide area internet dynamic caching, replication, and staging user-in-the-loop interfaces mobile users and devices

Joint work at UC Berkeley with Joe Hellerstein


Sources may be unreachable or slow to respond.

Data delivery may be: slower than expected bursty interrupted

Data statistics/cost estimates may be unavailable or unreliable due to poor interfaces or crossing administrative domains.

Wide-area + Wrapped sources Wide-area + Wrapped sources UnpredictabilityUnpredictability


Batch processing is inappropriate for many apps.

especially when searching the Internet

Must provide feedback to the user as quickly as possible.

Data access becomes a cooperative, iterative approach:

User may correct/redirect query. User may refine/change the query.

User-in-the-loop User-in-the-loop UnpredictabilityUnpredictability


Mobility Location-centric queries Moving endpoints change

data staging needs

Data Streams/Sensors Varying data arrival rates Adapting resolutions Push vs. Pull

Mobility & Data Streams Mobility & Data Streams UnpredictabilityUnpredictability


Some SolutionsSome Solutions Adaptive Query Processing

Query Scrambling - “Reactive Query Execution”

XJoin – non-blocking, reactive query operator. Eddies – Continuous Query Optimization

Risk-Aware Query Planning Producing robust plans or partial plans.

Exploiting Alternative Sources Mirrors or “not exactly”.

Relaxing Query Semantics Partial, Fuzzy, or Alternative answers


Query Scrambling ExampleQuery Scrambling Example

1

4

A

CDEB

Reschedule

A

CDEB

New Operators

3

2

4

1

B C D EA

Initial Plan Reschedule

A

BCDE

ABCDE


Traditional Hash Joins block when one input stalls.

Hash JoinHash Join

BuildProbe

Source A Source B

Hash Table A

Hash Table A

Hash Table B

Symmetric Hash Join (SHJ) blocks only if both stall. XJoin partitions data -> small footprint -> full pipelining & bushy

plans-> higher adaptability.

XJoinXJoin


Eddy – Continuous OptimizationEddy – Continuous Optimization

Flow-based (“Rivers”) Tuples are routed via a ticket-based scheme and back-pressure. Hellerstein and Avnur 99

Eddy

Join ST

Join RSR

S

T


Adaptive ApproachesAdaptive Approaches

Increased uncertainty argues for increased adaptivity. Wide-area nets and admin domains introduce uncertainty. Pesky users introduce uncertainty. Mobility and streams introduce uncertainty.

Implications for data-intensive Internet services.

Dynamic,Parametric,

Competitive,…

staticplans

anarchylatebinding reopt. continuous

opt.

currentDBMS

Query Scrambling Eddy

XJoin

???


ConculsionsConculsions We need to build more intelligent systems to protect

humans from the data flood, but good old systems performance issues still matter too.

No killer app for Ubiqutious Data Access yet; may be the killer “user experience”

Scenarios give us a common (and challenging!) set of requirements for data management: Adaptivity, context-awareness, global-scale, …

The Data Centers and Telegraph projects are addressing key data management technologies for supporting ubiquitous access to data.

Challenges in Ubiquitous Data Management Michael Franklin UC Berkeley.

Documents

Transcript of Challenges in Ubiquitous Data Management Michael Franklin UC Berkeley.