Challenges in Ubiquitous Data Management Michael Franklin UC Berkeley.
-
Upload
adam-briggs -
Category
Documents
-
view
215 -
download
0
Transcript of Challenges in Ubiquitous Data Management Michael Franklin UC Berkeley.
Challenges in Ubiquitous Challenges in Ubiquitous Data ManagementData Management
Michael FranklinUC Berkeley
© 2000 Michael J. Franklin 2
Ubiquitous ComputingUbiquitous Computing
“In ten years, billions of people will be using the Web, but a trillion "gizmos" will also be connected to the Web.” Asilomar Rep. on DB Research, Dec. 1998
You’ve heard it before…
Wireless Internet-enabled devices projected to soon outnumber wired Internet devices.
Many computing devices per person: Smartphones, PDAs, Smartcards, badges, wearables, lightswitches, toasters, …
© 2000 Michael J. Franklin 3
Ubiquitous ConnectivityUbiquitous Connectivity
Tremendous improvements in Internet backbone bandwidth and reductions in diameter.
Broadband connectivity to the home and office (i.e. the “last mile”) is being solved.
Wireless technologies are enabling anytime-anywhere connectivity.
© 2000 Michael J. Franklin 4
Ubiquitous Data AccessUbiquitous Data Access
But, ubiquitous computing and connectivity aren’t worth much without ubiquitous data access.
“Fundamentally, the ability to access all information from anywhere and have ONE unified and synchronized information repository is critical to making appliances useful.” Hambrecht and Quist, iWord , 3/99
Ubiquitous data access will put existing data management techniques to the test, in all aspects – searching, location, reliability, consistency, …
© 2000 Michael J. Franklin 5
Ubiquitous Data – State of the ArtUbiquitous Data – State of the Art Everyone uses a database system and/or search
engine every day Although they may not realize it! (the true test of “ubiquity”).
The Internet and WWW have become a ubiquitous means of global data dissemination and exchange.
Databases play a crucial but largely invisible role here. XML and related standards are enabling increasingly
sophisticated interoperation.
Wireless access provides anytime-anywhere access and enables location-centric applications.
© 2000 Michael J. Franklin 6
Scenarios and RequirementsScenarios and Requirements Real “killer apps” have not yet emerged.
Many in industry have begun to refer to a “user experience” rather than a particular app.
Many of these scenarios are quite irritating
e.g. “buy milk now!!!!” Typical scenarios require three types of functionality:
Support for mobility – of users and data Context awareness – what is the user trying to do? Support for collaboration – varied and dynamic groups of
people; real-time or asynchronous,…
© 2000 Michael J. Franklin 7
Demands on Data ManagementDemands on Data Management A key requirement that emerges from all three of these
categories is adaptivity.
movement/availability of data and people continually changing contexts dynamic groups and interactions
A problem and solution: “user-in-the-loop”:
people can deal with ambiguity and conflict resolution. requires a collaborative and responsive approach to
information systems: provide fast interactive performance quickly respond to user direction.
© 2000 Michael J. Franklin 8
MobilityMobility Limited device capabilities:
storage & CPU, battery power, bandwidth, display, … requires adjustment of data delivery to these
Varying and intermittent connectivity
requires proxies and smart data staging/pre-staging requires global access to data
Location-centric applications
“find open drugstores within two miles of my current location.”
must be able to deal with locations and distances servers must track huge numbers of moving objects
© 2000 Michael J. Franklin 9
Context AwarenessContext Awareness System must maintain an internal representation of the
users’ needs, tasks, roles, preferences, etc.
requires “user profiles” and models some information can be leveraged from PIM apps
In some scenarios, e.g. “smart spaces”, system must continually monitor and react to changes in the environment:
requires processing streams of data from sensors, logs, etc.
All require inferencing and learning techniques over dirty and incomplete data.
© 2000 Michael J. Franklin 10
CollaborationCollaboration
Synchronization and consistency support
collaboration revolves around a set of shared data
requirements range from unmoderated chat rooms to complete ACID transactions
Also need maintenance of history
to support asynchronous collaborations to support changes in group membership must be durable and highly-available.
© 2000 Michael J. Franklin 11
Two On-going ProjectsTwo On-going Projects Two projects currently underway to address some of
these issues (both part of “Endeavour”).
Data Centers/Dissemination-Based Info Sys
Profile-based data management includes “data recharging” collaboration with Stan Zdonik at Brown and Mitch
Cherniack at Brandeis Telegraph
adaptive query processing over data streams with Joe Hellerstein at UC Berkeley
© 2000 Michael J. Franklin 12
Data Centers FrameworkData Centers Framework
An architecture that combines data delivery techniques for responsive client access.
3 types of nodes: Data sources Clients Information brokers (can add value)
Any data delivery mode can be used.
Network transparency Dynamic
© 2000 Michael J. Franklin 13
Delivery OptionsDelivery Options
PushPull
Aperiodic Periodic
Unicast 1-to-n Unicast 1-to-n
Aperiodic Periodic
Unicast 1-to-n Unicast 1-to-n
request/response
request/responsew/snoop
polling pollingw\snoop
Email lists
publish/subscribe
Emaillistdigests
Broad-castdisks
publish/subscribe
© 2000 Michael J. Franklin 14
Network TransparencyNetwork Transparency
Clients Brokers Sources
The type of a link matters The type of a link matters only only to nodes on each endto nodes on each end
© 2000 Michael J. Franklin 15
DBIS ExampleDBIS Example
1-to-n pushServerDB
Proxy cache
An example:
Can vary dynamically
Unicast pull
Proxy cache
Proxy cache
Unicast pull
Unicast pull
© 2000 Michael J. Franklin 16
““Data Recharging” for Weakly Data Recharging” for Weakly Connected DevicesConnected Devices
Mobile devices require 2 resources: power and data
It is impractical to be continuously connected to fixed sources of these.
Devices cope with disconnection using caching:
Power cached in rechargeable batteries Data cached in hot-synched memory
Recharging the power is easy…
Anywhere, Anytime, “Hands-off” operation, Flexible connection duration
© 2000 Michael J. Franklin 17
Data Recharging – Elevator PitchData Recharging – Elevator Pitch
Make recharging data as simple as recharging power:
Anywhere – no need to connect to your home machine,
Anytime – no prior arrangements necessary, “Hands-off” operation – system knows what you
need Flexible connection duration – the longer you stay
connected, the better your device-resident data gets.
© 2000 Michael J. Franklin 18
Some QuestionsSome Questions
How to know where the user will be?
and do we care? (for context – yes, for staging -??)
How to know what the user wants?
How to prioritize data delivery?
The answer is User Profiles
© 2000 Michael J. Franklin 19
““Data Recharging” ProfilesData Recharging” Profiles Three main components:
1) Content-based specifications of user interests(read “queries”)
2) Specifications of user priorities/requirementspriority ordering, resolution, freshness, dependencies
3) User Context information – where, when, who, what This info is available in the user’s PIM data!
Profiles must be both specified explicitly and learned automatically.
© 2000 Michael J. Franklin 20
First cut at Profile ModelFirst cut at Profile Model Tasks, sub-tasks, and jobs
Dependencies and alternatives expressed in a tree “Values” assigned and manipulated
Two optimization problems:
Bounded (known) sync time Unknown sync time
Bounded case is an instance of the “precedence-constrained knapsack problem”
The XFilter system allows us to process millions of standing queries of XML documents
© 2000 Michael J. Franklin 21
The challenge is to efficiently and quickly match incoming XML documents against the potentially huge set of user profiles.
Xfilter- An XML-Based SDI Xfilter- An XML-Based SDI SystemSystem
XML Conversion
XML Document
s Filter Engine
User Profiles
Users
Filtered Data
Data Sources
© 2000 Michael J. Franklin 22
Important XPath FeaturesImportant XPath Features Parent/Child (‘/’) and Ancestor/Descendant (‘//’):
/catalog/product//msrp
Wildcards (match any single element):
/catalog/*/msrp
Element Node Filters to further refine the nodes:
Filters can contain nested path expressions
//product[price/msrp < 300]/name
Filter applied to product element node
© 2000 Michael J. Franklin 23
ArchitectureArchitecture
XPath Parser
Filter Engine
Path NodesProfile Info
XML Documents
XML Parser(SAX Based) Element
Events
SuccessfulProfiles &
Filtered Data
ProfileBase
SuccessfulQueries
Query Index
User Profiles(XPath Queries)
/a//b/c//b/d/*/e/c/*/d//e
/a/b[c/d]/e//d/*/*/e/b/e
© 2000 Michael J. Franklin 24
XML Parsing and FilteringXML Parsing and Filtering Event-based XML Parsing using SAX API
XML documents are converted to a linear sequence of events that drive the execution of the filter
Callback functions are implemented to deal with the different events
Start Element Element Data End Element
© 2000 Michael J. Franklin 25
Filter EngineFilter Engine Tricky aspects of the XPath language:
Checking the order of elements in the queries Handling wildcards and descendent operators Evaluating filters that are applied to element
nodes (Nested path expressions) Solution:
Convert each XPath query into a Finite State Machine (FSM)
A profile is considered to be satisfied when its final state is reached
Index the states of FSMs for efficient evaluation
© 2000 Michael J. Franklin 26
FSM RepresentationFSM Representation Each element node is a state
A state is represented using a Path Node structure:
Contains information to process current state: Compare the level of element name in input document
with the level value of the path node Evaluate the element node filter if there is any Locate next path nodes for the state change in the FSM
representation Calculate the level values of next states using relative
distance values (in terms of levels) stored in the path nodes
© 2000 Michael J. Franklin 27
Handling Multiple QueriesHandling Multiple Queries
Hash table based on the element names in the queries
Each node contains two lists of path nodes:
Candidate List: Stores the path nodes that represent current state of each query
Wait List: Stores the path nodes that represent the future states
State transition is represented by promoting a path node from the Wait List to the Candidate List
Initial distribution of path nodes has a significant impact on performance
Key insight for scalable Profile Matching:Index the queries instead of the data
© 2000 Michael J. Franklin 28
ExampleExampless
Q1 = / a / b // c
Q1
1
NA
1
Q1
2
1
?
Q1
3
NA
-1
Q1-1 Q1-2 Q1-3
Q2 = // b / * / c / d
Q2
1
NA
-1
Q2
2
2
?
Q2
3
1
?
Q2-3Q2-2Q2-1
Q3 = / * / a / c // d Q4 = b / d / e Q5 = / a / * / * / c // e
Q3
3
NA
-1
Q3
2
1
?
Q3
1
NA
2
Q3-3Q3-2Q3-1
Q5
1
NA
1
Q5-1
Q5
2
3
?
Q5-2
Q5
3
NA
-1
Q5-3
Q4
1
NA
-1
Q4-1
Q4
2
1
?
Q4-2
Q4
3
1
?
Q4-3
Query Id
Position
Rel Dist
Level
© 2000 Michael J. Franklin 29
Query Index ConstructionQuery Index Construction
z
a
b
c
d
e
WL
CLQ2-1
Q2-2
Q2-3
Q3-1
Q3-2
Q3-3
Element Hash Table
CL : Candidate ListWL: Wait List
WL
Q1-1
Q1-2
Q1-3
WL CL
WL
CL
CL
WL CL
Q4-1
Q4-2
Q4-3
Q5-1
Q5-2
Q5-3
© 2000 Michael J. Franklin 30
Data Centers - Research AgendaData Centers - Research Agenda Profile Definition and Maintenance
Update Storage and Preparation
Efficient integration of "recharge" updates with existing cached data.
Recharge, Trickle Charge, Jump Start... Consistency Guarantees
Global Data Staging
Approaches will be driven by (mostly PIM) applications.
© 2000 Michael J. Franklin 31
Telegraph: Telegraph: An Adaptive Dataflow An Adaptive Dataflow EngineEngine
Dataflow because that’s what data does… data streaming from sensors real-time processing of streams: update
stream, click-stream, swipe-stream, … siphon data from the “deep web” “continuous queries” for dissemination-based apps
Adaptivity due to volatility… sensor nets wide area internet dynamic caching, replication, and staging user-in-the-loop interfaces mobile users and devices
Joint work at UC Berkeley with Joe Hellerstein
© 2000 Michael J. Franklin 32
Sources may be unreachable or slow to respond.
Data delivery may be: slower than expected bursty interrupted
Data statistics/cost estimates may be unavailable or unreliable due to poor interfaces or crossing administrative domains.
Wide-area + Wrapped sources Wide-area + Wrapped sources UnpredictabilityUnpredictability
© 2000 Michael J. Franklin 33
Batch processing is inappropriate for many apps.
especially when searching the Internet
Must provide feedback to the user as quickly as possible.
Data access becomes a cooperative, iterative approach:
User may correct/redirect query. User may refine/change the query.
User-in-the-loop User-in-the-loop UnpredictabilityUnpredictability
© 2000 Michael J. Franklin 34
Mobility Location-centric queries Moving endpoints change
data staging needs
Data Streams/Sensors Varying data arrival rates Adapting resolutions Push vs. Pull
Mobility & Data Streams Mobility & Data Streams UnpredictabilityUnpredictability
© 2000 Michael J. Franklin 35
Some SolutionsSome Solutions Adaptive Query Processing
Query Scrambling - “Reactive Query Execution”
XJoin – non-blocking, reactive query operator. Eddies – Continuous Query Optimization
Risk-Aware Query Planning Producing robust plans or partial plans.
Exploiting Alternative Sources Mirrors or “not exactly”.
Relaxing Query Semantics Partial, Fuzzy, or Alternative answers
© 2000 Michael J. Franklin 36
Query Scrambling ExampleQuery Scrambling Example
1
4
A
CDEB
Reschedule
A
CDEB
New Operators
3
2
4
1
B C D EA
Initial Plan Reschedule
A
BCDE
ABCDE
© 2000 Michael J. Franklin 37
Traditional Hash Joins block when one input stalls.
Hash JoinHash Join
BuildProbe
Source A Source B
Hash Table A
Hash Table A
Hash Table B
Symmetric Hash Join (SHJ) blocks only if both stall. XJoin partitions data -> small footprint -> full pipelining & bushy
plans-> higher adaptability.
XJoinXJoin
© 2000 Michael J. Franklin 38
Eddy – Continuous OptimizationEddy – Continuous Optimization
Flow-based (“Rivers”) Tuples are routed via a ticket-based scheme and back-pressure. Hellerstein and Avnur 99
Eddy
Join ST
Join RSR
S
T
© 2000 Michael J. Franklin 39
Adaptive ApproachesAdaptive Approaches
Increased uncertainty argues for increased adaptivity. Wide-area nets and admin domains introduce uncertainty. Pesky users introduce uncertainty. Mobility and streams introduce uncertainty.
Implications for data-intensive Internet services.
Dynamic,Parametric,
Competitive,…
staticplans
anarchylatebinding reopt. continuous
opt.
currentDBMS
Query Scrambling Eddy
XJoin
???
© 2000 Michael J. Franklin 40
ConculsionsConculsions We need to build more intelligent systems to protect
humans from the data flood, but good old systems performance issues still matter too.
No killer app for Ubiqutious Data Access yet; may be the killer “user experience”
Scenarios give us a common (and challenging!) set of requirements for data management: Adaptivity, context-awareness, global-scale, …
The Data Centers and Telegraph projects are addressing key data management technologies for supporting ubiquitous access to data.