1 ADBIS06, Data ring1 Data Ring: Let us turn the net into a database! Serge Abiteboul, INRIA-Futurs...
-
date post
22-Dec-2015 -
Category
Documents
-
view
221 -
download
1
Transcript of 1 ADBIS06, Data ring1 Data Ring: Let us turn the net into a database! Serge Abiteboul, INRIA-Futurs...
1
ADBIS06, Data ring 1
Data Ring: Let us turn the net into a database!
Serge Abiteboul, INRIA-Futurs
Neoklis Polyzotis, UC Santa-Cruz
3
ADBIS06, Data ring 3
Success stories after the Internet bubble
Google: management of Web pages
Mapquest: management of maps
Amazone: book catalogue
eBay: product catalogue
Napster (emule, bearshare, etc.): music database
Flickr: picture database
Wikipedia: dictionary They are all aboutpublishing some database
4
ADBIS06, Data ring 4
A trend is towards peer-to-peer
P2P:
A large and varying number of computers cooperate to solve some particular task without any centralized authority
Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network
Examples• seti@home: search for extraterrestrial intelligence • kazaa: obtain free music/video over the net• cabal: decryption of 512 bits RSA code • grub: P2P Web search
5
ADBIS06, Data ring 5
A trend is towards peer-to-peer
The Web is switching from centralized servers to communities and syndication
Buzzwords such as Web 2.0, infoware, annotations, interactivity
Some motivations• Social, organizational: no need to have a leader, more flexible
• Technical: performance, availability, no bottleneck
Analogy: • Software development by very structured and controlled groups of
programmers, e.g., Windows
• Open-source software produced by large communities of autonomous developers, e.g., Linux
6
ADBIS06, Data ring 6
Data ring
Information management in a P2P network
Information is heterogeneous, distributed, replicated and dynamic• Data + knowledge + services
Peers are heterogeneous, autonomous and possibly mobile• Heterogeneous storage, bandwidth, computing power • From sensors to PDA to mainframe
Variety of requirements in terms of number of peers, QoS, performance, security, etc.
We will see motivations and examples
7
ADBIS06, Data ring 7
Acknowledgement – people & projects
Xyleme: Sophie Cluet, Guy Ferran & many others
Scalable XML repository
A start-up in 2000
ActiveXML within EC DbGlobe project: Omar Benjelloun, Ioana Manolescu, Tova Milo & many others
Framework for 2P data management
KadoP within EC Edos project: Ioana Manolescu, Nicoleta Preda & many others
P2P scalable XML repository
8
ADBIS06, Data ring 8
Outline
Introduction
Data ring
XML query processing
P2P XML query processing
Logic and algebra for distributed data management
Automatic and distributed management
Other issues in brief
Conclusion
10
ADBIS06, Data ring 10
Functionalities:DBMS, content management and more
Storage, persistence, replication
Indexing, caching, query, update, optimization
Schema management, access control
Fault tolerance, self tuning, monitoring
Semantic enrichment, information retrieval, uncertain data, result ranking, multi-lingual…
Resource discovery, history, provenance
Each functionality may be achieved by a peer or by the network
This may depend on the nature of the application
11
ADBIS06, Data ring 11
The information in a data ring
Data: tuples, collections, documents, relations, objects…
Services: data sources, possibly some computation (e.g. genome)
Knowledge
• Meta-data about resources: attribute/values pairs, annotations
• Ontologies to explain data and metadata
• Mapping between ontologies (for data integration)
• Views (possibly recursive)
Physical data
• Indices
• Materialized views
• Statistics for query optimization
12
ADBIS06, Data ring 12
And now, what is a peer?
A mainframe database
A file system
Web server
A PC
A PDA
A telephone
A sensor
A home appliance
A car
A manufacturing tool
A telecom equipment
A toy
Another data ring
Any connected device or software with information to share
+ a net address + names for its resources
d@p, s@p’
?
13
ADBIS06, Data ring 13
Example: Edos distribution system
A system for the management of Linux distribution
Joint work with Mandriva Software and U. Tel Aviv
Community of open-source developers: thousands
System releases: about 10 000 software packages + meta data• Package name, author; date, signature, documentation
• Collections of packages, dependencies between packages
Functionalities• Query the metadata (also query subscription)
• Retrieve packages
• Update an existing release
• Publish a new release (based on BitTorrent)
14
ADBIS06, Data ring 14
Main parameters of such applications
Number of peers
Number of documents
How volatile the peers are
The query workload
The update workload
The “network distance” between the peers
Edos: peers and documents in thousands, mostly append for updates, peers not too volatile
An extreme: Google search engine in P2P for billions of documents using millions of volatile peers
15
ADBIS06, Data ring 15
Lots of related work and related systems
Structured P2P nets: Pastry, Chord
Content delivery net: Coral, Akamai
XML repositories: Xyleme, DBMonet
Multicast systems: Avalanche, Bullet
File sharing systems: BitTorrent, Kazaa
Pub/Sub systems: Scribe, Hyper
Distributed storage systems: OceanStore, GoogleFS
Data spaces: Franklin, Halevy
16
ADBIS06, Data ring 16
Standards: The golden triangle of distributed content management on the Web
Standard for data exchange• XML, XML Schema…
• Extensible Markup Language
• Labeled ordered trees
• Foundations: tree automata
Query languages• XPATH, XQuery…
Standards for distributed computing: Web services • SOAP, WSDL, UDDI, BPEL…
• Simple Object Access Protocols
• Corba but simpler and on the Web
SOAPWSDL
XML
XqueryXpath
17
ADBIS06, Data ring 17
Information used to live in islands but it is changing
Different formats: relational, metadata, documents, text, DXF• A Web standard for data exchange, XML, is fixing it • XML captures all kinds of information over a wide spectrum • XML comes with a family of emerging standards: XML schema, XSL/T,
Xquery, domain specific schemas…
Different computers, platforms, languages, applications• A standard for Web services, SOAP, is fixing it• SOAP allows ubiquitous computing on the Internet• SOAP comes with a family of emerging standards: WSDL, UDDI
This provides a uniform access to information… …the dream for distributed data management
18
ADBIS06, Data ring 18
Why XML? Because of its information spectrum
Structured Data
Minimal structure
Meta dataHierarchy +
Books Contracts Catalogs Bank accounts
Emails Financial Reports Insurance Policies
Economical Analysis Derivatives Inventory
Political analysis Insurance Claims
Financial News Sports News Resumes
19
ADBIS06, Data ring 19
A standard for information: XML
Labeled ordered trees where leaves are text
Marriage of document and database worlds
Marriage of full text indexing and structure indexing
Is it the ultimate data model? No
Purely syntax – more semantics needed
Is it OK for now? Definitely yes (because it is a standard)
20
ADBIS06, Data ring 20
The main asset of XML (vs. HTML) = typing
Applications need typing
XML typing: DTD and XML schema
Trees
Semantics and structure are in tags and paths• product-table/product/reference• product-table/product/price
product
designation descriptionprice
reference
product-table
21
ADBIS06, Data ring 21
A standard for distributed computing: Web services
Possibility to activate a method on some remote Web server
Exchange information in XML: input and result are in XML
Ubiquitous XML distributed computing infrastructure
2 main applications• E-commerce• Access to remote data
With XML and Web services, it is possible• To get information from virtually anywhere• To provide information to virtually anywhere
22
ADBIS06, Data ring 22
A standard for XML queries: Xquery (soon a recommendation of W3C)
A “logic” for labeled tree – a declarative language
Inspired by SQL: standard for relation data
Inspired by OQL: standard for object databases• Functional as OQL
• Not as clean as OQL
Mixes structure and content• Give me the documents where the word XML appears in title
• Some full-text extension is coming
Also an update language and full-text capabilities
23
ADBIS06, Data ring 23
XML and Xquery
XML focuses on lists• Represent a very set of objects as a list
• In our context, overhead in terms of index maintenance and query processing
• Here we are mostly concerned with a “set-oriented” view of XML
Xquery is very complicated and powerful• OK for large documents on a server
• Not appropriate for huge volumes of small documents on the Web
• Here we are mostly concerned with tree pattern queries (a subset of XQuery) and full-text search
24
ADBIS06, Data ring 24
So to be honest
We take only what we want in the standard• Ignore order
• Ignore complex features of XQuery
And even that is not what we need…
25
ADBIS06, Data ring 25
Thesis
This is fine for efficient query processing of local XML
This is not sufficient for distributed data management and flexible information integration
What made the success of the relational model, i.e., of tables on a server :1. A logic for defining tables
2. An algebra for describing query plans over tables
By analogy, we need for trees in a P2P system1. A logic for defining distributed data and data services
2. An algebra for describing query plans over these
Proposal: ActiveXML and query optimization around it
26
ADBIS06, Data ring 26
What we do next?
XML processing on a server
XML processing in a P2P system
ActiveXML
Then other data ring functions
28
ADBIS06, Data ring 28
Xyleme – in short
1999: Xyleme research project at INRIA
2000: Creation of a spin-off
2006: About 25 people
Technology: a content warehouse built around a very efficient and scalable XML repository
Example: all articles of Le Monde in XML, financial reports in JP-Morgan
29
ADBIS06, Data ring 29
Xyleme Functionalities
Store & IndexStore & Index
Query ProcessingQuery Processing
View and Semantic View and Semantic stemming, integration, classification…stemming, integration, classification…
WebWeb
GU
I , Web
servi ces , repo
rt ing
…
Feed
ing
Exp
loitin
g
30
ADBIS06, Data ring 30
Xyleme Architecture
XMLstore
Index
Loader| Local | Query
Global Query Manager
Global Query Manager
Application Server Tomcat|
Soap
Corba
Name ServerUser ManagerUrl Manager
Notification Mgr
HTTP | Web Service API
ApplicationsIE/Java/C++/.Net
...
Java/C++ API
or
Or Any Platform
XMLstore
Index
Loader| Local | Query
XMLstore
Index
Loader| Local | Query
Client sideClient side
Server sideServer side
31
ADBIS06, Data ring 31
Structural identifiersand indexing
1
2
3 4
5
6
71
2
3
4
5
6
7
X ancestor of Y <=>pre(X) < pre(Y) andpost(X) > post(Y)
X parent of Y <=>X ancestor of Y andlevel(X) = level(Y) - 1Structural IDs = Prefix-Postfix-Level
A
B
D E
C
F
“John” G
0
1 1
22 2
3
LAN
Put(C;[d,p,6,6,1])
Put(“John”;[d,p,3,1,2])
hash(C)
hash(“John)
32
ADBIS06, Data ring 32
Query evaluation based on Holistic twig joins
(d1, 201, 400)
(d1, 224, 201) (d1, 228, 237)
(d1, 228, 237)
A
DC
“John”
34
ADBIS06, Data ring 34
Moving this to a P2P network – KadoP
Advantages: scaling Disadvantages: complexity
Performance• Optimization of parallelism• Avoid bottleneck• Replication
Availability • Replication
Cost• Avoid the cost of server• Share operational cost
Dynamicity• add/remove new data sources
Performance• Cost for complex queries
• Communication cost
Availability• Peers can leave
Consistency maintenance • Difficult to support transaction
Quality• Difficult to guarantee quality
35
ADBIS06, Data ring 35
Distributed hash tables
Typically on a WAN
Peers come and go
Standard interface • Locate(k):: log(n) messages to fing
the peer in charge of key k • Put(k,v)• Get(k): retrieves all the values for k
We tried Pastry, Chord and JXTA
We now use Pastry
Not satisfactory for the system we want to develop
example: locate, put
DHT
put(k;v2)
hash(k)
get(k)
put(k;v1)
put(k;v3)v1,v2,v3
36
ADBIS06, Data ring 36
Indexing in KadoP
Use structured ID as in Xyleme
Publish them in a DHT
Use Holistic twig join
Main issue: communications • WAN vs. LAN
Long posting lists
Optimization techniques• Use only docID [wisconsin]• Ship smallest list• Semi-join techniques • Intensional indexing
DHT
put(C;[d,p,6,6,1])
put(“John”;[d,p,3,1,2])
hash(C)
hash(“John)
DHT
hash(C)
37
ADBIS06, Data ring 37
Such P2P systems scale… sort of…
If we consider • The Internet
• Millions of peers
• A “large” number of documents per peers
• A “reasonable” query load per peer
This is placing too much load on the network• Lots of queries will result in intersecting posting lists on peers that are
remote, e.g. US and Europe
Besides parallelism, we should use replication intensively
To intersect two long posting lists, I should be most likely to find two replicas that are nearby eachother
Very interesting distributed data management problem
39
ADBIS06, Data ring 39
ActiveXML
Recall the standards of distributed data management: XML, Web services, XQuery
ActiveXML = XML documents with embedded Web service calls where service calls are typically in Xquery
Web service provides:
Intensional data: not explicitly here
Dynamic: a new call may provide new data
SOAPWSDL
XML
XqueryXpath
40
ADBIS06, Data ring 40
ActiveXML = XML + embedded service calls(omitting syntactic details)
<resorts state=‘Colorado’> <resort> <name> Aspen </name> <scond> Unisys.com/snow(“Aspen”)
</scond> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> …</resorts>
May contain callsto any SOAP web serviceto any ActiveXML web services
<depth unit=“meter”>1</depth>
41
ADBIS06, Data ring 41
Not a new idea in databasesNot a new idea on the Web
Mixing calls to data is an old idea• Procedural attributes in relational systems
• Basis of Object Databases
In HTML world • Sun’s JSP, PHP+MySQL
Call to Web services inside documents• Macromedia MX, Apache Jelly
42
ADBIS06, Data ring 42
What exactly to exchange
A parameter of a call contains some service calls
The result of a call contains some service calls
Do we evaluate these calls before transmitting the data or not
Hi Pierre, what is the phone number of Number 10 in the French soccer team?• Find his name in wc2006.com/France/team.xml
then look in fifa.PhoneBook.org
• Look for Zidane in fifa.PhoneBook.org
• 33 4 13 00 00 01
43
ADBIS06, Data ring 43
When to activate the callExplicit pull mode
• Frequency: Daily, weekly, etc.• After some event: e.g., when another service call completed• This aspect of the problem is related to active databases
Implicit pull mode : Lazy• When the data is requested • Difficulty : detect that the result of a particular request may be
affected by a particular call• This is related to deductive databases
Push mode• E.g., based on a query subscription; the web server pushes
information to the client• E.g., synchronization with an external source• This is related to stream and subscription queries
44
ADBIS06, Data ring 44
ActiveXML peer
Peer-to-peer architecture
Each ActiveXML peer • Repository: manages
ActiveXML data with embedded web service calls
• Web client: uses Web services
• Web server: provides (parameterized) queries/updates over the repository as web services
Open source system
SUN’s Java SDK 1.4 • XML parser• XPath processor, XSLT engine
Apache Tomcat 4.0 servlet engine
Apache Axis SOAP toolkit 1.0
X-OQL query processor• persistent DOM repository
JSP-based user interface• JSTL 1.0 standard tag library
see http://activexml.net
ActiveXMLpeerso
ap
45
ADBIS06, Data ring 45
ActiveXML is a logic for distributed tree management… and also an algebra
ActiveXML extended• Generic services: s@any
• Send and receive (work on multicasting)
• Stream and EoS
ActiveXML is restricted: recall we ignore node ordering
Novelties• Stream of trees vs. set of relations for the relational model
• Distribution: exchange of query plans
• Optimization & evaluation are interleaved
46
ADBIS06, Data ring 46
A simple example
paper
author title
“Ullman” “magic set”
//paper[//author contains “Ullman”][//title contains “magic set”]
Q@any
Ring
Q@any
indexQ@local
Q@any
*
d21@p2 d12@p6
Query plan: evaluate: q locally or ship it to p2,p6
47
ADBIS06, Data ring 47
A simple example - continued
Optimization of the Index computation (indexQ)
send send
get get
get get
“Ullman”
send
Author
send
Title
send
“magic set”
send
Author
“Ullman”
send
“magic set”
get get
get
Title get
send
48
ADBIS06, Data ring 48
Distributed query optimization
This provides an algebra for distributed query processing
Allows interleaving query processing and query optimization
Can interact with local optimizers
Based on XML flow and local XML flow processing
Most techniques that I know of can be captured• Horizontal decomposition, semi-join, replication, magic set…
Main issue to guide optimization is the gathering of statistics
50
ADBIS06, Data ring 50
Context
No centralized server
No information administrator (no info manager)
Most users are non-experts• E.g., scientists
Requirements• Ease of deployment (zero-effort)
• Ease of administration (zero-effort)
• Ease of publication (epsilon-effort)
• Ease of exploitation (epsilon-effort)
• Participation in community building notably via annotations
51
ADBIS06, Data ring 51
What should be made automatic
Self-statistics from the monitoring of the Data ring• In particular, define the statistics that are needed*
Self-tuning based on the self-statistics• Choose the most appropriate organization
• Decide to install access structures: indexes, views, etc.
• Control replication of data and services
Self-healing• Recovery from errors
• E.g., replacement of a failing Web service
And automatic file management
52
ADBIS06, Data ring 52
Automatic file management
Lots of data are in file systems (not managed by a database)• Scientific data often
• Personal data
Need to also optimize access to it• Indexing, efficient filtering, replication, etc.
• Algebra for file access
• Concurrency control, versioning, change control, etc.
53
ADBIS06, Data ring 53
Any hope?
Technology exists (database self-tuning, machine learning, etc.)
But self-tuning for databases has advanced very slowly
Why can this work?
1. There is no alternative (for db, this was just a cool gadget)
2. KISS (keep it simple stupid!)
3. The power of heavy parallelism and heavy replication
Suppose 2 indexes give the best performance; you can afford to ask 20 peers to each try 5 to see what works
This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true)
55
ADBIS06, Data ring 55
Distributed access control
Goal: Control access to ring resources
Access to resources is based on access rights (ACL)
Who is controlling ACLs?
A node manages ACLs for a collection of distributed resources• Easy but against the spirit and possible bottleneck
The network manages access control• Anybody can get the data• The data is published with encryption and signatures; only nodes with
proper access rights can perform reads/writes – More precisely, one can check whether all updates were valid
• Some techniques exist• Verification
56
ADBIS06, Data ring 56
Surveillance
Monitoring on the ring• Updates of a data collection• Some data streams • Web service calls in a Web server• The execution of some application/query • Some RSS feeds
The output is a stream • A main cause of streams
Many applications• Business intelligence • Self-statistics and tracing• Error diagnosis• Technological watch
PubList
Monitorstreams
Outputstream
MonitorXML
collection
updates
streams
57
ADBIS06, Data ring 57
Streams are everywhere
In query processing
KadoP example
Recursive queries
In messaging, monitoring and continuous query (sub)
That is why we use an algebra over streams of trees and not simply an algebra over trees
59
ADBIS06, Data ring 59
Logic for distributed data management• Opinion: Xquery is a language for local XML management
• Language for distributed query management? ActiveXML?
Algebraic foundation of distributed query optimization• Recent proposal: ActiveXML + send/receive
KadoP and P2P (Active) XML indexing• Now being tested and working on optimization
ActiveXML is open-source – see activexml.net
KadoP soon will be – already available upon request
Edos distribution system will soon be (with Mandriva)
Already achieved
60
ADBIS06, Data ring 60
On-going work
P2P XML processing (with Ioana Manolescu and Nicoleta Preda)
Distributed query optimization (with Ioana Manolescu and Alkis Polyzotis)
Self tuning (joint work with Alkis Polyzotis)
Distributed access control (joint work with Bogdan Cautis)
Monitoring of data rings (with Bogdan Marinoiu and DistribCom group in INRIA-Rennes)
Fuzzy XML processing (with Pierre Sennelart)
Verification of ActiveXML systems and data flow (with same)
61
ADBIS06, Data ring 61
Find your own topic
Take an arbitrary problem for data or knowledge management and look at it in the P2P setting with gigabytes of data and thousands of peers
If not yet interesting, consider terabytes of data and millions of peers.