1 EDA06 - Entrepôts de contenu1 Entrepôts de contenu autour de XML et des services Web Serge...
-
Upload
abram-printup -
Category
Documents
-
view
215 -
download
1
Transcript of 1 EDA06 - Entrepôts de contenu1 Entrepôts de contenu autour de XML et des services Web Serge...
1
EDA06 - Entrepôts de contenu 1
Entrepôts de contenu autour de XML et des services Web
Serge Abiteboul INRIA-Futurs et LRI-Paris 11
2
Introduction
3
EDA06 - Entrepôts de contenu 3
Joint works – some participants & projects
Xyleme: Sophie Cluet, Guy Ferran & many others
Acware within Edot project: Benjamin Nguyen, Gabriela Ruberg, Gregory Cobena
Active XML within DbGlobe project: Omar Benjelloun, Ioana Manolescu, Tova Milo & many others
KadoP with Edos project: Ioana Manolescu, Nicoleta Preda & many others
4
EDA06 - Entrepôts de contenu 4
Success stories in the time of the Internet bubble: Information management
Google: management of Web pages
Mapquest: management of maps
Amazone: book catalogue
eBay: product catalogue
Napster (emule, bearshare, etc.): music database
Flickr: picture database
Wikipedia: dictionary
Even in France: • Meetic: dating database
• Kelkoo: comparative shopping
5
EDA06 - Entrepôts de contenu 5
The trend is towards peer-to-peer infoware
Why?
The Web is switching from centralized servers to communities and syndication
Buzzwords such as Web 2.0 (?)
Infoware: classe de logiciels dont l'objectif est non plus de traiter de l'information, mais de la gérer globalement, tellement les quantités sont de plus en plus importantes
Analogy: • Software development by very structured and controlled groups of
programmers vs.
• open-source software produced by large communities of autonomous developers
6
EDA06 - Entrepôts de contenu 6
Outline
Introduction
Content warehouse• Concept• XML and Web services• Xyleme
Peer-to-Peer content warehouse• Concept• Active XML• KadoP
Conclusion
7
Content warehouse
8
EDA06 - Entrepôts de contenu 8
Warehouse
Goal: integrated access to heterogeneous, autonomous, distributed sources of information
Main functionalities: acquire, transform, filter, clean and integrate data, support for queries
Warehouse vs. mediationWarehouse: information is acquired in advance
≠ Mediation: information acquired when needed
Classical tradeoff between updates and queries
Typically mix of both
9
EDA06 - Entrepôts de contenu 9
Content warehouse
All kinds of content • Mail, reports, news, web pages, contacts, catalogs, annotations,
etc
• Text, multimedia, etc.
• Little is numerical vs. OLAP – some may me mixed, e.g., financial reports
Typically found on the Web and not in relational databases
10
EDA06 - Entrepôts de contenu 10
Content vs. data warehouse
Data warehouse XML warehouse
Data relational data
numerical values
XML
text
Enrichment cleaning cleaning, classification, semantics…
Integration and view relations
cube
XML
Query SQL Xquery, XSLT
Exploitation OLAP; statistical tools
report generation
browsing
report generation
11
EDA06 - Entrepôts de contenu 11
XML Warehouse
XMLWarehouse
Operationaldata
sources
Operationaldata
sources
Operationaldata
sources
Operationaldata
sources
Application
Application
Application
Feed Exploit
Import data from many sources
Add value to it without interfering with operational data
Export integrated views of it
Same as a relational warehouse
12
EDA06 - Entrepôts de contenu 12
The basis of content management
Standard for data exchange• XML, XML Schema…
• Extensible Markup Language
• Labeled ordered trees• Foundations: tree automata
Query languages• XPATH, XQuery…
• Foundations: tree automata
• Not perfect but at least exist
Xquery
XpathSOAPWSDL
XML
13
EDA06 - Entrepôts de contenu 13
Functionalities
Store & Index
Store & IndexQuery Processing
Query ProcessingView and Semantic
View and Semantic
stemming, integration, classification…
stemming, integration, classification…
WebWeb
GU
I, Web
services, repo
rting
…
Feedin
g
Explo
iting
14
EDA06 - Entrepôts de contenu 14
Functionalities: Feeding
Loading from the Web (Internet and Intranet)• Web search• Web crawl• Access Web data via forms or Web services
Plug-ins to load from • File systems, document management systems• Data bases, LDAP• Newsgroup, emails• Other applications
Extraction and transformation• XSL-T or Xquery mappings for XML sources• XML-izers to load data from other formats
Monitoring of the feeding
15
EDA06 - Entrepôts de contenu 15
Functionalities: More feeding
User feeding• Document editing
• Meta data editing
Publication
API: SOAP and WebDAV
16
EDA06 - Entrepôts de contenu 16
Functionalities: Storage
Storage of (massive volume of) XML (terabytes)
Indexing of (massive volume of) XML• By structure• By full-text• Linguistic support: multi language, stemming, synonyms, etc.
Very efficient XML query processing
Importance ranking
Monitoring of the warehouse (support for subscriptions)
Access control and security
Versioning, archiving
Recovery
Possibly transaction mechanism
17
EDA06 - Entrepôts de contenu 17
Functionalities: Enrichment
Global organization• Global schema management
– Management of collections• Incorporate domain ontologies and thesauri• Document classification• Cleaning by filtering out documents from collections, etc.
Document enrichment• Concept extraction and tagging• Cleaning inside de document• Summarization, etc.
Relationships between documents• Tables of contents• Tables of index• Cross referencing, etc.
18
EDA06 - Entrepôts de contenu 18
Functionalities: View & integration
View management• Document restructuring/mapping• Schema to schema mapping
Semantic integration• Manual for complex ones and (semi-) automatic for simple ones• Tools to analyze a set of schemas• Tools to integrate them • Processing for queries on integration view
Management of virtual data in a mediator style
19
EDA06 - Entrepôts de contenu 19
Functionalities: Exploitation
Access to the warehouse• Browsing• Querying by keywords, XPaths or Xquery • Temporal queries
Query subscription
Reporting• Generation of complex reports with pointers to documents, counts,
abstracts… • Organized by collections, content, domains…
By GUI or from programs (Web service-based API)
20
Xyleme Content warehouse
21
EDA06 - Entrepôts de contenu 21
Xyleme – in short
1999: Xyleme research project at INRIA
2000: Creation of a spin-off
2006: About 40 people
Technology: a content warehouse built around a very efficient and scalable XML repository
Application example: all articles of Le Monde in XML
22
EDA06 - Entrepôts de contenu 22
Xyleme Functionalities
Store & Index
Store & IndexQuery Processing
Query ProcessingView and Semantic
View and Semantic
stemming, integration, classification…
stemming, integration, classification…
WebWeb
GU
I, Web
services, repo
rting
…
Feedin
g
Explo
iting
23
EDA06 - Entrepôts de contenu 23
Xyleme Architecture
XMLstore
Index
Loader| Local | Query
Global Query Manager
Global Query Manager
Application Server Tomcat|
Soap
Corba
Name ServerUser ManagerUrl Manager
Notification Mgr
HTTP | Web Service API
ApplicationsIE/Java/C++/.Net
...
Java/C++ API
or
Or Any Platform
XMLstore
Index
Loader| Local | Query
XMLstore
Index
Loader| Local | Query
Client sideClient side
Server sideServer side
24
EDA06 - Entrepôts de contenu 24
Structural identifiersand indexing
1
2
3 4
5
6
71
2
3
4
5
6
7
X ancestor of Y <=>pre(X) < pre(Y) andpost(X) > post(Y)
X parent of Y <=>X ancestor of Y andlevel(X) = level(Y) - 1Structural IDs = Prefix-Postfix-Level
A
B
D E
C
F
“John” G
0
1 1
22 2
3
LAN
Put(C;[d,p,6,6,1])
Put(“John”;[d,p,3,1,2])
hash(C)
hash(“John)
25
EDA06 - Entrepôts de contenu 25
Query evaluation based on Holistic twig joins
(d1, 201, 400)
(d1, 224, 201) (d1, 228, 237)
(d1, 228, 237)
A
DC
“John”
26
Peer-to-peer content warehouse
27
EDA06 - Entrepôts de contenu 27
The golden triangle of distributed content management on the Web
Standard for data exchange• XML, XML Schema…
• Extensible Markup Language
• Labeled ordered trees
• Foundations: tree automata
Query languages• XPATH, XQuery…
Standards for distributed computing: Web services • SOAP, WSDL, UDDI…
• Simple Object Access Protocols
• Corba but simpler and on the Web
Xquery
XpathSOAPWSDL
XML
28
EDA06 - Entrepôts de contenu 28
Peer-to-peer
A large and varying number of computers cooperate to solve some particular task without any centralized authority
Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network
Examples• seti@home: search for extraterrestrial intelligence • kazaa: obtain free music/video over the net• cabal: decryption of 512 bits RSA code • grub: P2P Web search
29
EDA06 - Entrepôts de contenu 29
An XML warehouse in P2P
Warehouse: a very centralized system
P2P: an ultra distributed system (no authority)
P2P warehouse: an oxymoron?
No!
A warehouse: from a logical viewpoint
P2P system: from a physical viewpoint
30
EDA06 - Entrepôts de contenu 30
data sources
mediator
data sources
warehouse(logical &physical)
data sources
P2P warehouse
(logical)P2P
warehouse (physical)
data sources
P2Pmediator
Centralized mediation P2P mediation
Centralized warehouse P2P warehouse
31
EDA06 - Entrepôts de contenu 31
P2P XML Warehouse
Data sources and peers are distributed, transient and autonomous
Information is distributed and replicated
Nothing is centralized• Not the control, storage, indexing…
The machines are “cooperating” with some level of trust to provide the functionalities of an XML warehouse
32
EDA06 - Entrepôts de contenu 32
Advantages Disadvantages
Performance• Optimization of parallelism• Avoid bottleneck• Replication
Availability • Replication
Cost• Avoid the cost of server• Share operational cost
Dynamicity• add/remove new data sources
Better scaling
Performance• Cost for complex queries
• Communication cost
Availability• Peers can leave
Consistency maintenance • Difficult to support transaction
Quality• Difficult to guarantee quality
33
EDA06 - Entrepôts de contenu 33
Relational DBMS P2P warehouse
Relations Active XML
Schema and constraints Ontologies (including Xschema)
B-tree, hashing, fulltext AXML indexes & global indexes
Disk pages AXML persistent store
SQL (query & update) Query&Update (Xquery-Webdav)
ACL Network Access Control
Historical DB Provenance and history
Triggers Monitoring
(from distributed DBMS) Replication and partitioning
Approximation, incompleteness Idem but even more important
Discovery of data/services
MulticastingCe
ntr
aliz
ed v
s. d
istr
ibu
ted
dat
a m
an
ag
em
ent
34
EDA06 - Entrepôts de contenu 34
Two classes of P2P networks
Unstructured P2P networks
Local exchange: mappings relate content on different peers
Queries are propagated (flooding)
SomeWhere, ...
Structured P2P networks
Content is indexed globally and located via the index
Local content, global access
KadoP, ...
35
ActiveXML: A framework for distributed datamanagement
36
EDA06 - Entrepôts de contenu 36
Active XML
The standards of distributed data management
Active XML = XML documents with embedded Web service calls where service calls are typically in Xquery
Intensional & Dynamic
This is not a new idea• Procedural attributes in relational systems
• Basis of Object Databases
• Sun’s JSP, PHP+MySQL, Apache Jelly…
Xquery
XpathSOAPWSDL
XML
37
EDA06 - Entrepôts de contenu 37
Active XML = XML + embedded service calls(omitting syntactic details)
<resorts state=‘Colorado’> <resort> <name> Aspen </name> <scond> Unisys.com/snow(“Aspen”)
</scond> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> …</resorts>
May contain callsto any SOAP web serviceto any AXML web services - to be defined
<depth unit=“meter”>1</depth>
38
EDA06 - Entrepôts de contenu 38
Not a new idea in databasesNot a new idea on the Web
Mixing calls to data is an old idea• Procedural attributes in relational systems
• Basis of Object Databases
In HTML world • Sun’s JSP, PHP+MySQL
Call to Web services inside documents• Macromedia MX, Apache Jelly
39
EDA06 - Entrepôts de contenu 39
What exactly to exchange
A parameter of a call contains some service calls
The result of a call contains some service calls
Do we evaluate these calls before transmitting the data or not
Hi John, what is the phone number of the CEO of INRIA?• (33 1) 39 66 00 01
• Look in INRIA directory at Michel Cosnard
• Find his name at www.inria.fr then look on the directory
40
EDA06 - Entrepôts de contenu 40
When to activate the callExplicit pull mode
• Frequency: Daily, weekly, etc.• After some event: e.g., when another service call completed• This aspect of the problem is related to active databases
Implicit pull mode : Lazy• When the data is requested • Difficulty : detect that the result of a particular request may be
affected by a particular call• This is related to deductive databases
Push mode• E.g., based on a query subscription; the web server pushes
information to the client• E.g., synchronization with an external source• This is related to stream and subscription queries
41
EDA06 - Entrepôts de contenu 41
Active XML peer
Peer-to-peer architecture
Each Active XML peer • Repository: manages Active
XML data with embedded web service calls
• Web client: uses Web services
• Web server: provides (parameterized) queries/updates over the repository as web services
Open source system
SUN’s Java SDK 1.4 • XML parser• XPath processor, XSLT engine
Apache Tomcat 4.0 servlet engine
Apache Axis SOAP toolkit 1.0
X-OQL query processor• persistent DOM repository
JSP-based user interface• JSTL 1.0 standard tag library
see http://activexml.net
AXMLpeerso
ap
42
KadoP: a P2P system for sharing content
43
EDA06 - Entrepôts de contenu 43
KadoP model
Data: XML Document; views; Active XML; Web services
Simple semantics: Concepts, namespaces, DTDs, iSa, partOf, relatedTo, context documents (for services)
Queries: tree pattern query with join
KadoP• XML data distributed in the P2P network
• Index is distributed via a DHT
• Goal: Efficient processing of terabytes of XML with no centralized authority
44
EDA06 - Entrepôts de contenu 44
Distributed hash tables
Typically on a WAN
Peers come and go
Small number of messages to “locate” the peer in charge of key k – log n
Standard interface: put, get
We tried Pastry, Chord and JXTA
We use now Pastry
DHT
put(k;v2)
hash(k)
get(k)
put(k;v1)
put(k;v3)v1,v2,v3
45
EDA06 - Entrepôts de contenu 45
Indexing in KadoP
Use structured ID as in Xyleme
Publish them in a DHT
Use Holistic twig join
Main issue: communications • WAN vs. LAN
Long posting lists
Optimization techniques• Use only docID [wisconsin]• Ship smallest list• Semi-join techniques • Intensional indexing
DHT
put(C;[d,p,6,6,1])
put(“John”;[d,p,3,1,2])
hash(C)
hash(“John)
DHT
hash(C)
46
Conclusion
47
EDA06 - Entrepôts de contenu 47
AXML and distributed data management on the Web• Opinion: Xquery is a language for local XML management
• Language for distributed query management
• Active XML?
• What else?
Foundation of distributed query optimization• Recent proposal: AXML + send/receive
KadoP and P2P (Active) XML indexing• Now being tested and working on optimization
ActiveXML is open-source – see activexml.net
KadoP soon will be – already available upon request
Application: distribution of open-source software (with Mandriva)
On going work
48
EDA06 - Entrepôts de contenu 48
Other issues for turning the network into a scalable database
Take an arbitrary problem for data or knowledge management and look at it in the P2P setting with Gigabytes of data
Examples• Self tuning (joint work with Alkis Polyzotis)
• Semantic integration (lots of work in Gemo)
• Distributed access control (joint work with Bogdan Cautis)
• Monitoring (joint work with DistribCom group in INRIA-Rennes)
49
EDA06 - Entrepôts de contenu 49
Publicité
Lancement de webContent
Une plateforme RNTL
Entrepôt de données du Web pour la surveillance
EADS, Thales, Bongrain, Xyleme, Exalead, NewPhoenix
Recherche de jeunes ingénieurs pour travailler dans webContent
50
EDA06 - Entrepôts de contenu 50
Merci
Merci