XML Distributed Retrieval

XML Distributed Retrieval

Emiran Curtmola @ UCSD K. K. Ramakrishnan @ at&tAlin Deutsch @ UCSD Divesh Srivastava @ at&t

04/19/2023 2

Motivation

Democratization of data creation on the web Easy to create and publish data

Self-organization in online communities Easy to form online communities in an ad-hoc

fashion Members create, publish and share data items

Need to query the overall community data collection (all the published data)

04/19/2023 3

“The virtual newspaper” community

Newspapers

Advertised data items about the publisher’s articles

P1San Diego, San Francisco, stocks, food, weather

P2 San Diego, gold, New York, food

P3San Diego, San Francisco, gold, New York, weather

P4 San Diego, fire, gold, stocks, weather


P6 fire, San Francisco, stocks, weather

P7 fire, gold, stocks, weather

P8 fire, gold, stocks, weatherQuery Q1: find the articles talking about fire in San Diego

Query Q3: find the articles talking about food

Query Q2: find the articles about San Francisco

Query Q4: find the articles that give the weather in New York

P3

P8

P7 P6

P1

P2

P4

P5

The community data collection

Efficient querying the community data collection?

local data

P3 local data

P4

local data

P8

local data

P2

local data

P5

local data

P1

local data

P6

local data

P7

04/19/2023 4

State-of-the-Art in Querying Topic-based approach

Users creates static topics▪ a topic is a rendezvous point between consumers and

publishers▪ consumers subscribe (query) to topics of interest▪ publishers classify content into topics

Limitation▪ Consumer interests can not be specified at a very fine

granularity (too many topics)e.g., “news about fire damage when more than 1,000 people impacted and related to Santa Ana conditions in San Diego county, and information about related government relief efforts underway”

04/19/2023 5

Ad-hoc Querying

Query Q1: find the articles talking about fire in San Diego




local data

P3 local data

P4

local data

P8

local data

P2

local data

P5

local data

P1

local data

P6

local data

P7

global dataThe community data collection

Central site

Content-based approach: on actual content E.g., search engines, hosted online

communities

04/19/2023 6

Limitations of Centralized Approach

Centralized approach disintermediates publishers from consumers via a centralized authority Publishers need to give up their data

▪ against the community of autonomous members Publishers can not know who is interested and

who accesses their data

Insufficient timeliness Freshness of data depends on crawling

frequency

04/19/2023 7

Decentralized Approach:Move Queries Instead of Data

Query Q1: find the articles talking about fire in San Diego




local data

P3 local data

P4

local data

P8

local data

P2

local data

P5

local data

P1

local data

P6

local data

P7

The community data collection

Newspapers

Advertised data items about the publisher’s articles

P1San Diego, San Francisco, stocks, food, weather

P2 San Diego, gold, New York, food

P3San Diego, San Francisco, gold, New York, weather



P6 fire, San Francisco, stocks, weather



04/19/2023 8

Our Goal for Querying

Data resides with the publisher publishers maintain complete control

over who accesses their data

Consumers can send ad-hoc queries over the content of community data collection

04/19/2023 9

Challenges

Distributed nature of the data among publishers Data is not materialized globally but it resides with each

publisher

Large number of decentralized publishers and consumers Publishers: “whom to tell” among the host of potential

consumers?

Consumers: “whom to ask” among the myriad of available publishers?

Avoid flooding the network

04/19/2023 10

Proposal for Query Dissemination

The community setup Network of logical routers as infrastructure for the community Publishers connect to this network at the edge

Build an overlay network to act as a distributed index structure Routers are organized into a network called a

query dissemination tree (QDT)

Use QDT to disseminate queries Queries always posed at root Queries forwarded by routers to relevant publishers based on

the certain information▪ every node contains a summary of data stored in its subtrees

04/19/2023 11

A Query Dissemination Tree (QDT)

Only the overlay connections between the nodes of QDT are shown

P1’s advertised set of terms: San Diego, San Francisco, stocks, food, weather

P2’s advertised set of terms: San Diego, gold, New York, food

Node 3’s summary (set of terms) San Diego, San Francisco, stocks, food, weather, gold, New York

242118

1

8

9

1064 17 20 23

132

3 14 16

P4 P5

P6 P7 P8

P3P2P1

router

P publisher

union of its subtrees’ summaries

04/19/2023 12

XML Content Descriptors (CDs) An XML document D is described

(imperfectly) by a set of content descriptors, CD(D)

A query Q is also described by a set of CDs, CD(Q)

To estimate if Q has a match against D we check CD(Q) CD(D)

04/19/2023 13

Representing Documents Using CDs

rss

channel

editor item description

title linkJupiter

ReutersNewsreuters.com

San Diego, fire …

CDs can be

• all simple keywords:

San Diego, fire, Jupiter, ReutersNews, reuters.com

• keywords with full path from root:

/rss/channel/description/San Diego /rss/channel/description/fire /rss/channel/editor/Jupiter /rss/chanel/item/title/ReutersNews /rss/channel/item/link/reuters.com •etc.

• keywords with only last tag on path:

description/San Diego description/fire editor/Jupiter title/ReutersNews link/reuters.com

Sample XML article published by P1

04/19/2023 14

Query Routing in a QDT

1

64

2

3

8

9

P410 17 20 23

242118

13

14 16

P5

P6 P7 P8

P3P2P1

Q3=<food>

Q3 Q3 Q3

Q3

Q3

Q3

Q3

Q3

Only P1 and P2

publish articles about food … food …… food …

check set inclusion: query into node’s summary

Bloom Filter

04/19/2023 15

Traffic Congestion at Top of QDT

The tree topology introduces congestion during query

dissemination

04/19/2023 16

Traffic Congestion at Top of QDT

1

8

9

P41064 17 20 23

242118

132

3 14 16

P5

P6 P7 P8

P3P2P1

… food …… food …

Bottleneck(the load decreases from root to leavesdue to filtering)

How to relieve the congestion?

Routing a queryRouting a query workload • non-zero time to process a query at a node

04/19/2023 17

Techniques for Load Balancing

Overlaying multiple logical QDTs over the same underlay network a node belongs to multiple QDTs but at

different levels

Goal: organize the nodes into QDTs such that the distribution of tree levels for a node

is uniform across the QDTs

04/19/2023 18

Overlaying Multiple QDTs: QDT1

P4 P5

P6 P7 P8

P3P2P1

123

4

6

8

9

10

1314

16

23

20

17

24

21

18

QDT1

1

04/19/2023 19

Overlaying Multiple QDTs: QDT2

P4 P5

P6 P7 P8

P3P2P1

123

4

6

8

9

10

1314

16

23

20

17

24

21

18

QDT2

1

04/19/2023 20

Overlaying Multiple QDTs

QDT1 QDT2

QDT3 QDT4

1

1

1

1

04/19/2023 21

Query Routing for Multiple QDTs Partition community data collection (set of CDs) into blocks

Build one QDT tree per block QDTi groups all publishers with CDs in Bi

Routing a query Terms in query determine the relevant blocks Send query to the corresponding QDT Check the full query with publishers’ storage

Example of routing Q3

Q3 falls in B4 use QDT4

Block

CDs

B1 San Diego, fire

B2 San Francisco, gold

B3 New York, stocks

B4 food, weather

QDT4 for B4

… food … … food …

Q3=<food>

QDT1

QDT2

QDT3

QDT4

04/19/2023 22

Relieving the CongestionQDT1 QDT2

QDT3 QDT4

Q3=<food>

Q1=<fire, San Diego>

04/19/2023 23

Queries Spanning on Multiple Blocks

Q4=<New York, weather>

Route Q4 on both trees?▪ NO: generate redundant traffic, therefore more

messages▪ Routing on both trees can touch the same nodes

we show it suffices to send the query to either of the trees

Block

Terms

B1 San Diego, fire

B2 San Francisco, gold

B3 New York, stocks

B4 food, weather

QDT3

QDT4

04/19/2023 24

Routing Alternatives

Routing Q4=<New York, weather>

Q4: routing by <New York> Q4: routing by <weather>

QDT3QDT4

Check the all query terms at each publisher!

04/19/2023 25

Routing Alternatives

Routing Q4=<New York, weather>

Ideally, route after the most selective term In practice, not possible but use informed routing

▪ keep track of popular CDs▪ avoid routing with low selective (popular) CDs

Q4: routing by <New York> Q4: routing by <weather>

QDT3QDT4

04/19/2023 26

Discussion: The Design Space

How many query dissemination trees? 1 tree for all published terms

▪ Con: traffic congestion in the upper level of the dissemination tree▪ Pro: queries routed in tree are very selective

▪ the more conjuncts, the more selective the query early pruning of subtrees to be visited

1 tree per term▪ Pro: congestion-free▪ Con: tree maintenance (as many trees as terms)▪ Con: single-term queries less selective unnecessary visit more peers

“SWEET SPOT” EXPECTED TO LIE BETWEEN ABOVE EXTREMES Our solution

04/19/2023 27

Finding the Sweet Spot

Empirical fact upper 2 tree levels in a QDT are the most

congested

One solution: cyclical permutation of nodes on the tree levels

Goal: all routers appear precisely once in the top 2 levels of any QDT

04/19/2023 28

Sweet Spot when 4 QDTs1

8

9

P41064 17 20 23

242118

132

3 14 16

P5

P6 P7 P8

P3P2P1

P6 P7 P8

P3P2P1

QDT1 QDT2

3 9 14 16

4 6 10 17 20

18 21

23

24

1

2 8 13

P4 P5

04/19/2023 29

3

14

6

P4182320 21 24 1

1382

169

4 10 17

P5

P6 P7 P8

P3P2P1

Sweet Spot when 4 QDTs1

8

9

P41064 17 20 23

242118

132

3 14 16

P5

P6 P7 P8

P3P2P1

20

18

1

P49313 14 16 4

17106

2123

24 2 8

P5

P6 P7 P8

P3P2P1

4

10

23

P42124 8 13 3

16149

176

20 18 21

P5

P6 P7 P8

P3P2P1

QDT1 QDT2

QDT4QDT3

1

1

1

1

04/19/2023 30

Experimental Goals

Effect of number of QDTs find the “sweet spot” to load balance

Effect of routing strategy (informed routing) optimize based on query selectivity

estimation

Effect of QDT topology study the overlay organization of the

peers

04/19/2023 31

Experimental Setup

10,000-node overlay network simulator 9,400 publishers and 600 routers

XML Wikipedia dump of 1.1M articles (8.6GB)

Query workload: 50,000 conjunctive queries each query has 1..10 conjunctive terms each query has at least one match in the global data

collection

QDT topology Multicast trees e.g., Scribe (QDTS) Balanced trees (QDTB)

04/19/2023 32

Measuring the Throughput Processing load at each node

is a function of nr. messages reaching a node

Peak load: is the maximum load over all nodes

Average load: is the nr. messages in the network divided by nr. Routers

The ideal load we can achieve is the average load for the 1-QDT case

New metric: the load reduction how close is the actual peak load (when k QDTs) from the

ideal load case QDT-kfor loadpeak

case QDT-1for load average reduction load

04/19/2023 33

Effect of Number of QDTs

Varying the number of QDTs, we confirm the nr. of QDTs given by the cyclical

permutation method returns the highest load reduction

The “sweet spot” is well defined

For this nr. of QDTs the load reduction is near the optimum

04/19/2023 34

Effect of Number of QDTs Result: bring actual peak load very close to the

ideal load near-optimum peak load reduction at 15 QDTs

for Scribe generated topologies

04/19/2023 35

Effect of Routing Strategy

Query selectivity estimation for only 1-3% state, we get 65-75% of

the routing benefit

04/19/2023 36

Effect of QDT Topology

Fanout-balanced trees are closest to optimal throughput

Ideal-to-actual peak load reduction ratio

QDTS, 15-QDT config.

QDTB, 66-QDT config.

Processing 1.32 1.18

Forwarding 10.23 2.3

04/19/2023 37

Summary

Infrastructure for ad-hoc querying in online communities where the publishers keep control over their own data

Ongoing Work Ranked results

▪ Disseminate only to top-K relevant publishers▪ Find only top-K matching documents

Support for more expressive XML queries Simulation Build Prototype

04/19/2023 38

Thank You!

04/19/2023 39

Effect of Number of QDTs

04/19/2023 40

Effect of Routing Strategy

04/19/2023 41

Efficient Representation of Summaries

Naïve solution keep “exact node summaries” as a complete list of

published terms Con: memory intensive arbitrarily large summaries Con: costly to check set inclusion

How to achieve fast term inclusion sets? How to represent summaries using little space? Allow estimates

▪ without false negatives: to avoid incomplete answers▪ bounded false positives: to avoid wasting bandwidth

Represent summaries (term sets) using Bloom filters

XML Distributed Retrieval

Documents

Transcript of XML Distributed Retrieval