PIER ( Peer-to-Peer Information Exchange and Retrieval )
description
Transcript of PIER ( Peer-to-Peer Information Exchange and Retrieval )
PIER (Peer-to-Peer Information Exchange and Retrieval)
30 March 07
Neha Singh
Outline
Motivation Introduction Architecture Join Algorithms Experimental Results Conclusion
Outline
Motivation Introduction Architecture Join Algorithms Experimental Results Conclusion
Motivation: What is Very Large?
Single SiteClusters
Distributed10’s – 100’s
“The database research community prides itself on scalable technologies. Yet database systems traditionally do not excel on one important scalability dimension the degree of distribution. This limitation has hampered the impact of database technologies on massively distributed systems like the Internet.”
Database Community Network Community
Internet Scale1000’s – Millions
Motivation Databases:
powerful query facilities potential to scale up to few hundred
computers For querying Internet, there is a well
distributed system that has query facilities (SQL) fault tolerance flexibility (in terms of nature of data)
Databases + P2P DHTs
Databases carry a lot of other expensive baggage ACID transactions Consistency above all else
Need to unite the query processor with DHTs DHTs + Relational Query Processing = PIER
Bringing complex queries to DHTs
Outline Motivation Introduction
What is PIER Design Principles
Architecture Join Algorithms Experimental Results Conclusion
What is PIER?
Peer-to-Peer Information Exchange and Retrieval It is a query engine that scales up to thousands of
participating nodes and can work on various data It runs on top of P2P network
step to the distributed query processing at a larger scale way for massive distribution: querying heterogeneous data
Architecture meets traditional database query processing with recent peer-to-peer technologies
Design Principles Relaxed Consistency
Brewer states that a distributed data system can only have two out of three of the following properties :1. (C)onsistency2. (A)vailability3. Tolerance of network (P)artitions
Pier : Priority : A ,P and sacrifice C(Sacrificing Consistency in face of Availability and Partition tolerance)
ie. “Best effort results”
Design Principles Organic Scaling
Growth with deployment Does not require a priori allocation
Natural Habitats for Data Data remains in original format with a
DB interface Standard Schemas
Achieved though common software
Initial Design Assumptions
Overlay Network DHTs are highly scalable Resilient to network failures But DHTs provided limited functionalities Design challenge: get lots of functionality from this
simple interface
Decouple Storage PIER is just the query engine, no storage
Query data that is in situ Give up ACID guarantees
Why not a design p2p storage manager too? Not needed for many applications Hard problem in itself – leave for others to solve (or not)
Outline
Motivation Introduction Architecture Join Algorithms Experimental Results Conclusion
PIER Architecture
CoreRelationalExecution
Engine
ProviderStorageManager
OverlayRouting
CatalogManager
QueryOptimizer
Various User Applications
PIER
DHT
Apps
IPNetwork
Network
DHTWrapper
StorageManager
OverlayRouting
DHT
CoreRelationalExecution
EngineCatalogManager
QueryOptimizer
PIER
NetworkMonitoring
Other UserApps
Applications
Physical Network
Overlay Network
Query Plan
DeclarativeQueries
Select R.Cpr,R.name,s.Address From R,SWhere R.cpr=S.cpr
Architecture: DHT: Routing
lookup(key) ipaddrjoin(landmarkNode)leave()CALLBACK: locationMapChange()
ProviderStorageManager
OverlayRouting
DHT is divided into 3 modules
Very simple interface Any routing algorithm here:
CAN, Chord, Pastry, etc. Overlay Routing API:
Architecture: DHT: Storage Stores and retrieves
records, which consist of key/value pairs. Keys are used to locate items and can be any data type or structure supported
A complex storage system can be used but here using a simple in-memory storage system.
ProviderStorageManager
OverlayRouting
store(key, item)
retrieve(key) item
remove(key)
DHT – Provider (1/2)
Provider ties routing and storage manager layers and provides an interface
Each object in the DHT has a namespace, resourceID and instanceID
DHT key = hash(namespace,resourceID)
ProviderStorageManager
OverlayRouting
namespace - application or group of object, table resourceID – what is object, primary key or any attribute instanceID – integer, to separate items with the same
namespace and resourceID CAN’s mapping of resourceID/Object is equivalent to an
index
DHT – Provider (2/2)
Node R1
(1..n)
Table R (namespace)
(1..n) tuples
(n+1..m) tuplesNode R2
(n+1..m)
rID1
item
rID3
item
rID2
item
get (namespace, resourceID) item
put (namespace, resourceID, item, lifetime)
renew (namespace, resourceID, instanceID, lifetime) bool
multicast(namespace, resourceID, item)
unicast(ns, rid, item)
lscan(namespace) items
CALLBACK: newData(ns, item)
Architecture: PIER Consists only of the relational execution
engine Executes a pre-optimized query plan Query plan is a box-and-arrow
description of how to connect basic operators together selection, projection, join,
group-by/aggregation, and some DHT specific operators such as rehash
Query Processor How it works?
performs selection, projection, joins, grouping, aggregation simultaneous execution of multiple operators pipelined
together results are produced and queued as quick as possible
How it modifies data? insert, update and delete different items via DHT interface
How it selects data to process? dilated-reachable snapshot – data, published by reachable
nodes at the query arrival time
Outline
Motivation Introduction Architecture Join Algorithms Experimental Results Conclusion
Joins: The Core of Query Processing
Goal: Get tuples that have the same value
for a particular attribute(s) (the join attribute(s)) to the same site, then append tuples together.
Algorithms come from existing database literature, minor adaptations to use DHT.
DHT based Distributed Joins
DHTs have been used as both Content Addressable Network For routing tuples by value
Have implemented 2 binary equi-joins and 2 band-width reducing rewrite schemes
Joins: Symmetric Hash Join (SHJ)
Algorithm for each site (Scan) Use two lscan calls to retrieve all data
stored at that site from the source tables (Rehash) put a copy of each eligible tuple with the
hash key based on the value of the join attribute (Listen) use newData to see the rehashed tuples (Compute) Run standard one-site join algorithm on
the tuples as they arrive Scan/Rehash steps must be run on all sites
that store source data Listen/Compute steps can be run on fewer
nodes by choosing the hash key differently
Joins: Fetch Matches (FM) Algorithm for each site
(Scan) Use lscan to retrieve all data from ONE table
(Get) Based on the value for the join attribute, issue a get for the possible matching tuples from the other table
Note, one table (the one we issue the gets for) must already be hashed/stored on the join attribute
Big picture: SHJ is put based FM is get based
Query Processor – Join rewriting
Symmetric semi-join (Project) both R and S to
their resourceIDs and join keys
(Small rehash) Perform a SHJ on the two projections
(Compute) Send results into FM join for each of the tables
*Minimizes initial communication
Bloom joins (Scan) create Bloom Filter
for a fragment of relation (Put) Publish filter for R, S (Multicast) Distribute filters (Rehash) only tuples
matched the filter (Compute) Run SHJ
*Reduces rehashing
Joins: Additional Strategies Bloom Filters
Use of bloom filters can be used to reduce the amount of data rehashed in the SHJ
Symmetric Semi-Join Run a SHJ on the source data projected to
only have the hash key and join attributes. Use the results of this mini-join as source for
two FM joins to retrieve the other attributes for tuples that are likely to be in the answer set
Big Picture: Tradeoff bandwidth (extra rehashing) for
latency (time to exchange filters)
Outline
Motivation Introduction Architecture Join Algorithms Experimental Results Conclusion
Workload Parameters
CAN configuration: d = 4 |R| =10 |S| Constants provide 50% selectivity f(x,y) evaluated after the join 90% of R tuples match a tuple in S Result tuples are 1KB each Symmetric hash join used
Simulation Setup Up to 10,000 nodes Network cross-traffic, CPU and
memory utilizations ignored
ScalabilitySimulations of 1 SHJ Join
Warehousing
Full Parallelization
Result 1 (Scalability)
The system scales well as long as the number of computation nodes is large enough to avoid network congestion at those nodes.
Join Algorithms (1/2) Infinite Bandwidth 1024 data and computation nodes Core join Algorithms
Perform faster Rewrites
Bloom Filter: two multicasts Semi-join: two CAN lookups
Performance: Join Algorithms
0
2000
4000
6000
8000
10000
12000
14000
16000
0 20 40 60 80 100
Selectivity of predicat on relation S
Ave
rage
net
wor
k t
raff
ic
SHJ FM SSJ BF
R + S = 25 GB n = m = 1024
inbound capacity = 10 Mbps
hop latency =100 ms
At 40% selectivity, bottleneck switches from computation nodes to query sites
Soft State
Failure detection and recovery
15 second failure detection
4096 nodes
Refresh period Time to reinsert lost
tuples
Average recall decreases as the failure rate increases and increases as the refresh period decreases.
Some real-world results1 SHJ Join on Millennium Cluster
Outline
Motivation Introduction Architecture Join Algorithms Experimental Results Conclusion
Applications
P2P Databases Highly distributed and available data
Network Monitoring Distributed intrusion detection Fingerprint queries
Conclusion Distributed data needs a distributed
query processor DHTs too simple, databases too
complex PIER occupies a point in the middle of this
new design space The primary technical contributions of
this paper is architectural and evaluative, rather than algorithmic
It presents the first serious treatment of scalability issues in P2P style relational query engine
References Ryan Huebsch, Joseph M. Hellerstein,
Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica. Querying the Internet with PIER. VLDB 2003
Matthew Harren, Joseph M. Hellerstein, Ryan Huebsch, Boon Thau Loo, Scott Shenker, and Ion Stoica, Complex Queries in DHT-based Peer-to-Peer Networks. IPTPS, March 2002.
Thanks you
Extra Slides
Transit Stub Topology GT-ITM
4 Domains, 10 nodes per domain, 3 stubs per node
50ms, 10ms, 2ms latency
10Mbps inbound links
Similar trends as fully connected topology
A bit longer end-to-end delays