P2P Databases

41
P2P Databases

description

P2P Databases. P2P Today. edonkey. bittorrent. pastry. jxta. can. fiorana. napster. freenet. united devices. open cola. ?. aim. ocean store. netmeeting. farsite. gnutella. icq. ebay. morpheus. limewire. seti@home. bearshare. uddi. grove. jabber. popular power. kazaa. - PowerPoint PPT Presentation

Transcript of P2P Databases

Page 1: P2P Databases

P2P Databases

Page 2: P2P Databases

P2P Today

napster

gnutellamorpheus

kazaa

bearshare seti@home

folding@home

ebay

limewire

icq

fiorana

mojo nation

jxta

united devicesopen cola

uddi

process tree

can

chord

ocean store

farsite

pastry

tapestry

?grove

netmeeting

freenet

popular power

aim

jabber

bittorrentedonkey

Page 3: P2P Databases

Object representation and storageAttributes :

Name , Artist, Album , Genre

Objects

Pointer to object

Page 4: P2P Databases

P2P vs. Distributed DBMS

• Transactions• Distributed Query Optimization• Interoperation of heterogeneous data

sources• Reliability/failure of nodes

Complex features do not scale

Traditional DDBMS Issues:

Page 5: P2P Databases

P2P vs. Distributed DBMSExample application: file-sharing• Simple data model and query language

– No complex query optimization– Easy interoperation

• No guarantee on quality of results– Individual site availability unimportant

• Local updates– No transactions– Network partitions OK

Simple Amenable to large-scale network of PCs

Page 6: P2P Databases

Example: file sharing

• Challenge #1: Performance– Asking everyone

is expensive!– If I am smart, I

only need to ask one peer

– How can I be smart?

?

??

?

File X?

Page 7: P2P Databases

Search in P2P

• System can control:– Connections made by users/topology– Data placement– Query type

• Tight control: “Structured” – Efficient, comprehensive

• Loose control: “Unstructured”– Inefficient, not comprehensive, simple, expressive– Used in real life

Both are useful to study

Page 8: P2P Databases

Centralized

• Napster model• Benefits:

– Efficient search– Limited bandwidth usage– No per-node state

• Drawbacks:– Central point of failure– Limited scale

Bob Alice

JaneJudy

Page 9: P2P Databases

http://www.snocap.com/

Page 10: P2P Databases

Unstructured – Query Flooding

= forward query= processed query

= query source

= found result

= forward response

Page 11: P2P Databases

Problems with unstructured

• Inefficient– Query messages are flooded– Even if routing is intelligent, worst case load is still

O(n), where n is # nodes in system• Not comprehensive

– If I do not get a result for my query, is it because none exists?

• (Of course, many optimizations are possible…)

Structured systems address these problems

Page 12: P2P Databases

Distributed Hash Table (DHTs)

• Model:– Key/Object pair, the key is hashed to get an ID– Example:

• Objects are files• The key is the content of the file• The ID is the hash of the file contents

• Single operation: Lookup(ID)– Input: integer ID– Output: the object with the corresponding ID

Page 13: P2P Databases

Identifiers• IDs are m-bit integers• Nodes are also assigned IDs

– Commonly assigned by hashing a node’s IP address, although many problems with this

• An object is stored on the node with the smallest ID greater than the object’s ID– This node is called the successor of the object’s

ID– IDs are arranged on a circle, so 0 > 2m-1

Page 14: P2P Databases

Data Placement

0

1

2

3

45

6

7m = 3 Nodes:

• 0• 1• 3

Data:• 1• 2• 6

1

2

2

6

6

Page 15: P2P Databases

Connections

0

1

2

3

45

6

7

“Finger pointers”

Distance• 20

• 21

….

• 2m-1

Page 16: P2P Databases

Query• Lookup(objectID)

– objectID is typically the ID of the object you are looking for, but not necessarily

• Approach:– Find the predecessor of the object

• I.e. the node with the largest ID that is smaller than the object ID– Return the successor of the predecessor

Page 17: P2P Databases

Query Example• Say node 0 wants to find the object with ID =

7• For simplicity, we will assume a node exists

at every ID in the space

Page 18: P2P Databases

Query Example

0

1

3

45

6

7

2

Node 0: Lookup(7)Node 0: FindPred (7)

Page 19: P2P Databases

Query Example

0

1

3

45

6

7

2

Node 4: FindPred(7)

Page 20: P2P Databases

Query Example

0

1

3

45

6

7

2

Node 6: FindPred(7)

Node 6 is predecessorReturn successor node 7

Page 21: P2P Databases

Query characteristics

• With high probability, a query can be answered by contacting O(log N) nodes– N total nodes in the networkEfficient!

• Also notice: if an object with the ID exists in the network, it will be foundComprehensive!

• State is also O(log N) in size

Page 22: P2P Databases

Query characteristics• Note that finger pointers are not required for

correct operation– Only successor pointers are needed– But then cost of query increases

• O(N) in worst case

Page 23: P2P Databases

Advantages of Structured?• Scalability/Efficiency

– load grows with O(log N)• Comprehensiveness

Page 24: P2P Databases

Disadvantages? (cont)

• Availability of Data– If a node dies suddenly, what happens to

the data it was storing?– MUST replicate data across multiple nodes

• Query Language– How can we express keyword queries

efficiently?– Many useful applications require different

languages

Page 25: P2P Databases

Magnolia

Current approach: Hash each keyword separately and store pointers at h(keyword)

Seven

Innovation

Myths

h(some)

h(innovation)

h(myths)

1100100101

“Seven Innovation Myths” 1100100101h(title) “Innovation”

Page 26: P2P Databases

Resulting Distribution

Page 27: P2P Databases

Prefix hashing

………….

m’m bits

Innovation

hP(innovation) hP = m’ bit hash function

Partitions network into ~ n/2m’ separate sibling groups

n = nodes, m’ partitioning factorFor m’=12, n= 1 million, ~ 256 nodes will share same prefix Assumption: h is uniformly distributed

100

Prefix Hashing

Page 28: P2P Databases

100

Innovation

Balanced over the sibling group

Sibling group ID=100

Balancing

All siblings in a group share the same prefix

Page 29: P2P Databases

Random Sibling

Insert Keyword hP SiblingGroup ID

Locate a sibling node via SIFT

Lookup

KeywordO(1)

Group Broadcast or Multicast

Replies

Page 30: P2P Databases

Advantages• Good Balancing Properties

Page 31: P2P Databases

Advantages• Low Traffic Load on nodes for popular

queries• Quick Lookup• Popularity Ranking of Objects• Distributed Replication for resilience

Page 32: P2P Databases

Implementing Magnolia

• Developed on top of a chord clone written in Python– If you’re going to write a peer-to-peer app, why

not leverage existing modules and libraries?

• Challenge: How do we implement group-based stores and queries without requiring additional network maintenance?

Page 33: P2P Databases

Chord’s Finger Table• A chord node maintains a finger table of M

IP’s pointing to nodes ahead of it in the ring.– A pointer at index i is the successor of node id +

(2^i-1). This lets us reach any node in the network in O(log M) hops

• We use the M’ most significant bits in a node’s id to indicate it’s group. We want to reach any group in O(log M’) hops.– Do we need another table?– Nope. The last M’ entries in our finger table

provide this.

Page 34: P2P Databases

Talking to Siblings

• How do we propagate queries through the group?

• Naïve solution: send to our predecessor and successor.

• A better solution: We can send a query throughout the group by treating the sibling group as a tree.

Page 35: P2P Databases

Sibling Tree

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1 8

2

3

15

4

5

6 7

9

10 11

12

13 14

0+1

1+1

2+1

0+2^3

8+2^21+2^2

2+2^1 5+2^1 9+2^1 12+2^1

14+2^0

5+1 9+1

8+1

12+1

N/N’ = 16; M/M’ = 4

Every edge can be found in the finger table!

Page 36: P2P Databases

Sibling Tree Problems• Problems:

– Not every possible node will exist – Not every node will have results to report– The query maker needs to know when the search is

done• But we’re okay!

– Nodes can determine if a child sub-tree is dead– Even if a child node in our sibling table is of a higher ID

than expected• its sub-tree contains all existing descendents of the expected id• we can predict when a child is in a sibling our ancestor’s tree

Page 37: P2P Databases

Bigger Problems

• What if a pointer in our finger table fails?– We either have to find the successor to it’s id or

fail to query the sub-tree

• What if the lowest ID node isn’t the root of our tree?– Some of our edges won’t be in our finger table

Page 38: P2P Databases

Popularity queries

Page 39: P2P Databases

Yulania , Demo

Page 40: P2P Databases

BitTorrent

Page 41: P2P Databases

SplitStream