Index and Distributed Index Methods Zachary G. Ives University of Pennsylvania CIS 455 / 555 –...

Index and Distributed Index Methods

Zachary G. IvesUniversity of Pennsylvania

CIS 455 / 555 – Internet and Web Systems

February 19, 2008

Some portions derived from slides by Raghu Ramakrishnan

2

Publish-Subscribe Model Summarized XFilter has an elegant model for matching XPaths

A good deal more complex than HW2, in that it supports wildcards (*) and //

Currently not commonly used Partly because XML isn’t that widespread This may change with the adoption of an XML format

called RSS (Rich Site Summary or Really Simple Syndication)

Many news sites, web logs, mailing lists, etc. use RSS to publish daily articles

Seems like a good fit for publish-subscribe models!

3

Finding a Happy MediumWe’ve seen two approaches:

Do all the work at the data stores: flood the network with requests

Do all the work via a central crawler: record profiles and disseminate matches

An alternative, two-step process: Build a content index over what’s out there

An index is a key -> value map Typically limited in what kinds of queries can be

supported Most common instance: an index of document keywords

4

Inverted Indices A conceptually very simple data structure:

<keyword, {list of occurrences}>

In its simplest form, each occurrence includes a document pointer (e.g., URI), perhaps a count and/or position

Requires two components, an indexer and a retrieval system

We’ll consider cost of building the index, plus searching the index using a single keyword

5

How Do We Lay Out an Inverted Index? Some options:

Unordered list Ordered list Tree Hash table

6

Unordered and Ordered Lists Assume that we have entries such as:

<keyword, #items, {list of occurrences}> What does ordering buy us?

Assume that we adopt a model in which we use:<keyword, item><keyword, item>

Do we get any additional benefits?

How about:<keyword, {items}> where we fix the

size of thekeyword and the number of items?

7

Tree-Based IndicesTrees have several benefits over lists:

Potentially, logarithmic search time, as with a well-designed sorted list, IF it’s balanced

Ability to handle variable-length records

We’ve already seen how trees might make a natural way of distributing data, as well

How does a binary search tree fare? Cost of building? Cost of finding an item in it?

B+ Tree: A Flexible, Height-Balanced, High-Fanout Tree Insert/delete at log F N cost

(F = fanout, N = # leaf pages) Keep tree height-balanced

Minimum 50% occupancy (except for root) Each node contains d <= m <= 2d entries

d is called the order of the tree Can search efficiently based on equality (or also

range, though we don’t need that here)Index Entries

Data Entries("Sequence set")

(Direct search)

Example B+ Tree Data (inverted list ptrs) is at leaves; intermediate

nodes have copies of search keys Search begins at root, and key comparisons

direct it to a leaf Search for be↓, bobcat↓ ...

Based on the search for bobcat*, we know it is not in the tree!

Root

best but dog

a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓

art

Inserting Data into a B+ Tree Find correct leaf L Put data entry onto L

If L has enough space, done! Else, must split L (into L and a new node L2)

Redistribute entries evenly, copy up middle key Insert index entry pointing to L2 into parent of L

This can happen recursively To split index node, redistribute entries evenly, but push

up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height

Tree growth: gets wider or one level taller at top

11

Inserting “and↓” Example: Copy up

Want to insert here; no room, so split & copy up:

a↓ am ↓ an↓ ant↓ and↓

anEntry to be inserted in parent node.(Note that key “an” is copied up andcontinues to appear in the leaf.)

and↓

Root

best but dog

a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓

art

12

Inserting “and↓” Example: Push up 1/2

Root

art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓

an

Need to split node & push up

best but dogart

a↓ am ↓ dog↓ dry↓ elf↓ fox↓

an↓ ant↓ and↓

13

Inserting “and↓” Example: Push up 2/2

Root

art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓

an but dog

best

art

Entry to be inserted in parent node.(Note that best is pushed up and onlyappears once in the index. Contrastthis with a leaf split.)

a↓ am ↓ dog↓ dry↓ elf↓ fox↓

an↓ ant↓ and↓

14

Copying vs. Splitting, Summarized Every keyword (search key) appears in at most

one intermediate node Hence, in splitting an intermediate node, we push

up Every inverted list entry must appear in the

leaf We may also need it in an intermediate node to

define a partition point in the tree We must copy up the key of this entry

Note that B+ trees easily accommodate multiple occurrences of a keyword

Virtues of the B+ Tree B+ tree and other indices are quite efficient:

Height-balanced; logF N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average

Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you are using Interface: open B+ Tree; get and put items based on

key Handles concurrency, caching, etc.

16

How Do We Distribute a B+ Tree? We need to host the

root at one machine and distribute the rest

What are the implications for scalability? Consider building the

index as well as searching

17

Eliminating the Root Sometimes we don’t want a tree-

structured system because the higher levels can be a central point of congestion or failure

Two strategies: Modified tree structure (e.g., BATON, Jagadish

et al.) Non-hierarchical structure

18

A “Flatter” Scheme: Hashing Start with a hash function

with a uniform distribution of values: h(name) a value (e.g., 32-bit

integer)

Map from values to hash buckets Generally using mod (#

buckets) Put items into the buckets

May have “collisions” and need to chain

0

1

2

3

048

12

…

buckets

{h(x) values

overflow chain

19

Dividing Hash Tables Across Machines Simple distribution – allocate some number of

hash buckets to various machines Can give this information to every client, or provide

a central directory Can evenly or unevenly distribute buckets Lookup is very straightforward

A possible issue – data skew: some ranges of values occur frequently Can use dynamic hashing techniques Can use better hash function, e.g., SHA-1 (160-bit

key)

20

Some Issues Not Solved withConventional Hashing What if the set of servers holding the

inverted index is dynamic? Our number of buckets changes How much work is required to reorganize the

hash table?

Solution: consistent hashing

21

Consistent Hashing – the Basis of “Structured P2P”Intuition: we want to build a distributed hash table

where the number of buckets stays constant, even if the number of machines changes Requires a mapping from hash entries to nodes Don’t need to re-hash everything if node joins/leaves Only the mapping (and allocation of buckets) needs to

change when the number of nodes changes

Many examples: CAN, Pastry, Chord For this course, you’ll use Pastry But Chord is simpler to understand, so we’ll look at it

22

Basic Ideas We’re going to use a giant hash key space

SHA-1 hash: 20B, or 160 bits We’ll arrange it into a “circular ring” (it wraps

around at 2160 to become 0)

We’ll actually map both objects’ keys (in our case, keywords) and nodes’ IP addresses into the same hash key space “abacus” SHA-1 k10 130.140.59.2 SHA-1 N12

23

Chord Hashes a Key to its Successor

N32

N10

N100

N80

N60

Circularhash

ID Space

Nodes and blocks have randomly distributed IDs Successor: node with next highest ID

k52

k30

k10

k70

k99

Node ID k112k120

k11

k33k40

k65

Key Hash

24

Basic Lookup: Linear Time

N32

N10N5

N20

N110

N99

N80

N60

N40

“Where is k70?”

“N80”

Lookups find the ID’s predecessor Correct if successors are correct

25

“Finger Table” Allows O(log N) Lookups

N80

½¼

1/8

1/161/321/641/128

Goal: shortcut across the ring – binary search Reasonable lookup latency

26

Node Joins How does the node

know where to go?(Suppose it knows 1

peer)

What would need to happen to maintain connectivity?

What data needs to be shipped around?

N32

N10N5

N20

N110

N99

N80

N60

N40

N120

27

A Graceful Exit: Node Leaves

What would need to happen to maintain connectivity?

What data needs to be shipped around?

N32

N10N5

N20

N110

N99

N80

N60

N40

28

What about Node Failure? Suppose a node just dies?

What techniques have we seen that might help?

29

Successor Lists Ensure Connectivity

N32

N10N5

N20

N110

N99

N80

N60 Each node stores r successors, r = 2 log N Lookup can skip over dead nodes to find objects

N40

N10, N20, N32N20, N32, N40

N32, N40, N60

N40, N60, N80

N60, N80, N99

N80, N99, N110

N99, N110, N5

N110, N5, N10

N5, N10, B20

30

Objects are Replicated as Well When a “dead” peer is detected, repair the

successor lists of those that pointed to it Can take the same scheme and replicate

objects on each peer in the successor list Do we need to change lookup protocol to find

objects if a peer dies? Would there be a good reason to change lookup

protocol in the presence of replication?

What model of consistency is supported here? Why?

31

Stepping Back for a Moment:DHTs vs. Gnutella and Napster 1.0 Napster 1.0: central directory; data on peers Gnutella: no directory; flood peers with requests Chord, CAN, Pastry: no directory; hashing scheme to

look for data

Clearly, Chord, CAN, and Pastry have guarantees about finding items, and they are decentralized

But non-research P2P systems haven’t adopted this paradigm: Kazaa, BitTorrent, … still use variations of the Gnutella

approach Why? There must be some drawbacks to DHTs..?

32

Distributed Hash Tables, Summarized

Provide a way of deterministically finding an entity in a distributed system, without a directory, and without worrying about failure

Can also be a way of dividing up work: instead of sending data to a node, might send a task Note that it’s up to the individual nodes to do

things like store data on disk (if necessary; e.g., using B+ Trees)

33

Applications of Distributed Hash Tables To build distributed file systems (CFS,

PAST, …) To distribute “latent semantic indexing”

(U. Rochester) As the basis of distributed data integration

(U. Penn, U. Toronto, EPFL) and databases (UC Berkeley)

To archive library content (Stanford)

34

Distributed Hash Tables andYour ProjectIf you’re building a mini-Google, how might DHTs

be useful in: Crawling + indexing URIs by keyword? Storing and retrieving query results?

The hard parts: Coordinating different crawlers to avoid redundancy Ranking different sites (often more difficult to

distribute) What if a search contains 2+ keywords?

(You’ll initially get to test out DHTs in Homework 3)

35

From Chord to Pastry What we saw was the basic data

algorithms for the Chord system Pastry is a slightly different:

It uses a different mapping mechanism than the ring (but one that works similarly)

It doesn’t exactly use a hash table abstraction – instead there’s a notion of routing messages

It allows for replication of data and finds the closest replica

It’s written in Java, not C … And you’ll be using it in your projects!

36

Pastry API Basics (v 1.4.3_02) See freepastry.org for details and downloads Nodes have identifiers that will be hashed:

interface rice.p2p.commonapi.Id 2 main kinds of NodeIdFactories – we’ll use socket-based

Nodes are logical entities: can have more than one virtual node Several kinds of NodeFactories: create virtual Pastry

nodes All Pastry nodes have built in functionality to

manage routingDerive from “common API” class rice.p2p.commonapi.Application

37

Creating a P2P Network Example code in DistTutorial.java Create a Pastry node:

Environment env = new Environment();PastryNodeFactory d = new SocketPastryNodeFactory(new

NodeFactory(keySize), env);

// Need to compute InetSocketAddress of a host to be addrNodeHandle aKnownNode =

((SocketPastryNodeFactory)d).getNodeHandle(addr);PastryNode pn = d.newNode(aKnownNode);MyApp = new MyApp(pn); // Base class of your

application! No need to call a simulator – this is real!

38

Pastry Client APIs Based on a model of routing messages

Derive your message from class rice.p2p.commonapi.Message

Every node has an Id (NodeId implementation) Every message gets an Id corresponding to its key Call endpoint.route(id, msg, hint) (aka routeMsg) to

send a message (endpoint is an instance of Endpoint) The hint is the starting point, of type NodeHandle

At each intermediate point, Pastry calls a notification: forward(id, msg, nextHop)

At the end, Pastry calls a final notification: deliver(id, msg) aka messageForAppl

39

IDs Pastry has mechanisms for creating node IDs itself Obviously, we need to be able to create IDs for

keys Need to use java.security.MessageDigest:

MessageDigest md = MessageDigest.getInstance("SHA"); byte[] content = myString.getBytes();md.update(content);byte shaDigest[] = md.digest();

rice.pastry.Id keyId = new rice.pastry.Id(shaDigest);

40

How Do We Create a Hash Table (Hash Map/Multiset) Abstraction?

We want the following: put (key, value) remove (key) valueSet = get (key)

How can we use Pastry to do this?

41

Next Time We’ve been looking at data distribution to

this point We saw XML as a means of message

passing – one of several means of communication, in some ways analogous to event-based scheduling

We’ll talk about remote procedure calls, remote method invocations, and Web services

Index and Distributed Index Methods Zachary G. Ives University of Pennsylvania CIS 455 / 555 –...

Documents

Transcript of Index and Distributed Index Methods Zachary G. Ives University of Pennsylvania CIS 455 / 555 –...