Tutorial 3 (b tree min heap)
-
Upload
kira -
Category
Technology
-
view
548 -
download
0
description
Transcript of Tutorial 3 (b tree min heap)
B-Tree Lexicon, Min-Heaps
Kira Radinsky
Min-Heap slides are courtesy of Aya Soffer and David Carmel,
IBM Haifa Research Lab
2 November 2010 236621 Search Engine Technology 2
The Lexicon as a B-Tree
• B-Tree: a balanced tree that is optimized for disk I/O, holding key/value pairs
• Branching is defined by a min-degree parameter t, t > 1– t is chosen according to the size of a disk block
• Any internal node other than the root has at least t and at most 2tchildren; the root has either no children, or at least two and at most 2tchildren
• Any internal node with k children also stores k-1 keys which serve as separator values: separator j is larger than the keys of subtree j and smaller than the keys of subtree j+1
• Leaf nodes, like all nodes, store at most 2t-1 key/value pairs– When not the root, store at least t-1 key/value pairs
• Lookup, insertion and deletion operations on a B-Tree are linear in its height (and t-logarithmic in the number of keys)
2 November 2010 236621 Search Engine Technology 3
B-Tree Lexicon - Example
• t=2
• Each key is associated with a value that contains a DF and a pointer to the postings list (dashed line)
gets more
1 2
and as bad
3 1 2
good is it
2 1 2
the ugly
1 2
2 November 2010 236620 Search Engine Technology 4
B-Tree Lookup
Looking up the value associated with key x:
1. current_node root
2. Let k1<k2<…<km be the keys of current_node
3. if x{k1,k2,…,km} – we’re done, return associated value
4. else, if current_node is a leaf node, return null
5. else, let j be the smallest index s.t. x<kj (j m+1 if x>km);
– current_node j’th subtree, and goto 2
2 November 2010 236621 Search Engine Technology 5
Top-r Document Selection
Problem definition: Given a set A of scored documents, select the r documents with the highest scores in A and return them in decreasing relevance order
• Naïve method: sort the set A by score– If |A|=M, time complexity is O(M logM)
• Better approach: since typically r<<M, selecting the r top scores can be done in O(M+r log M) time using a heap:
1. Heapify the set of M scores (about 2M comparisons) so that the top score is at the root
2. Repeatedly extract the heap’s root (r times), each time fixing the heap in O(logM)
2 November 2010 236621 Search Engine Technology 6
The Heap Data Structure - Reminder
• A binary heap is a (mostly full) binary tree with values stored at all leaves and internal nodes, and an ordering rule that requires values to be non-decreasing (alternatively, non-increasing) along each path from a leaf to the root– Largest/smallest value is at the root
• Heap implemented in an Array:– Root at index 1
– For node at index i, left child is at index 2i and right child at index 2i+1
– Thus the parent of the node at index i is at index i/2
2 November 2010 236621 Search Engine Technology 7
Binary Heap Stored in an Array
23
17
28
5
15
13
144
17
23 17 15 17 8 2 13 4 14 5
1 2 3 4 5 6 7 8 9 10
2 November 2010 236621 Search Engine Technology 8
Extracting the Top Element
• Remove the largest item r times• Each time:
– Remove the largest item – the root of the heap – Replace it with the last element of the heap– Sift the new root down until restoring order
• Example– Remove item 23 from the root – Last item in array 5 (at location 10) replaces it– Reinstate heap order - worst case 5 will be sifted
back down the tree - number of sifts is bounded by log(size of heap)
2 November 2010 236621 Search Engine Technology 9
Heap Example (cont.)
To restore order at the top level of tree, item 17, the larger of the 2 children of root must be swapped with 5.
This limits the order violation to the left sub-tree.
5
17
28
15
13
144
17
The process is repeated until heap order is restored
2 November 2010 236621 Search Engine Technology 10
5
17
28
15
13
144
17
17
17
28
15
13
54
14
17
5
28
15
13
144
17
17
17
28
15
13
144
5
Heap Example (cont.)
2 November 2010 236621 Search Engine Technology 11
Top-r Selection Using a Min-Heap
• The selection problem can be solved by a heap that stores the smallest item at the root: min-heap
• A min-heap of r items is held instead of a max-heap of M –lots of memory is saved, which is always good
• Process the M scores, storing in the min-heap the r largest values seen so far– First r values are heapified in O(r) comparisons
– Replace the smallest value in the min-heap (the rth largest) whenever a larger value is found
• Sort the r highest values in descending order and return the corresponding documents – O(r log r)
2 November 2010 236621 Search Engine Technology 12
Min-Heap Processing - Illustration
Processed Unprocessed
Min-heap of r
largest items
Discard smallest
value
2 November 2010 236621 Search Engine Technology 13
Top-r Selection Using a Min-Heap: Complexity Analysis
• Worst case: the scores are already in increasing order– Each of the M-r last values is inserted into the heap
– Furthermore, it percolates to the bottom of the heap
– Complexity is O( (M-r)*log(r) )
• Average case – the scores arrive in a permutation of size M chosen uniformly at random– The expected number of times one of the M-r last values is
inserted into the heap is ~ r*ln(M/r)
– Each insertion costs O(log(r))
– Complexity is O( r*log(r)*log(M/r) )
• Proof on the board