K08 Grant Application with Summary Statement - Sample 2: Center ...
B-Treescgi.di.uoa.gr/~k08/manolis/2019-2020/B-Trees.pdf · ⌉and = . • B-trees are used for...
Transcript of B-Treescgi.di.uoa.gr/~k08/manolis/2019-2020/B-Trees.pdf · ⌉and = . • B-trees are used for...
B-Trees
Manolis Koubarakis
Data Structures and Programming Techniques
1
The Memory Hierarchy
Data Structures and Programming Techniques
2
External Memory
Main (Internal) Memory
Cache
Registers
CPU
Bigger
Faster
External Memory
• So far we have assumed that our data structures are stored in main memory. However, if the size of a data structure is too big then it will be stored on external memory e.g., on a hard disk.
• Examples: the database of a bank, a database of images, a database of videos etc.
Data Structures and Programming Techniques
3
External Searching
• When we access data on a disk or another external memory device, we perform external searching.
• A disk access can be at least 100,000 to 1,000,000 times longer than a main memory access.
• Thus, for data structures residing on disk, we want to minimize disk accesses.
Data Structures and Programming Techniques
4
(𝑎, 𝑏) Trees
• An (𝒂, 𝒃) tree, where 𝑎 and 𝑏 are integers, such that 2 ≤ 𝑎 ≤
(𝑏+1)
2, is a multi-way search tree 𝑇
with the following additional restrictions:– Size property: Each internal node has at least 𝑎
children, unless it is the root, and at most 𝑏 children. The root can have as few as 2 children.
– Depth property: All external nodes have the same depth.
• A (2,4) tree is an (𝑎, 𝑏) tree with 𝑎 = 2 and 𝑏 =4.
Data Structures and Programming Techniques
5
Example (3,5) Tree
Data Structures and Programming Techniques
6
a b
c f
g h i k l s t u xd e n p
j
m r
Proposition
• The height of an (𝑎, 𝑏) tree storing 𝑛 entries is
𝑂log 𝑛
log 𝑎.
• Proof?
Data Structures and Programming Techniques
7
Proof
• Let 𝑇 be an (𝑎, 𝑏) tree storing 𝑛 entries and let ℎ be the height of 𝑇. We justify the proposition by proving the following bounds on ℎ:
1
log 𝑏log(𝑛 + 1) ≤ ℎ <
1
log 𝑎log
𝑛+1
2+1
• By the size and depth properties, the number 𝑛′′ of external nodes of 𝑇 is at least 𝟐𝒂𝒉−𝟏 and at most 𝒃𝒉.
• To see the upper bound, consider that we can have 1 node at level 0, at most 𝑏 nodes at level 1, at most 𝑏2 nodes at level 2 etc. and at most 𝑏ℎ at level ℎ (these are the external nodes).
• To see the lower bound, consider that we can have 1 node at level 0, 2 nodes at level 1, at least 2𝑎 nodes at level 2, at least 2𝑎2 at level 3 etc. and at least 2𝑎ℎ−1 nodes at level ℎ.
Data Structures and Programming Techniques
8
Proof (cont’d)
• By an earlier proposition for multi-way trees, we have that 𝑛′′ = 𝑛 + 1therefore
2𝑎ℎ−1 ≤ 𝑛 + 1 ≤ 𝑏ℎ
• Taking the logarithm of base 2 of each term, we getℎ − 1 log 𝑎 +1 ≤ log(𝑛 + 1) ≤ ℎ log 𝑏
• The lower bound we want to prove is obvious from the above right inequality.
• The upper bound we want to prove is also easy to see using the left inequality from above:
ℎ log 𝑎 − log 𝑎 + 1 ≤ log(𝑛 + 1)ℎ log 𝑎 ≤ log(𝑛 + 1) + log 𝑎 − 1
ℎ ≤1
log 𝑎log
𝑛 + 1
2+ 1 −
1
log 𝑎
ℎ <1
log 𝑎log
𝑛 + 1
2+ 1
Data Structures and Programming Techniques
9
B-Trees
• In an (𝑎, 𝑏) tree , we can select the parameters 𝑎 and 𝑏 so that each tree node occupies a single disk block or page.
• This gives rise to a well-known external memory data structure called the B-tree.
• A B-tree of order 𝒎 is an (𝑎, 𝑏) tree with 𝑎 = ⌈𝑚
2⌉ and 𝑏 = 𝑚.
• B-trees are used for indexing data stored on external memory.• When we implement a B-tree, we choose the order 𝑚 so that the
(at most) 𝑚 children references and the (at most) 𝑚 − 1 keys stored at a node can all fit into a single block.
• Nodes are at least half-full all the time due to the value of 𝑎.
Data Structures and Programming Techniques
10
Example B-Tree of Order 𝑚 = 5
Data Structures and Programming Techniques
11
a b
c f
g h i k l s t u xd e n p
j
m r
Proposition
• Let 𝑇 be a B-tree of order 𝑚 and height ℎ.
Let 𝑑 = ⌈𝑚
2⌉ and 𝑛 the number of entries in
the tree. Then, the following inequalities hold:
1. 2𝑑ℎ−1 − 1 ≤ 𝑛 ≤ 𝑚ℎ − 1
2. log𝑚(𝑛 + 1) ≤ ℎ ≤ log𝑑(𝑛+1)
2+ 1
• Proof?
Data Structures and Programming Techniques
12
Proof
• Let us prove (1) first. • The upper bound follows from the fact that a B-
tree of order 𝑚 is a multi-way tree and the respective proposition we proved for multi-way trees.
• The lower bound follows from an inequality we used in the proof of the previous proposition that for 𝑎, 𝑏 trees.
• To prove (2), rewrite the inequalities of (1) and then take logarithms with bases 𝑚 and 𝑑 for the respective terms.
Data Structures and Programming Techniques
13
Result
• From the right inequality of (2) in the previous proposition, we have that the height of a B-
tree is 𝑶 𝐥𝐨𝐠𝒅 𝒏 where 𝑑 = ⌈𝑚
2⌉ , as we
would like it for a balanced search tree.
Data Structures and Programming Techniques
14
Insertion into a B-tree
• The general method for insertion in a B-tree is as follows. First, a search is made to see if the new key is in the tree. This search (if the tree is truly new) will terminate in failure at a leaf.
• The new key is then added to the parent of the leaf node. If the node was not previously full, then the insertion is finished.
• When a key is added to a full node, we have an overflow. Then this node splits into two nodes on the same level, except that the median key at position ⌈
𝑚
2⌉ is not put into either of the two new
nodes, but is instead sent up to the tree to be inserted into the parent node.
• When a search is later made through the tree, a comparison with the median key will serve to direct the search into the proper subtree.
Data Structures and Programming Techniques
15
Example
• Let us see an example of insertions into an initially empty B-tree of order 5.
Data Structures and Programming Techniques
16
Insert a
Data Structures and Programming Techniques
17
a
Insert g
Data Structures and Programming Techniques
18
a g
Insert f
Data Structures and Programming Techniques
19
a f g
Insert b
Data Structures and Programming Techniques
20
a b f g
Insert k - Overflow
Data Structures and Programming Techniques
21
a b f g k
Creation of a New Root Node
Data Structures and Programming Techniques
22
a b g k
f
Split
Data Structures and Programming Techniques
23
a b
f
g k
Insert d
Data Structures and Programming Techniques
24
a b d
f
g k
Insert h
Data Structures and Programming Techniques
25
a b d
f
g h k
Insert m
Data Structures and Programming Techniques
26
a b d
f
g h k m
Insert j - Overflow
Data Structures and Programming Techniques
27
a b d
f
g h j k m
Sent j to the Parent Node
Data Structures and Programming Techniques
28
a b d
f
g h k m
j
Split
Data Structures and Programming Techniques
29
a b d
f j
g h k m
Insert e
Data Structures and Programming Techniques
30
a b d e
f j
g h k m
Insert s
Data Structures and Programming Techniques
31
a b d e
f j
g h k m s
Insert i
Data Structures and Programming Techniques
32
a b d e
f j
g h i k m s
Insert r
Data Structures and Programming Techniques
33
a b d e
f j
g h i k m r s
Insert x - Overflow
Data Structures and Programming Techniques
34
a b d e
f j
g h i k m r s x
r is Sent to the Parent Node
Data Structures and Programming Techniques
35
a b d e
f j
g h i k m s x
r
Split
Data Structures and Programming Techniques
36
a b d e
f j r
g h i k m s x
Insert c - Overflow
Data Structures and Programming Techniques
37
a b c d e
f j r
g h i k m s x
c is Sent to the Parent
Data Structures and Programming Techniques
38
a b d e
f j r
g h i k m s x
c
Split
Data Structures and Programming Techniques
39
a b
c f j r
g h i k m s xd e
Insert l
Data Structures and Programming Techniques
40
a b
c f j r
g h i k l m s xd e
Insert n
Data Structures and Programming Techniques
41
a b
c f j r
g h i k l m n s xd e
Insert t
Data Structures and Programming Techniques
42
a b
c f j r
g h i k l m n s t xd e
Insert u
Data Structures and Programming Techniques
43
a b
c f j r
g h i k l m n s t u xd e
Insert p - Overflow
Data Structures and Programming Techniques
44
a b
c f j r
g h i k l m n p s t u xd e
m is Sent to the Parent Node
Data Structures and Programming Techniques
45
a b
c f j r
g h i k l n p s t u xd e
m
Split
Data Structures and Programming Techniques
46
a b
c f j m r
g h i k l s t u xd e n p
Overflow at the Root
Data Structures and Programming Techniques
47
a b
c f j m r
g h i k l s t u xd e n p
j is Sent up to a New Root
Data Structures and Programming Techniques
48
a b
c f m r
g h i k l s t u xd e n p
j
Split
Data Structures and Programming Techniques
49
a b
c f
g h i k l s t u xd e n p
j
m r
Final Tree
Data Structures and Programming Techniques
50
a b
c f
g h i k l s t u xd e n p
j
m r
Deletion from a B-tree
• Let us now see how we delete a key from a B-tree.• If the key to be deleted is in a node with only external
nodes as children, then it can be deleted immediately.• If the key to be deleted is in an internal node with only
internal nodes as children, then its immediate predecessor (or successor) under the natural order of keys is guaranteed to be in a node with only external-node children.
• Hence, we can promote the immediate predecessor or successor into the position occupied by the key to be deleted, and delete the key from the node with only external-node children.
Data Structures and Programming Techniques
51
Deletion from a B-tree (cont’d)
• If the node where the deletion takes place contains more than the minimum number of keys, then one can be deleted with no further action.
• If the node contains the minimum number, then we first look at its two immediate siblings (or in the case of a node on the outside, one sibling).
• If one of these has more than the minimum number for entries, then we can do a transfer operation: one child of the sibling is moved to the node where the deletion takes place, one of the keys of the sibling is moved into the parent node, and a key from the parent node is moved into the node where the deletion takes place.
• If the immediate sibling has only the minimum number of keys then we perform a fusion operation: the current node and its sibling are merged into a new node and a key is moved from the parent into this new node.
• If this fusion step leaves the parent with too few entries, the process propagates upward.
Data Structures and Programming Techniques
52
Example
Data Structures and Programming Techniques
53
a b
c f
g h i k l s t u xd e n p
j
m r
Delete h
Data Structures and Programming Techniques
54
a b
c f
g i k l s t u xd e n p
j
m r
Delete r
Data Structures and Programming Techniques
55
a b
c f
g i k l s t u xd e n p
j
m r
Find the Successor of r
Data Structures and Programming Techniques
56
a b
c f
g i k l s t u xd e n p
j
m r
Promote the Successor of r – Delete the Successor from its Place
Data Structures and Programming Techniques
57
a b
c f
g i k l t u xd e n p
j
m s
Delete p
Data Structures and Programming Techniques
58
a b
c f
g i k l t u xd e n p
j
m s
Transfer
Data Structures and Programming Techniques
59
a b
c f
g i k l u xd e n
j
m
t s
After the Transfer
Data Structures and Programming Techniques
60
a b
c f
g i k l u xd e n s
j
m t
Delete d
Data Structures and Programming Techniques
61
a b
c f
g i k l u xe n s
j
m t
d
Fusion
Data Structures and Programming Techniques
62
a b
f
g i k l u xe n s
j
m t
c
After the Fusion – Underflow at f
Data Structures and Programming Techniques
63
a b c e
f
g i k l u xn s
j
m t
Fusion
Data Structures and Programming Techniques
64
a b c e
f
g i k l u xn s
j
m t
After the Fusion – Delete Root
Data Structures and Programming Techniques
65
a b c e g i k l u xn s
f j m t
Final Tree
Data Structures and Programming Techniques
66
a b c e g i k l u xn s
f j m t
Complexity of Operations in a B-tree
• As we have shown for multi-way trees, the complexity of search, insertion and deletion in a B-tree of order 𝑚 is 𝑂 ℎ𝑡 where 𝑂(𝑡) is the time it takes to implement split, transfer or fusion using the data structure implementing each node of the tree.
• If we count only disk block operations then 𝑂 𝑡 = 𝑂(1). Therefore, the complexity of each operation is 𝑶 𝒉 = 𝑶(𝐥𝐨𝐠 𝒎
𝟐
𝒏).
Data Structures and Programming Techniques
67
B+-trees
• A variation of B-trees called B+-trees is one of the most important indexing structures used in today’s file systems and relational database management systems.
Data Structures and Programming Techniques
68
B+-tree Example
Data Structures and Programming Techniques
69
B+-trees (cont’d)
• B+-trees are similar to B-trees. But in B+-trees, internal nodes store only keys while external nodes at the bottom layer store keys and pointers to values (the 𝑑𝑖’s in the previous slide).
• The external nodes in the bottom layer are ordered and linked so that, not only equality queries (e.g., find employees with salary 10,000), but also range queries can be answered effectively (e.g., find employees with salary between 10,000 and 20,000 euros).
Data Structures and Programming Techniques
70
Readings
• M. T. Goodrich, R. Tamassia and D. Mount. Data Structures and Algorithms in C++. 2nd
edition. John Wiley.
• M. T. Goodrich, R. Tamassia. Δομές Δεδομένων και Αλγόριθμοι σε Java. 5η έκδοση. Εκδόσεις Δίαυλος.– Κεφ. 10.5
• Sartaj Sahni. Δομές Δεδομένων, Αλγόριθμοι και Εφαρμογές στη C++. Εκδόσεις Τζιόλα.
Data Structures and Programming Techniques
71
Readings (cont’d)
• You can also see the following chapter but notice that the data structure called B-tree there is essentially a B+-tree (but without the linking of the external nodes on the bottom layer):
• R. Sedgewick. Αλγόριθμοι σε C.– Κεφ. 16.3
Data Structures and Programming Techniques
72