Datastructuren - Data Structuresliacs.leidenuniv.nl/~hoogeboomhj/dat/ohp/dat-present.pdf ·...
Transcript of Datastructuren - Data Structuresliacs.leidenuniv.nl/~hoogeboomhj/dat/ohp/dat-present.pdf ·...
Datastructuren
DatastructurenData Structures
Fenia AivaloglouHendrik Jan Hoogeboom
Informatica – LIACSUniversiteit Leiden
najaar 2019
Datastructuren
Table of Contents I
1 Basic Data Structures
2 Tree Traversal
3 Binary Search Trees
4 Balancing Binary Trees
5 Priority Queues
6 B-Trees
7 Graphs
8 Hash Tables
9 Data Compression
10 Pattern Matching
Datastructuren
Basic Data Structures
Contents
1 Basic Data StructuresLinear listsAbstract Data StructuresAdvanced C++ programmingTrees and their Representations
Datastructuren
Basic Data Structures
Linear lists
hierarchy of lists
A deque (”double-ended queue”) is a linear list for which allinsertions and deletions (and usually all accesses) are made atthe ends of the list. A deque is therefore more general than astack or a queue; it has some properties in common with a deckof cards, and it is pronounced the same way. (Knuth, TAoCPvol. 1)
linear list
deque ‘deck’
stack stapel lifoqueue rij →fifo→
Datastructuren
Basic Data Structures
Linear lists
first last[6]position
inspectchange
at position
insert
×
delete
Datastructuren
Basic Data Structures
Linear lists
implementation: doubly linked list
Λ 6
first
13
prv
-2
nxt
6 4 Λ
last
sentinel
Λ ⊥first
6
Datastructuren
Basic Data Structures
Linear lists
Babel
insert remove inspectback front back front back front
C++ push back push front pop back pop front back frontPerl push unshift pop shift [-1] [0]Python append appendleft pop popleft [-1] [0]
1
1Double-ended queue Operations
Datastructuren
Basic Data Structures
Linear lists
singly linked list
stack stapel
Λ x1 x2 xn
top
queue (wacht-)rij
Λ x1
first
x2 xn
last
Datastructuren
Basic Data Structures
Linear lists
Programmeermethoden
class stapel { // de stapel zelf
public:
stapel ( ) {
bovenste = NULL; } // maak lege stapel
~stapel ( ); // destructor
void zetopstapel (int); // push
void haalvanstapel (int&); // pop
bool isstapelleeg ( ) { // is stapel leeg?
return ( ( bovenste == NULL ) ? true : false );
}//isstapelleeg
...
private: // het begin van de lijst is
vakje* bovenste; // de bovenkant van de stapel
};//stapel
void stapel::zetopstapel (int getal) { // push
vakje* temp = new vakje;
temp->info = getal;
temp->volgende = bovenste;
bovenste = temp;
}//stapel::zetopstapel
Datastructuren
Basic Data Structures
Linear lists
contiguous representation
stack stapel
x x x x x x x
top
queue (wacht-)rij cyclic
x x x x x x
first last
x x xx x x x
last first
empty vs. full (?)
Datastructuren
Basic Data Structures
Linear lists
Programmeermethoden
const int MAX = 100;
class stapel { // voor maximaal MAX integers
public:
stapel ( ) { bovenste = -1; } // constructor
void zetopstapel (int);
void haalvanstapel (int&);
bool isstapelleeg ( ) {
return ( bovenste == -1 ); }
...
private:
int inhoud[MAX];
int bovenste; // index bovenste waarde
};//stapel
void stapel::zetopstapel (int getal) {
bovenste++;
inhoud[bovenste] = getal;
}//stapel::zetopstapel
Datastructuren
Basic Data Structures
Abstract Data Structures
OOP: object oriented programming
object class– data members +– methods
data encapsulation ⇒ nicer modelling
localization operations ⇒ easier error finding
information hiding ⇒ avoiding errors
see Programmeermethoden
Datastructuren
Basic Data Structures
Abstract Data Structures
black box
stack
bottom
0 1 2 3 4 5
pop
push
isE
mpt
y
top
data
remove
insert
quer
y
Datastructuren
Basic Data Structures
Abstract Data Structures
elements
domain operations
data structure
specification
representation
implementation
structure
Datastructuren
Basic Data Structures
Abstract Data Structures
ADT – what, not how
Definition
An abstract data structure (ADT) is a specification of the valuesstored in the data structure as well as the description andsignatures of the operations that can be performed.
no representation or implementation in ADT“mathematical model”
Datastructuren
Basic Data Structures
Abstract Data Structures
abstract “native” data structures– float R– int Z
now get used to consider stacks (etc) that way
Datastructuren
Basic Data Structures
Abstract Data Structures
structure
unordered linear hierarchical network
set list tree graph
Datastructuren
Basic Data Structures
Abstract Data Structures
ADT Stack
Initialize: void → stack<T>. Construct an emptysequence ().
IsEmpty: void → Boolean. Check whether there the stackis empty, i.e., contains no elements).
Size: void → Integer. Return the number n of elements, thelength of the sequence (x1, . . . , xn).
Top: void → T. Returns the top xn of the sequence(x1, . . . , xn). Undefined on the empty sequence.
Push(x): T → void. Add the given element x to the top ofthe sequence (x1, . . . , xn), so afterwards the sequence is(x1, . . . , xn, x).
Pop: void → void. Remove the topmost xn element of thesequence (x1, . . . , xn), so afterwards the sequence is(x1, . . . , xn−1). Undefined on the empty sequence.
Datastructuren
Basic Data Structures
Abstract Data Structures
ADT Queue
Initialize: construct an empty sequence ().
IsEmpty: check whether there the queue is empty, i.e.,contains no elements).
Size: return the number n of elements, the length of thesequence (x1, . . . , xn).
Front: returns the first element x1 of the sequence(x1, . . . , xn). Undefined on the empty sequence.
Enqueue(x): add the given element x to the end/back ofthe sequence (x1, . . . , xn), so afterwards the sequence is(x1, . . . , xn, x).
DeQueue: removes the first element of the sequence(x1, . . . , xn), so afterwards the sequence is (x2, . . . , xn).Undefined on the empty sequence.
Datastructuren
Basic Data Structures
Abstract Data Structures
other ADTs (and implementation)
Set ⇒ (balanced) binary trees, hash tables
Map
Priority Queue ⇒ binary heap, leftist heap
Graph
Union-Find
Datastructuren
Basic Data Structures
Advanced C++ programming
templates
templated function
template <typename T>
T max(T a, T b) { return a>b ? a : b ; }
templated class
template <typename Typ>
class Stack {
...
private:
vector<Typ> storage;
}
Stack<int> intStack;
Stack<string> stringStack;
Datastructuren
Basic Data Structures
Advanced C++ programming
standard template library
container
iterator
algorithm
Datastructuren
Basic Data Structures
Advanced C++ programming
stl container classes
helper: pair
sequences: contiguous: array (fixed length),vector (flexible length), deque (double ended),linked: forward list (single), list (double)
adaptors: based on one of the sequences:stack (lifo), queue (fifo),based on binary heap: priority queue
associative: based on balanced trees:set, map, multiset, multimap
unordered: based on hash table:unordered set, unordered map,unordered multiset,unordered multimap
Datastructuren
Basic Data Structures
Advanced C++ programming
STL vector of pair#include <iostream>
#include <string>
#include <queue>
using namespace std;
using paar = pair<string, unsigned int>; // replacing typedef
int main() {
vector <paar> club // ’modern’ initialization
{ {"Jan", 1}, {"Piet", 6}, {"Katrien", 5}, {"Ramon", 2} };
for (auto& mem: club) { // range based for-loop
cout << mem.first << " " ;
}
cout << endl;
return 0;
}
Jan Piet Katrien Ramon
Datastructuren
Basic Data Structures
Advanced C++ programming
STL priority queueclass Comp {
public:
int operator() ( const paar& p1, const paar& p2 ) {
return p1.second < p2.second;
}
};
int main() {
vector <paar> club // ’modern’ initialization
{ {"Jan", 1}, {"Piet", 6}, {"Katrien", 5}, {"Ramon", 2} };
using pqtype = priority_queue< paar, vector <paar>, Comp > ;
pqtype pq (club.begin(), club.end() ); // wow! converts into
// priority_queue
while ( !pq. empty() ) {
cout << pq.top().first << " (" << pq.top().second << ") ";
pq.pop();
}
return 0;
}
Piet (6) Katrien (5) Ramon (2) Jan (1)
Datastructuren
Basic Data Structures
Trees and their Representations
trees
Tree
structure AVL, B-, red-black• number children• height
contents• relative position values Heap, BST
Definition (Binary Tree)
A binary tree is: an empty tree (without any nodes), or a nodewith two children L and R where L and R are binary trees.
Datastructuren
Basic Data Structures
Trees and their Representations
representating binary trees: pointers
template <class T>
class BinKnp {
\\ CONSTRUCTOR
BinKnp ( const T& i,
BinKnp<T> *l = nullptr, \\ default
BinKnp<T> *r = nullptr )
: info(i) \\ constructor of type T
{ links = l; rechts = r; }
private: \\ DATA
T info;
BinKnp<T> *links, *rechts;
};
Datastructuren
Basic Data Structures
Trees and their Representations
binary search tree vs heap order
35
20
10
5 14
30
26
23
45
39 51
56
83
70
10
5 7
30
26
23
45
39 37
3
Datastructuren
Basic Data Structures
Trees and their Representations
AVL-tree and B-tree
6+1
30
9+1
2-1
5-1
8-1
11+1
1 4 7 10 12+1
13
10 20 25 32 34 40 41 44 46 52 54 58 60
30 38 50 56
42
Datastructuren
Basic Data Structures
Trees and their Representations
text compression Huffman & ZLW
a
b e
0 1
0 1
f
c d
0 1
0 1
0 1
1 2 3
a b c
4 5
6
7b a b
c
8
9
10
11b
a
a
a
Datastructuren
Basic Data Structures
Trees and their Representations
expression tree
3
π x
·
sin 2
↑
·
π x
·
cos
·
0 x
·
π 1
·
+
·
Datastructuren
Basic Data Structures
Trees and their Representations
full binary tree
Datastructuren
Basic Data Structures
Trees and their Representations
complete binary tree −→ array
1
2 3
4 5 6 7
8 9 10 11 12
33
42 17
8 24 3 3
98 55 10 19 5
33
1
42
2
17
3
8
4
24
5
3
6
3
7
98
8
55
9
10
10
19
11
5
12
Datastructuren
Basic Data Structures
Trees and their Representations
left-child right-sibling
a
b
c
d
e
f
g h
a
b
c
d
e
f
g h
a
b
c d
e f
g
h
Datastructuren
Basic Data Structures
Trees and their Representations
trie “retrieval”
tp
o
t
a
t
o
t
e
r
y
a
t
t
o
o
em p
o
t
pot
ato$tery$
attoo$
empo$$
Datastructuren
Tree Traversal
Contents
2 Tree TraversalDefinitions and representationRecursionUsing a StackUsing Inorder ThreadsMorris TraversalLink Inversion
Datastructuren
Tree Traversal
Definitions and representation
Traversal
The process of visiting each node (precisely once) in a systematicway:
breadth-first
NLR preorder
LNR inorder
LRN postorder
recursion
(parent pointer)
iterative, with stack
threads
link inversion
Datastructuren
Tree Traversal
Recursion
recursion (binary trees)
recursivetraversal( node )
if (node != nil) then
// pre-visit(node)
traversal(node.left)
// in-visit(node)
traversal(node.right)
// post-visit(node)
fi
end // traversal
pre
in
post
Datastructuren
Tree Traversal
Recursion
Algoritmiek2
class knoop { // struct mag ook
public:
knoop ( ) { // constructor
info = 0;
links = NULL;
rechts = NULL;
}
// maar misschien private
int info;
knoop* links;
knoop* rechts;
}; // knoop
void preorde (knoop* root) {
if ( root != NULL ) {
cout << root->info << endl;
preorde (root->links);
preorde (root->rechts);
} // if
} // preorde
void symmetrisch (knoop* root) {
if ( root != NULL ) {
symmetrisch (root->links);
cout << root->info << endl;
symmetrisch (root->rechts);
} // if
} // symmetrisch
2ja, dat hebben we gezien
Datastructuren
Tree Traversal
Recursion
pre-traversal( node )
if (node != nil) then
pre-visit(node)
pre-traversal(node.left)
pre-traversal(node.right)
fi
end
b2
d3
a1
g6
i7
e5
j9
h8
k10
c4
f11
NLR = preordera b d c e g i h j k f
Datastructuren
Tree Traversal
Recursion
in-traversal( node )
if (node != nil) then
in-traversal(node.left)
in-visit(node)
in-traversal(node.right)
fi
end
b1
d2
a
3
g
4
i5
e
6
j
7
h8
k9
c
10
f11
LNR = inorderb d a g i e j h k c f
Datastructuren
Tree Traversal
Recursion
post-traversal( node )
if (node != nil) then
post-traversal(node.left)
post-traversal(node.right)
post-visit(node)
fi
end
b 2
d 1
a 11
g 4
i 3
e 8
j 5
h 7
k 6
c 10
f 9
LRN = postorderd b i g j k h e f c a
Datastructuren
Tree Traversal
Using a Stack
generic binary tree traversal
visit direction visit at next node
1 down-left 1(stay) 2 node has no left-child
2 down-right 1(stay) 3 node has no right-child
3 up 2 at left-child3 at right-child
1
current
1next
2
current
1next
3
next
3current
2
next
3current
problem: going up to parent
Datastructuren
Tree Traversal
Using a Stack
visit = 1; node = root
while (visit != 3 or not S.isEmpty() )
case visit of
1 : if (node.left != nil) then
S.push(node)
node = node.left
else
visit = 2
fi
2 : if (node.left != nil) then
S.push(node)
node = node.right
visit = 1
else
visit = 3
fi
3 : parent = S.pop()
if (parent.left != node) then
visit = 2
else
visit = 3
fi
node = parent
end//case
end//while
1node
1next
2node
1 next
3 next
3 node
2 next
3node
Datastructuren
Tree Traversal
Using a Stack
which nodes on stack
pre: right children
b2
d3
a
1
g
6
i7
e
5
j
9
h8
k10
c
4
f11
end
in: left parents
b1
d2
a
3
g
5
i4
e
6
j
7
h8
k9
c
10
f11
end
Datastructuren
Tree Traversal
Using a Stack
pre-orderiterative-preorder( root )
S : Stack
S.create()
S.push( root )
while ( not S.isEmpty() ) do
node = S.pop()
if (node != nil) then
visit( node )
S.push( node.right )
S.push( node.left )
fi
do
end // iterative-preorder
* currentX visitedX on stack
X
X X
X X
X X
X X
X X
X *current
Datastructuren
Tree Traversal
Using a Stack
pre-orderiterative-preorder( root )
S : Stack
S.create()
S.push( root )
while ( not S.isEmpty() ) do
node = S.pop()
if (node != nil) then
visit( node )
S.push( node.right )
S.push( node.left )
fi
do
end // iterative-preorder
pre-order (2)iterative-preorder( root )
S : Stack
S.create()
S.push( root )
while ( not S.isEmpty() ) do
node = S.pop()
while (node != nil) do
visit( node )
S.push( node.right )
node = node.left
od
do
end // iterative-preorder [bis]
Datastructuren
Tree Traversal
Using a Stack
in-orderiterative-inorder( root : Node)
S : Stack
S.create()
// move to first node (left-most)
walkLeft( root, S )
while ( not S.isEmpty() ) do
node = S.pop()
visit( node )
walkLeft( node.right, S )
do
end // iterative-inorder
walkLeft( node : Node, S : Stack)
while (node != nil) do
S.push( node )
node = node.left
od
end // walkLeft
X
X
X X
X X
X
X
X *current
X
Datastructuren
Tree Traversal
Using a Stack
in-orderiterative-inorder( root : Node)
S : Stack
S.create()
// move to first node (left-most)
walkLeft( root, S )
while ( not S.isEmpty() ) do
node = S.pop()
visit( node )
walkLeft( node.right, S )
do
end // iterative-inorder
walkLeft( node : Node, S : Stack)
while (node != nil) do
S.push( node )
node = node.left
od
end // walkLeft
in-order (2)iterative-inorder( root )
S : Stack
S.create()
node = root;
while (node != nil or
not S.isEmpty() ) do
if (node != nil) then
S.push( node )
node = node.left
else
node = S.pop()
visit( node )
node = node.right
fi
od
end // iterative-inorder [bis]
Datastructuren
Tree Traversal
Using a Stack
post-orderiterative-postorder( root )
S : Stack; // contains path from root
S.create();
last = nil
node = root
while (not S.isEmpty() or node != nil) do
if (node != nil) then
S.push(node)
node = node.left
else
peek = S.top()
if (peek.right != nil and last != peek.right) then
// right child exists AND traversing from left, move right
node = peek.right
else
visit(peek)
last = S.pop()
fi
fi
od
end // iterative-postorder
Datastructuren
Tree Traversal
Using a Stack
post-orderiterative-postorder( root )
S : Stack; // contains path from root
S.create();
last = nil
node = root
while (not S.isEmpty() or node != nil) do
if (node != nil) then
S.push(node)
node = node.left
else
peek = S.top()
if (peek.right != nil and last != peek.right) then
// right child exists AND traversing from left, move right
node = peek.right
else
visit(peek)
last = S.pop()
fi
fi
od
end // iterative-postorder
X
X
X X
X X
X
X
X *current
X X
Datastructuren
Tree Traversal
Using Inorder Threads
using inorder threads
threads:replace nil-pointers, explicitly store inorder successors
can be used to perform stack-less traversal
need one bit [boolean] per node to mark thread
Morris-variant: temporary threads, no extra bit
nb. inorder = symmetric
Datastructuren
Tree Traversal
Using Inorder Threads
inorder successor with threads
xcurr
Λ
succ
succ
x
Λ
curr
9
5
3
2
1
4
7
6 8
12
10
11
nil
Datastructuren
Tree Traversal
Using Inorder Threads
traversal with symmetric threads
inorder threads// assuming Root != nil, find first position in inorder
Curr = walkLeft( Root );
while (Curr != nil) do
inOrderVisit( Curr );
if (Curr.IsThread) then
Curr = Curr.right; // to inorder successor
else
Curr = walkLeft (Curr.right)
fi
od
walkLeft( node : Node)
while (node.left != nil) do
node = node.left
od
return node
end // walkLeft
Datastructuren
Tree Traversal
Using Inorder Threads
what about
pre-order traversal with inorder threads
Datastructuren
Tree Traversal
Morris Traversal
Morris: temporary threads
inorder successor(to left parents)
b1
d2
a
3
g
5
i4
e
6
j
7
h8
k9
c
10
f11
end
stack vs. threads
*1
d2
X3
g
5
i4
e
6
*7
X8
k9
X10
f11
nil
Datastructuren
Tree Traversal
Morris Traversal
Morris: basics
inorder successor(to left parents)
b1
d2
a
3
g
5
i4
e
6
j
7
h8
k9
c
10
f11
end
two visits
1 (pre-order)from parentvia child-link (left or right)add thread to current node
2 (inorder)from subtree, via threaddelete thread
algorithm does not know threadsso does not know which visitbut will check!
Datastructuren
Tree Traversal
Morris Traversal
Morris traversal - algorithm
no left subtree:
1st and 2nd visitgo right
(by edge or by thread)
Λ
Curr
new subtree: 1st visitconstruct thread
go left
Λ
Curr
Pred
been there: 2nd visitdelete thread
go right
Curr
Pred
Datastructuren
Tree Traversal
Morris Traversal
Morris traversal - pseudo codemorris-algo
Curr = Root;
while (Curr != nil) do
if (Curr.left = nil) then
inOrderVisit( Curr )
Curr = Curr.right
else
// find predecessor
Pred = Curr.left
while (Pred.right != Curr && Pred.right != nil) do
Pred = Pred.right
od
if (Pred.right=nil) then
// no thread: subtree not yet visited
Pred.right = Curr
Curr = Curr.left
else
// been there, remove thread
Pred.right = nil
inOrderVisit( Curr )
Curr = Curr.right
fi
fi
od
Datastructuren
Tree Traversal
Morris Traversal
alternative view: tree transformation
6
2
1 4
3 5
8
7 9
2
1 4
3 5
6
8
7 9
1
2
4
3 5
6
8
7 9
6
2
1 4
3 5
8
7 9
6
2
1 4
3 5
8
7 9
6
2
1 4
3 5
8
7 9
Datastructuren
Tree Traversal
Link Inversion
features
– use generic traversal– at each step we know which visit– no stack, invert links on path from root– use bit on path (tag) to distinguish left/right
bit stack?– keep parent– global visit counter (pre-/in-/post-order)– single traversal at a time
Datastructuren
Tree Traversal
Link Inversion
inverted links
*
binary tree
parent
tag=1
* curr
tag=0
3 visits at *
tag=1
* parent
tag=0
curr
tag=0
after 1st visit
curr
*
tag=0
parent
after 3rd visit
Datastructuren
Binary Search Trees
Contents
3 Binary Search TreesIntroductionBST use casesConstructing BSTsAnalysis of treesADT Set and Dictionary
Datastructuren
Binary Search Trees
Introduction
binary search tree BST3
K
< K > K
Definition
A binary search tree is a binary tree such that for each node:
all nodes in its left subtree have smaller values, and
all nodes in its right subtree have larger values
3BZB, zie Algoritmiek
Datastructuren
Binary Search Trees
Introduction
comparables
chico
harpo
groucho
gummo
marx
zeppo 4
5
11
18
25
30 11.6.1509
28.5.1533
30.5.1536
6.1.1540
28.7.1540
12.7.1543
Datastructuren
Binary Search Trees
Introduction
binary search tree BST
worst case search complexity: unsuccessful search in
linear tree: O(n)
optimal tree: O(log2(n)) (complete tree)
Average case behaviour: see later
Datastructuren
Binary Search Trees
Introduction
BST with 31 most common English words
top five frequencies indicated the15568
to5739
this with
was you
which
of9767
and7638
that
on
or
a5074
in
I is
it
not
for
as his
are be he
at
but
from
have herby
had
Inserted in BST by decreasing order of frequencySuccessful search of BST requires 4.042 comparisons (on avg.)
Datastructuren
Binary Search Trees
Introduction
balanced BST
a
5074
and7638
are
as
at
be
but
by
for
from
had
have
he
her
his
I
in
is
it
not
of
9767
on
or
that
the
15568
this
to
5739
was
which
with
you
Perfectly balanced BST
Successful search requires 4.393 comparisons (on avg.)
Datastructuren
Binary Search Trees
Introduction
optimal BST
are at but from have her I which
as by had his is not or was you
a5074
be he it on this with
and7638
in that to5739
for the15568
of9767
Optimal tree taking frequencies into account
Successful search requires 3.437 comparisons (on avg.)
source: Knuth TAoCP Vol.3 (Sorting and Searching)
Datastructuren
Binary Search Trees
BST use cases
search value
bool contains( const Comparable & x, Node *t ) const {
if( t == nullptr )
return false;
else if( x < t->element )
return contains( x, t->left );
else if( t->element < x )
return contains( x, t->right );
else
return true; // found
}
call with: contains(v,root);
Datastructuren
Binary Search Trees
BST use cases
find min/max value
BinaryNode * findMin( BinaryNode *t ) const {
if( t == nullptr )
return nullptr;
if( t->left == nullptr )
return t;
return findMin( t->left );
}
BinaryNode * findMax( BinaryNode *t ) const {
if( t != nullptr )
while( t->right != nullptr )
t = t->right;
return t;
}
call with: findMin(root); and findMax(root);
Datastructuren
Binary Search Trees
BST use cases
inorder is sorted
81
112
153
204
265
336
347
428
519
5710
6111
inorder : 8 11 15 29 26 33 34 42 51 57 61
Datastructuren
Binary Search Trees
BST use cases
find k-th element
Augment each node with the size of its subtree
51
103
141
206
261
302
3511
391
454
512
561
Let r be left->size + 1
If k = r: stop! This node has kth item
If k < r: search kth item in left subtree
If k > r: search (k − r)th item in right subtree
Datastructuren
Binary Search Trees
BST use cases
counting items in [12, 52]
3
6
9
12
X
15
1
18
X
21
24
2
27
X 60
30
33
4
36
39
42
X
45
148
X
51
X
54
57
Datastructuren
Binary Search Trees
Constructing BSTs
insertion (implementation)
template<class T>
void Node<T>::insert(const T& el, Node<T> * & p) {
if( p == nullptr ) {
p = new Node{el, nullptr, nullptr};
} else if (el < p->data) {
insert(el, p->left);
} else if (el > p->data) {
insert(el, p->right);
} else {
; // Duplicate; do nothing
}
}
call with: insert(el,root);
Datastructuren
Binary Search Trees
Constructing BSTs
deletion “by copying”
f
×
T1
Λ
=⇒
f
T1
×
T1 T2
=
×
p
Λ
T2
=⇒
p
×
Λ
T2
Datastructuren
Binary Search Trees
Constructing BSTs
deletion (implementation)
void remove( const Comparable & x, Node * & t ) {
if( t == nullptr ) return;
if( x < t->data ) remove( x, t->left );
else if( x > t->data) remove( x, t->right );
else if( t->left != nullptr && t->right != nullptr ) {
Node *pred = findMax( t->left );
t->element = pred->element;
remove( t->element, t->left );
}
else {
BinaryNode *oldNode = t;
if(t->left != nullptr ) t = t->left
else t = t->right;
delete oldNode;
}
}
aanroepen met: remove(el,root);
Datastructuren
Binary Search Trees
Analysis of trees
counting trees
i
Bi−1 Bn−i
Unlabeled n-node binary trees
Bn =∑n−1
i=0 (Bi−1 ·Bn−i) with B0 = 1
nth Catalan number: Bn = 1n+1
(2nn
)= (2n)!
(n+1)!n! ∼4n
n3/2√π
this is also the number of BST with given values:unique way to store values in given [unlabeled] tree
Datastructuren
Binary Search Trees
Analysis of trees
internal path length
0
1
2 2
1
2ipl = 0 + 1 + 1 + 2 + 2 + 2 = 8
Path length of node: # edges from root to node
Definition (Internal path length)
ipl = sum of all path lengths to all nodes
Avg # comparisons in successful search: ipln + 1
Datastructuren
Binary Search Trees
Analysis of trees
external path length
0
1
2 2
1
2
E = 3 + 3 + 3 + 3 + 2 + 3 + 3 = 20
Definition (External path length)
E = sum of all path lengths to the ‘extended’ leaves
Avg # comparisons in unsuccessful search: En+1 (n+ 1 leaves)
Relation to ipl: E = ipl + 2n proof: induction
Datastructuren
Binary Search Trees
Analysis of trees
path length extremal trees
optimal (balanced) worst case (linear)h levels: n = 2h − 1 nodes
h = lg(n+1)
0
1 1
2 2 2 2
0
1
2
6
ipl =∑h−1
i=0 i · 2i, E = 2h · h ipl =∑n−1
i=0 i = n(n−1)2
⇒ ipl = (n+1) lg(n+1)− 2n E = ipl + 2n = n(n+3)2
avg = n+1n lg(n+1)− 1 avg = n+1
2
Datastructuren
Binary Search Trees
Analysis of trees
average tree
intuition: more balance ⇒ more permutations yield that treeexample: 4-node BSTs
1
2
3
4
1234ipl=6
1
2
4
3
1243ipl=6
1
3
2 4
13241342ipl=5
1
4
2
3
1423ipl=6
1
4
3
2
1432ipl=6
2
1 3
4
213423142341ipl=4
2
1 4
3
214324132431ipl=4
14 BSTs (7 symmetric to above)4! = 24 permutationsaverage ipl: 1
24(12× 4 + 4× 5 + 8× 6) = 11624 = 29
6
Datastructuren
Binary Search Trees
Analysis of trees
average ipl BST
In average internal path length BST n nodes
insert permutation 1, . . . , n into BST ⇒ tree structurewe average over permutations
5
2
1 4
3
6
7
permutationdetermines left & right subtrees
2 4 1 35
6 7
any k can be root = first elementIn = (n− 1) + 2
n
∑nk=1(Ik−1 + In−k)
Datastructuren
Binary Search Trees
Analysis of trees
telescope!
In average internal path length n nodes
so In = (n− 1) + 2(I0 + I1 + · · ·+ In−1)/n
also In−1 = (n− 2) + 2(I0 + I1 + · · ·+ In−2)/(n− 1)
subtract n In − (n− 1)In−1 = 2n− 2 + 2In−1
thus n In = (n+ 1)In−1 + 2n− 2
In
n+ 1=In−1
n+
2
n+ 1−
2
n(n+ 1)
In−1
n=In−2
n− 1+
2
n−
2
(n− 1)n
. . .
I1
2=I0
1+
2
2−
2
1 · 2In
n+ 1=I0
1+O(lnn)−
2n
n+ 1
Datastructuren
Binary Search Trees
ADT Set and Dictionary
ADT Set
Initialize: construct an empty set.
IsEmpty: check whether there the set is empty (∅, containsno elements).
Size: return the number of elements, the cardinality of theset.
IsElement(a): returns whether a given object from thedomain belongs to the set, a ∈ A.
Insert(a): add an element to the set (if it is not present,A ∪ {a})Delete(a): removes an element from the set (if it is present,A \ {a}).
Efficient implementation of ADT Set possible with BST
Datastructuren
Balancing Binary Trees
Contents
4 Balancing Binary TreesTree rotationAVL TreesAdding an item to an AVL TreeDeletion in an AVL TreeSplay Trees
Datastructuren
Balancing Binary Trees
Tree rotation
single rotation
root: p, pivot: q ⇒
p
q
T1
T2 T3
⇐⇒ p
q
T1 T2
T3
⇐ root: q, pivot: p
Datastructuren
Balancing Binary Trees
Tree rotation
double rotation
r
p
q
T1
T2 T3
T4
=⇒ r
p
q
T1 T2
T3
T4
=⇒
rp
q
T1 T2 T3 T4
rotate two times with pivot=q
Datastructuren
Balancing Binary Trees
Tree rotation
Day/Stout/Warren
2
4
6
8
12
2
4
6
8
12
24
68
12
Datastructuren
Balancing Binary Trees
Tree rotation
Day/Stout/Warren Algorithm - createBackBone
rotate(root, pivot) { ... }
createBackBone(root)
tmp = root;
while (tmp != nil) do
if(tmp.left != nil) then
rotate(tmp, tmp.left);
tmp = tmp.left;
else
tmp = tmp.right;
fi
od
Datastructuren
Balancing Binary Trees
Tree rotation
Day/Stout/Warren Algorithm
createCompleteTree(root)
createBackBone(root);
n = number of nodes
m = 2^floor(log(n+1)) - 1;
rotate n-m times at every other node in the backbone
while(m>1) do
m = m/2;
rotate m times at every other node in the backbone
od
Datastructuren
Balancing Binary Trees
AVL Trees
stl container classes
helper: pair
sequences: contiguous: array (fixed length),vector (flexible length), deque (double ended),linked: forward list (single), list (double)
adaptors: based on one of the sequences:stack (lifo), queue (fifo),based on binary heap: priority queue
associative: based on balanced trees:set, map, multiset, multimap
unordered: based on hash table:unordered set, unordered map,unordered multiset,unordered multimap
Datastructuren
Balancing Binary Trees
AVL Trees
balance factor
4
3
2
1
0
depth
35-1
20
10
5 14
30-2
26
23
45+1
39 51
56
3
2 height
Datastructuren
Balancing Binary Trees
AVL Trees
Definition (AVL Tree)
An AVL tree is a BST where for each node: |balance(node)| ≤ 1
6+1
30
9+1
2-1
5-1
8-1
11+1
1 4 7 10 12+1
13
Datastructuren
Balancing Binary Trees
AVL Trees
Fibonacci ‘worst’ AVL tree
1
3
2
1
5
2
1
4
1
3
2
1
Fh−2Fh−1
Fh = Fh−2 + Fh−1 + 1 ≈ (1+√
(5)
2 )h, thus worst-case search inAVL tree grows O(lgn) in the number of nodes n
Datastructuren
Balancing Binary Trees
Adding an item to an AVL Tree
Adding in left subtree
a)
p+1/0
new node
ok, stop
b)
p0/-1
ok, go up
c)
p-1/-2
=⇒
rebalance (next 2 slides), stop
q0
Datastructuren
Balancing Binary Trees
Adding an item to an AVL Tree
rebalance: LL-case
q-1/-2
p0/-1
=⇒ q0
p0
Datastructuren
Balancing Binary Trees
Adding an item to an AVL Tree
Rebalance: LR-cases
r-1/-2
p0/+1
q0/± 1
OR
=⇒
r+1
p0
q0
OR
r0
p-1
q0
Datastructuren
Balancing Binary Trees
Adding an item to an AVL Tree
example: adding 11
1
20
3
4+1/+2
5
6-1
70/+1
8
90/+1
100/+1
11new
70
40
20
1 3
6-1
5
9+1
8 10+1
11
inbalance at 4, RR-case so rotate at 4 with pivot=7
Datastructuren
Balancing Binary Trees
Adding an item to an AVL Tree
example: adding 5
1
20
3
4+1/+2
6
0/-1
7
0/-1
90/-1
8
10+1
11
5new
70
40
20
1 3
6-1
5
9+1
8 10+1
11
inbalance at 4, RL-case so rotate twice with pivot=7
Datastructuren
Balancing Binary Trees
Deletion in an AVL Tree
Deletion: RR cases
q0
p+1/+2
=⇒ q+1
p-1
q+1
p+1/+2
=⇒ q0
p0
Datastructuren
Balancing Binary Trees
Deletion in an AVL Tree
Delete: RL cases (ε = 0,±1)
p-1
qε
r+1/+2
=⇒
p0,+1
q0
r0,-1
Datastructuren
Balancing Binary Trees
Deletion in an AVL Tree
8-1
5-1
3-1
1+1
2
4
7-1
6
11-1
9+1
10
×
8-1
5-1
3-1
1+1
2
4
7-1
6
11-2
9+1
10
8-2
5-1
3-1
1+1
2
4
7-1
6
100
9 11+1
50
3-1
1+1
2
4
80
7-1
6
100
9 11
Datastructuren
Balancing Binary Trees
Splay Trees
splay zig-zag (LR)
g
p
x
T1
T2 T3
T4
=⇒ gp
x
T1 T2 T3 T4
Datastructuren
Balancing Binary Trees
Splay Trees
splay zig-zig (LL)
g
p
x
T1 T2
T3
T4
=⇒
g
p
x
T1
T2
T3 T4
Datastructuren
Balancing Binary Trees
Splay Trees
splay linear tree
1
2
3
4
5
6
7
1
2
3
4
5
6
7
2
3
4
5
1
6
7
2
3
4
5
1
6
7
Datastructuren
Priority Queues
Contents
5 Priority QueuesADT Priority QueueBinary HeapLeftist heapsPairing Heap (niet)Double-ended Priority Queues
Datastructuren
Priority Queues
Abstract Data Structures
ADT – what, not how
Definition
An abstract data structure (ADT) is a specification of the valuesstored in the data structure as well as the description andsignatures of the operations that can be performed.
no representation or implementation in ADT“mathematical model”
Datastructuren
Priority Queues
Abstract Data Structures
stl container classes
helper: pair
sequences: contiguous: array (fixed length),vector (flexible length), deque (double ended),linked: forward list (single), list (double)
adaptors: based on one of the sequences:stack (lifo), queue (fifo),based on binary heap: priority queue
associative: based on balanced trees:set, map, multiset, multimap
unordered: based on hash table:unordered set, unordered map,unordered multiset,unordered multimap
Datastructuren
Priority Queues
Abstract Data Structures
STL priority queueclass Comp {
public:
int operator() ( const paar& p1, const paar& p2 ) {
return p1.second < p2.second;
}
};
int main() {
vector <paar> club // ’modern’ initialization
{ {"Jan", 1}, {"Piet", 6}, {"Katrien", 5}, {"Ramon", 2} };
using pqtype = priority_queue< paar, vector <paar>, Comp > ;
pqtype pq (club.begin(), club.end() );
// wow! converts into priority_queue
while ( !pq. empty() ) {
cout << pq.top().first << " (" << pq.top().second << ") ";
pq.pop();
}
return 0;
}
Piet (6) Katrien (5) Ramon (2) Jan (1)
Datastructuren
Priority Queues
ADT Priority Queue
dictionary vs. priority queue
Both store a set of (key,value) pairs
{ (’Detra’,17), (’Nova’,84), (’Charlie’,22), (’Henry’,75), (’Elsa’,29) }
both:Insert(’Roxanne’,29)
dictionary:Delete(’Detra’)Find(’Elsa’) returns 29Set(’Henry’,76)
priority queue:FindMax() returns (’Nova’,84)DeleteMax()
Datastructuren
Priority Queues
ADT Priority Queue
ADT dictionary / map / associative array
Stores a set of (key,value) pairs
Initialize, IsEmpty, Size
Insert: add (key,value) pair, provided key is not yet present
Delete: deletes (key,value) pair, given the key
Find: returns the value associated to a given key
Set: reassigns a new value to a (existing) given key
usually implemented as (balanced) binary serach tree,or hash table “unordered”
Datastructuren
Priority Queues
ADT Priority Queue
ADT priority queue
Initialize: construct an empty queue.
IsEmpty: check whether there are any elements in the queue.
Size: returns the number of elements.
Insert: given a data element with its priority, it is added tothe queue
DeleteMax: returns a data element with maximal priority,and deletes it.
GetMax: returns a data element with maximal priority.
IncreaseKey: given an element with its position in thequeue it is assigned a higher priority.
Meld, or Union: takes two priority queues and returns anew priority queue containing the data elements from both.
Datastructuren
Priority Queues
ADT Priority Queue
min & max queues
max-queue ≥
Initialize, IsEmpty, Size, Insert, DeleteMax, GetMax,IncreaseKey, Meld
min-queue ≤
Initialize, IsEmpty, Size, Insert, DeleteMin, GetMin,DecreaseKey, Meld
even opletten welke ordening
er staat vaak ook data (niet alleen prioriteit)
Datastructuren
Priority Queues
ADT Priority Queue
priority queue - use cases
sorting (heapsort)
graph algorithms (Dijkstra shortest path, Prim’s algorithm)
compression (Huffman)
operating systems: task queue, print job queue
discrete event simulation
Datastructuren
Priority Queues
ADT Priority Queue
implementations
Binary Leftist Pairing Fibonacci Brodal
GetMax Θ(1) Θ(1) Θ(1) Θ(1) Θ(1)Insert O(log n) Θ(log n) Θ(1) Θ(1) Θ(1)DeleteMax Θ(log n) Θ(log n) O(log n)† O(log n)† O(log n)
IncreaseKey Θ(log n) Θ(log n) O(log n)† Θ(1)† Θ(1)Meld Θ(n) Θ(log n) Θ(1) Θ(1) Θ(1)† amortized complexity
“. . . is based on heap ordered trees where [. . . ] nodes may violateheap order.” “The data structure presented is quite complicated.”
Datastructuren
Priority Queues
Binary Heap
binary search tree vs heap order
35
20
10
5 14
30
26
23
45
39 51
56
83
70
10
5 7
30
26
23
45
39 37
3
Datastructuren
Priority Queues
Binary Heap
representing binary tree with an array
root at index 1, left/right child i at index 2i/2i+1.
1
10 11
100 101 110 111
1000 1001 1010 1011 1100
33
42 17
8 24 3 3
98 55 10 19 5
33
1
42
2
17
3
8
4
24
5
3
6
3
7
98
8
55
9
10
10
19
11
5
12
works well for complete binary treeswaste of space when ‘missing’ nodes
Datastructuren
Priority Queues
Binary Heap
binary heap: three levels
functioning: abstract (priority queue)
understanding: binary tree
implementation: array
internal operations (change key at position):bubble up, trickle down
“To add an element to a heap we must perform an up-heap operation(also known as bubble-up, percolate-up, sift-up, trickle-up, swim-up,heapify-up, or cascade-up), . . . ” What’s in a name? [Wikipedia]
Datastructuren
Priority Queues
Binary Heap
increasekey / bubble up
98
57 55
42 24 17 3
8 33 10 19 71 13
981
572
553
424
245
176
37
88
339
1010
1911
x
12
711313
98
57 71
42 24 55 3
8 33 10 19 17 13
981
572
713
424
245
556
37
88
339
1010
1911
1712
1313
BubbleUp : swap with parent until heap-ordered
Datastructuren
Priority Queues
Binary Heap
decreasekey / trickle down
37
57 55
42 24 17 3
8 33 10 19 5 13
x
1
37572
553
424
245
176
37
88
339
1010
1911
512
1313
57
42 55
37 24 17 3
8 33 10 19 5 13
571
422
553
374
245
176
37
88
339
1010
1911
512
1313
TrickleDown : swap with largest child until heap-ordered
Datastructuren
Priority Queues
Binary Heap
Insert to priority queue
98
57 55
42 24 17 3
8 33 10 19 5 13 29
981
572
553
424
245
176
37
88
339
1010
1911
512
1313 14
29
98
57 55
42 24 17 29
8 33 10 19 5 13 3
981
572
553
424
245
176
297
88
339
1010
1911
512
1313
314
Insert: add as last, BubbleUp
Datastructuren
Priority Queues
Binary Heap
DeleteMax from priority queue
98 98
57 55
42 24 17 3
8 33 10 19 5 13
981
13572
553
424
245
176
37
88
339
1010
1911
512
1313
x
57
42 55
33 24 17 3
8 13 10 19 5
571
422
553
334
245
176
37
88
139
1010
1911
512
DeleteMax: move last element to root, trickleDown
Datastructuren
Priority Queues
Binary Heap
heapify (1)
33
42 17
8 24 13 3
98 57 10 19 5 55
331
422
173
84
245
136
37
988
579
1010
1911
512
5513
33
42 17
98 24 55 3
8 57 10 19 5 13
331
422
173
984
245
556
37
88
579
1010
1911
512
1313
TrickleDown new key: swap with parent until heap-ordered
Datastructuren
Priority Queues
Binary Heap
heapify (2)
33
42 17
98 24 55 3
8 57 10 19 5 13
331
422
173
984
245
556
37
88
579
1010
1911
512
1313
33
98 55
57 24 17 3
8 42 10 19 5 13
331
982
553
574
245
176
37
88
429
1010
1911
512
1313
Datastructuren
Priority Queues
Binary Heap
heapify (3)
33
98 55
57 24 17 3
8 42 10 19 5 13
331
982
553
574
245
176
37
88
429
1010
1911
512
1313
98
57 55
42 24 17 3
8 33 10 19 5 13
981
572
553
424
245
176
37
88
339
1010
1911
512
1313
Datastructuren
Priority Queues
Binary Heap
complexity heapify
Lemma∑hd=0 d2d = (h− 1)2h+1 + 2
n levels, N = 2n − 1 keys
top-down∑n−1`=0 2`` = (n− 2)2n = N lgN (ongeveer)
bottom-up∑n−1`=0 2`(n− 1− `) =
∑n−1`=0 2`(n− 1) +
∑n−1`=0 2`` = 2n − n− 1
which is O(N)
Datastructuren
Priority Queues
Leftist heaps
leftist heaps
“bladafstand”npl(x) nil path length, shortest distance to external leaf
Definition (Leftist tree)
An (extended) binary tree where for each internal node x,npl(left(x)) ≥ npl(right(x)).
Definition (Leftist heap)
A leftist tree where the priorities satisfy the heap order.
structure vs. node order
Datastructuren
Priority Queues
Leftist heaps
leftist tree (structure)
npl(left(x)) ≥ npl(right(x))
3
2
1
1
2
1
1
2
1 1
2
1 1
1
3
2
2
2
1 1
1
1
1
1
2
1 1
1
Datastructuren
Priority Queues
Leftist heaps
basic (internal) operation: ZIP
a b
T1 T2 T3 T4
︷ ︸︸ ︷Zipa
bT1
T2
T3 T4
︷ ︸︸ ︷Zipa ≥ b
Datastructuren
Priority Queues
Leftist heaps
example (step 1: recursive Zipping)
38
37 25
29 10
35
31 32
28 30
Zip︷ ︸︸ ︷38
37
29
25
10
35
31 32
28 30
Zip︷ ︸︸ ︷ 38
37
29 25
10
35
31
28
32
30
Zip︷ ︸︸ ︷
Datastructuren
Priority Queues
Leftist heaps
example (step 2: bottom-up swapping)
382
371
352
29 311
322
28 301
251
101
38
3735
293132
2830 25
10
Datastructuren
Priority Queues
Leftist heaps
complexity
Lemma
Let T be a leftist tree with root v such that npl(v) = k, then(1) T contains at least 2k − 1 (internal) nodes, and(2) the rightmost path in T has exactly k (internal) nodes.
3
2 2
1 1 2 1
2 1
2 1
. . . . . .
Datastructuren
Priority Queues
Leftist heaps
priority queue operations: Insert
Zip︷ ︸︸ ︷38
37 25
29 10
27
38
2737
29 25
10
38
2737
29 25
10
Datastructuren
Priority Queues
Leftist heaps
priority queue operations: DeleteMax
38
37 25
29 10
38
37 25
29 10
︷ ︸︸ ︷ 37
29 25
10
Datastructuren
Priority Queues
Double-ended Priority Queues
dual structure min-max heap
3
11 5
14 15 9
31
4
112
5
53
6
144
2
155
1
96
3
-
7
15
14 9
3 11 5
151
5
142
4
93
6
34
1
115
2
56
3
-
7
Pointer from min-heap item to same item in max-heap
Insertion: as in ordinary heap, but twice: once in each heap
Deletion: find item to delete in other heap using pointer,move last element to that position and do normal deletion
Datastructuren
Priority Queues
Double-ended Priority Queues
interval heap
2-92
8-80 11-75
17-69 42-70 44-73 14-39
24-33 23-65 55-60 44-50 54-57 61
[8,80] ⊆ [2,92]
Datastructuren
Priority Queues
Double-ended Priority Queues
interval heap: insert
2-92
11-75
44-73 14-39
54-57 6180
2-92
11-75
44-73 14-39
54-57 61-80
2-92
11-80
44-75 14-39
54-57 61-73
Datastructuren
Priority Queues
Double-ended Priority Queues
embedded min&max heap
2
8 11
17 42 44 14
24 23 55 44 54 61
92
80 75
69 70 73 39
33 65 60 50 57
Datastructuren
Priority Queues
Double-ended Priority Queues
Double ended priority queue - use case
wikipedia
One example application of the double-ended priority queue isexternal sorting. In an external sort, there are more elementsthan can be held in the computer’s memory.
Datastructuren
Priority Queues
Double-ended Priority Queues
Quiz4
AVL boom
71
42
23
14 35
56
67
98
89 110
voeg 7 toe
Binary min-heap
14
23
42
98 71
56
67
35
89 110
70
voeg 7 toe
4A quiz is a brief assessment used in education to measure growth in knowledge,abilities, and/or skills. Wikipedia
Datastructuren
B-Trees
Contents
6 B-TreesDefinition & InsertionDeleting KeysRed-Black Trees
Datastructuren
B-Trees
AVL-tree and B-tree
6+1
30
9+1
2-1
5-1
8-1
11+1
1 4 7 10 12+1
13
10 20 25 32 34 40 41 44 46 52 54 58 60
30 38 50 56
42
Datastructuren
B-Trees
multiway search tree
K
T0 T1T0 T1 T2 T`
K1K2 . . . K`
T0 < K1 < T1 < · · · < K` < T`
Datastructuren
B-Trees
Definition & Insertion
B-tree (Bayer & McCreight, 1972)
Definition
A B-tree of order m is a multi-way search tree such that
every node has at most m children(contains at most m− 1 keys),
every node other than the root has at least dm2 e children(contains at least dm2 e − 1 keys),
the root contains at least one key, and
all leaves are on the same level of the tree.
Datastructuren
B-Trees
Definition & Insertion
B-tree of order 5
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83
13 29 41 49 69 77
61
order m = 5: between dm2 e − 1 = 2 and m− 1 = 4 keys.
Datastructuren
B-Trees
Definition & Insertion
adding keys
Add the new key to a leaf.
When at maximal capacity, split leaf, move middle key up.Recurse.
Splits can reach the root. We then obtain a new root with asingle key.
Datastructuren
B-Trees
Definition & Insertion
adding a key (order 5)
10 20 25 32 38 40 41 44 50 56
30 42
+34
32 34 38 40 41
10 20 25 32 34 40 41 44 50 56
30 38 42
Datastructuren
B-Trees
Definition & Insertion
adding more keys (order 5)
10 20 25 32 34 40 41 44 50 56
30 38 42
+58,+60
44 50 56 58 60
10 20 25 32 34 40 41 44 50 58 60
30 38 42 56
Datastructuren
B-Trees
Definition & Insertion
adding even more keys (order 5)
10 20 25 32 34 40 41 44 50 58 60
30 38 42 56
+46,+52,+54
44 46 50 52 54
30 38 42 50 56
10 20 25 32 34 40 41 44 46 52 54 58 60
30 38 50 56
42
Datastructuren
B-Trees
Deleting Keys
deleting keys
For non-leafs: swap key with predecessor (key moves to a leaf)
Deleting from leafs:
If below minimal capacity, move key from sibling with surplusleafs to parent, move from parent to underfull node.If no siblings with surplus leafs: merge with sibling and getseparating key from parent. Recurse with parent.
Due to recursion with parent, deletion may reach the root, and cancollapse a level.
Datastructuren
B-Trees
Deleting Keys
deleting keys (order 5)
OK
10 20 25 32 34 40 41 42
30 38
45
Datastructuren
B-Trees
Deleting Keys
deleting keys (order 5)
10 20 25 32 34 40 41 42
30 38
45
swap predecessor
40 41 ×
30 38
42
Datastructuren
B-Trees
Deleting Keys
deleting keys (order 5)
10 20 25 32
borrow(‘via’ parent)
34 40 41 42
30 38
45
10 20 × 30 34 40 41 42
25 38
45
Datastructuren
B-Trees
Deleting Keys
deleting, ctd (order 5)
10 20 25 32 34 40
underfull:merge brother
41
×
30 38
42
10 20 25 32 34 38 40
30
underfull:merge brother
× 50 56
42
new root 30 42 50 56
Datastructuren
B-Trees
Red-Black Trees
2-4-tree to red-black tree
20 37 40 41 44 50
30 42 4230
20 4037 41 44 50
42
30
20 40
37 41
44
50
Datastructuren
B-Trees
Red-Black Trees
2-4-tree vs red-black tree
a b c
b
a c
a b
b
a
a
b
a
a
Datastructuren
B-Trees
Red-Black Trees
red-black tree
Definition
A red-black tree is a
binary search tree
such that each node is either black or red, where
the root is black,
no red node is the son of another red node,
the number of black nodes on each path from root toextended leaf (NIL-pointers) is the same.
Datastructuren
B-Trees
Red-Black Trees
examples
42
30
20 40
37 41
44
50
40
30 42
20 37 41 44
35 50
Datastructuren
B-Trees
Red-Black Trees
fun fact
every AVL-tree can be red-black coloured.
1
3
2
1
5
2
1
4
1
3
2
1
Datastructuren
B-Trees
Red-Black Trees
insertion in red-black tree
Insert as red leaf. Red node with red parent then:
If uncle is red: flag-flip. Continue at grandparent.
If uncle is black: rotate (see AVL-trees), Repaint and Stop.
g
p u
x
flag flip=⇒
g
p u
x
g
p u
x
rotation=⇒
p
x g
u
If the root has been coloured red, make it black.
Datastructuren
B-Trees
Red-Black Trees
just classical single/double rotation
42
30
20
30
20 42
42
30
40
40
30 42
Datastructuren
B-Trees
Red-Black Trees
example: adding key
42
30
20 40
37 41
44
50
35new
42
30
20 40
37 41
44
50
35
40
30 42
20 37 41 44
35 50
Datastructuren
B-Trees
Red-Black Trees
GNU C++ stl tree.h
“Red-black tree class, designed for use in implementing STLassociative containers (set, multiset, map, and multimap). Theinsertion and deletion algorithms are based on those in Cormen,Leiserson, and Rivest, Introduction to Algorithms (MIT Press,1990), except that . . . ”
Linux“There are a number of red-black trees in use in the kernel. The anticipatory,deadline, and CFQ I/O schedulers all employ rbtrees to track requests; the packetCD/DVD driver does the same. The high-resolution timer code uses an rbtree toorganize outstanding timer requests. The ext3 filesystem tracks directory entries in ared-black tree. Virtual memory areas (VMAs) are tracked with red-black trees, as areepoll file descriptors, cryptographic keys, and network packets in the ”hierarchicaltoken bucket” scheduler.” lwn.net/Articles/184495/
Datastructuren
Graphs
Contents
7 GraphsDefinitionRepresentationGraph traversalDisjoint Sets, ADT Union-FindMinimal Spanning TreesShortest Paths
Datastructuren
Graphs
Definition
graph definition
zie Algoritmiek!
Definition
A graph is a pair G = (V,E) where:
V is a set of vertices, or nodes
E ⊆ V × V is a set edges, or arcs, lines
directed / undirected
vertices / edges can have labels (string, number)
complexity in |V | and |E|. |E| ≤ |V |2
Datastructuren
Graphs
Representation
adjacency matrix
1
2 3 4
56
7
1 2 3 4 5 6 7
1 · 1 · · · 1 ·2 · · 1 · · · ·3 · · · · 1 1 14 · · · · · · ·5 · · · 1 · 1 ·6 1 · · · · · 17 1 · · · · · ·
Datastructuren
Graphs
Representation
adjacency lists
1
2 3 4
56
7
1
2
3
4
5
6
7
2 6
3
5 6 7
4 6
1 7
1
Datastructuren
Graphs
Graph traversal
depth first search
Recursive DFSvoid DFS(v)
{ visit(v)
mark(v)
for each w adjacent to v
do if w is not marked
DFS(w)
fi
od
}
Iterative DFS// start with unmarked nodes
S.push(init) // S.push((init,init))
while S is not empty
do v = S.pop() // (p,v) = S.pop
if v is not marked
then mark v
// add (p,v) to DFS tree (if p!=v)
for all edges (v, w)
do if w is unmarked then
S.push(w) // S.push((v,w))
fi
od
fi
od
Datastructuren
Graphs
Graph traversal
dfs tree (directed)
b
a
c
ed
f
g
b2
a1c 5
e4
d
7 f 6
g3forward
back
back
cross
b6
a1c 2
e5
d
4 f 3
g7back
forward
forward back
DFS tree with edge classification (tree, back, cross, forward)not unique
Datastructuren
Graphs
Graph traversal
dfs edges
1
2
3
4
5
6
7
forward back
cross
1
2
3
4
5
6
7
back
Datastructuren
Graphs
Graph traversal
applications of DFS
topological sorting
articulation points
A DFS traversal itself and the forest-like representation of the graph itprovides have proved to be extremely helpful for the development of efficientalgorithms for checking many important properties of graphs. Note thatthe DFS yields two orderings of vertices: the order in which the vertices arereached for the first time (pushed onto the stack) and the order in whichthe vertices become dead ends (popped off the stack). These orders arequalitatively different, and various applications can take advantage of eitherof them. [Levitin, Design & Analysis of Algorithms]
Datastructuren
Graphs
Graph traversal
application: topological sort
1 2 3
4 5
6 7 8
1 2 34 56 7 8
Datastructuren
Graphs
Graph traversal
Let G = (V,E) be a directed graph.
Definition
A topological ordering [or sort] of G is an ordering (v1, . . . , vn) ofV , such that if (vi, vj) ∈ E then i < j.
finding a topological sort:
1 pick node without incoming edges
2 remove outgoing edges from that node and go to step 1.
(or use depth-first search)
Datastructuren
Graphs
Graph traversal
application: topological sort
11,6
pre,post
27,7 38,8
44,5 52,2
65,3 73,1 86,4
1 2 345 67 8
Datastructuren
Graphs
Graph traversal
application: articulation points
12
3
45
67
8
9
10
11
12
13
Datastructuren
Graphs
Graph traversal
articulation points with dfs tree
12
3
45
67
8
9
10
11
12
13 12
3
4
5
6
7
8
9 10
1112
13vertex v is an articulation point if
v is the root, and has two or more children
v has a subtree where no node has a back edge reachingabove v
Datastructuren
Graphs
Graph traversal
breadth-first search (BFS)
Iterative BFS// Q is a queue of vertices
// start with unmarked nodes
Q.enqueue(init)
dist[init] = 0
while Q is not empty
do v = Q.dequeue()
if v is not marked
then newdist = dist[v] + 1
for all edges (v, w)
do if w is not marked
then Q.enqueue(w)
mark w
dist[w] = newdist
fi
od
fi
od
Datastructuren
Graphs
Graph traversal
bfs: ’floodfill’
4
3
4
3
2
1
2
5
4
5
4
3
2
3
6
5
6
5
4
3
4
7
6
7
6
5
4
5
8
7
8
7
6
5
6
9
8
9
8
7
6
7
10
9
10
9
8
7
8
0 1
1 2
2 3
Datastructuren
Graphs
Minimal Spanning Trees
minimal spanning tree
A B
C D E
F G
H
2
2 4 77 6
29 2
4
5
5
A B
C D E
F G
H
2
2 4
22
4
5
Definition (Minimal spanning tree of weighted graph)
A tree containing all nodes of the graph, with minimal total sum ofedge weights
Datastructuren
Graphs
Minimal Spanning Trees
minimal spanning tree Kruskal vs. Prim
A B
C D E
F G
H
22
22
4
A B
C D E
F G
H
22 4
22
4
A B
C D E
F G
H
22 4
22
4 5
A B
C D4
E7
F9
22 4 7
7
9
A B
C D E6
F2
22 4 7
7 62
9
A B
C D E6
F G2
H4
22 4 7
7 62
9 2
4
Datastructuren
Graphs
Minimal Spanning Trees
minimal spanning tree - Kruskal
High-level algorithm:
repeatconsider edge with smallest weightif it does not yield a cycle
add it to the treeotherwise discard the edge
until no edges left
Datastructuren
Graphs
Minimal Spanning Trees
spanning tree
1 2
3
4 5
6 7
8
910
?
?
edges that do not cause a cycle use union-find ADT
Datastructuren
Graphs
Minimal Spanning Trees
partition domain D = {1, 2, . . . , n}each set has a name, a representative
Initialize: construct the initial partition; each componentconsists of a singleton set {d}, with d ∈ D.
Find: retrieves the name of the component, i.e,Find(u) = Find(v) iff u and v belong to the same set in thepartition.
Union: given two elements u and v, the sets they belong toare merged. Has no effects when u and v already belong tothe same set.Usually it is assumed that u, v are representatives, i.e., namesof components, not arbitrary elements.
Datastructuren
Graphs
Minimal Spanning Trees
Union-Find implementation with path-compression
1 2 3 4 5 6 7 8 9 10
1 2 1 4 5 9 6 5 9 9 parent2 1 . 1 2 . . . 4 . size
1 2
3
4 5
6
7
8
9
10
Datastructuren
Graphs
Minimal Spanning Trees
Union-Find implementation with path-compression
v
T1 T2
w
T1 T2
v
T1 T2
w
T1 T2
v
T4
T3
T2
T1
v
T4T3T2
T1
Datastructuren
Graphs
Minimal Spanning Trees
minimal spanning tree - Kruskal
detailed algorithm with priority queue and union-find ADTs:Kruskal
KRUSKAL(G):
A = emptyTree
PQ = empty
foreach v:
MAKE-SET(v)
PQ.insert( weight(u,v), (u,v) )
repeat until PQ is empty:
(u,v) = PQ.DELETE-MIN()
if (FIND-SET (u) != FIND-SET(v)) then
A.add((u, v))
UNION(u, v)
fi
return A;
Datastructuren
Graphs
Minimal Spanning Trees
Algorithms from the Book
116 union-find Galler and Fischer109 Knuth-Morris-Pratt pattern matching
94 Blum,Floyd,Pratt,Rivest,Tarjan median89 binary search84 Floyd-Warshall all-pairs shortest path79 Euclidean algorithm greatest common divisor (GCD)73 quicksort Tony Hoare59 Huffman coding data compression51 Miller-Rabin primality test50 Schwartz-Zippel lemma polynomial identity46 depth first search42 sieve of Eratosthenes primes42 Dijkstra shortest path
3.11’19
Datastructuren
Graphs
Minimal Spanning Trees
Primcost[source] = 0 // infinite for other nodes
prev[source] = 0 // code for the root
Q = V // all vertices
while Q is not empty
do u is node in Q with minimal cost[u]
remove u from Q
for each edge (u,v) with v outside tree
do if length(u,v) < cost[v]
then cost[v] = length(u,v)
prev[v] = u
fi
od
od
high-level algorithm:
initialize tree with randomly chosen node
repeat until all vertices are connected:link unconnected node attached to edge with minimum weight
Datastructuren
Graphs
Minimal Spanning Trees
directed graphs not supported
u
6
4
2
Prim fails
u
4
6
21
Kruskal fails
Datastructuren
Graphs
Shortest Paths
Dijkstra1 dist[source] = 0 // infinite for other nodes
2 prev[source] = 0 // code for the root
3 PQ = V // all nodes
4 while PQ is not empty
5 do u is node in Q with minimal dist[u]
6 remove u from Q
7 for each edge (u,v)
8 do newdist = dist[u] + length(u,v);
9 if newdist < dist[v]
10 then dist[v] = newdist
11 prev[v] = u
12 fi
13 od
14 od
finds shortest path from fixed source node to all other nodesalso: shortest path from source to target node
Datastructuren
Graphs
Shortest Paths
distance vs. bottleneck
A B
C D E
F G
H
22 4 7
7 62
9 2
4
5
5
A2
B4
C0
D6
E11
F8
G10
H12
22 4 7
22
4
A4
B6
C∞
D7
E6
F9
G5
H5
4 77 6
9 5
5
Datastructuren
Graphs
Shortest Paths
all pairs distance
Lk(i, j) = min(Lk−1(i, j), Lk−1(i, k) + Lk−1(k, j)).Floyd-Warshall
// initially dist equals the adjacency matrix
for each edge (i,j)
do prev[i,j] = i
od
for k from 1 to n
do for i from 1 to n
do for j from 1 to n
do if dist[i,k] + dist[k,j] < dist[i,j]
then dist[i,j] = dist[i,k] + dist[k,j]
prev[i,j] = prev[k,j]
fi
od
od
od
Datastructuren
Graphs
Shortest Paths
example Floyd
1
2
3
9
3
6 2
Datastructuren
Graphs
Shortest Paths
Floyd
partial result A3, and distances via node 4
A3 =
0 2 1 63 0 1 44 1 0 5−2 0 −1 .
. 6 + 0 6− 1 64− 2 . 4− 1 45− 2 5 + 0 . 5−2 0 −1 .
Datastructuren
Graphs
Shortest Paths
path reconstruction
Path-reconstructionPath(u, v)
if prev[u][v] = null then
return []
path = [v]
while u != v do
v = prev[u][v]
path.insert_at_begin(v)
od
return path
Datastructuren
Graphs
Shortest Paths
Warshall// initially conn equals the adjacency matrix
// with additionally 1=true on the diagonal
for k from 1 to n
do for i from 1 to n
do for j from 1 to n
do conn[i,j] = conn[i,j] or ( conn[i,k] and conn[k,j] )
od
od
od
Datastructuren
Hash Tables
Contents
8 Hash TablesPerfect Hash FunctionOpen AddressingChainingChoosing a hash function
Datastructuren
Hash Tables
ADT map, dictionary
associative array, [hash-]map, symbol table, or dictionary wiki
is composed of a collection of (key, value) pairs;each possible key appears at most once
find insert deleteav wc order
unordered list n 1 n nobin tree log n n log n n log n n yesbalanced log n log n log n yeshash table n n n no
worst case
Datastructuren
Hash Tables
ADT map, dictionary
associative array, [hash-]map, symbol table, or dictionary wiki
is composed of a collection of (key, value) pairs;each possible key appears at most once
find insert deleteav wc av wc av wc order
unordered list n 1 n nobin tree log n n log n n log n n yesbalanced log n log n log n yeshash table 1 n 1 n 1 n no
av=average, wc=worst case
Datastructuren
Hash Tables
Hashing
Store keys of arbitrary size (usually large domains) in table offixed size (usually small)
Hash table: ADT that performs finds, insertions and deletionsin (on avg) constant time
Used to implement unordered sets, maps (C++ STL, Java),store passes, checksums (MD5, CRC32)
Hash function calculates position in table: h(k)mod TableSize
Collision: attempt to store key k when h(k) is occupied
Collision resolution
perfect hashing: Keys are known a-priori; can avoid collisions
open addressing: Collision resolved by storing key elsewhere
chained hashing: Store multiple keys at the same address(i.e. table entries are linked lists of items with same hash)
Datastructuren
Hash Tables
Perfect Hash Function
Cichelli
h(w) = |w|+ v(first(w)) + v(last(w)), with v defined by:
a b c f g h i l m n p r s t u v w y11 15 1 15 3 15 13 15 15 13 15 14 6 6 14 10 6 13
Value for other letters: 0
2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
do
end
else
case
dow
nto
goto
to oth
erw
ise
typ
ew
hile
con
std
ivan
dse
tor of m
od
file
reco
rdp
acke
dn
otth
enpr
oce
du
rew
ith
rep
eat
var
in arra
yif n
ilfo
rb
egin
un
til
lab
elfu
nct
ion
prog
ram
h(goto) = |goto|+ v(g) + v(o) = 4 + 3 + 0 = 7
h(const) = |const|+ v(c) + c(t) = 5 + 1 + 6 = 12
Datastructuren
Hash Tables
Open Addressing
Open Addressing
If insert causes collision, attempt to store hash elsewhere
Extend hash function with extra parameter i, the number ofthe attempt to store the key
General structure of hash function:h(k, i) = (g(k)− f(i)) mod TableSize
Linear probing: f(i) is linear (in i), i.e. f(i) = i
Quadratic probing: f(i) is quadratic (in i), i.e. f(i) = i2
Double hashing: h(k) = (g(k)− i ∗ f(k)) mod TableSize
Insert, find and remove operations
Insert/Find: probe at h(k, 0), h(k, 1), h(k, 2), ...
Delete: keep tag for each cell: active, deleted, empty
Datastructuren
Hash Tables
Open Addressing
linear
keys 605, 297, 748, 385, 198, 231 and 407address function g(K) = K mod 11probe function f(i) = ihash function h(K, i) = ((K mod 11)− i) mod 11
Datastructuren
Hash Tables
Open Addressing
linear
keys 605, 297, 748, 385, 198, 231 and 407address function g(K) = K mod 11probe function f(i) = ihash function h(K, i) = ((K mod 11)− i) mod 11
0 1 2 3 4 5 6 7 8 9 10
38 60
38
29 74
23 38 60 19 29 74
19
23 40 38 60 19 29
40
74
Datastructuren
Hash Tables
Open Addressing
linear step size f(i) = 3i
h(K, i) = ((K mod 10)− 3i) mod 100 1 2 3 4 5 6 7 8 9
65 32 43 55 72 19
Neighbors (relative to step size 3):0 3 6 9 2 5 8 1 4 7
65 43 72 19 32 55
Primary clustering: keys with “nearby” hash in same cluster
Careful! Pick step size coprime to table size, if not, insertscan fail even if table is not full: not all positions are probed
Datastructuren
Hash Tables
Open Addressing
quadratic
keys 605, 297, 748, 385, 198, 231 and 407address function h(K) = K mod 11probe function: f(i) = ±i2hash function: h(K, i) = (h(K)± i2) mod 11probes at h(K)± 1, h(K)± 4, h(K)± 9, . . .
Datastructuren
Hash Tables
Open Addressing
quadratic
keys 605, 297, 748, 385, 198, 231 and 407address function h(K) = K mod 11probe function: f(i) = ±i2hash function: h(K, i) = (h(K)± i2) mod 11probes at h(K)± 1, h(K)± 4, h(K)± 9, . . .
0 1 2 3 4 5 6 7 8 9 10
38 60
38
29 74
23 38 60 29 74 19
19
23 38 60 40 29
40
74 19
Secondary clustering: only keys with same hash cluster
Datastructuren
Hash Tables
Open Addressing
double
keys 605, 297, 748, 385, 198, 231 and 407table size: 11address function g(K) = K mod 11probe function p(K) = (K mod 4) + 1hash function h(K) = (g(K)− i ∗ p(K)) mod 11
Datastructuren
Hash Tables
Open Addressing
double
keys 605, 297, 748, 385, 198, 231 and 407table size: 11address function g(K) = K mod 11probe function p(K) = (K mod 4) + 1hash function h(K) = (g(K)− i ∗ p(K)) mod 11
0 1 2 3 4 5 6 7 8 9 10
38 6038
29 74
23 38 19 60 29 7419
23 38 19 60 40 29
40
74
Minimize clustering: diff. probes even for keys with same hash
Datastructuren
Hash Tables
Open Addressing
find / successful (α = 0.5− 0.8) add / unsuccessful (α = 0.5− 0.8)
linear 12(1 + 1
1−α) 1.5–3 12(1 + 1
(1−α)2 ) 2.5–13
quadratic 1 + ln( 11−α)− α
2 1.4–2.2 11−α − α+ ln( 1
1−α) 2.2–5.8
double 1α ln( 1
1−α) 1.4–2.0 11−α 2–5
1
2
3
4
5
6
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Exp
ecte
dpr
obes
Load factor α
linear hashquadraticdouble hash
Datastructuren
Hash Tables
Chaining
chaining
keys 605, 297, 748, 385, 198, 231 and 407table size: 11address function h(K) = K mod 11
0 1 2 3 4 5 6 7 8 9 10Λ
?23Λ
Λ Λ Λ?
?
38
60Λ
Λ?
?
?
?
29
40Λ
19
74Λ
Λ Λ
Datastructuren
Hash Tables
Choosing a hash function
A good hash function h(K) should
be fast to compute, and
evenly and deterministically distribute the keys over the table
depend on all “distinctive bits” of the key K
Techniques:
• extraction: compute address based on selected bits of key
• division: address = key mod TSize, choose TSize carefully
• folding: chop key into parts, combine (add/xor) parts
• mid-squaring: square key and take middle bits
Datastructuren
Hash Tables
Choosing a hash function
MurmurHash
Murmur3_32(key, len, seed)
// integer arithmetic with unsigned 32 bit integers.
c1 := 0xcc9e2d51
c2 := 0x1b873593
r1 := 15
r2 := 13
m := 5
n := 0xe6546b64
hash := seed
for each fourByteChunk of key
k := fourByteChunk
k := k * c1
k := (k << r1) OR (k >> (32-r1))
k := k * c2
hash := hash XOR k
hash := (hash << r2) OR (hash >> (32-r2))
hash := hash * m + n
with any remainingBytesInKey
\\ (also do Endian swapping on big-endian machines.)
remainingBytes := remainingBytesInKey * c1
remainingBytes := (remainingBytes << r1) OR (remainingBytes >> (32 - r1))
remainingBytes := remainingBytes * c2
hash := hash XOR remainingBytes
hash := hash XOR len
hash := hash XOR (hash >> 16)
hash := hash * 0x85ebca6b
hash := hash XOR (hash >> 13)
hash := hash * 0xc2b2ae35
hash := hash XOR (hash >> 16)
Datastructuren
Data Compression
Contents
9 Data CompressionHuffman CodingLempel-Ziv-WelshBurrows-Wheeler
Datastructuren
Data Compression
lossless (omkeerbaar)
GIF
2222
2222
2222
2222
0000
0000
0000
0000
0000
0000
0000
0000
0 0 0 0
111
111
111
111
1 1
0 0 0
colour table012
Datastructuren
Data Compression
lossless (omkeerbaar)
Huffman vs. LZW coding
prefix code trie
a
b e
0 1
0 1
f
c d
0 1
0 1
0 1
1 2 3
a b c
4 5
6
7b a b
c
8
9
10
11b
a
a
a
e 7→ 011f 7→ 10
cb 7→ 00111 7aaa 7→ 01011 11
Datastructuren
Data Compression
lossless (omkeerbaar)
Huffman vs. LZW coding
a
b e
0 1
0 1
f
c d
0 1
0 1
0 1
1 2 3
a b c
4 5
6
7b a b
c
8
9
10
11b
a
a
a
frequencies given self learningsingle letter to variable bits variable string to fixed lengthprefix code-tree trie- letters as leafs - letters along edges- bits left/right - code in nodestore code decoder learns too
Datastructuren
Data Compression
va-kan-tie-oord
Morse
H V F L P J B X C Y Z Q
S U R W D K G O
I A N M
E T
dot dash
BYOXO Are you trying to weasel out of our deal?tks om bv cu 73 thanks old-man bon-voyage see-you best regards
Datastructuren
Data Compression
Huffman Coding
a19
b
8c9
d
9e7
Shannon–Fano Huffman
a d c
b e
0 1
0 1
0 1
0 1
19 + 9
9 + 8 + 7
a
d c b e
0 1 0 1
0 1
0 1
2 · (19 + 9 + 9) + 3 · (8 + 7) = 119 1 · 19 + 3 · (9 + 9 + 8 + 7) = 118 :)
Datastructuren
Data Compression
Huffman Coding
David Albert Huffman (1925–1999)
photo: 1978, UCSC
(maa.org)
photo: 1991, Matthew Mulbry
(SciAm / huffmancoding.com)
Datastructuren
Data Compression
Huffman Coding
Huffman (1952)
variable length code (bitstring) for single lettersa1, . . . , an ∈ Σ 7→ w1, . . . , wn ∈ {0, 1}∗
based on character frequencies (known in advance)f1, . . . , fn
optimal expected code length (for prefix code)n∑i=1
fi · |wi|
code has to be known by decoder
‘old’ Shannon-Fano algorithm not always produces optimal code
Datastructuren
Data Compression
Huffman Coding
Huffman// initialize:
for each input letter: create tree with that letter
and its frequency
repeat until one tree left:
take two trees of minimal frequencies
join these as children in a new tree,
with combined (summed) frequency
Datastructuren
Data Compression
Huffman Coding
a18
b
8c
12
d
13e7
f
21a18
c12
d
13
f
21
b e
15
0 1
a18
f
21 15
b e
0 1
c d
25
0 1
33
a
b e
0 1
0 1f
21
c d
25
0 1
33
a
b e
0 1
0 1
46
f
c d
0 1
0 1
79
a
b e
0 1
0 1
f
c d
0 1
0 1
0 1
Datastructuren
Data Compression
Huffman Coding
keuzes, keuzes, . . .
a:10, b:5, c:5, d:5
a b c d
0 10 1
0 1
2 · 10 + 2 · 5 + 2 · 5 + 2 · 5 = 50
a
b
c d
0 1
0 1
0 1
1 · 10 + 2 · 5 + 3 · 5 + 3 · 5 = 50
Datastructuren
Data Compression
Lempel-Ziv-Welsh
Ziv-Lempel & Welsh (1977, 1984)
fixed length code for repeating patterns in inputx1, . . . , xn ∈ Σ∗ 7→ w1, . . . , wn ∈ {0, 1}k
strings xi plus code is learned while reading input
code is also learned by decoder and does not have to betransmitted
Datastructuren
Data Compression
Lempel-Ziv-Welsh
Ziv-Lempel & Welsh - compression
ZLW-compressinitialize dict with codes for single characters
w = "";
while ( not end of input )
do
read next character c
if w+c exists in the dict
w = w+c;
else
add to dict: w+c;
output code(w);
w = c;
fi
od
output code(w)
Datastructuren
Data Compression
Lempel-Ziv-Welsh
input abab cbab abaa aa
w c dict? output new code
a Xa b × 1 4 7→ abb a × 2 5 7→ baa b Xab c × 4 6 7→ abcc b × 3 7 7→ cbb a Xba b × 5 8 7→ babb a Xba b Xbab a × 8 9 7→ babaa a × 1 10 7→ aaa a Xaa a × 10 11 7→ aaaa ⊥ 1
1 2 3
a b c
4 5
6
7
b a b
c
8
9
10
11
b
a
a
a
0 (end)
1 a2 b3 c
4 ab5 ba6 abc7 cb8 bab9 baba10 aa11 aaa
Datastructuren
Data Compression
Lempel-Ziv-Welsh
Ziv-Lempel & Welsh - decompression
Decoding 1 2 4 3 5 8 1 10 1.code text new codes
1, 2, 3 7→ a, b, c initialization1 a we learn the new code one step late2 b 4 7→ ab last text + first letter4 ab 5 7→ ba3 c 6 7→ abc5 ba 7 7→ cb8 bab 8 7→ bab the new code is too late! is of the
form last text (ba) + first (b)1 a 9 7→ baba10 aa 10 7→ aa too late again1 a 11 7→ aaa
Datastructuren
Data Compression
Lempel-Ziv-Welsh
Ziv-Lempel & Welsh - decompression
ZLW-decompressinitialize dict with codes for single characters
read first code in variable prev and output str(prev)
while( not end of input )
read w;
if w exists in the dict
output str(w);
add to dict: str(prev) + firstchar(str(w));
else
// special case
output str(prev) + firstchar(str(prev));
add to dict: str(prev) + firstchar(str(prev));
fi
prev = w;
od
Datastructuren
Data Compression
Burrows-Wheeler
truukje
MISSISSIPPI 7→ SSMP-PISSIII
Datastructuren
Data Compression
Burrows-Wheeler
MISSISSIPPI.rotate
1 M I S S I S S I P P I -2 I S S I S S I P P I - M3 S S I S S I P P I - M I4 S I S S I P P I - M I S5 I S S I P P I - M I S S6 S S I P P I - M I S S I7 S I P P I - M I S S I S8 I P P I - M I S S I S S9 P P I - M I S S I S S I
10 P I - M I S S I S S I P11 I - M I S S I S S I P P12 - M I S S I S S I P P I
alphabetize, last column
8 I P P I - M I S S I S S5 I S S I P P I - M I S S2 I S S I S S I P P I - M
11 I - M I S S I S S I P P1 M I S S I S S I P P I -
10 P I - M I S S I S S I P9 P P I - M I S S I S S I7 S I P P I - M I S S I S4 S I S S I P P I - M I S3 S S I S S I P P I - M I6 S S I P P I - M I S S I
12 - M I S S I S S I P P I
Datastructuren
Data Compression
Burrows-Wheeler
decode
12 1S1 I1S2 I2M1 I3P1 I4- M1
P2 P1
I1 P2
S3 S1
S4 S2
I2 S3
I3 S4
I4 -
M1 I3 S4 S2 I2 S3 S1 I1 P2 P1 I4
Datastructuren
Data Compression
Burrows-Wheeler
decode
12 1S1 I1S2 I2M1 I3P1 I4- M1
P2 P1
I1 P2
S3 S1
S4 S2
I2 S3
I3 S4
I4 -
M1 I3 S4 S2 I2 S3 S1 I1 P2 P1 I4
Datastructuren
Data Compression
Burrows-Wheeler
decode
12 1S1 I1S2 I2M1 I3P1 I4- M1
P2 P1
I1 P2
S3 S1
S4 S2
I2 S3
I3 S4
I4 -
M1 I3 S4 S2 I2 S3 S1 I1 P2 P1 I4
Datastructuren
Data Compression
Burrows-Wheeler
decode
12 1S1 I1S2 I2M1 I3P1 I4- M1
P2 P1
I1 P2
S3 S1
S4 S2
I2 S3
I3 S4
I4 -
M1 I3 S4 S2 I2 S3 S1 I1 P2 P1 I4
Datastructuren
Data Compression
Burrows-Wheeler
Quiz
Add final step for Floyd
A3 =
0 2 1 63 0 1 44 1 0 5−2 0 −1 0
Datastructuren
Pattern Matching
Contents
10 Pattern MatchingKnuth-Morris-PrattAho-CorasickComparing texts
Datastructuren
Pattern Matching
naive
1T = ABABC. . .
↑ ↑×P = ABCAB. . .
3
⇒
2ABABCAB. . .×ABCABA. . .1
⇒
3ABABCABCAB. . .↑ ↑ ↑ ↑ ↑×ABCABABC. . .
6
⇒
4ABABCABC. . .
×ABCAB. . .1
⇒
5ABABCABC. . .
×ABCA. . .1
⇒
6ABABCABCABABCC. . .
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ABCABABC
Datastructuren
Pattern Matching
naive
1T = ABABC. . .
↑ ↑×P = ABCAB. . .
3
⇒
2ABABCAB. . .×ABCABA. . .1
⇒
3ABABCABCAB. . .↑ ↑ ↑ ↑ ↑×ABCABABC. . .
6
⇒
4ABABCABC. . .
×ABCAB. . .1
⇒
5ABABCABC. . .
×ABCA. . .1
⇒
6ABABCABCABABCC. . .
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ABCABABC
Datastructuren
Pattern Matching
naive
1T = ABABC. . .
↑ ↑×P = ABCAB. . .
3
⇒
2ABABCAB. . .×ABCABA. . .1
⇒
3ABABCABCAB. . .↑ ↑ ↑ ↑ ↑×ABCABABC. . .
6
⇒
4ABABCABC. . .
×ABCAB. . .1
⇒
5ABABCABC. . .
×ABCA. . .1
⇒
6ABABCABCABABCC. . .
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ABCABABC
Datastructuren
Pattern Matching
naive
1T = ABABC. . .
↑ ↑×P = ABCAB. . .
3
⇒
2ABABCAB. . .×ABCABA. . .1
⇒
3ABABCABCAB. . .↑ ↑ ↑ ↑ ↑×ABCABABC. . .
6
⇒
4ABABCABC. . .
×ABCAB. . .1
⇒
5ABABCABC. . .
×ABCA. . .1
⇒
6ABABCABCABABCC. . .
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ABCABABC
Datastructuren
Pattern Matching
naive
1T = ABABC. . .
↑ ↑×P = ABCAB. . .
3
⇒
2ABABCAB. . .×ABCABA. . .1
⇒
3ABABCABCAB. . .↑ ↑ ↑ ↑ ↑×ABCABABC. . .
6
⇒
4ABABCABC. . .
×ABCAB. . .1
⇒
5ABABCABC. . .
×ABCA. . .1
⇒
6ABABCABCABABCC. . .
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ABCABABC
Datastructuren
Pattern Matching
naive
1T = ABABC. . .
↑ ↑×P = ABCAB. . .
3
⇒
2ABABCAB. . .×ABCABA. . .1
⇒
3ABABCABCAB. . .↑ ↑ ↑ ↑ ↑×ABCABABC. . .
6
⇒
4ABABCABC. . .
×ABCAB. . .1
⇒
5ABABCABC. . .
×ABCA. . .1
⇒
6ABABCABCABABCC. . .
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ABCABABC
Datastructuren
Pattern Matching
Knuth-Morris-Pratt
match pattern against itself
T = . . . ABCABAB? . . .P = ABCABAB×
8
2 8
ABCABAB .. ABCABAB×
3 8
ABCABAB .. . ABCABAB×
4 8
ABCABAB .. . . ABCABAB
×
5 8
ABCABAB .. . . . ABCABAB
×
6 8
ABCABAB .. . . . . ABCABAB
3
Datastructuren
Pattern Matching
Knuth-Morris-Pratt
linear-time algorithm (1970, 1977)
Donald Knuth, Vaughan Pratt, and James H. Morris
failure links
linear time preprocessing
search will never back-up in text
Datastructuren
Pattern Matching
Knuth-Morris-Pratt
2
A. A
1
3
AB. . AB
1
4
ABC. . . ABC
1
5
ABCA. . . ABCA
2
6
ABCAB. . . ABCAB
3
7
ABCABA. . . . . ABCA
2
k 1 2 3 4 5 6 7 8P[k] A B C A B A B C
FLink[k] 0 1 1 1 2 3 2 3
at position k: the maximal r < k such thatP1 . . . Pr−1 = Pk−r+1 . . . Pk−1
mismatch at position k, then continue at position FLink[k](and same position in Text)
FLink[k] = 0: next position in Text, first position Pattern
Datastructuren
Pattern Matching
Knuth-Morris-Pratt
k 1 2 3 4 5 6 7 8P[k] A B C A B A B C
FLink[k] 0 1 1 1 2 3 2 3
0 1 2 3 4 5 6 7 8 9A B C A B A B C
match
skip to nextletter in text
mismatch fail
Datastructuren
Pattern Matching
Knuth-Morris-Pratt
KMP search// using failure links
Pos = 1 // position in pattern
TPos = 1 // position in text
while ((Pos <= PatLen) and (TPos <= TextLen)) do
if (P[Pos] == Text[TPos]) then
Pos ++;
TPos ++;
else
Pos = FLink[Pos]
if (Pos == 0) then
// start from scratch at next position in text
Pos = 1
TPos ++;
fi
fi
od
if (Pos > PatLen) then
Pattern found in text at position TPos-PatLen+1
fi
Datastructuren
Pattern Matching
Knuth-Morris-Pratt
computing KMP failure linksk = 1 // position in pattern
FLink[1] = 0
for k = 2 to PatLen do
Fail = FLink[k-1]
while ( (Fail > 0) and (P[Fail] != P[k-1]) ) do
Fail = FLink[Fail]
od
FLink[k] = Fail+1
od
k−1 k
A C B A
Datastructuren
Pattern Matching
Knuth-Morris-Pratt
why does this work?
all prefixes that are also a suffix:P1 . . . Pt−1 = Pk−t+1 . . . Pk−1
can be found by following failure links t0 = FLink[k] andti = FLink[ti−1]
t1 t0 k
P1 · · · Pt0−1 Pk−t0+1 · · · Pk−1P1 · · · Pt1−1 Pk−t1+1 · · · Pk−1
Datastructuren
Pattern Matching
Knuth-Morris-Pratt
why does this work?
t1 t0 k k + 1
Datastructuren
Pattern Matching
Knuth-Morris-Pratt
k 1 2 3 4 5 6 7 8P[k] A B C A B A B C
FLink[k] 0 1 1 1 2 3 2 3FLink′[k] 0 1 1 0 1 3 1 1
improving KMP failure linksfor Pos = 2 to PatLen
do if ( P[Pos] == P[FLink[Pos]] )
then FLink[Pos] = FLink[FLink[Pos]]
fi
od
Datastructuren
Pattern Matching
Aho-Corasick
{aaa, abc, baa, baba, cb}
1 2 3
a b c
4 5
a
6 12
c
7
b
8
a b
9
a
10
a b
11
a
trie
Datastructuren
Pattern Matching
Aho-Corasick
{aaa, abc, baa, baba, cb}
1 2 3
a b c
4 5
a
6 12
c
7
b
8
a b
9
a
10
a b
11
a
failure links
Datastructuren
Pattern Matching
Aho-Corasick
{aaa, abc, baa, baba, cb}
1 2 3
a b c
4 5
a
6 12
c
7
b
8
a b
9
a
10
a b
11
a
searching aaba . . .
Datastructuren
Pattern Matching
Aho-Corasick
{aaa, abc, baa, baba, cb}
1 2 3
a b c
4 5
a
6 12
c
7
b
8
a b
9
a
10
a b
11
a
construct next failure link
Datastructuren
Pattern Matching
Comparing texts
alignment
enzymes and their amino acids
82 TYHMCQFHCRYVNNHSGEKLYECNERSKAFSCPSHLQCHKRRQIGEKTHEHNQCGKAFPT 60
81 --------------------YECNQCGKAFAQHSSLKCHYRTHIGEKPYECNQCGKAFSK 40
****: .***: * *:** * :****.:* *******..
82 PSHLQYHERTHTGEKPYECHQCGQAFKKCSLLQRHKRTHTGEKPYE-CNQCGKAFAQ- 116
81 HSHLQCHKRTHTGEKPYECNQCGKAFSQHGLLQRHKRTHTGEKPYMNVINMVKPLHNS 98
**** *:***********:***:**.: .*************** : *.: :
Datastructuren
Pattern Matching
Comparing texts
similarity TCAGACGATTG and TCGGAGCTG
TCAG - ACG - ATTGTC - GGA - GC - T - G
TCAGACGATTGTCGGA - GCT - G
match, mismatch, insdel (gap)GG
AG
-G
A-
Datastructuren
Pattern Matching
Comparing texts
global alignment
TTCAT vs. TGCATCGT
T G C A T C G T
T
T
C
A
T insdel
mismatch
match
as shortest path
Datastructuren
Pattern Matching
Comparing texts
global versus local alignment
11
mn
0
maxmax
Needleman-Wunsch (1970), Smith-Waterman (1981)
Levenshtein distance (1966)
Datastructuren
Pattern Matching
Comparing texts
end.