Datastructuren - Data Structuresliacs.leidenuniv.nl/~hoogeboomhj/dat/ohp/dat-present.pdf ·...

Datastructuren

DatastructurenData Structures

Fenia AivaloglouHendrik Jan Hoogeboom

Informatica – LIACSUniversiteit Leiden

najaar 2019

Datastructuren

Table of Contents I

1 Basic Data Structures

2 Tree Traversal

3 Binary Search Trees

4 Balancing Binary Trees

5 Priority Queues

6 B-Trees

7 Graphs

8 Hash Tables

9 Data Compression

10 Pattern Matching

Datastructuren

Basic Data Structures

Contents

1 Basic Data StructuresLinear listsAbstract Data StructuresAdvanced C++ programmingTrees and their Representations

Datastructuren


Linear lists

hierarchy of lists

A deque (”double-ended queue”) is a linear list for which allinsertions and deletions (and usually all accesses) are made atthe ends of the list. A deque is therefore more general than astack or a queue; it has some properties in common with a deckof cards, and it is pronounced the same way. (Knuth, TAoCPvol. 1)

linear list

deque ‘deck’

stack stapel lifoqueue rij →fifo→

Datastructuren


Linear lists

first last[6]position

inspectchange

at position

insert

×

delete

Datastructuren


Linear lists

implementation: doubly linked list

Λ 6

first

13

prv

-2

nxt

6 4 Λ

last

sentinel

Λ ⊥first

6

Datastructuren


Linear lists

Babel

insert remove inspectback front back front back front

C++ push back push front pop back pop front back frontPerl push unshift pop shift [-1] [0]Python append appendleft pop popleft [-1] [0]

1

1Double-ended queue Operations

https://en.wikipedia.org/wiki/Double-ended_queue#Operations

Datastructuren


Linear lists

singly linked list

stack stapel

Λ x1 x2 xn

top

queue (wacht-)rij

Λ x1

first

x2 xn

last

Datastructuren


Linear lists

Programmeermethoden

class stapel { // de stapel zelf

public:

stapel ( ) {

bovenste = NULL; } // maak lege stapel

~stapel ( ); // destructor

void zetopstapel (int); // push

void haalvanstapel (int&); // pop

bool isstapelleeg ( ) { // is stapel leeg?

return ( ( bovenste == NULL ) ? true : false );

}//isstapelleeg

...

private: // het begin van de lijst is

vakje* bovenste; // de bovenkant van de stapel

};//stapel

void stapel::zetopstapel (int getal) { // push

vakje* temp = new vakje;

temp->info = getal;

temp->volgende = bovenste;

bovenste = temp;

}//stapel::zetopstapel

Datastructuren


Linear lists

contiguous representation

stack stapel

x x x x x x x

top

queue (wacht-)rij cyclic

x x x x x x

first last

x x xx x x x

last first

empty vs. full (?)

Datastructuren


Linear lists

Programmeermethoden

const int MAX = 100;

class stapel { // voor maximaal MAX integers

public:

stapel ( ) { bovenste = -1; } // constructor

void zetopstapel (int);

void haalvanstapel (int&);

bool isstapelleeg ( ) {

return ( bovenste == -1 ); }

...

private:

int inhoud[MAX];

int bovenste; // index bovenste waarde

};//stapel

void stapel::zetopstapel (int getal) {

bovenste++;

inhoud[bovenste] = getal;

}//stapel::zetopstapel

Datastructuren


Abstract Data Structures

OOP: object oriented programming

object class– data members +– methods

data encapsulation ⇒ nicer modelling

localization operations ⇒ easier error finding

information hiding ⇒ avoiding errors

see Programmeermethoden

Datastructuren



black box

stack

bottom

0 1 2 3 4 5

pop

push

isE

mpt

y

top

data

remove

insert

quer

y

Datastructuren



elements

domain operations

data structure

specification

representation

implementation

structure

Datastructuren



ADT – what, not how

Definition

An abstract data structure (ADT) is a specification of the valuesstored in the data structure as well as the description andsignatures of the operations that can be performed.

no representation or implementation in ADT“mathematical model”

Datastructuren



abstract “native” data structures– float R– int Z

now get used to consider stacks (etc) that way

Datastructuren



structure

unordered linear hierarchical network

set list tree graph

Datastructuren



ADT Stack

Initialize: void → stack<T>. Construct an emptysequence ().

IsEmpty: void → Boolean. Check whether there the stackis empty, i.e., contains no elements).

Size: void → Integer. Return the number n of elements, thelength of the sequence (x1, . . . , xn).

Top: void → T. Returns the top xn of the sequence(x1, . . . , xn). Undefined on the empty sequence.

Push(x): T → void. Add the given element x to the top ofthe sequence (x1, . . . , xn), so afterwards the sequence is(x1, . . . , xn, x).

Pop: void → void. Remove the topmost xn element of thesequence (x1, . . . , xn), so afterwards the sequence is(x1, . . . , xn−1). Undefined on the empty sequence.

Datastructuren



ADT Queue

Initialize: construct an empty sequence ().

IsEmpty: check whether there the queue is empty, i.e.,contains no elements).

Size: return the number n of elements, the length of thesequence (x1, . . . , xn).

Front: returns the first element x1 of the sequence(x1, . . . , xn). Undefined on the empty sequence.

Enqueue(x): add the given element x to the end/back ofthe sequence (x1, . . . , xn), so afterwards the sequence is(x1, . . . , xn, x).

DeQueue: removes the first element of the sequence(x1, . . . , xn), so afterwards the sequence is (x2, . . . , xn).Undefined on the empty sequence.

Datastructuren



other ADTs (and implementation)

Set ⇒ (balanced) binary trees, hash tables

Map

Priority Queue ⇒ binary heap, leftist heap

Graph

Union-Find

Datastructuren


Advanced C++ programming

templates

templated function

template <typename T>

T max(T a, T b) { return a>b ? a : b ; }

templated class

template <typename Typ>

class Stack {

...

private:

vector<Typ> storage;

}

Stack<int> intStack;

Stack<string> stringStack;

Datastructuren



standard template library

container

iterator

algorithm

Datastructuren



stl container classes

helper: pair

sequences: contiguous: array (fixed length),vector (flexible length), deque (double ended),linked: forward list (single), list (double)

adaptors: based on one of the sequences:stack (lifo), queue (fifo),based on binary heap: priority queue

associative: based on balanced trees:set, map, multiset, multimap

unordered: based on hash table:unordered set, unordered map,unordered multiset,unordered multimap

Datastructuren



STL vector of pair#include <iostream>

#include <string>

#include <queue>

using namespace std;

using paar = pair<string, unsigned int>; // replacing typedef

int main() {

vector <paar> club // ’modern’ initialization

{ {"Jan", 1}, {"Piet", 6}, {"Katrien", 5}, {"Ramon", 2} };

for (auto& mem: club) { // range based for-loop

cout << mem.first << " " ;

}

cout << endl;

return 0;

}

Jan Piet Katrien Ramon

Datastructuren



STL priority queueclass Comp {

public:

int operator() ( const paar& p1, const paar& p2 ) {

return p1.second < p2.second;

}

};

int main() {



using pqtype = priority_queue< paar, vector <paar>, Comp > ;

pqtype pq (club.begin(), club.end() ); // wow! converts into

// priority_queue

while ( !pq. empty() ) {

cout << pq.top().first << " (" << pq.top().second << ") ";

pq.pop();

}

return 0;

}

Piet (6) Katrien (5) Ramon (2) Jan (1)

Datastructuren


Trees and their Representations

trees

Tree

structure AVL, B-, red-black• number children• height

contents• relative position values Heap, BST

Definition (Binary Tree)

A binary tree is: an empty tree (without any nodes), or a nodewith two children L and R where L and R are binary trees.

Datastructuren



representating binary trees: pointers

template <class T>

class BinKnp {

\\ CONSTRUCTOR

BinKnp ( const T& i,

BinKnp<T> *l = nullptr, \\ default

BinKnp<T> *r = nullptr )

: info(i) \\ constructor of type T

{ links = l; rechts = r; }

private: \\ DATA

T info;

BinKnp<T> *links, *rechts;

};

Datastructuren



binary search tree vs heap order

35

20

10

5 14

30

26

23

45

39 51

56

83

70

10

5 7

30

26

23

45

39 37

3

Datastructuren



AVL-tree and B-tree

6+1

30

9+1

2-1

5-1

8-1

11+1

1 4 7 10 12+1

13

10 20 25 32 34 40 41 44 46 52 54 58 60

30 38 50 56

42

Datastructuren



text compression Huffman & ZLW

a

b e

0 1

0 1

f

c d

0 1

0 1

0 1

1 2 3

a b c

4 5

6

7b a b

c

8

9

10

11b

a

a

a

Datastructuren



expression tree

3

π x

·

sin 2

↑

·

π x

·

cos

·

0 x

·

π 1

·

+

·

Datastructuren



full binary tree

Datastructuren



complete binary tree −→ array

1

2 3

4 5 6 7

8 9 10 11 12

33

42 17

8 24 3 3

98 55 10 19 5

33

1

42

2

17

3

8

4

24

5

3

6

3

7

98

8

55

9

10

10

19

11

5

12

Datastructuren



left-child right-sibling

a

b

c

d

e

f

g h

a

b

c

d

e

f

g h

a

b

c d

e f

g

h

Datastructuren



trie “retrieval”

tp

o

t

a

t

o

t

e

r

y

a

t

t

o

o

em p

o

t

pot

ato$tery$

attoo$

empo$$

Datastructuren

Tree Traversal

Contents

2 Tree TraversalDefinitions and representationRecursionUsing a StackUsing Inorder ThreadsMorris TraversalLink Inversion

Datastructuren

Tree Traversal

Definitions and representation

Traversal

The process of visiting each node (precisely once) in a systematicway:

breadth-first

NLR preorder

LNR inorder

LRN postorder

recursion

(parent pointer)

iterative, with stack

threads

link inversion

Datastructuren

Tree Traversal

Recursion

recursion (binary trees)

recursivetraversal( node )

if (node != nil) then

// pre-visit(node)

traversal(node.left)

// in-visit(node)

traversal(node.right)

// post-visit(node)

fi

end // traversal

pre

in

post

Datastructuren

Tree Traversal

Recursion

Algoritmiek2

class knoop { // struct mag ook

public:

knoop ( ) { // constructor

info = 0;

links = NULL;

rechts = NULL;

}

// maar misschien private

int info;

knoop* links;

knoop* rechts;

}; // knoop

void preorde (knoop* root) {

if ( root != NULL ) {

cout << root->info << endl;

preorde (root->links);

preorde (root->rechts);

} // if

} // preorde

void symmetrisch (knoop* root) {

if ( root != NULL ) {

symmetrisch (root->links);

cout << root->info << endl;

symmetrisch (root->rechts);

} // if

} // symmetrisch

2ja, dat hebben we gezien

Datastructuren

Tree Traversal

Recursion

pre-traversal( node )


pre-visit(node)

pre-traversal(node.left)

pre-traversal(node.right)

fi

end

b2

d3

a1

g6

i7

e5

j9

h8

k10

c4

f11

NLR = preordera b d c e g i h j k f

Datastructuren

Tree Traversal

Recursion

in-traversal( node )


in-traversal(node.left)

in-visit(node)

in-traversal(node.right)

fi

end

b1

d2

a

3

g

4

i5

e

6

j

7

h8

k9

c

10

f11

LNR = inorderb d a g i e j h k c f

Datastructuren

Tree Traversal

Recursion

post-traversal( node )


post-traversal(node.left)

post-traversal(node.right)

post-visit(node)

fi

end

b 2

d 1

a 11

g 4

i 3

e 8

j 5

h 7

k 6

c 10

f 9

LRN = postorderd b i g j k h e f c a

Datastructuren

Tree Traversal

Using a Stack

generic binary tree traversal

visit direction visit at next node

1 down-left 1(stay) 2 node has no left-child

2 down-right 1(stay) 3 node has no right-child

3 up 2 at left-child3 at right-child

1

current

1next

2

current

1next

3

next

3current

2

next

3current

problem: going up to parent

Datastructuren

Tree Traversal

Using a Stack

visit = 1; node = root

while (visit != 3 or not S.isEmpty() )

case visit of

1 : if (node.left != nil) then

S.push(node)

node = node.left

else

visit = 2

fi

2 : if (node.left != nil) then

S.push(node)

node = node.right

visit = 1

else

visit = 3

fi

3 : parent = S.pop()

if (parent.left != node) then

visit = 2

else

visit = 3

fi

node = parent

end//case

end//while

1node

1next

2node

1 next

3 next

3 node

2 next

3node

Datastructuren

Tree Traversal

Using a Stack

which nodes on stack

pre: right children

b2

d3

a

1

g

6

i7

e

5

j

9

h8

k10

c

4

f11

end

in: left parents

b1

d2

a

3

g

5

i4

e

6

j

7

h8

k9

c

10

f11

end

Datastructuren

Tree Traversal

Using a Stack

pre-orderiterative-preorder( root )

S : Stack

S.create()

S.push( root )

while ( not S.isEmpty() ) do

node = S.pop()


visit( node )

S.push( node.right )

S.push( node.left )

fi

do

end // iterative-preorder

* currentX visitedX on stack

X

X X

X X

X X

X X

X X

X *current

Datastructuren

Tree Traversal

Using a Stack

pre-orderiterative-preorder( root )

S : Stack

S.create()

S.push( root )


node = S.pop()


visit( node )


S.push( node.left )

fi

do

end // iterative-preorder

pre-order (2)iterative-preorder( root )

S : Stack

S.create()

S.push( root )


node = S.pop()

while (node != nil) do

visit( node )


node = node.left

od

do

end // iterative-preorder [bis]

Datastructuren

Tree Traversal

Using a Stack

in-orderiterative-inorder( root : Node)

S : Stack

S.create()

// move to first node (left-most)

walkLeft( root, S )


node = S.pop()

visit( node )

walkLeft( node.right, S )

do

end // iterative-inorder

walkLeft( node : Node, S : Stack)


S.push( node )

node = node.left

od

end // walkLeft

X

X

X X

X X

X

X

X *current

X

Datastructuren

Tree Traversal

Using a Stack

in-orderiterative-inorder( root : Node)

S : Stack

S.create()

// move to first node (left-most)

walkLeft( root, S )


node = S.pop()

visit( node )

walkLeft( node.right, S )

do

end // iterative-inorder

walkLeft( node : Node, S : Stack)


S.push( node )

node = node.left

od

end // walkLeft

in-order (2)iterative-inorder( root )

S : Stack

S.create()

node = root;

while (node != nil or

not S.isEmpty() ) do


S.push( node )

node = node.left

else

node = S.pop()

visit( node )

node = node.right

fi

od

end // iterative-inorder [bis]

Datastructuren

Tree Traversal

Using a Stack

post-orderiterative-postorder( root )

S : Stack; // contains path from root

S.create();

last = nil

node = root

while (not S.isEmpty() or node != nil) do


S.push(node)

node = node.left

else

peek = S.top()

if (peek.right != nil and last != peek.right) then

// right child exists AND traversing from left, move right

node = peek.right

else

visit(peek)

last = S.pop()

fi

fi

od

end // iterative-postorder

Datastructuren

Tree Traversal

Using a Stack

post-orderiterative-postorder( root )

S : Stack; // contains path from root

S.create();

last = nil

node = root

while (not S.isEmpty() or node != nil) do


S.push(node)

node = node.left

else

peek = S.top()

if (peek.right != nil and last != peek.right) then

// right child exists AND traversing from left, move right

node = peek.right

else

visit(peek)

last = S.pop()

fi

fi

od

end // iterative-postorder

X

X

X X

X X

X

X

X *current

X X

Datastructuren

Tree Traversal

Using Inorder Threads

using inorder threads

threads:replace nil-pointers, explicitly store inorder successors

can be used to perform stack-less traversal

need one bit [boolean] per node to mark thread

Morris-variant: temporary threads, no extra bit

nb. inorder = symmetric

Datastructuren

Tree Traversal


inorder successor with threads

xcurr

Λ

succ

succ

x

Λ

curr

9

5

3

2

1

4

7

6 8

12

10

11

nil

Datastructuren

Tree Traversal


traversal with symmetric threads

inorder threads// assuming Root != nil, find first position in inorder

Curr = walkLeft( Root );

while (Curr != nil) do

inOrderVisit( Curr );

if (Curr.IsThread) then

Curr = Curr.right; // to inorder successor

else

Curr = walkLeft (Curr.right)

fi

od

walkLeft( node : Node)

while (node.left != nil) do

node = node.left

od

return node

end // walkLeft

Datastructuren

Tree Traversal


what about

pre-order traversal with inorder threads

Datastructuren

Tree Traversal

Morris Traversal

Morris: temporary threads

inorder successor(to left parents)

b1

d2

a

3

g

5

i4

e

6

j

7

h8

k9

c

10

f11

end

stack vs. threads

*1

d2

X3

g

5

i4

e

6

*7

X8

k9

X10

f11

nil

Datastructuren

Tree Traversal

Morris Traversal

Morris: basics

inorder successor(to left parents)

b1

d2

a

3

g

5

i4

e

6

j

7

h8

k9

c

10

f11

end

two visits

1 (pre-order)from parentvia child-link (left or right)add thread to current node

2 (inorder)from subtree, via threaddelete thread

algorithm does not know threadsso does not know which visitbut will check!

Datastructuren

Tree Traversal

Morris Traversal

Morris traversal - algorithm

no left subtree:

1st and 2nd visitgo right

(by edge or by thread)

Λ

Curr

new subtree: 1st visitconstruct thread

go left

Λ

Curr

Pred

been there: 2nd visitdelete thread

go right

Curr

Pred

Datastructuren

Tree Traversal

Morris Traversal

Morris traversal - pseudo codemorris-algo

Curr = Root;

while (Curr != nil) do

if (Curr.left = nil) then

inOrderVisit( Curr )

Curr = Curr.right

else

// find predecessor

Pred = Curr.left

while (Pred.right != Curr && Pred.right != nil) do

Pred = Pred.right

od

if (Pred.right=nil) then

// no thread: subtree not yet visited

Pred.right = Curr

Curr = Curr.left

else

// been there, remove thread

Pred.right = nil

inOrderVisit( Curr )

Curr = Curr.right

fi

fi

od

Datastructuren

Tree Traversal

Morris Traversal

alternative view: tree transformation

6

2

1 4

3 5

8

7 9

2

1 4

3 5

6

8

7 9

1

2

4

3 5

6

8

7 9

6

2

1 4

3 5

8

7 9

6

2

1 4

3 5

8

7 9

6

2

1 4

3 5

8

7 9

Datastructuren

Tree Traversal

Link Inversion

features

– use generic traversal– at each step we know which visit– no stack, invert links on path from root– use bit on path (tag) to distinguish left/right

bit stack?– keep parent– global visit counter (pre-/in-/post-order)– single traversal at a time

Datastructuren

Tree Traversal

Link Inversion

inverted links

*

binary tree

parent

tag=1

* curr

tag=0

3 visits at *

tag=1

* parent

tag=0

curr

tag=0

after 1st visit

curr

*

tag=0

parent

after 3rd visit

Datastructuren

Binary Search Trees

Contents

3 Binary Search TreesIntroductionBST use casesConstructing BSTsAnalysis of treesADT Set and Dictionary

Datastructuren

Binary Search Trees

Introduction

binary search tree BST3

K

< K > K

Definition

A binary search tree is a binary tree such that for each node:

all nodes in its left subtree have smaller values, and

all nodes in its right subtree have larger values

3BZB, zie Algoritmiek

Datastructuren

Binary Search Trees

Introduction

comparables

chico

harpo

groucho

gummo

marx

zeppo 4

5

11

18

25

30 11.6.1509

28.5.1533

30.5.1536

6.1.1540

28.7.1540

12.7.1543

Datastructuren

Binary Search Trees

Introduction

binary search tree BST

worst case search complexity: unsuccessful search in

linear tree: O(n)

optimal tree: O(log2(n)) (complete tree)

Average case behaviour: see later

Datastructuren

Binary Search Trees

Introduction

BST with 31 most common English words

top five frequencies indicated the15568

to5739

this with

was you

which

of9767

and7638

that

on

or

a5074

in

I is

it

not

for

as his

are be he

at

but

from

have herby

had

Inserted in BST by decreasing order of frequencySuccessful search of BST requires 4.042 comparisons (on avg.)

Datastructuren

Binary Search Trees

Introduction

balanced BST

a

5074

and7638

are

as

at

be

but

by

for

from

had

have

he

her

his

I

in

is

it

not

of

9767

on

or

that

the

15568

this

to

5739

was

which

with

you

Perfectly balanced BST

Successful search requires 4.393 comparisons (on avg.)

Datastructuren

Binary Search Trees

Introduction

optimal BST

are at but from have her I which

as by had his is not or was you

a5074

be he it on this with

and7638

in that to5739

for the15568

of9767

Optimal tree taking frequencies into account

Successful search requires 3.437 comparisons (on avg.)

source: Knuth TAoCP Vol.3 (Sorting and Searching)

Datastructuren

Binary Search Trees

BST use cases

search value

bool contains( const Comparable & x, Node *t ) const {

if( t == nullptr )

return false;

else if( x < t->element )

return contains( x, t->left );

else if( t->element < x )

return contains( x, t->right );

else

return true; // found

}

call with: contains(v,root);

Datastructuren

Binary Search Trees

BST use cases

find min/max value

BinaryNode * findMin( BinaryNode *t ) const {

if( t == nullptr )

return nullptr;

if( t->left == nullptr )

return t;

return findMin( t->left );

}

BinaryNode * findMax( BinaryNode *t ) const {

if( t != nullptr )

while( t->right != nullptr )

t = t->right;

return t;

}

call with: findMin(root); and findMax(root);

Datastructuren

Binary Search Trees

BST use cases

inorder is sorted

81

112

153

204

265

336

347

428

519

5710

6111

inorder : 8 11 15 29 26 33 34 42 51 57 61

Datastructuren

Binary Search Trees

BST use cases

find k-th element

Augment each node with the size of its subtree

51

103

141

206

261

302

3511

391

454

512

561

Let r be left->size + 1

If k = r: stop! This node has kth item

If k < r: search kth item in left subtree

If k > r: search (k − r)th item in right subtree

Datastructuren

Binary Search Trees

BST use cases

counting items in [12, 52]

3

6

9

12

X

15

1

18

X

21

24

2

27

X 60

30

33

4

36

39

42

X

45

148

X

51

X

54

57

Datastructuren

Binary Search Trees

Constructing BSTs

insertion (implementation)

template<class T>

void Node<T>::insert(const T& el, Node<T> * & p) {

if( p == nullptr ) {

p = new Node{el, nullptr, nullptr};

} else if (el < p->data) {

insert(el, p->left);

} else if (el > p->data) {

insert(el, p->right);

} else {

; // Duplicate; do nothing

}

}

call with: insert(el,root);

Datastructuren

Binary Search Trees

Constructing BSTs

deletion “by copying”

f

×

T1

Λ

=⇒

f

T1

×

T1 T2

=

×

p

Λ

T2

=⇒

p

×

Λ

T2

Datastructuren

Binary Search Trees

Constructing BSTs

deletion (implementation)

void remove( const Comparable & x, Node * & t ) {

if( t == nullptr ) return;

if( x < t->data ) remove( x, t->left );

else if( x > t->data) remove( x, t->right );

else if( t->left != nullptr && t->right != nullptr ) {

Node *pred = findMax( t->left );

t->element = pred->element;

remove( t->element, t->left );

}

else {

BinaryNode *oldNode = t;

if(t->left != nullptr ) t = t->left

else t = t->right;

delete oldNode;

}

}

aanroepen met: remove(el,root);

Datastructuren

Binary Search Trees

Analysis of trees

counting trees

i

Bi−1 Bn−i

Unlabeled n-node binary trees

Bn =∑n−1

i=0 (Bi−1 ·Bn−i) with B0 = 1

nth Catalan number: Bn = 1n+1

(2nn

)= (2n)!

(n+1)!n! ∼4n

n3/2√π

this is also the number of BST with given values:unique way to store values in given [unlabeled] tree

Datastructuren

Binary Search Trees

Analysis of trees

internal path length

0

1

2 2

1

2ipl = 0 + 1 + 1 + 2 + 2 + 2 = 8

Path length of node: # edges from root to node

Definition (Internal path length)

ipl = sum of all path lengths to all nodes

Avg # comparisons in successful search: ipln + 1

Datastructuren

Binary Search Trees

Analysis of trees

external path length

0

1

2 2

1

2

E = 3 + 3 + 3 + 3 + 2 + 3 + 3 = 20

Definition (External path length)

E = sum of all path lengths to the ‘extended’ leaves

Avg # comparisons in unsuccessful search: En+1 (n+ 1 leaves)

Relation to ipl: E = ipl + 2n proof: induction

Datastructuren

Binary Search Trees

Analysis of trees

path length extremal trees

optimal (balanced) worst case (linear)h levels: n = 2h − 1 nodes

h = lg(n+1)

0

1 1

2 2 2 2

0

1

2

6

ipl =∑h−1

i=0 i · 2i, E = 2h · h ipl =∑n−1

i=0 i = n(n−1)2

⇒ ipl = (n+1) lg(n+1)− 2n E = ipl + 2n = n(n+3)2

avg = n+1n lg(n+1)− 1 avg = n+1

2

Datastructuren

Binary Search Trees

Analysis of trees

average tree

intuition: more balance ⇒ more permutations yield that treeexample: 4-node BSTs

1

2

3

4

1234ipl=6

1

2

4

3

1243ipl=6

1

3

2 4

13241342ipl=5

1

4

2

3

1423ipl=6

1

4

3

2

1432ipl=6

2

1 3

4

213423142341ipl=4

2

1 4

3

214324132431ipl=4

14 BSTs (7 symmetric to above)4! = 24 permutationsaverage ipl: 1

24(12× 4 + 4× 5 + 8× 6) = 11624 = 29

6

Datastructuren

Binary Search Trees

Analysis of trees

average ipl BST

In average internal path length BST n nodes

insert permutation 1, . . . , n into BST ⇒ tree structurewe average over permutations

5

2

1 4

3

6

7

permutationdetermines left & right subtrees

2 4 1 35

6 7

any k can be root = first elementIn = (n− 1) + 2

n

∑nk=1(Ik−1 + In−k)

Datastructuren

Binary Search Trees

Analysis of trees

telescope!

In average internal path length n nodes

so In = (n− 1) + 2(I0 + I1 + · · ·+ In−1)/n

also In−1 = (n− 2) + 2(I0 + I1 + · · ·+ In−2)/(n− 1)

subtract n In − (n− 1)In−1 = 2n− 2 + 2In−1

thus n In = (n+ 1)In−1 + 2n− 2

In

n+ 1=In−1

n+

2

n+ 1−

2

n(n+ 1)

In−1

n=In−2

n− 1+

2

n−

2

(n− 1)n

. . .

I1

2=I0

1+

2

2−

2

1 · 2In

n+ 1=I0

1+O(lnn)−

2n

n+ 1

Datastructuren

Binary Search Trees

ADT Set and Dictionary

ADT Set

Initialize: construct an empty set.

IsEmpty: check whether there the set is empty (∅, containsno elements).

Size: return the number of elements, the cardinality of theset.

IsElement(a): returns whether a given object from thedomain belongs to the set, a ∈ A.

Insert(a): add an element to the set (if it is not present,A ∪ {a})Delete(a): removes an element from the set (if it is present,A \ {a}).

Efficient implementation of ADT Set possible with BST

Datastructuren

Balancing Binary Trees

Contents

4 Balancing Binary TreesTree rotationAVL TreesAdding an item to an AVL TreeDeletion in an AVL TreeSplay Trees

Datastructuren


Tree rotation

single rotation

root: p, pivot: q ⇒

p

q

T1

T2 T3

⇐⇒ p

q

T1 T2

T3

⇐ root: q, pivot: p

Datastructuren


Tree rotation

double rotation

r

p

q

T1

T2 T3

T4

=⇒ r

p

q

T1 T2

T3

T4

=⇒

rp

q

T1 T2 T3 T4

rotate two times with pivot=q

Datastructuren


Tree rotation

Day/Stout/Warren

2

4

6

8

12

2

4

6

8

12

24

68

12

Datastructuren


Tree rotation

Day/Stout/Warren Algorithm - createBackBone

rotate(root, pivot) { ... }

createBackBone(root)

tmp = root;

while (tmp != nil) do

if(tmp.left != nil) then

rotate(tmp, tmp.left);

tmp = tmp.left;

else

tmp = tmp.right;

fi

od

Datastructuren


Tree rotation

Day/Stout/Warren Algorithm

createCompleteTree(root)

createBackBone(root);

n = number of nodes

m = 2^floor(log(n+1)) - 1;

rotate n-m times at every other node in the backbone

while(m>1) do

m = m/2;

rotate m times at every other node in the backbone

od

Datastructuren


AVL Trees


helper: pair





Datastructuren


AVL Trees

balance factor

4

3

2

1

0

depth

35-1

20

10

5 14

30-2

26

23

45+1

39 51

56

3

2 height

Datastructuren


AVL Trees

Definition (AVL Tree)

An AVL tree is a BST where for each node: |balance(node)| ≤ 1

6+1

30

9+1

2-1

5-1

8-1

11+1

1 4 7 10 12+1

13

Datastructuren


AVL Trees

Fibonacci ‘worst’ AVL tree

1

3

2

1

5

2

1

4

1

3

2

1

Fh−2Fh−1

Fh = Fh−2 + Fh−1 + 1 ≈ (1+√

(5)

2 )h, thus worst-case search inAVL tree grows O(lgn) in the number of nodes n

Datastructuren


Adding an item to an AVL Tree

Adding in left subtree

a)

p+1/0

new node

ok, stop

b)

p0/-1

ok, go up

c)

p-1/-2

=⇒

rebalance (next 2 slides), stop

q0

Datastructuren



rebalance: LL-case

q-1/-2

p0/-1

=⇒ q0

p0

Datastructuren



Rebalance: LR-cases

r-1/-2

p0/+1

q0/± 1

OR

=⇒

r+1

p0

q0

OR

r0

p-1

q0

Datastructuren



example: adding 11

1

20

3

4+1/+2

5

6-1

70/+1

8

90/+1

100/+1

11new

70

40

20

1 3

6-1

5

9+1

8 10+1

11

inbalance at 4, RR-case so rotate at 4 with pivot=7

Datastructuren



example: adding 5

1

20

3

4+1/+2

6

0/-1

7

0/-1

90/-1

8

10+1

11

5new

70

40

20

1 3

6-1

5

9+1

8 10+1

11

inbalance at 4, RL-case so rotate twice with pivot=7

Datastructuren


Deletion in an AVL Tree

Deletion: RR cases

q0

p+1/+2

=⇒ q+1

p-1

q+1

p+1/+2

=⇒ q0

p0

Datastructuren



Delete: RL cases (ε = 0,±1)

p-1

qε

r+1/+2

=⇒

p0,+1

q0

r0,-1

Datastructuren



8-1

5-1

3-1

1+1

2

4

7-1

6

11-1

9+1

10

×

8-1

5-1

3-1

1+1

2

4

7-1

6

11-2

9+1

10

8-2

5-1

3-1

1+1

2

4

7-1

6

100

9 11+1

50

3-1

1+1

2

4

80

7-1

6

100

9 11

Datastructuren


Splay Trees

splay zig-zag (LR)

g

p

x

T1

T2 T3

T4

=⇒ gp

x

T1 T2 T3 T4

Datastructuren


Splay Trees

splay zig-zig (LL)

g

p

x

T1 T2

T3

T4

=⇒

g

p

x

T1

T2

T3 T4

Datastructuren


Splay Trees

splay linear tree

1

2

3

4

5

6

7

1

2

3

4

5

6

7

2

3

4

5

1

6

7

2

3

4

5

1

6

7

Datastructuren

Priority Queues

Contents

5 Priority QueuesADT Priority QueueBinary HeapLeftist heapsPairing Heap (niet)Double-ended Priority Queues

Datastructuren

Priority Queues


ADT – what, not how

Definition

An abstract data structure (ADT) is a specification of the valuesstored in the data structure as well as the description andsignatures of the operations that can be performed.

no representation or implementation in ADT“mathematical model”

Datastructuren

Priority Queues



helper: pair





Datastructuren

Priority Queues


STL priority queueclass Comp {

public:

int operator() ( const paar& p1, const paar& p2 ) {

return p1.second < p2.second;

}

};

int main() {



using pqtype = priority_queue< paar, vector <paar>, Comp > ;

pqtype pq (club.begin(), club.end() );

// wow! converts into priority_queue

while ( !pq. empty() ) {

cout << pq.top().first << " (" << pq.top().second << ") ";

pq.pop();

}

return 0;

}

Piet (6) Katrien (5) Ramon (2) Jan (1)

Datastructuren

Priority Queues

ADT Priority Queue

dictionary vs. priority queue

Both store a set of (key,value) pairs

{ (’Detra’,17), (’Nova’,84), (’Charlie’,22), (’Henry’,75), (’Elsa’,29) }

both:Insert(’Roxanne’,29)

dictionary:Delete(’Detra’)Find(’Elsa’) returns 29Set(’Henry’,76)

priority queue:FindMax() returns (’Nova’,84)DeleteMax()

Datastructuren

Priority Queues

ADT Priority Queue

ADT dictionary / map / associative array

Stores a set of (key,value) pairs

Initialize, IsEmpty, Size

Insert: add (key,value) pair, provided key is not yet present

Delete: deletes (key,value) pair, given the key

Find: returns the value associated to a given key

Set: reassigns a new value to a (existing) given key

usually implemented as (balanced) binary serach tree,or hash table “unordered”

Datastructuren

Priority Queues

ADT Priority Queue

ADT priority queue

Initialize: construct an empty queue.

IsEmpty: check whether there are any elements in the queue.

Size: returns the number of elements.

Insert: given a data element with its priority, it is added tothe queue

DeleteMax: returns a data element with maximal priority,and deletes it.

GetMax: returns a data element with maximal priority.

IncreaseKey: given an element with its position in thequeue it is assigned a higher priority.

Meld, or Union: takes two priority queues and returns anew priority queue containing the data elements from both.

Datastructuren

Priority Queues

ADT Priority Queue

min & max queues

max-queue ≥

Initialize, IsEmpty, Size, Insert, DeleteMax, GetMax,IncreaseKey, Meld

min-queue ≤

Initialize, IsEmpty, Size, Insert, DeleteMin, GetMin,DecreaseKey, Meld

even opletten welke ordening

er staat vaak ook data (niet alleen prioriteit)

Datastructuren

Priority Queues

ADT Priority Queue

priority queue - use cases

sorting (heapsort)

graph algorithms (Dijkstra shortest path, Prim’s algorithm)

compression (Huffman)

operating systems: task queue, print job queue

discrete event simulation

Datastructuren

Priority Queues

ADT Priority Queue

implementations

Binary Leftist Pairing Fibonacci Brodal

GetMax Θ(1) Θ(1) Θ(1) Θ(1) Θ(1)Insert O(log n) Θ(log n) Θ(1) Θ(1) Θ(1)DeleteMax Θ(log n) Θ(log n) O(log n)† O(log n)† O(log n)

IncreaseKey Θ(log n) Θ(log n) O(log n)† Θ(1)† Θ(1)Meld Θ(n) Θ(log n) Θ(1) Θ(1) Θ(1)† amortized complexity

“. . . is based on heap ordered trees where [. . . ] nodes may violateheap order.” “The data structure presented is quite complicated.”

Datastructuren

Priority Queues

Binary Heap

binary search tree vs heap order

35

20

10

5 14

30

26

23

45

39 51

56

83

70

10

5 7

30

26

23

45

39 37

3

Datastructuren

Priority Queues

Binary Heap

representing binary tree with an array

root at index 1, left/right child i at index 2i/2i+1.

1

10 11

100 101 110 111

1000 1001 1010 1011 1100

33

42 17

8 24 3 3

98 55 10 19 5

33

1

42

2

17

3

8

4

24

5

3

6

3

7

98

8

55

9

10

10

19

11

5

12

works well for complete binary treeswaste of space when ‘missing’ nodes

Datastructuren

Priority Queues

Binary Heap

binary heap: three levels

functioning: abstract (priority queue)

understanding: binary tree

implementation: array

internal operations (change key at position):bubble up, trickle down

“To add an element to a heap we must perform an up-heap operation(also known as bubble-up, percolate-up, sift-up, trickle-up, swim-up,heapify-up, or cascade-up), . . . ” What’s in a name? [Wikipedia]

https://en.wikipedia.org/wiki/Binary_heap#Heap_operations

Datastructuren

Priority Queues

Binary Heap

increasekey / bubble up

98

57 55

42 24 17 3

8 33 10 19 71 13

981

572

553

424

245

176

37

88

339

1010

1911

x

12

711313

98

57 71

42 24 55 3

8 33 10 19 17 13

981

572

713

424

245

556

37

88

339

1010

1911

1712

1313

BubbleUp : swap with parent until heap-ordered

Datastructuren

Priority Queues

Binary Heap

decreasekey / trickle down

37

57 55

42 24 17 3

8 33 10 19 5 13

x

1

37572

553

424

245

176

37

88

339

1010

1911

512

1313

57

42 55

37 24 17 3

8 33 10 19 5 13

571

422

553

374

245

176

37

88

339

1010

1911

512

1313

TrickleDown : swap with largest child until heap-ordered

Datastructuren

Priority Queues

Binary Heap

Insert to priority queue

98

57 55

42 24 17 3

8 33 10 19 5 13 29

981

572

553

424

245

176

37

88

339

1010

1911

512

1313 14

29

98

57 55

42 24 17 29

8 33 10 19 5 13 3

981

572

553

424

245

176

297

88

339

1010

1911

512

1313

314

Insert: add as last, BubbleUp

Datastructuren

Priority Queues

Binary Heap

DeleteMax from priority queue

98 98

57 55

42 24 17 3

8 33 10 19 5 13

981

13572

553

424

245

176

37

88

339

1010

1911

512

1313

x

57

42 55

33 24 17 3

8 13 10 19 5

571

422

553

334

245

176

37

88

139

1010

1911

512

DeleteMax: move last element to root, trickleDown

Datastructuren

Priority Queues

Binary Heap

heapify (1)

33

42 17

8 24 13 3

98 57 10 19 5 55

331

422

173

84

245

136

37

988

579

1010

1911

512

5513

33

42 17

98 24 55 3

8 57 10 19 5 13

331

422

173

984

245

556

37

88

579

1010

1911

512

1313

TrickleDown new key: swap with parent until heap-ordered

Datastructuren

Priority Queues

Binary Heap

heapify (2)

33

42 17

98 24 55 3

8 57 10 19 5 13

331

422

173

984

245

556

37

88

579

1010

1911

512

1313

33

98 55

57 24 17 3

8 42 10 19 5 13

331

982

553

574

245

176

37

88

429

1010

1911

512

1313

Datastructuren

Priority Queues

Binary Heap

heapify (3)

33

98 55

57 24 17 3

8 42 10 19 5 13

331

982

553

574

245

176

37

88

429

1010

1911

512

1313

98

57 55

42 24 17 3

8 33 10 19 5 13

981

572

553

424

245

176

37

88

339

1010

1911

512

1313

Datastructuren

Priority Queues

Binary Heap

complexity heapify

Lemma∑hd=0 d2d = (h− 1)2h+1 + 2

n levels, N = 2n − 1 keys

top-down∑n−1`=0 2`` = (n− 2)2n = N lgN (ongeveer)

bottom-up∑n−1`=0 2`(n− 1− `) =

∑n−1`=0 2`(n− 1) +

∑n−1`=0 2`` = 2n − n− 1

which is O(N)

Datastructuren

Priority Queues

Leftist heaps

leftist heaps

“bladafstand”npl(x) nil path length, shortest distance to external leaf

Definition (Leftist tree)

An (extended) binary tree where for each internal node x,npl(left(x)) ≥ npl(right(x)).

Definition (Leftist heap)

A leftist tree where the priorities satisfy the heap order.

structure vs. node order

Datastructuren

Priority Queues

Leftist heaps

leftist tree (structure)

npl(left(x)) ≥ npl(right(x))

3

2

1

1

2

1

1

2

1 1

2

1 1

1

3

2

2

2

1 1

1

1

1

1

2

1 1

1

Datastructuren

Priority Queues

Leftist heaps

basic (internal) operation: ZIP

a b

T1 T2 T3 T4

︷︸︸︷Zipa

bT1

T2

T3 T4

︷︸︸︷Zipa ≥ b

Datastructuren

Priority Queues

Leftist heaps

example (step 1: recursive Zipping)

38

37 25

29 10

35

31 32

28 30

Zip︷︸︸︷38

37

29

25

10

35

31 32

28 30

Zip︷︸︸︷ 38

37

29 25

10

35

31

28

32

30

Zip︷︸︸︷

Datastructuren

Priority Queues

Leftist heaps

example (step 2: bottom-up swapping)

382

371

352

29 311

322

28 301

251

101

38

3735

293132

2830 25

10

Datastructuren

Priority Queues

Leftist heaps

complexity

Lemma

Let T be a leftist tree with root v such that npl(v) = k, then(1) T contains at least 2k − 1 (internal) nodes, and(2) the rightmost path in T has exactly k (internal) nodes.

3

2 2

1 1 2 1

2 1

2 1

. . . . . .

Datastructuren

Priority Queues

Leftist heaps

priority queue operations: Insert

Zip︷︸︸︷38

37 25

29 10

27

38

2737

29 25

10

38

2737

29 25

10

Datastructuren

Priority Queues

Leftist heaps

priority queue operations: DeleteMax

38

37 25

29 10

38

37 25

29 10

︷︸︸︷ 37

29 25

10

Datastructuren

Priority Queues

Double-ended Priority Queues

dual structure min-max heap

3

11 5

14 15 9

31

4

112

5

53

6

144

2

155

1

96

3

-

7

15

14 9

3 11 5

151

5

142

4

93

6

34

1

115

2

56

3

-

7

Pointer from min-heap item to same item in max-heap

Insertion: as in ordinary heap, but twice: once in each heap

Deletion: find item to delete in other heap using pointer,move last element to that position and do normal deletion

Datastructuren

Priority Queues


interval heap

2-92

8-80 11-75

17-69 42-70 44-73 14-39

24-33 23-65 55-60 44-50 54-57 61

[8,80] ⊆ [2,92]

Datastructuren

Priority Queues


interval heap: insert

2-92

11-75

44-73 14-39

54-57 6180

2-92

11-75

44-73 14-39

54-57 61-80

2-92

11-80

44-75 14-39

54-57 61-73

Datastructuren

Priority Queues


embedded min&max heap

2

8 11

17 42 44 14

24 23 55 44 54 61

92

80 75

69 70 73 39

33 65 60 50 57

Datastructuren

Priority Queues


Double ended priority queue - use case

wikipedia

One example application of the double-ended priority queue isexternal sorting. In an external sort, there are more elementsthan can be held in the computer’s memory.

https://en.wikipedia.org/wiki/Double-ended_priority_queue#Applications

Datastructuren

Priority Queues


Quiz4

AVL boom

71

42

23

14 35

56

67

98

89 110

voeg 7 toe

Binary min-heap

14

23

42

98 71

56

67

35

89 110

70

voeg 7 toe

4A quiz is a brief assessment used in education to measure growth in knowledge,abilities, and/or skills. Wikipedia

https://en.wikipedia.org/wiki/Quiz#In_education

Datastructuren

B-Trees

Contents

6 B-TreesDefinition & InsertionDeleting KeysRed-Black Trees

Datastructuren

B-Trees

AVL-tree and B-tree

6+1

30

9+1

2-1

5-1

8-1

11+1

1 4 7 10 12+1

13

10 20 25 32 34 40 41 44 46 52 54 58 60

30 38 50 56

42

Datastructuren

B-Trees

multiway search tree

K

T0 T1T0 T1 T2 T`

K1K2 . . . K`

T0 < K1 < T1 < · · · < K` < T`

Datastructuren

B-Trees

Definition & Insertion

B-tree (Bayer & McCreight, 1972)

Definition

A B-tree of order m is a multi-way search tree such that

every node has at most m children(contains at most m− 1 keys),

every node other than the root has at least dm2 e children(contains at least dm2 e − 1 keys),

the root contains at least one key, and

all leaves are on the same level of the tree.

Datastructuren

B-Trees


B-tree of order 5

3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83

13 29 41 49 69 77

61

order m = 5: between dm2 e − 1 = 2 and m− 1 = 4 keys.

Datastructuren

B-Trees


adding keys

Add the new key to a leaf.

When at maximal capacity, split leaf, move middle key up.Recurse.

Splits can reach the root. We then obtain a new root with asingle key.

Datastructuren

B-Trees


adding a key (order 5)

10 20 25 32 38 40 41 44 50 56

30 42

+34

32 34 38 40 41

10 20 25 32 34 40 41 44 50 56

30 38 42

Datastructuren

B-Trees


adding more keys (order 5)

10 20 25 32 34 40 41 44 50 56

30 38 42

+58,+60

44 50 56 58 60

10 20 25 32 34 40 41 44 50 58 60

30 38 42 56

Datastructuren

B-Trees


adding even more keys (order 5)

10 20 25 32 34 40 41 44 50 58 60

30 38 42 56

+46,+52,+54

44 46 50 52 54

30 38 42 50 56

10 20 25 32 34 40 41 44 46 52 54 58 60

30 38 50 56

42

Datastructuren

B-Trees

Deleting Keys

deleting keys

For non-leafs: swap key with predecessor (key moves to a leaf)

Deleting from leafs:

If below minimal capacity, move key from sibling with surplusleafs to parent, move from parent to underfull node.If no siblings with surplus leafs: merge with sibling and getseparating key from parent. Recurse with parent.

Due to recursion with parent, deletion may reach the root, and cancollapse a level.

Datastructuren

B-Trees

Deleting Keys

deleting keys (order 5)

OK

10 20 25 32 34 40 41 42

30 38

45

Datastructuren

B-Trees

Deleting Keys


10 20 25 32 34 40 41 42

30 38

45

swap predecessor

40 41 ×

30 38

42

Datastructuren

B-Trees

Deleting Keys


10 20 25 32

borrow(‘via’ parent)

34 40 41 42

30 38

45

10 20 × 30 34 40 41 42

25 38

45

Datastructuren

B-Trees

Deleting Keys

deleting, ctd (order 5)

10 20 25 32 34 40

underfull:merge brother

41

×

30 38

42

10 20 25 32 34 38 40

30

underfull:merge brother

× 50 56

42

new root 30 42 50 56

Datastructuren

B-Trees

Red-Black Trees

2-4-tree to red-black tree

20 37 40 41 44 50

30 42 4230

20 4037 41 44 50

42

30

20 40

37 41

44

50

Datastructuren

B-Trees

Red-Black Trees

2-4-tree vs red-black tree

a b c

b

a c

a b

b

a

a

b

a

a

Datastructuren

B-Trees

Red-Black Trees

red-black tree

Definition

A red-black tree is a

binary search tree

such that each node is either black or red, where

the root is black,

no red node is the son of another red node,

the number of black nodes on each path from root toextended leaf (NIL-pointers) is the same.

Datastructuren

B-Trees

Red-Black Trees

examples

42

30

20 40

37 41

44

50

40

30 42

20 37 41 44

35 50

Datastructuren

B-Trees

Red-Black Trees

fun fact

every AVL-tree can be red-black coloured.

1

3

2

1

5

2

1

4

1

3

2

1

Datastructuren

B-Trees

Red-Black Trees

insertion in red-black tree

Insert as red leaf. Red node with red parent then:

If uncle is red: flag-flip. Continue at grandparent.

If uncle is black: rotate (see AVL-trees), Repaint and Stop.

g

p u

x

flag flip=⇒

g

p u

x

g

p u

x

rotation=⇒

p

x g

u

If the root has been coloured red, make it black.

Datastructuren

B-Trees

Red-Black Trees

just classical single/double rotation

42

30

20

30

20 42

42

30

40

40

30 42

Datastructuren

B-Trees

Red-Black Trees

example: adding key

42

30

20 40

37 41

44

50

35new

42

30

20 40

37 41

44

50

35

40

30 42

20 37 41 44

35 50

Datastructuren

B-Trees

Red-Black Trees

GNU C++ stl tree.h

“Red-black tree class, designed for use in implementing STLassociative containers (set, multiset, map, and multimap). Theinsertion and deletion algorithms are based on those in Cormen,Leiserson, and Rivest, Introduction to Algorithms (MIT Press,1990), except that . . . ”

Linux“There are a number of red-black trees in use in the kernel. The anticipatory,deadline, and CFQ I/O schedulers all employ rbtrees to track requests; the packetCD/DVD driver does the same. The high-resolution timer code uses an rbtree toorganize outstanding timer requests. The ext3 filesystem tracks directory entries in ared-black tree. Virtual memory areas (VMAs) are tracked with red-black trees, as areepoll file descriptors, cryptographic keys, and network packets in the ”hierarchicaltoken bucket” scheduler.” lwn.net/Articles/184495/

http://lwn.net/Articles/184495/

Datastructuren

Graphs

Contents

7 GraphsDefinitionRepresentationGraph traversalDisjoint Sets, ADT Union-FindMinimal Spanning TreesShortest Paths

Datastructuren

Graphs

Definition

graph definition

zie Algoritmiek!

Definition

A graph is a pair G = (V,E) where:

V is a set of vertices, or nodes

E ⊆ V × V is a set edges, or arcs, lines

directed / undirected

vertices / edges can have labels (string, number)

complexity in |V | and |E|. |E| ≤ |V |2

Datastructuren

Graphs

Representation

adjacency matrix

1

2 3 4

56

7

1 2 3 4 5 6 7

1 · 1 · · · 1 ·2 · · 1 · · · ·3 · · · · 1 1 14 · · · · · · ·5 · · · 1 · 1 ·6 1 · · · · · 17 1 · · · · · ·

Datastructuren

Graphs

Representation

adjacency lists

1

2 3 4

56

7

1

2

3

4

5

6

7

2 6

3

5 6 7

4 6

1 7

1

Datastructuren

Graphs

Graph traversal

depth first search

Recursive DFSvoid DFS(v)

{ visit(v)

mark(v)

for each w adjacent to v

do if w is not marked

DFS(w)

fi

od

}

Iterative DFS// start with unmarked nodes

S.push(init) // S.push((init,init))

while S is not empty

do v = S.pop() // (p,v) = S.pop

if v is not marked

then mark v

// add (p,v) to DFS tree (if p!=v)

for all edges (v, w)

do if w is unmarked then

S.push(w) // S.push((v,w))

fi

od

fi

od

Datastructuren

Graphs

Graph traversal

dfs tree (directed)

b

a

c

ed

f

g

b2

a1c 5

e4

d

7 f 6

g3forward

back

back

cross

b6

a1c 2

e5

d

4 f 3

g7back

forward

forward back

DFS tree with edge classification (tree, back, cross, forward)not unique

Datastructuren

Graphs

Graph traversal

dfs edges

1

2

3

4

5

6

7

forward back

cross

1

2

3

4

5

6

7

back

Datastructuren

Graphs

Graph traversal

applications of DFS

topological sorting

articulation points

A DFS traversal itself and the forest-like representation of the graph itprovides have proved to be extremely helpful for the development of efficientalgorithms for checking many important properties of graphs. Note thatthe DFS yields two orderings of vertices: the order in which the vertices arereached for the first time (pushed onto the stack) and the order in whichthe vertices become dead ends (popped off the stack). These orders arequalitatively different, and various applications can take advantage of eitherof them. [Levitin, Design & Analysis of Algorithms]

Datastructuren

Graphs

Graph traversal

application: topological sort

1 2 3

4 5

6 7 8

1 2 34 56 7 8

Datastructuren

Graphs

Graph traversal

Let G = (V,E) be a directed graph.

Definition

A topological ordering [or sort] of G is an ordering (v1, . . . , vn) ofV , such that if (vi, vj) ∈ E then i < j.

finding a topological sort:

1 pick node without incoming edges

2 remove outgoing edges from that node and go to step 1.

(or use depth-first search)

Datastructuren

Graphs

Graph traversal

application: topological sort

11,6

pre,post

27,7 38,8

44,5 52,2

65,3 73,1 86,4

1 2 345 67 8

Datastructuren

Graphs

Graph traversal

application: articulation points

12

3

45

67

8

9

10

11

12

13

Datastructuren

Graphs

Graph traversal

articulation points with dfs tree

12

3

45

67

8

9

10

11

12

13 12

3

4

5

6

7

8

9 10

1112

13vertex v is an articulation point if

v is the root, and has two or more children

v has a subtree where no node has a back edge reachingabove v

Datastructuren

Graphs

Graph traversal

breadth-first search (BFS)

Iterative BFS// Q is a queue of vertices

// start with unmarked nodes

Q.enqueue(init)

dist[init] = 0

while Q is not empty

do v = Q.dequeue()

if v is not marked

then newdist = dist[v] + 1

for all edges (v, w)

do if w is not marked

then Q.enqueue(w)

mark w

dist[w] = newdist

fi

od

fi

od

Datastructuren

Graphs

Graph traversal

bfs: ’floodfill’

4

3

4

3

2

1

2

5

4

5

4

3

2

3

6

5

6

5

4

3

4

7

6

7

6

5

4

5

8

7

8

7

6

5

6

9

8

9

8

7

6

7

10

9

10

9

8

7

8

0 1

1 2

2 3

Datastructuren

Graphs

Minimal Spanning Trees

minimal spanning tree

A B

C D E

F G

H

2

2 4 77 6

29 2

4

5

5

A B

C D E

F G

H

2

2 4

22

4

5

Definition (Minimal spanning tree of weighted graph)

A tree containing all nodes of the graph, with minimal total sum ofedge weights

Datastructuren

Graphs


minimal spanning tree Kruskal vs. Prim

A B

C D E

F G

H

22

22

4

A B

C D E

F G

H

22 4

22

4

A B

C D E

F G

H

22 4

22

4 5

A B

C D4

E7

F9

22 4 7

7

9

A B

C D E6

F2

22 4 7

7 62

9

A B

C D E6

F G2

H4

22 4 7

7 62

9 2

4

Datastructuren

Graphs


minimal spanning tree - Kruskal

High-level algorithm:

repeatconsider edge with smallest weightif it does not yield a cycle

add it to the treeotherwise discard the edge

until no edges left

Datastructuren

Graphs


spanning tree

1 2

3

4 5

6 7

8

910

?

?

edges that do not cause a cycle use union-find ADT

Datastructuren

Graphs


partition domain D = {1, 2, . . . , n}each set has a name, a representative

Initialize: construct the initial partition; each componentconsists of a singleton set {d}, with d ∈ D.

Find: retrieves the name of the component, i.e,Find(u) = Find(v) iff u and v belong to the same set in thepartition.

Union: given two elements u and v, the sets they belong toare merged. Has no effects when u and v already belong tothe same set.Usually it is assumed that u, v are representatives, i.e., namesof components, not arbitrary elements.

Datastructuren

Graphs


Union-Find implementation with path-compression

1 2 3 4 5 6 7 8 9 10

1 2 1 4 5 9 6 5 9 9 parent2 1 . 1 2 . . . 4 . size

1 2

3

4 5

6

7

8

9

10

Datastructuren

Graphs


Union-Find implementation with path-compression

v

T1 T2

w

T1 T2

v

T1 T2

w

T1 T2

v

T4

T3

T2

T1

v

T4T3T2

T1

Datastructuren

Graphs


minimal spanning tree - Kruskal

detailed algorithm with priority queue and union-find ADTs:Kruskal

KRUSKAL(G):

A = emptyTree

PQ = empty

foreach v:

MAKE-SET(v)

PQ.insert( weight(u,v), (u,v) )

repeat until PQ is empty:

(u,v) = PQ.DELETE-MIN()

if (FIND-SET (u) != FIND-SET(v)) then

A.add((u, v))

UNION(u, v)

fi

return A;

Datastructuren

Graphs


Algorithms from the Book

116 union-find Galler and Fischer109 Knuth-Morris-Pratt pattern matching

94 Blum,Floyd,Pratt,Rivest,Tarjan median89 binary search84 Floyd-Warshall all-pairs shortest path79 Euclidean algorithm greatest common divisor (GCD)73 quicksort Tony Hoare59 Huffman coding data compression51 Miller-Rabin primality test50 Schwartz-Zippel lemma polynomial identity46 depth first search42 sieve of Eratosthenes primes42 Dijkstra shortest path

3.11’19

https://cstheory.stackexchange.com/q/189/12122

Datastructuren

Graphs


Primcost[source] = 0 // infinite for other nodes

prev[source] = 0 // code for the root

Q = V // all vertices

while Q is not empty

do u is node in Q with minimal cost[u]

remove u from Q

for each edge (u,v) with v outside tree

do if length(u,v) < cost[v]

then cost[v] = length(u,v)

prev[v] = u

fi

od

od

high-level algorithm:

initialize tree with randomly chosen node

repeat until all vertices are connected:link unconnected node attached to edge with minimum weight

Datastructuren

Graphs


directed graphs not supported

u

6

4

2

Prim fails

u

4

6

21

Kruskal fails

Datastructuren

Graphs

Shortest Paths

Dijkstra1 dist[source] = 0 // infinite for other nodes

2 prev[source] = 0 // code for the root

3 PQ = V // all nodes

4 while PQ is not empty

5 do u is node in Q with minimal dist[u]

6 remove u from Q

7 for each edge (u,v)

8 do newdist = dist[u] + length(u,v);

9 if newdist < dist[v]

10 then dist[v] = newdist

11 prev[v] = u

12 fi

13 od

14 od

finds shortest path from fixed source node to all other nodesalso: shortest path from source to target node

Datastructuren

Graphs

Shortest Paths

distance vs. bottleneck

A B

C D E

F G

H

22 4 7

7 62

9 2

4

5

5

A2

B4

C0

D6

E11

F8

G10

H12

22 4 7

22

4

A4

B6

C∞

D7

E6

F9

G5

H5

4 77 6

9 5

5

Datastructuren

Graphs

Shortest Paths

all pairs distance

Lk(i, j) = min(Lk−1(i, j), Lk−1(i, k) + Lk−1(k, j)).Floyd-Warshall

// initially dist equals the adjacency matrix

for each edge (i,j)

do prev[i,j] = i

od

for k from 1 to n

do for i from 1 to n

do for j from 1 to n

do if dist[i,k] + dist[k,j] < dist[i,j]

then dist[i,j] = dist[i,k] + dist[k,j]

prev[i,j] = prev[k,j]

fi

od

od

od

Datastructuren

Graphs

Shortest Paths

example Floyd

1

2

3

9

3

6 2

Datastructuren

Graphs

Shortest Paths

Floyd

partial result A3, and distances via node 4

A3 =

0 2 1 63 0 1 44 1 0 5−2 0 −1 .

. 6 + 0 6− 1 64− 2 . 4− 1 45− 2 5 + 0 . 5−2 0 −1 .

Datastructuren

Graphs

Shortest Paths

path reconstruction

Path-reconstructionPath(u, v)

if prev[u][v] = null then

return []

path = [v]

while u != v do

v = prev[u][v]

path.insert_at_begin(v)

od

return path

Datastructuren

Graphs

Shortest Paths

Warshall// initially conn equals the adjacency matrix

// with additionally 1=true on the diagonal

for k from 1 to n

do for i from 1 to n

do for j from 1 to n

do conn[i,j] = conn[i,j] or ( conn[i,k] and conn[k,j] )

od

od

od

Datastructuren

Hash Tables

Contents

8 Hash TablesPerfect Hash FunctionOpen AddressingChainingChoosing a hash function

Datastructuren

Hash Tables

ADT map, dictionary

associative array, [hash-]map, symbol table, or dictionary wiki

is composed of a collection of (key, value) pairs;each possible key appears at most once

find insert deleteav wc order

unordered list n 1 n nobin tree log n n log n n log n n yesbalanced log n log n log n yeshash table n n n no

worst case

https://en.wikipedia.org/wiki/Associative_array

Datastructuren

Hash Tables

ADT map, dictionary

associative array, [hash-]map, symbol table, or dictionary wiki

is composed of a collection of (key, value) pairs;each possible key appears at most once

find insert deleteav wc av wc av wc order

unordered list n 1 n nobin tree log n n log n n log n n yesbalanced log n log n log n yeshash table 1 n 1 n 1 n no

av=average, wc=worst case

https://en.wikipedia.org/wiki/Associative_array

Datastructuren

Hash Tables

Hashing

Store keys of arbitrary size (usually large domains) in table offixed size (usually small)

Hash table: ADT that performs finds, insertions and deletionsin (on avg) constant time

Used to implement unordered sets, maps (C++ STL, Java),store passes, checksums (MD5, CRC32)

Hash function calculates position in table: h(k)mod TableSize

Collision: attempt to store key k when h(k) is occupied

Collision resolution

perfect hashing: Keys are known a-priori; can avoid collisions

open addressing: Collision resolved by storing key elsewhere

chained hashing: Store multiple keys at the same address(i.e. table entries are linked lists of items with same hash)

Datastructuren

Hash Tables

Perfect Hash Function

Cichelli

h(w) = |w|+ v(first(w)) + v(last(w)), with v defined by:

a b c f g h i l m n p r s t u v w y11 15 1 15 3 15 13 15 15 13 15 14 6 6 14 10 6 13

Value for other letters: 0

2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7

do

end

else

case

dow

nto

goto

to oth

erw

ise

typ

ew

hile

con

std

ivan

dse

tor of m

od

file

reco

rdp

acke

dn

otth

enpr

oce

du

rew

ith

rep

eat

var

in arra

yif n

ilfo

rb

egin

un

til

lab

elfu

nct

ion

prog

ram

h(goto) = |goto|+ v(g) + v(o) = 4 + 3 + 0 = 7

h(const) = |const|+ v(c) + c(t) = 5 + 1 + 6 = 12

Datastructuren

Hash Tables

Open Addressing

Open Addressing

If insert causes collision, attempt to store hash elsewhere

Extend hash function with extra parameter i, the number ofthe attempt to store the key

General structure of hash function:h(k, i) = (g(k)− f(i)) mod TableSize

Linear probing: f(i) is linear (in i), i.e. f(i) = i

Quadratic probing: f(i) is quadratic (in i), i.e. f(i) = i2

Double hashing: h(k) = (g(k)− i ∗ f(k)) mod TableSize

Insert, find and remove operations

Insert/Find: probe at h(k, 0), h(k, 1), h(k, 2), ...

Delete: keep tag for each cell: active, deleted, empty

Datastructuren

Hash Tables

Open Addressing

linear

keys 605, 297, 748, 385, 198, 231 and 407address function g(K) = K mod 11probe function f(i) = ihash function h(K, i) = ((K mod 11)− i) mod 11

Datastructuren

Hash Tables

Open Addressing

linear

keys 605, 297, 748, 385, 198, 231 and 407address function g(K) = K mod 11probe function f(i) = ihash function h(K, i) = ((K mod 11)− i) mod 11

0 1 2 3 4 5 6 7 8 9 10

38 60

38

29 74

23 38 60 19 29 74

19

23 40 38 60 19 29

40

74

Datastructuren

Hash Tables

Open Addressing

linear step size f(i) = 3i

h(K, i) = ((K mod 10)− 3i) mod 100 1 2 3 4 5 6 7 8 9

65 32 43 55 72 19

Neighbors (relative to step size 3):0 3 6 9 2 5 8 1 4 7

65 43 72 19 32 55

Primary clustering: keys with “nearby” hash in same cluster

Careful! Pick step size coprime to table size, if not, insertscan fail even if table is not full: not all positions are probed

Datastructuren

Hash Tables

Open Addressing

quadratic

keys 605, 297, 748, 385, 198, 231 and 407address function h(K) = K mod 11probe function: f(i) = ±i2hash function: h(K, i) = (h(K)± i2) mod 11probes at h(K)± 1, h(K)± 4, h(K)± 9, . . .

Datastructuren

Hash Tables

Open Addressing

quadratic

keys 605, 297, 748, 385, 198, 231 and 407address function h(K) = K mod 11probe function: f(i) = ±i2hash function: h(K, i) = (h(K)± i2) mod 11probes at h(K)± 1, h(K)± 4, h(K)± 9, . . .

0 1 2 3 4 5 6 7 8 9 10

38 60

38

29 74

23 38 60 29 74 19

19

23 38 60 40 29

40

74 19

Secondary clustering: only keys with same hash cluster

Datastructuren

Hash Tables

Open Addressing

double

keys 605, 297, 748, 385, 198, 231 and 407table size: 11address function g(K) = K mod 11probe function p(K) = (K mod 4) + 1hash function h(K) = (g(K)− i ∗ p(K)) mod 11

Datastructuren

Hash Tables

Open Addressing

double

keys 605, 297, 748, 385, 198, 231 and 407table size: 11address function g(K) = K mod 11probe function p(K) = (K mod 4) + 1hash function h(K) = (g(K)− i ∗ p(K)) mod 11

0 1 2 3 4 5 6 7 8 9 10

38 6038

29 74

23 38 19 60 29 7419

23 38 19 60 40 29

40

74

Minimize clustering: diff. probes even for keys with same hash

Datastructuren

Hash Tables

Open Addressing

find / successful (α = 0.5− 0.8) add / unsuccessful (α = 0.5− 0.8)

linear 12(1 + 1

1−α) 1.5–3 12(1 + 1

(1−α)2 ) 2.5–13

quadratic 1 + ln( 11−α)− α

2 1.4–2.2 11−α − α+ ln( 1

1−α) 2.2–5.8

double 1α ln( 1

1−α) 1.4–2.0 11−α 2–5

1

2

3

4

5

6

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Exp

ecte

dpr

obes

Load factor α

linear hashquadraticdouble hash

Datastructuren

Hash Tables

Chaining

chaining

keys 605, 297, 748, 385, 198, 231 and 407table size: 11address function h(K) = K mod 11

0 1 2 3 4 5 6 7 8 9 10Λ

?23Λ

Λ Λ Λ?

?

38

60Λ

Λ?

?

?

?

29

40Λ

19

74Λ

Λ Λ

Datastructuren

Hash Tables

Choosing a hash function

A good hash function h(K) should

be fast to compute, and

evenly and deterministically distribute the keys over the table

depend on all “distinctive bits” of the key K

Techniques:

• extraction: compute address based on selected bits of key

• division: address = key mod TSize, choose TSize carefully

• folding: chop key into parts, combine (add/xor) parts

• mid-squaring: square key and take middle bits

Datastructuren

Hash Tables

Choosing a hash function

MurmurHash

Murmur3_32(key, len, seed)

// integer arithmetic with unsigned 32 bit integers.

c1 := 0xcc9e2d51

c2 := 0x1b873593

r1 := 15

r2 := 13

m := 5

n := 0xe6546b64

hash := seed

for each fourByteChunk of key

k := fourByteChunk

k := k * c1

k := (k << r1) OR (k >> (32-r1))

k := k * c2

hash := hash XOR k

hash := (hash << r2) OR (hash >> (32-r2))

hash := hash * m + n

with any remainingBytesInKey

\\ (also do Endian swapping on big-endian machines.)

remainingBytes := remainingBytesInKey * c1

remainingBytes := (remainingBytes << r1) OR (remainingBytes >> (32 - r1))

remainingBytes := remainingBytes * c2

hash := hash XOR remainingBytes

hash := hash XOR len

hash := hash XOR (hash >> 16)

hash := hash * 0x85ebca6b


hash := hash * 0xc2b2ae35


Datastructuren

Data Compression

Contents

9 Data CompressionHuffman CodingLempel-Ziv-WelshBurrows-Wheeler

Datastructuren

Data Compression

lossless (omkeerbaar)

GIF

2222

2222

2222

2222

0000

0000

0000

0000

0000

0000

0000

0000

0 0 0 0

111

111

111

111

1 1

0 0 0

colour table012

Datastructuren

Data Compression


Huffman vs. LZW coding

prefix code trie

a

b e

0 1

0 1

f

c d

0 1

0 1

0 1

1 2 3

a b c

4 5

6

7b a b

c

8

9

10

11b

a

a

a

e 7→ 011f 7→ 10

cb 7→ 00111 7aaa 7→ 01011 11

Datastructuren

Data Compression


Huffman vs. LZW coding

a

b e

0 1

0 1

f

c d

0 1

0 1

0 1

1 2 3

a b c

4 5

6

7b a b

c

8

9

10

11b

a

a

a

frequencies given self learningsingle letter to variable bits variable string to fixed lengthprefix code-tree trie- letters as leafs - letters along edges- bits left/right - code in nodestore code decoder learns too

Datastructuren

Data Compression

va-kan-tie-oord

Morse

H V F L P J B X C Y Z Q

S U R W D K G O

I A N M

E T

dot dash

BYOXO Are you trying to weasel out of our deal?tks om bv cu 73 thanks old-man bon-voyage see-you best regards

https://nl.wikipedia.org/wiki/Morse#Ezelsbruggetjes

Datastructuren

Data Compression

Huffman Coding

a19

b

8c9

d

9e7

Shannon–Fano Huffman

a d c

b e

0 1

0 1

0 1

0 1

19 + 9

9 + 8 + 7

a

d c b e

0 1 0 1

0 1

0 1

2 · (19 + 9 + 9) + 3 · (8 + 7) = 119 1 · 19 + 3 · (9 + 9 + 8 + 7) = 118 :)

https://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding

Datastructuren

Data Compression

Huffman Coding

David Albert Huffman (1925–1999)

photo: 1978, UCSC

(maa.org)

photo: 1991, Matthew Mulbry

(SciAm / huffmancoding.com)

http://www.maa.org/press/periodicals/convergence/discovery-of-huffman-codes

http://www.huffmancoding.com/my-uncle/scientific-american

Datastructuren

Data Compression

Huffman Coding

Huffman (1952)

variable length code (bitstring) for single lettersa1, . . . , an ∈ Σ 7→ w1, . . . , wn ∈ {0, 1}∗

based on character frequencies (known in advance)f1, . . . , fn

optimal expected code length (for prefix code)n∑i=1

fi · |wi|

code has to be known by decoder

‘old’ Shannon-Fano algorithm not always produces optimal code

Datastructuren

Data Compression

Huffman Coding

Huffman// initialize:

for each input letter: create tree with that letter

and its frequency

repeat until one tree left:

take two trees of minimal frequencies

join these as children in a new tree,

with combined (summed) frequency

Datastructuren

Data Compression

Huffman Coding

a18

b

8c

12

d

13e7

f

21a18

c12

d

13

f

21

b e

15

0 1

a18

f

21 15

b e

0 1

c d

25

0 1

33

a

b e

0 1

0 1f

21

c d

25

0 1

33

a

b e

0 1

0 1

46

f

c d

0 1

0 1

79

a

b e

0 1

0 1

f

c d

0 1

0 1

0 1

Datastructuren

Data Compression

Huffman Coding

keuzes, keuzes, . . .

a:10, b:5, c:5, d:5

a b c d

0 10 1

0 1

2 · 10 + 2 · 5 + 2 · 5 + 2 · 5 = 50

a

b

c d

0 1

0 1

0 1

1 · 10 + 2 · 5 + 3 · 5 + 3 · 5 = 50

Datastructuren

Data Compression

Lempel-Ziv-Welsh

Ziv-Lempel & Welsh (1977, 1984)

fixed length code for repeating patterns in inputx1, . . . , xn ∈ Σ∗ 7→ w1, . . . , wn ∈ {0, 1}k

strings xi plus code is learned while reading input

code is also learned by decoder and does not have to betransmitted

Datastructuren

Data Compression

Lempel-Ziv-Welsh

Ziv-Lempel & Welsh - compression

ZLW-compressinitialize dict with codes for single characters

w = "";

while ( not end of input )

do

read next character c

if w+c exists in the dict

w = w+c;

else

add to dict: w+c;

output code(w);

w = c;

fi

od

output code(w)

Datastructuren

Data Compression

Lempel-Ziv-Welsh

input abab cbab abaa aa

w c dict? output new code

a Xa b × 1 4 7→ abb a × 2 5 7→ baa b Xab c × 4 6 7→ abcc b × 3 7 7→ cbb a Xba b × 5 8 7→ babb a Xba b Xbab a × 8 9 7→ babaa a × 1 10 7→ aaa a Xaa a × 10 11 7→ aaaa ⊥ 1

1 2 3

a b c

4 5

6

7

b a b

c

8

9

10

11

b

a

a

a

0 (end)

1 a2 b3 c

4 ab5 ba6 abc7 cb8 bab9 baba10 aa11 aaa

Datastructuren

Data Compression

Lempel-Ziv-Welsh

Ziv-Lempel & Welsh - decompression

Decoding 1 2 4 3 5 8 1 10 1.code text new codes

1, 2, 3 7→ a, b, c initialization1 a we learn the new code one step late2 b 4 7→ ab last text + first letter4 ab 5 7→ ba3 c 6 7→ abc5 ba 7 7→ cb8 bab 8 7→ bab the new code is too late! is of the

form last text (ba) + first (b)1 a 9 7→ baba10 aa 10 7→ aa too late again1 a 11 7→ aaa

Datastructuren

Data Compression

Lempel-Ziv-Welsh

Ziv-Lempel & Welsh - decompression

ZLW-decompressinitialize dict with codes for single characters

read first code in variable prev and output str(prev)

while( not end of input )

read w;

if w exists in the dict

output str(w);

add to dict: str(prev) + firstchar(str(w));

else

// special case

output str(prev) + firstchar(str(prev));

add to dict: str(prev) + firstchar(str(prev));

fi

prev = w;

od

Datastructuren

Data Compression

Burrows-Wheeler

truukje

MISSISSIPPI 7→ SSMP-PISSIII

Datastructuren

Data Compression

Burrows-Wheeler

MISSISSIPPI.rotate

1 M I S S I S S I P P I -2 I S S I S S I P P I - M3 S S I S S I P P I - M I4 S I S S I P P I - M I S5 I S S I P P I - M I S S6 S S I P P I - M I S S I7 S I P P I - M I S S I S8 I P P I - M I S S I S S9 P P I - M I S S I S S I

10 P I - M I S S I S S I P11 I - M I S S I S S I P P12 - M I S S I S S I P P I

alphabetize, last column

8 I P P I - M I S S I S S5 I S S I P P I - M I S S2 I S S I S S I P P I - M

11 I - M I S S I S S I P P1 M I S S I S S I P P I -

10 P I - M I S S I S S I P9 P P I - M I S S I S S I7 S I P P I - M I S S I S4 S I S S I P P I - M I S3 S S I S S I P P I - M I6 S S I P P I - M I S S I

12 - M I S S I S S I P P I

Datastructuren

Data Compression

Burrows-Wheeler

decode

12 1S1 I1S2 I2M1 I3P1 I4- M1

P2 P1

I1 P2

S3 S1

S4 S2

I2 S3

I3 S4

I4 -

M1 I3 S4 S2 I2 S3 S1 I1 P2 P1 I4

Datastructuren

Data Compression

Burrows-Wheeler

Quiz

Add final step for Floyd

A3 =

0 2 1 63 0 1 44 1 0 5−2 0 −1 0

Datastructuren

Pattern Matching

Contents

10 Pattern MatchingKnuth-Morris-PrattAho-CorasickComparing texts

Datastructuren

Pattern Matching

naive

1T = ABABC. . .

↑ ↑×P = ABCAB. . .

3

⇒

2ABABCAB. . .×ABCABA. . .1

⇒

3ABABCABCAB. . .↑ ↑ ↑ ↑ ↑×ABCABABC. . .

6

⇒

4ABABCABC. . .

×ABCAB. . .1

⇒

5ABABCABC. . .

×ABCA. . .1

⇒

6ABABCABCABABCC. . .

↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ABCABABC

Datastructuren

Pattern Matching

Knuth-Morris-Pratt

match pattern against itself

T = . . . ABCABAB? . . .P = ABCABAB×

8

2 8

ABCABAB .. ABCABAB×

3 8

ABCABAB .. . ABCABAB×

4 8

ABCABAB .. . . ABCABAB

×

5 8

ABCABAB .. . . . ABCABAB

×

6 8

ABCABAB .. . . . . ABCABAB

3

Datastructuren

Pattern Matching

Knuth-Morris-Pratt

linear-time algorithm (1970, 1977)

Donald Knuth, Vaughan Pratt, and James H. Morris

failure links

linear time preprocessing

search will never back-up in text

Datastructuren

Pattern Matching

Knuth-Morris-Pratt

2

A. A

1

3

AB. . AB

1

4

ABC. . . ABC

1

5

ABCA. . . ABCA

2

6

ABCAB. . . ABCAB

3

7

ABCABA. . . . . ABCA

2

k 1 2 3 4 5 6 7 8P[k] A B C A B A B C

FLink[k] 0 1 1 1 2 3 2 3

at position k: the maximal r < k such thatP1 . . . Pr−1 = Pk−r+1 . . . Pk−1

mismatch at position k, then continue at position FLink[k](and same position in Text)

FLink[k] = 0: next position in Text, first position Pattern

Datastructuren

Pattern Matching

Knuth-Morris-Pratt

k 1 2 3 4 5 6 7 8P[k] A B C A B A B C

FLink[k] 0 1 1 1 2 3 2 3

0 1 2 3 4 5 6 7 8 9A B C A B A B C

match

skip to nextletter in text

mismatch fail

Datastructuren

Pattern Matching

Knuth-Morris-Pratt

KMP search// using failure links

Pos = 1 // position in pattern

TPos = 1 // position in text

while ((Pos <= PatLen) and (TPos <= TextLen)) do

if (P[Pos] == Text[TPos]) then

Pos ++;

TPos ++;

else

Pos = FLink[Pos]

if (Pos == 0) then

// start from scratch at next position in text

Pos = 1

TPos ++;

fi

fi

od

if (Pos > PatLen) then

Pattern found in text at position TPos-PatLen+1

fi

Datastructuren

Pattern Matching

Knuth-Morris-Pratt

computing KMP failure linksk = 1 // position in pattern

FLink[1] = 0

for k = 2 to PatLen do

Fail = FLink[k-1]

while ( (Fail > 0) and (P[Fail] != P[k-1]) ) do

Fail = FLink[Fail]

od

FLink[k] = Fail+1

od

k−1 k

A C B A

Datastructuren

Pattern Matching

Knuth-Morris-Pratt

why does this work?

all prefixes that are also a suffix:P1 . . . Pt−1 = Pk−t+1 . . . Pk−1

can be found by following failure links t0 = FLink[k] andti = FLink[ti−1]

t1 t0 k

P1 · · · Pt0−1 Pk−t0+1 · · · Pk−1P1 · · · Pt1−1 Pk−t1+1 · · · Pk−1

Datastructuren

Pattern Matching

Knuth-Morris-Pratt

why does this work?

t1 t0 k k + 1

Datastructuren

Pattern Matching

Knuth-Morris-Pratt

k 1 2 3 4 5 6 7 8P[k] A B C A B A B C

FLink[k] 0 1 1 1 2 3 2 3FLink′[k] 0 1 1 0 1 3 1 1

improving KMP failure linksfor Pos = 2 to PatLen

do if ( P[Pos] == P[FLink[Pos]] )

then FLink[Pos] = FLink[FLink[Pos]]

fi

od

Datastructuren

Pattern Matching

Aho-Corasick

{aaa, abc, baa, baba, cb}

1 2 3

a b c

4 5

a

6 12

c

7

b

8

a b

9

a

10

a b

11

a

trie

Datastructuren

Pattern Matching

Aho-Corasick


1 2 3

a b c

4 5

a

6 12

c

7

b

8

a b

9

a

10

a b

11

a

failure links

Datastructuren

Pattern Matching

Aho-Corasick


1 2 3

a b c

4 5

a

6 12

c

7

b

8

a b

9

a

10

a b

11

a

searching aaba . . .

Datastructuren

Pattern Matching

Aho-Corasick


1 2 3

a b c

4 5

a

6 12

c

7

b

8

a b

9

a

10

a b

11

a

construct next failure link

Datastructuren

Pattern Matching

Comparing texts

alignment

enzymes and their amino acids

82 TYHMCQFHCRYVNNHSGEKLYECNERSKAFSCPSHLQCHKRRQIGEKTHEHNQCGKAFPT 60

81 --------------------YECNQCGKAFAQHSSLKCHYRTHIGEKPYECNQCGKAFSK 40

****: .***: * *:** * :****.:* *******..

82 PSHLQYHERTHTGEKPYECHQCGQAFKKCSLLQRHKRTHTGEKPYE-CNQCGKAFAQ- 116

81 HSHLQCHKRTHTGEKPYECNQCGKAFSQHGLLQRHKRTHTGEKPYMNVINMVKPLHNS 98

**** *:***********:***:**.: .*************** : *.: :

Datastructuren

Pattern Matching

Comparing texts

similarity TCAGACGATTG and TCGGAGCTG

TCAG - ACG - ATTGTC - GGA - GC - T - G

TCAGACGATTGTCGGA - GCT - G

match, mismatch, insdel (gap)GG

AG

-G

A-

Datastructuren

Pattern Matching

Comparing texts

global alignment

TTCAT vs. TGCATCGT

T G C A T C G T

T

T

C

A

T insdel

mismatch

match

as shortest path

Datastructuren

Pattern Matching

Comparing texts

global versus local alignment

11

mn

0

maxmax

Needleman-Wunsch (1970), Smith-Waterman (1981)

Levenshtein distance (1966)

Datastructuren

Pattern Matching

Comparing texts

end.

Datastructuren - Data Structuresliacs.leidenuniv.nl/~hoogeboomhj/dat/ohp/dat-present.pdf ·...

Documents

Transcript of Datastructuren - Data Structuresliacs.leidenuniv.nl/~hoogeboomhj/dat/ohp/dat-present.pdf ·...