Data Structures

CS 124 Course Notes 1 Spring 2002 An algorithm is a recipe or a well-defined procedure for performing a calculation, or in general, for transforming some input into a desired output. Perhaps the most familiar algorithms are those those for adding and multiplying integers. Here is a multiplication algorithm that is different from the standard algorithm you learned in school: write the multiplier and multiplicand side by side. Repeat the following operations - divide the first number by 2 (throw out any fractions) and multiply the second by 2, until the first number is 1. This results in two columns of numbers. Now cross out all rows in which the first entry is even, and add all entries of the second column that haven’t been crossed out. The result is the product of the two numbers. 75 29 37 58 18 116 9 232 4 464 2 928 1 1856 2175 29 x 1001011 29 58 232 1856 2175 Figure 1.1: A different multiplication algorithm. 1-1

Transcript of Data Structures

Page 1: Data Structures

CS 124 Course Notes 1 Spring 2002

An algorithm is a recipe or a well-defined procedure for performing a calculation, or in general, for transforming

some input into a desired output. Perhaps the most familiar algorithms are those those for adding and multiplying

integers. Here is a multiplication algorithm that is different from the standard algorithm you learned in school: write

the multiplier and multiplicand side by side. Repeat the following operations - divide the first number by 2 (throw

out any fractions) and multiply the second by 2, until the first number is 1. This results in two columns of numbers.

Now cross out all rows in which the first entry is even, and add all entries of the second column that haven’t been

crossed out. The result is the product of the two numbers.

75 2937 5818 116 9 232 4 464 2 928 1 1856 2175

29x 1001011



Figure 1.1: A different multiplication algorithm.


Page 2: Data Structures


In this course we will ask a number of basic questions about algorithms:

• Does it halt?

The answer for the algorithm given above is clearly yes, provided we are multiplying positive integers. The

reason is that for any integer greater than 1, when we divide it by 2 and throw out the fractional part, we always

get a smaller integer which is greater than or equal to 1. Hence our first number is eventually reduced to 1 and

the process halts.

• Is it correct?

To see that the algorithm correctly computes the product of the integers, observe that if we write a 0 for each

crossed out row, and 1 for each row that is not crossed out, then reading from bottom to top just gives us

the first number in binary. Therefore, the algorithm is just doing standard multiplication, with the multiplier

written in binary.

• Is it fast?

It turns out that the above algorithm is about as fast as the standard algorithm you learned in school. Later in

the course, we will study a faster algorithm for multiplying integers.

• How much memory does it use?

The memory used by this algorithm is also about the same as that of standard algorithm.

Page 3: Data Structures


The history of algorithms for simple arithmetic is quite fascinating. Although we take these algorithms for

granted, their widespread use is surprisingly recent. The key to good algorithms for arithmetic was the positional

number system (such as the decimal system). Roman numerals (I, II, III, IV, V, VI, etc) are just the wrong data

structure for performing arithmetic efficiently. The positional number system was first invented by the Mayan

Indians in Central America about 2000 years ago. They used a base 20 system, and it is unknown whether they had

invented algorithms for performing arithmetic, since the Spanish conquerors destroyed most of the Mayan books on

science and astronomy.

The decimal system that we use today was invented in India in roughly 600 AD. This positional number system,

together with algorithms for performing arithmetic, were transmitted to Persia around 750 AD, when several impor-

tant Indian works were translated into Arabic. Around this time the Persian mathematician Al-Khwarizmi wrote his

Arabic textbook on the subject. The word “algorithm” comes from Al-Khwarizmi’s name. Al-Khwarizmi’s work

was translated into Latin around 1200 AD, and the positional number system was propagated throughout Europe

from 1200 to 1600 AD.

The decimal point was not invented until the 10th century AD, by a Syrian mathematician al-Uqlidisi from

Damascus. His work was soon forgotten, and five centuries passed before decimal fractions were re-invented by the

Persian mathematician al-Kashi.

With the invention of computers in this century, the field of algorithms has seen explosive growth. There are a

number of major successes in this field:

• Parsing algorithms - these form the basis of the field of programming languages

• Fast Fourier transform - the field of digital signal processing is built upon this algorithm.

• Linear programming - this algorithm is extensively used in resource scheduling.

• Sorting algorithms - until recently, sorting used up the bulk of computer cycles.

• String matching algorithms - these are extensively used in computational biology.

• Number theoretic algorithms - these algorithms make it possible to implement cryptosystems such as the RSA

public key cryptosystem.

• Compression algorithms - these algorithms allow us to transmit data more efficiently over, for example, phone


Page 4: Data Structures


• Geometric algorithms - displaying images quickly on a screen often makes use of sophisticated algorithmic


In designing an algorithm, it is often easier and more productive to think of a computer in abstract terms. Of

course, we must carefully choose at what level of abstraction to think. For example, we could think of computer

operations in terms of a high level computer language such as C or Java, or in terms of an assembly language. We

could dip further down, and think of the computer at the level AND and NOT gates.

For most algorithm design we undertake in this course, it is generally convenient to work at a fairly high level.

We will usually abstract away even the details of the high level programming language, and write our algorithms in

”pseudo-code”, without worrying about implementation details. (Unless, of course, we are dealing with a program-

ming assignment!) Sometimes we have to be careful that we do not abstract away essential features of the problem.

To illustrate this, let us consider a simple but enlightening example.

Page 5: Data Structures


1.1 Computing the nth Fibonacci number

Remember the famous sequence of numbers invented in the 15th century by the Italian mathematician Leonardo

Fibonacci? The sequence is represented as F0,F1,F2 . . ., where F0 = 0, F1 = 1, and for all n ≥ 2, Fn is defined as

Fn−1 +Fn−2. The first few Fibonacci numbers are 0,1,1,2,3,5,8,13,21,34,55, . . . The value of F30 is greater than a

million! It is easy to see that the Fibonacci numbers grow exponentially. As an exercise, try to show that Fn ≥ 2n/2

for sufficiently large n by a simple induction.

Here is a simple program to compute Fibonacci numbers that slavishly follows the definition.

function F(n: integer): integer

if n = 0 then return 0

else if n = 1 then return 1

else return F(n−1)+F(n−2)

The program is obviously correct. However, it is woefully slow. As it is a recursive algorithm, we can naturally

express its running time on input n with a recurrence equation. In fact, we will simply count the number of addition

operations the program uses, which we denote by T (n). To develop a recurrence equation, we express T (n) in terms

of smaller values of T . We shall see several such recurrence relations in this class.

It is clear that T (0) = 0 and T (1) = 0. Otherwise, for n ≥ 2, we have

T (n) = T (n−1)+T(n−2)+1,

because to computer F(n) we compute F(n− 1) and F(n− 2) and do one other addition besides. This is (almost)

the Fibonacci equation! Hence we can see that the number of addition operations is growing very large; it is at least

2n/2 for n ≥ 4.

Page 6: Data Structures


Can we do better? This is the question we shall always ask of our algorithms. The trouble with the naive

algorithm the wasteful recursion: the function F is called with the same argument over and over again, exponentially

many times (try to see how many times F(1) is called in the computation of F(5)). A simple trick for improving

performance is to avoid repeated calculations. In this case, this can be easily done by avoiding recursion and just

calculating successive values:

function F(n: integer): integer array A[0 . . .n] of integer

A[0] = 0; A[1] = 1

for i = 2 to n do:

A[i] = A[i−1]+A[i−2]

return A[n]

This algorithm is of course correct. Now, however, we only do n−1 additions.

Page 7: Data Structures


It seems that we have come so far, from exponential to polynomially many operations, that we can stop here.

But in the back of our heads, we should be wondering an we do even better? Surprisingly, we can. We rewrite our

equations in matrix notation. Then




0 1

1 1









0 1

1 1





0 1

1 1






and in general, Similarly,




0 1

1 1






So, in order to compute Fn, it suffices to raise this 2 by 2 matrix to the nth power. Each matrix multiplication

takes 12 arithmetic operations, so the question boils down to the following: how many multiplications does it take

to raise a base (matrix, number, anything) to the nth power? The answer is O(logn). To see why, consider the case

where n > 1 is a power of 2. To raise X to the nth power, we compute X n/2 and then square it. Hence the number of

multiplications T (n) satisfies

T (n) = T (n/2)+1,

from which we find T (n) = logn. As an exercise, consider what you have to do when n is not a power of 2.

(Hint: consider the connection with the multiplication algorithm of the first section; there too we repeatedly halved

a number...)

So we have reduced the computation time exponentially again, from n− 1 arithmetic operations to O(log n),

a great achievement. Well, not really. We got a little too abstract in our model. In our accounting of the time

requirements for all three methods, we have made a grave and common error: we have been too liberal about what

constitutes an elementary step. In general, we often assume that each arithmetic step takes unit time, because the

numbers involved will be typically small enough that we can reasonably expect them to fit within a computer’s

word. Remember, the number n is only log n bits in length. But in the present case, we are doing arithmetic on huge

numbers, with about n bits, where n is pretty large. When dealing with such huge numbers, if exact computation

is required we have to use sophisticated long integer packages. Such algorithms take O(n) time to add two n-bit

numbers. Hence the complexity of the first two methods was larger than we actually thought: not really O(Fn) and

O(n), but instead O(nFn) and O(n2), respectively. The second algorithm is still exponentially faster. What is worse,

the third algorithm involves multiplications of O(n)-bit integers. Let M(n) be the time required to multiply two n-bit

numbers. Then the running time of the third algorithm is in fact O(M(n)).

Page 8: Data Structures


The comparison between the running times of the second and third algorithms boils down to a most important

and ancient issue: can we multiply two n-bit integers faster than Ω(n2) ? This would be faster than the method we

learn in elementary school or the clever halving method explained in the opening of these notes.

As a final consideration, we might consider the mathematicians’ solution to computing the Fibonacci numbers.

A mathematician would quickly determine that

Fn =1√











Using this, how many operations does it take to compute Fn? Note that this calculation would require floating point

arithmetic. Whether in practice that would lead to a faster or slower algorithm than one using just integer arithmetic

might depend on the computer system on which you run the algorithm.

Page 9: Data Structures

CS 124 Lecture 2

In order to discuss algorithms effectively, we need to start with a basic set of tools. Here, we explain these tools

and provide a few examples. Rather than spend time honing our use of these tools, we will learn how to use them by

applying them in our studies of actual algorithms.


The standard form of the induction principle is the following:

If a statement P(n) holds for n = 1, and if for every n ≥ 1 P(n) implies P(n+1), then P holds for all n.

Let us see an example of this:

Claim 2.1 Let S(n) = ∑ni=1 i. Then S(n) = n(n+1)

2 .

Proof: The proof is by induction.

Base Case: We show the statement is true for n = 1. As S(1) = 1 = 1(2)2 , the statement holds.

Induction Hypothesis: We assume S(n) = n(n+1)2 .

Reduction Step: We show S(n+1) = (n+1)(n+2)2 . Note that S(n+1) = S(n)+n+1. Hence

S(n+1) = S(n)+n+1



= (n+1)(n






Page 10: Data Structures


The proof style is somewhat pedantic, but instructional and easy to read. We break things down to the base case

– showing that the statement holds when n = 1; the induction hypothesis – the statement that P(n) is true; and the

reduction step – showing that P(n) implies P(n+1).

Induction is one of the most fundamental proof techniques. The idea behind induction is simple: take a large

problem (P(n + 1)), and somehow reduce its proof to a proof of a smaller problems (such as P(n); P(n) is smaller

in the sense that n < n+1). If every problem can thereby be broken down to a small number of instances (we keep

reducing down to P(1)), these can be checked easily. We will see this idea of reduction, whereby we reduce solving

a problem to a solving an easier problem, over and over again throughout the course.

As one might imagine, there are other forms of induction besides the specific standard form we gave above.

Here’s a different form of induction, called strong induction:

If a statement P(n) holds for n = 1, and if for every n ≥ 1 the truth of P(i) for all i ≤ n implies P(n+1), then P holds

for all n.

Exercise: show that every number has a unique prime factorization using strong induction.

Page 11: Data Structures


O Notation

When measuring, for example, the number of steps an algorithm takes in the worst case, our result will generally

be some function T (n) of the input size, n. One might imagine that this function may have some complex form, such

as T (n) = 4n2 − 3n log n + n2/3 + log3 n− 4. In very rare cases, one might wish to have such an exact form for the

running time, but in general, we are more interested in the rate of growth of T (n) rather than its exact form.

The O notation was developed with this in mind. With the O notation, only the fastest growing term is important,

and constant factors may be ignored. More formally:

Definition 2.2 We say for non-negative functions f (n) and g(n) that f (n) is O(g(n)) if there exist positive constants

c and N such that for all n ≥ N,

f (n) ≤ cg(n).

Page 12: Data Structures


Let us try some examples. We claim that 2n3 +4n2 is O(n3). It suffices to show that 2n3 +4n2 ≤ 6n3 for n ≥ 1,

by definition. But this is clearly true as 4n3 ≥ 4n2 for n ≥ 1. (Exercise: show that 2n3 +4n2 is O(n4).)

We claim 10log2 n is O(lnn). This follows from the fact that 10 log2 n ≤ (10log2 e) ln n.

If T (n) is as above, then T (n) is O(n2). This is a bit harder to prove, because of all the extraneous terms. It is,

however, easy to see; 4n2 is clearly the fastest growing term, and we can remove the constant with O notation. Note,

though, that T (n) is O(n3) as well! The O notation is not tight, but more like a ≤ comparison.

Page 13: Data Structures


Similarly, there is notation for ≥ and = comparisons.

Definition 2.3 We say for non-negative functions f (n) and g(n) that f (n) is is Ω(g(n)) if there exist positive con-

stants c and N such that for all n ≥ N,

f (n) ≥ cg(n).

We say that f (n) is Θ(g(n)) if both f (n) is O(g(n)) and f (n) is Ω(g(n)).

The O notation has several useful properties that are easy to prove.

Lemma 2.4 If f1(n) is O(g1(n)) and f2(n) is O(g2(n)) then f1(n)+ f2(n) is O(g1(n)+g2(n)).

Proof: There exist positive constants c1,c2,N1, and N2 such that f1(n) ≤ c1g1(n) for n ≥ N1 and f2(n) ≤ c2g2(n) for

n ≥ N2. Hence f1(n)+ f2(n) ≤ maxc1,c2(g1(n)+g2(n)) for n ≥ maxN1,N2.

Exercise: Prove similar lemmata for f1(n) f2(n). Prove the lemmata when O is replaced by Ω or Θ.

Page 14: Data Structures


Finally, there is a bit for notation corresponding to <<, when one function is (in some sense) much less than


Definition 2.5 We say for non-negative functions f (n) and g(n) that f (n) is is o(g(n)) if


f (n)

g(n)= 0.

Also, f (n) is ω(g(n)) if g(n) is o( f (n)).

We emphasize that the O notation is a tool to help us analyze algorithms. It does not always accurately tell us

how fast an algorithm will run in practice. For example, constant factors make a huge difference in practice (imagine

increasing your bank account by a factor of 10), and they are ignored in the O notation. Like any other tool, the O

notation is only useful if used properly and wisely. Use it as a guide, not as the last word, to judging an algorithm.

Page 15: Data Structures


Recurrence Relations

A recurrence relation defines a function using an expression that includes the function itself. For example, the

Fibonacci numbers are defined by:

F(n) = F(n−1)+F(n−2), F(1) = F(2) = 1.

This function is well-defined, since we can compute a unique value of F(n) for every positive integer n.

Note that recurrence relations are similar in spirit to the idea of induction. The relations defines a function value

F(n) in terms of the function values at smaller arguments (in this case, n− 1 and n− 2), effectively reducing the

problem of computing F(n) to that of computing F at smaller values. Base cases (the values of F(1) and F(2)) need

to be provided.

Finding exact solutions for recurrence relations is not an extremely difficult process; however, we will not

focus on solution methods for them here. Often a natural thing to do is to try to guess a solution, and then prove it

by induction. Alternatively, one can use a symbolic computation program (such as Maple or Mathematica); these

programs can often generate solutions.

We will occasionally use recurrence relations to describe the running times of algorithms. For our purposes, we

often do not need to have an exact solution for the running time, but merely an idea of its asymptotic rate of growth.

For example, the relation

T (n) = 2T (n/2)+2n, T (1) = 1

has the exact solution (for n a power of 2) of T (n) = 2n log2 n+n. (Exercise: Prove this by induction.) But for our

purposes, it is generally enough to know that the solution is Θ(n log n).

Page 16: Data Structures


The following theorem is extremely useful for such recurrence relations:

Theorem 2.6 The solution to the recurrence relation T (n) = aT (n/b)+ cnk , where a ≥ 1 and b ≥ 2 are integers

and c and k are positive constants satisfies:

T (n) is


nlogb a)

if a > bk


nk log n)

if a = bk



if a < bk.

Page 17: Data Structures


Data Structures

We shall regard integers, real numbers, and bits, as well as more complicated objects such as lists and sets, as

primitive data structures. Recall that a list is just an ordered sequence of arbitrary elements.

List q := [x1,x2, . . . ,xn].

x1 is called the head of the list.

xn is called the tail of the list.

n = |q| is the size of the list.

We denote by the concatenation operation. Thus q r is the list that results from concatenating the list q with

the list r.

The operations on lists that are especially important for our purposes are:

head(q) return(x1)

push(q,x) q := [x]q

pop(q) q := [x2, . . . ,xn], return(x1)

inject(q,x) q := q [x]

eject(q) q := [x1,x2, . . . ,xn−1], return(xn)

size(q) return(n)

The head, pop, and eject operations are not defined for empty lists. Appropriate return values (either an error,

or an empty symbol) can be designed depending on the implementation.

A stack is a list that supports operations head, push, pop.

A queue is a list that supports operations head, inject and pop.

A deque supports all these operations.

Note that we can implement lists either by arrays or using pointers as the usual linked lists. Arrays are often

faster in practice, but they are often more complicated to program (especially if there is no implicit limit on the

number of items). In either case, each of the above operations can be implemented in a constant number of steps.

Page 18: Data Structures


Application: Mergesort

For the rest of the lecture, we will review the procedure mergesort. The input is a list of n numbers, and the

output is a list of the given numbers sorted in increasing order. The main data structure used by the algorithm will be

a queue. We will assume that each queue operation takes 1 step, and that each comparison (is x > y?) takes 1 step.

We will show that mergesort takes O(n logn) steps to sort a sequence of n numbers.

The procedure mergesort relies on a function merge which takes as input two sorted (in increasing order) lists

of numbers and outputs a single sorted list containing all the given numbers (with repetition).

Page 19: Data Structures


function merge (s, t)list s, tif s = [ ] then return t

else if t = [ ] then return selse if s(1) ≤ t(1) then u:= pop(s)

else u:= pop(t)return push(u, merge(s, t))

end merge

function mergesort (s)list s, qq = [ ]for x ∈ s

inject(q, [x])rofwhile size(q) ≥ 2

u := pop(q)v := pop(q)inject(q, merge(u,v))

endif q = [ ] return [ ]

else return q(1)end mergesort

Page 20: Data Structures


The correctness of the function merge follows from the following fact: the smallest number in the input is either

s(1) or t(1), and must be the first number in the output list. The rest of the output list is just the list obtained by

merging s and t after deleting that smallest number.

The number of steps for each invocation of function merge is O(1) steps. Since each recursive invocation of

merge removes an element from either s or t, it follows that function merge halts in O(|s|+ |t|) steps.

Question: Can you design an iterative (rather than recursive) version of merge? How much time does is take?

Which version would be faster in practice– the recursive or the iterative?

Page 21: Data Structures


Q : [ [7,9], [1,4], [6,16], [2,10]∗ [3,11,12,14], [5,8,13,15] ]

Q : [ [6,16], [2,10]∗ [3,11,12,14], [5,8,13,15], [1,4,7,9] ]

Figure 2.1: One step of the mergesort algorithm.

The iterative algorithm mergesort uses q as a queue of lists. (Note that it is perfectly acceptable to have lists of

lists!) It repeatedly merges together the two lists at the front of the queue, and puts the resulting list at the tail of the


The correctness of the algorithm follows easily from the fact that we start with sorted lists (of length 1 each),

and merge them in pairs to get longer and longer sorted lists, until only one list remains. To analyze the running

time of this algorithm, let us place a special marker ∗ initially at the end of the q. Whenever the marker ∗ reaches the

front of q, and is either the first or the second element of q, we move it back to the end of q. Thus the presence of the

marker ∗ makes no difference to the actual execution of the algorithm. Its only purpose is to partition the execution

of the algorithm into phases: where a phase is the time between two successive visits of the marker ∗ to the end

of the q. Then we claim that the total time per phase is O(n). This is because each phase just consists of pairwise

merges of disjoint lists in the queue. Each such merge takes time proportional to the sum of the lengths of the lists,

and the sum of the lengths of all the lists in q is n. On the other hand, the number of lists is halved in each phase,

and therefore the number of phases is at most logn. Therefore the total running time of mergesort is O(n log n).

Page 22: Data Structures


An alternative analysis of mergesort depends on a recursive, rather than iterative, description. Suppose we have

an operation that takes a list and splits it into two equal-size parts. (We will assume our list size is a power of 2, so

that all sublists we ever obtain have even size or are of length 1.) Then a recursive version of mergesort would do

the following:

function mergesort (s)list s, s1, s2

if size(s) = 1 then return(s)split(s,s1,s2)s1 = mergesort(s1)s2 = mergesort(s2)return(merge(s1 ,s2))

end mergesort

Here split splits the list s into two parts of equal length s1 and s2. The correctness follows easily from induction.

Let T (n) be the number of comparisons mergesort performs on lists of length n. Then T (n) satisfies the

recurrence relation T (n) ≤ 2T (n/2) + n− 1. This follows from the fact that to sort lists of length n we sort two

sublists of length n/2 and then merge them using (at most) n − 1 comparisons. Using our general theorem on

solutions of recurrence relations, we find that T (n) = O(n log n).

Question: The iterative version of mergesort uses a queue. Implicitly, the recursive version is using a stack. Explain

the implicit stack in the recursive version of mergesort.

Question: Solve the recurrence relation T (n) = 2T (n/2)+n−1 exactly to obtain an upper bound on the number of

comparisons performed by the recursive mergesort variation.

Page 23: Data Structures

CS124 Lecture 3 Spring 2002

Graphs and modeling

Formulating a simple, precise specification of a computational problem is often a prerequisite to writing a

computer program for solving the problem. Many computational problems are best stated in terms of graphs. A

directed graph G(V,E) consists of a finite set of vertices V and a set of (directed) edges or arcs E . An arc is an

ordered pair of vertices (v,w) and is usually indicated by drawing a line between v and w, with an arrow pointing

towards w. Stated in mathematical terms, a directed graph G(V,E) is just a binary relation E ⊆V ×V on a finite set

V . Undirected graphs may be regarded as special kinds of directed graphs, such that (u,v) ∈ E ↔ (v,u) ∈ E . Thus,

since the directions of the edges are unimportant, an undirected graph G(V,E) consists of a finite set of vertices V ,

and a set of edges E , each of which is an unordered pair of vertices u,v.

Graphs model many situations. For example, the vertices of a graph can represent cities, with edges representing

highways that connect them. In this case, each edge might also have an associated length. Alternatively, an edge

might represent a flight from one city to another, and each edge might have a weight which represents the cost of the

flight. A typical problem in this context is to compute shortest paths: given that you wish to travel from city X to

city Y, what is the shortest path (or the cheapest flight schedule). We will find very efficient algorithms for solving

these problems.

A seemingly similar problem is the traveling salesman problem. Supposing that a traveling salesman wishes to

visit each city exactly once and return to his starting point, in what order should he visit the cities to minimize the

total distance traveled? Unlike the shortest paths problem, however, this problem has no known efficient algorithm.

This is an example of an NP-complete problem, and one we will study towards the end of this course.


Page 24: Data Structures


A different context in which graphs play a critical modeling role is in networks of pipes or communication

links. These can, in general, be modeled by directed graphs with capacities on the edges. A directed edge from u

to v with capacity c might represent a cable that can carry a flow of at most c calls per unit time from u to v. A

typical problem in this context is the max-flow problem: given a communications network modeled by a directed

graph with capacities on the edges, and two special vertices — a source s and a sink t — what is the maximum rate

at which calls from s to t can be made? There are ingenious techniques for solving these types of flow problems.

In all the cases mentioned above, the vertices and edges of the graph represented something quite concrete such

as cities and highways. Often, graphs will be used to represent more abstract relationships. For example, the vertices

of a graph might represent tasks, and the edges might represent precedence constraints: a directed edge from u to v

says that task u must be completed before v can be started. An important problem in this context is scheduling: in

what order should the tasks be scheduled so that all the precedence constraints are satisfied. There are extremely fast

algorithms for this problem that we will see shortly.

Page 25: Data Structures


Representing graphs on the computer

One common representation for a graph G(V,E) is the adjacency matrix. Suppose V = 1, · · · ,n. The adja-

cency matrix for G(V,E) is an n×n matrix A, where ai, j = 1 if (i, j) ∈ E and ai, j = 0 otherwise.1 The advantage of

the adjacency matrix representation is that it takes constant time (just one memory access) to determine whether or

not there is an edge between any two given vertices. In the case that each edge has an associated length or a weight,

the adjacency matrix representation can be appropriately modified so entry a i, j contains that length or weight instead

of just a 1. The disadvantage of the adjacency matrix representation is that it requires Ω(n2) storage, even if the

graph has as few as O(n) edges. Moreover, just examining all the entries of the matrix would require Ω(n2) steps,

thus precluding the possibility of linear time algorithms for graphs with o(n2) edges (at least in cases where all the

matrix entries must be examined).

An alternative representation for a graph G(V,E) is the adjacency list representation. We say that a vertex j is

adjacent to a vertex i if (i, j) ∈ E. The adjacency list for a vertex i is a list of all the vertices adjacent to i (in any

order). To represent the graph, we use an array of size n to represent the vertices of the graph, and the i th element of

the array points to the adjacency list of the ith vertex. The total storage used by an adjacency list representation of a

graph with n vertices and m edges is O(n + m). The adjacency list representation hence avoids the disadvantage of

using more space than necessary. We will use this representation for all our graph algorithms that take linear or near

linear time. A disadvantage of adjacency lists, however, is that determining whether there is an edge from vertex i to

vertex j may take as many as n steps, since there is no systematic shortcut to scanning the adjacency list of vertex i.

For applications where determining if there is an edge between two vertices is the bottleneck, the adjacency matrix

is thus preferable.

1Generally, we use either n or |V | for the number of nodes in a graph, and m or |E| for the number of edges.

Page 26: Data Structures


Depth first search

There are two fundamental algorithms for searching a graph: depth first search and breadth first search. To

better understand the need for these procedures, let us imagine the computer’s view of a graph that has been input

into it, in the adjacency list representation. The computer’s view is fundamentally local to a specific vertex: it can

examine each of the edges adjacent to a vertex in turn, by traversing its adjacency list; it can also mark vertices as

visited. One way to think of these operations is to imagine exploring a dark maze with a flashlight and a piece of

chalk. You are allowed to illuminate any corridor of the maze emanating from your current position, and you are

also allowed to use the chalk to mark your current location in the maze as having been visited. The question is how

to find your way around the maze.

We now show how the depth first search allows the computer to find its way around the input graph using just

these primitives. (We will examine breadth first search shortly.)

Depth first search is technique for exploring a graph using a stack as the basic data structure. We start by

defining a recursive procedure search (the stack is implicit in the recursive calls of search): search is invoked on a

vertex v, and explores all previously unexplored vertices reachable from v.

Procedure search(v)vertex vexplored(v) := 1previsit(v)for (v,w) ∈ E

if explored(w) = 0 then search(w)rofpostvisit(v)

end search

Procedure DFS (G(V,E))graph G(V,E)for each v ∈V do

explored(v) := 0roffor each v ∈V do

if explored(v) = 0 then search(v)rof

end DFS

Page 27: Data Structures


By modifying the procedures previsit and postvisit, we can use DFS to solve a number of important problems,

as we shall see. It is easy to see that depth first search takes O(|V |+ |E|) steps (assuming previsit and postvisit take

O(1) time), since it explores from each vertex once, and the exploration involves a constant number of steps per

outgoing edge.

The procedure search defines a tree in a natural way: each time that search discovers a new vertex, say w, we

can incorporate w into the tree by connecting w to the vertex v it was discovered from via the edge (v,w). The

remaining edges of the graph can be classified into three types:

• Forward edges - these go from a vertex to a descendant (other than child) in the DFS tree.

• Back edges - these go from a vertex to an ancestor in the DFS tree.

• Cross edges - these go from “right to left”– there is no ancestral relation.

Question: Explain why if the graph is undirected, there can be no cross edges.

One natural use of previsit and postvisit could each keep a counter that is increased each time one of these

routines is accessed; this corresponds naturally to a notion of time. Each routine could assign to each vertex a

preorder number (time) and a postorder number (time) based on the counter. If we think of depth first search as

using an explicit stack, then the previsit number is assigned when the vertex is first placed on the stack, and the

postvisit number is assigned when the vertex is removed from the stack. Note that this implies that the intervals

[preorder(u), postorder(u)] and [preorder(v), postorder(v)] are either disjoint, or one contains the other.

Page 28: Data Structures


An important property of depth-first search is that the contents of the stack at any time yield a path from the root

to some vertex in the depth first search tree. (Why?) This allows us to prove the following property of the postorder


Claim 3.1 If (u,v) ∈ E then postorder(u) < postorder(v) ⇐⇒ (u,v) is a back edge.

Proof: If postorder(u) < postorder(v) then v must be pushed on the stack before u. Otherwise, the existence

of edge (u,v) ensures that v must be pushed onto the stack before u can be popped, resulting in postorder(v) <

postorder(u) — contradiction. Furthermore, since v cannot be popped before u, it must still be on the stack when u

is pushed on to it. It follows that v is on the path from the root to u in the depth first search tree, and therefore (u,v)

is a back edge.

The other direction is trivial.

Exercise: What conditions to the preorder and postorder numbers have to satisfy if (u,v) is a forward edge? A

cross edge?

Claim 3.2 G(V,E) has a cycle iff the DFS of G(V,E) yields a back edge.

Proof: If (u,v) is a back edge, then (u,v) together with the path from v to u in the depth first tree form a cycle.

Conversely, for any cycle in G(V,E), consider the vertex assigned the smallest postorder number. Then the

edge leaving this vertex in the cycle must be a back edge by Claim 3.1, since it goes from a lower postorder number

to a higher postorder number.

Page 29: Data Structures








Graph is explored in preorder ABCDEF.Postorder is DCBAFE.DB is a back edge.AD is a forward edge.EC is a cross edge.







Figure 3.1: A sample depth-first search.

Application of DFS: Topological sort

We now suggest an algorithm for the scheduling problem described previously. Given a directed graph G(V,E),

whose vertices V = v1, . . .vn represent tasks, and whose edges represent precedence constraints: a directed edge

from u to v says that task u must be completed before v can be started. The problem of topological sorting asks: in

what order should the tasks be scheduled so that all the precedence constraints are satisfied.

Note: The graph must be acyclic for this to be possible. (Why?) Directed acyclic graphs appear so frequently

they are commonly referred to as DAGs.

Claim 3.3 If the tasks are scheduled by decreasing postorder number, then all precedence constraints are satisfied.

Proof: If G is acyclic then the DFS of G produces no back edges by Claim 3.2. Therefore by Claim 3.1,

(u,v)∈G implies postorder(u) > postorder(v). So, if we process the tasks in decreasing order by postorder number,

when task v is processed, all tasks with precedence constraints into v (and therefore higher postorder numbers) must

already have been processed.

There’s another way to think about topologically sorting a DAG. Each DAG has a source, which is a vertex

with no incoming edges. Similarly, each DAG has a sink, which is a vertex with no outgoing edges. (Proving this

is an exercise.) Another way to topologically order the vertices of a DAG is to repeatedly output a source, remove

it from the graph, and repeat until the graph is empty. Why does this work? Similarly, once could repeatedly output

sinks, and this gives the reverse of a valid topological order. Again, why?

Page 30: Data Structures


Strongly Connected Components

Connectivity in undirected graphs is rather straightforward. A graph that is not connected can naturally be

decomposed into several connected components (Figure 3.2). DFS does this handily: each restart of DFS marks a

new connected component.

1 2



6 78

9 10 11

12 13


Figure 3.2: An undirected graph

Page 31: Data Structures


In directed graphs, what connectivity means is more subtle. In some primitive sense, the directed graph in

Figure 3.3 appears connected, since if it were an undirected graph, it would be connected. But there is no path from

vertex 12 to 6, or from 6 to 1, so saying the graph is connected would be misleading.

We must begin with a meaningful definition of connectivity in directed graphs. Call two vertices u and v of

a directed graph G = (V,E) connected if there is a path from u to v, and one from v to u. This relation between

vertices is reflexive, symmetric, and transitive (check!), so it is an equivalence relation on the vertices. As such, it

partitions V into disjoint sets, called the strongly connected components (SCC’s) of the graph (in Figure 3.3 there

are four SCC’s). Within a strongly connected component, every pair of vertices are connected.

1 2 3



7 8




1 2-4-5 3-6


Figure 3.3: A directed graph and its SCC’s

Page 32: Data Structures


We now imagine shrinking each SCC into a vertex (a supervertex), and draw an edge (a superedge) from SCC

X to SCC Y if there is at least one edge from a vertex in X to a vertex in Y . The resulting directed graph has to be a

directed acyclic graph (DAG) – that is to say, it can have no cycles (see Figure 3.3). The reason is simple: a cycle

containing several SCC’s would merge to a single SCC, since there would be a path between every pair of vertices

in the SCC’s of the cycle. Hence, every directed graph is a DAG of its SCC’s.

This important decomposition theorem allows one to think of connectivity information of a directed graph

in two levels. At the top level we have a DAG, which has a useful, simple structure. For example, as we have

mentioned before, a DAG is guaranteed to have at least one source (a vertex without incoming edges) and a sink

(a vertex without outgoing edges). If we want more details, we could look inside a vertex of the DAG to see the

full-fledged SCC —a completely connected graph— that lies there.

This decomposition is extremely useful and informative; it is thus very fortunate that we have a very efficient

algorithm, based on DFS, that finds the strongly connected components in linear time! We motivate this algorithm

next. It is based on several interesting and slightly subtle properties of DFS:

Page 33: Data Structures


Property 1: If DFS is started at a vertex v, then it will get stuck and restarted precisely when all vertices in the SCC

of v, and in all the SCC’s that are reachable from the SCC of v, are visited. Consequently, if DFS is started at a

vertex of a sink SCC (a SCC that has no edges leaving it in the DAG of SCC’s), then it will get stuck after it visits

precisely the vertices of this SCC.

For example, if DFS is started at vertex 11 in Figure 3.3 (a vertex in the only sink SCC in this graph), then it will visit

the six vertices in the sink SCC before getting stuck: vertices 12, 10, 9, 7, 8. Property 1 suggests a way of starting a

decomposition algorithm, by finding the first SCC: start DFS from a vertex in a sink SCC, and, when stuck, output

the vertices that have been visited. They form an SCC!

Of course, this leaves us with two problems: (A) How to guess a vertex in a sink SCC, and (B) how to continue

our algorithm by outputting the next SCC, and so on.

Page 34: Data Structures


Let us first face Problem (A). It turns out that it will be easier not to look for vertices in a sink SCC, but instead

look for vertices in a source SCC. In particular:

Property 2: The vertex with the highest postorder number in DFS (that is, the vertex where the DFS ends) belongs

to a source SCC.

The proof is by contradiction. If Property 2 were not not true, and v is the vertex with the highest post-order

number, then there would be an incoming edge (u,w) with u not in the SCC of v and w in the SCC of v. If u were

searched before v, then u clearly has a higher postorder number. If u were searched after v, then since u does not lie

in v’s SCC, it must not be searched until v is popped from the search stack, so again u must have a higher postorder

number than v.

The reason behind Property 2 is thus not hard to see: if there is an SCC “above” the SCC of the vertex where the

DFS ends, then the DFS should have ended in that SCC (reaching it either by restarting or by backtracking).

Property 2 provides an indirect solution to Problem (A). Consider a graph G and the reverse graph GR —G with

the directions of all edges reversed. GR has precisely the same SCC’s as G (why?). So, if we make a DFS in GR,

then the vertex where we end (the one with the highest post-order) belongs to a source SCC of GR —that is to say, a

sink SCC of G. We have solved Problem (A).

Page 35: Data Structures


Onwards to Problem (B). How does the algorithm continue after the first sink component is output? The solution

is clear: delete the SCC just output from GR, and make another DFS in the remaining graph. The only problem is,

this would be a quadratic, not linear, algorithm, since we would run an O(m) DFS algorithm for up to each or O(n)

vertices. How can we avoid this extra work? The key observation here is that we do not have to make a new DFS in

the remaining graph:

Property 3: If we make a DFS in a directed graph, and then delete a source SCC of this graph, what remains is a

DFS in the remaining graph (the pre-order and post-order numbers may now not be consecutive, but they will be of

the right relative magnitude).

This is also easy to justify. We just imagine two runs of the DFS algorithm, one with and one without the source

SCC. Consider a transcript recording the steps of the DFS algorithm. It is easy to see that the transcript of both

runs would be the same (assuming they both made the same choices of what edges to follow at what points), except

where the the first went through the source SCC.

Page 36: Data Structures


Property 3 allows us to use induction to continue our SCC algorithm. After we output the first SCC, we can use

the same DFS information from GR to output the second SCC, the third SCC, and so on. The full algorithm can thus

be described as follows:

Step 1: Perform DFS on GR.

Step 2: Perform DFS on G, processing unsearched vertices in the order of decreasing postorder numbers from the

DFS of Step 1. At the beginning and every restart print “New SCC:” When visiting vertex v, print v.

This algorithm is linear-time, since the total work is really just two depth-first searches, each of which is linear time.

Question: (How does one construct GR from G?) If we run this algorithm on Figure 3.3, Step 1 yields the following

order on the vertices (decreasing postorder in GR’s DFS): 7, 9, 10, 12, 11, 8, 3, 6, 2, 5, 4, 1. Step 2 now produces the

following output: New SCC: 7, 8, 10, 9, 11, 12, New SCC: 3, 6, New SCC: 2, 4, 5, New SCC: 1.

Page 37: Data Structures


Incidentally, there is more sophisticated connectivity information that one can derive from undirected graphs.

An articulation point is a vertex whose deletion increases the number of connected components in the undirected

graph. In Figure 3.2 there are 4 articulation points: 3, 6, 8, and 13. Articulation points divide the graph into bicon-

nected components (the pieces of the graph between articulation points) and bridge edges. Biconnected components

are maximal edge sets (of at least 2 edges) such that any two edges on the set lie on a common cycle. For example,

the large connected component of the graph in Figure 3.2 contains the biconnected components on edges between

vertices 1-2-3-4-5-7-8 and 6-9-10. The remaining edges are 3-6 and 8-11 are bridge edges; they disconnect the

graph. Not coincidentally, this more sophisticated and subtle connectivity information can also be captured by DFS.

Page 38: Data Structures


Putting in Into Practice

Suppose you are debugging your latest huge software program for a major industrial client. The program has

hundreds of procedures, each of which must be carefully tested for bugs.

You realize that, to save yourself some work, it would be best to analyze the procedures in a particular

order. For instance, if procedure Write Check() calls Get Check Number(), you would probably want to test

Get Check Number() first. That way, when you look for the bugs in Write Check(), you do not have to worry

about checking (or re-checking) Get Check Number(). (Let’s ignore the specious argument that if there are no bugs,

you might avoid testing and debugging Get Check Number() altogether by starting with Write Check().)

You can easily generate a list of what procedures each procedure calls with a single pass through the code. So

here’s the problem: given your program, determine what schedule you should give your testing and debugging team,

so that a procedure will be debugged only after anything it calls will be debugged.

Go through the program, creating one vertex for each procedure. Introduce a directed edge from vertex A to

vertex B if the procedure A calls B. This directed edge represents the fact that A must be debugged before B. We call

this graph the procedure graph. If this graph is acyclic, then the topological sort will give you a valid ordering for

the debugging.

What if the graph is not acyclic? Then your program uses mutual recursion; that is, there is some chain of

procedures through which a procedure might end up calling itself. For example, this would be the case if procedure

A calls procedure B, procedure B calls procedure C, and procedure C calls procedure A. A topological sort will

detect these cycles, but what we really want is a list of them, since instances of mutual recursion are harder to test

and debug.

In this case, we should use the strongly connected components algorithm on the procedure graph. The SCC

algorithm will find all the cycles, showing all instances of mutual recursion. Moreover, if we collapse the cycles in

the graph, so that instances of mutual recursion are treated as one large super-procedure, then the SCC algorithm

will provide a valid debugging ordering for all the procedures in this modified graph. That is, the SCC algorithm

will topologically sort the underlying SCC DAG.

Page 39: Data Structures

CS124 Lecture 4 Spring 2002

Breadth-First Search

A searching technique with different properties than DFS is Breadth-First Search (BFS). While DFS used an

implicit stack, BFS uses an explicit queue structure in determining the order in which vertices are searched. Also,

generally one does not restart BFS, because BFS only makes sense in the context of exploring the part of the graph

that is reachable from a particular vertex (s in the algorithm below).

Procedure BFS (G(V,E),s ∈V )graph G(V,E)array[|V |] of integers distqueue q;dist[s] := 0inject(q,s)while size(q) > 0

v := pop(q)previsit(v)explored(v) := 1for (v,w) ∈ E

if explored(w) = 0 theninject(q,w)dist(w) = dist(v)+1


end whileend BFS

BFS runs, of course, in linear time O(|E|), under the assumption that |E| ≥ |V |. The reason is that BFS visits

each edge exactly once, and does a constant amount of work per edge.


Page 40: Data Structures


S0 1 2


2 3


Figure 4.1: BFS of a directed graph

Although BFS does not have the same subtle properties of DFS, it does provide useful information. BFS visits

vertices in order of increasing distance from s. In fact, our BFS algorithm above labels each vertex with the distance

from s, or the number of edges in the shortest path from s to the vertex. For example, applied to the graph in

Figure 4.1, this algorithm labels the vertices (by the array dist) as shown.

Why are we sure that the array dist is the shortest-path distance from s? A simple induction proof suffices. It

is certainly true if the distance is zero (this happens only at s). And, if it is true for dist(v) = d, then it can be easily

shown to be true for values of dist equal to d +1 —any vertex that receives this value has an edge from a vertex with

dist d, and from no vertex with lower value of dist. Notice that vertices not reachable from s will not be visited or


Page 41: Data Structures


Single-Source Shortest Paths —Nonnegative Lengths

What if each edge (v,w) of our graph has a length, a positive integer denoted length(v,w), and we wish to find

the shortest paths from s to all vertices reachable from it? (What if we are interested only in the shortest path from s

to a specific node t? As it turns out, all algorithms known for this problem have to compute the shortest path from s

to all vertices reachable from it.) BFS offers a possible solution. We can subdivide each edge (u,v) into length(u,v)

edges, by inserting length(u,v)− 1 “dummy” nodes, and then apply DFS to the new graph. This algorithm solves

the shortest-path problem in time O(∑(u,v)∈E length(u,v)). Unfortunately, this can be very large —lengths could be

in the thousands or millions. So we need to find a better way.

The problem is that this BFS-based algorithm will spend most of its time visiting “dummy” vertices; only

occasionally will it do something truly interesting, like visit a vertex of the original graph. What we would like to

do is run this algorithm, but only do work for the “interesting” steps.

Page 42: Data Structures


To do this, We need to generalize BFS. Instead of using a queue, we will instead use a heap or priority queue

of vertices. A heap is an data structure that keeps a set of objects, where each object has an associated value. The

operations a heap H implements include the following:

deletemin(H) return the object with the smallest value

insert(x,y,H) insert a new object x/value y pair in the structure

change(x,y,H) if y is smaller than x’s current value,

change the value of object x to y

We will not distinguish between insert and change, since for our purposes, they are essentially equivalent;

changing the value of a vertex will be like re-inserting it. (In all heap implementations we assume that we have an

array of pointers that gives, for each vertex, its position in the heap, if any. This allows us to always have at most

one copy of each vertex in the heap. Furthermore, it makes changes and inserts essentially equivalent operations.)

Each entry in the heap will stand for a projected future “interesting event” of our extended BFS. Each entry will

correspond to a vertex, and its value will be the current projected time at which we will reach the vertex. Another

way to think of this is to imagine that, each time we reach a new vertex, we can send an explorer down each adjacent

edge, and this explorer moves at a rate of 1 unit distance per second. With our heap, we will keep track of when each

vertex is due to be reached for the first time by some explorer. Note that the projected time until we reach a vertex

can decrease, because the new explorers that arise when we reach a newly explored vertex could reach a vertex first

(see node b in Figure 4.2). But one thing is certain: the most imminent future scheduled arrival of an explorer must

happen, because there is no other explorer who can reach any vertex faster. The heap conveniently delivers this most

imminent event to us.

Page 43: Data Structures


As in all shortest path algorithms we shall see, we maintain two arrays indexed by V . The first array, dist[v],

will eventually contain the true distance of v from s. The other array, prev[v], will contain the last node before v in

the shortest path from s to v. Our algorithm maintains a useful invariant property: at all times dist[v] will contain a

conservative over-estimate of the true shortest distance of v from s. Of course dist[s] is initialized to its true value 0,

and all other dist’s are initialized to ∞, which is a remarkably conservative overestimate. The algorithm is known as

Djikstra’s algorithm, named after the inventor.

Algorithm Djikstra (G = (V,E, length); s ∈V )v,w: verticesdist: array[V ] of integerprev: array[V ] of verticesH: priority heap of VH := s : 0for v ∈V do

dist[v] := ∞, prev[v] :=nilrofdist[s] := 0while H 6= /0

v := deletemin(h)for (v,w) ∈ E

if dist[w] > dist[v]+ length(v,w)dist[w] := dist[v] + length(v,w), prev[w] := v, insert(w,dist[w],H)


end while end shortest paths 1

Page 44: Data Structures


s 0

a 2 c 3 e 6

b 4 d 6 f 5



3 5

1 4




Figure 4.2: Shortest paths

The algorithm, run on the graph in Figure 4.2, will yield the following heap contents (node: dist/priority pairs)

at the beginning of the while loop: s : 0, a : 2,b : 6, b : 5,c : 3, b : 4,e : 7, f : 5, e : 7, f : 5,d : 6, e : 6,d : 6,

e : 6, . The distances from s are shown in Figure 2, together with the shortest path tree from s, the rooted tree

defined by the pointers prev.

Page 45: Data Structures


What is the running time of this algorithm? The algorithm involves |E| insert operations and |V | deletemin

operations on H , and so the running time depends on the implementation of the heap H . There are many ways to

implement a heap. Even an unsophisticated implementation as a linked list of node/priority pairs yields an interesting

time bound, O(|V |2) (see first line of the table below). A binary heap would give O(|E| log |V |).

Which of the two should we prefer? The answer depends on how dense or sparse our graphs are. In all graphs,

|E| is between |V | and |V |2. If it is Ω(|V |2), then we should use the linked list version. If it is anywhere below |V |2

log |V | ,

we should use binary heaps.

heap implementation deletemin insert |V |×deletemin+|E|×insert

linked list O(|V |) O(1) O(|V |2)

binary heap O(log |V |) O(log |V |) O(|E| log |V |)

d-ary heap O( d log |V |logd ) O( log |V |

logd ) O((|V | ·d + |E|) log |V |logd

Fibonacci heap O(log |V |) O(1) amortized O(|V | log |V |+ |E|)

A more sophisticated data structure, the d-ary heap, performs even better. A d-ary heap is just like a binary

heap, except that the fan-out of the tree is d, instead of 2. (Here d should be at least 2, however!) Since the depth of

any such tree with |V | nodes is log |V |logd , it is easy to see that inserts take this amount of time. Deletemins take d times

that, because deletemins go down the tree, and must look at the children of all vertices visited.

The complexity of this algorithm is a function of d. We must choose d to minimize it. A natural choice is

d = |E||V | , which is the the average degree! (Note that this is the natural choice because it equalizes the two terms of

|E|+ |V | ·d. Alternatively, the “exact” value can be found using calculus.) This yields an algorithm that is good for

both sparse and dense graphs. For dense graphs, its running time is O(|V |2). For graphs with |E| = O(|V |), it is

|V | log |V |. Finally, for graphs with intermediate density, such as |E| = |V |1+δ, where δ is the density of the graph,

the algorithm is linear!

The fastest known implementation of Djikstra’s algorithm uses a data structure known as a Fibonacci heap,

which we will not cover here. Note that the bounds for the insert operation for Fibonacci heaps are amortized

bounds: certain operations may be expensive, but the average cost over a sequence of operations is constant.

Page 46: Data Structures


Single-Source Shortest Paths: General Lengths

Our argument of correctness of our shortest path algorithm was based on the “time metaphor:” the most im-

minent prospective event (arrival of an explorer) must take place, exactly because it is the most imminent. This

however would not work if we had negative edges. (Imagine explorers being able to arrive before they left!) If the

length of edge (a,b) in Figure 2 were −1, the shortest path from s to b would have value 1, not 4, and our simple

algorithm fails. Obviously, with negative lengths we need more involved algorithms, which repeatedly update the

values of dist.

We can describe a general paradigm for constructing shortest path algorithms with arbitrary edge weights. The

algorithms use arrays dist and prev, and again we maintain the invariant that dist is always a conservative overestimate

of the true distance from s. (Again, dist is initialized to ∞ for all nodes, except for s for which it is 0).

The algorithms maintain dist so that it is always a conservative overestimate; it will only update the a value

when a suitable path is discovered to show that the overestimate can be lowered. That is, suppose we find a neighbor

w of v, with dist[v] > dist[w] + length(w,v). Then we have found an actual path that shows the distance estimate is

too conservative. We therefore repeatedly apply the following update rule.

Page 47: Data Structures


procedure update ((w,v))edge (w,v)if dist[v] > dist[w]+ length(w,v) then

dist[v] := dist[w] + length(w,v), prev[v] := w

A crucial observation is that this procedure is safe, in that it never invalidates our “invariant” that dist is a

conservative overestimate.

The key idea is to consider how these updates along edges should occur. In Djikstra’s algorithm, the edges are

updated according to the time order of the imaginary explorers. But this only works with positive edge lengths.

A second crucial observation concerns how many updates we have to do. Let a 6= s be a node, and consider the

shortest path from s to a, say s,v1,v2, . . . ,vk = a for some k between 1 and n−1. If we perform update first on (s,v1),

later on (v1,v2), and so on, and finally on (vk−1,a), then we are sure that dist(a) contains the true distance from s

to a, and that the true shortest path is encoded in prev. (Exercise: Prove this, by induction.) We must thus find a

sequence of updates that guarantee that these edges are updated in this order. We don’t care if these or other edges

are updated several times in between, since all we need is to have a sequence of updates that contains this particular

subsequence. There is a very easy way to guarantee this: update all edges |V |−1 times in a row!

Page 48: Data Structures


Algorithm Shortest Paths 2 (G = (V,E, length); s ∈V )v,w: verticesdist: array[V ] of integerprev: array[V ] of verticesi: integerfor v ∈V do

dist[v] := ∞, prev[v] :=nilrofdist[s] := 0for i = 1 . . .n−1

for (w,v) ∈ E update(w,v)end shortest paths 2

This algorithm solves the general single-source shortest path problem in O(|V | · |E|) time.

Page 49: Data Structures


Negative Cycles

In fact, there is a further problem that negative edges can cause. Suppose the length of edge (b,a) in Figure 2

were changed to −5. The the graph would have a negative cycle (from a to b and back). On such graphs, it does not

make sense to even ask the shortest path question. What is the shortest path from s to c in the modified graph? The

one that goes directly from s to a to c (cost: 3), or the one that goes from s to a to b to a to c (cost: 1), or the one that

takes the cycle twice (cost: -1)? And so on.

The shortest path problem is ill-posed in graphs with negative cycles. It makes no sense and deserves no

answer. Our algorithm in the previous section works only in the absence of negative cycles. (Where did we assume

no negative cycles in our correctness argument? Answer: When we asserted that a shortest path from s to a exists!)

But it would be useful if our algorithm were able to detect whether there is a negative cycle in the graph, and thus to

report reliably on the meaningfulness of the shortest path answers it provides.

This is easily done. After the |V | − 1 rounds of updates of all edges, do a last update. If any changes occur

during this last round of updates, there is a negative cycle. This must be true, because if there were no negative

cycles, |V |−1 rounds of updates would have been sufficient to find the shortest paths.

Page 50: Data Structures


Shortest Paths on DAG’s

There are two subclasses of weighted graphs that automatically exclude the possibility of negative cycles:

graphs with non-negative weights and DAG’s. We have already seen that there is a fast algorithm when the weights

are non-negative. Here we will give a linear algorithm for single-source shortest paths in DAG’s.

Our algorithm is based on the same principle as our algorithm for negative weights. We are trying to find a

sequence of updates, such that all shortest paths are its subsequences. But in a DAG we know that all shortest paths

from s must go in the topological order of the DAG. All we have to do then is first topologically sort the DAG using

a DFS, and then visit all edges coming out of nodes in the topological order. This algorithm solves the general

single-source shortest path problem for DAG’s in O(m) time.

Page 51: Data Structures

CS124 Lecture 5 Spring 2002

Minimum Spanning Trees

A tree is an undirected graph which is connected and acyclic. It is easy to show that if graph G(V,E) that

satisfies any two of the following properties also satisfies the third, and is therefore a tree:

• G(V,E) is connected

• G(V,E) is acyclic

• |E| = |V |−1

(Exercise: Show that any two of the above properties implies the third (use induction).)

A spanning tree in an undirected graph G(V,E) is a subset of edges T ⊆ E that are acyclic and connect all the

vertices in V . It follows from the above conditions that a spanning tree must consist of exactly n− 1 edges. Now

suppose that each edge has a weight associated with it: w : E → Z. Say that the weight of a tree T is the sum of the

weights of its edges; w(T ) = ∑e∈T w(e). The minimum spanning tree in a weighted graph G(V,E) is one which has

the smallest weight among all spanning trees in G(V,E).

As an example of why one might want to find a minimum spanning tree, consider someone who has to install

the wiring to network together a large computer system. The requirement is that all machines be able to reach each

other via some sequence of intermediate connections. By representing each machine as a vertex and the cost of

wiring two machines together by a weighted edge, the problem of finding the minimum cost wiring scheme reduces

to the minimum spanning tree problem.

In general, the number of spanning trees in G(V,E) grows exponentially in the number of vertices in G(V,E).

(Exercise: Try to determine the number of different spanning trees for a complete graph on n vertices.) Therefore

it is infeasible to search through all possible spanning trees to find the lightest one. Luckily it is not necessary

to examine all possible spanning trees; minimum spanning trees satisfy a very important property which makes it

possible to efficiently zoom in on the answer.


Page 52: Data Structures

Lecture 5 5-2

We shall construct the minimum spanning tree by successively selecting edges to include in the tree. We will

guarantee after the inclusion of each new edge that the selected edges, X , form a subset of some minimum spanning

tree, T . How can we guarantee this if we don’t yet know any minimum spanning tree in the graph? The following

property provides this guarantee:

Cut property: Let X ⊆ T where T is a MST in G(V,E). Let S ⊂V such that no edge in X crosses between S

and V −S; i.e. no edge in X has one endpoint in S and one endpoint in V −S. Among edges crossing between S and

V −S, let e be an edge of minimum weight. Then X ∪e ⊆ T ′ where T ′ is a MST in G(V,E).

The cut property says that we can construct our tree greedily. Our greedy algorithms can simply take the

minimum weight edge across two regions not yet connected. Eventually, if we keep acting in this greedy manner,

we will arrive at the point where we have a minimum spanning tree. Although the idea of acting greedily at each

point may seem quite intuitive, it is very unusual for such a strategy to actually lead to an optimal solution, as we

will see when we examine other problems!

Proof: Suppose e /∈ T . Adding e into T creates a unique cycle. We will remove a single edge e ′ from this

unique cycle, thus getting T ′ = T ∪e−e′. It is easy to see that T ′ must be a tree — it is connected and has

n−1 edges. Furthermore, as we shall show below, it is always possible to select an edge e ′ in the cycle such that it

crosses between S and V −S. Now, since e is a minimum weight edge crossing between S and V −S, w(e ′) ≥ w(e).

Therefore w(T ′) = w(T )+ w(e)−w(e′) ≤ w(T ). However since T is a MST, it follows that T ′ is also a MST and

w(e) = w(e′). Furthermore, since X has no edge crossing between S and V − S, it follows that X ⊆ T ′ and thus

X ∪e ⊆ T ′.

How do we know that there is an edge e′ 6= e in the unique cycle created by adding e into T , such that e′ crosses

between S and V −S? This is easy to see, because as we trace the cycle, e crosses between S and V −S, and we must

cross back along some other edge to return to the starting point.

Page 53: Data Structures

Lecture 5 5-3

In light of this, the basic outline of our minimum spanning tree algorithms is going to be the following:

X := .Repeat until |X | = n−1.

Pick a set S ⊆V such that no edge in X crosses between S and V −S.Let e be a lightest edge in G(V,E) that crosses between S and V −S.X := X ∪e.

The difference between minimum spanning tree algorithms lies in how we pick the set S at each step.

Page 54: Data Structures

Lecture 5 5-4

Prim’s algorithm:

In the case of Prim’s algorithm, X consists of a single tree, and the set S is the set of vertices of that tree. One

way to think of the algorithm is that it grows a single tree, adding a new vertex at each step, until it has the minimum

spanning tree. In order to find the lightest edge crossing between S and V − S, Prim’s algorithm maintains a heap

containing all those vertices in V − S which are adjacent to some vertex in S. The priority of a vertex v, according

to which the heap is ordered, is the weight of its lightest edge to a vertex in S. This is reminiscent of Dijkstra’s

algorithm (where distance was used for the heap instead of the edge weight). As in Dijkstra’s algorithm, each vertex

v will also have a parent pointer prev(v) which is the other endpoint of the lightest edge from v to a vertex in S. The

pseudocode for Prim’s algorithm is almost identical to that for Dijkstra’s algorithm:

Procedure Prim(G(V,E), s)v,w: verticesdist: array[V ] of integerprev: array[V ] of verticesS: set of vertices, initially emptyH: priority heap of VH := s : 0for v ∈V do

dist[v] := ∞, prev[v] :=nilrofdist[s] := 0while H 6= /0

v := deletemin(h)S := S∪vfor (v,w) ∈ E and w ∈V −S do

if dist[w] > length(v,w)dist[w] := length(v,w), prev[w] := v, insert(w,dist[w],H)


end while end Prim

Note that each vertex is “inserted” on the heap at most once; other insert operations simply change the value on

the heap. The vertices that are removed from the heap form the set S for the cut property. The set X of edges chosen

to be included in the MST are given by the parent pointers of the vertices in the set S. Since the smallest key in the

heap at any time gives the lightest edge crossing between S and V −S, Prim’s algorithm follows the generic outline

for a MST algorithm presented above, and therefore its correctness follows from the cut property.

The running time of Prim’s algorithm is clearly the same as Dijkstra’s algorithm, since the only change is how

we prioritize nodes in the heap. Thus, if we use d-heaps, the running time of Prim’s algorithm is O(m logm/n n).

Page 55: Data Structures

Lecture 5 5-5

Kruskal’s algorithm:

Kruskal’s algorithm uses a different strategy from Prim’s algorithm. Instead of growing a single tree, Kruskal’s

algorithm attempts to put the lightest edge possible in the tree at each step. Kruskal’s algorithm starts with the edges

sorted in increasing order by weight. Initially X = , and each vertex in the graph regarded as a trivial tree (with

no edges). Each edge in the sorted list is examined in order, and if its endpoints are in the same tree, then the edge is

discarded; otherwise it is included in X and this causes the two trees containing the endpoints of this edge to merge

into a single tree. Note that, by this process, we are implicitly choosing a set S ⊆ V with no edge in X crossing

between S and V −S, so this fits in our basic outline of a minimum spanning tree algorithm.

To implement Kruskal’s algorithm, given a forest of trees, we must decide given two vertices whether they

belong to the same tree. For the purposes of this test, each tree in the forest can be represented by a set consisting of

the vertices in that tree. We also need to be able to update our data structure to reflect the merging of two trees into a

single tree. Thus our data structure will maintain a collection of disjoint sets (disjoint since each vertex is in exactly

one tree), and support the following three operations:

• MAKESET(x): Create a new x containing only the element x.

• FIND(x): Given an element x, which set does it belong to?

• UNION(x,y): replace the set containing x and the set containing y by their union.

The pseudocode for Kruskal’s algorithm follows:

Function Kruskal(graph G(V,E))set XX = E:= sort E by weightfor u ∈V

MAKESET(u)roffor (u,v) ∈ E (in increasing order) do

if FIND(u) 6= FIND(v) doX = X ∪(u,v)UNION(u,v)

rofreturn(X )

end Kruskal

Page 56: Data Structures

Lecture 5 5-6

The correctness of Kruskal’s algorithm follows from the following argument: Kruskal’s algorithm adds an edge

e into X only if it connects two trees; let S be the set of vertices in one of these two trees. Then e must be the first

edge in the sorted edge list that has one endpoint in S and the other endpoint in V − S, and is therefore the lightest

edge that crosses between S and V −S. Thus the cut property of MST implies the correctness of the algorithm.

The running time of the algorithm, assuming the edges are given in sorted order, is dominated by the set

operations: UNION and FIND. There are n−1 UNION operations (one corresponding to each edge in the spanning

tree), and 2m FIND operations (2 for each edge). Thus the total time of Kruskal’s algorithm is O(m×FIND+n×

UNION). We will soon show that this is O(m log∗ n). Note that, if the edges are not initially given in sorted order,

then to sort them in the obvious way takes O(m logm) time, and this would be the dominant part of the running time

of the algorithm.

Page 57: Data Structures

Lecture 5 5-7

Exchange Property

Actually spanning trees satisfy an even stronger property than the cut property — the exchange property. The

exchange property is quite remarkable since it implies that we can “walk” from any spanning tree T to a minimum

spanning tree T by a sequence of exchange moves — each such move consists of throwing an edge out of the current

tree that is not in T , and adding a new edge into the current tree that is in T . Moreover, each successive tree in the

“walk” is guaranteed to weigh no more than its predecessor.

Exchange property: Let T and T ′ be spanning trees in G(V,E). Given any e′ ∈ T ′−T , there exists an edge

e ∈ T −T ′ such that (T −e)∪e′ is also a spanning tree.

The proof is quite similar to that of the cut property. Adding e′ into T results in a unique cycle. There must be

some edge in this cycle that is not in T ′ (since otherwise T ′ must have a cycle). Call this edge e. Then deleting e

restores a spanning tree, since connectivity is not affected, and the number of edges is restored to n−1.

To see how one may use this exchange property to “walk” from any spanning tree to a MST: let T be any

spanning tree and let T be a MST in G(V,E). Let e′ be the lightest edge that is not in both trees. Perform an

exchange using this edge. Since the exchange was done with the lightest such edge, the new tree must be lighter than

the old one. Since T is already a MST, it follows that the exchange must have been performed upon T and results in

a lighter spanning tree which has more edges in common with T (if there are several edges of the same weight, then

the new tree might not be lighter, but it still has more edges in common with T ).

Page 58: Data Structures

Lecture 5 5-8

1 5

3 5 2

4 12 5 7

3 6

Figure 5.1: An example of Prim’s algorithm and Kruskal’s algorithm. Which is which?

Page 59: Data Structures

CS124 Lecture 6 Spring 2002

Disjoint set (Union-Find)

For Kruskal’s algorithm for the minimum spanning tree problem, we found that we needed a data structure for

maintaining a collection of disjoint sets. That is, we need a data structure that can handle the following operations:

• MAKESET(x) - create a new set containing the single element x

• UNION(x,y) - replace two sets containing x and y by their union.

• FIND(x) - return the name of the set containing the element x

Naturally, this data structure is useful in other situations, so we shall consider its implementation in some detail.

Within our data structure, each set is represented by a tree, so that each element points to a parent in the tree.

The root of each tree will point to itself. In fact, we shall use the root of the tree as the name of the set itself; hence

the name of each set is given by a canonical element, namely the root of the associated tree.

It is convenient to add a fourth operation LINK(x,y) to the above, where we require for LINK that x and y are

two roots. LINK changes the parent pointer of one of the roots, say x, and makes it point to y. It returns the root

of the now composite tree y. With this addition, we have UNION(x,y) = LINK(FIND(x),FIND(y)), so the main

problem is to arrange our data structure so that FIND operations are very efficient.


Page 60: Data Structures

Lecture 6 6-2

Notice that the time to do a FIND operation on an element corresponds to its depth in the tree. Hence our goal is

to keep the trees short. Two well-known heuristics for keeping trees short in this setting are UNION BY RANK and

PATH COMPRESSION. We start with the UNION BY RANK heuristic. The idea of UNION BY RANK is to ensure

that when we combine two trees, we try to keep the overall depth of the resulting tree small. This is implemented as

follows: the rank of an element x is initialized to 0 by MAKESET. An element’s rank is only updated by the LINK

operation. If x and y have the same rank r, then invoking LINK(x,y) causes the parent pointer of x to be updated to

point to y, and the rank of y is then updated to r + 1. On the other hand, if x and y have different rank, then when

invoking LINK(x,y) the parent point of the element with smaller rank is updated to point to the element with larger

rank. The idea is that the rank of the root is associated with the depth of the tree, so this process keeps the depth

small. (Exercise: Try some examples by hand with and without using the UNION BY RANK heuristic.)

The idea of PATH COMPRESSION is that, once we perform a FIND on some element, we should adjust its

parent pointer so that it points directly to the root; that way, if we ever do another FIND on it, we start out much

closer to the root. Note that, until we do a FIND on an element, it might not be worth the effort to update its parent

pointer, since we may never access it at all. Once we access an item, however, we must walk through every pointer

to the root, so modifying the pointers only changes the cost of this walk by a constant factor.

Page 61: Data Structures

Lecture 6 6-3

procedure MAKESET(x)p(x) := xrank(x) := 0


function FIND(x)if x 6= p(x) then

p(x) := FIND(p(x))return(p(x))


function LINK(x,y)if rank(x) > rank(y) then x ↔ yif rank(x) = rank(y) then rank(y) := rank(y)+1p(x) := yreturn(y)


procedure UNION(x,y)LINK(FIND(x),FIND(y))


Page 62: Data Structures

Lecture 6 6-4

In our analysis, we show that any sequence of m UNION and FIND operations on n elements take at most

O((m + n) log∗ n) steps, where log∗ n is the number of times you must iterate the log2 function on n before getting

a number less than or equal to 1. (So log∗ 4 = 2, log∗ 16 = 3, log∗ 65536 = 4.) We should note that this is not the

tightest analysis possible; however, this analysis is already somewhat complex!

Note that we are going to do an amortized analysis here. That is, we are going to consider the cost of the

algorithm over a sequence of steps, instead of considering the cost of a single operation. In fact a single UNION or

FIND operation could require O(logn) operations. (Exercise: Prove this!) Only by considering an entire sequence

of operations at once can obtain the above bound. Our argument will require some interesting accounting to total the

cost of a sequence of steps.

Page 63: Data Structures

Lecture 6 6-5

We first make a few observations about rank.

• if v 6= p(v) then rank(p(v)) > rank(v)

• whenever p(v) is updated, rank(p(v)) increases

• the number of elements with rank k is at most n2k

• the number of elements with rank at least k is at most n2k−1

The first two assertions are immediate from the description of the algorithm. The third assertion follows from

the fact that the rank of an element v changes only if LINK(v,w) is executed, rank(v) = rank(w), and v remains

the root of the combined tree; in this case v’s rank is incremented by 1. A simple induction then yields that when

rank(v) is incremented to k, the resulting tree has at least 2k elements. The last assertion then follows from the third

assertion, as ∑∞j=k

n2 j = n

2k−1 .

Exercise: Show that the maximum rank an item can have is logn.

Page 64: Data Structures

Lecture 6 6-6

As soon as an element becomes a non-root, its rank is fixed. Let us divide the (non-root) elements into groups

according to their ranks. Group i contains all elements whose rank r satisfies log∗ r = i. For example, elements in

group 3 have ranks in the range (4,16], and the range of ranks associated with group i is (2 i−1,22i−1). For convenience

we shall write this more simply by saying group (k,2k] to mean the group with these ranks.

It is easy to establish the following assertions about these groups:

• The number of distinct groups is at most log∗ n. (Use the fact that the maximum rank is log n.)

• The number of elements in the group (k,2k] is at most n2k .

Let us assign 2k tokens to each element in group (k,2k]. The total number of tokens assigned to all elements

from that group is then at most 2k n2k = n, and the total number of groups is at most log∗ n, so the total number of

tokens given out is n log∗ n. We use these tokens to account for the work done by FIND operations.

Recall that the number of steps for a FIND operation is proportional to the number of pointers that the FIND

operation must follow up the tree. We separate the pointers into two groups, depending on the groups of u and

p(u) = v, as follows:

• Type 1: a pointer is of Type 1 if u and v belong to different groups, or v is the root.

• Type 2: a pointer is of Type 2 if u and v belong to the same group.

We account for the two Types of pointers in two different ways. Type 1 links are “charged” directly to the FIND

operation; Type 2 links are “charged” to u, who “pays” for the operation using one of the tokens. Let us consider

these charges more carefully.

Page 65: Data Structures

Lecture 6 6-7

The number of Type 1 links each FIND operation goes through is at most log∗ n, since there are only log∗ n

groups, and the group number increases as we move up the tree.

What about Type 2 links? We charge these links directly back to u, who is supposed to pay for them with a

token. Does u have enough tokens? The point here is that each time a FIND operation goes through an element u,

its parent pointer is changed to the current root of the tree (by PATH COMPRESSION), so the rank of its parent

increases by at least 1. If u is in the group (k,2k], then the rank of u’s parent can increase fewer than 2k times before

it moves to a higher group. Therefore the 2k tokens we assign to u are sufficient to pay for all FIND operations that

go through u to a parent in the same group.

Page 66: Data Structures

Lecture 6 6-8

We now count the total number of steps for m UNION and FIND operations. Clearly LINK requires just O(1)

steps, and since a UNION operation is just a LINK and 2 FIND operations, it suffices to bound the time for at most

2m FIND OPERATIONS. Each FIND operation is charged at most log∗ n for a total of O(m log∗ n). The total number

of tokens used at most n log∗ n, and each token pays for a constant number of steps. Therefore the total number of

steps is O((m+n) log∗ n).

Let us give a more equation-oriented explanation. The total time spent over the course of m UNION and FIND

operations is just

∑all FIND ops

(# links passed through).

We split this sum up into two parts:

∑all FIND ops

(# links in same group) + ∑all FIND ops

(# links in different groups).

(Technically, the case where a link goes to the root should be handled explicitly; however, this is just O(m) links in

total, so we don’t need to worry!) The second term is clearly O(m log∗ n). The first term can be upper bounded by:

∑all elements u

(# ranks in the group of u),

because each element u can be charged only once for each rank in its group. (Note here that this is because the links

to the root count in the second sum!) This last sum is bounded above by

∑all groups

(# items in group) · (# ranks in group) ≤log∗ n


n2k 2k ≤ n log∗ n.

This completes the proof.

Page 67: Data Structures

Lecture 6 6-9

x y y









b c d

Figure 6.1: Examples of UNION BY RANK and PATH COMPRESSION.

Page 68: Data Structures

CS124 Lecture 7

In today’s lecture we will be looking a bit more closely at the Greedy approach to designing algorithms. As we

will see, sometimes it works, and sometimes even when it doesn’t, it can provide a useful result.

Horn Formulae

A simple application of the greedy paradigm solves an important special case of the SAT problem. We have

already seen that 2SAT can be solved in linear time. Now consider SAT instances where in each clause, there is at

most one positive literal. Such formulae are called Horn formulae; for example, this is an instance:

(x∨ y∨ z∨w)∧ (x∨ y∨w)∧ (x∨ z∨w)∧ (x∨ y)∧ (x)∧ (z)∧ (x∨ y∨w).

Given a Horn formula, we can separate its clauses into two parts: the pure negative clauses (those without a

positive literal) and the implications (those with a positive literal). We call clauses with a positive literal implications

because they can be rewritten suggestively as implications; (x∨ y∨ z∨w) is equivalent to (y∧ z∧w) → x. Note

the trivial clause (x) can be thought of as a trivial implication → x. Hence, in the example above, we have the


(y∧ z∧w → x),(x∧ z → w),(x → y),(→ x),(x∧ y → w)

and these two pure negative clauses

(x∨ y∨w),(z).

We can now develop a greedy algorithm. The idea behind the algorithm is that we start with all variables set to

false, and we only set variables to T if an implication forces us to. Recall that an implication is not satisfied if all

variables to the left of the arrow are true and the one to the right is false. This algorithm is greedy, in the sense that

it (greedily) tries to ensure the pure negative clauses are satisfied, and only changes a variable if absolutely forced.


Page 69: Data Structures

Lecture 7 7-2

Algorithm Greedy-Horn(φ: CNF formula with at most one positive literal per clause)

Start with the truth assignment t :=FFF· · ·F

while there is an implication that is not satisfied do

make the implied variable T in t

if all pure negatives are satisfied then return t

else return “φ is unsatisfiable”

Once we have the proposed truth assignment, we look at the pure negatives. If there is a pure negative clause

that is not satisfied by the proposed truth assignment, the formula cannot be satisfied. This follows from the fact that

all the pure negative clauses will be satisfied if any of their variables are set to F. If such a clause is unsatisfied, all of

its variables must be set to T. But we only set a variable to T if we are forced to by the implications. If all the pure

negative clauses are satisfied, then we have found a truth assignment.

On the example above, Greedy-Horn first flips x to true, forced by the implication → x. Then y gets forced to

true (from x → y), and similarly w is forced to true. (Why?) Looking at the pure negative clauses, we find that the

first is not satisfied, and hence we conclude the original formula had no truth assignment.

Exercise: Show that the Horn-greedy algorithm can be implemented in linear time in the length of the formula (i.e.,

the total number of appearances of all literals).

Page 70: Data Structures

Lecture 7 7-3

Huffman Coding

Suppose that you must store the map of a chromosome which consists of a sequence of 130 million symbols of

the form A, C, G, or T. To store the sequence efficiently, you represent each character with just 2 bits: A as 00, C as

01, G as 10, and T as 11. Such a representation is called an encoding. With this encoding, the sequence requires 260

megabits to store.

Suppose, however, that you know that some symbols appear more frequently than others. For example, suppose

A appears 70 million times, C 3 million times, G 20 million times, and T 37 million times. In this case it seems

wasteful to use two bits to represent each A. Perhaps a more elaborate encoding assigning a shorter string to A could

save space.

We restrict ourselves to encodings that satisfy the prefix property: no assigned string is the prefix of another.

This property allows us to avoid backtracking while decoding. For an example without the prefix property, suppose

we represented A as 1 and C as 101. Then when we read a 1, we would not know whether it was an A or the

beginning of a C! Clearly we would like to avoid such problems, so the prefix property is important.

You can picture an encoding with the prefix property as a binary tree. For example, the binary tree below

corresponds to an optimal encoding in the above situation. (There can be more than one optimal encoding! Just flip

the left and right hand sides of the tree.) Here a branch to the left represent a 0, and a branch to the right represents

a 1. Therefore A is represented by 1, C by 001, G by 000, and T by 01. This encoding requires only 213 million bits

– a 17% improvement over the balanced tree (the encoding 00,01,10,11). (This does not include the bits that might

be necessary to store the form of the encoding!)

Page 71: Data Structures

Lecture 7 7-4

0 1

0 1

0 1

A (70)

T (37)

G (20) C (3)



Figure 7.1: A Huffman tree.

Page 72: Data Structures

Lecture 7 7-5

Let us note some properties of the binary trees that represent encoding. The symbols must correspond to leaves;

an internal node that represents a character would violate the prefix property. The code words are thus given by all

root-to-leaf paths. All internal nodes must have exactly two children, as an internal node with only one child could

be deleted to yield a better code. Hence if there are n leaves there are n− 1 internal edges. Also, if we assign

frequencies to the internal nodes, so that the frequencies of an internal node are the sums of the frequencies of the

children, then the total length produced by the encoding is the sum of the frequencies of all nodes except the root. (A

one line proof: each edge corresponds to a bit that is written as many times as the frequency of the node to which it


One final property allows us to determine how to build the tree: the two symbols with the smallest frequencies

are together at the lowest level of the tree. Otherwise, we could improve the encoding by swapping a more frequently

used character at the lowest level up. (This is not a full proof; feel free to complete one.)

This tells us how to construct the optimum tree greedily. Take the two symbols with the lowest frequency,

delete them from the list of symbols, and replace them with a new meta-character; this new meta-character will lie

directly above the two deleted symbols in the tree. Repeat this process until the whole tree is constructed.

We can prove by induction that this gives an optimal tree. It works for 2 symbols (base case). We also show

that if it works for n letters, it must also work for n + 1 letters. After deleting the two least frequent symbols and

replacing them with a meta-character, it as though we have just n symbols. this process yields an optimal tree for

these n symbols (by the inductive hypothesis). Expanding the meta-character back into the two deleted nodes must

now yield an optimal tree, since otherwise we could have found a better tree for the n symbols.

Page 73: Data Structures

Lecture 7 7-6

A 60

E 70

I 40

O 50

U 20

Y 30

A 60

E 70

I 40

O 50

[UY] 50


A 60

E 70

O 50

[I[UY]] 90



[OA] 110

E 70


[I[UY]] 90




Figure 7.2: The first few steps of building a Huffman tree.

Page 74: Data Structures

Lecture 7 7-7

It is important to realize that when we say that a Huffman tree is optimal, this does not mean that it gives the

best way to compress a string. It only says that we cannot do better by encoding one symbol at a time. By encoding

frequently used blocks of letters (such as, in this section, the block “frequen”) we can obtain much better encodings.

(Note that finding the right blocks of letters can be quite complicated.) Given this, one might expect that Huffman

coding is rarely used. In fact, many compression schemes use Huffman codes at some point as a basic building

block. For example, image and video transmission often use Huffman encoding somewhere along the line.

Exercise: Find a Huffman compressor and another compressor, such as a gzip compressor. Test them on some files.

Which compresses better?

It is straightforward to write code to generate the appropriate tree, and then use this tree to encode and decode

messages. For encoding, we simply build a table with a codeword for each sybmol. To decode, we could read bits

in one at a time, and walk down the tree in the appropriate manner. When we reach a leaf, we output the appropriate

symbol and return to the top of the tree.

In practice, however, if we want to use Huffman coding, there are much faster ways to decode than to explicitly

walk down the tree one bit at a time. Using an explicit tree is slow, for a variety of reasons. Exercise: Think about


One approach is to design a system that performs several steps at a time by reading several bits of input and

determining what actions to take according to a big lookup table. For example, we could have a table that represents

the information, “If you are currently at this point in the tree, and the next 8 bits are 00110101, then output AC and

move to this point in the tree.” This lookup table, which might be huge, encapsulates the information needed to

handle eight bits at once. Since computers naturally handle eight bit blocks more easily than single bits, and because

table lookups are faster than following pointers down a Huffman tree, substantial speed gains are possible. Notice

that this gain in speed comes at the expense of the space required for the lookup table.

There are other solutions that work particularly well for very large dictionaries. For example, if you were using

Huffman codes on a libraray of newspaper articles, you might treat each work as a symbol that can be encoded. In

this case, you would have a lot of symbols! We will not go over these other methods here; a useful paper on the

subject is “On the Implementation of Minimum-Redundancy Prefix Codes,” by Moffat and Turpin. The key to keep

in mind is that while thinking of decoding on the Huffman tree as happening one bit at a time is useful conceptually,

good engineering would use more sophisticated methods to increase efficiency.

Page 75: Data Structures

Lecture 7 7-8

The Set Cover Problem

The inputs to the set cover problem are a finite set X = x1, . . . ,xn, and a collection of subsets S of X such that⋃

S∈S S = X . The problem is to find the subcollection T ⊆ S such that the sets of T cover X , that is


T = X .

Notice that such a cover exists, since S is itself a cover.

The greedy heuristic suggests that we build a cover by repeatedly including the set in S that will cover the

maximum number of as yet uncovered elements. In this case, the greedy heuristic does not yield an optimal solution.

Interestingly, however, we can prove that the greedy solution is a good solution, in the sense that it is not too far

from the optimal.

This is an example of an approximation algorithm. Loosely speaking, with an approximation algorithm, we

settle for a result that is not the correct answer. Instead, however, we try to prove a guarantee on how close the

algorithm is to the right answer. As we will see later in the course, sometimes this is the best we can hope to do.

Page 76: Data Structures

Lecture 7 7-9

Claim 7.1 Let k be the size of the smallest set cover for the instance (X ,S). Then the greedy heuristic finds a set

cover of size at most k lnn.

Proof: Let Yi ⊆ X be the set of elements that are still not covered after i sets have been chosen with the greedy

heuristic. Clearly Y0 = X . We claim that there must be a set A ∈ S such that |A∩Yi| ≥ |Yi|/k. To see this, consider

the sets in the optimal set cover of X . These sets cover Yi, and there are only k of them, so one of these sets must

cover at least a 1/k fraction of Yi. Hence

|Yi+1| ≤ |Yi|− |Yi|/k = (1−1/k)|Yi|,

and by induction,

|Yi| ≤ (1−1/k)i|Y0| = n(1−1/k)i < ne−i/k,

where the last inequality uses the fact that 1+ x ≤ ex with equality iff x = 0. Hence when i ≥ k lnn we have |Yi| < 1,

meaning there are no uncovered elements, and hence the greedy algorithm finds a set cover of size at most k lnn.

Exercise: Show that this bound is tight, up to constant factors. That is, give a family of examples where the set

cover has size k and the greedy algorithm finds a cover of size Ω(k lnn).

Page 77: Data Structures

CS124 Lecture 8 Spring 2000

Divide and Conquer

We have seen one general paradigm for finding algorithms: the greedy approach. We now consider another

general paradigm, known as divide and conquer.

We have already seen an example of divide and conquer algorithms: mergesort. The idea behind mergesort is to

take a list, divide it into two smaller sublists, conquer each sublist by sorting it, and then combine the two solutions

for the subproblems into a single solution. These three basic steps – divide, conquer, and combine – lie behind most

divide and conquer algorithms.

With mergesort, we kept dividing the list into halves until there was just one element left. In general, we may

divide the problem into smaller problems in any convenient fashion. Also, in practice it may not be best to keep

dividing until the instances are completely trivial. Instead, it may be wise to divide until the instances are reasonably

small, and then apply an algorithm that is fast on small instances. For example, with mergesort, it might be best to

divide lists until there are only four elements, and then sort these small lists quickly by insertion sort.


Page 78: Data Structures

Lecture 8 8-2


Suppose we wish to find the minimum and maximum items in a list of numbers. How many comparisons does

it take?

A natural approach is to try a divide and conquer algorithm. Split the list into two sublists of equal size. (Assume

that the initial list size is a power of two.) Find the maxima and minima of the sublists. Two more comparisons then

suffice to find the maximum and minimum of the list.

Hence, if T (n) is the number of comparisons, then T (n) = 2T (n/2) + 2. (The 2T (n/2) term comes from

conquering the two problems into which we divide the original; the 2 term comes from combining these solutions.)

Also, clearly T (2) = 1. By induction we find T (n) = (3n/2)−2, for n a power of 2.

Page 79: Data Structures

Lecture 8 8-3

Integer Multiplication

The standard multiplication algorithm takes time Θ(n2) to multiply together two n digit numbers. This algo-

rithm is so natural that we may think that no algorithm could be better. Here, we will show that better algorithms

exist (at least in terms of asymptotic behavior).

Imagine splitting each number x and y into two parts: x = 10n/2a+b,y = 10n/2c+d. Then

xy = 10nac+10n/2(ad +bc)+bd.

The additions and the multiplications by powers of 10 (which are just shifts!) can all be done in linear time. We

have therefore reduced our multiplication problem into four smaller multiplications problems, so the recurrence for

the time T (n) to multiply two n-digit numbers becomes

T (n) = 4T (n/2)+O(n).

The 4T (n/2) term arises from conquering the smaller problems; the O(n) is the time to combine these problems into

the final solution (using additions and shifts). Unfortunately, when we solve this recurrence, the running time is still

Θ(n2), so it seems that we have not gained anything.

Page 80: Data Structures

Lecture 8 8-4

The key thing to notice here is that four multiplications is too many. Can we somehow reduce it to three? It

may not look like it is possible, but it is using a simple trick. The trick is that we do not need to compute ad and bc

separately; we only need their sum ad +bc. Now note that

(a+b)(c+d) = (ad +bc)+(ac+bd).

So if we calculate ac , bd, and (a + b)(c + d), we can compute ad + bc by the subtracting the first two terms from

the third! Of course, we have to do a bit more addition, but since the bottleneck to speeding up this multiplication

algorithm is the number of smaller multiplications required, that does not matter. The recurrence for T (n) is now

T (n) = 3T (n/2)+O(n),

and we find that T (n) = nlog2 3 ≈ n1.59, improving on the quadratic algorithm.

If one were to implement this algorithm, it would probably be best not to divide the numbers down to one

digit. The conventional algorithm, because it uses fewer additions, is probably more efficient for small values of

n. Moreover, on a computer, there would be no reason to continue dividing once the length n is so small that the

multiplication can be done in one standard machine multiplication operation!

It also turns out that using a more complicated algorithm (based on a similar idea) the asymptotic time for

multiplication can be made arbitrarily close to linear– that is, for any ε > 0 there is an algorithm that runs in time


Page 81: Data Structures

Lecture 8 8-5

Strassen’s algorithm

Divide and conquer algorithms can similarly improve the speed of matrix multiplication. Recall that when

multiplying two matrices, A = ai j and B = b jk, the resulting matrix C = cik is given by

cik = ∑j

ai jb jk.

In the case of multiplying together two n by n matrices, this gives us an Θ(n3) algorithm; computing each cik takes

Θ(n) time, and there are n2 entries to compute.

Let us again try to divide up the problem. We can break each matrix into four submatrices, each of size n/2 by

n/2. Multiplying the original matrices can be broken down into eight multiplications of the submatrices, with some









Letting T (n) be the time to multiply together two n by n matrices by this algorithm, we have T (n) = 8T (n/2)+

Θ(n2). Unfortunately, this does not improve the running time; it is still Θ(n3).

Page 82: Data Structures

Lecture 8 8-6

As in the case of multiplying integers, we have to be a little tricky to speed up matrix multiplication. (Strassen

deserves a great deal of credit for coming up with this trick!) We compute the following seven products:

• P1 = A(F −H)

• P2 = (A+B)H

• P3 = (C +D)E

• P4 = D(G−E)

• P5 = (A+D)(E +H)

• P6 = (B−D)(G+H)

• P7 = (A−C)(E +F)

Then we can find the appropriate terms of the product by addition:

• AE +BG = P5 +P4−P2 +P6

• AF +BH = P1 +P2

• CE +DG = P3 +P4

• CF +DH = P5 +P1−P3 −P7

Now we have T (n) = 7T (n/2)+Θ(n2), which give a running time of T (n) = Θ(nlog 7).

Faster algorithms requiring more complex splits exist; however, they are generally too slow to be useful in

practice. Strassen’s algorithm, however, can improve the standard matrix multiplication algorithm for reasonably

sized matrices, as we will see in our second programming assignment.

Page 83: Data Structures

CS124 Lecture 9 Spring 2000

9.1 The String reconstruction problem

The greedy approach doesn’t always work, as we have seen. It lacks flexibility; if at some point, it makes a wrong

choice, it becomes stuck.

For example, consider the problem of string reconstruction. Suppose that all the blank spaces and punctuation

marks inadvertantly have been removed from a text file. You would like to reconstruct the file, using a dictionary.

(We will assume that all words in the file are standard English.)

For example, the string might begin “thesearethereasons”. A greedy algorithm would spot that the first two

words were “the” and “sea”, but then it would run into trouble. We could backtrack; we have found that sea is

a mistake, so looking more closely, we might find the first three words “the”,“sear”, and “ether”. Again there is

trouble. In general, we might end up spending exponential time traveling down false trails. (In practice, since

English text strings are so well behaved, we might be able to make this work– but probably not in other contexts,

such as reconstructing DNA sequences!)


Page 84: Data Structures

Lecture 9 9-2

This problem has a nice structure, however, that we can take advantage of. The problem can be broken down

into entirely similar subproblems. For example, we can ask whether the strings “theseare” and “thereasons” both

can be reconstructed with a dictionary. If they can, then we can glue the reconstructions together. Notice, however,

that this is not a good problem for divide and conquer. The reason is that we do not know where the right dividing

point is. In the worst case, we could have to try every possible break! The recurrence would be

T (n) =n−1


T (i)+T(n− i).

You can check that the solution to this recurrence grows exponentially.

Although divide and conquer directly fails, we still want to make use of the subproblems. The attack we now

develop is called dynamic programming. The way to understand dynamic programming is to see that divide and

conquer fails because we might recalculate the same thing over and over again. (Much like we saw very early on

with the Fibonacci numbers!) If we try divide and conquer, we will repeatedly solve the same subproblems (the case

of small substrings) over and over again. The key will be to avoid the recalculations. To avoid recalculations, we use

a lookup table.

Page 85: Data Structures

Lecture 9 9-3

In order for this approach to be effective, we have to think of subproblems as being ordered by size. We solve

the subproblems bottom-up, from the smallest to the largest, until we reach the original problem.

For this dictionary problem, think of the string as being an array s[1 . . .n]. Then there is a natural subprob-

lem for each substring s[i . . . j]. Consider a two dimensional array D(i, j) that will denote whether s[i . . . j] is the

concatenation of words from the dictionary. The size of a subproblem is naturally d = j− i.

So now we write a simple loops which solves the subprobelms in order of increasing size:

for d := 1 to n−1 dofor i := 1 to n−d do

j := i+d;if indict(s[i . . . j]) then D(i, j) := true else

for k := i+1 to j−1 doif D(i,k) and D(k, j) then D(i, j) := true;

This algorithm runs in time O(n3); the three loops each run over at most n values. Pictorially, we can think of the

algorithm as filling in the upper diagonal triangle of a two-dimensional array, starting along the main diagonal and

moving up, diagonal by diagonal.

We need to add a bit to actually find the words. Let F(i, j) be the position of end of the first word in s[i . . . j]

when this string is a proper concatenation of dictionary words. Initially all F(i, j) should be set to nil. The value for

F(i, j) can be set whenever D(i, j) is set to true. Given the F(i, j), we can reconstruct the words simply by finding

the words that make up the string in order. Note also that we can use this to improve the running time; as soon as we

find a match for the entire string, we can exit the loop and return success! Further optimizations are possible.

Let us highlight the aspects of the dynamic programming approach we used. First, we used a recursive descrip-

tion based on subproblems: D(i, j) is true if D(i,k) and D(k, j) for some k. Second, we built up a table containing

the answers of the problems, in some natural bottom-up order. Third, we used this table to find a way to determine

the actual solution. Dynamic programming generally involves these three steps.

Page 86: Data Structures

Lecture 9 9-4

9.2 Edit distance

A problem that arises in biology is to measure the distance between two strings (of DNA). We will examine the

problem in English; the ideas are the same. There are many possible meanings for the distance between two strings;

here we focus on one natural measure, the edit distance. The edit distance measures the number of editing operations

it would be necessary to perform to transform the first string into the second. The possible operations are as follows:

• Insert: Insert a character into the first string.

• Delete: Delete a character from the first string.

• Replace: Replace a character from the first string with another character.

Another possibility is to not edit a character, when there is a Match. For example, a transformation from activate

to caveat can be represented by


a c t i v a t e

c a v e a t

The top line represents the operation performed. So the a in activate id deleted, and the t is replaced. The e in

caveat is explicitly inserted.

The edit distance is the minimal number of edit operations – that is, the number of Inserts, Deletes, or Replaces

– necesary to transform one string to the other. Note that Matches do not count. Also, it is possible to have a

weighted edit distance, if the different edit operations have different costs. We currently assume all operations have

weight 1.

Page 87: Data Structures

Lecture 9 9-5

We will show how compute the edit distance using dynamic programming. Our first step is to define appropriate

subproblems. Let us reprsent our strings by A[1 . . .n] and B[1 . . .m]. Suppose we want to consider what we do with

the last character of A. To determine that, we need to know how we might have transformed the first n−1 characters

of A. These n−1 characters might have transformed into any number of symbols of B, up to m. Similarly, to compute

how we might have transformed the first n− 1 characters of A into some part of B, it makes sense to consider how

we transformed the first n−2 characters, and so on.

This suggests the following submproblems: we will let D(i, j) represent the edit distance between A[1 . . . i] and

B[1 . . . j]. We now need a recursive description of the subproblems in order to use dynamic programming. Here the

recurrence is:

D(i, j) = min[D(i−1, j)+1,D(i, j−1)+1,D(i−1, j−1)+ I(i 6= j)].

In the above, I(i 6= j) represents the value 1 if i 6= j and 0 if i = j. We obtain the above expression by considering the

possible edit operations available. Suppose our last operation is a Delete, so that we deleted the ith character of A to

transform A[1 . . . i] to B[1 . . . j]. Then we must have transformed A[1 . . . i−1] to B[1 . . . j], and hence the edit distance

would be D(i−1, j)+1, or the cost of the transformation from A[1 . . . i−1] to B[1 . . . j] plus one for the cost of the

final Delete. Similarly, if the last operation is an Insert, the cost would be D(i, j−1)+1.

The other possibility is that the last operation is a Replace of the ith character of A with the jth character of B,

or a Match between these two characters. If there is a Match, then the two characters must be the same, and the cost

is D(i−1, j−1). If there is a Replace, then the two characters should be different, and the cost is D(i−1, j−1)+1.

We combine these two cases in our formula, using D(i−1, j−1)+ I(i 6= j).

Our recurrence takes the minimum of all these possibilities, expressing the fact that we want the best possible

choice for the final operation!

Page 88: Data Structures

Lecture 9 9-6

It is worth noticing that our recursive description does not work when i or j is 0. However, these cases are

trivial. We have

D(i,0) = i,

since the only way to transform the first i characters of A into nothing is to delete them all. Similarly,

D(0, j) = j.

Again, it is helpful to think of the computation of the D(i, j) as filling up a two-dimensional array. Here, we

begin with the first column and first row filled. We can then fill up the rest of the array in various ways: row by row,

column by column, or diagonal by diagonal!

Besides computing the distance, we may want to compute the actual transformation. To do this, when we fill

the array, we may also picture filling the array with pointers. For example, if the minimal distance for D(i, j) was

obtained by a final Delete operation, then the cell (i, j) in the table should have a pointer to (i− 1, j). Note that a

cell can have multiple pointers, if the minimum distance could have been achieved in multiple ways. Now any path

back from (n,m) to (0,0) corresponds to a sequence of operations that yields the minimum distance D(n,m), so the

transformation can be found by following pointers.

The total computation time and space required for this algorithm is O(nm).

Page 89: Data Structures

Lecture 9 9-7

9.3 All pairs shortest paths

Let G be a graph with positive edge weights. We want to calculate the shortest paths between every pair of nodes.

One way to do this is to run Dijkstra’s algorithm several times, once for each node. Here we develop a different

dynamic programming solution.

Our subproblems will be shortest paths using only nodes 1 . . . k as intermediate nodes. Of course when k equals

the number of nodes in the graph, n, we will have solved the original problem.

We let the matrix Dk[i. j] represent the length of the shortest path between i and j using intermediate nodes 1 . . .k.

Initially, we set a matrix D0 with the direct distances between nodes, given by di j . Then Dk is easily computed from

the subproblems Dk−1 as follows:

Dk[i, j] = min(Dk−1[i, j],Dk−1[i,k]+Dk−1[k, j]).

The idea is the shortest path using intermediate nodes 1 . . .k either completely avoids node k, in which case it

has the same length as Dk−1[i, j]; or it goes through k, in which case we can glue together the shortest paths found

from i to k and k to j using only intermediate nodes 1 . . .k−1 to find it.

It might seem that we need at least two matrices to code this, but in fact it can all be done in one loop. (Exercise:

think about it!)

D = (di j), distance array, with weights from all i to all j

for k = 1 to n do

for i = 1 to n do

for j = 1 to n do

D[i, j] = min(D[i, j],D[i.k]+D[k, j])

Note that again we can keep an auxiliary array to recall the actual paths. We simply keep track of the last

intermediate node found on the path from i to j. We reconstruct the path by succesively reconstructing intermediate

nodes, until we reach the ends.

Page 90: Data Structures

Lecture 9 9-8

9.4 Traveling salesman problem

Suppose that you are given n cities and the distances di j between them. The traveling salesman problem (TSP) is to

find the shortest tour that takes you from your home city to all the other cities and back again. As there are (n−1)!

possible paths, this can clearly be done in O(n!) time by trying all possible paths. Of course this is not very efficient.

Since the TSP is NP-complete, we cannot really hope to find a polynomial time algorithm. But dynamic

programming gives us a much better algorithm than trying all the paths.

The key is to define the appropriate subproblem. Suppose that we label our home city by the symbol 1, and

other cities are labeled 2, . . . ,n. In this case, we use the following: for a subset S of vertices including 1 and at least

one other city, let C(S, j) be the shortest path that start at 1, visits all other nodes in S, and ends at j. Note that our

subproblems here look slightly different: instead of finding tours, we are simply finding paths. The important point

is that the shortest path from i to j through all the vertices in S consists of some shortest path from i to a vertex x,

where x ∈ S− j, and the additional edge from x to j.

for all j do C(i, j, j) := d1 j

for s = 3 to n do % s is the size of the subset

for all subsets S of 1, . . . ,n of size s containing 1 do

for all j ∈ S, j 6= 1 do

C(S, j) := mini6= j,i∈S[C(S− j, i)+di j ]

opt := min j 6=i C(1, . . . ,n, j)+d j1

The idea is to build up paths one node at a time, not worrying (at least temporarily) where they will end up.

Once we have paths that go through all the vertices, it is easy to check the tours, since they consists of a shortest path

through all the vertices plus an additional edge. The algorithm takes time O(n22n), as there are O(n2n) entries in the

table (one for each pair of set and city), and each takes O(n) time to fill. Of course we can add in structures so that

we can actually find the tour as well. Exercise: Consider how memory-efficient you can make this algorithm.

Page 91: Data Structures

CS124 Lecture 10 Spring 1999

10.1 The Birthday Paradox

How many people do there need to be in a room before with probability greater than 1/2 some two of them have thesame birthday? (Assume birthdays are distributed uniformly at random.)

Surprisingly, only 23. This is easily determined as follows: the probability the first two people have differentbirthdays is (1−1/365). The probability that the third person in the room then has a birthday different from the firsttwo, given the first two people have different birthdays, is (1−2/365), and so on. So the probability that all of thefirst k people have different birthdays is the product of these terms, or


365) · (1−


) · (1−3

365) . . . · (1−



Determining the right value of k is now a simple exercise.


Page 92: Data Structures

Lecture 10 10-2

10.2 Balls into Bins

Mathematically, the birthday paradox is an example of a more general mathematical question, often formulated interms of balls and bins. Some number of balls n are thrown into some number of bins m. What does the distributionof balls and bins look like?

The birthday paradox is focused on the first time a ball lands in a bin with another ball. One might also ask howmany of the bins are empty, how many balls are in the most full bin, and other sorts of questions.

Let us consider the question of how many bins are empty. Look at the first bin. For it to be empty, it has tobe missed by all n balls. Since each ball hits the first bin with probability 1/m, the probability the first bin remainsempty is


)n ≈ e−n/m.

Since the same argument holds for all bins, on average a fraction e−n/m of the bins will remain empty.

Exercise: Howmany bins have 1 ball? 2?

Page 93: Data Structures

Lecture 10 10-3

10.3 Hash functions

A hash function is a deterministic mapping from one set into another that appears random. For example, mappingpeople into their birthdays can be thought of as a hash function.

In general, a hash function is a mapping f : 0, . . . ,n−1 → 0, . . . ,m−1. Generally n >> m; for example,the number of people in the world in much bigger than the number of possible birthdays. There is a great deal oftheory behind designing hash functions that “appear random.” We will not go into that theory here, and insteadassume that the hash functions we have available are in fact completely random. In other words, we assume that foreach i (0 ≤ i ≤ n−1), the probability that f (i) = j is 1/m (for (0 ≤ j ≤ m−1). Notice that this does mean that everytime we look at f (i), we get a different random answer! The value of f (i) is fixed for all time; it is just equally likelyto take on any value in the range.

While such completely random hash functions are unavailable in practice, they generally provide a good roughidea of how hashing schemes perform.

(An aside: in reality, birthdays are not completely random either. Seasonal distributions skew the calculation.How might this affect the birthday paradox?)

Page 94: Data Structures

Lecture 10 10-4

10.4 Applications: A Password-checker

We now consider a hashing application. Suppose you are adminstering a computer system, and you would like tomake sure that nobody uses a common password. This protects against hackers, who can often determine if someoneis using a common password (such as a name, or a common dictionary word) by gaining access to the encryptedpassword file and using an exhaustive search. When the user attempts to change their password, you would like tocheck their password against a dictionary of common passwords as quickly as possible.

One way to do this would be to use a standard search technique, such as binary search, on the string. Thisapproach has two negative features. First, one must store the entire dictionary, which takes memory. Second, ona large dictionary, this approach might be slow. Instead we present a quick and space-efficient scheme based onhashing. The data structure we consider is commonly called a Bloom filter, after the originator.

Choose a table size m. Create a table consisting of m bits, initially all set to 0. Use a hash function on each ofthe n words in the dictionary, where the range of the hash function is [0,m). If the word hashes to value k, set the kthbit of the table to 1.

When a user attempts to change the password, hash the user’s desired password and check the appropriateentry in the table. If there is a 1 there, reject the password; it could be a common one. Otherwise, accept it. Acommon password from the dictionary is always rejected. Assuming other strings are hashed to a random location,the probability of rejecting a password that should be accepted is 1− e−n/m.

It would seem one would need to choose m to be fairly large in order to make the probability of rejecting apotentially good password small. Space can be used more efficiently by making multiple tables, using a differenthash function to set the bits for each table. To check a proposed password now requires more time, since severalhash functions must be checked. However, as soon as a single 0 entry is found, the password can be accepted. Theprobability of rejecting a password that should be accepted when using h tables, each of size m, is then


1− e−n/m)h


The total space used is merely hm bits. Notice that the Bloom filter sometimes returns the wrong answer – we mayreject a proposed password, even though it is not a common password. This sort of error is probably acceptable, aslong as it doesn’t happen so frequently as to bother users. Fortunately this error is one-sided; a common password isnever accepted. One must set the parameters m and h appropriately to trade off this error probability against spaceand time requirements.

For example, consider a dictionary of 100,000 common passwords, each of which is on average 7 characterslong. Uncompressed this would be 700,000 bytes. Compression might reduce it substantially, to around 300,000bytes. Of course, then one has the problems of searching efficiently on a compressed list.

Instead, one could keep a 100,000 byte Bloom filter, consisting of 5 tables of 160,000 bits. The probability ofrejecting a reasonable password is just over 2%. The cost for checking a password is at most 5 hashes and 5 lookupsinto the table.

Page 95: Data Structures

CS 124 Lecture 11

11.1 Applications: Fingerprinting for pattern matching

Suppose we are trying to find a pattern string P in a long document D. How can we do it quickly and efficiently?

Hash the pattern P into say a 16 bit value. Now, run through the file, hashing each set of |P| consecutivecharacters into a 16 bit value. If we ever get a match for a pattern, we can check to see if it corresponds to an actualpattern match. (In this case, we want to double-check and not report any false matches!) Otherwise we can just moveon. We can use more than 16 bits, too; we would like to use enough bits so that we will obtain few false matches.

This scheme is efficient, as long as hashing is efficient. Of course hashing can be a very expensive operation, soin order for this approach to work, we need to be able to hash quickly on average. In fact, a simple hashing techniqueallows us to do so in constant time per operation!

The easiest way to picture the process is to think of the file as a sequence of digits, and the pattern as a number.Then we move a pointer in the file one character at a time, seeing if the next |P| digits gives us a number equal tothe number corresponding to the pattern. Each time we read a character in the file, the number we are looking atchanges is a natural way: the leftmost digit a is removed, and a new rightmost digit b is inserted. Hence, we updatean old number N and obtain a new number N ′ by computing

N ′ = 10 · (N −10|P|−1 ·a)+b.

When dealing with a string, we will be reading characters (bytes) instead of numbers. Also, we will not wantto keep the whole pattern as a number. If the pattern is large, then the corresponding number may be too largeto do effective comparisons! Instead, we hash all numbers down into say 16 bits, by reducing them modulo someappropriate prime p. We then do all the mathematics (multiplication, addition) modulo p, i.e.

N ′ = [10 · (N −10|P|−1 ·a)+b] mod p.

All operations mod p can be made quite efficient, so each new hash value takes only constant time to compute!

This pattern matching technique is often called fingerprinting. The idea is that the hash of the pattern createsan almost unique identifier for the pattern– like a fingerprint. If we ever find two fingerprints that match, we have agood reason to expect that they must come the same pattern. Of course, unlike real fingerprints, our hashing-basedfingerprints do not actually uniquely identify a pattern, so we still need to check for false matches. But since falsematches should be rare, the algorithm is very efficient!

See Figure 11.1 for an example of fingerprinting.


Page 96: Data Structures

Lecture 11 11-2

P = 17935p = 251


P mod p = 114

63861 mod p = 10738617 mod p = 21486179 mod p = 8661793 mod p = 4717935 mod p = 11479357 mod p = 4193573 mod p = 20135734 mod p = 9257342 mod p = 114

Figure 11.1: A fingerprinting example. The pattern P is a 5 digit number. Note successive calculations take constanttime: 38617 mod p = ( (63861 mod p) - (60000 mod p)) · 10 + 7 mod p. Also note that false matches are possible(but unlikely); 57432 = 17935 mod p.

One question remains. How should we choose the prime p? We would like the prime we choose to work well,in that it should have few false matches. The problem is that for every prime, there are certainly some bad patternsand documents. If we choose a prime in advance, then someone can try to set up a document and pattern that willcause a lot of false matches, making our fingerprinting algorithm go very slowly.

A natural approach is to choose the prime p randomly. This way, nobody can set up a bad pattern and documentin advance, since they are not sure what prime we will choose.

Let us make this a bit more rigorous. Let π(x) represent the number of primes that are less than or equal to x. Itwill be helpful to use the following fact:

Fact: xlnx ≤ π(x) ≤ 1.26 x

ln x .

Consider any point in the algorithm, where the pattern and document do not match. If our pattern has length|P|, then at that point we are comparing two numbers that are each less than 10 |P|. In particular, their difference (inabsolute value) is less than 10|P|. What is the probability that a random prime divides this difference? That is, whatis the probability that for the random prime we choose, the two numbers corresponding to the pattern and the current|P| digits in the document are equal modulo p.

First, note that there are at most log2 10|P| distinct primes that divide the difference, since the difference is atmost 10|P| (in absoulte value), and each distinct prime divisor is at least 2. Hence, if we choose our prime randomly

Page 97: Data Structures

Lecture 11 11-3

from all primes up to Z, the probability we have a false match is at most

log2 10|P|


Now the probability that we have a false match anywhere is at most |D| times the probability that we have a falsematch in any single location, by the union bound. Hence the probability that we have a false match anywhere is atmost

|D| log2 10|P|


Exercise: How big should we make Z in order to make the probability of a false match anywhere in thealgorithm less than 1/100?

Page 98: Data Structures

Lecture 11 11-4

How could we improve the probability of a false match? One way is to choose from a larger set of primes.Another way is to choose not just one random prime, but several random primes from Z. This is like choosingseveral hash functions in the Bloom filter problem. There is a false match only if there is a false match at everyrandom prime we choose. If we choose k primes (with replacement) from the primes up to Z, the probability of afalse match at a specific point is at most


log2 10|P|




Page 99: Data Structures

CS124 Lecture 12

12.1 Near duplicate documents1

Suppose we are designing a major search engine. We would like to avoid answering user queries with multiplecopies of the same page. That is, there may be several pages with exactly the same text. These duplicates occurfor a variety of reasons. Some are mirror sites, some are copies of common pages (such as Unix man pages), someare multiple spam advertisements, etc. Returning just one of the duplicates should be sufficient for the end user;returning all of them will clutter the response page, wasting valuable real estate and frustraing the user. How can wecope with duplicate pages?

Determining exact duplicates has a simple solution, based on hashing. Use the text of each page and an ap-propriate hash function to hash the text into a 64 bit signature. If two documents have the same signature, it isreasonable to assume that they share the same text. (Why? How often is this assumption wrong? Is it a terrible thingif the assumption turns out to be false?) By comparing signatures on the fly, we can avoid returning duplicates.

This solution works extremely well if we want to catch exact duplicates. What if, however, we want to capturethe idea of “near duplicate” documents, or similar documents. For example, consider two mirror sites on the Web.It may be that the documents share the same text, except that the text corresponding to the links on the page aredifferent, with each referring to the correct mirror site. In this case, the two pages will not yield the same signature,although again, we would not want to return both pages to the end user, because they are so similar. As anotherexample, consider two copies of a newspaper article, one with a proper copyright notice added, and one without. Wedo not need to return both pages to the user. Again, hashing the document appears to be of no help. Finally, considerthe case of advertisers who submit slightly modified versions of their ads over and over again, trying to get more orbetter spots on the response pages sent back to users. We want to stop their nefarious plans!

We will describe a scheme used to detect similar documents efficiently, using a hashing based scheme. Like theBloom filter solution for password dictionaries, our solution is highly efficient in terms of space and time. The costfor this efficienty is accuracy; our algorithm will sometimes make mistakes, because it uses randomness.

12.2 Set resemblance

We describe a more general problem that will relate to our document similarity problem.

Consider two sets of numbers, A and B. For concreteness, we will assume that A and B are subsets of 64 bitnumbers. We may define the resemblance of A and B as

resemblance(A,B) = R(A,B) =|A∩B||A∪B| .

The resemblance is a real number between 0 and 1. Intuitively, the resemblance accurately captures how closethe two sets are. Sets and documents will be related, as we will see later.

1This lecture is based on the work of Andrei Broder, who developed these ideas, and convinced Altavista to use them! (The second featmay have been even more difficult than the first.)


Page 100: Data Structures

Lecture 12 12-2

How quickly can we determine the resemblance of two sets? If the sets are each of size n, the natural approach(compare each element to in A to each element in B) is O(n2). We can do better by sorting the sets. Still, theseapproaches are all rather slow, when we consider that we will have many sets to deal with and hence many pairs ofsets to consider.

Instead we should ocnsider relaxing the problem. Suppose that we do not need an exact calculation of theresemblance R(A,B). A reasonable estimate or approximation of the resemblance will suffice. Also, since we willbe answering a variety of queries over a long period of time, it makes sense to consider algorithms that first doa preprocessing phase, in order to handle the queries much more quickly. That is, we will first do some work,preparing the appropriate data structures and data in a preprocessing phase. The advantage of doing all this work inadvance will be that queries regarding resemblance can then be quickly answered.

Our estimation process will require a black box that does the following: it produces an effective random per-mutation on the set of 64 bit numbers. What do we mean by a random permutation? Let us consider just the case offour bit number, of which there are 16. Suppose we write each number on a card. Generating a random permutationis like shuffling this deck of 16 cards and looking at the order at which the numbers appear after ths shuffling. Forexample, if we find the number 0011 on the first card, then our permutation maps the number 3 to the number 1. Wewrite this as π(3) = 1, where π is a function that represents the permutation.

Suppose we have an efficient implemenation of random permutations, which we think of as a black box proce-dure. That is, when we invoke the black box procedure BB(1,x) on a 64 bit number x, we get out y = π1(x) for somefixed, completely random permutation π1. Similarly, if we invoke the black box BB(2,x), we get out π2(x) for somedifferent random permutation π2. (In fact in practice we cannot achieve this black box, but we can get close enoughthat it is useful to think in these terms for analysis.)

Let us use the notation π1(A) to denote the set of elements obtained by computing BB(1,x) for every x in A.Consider the following procedure: we compute the set π1(A) and π1(B), and record the minimum of each set. Whendoes minπ1(A)= minπ1(B)? This happens only when there is some element x satisfying π1(x) = minπ1(A)=minπ1(B). In other words, the element x that is the minimum element in the set A∪B has to be the intersection ofthe sets A∩B.

If π1 is a random permutation, then every element in A∪B has equal probability of mapping to the minimumelement after the permutation is applies. That is, for all x and y in A∪B,

Pr[π1(x) = minπ1(A∪B)] = Pr[π1(y) = minπ1(A∪B)].

Thus, for the minimum of π1(A) and π1(B) to be the same, the minimum element must lie in π1(A∩B) (see Fig-ure 12.1). Hence

Pr[minπ1(A) = minπ1(B)] =|A∩B||A∪B| .

But this is just the resemblance R(A,B)!

This gives us a way to estimate the resemblance. Instead of taking just one permutation, we take many– say100. For each set A, we preprocess by computing minπj(A) for j = 1 to 100, and store these values. To estimatethe resemblance of two sets A and B, we count how often the minima are the same, and divide by 100. It is like eachpermutation gives us a coin flip, where the probability of a heads (a match) is exactly the resemblance R(A,B) of thetwo sets.

Page 101: Data Structures

Lecture 12 12-3



Figure 12.1: If the minimum element of π1(A) and π1(B) are the same, the minimum element must lie in π1(A∩B).

Four score and seven years ago, our foundingFour score and seven

score and seven yearsand seven years ago

seven years ago ouryears ago our founding

Figure 12.2: Shingling: the document is broken up into all segments of k consecutive words; each segment leads toa 64 bit hash value.

12.3 Turning Document Similarity into a Set Resemblance Problem

We now return to the original application. How do we turn document similarity into a set resemblance problem? Thekey idea is to hash pieces of the document– say every four consecutive words– into 64 bit numbers. This process hasbeen called shingling, and each set of consecutive words is called a shingle. (See Figure 12.2.) Using hashing, theshingles give rise to the resulting numbers for the set resemblance problem, so that for each document D there is aset SD. There are many possible variations and improvements possible. For example, one could modify the numberof bits in a shingle or the method for shingling. Similarly, one could throw out all shingles that are not 0 mod 16,say, in order to reduce the number of shingles per document.

This approach obscures some important information in the document– such as the order paragraphs appearin, say. However, it seems reasonable to say that if the resulting sets have high resemblance, the documents arereasonably similar.

Once we have the shingles for the document, we associate a document sketch with each document. The sketchof a document SD is a list of say 100 numbers: (minπ1(SD),minπ2(SD),minπ3(SD), . . . ,minπ100(SD)).

Now we choose a threshold– for example, we might say that two documents are the similar if 90 out of the 100entries in the sketch match. Now whenever a user queries the search engine, we check the sketches of the documentswe wish to return. If two sketches share 90 entries, we only send one of them. (Alternatively, we could catch theduplicates on the crawling side– we check all the documents as we crawl the Web, and whenever two sketches sharemore than 90 entries, we assume the associated documents are similar, so that we only need to store one of them!)

Recall that our scheme uses random permutations. So, if we set our sketch threshold to 90 out of 100 entries,

Page 102: Data Structures

Lecture 12 12-4

this does not guarantee that any pair of documents with high resemblance are caught. Also, some pairs of documentsthat do not have high resemblance may get marked as having high resemblance. How well does this scheme do?

We analyze how well the scheme does with the following argument. For each permutation πi, the probabilitythat two documents A and B have the same value in the ith position of the sketch is just the resemblance of the twodocuments R(A,B) = r. (Here the resemblance R(A,B) of course refers to the resemblance of the sets of numbersobtained by shingling A and B.) Hence, the probability p(r) that at least 90 out of the 100 entries in the sketch matchis

p(r) =100





rk(1− r)100−k.

What does p(r) look like as a function of r? The graph is shown in Figure 12.3. Notice that p(r) stays verysmall until r approaches 0.9, and then quickly grows towards 1. This is exactly the property we want our scheme tohave– if two documents are not similar, we will rarely mistake them for being similar, and if they are similar, we arelikely to catch them!

For example, even if the resemablance is 0.8, we will only get 90 matches with probability less than 0.006!When the resemblance is only 0.5, the probability of having 90 entries in the sketch match falls to almost 10−18! Ifdocuments are not alike, we will rarely mistake them as being similar.

If documents are alike, we will most likely catch them. If the resemblance is 0.95, the documents will have 90or more entries in common in the sketch with probability greater than .988; if the resemblance is 0.96, the probabilityjumps to over .997.

We are dealing with a very large number of dcouments– most search engines currently index twenty-five to overone hundred million Web pages! So even though the probability of making a mistake is small, it will happen. Theworst that happens, though, is that the search engine fails to index a few pages that it should, and it fails to catch afew duplicates that it should. These problems are not a big deal.

Page 103: Data Structures

Lecture 12 12-5












0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1






90 o

r m




Figure 12.3: Making the threshold for document similarity 90 out of 100 matches in the sketch leads to the followinggraph relating resemblance to the probability two documents are considered similar. Notice the sharp change inbehavior near where the resemablance is 0.90. Essentially, the procedure behaves like a low pass filter.

Page 104: Data Structures

CS124 Lecture 13

Hopefully the ideas we saw in our hashing problems have convinced you that randomness is a useful tool inthe design and analysis of algorithms. Just to make sure, we will consider several more example of how to userandomness to design algorithms.

13.1 Primality testing

A great deal of modern cryptography is based on the fact that factoring is apparently hard. At least nobody haspublished a fast way to factor yet. (It is rumored the NSA knows how to factor, and is keeping it a secret. Someof you might well have worked or will work for the NSA, at which point you will be required to keep this secret.Shame on you.) Of course, certain numbers are easy to factor– numbers with small prime factors, for example. Sooften, for cryptographic purposes, we may want to generate very large prime numbers and multiply them together.How can we find large prime numbers?

We are fortunate to find that prime numbers are pretty dense. That is, there’s an awful lot of them. Letπ(x) bethe number of primes less than or equal tox. Then

π(x) ≈ xlnx


or more exactly,




= 1.

This means that on average about one out of every lnx numbers is prime, if we are looking for primes about the sizeof x. So if we want to find prime numbers of say 250 digits, we would have to check about ln10250≈ 576 numberson average before finding a prime. (We can search smarter, too, throwing out multiples of 2,3,5, etc. in order tocheck fewer numbers.) Hence, all we need is a good method fortesting if a number is prime. With such a test, wecan generate large primes easily– just keep generating random large numbers, and test them for primality until wefind a suitable prime number.

How can we test if a numbern is prime? The pedantic way is to try dividingn by all smaller numbers.Alternatively, we can try to dividen by all primes up to

√n. Of course, both of these approaches are quite slow;

whenn is about 10250, the value of√

n is still huge! The point is that 10250 has only 250 (or more generallyO(logn))digits, so we’d like the running time of the algorithm to be based on the size 250, not 10250!

How can we quickly test if a number is prime? Let’s start by looking at some ways that work pretty well, buthave a few problems. We will use the following result from number theory:

Theorem 13.1 If p is a prime and 1≤ a < p, then

ap−1 = 1 mod p.

Proof: There are two nice proofs for this fact. One uses a simple induction to prove the equivalent statementthatap = a mod p. This is clearly true whena = 1. Now

(a+1)p =p





Page 105: Data Structures

Lecture 13 13-2

The coefficient(p


)is divisible byp, unlessi = 0 or i = p. Hence

(a+1)p = ap +1 mod p = a+1 modp,

where the last step follows by the induction hypothesis.

An alternative proof uses the following idea. Consider the numbers 1,2, . . . , p−1. Multiply them all bya, sonow we havea,2a, . . . ,(p−1)a. Each of these number is distinct modp, and there arep−1 such numbers, so infact the sequencea,2a, . . . ,(p−1)a is the same as the sequence 1,2, . . . , p−1 when considered modulop, exceptfor the order. Hence

1 ·2 · . . . · (p−1) = a ·2a · . . . · (p−1)a mod p = ap−1 ·1 ·2 · . . . · (p−1) mod p.

Thus we haveap−1 = 1 mod p.

This immediately suggests one way to check if a numbern is prime. Compute 2n−1 modn. If it is not 1, thenn is certainly not prime! Note that we can compute 2n−1 modn quite efficiently, using our previously discussedmethods for exponentiation, which require onlyO(logn) multiplications! Thus this test is efficient.

But so far this test is just one-way; ifn is composite, we may have that 2n−1 = 1 modn, so we cannot assumethatn is prime just because it passes the test. For example, 2340 = 1 mod 341, and 341 is not prime. Such a numberis called a2-psuedoprime, and unfortunately there are infinitely many of them. (Of course, even though there areinfinitely many 2-pseudoprimes, they are not as dense as the primes– that is, there are relatively very few of them.So if we generate a large numbern randomly, and see if 2n−1 = 1 modn, we will most likely be right if we then sayn is prime if it passes this test. In practice, this might be good enough! This is not a good primality test, however, ifan NSA official you know gives you a number to test for primality, and you think they might be trying to fool you.The NSA might be purposely giving you a 2-pseudoprime. They can be tricky that way.)

You might think to try a different base, other than 2. For example, you might choose 3, or a random value ofa. Unfortunately, there are infinitely many 3-pseudoprimes. In fact, there are infinitely many composite numbersnsuch thatan−1 = 1 modn for all a that do not share a factor withn. (That is, for alla such that the greatest commondivisor of a andn is 1.) Suchn are calledCarmichael numbers– the smallest such number is 561. So a test based onthis approach is destined to fail for some numbers.

There is a way around this problem, due to Rabin. Letn−1 = 2tu. Suppose we choose a random basea andcomputean−1 by first computingau and then repeatedly squaring. Along the way, we will check to see for the valuesau,a2u, . . . whether they have the following property:

a2i−1u = ±1 modn,a2iu = 1 modn.

That is, suppose we find anon-trivial square root of 1 modulon. It turns out that only composite numbers havenon-trivial square roots – prime numbers don’t. In fact, if we choosea randomly, andn is composite, for at least 3/4of the values ofa, one of two things will happen: we will either find a non-trivial square root of 1 using this process,or we will find thatan−1 = 1 modn. In either case, we know thatn is composite!

A value of a for which eitheran−1 = 1 modn or the computation ofan−1 yields a non-trivial square root iscalled awitness to the compositeness ofn. We have said that 3/4 of the possible values ofa are witnesses (we willnot prove this here!). So if we pick a single value ofa randomly, andn is composite, we will determine thatn iscomposite with probability at least 3/4. How can we improve the probability of catching whenn is composite?

The simplest way is just to repeat the test several times, each time choosing a value ofa randomly. (Note thatwe do not even have to go to the trouble of making sure we try different values ofa each time; we can choose valueswith replacement!) Each time we try this we have a probability of at least 3/4 of catching thatn is composite, so if

Page 106: Data Structures

Lecture 13 13-3

we try the testk times, we will return the wrong answer in the case wheren is composite with probability(1/4)k. Fork = 25, the probability of the algorithm itself making an error is thus(1/2)50; the probability that a random cosmicray affected your arithmetic unit is probably higher!

This trick comes up again and again with randomized algorithms. If the probability of catching an error on asingle trial isp, the probability of failing to catch an error aftert trials is(1− p)t , assuming each trial is independent.By makingt sufficiently large, the probability of error can be reduced. Since the probability shrinks exponentiallyin t, few trials can produce a great deal of security in the answer.

Page 107: Data Structures

CS 124 Lecture 14

14.1 Cryptography Fundamentals

Cryptography is concerned with the following scenario: two people, Alice and Bob, wish to communicate privately

in the presence of an eavesdropper, Eve. In particular, suppose Alice wants to send Bob a messagex. (For conve-

nience, we will always assume our message has been converted into a bit string.) Using cryptography, Alice would

compute a functione(x), the encoding ofx, using some secret key, and transmite(x) to Bob. Bob receivese(x),

and using his own secret key, would compute a functiond(e(x)) = x. The functiond provides the decoding of the

encodinge(x). Eve is presumably unable to recoverx from e(x) because she does not have the key – without the

key, computingx is either impossible or computationally difficult.

14.1.1 One-Time Pad

A classical cryptographic method is theone-time pad. A one-time pad is a random string of bitsr, equal in length to

the messagex, that Alice and Bob share and is secret. By random, here we mean thatr is equally like to be any bit

string of the right length,|r|. Alice computee(x) = x⊕ r; Bob computesd(e(x)) = e(x)⊕ r = x⊕ r⊕ r = x.

The claim is that Eve gets absolutely no information about the message by seeinge(x). More concretely, we


Pr(message isx | e(x)) = Pr(message isx);

that is, knowinge(x) gives no more information to Eve than she already had. This is a nice exercise in condtional


Sincee(x) provides no information, the one-time pad is completely secure. (Notice that this does not rely

on notions of computational difficulty; Eve really obtains no additional information!) There are, however, crucial


• The keyr has to be as long asx.

• The keyr can only be used once. (To see this, suppose we use the same keyr to encodex andy. The Eve can

computee(x)⊕ e(y) = x⊕ y, which might yield useful information!)


Page 108: Data Structures

Lecture 14 14-2

• The keyr has to be exchanged, by some other means. (Private courier?)

14.1.2 DES

TheData Encrytpion Standard, or DES, is a U.S. government sponsored cryptographic method proposed in 1976. It

uses a 56 bit key, again shared by Alice and Bob, and it encodes blocks of 64 bits using a complicated sequence of

bit operations.

Many have suspected that the government engineered the DES standard, so that they could break it easily, but

nobody has shown a simpler method for breaking DES other than trying the 256 possible keys. These days, however,

trying even this large number of keys can be accomplished in just a few days with specialized hardware. Hence DES

is widely considered no longer secure.

14.1.3 RSA

RSA (named after its inventors, Ron Rivest, Adi Shamir, and Len Adleman) was developed around the same time as

DES. RSA is an example ofpublic key cryptography. In public key cryptography, Bob has two keys: a public key,

ke, known to everyone, and a private key,kd , known only to Bob. If Alice (or anyone else) wants to send a messagex

to Bob, she encrypts it ase(x) using the public key; Bob then decrypts it using his private key. For this to be secure,

the private key must be hard to compute from the public key, and similarlye(x) must be hard to compute fromx.

The RSA algorithm depends on some number theory and simple algorithms, which we will consider before

describing RSA. We will then describe how RSA isefficient andsecure.

14.2 Tools for RSA

14.2.1 Primality

For the time being, we will assume that it is possible to generate large prime numbers. In fact, there are simple and

efficientrandomized algorithms for generating large primes, that we will consider later in the course.

Page 109: Data Structures

Lecture 14 14-3

14.2.2 Euclid’s Greatest Common Divisor Algorithm

Definition: Thegreatest common divisor (or gcd) of integersa,b ≥ 0 is the largest integerd ≥ 0 such thatd|a and

d|b, whered|a denotes thatd dividesa.

Example: gcd(360,84) = 12.

One way of computing the gcd is to factor the two numbers, and find the common prime factors (with the right

multiplicity). Factoring, however, is a problem for which we do not have general efficient algorithms.

The following algorithm, due to Euclid, avoids factoring. Assumea ≥ b ≥ 0.

function Euclid(a,b)

if b = 0 return(a)

return(Euclid(b,a modb))

end Euclid

Euclid’s algorithm relies on the fact that gcd(a,b) = gcd(b,a modb). You should prove this as an exercise.

We need to check that this algorithm is efficient. We will assume that mod operations are efficient (in fact they

can be done inO(log2 a) bit operations). How many mod operations must be performed?

To analyze this, we notice that in the recursive calls of Euclid’s algorithms, the numbers always get smaller.

For the algorithm to be efficient, we’d like to have only aboutO(loga) recursive calls. This will require the numbers

to shrink by a constant factor after a constant number of rounds. In fact, we can show that the larger number shrinks

by a factor of 2 every 2 rounds.

Claim 1: a modb ≤ a/2.

Proof: The claim is trivially true ifb ≤ a/2. If b > a/2, thena modb = a−b ≤ a/2.

Claim 2: On calling Euclid(a,b), after the second recursive call Euclid(a′,b′) hasa′ ≤ a/2.

Proof: For the second recursive call, we will havea′ = a modb.

14.2.3 Extended Euclid’s Algorithm

Euclid’s algorithm can be extended to give not just the greatest common divisord = gcd(a,b), but also two integers

x andy such thatax+ by = d. This will prove useful to us subsequently, as we will explain.

Page 110: Data Structures

Lecture 14 14-4


if b = 0 return(a,1,0)

Computek such thata = bk +(a modb)

(d,x,y) = Extended-Euclid(b,a modb)

return((d,y,x− ky))

end Extended-Euclid

Claim 3: The Extended Euclid’s algorithm returns the correct answer.

Proof: By induction ona + b. It clearly works if b = 0. (Note the understanding that all numbers divide

0!) If b = 0, then we may assume the recursive call provides the correct answer by induction, asa modb < a.

Hence we havex andy such thatbx +(a modb)y = d. But (a modb) = a− bk, and hence by substitution we get

bx+(a−bk)y = d, or ay+ b(x− ky) = d. This shows the algorithm provides the correct output.

Note that the Extended Euclid’s algorithm is clearly efficient, as it requires only a few extra arithmetic opera-

tions per recursive call over Euclid’s algorithm.

The Extended Euclid’s algorithm is useful if we wish to compute the inverse of a number. That is, suppose we

wish to finda−1 modn. The numbera has a multiplicative inverse modulon if and only if the gcd ofa andn is 1.

Moreover, the Extended Euclid’s algorithm gives us that number. Since in this case computing gcd(a,n) givesx,y

such thatax+ ny = 1, we have thatx = a−1 modn.

14.2.4 Exponentiation

Suppose we have to computexy mod z, for integersx,y,z. Multiplying x by itself y times is one possibility, but

it is too slow. A more efficient approach is to repeatedly square fromx, to getx2 mod z, x4 mod z, x8 mod z . . .,

x2logymod z. Now xy can be computed by multiplying together moduloz the powers that correspond to ones in the

binary representation ofy.

14.3 The RSA Protocol

To create a public key, Bob finds two large primes,p andq, of roughly the same size. (Large should be a few hundred

decimal digits. Recently, with a lot of work, 512-bit RSA has been broken; this corresponds ton = pq being 512

Page 111: Data Structures

Lecture 14 14-5

bits long.) Bob computesn = pq, and also computes a random integere, such that gcd((p−1)(q−1),e) = 1. (An

alternative to choosinge randomly often used in practice is to choosee = 3, in which casep andq cannot equal 1

modulo 3.)

The pair(n,e) is Bob’s public key, which he announces to the world. Bob’s private key isd = e−1 mod (p−1)(q−1), which can be computed by Euclid’s algorithm. More specifically,(p,q,d) is Bob’s private key.

Suppose Alice wants to send a message to Bob. We think of the message as being a numberx from the range

[1,n]. (If the message is too big to be represented by a number this small, it must be broken up into pieces; for

example, the message could be broken into bit strings of lengthlogn.) To encode the message, Alice computes

and sends to Bob

e(x) = xe modn.

Upon receipt, Bob computes

d(e(x)) = (e(x))d modn.

To show that this operation decodes correctly, we must prove:

Claim 4: d(e(x)) = x.

Proof: We use the steps:

e(x)d = xde = x1+k(p−1)(q−1) = x modn.

The first equation recalls the definition ofe(x). The second uses the fact thatd = e−1 mod(p−1)(q−1), and hence

de = 1+k(p−1)(q−1) for some integerk. The last equality is much less trivial. It will help us to have the following


Claim 5: (Fermat’s Little Theorem) Ifp is prime, then fora = 0 mod p, we haveap−1 = 1 mod p.

Proof: Look at the numbers 1,2, . . . , p−1. Suppose we multiply them all bya modulop, to geta ·1 mod p,a ·2 mod p, . . . ,a · (p−1) mod p. We claim that the two sets of numbers are the same! This is because every pair of

numbers in the second group is different; this follows since ifa · i = a · j mod p, then by multiplying bya−1, we

must havei = j mod p. But if all the numbers in the second group are different modulop, since none of them are 0,

they must just be 1,2, . . . , p−1. (To get a feel for this, take an example: whenp = 7 anda = 5, multiplyinga by the

numbers1,2,3,4,5,6 yields5,3,1,6,4,2.)

From the above equality of sets of numbers, we conclude

1 ·2 · · · (p−1) = (a ·1) · (a ·2) · · · (a · (p−1)) mod p.

Page 112: Data Structures

Lecture 14 14-6

Multiplying both sides by 1−1,2−1, . . . ,(p−1)−1 we have

1 = ap−1 mod p.

This proves Claim 5.

We now return to the end of Claim 4, where we must prove

x1+k(p−1)(q−1) = x modn.

We first claim thatx1+k(p−1)(q−1) = x mod p. This is clearly true ifx = 0 mod p. If x = 0 mod p, then by Fermat’s

Little Theorem,x(p−1) = 1 modp, and hencexk(p−1)(q−1) = 1 mod p, from which we havex1+k(p−1)(q−1) = x mod p.

by the same argument we also havex1+k(p−1)(q−1) = x modq. But if a number is equal tox both modulop and

moduloq, it is equal tox modulon = p ·q. Hencex1+k(p−1)(q−1) = x modn, and Claim 4 is proven.

We have shown that the RSA protocol allows for correct encoding and decoding. We also should be convinced

it is efficient, since it requires only operations that we know to be efficient, such as Euclid’s algorithm and modular

exponentiation. One thing we have not yet asked is why the scheme is secure. That is, why can’t the eavesdropper

Eve recover the messagex also?

The answer, unfortunately, is that there is no proof that Eve cannot computex efficiently from e(x). There

is simply a belief that this is a hard problem. It is an unproven assumption that there is no efficient algorithm for

computingx from e(x). There is the real but unlikely possibility that someone out there can read all messages sent

using RSA!

Let us seek some idea of why RSA is believed to be secure. If Eve obtainse(x) = xe modn, what can she do?

She could try all possible values ofx to try to find the correct one; this clearly takes too long. Or she could try to

factorn and computed. Factoring, however, is a widely known and well studied problem, and nobody has come up

with a polynomial time algorithm for the problem. In fact, it is widely believed that no such algorithm exists.

It would be nice if we could make some sort of guarantee. For example, suppose that breaking RSA allowed

us to factorn. Then we could say that RSA is as hard as factoring. Unfortunately, this is not the case either. It

is possible that RSA could be broken without providing a general factoring algorithm, although it seems that any

natural approach for breaking RSA would also provide a way to factorn.

Page 113: Data Structures

CS124 Lecture 15

15.1 2SAT

We begin by showing yet another possible way to solve the 2SAT problem. Recall that the input to 2SAT is a logicalexpression that is the conjunction (AND) of a set of clauses, where each clause is the disjunction (OR) of two literals.(A literal is either a Boolean variable or the negation of a Boolean variable.) For example, the following expressionis an instance of 2SAT:

(x1 ∨ x2)∧ (x1 ∨ x3)∧ (x1 ∨ x2)∧ (x4 ∨ x3)∧ (x4 ∨ x1).

A solution to an instance of a 2SAT formula is an assignment of the variables to the values T (true) and F(false) so that all the clauses are satisfied– that is, there is at least one true literal in each clause. For example, theassingment x1 = T,x2 = F,x3 = F,x4 = T satisfies the 2SAT formula above.

Here is a simple randomized solution to the 2SAT problem. Start with some truth assignment, say by setting allthe variables to false. Find some clause that is not yet satisfied. Randomly choose one the variables in that clause,say by flipping a coin, and change its value. Continue this process, until either all clauses are satisfied or you gettired of flipping coins.

In the example above, when we begin with all variables set to F, the clause (x1 ∨ x2) is not satisfied. So wemight randomly choose to set x1 to be T. In this case this would leave the clause (x4 ∨ x1) unsatisfied, so we wouldhave to flip a variable in the clause, and so on.

Why would this algorithm tend to lead to a solution? Let us suppose that there is a solution, call it S. Supposewe keep track of the number of variables in our current assignment A that match S. Call this number k. We wouldlike to get to the point where k = n, the number of variables in the formula, for then A would match the solution S.How does k evolve over time?

At each step, we choose a clause that is unsatisfied. Hence we know that A and S disagree on the value of atleast one of the variables in this clause– if they agreed, the clause would have to be satisfied! If they disagree on both,then clearly changing either one of the values will increase k. If they disagree on the value one of the two variables,then with probability 1/2 we choose that variable and make increase k by 1; with probability 1/2 we choose the othervariable and decrease k by 1.

Hence, in the worse case, k behaves like a random walk– it either goes up or down by 1, randomly. This leavesus with the following question: if we start k at 0, how many steps does it take (on average, or with high probability)for k to stumble all the way up to n, the number of variables?

We can check that the average amount of steps to walk (randomly) from 0 to n is just n2. In fact, the averageamount of time to walk from i to n is n2 − i2. Note that the time average time T (i) to walk from i to n is given by:

T (n) = 0

T (i) =T (i−1)


T (i+ 1)2

+ 1, i ≥ 1

T (0) = T (1)+ 1.


Page 114: Data Structures

Lecture 15 15-2

These equations completely determine T (i), and our solution satisfies these equations!

Hence, on average, we will find a solution in at most n2 steps. (We might do better– we might not start with allof our variables wrong, or we might have some moves where we must improve the number of matches!)

We can run our algorithm for say 100n2 steps, and report that no solution was found if none was found. Thisalgorithm might return the wrong answer– there may be a truth assignment, and we have just been unlucky. Butmost of the time it will be right.

Page 115: Data Structures

CS124 Lecture 16

An introductory example

Suppose that a company that produces three products wishes to decide the level of production of each so as to

maximize profits. Let x1 be the amount of Product 1 produced in a month, x2 that of Product 2, and x3 that of Product

3. Each unit of Product 1 yields a profit of 100, each unit of Product 2 a profit of 600, and each unit of Product 3 a

profit of 1400. There are limitations on x1, x2, and x3 (besides the obvious one, that x1,x2,x3 ≥ 0). First, x1 cannot

be more than 200, and x2 cannot be more than 300, presumably because of supply limitations. Also, the sum of the

three must be, because of labor constraints, at most 400. Finally, it turns out that Products 2 and 3 use the same

piece of equipment, with Product 3 using three times as much, and hence we have another constraint x2 +3x3 ≤ 600.

What are the best levels of production?

We represent the situation by a linear program, as follows:

max100x1 + 600x2 + 1400x3

x1 ≤ 200

x2 ≤ 300

x1 + x2 + x3 ≤ 400

x2 + 3x3 ≤ 600

x1,x2,x3 ≥ 0

The set of all feasible solutions of this linear program (that is, all vectors in 3-d space that satisfy all constraints)

is precisely the polyhedron shown in Figure 16.1.

We wish to maximize the linear function 100x1 +600x2 +1400x3 over all points of this polyhedron. Geometri-

cally, the linear equation 100x1 + 600x2 + 1400x3 = c can be represented by a plane parallel to the one determined

by the equation 100x1 +600x2 +1400x3 = 0. This means that we want to find the plane of this type that touches the

polyhedron and is as far towards the positive orthant as possible. Obviously, the optimum solution will be a vertex

(or the optimum solution will not be unique, but a vertex will do). Of course, two other possibilities with linear

programming are that (a) the optimum solution may be infinity, or (b) that there may be no feasible solution at all.


Page 116: Data Structures

Lecture 16 16-2








Figure 16.1: The feasible region

In this case, an optimal solution exists, and moreover we shall show that it is easy to find.

Linear programs

Linear programs, in general, have the following form: there is an objective function that one seeks to optimize,

along with constraints on the variables. The objective function and the constraints are all linear in the variables;

that is, all equations have no powers of the variables, nor are the variables multiplied together. As we shall see,

almost all problems can be represented by linear programs, and for many problems it is an extremely convenient

representation. So once we explain how to solve linear programs, the question then becomes how to reduce other

problems to linear programming (LP).

There are polynomial time algorithms for solving linear programs. In practice, however, such problems are

solved by the simplex method devised by George Dantzig in 1947. The simplex method starts from a vertex (in this

Page 117: Data Structures

Lecture 16 16-3

case the vertex (0,0,0)) and repeatedly looks for a vertex that is adjacent, and has better objective value. That is, it

is a kind of hill-climbing in the vertices of the polytope. When a vertex is found that has no better neighbor, simplex

stops and declares this vertex to be the optimum. For example, in the figure one of the possible paths followed by

simplex is shown. No known variant of the simplex algorithm has been proven to take polynomial time, and most of

the variations used in practice have been shown to take exponential time on some examples. Fortunately, in practice,

bad cases rarely arise, and the simplex algorithm runs extremely quickly. There are now implementations of simplex

that solve routinely linear programs with many thousands of variables and constraints.

Of course, given a linear program, it is possible either that (a) the optimum solution may be infinity, or (b) that

there may be no feasible solution at all. If this is the case, simplex algorithm will discover it.

Reductions between versions of simplex

A general linear programming problem may involve constraints that are equalities or inequalities in either

direction. Its variables may be nonnegative, or could be unrestricted in sign. And we may be either minimizing

or maximizing a linear function. It turns out that we can easily translate any such version to any other. One

such translation that is particularly useful is from the general form to the one required by simplex: minimization,

nonnegative variables, and equality constraints.

To turn an inequality ∑aixi ≤ b into an equality constraint, we introduce a new variable s (the slack variable for

this inequality), and rewrite this inequality as ∑aixi + s = b,s ≥ 0. Similarly, any inequality ∑aixi ≥ b is rewritten

as ∑aixi − s = b,s ≥ 0; s is now called a surplus variable.

We handle an unrestricted variable x as follows: we introduce two nonnegative variables, x+ and x−, and

replace x by x+ − x− everywhere. The idea is that we let x = x+ − x−, where we may restrict both x+ and x− to be

nonnegative. This way, x can take on any value, but there are only nonnegative variables.

Finally, to turn a maximization problem into a minimization one, we just multiply the objective function by −1.

A production scheduling example

We have the demand estimates for our product for all months of 1997, di : i = 1, . . . ,12, and they are very

uneven, ranging from 440 to 920. We currently have 30 employees, each of which produce 20 units of the product

each month at a salary of 2,000; we have no stock of the product. How can we handle such fluctuations in demand?

Three ways:

Page 118: Data Structures

Lecture 16 16-4

• overtime —but this is expensive since it costs 80% more than regular production, and has limitations, as

workers can only work 30% overtime.

• hire and fire workers —but hiring costs 320, and firing costs 400.

• store the surplus production —but this costs 8 per item per month

This rather involved problem can be formulated and solved as a linear program. As in all such reductions, the

crucial first step is defining the variables:

• Let w0 be the number of workers we have the ith month —we have w0 = 30.

• Let xi be the production for month i.

• oi is the number of items produced by overtime in month i.

• hi and fi are the number of workers hired/fired in the beginning of month i.

• si is the amount of product stored after the end of month i.

We now must write the constraints:

• xi = 20wi + oi —the amount produced is the one produced by regular production, plus overtime.

• wi = wi−1 + hi − fi,wi ≥ 0 —the changing number of workers.

• si = si−1 +xi−di ≥ 0 —the amount stored in the end of this month is what we started with, plus the production,

minus the demand.

• oi ≤ 6wi —only 30% overtime.

Finally, what is the objective function? It is

min 2000∑wi + 400∑ fi + 320∑hi + 8∑ si + 180∑oi,

where the summations are from i = 1 to 12.

A Communication Network Problem

Page 119: Data Structures

Lecture 16 16-5

We have a network whose lines have the bandwidth shown in Figure 16.2. We wish to establish three calls: one

between A and B (call 1), one between B and C (call 2), and one between A and C (call 3). We must give each call

at least 2 units of bandwidth, but possibly more. The link from A to B pays 3 per unit of bandwidth, from B to C

pays 2, and from A to C pays 4. Notice that each call can be routed in two ways (the long and the short path), or by a

combination (for example, two units of bandwidth via the short route, and three via the long route). Suppose we are

a shady network administrator, and our goals is to maximize the network’s income (rather than minimize the overall

cost). How do we route these calls to maximize the network’s income?


13 6






Figure 16.2: A communication network

This is also a linear program. We have variables for each call and each path (long or short); for example x1 is

the short path for call 1, and x′2 the long path for call 2. We demand that (1) no edge bandwidth is exceeded, and (2)

each call gets a bandwidth of 2.

max 3x1 + 3x′1 + 2x2 + 2x′2 + 4x3 + 4x′3

x1 + x′1 + x2 + x′2 ≤ 10

x1 + x′1 + x3 + x′3 ≤ 12

x2 + x′2 + x3 + x′3 ≤ 8

x1 + x′2 + x′3 ≤ 6

Page 120: Data Structures

Lecture 16 16-6

x′1 + x2 + x′3 ≤ 13

x′1 + x′2 + x3 ≤ 11

x1 + x′1 ≥ 2

x2 + x′2 ≥ 2

x3 + x′3 ≥ 2

x1,x′1 . . . ,x′3 ≥ 0

The solution, obtained via simplex in a few milliseconds, is the following: x1 = 0,x′1 = 7,x2 = x′2 = 1.5,x3 =

.5,x′3 = 4.5.

Question: Suppose that we removed the constraints stating that each call should receive at least two units.

Would the optimum change?

Approximate Separation

An interesting last application: Suppose that we have two sets of points in the plane, the black points (xi,yi) :

i = 1, . . . ,m and the white points (xi,yi) : i = m+1, . . . ,m+n. We wish to separate them by a straight line ax+by = c,

so that for all black points ax + by ≤ c, and for all white points ax + by ≥ c. In general, this would be impossible.

Still, we may want to separate them by a line that minimizes the sum of the “displacement errors” (distance from the

boundary) over all misclassified points. Here is the LP that achieves this:

mine1 +e2 + . . .+ em + em+1 + . . .+ em+n

e1 ≥ ax1 + by1 − ce2 ≥ ax2 + by2 − c

...em ≥ axm + bym − c

em+1 ≥ c−axm+1 −bym+1...

em+n ≥ c−axm+n −bym+n

ei ≥ 0

Network Flows

Suppose that we are given the network in top of Figure 16.3, where the numbers indicate capacities, that is, the

amount of flow that can go through the edge in unit time. We wish to find the maximum amount of flow that can go

through this network, from S to T .

Page 121: Data Structures

Lecture 16 16-7




1 12











1 12











1 12











1 12



















minimum cut,capacity 6

Figure 16.3: Max flow

Page 122: Data Structures

Lecture 16 16-8

This problem can also be reduced to linear programming. We have a nonnegative variable for each edge, rep-

resenting the flow through this edge. These variables are denoted fSA, fSB, . . . We have two kinds of constraints:

capacity constraints such as fSA ≤ 5 (a total of 9 such constraints, one for each edge), and flow conservation con-

straints (one for each node except S and T ), such as fAD + fBD = fDC + fDT (a total of 4 such constraints). We wish

to maximize fSA + fSB, the amount of flow that leaves S, subject to these constraints. It is easy to see that this linear

program is equivalent to the max-flow problem. The simplex method would correctly solve it.

In the case of max-flow, it is very instructive to “simulate” the simplex method, to see what effect its various

iterations would have on the given network. Simplex would start with the all-zero flow, and would try to improve it.

How can it find a small improvement in the flow? Answer: it finds a path from S to T (say, by depth-first search),

and moves flow along this path of total value equal to the minimum capacity of an edge on the path (it can obviously

do no better). This is the first iteration of simplex (see Figure 16.3).

How would simplex continue? It would look for another path from S to T . Since this time we already partially

(or totally) use some of the edges, we should do depth-first search on the edges that have some residual capacity,

above and beyond the flow they already carry. Thus, the edge CT would be ignored, as if it were not there. The

depth-first search would now find the path S−A−D− T , and augment the flow by two more units, as shown in

Figure 16.3.

Next, simplex would again try to find a path from S to T . The path is now S−A−B−D−T (the edges C−T

and A−D are full are are therefore ignored), and we augment the flow as shown in the bottom of Figure 16.3.

Next simplex would again try to find a path. But since edges A−D, C− T , and S−B are full, they must be

ignored, and therefore depth-first search would fail to find a path, after marking the nodes S,A,C as reachable from

S. Simplex then returns the flow shown, of value 6, as maximum.

How can we be sure that it is the maximum? Notice that these reachable nodes define a cut (a set of nodes

containing S but not T ), and the capacity of this cut (the sum of the capacities of the edges going out of this set) is

6, the same as the max-flow value. (It must be the same, since this flow passes through this cut.) The existence of

this cut establishes that the flow is optimum!

There is a complication that we have swept under the rug so far: when we do depth-first search looking for a

path, we use not only the edges that are not completely full, but we must also traverse in the opposite direction all

edges that already have some non-zero flow. This would have the effect of canceling some flow; canceling may be

necessary to achieve optimality, see Figure 16.4. In this figure the only way to augment the current flow is via the

path S−B−A−T , which traverses the edge A−B in the reverse direction (a legal traversal, since A−B is carrying

Page 123: Data Structures

Lecture 16 16-9

non-zero flow).

1 1


1 1





Figure 16.4: Flows may have to be canceled

In general, a path from the source to the sink along which we can increase the flow is called an augmenting

path. We can look for an augmenting path by doing for example a depth first search along the residual network,

which we now describe. For an edge (u,v), let c(u,v) be its capacity, and let f (u,v) be the flow across the edge.

Note that we adopt the following convention: if 4 units flow from u to v, then f (u,v) = 4, and f (v,u) = −4. That is,

we interpret the fact that we could reverse the flow across an edge as being equivalent to a “negative flow”. Then the

residual capacity of an edge (u,v) is just

c(u,v)− f (u,v).

The residual network has the same vertices as the original graph; the edges of the residual network consist of all

weighted edges with strictly positive residual capacity. The idea is then if we find a path from the source to the sink

in the residual network, we have an augmenting path to increase the flow in the original network. As an exercise,

you may want to consider the residual network at each step in Figure 16.3.

Suppose we look for a path in the residual network using depth first search. In the case where the capacities

are integers, we will always be able to push an integral amount of flow along an augmenting path. Hence, if the

maximum flow is f ∗, the total time to find the maximum flow is O(E f∗), since we may have to do an O(E) depth

first search up to f∗ times. This is not so great.

Note that we do not have to do a depth-first search to find an augmenting path in the residual network. In fact,

using a breadth-first search each time yields an algorithm that provably runs in O(VE2) time, regardless of whether

or not the capacities are integers. We will not prove this here. There are also other algorithms and approaches to the

Page 124: Data Structures

Lecture 16 16-10

max-flow problem as well that improve on this running time.

To summarize: the max-flow problem can be easily reduced to linear programming and solved by simplex. But

it is easier to understand what simplex would do by following its iterations directly on the network. It repeatedly

finds a path from S to T along edges that are not yet full (have non-zero residual capacity), and also along any reverse

edges with non-zero flow. If an S−T path is found, we augment the flow along this path, and repeat. When a path

cannot be found, the set of nodes reachable from S defines a cut of capacity equal to the max-flow. Thus, the value

of the maximum flow is always equal to the capacity of the minimum cut. This is the important max-flow min-cut

theorem. One direction (that max-flow≤min-cut) is easy (think about it: any cut is larger than any flow); the other

direction is proved by the algorithm just described.

Page 125: Data Structures

CS124 Lecture 17




1 12











1 12











1 12











1 12



















minimum cut,capacity 6

Figure17.1:Max flow


Page 126: Data Structures

Lecture17 17-2

Network Flows

Supposethatwearegiventhenetwork in topof Figure17.1,wherethenumbersindicatecapacities,thatis, the

amountof flow thatcango throughtheedgein unit time. Wewish to find themaximumamountof flow thatcango

throughthisnetwork, from Sto T.

This problemcanalsobereducedto linearprogramming.We have a nonnegative variablefor eachedge,rep-

resentingthe flow throughthis edge. ThesevariablesaredenotedfSA fSB We have two kinds of constraints:

capacityconstraintssuchas fSA

5 (a total of 9 suchconstraints,onefor eachedge),andflow conservation con-

straints(onefor eachnodeexceptSandT), suchas fAD fBD fDC fDT (a total of 4 suchconstraints).Wewish

to maximize fSA fSB, theamountof flow thatleavesS, subjectto theseconstraints.It is easyto seethatthis linear

programis equivalentto themax-flow problem.Thesimplex methodwouldcorrectlysolve it.

Page 127: Data Structures

Lecture17 17-3

In thecaseof max-flow, it is very instructive to “simulate” thesimplex method,to seewhateffect its various

iterationswouldhave on thegivennetwork. Simplex wouldstartwith theall-zeroflow, andwould try to improve it.

How canit find a small improvementin theflow? Answer: it findsa pathfrom S to T (say, by depth-firstsearch),

andmovesflow alongthispathof totalvalueequalto theminimumcapacityof anedgeonthepath(it canobviously

do nobetter).This is thefirst iterationof simplex (seeFigure17.1).

How wouldsimplex continue?It would look for anotherpathfrom S to T. Sincethis timewe alreadypartially

(or totally) usesomeof theedges,we shoulddo depth-firstsearchon theedgesthathave someresidualcapacity,

above andbeyond the flow they alreadycarry. Thus,the edgeCT would be ignored,asif it werenot there. The

depth-firstsearchwould now find the pathS A D T, andaugmentthe flow by two moreunits, asshown in


Next, simplex wouldagaintry to find a pathfrom S to T. Thepathis now S A B D T (theedgesC T

andA D arefull arearethereforeignored),andwe augmenttheflow asshown in thebottomof Figure17.1.

Next simplex would againtry to find a path. But sinceedgesA D, C T, andS B arefull, they mustbe

ignored,andthereforedepth-firstsearchwould fail to find a path,aftermarkingthenodesS A C asreachablefrom

S. Simplex thenreturnstheflowshown,of value6, asmaximum.

Page 128: Data Structures

Lecture17 17-4

How canwe be surethat it is the maximum? Notice that thesereachablenodesdefinea cut (a setof nodes

containingSbut not T), andthecapacityof this cut (thesumof thecapacitiesof theedgesgoingout of this set)is

6, thesameasthemax-flow value. (It mustbe thesame,sincethis flow passesthroughthis cut.) Theexistenceof

this cut establishesthattheflow is optimum!

Thereis a complicationthatwe have sweptundertherug sofar: whenwe do depth-firstsearchlooking for a

path,we usenot only theedgesthatarenot completelyfull, but we mustalsotraversein theoppositedirectionall

edgesthatalreadyhave somenon-zeroflow. This would have theeffect of cancelingsomeflow; cancelingmaybe

necessaryto achieve optimality, seeFigure17.2. In this figure theonly way to augmentthecurrentflow is via the

pathS B A T, which traversestheedgeA B in thereversedirection(a legal traversal,sinceA B is carrying


1 1


1 1





Figure17.2:Flowsmayhave to becanceled

Page 129: Data Structures

Lecture17 17-5

In general,a pathfrom the sourceto the sink alongwhich we canincreasethe flow is calledan augmenting

path. We canlook for an augmentingpathby doing for examplea depthfirst searchalongthe residualnetwork,

which we now describe.For an edge u v , let c u v be its capacity, andlet f u v be the flow acrossthe edge.

Notethatweadoptthefollowing convention: if 4 unitsflow from u to v, then f u v 4, and f v u 4. Thatis,

we interpretthefactthatwecouldreversetheflow acrossanedgeasbeingequivalentto a“negative flow”. Thenthe

residualcapacityof anedge u v is just

c u v f u v

The residualnetwork hasthe sameverticesasthe original graph;the edgesof the residualnetwork consistof all

weightededgeswith strictly positive residualcapacity. Theideais thenif wefind apathfrom thesourceto thesink

in the residualnetwork, we have an augmentingpathto increasetheflow in theoriginal network. As anexercise,

youmaywantto considertheresidualnetwork at eachstepin Figure17.1.

Supposewe look for a pathin the residualnetwork usingdepthfirst search.In thecasewherethecapacities

areintegers,we will alwaysbe ableto pushan integral amountof flow alongan augmentingpath. Hence,if the

maximumflow is f , the total time to find themaximumflow is O E f , sincewe mayhave to do anO E depth

first searchup to f times.This is not sogreat.

Notethatwe do not have to do a depth-firstsearchto find anaugmentingpathin theresidualnetwork. In fact,

usingabreadth-firstsearcheachtime yieldsanalgorithmthatprovably runsin O VE2 time, regardlessof whether

or not thecapacitiesareintegers.Wewill notprove thishere.Therearealsootheralgorithmsandapproachesto the

max-flow problemaswell thatimprove on this runningtime.

To summarize:themax-flow problemcanbeeasilyreducedto linearprogrammingandsolvedby simplex. But

it is easierto understandwhat simplex would do by following its iterationsdirectly on the network. It repeatedly

findsapathfrom Sto T alongedgesthatarenotyet full (havenon-zeroresidualcapacity),andalsoalongany reverse

edgeswith non-zeroflow. If anS T pathis found,we augmenttheflow alongthis path,andrepeat.Whena path

cannotbefound,thesetof nodesreachablefrom Sdefinesa cut of capacityequalto themax-flow. Thus,thevalue

of the maximumflow is alwaysequalto thecapacityof theminimumcut. This is the importantmax-flowmin-cut

theorem. Onedirection(thatmax-flow

min-cut) is easy(think aboutit: anycut is larger thananyflow); theother

directionis provedby thealgorithmjust described.

Page 130: Data Structures

Lecture17 17-6


As it turnsout, themax-flow min-cut theoremis a specialcaseof a moregeneralphenomenoncalledduality.

Basically, duality meansthat for eachmaximizationproblemthereis a correspondingminimizationsproblemwith

thepropertythatany feasiblesolutionof themin problemis greaterthanor equalany feasiblesolutionof themax

problem.Furthermore,andmoreimportantly, they havethesameoptimum.

Considerthenetwork shown in Figure17.3,andthecorrespondingmax-flow problem.Weknow thatit canbe

writtenasa linearprogramasfollows:

3 1


2 3





Figure17.3:A simplemax-flow problem

max fSA fSB








3fSA fAB fAT 0

fSA fAB fBT 0f 0


Page 131: Data Structures

Lecture17 17-7

Considernow thefollowing linearprogram:

min 3ySA 2ySB yAB yAT 3yBT

ySA uA 1ySB uB 1

yAB uA uB 0yAT uA 0

yBT uB 0y 0


This LP describesthemin-cutproblem!To seewhy, supposethattheuA variableis meantto be1 if A is in the

cut with S, and0 otherwise,andsimilarly for B (naturally, by thedefinitionof a cut, Swill alwaysbewith S in the

cut, andT will never bewith S). Eachof they variablesis to be1 if thecorrespondingedgecontributesto thecut

capacity, and0 otherwise.Thentheconstraintsmake surethat thesevariablesbehave exactly asthey should. For

example,thesecondconstraintstatesthat if A is not with S, thenSA mustbeaddedto thecut. Thethird onestates

thatif A is with SandB is not(this is theonly casein whichthesum uA uB becomes 1), thenABmustcontribute

to thecut. And soon. Althoughthey andu’s arefreeto take valueslarger thanone,they will be“slammed”by the

minimizationdown to 1 or 0.

Page 132: Data Structures

Lecture17 17-8

Let us now make a remarkableobservation: thesetwo programshave strikingly symmetric,dual, structure.

This structureis mosteasilyseenby putting the linearprogramsin matrix form. Thefirst program,which we call

theprimal (P), we write as:

max 1 1 0 0 0

1 0 0 0 0


0 1 0 0 0


0 0 1 0 0


0 0 0 1 0


0 0 0 0 1


1 0 1 1 1 0

0 1 1 0 1 0

Herewe have removed theactualvariablenames,andwe have includedanadditionalrow at thebottomdenoting

thatall thevariablesarenon-negative. (An unrestrictedvariablewill bedenotedby unr.

Thesecondprogram,whichwecall thedual(D), we write as:

min 3 2 1 1 3 0 0

1 0 0 0 0 1 0 1

0 1 0 0 0 0 1 1

0 0 1 0 0 1 1 0

0 0 0 1 0 1 0 0

0 0 0 0 1 0 1 0

unr unr

Eachvariableof P correspondsto a constraintof D, andvice-versa.Equalityconstraintscorrespondto unre-

strictedvariables(theu’s), andinequalityconstraintsto restrictedvariables.Minimization becomesmaximization.

Thematricesaretransposeof oneanother, andtherolesof right-handsideandobjective functionareinterchanged.

Page 133: Data Structures

Lecture17 17-9

SuchLP’s arecalleddual to eachother. It is mechanical,givenanLP, to form its dual. Supposewe startwith

a maximizationproblem.Changeall inequalityconstraintsinto

constraints,negatingbothsidesof anequationif

necessary. Then

transposethecoefficient matrix

invert maximizationto minimization

interchangetherolesof theright-handsideandtheobjective function

introduceanonnegative variablefor eachinequality, andanunrestrictedonefor eachequality

for eachnonnegativevariableintroducea constraint,andfor eachunrestrictedvariableintroduceanequality


If we startwith a minimizationproblem,we insteadbegin by turning all inequalityconstraintsinto con-


to a

constraint.Note that it is easyto show from this descriptionthat thedualof thedual is theoriginal primal


By the max-flow min-cut theorem,the two LP’s P andD above have thesameoptimum. In fact, this is true

for general dual LP’s! This is theduality theorem, which canbestatedasfollows (we shall not prove it; thebest

proof comesfrom the simplex algorithm,very muchasthe max-flow min-cut theoremcomesfrom the max-flow


If an LP hasa boundedoptimum,thensodoesits dual,andthetwooptimalvaluescoincide.

Page 134: Data Structures

Lecture17 17-10


It is oftenusefulto composereductions.That is, we canreducea problemA to B, andB to C, andsinceC we

know how to solve,we endupsolvingA. A goodexampleis thematchingproblem.

Supposethatthebipartitegraphshown in Figure17.4recordsthecompatibilityrelationbetweenfour boys and

four girls. We seeka maximummatching,that is, a setof edgesthat is aslarge aspossible,andin which no two

edgesshareanode.For example,in Figure17.4thereis acompletematching(amatchingthatinvolvesall nodes).










Figure17.4:Reductionfrom matchingto max-flow (all capacitiesare1)

Page 135: Data Structures

Lecture17 17-11

To reducethis problemto max-flow, we createa new sourceanda new sink, connectthesourcewith all boys

andall girls with thesinks,anddirectall edgesof theoriginal bipartitegraphfrom theboys to thegirls. All edges

have capacityone.It is easyto seethatthemaximumflow in thisnetwork correspondsto themaximummatching.

Well, thesituationis slightly morecomplicatedthanwasstatedabove: what is easyto seeis thattheoptimum

integer-valuedflow correspondsto theoptimummatching.We would beat a lossinterpretingasa matchinga flow

thatships.7unitsalongtheedgeAl-Eve! Fortunately, whatthealgorithmin theprevioussectionestablishesis thatif

thecapacitiesare integers, thenthemaximumflow is integer. This is becauseweonly dealwith integersthroughout

thealgorithm.Henceintegrality comesfor freein themax-flowproblem.

Unfortunately, max-flow is aboutthe only problemfor which integrality comesfor free. It is a very difficult

problemto find theoptimumsolution(or anysolution)of agenerallinearprogramwith theadditionalconstraintthat

(someor all of) thevariablesbeintegers.Wewill seewhy in forthcominglectures.

Page 136: Data Structures

Lecture17 17-12


We canrepresentvarioussituationsof conflict in life in termsof matrix games. For example,thegameshown

below is therock-paper-scissors game.TheRow playerchoosesarow strategy, theColumnplayerchoosesacolumn

strategy, andthenColumnpaysto Row thevalueat theintersection(if it is negative,Row endsup payingColumn).

r p s

r 0 1 1

p 1 0 1

s 1 1 0

Gamesdo not necessarilyhave to besymmetric(that is, Row andColumnhave thesamestrategies,or, in termsof

matrices,A AT). For example,in thefollowing fictitious Clinton-Dolegamethestrategiesmaybetheissueson

which a candidatefor office may focus(the initials standfor “economy,” “society,” “morality,” and“tax-cut”) and

theentriesarethenumberof voterslost by Column.

m t

e 3 1

s 2 1

Wewantto explorehow thetwo playersmayplay“optimally” thesegames.It is notclearwhatthismeans.For

example,in thefirst gamethereis no suchthing asanoptimal “pure” strategy (it very muchdependson whatyour

opponentdoes;similarly in thesecondgame).But supposethatyouplay thisgamerepeatedly. Thenit makessense

to randomize. That is, considera gamegiven by an m n matrix Gi j ; definea mixedstrategy for the row player

to bea vector x1 xm , suchthatxi 0, and∑mi 1 xi 1. Intuitively, xi is theprobabilitywith which Row plays

strategy i. Similarly, amixedstrategy for Columnis avector y1 yn , suchthaty j 0, and∑nj 1 y j 1.

Page 137: Data Structures

Lecture17 17-13

Supposethat,in theClinton-Dolegame,Row decidesto play themixedstrategy 5 5 . WhatshouldColumn

do?Theansweris easy:If thexi ’saregiven,thereis apurestrategy (thatis,amixedstrategy with all y j ’szeroexcept

for one)that is optimal. It is found by comparingthe n numbers∑mi 1 Gi jxi , for j 1 n (in the Clinton-Dole

game,Columnwould compare 5 with 0, andof coursechoosethesmallest—remember, theentriesdenotewhat

Columnpays). That is, if Columnknew Row’s mixed strategy, s/hewould endup payingthe smallestamongthe

n outcomes∑mi 1Gi jxi , for j 1 n. On the otherhand,Row will seekthe mixed strategy that maximizesthis





∑i 1

Gi jxi

This maximumwould bethebestpossibleguaranteeaboutanexpectedoutcomethatRow canhave by choosinga

mixedstrategy. Let uscall thisguaranteez; whatRow is trying to do is solve thefollowing LP:

maxzz 3x1 2x2


z x1 x2

0x1 x2 1

Symmetrically, it is easyto seethatColumnwouldsolve thefollowing LP:

minww 3y1 y2 0w 2y1 y2 0

y1 y2 1

Thecrucialobservationnow is thattheseLP’saredual to each other, andhencehave thesameoptimum,call it V.

Page 138: Data Structures

Lecture17 17-14

Let ussummarize:By solvinganLP, Row canguaranteeanexpectedincomeof at leastV, andby solvingthe

dualLP, Columncanguaranteeanexpectedlossof atmostthesamevalue.It followsthatthis is theuniquelydefined

optimalplay (it wasnot a priori certainthatsucha play exists). V is calledthevalueof thegame. In this case,the

optimummixedstrategy for Row is 3 7 4 7 , andfor Column 2 7 5 7 , with avalueof 1 7 for theRow player.

Theexistenceof mixedstrategiesthatareoptimalfor bothplayersandachieve thesamevalueis a fundamental

resultin GameTheorycalledthemin-maxtheorem. It canbewritten in equationsasfollows:


miny ∑xiy jGi j min


x ∑xiy jGi j

It is surprising,becausethe left-handside,in which Columnoptimizeslast, andthereforehaspresumablyan ad-

vantage,shouldbeintuitively smallerthantheright-handside,in whichColumndecidesfirst. Duality equalizesthe

two, asit doesin max-flow min-cut.

Page 139: Data Structures

Lecture17 17-15

Circuit Evaluation








Figure17.5:A Booleancircuit

Page 140: Data Structures

Lecture17 17-16

We have seenmany interestinganddiverseapplicationsof linearprogramming.In somesense,thenext oneis

theultimateapplication.SupposethatwearegivenaBooleancircuit, thatis, aDAG of gates,eachof which is either

aninputgate(indegreezero,andhasavalueT or F), or anOR gate(indegreetwo), or anAND gate(indegreetwo),

or a NOT gate(indegreeone).Oneof themis designatedastheoutputgate.We wish to tell if this circuit evaluates

(following thelaws of Booleanvaluesbottom-up)to T. This is known asthecircuit valueproblem.

Thereis a very simpleandautomaticway of translatingthecircuit valueprobleminto anLP: for eachgateg

wehave avariablexg. For all gateswe have0


1. If g is aT inputgate,wehave theequationxg 1; if it is F,

xg 0. If it is anOR gate,sayof thegatesh andh , thenwe have theinequalityxg

xh xh . If it is anAND gate

of h andh , we have the inequalitiesxg

xh, xg

xh (noticethedifference).For a NOT gatewe sayxg 1 xh.

Finally, we wantto maxxo, whereo is theoutputgate.It is easyto seethattheoptimumvalueof xo will be1 if the

circuit valueif T, and0 if it is F.

This is a ratherstraight-forward reductionto LP, from a problemthatmaynot seemvery interestingor hardat

first. However, thecircuit valueproblemis in somesensethemostgeneralproblemsolvablein polynomialtime!

Hereis a justificationof this statement:afterall, apolynomialtimealgorithmrunsonacomputer, andthecomputer

is ultimatelya Booleancombinationalcircuit implementedon a chip. Sincethealgorithmrunsin polynomialtime,

it canberenderedasa circuit consistingof polynomiallymany superpositionsof thecomputer’s circuit. Hence,the

factthatcircuit valueproblemreducesto LP meansthatall polynomiallysolvableproblemsdo!

In ournext topic,Complexity andNP-completeness, weshallseethataclassthatcontainsmany hardproblems

reduces,muchthesameway, to integer programming.

Page 141: Data Structures

CS124 NP-CompletenessReview

WhereWeAr eHeaded

Up to this point, we have generallyassumedthat if we weregiven a problem,we could find a way to solve

it. Unfortunately, as most of you know, thereare many fundamentalproblemsfor which we have no efficient

algorithms.In fact,by classifyingthesehardproblems,we canshow that thereis a largeclassof simpleproblems

for which thereis (probably)no efficient algorithm–theNP-completeproblems.Moreover, if you coulddesignan

efficient algorithmfor anyoneof theseproblems,you coulddesignanalgorithmfor all of them! It’s anall or none

proposition,soif youcouldsolve justoneof them,you would becomerich andfamousovernight.Thesenoteswill

review themainconceptsbehindthetheoryof NP-completeproblems.

Onemight askwhy it is importantto studywhatproblemswe cannotsolve, insteadof focusingon problems

we cansolve. Especiallyfor analgorithmscourse.Thereareseveralpossibleresponses,but perhapsthebestis that

if you do not know what is impossible,you might wastea greatdealof time trying to solve it, insteadof comingto

termswith its impossibilityandfinding suitablealternatives(suchas,for example,approximationsinsteadof exact



Page 142: Data Structures

Lecture18 18-2

Polynomial Running Times

The fasterthe runningtime, thebetter. Linear is great,quadraticis all right, cubic is perhapsa bit slow. But

how exactly shouldwe classifywhichproblemshave efficient algorithms?Whereis thecut off point?

The choicecomputerscientistshave madeis to group togetherall problemsthat aresolvable in polynomial

time. Thatis, we defineaclassof problemsP asfollows:

Definition: P is thesetof all problemsZ with ayes-noanswersuchthatthereis analgorithmA anda positive

integerk suchthatA solvesZ in Onk steps(on inputsof sizen).

Let us clarify somepointsin thedefinition. The restrictionto problemswith a yes-noansweris really just a

technicalconvenience.For example,the problemof finding the minimum spanningtree( on a treewith integer

weights)canberecastastheproblemof answeringthefollowing question:is thesizeof theminimumspanningtree

at least j? If youcanansweronequestion,youcananswertheother;consideringonly yes-noproblemsprovesmore


From the definition, all problemswith linear, quadratic,or cubic time algorithmsare all in P. But so are

problemswith algorithmsthatrequiretime Θn100 . This mayseema little strange;for example,would a problem

with analgorithmthatrunsin timeΘn100 reallybesaidto haveanefficientsolution?But themainpointof defining

theclassP is to separatetheseproblemsfrom thosethat requireexponentialtime, or Ω2nε steps(for someε 0.

Problemsthatrequirethismuchtimeto solve areclearlyasymptoticallyinefficient,comparedwith polynomialtime

algorithms.TheclassP is alsousefulbecause,asweshallseebelow, it is closedunderpolynomialtime reductions.

Page 143: Data Structures

Lecture18 18-3


Let A andB betwo problemswhoseinstancesrequirea “yes” or “no” answer. (For example,2SAT is sucha

problem,asis thequestionof whetherabipartitegraphhasaperfectmatching.)A (polynomialtime) reductionfrom

A to B is a polynomialtime algorithmR which transformsaninput of problemA into aninput for problemB. That

is, given an input x to problemA, R will producean input Rx to problemB, suchthat theanswerto x is yesfor

problemA if andonly if theanswerfor Rx is yesfor problemB.

This ideaof reductionshouldnot seemunfamiliar; all alongwe have seenthe ideaof reducingoneproblem

to another. (For example,we recentlysaw how to reducethematchingprobleminto themax-flow problem,which

couldbereducedto linearprogramming.)Theonly differenceis, right now, for convenienceweareonly considering


A reductionfrom A to B, togetherwith apolynomialtimealgorithmfor B, yieldsapolynomialtimealgorithm

for A. (SeeFigure18.1.)Let usexplain this in moredetail.For any inputx of A of sizen, thereductionR takestime

pn , wherep is a polynomial,to producean input R

x for B. This input R

x canhave sizeat most p

n , since

this is the largestinput R couldpossiblyconstructin pn time! We now submitR

x asan input to thealgorithm

for B, whichwe assumerunsin time qm on inputsof sizem, whereq is anotherpolynomial.Thealgorithmfor B

givesustheright answerfor B onRx , andhencealsotheright answerfor A onx. Thetotal time takenwasatmost

pn q

pn , which is itself justapolynomialin n!

This ideaof reductionexplainswhy the classP is so useful. If we have a problemA in P, andsomeother

problemB reducesto it, thenB is in P aswell. Hencewe saythatP is closedunderpolynomialtime reductions.

If wecanreduceA to B, weareessentiallyestablishingthat,giveor takea polynomial, A is noharderor B. We

canwrite this as


whereheretheinequalityis representsa factaboutthecomplexities of thetwo problems.If we know thatB is easy,

thenA B establishesthatB is easy.

Wecanalsolook at this inequalitytheotherway. If weknow thatA is hard,thentheinequalityestablishesthat

B is hard.It is this implicationthatwewill now use,to show thatproblemsarehard.Thiswayof usingreductionsis

verydifferentfrom thewaywehaveusedreductionssofar; it is alsomuchmoresophisticatedandcounter-intuitive.

Page 144: Data Structures

Lecture18 18-4


Input for A


Outputfor A

Reduction R R(x)

Input for B

Algorithm for B Output

for B

Algorithm for A

Figure18.1:Reductionsleadto algorithms.

Page 145: Data Structures

Lecture18 18-5

Short Certificates and the ClassNP

We will now begin to examinea classof problemsthat includesseveral “hard” problems.Whatwe meanby

“hard” in this settingis thatalthoughnobodyhasyet shown that thereareno polynomialtime algorithmsto solve

theseproblems,thereis overwhelmingevidencethatthis is thecase.

RecallthattheclassP is theclassof yes-noproblemsthatcanbesolvedin polynomialtime. Thenew classwe

define,NP, consistsof yes-noproblemswith adifferentproperty:if theanswerto theproblemis yes,thenthereis a

shortcertificatethatcanbecheckedto show thattheansweris correct.A bit moreformally, a shortcertificatemust

have thefollowing properties:

It mustbeshort: thelengthof thepolynomialis no morethanpolynomialin thelengthof theinput.

It mustcertify: thereis apolynomialtimechecker (analgorithm!) thattakestheinputandtheshortcertificate

andchecksthatthecertificateis valid.

Theideaof theshortcertificateis thefollowing: aproblemis in NP if someoneelsecanconvinceyouin polynomial

time that theansweris yeswhentheansweris yes,andthey cannotfool you into thinking theansweris yeswhen

theansweris no.

Let usmove from theabstractto somespecificproblems.

Compositeness:Testingwhethera numberis compositeis in NP, sinceif somebodywantedto convinceyou

a numberis composite,they could give you its factorization(the shortcertificate).You could thencheckthat the

factorizationwascorrectby doingthemultiplication,in polynomialtime. (Noticeyoucan’t befooled!)

3SAT: 3SAT is like the 2SAT problemwe have seenin the homework, except that therecanbe up to three

literals in eachclause. 3SAT is in NP, sinceif somebodywantedto convince you that a formula is satisfiable,

they could give you a satisyfingtruth assignment(the shortcertificate). You could thencheckthe proposedtruth

assignmentin polynomialtime by pluggingit in andcheckingeachclause.(Again,noticeyoucan’t befooled!)

Finally, notethatP is a subsetof NP. To seewhy, notethat if a problemis in P, we don’t even needa short

certificate;someonecanconvincethemselvesof thecorrectanswerjust by runningthepolynomialtime algorithm!

Now, let usseeanexampleof aproblemwhichdoesnotappearto have shortcertificates:

not-satisfiable-3SAT: This is like 3SAT, but now theansweris yesif thereis no satisyfingassignmentfor the

formula.Givena formulawith nosolution,how canweconvincepeoplethereis nosolution?Theobviousway is to

list all possibletruth assignments,andshow thatthey do notwork, but thiswouldnot yield ashortcertificate.

Page 146: Data Structures

Lecture18 18-6


The “hard” problemswe will be looking at will be the hardestproblemsin NP; we call theseproblemsNP-

complete.An NP-completeproblemwill have two properties:

it is in NP

all otherproblemsin NP reduceto it

Thus, our conceptof “being the hardest”is basedon reductions. If all other problemsin NP reduceto a

problem,it mustbeat leastashardasany of them! It mayseemsurprising,thatthereareproblemsin NP thathave


We will start by proving (well, sketching a proof) that an easily statedproblem,circuit SAT, is NP-

complete. Oncewe have a first problemdone, it will turn out to be much easierto prove that other problems

areNP-complete.This is becauseoncewe have oneNP-completeproblem,it is mucheasierto prove others:

Claim 18.1 SupposeproblemA is NP-complete, problemB is in NP, and problemA reducesto problemB. Then

problemB is NP-complete.

Intuitively, thismustbetruebecauseif A reducesto B, thenB is at leastashardasA. SoaslongasB is in NP,

andthehardestproblemsin NP aretheNP-completeones,thenB mustalsobeNP-complete.

Slightly moreformally, we have to show that every problemin NP reducesto B. But we alreadyknow that

every problemreducesto A, andA reducesto B. By combiningreductions,asin the picturebelow, we have that

everyproblemin NP reducesto B. Sooncewehaveoneproblem,wecanstartbuilding up“chains”of NP-complete


Page 147: Data Structures

Lecture18 18-7


Input for A


Outputfor A

Reduction R R(x)

Input for B

Algorithm for B Output

for B

Algorithm for A

Figure18.2: If C reducesto A, andA reducesto B, thenC reducesto B. (Transitivity!)

Page 148: Data Structures

Lecture18 18-8


Theproblemcircuit SAT is definedasfollows: givena Booleancircuit andthevaluesof someof its inputs,is

therea way to settherestof its inputssothattheoutputis T? It is easyto show thatcircuit SAT is in NP.

Claim 18.2 A problemis in NP if andonly if it canbereducedto circuit SAT.

Thisstatementis known asCook’s theorem,andit is oneof themostimportantresultsin ComputerScience.

Onedirectionis easy. If a problemA canbe reducedto circuit SAT , it caneasilybe shown to be in NP. A

shortcertificatefor aninput to problemA consistsof theshortcertificatefor thecircuit thatresultsfrom runningthe

reductionfrom A to circuit SAT on the input. Given this shortcertificate,a polynomialtime algorithmcould run

thereductionon theinput to A to gettheappropriatecircuit, andthenusetheshortcertificateto checkthecircuit.

Theotherdirectionis morecomplicated,sowe offer a somewhat informal explanation.Supposethatwe have

a problemA in NP. We needto show that it reducesto circuit SAT. SinceA is in NP, thereis a polynomialtime

algorithmthatchecksthevalidity of inputsof A togetherwith theappropriatecertificates.But wecouldprogramthis

algorithmonacomputer, andthisprogramwouldreallybejustahugeBooleancircuit. (After all, computersarejust

big Booleancircuits themselves!) The input to this circuit is the input to problemA alongwith a shortcertificate.

Now supposewe aregiven a specificinstancex of A. The questionof whetherx is a yesinstanceor no instance

is exactly thequestionof whetherthereis an appropriateshortcertificate,which is exactly the samequestionask

askingif thereis someway of settingtherestof theinputsto theBooleancircuit sothattheansweris T. Hence,the

constructionof thecircuit we describedis thesoughtreductionfrom A to circuit SAT!

Page 149: Data Structures

Lecture18 18-9

Mor eNP-completeproblems

Now that we have proved that circuit SAT is NP-complete,we will build on this to find otherNP-complete

problems.For example,we will now show thatcircuit SAT reducesto 3SAT, andsince3SAT is clearlyin NP, this

shows that3SAT is NP-complete.

SupposewearegivenacircuitC with someinputgatesunset.Wemust(quickly, in polynomialtime)construct

from thiscircuit a3SAT-formulaRC whichis satisfiableif andonly if thereis asatisfyingassignmentof thecircuit

inputs.In essence,wewantto mimictheactionsof thecircuit with a suitableformula.

TheformulaRC will have onevariablefor eachgate(thatis, eachinput,andeachoutputof anAND, OR,or

NOT), andeachgatewill alsoleadto certainclauses,asdescribedbelow:

1. If x is aT inputgate,thenaddtheclausex .

2. If x is aF inputgate,thenaddtheclausex .

3. If x is anunknown inputgate,thenno clausesareaddedfor it.

4. If x is theOR of gatesy andz, thenaddtheclausesy x , z x , and

x y z . (It is easyto seethat the

conjunctionof theseclausesis equivalentto x y z .

5. If x is theAND of gatesy andz, thenaddtheclausesx y , x z , and

y z x . (It is easyto seethatthe

conjunctionof theseclausesis equivalentto x y z .

6. If x is theNOT of gatey, thenaddtheclausesx y and

x y . (It is easyto seethattheconjunctionof these

clausesis equivalentto x y .

7. Finally, if gatex is theoutputgate,addtheclausex , expressingtheconditionthattheoutputgateshouldbe


Theconjuctionof all of theseclausesyieldstheformulaRC . It shouldbeapparentthat this reductionR can

beaccomplishedin polynomial(in fact, in linear) time. To verify it is a valid reduction,we mustnow show thatC

hasa settingof theunknowninputgatesthatmakestheoutputT if andonly if R(C)is satifiable.

SupposeC hasa valid setting. Thenwe claim RC canbe satisfiedby the truth assignmentthat giveseach

variablethe samevalueas the appropriategatewhenC is run on this valid setting. This truth assignmentmust

satisfyall theclausesof RC , sincewe constructedR

C to computethesamevaluesasthecircuit. Note that the

outputgateis T for C, andhencethefinal clauselistedabove is alsosatisfied.

Page 150: Data Structures

Lecture18 18-10

Conversely(andthis is moresubtle!),if thereis a valid truth assignmentfor RC , thenthereis a valid setting

for the inputsof C thatmakestheoutputT. Justsettheunknown input gatesin themannerproscribedby thetruth

assignmentfor RC . SinceR

C effectively mimics thecomputationof thecircuit, we know theoutputgatemust

beT whentheseinputsareapplied.

Page 151: Data Structures

Lecture18 18-11

From 3SAT to Integer Linear Programming

We musttake a 3SAT formulaandconvert it to an integer linearprogram.This reductionis easy. Restrictall

variablessothat they areeither0 or 1 by includingtheconstrating0 x 1. Now a clausesuchasx y z can

beturnedinto a linearconstraintby replacing by , a literal x by x, anda literal x by1 x , andthenforcing the

whole thing to beat least1. For example,theabove clausebecomesx 1 y z 1. Theappropriateclauseis

clearlysatisifedif andonly if this constraintis; all termson theleft of theequationareeither0 or 1, andthereis at

leastone1 if andonly if oneof theliteralsof theclauseis true.

It is somewhatstrangethat linearprogrammingcanbesolvedpolynomialtime,but whenwe try to restrictthe

solutionsto beintegers,thentheproblemappearsnotbesolvablein polynomialtime (sinceit is NP-complete).

Page 152: Data Structures

Lecture18 18-12

From 3SAT to IndependentSet

In aninput to IndependentSetwearegivenagraphG V E andanintegerK. Weareaskedif thereis aset

I V with I K suchthat if u v I thenu v E. That is, we areaskedto find a setof verticesof sizeat least

K suchthatno two areconnectedby anedge.Theproblemis clearlyin NP. (Why?)

Wereduce3SAT to IndependentSet. Thatis, givenaBooleanformulaφ with atmost3 literalsin eachclause,

wemust(in polynomialtime)comeupwith agraphG V E andanintegerK sothatG hasanindependentsetof

sizeK or moreif andonly if theformulaφ is satisfiable.

Thereductionis illustratedin Figure18.3.For eachclause,we have a groupof vertices,onefor eachliteral in

theclause,connectedby all possibleedges.Betweengroupsof vertices,we connecttwo verticesif they correspond

to oppositeliterals (like x andx). We let K bethenumberof clauses.This completesthe reduction,andit is clear

thatit canbeaccomplishedin polynomialtime. Wenow show thereis asatisfyingtruthassignmentfor φ if andonly

if thereis anindependentsetof sizeat leastK.

Page 153: Data Structures

Lecture18 18-13








y z z

x + y + z

x + y + z

x + y + z

x + y

Figure18.3:Turningformulaeinto graphs.

Page 154: Data Structures

Lecture18 18-14

If thereis a truth assignmentfor φ, thenthereis at leastonetrue literal in eachclause.Pick just onefor each

clausein any way. ThesetI of correspondingverticesmustgive an independentsetof sizeK. This is becausewe

useonly onevertex per clause,so the only way I could not be independentis if it includedtwo oppositeliterals,

which is impossible,becausethesatisfyingassignmentcannotsettwo oppositeliteralsto T.

Now supposeG hasan independentset I of sizeK. SincethereareK groups,andeachgroupis completely

interconnected,theremust be one vertex from eachgroup in I . Considerthe assignmentthat setsall literals in

the assignmentto T, their oppositesto F, andany unusedvariablesarbitrarily. It is clearthat this is a valid truth

assignment(sinceif avariableis setto T, its oppositemustbesetto F).

Page 155: Data Structures

Lecture18 18-15

From IndependentSetto Vertex Cover and Clique

Let G V E beagraph.A vertex coverof G is asetG V suchthatall edgesin E haveat leastoneendpoint

in C. That is, eachedgeis adjacentto at leastonevertex in thevertex cover. TheVertex Cover problemis, givena

graphG andanumberK, to determineif G hasavertex cover of sizeatmostK.

The reductionfrom Independent Set to Vertex Cover is immediatefrom the following observation: C is a

vertex coverof G V E if andonly if V C is anindependentset! (For example,supposeI is anindependentset,

andconsidersomeedgeu v . Both u andv can’t bein theindependentset,soV I contatinseitheru or v or both,

andtheedgeis covered.)Sothereductionis trivial; givenaninstanceG K of IndependentSet, we producethe

instanceGV K of Vertex Cover.

A clique in a graphis a setof fully connectednodes–every possibleedgebetweenevery pair of thenodesis

there.Theclique problemaskswhetherthereis a cliqueof sizeK or larger in thegraph.Again, thereductionfrom

IndependentSet is immediatefrom a simpleobservation. Let G bethecomplementof G, which is thegraphwith

thesamenodesasG, but theedgesof G arepreciselythoseedgesthataremissingfrom G. ThenC is a clique in

G V E if andonly if C is anindependentsetin G. (SeeFigure18.4.)

Page 156: Data Structures

Lecture18 18-16

Figure18.4: Independentsetsbecomecliquesin thecomplement.

Page 157: Data Structures

CS124 Lecture 19

Wehavedefinedtheclassof NP-completeproblems,whichhave thepropertythatif thereis apolynomialtime

algorithmfor any oneof theseproblems,thereis apolynomialtimealgorithmfor all of them.Unfortunately, nobody

hasfoundanalgorithmfor any NP-completeproblem,andit is widely believedthatit is impossibleto doso.

Thismightseemlikeabig hint thatweshouldjustgiveup,andgobackto solvingproblemslikeMAX-FLOW,

wherewe canfind a polynomialtime solution. Unfortunately, NP-completeshow up all the time in therealworld,

andpeoplewantsolutionsto theseproblems.Whatcanwedo?


Page 158: Data Structures

Lecture19 19-2

What Can We Do?

Actually, thereis a greatdealwe cando. Herearejusta few possibilities:


NP-completenessrefersto theworstcaseinputsfor aproblem.Often,inputsarenotasbadasthosethatarise

in NP-completenessproofs. For example,althoughthegeneralSAT problemis hard,we have seenthat the

casesof 2SAT andHorn formulaehave simplepolynomialtimesalgorithms.


NP-completenessresutlsoftenarisebecausewewantanexactanswer. If werelaxtheproblemsothatweonly

have to returna goodanswer, thenwe might beableto developa polynomialtime algorithm. For example,

we have seenthatagreedyalgorithmprovidesanapproximateanswerfor theSETCOVER problem.


Sometimeswe might not be ableto make absoluteguarantees,but we candevelop algorithmsthat seemto

work well in practice,andhave argumentssuggestingwhy they shouldwork well. For example,thesimplex

algorithmfor linearprogrammingis exponentialin theworstcase,but in practiceit’s generallytheright tool

for solvinglinearprogrammingproblems.


Sofar, all our algorithmshave beendeterministic; they alwaysrun thesameway on thesameinput. Perhaps

if we let ouralgorithmdo somethingsrandomly, we canavoid theNP-completenessproblem?

Actually, the questionof whetheronecanuserandomnessto solve an NP-completeproblemis still open,

thoughit appearsunlikely. (As is, of course,theproblemof whetheronecansolve anNP-completeproblem

in polynomialtime!) However, randomnessprovesa usefultool whenwe try to comeup with approximation

algorithmsandheuristics. Also, if onecanassumethe input comesfrom a suitable“randomdistribution”,

thenoftenonecandevelopanalgorithmthatworkswell on average.

To begin, we will look at heuristicmethods.The amountwe canprove aboutthesemethodsis (asyet) very

limited. However, thesetechniqueshave hadsomesuccessin practice,andthereareargumentsin favor of why they

arereasonablething to try for someproblems.

Page 159: Data Structures

Lecture19 19-3

Local Search

“Local search”is meantto representa largeclassof similar techniquesthatcanbeusedto find agoodsolution

for a problem.Theideais to think of thesolutionspaceasbeingrepresentedby anundirectedgraph.That is, each

possiblesolutionis a nodein the graph. An edgein the graphrepresentsa possiblemove we canmake between


For example,considerthe NumberPartition problemfor the homework assignment.Eachpossiblesolution,

or division of thesetof numbersinto two groups,would bea vertex in thegraphof all possiblesolutions.For our

possiblemoves,we couldmove betweensolutionsby changingthesignassociatedwith a number. So in this case,

ourgraphof all possiblesolutions,wehave anedgebetweenany two possiblesolutionsthatdiffer in only onesign.

Of coursethis graphof all possiblesolutionsis huge;thereare2n possiblesolutionswhentherearen numbersin

theoriginal problem!Wecouldnever hopeto evenwrite this graphdown. Theideaof local searchis thatwe never

actuallytry to write thewholegraphdown; we justmovefrom onepossiblesolutionto a“nearby”possiblesolution,

eitherfor aslongaswe like,or until we happento find anoptimalsolution.

Page 160: Data Structures

Lecture19 19-4

To setup a local searchalgorithm,we needto have thefollowing:

1. A setof possiblesolutions,whichwill betheverticesin our local searchgraph.

2. A notion of what the neighbors of eachvertex in the graphare. For eachvertex x, we will call the setof

adjacentverticesNx . Theneighborsmustsatisfyseveralproperties:N

x mustbeeasyto computefrom x

(sinceif wetry to move from x wewill needto computetheneighbors),if y Nx thenx N

y (soit makes

senseto representneighborsasundirectededges),andNx cannotbetoobig, or morethanpolynomialin the

input size(sothattheneighborsof a nodeareeasyto searchthrough).

3. A costfunction,from possiblesolutionsto therealnumbers.

Themostbasiclocal searchalgorithm(sayto minimizethecostfunction)is easilydescribed:

1. Pick astartingpoint x.

2. While thereis aneighbory of x with fy f

x , move to it; thatis, setx to y andcontinue.

3. Returnthefinal solution.

Page 161: Data Structures

Lecture19 19-5

The Reasoning Behind Local Search

Theideabehindlocal searchis clear;if keepgettingbetterandbettersolutions,we shouldendup with a good

one.Pictorially, if we “project” thestatespacedown to a two dimensionalcurve,we arehopingthatthepicturehas

asink,or global optimum, andthatwe will quickly move towardit. SeeFigure19.1.



x*, global optimum

Figure19.1:A very nicestatespace.

Thereare two possibleproblemswith this line of thinking. First, even if the spacedoeslook this way, we

might not move quickly enoughtowardtheright solution.For example,for thenumberpartitionproblemfrom the

homework, it might bethateachmove improvesour solution,but only by improving theresidueby 1 eachtime. If

we startwith a badsolution,it will take a lot of movesto reachtheminimum. Generally, however, this is not much

of a problem,aslong asthecostfunctionis reasonablysimple.

Page 162: Data Structures

Lecture19 19-6

The moreimportantproblemis that thesolutionspacemight not look this way at all. For example,our cost

functionmightnotchangesmoothlywhenwemovefrom astateto it neighbor. Also, it maybethatthereareseveral

local optima, in whichcasour local searchalgorithmwill honein a local optimumandgetstuck.SeeFigure19.2.



x*, global optimum

local optima

Figure19.2:A statespacewith many local optima;it will behardto find thebestsolution.

Thissecondproblem,thatthesolutionspacemightnot“look nice”, is crucial,andit underscoresthe importance

of setting up the problem. Whenwe choosethepossiblemovesbetweensolutions– that is, whenwe constructthe

mappingthatgivesustheneighborhoodof eachnode–wearesettinguphow localsearchwill behave, includinghow

thecostfunctionwill changebetweenneighbors,andhow many local optimathereare.How well local searchwill

work dependstremendouslyon how smartoneis in settingup the right neighborhoods,so that thesolutionspace

really doeslook theway wewould like it to.

Page 163: Data Structures

Lecture19 19-7

Examples of Neighborhoods

We have alreadyseenanexampleof a neighborhoodfor thehomework problem.Herearepossibleneighbor-

hoodsfor otherproblems:

MAX3SAT: A possibleneighborhoodstructureis two truth assignmentsareneighborsif they differ in only

onevariables.A moreextensive neighborhoodcouldmake two truth assignmentsneighborsif they differ in

at mosttwo variables;this tradesincreasedflexibility for increasesizein theneighborhood.

Travelling Salesperson:Thek-opt neighborhoodof x is givenby all toursthatdiffer in at mostk edgesfrom

x. In practice,usingthe3-optneighborhoodseemsto performbetterthanthe2-optneighborhood,andusing

4-optor largerincreasestheneighborhoodsizeto a pointwhereit is inefficient.

Page 164: Data Structures

Lecture19 19-8

Lots of Room to Experiment

Thereareseveralaspectsof localsearchalgorithmsthatwecanvary, andall canhaveanimpactonperformance.

For example:

1. WhatarethenieghborhoodsNx ?

2. How do we chooseaninital startingpoint?

3. How do we choosea neighbory to move to? (Do we take the first one we find, a randomneighborthat

improves f , theneighborthatimproves f themost,or do weuseothercriteria?)

4. Whatif thereareties?

Thereareotherpracticalconsiderationsto keepin mind. Canwe re-runthe algorithmseveral times? Canwe try

several of the algorithmson differentmachines?Issueslike thesecanhave a big impacton actualperformance.

However, perhapsthemost importantissueis to think of the right neighborhoodstructureto begin with; if this is

right, thenotherissuesaregenerallysecondary, andif this is wrong,youarelikely to fail no matterwhatyou do.

Page 165: Data Structures

Lecture19 19-9

Local Search Variations

Therearemany variationson thelocal searchtechnique(below, assumethegoal is to minimizethecostfunc-


Hill-climbing – this is the namefor the basicvariation,whereonemovesto a vertex of lower (or possibly


Metropolisrule– pick arandomneighbor, andif thecostis lower, movethere.If thecostis higher, movethere

with someprobability(thatis usuallysetto dependon thecostdifferential).Theideais thatpossiblymoving

to aworsestatehelpsavoid gettingtrappedat localminima.

Simulatedannealing– this methodis similar to theMetropolisrule, exceptthat theprobabilityof goingto a

highercostneighborvarieswith time. This is analogousto a physicalsystem(suchasa chemicalpolymer)

beingcooleddown over time.

Tabu search– thisaddssomememoryto hill climbing. Likewith theMetropolisruleandsimulatedannealing,

you cango to worsesolutions.A penaltyfunction is addedto thecostfunction to try to preventcycling and

promotesearchingnew areasof thesearchspace.

Parallelsearch(“go with thewinners”)–domultiplesearchesin parallel,occasionallykilling off searchesthat

appearlesssuccessfulandreplacingthemwith copiesof searchesthatappearto bedoingbetter.

Geneticalgorithms– this trendy areais actually quite relatedto local search. An importantdifferenceis

that insteadinsteadof keepingonesolutionat a time, a groupof them(calleda population)is kept,andthe

populationchangesat eachstep.

It is still quite unclearwhat exactly eachof thesetechniquesaddsto the pot. For example,somepeople

swearthatgeneticalgorithmsleadto bettersolutionsmorequickly thanothermethods,while othersclaim thatby

choosingtheright neighborhoodfunctinonecandoaswell with hill climbing. In theyearsto come,hopefullymore

will becomeunderstoodaboutall of thesemethods.

If you’re interested,you might try looking for geneticalgorithmsandsimulatedannealingin Yahoo.They’re


Page 166: Data Structures

CS124 Lecture 20

Heuristics can be useful in practice, but sometimes we would like to have guarantees. Approximation algorithms

give guarantees. It is worth keeping in mind that sometimes approximation algorithms do not always perform as well

as heuristic-based algorithms. Other times they provide insight into the problem, so they can help determine good


Often when we talk about an approximation algorithm, we give an approximation ratio. The approximation

ratio gives the ratio between our solution and the actual solution. The goal is to obtain an approximation ratio as

close to 1 as possible. If the problem involves a minimization, the approximation ratio will be greater than 1; if it

involves a maximization, the approximation ratio will be less than 1.


Page 167: Data Structures

Lecture 20 20-2

Vertex Cover Approximations

In the Vertex Cover problem, we wish to find a set of vertices of minimal size such that every edge is adjacent

to some vertex in the cover. That is, given an undirected graph G = (V,E), we wish to find U ⊆ V such that every

edge e ∈ E has an endpoint in U . We have seen that Vertex Cover is NP-complete.

A natural greedy algorithm for Vertex Cover is to repeatedly choose a vertex with the highest degree, and put it

into the cover. When we put the vertex in the cover, we remove the vertex and all its adjacent edges from the graph,

and continue. Unfortunately, in this case the greedy algorithm gives us a rather poor aprroximation, as can be seen

with the following example:

vertices chosenby greedy

vertices in the min cover

Figure 20.1: A bad greedy example.

In the example, all edges are connected to the base level; there are m/2 vertices at the next level, m/3 vertices

at the next level, and so on. Each vertex at the base level is connected to one vertex at each other level, and the

connections are spread as evenly as possible at each level. A greedy algorithm could always choose a rightmost

vertex, whereas the optimal cover consists of the leftmost vertices. This example shows that, in general, the greedy

approach could be off by a factor of Ω(logn), where n is the number of vertices.

Page 168: Data Structures

Lecture 20 20-3

A better algorithm for vertex cover is the following: repeatedly choose an edge, and throw both of its endpoints

into the cover. Throw the vertices and its adjacent edges out of the graph, and continue.

It is easy to show that this second algorithm uses at most twice as many vertices as the optimal vertex cover.

This is because each edge that gets chosen during the course of the algorithm must have one of its endpoints in the

cover; hence we have merely always thrown two vertices in where we might have gotten away with throwing in 1.

Somewhat surprsingly, this simple algorithm is still the best knwon approximation algorithm for the vertex

cover problem. That is, no algorithm has been proven to do better than within a factor of 2.

Page 169: Data Structures

Lecture 20 20-4

Maximum Cut Approximation

We will provide both a randomized and a deterministic approximation algorithm for the MAX CUT problem.

The MAX CUT problem is to divide the vertices in a graph into two disjoint sets so that the numbers of edges

between vertices in different sets is maximized. This problem is NP-hard. Notice that the MIN CUT problem can

be solved in polynomial time by repeated using the min cut-max flow algorithm. (Exercise: Prove this!)

The randomized version of the algorithm is as follows: we divide the vertices into two sets, HEADS and TAILS.

We decide where each vertex goes by flipping a (fair) coin.

What is the probability an edge crosses between the sets of the cut? This will happen only if its two endpoints

lie on different sides, which happens 1/2 of the time. (There are 4 possibilities for the two endpoints – HH,HT,TT,TH

– and two of these put the vertices on different sides.) So, on average, we expect 1/2 the edges in the graph to cross

the cut. Since the most we could have is for all the edges to cross the cut, this random assignment will, on average,

be within a factor of 2 of optimal.

Page 170: Data Structures

Lecture 20 20-5

We now examine a deterministic algorithm with the same “approximation ratio”. (In fact, the two algorithms

are intrinsically related– but this is not so easy to see!) The algorithm implements the hill climbing approximation

heuristic. We will split the vertices into sets S1 and S2. Start with all vertices on one side of the cut. Now, if you can

switch a vertex to a different side so that it increases the number of edges across the cut, do so. Repeat this action

until the cut can no longer be improved by this simple switch.

We switch vertices at most |E| times (since each time, the number of edges across the cut increases). Moreover,

when the process finishes we are within a factor of 2 of the optimal, as we shall now show. In fact, when the process

finishes, at least |E|/2 edges lie in the cut.

We can count the edges in the cut in the following way: consider any vertex v ∈ S1. For every vertex w in S2

that it is connected to by an edge, we add 1/2 to a running sum. We do the same for each vertex in S2. Note that

each edge crossing the cut contributes 1 to the sum– 1/2 for each vertex of the edge.

Hence the cut C satisfies

C =12



|w : (v,w) ∈ E,w ∈ S2|+ ∑v∈S2

|w : (v,w) ∈ E,w ∈ S1|



Since we are using the local search algorithm, at least half the edges from any vertex v must lie in the set opposite

from v; otherwise, we could switch what side vertex v is on, and improve the cut! Hence, if vertex v has degree δ(v),


C =12



|w : (v,w) ∈ E,w ∈ S2|+ ∑v∈S2

|w : (v,w) ∈ E,w ∈ S1|






+ ∑v∈S2



=14 ∑




where the last equality follows from the fact that if we sum the degree of all vertices, we obtain twice the number of

edges, since we have counted each edge twice.

In practice, we might expect that hill climbing algorithm would do better than just getting a cut within a factor

of 2.

Page 171: Data Structures

Lecture 20 20-6

Euclidean Travelling Salesperson Problem

In the Euclidean Travelling Salesman Problem, we are given n points (cities) in the x− y plane, and we seek

the tour (cycle) of minimum length that travels through all the cities. This problem is NP-complete (showing this is

somewhat difficult).

Our approximation algorithm involves the following steps:

1. Find a minimum spanning tree T for the points.

2. Create a psuedo tour by walking around the tree. The pseduo tour may visit some vertices twice.

3. Remove repeats from the tour by short-cutting through the repeated vertices. (See Figure 20.2.)

Page 172: Data Structures

Lecture 20 20-7


Constructed tour

Constructed pseudo tour

Minimum spanning tree

Figure 20.2: Building an approximate tour. Start at X , move in the direction shown, short-cutting repeated vertices.

Page 173: Data Structures

Lecture 20 20-8

We now show the following inequalities:

length of tour ≤ length of pseudo tour

≤ 2(size of T)

≤ 2(length of optimal tour)

Short-cutting edges can only decrease the length of the tour, so the tour given by the algorithm is at most the

length of the pseudo tour. The length of our psuedo tour is at most twice the size of the spanning tree, since this

pseudo tour consists of walking through each edge of the tree at most twice. Finally, the length of the optimal tour

is at least the size of the minimum spanning tree, since any tour contains a spanning tree (plus an edge!).

Using a similar idea, one can come up with an approximation algorithm that returns a tour that is within a factor

of 3/2 of the optimal. Also, note that this algorithm will work in any setting where short-cutting is effective. More

specifically, it will work for any instance of the travelling salesperson problem that satisfies the triangle inequality

for distances: that is, if d(x,y) represents the distance between vertices x and y, and d(x,z) ≤ d(x,y)+d(y,z) for all

x,y and z.

Page 174: Data Structures

Lecture 20 20-9

MAX-SAT: Applying Randomness

Consider the MAX-SAT problem. What happens if we do the simplest random thing we can think of– we

decide whether each variable should be TRUE or FALSE by flipping a coin.

Theorem 20.1 On average, at least half the clauses will be satisfied if we just flip a coin to decide the value of each

variable. Moreover, if each clause has k literals, then on average 1−2−k clauses will be satisfied.

The proof is simple. Look at each clause. If it has k literals in it, then each literal could make the clause TRUE

with probability 1/2. So the probability the clause is not satisfied is 1−2−k, where k is the number of literals in the


Page 175: Data Structures

Lecture 20 20-10

Linear Programming Relaxation

The next approach we describe, linear programming relaxation, can often be used as a good heuristic, and

in some cases it leads to approximation algorithms with provable guarantees. Again, we will use the MAX-SAT

problem as an example of how to use this technique.

The idea is simple. Most NP-complete problems can be easily described by a natural Integer Programming

problem. (Of course, all NP-complete problems can be transformed into some Integer Programming problem, since

Integer Programming is NP-complete; but what we mean here is in many cases the transformation is quite natural.)

Even though we cannot solve the related Integer Program, if we pretend it is a linear program, then we can solve it,

using (for example) the simplex method. This idea is konwn as relaxation, since we are relaxing the constraints on

the solution; we are no longer requiring that we get a solution where the variables take on integer values.

If we are extremely lucky, we might find a solution of the linear program where all the variables are integers,

in which case we will have solved our original problem. Usually, we will not. In this case we will have to try to

somehow take the linear programming solution, and modify it into a solution where all the variables take on integer

values. Randomized Rouding is one technique for doing this.

Page 176: Data Structures

Lecture 20 20-11


We may formulate MAX-SAT as an integer programming problem in a straightforward way (in fact, we have

seen a similar reduction before, back when we examined reducitons; it is repeated here). Suppose the formula

contains variables x1,x2, . . . ,xn which must be set to TRUE or FALSE, and clauses C1,C2, . . . ,Cm. For each variable

xi we associate a variable yi which should be 1 if the variable is TRUE, and 0 if it is FALSE. For each clause C j we

have a variable z j which should be 1 if the clause is satisfied and 0 otherwise.

We wish to maximize the number of satisfied clauses s, or



z j.

The constraints include that that 0 ≤ yi,z j ≤ 1; since this is an integer program, this forces all these variables

to be either 0 or 1. Finally, we need a constraint for each clause saying that its associated variable z j can be 1 if and

only if the clause is actually satisfied. If the clause C j is (x2 ∨x4∨x6∨x8), for example, then we need the restriction:

y2 + y6 +(1− y4)+(1− y8) ≥ z j.

This forces z j to be 0 unless the clause can be satisfied. In general, we replace xi by yi, xi by 1− yi, ∨ by +, and set

the whole thing ≥ z j to get the appropriate constraint.

When we solve the linear program, we will get a solution that might have y1 = 0.7 and z1 = 0.6, for instance.

This initially appears to make no sense, since a variable cannot be 0.7 TRUE. But we can still use these values in a

reasonable way. If y1 = 0.7, it suggests that we would prefer to set the variable x1 to TRUE (1). In fact, we could

try just rounding each variable up or down to 0 or 1, and use that as a solution! This would be one way to turn

the non-integer solution into an integer solution. Unfortunately, there are problems with this method. For example,

suppose we have the clause C1 = (x1∨x2∨x3), and y1 = y2 = y3 = 0.4. Then by simple rounding, this clause will not

be TRUE, even though it “seems satisfied” to our linear program (that is, z1 = 1). If we have a lot of these clauses,

regular rounding might perform very poorly.

It turns out that there an interpretation for 0.7 that suggests a better way than simple rounding. We think of the

0.7 as a probability. That is, we interpret y1 = 0.7 as meaning that x1 would like to be true with probability 0.7.

So we take each variable xi, and independently we set it to 1 with the probability given by yi (and with probability

1− yi we set xi to 0). This process is known as randomized rounding. One reason randomized rounding is useful is

it allows us to prove that the expected number of clauses we satisfy using this rounding is a within a constant factor

of the true optimum.

Page 177: Data Structures

Lecture 20 20-12

First, note that whatever the maximum number of clauses s we can satisfy is, the value found by the linear

program, or ∑mj=1 z j, is at least as big as s. This is because the linear program could achieve a value of at least s

simply by using as the values for yi the truth assignment that make satisfying s clauses possible.

Now consider a clause with k variables; for convenience, suppose the clause is just C1 = (x1 ∨ x2 . . .∨ xk).

Suppose that when we solve the linear program, we find z1 = β. Then we claim that the probability that this clause

is satisfied after the rounding is at least (1− 1/e)β. This can be checked (using a bit of sophisticated math), but it

follows by noting (with experiments) that the worst possibility is that y1 = y2 . . . = yk = β/k. In this case, each x1

is FALSE with probability (1− β/k), and so C1 ends up being unsatisfied with probability (1− β/k)k. Hence the

probability it is satisfied is at least (again using some math) 1− (1−β/k)k ≥ (1−1/e)β.

Hence the ith clause is satisfied with probability at least (1−1/e)zi , so the expected number of satisfied clauses

after randomized rounding is at least (1−1/e) ∑mj=1 z j . This is within a factor of (1−1/e) of our upper bound on the

maximum number of satisfiable clauses, ∑mj=1 z j . Hence we expected to get within a constant factor of the maximum.

Page 178: Data Structures

Lecture 20 20-13

Combining the Two

Surprisingly, by combining the simple coin flipping algorithm with the randomized rounding algorithm, we can

get an even better algorithm. The idea is that the coin flipping algorithm does best on long clauses, since each literal

in the clause makes it more likely the clause gets set to TRUE. On the other hand, randomized rounding does best

on short clauses; the probability the clause is satisfied (1− (1−β/k)k) decreases with k. It turns out that if we try

both algorithms, and take the better result, on average we will satisfy 3/4 of the clauses.

We also point out that there are even more sophisticated approximation algorithms for MAX-SAT, with better

approximation ratios. However, these algorithms point out some very interesting and useful general techniques.

Page 179: Data Structures

CS 124 Lecture 21

We now consider a natural problem that arises in many applications, particularly in conjunction with suffix

trees, which we will study later. Suppose we have a rooted tree T with n nodes. We would like to be able to answer

questions of the following form: what is the least common ancestor of nodes u and v; that is, what is the common

ancestor of u and v closest to the root?

In this setting, we will not be answering a single questions, but many questions on the same fixed tree T . If

we are given the tree T in advance, we can design an appropriate data structure for answering future queries. Our

algorithm will therefore be measured on several criteria. Of course one important criterion is the query time, or the

time to answer a specific query. However, a second consideration is how much preprocessing time, or time to set up

the data structure, is required to answer the questions. A third related aspect to study is the memory required to store

the data structure.

For example, a trivial algorithm for the problem is to consider each pair of vertices, and compute their least

common ancestor by following both paths toward the root until the first shared vertex is found. Then all the the

answers can be stored in a table. There are(n



pairs of vertices, so our table will require Θ(n2) space. Queries can

be answered by a table lookup, which is constant time. Preprocessing, however, can require Θ(n3) time.

The problem of designing an appropriate data structure for this is called the Least Common Ancestor (LCA)

Problem. We will show that there is an algorithm for LCA that require only linear preprocessing time and memory,

but still answers any query in constant time! This result is as efficient as we could hope for.

We will reduce the LCA problem to a seemingly different but in fact quite related problem, called the Range

Minimum Query (RMQ) Problem. The RMQ problem applies to an array A of length n of numbers. We would like

to be able to answer questions of the following form: given two indices i and j, what is the index of the smallest

element in the subarray A[i . . . j]? Again, we may prepocess the array A to derive some alternative data structure to

answer the questions quickly. There is a trivial solution for the RMQ problem completely similar to the one above

for the LCA problem.


Page 180: Data Structures

Lecture 21 21-2

21.1 Reduction: From LCA to RMQ

How to we convert an LCA problem to an RMQ problem? Note that we must do the conversion in linear time, if we

are going to totally complete the preprocessing in linear time for the LCA problem.

Linear time suggests that we want to do a tree traversal. In fact, the observation we will use is that the LCA of

nodes u and v is just the shallowest node encountered between visiting u and v during a depth first search of the tree

starting at the root. So let us do a DFS on the tree, and we can record in an array V the nodes we visit. An example

is shown in Figure 1. Notice each node can appear multiple times, but the total length of the array is 2n−1, where n

is the number of nodes in the tree. Each of the n−1 edges yields two values in the array, one when we go down the

edge and one when we go up the edge. The first value is the root. Also, from now on we will refer to each node by

its number on the DFS search.

We will also require two further arrays. The Level Array is derived from V ; L[i] is the distance from the root

of V [i]. Adjacent elements in L can only differ by +1 or −1, since adjacent steps in the DFS are connected by an

edge. Finally, R[i] is the representative array; R[i] contains the first index of V that contains the value i. (Actually,

any occurrence of i can be stored in R[i], but we might as well choose a specific one.)

Clearly, to compute LCA(u,v) it suffices to compute RMQ(R[u],R[v]) over the array L. This gives us the index

of the shallowest node between u and v, and the array V can be used to determine the actual node from the index.

Page 181: Data Structures

Lecture 21 21-3


1 4 5

2 3 6 97


V: 0 1 2 1 3 1 0 4 0 5 6 5 7 8 7 5 9 5 0L: 0 1 2 1 2 1 0 1 0 1 2 1 2 3 2 1 2 1 0

R: 0 1 2 4 7 9 10 12 13 16

Figure 1: Changing an LCA problem into an RMQ problem.

21.2 Solutions for RMQ

We first note that we can do better than the naive Θ(n3) preprocessing time for RMQ on an array A by doing a trivial

dynamic programming, using the recurrence

RMQ(i, j) = A−1[min(A[RMQ(i, j−1)],A[ j])].

Here we are using convenient notation. Clearly min(A[RMQ(i, j− 1)],A[ j]) gives the value A[k], where A[k] is the

smallest value that in the subarray A[i . . . j]. However, we want the index of this value. We use the notation A−1 to

represent that we want the index of this value; note that if multiple indices have this value, we do not particularly

care which index we obtain. Each table entry can be calculated in constant time by building the table in order of

ranges [i, j] of increasing size, leading to preprocessing time Θ(n2).

In fact, we can reduce our table size and memory using a different dynamic program, and by using a few

additional operations per query. Let us create a table M(i, j) such that M(i, j) = A−1[mink∈[i,i+2 j )A[k]]. That is,

M(i, j) contains the location of the minimum value over the 2 j positions starting from i. This table has size O(n log n),

and it can easily be filled in O(n log n) step by using dynamic programming, based on the fact that M(i, j) can be

Page 182: Data Structures

Lecture 21 21-4

determined from M(i, j−1) and M(i+2 j−1, j−1).

How do we use the M(i, j) to compute RMQ(i, j), if j is not a power of 2? We may use two overlapping

intervals that cover the range [i, j] as follows. Let k = blog( j− i+1)c, so that 2k is the largest power of 2 such that

i + 2k ≤ j + 1. Then RMQ(i, j) = A−1[minA[M(i,k)],A[M( j− 2k,k)]], and this can be computed in constant time

from the M.

We have shown that we can achieve preprocessing time and memory size Θ(n log n) while maintaining con-

stant query time. Interestingly, this method can be enhanced so as to require preprocessing time and memory size

Θ(n log log n) through a recursive construction. (This will be an exercise.) In practice, such a result would probably

be good enough – log log n is quite small for reasonable values of n. By continuing the recursive construction for

further levels, we could even achieve Θ(n log log logn) preprocessing time and memory size, and so on for any fixed

number of logs, while maintaining constant query time. However, this recursive construction would add significant

complexity to an actual program, and it still would not lead us to a linear preprocessing time solution.

Page 183: Data Structures

Lecture 21 21-5

21.3 ±1 RMQ

In order to achieve linear preprocessing, we will use an additional fact about the RMQ problem we obtain from

the reduction from LCA. Recall that our RMQ problem is on the Level Array obtained from the LCA problem.

The Level Array has one additional property that we are not yet taking advantage of: each entry differs from the

previous entry by +1 or −1. We can take advantage of this fact to split the RMQ problem into a different set of

small subproblems in such a way that we can avoid some work by doing table look-ups.

The split works as follows: partition A into blocks of size logn2 . Let X [1, . . . ,2n/ log n] and Y [1, . . . ,2n/ log n] be

arrays such that X [i] stores the minimum element in the ith block of A, and Y [i] stores the position in the ith block

where the element X [i] occurs. Now to answer an RMQ query for indices i and j with i < j on the array A, we can

do the following:

1. If i and j are in the same block, we can perform an RMQ on this block. Notice that this requires that each

block be preprocessed.

2. If i and j are in different blocks, we have to compute the following values, and take the minimum of them:

(a) The minimum from position i to the end of i’s block.

(b) The minimum from the beginning of j’s block to position j.

(c) The minimum of all blocks between i’s block and j’s block.

Steps 2a and 2c also require that we preprocess for RMQ queries on each block. Step 2b requires that we perform

an RMQ over the array X . Assuming we have done all this preprocessing, the total query time is still constant.

However, if we preprocess each block in order to do RMQ’s, we have not saved on the running time. We need a

faster way to deal with preprocessing each block.

How can we possibly avoid preprocessing each block separately? We use the following observation. Consider

two arrays X and X ′. Suppose that these two arrays differ by a constant at each position; for example, the arrays

might be 1,2,3,4,3,2 . . . and 3,4,5,6,5,4 . . . and. Then the RMQ answers, which give the index of the minimum

element, will be the same for these two arrays. Hence we can “share” the preprocessing used for these two arrays!

Another way to explain this is that in the ±1 RMQ problem, the initial value of the array does not matter, only

the sequence of +1 and −1 values are necessary to determine the answer. Now, how many different such sequences

are there? Since there are only logn/2 elements in a block, there are only (log n/2)−1 values in the sequence of +1

Page 184: Data Structures

Lecture 21 21-6

and −1 values. Hence there are only 2(log n/2)−1 =√

n/2 possible sequences. This number is so small, we can afford

to compute and store tables for every possible sequence! Even if we use quadratic preprocessing time and memory,

these tables would take time O(√

n log2 n) to preprocess and O(√

n log2 n) memory. For each block in A, we have to

determine which table to use; this can easily be done in linear time.

Page 185: Data Structures

Lecture 21 21-7

21.4 Back to the standard RMQ

We have shown that ±1 RMQ problems can be solved with linear time preprocessing, and therefore we have a linear

time preprocessing solution for LCA. What about the general RMQ problem? It turns out that we can also reduce

the RMQ problem to the LCA problem in linear time. So we can obtain a linear time solution the general RMQ

problem, by turning it into an LCA problem, and solving that as a ±1 RMQ problem! The details of this reduction

are omitted here.

Page 186: Data Structures

CS 124 Lecture 22 Spring 2000

Suffix trees are an old data structure that have become new again, thanks to a recent new linear time algorithm

for constructing suffix trees due to Ukkonen that proves more useful for many applications. Here, we will describe

a suffix tree and discuss their classical use, pattern matching.

22.1 Definition

A suffix tree T is built for a string S[1 . . .m]. The tree is rooted and directed with m leaves, which are numbered from

1 to m. Each edge is labeled with a nonempty substring of S. The internal nodes of the tree (other than the root)

all have at least two outgoing edges, and the labels of all outgoing edges are labeled with different characters. By

following the path from the root to leaf i and concatenating the edge labels, one obtains the suffix S[i . . .m].

An example of a suffix tree for the string xyzxzxy$ is given in Figure 22.1. The figure helps understand some

important points about the suffix tree. First, each internal node has two or more children with different starting

characters along the edges, since otherwise the node could be removed or moved in order to make this the case.

Also, it is important that the last character of the string be a “unique” character, as this guarantees that the suffix tree

as defined actually exists. For example, suppose our string was just xyzxzxy. The suffix tree would remain largely

the same. In particular, in the not-quite-suffix tree in Figure 22.1 the path for the suffix xy does not end at a leaf,

violating the definition. The problem is that the suffix xy is also the prefix of the string. This problem can be avoided

by terminating the string with a special character that does not appear elsewhere, since then no suffix can also be a

prefix (except for the entire string itself). Hence, from now on, we will assume all strings end with a special character


It is also worth noting that a more convenient represenation of the suffix tree does not actually label the edges

with characters. Instead, these labels can be represented by a pair of indices; labeling an edge [i, j] represents that

the edge label corresponds to characters S[i . . . j]. Besides saving space and ensuring that each edge is conveniently

represented by two numbers, this scheme is important for the linear time algorithm for suffix tree construction.


Page 187: Data Structures

Lecture 22 22-2

22.2 Construction algorithm

To see that constructing suffix trees is possible, let us consider a simple O(m2) algorithm. Before beginning, we

emphasize that in this case, the O notation is being used to hide a potentially substantial constant, that depends on

the size of the alphabet. That is, if our alphabet is Σ, the O notation is hiding some factor dependent on |Σ|.

The goal is to build up the tree, one suffix at a time. We think of the intermediate results we get at each stage as

partial trees, T1,T2, . . . ,Tm. Initially the tree T1 consists of one edge, with label S[1 . . .m]; the end node is labeled with

1. For tree Ti, we modify the tree so that the suffix S[i . . .m] is handled properly. To do this, we start from the root

and follow the path down the tree matching characters from S[i . . .m] as far as possible. This just requires character

comparisons, and the path followed is necessarily unique since no two edges leading out of a node are labeled with

string that begin with the same character.

(Note, however, that whenever we reach an intermediate node in the tree, we have to look at all the branches

and decide which one, if any, to follow. Since there is at most one branch for each character, there are at most |Σ|

branches; since |Σ| is a constant, this takes only constant time! In practice, one might want to set up a hash table

based on a number assigned to each node and the first character on an edge in order to make finding the right edge

branch out from a node more efficient.)

At some point, no further matches are possible. Note that this cannot happen at a leaf node, because we end our

string with the special character $. Therefore it either happens when our character matching is either in the middle

of an edge or at a node. In the first case, we break the edge into two edges by inserting a new node. The edge to

the new node contains the characters that have matched so far along the old edge, and the edge from the new node

contains the remaining characters from the old edge. With this addition, we can now add the remainder of the suffix

S[i . . .m] by adding another edge from the new node. If instead when no further matches are possible we are at a

node, we can simply add a new edge from that node with the remainder of the suffix S[i . . .m]. In both cases, when

we add the new edge, we label the new leaf with the value i. The time to add each suffix is proportional to the length

of the suffix, leading to an O(m2) algorithm.

Although this algorithm is very simple, the quadratic construction time is extremely limiting. Suffix trees are

used, for example, for large pattern matching problems, where the input strings might be DNA strands of thousands

or even millions of characters. Quadratic time will not suffice for these applications.

Page 188: Data Structures

Lecture 22 22-3

Fortunately, there are slightly more complex construction algorithms that require only O(m) time. We will not

discuss the algorithm at this point; the details and the subsequent analysis would require a non-trivial amount of

time. A reasonable introduction to the algorithm, however, has been written by Mark Nelson and has appeared in

Dr. Dobb’s Journal. You can currently find it at

Page 189: Data Structures

Lecture 22 22-4

22.3 Using suffix trees for pattern matching

Once we have constructed our suffix tree, we can use it to efficiently solve pattern matching problems. There are of

course other methods for pattern matching, but using suffix trees has an interesting advantage. Once the suffix tree

has been constructed, finding all the occurences of any pattern P[1 . . .n] in the string S takes time O(n + k), where

k is the number of times that the string S appears in the text. So by incurring a one-time preprocessing charge to

establish the suffix trees, we can handle any pattern matching problem after that in time essentially proportional to

the length of the pattern, independent of the length of the original string! This is quite powerful, particularly for

things like DNA databases, where the underlying database is large and fixed but must be able to deal with lots of


Suppose that P lies in the string S; for example, suppose P corresponds to S[i . . . i+n−1]. Then P is the prefix

of the suffix S[i . . .m]. Hence, if we starting matching characters in P against the labels in the suffix tree for S, we

will follow part of the path from the root to the leaf vertex labeled i. Hence, to find all occurences of P in S, start at

the root, and match down the tree as far as possible. This takes time O(n). If P does not match some path in the tree,

then P does not lie in S. If P does match some path in the tree, in matches down to some point z. All the leaves in the

subtree below z correspond to suffixes for which P is a prefix, so the labels on these leaves correspond to locations

that begin an occurence of P. To find these positions, we just traverse the subtree below z, using for example depth

first search. If there are k leaves, the depth first search takes only O(k) time.

Page 190: Data Structures

Lecture 22 22-5

22.4 Representation

An important point about suffix trees: to make sure everything takes linear time, it is important to use the correct

representation. For example, we do not explicitly label each edge with a group of characters– this could take as

much as Ω(n2) time to just write down! Instead, each edge is labeled with a pair of values, representing characters.

For example, an edge labeled [a,b] should be thought of as being labeled by the character S[a] . . .S[b]. Hence each

edge is just labeled by two numbers, and only linear space is required.

Page 191: Data Structures

Lecture 22 22-6

43 5


y zxy$

1 6$





$ zxzxy$


zxy$ y$


6 43 5


y zxy







zxy y

Figure 22.1: A true suffix tree (top); why we need the $ character (bottom).

Page 192: Data Structures

Lecture 22 22-7

22.5 Generalized suffix trees

You may want to put a set S1,S2, . . . ,Sk of strings in a suffix tree data structure. (Note– we assume each string ends

with the special character $.) The structure in this case is called a generalized suffix tree. There are two primary

differences. First, now each leaf node may contain multiple pairs of numbers. Each pair of numbers identifies a

string Si and a location where the suffix from the root to that leaf starts in Si. Note that multiple strings can have a

suffix that share a leaf node! Second, each edge label must be represented by three numbers: a number i and a pair

[a,b] represent that the characters on the edge label are Si[a] . . .Si[b].

Construcing a generalized suffix tree can easily be done by extending our quadratic time algorithm. However,

the linear time algorithm for suffix trees can also be used to build a generalized suffix tree. Hence if m = ∑ki=1 |Si|,

constructing the generalized suffix tree can be done in O(m) time.

Page 193: Data Structures

Lecture 22 22-8

22.6 Longest common extension

Using generalized suffix trees and the LCA algorithm, we can solve a very general problem called the largest

common extension problem. Given strings S1 and S2, we wish to pre-process the string so that we can answer

questions of the following form: given a pair (i, j), find the longest substring of S1 that begins at position i that

matches a substring of S2 that begins at position j.

We will use linear time pre-processing and linear space, after which we can answer queries in constant time.

The solution is to build a generalized search tree for S1 and S2. When we build this tree, we should also compute

the string depth of each node. The string depth of a node is simply the number of characters along the edges from

the root to that node. Notice the string depth is not the same as the tree depth. Also, after building the tree, we

precompute the information necessary to do LCA queries on the tree.

Given a pair (i, j) we compute the least common ancestor u of the leaf nodes corresponding to the suffix

beginning at i in S1 and the suffix beginning at j in S2. The path from the root to u is longest common extension, and

hence the string depth of this node is all we need.

Page 194: Data Structures

Lecture 22 22-9

22.7 Maximal palindromes

A palindrome is a string that reads the same forwards as backwards, such as axbccbxa.

A substring U of a string S is a maximal palindrome if and only if it is a palindrome and extending it one

character in both directions yields a string that is not a palindrome. Generally we separate even-length maximal

palindromes, or even palindromes for short, and odd-length maximal palindromes (odd palindromes) for conve-

nience. For example, in S = axbccbbbaa, the maximal even palindromes are bccb,bb, and aa. The string bbb is

a maximal odd palindrome, and we will skip writing the maximal odd palindromes of length 1. Note that every

palindrome is contained in a maximal palindrome.

Here is a simple way to find all even-length maximal palindromes in linear time. (Finding odd-length maximal

palindromes is similar.)

Consider S and Sr, the reversal of S. There is a palindrome of length 2k with the middle just after position q if

the string of length k starting from position q+1 of S matches the string of length k starting from position n−q+1

of Sr. In particular, this palindrome will be maximal if this is the length of the longest match from these positions.

Thus, solving the even-length maximal palindrome problem corresponds to computing the longest common

extension of (q + 1,n− q + 1) for all possible q. The data stucture can be processed in linear time, and each of the

linear number of queries can be answered in constant time, so the total time is linear.