ineq2

7/30/2019 ineq2

1/11

15-359: Probability and Computing

Inequalities

The worst form of inequality is to try to make unequal things equal. Aristotle

1. Introduction

We have already seen several times the need to approximate event probabilities. Recall the birthdayproblem, for example, or balls and bins calculations where we approximated the probability of thetail of the binomial distribution P(X k), which is explicitly given by

P(X k) =n

j=k

n

j

pj (1 p)nj

The need to approximate this type of probability comes up in many applications. The difficulty isthat the sum involving binomial coefficients is unwieldy; wed like to replace it by a single factorthat we can use in further analysis.

Well now develop bounds more systematically, leading to methods that hold for general randomvariables, not just the binomial. There is a tradition of naming inequalities after their inventors.To be preserved in eponymity, devise an effective probability boundthere is still room for more!

2. Markov

Well begin with one of the simplest inequalities, called the Markovinequality after Andrei A. Markov. Markov was actually a student ofChebyshev, whom well hear from in a moment. He is best known forinitiating the study of sequences of dependent random variables, nowknown as Markov processes.

In spite of its simplicity, the Markov inequality is a very importantbound because it is used as a subroutine to derive more sophisticatedand effective bounds.

Let X be a non-negative, discrete random variable, and let c > 0 bea positive constant. We want to derive a bound on the tail probabilityP(X c); this is the total probability mass after (and including) the

point X = c. Since X is discrete, we can write E[X] =

x xPX(x) where the sum is over the set of

1

7/30/2019 ineq2

2/11

values x 0 taken by the random variable X. Now, we can bound this expectation from below as

E[X] =x

xPX(x)

= 0x

7/30/2019 ineq2

3/11

3. Chebyshev

According to Kolmogorov [2], Pafnuty Chebyshev was one of the firstmathematicians to make use of random quantities and expected val-ues, and to study rigorous inequalities for random variables that were

valid under general conditions. In 1867 he published the paper Onmean values that presented the inequality commonly associated withhis name.

Unlike the Markov inequality, the Chebyshev inequality involves theshape of a distribution through its variance. The idea is to applythe Markov inequality to the deviation of a random variable from itsmean.

For a general random variable X we wish to bound the probability ofthe event {|X E[X]| > a}. Note that this is the same as the event

{(X E[X])2 > a2}. Since Y = f(X) = (X E[X])2 is a non-negative random variable, we can

apply the Markov inequality to obtainP(|X E[X]| > a) = P (X E[X])2 > a2

E[(X E[X])2]

a2

=Var(X)

a2

As a special case, it follows that

P (|X E[X]| a(X)) 1a2

In particular, for an arbitrary random variable there is no more than a total probability mass of

1/4 two standard deviations away from the mean. An alternative formulation is the following:

P (|X E[X]| aE[X]) Var(X)a2 E[X]

3.1. Example

Lets return to our coin flipping example. Applying the Chebyshev inequality we get

P

X 3n

4

= P

X n

2 n

4

P|X E[X]| 12E[X] n/4

1/4 (n/2)2

=4

n

This is betterunlike the Markov inequality, it suggests that the probability is becoming concen-trated around the mean.

3

7/30/2019 ineq2

4/11

4. Chernoff

We now come to one of the most powerful bounding techniques, due toHerman Chernoff. Chernoff is known for many contributions to bothpractical and theoretical statistics. The so-called Chernoff boundsoriginated in a paper on statistical hypothesis testing [1], but the re-sulting bounds are very widely used in computer science.

Chernoff is also known for a method called Chernoff faces for visualiz-ing high dimensional data. Each variable is associated with a particularattribute in a cartoon facefor example, one variable might be associ-ated with eyebrow slant, one with eye spacing, one with mouth shape,etc. The characteristics of different data sets (or points) are then seenat a glance.1 On a personal note, Herman Chernoff was very kind tome when I was an undergraduate, giving his time for weekly privatetutoring in probability and stochastic processes one spring semester.

He was extremely generous, and so its a pleasure for me to see his name remembered so often in

computer science.

Figure 1: Chernoff faces

Recall that we introduced a dependence on the shape of a distribution by squaring the randomvariable and applying the Markov inequality. Here we exponentiate. By the Markov inequality, if > 0, we have that

P(X c) = P eX ec

E[eX]ec

= ec MX()

where MX() = E[eX] is the moment generating function of X.

1Its interesting to note that in his 1973 article on this topic Chernoff says that At this time the cost of drawing

these faces is about 20 to 25 cents per face on the IBM 360-67 at Stanford University using the Calcomp Plotter.

Most of this cost is in the computing, and I believe that it should be possible to reduce it considerably.

4

7/30/2019 ineq2

5/11

It is best to consider this inequality in the log domain. Taking logarithms gives

logP(X c) c + log MX()

Now, its possible to show that log MX() is a convex function in . Since this inequality holds forany , we can think of it as a variational parameter and select the value of to give the tightest

possible upper bound. See Figure 2.

0 0.5 1 1.5 2 2.5 37

6

5

4

3

2

1

0

1

2

3

Variational parameter

Boundonlog

P(X

C)

Figure 2: Chernoff bounds for n = 30 independent Bernoulli trials for the event C ={X| Xi > np (1 + )} with p = 12 and = 12 . The classical Chernoff bound log P(X C) 0. To prove the first of these (the second is similar) we follow the usual protocol: exponen-

6

7/30/2019 ineq2

7/11

tiate and apply the Markov inequality. This gives

P(X np ) = P

e(Xnp) e

e E

e(n

i=1Xinp)

= e E

en

i=1(Xip)

= e E

ni=1

e(Xip)

= eni=1

E

e(Xip)

(by independence)

= eni=1

pe(1p) + (1 p)ep

Now, using the following inequality (which we wont prove)

p e(1p) + (1 p) ep e2/8

we get that

P(X np ) eni=1

p e(1p) + (1 p)ep

e+n2/8

This gives the convex function + n2/8 as a second, weaker upper bound to log P(Xnp ).Minimizing over the parameter then gives = 4/n and the bound

P(X np ) e22/n

4.3. Example

Lets now return again to the simple binomial example. Recall that the Markov inequality gaveP(X 3n/4) 2/3, and the Chebyshev inequality gave P(X 3n/4) 4/n. Now, we apply theChernoff bound just derived to get

P(X 3n/4) = P(X n/2 n/4)

e2(n/4)

2/n

= en/8

This bound goes to zero exponentially fast in n. This is as we expect intuitively, since the binomialshould become concentrated around its mean as n .

7

7/30/2019 ineq2

8/11

4.4. Alternative forms of Chernoff bounds

There are other forms of the Chernoff bounds for binomial or Bernoulli trials that well state here.If X1, . . . , X n are independent Bernoulli trials with Xi Bernoulli(pi), let X =

ni=1 Xi. Then

P(X > (1 + )E[X]) e(1 + )(1+)E[X]

If, moreover, 0 < < 1, then

P(X > (1 + )E[X]) e 2

3E[X]

P(X < (1 )E[X]) e 2

2E[X]

The bounds are slightly different but the idea of the proof is the same as weve already seen.

4.5. Inverse bounds

As an example of the use of Chernoff bounds, we can carry out an inverse calculation to see howmuch beyond the mean a random variable must be to have a sufficiently small tail probability.

Specifically, for X Bernoulli(n, 12), we can ask how big does m have to be so that

P

X n

2> m

1

n

Using the Chernoff bound we have that

P

X n

2> m

e2m2/n

In order for the righthand side to be equal to 1/n, we need that

2 m2

n= log n

which we solve to get m =

n log n2 .

4.6. Application to Algorithms

We now apply the Chernoff bound to the analysis of randomized quicksort. Recall that earlierwe showed randomized quicksort has expected running time O(n log n). Here well show something

much strongerthat with high probability the running time is O(n log n). In particular, we willshow the following.

Theorem. Randomized quicksort runs in O(n log n) time with probability at least 1 1/nb forsome constant b > 1.

To sketch the proof of this, suppose that we run the algorithm and stop when the depth of therecursion tree is c log n for some constant c. The question we need to ask is: what is the probabilitythere is a leaf that is not a singleton set {si}?

8

7/30/2019 ineq2

9/11

Call a split (induced by some pivot) good if it breaks up the set S into two pieces S1 and S2 with

min(|S1|, |S2|) 13|S|

Otherwise the split is bad. What is the probability of getting a good split? Assuming that all

elements are distinct, its easy to see that the probability of a good split is1

3 .Now, for each good split, the size of S reduces by a factor of (at least) 23 . How many good splitsare needed in the worst case to get to a singleton set? A little calculation shows that we require

x = log n/ log(3/2)

= a log n

good splits. We can next reason about how large the constant c needs to be so that the probabilitythat we get fewer than a log n good splits is small.

Consider a single path from the root to a leaf in the recursion tree. The average number of goodsplits is (1/3)c log n. By the Chernoff bound, we have that

P(number of good splits < (1 ) 13

c log n) e 13 c log n 2

2

We can then choose c sufficiently large so that this righthand side is no larger than 1n2 .

We arranged this so that the probability of having too few good splits on a single path was lessthan 1/n2. Thus, by the union bound, the probability that there is some path from root to a leafthat has too few good splits is less than 1/n. It follows that with probability at least 1 1/n,the algorithm finishes with a singleton set {si} at each leaf in the tree, meaning that the set S issuccessfully sorted in time O(n log n).

This is a much stronger guarantee on the running time of the algorithm, which explains the effec-tiveness of randomized quicksort in practice.

4.7. Pointer to recent research

As a final note, in recent research in machine learning we have attempted to extend the scopeof Chernoff bounds to non-independent random variables, using some of the machinery of convexoptimization [4].

9

7/30/2019 ineq2

10/11

5. Jensen

Johan Ludwig Jensen of Denmark was self-taught in mathematics, andnever had an academic or full research position. Instead, he workedbeginning in 1880 for a telephone company to support himself suffi-ciently in order to be able to work on mathematics in his spare time.(Only much later did telephone workers at such companies as AT&Tget paid directly to do mathematics!)

Jensen studied convex functions, and is given credit for a particularlyimportant and useful inequality involving convex functions. A convexfunction is one such as f(x) = x2 + ax + b that holds water. Jensensinequality is best viewed geometrically. Iff(x) and f(y) are two pointson the graph of f and we connect them by a chord, this line is traced

out by f(x) + (1 )f(y) for 0 1. By convexity, this line lies above the graph of thefunction. Thus2,

f(x) + (1

) f(y)

f( x + (1

) y)

By induction, it follows easily that

1 f(x1) + n f(xn) f(1 x1 + + n xn)

where i 0 andn

i=1 i = 1.

If X is a finite random variable, this shows that for f convex,

f(E[X]) E [f(X)]

Similarly, for g a concave function,

g (E[X]) E [g(X)]

These inequalities are best remembered by simply drawing a picture of a convex function and achord between two points on the curve.

5.1. Example

As a simple use of Jensens inequality, we note the arithmetic/geometric mean inequality. Usingconvexity of log(x), ifa and b are positive then

12

log(a) 12

log(b) log12

a + 12

b

which implies ab 1

2(a + b)

2It appears that Jensen only, in fact, studied the inequality fx+y2

f(x)+f(y)2

.

10

7/30/2019 ineq2

11/11

5.2. Example: Entropy

If X is a finite r.v. taking values vi with probability pi = P(X = vi), the entropy of X, denotedH(X), is defined as

H(X) = ni=1

pi logpi

Clearly H(X) 0 since log(1/pi) 0. Now, since log is convex, we have by Jensens inequalitythat

H(X) = i

pi logpi

=i

pi log1

pi

log

n

i=1

pi1

pi

= log n

so that the entropy lies in the interval [0, log n].

The entropy is a lower bound on the amount of compression possible for a sequence of charactersvi that are independent and identically distributed according to the pmf of X.

References

[1] Herman Chernoff (1952). A measure of asymptotic efficiency for tests of a hypothesis basedon the sum of observations. Annals of Mathematical Statistics, 23, 493507.

[2] The MacTutor history of mathematics archive (2004). http://www-gap.dcs.st-and.ac.uk/~history.

[3] Rajeev Motwani and Prabhakar Raghavan (1995). Randomized Algorithms. Cambridge Uni-versity Press.

[4] Pradeep Ravikumar and John Lafferty (2004). Variational Chernoff bounds for graphicalmodels, Uncertainty in Artificial Intelligence.

[5] Sheldon M. Ross (2002). Probability Models for Computer Science. Harcourt/Academic Press.

11

ineq2

Documents

Transcript of ineq2