l5

67
Introduction to Detection Theory Reading: Ch. 3 in Kay-II. Notes by Prof. Don Johnson on detection theory, see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf. Ch. 10 in Wasserman. EE 527, Detection and Estimation Theory, # 5 1

description

Introduction to detection theory

Transcript of l5

  • Introduction to Detection Theory

    Reading:

    Ch. 3 in Kay-II.

    Notes by Prof. Don Johnson on detection theory,see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf.

    Ch. 10 in Wasserman.

    EE 527, Detection and Estimation Theory, # 5 1

  • Introduction to Detection Theory (cont.)

    We wish to make a decision on a signal of interest using noisymeasurements. Statistical tools enable systematic solutionsand optimal design.

    Application areas include:

    Communications,

    Radar and sonar,

    Nondestructive evaluation (NDE) of materials,

    Biomedicine, etc.

    EE 527, Detection and Estimation Theory, # 5 2

  • Example: Radar Detection. We wish to decide on thepresence or absence of a target.

    EE 527, Detection and Estimation Theory, # 5 3

  • Introduction to Detection Theory

    We assume a parametric measurement model p(x | ) [orp(x ; ), which is the notation that we sometimes use in theclassical setting].

    In point estimation theory, we estimated the parameter given the data x.

    Suppose now that we choose 0 and 1 that form a partitionof the parameter space :

    0 1 = , 0 1 = .

    In detection theory, we wish to identify which hypothesis is true(i.e. make the appropriate decision):

    H0 : 0, null hypothesisH1 : 1, alternative hypothesis.

    Terminology: If can only take two values,

    = {0, 1}, 0 = {0}, 1 = {1}

    we say that the hypotheses are simple. Otherwise, we say thatthey are composite.

    Composite Hypothesis Example: H0 : = 0 versus H1 : (0,).

    EE 527, Detection and Estimation Theory, # 5 4

  • The Decision Rule

    We wish to design a decision rule (function) (x) : X (0, 1):

    (x) ={

    1, decide H1,0, decide H0.

    which partitions the data space X [i.e. the support of p(x | )]into two regions:

    Rule (x): X0 = {x : (x) = 0}, X1 = {x : (x) = 1}.

    Let us define probabilities of false alarm and miss:

    PFA = E x | [(X) | ] =X1p(x | ) dx for in 0

    PM = E x | [1 (X) | ] = 1X1p(x | ) dx

    =X0p(x | ) dx for in 1.

    Then, the probability of detection (correctly deciding H1) is

    PD = 1PM = E x | [(X) | ] =X1p(x | ) dx for in 1.

    EE 527, Detection and Estimation Theory, # 5 5

  • Note: PFA and PD/PM are generally functions of the parameter (where 0 when computing PFA and 1 whencomputing PD/PM).

    More Terminology. Statisticians use the followingterminology:

    False alarm Type I error

    Miss Type II error

    Probability of detection Power

    Probability of false alarm Significance level.

    EE 527, Detection and Estimation Theory, # 5 6

  • Bayesian Decision-theoretic Detection Theory

    Recall (a slightly generalized version of) the posterior expectedloss:

    (action |x) =

    L(, action) p( |x) dthat we introduced in handout # 4 when we discussed Bayesiandecision theory. Let us now apply this theory to our easyexample discussed here: hypothesis testing, where our actionspace consists of only two choices. We first assign a loss table:

    decision rule true state 1 0x X1 L(1|1) = 0 L(1|0)x X0 L(0|1) L(0|0) = 0

    with the loss function described by the quantitiesL(declared | true):

    L(1 | 0) quantifies loss due to a false alarm,

    L(0 | 1) quantifies loss due to a miss,

    L(1 | 1) and L(0 | 0) (losses due to correct decisions) typically set to zero in real life. Here, we adopt zero lossesfor correct decisions.

    EE 527, Detection and Estimation Theory, # 5 7

  • Now, our posterior expected loss takes two values:

    0(x) =1

    L(0 | 1) p( |x) d

    +0

    L(0 | 0) 0

    p( |x) d

    =1

    L(0 | 1) p( |x) d

    L(0 | 1) is constant= L(0 | 1)

    1

    p( |x) d P [1 |x]

    and, similarly,

    1(x) =0

    L(1 | 0) p( |x) d

    L(1 | 0) is constant= L(1 | 0)

    0

    p( |x) d P [0 |x]

    .

    We define the Bayes decision rule as the rule that minimizesthe posterior expected loss; this rule corresponds to choosingthe data-space partitioning as follows:

    X1 = {x : 1(x) 0(x)}EE 527, Detection and Estimation Theory, # 5 8

  • or

    X1 ={x :

    P [1 |x] 1

    p( |x) d0

    p( |x) d P [0 |x]

    L(1 | 0)L(0 | 1)

    }(1)

    or, equivalently, upon applying the Bayes rule:

    X1 ={x :

    1p(x | )pi() d

    0p(x | )pi() d

    L(1 | 0)L(0 | 1)

    }. (2)

    0-1 loss: For L(1|0) = L(0|1) = 1, we havedecision rule true state 1 0

    x X1 L(1|1) = 0 L(1|0) = 1x X0 L(0|1) = 1 L(0|0) = 0

    yielding the following Bayes decision rule, called the maximuma posteriori (MAP) rule:

    X1 ={x :

    P [ 1 |x]P [ 0 |x] 1

    }(3)

    or, equivalently, upon applying the Bayes rule:

    X1 ={x :

    1p(x | )pi() d

    0p(x | )pi() d 1

    }. (4)

    EE 527, Detection and Estimation Theory, # 5 9

  • Simple hypotheses. Let us specialize (1) to the case of simplehypotheses (0 = {0},1 = {1}):

    X1 ={x :

    p(1 |x)p(0 |x)

    posterior-odds ratio

    L(1 | 0)L(0 | 1)

    }. (5)

    We can rewrite (5) using the Bayes rule:

    X1 ={x :

    p(x | 1)p(x | 0)

    likelihood ratio

    pi0 L(1 | 0)pi1 L(0 | 1)

    }(6)

    wherepi0 = pi(0), pi1 = pi(1) = 1 pi0

    describe the prior probability mass function (pmf) of the binaryrandom variable (recall that {0, 1}). Hence, for binarysimple hypotheses, the prior pmf of is the Bernoulli pmf.

    EE 527, Detection and Estimation Theory, # 5 10

  • Preposterior (Bayes) Risk

    The preposterior (Bayes) risk for rule (x) is

    E x,[loss] =X1

    0

    L(1 | 0) p(x | )pi() d dx

    +X0

    1

    L(0 | 1) p(x | )pi() d dx.

    How do we choose the rule (x) that minimizes the preposterior

    EE 527, Detection and Estimation Theory, # 5 11

  • risk? X1

    0

    L(1 | 0) p(x | )pi() d dx

    +X0

    1

    L(0 | 1) p(x | )pi() d dx

    =X1

    0

    L(1 | 0) p(x | )pi() d dx

    X1

    1

    L(0 | 1) p(x | )pi() d dx

    +X0

    1

    L(0 | 1) p(x | )pi() d dx

    +X1

    1

    L(0 | 1) p(x | )pi() d dx

    = const not dependent on (x)

    +X1

    {L(1 | 0)

    0

    p(x | )pi() d

    L(0 | 1) 1

    p(x | )pi() d}dx

    implying that X1 should be chosen as{X1 : L(1 | 0)

    0

    p(x | )pi() dL(0 | 1)1

    p(x | )pi() < 0}

    EE 527, Detection and Estimation Theory, # 5 12

  • which, as expected, is the same as (2), since

    minimizing the posterior expected loss

    minimizing the preposterior risk for every x

    as showed earlier in handout # 4.

    0-1 loss: For the 0-1 loss, i.e. L(1|0) = L(0|1) = 1, thepreposterior (Bayes) risk for rule (x) is

    E x,[loss] =X1

    0

    p(x | )pi() d dx

    +X0

    1

    p(x | )pi() d dx (7)

    which is simply the average error probability, with averagingperformed over the joint probability density or mass function(pdf/pmf) or the data x and parameters .

    EE 527, Detection and Estimation Theory, # 5 13

  • Bayesian Decision-theoretic Detection forSimple Hypotheses

    The Bayes decision rule for simple hypotheses is (6):

    (x) likelihood ratio

    =p(x | 1)p(x | 0)

    H1 pi0 L(1|0)

    pi1 L(0|1) (8)

    see also Ch. 3.7 in Kay-II. (Recall that (x) is the sufficientstatistic for the detection problem, see p. 37 in handout # 1.)Equivalently,

    log (x) = log[p(x | 1)] log[p(x | 0)]H1 log .

    Minimum Average Error Probability Detection: In thefamiliar 0-1 loss case where L(1|0) = L(0|1) = 1, we knowthat the preposterior (Bayes) risk is equal to the average errorprobability, see (7). This average error probability greatly

    EE 527, Detection and Estimation Theory, # 5 14

  • simplifies in the simple hypothesis testing case:

    av. error probability =X1

    L(1 | 0) 1

    p(x | 0)pi0 dx

    +X0

    L(0 | 1) 1

    p(x | 1)pi1 dx

    = pi0 X1p(x | 0) dx PFA

    +pi1 X0p(x | 1) dx

    PM

    where, as before, the averaging is performed over the jointpdf/pmf of the data x and parameters , and

    pi0 = pi(0), pi1 = pi(1) = 1 pi0 (the Bernoulli pmf).

    In this case, our Bayes decision rule simplifies to the MAP rule(as expected, see (5) and Ch. 3.6 in Kay-II):

    p(1 |x)p(0 |x)

    posterior-odds ratio

    H1 1 (9)

    or, equivalently, upon applying the Bayes rule:

    p(x | 1)p(x | 0)

    likelihood ratio

    H1 pi0

    pi1. (10)

    EE 527, Detection and Estimation Theory, # 5 15

  • which is the same as

    (4), upon substituting the Bernoulli pmf as the prior pmf for and

    (8), upon substituting L(1|0) = L(0|1) = 1.

    EE 527, Detection and Estimation Theory, # 5 16

  • Bayesian Decision-theoretic Detection Theory:Handling Nuisance Parameters

    We apply the same approach as before integrate the nuisanceparameters (, say) out!

    Therefore, (1) still holds for testing

    H0 : 0 versusH1 : 1

    but p |x( |x) is computed as follows:

    p |x( |x) =p, |x(, |x) d

    and, therefore,

    1

    p( |x) p, |x(, |x) d d

    0

    p, |x(, |x) d

    p( |x)

    d

    H1 L(1 | 0)

    L(0 | 1) (11)

    EE 527, Detection and Estimation Theory, # 5 17

  • or, equivalently, upon applying the Bayes rule:1

    px | ,(x | , )pi,(, ) d d

    0

    px | ,(x | , )pi,(, ) d d

    H1 L(1 | 0)

    L(0 | 1). (12)

    Now, if and are independent a priori, i.e.

    pi,(, ) = pi() pi() (13)

    then (12) can be rewritten as

    1pi()

    p(x |) px | ,(x | , )pi() d d

    0pi()

    px | ,(x | , )pi()

    p(x |)

    d d

    H1 L(1 | 0)

    L(0 | 1). (14)

    Simple hypotheses and independent priors for and : Letus specialize (11) to the simple hypotheses (0 = {0},1 ={1}):

    p, |x(1, |x) dp, |x(0, |x) d

    H1 L(1 | 0)

    L(0 | 1). (15)

    Now, if and are independent a priori, i.e. (13) holds, then

    EE 527, Detection and Estimation Theory, # 5 18

  • we can rewrite (14) [or (15) using the Bayes rule]:

    px | ,(x | 1, )pi() dpx | ,(x | 0, )pi() d integrated likelihood ratio

    =

    same as (6) p(x | 1)p(x | 0)

    H1 pi0 L(1 | 0)

    pi1 L(0 | 1) (16)

    wherepi0 = pi(0), pi1 = pi(1) = 1 pi0.

    EE 527, Detection and Estimation Theory, # 5 19

  • Chernoff Bound on Average Error Probabilityfor Simple Hypotheses

    Recall that minimizing the average error probability

    av. error probability =X1

    0

    p(x | )pi() d dx

    +X0

    1

    p(x | )pi() d dx

    leads to the MAP decision rule:

    X ?1 ={x :0

    p(x | )pi() d 1

    p(x | )pi() d < 0}.

    In many applications, we may not be able to obtain asimple closed-form expression for the minimum average error

    EE 527, Detection and Estimation Theory, # 5 20

  • probability, but we can bound it as follows:

    min av. error probability =X?1

    0

    p(x | )pi() d dx

    +X?0

    1

    p(x | )pi() d dx

    usingthe def.of X?1=

    X

    data space

    min{

    0

    p(x | )pi() d,1

    p(x | )pi() d)}dx

    X

    [ 4= q0(x) 0

    p(x | )pi() d] [ 4= q1(x)

    1

    p(x | )pi() d]1

    dx

    =X[q0(x)] [q1(x)]1 dx

    which is the Chernoff bound on the minimum average errorprobability. Here, we have used the fact that

    min{a, b} a b1, for 0 1, a, b 0.

    When

    x =

    x1x2...xN

    EE 527, Detection and Estimation Theory, # 5 21

  • with

    x1, x2, . . . xN conditionally independent, identicallydistributed (i.i.d.) given and

    simple hypotheses (i.e. 0 = {0},1 = {1})

    then

    q0(x) = p(x | 0) pi0 = pi0Nn=1

    p(xn | 0)

    q1(x) = p(x | 1) pi1 = pi1Nn=1

    p(xn | 1)

    yielding

    Chernoff bound for N conditionally i.i.d. measurements (given ) and simple hyp.

    = [

    pi0

    Nn=1

    p(xn | 0)] [

    pi1

    Nn=1

    p(xn | 1)]1

    dx

    = pi0 pi11

    Nn=1

    {[p(xn | 0)] [p(xn | 1)]1 dxn

    }= pi0 pi

    11

    {[p(x1 | 0)] [p(x1 | 1)]1 dx1

    }NEE 527, Detection and Estimation Theory, # 5 22

  • or, in other words,

    1N log(min av. error probability) log(pi0 pi11 )

    + log[p(x1 | 0)] [p(x1 | 1)]1 dx1, [0, 1].

    If pi0 = pi1 = 1/2 (which is almost always the case of interestwhen evaluating average error probabilities), we can say that,as N ,

    min av. error probabilityfor N cond. i.i.d. measurements (given ) and simple hypotheses

    f(N) exp(N

    { min

    [0,1]log[p(x1 | 0)] [p(x1 | 1)]1

    } Chernoff information for a single observation

    )

    where f(N) is a slowly-varying function compared with theexponential term:

    limN

    log f(N)N

    = 0.

    Note that the Chernoff information in the exponent term ofthe above expression quantifies the asymptotic behavior of theminimum average error probability.

    We now give a useful result, taken from

    EE 527, Detection and Estimation Theory, # 5 23

  • K. Fukunaga, Introduction to Statistical Pattern Recognition,2nd ed., San Diego, CA: Academic Press, 1990

    for evaluating a class of Chernoff bounds.

    Lemma 1. Consider p1(x) = N (1,1) and p2(x) =N (2,2). Then

    [p1(x)] [p2(x)]1 dx = exp[g()]

    where

    g() = (1 )

    2 (2 1)T [1 + (1 )2]1(2 1)

    +12 log[|1 + (1 )2|

    |1| |2|1].

    EE 527, Detection and Estimation Theory, # 5 24

  • Probabilities of False Alarm (P FA) andDetection (P D) for Simple Hypotheses

    0 5 10 15 20 250

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    0.2

    Test Statistic

    Prob

    abilit

    y De

    nsity

    Fun

    ctio

    nPDF under H0PDF under H1

    PFA

    PD

    PFA = P [(x)

    test statistic > XX1

    | = 0]

    PD = P [test statistic > XX0

    | = 1].

    Comments:

    (i) As the region X1 shrinks (i.e. ), both of the aboveprobabilities shrink towards zero.

    EE 527, Detection and Estimation Theory, # 5 25

  • (ii) As the region X1 grows (i.e. 0), both of theseprobabilities grow towards unity.

    (iii) Observations (i) and (ii) do not imply equality betweenPFA and PD; in most cases, as R1 grows, PD grows morerapidly than PFA (i.e. we better be right more often than weare wrong).

    (iv) However, the perfect case where our rule is always rightand never wrong (PD = 1 and PFA = 0) cannot occur whenthe conditional pdfs/pmfs p(x | 0) and p(x | 1) overlap.

    (v) Thus, to increase the detection probability PD, we mustalso allow for the false-alarm probability PFA to increase.This behavior

    represents the fundamental tradeoff in hypothesis testingand detection theory and

    motivates us to introduce a (classical) approach to testingsimple hypotheses, pioneered by Neyman and Pearson (tobe discussed next).

    EE 527, Detection and Estimation Theory, # 5 26

  • Neyman-Pearson Test for Simple Hypotheses

    Setup:

    Parametric data models p(x ; 0), p(x ; 1),

    Simple hypothesis testing:

    H0 : = 0 versusH1 : = 1.

    No prior pdf/pmf on is available.

    Goal: Design a test that maximizes the probability of detection

    PD = P [X X1 ; = 0]

    (equivalently, minimizes the miss probability PM) under theconstraint

    PFA = P [X X1 ; = 0] = .

    Here, we consider simple hypotheses; classical version oftesting composite hypotheses is much more complicated. TheBayesian version of testing composite hypotheses is trivial (as

    EE 527, Detection and Estimation Theory, # 5 27

  • is everything else Bayesian, at least conceptually) and we havealready seen it.

    Solution. We apply the Lagrange-multiplier approach:maximize

    L = PD + (PFA )=X1p(x ; 1) dx+

    [ X1p(x; 0) dx

    ]=X1[p(x ; 1) p(x ; 0)] dx .

    To maximize L, set

    X1 = {x : p(x ; 1)p(x ; 0) > 0} ={x :

    p(x ; 1)p(x ; 0)

    > }.

    Again, we find the likelihood ratio:

    (x) =p(x ; 1)p(x ; 0)

    .

    Recall our constraint:X1p(x ; 0) dx = PFA = .

    EE 527, Detection and Estimation Theory, # 5 28

  • If we increase , PFA and PD go down. Similarly, if we decrease, PFA and PD go up. Hence, to maximize PD, choose sothat PFA is as big as possible under the constraint.

    Two useful ways for determining the threshold thatachieves a specified false-alarm rate:

    Find that satisfiesx : (x)>

    p(x ; 0) dx = PFA =

    or,

    expressing in terms of the pdf/pmf of (x) under H0:

    p;0(l ; 0) dl = .

    or, perhaps, in terms of a monotonic function of (x), sayT (x) = monotonic function((x)).

    Warning: We have been implicitly assuming that PFA is acontinuous function of . Some (not insightful) technicaladjustments are needed if this is not the case.

    A way of handling nuisance parameters: We can utilize theintegrated (marginal) likelihood ratio (16) under the Neyman-Pearson setup as well.

    EE 527, Detection and Estimation Theory, # 5 29

  • Chernoff-Stein Lemma for Bounding the MissProbability in Neyman-Pearson Tests of Simple

    Hypotheses

    Recall the definition of the Kullback-Leibler (K-L) distanceD(p q) from one pmf (p) to another (p):

    D(p q) =k

    pk logpkqk.

    The complete proof of this lemma for the discrete (pmf) caseis given in

    Additional Reading: T.M. Cover and J.A. Thomas, Elementsof Information Theory. Second ed., New York: Wiley, 2006.

    EE 527, Detection and Estimation Theory, # 5 30

  • Setup for the Chernoff-Stein Lemma

    Assume that x1, x2, . . . , xN are conditionally i.i.d. given .

    We adopt the Neyman-Pearson framework, i.e. obtain adecision threshold to achieve a fixed PFA. Let us study theasymptotic PM = 1 PD as the number of observations Ngets large.

    To keep PFA constant as N increases, we need to make ourdecision threshold (, say) vary with N , i.e.

    = N(PFA)

    Now, the miss probability is

    PM = PM() = PM(N(PFA)

    ).

    EE 527, Detection and Estimation Theory, # 5 31

  • Chernoff-Stein Lemma

    The Chernoff-Stein lemma says:

    limPFA0

    limN

    1N

    logPM = D(p(Xn | 0) p(Xn | 1)

    )

    K-L distance for a single observation

    where the K-L distance between p(xn | 0) and p(xn | 1)

    D(p(Xn | 0) p(Xn | 1)

    )= E p(xn | 0)

    [log

    p(Xn | 1)p(Xn | 0)

    ]discrete (pmf) case

    = xn

    p(xn | 0) log[p(xn | 1)p(xn | 0)

    ]

    does not depend on the observation index n, since xn areconditionally i.i.d. given .

    Equivalently, we can state that

    PMN+ f(N) exp

    [N D(p(Xn | 0) p(Xn | 1))]

    as PFA 0 and N , where f(N) is a slowly-varyingfunction compared with the exponential term (when PFA 0and N ).

    EE 527, Detection and Estimation Theory, # 5 32

  • Detection for Simple Hypotheses: Example

    Known positive DC level in additive white Gaussian noise(AWGN), Example 3.2 in Kay-II.

    Consider

    H0 : x[n] = w[n], n = 1, 2, . . . , N versusH1 : x[n] = A+ w[n], n = 1, 2, . . . , N

    where

    A > 0 is a known constant,

    w[n] is zero-mean white Gaussian noise with known variance2, i.e.

    w[n] N (0, 2).

    The above hypothesis-testing formulation is EE-like: noiseversus signal plus noise. It is similar to the on-off keyingscheme in communications, which gives us an idea torephrase it so that it fits our formulation on p. 4 (for whichwe have developed all the theory so far). Here is such an

    EE 527, Detection and Estimation Theory, # 5 33

  • alternative formulation: consider a family of pdfs

    p(x ; a) =1

    (2pi2)N exp

    [ 1

    22

    Nn=1

    (x[n] a)2]

    (17)

    and the following (equivalent) hypotheses:

    H0 : a = 0 (off) versusH1 : a = A (on).

    Then, the likelihood ratio is

    (x) =p(x ; a = A)p(x ; a = 0)

    =1/(2pi2)N/2 exp[ 1

    22

    Nn=1(x[n]A)2]

    1/(2pi2)N/2 exp( 122

    Nn=1 x[n]2)

    .

    Now, take the logarithm and, after simple manipulations, reduceour likelihood-ratio test to comparing

    T (x) = x4=

    1N

    Nn=1

    x[n]

    with a threshold . [Here, T (x) is a monotonic function of(x).] If T (x) > acceptH1 (i.e. rejectH0), otherwise acceptEE 527, Detection and Estimation Theory, # 5 34

  • H0 (well, not exactly, we will talk more about this decision onp. 59).

    The choice of depends on the approach that we take. Forthe Bayes decision rule, is a function of pi0 and pi1. Forthe Neyman-Pearson test, is chosen to achieve (control) adesired PFA.

    Bayesian decision-theoretic detection for 0-1 loss(corresponding to minimizing the average error

    EE 527, Detection and Estimation Theory, # 5 35

  • probability):

    log (x) = 122

    Nn=1

    (x[n]A)2 + 122

    Nn=1

    (x[n])2H1 log

    (pi0pi1

    )

    122

    Nn=1

    (x[n] x[n] 0

    +A) (x[n] + x[n]A)H1 log

    (pi0pi1

    )

    2A ( Nn=1

    x[n])A2N

    H1 22 log

    (pi0pi1

    )

    ( Nn=1

    x[n]) AN

    2

    H1

    2

    Alog(pi0pi1

    )since A > 0

    and, finally, xH1

    2

    N A log(pi0/pi1) + A2

    which, for the practically most interesting case of equiprobablehypotheses

    pi0 = pi1 = 12 (18)

    simplifies toxH1 A

    2

    known as the maximum-likelihood test (i.e. the Bayes decisionrule for 0-1 loss and a priori equiprobable hypotheses is definedas the maximum-likelihood test). This maximum-likelihooddetector does not require the knowledge of the noise variance

    EE 527, Detection and Estimation Theory, # 5 36

  • 2 to declare its decision. However, the knowledge of 2 iskey to assessing the detection performance. Interestingly, theseobservations will carry over to a few maximum-likelihood teststhat we will derive in the future.

    Assuming (18), we now derive the minimum average errorprobability. First, note that X | a = 0 N (0, 2/N) andX | a = A N (A, 2/N). Then

    min av. error prob. = 12 P [X >A

    2| a = 0]

    PFA

    +12 P [X A/22/N

    ; a = 0]

    +12 P[ X A

    2/N standardnormal

    random var.

    , decide H1 (i.e. reject H0),

    otherwise decide H0 (see also the discussion on p. 59).

    Performance evaluation: Assuming (17), we have

    T (x) | a N (a, 2/N).

    Therefore, T (X) | a = 0 N (0, 2/N), implying

    PFA = P [T (X) > ; a = 0] = Q(

    2/N

    )EE 527, Detection and Estimation Theory, # 5 40

  • and we obtain the decision threshold as follows:

    =

    2

    NQ1(PFA).

    Now, T (X) | a = A N (A, 2/N), implying

    PD = P (T (X) > | a = A) = Q( A

    2/N

    )= Q

    (Q1(PFA)

    A2

    2/N

    )

    = Q

    (Q1(PFA)

    NA2/2 4= SNR = d2

    ).

    Given the false-alarm probability PFA, the detection probabilityPD depends only on the deflection coefficient:

    d2 =NA2

    2={E [T (X) | a = A] E [T (X) | a = 0]}2

    var[T (X | a = 0)]

    which is also (a reasonable definition for) the signal-to-noiseratio (SNR).

    EE 527, Detection and Estimation Theory, # 5 41

  • Receiver Operating Characteristics (ROC)

    PD = Q(Q1(PFA) d

    ).

    Comments:

    As we raise the threshold , PFA goes down but so does PD.

    ROC should be above the 45 line otherwise we can dobetter by flipping a coin.

    Performance improves with d2.

    EE 527, Detection and Estimation Theory, # 5 42

  • Typical Ways of Depicting the DetectionPerformance Under the Neyman-Pearson Setup

    To analyze the performance of a Neyman-Pearson detector, weexamine two relationships:

    Between PD and PFA, for a given SNR, called the receiveroperating characteristics (ROC).

    Between PD and SNR, for a given PFA.

    Here are examples of the two:

    see Figs. 3.8 and 3.5 in Kay-II, respectively.

    EE 527, Detection and Estimation Theory, # 5 43

  • Asymptotic (as N and P FA 0)P D and PM for a Known DC Level in AWGN

    We apply the Chernoff-Stein lemma, for which we need tocompute the following K-L distance:

    D(p(Xn | a = 0) p(Xn | a = A)

    )= E p(xn | a=0)

    {log[p(Xn | a = A)p(Xn | a = 0)

    ]}where

    p(xn | a = 0) = N (0, 2)

    log[p(xn | a = A)p(xn | a = 0)

    ]= (xn A)

    2

    22+

    x2n22

    =122

    (2Axn A2).

    Therefore,

    D(p(Xn | 0) p(Xn | 1)

    )=

    A2

    22

    and the Chernoff-Stein lemma predicts the following behaviorof the detection probability as N and PFA 0:

    PD 1 f(N) exp(N A

    2

    22)

    PM

    EE 527, Detection and Estimation Theory, # 5 44

  • where f(N) is a slowly-varying function of N compared withthe exponential term. In this case, the exact expression for PM(PD) is available and consistent with the Chernoff-Stein lemma:

    PM = 1Q(Q1(PFA)

    NA2/2

    )= Q

    (NA2/2 Q1(PFA)

    )N+ 1

    [NA2/2 Q1(PFA)]

    2pi

    exp{ [NA2/2 Q1(PFA)]2/2

    }=

    1[NA2/2 Q1(PFA)]

    2pi

    exp{NA2/(22) [Q1(PFA)]2/2

    +Q1(PFA)NA2/2

    }=

    1[NA2/2 Q1(PFA)]

    2pi

    exp{ [Q1(PFA)]2/2 +Q1(PFA)

    NA2/2

    } exp[NA2/(22)] as predicted by the Chernoff-Stein lemma

    (21)

    where we have used the asymptotic formula (20). When PFA

    EE 527, Detection and Estimation Theory, # 5 45

  • is small and N is large, the first two (green) terms in theabove expression make a slowly-varying function f(N) of N .Note that the exponential term in (21) does not depend onthe false-alarm probability PFA (or, equivalently, on the choiceof the decision threshold). The Chernoff-Stein lemma assertsthat this is not a coincidence.

    Comment. For detecting a known DC level in AWGN:

    The slope of the exponential-decay term of PM is

    A2/(22)

    which is different from (larger than, in this case) the slopeof the exponential decay of the minimum average errorprobability:

    A2/(82).

    EE 527, Detection and Estimation Theory, # 5 46

  • Decentralized Neyman-Pearson Detection forSimple Hypotheses

    Consider a sensor-network scenario depicted by

    Assumption: Observations made at N spatially distributedsensors (nodes) follow the same marginal probabilistic model:

    Hi : xn p(xn | i)

    where n = 1, 2, . . . , N and i {0, 1} for binary hypotheses.Each node n makes a hard local decision dn based on its localobservation xn and sends it to the headquarters (fusion center),which collects all the local decisions and makes the final global

    EE 527, Detection and Estimation Theory, # 5 47

  • decision about H0 or H1. This structure is clearly suboptimal it is easy to construct a better decision strategy in whicheach node sends its (quantized, in practice) likelihood ratio tothe fusion center, rather than the decision only. However, sucha strategy would have a higher communication (energy) cost.

    We now go back to the decentralized detection problem.Suppose that each node n makes a local decision dn {0, 1}, n = 1, 2, . . . , N and transmits it to the fusion center.Then, the fusion center makes the global decision based onthe likelihood ratio formed from the dns. The simplestfusion scheme is based on the assumption that the dns areconditionally independent given (which may not always bereasonable, but leads to an easy solution). We can now write

    p(dn | 1) = P dnD,n (1 PD,n)1dn Bernoulli pmf

    where PD,n is the nth sensors local detection probability.Similarly,

    p(dn | 0) = P dnFA,n (1 PFA,n)1dn Bernoulli pmf

    where PFA,n is the nth sensors local detection false-alarm

    EE 527, Detection and Estimation Theory, # 5 48

  • probability. Now,

    log (d) =Nn=1

    log[p(dn | 1)p(dn | 0

    ]

    =Nn=1

    log[ P dnD,n (1 PD,n)1dnP dnFA,n (1 PFA,n)1dn

    ] H1 log .

    To further simplify the exposition, we assume that all sensorshave identical performance:

    PD,n = PD, PFA,n = PFA.

    Define the number of sensors having dn = 1:

    u1 =Nn=1

    dn

    Then, the log-likelihood ratio becomes

    log (d) = u1 log( PDPFA

    )+ (N u1) log

    ( 1 PD1 PFA

    ) H1 log

    or

    u1 log[PD (1 PFA)PFA (1 PD)

    ] H1 log +N log

    (1 PFA1 PD

    ). (22)

    EE 527, Detection and Estimation Theory, # 5 49

  • Clearly, each nodes local decision dn is meaningful only ifPD > PFA, which implies

    PD (1 PFA)PFA (1 PD) > 1

    the logarithm of which is therefore positive, and the decisionrule (22) further simplifies to

    u1H1 .

    The Neyman-Person performance analysis of this detectoris easy: the random variable U1 is binomial given (i.e.conditional on the hypothesis) and, therefore,

    P [U1 = u1] =(N

    u1

    )pu1 (1 p)Nu1

    where p = PFA under H0 and p = PD under H1. Hence, theglobal false-alarm probability is

    PFA,global = P [U1 > | 0] =N

    u1=d e

    (N

    u1

    )Pu1FA (1PFA)Nu1.

    Clearly, PFA,global will be a discontinuous function of and

    therefore, we should choose our PFA,global specification from

    EE 527, Detection and Estimation Theory, # 5 50

  • the available discrete choices. But, if none of the candidatechoices is acceptable, this means that our current system doesnot satisfy the requirements and a remedial action is needed,e.g. increasing the quantity (N) or improving the quality ofsensors (through changing PD and PFA), or both.

    EE 527, Detection and Estimation Theory, # 5 51

  • Testing Multiple Hypotheses

    Suppose now that we choose 0,1, . . . ,M1 that form apartition of the parameter space :

    0 1 M1 = , i j = i 6= j.

    We wish to distinguish among M > 2 hypotheses, i.e. identifywhich hypothesis is true:

    H0 : 0 versusH1 : 1 versus

    ... versus

    HM1 : M1

    and, consequently, our action space consists of M choices. Wedesign a decision rule : X (0, 1, . . . ,M 1):

    (x) =

    0, decide H0,1, decide H1,...M 1, decide HM1

    where partitions the data space X [i.e. the support of p(x | )]into M regions:

    Rule : X0 = {x : (x) = 0}, . . . ,XM1 = {x : (x) =M1}.EE 527, Detection and Estimation Theory, # 5 52

  • We specify the loss function using L(i |m), where, typically,the losses due to correct decisions are set to zero:

    L(i | i) = 0, i = 0, 1, . . . ,M 1.

    Here, we adopt zero losses for correct decisions. Now, ourposterior expected loss takes M values:

    m(x) =M1i=0

    i

    L(m | i) p( |x) d

    =M1i=0

    L(m | i)i

    p( |x) d, m = 0, 1, . . . ,M 1.

    Then, the Bayes decision rule ? is defined via the followingdata-space partitioning:

    X ?m = {x : m(x) = min0lM1

    l(x)}, m = 0, 1, . . . ,M 1

    or, equivalently, upon applying the Bayes rule

    X ?m ={x : m = arg min

    0lM1

    M1i=0

    i

    L(l | i) p(x | )pi() d}

    ={x : m = arg min

    0lM1

    M1i=0

    L(l | i)i

    p(x | )pi() d}.

    EE 527, Detection and Estimation Theory, # 5 53

  • The preposterior (Bayes) risk for rule (x) is

    E x,[loss] =M1m=0

    M1i=0

    Xm

    i

    L(m | i) p(x | )pi() d dx

    =M1m=0

    Xm

    M1i=0

    L(m | i)i

    p(x | )pi() d hm()

    dx.

    Then, for an arbitrary hm(x),

    [M1m=0

    Xm

    hm(x) dx][M1m=0

    Xm?

    hm(x) dx] 0

    which verifies that the Bayes decision rule ? minimizes thepreposterior (Bayes) risk.

    EE 527, Detection and Estimation Theory, # 5 54

  • Special Case: L(i | i) = 0 and L(m | i) = 1 for i 6= m (0-1loss), implying that m(x) can be written as

    m(x) =M1

    i=0, i 6=m

    i

    p( |x) d

    = const not a function of m

    m

    p( |x) d

    and

    X ?m ={x : m = arg max

    0lM1

    l

    p( |x) d}

    ={x : m = arg max

    0lM1P [ l |x]

    }(23)

    which is the MAP rule, as expected.

    Simple hypotheses: Let us specialize (23) to simplehypotheses (m = {m}, m = 0, 1, . . . ,M 1):

    X ?m ={x : m = arg max

    0lM1p(l |x)

    }or, equivalently,

    X ?m ={x : m = arg max

    0lM1[pil p(x | l)]

    }, m = 0, 1, . . . ,M1

    EE 527, Detection and Estimation Theory, # 5 55

  • wherepi0 = pi(0), . . . , piM1 = pi(M1)

    define the prior pmf of the M -ary discrete random variable (recall that {0, 1, . . . , M1}). If pii, i = 0, 1, . . . ,M 1are all equal:

    pi0 = pi1 = = piM1 = 1M

    the resulting test

    X ?m ={x : m = arg max

    0lM1

    [ 1M

    p(x | l)] }

    ={x : m = arg max

    0lM1p(x | l) likelihood

    }(24)

    is the maximum-likelihood test; this name is easy to justify afterinspecting (24) and noting that the computation of the optimaldecision region X ?m requires the maximization of the likelihoodp(x | ) with respect to the parameter {0, 1, . . . , M1}.

    EE 527, Detection and Estimation Theory, # 5 56

  • Summary: Bayesian Decision Approach versusNeyman-Pearson Approach

    The Neyman-Pearson approach appears particularly suitablefor applications where the null hypothesis can be formulatedas absence of signal or perhaps, absence of statisticaldifference between two data sets (treatment versus placebo,say).

    In the Neyman-Pearson approach, the null hypothesis istreated very differently from the alternative. (If the nullhypothesis is true, we wish to control the false-alarm rate,which is different from our desire to maximize the probabilityof detection when the alternative is true). Consequently, ourdecisions should also be treated differently. If the likelihoodratio is large enough, we decide to accept H1 (or rejectH0). However, if the likelihood ratio is not large enough,we decide not to reject H0 because, in this case, it may bethat either

    (i) H0 is true or(ii) H0 is false but the test has low detection probability

    (power) (e.g. because the signal level is small comparedwith noise or we collected too small number ofobservations).

    EE 527, Detection and Estimation Theory, # 5 57

  • The Bayesian decision framework is suitable forcommunications applications as it can easily handle multiplehypotheses (unlike the Neyman-Pearson framework).

    0-1 loss: In communications applications, we typicallyselect a 0-1 loss, implying that all hypotheses are treatedequally (i.e. we could change the roles of null andalternative hypotheses without any problems). Therefore,interpretations of our decisions are also straightforward.Furthermore, in this case, the Bayes decision rule isalso optimal in terms of minimizing the average errorprobability, which is one of the most popular performancecriteria in communications.

    EE 527, Detection and Estimation Theory, # 5 58

  • P Values

    Reporting accept H0 or accept H1 is not very informative.Instead, we could vary PFA and examine how our report wouldchange.

    Generally, if H1 is accepted for a certain specified PFA, it willbe accepted for P FA > PFA. Therefore, there exists a smallestPFA at which H1 is accepted. This motivates the introductionof the p value.

    To be more precise (and be able to handle compositehypotheses), here is a definition of a size of a hypothesistest.

    Definition 1. The size of a hypothesis test described by

    Rule : X0 = {x : (x) = 0}, X1 = {x : (x) = 1}.

    is defined as follows:

    = max0

    P [x X1 | ] = max possible PFA.

    A hypothesis test is said to have level if its size is less thanor equal to . Therefore, a level- test is guaranteed to havea false-alarm probability less than or equal to .

    EE 527, Detection and Estimation Theory, # 5 59

  • Definition 2. Consider a Neyman-Pearson-type setup whereour test

    Rule : X0, = {x : (x) = 0}, X1, = {x : (x) = 1}.(25)

    achieves a specified size , meaning that,

    = max possible PFA

    = max0

    P [x X1, | ] (composite hypotheses)

    or, in the simple-hypothesis case (0 = {0},1 = {1}):

    =PFA

    P [x X1, | = 0] (simple hypotheses).

    We suppose that, for every (0, 1), we have a size- testwith decision regions (25). Then, the p value for this test isthe smallest level at which we can declare H1:

    p value = inf{ : x X1,}.

    Informally, the p value is a measure of evidence for supportingH1. For example, p values less than 0.01 are considered verystrong evidence supporting H1.There are a lot of warnings (and misconceptions) regarding pvalues. Here are the most important ones.

    EE 527, Detection and Estimation Theory, # 5 60

  • Warning: A large p value is not strong evidence in favor ofH0; a large p value can occur for two reasons:

    (i) H0 is true or

    (ii) H0 is false but the test has low detection probability(power).

    Warning: Do not confuse the p value with

    P [H0 | data] = P [ 0 |x]

    which is used in Bayesian inference. The p value is not theprobability that H0 is true.Theorem 1. Suppose that we have a size- test of the form

    declare H1 if and only if T (x) c.

    Then, the p value for this test is

    p value = max0

    P [T (X) T (x) | ]

    where x is the observed value of X. For 0 = {0}:

    p value = P [T (X) T (x) | = 0].

    EE 527, Detection and Estimation Theory, # 5 61

  • In words, Theorem 1 states that

    The p value is the probability that, under H0, a randomdata realization X is observed yielding a value of the teststatistic T (X) that is greater than or equal to what hasactually been observed (i.e. T (x)).

    Note: This interpretation requires that we allow the experimentto be repeated many times. This is what Bayesians criticize bysaying that data that have never been observed are used forinference.

    Theorem 2. If the test statistics has a continuousdistribution, then, under H0 : = 0, the p value has auniform(0, 1) distribution. Therefore, if we declare H1 (rejectH0) when the p value is less than or equal to , the probabilityof false alarm is .

    In other words, if H0 is true, the p value is like a randomdraw from an uniform(0, 1) distribution. If H1 is true and if werepeat the experiment many times, the random p values willconcentrate closer to zero.

    EE 527, Detection and Estimation Theory, # 5 62

  • Multiple Testing

    We may conduct many hypothesis tests in some applications,e.g.

    bioinformatics and

    sensor networks.

    Here, we perform many (typically binary) tests (one test pernode in a sensor network, say). This is different from testingmultiple hypotheses that we considered on pp. 5458, wherewe performed a single test of multiple hypotheses. For asensor-network related discussion on multiple testing, see

    E.B. Ermis, M. Alanyali, and V. Saligrama, Search anddiscovery in an uncertain networked world, IEEE SignalProcessing Magazine, vol. 23, pp. 107118, Jul. 2006.

    Suppose that each test is conducted with false-alarm probabilityPFA = . For example, in a sensor-network setup, each nodeconducts a test based on its local data.

    Although the chance of false alarm at each node is only , thechance of at least one falsely alarmed node is much higher,since there are many nodes. Here, we discuss two ways to dealwith this problem.

    EE 527, Detection and Estimation Theory, # 5 63

  • Consider M hypothesis tests:

    H0i versus H1i, i = 1, 2, . . . ,M

    and denote by p1, p2, . . . , pM the p values for these tests. Then,the Bonferroni method does the following:

    Given the p values p1, p2, . . . , pm, accept H1i if

    pi x} assuming H0ii = 1, 2, . . . ,M

    = (1 x)M

    yielding the proper p value to be attached tomin{p1, p2, . . . , pm} as

    1 (1min{p1, p2, . . . , pM})M (26)

    and if min{p1, p2, . . . , pM} is small and M not too large, (26)will be close to mmin{p1, p2, . . . , pM}.False Discovery Rate: Sometimes it is reasonable to controlfalse discovery rate (FDR), which we introduce below.

    Suppose that we accept H1i for all i for which

    pi < threshold

    EE 527, Detection and Estimation Theory, # 5 65

  • and let m0 +m1 = m benumber of true H0i hypotheses + number of true H1ihypotheses = total number of hypotheses (nodes, say).

    # of different outcomes H0 not rejected H1 declared totalH0 true U V m0H1 true T S m1total mR R m

    Define the false discovery proportion (FDP) as

    FDP ={V/R, R > 0,0, R = 0

    which is simply the proportion of incorrect H1 decisions. Now,define

    FDR = E [FDP].

    EE 527, Detection and Estimation Theory, # 5 66

  • The Benjamini-Hochberg (BH) Method

    (i) Denote the ordered p values by p(1) < p(2) < p(m).

    (ii) Define

    li =i

    Cmmand R = max{i : p(i) < li}

    where Cm is defined to be 1 if the p values areindependent and Cm =

    mi=1(1/i) otherwise.

    (iii) Define the BH rejection threshold = p(R).

    (iv) Accept all H1i for which p(i) .

    Theorem 4. (formulated and proved by Benjamini andHochberg) If the above BH method is applied, then, regardlessof how many null hypotheses are true and regardless of thedistribution of the p values when the null hypothesis is false,

    FDR = E [FDP] m0m

    .

    EE 527, Detection and Estimation Theory, # 5 67