A walk in the black forest - during which I explain the fundamental problem of forensic statistics...

A walk in the black forestRichard Gill

• Know nothing about mushrooms, can see if 2 mushrooms belong to the same species or not

• S = set of all conceivable species of mushrooms

• n = 3, 3 = 2 + 1

• Want to estimate (functionals of) p = (ps : s ∈ S)

• S is (probably) huge

• MLE is ( ↦ 2/3, ↦ 1/3, “other” ↦ 0)

15:10 15:1215:42

Forget species names (reduce data), then do MLE

• 3 = 2 + 1 is the partition of n = 3 generated by our sample = reduced data

• qk := probability of k th most common species in population

• Estimate q = (q1, q2, …) (i.e., ps listed in decreasing order)

• Marginal likelihood = q12 q2 + q2 2 q1 + q1

2 q3 + q32 q1 + …

• Maximized by q = (1/2, 1/2, 0, 0, …) (exercise!)

• Many interesting functionals of p and of q = rsort(p) coincide

• But q = rsort(p) does not generally hold for MLE’s based on reduced / all data, and many interesting functionals differ too

Examples Acharya, Orlitsky, Pan (2009)

1000 = 6 x 7 + 2 x 6 + 17 x 5 + 51 x 4 + 86 x 3 + 138 x 2 + 123 x 1 (*)

0 200 400 600 800 1000

0.0001

0.0002

0.0005

0.0010

0.0020

0.0050 Naive estimator = !full data MLE of q =!

rsort (full data MLE of p)!!

Reduced data MLE of q!!

True q

© (2014) Piet Groeneboom (independent replication)

6 hrs, MH-EM, C. 104 outer EM steps.

106 inner MH to approximate the E within each EM step

(*) Noncommutative multiplication: 6 x 7 = 6 species each observed 7 times)

An important application of is to estimate the un-derlying distribution's support size k. Inequality (1) boundsthe support size when the lowest mt;!tiplicity, JLm, is at least2. The next theorem upper bounds k when JLm == 1 and allother multiplicities are at least 2, namely exactly one elementappears once, for example as in the pattern 11122334. We callsuch patterns unique-singleton. We will later use this result toestablish the PML distribution of ternary patterns.

Theorem 3: For unique-singleton patterns,

In particular, all patterns with JLm > log(m + 1) have PMLdistribution with support size m. Combined with the lemma,we obtain

Corollary 2: The PML distribution of a quasi-uniformpattern with JLm > log2 (m +1) is uniform over m symbols.•

For example, the pattern 11111222333 is quasi-uniform andhas JLm == 3 > 2 == log2 (m + 1), hence the corollary yieldsthe previously unknown PML distribution

P11111222333 == (1/3,1/3,1/3).

Theorem 6 in [12] states that

....... m-1k<m+---

- 2J-Lrn - 2 (1)

ISIr 2009, Seoul, Korea, June 28 - July 3, 2009

Canonical to P-;j) Reference1 any distribution Trivial11,111,111, ... (1) Trivial12, 123, 1234, ... () Trivial112,1122,1112,

(1/2, 1/2) [12]11122, 11112211223, 112233, 1112233 (1/3,1/3,1/3) [13]111223, 1112223, (1/3,1/3,1/3) Corollary 51123, 1122334 (1/5,1/5, ... ,1/5) [12]11234 (1/8,1/8, ... ,1/8) [13]11123 (3/5) [15]11112 (0.7887 ..,0.2113..) [12]111112 (0.8322 ..,0.1678..) [12]111123 (2/3) [15]111234 (112) [15]112234 (1/6,1/6, ... ,1/6) [13]112345 (1/13, ... ,1/13) [13]1111112 (0.857 ..,0.143 ..) [12]1111122 (2/3, 1/3) [12]1112345 (3/7) [15]1111234 (4/7) [15]1111123 (5/7) [15]

1111223 (1 0-1 0-1) Corollary 70' 20 ' 201123456 (1/19, ... ,1/19) [13]1112234 (1/5,1/5, ... ,1/5)7 Conjectured

TABLE IPML DISTRIBUTIONS OF ALL PATTERNS OF LENGTH::::; 7

k :S 2(m - 1).Corollary 6: For all ternary patterns with at most one

• symbol appearing once,

A pattern tt/J is i-uniform if JLi - JLj :S 1 for all i.], namelyall multiplicities are within one from each other as in 1112233.As shown in [13], all l-uniform patterns have a uniform PMLdistribution and can thus be evaluated.

As mentioned earlier, the simplest patterns are binary,and their PML distribution was in [12], showing inparticular that all of them have k == 2. The next simplestpatterns are ternary, and have m == 3. Three types of ternarypatterns can be addressed by existing results.

1) Uniform (lT2T3T). Of these, 123 has P == (), and allothers have P == (1/3,1/3,1/3) [12].

2) l-uniform (lT2T3T- 1 or 1T2T-13T-l). Of these, 1123has P == (1/5,1/5,1/5,1/5,1/5), and all others haveP == (1/3,1/3,1/3) [13].

3) Skewed (l T 23). Of these, 1123 is I-uniform and ad-dressed above, and all others have P == Thisresult is proved in [15].

It is easy to see that all ternary patterns not covered by thesecases have at most one symbol appearing once, for example111223 and 111122233. For all those, we show that the PMLdistribution has support size 3.

Theorem 4: All patterns with at most one symbolappearing once have k == 3. •

The theorem allows us to compute the PML distribution ofall ternary patterns. Some follow by an easy combination ofthe theorem with Lemma 1.

Corollary 5: P111223 == P1112223 == (1/3,1/3,1/3).For more complex patterns, the PML distribution can beobtained by combining the theorem with the Kuhn-Tuckerconditions.

where PI, P2, P3 are solutions to the following three polyno-mial equations,

PI + P2 + P3 == 1,

"""' J-Ljl -1 J-Lj2 J-Lj3 _ """' J-Lh-1 J-Lj2 J-Lj3c: JLj1Pl P2 P3 - c: JLj1P3 PI P2 .

where the summation is over all six permutations (]1,]2,]3)of (1,2,3). •

For short patterns we can solve the equations in Corollary 6and derive the PML distribution. An example is the followingresult.

Corollary 7: P _ (1 1- 1- ) •1111223 - 0' -2-' -2- .

Combined with previously known results, the three PMLdistributions in Corollaries 5 and 7 yield the PML distri-butions of all but one pattern of length up to 7. The onlyexception is 1112234, which we conjecture to have PML(1/5,1/5, ... ,1/5) but have not been able to prove yet. ThePML distributions of these patterns are shown in Table I alongwith references to where they were shown.

IV. PROOFS

For a probability distribution P == (PI, P2, ... ), let Pi(PI, ... ,Pi-I, 0, Pi+1, ... ) be the sub-distribution agreeingwith P on all probabilities, except Pi, which is set to O. Notethat the probabilities in Pi, including q, sum to 1 - Pi, henceif Pi > 0 then Pi is not a distribution but a point inside theprobability simplex P. We let Pi, be normalized Pi so it is a

1137

ISIT 2009, Seoul, Korea, June 28 - July 3, 2009

The Maximum Likelihood Probability ofUnique-Singleton, Ternary, and Length-7 Patterns

Jayadev AcharyaECE Department, UCSDEmail: [email protected]

Alon OrlitskyECE & CSE Departments, UCSD

Email: [email protected]

Shengjun PanCSE Department, UCSDEmail: [email protected]

Abstract-We derive several pattern maximum likelihood(PML) results, among them showing that if a pattern has onlyone symbol appearing once, its PML support size is at most twicethe number of distinct symbols, and that if the pattern is ternarywith at most one symbol appearing once, its PML support size isthree. We apply these results to extend the set of patterns whosePML distribution is known to all ternary patterns, and to all butone pattern of length up to seven.

I. INTRODUCTION

Estimating the distribution underlying an observed datasample has important applications in a wide range of fields,including statistics, genetics, system design, and compression.

Many of these applications do not require knowing theprobability of each element, but just the collection, or multisetof probabilities. For example, in evaluating the probability thatwhen a coin is flipped twice both sides will be observed, wedon't need to know p(heads) and p(tails) , but only the multiset{p( heads) , p( tails) }. Similarly to determine the probabilitythat a collection of resources can satisfy certain requests, wedon't need to know the probability of requesting the individualresources, just the multiset of these probabilities, regardless oftheir association with the individual resources. The same holdswhenever just the data "statistics" matters.

One of the simplest solutions for estimating this proba-bility multiset uses standard maximum likelihood (SML) tofind the distribution maximizing the sample probability, andthen ignores the association between the symbols and theirprobabilities. For example, upon observing the symbols @ 1\ @,

SML would estimate their probabilities as p(@) == 2/3 andp(1\) == 1/3, and disassociating symbols from their probabili-ties, would postulate the probability multiset {2/3, 1/3}.

SML works well when the number of samples is largerelative to the underlying support size. But it falls short whenthe sample size is relatively small. For example, upon observ-ing a sample of 100 distinct symbols, SML would estimatea uniform multi set over 100 elements. Clearly a distributionover a large, possibly infinite number of elements, would betterexplain the data. In general, SML errs in never estimating asupport size larger than the number of elements observed, andtends to underestimate probabilities of infrequent symbols.

Several methods have been suggested to overcome theseproblems. One line of work began by Fisher [1], and wasfollowed by Good and Toulmin [2], and Efron and Thisted [3].Bunge and Fitzpatric [4] provide a comprehensive survey ofmany of these techniques.

A related problem, not considered in this paper estimates theprobability of individual symbols for small sample sizes. Thisproblem was considered by Laplace [5], Good and Turing [6],and more recently by McAllester and Schapire [7], Shamir [8],Gemelos and Weissman [9], Jedynak and Khudanpur [10], andWagner, Viswanath, and Kulkarni [11].

A recent information-theoretically motivated method for themultiset estimation problem was pursued in [12], [13], [14]. Itis based on the observation that since we do not care about theassociation between the elements and their probabilities, wecan replace the elements by their order of appearance, calledthe observation's pattern. For example the pattern of @ 1\ @ is121, and the pattern of abracadabra is 12314151231.

Slightly modifying SML, this pattern maximum likelihood(PML) method asks for the distribution multiset that maxi-mizes the probability of the observed pattern. For example,the 100 distinct-symbol sample above has pattern 123...100,and this pattern probability is maximized by a distributionover a large, possibly infinite support set, as we would expect.And the probability of the pattern 121 is maximized, to 1/4,by a uniform distribution over two symbols, hence the PMLdistribution of the pattern 121 is the multiset {1/2, 1/2} .

To evaluate the accuracy of PML we conducted the fol-lowing experiment. We took a uniform distribution over 500elements, shown in Figure 1 as the solid (blue) line. We sam-pled the distribution with replacement 1000 times. In a typicalrun, of the 500 distribution elements, 6 elements appeared 7times, 2 appeared 6 times, and so on, and 77 did not appear atall as shown in the figure. The standard ML estimate, whichalways agrees with empirical frequency, is shown by the dotted(red) line. It underestimates the distribution's support size byover 77 elements and misses the distribution's uniformity. Bycontrast, the PML distribution, as approximated by the EMalgorithm described in [14] and shown by the dashed (green)line, performs significantly better and postulates essentially thecorrect distribution.

As shown in the above and other experiments, PML'sempirical performance seems promising. In addition, severalresults have proved its convergence to the underlying distribu-tion [13], yet analytical calculation of the PML distribution forspecific patterns appears difficult. So far the PML distributionhas been derived for only very simple or short patterns.

Among the simplest patterns are the binary patterns, con-sisting of just two distinct symbols, for example 11212. Aformula for the PML distributions of all binary patterns was

978-1-4244-4313-0/09/$25.00 ©2009 IEEE 1135

7 = 3 + 2 + 1 + 1

5 = 3 + 1 + 1

1 = 1!n = n (= 2, 3, …)!n = 1 + 1 + 1 + …

frequency 27 23 16 14 13 12 11 10 9 8 7 6 5 4 3 2 1replicates 1 1 2 1 1 1 2 2 1 3 2 6 11 33 71 253 1434

Biometrika (2002), 89, 3, pp. 669-681 ? 2002 Biometrika Trust Printed in Great Britain

A Poisson model for the coverage problem with a genomic application

BY CHANG XUAN MAO Interdepartmental Group in Biostatistics, University of California, Berkeley,

367 Evans Hall, Berkeley, California 94720-3860, U.S.A. [email protected]

AND BRUCE G. LINDSAY Department of Statistics, Pennsylvania State University, University Park,

Pennsylvania 16802-2111, U.S.A. [email protected]

SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown

N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any general abundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is pre- sented. As an application, a gene-categorisation problem in genomic research is addressed. Since Turing's approach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient.

Some key words: Digital gene expression; Poisson mixture; Sample coverage; Species.

1. INTRODUCTION

Consider a population composed of infinitely many individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'s are subject to the constraint 7 ~•{ 1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of

Biometrika (2002), 89, 3, pp. 669-681 ? 2002 Biometrika Trust Printed in Great Britain

A Poisson model for the coverage problem with a genomic application

BY CHANG XUAN MAO Interdepartmental Group in Biostatistics, University of California, Berkeley,

367 Evans Hall, Berkeley, California 94720-3860, U.S.A. [email protected]

AND BRUCE G. LINDSAY Department of Statistics, Pennsylvania State University, University Park,

Pennsylvania 16802-2111, U.S.A. [email protected]

SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown

N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any general abundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is pre- sented. As an application, a gene-categorisation problem in genomic research is addressed. Since Turing's approach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient.

Some key words: Digital gene expression; Poisson mixture; Sample coverage; Species.

1. INTRODUCTION

Consider a population composed of infinitely many individuals, which can be considered as an approximation of real populations with finitely many individuals under specific situ- ations, in particular when the number of individuals in a target population is very large. The population has been partitioned into N disjoint classes indexed by i=1,2,..., N, with ic being the proportion of the ith class. The identity of each class and the parameter N are assumed to be unknown prior to the experiment. The unknown m,'s are subject to the constraint 7 ~•{ 1 = 1. A random sample of individuals is taken from the population. Let X, be the number of individuals from the ith class, called the frequency in the sample. If X, = 0, then the ith class is not observed in the sample. It will be assumed that these zero frequencies are 'missed' random variables so that N, the 'sample size' of all the Xi's, is unknown. Let nk be the number of classes with frequency k and s be the number of

– naive – mle

– naive!– mle

LR = 2200!!

LR = 7400!Good-Turing = 7300

N = 2586; 21 x 106 iterations of SA-MH-EM (24 hrs) (right panel: y-axis logarithmic)

ˆˆ

(mle has mass 0.3 spread thinly beyond 2000 – for optimisation, taken infinitely thin, inf. far)

Tomato genes ordered by frequency of expression qk = probability gene k is being expressed Very similar to typically Y-STR haplotype data!

0 500 1000 1500 2000

0.000

0.002

0.004

0.006

0.008

0.010

0 500 1000 1500 2000

1e−04

2e−04

5e−04

1e−03

2e−03

5e−03

1e−02

© (2014) Richard Gill R / C++ using Rcpp interface

MH and EM steps alternate, E updated by SA =

“stochastic approximation”

The Fundamental Problem of Forensic Statistics(Sparsity, and sometimes Less is More)

Richard Gill

with: Dragi Anevski, Stefan Zohren;Michael Bargpeter; Giulia Cereda

September 12, 2014

Data (database):X ∼ Multinomial(n,p) p = (ps : s ∈ S)

#S is huge (all theoretically possible species)Very many ps are very small, almost all are zero

Conventional task:Estimate ps , for a certain s such that Xs = 0Preferably, also inform “consumer” of accuracyAt present hardly ever known, hardly ever done

Motivation:Each “species” = a possible DNA profileWe have a database of size n of a sample of DNA profilese.g. Y-chromosome 8 locus STR haplotypes (yhrd.org)e.g. Mitochondrial DNA, ...

Some profiles occur many times, some are very rare,most “possible” profiles do not occur in the data-base at all

Profile s is observed at a crime-sceneWe have a suspect and his profile matchesWe never saw s before, so it must be rare

Likelihood ratio (prosecution : defence) is 1/psEvidential value is − log10(ps) ban’s(one ban = 2.30 nats = 3.32 bits)

Present day practicesType 1: don’t use any prior information, discard most dataVarious ad-hoc “elementary” approachesNo evaluation of accuracy

Type 2: use all prior information, all dataModel (ps : s ∈ S) using population genetics, biology, imagination:Hierarchy of parametric models of rapidly increasing complexity:“Present day population equals finite mixture of subpopulationseach following exact theoretical law”.e.g. Andersen et al. (2013) “DiscLapMix”:Model selection by AIC, estimation by EM (MLE), plug-in for LRNo evaluation of accuracy

Both types are (more precisely: easily can be) disastrous

New approach: Good-TuringIdea 1: full data = database + crime scene + suspect dataIdea 2: reduce full data cleverly to reduce complexity,to get better-estimable and better-evaluable LR without losingmuch discriminatory power

Database & all evidence together = sample of size n, twiceincreased by +1

Reduce complexity by forgetting names of the species

All data together is reduced topartition of n implied by database, together withpartition of n + 1 (add crime-scene species to database)partition of n + 2 (add suspect species to database)Database partition = spectrum of database species frequencies= Y = rsort(X) (reverse sort)

The distribution of database frequency spectrum Y = rsort(X)only depends on q = rsort(p)the spectrum of true species probabilities (*)

Note: components of X and of p are indexed by s ∈ S,components of Y and of q are indexed by k ∈ N+,

Y/n (database spectrum, expressed as relative frequencies)is for our purposes a lousy estimator of q(i.e., for the functionals of q we are interested in),even more so than X/n is a lousy estimator of p (for thosefunctionals)

(*) More generally, (joint) laws of (Y,Y+,Y++) under prosecution and

defence hypotheses only depend on q

Our aim: estimate reduced data LR

LR =

∑s∈S(1− ps)nps∑s∈S(1− ps)np2s

=

∑k∈N+(1− qk)nqk∑k∈N+(1− qk)nq2k

or, “estimate” reduced data conditional LR

LRcond =

∑s∈S I{Xs = 0}qs∑s∈S I{Xs = 0}q2s

and report (or at the least, know something about) the precision ofthe estimateImportant to transform to “bans” (ie take negative log base 10)before discussing precision

Key observationsDatabase spectrum can be reduced to Nx = #{s : Xs = x},x ∈ N+, or even further, to {(x ,Nx) : Nx > 0}

Add superscript n to “impose” dependence on sample size,

(n + 1

1

)∑

s∈S

(1− ps)nps = En+1(N1) ≈ En(N1)

(n + 2

2

)∑

s∈S

(1− ps)np2s = En+2(N2) ≈ En(N2)

suggests: estimate LR (or LRcond!) from database bynN12N2

Alternative: estimate by plug-in of database MLE of q in formulafor LR as function of q

But how accurate is L̂R?

Accuracy (1): asymptotic normality; wlog S = N+, p = q

Poissonization trick: pretend n is realisation of N ∼ Poisson(λ)=⇒ Xs ∼ independent Poisson(λps)

(N,N1,N2) =∑s(Xs , I{Xs = 1}, I{Xs = 2}) should be

asymptotically Gaussian3 as λ→∞Conditional law of (N1,N2) given N = n = λ→∞, hopefully,converges to corresponding conditional Gaussian2Proof: Esty (1983), no N2, p depending on n,

√n normalisation;

Zhang & Zhang (2009): N&SC, implying p must vary with n tohave convergence (& non degenerate limit) at this rate.Conjecture: Gaussian limit with slower rate and fixed p possible,under appropriate tail behaviour of p; rate will depend on tail“exponent” (rate of decay of tail)Formal delta-method calculations give simple conjecturedasymptotic variance estimator depending on n, N1, N2, N3, N4;simulations suggest “it works”

Accuracy (2): MLE and bootstrap

Problem: Nx gets rapidly unstable as x increasesGood & Turing proposed “smoothing” the higher observed Nx

Alternative: Orlitsky et al. (2004, 2005, ...): estimate q by MLELikelihood =

∑∏k qYkχ(k), sum over bijections χ : N+ → N+

Anevksy, Gill, Zohren (arXiv:1312.1200) prove (almost) root nL1-consistency by delicate proof based on simple ideas inspired byOrlitsky (et al.) “outline” of “proof”

Must “close” parameter space to make MLE exist – allow positiveprobability “blob” of zero probability speciesComputation: SA-MH-EM

Proposal: use database MLE to estimate accuracy (bootstrap),and possibly to give alternative estimator of LR

Open problems

(1) Get results for these and other interesting functionals of MLE,e.g., entropyConjecture: will see interesting convergence rates depending onrate of decay of tail of qConjecture: interesting / useful theorems require q to depend on nProblems of adaptation, coverage probability challenges

(2) Prove “semiparametric” bootstrap works for such functionals

(3) Design computationally feasible nonparametric Bayesianapproach (a confidence region for a likelihood ratio is an oxymoron)and verify it has good frequentist properties;or design hybrid (i.e., “empirical Bayes”) approaches = Bayes withdata-estimated hyperparameter

Remarks

Plug-in MLE to estimate E(Nx) almost exactly reproduces observedNx : order of size of sqrt(expected) minus sqrt(observed) is ≈ ±12

Our approach is all the more useful when case involves match of anot new but still (in database) uncommon species:Good-Turing-inspired estimated LR easy to generalise, involves Nxfor“higher” x , hence better to smooth by replacing with MLEpredictions / estimation of expectations

Consistency proof

• Convergence at rate n – (k – 1) / 2k if 𝜃x ≈ C / x k

• Convergence at rate n – 1 / 2 if 𝜃x ≈ A exp( – B x k)

Estimating a probability mass function with

unknown labels

Dragi Anevski, Richard Gill, Stefan Zohren,Lund University, Leiden University, Oxford University

July 25, 2012 (last revised DA, 10:10 am CET)

Abstract

1 The model

1.1 Introduction

Imagine an area inhabited by a population of animals which can be classifiedby species. Which species actually live in the area (many of them previouslyunknown to science) is a priori unknown. Let A denote the set of all possiblespecies potentially living in the area. For instance, if animals are identified bytheir genetic code, then the species’ names ↵ are “just” equivalence classesof DNA sequences. The set of all possible DNA sequences is e↵ectivelyuncountably infinite, and for present purposes so is the set of equivalenceclasses, each equivalence class defining one “potential” species.

Suppose that animals of species ↵ 2 A form a fraction ✓

↵

� 0 of the totalpopulation of animals. The probabilities ✓

↵

are completely unknown.[To be completed: very brief (verbal) description of model, sum-

mary of previous work of Orlitsky et al., main results of the paper.]

1.2 The data: a random partition of n

The basic model studied in this paper assumes thatP

↵:✓↵>0 ✓↵ = 1 but weshall also study an extended model in which it is allowed that

P↵:✓↵>0 ✓↵ < 1.

In either case, the set of species with positive probability is finite or at mostcountably infinite.

1

Corollary:

14 ANEVSKI, GILL AND ZOHREN

We show that an extended maximum likelihood estimator exists in Ap-pendix A of [3]. We next derive the almost sure consistency of (any) extendedmaximum likelihood estimator ✓̂.

Theorem 1. Let ✓̂ = ✓̂

(n) be (any) extended maximum likelihood esti-mator. Then for any � > 0

P

n,✓(||✓̂ � ✓||1

> �) 1p3n

e

⇡

p

2n3 �n

✏2

2 (1 + o(1)) as n ! 1

where ✏ = �/(8r) and r = r(✓, �) such thatP1

i=r+1

✓

i

�/4.

Proof. Now let Q✓,�

be as in the statement of Lemma 1. Then there isan r such that the conclusion of the lemma holds, i.e. for each n there is aset

A = A

n

= { sup1xr

|f̂ (n)

x

� ✓

x

| ✏}

such that

P

n,✓(An

) � 1� 2e�n✏

2/2

,

sup�2Q✓,�

P

n,�(An

) 2e�n✏

2/2

.

For any � 2 Q�,�

, we can define the likelihood ratio dP

n,�

/dP

n,✓. Then forany � 2 Q

�,�

P

n,✓

A

n

\(

dP

n,�

dP

n,✓

� 1

)!

=Z

An\n

dPn,�

dPn,✓ �1

o

dP

n,✓

Z

An

dP

n,�

dP

n,✓

dP

n,✓

= P

n,�(An

)

2e�n✏

2/2

,

which implies that

P

n,✓

dP

n,�

dP

n,✓

� 1

!

= P

n,✓

A

n

\(

dP

n,�

dP

n,✓

� 1

)!

� P

n,✓(An

)

+P

n,✓

A

n

[(

dP

n,�

dP

n,✓

� 1

)!

2e�n✏

2/2 � 1 + 2e�n✏

2/2 + 1

= 4e�n✏

2/2

.

Tools

• Kiefer-Dvoretsky-Wolfowitz

• “r-sort” is contraction mapping w.r.t sup norm

• Hardy-Ramanujan: number of partitions of n grows as

ESTIMATING A PROBABILITY MASS FUNCTION 15

If ✓̂ is an extended ML estimator then

dP

n,

ˆ

✓

dP

n,✓

� 1.

For a given n = n

1

+ . . .+n

k

such that n1

� . . . � n

k

> 0, (with k varying),there is a finite number p(n) of possibilities for the value of (n

1

, . . . , n

k

). Thenumber p(n) is the partition function of n, for which we have the asymptoticformula

p(n) =1

4np3e

⇡

p

2n3 (1 + o(1)),

as n ! 1, cf. [23]. For each possibility of (n1

, . . . , n

k

) there is an extendedML estimator (for each possibility we can choose one such) and we let P

n

={✓̂(1), . . . , ✓̂(p(n))} be the set of all such choices of extended ML estimators.Then

P

n,✓(✓̂ 2 Q✓,�

) =X

�2Pn\Q✓,�

P

n,✓(✓̂ = �)

X

�2Pn\Q✓,�

P

n,✓

dP

n,�

dP

n,✓

� 1

!

p(n)4e�n✏

2/2

,

which ends the proof. 2

That a ✓̂ is consistent in probability is immediate from Theorem 1, andin fact we have almost sure consistency:

Corollary 1. The sequence of maximum likelihood estimators ✓̂

(n) isstrongly consistent in L

1

-norm, i.e.

limn!1

||✓̂(n) � ✓||1

a.s.! 0

as n ! 1.

Proof. This follows as a consequence of the bound in Theorem 1, by thecharacterization X

n

a.s.! 0 ,P1

n=1

P (|Xn

| > �) < 1 for all � > 0, since

1X

n=1

1p3n

e

�⇡

pn(

pn

✏2

2 �p

23 ) < 1.

12 ANEVSKI, GILL AND ZOHREN

Assume first thatP1

i=r+1

�

i

�/2. Then

� r

X

i=1

|✓i

� �

i

|+1X

i=r+1

|✓i

� �

i

|

r

X

i=1

|✓i

� �

i

|+1X

i=r+1

✓

i

+1X

i=r+1

�

i

r

X

i=1

|✓i

� �

i

|+ �

4+

�

2,

which implies (8). (ii) Assume instead thatP1

i=r+1

�

i

> �/2, and write theassumptions as

P

r

i=1

✓

i

> 1 � �/4 andP

r

i=1

�

i

=P1

i=1

�

i

�P1

i=r+1

�

i

1� �/2. Then

r

X

i=1

|✓i

� �

i

| �r

X

i=1

(✓i

� �

i

)

> 1� �

4� 1 +

�

2

=�

4,

which again implies (8).From (8) follows that for some i r we have

|✓i

� �

i

| � �

4r:= 2✏ = 2✏(�, ✓).(9)

Note that r, and thus also ✏ depends only on ✓, and not on �.Recall the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality [6, 17]; for every

✏ > 0

P✓

(supx�0

|F (n)(x)� F

✓

(x))| � ✏) 2e�2n✏

2,(10)

where F

✓

is the cumulative distribution function corresponding to ✓, andF

(n) the empirical probability function based on i.i.d. data from F

✓

. Since

{supx�0

|F (n)(x)�F

✓

(x)| � ✏} � {supx�0

|f (n)

x

�✓

x

| � 2✏} � {supx�1

|f (n)

x

�✓

x

| � 2✏}, with f

(n) the empirical probability mass function correspondingto F

(n), equation (10) implies

P

n,✓(supx�1

|f (n)

x

� ✓

x

| � ✏) = P✓

(supx�1

|f (n)

x

� ✓

x

| � ✏)

2e�n✏

2/2

.(11)

r-sort = reverse sort = sort in decreasing order

Key lemma P, Q probability measures; p, q densities

• Find event A depending on P and 𝛿

• P (Ac) ≤ 𝜀

• Q (A) ≤ 𝜀 for all Q: d (Q, P) ≥ 𝛿

• Hence P ( p /q < 1) ≤ 2𝜀 if d (Q, P) ≥ 𝛿

Application: P, Q are probability distributions of data, depending on parameters 𝜃, 𝜙 respectively A is event that Y/n is within 𝛿 of 𝜃 (sup norm) d is L1 distance between 𝜃 and 𝜙

Proof: P ( p /q < 1) = P( p /q < 1 ∩ Ac) + P( p /q < 1 ∩ A) ≤ P(Ac) + Q(A)

Preliminary steps• By Kiefer-Dvoretsky-Wolfowitz, empirical relative frequencies

are close to true probabilities, in sup norm, ordering known

• By contraction property, same is true after monotone reordering of empirical

• Apply key lemma, with as input:

• 1. Naive estimator is close to the truth (with large probability)

• 2. Naive estimator is far from any particular distant non-truth (with large probability)

Proof outiline• By key lemma, probability MLE is any particular q

distant from (true) p is very small

• By Hardy, there are not many q to consider

• Therefore probability MLE is far from p is small

• Careful – sup norm on data, L1 norm on parameter space, truncation of parameter vectors ...

–Ate Kloosterman

A practical question from the NFI:

It is a Y-chromosome identification case (MH17 disaster). A nephew (in the paternal line) of a missing male individual and the unidentified victim share the same Y-chromosome haplotype. The evidential value of the autosomal DNA evidence for this ID is low (LR=20). The matching Y-str haplotype is unique in a Dutch population sample (N=2085) and in the worldwide YHRD Y-str database (containing 84 thousand profiles). I know the distribution of the haplotype frequencies in the Dutch database (1826 haplotypes)

Distribution of haplotype frequenciesDatabase frequency of haplotype 1 2 3 4 5 6 7 13Number of haplotypes in database 1650 130 30 7 5 2 1 1

N = 2085; 106 x 105 (outer x inner) iterations of SA-MH-EM (12 hrs) (y-axis logarithmic)mle puts mass 0.51 spread thinly beyond 2500 – for optimisation, taken infinitely thin, inf. far

© (2014) Richard Gill R / C++ using Rcpp interface outer EM and inner MH steps

E updated by SA = “stochastic approximation”

Estimated NL Y-str haplotype probability distribution

0 500 1000 1500 2000 2500

0.00

005

0.00

020

0.00

050

0.00

200

0.00

500

Haplotype

Prob

abilit

y

PMLNaieveInitial (1/200 sqrt x)

Observed and expected number of haplotypes, by database frequency

Frequency 1 2 3 4 5 6O 1650 130 30 7 5 2E 1650.49 129.64 29.14 8.99 3.77 1.75

Hanging rootogram

Database frequency

−1.0

−0.5

0.0

0.5

1.0

Bootstrap s.e. 0.040, Good-Turing asymptotic theory estimated s.e. 0.044

4.00 4.05 4.10 4.15 4.20 4.25

02

46

810

Bootstrap distribution of Good−Turing log10LR

Bootstrap log10LR

Prob

abilit

y de

nsity

A walk in the black forest - during which I explain the fundamental problem of forensic statistics...

Science

Transcript of A walk in the black forest - during which I explain the fundamental problem of forensic statistics...