The fundamental problem of Forensic Statistics

The fundamental problem of Forensic Statistics

How to assess the evidential value

of a rare type match

Giulia Cereda, Université de Lausanne

Richard D. Gill, University of Leiden

The problem

• A crime• A piece of evidence found at the crime scene

(DNA, fingerprint, footprint, hand writing, etc.) • A suspect (identified independently)• A match between suspect’s characteristic and

evidence’s characteristic.• A database which counts the frequency of each

characteristic.• Database frequency of the crime (and the

suspect) characteristic is 0

Example

• A DNA stain is found on the victim’s body.

• Y-STR profile of type h.

• A suspect is identified, which is also of Y-STR type h.

• The Y-STR database of reference does not contain type h

Small databases

Generalized-Good. Non parametric Good-type estimator based on Good (1953).

DiscLap-method (Andersen et al. 2013)

Explore other methods (Brenner 2010, Roewer2000, …)

How to evaluate this kind of evidence?

The Likelihood Ratio

E is the evidence to be evaluated

B is the background information

Hp: the suspect left the stain

Hd: someone else left the stain

Many possiblechoices

THE likelihood ratio does not exists

Typical choice

• E= the particular haplotype of the suspect and of the crime stain

• B=the list of haplotypes in the database

e.g. Discrete Laplace Method

This frequency is not known. It can only be estimated

Un

cert

ain

ty

e.g.

Dis

cLap

met

ho

d

A different choice

• E=number of times the haplotypes of the suspect (hs) and the haplotype of the crime-stain (hc) are in the data-base and whether or not they are the same haplotype.

• B= the frequencies of the frequencies of the database.

Ignore information about the particular haplotype

• D database

Gotham City, 12,13,30,24,10,11,13

Gotham City, 12,13,30,24,10,11,14

Gotham City, 13,12,30,24,10,11,13

Gotham City, 13,13,29,23,10,11,13

Gotham City, 13,13,29,24,10,11,14

Gotham City, 13,13,29,24,11,13,13

Gotham City, 13,13,29,24,11,13,13

Gotham City, 13,13,30,24,10,11,13

Gotham City, 13,13,30,24,10,11,13

Gotham City, 13,13,30,24,10,11,13

Gotham City, 13,13,30,24,10,11,13

D’ database count

Gotham City, 12,13,30,24,10,11,131 Gotham City, 12,13,30,24,10,11,141Gotham City, 13,12,30,24,10,11,131Gotham City, 13,13,29,23,10,11,131Gotham City, 13,13,29,24,10,11,141Gotham City, 13,13,29,24,11,13,132Gotham City, 13,13,30,24,10,11,134

The frequencies of frequencies

N1 5

N2 1

N3 0

N4 1

Df frequencies of frequencies

Information is discarded

N1 is the number of haplotypes which occur once in D (singletons)

N2 is the number of dupletsEtc.

A database D of size N

Gotham City, 12,13,30,24,10,11,13

Gotham City, 12,13,30,24,10,11,14

Gotham City, 13,12,30,24,10,11,13

Gotham City, 13,13,29,23,10,11,13

Gotham City, 13,13,29,24,10,11,14

Gotham City, 13,13,29,24,11,13,13

Gotham City, 13,13,29,24,11,13,13

Gotham City, 13,13,30,24,10,11,13

Gotham City, 13,13,30,24,10,11,13

Gotham City, 13,13,30,24,10,11,13

Gotham City, 13,13,30,24,10,11,13

can be considered as an i.i.d. sample (Y1, Y2, …, YN ) from species {1,2,…,s} with

probabilities (p1, p2, … ps).

The database count

Gotham City, 12,13,30,24,10,11,13 1

Gotham City, 12,13,30,24,10,11,14 1

Gotham City, 13,12,30,24,10,11,13

1

Gotham City, 13,13,29,23,10,11,13 1

Gotham City, 13,13,29,24,10,11,14 1

Gotham City, 13,13,29,24,11,13,13 2

Gotham City, 13,13,30,24,10,11,13 4

is a realization of r.v. (X1, X2, …, Xs),

defined Xj=#{i|Yi=j}.

The frequencies of frequencies

is made of (N1, N2,… )where Nj=#{i|Xi=j}

N1 5

N2 1

N3 0

N4 1

• E=numbers of times the haplotypes of the suspect (hs) and the haplotype of the crime-stain (hc) are in the data-base and whether or not they are the same haplotype.

• B= the frequencies of the frequencies of the database (Df)

unbiased estimator for the numerator

unbiased estimator for the denominator

It is more sensible to estimate instead of .

is approximately unbiased for .

This suggests to use

as an estimator for

How well estimates the true (unknown) ?

Take a big database of size 12,727.

Consider it as the world population. C1=0, C2=0.

Then,

1. Sample a little databases of size N=100+1+1.

2. If the 101th type is a new one in the small database increase

C1=C1+1

3. Check if the 101th is a new type equal to the 102th. C2=C2+1

4. Repeat steps 1-3 M=10,000 times.

P1=C1/M, P2=C2/M,

distribution of over many replications of small databases (size N=100) sampled from a bigger one (size N=12,727) which we pretend is the population.

And from which we obtain a value for 2.603:

We sample 1000 databases of size 100 from the big one, and for each we calculate the estimate :

Performance of the GG-method

We know .

We know .

We sample 1000 databases of size 100 from the big one, and for each we calculate the estimate :

Performance of the GG-method

How well estimates the true (unknown) ?

distribution over many replications of small databases (size N=100) and new haplotype sampled from a bigger one (size N=12,727).

For each database sampled, the true frequency of the new haplotype h is taken equal to its frequency in the big database.

The estimated frequency is calculated using the Discrete Laplace method with default options (iterations, init_y …).

We calculate the distribution of and for each

database and new haplotype sampled.

Performance of the DiscLap-method

Comparing the distribution of

0 200 400 600 800 1000

02

46

Index

log1

0(R

atio_

An

de

rse

n)

Comparing the errors of the two methods

DiscLap-method GG-method

0 200 400 600 800 1000

02

46

Index

log10(R

atio

_G

ill)

−1

01

23

45

6

log1

0(R

atio_

An

de

rse

n)

−1

01

23

45

6

log

10(R

atio

_G

ill)

Comparing the errors of the two methods

DiscLap-method GG-method

Remarks

Two more levels of uncertainty:

• whether or not the model M that we are assuming for Pr is “correct enough”

• whether or not parameters of Pr in the model M are “correct enough”

Basic uncertainty: • whether or not the trace comes from the

suspect

Maybe DiscLap was never intended it to be used for such small databases.

Maybe DiscLap does better for our purpose when used in more clever (targeted for our purpose) ways.

The error in the DiscLap method is given by two levels of uncertainty:• Population vs DiscLap• Parameter estimation (within Disclap)

The GG is a “model-free” method which thus has only one level of uncertainty.

Conclusions

• The situation is more complex than it appears.

• Using more information less accurate LR.

• Assuming less gives more reliable LR.

References

You want to discuss? Know more?Collaborate? Give suggestions?

You are welcome!

[email protected]

The fundamental problem of Forensic Statistics

Science

Transcript of The fundamental problem of Forensic Statistics