The fundamental problem of Forensic Statistics
-
Upload
giulia-cereda -
Category
Science
-
view
289 -
download
2
Transcript of The fundamental problem of Forensic Statistics
The fundamental problem of Forensic Statistics
How to assess the evidential value
of a rare type match
Giulia Cereda, Université de Lausanne
Richard D. Gill, University of Leiden
The problem
• A crime• A piece of evidence found at the crime scene
(DNA, fingerprint, footprint, hand writing, etc.) • A suspect (identified independently)• A match between suspect’s characteristic and
evidence’s characteristic.• A database which counts the frequency of each
characteristic.• Database frequency of the crime (and the
suspect) characteristic is 0
Example
• A DNA stain is found on the victim’s body.
• Y-STR profile of type h.
• A suspect is identified, which is also of Y-STR type h.
• The Y-STR database of reference does not contain type h
Small databases
Generalized-Good. Non parametric Good-type estimator based on Good (1953).
DiscLap-method (Andersen et al. 2013)
Explore other methods (Brenner 2010, Roewer2000, …)
How to evaluate this kind of evidence?
The Likelihood Ratio
E is the evidence to be evaluated
B is the background information
Hp: the suspect left the stain
Hd: someone else left the stain
Many possiblechoices
THE likelihood ratio does not exists
Typical choice
• E= the particular haplotype of the suspect and of the crime stain
• B=the list of haplotypes in the database
e.g. Discrete Laplace Method
This frequency is not known. It can only be estimated
Un
cert
ain
ty
e.g.
Dis
cLap
met
ho
d
A different choice
• E=number of times the haplotypes of the suspect (hs) and the haplotype of the crime-stain (hc) are in the data-base and whether or not they are the same haplotype.
• B= the frequencies of the frequencies of the database.
Ignore information about the particular haplotype
• D database
Gotham City, 12,13,30,24,10,11,13
Gotham City, 12,13,30,24,10,11,14
Gotham City, 13,12,30,24,10,11,13
Gotham City, 13,13,29,23,10,11,13
Gotham City, 13,13,29,24,10,11,14
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
D’ database count
Gotham City, 12,13,30,24,10,11,131 Gotham City, 12,13,30,24,10,11,141Gotham City, 13,12,30,24,10,11,131Gotham City, 13,13,29,23,10,11,131Gotham City, 13,13,29,24,10,11,141Gotham City, 13,13,29,24,11,13,132Gotham City, 13,13,30,24,10,11,134
The frequencies of frequencies
N1 5
N2 1
N3 0
N4 1
Df frequencies of frequencies
Information is discarded
N1 is the number of haplotypes which occur once in D (singletons)
N2 is the number of dupletsEtc.
A database D of size N
Gotham City, 12,13,30,24,10,11,13
Gotham City, 12,13,30,24,10,11,14
Gotham City, 13,12,30,24,10,11,13
Gotham City, 13,13,29,23,10,11,13
Gotham City, 13,13,29,24,10,11,14
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,29,24,11,13,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
Gotham City, 13,13,30,24,10,11,13
can be considered as an i.i.d. sample (Y1, Y2, …, YN ) from species {1,2,…,s} with
probabilities (p1, p2, … ps).
The database count
Gotham City, 12,13,30,24,10,11,13 1
Gotham City, 12,13,30,24,10,11,14 1
Gotham City, 13,12,30,24,10,11,13
1
Gotham City, 13,13,29,23,10,11,13 1
Gotham City, 13,13,29,24,10,11,14 1
Gotham City, 13,13,29,24,11,13,13 2
Gotham City, 13,13,30,24,10,11,13 4
is a realization of r.v. (X1, X2, …, Xs),
defined Xj=#{i|Yi=j}.
The frequencies of frequencies
is made of (N1, N2,… )where Nj=#{i|Xi=j}
N1 5
N2 1
N3 0
N4 1
• E=numbers of times the haplotypes of the suspect (hs) and the haplotype of the crime-stain (hc) are in the data-base and whether or not they are the same haplotype.
• B= the frequencies of the frequencies of the database (Df)
unbiased estimator for the numerator
unbiased estimator for the denominator
It is more sensible to estimate instead of .
is approximately unbiased for .
This suggests to use
as an estimator for
How well estimates the true (unknown) ?
Take a big database of size 12,727.
Consider it as the world population. C1=0, C2=0.
Then,
1. Sample a little databases of size N=100+1+1.
2. If the 101th type is a new one in the small database increase
C1=C1+1
3. Check if the 101th is a new type equal to the 102th. C2=C2+1
4. Repeat steps 1-3 M=10,000 times.
P1=C1/M, P2=C2/M,
distribution of over many replications of small databases (size N=100) sampled from a bigger one (size N=12,727) which we pretend is the population.
And from which we obtain a value for 2.603:
We sample 1000 databases of size 100 from the big one, and for each we calculate the estimate :
Performance of the GG-method
We know .
We know .
We sample 1000 databases of size 100 from the big one, and for each we calculate the estimate :
Performance of the GG-method
How well estimates the true (unknown) ?
distribution over many replications of small databases (size N=100) and new haplotype sampled from a bigger one (size N=12,727).
For each database sampled, the true frequency of the new haplotype h is taken equal to its frequency in the big database.
The estimated frequency is calculated using the Discrete Laplace method with default options (iterations, init_y …).
We calculate the distribution of and for each
database and new haplotype sampled.
Performance of the DiscLap-method
Comparing the distribution of
0 200 400 600 800 1000
02
46
Index
log1
0(R
atio_
An
de
rse
n)
Comparing the errors of the two methods
DiscLap-method GG-method
0 200 400 600 800 1000
02
46
Index
log10(R
atio
_G
ill)
−1
01
23
45
6
log1
0(R
atio_
An
de
rse
n)
−1
01
23
45
6
log
10(R
atio
_G
ill)
Comparing the errors of the two methods
DiscLap-method GG-method
Remarks
Two more levels of uncertainty:
• whether or not the model M that we are assuming for Pr is “correct enough”
• whether or not parameters of Pr in the model M are “correct enough”
Basic uncertainty: • whether or not the trace comes from the
suspect
Maybe DiscLap was never intended it to be used for such small databases.
Maybe DiscLap does better for our purpose when used in more clever (targeted for our purpose) ways.
The error in the DiscLap method is given by two levels of uncertainty:• Population vs DiscLap• Parameter estimation (within Disclap)
The GG is a “model-free” method which thus has only one level of uncertainty.
Conclusions
• The situation is more complex than it appears.
• Using more information less accurate LR.
• Assuming less gives more reliable LR.
References