Audio Music Similarity and Retrieval: Evaluation Power and Stability
-
Upload
julian-urbano -
Category
Technology
-
view
313 -
download
0
description
Transcript of Audio Music Similarity and Retrieval: Evaluation Power and Stability
ISMIR 2011Miami, USA · October 26thPicture by Michael Shane
Audio Music Similarity and Retrieval:
Evaluation Power and StabilityJulián Urbano @julian_urbano
Diego Martín, Mónica Marrero and Jorge MoratoUniversity Carlos III of Madrid
AMS
retrieve audio clips
musically similar
to a query clip
grand results(MIREX 2009)
grand results(MIREX 2009)I won!I won!I won!I won!
but the difference is not significant…is not significant…is not significant…is not significant…
yeah, it’s not significant!
oh, come on! it‘s so close!so close!so close!so close!
grand results(MIREX 2009)I won!I won!I won!I won!
but the difference is not significant…is not significant…is not significant…is not significant…
yeah, it’s not significant!
did you hear?
shut up… we are!we are!we are!we are!
oh, come on! it‘s so close!so close!so close!so close!
grand results(MIREX 2009)I won!I won!I won!I won!
but the difference is not significant…is not significant…is not significant…is not significant…
yeah, it’s not significant!
did you hear?
damn it!
don‘t worry don‘t worry don‘t worry don‘t worry about it
shut up… we are!we are!we are!we are!
oh, come on! it‘s so close!so close!so close!so close!
Picture by Sara A. Beyer
what does it mean?
proper interpretation of p-values
H0: mean score of system A = mean score of B
H1: mean scores are different
B A
a statistical test returns p<0.01, so we conclude A >> B
proper interpretation of p-values
H0: mean score of system A = mean score of B
H1: mean scores are different
B Ait means that if we assume Hassume Hassume Hassume H0000and repeatrepeatrepeatrepeat the experiment, there is a <0.01 probabilityof having these result having these result having these result having these result again*
*or one even more extreme
a statistical test returns p<0.01, so we conclude A >> B
MIREX 2010
system A is better than B, but it’s
not statistically significant
we can expect anything
with a different collection
this evaluationis not powerfulnot powerfulnot powerfulnot powerful
MIREX 2009
conclusions about general behavior
A ? BA > B
A is better than B, and it’s
statistically significant
A >> B we expect the same:
A is significantly better than B
A >> B
…and stablestablestablestable
but these could also happen:
A > B or A < B or A << B
this oneis powerful…is powerful…is powerful…is powerful…
lack of power lack of power lack of power lack of power in MIREX 2010minorminorminorminor stability conflict
majormajormajormajor stability conflict
it‘s all about reliability
on the shoulders of giantsIsaac Newton
Text REtrieval Conference
1% to 14% of comparisons show stability conflicts
~25% differences to ensure <5% conflicts with 50 queries
[Buckley and Voorhees, 2000]depends on the measure used
nononono significancesignificancesignificancesignificance testing
improved reliability with pairwise t-tests
virtually no conflicts if >10% differences with significance
[Sanderson and Zobel, 2005] others werenot as good
with many queries, even significance is unreliable
[Voorhees, 2009]
major review: other collections and more recent measures
some measures are much better than others
[Sakai, 2007]
sensitivitysensitivitysensitivitysensitivity
efforteffortefforteffort
does not mean they should not be used!
Music Similarity and Retrieval
alternative forms of ground truth for SMS
reliable and comprehensive but too expensive
[Typke et al., 2005][Urbano et al., 2010]
more about thisin 30 mins
no prefixedrelevance scale
specific measure for the task
[Typke et al., 2006]
despite high agreement, evaluation does change…evaluation does change…evaluation does change…evaluation does change…
agreement between judgments by different people
propose to use more queries
[Jones et al., 2007]
cheaper judgments via crowdsourcing seems reliable
[Urbano et al., 2010][Lee, 2010]
many other things
[Urbano, 2011]
it‘s actually about the
effort-reliability tradeoff
it‘s actually about the
effort-reliability tradeoff
task# of queries
relevance judgmentsmeasures
statistical methods
# of systemssystem similarity
Picture by Wessex Archaeology
measures
&
judgments
how much information does the user gain?
results as a set
AG@5: Average Gain in the top 5 documents
results as a list
NDCG@5: Normalized Discounted Cumulated Gain
ANDCG@5: Average NDCG across ranks
ADR@5: Average Dynamic Recall
measure used in MIREX(with different name)
more realisticuser modeluser modeluser modeluser model
best documents firstbest documents firstbest documents firstbest documents first,and the lower the rank
the lower the gain**details in the paper
how much information does a result provide?
BROAD relevance judgments
not similar = 0
somewhat similar = 1
very similar = 2
FINE relevance judgments
real-valued, from 0 to 10 or 100
look at MIREX 2009
largest evaluation until 2011
Picture by Roger Green
power
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
all 100 queries set
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
5 query
subset
all 100 queries set
random sample
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
5 query
subset
all 100 queries set
evaluation
Broad judgments
Fine judgments
random sample
# queries
% s
ign
ific
an
t
# queries
% s
ign
ific
an
t
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
5 query
subset
all 100 queries set
evaluation
Broad judgments
Fine judgments
random sample
repeat 500 times repeat 500 times repeat 500 times repeat 500 times for 5 query subsetsto minimize random effectsrandom effectsrandom effectsrandom effects
52,500 system
comparisons# queries
% s
ign
ific
an
t
# queries
% s
ign
ific
an
t
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
all 100 queries set10 query
subset
Broad judgments
Fine judgments
repeat another 500 times another 500 times another 500 times another 500 times for 10 query subsets
evaluation
52,500 system
comparisons# queries
% s
ign
ific
an
t
# queries
% s
ign
ific
an
t
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
all 100 queries set10 query
subset
Broad judgments
Fine judgments
stratifiedstratifiedstratifiedstratified random samplingwith equal priorsequal priorsequal priorsequal priors
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
balanced across 10 genres
evaluation# queries
% s
ign
ific
an
t
# queries
% s
ign
ific
an
t
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
all 100 query subset
# queries
% s
ign
ific
an
t
Broad judgments
# queries
% s
ign
ific
an
t
Fine judgments
evaluation
we simulate possible
evaluation scenarios
power results (larger is better)
power inMIREX 2009
Broad judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
power results (larger is better)
power inMIREX 2009
similar logarithmic trend similar logarithmic trend similar logarithmic trend similar logarithmic trend except for ADRFine (expected)
Broad judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
power results (larger is better)
power inMIREX 2009
similar logarithmic trend similar logarithmic trend similar logarithmic trend similar logarithmic trend except for ADRFine (expected)
same powersame powersame powersame powerwith with with with 70% 70% 70% 70% effort!effort!effort!effort!
only 2 significant pairs missed with 70% effort
(probably unstable)(probably unstable)(probably unstable)(probably unstable)
Broad judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
merely using more queries
does not pay offwhen looking for power
Picture by Dave Hunt
stability
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures5 query
subset
all 100 queries set
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures5 query
subset
all 100 queries set
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
5 query
subset
independent samples
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures5 query
subset
all 100 queries set
evaluation
#queries
% c
on
flic
tin
g
Broad judgments
#queries
% c
on
flic
tin
g
Fine judgments
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
5 query
subset
evaluation
independent samples
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures5 query
subset
all 100 queries set
evaluation
#queries
% c
on
flic
tin
g
Broad judgments
#queries
% c
on
flic
tin
g
Fine judgments
52,500crosscrosscrosscross----collectioncollectioncollectioncollection
system comparisons
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
5 query
subset
evaluation
independent samples
repeat 500 timesrepeat 500 timesrepeat 500 timesrepeat 500 timesto minimize random effectsrandom effectsrandom effectsrandom effects
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures
evaluation
#queries
% c
on
flic
tin
g
Broad judgments
#queries
% c
on
flic
tin
g
Fine judgments
evaluation
with 100 total queries we can’t go beyond 50
50 query subset
50 query subset
we simulate comparisons
across possible collections
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
stability inMIREX 2009
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
stability inMIREX 2009
lack of powerlack of powerlack of powerlack of power in one collection but not in the other
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
stability inMIREX 2009
lack of powerlack of powerlack of powerlack of power in one collection but not in the otherADR takes longer
to converge
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
stability inMIREX 2009
lack of powerlack of powerlack of powerlack of power in one collection but not in the other
converge to <5% for >40 queries converge to <5% for >40 queries converge to <5% for >40 queries converge to <5% for >40 queries (consistent with α=0.05)
ADR takes longerto converge
merely using more queries
does not pay offwhen looking for stability
type of conflicts (50 queries)
measure conflictsA>B
(power)
A<B
(minor)
A<<B
(major)B
roa
d
AG 3.36% 100% 0% 0%
NDCG 3.77% 99.90% 0.10% 0%
ANDCG 4.73% 99.96% 0.04% 0%
ADR 9.03% 99.94% 0.06% 0%
Fin
e
AG 2.64% 99.86% 0.14% 0%
NDCG 2.94% 99.74% 0.26% 0%
ANDCG 4.03% 99.91% 0.09% 0%
ADR 19.08% 99.50% 0.50% 0%
virtually all virtually all virtually all virtually all conflicts due to lack of power in one collection
no major conflictno major conflictno major conflictno major conflictwhatsoeverwhatsoeverwhatsoeverwhatsoever
if significance shows up
it most probably is correct
are we being too conservative?
Milton Friedman
statistics
John TukeyFrank Wilcoxon
compare two systems
is the difference significant?
t-test, Wilcoxon test, sign test, etc.
significance level α
probability of Type I error
(finding a significant difference when there is none)
usually, α=0.05 or α=0.01
5% or 1% of my significant results are just wrong
stability conflictthey make
different assumptions
compare several systems
15 systems = 105 comparisons
experiment-wide significance level = 1-(1-α)105 = 0.995
we can expect at least one significant comparison to be wrong
instead, compare all systems at once
ANOVA, Friedman test, Kruskal-Wallis, etc.
correct p-values to keep experiment-wide significance level <0.05
Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.
used in MIREX(with different assumptions)(with different assumptions)(with different assumptions)(with different assumptions)
MIREX 2009
more stability
at the cost of
less power
is it worth it?
what a MIREX participant wants
compare my system with the other 14
comparisons between those 14 are uninteresting
subexperiment: only 14 pairwise comparisons, not 105
get back the power missed by considering the other 91
compare all systems with 1-tailed Wilcoxon tests at α=0.01
experiment-wide significant level = 1-(1-0.01)105 = 0.652
subexperiment-wide significant level = 1-(1-0.01)14 = 0.131
should throw out more conflicts toonumber of comparisons grows linearly with number of systems
subexperiment-wide significant level = 1-(1-α)14 = 0.512
power results (larger is better)
Broad judgments
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Friedman+Tukey(as in MIREX)
power results (larger is better)
Broad judgments
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Friedman+Tukey(as in MIREX)
all 1-tailed Wilcoxon comparisonsis up to %20 more powerful up to %20 more powerful up to %20 more powerful up to %20 more powerful than Friedman+Tukey
power results (larger is better)
Broad judgments
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Friedman+Tukey(as in MIREX)
same powersame powersame powersame powerwith 50with 50with 50with 50% % % % effort!effort!effort!effort!
all 1-tailed Wilcoxon comparisonsis up to %20 more powerful up to %20 more powerful up to %20 more powerful up to %20 more powerful than Friedman+Tukey
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
earlier convergencebecause of increased power
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
earlier convergencebecause of increased power
AG converges againagainagainagain to 3-4%(A)NDCG converge to 5-6%
type of conflicts (50 queries)
measure conflictsA>B
(power)
A<B
(minor)
A<<B
(major)B
roa
d
AG 3.68% 96.32% 3.68% 0%
NDCG 5.05% 96.82% 3.18% 0%
ANDCG 6.08% 96.84% 3.13% 0.03%
ADR 5.93% 95.12% 4.88% 0%
Fin
e
AG 3.32% 98.34% 1.66% 0%
NDCG 6.58% 96.61% 3.39% 0%
ANDCG 6.44% 94.94% 5.06% 0%
ADR 12.48% 90.58% 9.37% 0.05%
again, again, again, again, due tolack of power in one collection no major conflicts
within knownType III error rates
effort-reliability tradeoff
Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries
measure power - conflicts = stable power - conflicts = stable
Bro
ad
AG 57.14% - 3.64% = 53.50% 55.10% - 3.68% = 51.42%
NDCG 57.14% - 4.08% = 53.06% 57.01% - 5.05% = 51.96%
ANDCG 57.14% - 4.19% = 52.95% 57.37% - 6.08% = 51.29%
ADR 56.19% - 7.13% = 49.06% 57.30% - 5.93% = 51.37%
Fin
e
AG 54.29% - 3.20% = 51.09% 54.31% - 3.32% = 50.99%
NDCG 56.19% - 3.04% = 53.15% 57.56% - 6.58% = 50.98%
ANDCG 56.19% - 2.96% = 53.23% 57.38% - 6.44% = 50.94%
ADR 56.19% - 19.97% = 36.22% 55.03% - 12.48% = 42.55%
vvvvirtually same reliability with half the effort!irtually same reliability with half the effort!irtually same reliability with half the effort!irtually same reliability with half the effort!
Friedman-Tukey requires
too much effort
my point?
Do not attempt to accomplish greater results
by a greater effort of your little understanding,
but by a greater understanding of your little effort.
Walter Russell
if significance shows up it most probably is trueat worst, conflicts are due to lack of power
using more and more queries is pointlesstoo much effort for the small gain in power and stability
using different similarity scales has little effectusing only one is probably just fine
some effectiveness measures are better than othersthey should still be used: they measure different things
but bear in mind their power and stability
some statistical methods are better than othersvirtually same realiability with half the effort
Picture by Ronny Welter
reduce the judging effortmore queries in Symbolic Melodic Similarity
reliable low-cost in-house evaluations and Crowdsourcing
other collections, tasks and measures
deeper evaluation cutoffsnot just the top 5 documents: pay attention to ranking
probably more reliable, and certainly more reusable
effect of the number of systemsspecially if developed by the same research group
forget about power and worry about effect-sizeeventually, significance becomes meaningless
other statistical methodsMultiple Comparisons with a Control (baseline)
guide experimenters in
the interpretationof the results and the
tradeoff between
effort and reliability