Designing Test Collections for Comparing Many Systems

65
Designing Test Collections for Comparing Many Systems Tetsuya Sakai Waseda University, Japan @tetsuyasakai November 4, 2014@CIKM 2014, Shanghai

description

CIKM 2014 slides by Tetsuya Sakai

Transcript of Designing Test Collections for Comparing Many Systems

Page 1: Designing Test Collections for Comparing Many Systems

Designing Test Collections for Comparing Many Systems

Tetsuya SakaiWaseda University, Japan

@tetsuyasakai

November 4, 2014@CIKM 2014, Shanghai

Page 2: Designing Test Collections for Comparing Many Systems

Acknowledgement

This research is a part of Waseda University’s project        “Taxonomising and Evaluating Web Search Engine User Behaviours,” supported by Microsoft Research.

THANK YOU!

Page 3: Designing Test Collections for Comparing Many Systems

Takeaways (1)

• Using one‐way ANOVA‐based power analysis, researchers can determine the topic set size n by specifying:

α: Type I error probability β: Type II error probabilityminD: minimum detectable range (performance diff between best and worst systems) for ensuring a statistical power of 1‐βm: number of systems to be compared

: estimated variance of each system• Different measures have different      s, so researchers should decide on the evaluation measure at the test collection design phase

Page 4: Designing Test Collections for Comparing Many Systems

Takeaways (2)

• Our method can provide different test collection designs (n, pd) that satisfy the same statistical requirement.

n: topic set size pd: pool depth• The assessment cost of a pd=100 test collection can be reduced to 18% or less while keeping it statistically equally reliable.

• Our method can be used to compare evaluation measures in terms of practical significance = judgment cost.

• Our tools and data are available athttp://www.f.waseda.jp/tetsuya/tools.htmlhttp://www.f.waseda.jp/tetsuya/data.html

Page 5: Designing Test Collections for Comparing Many Systems

TALK OUTLINE

1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work

Page 6: Designing Test Collections for Comparing Many Systems

Test collections = standard data sets for evaluation

Test collection A Test collection BEvaluationmeasurevalues

Evaluationmeasurevalues

Page 7: Designing Test Collections for Comparing Many Systems

An Information Retrieval (IR) test collection

Topic Relevance assessments(relevant/nonrelevant documents)

Document collection

Topic Relevance assessments(relevant/nonrelevant documents)

Topic Relevance assessments(relevant/nonrelevant documents)

: :

Topic set “Qrels = query relevance sets”

CIKM 2014 home page cikm2014.fudan.edu.cn/: highly relevant

cikmconference.org/: partially relevantwww.cikm2013.org: nonrelevant

Page 8: Designing Test Collections for Comparing Many Systems

How IR people build test collections (1)

Okay, let’s build a test collection…

Organiser

Page 9: Designing Test Collections for Comparing Many Systems

How IR people build test collections (2)

…with maybe n=50topics (search requests)…

Well n>25 sounds good for statistical significance testing, but why 50? Why not 100? Why not 30?

TopicTopicTopicTopicTopic 1

Page 10: Designing Test Collections for Comparing Many Systems

How IR people build test collections (3)

TopicTopicTopicTopicTopic 1

50 topicsOkay folks, give me your runs (search results)!

run run run

Participants

Page 11: Designing Test Collections for Comparing Many Systems

How IR people build test collections (4)

TopicTopicTopicTopicTopic 1

50 topicsPool depth pd=100 looks 

affordable…

run run run

Top pd=100 documentsfrom each run 

Pool for 

Topic 1Document collection too large to doexhaustive relevance assessments sojudge pooled documents only 

Page 12: Designing Test Collections for Comparing Many Systems

How IR people build test collections (5)

TopicTopicTopicTopicTopic 1

50 topics

Top pd=100 documentsfrom each run 

Pool for 

Topic 1Relevance assessments

Highly relevant

Partially relevant

Nonrelevant

Page 13: Designing Test Collections for Comparing Many Systems

An Information Retrieval (IR) test collection

Topic Relevance assessments(relevant/nonrelevant documents)

Document collection

Topic Relevance assessments(relevant/nonrelevant documents)

Topic Relevance assessments(relevant/nonrelevant documents)

: :

Topic set “Qrels = query relevance sets”

CIKM 2014 home page cikm2014.fudan.edu.cn/: highly relevant

cikmconference.org/: partially relevantwww.cikm2013.org: nonrelevant

n=50topics…why?

Pool depth pd=100(not exhaustive)

Page 14: Designing Test Collections for Comparing Many Systems

TALK OUTLINE

1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work

Page 15: Designing Test Collections for Comparing Many Systems

NHST = null hypothesis significance testing (1)

EXAMPLE: paired t‐test for comparing systems X and Y with n topics

Assumptions: 

Null hypothesis:

Test statistic:

Population means are the same

Page 16: Designing Test Collections for Comparing Many Systems

NHST = null hypothesis significance testing (2)

EXAMPLE: paired t‐test for comparing systems X and Y with n topics

Null hypothesis:

Test statistic:

Under H0,  t0 obeys a t distribution with n‐1 degrees of freedom.

Page 17: Designing Test Collections for Comparing Many Systems

NHST = null hypothesis significance testing (3)

EXAMPLE: paired t‐test for comparing systems X and Y with n topicsNull hypothesis:Under H0,  t0 obeys a t distribution with n‐1 degrees of freedom.

Given a significance criterion α(=0.05), reject H0 if |t0| >= t(n‐1; α).

0

0.1

0.2

0.3

0.4

‐t(n‐1; α)

n=50

t(n‐1; α)

“H0 is probably not true because the chance of observing t0 under H0

is very small”

Page 18: Designing Test Collections for Comparing Many Systems

NHST = null hypothesis significance testing (4)

EXAMPLE: paired t‐test for comparing systems X and Y with n topicsNull hypothesis:Given a significance criterion α(=0.05), reject H0 if |t0| >= t(n‐1; α).

0

0.1

0.2

0.3

0.4

‐t(n‐1; α)

n=50

t(n‐1; α)

0

0.1

0.2

0.3

0.4

‐t(n‐1; α)

n=50

t(n‐1; α)

Conclusion:X ≠ Y!

t0 t0Conclusion:

H0 not rejected, so don’t know

Page 19: Designing Test Collections for Comparing Many Systems

NHST is not good enough [Cumming12]

• Dichotomous thinking ( “different or not different?” )A more important question is “what is the magnitude of the difference?” Another is “How accurate is my estimate?”• p‐values a little more informative than “significant at α=0.05” but…

0

0.1

0.2

0.3

0.4

‐t(n‐1; α)

n=50

t(n‐1; α)

t0

Probability of observing t0 or something more extreme under H0

Page 20: Designing Test Collections for Comparing Many Systems

The p‐value is not good enough either [Ellis10,Nagata03]

Reject H0 if |t0| >= t(n‐1; α) where                                                   

But a large |t0| could mean two things:(1) Sample effect size (ES)                    is large;(2) Topic set size n is large.

If you increase the sample size n, you can always achieve statistical significance!

Difference between X and Y measured in standard deviation 

units

Page 21: Designing Test Collections for Comparing Many Systems

Statistical reform – effect sizes [Cumming12,Okubo12]

• ES: “how much difference is there?”• ES for paired t test measures difference in standard deviation unitsPopulation ES = 

Sample ES as an estimate of the above =

In several research disciplines such as psychology and medicine, it is required to report ESs! In this study, we determine the topic set size n by ensuring high power 1‐β whenever ES is large.

Page 22: Designing Test Collections for Comparing Many Systems

Statistical reform – confidence intervals

• CIs are much more informative than NHST(point estimate + uncertainty/accuracy)• Estimation thinking, not dichotomous thinking[Cumming12]

In several research disciplines such as psychology and medicine, it is required to report CIs! See [Sakai14FIT] (Designing Test Collections that Provide Tight Confidence Intervals)

[Sakai14SIGIRforum]

Page 23: Designing Test Collections for Comparing Many Systems

TALK OUTLINE

1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work

Page 24: Designing Test Collections for Comparing Many Systems

Overview of our approaches

• Using the paired t‐test to determine n (for m=2 systems)INPUT: α, β, minDt (minimum detectable difference for ensuring power=1‐β),       (estimated variance of performance difference)• Using one‐way ANOVA to determine n (for m>=2 systems)INPUT: α, β, m, minD (minimum detectable range for ensuring power=1‐β),       (estimated variance of system performance)• Methods for estimating          and INPUT: test collections with runs and an evaluation measure(topic‐by‐run matrices)

See Appendix

Page 25: Designing Test Collections for Comparing Many Systems

The ANOVA approach (1)

Assume

i=1,…,m (systems)j=1,…,n (topics)

Homoscedasticity(equal variance)

Let

where

Hypotheses

: ai ≠ 0 for some i

System effect

No system effect

Page 26: Designing Test Collections for Comparing Many Systems

The ANOVA approach (2)i=1,…,m (systems)j=1,…,n (topics)

Total variation 

can be decomposed intoST = SA + SE where 

Between‐systemvariation

Within‐systemvariation

Sample grand mean

Sample system mean

Page 27: Designing Test Collections for Comparing Many Systems

The ANOVA approach (3)i=1,…,m (systems)j=1,…,n (topics)

Test statistic F0 = VA/VE where VA = SA/φA, VE = SE/φE,φA = m‐1, φE = m(n‐1)

Under H0,F0~ F distribution with (φA, φE) degrees of freedom.One‐way ANOVA rejects H0if F0 >= F(φA; φE; α).

F(φA; φE; α)F0

α

1‐α

How large is the between‐system variance compared to the within‐system variance?

Page 28: Designing Test Collections for Comparing Many Systems

The ANOVA approach (4)i=1,…,m (systems)j=1,…,n (topics)

The probability of rejecting H0

Under H0, this is exactly α (rejecting H0 that is true)

F(φA; φE; α)F0

α

1‐α

Page 29: Designing Test Collections for Comparing Many Systems

The ANOVA approach (5)i=1,…,m (systems)j=1,…,n (topics)

The probability of rejecting H0

Under H1, this is exactly the power 1‐β (rejecting H0 that is false)

Under H1, F0~ noncentral F distribution with (φA, φE) degrees of freedomand a noncentrality parameter λ = nΔ, where

Measures total system effects in variance units

Page 30: Designing Test Collections for Comparing Many Systems

The ANOVA approach (6)i=1,…,m (systems)j=1,…,n (topics)

The power (probability of rejecting H0 that is false)

1‐βF0~ noncentralF distribution

For a random variable F’ that obeys a noncenral F distribution,Pr{ F’ <= w } can be approximated using a normal distribution[Nagata03] (Eqs. 14 and 15 in my paper).

Given α, n, Δ, m, the power 1‐β can be computed.But what we want is: Given α, β, Δ, m, compute n!

Page 31: Designing Test Collections for Comparing Many Systems

Under H0, we know Δ=0. But under H1, we only know that Δ ≠ 0.Δ needs to be specified to guarantee power=1‐β.Let’s guarantee power=1‐β (i.e. correctly reject H0 with 100(1‐β)% confidence)whenever Δ >= minΔ (minimum detectable delta).

How shall we set minΔ?

The ANOVA approach (7)i=1,…,m (systems)j=1,…,n (topics)

Page 32: Designing Test Collections for Comparing Many Systems

Let

where minD is the minimum detectable range that you specify.Whenever the difference between the best system and the worst system is minD or more, we guarantee power=1‐β.

Given α, n, Δ, m, the power 1‐βcan be computed.Using α, n, minD,        , m, the worst‐case power can be computed (See Eq.17 in my paper).

The ANOVA approach (8)i=1,…,m (systems)j=1,…,n (topics)

Estimate variance from past data

μ

best

worst

μi

minDai = μi ‐ μ

Page 33: Designing Test Collections for Comparing Many Systems

The ANOVA approach (9)i=1,…,m (systems)j=1,…,n (topics)

Given (α, β, minD,        , m), 

Here, λ can be obtained if we use the following approximation [Nagata03]: 

λ = nΔ: Noncentrality parameter

1‐β ≒

Let φE = m(n‐1) ≒∞ ~ noncentral chi‐square distribution with φA degrees of 

freedom and the same noncentrality parameter λ

Page 34: Designing Test Collections for Comparing Many Systems

The ANOVA approach (10)i=1,…,m (systems)j=1,…,n (topics)

Given (α, β, minD,        , m),  λ = nΔ: Noncentrality parameter

φA = m‐1

For noncentral chi‐square distributions, use the following λ values [Nagata03]: 

Page 35: Designing Test Collections for Comparing Many Systems

The ANOVA approach (11)i=1,…,m (systems)j=1,…,n (topics)

Given (α, β, minD,        , m), 

Obtain n using λ from the table, and check if (α, n, minΔ, m) satisfies the power requirement: 

λ = nΔ: Noncentrality parameter

1‐β

Normal approximation available (Eqs. 15, 16 in my 

paper) 

If not, n++ and try the above again.

Page 36: Designing Test Collections for Comparing Many Systems

Demo: determine n from (α, β, minDt,      , m)

http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx

Page 37: Designing Test Collections for Comparing Many Systems

Obtaining 

A time‐honoured method using one‐way ANOVA statistics [Okubo12]:

Estimate of the population between‐system variance

Estimate of the population within‐system variance

Estimate of the populationvariance

Given multiple ANOVA statistics (test collections + runs),the estimated variances can be pooled to enhance reliability:

Page 38: Designing Test Collections for Comparing Many Systems

Overview of our approaches

• Using the paired t‐test to determine n (for m=2 systems)INPUT: α, β, minDt (minimum detectable difference for ensuring power=1‐β),       (estimated variance of performance difference)• Using one‐way ANOVA to determine n (for m>=2 systems)INPUT: α, β, m, minD (minimum detectable range for ensuring power=1‐β),       (estimated variance of system performance)• Methods for estimating          and INPUT: test collections with runs and an evaluation measure(topic‐by‐run matrices)

See Appendix

Page 39: Designing Test Collections for Comparing Many Systems

TALK OUTLINE

1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work

Page 40: Designing Test Collections for Comparing Many Systems

Data for estimating  

Data #topics runs pd #docsTREC03new 50 78 125 528,155 news articlesTREC04new 49 78 100 dittoTREC11w 50 37 25 One billion web pagesTREC12w 50 28 20/30 dittoTREC11wD 50 25 25 dittoTREC12wD 50 20 20/30 ditto

Adhoc news IR

Adhoc web IR

Diversified web IR

We have a topic‐by‐run matrix for each data set and evaluation measure

Page 41: Designing Test Collections for Comparing Many Systems

Evaluation measures

News(l=10,1000)Web (l=10)

Web (l=10)

l: measurement depth

Page 42: Designing Test Collections for Comparing Many Systems
Page 43: Designing Test Collections for Comparing Many Systems

t‐test vs ANOVAWhen m=2,• minD for ANOVA (range) reducesto minDt for t‐test (difference).• Results are similar,with ANOVA giving slightly larger estimates of n.Henceforth we discussANOVA as it can also considerm>3 and we prefer to“err on the side of oversampling”[Ellis10] .

Page 44: Designing Test Collections for Comparing Many Systems

0

1000

2000

3000

4000

5000

6000

0 20 40 60 80 100 120 140 160 180 200AP Q nDCG nERR

(a2) adhoc/news (l=10)(α, β, minD)=(0.05, 0.20, 0.05)

m

n

For comparing m=100 systems,Q/nDCG/AP/nERR require

2198/2382/2863/4063 topics

Page 45: Designing Test Collections for Comparing Many Systems

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160 180 200AP Q nDCG nERR

(b) adhoc/web (l=10)(α, β, minD)=(0.05, 0.20, 0.05)

m

n

For comparing m=100 systems,Q/nDCG/AP/nERR require

1240/1291/2801/2921 topics

Page 46: Designing Test Collections for Comparing Many Systems

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160 180 200α‐nDCG nERR‐IA D‐nDCG D#‐nDCG

(c) diversity/web (l=10)(α, β, minD)=(0.05, 0.20, 0.05)

m

n

For comparing m=100 systems,D/D#/α/nERR‐IA require

1201/1749/2662/2869 topics

Page 47: Designing Test Collections for Comparing Many Systems

What if we reduce the pool depth pd?

TopicTopicTopicTopicTopic 1

n=50 topics

Top pd=100 documentsfrom each run 

Pool for 

Topic 1Relevance assessments

Highly relevant

Partially relevant

Nonrelevant

For adhoc/news l=1000 (pd=100) only

Page 48: Designing Test Collections for Comparing Many Systems

when pd is reduced

As pd gets smaller,• Average #judged/topic decreases (naturally)• Variance increases (fewer data points hurts stability)Re‐estimate n for (α, β, minD, new        )

Page 49: Designing Test Collections for Comparing Many Systems

0

200

400

600

800

1000

1200

1400

1600

0 100 200 300 400 500 600 700 800AP Q nDCG nERR

pd=50

n

(α, β, minD)=(0.05, 0.20, 0.05)m=10

pd=100pd=70

pd=30pd=10

#Average judged/topic

Total cost for AP:96 docs/topic * 

879 topics = 84,384 docs  

Total cost for AP:731 docs/topic * 

652 topics = 476,612 docs  

TREC ad hocpool depth

Alternative design with costreduced to 18%

Page 50: Designing Test Collections for Comparing Many Systems

TALK OUTLINE

1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work

Page 51: Designing Test Collections for Comparing Many Systems

Takeaways (1)

• Using one‐way ANOVA‐based power analysis, researchers can determine the topic set size n by specifying:

α: Type I error probability β: Type II error probabilityminD: minimum detectable range (performance diff between best and worst systems) for ensuring a statistical power of 1‐βm: number of systems to be compared

: estimated variance of each system• Different measures have different      s, so researchers should decide on the evaluation measure at the test collection design phase

Page 52: Designing Test Collections for Comparing Many Systems

Takeaways (2)

• Our method can provide different test collection designs (n, pd) that satisfy the same statistical requirement.

n: topic set size pd: pool depth• The assessment cost of a pd=100 test collection can be reduced to 18% or less while keeping it statistically equally reliable.

• Our method can be used to compare evaluation measures in terms of practical significance = judgment cost.

• Our tools and data are available athttp://www.f.waseda.jp/tetsuya/tools.htmlhttp://www.f.waseda.jp/tetsuya/data.html

Page 53: Designing Test Collections for Comparing Many Systems

Future work

• Investigating the relationship between our power‐based approach and a CI (confidence interval)‐based approach: DONE [Sakai14EVIA]

• Estimating n for various tasks (not just IR) – our methods are applicable to any paired‐data evaluation tasks

• Given a set of statistically equally reliable designs (n,pd), choose the best one based on reusability and assessment cost

Can we evaluate new systems fairly?

Page 54: Designing Test Collections for Comparing Many Systems

References[Cumming12] Cumming, G.: Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta‐Analysis. Routledge, 2012.[Ellis10] Ellis, P.D.: The Essential Guide to Effect Sizes, Cambridge, 2010.[Nagata03] Nagata, Y.: How to Design the Sample Size. Asakura Shoten, 2003.[Okubo12] Okubo, M. and Okada, K. Psychological Statistics to Tell Your Story: Effect Size, CondenceInterval (in Japanese). Keiso Shobo, 2012.[Sakai14PROMISE] Sakai, T.: Metrics, Statistics, Tests. PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), pp.116‐163, Springer, 2014. [Sakai14SIGIRforum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1), 2014. [Sakai14FIT] Sakai, T.: Designing Test Collections that Provide Tight Confidence Intervals, FIT 2014, RD‐003, 2014.[Webber08] Webber, W., Moffat, A. and Zobel, J.: Statistical power in Retrieval Experimentation. ACM CIKM 2008, pp.571–580, 2008.

Page 55: Designing Test Collections for Comparing Many Systems

Appendices

• The t‐test approach• Delta over [Webber08]

Page 56: Designing Test Collections for Comparing Many Systems

The t‐test approach (1)

vs.

Hypotheses

Assume

where

Systems X and Y are equally effective 

Page 57: Designing Test Collections for Comparing Many Systems

The t‐test approach (2)

Test statistic

Under H0, t0~ t distribution with φ=n‐1 degrees of freedom.The paired t test rejects H0 if  |t0| >= t(φ; α).

α/2 α/2

t(φ; α)‐t(φ; α)

1‐αt0

Two‐sided critical t value

Page 58: Designing Test Collections for Comparing Many Systems

The t‐test approach (3)

The probability of rejecting H0

α/2 α/2

t(φ; α)‐t(φ; α)

1‐αt0

Under H0, this is exactly α (rejecting H0 that is true)

Page 59: Designing Test Collections for Comparing Many Systems

The t‐test approach (4)The probability of rejecting H0

Under H1, this is exactly the power 1‐β (rejecting H0 that is false)

Under H1, t0~ noncentral t distribution with φ=n‐1 degrees of freedom anda noncentrality parameter λt =where 

Effect size

Page 60: Designing Test Collections for Comparing Many Systems

The t‐test approach (5)The power (probability of rejecting H0 that is false)

1‐β = 

t0~ noncentralt distribution

For a random variable t’ that obeys a noncenral t distribution,Pr{ t’ <= w } can be approximated using a normal distribution[Nagata03] (Eqs. 4 and 5 in my paper).

Given α, n, effect size Δt,  the power 1‐β can be computed.But what we want is: Given α, β, Δt, compute n!

Page 61: Designing Test Collections for Comparing Many Systems

The t‐test approach (6)Given α, n, effect size Δt,  the power 1‐β can be computed.But what we want is: Given α, β, Δt, compute n!

Under H0, we know Δt = 0. But under H1, we need to specify Δt to discuss power.So let’s correctly reject H0 with 100(1‐β)% confidencewhenever |Δt| >= minΔt (minimum detectable effect). 

Don’t miss a real difference if it’s minΔt or larger!

Page 62: Designing Test Collections for Comparing Many Systems

The t‐test approach (7)

Given (α, β, minΔt), the required n can be approximated [Nagata03]:

zp: one‐sided critical z value

1‐β <= 

Check if the above n actually satisfies the power requirement:

t0~ noncentralt distribution

If not , n++ and try the above again.

Page 63: Designing Test Collections for Comparing Many Systems

The t‐test approach (8)

Effect size

In practice, instead of setting minΔt (in terms of effect size),set minDt (minimum detectable difference):

|μX – μY | > = minDtthen convert it to 

Need a variance estimatefrom past data! 

Page 64: Designing Test Collections for Comparing Many Systems

Demo: determine n from (α, β, minDt,        )

http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsx

Page 65: Designing Test Collections for Comparing Many Systems

Delta over [Webber08]

• They addressed the problem of building a test collection incrementally (add a topic, judge, re‐estimate variance…).

• They considered the t‐test only.

• They used heuristics to estimate the variance.

• They considered AP and adhoc IR only.

We ask the direct question: “How many topics do we need to create?”

We use both the t‐test and ANOVA to handle m(>=2) systems

We use estimates from ANOVA

We consider a variety of graded‐relevance measures and three different IR tasks.