Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th...

26
Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation

Transcript of Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th...

Page 1: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci

CLEF 2010, 20th Sept. 2010, Padova

A PROMISE for Experimental Evaluation

Page 2: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 2

Multilingual and Multimedia Information Access Systems

Page 3: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 3

Challenges for Experimental Evalution

• Heterogeneousness and volume of the data • much is done to provide realistic document

collections

• Diversity of users and tasks• evalution tasks/tracks are often too

“monolithic”

• Complexity of the systems• system are usually dealt with as “black

boxes”

Page 4: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 4

Experimental Evaluation Needs• To increase the automation in the evaluation

process

• reduction of the effort necessary for carrying out evaluation

• increase the number of the experiments conducted in order to deeply analyse evolving user habits and tasks

• To study systems, component-by-component

• better understanding of systems’ behaviour also with respect to different tasks

• To increase the usage of the produced experimental data

• improving collaboration and user involvment to achieve unforeseen exploitation and enrichment of the experimental data

Page 5: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 5

PROMISE Approach

Page 6: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation

Evaluation: Labs and Metrics

Maarten de Rijke

UvA

Page 7: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 7

Information access changing

• New breeds of users

• Performing increasingly broad range of tasks within varying domains

• Acting within communities to find information for themselves and to share with others

• Re-orientation of methodology and goals in evaluation of information access systems

Page 8: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 8

Mapping the evaluation landscape

• Generating ground truth from log files

• Generating ground truth from annotations

• Alternative retrieval scenarios and metrics

• Living labs

• Evaluation in the wild

• Ranking analysis

Page 9: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 9

Use Cases – a bridge to application

Jussi Karlgren

SICS

Page 10: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 10

Two legs of evaluation

•Benchmarking

•Validation(well … at least two)

Each with separate craft and practice. How can they communicate?

Page 11: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 11

Use cases – a conduit

•To communicate starting points of evaluation practice we suggest the formulation of use cases, based on practice in the field.

• Interviews, think tanks, hypothesis-driven as well as empiry driven.

•Contact us! Suggest stakeholders!

Page 12: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 12

IP Search

Allan Hanbury

IRF

Page 13: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 13

IR Evaluation Campaigns today

… are mostly based on

The TREC organisation model, which is based on

The Cranfield Paradigm, which was developed for

Page 14: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 14

You can do a lot with index cards...

The Mundaneum Begun in 1919 in Belgium, by April 1934 there were 15 646 346 index cards (cross referenced)

Page 15: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 15

Disadvantages of Evaluation Campaign Approach

• Fixed timelines and cyclic nature of events

• Evaluation at system-level only

• Difficulty in comparing systems and elucidating reasons for their performance

• Viewing the campaign as a competition

• Are IR Systems getting better?It is not clear from results in published papers that IR systems have

improved over the last decade [Fuhr, this morning; Armstrong et al., CIKM 2009]

Page 16: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 16

Search for Innovation

Patent Search is an interesting problem because:

• Very high recall required, but precision should not be sacrificed

• Many types of search done: from narrow to wide

• Searches also in non-patent literature

• Classification required

• Multi-lingual

• Non-text information is important

• Different style used in different parts of patents

Page 17: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 17

Visual Analytics

Giuseppe Santucci

Sapienza Università di Roma

Page 18: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 18

Data !

PROMISE has to manage and explore large and/or complex datasets Topics Experiment submissions Creation of pools Relevance assessment Log files Measures Derived data Statistics …

And PROMISE foresees a growth of the managed data during the project of about one order of magnitude

Page 19: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 19

Challenges

What are the challenges rising from the management of such datasets?Not the storage (even if it requires an engineered database design)

Not the retrieval (if you just need to retrieve a measure)

Challenges come from effectively using such immense wealth of data (without being overloaded). It means:understanding itdiscovering patterns, insights, and trendsmaking decisionssharing and reusing results

Page 20: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 20

Rescuing information

In different situations people need to exploit and to use hidden information resting in unexplored large data sets decision-makers analysts engineers emergency response teams ...

Several techniques exist devoted to this aim Automatic analysis techniques (e.g., data mining, statistics) Manual analysis techniques (e.g., Information visualization)

Large and complex datasets require a joint effort:

Page 21: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 21

Visual Analytics

Page 22: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 22

A simple Visual Analytics example

How to visually compare Jack London and Mark Twain books?

VA stepsSplit the book in several text block (e.g., pages, paragraph)

Measure, for each text block, a relevant feature (e.g., average sentence length, word usage, etc. )

Associate the relevant feature to a visual attribute (e.g., color)

Visualize it

Page 23: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 23

Jack London vs Mark Twain

Average sentence length

Hapax Legomena (HL)(words appearing

only once)

Short sentences

Long sentences

Many HL (rich

vocabulary)

Few HL

Page 24: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 24

Visual Analytics@PROMISE !

One of the innovative aspects of PROMISE acknowledged by the European Commission, is the idea of providing Visual Analytics techniques for exploring the available datasets

Specific algorithms Suitable visualizations Sharing and collaboration mechanisms

Page 25: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 25

The European Vismaster CA project

Page 26: Nicola Ferro, Allan Hanbury, Jussi Karlgren, Maarten de Rijke, and Giuseppe Santucci CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation.

CLEF 2010, 20th Sept. 2010, Padova A PROMISE for Experimental Evaluation 26

Where next?

•What can PROMISE deliver to future CLEF labs?

•How will PROMISE contribute to the field as a whole, outside direct CLEF activities?

•How can PROMISE provide experimental infrastructure for other projects?