Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication...

23
Evaluation and Replication (the scientist) Trusted Data Analytics Seminar TU Delft, 15 June 2018 Julio Gonzalo (UNED, Madrid, Spain)

Transcript of Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication...

Page 1: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Evaluation and

Replication

(the scientist) Trusted Data Analytics Seminar

TU Delft, 15 June 2018

Julio Gonzalo (UNED, Madrid, Spain)

Page 2: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Evaluation & Replication

Reproducibility: experimental results can be

replicated

Generalizability: Results make valid

predictions outside the lab

Validity: Results are meaningful, unbiased and

based on valid measurements

Page 3: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Data Scientists Anonymous

Page 4: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

(Textual) Data Science

Evaluation

Hard to predict performance

for a new problem

for a similar problem

for the same problem on a different dataset

Even hard to replicate results

Prediction is very difficult, especially about the future

(Niels Bohr, physicist)

Replication is very difficult, especially after the first occurrence

(Stefano Mizzaro, data scientist)

Page 5: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

The bridge metaphor in

Computer Science

Page 6: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Satellite

control

systems

New Cola

Flavor NLP

IR

RecSys

objective subjective

human data

Page 7: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Standard Evaluation

Methodology

Task formalization, RQs ( Algorithms)

Data Selection / Acquisition / Harvesting

Experimental Setting

Analysis

Evaluation metrics, statistical significance

Failure analysis, qualitative analysis

scope & limitations of results

Page 8: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Right or wrong?

Page 9: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Task biases

User satisfaction

Attention retention Publishability

70% of research

is private

70% publications

come from public

institutions

INCENTIVE

IRRELEVANT

(reproducible?)

IRREPRODUCIBLE

(relevant?)

BIG (?)

DATA

OUTCOMES

PUBLIC PRIVATE

Page 10: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Metrics define the problem

Page 11: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

The case of recommendation

NETFLIX CHALLENGE

predict user ratings

classification metrics

REAL TASK

suggest something the user

likes

ranking metrics

Page 12: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Let’s go ranking

Page 13: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Metric Selection Hall of

Fame

Use the most popular

Use the simplest

Get creative

Mirror, mirror, who is the prettiest…?

Wait… Do all MAP

implementations give the same

output?

Page 14: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Popular: Purity & Inverse

Purity

Page 15: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

What about multiple quality

dimensions?

Page 16: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Data Harvesting

Wisdom of the crowds… only if there is diversity of opinion,

independence and decentralization!

Biases everywhere in our online society:

E.g. Twitter population is not a random sample of citizens

Cognitive biases and techniques to exploit them (we share what

we find outrageous) [economic incentives!]

“Big data, small data, right data” (Ricardo Baeza)

Page 17: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Ground Turth harvesting:

RecSys example

Random sampling of user/item scores

Training/Validation/Test split

Strong bias for popular items!

Page 18: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Data annotation: the

entailment case

President Rajoy deposition was, according to the judge,

unbelievable ENTAILS Rajoy was lying

President Rajoy deposition was, according to the judge,

unbelievable NOT ENTAILS Rajoy was not lying

Crowdworkers find trick to produce test cases quickly

A negation in the consequent is highly correlated with an

invalid entailment

What the algorithms learn

Consequence: entailment resolution systems

have been highly overrated for years

Page 19: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

A/B testing problems

Weak baselines (“Improvements that don’t add up”)

Difficult to compare with state of the art

Previous approaches difficult to reproduce

System outputs are not usually available

Lab experimentation usually ignore real-world scenarios

When there are benchmarks, there is overfitting

Importance of evaluation campaigns

Page 20: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Analysis problems

sensitivity rather than reliability

finding rather than explaining differences

Averages that hide behavior: across metrics, across classes,

across test cases. Example: classification efficacy measured

as arithmetic mean of harmonic means per class

Page 21: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Analysis of ML outcomes

Knowledge-based systems vs ignorance-based systems.

Overfitting: what do ML algorithms actually learn?

ML elevates correlation to causality:

input X correlates with output Y ML outputs Y because of X

Great hockey player born in January

Bias & second-order bias:

Google holocaust biased crowds, unpredictable results.

Second order bias: tag recommendation in Flickr. Without

human input, the algorithm cannot improve, is doing a harakiri

(Baeza-Yates).

Page 22: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid

Evaluation and

Replication

(the scientist) Trusted Data Analytics Seminar

TU Delft, 15 June 2018

Julio Gonzalo (UNED, Madrid, Spain)

Page 23: Evaluation and Replication (the scientist) · 2018-06-21 · Evaluation & Replication Reproducibility: experimental results can be replicated Generalizability: Results make valid