CrowdTruth for medical relation extraction - WAI talk

23
Crowdsourcing Ground Truth for Relation Extraction in the Medical Domain Anca Dumitrache , Lora Aroyo, Chris Welty 26 th January 2015

Transcript of CrowdTruth for medical relation extraction - WAI talk

Crowdsourcing Ground Truth for Relation Extraction in the Medical Domain

Anca Dumitrache, Lora Aroyo, Chris Welty 26th January 2015

Introduction

problem: cognitive computing systems need annotated data for training, testing, evaluation

solution: human annotation through crowdsourcing augmented with machine processing

What's wrong with the gold standard?● algorithmic performance is measured on test sets vetted by human

experts→never perfectly correct● gold standards are created assuming that for each annotated instance there

is a single right answer→does not account for alternative interpretations and clarity

● gold standard quality is measured in inter-annotator agreement→what happens when disagreeing annotators are both right?

The fallacy of the “one truth” assumption that pervades computational semantics

Examples of disagreementDoes each sentence express the TREAT relation?

● ANTIBIOTICS are the first line treatment for indications of TYPHUS.

● Patients with TYPHUS who were given ANTIBIOTICS exhibited several side-effects.

● With ANTIBIOTICS in short supply, DDT was used during World War II to control the insect vectors of TYPHUS.

Examples of disagreementDoes each sentence express the TREAT relation?

● ANTIBIOTICS are the first line treatment for indications of TYPHUS. →agreement 99%

● Patients with TYPHUS who were given ANTIBIOTICS exhibited several side-effects.→agreement 80%

● With ANTIBIOTICS in short supply, DDT was used during World War II to control the insect vectors of TYPHUS.→agreement 60%

Why disagreement happens

With ANTIBIOTICS in short supply, DDT was used during World War II to control the

insect vectors of TYPHUS.

ANTIBIOTICS are the first line treatment for indications of TYPHUS.

“cause” vs.“side effect”

spammers vs.high quality workers

Triangle of reference

input sentence

annotation worker

CrowdTruth● use crowdsourcing tasks to improve machine-generated data● adaptable to new annotation tasks, new domains

● disagreement is an indicator of quality for:○ sentences, relations, workers

● capture and interpret disagreement through CrowdTruth metrics

● open source & available as a web service:http://CrowdTruth.org

CrowdTruth for relation extractiongoal: use the CrowdTruth approach for collecting a relation extraction gold standard, to improve the performance of a relation extraction classifier

issue: how to interpret crowd disagreement for relation extraction?

approach:● run relation extraction crowdsourcing task on 900 medical sentences● measure disagreement with CrowdTruth metrics● train & evaluate classifier with CrowdTruth scores

Research questions

1. what is the threshold between a negative and positive relation in crowdsourced data?

2. can CrowdTruth outperform experts in training a relation extraction classifier?

3. can CrowdTruth be used in evaluating a relation extraction classifier?

Relation extraction task

Unit vector

Sentence vector

Sentence-relation score

Experiment setup1. train relation extraction classifier with x-validation for one relation

datasets to compare:● baseline: original distant supervision relations● single annotator: for each sentence-relation pair, randomly sample a

binary decision from the sentence-relation vector● expert annotator: each sentence annotated by 1 medical expert● CrowdTruth: use scaled sentence-relation score as confidence value

2. perform evaluation on manually vetted data

Crowdsourcing results

Relation Frequency

ASSOCIATED WITH 69.4%

TREATS 45.23%

SYMPTOM 43.56%

CAUSE 41.35%

MANIFESTATION 39.57%

Top relations

Data cleanup

● tuning parameters:○ how much disagreement we accept until labeling a worker as a

spammer?○ what is the sen-rel score threshold for positive/negative

● spam removal● data clustering: relation similarity

A pairwise metric between two relations, it is used to express how likely one relation is to appear in a sentence, given that the second one is present. It is based on the causal power - for every two relations i and j, the causal power of i over j is:

where P(i) is the probability that relation i appears in the sentence.

Relation similarity

Relation clustering

Relation Frequency

CAUSE (cause + symptom + manifestation + side effect)

74.72%

ASSOCIATED WITH 69.4%

TREATS 45.23%

Top relations

Final relation overlap

Building a test set

+ manual evaluation

X-validation results

baseline single expert crowd thresh 0.1

crowd thresh 0.2

crowd thresh 0.3

crowd thresh 0.4

crowd thresh 0.5

crowd thresh 0.6

crowd thresh 0.7

crowd thresh 0.8

crowd thresh 0.9

Kendall tau 0.62 0.64 0.664 0.656 0.654 0.672 0.662 0.666 0.654 0.651 0.649 0.655

Spearman footrule

0.513 0.501 0.568 0.545 0.54 0.554 0.543 0.53 0.535 0.543 0.536 0.549

Rank correlation between classifier output and CrowdTruth

Results

1. what is the threshold between a negative and positive relation in crowdsourced data?● agreement with experts: sen-rel score = 0.5● best performance: sen-rel score = 0.7

2. can CrowdTruth outperform experts in training a relation extraction classifier?● not yet, but it comes close

3. can CrowdTruth be used in evaluating a relation extraction classifier?● yes for rank correlation methods● TODO: weighted precision, recall, F1, accuracy