#CrowdTruth: Biomedical Data Mining, Modeling & Semantic Integration (BDM2I 2015) @ISWC2015

Post on 21-Mar-2017

666 views 3 download

Transcript of #CrowdTruth: Biomedical Data Mining, Modeling & Semantic Integration (BDM2I 2015) @ISWC2015

Anca Dumitrache, Lora Aroyo, Chris Welty http://CrowdTruth.org

Achieving Expert-Level Annotation Quality with the Crowd

The Case of Medical Relation Extraction

Biomedical Data Mining, Modeling & Semantic Integration @ ISWC2015

#CrowdTruth @anouk_anca @laroyo @cawelty #BDM2I

•  Annotator disagreement is signal, not noise.

•  It is indicative of the variation in human semantic interpretation of signs

•  It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality

CrowdTruth  http://CrowdTruth.org

•  Goals: collecting a relation extraction

gold standard improve the performance of a

relation extraction classifier

•  Approach: crowdsource 900 medical

sentences measure disagreement with

CrowdTruth metrics train & evaluate classifier with

CrowdTruth score

CrowdTruth  for  medical  rela2on  extrac2on  

http://CrowdTruth.org

RelEx  TA

SK  in  CrowdFlow

er  Pa2ents  with  ACUTE  FEVER  and  nausea  could  be  suffering  from  INFLUENZA  AH1N1  

Is  ACUTE  FEVER  –  related  to  →  INFLUENZA  AH1N1?  

h"p://CrowdTruth.org    

1 1 1

Worker  Vector  

h"p://CrowdTruth.org    

1 1 1

1 1

1

1 1

1 1

1 1

1

1

1

0 1 1 0 0 4 3 0 0 5 1 0

Sentence  Vector  

h"p://CrowdTruth.org    

0.907,  p  =  0:007  

0.844  

Annota2on  Quality    of  Expert  vs.  Crowd  Annota2ons  

h"p://CrowdTruth.org    

0.907,  p  =  0:007  

0.844  

[0.6  -­‐  0.8]  crowd  significantly  out-­‐performs  expert    with  max  in  0.907  F1  @  0.7  threshold  

Annota2on  Quality    of  Expert  vs.  Crowd  Annota2ons  

h"p://CrowdTruth.org    

0.642,  p  =  0:016    0.638  

Relex  CAUSE  Classifier  F1    for  Crowd  vs.  Expert  Annota2ons  

h"p://CrowdTruth.org    

0.642,  p  =  0:016    0.638  

crowd  provides  training  data  that  is  at  least  as  good  if  not  beEer  than  experts  

Relex  CAUSE  Classifier  F1    for  Crowd  vs.  Expert  Annota2ons  

h"p://CrowdTruth.org    

(crowd  with  pos./neg.  threshold  at  0.5)  

h"p://CrowdTruth.org    

Learning  Curves  

Learning  Curves  

(crowd  with  pos./neg.  threshold  at  0.5)  

above  400  sent.:  crowd  consistently  over  baseline  &  single  above  600  sent.:  crowd  out-­‐performs  experts  

h"p://CrowdTruth.org    

Learning  Curves  Extended  

(crowd  with  pos./neg.  threshold  at  0.5)  

h"p://CrowdTruth.org    

Learning  Curves  Extended  

(crowd  with  pos./neg.  threshold  at  0.5)  

h"p://CrowdTruth.org    

crowd  consistently  performs  beEer  than  baseline  

#  of  Workers:  Impact  on  Sentence-­‐Rela2on  Score  

h"p://CrowdTruth.org    

#  of  Workers:  Impact  on  Annota2on  Quality  

only  54  sent.  had  15  or  more  workers  

h"p://CrowdTruth.org    

Experts  vs.  Crowd    in  Human  Annota2on  Overall  Comparison  

•  91% of expert annotations covered by the crowd •  expert annotators reach agreement only in 30% •  most popular crowd vote covers 95% of this

expert annotation agreement  

h"p://CrowdTruth.org    

F1 Cost per sentence

CrowdTruth 0.642 $0.66

Expert Annotator 0.638 $2.00

Single Annotator 0.492 $0.08

h"p://CrowdTruth.org    

Expert  vs.  Crowd    in  Human  Annota2on  

Cost  Comparison  

•  crowd performs just as well as medical experts

•  crowd is also cheaper •  crowd is always available

•  using only a few annotators for ground truth is faulty

•  min 10 workers/sentence are needed for highest quality annotations

•  CrowdTruth = a solution to Clinical

NLP Challenge: •  lack of ground truth for training &

benchmarking

Experimentsproved  that:  

http://CrowdTruth.org

#CrowdTruth @anouk_anca @laroyo @cawelty #BDM2I #ISWC2015

CrowdTruth.org

http://data.CrowdTruth.org/medical-relex