D4.5 Community Evaluation Methodology -...

30
D4.5 - Community Evaluation Methodology January 16, 2017 Deliverable Code: D4.5 Version: 1.0 Final Dissemina(on level: Public This document contains a formal descrip3on of the evalua3on methodology key indicators based on community driven applica3on scenarios (use cases from four thema3c areas) with the aim of valida-ng and assessing the OpenMinTeD framework through community scenarios. H2020EINFRA20142015 / H2020EINFRA20142 Topic: EINFRA12014 Managing, preserving and compu2ng with big research data Research & Innova.on ac.on Grant Agreement 654021

Transcript of D4.5 Community Evaluation Methodology -...

 

D4.5 - Community Evaluation

Methodology January  16,  2017  

 

Deliverable  Code:  D4.5  

Version:  1.0  –    Final  

Dissemina(on  level:  Public  

 

This   document   contains   a   formal   descrip3on   of   the   evalua3on  methodology   key   indicators  based  on  community  driven  applica3on   scenarios   (use   cases   from   four   thema3c  areas)  with  the   aim   of   valida-ng   and   assessing   the   OpenMinTeD   framework   through   community  scenarios.      

 

 

H2020-­‐EINFRA-­‐2014-­‐2015  /  H2020-­‐EINFRA-­‐2014-­‐2  Topic:  EINFRA-­‐1-­‐2014  Managing,  preserving  and  compu2ng  with  big  research  data  Research  &  Innova.on  ac.on  Grant  Agreement  654021  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  1  of  29  

Document Description D4.5  –  Community  Driven  Evalua&on  Methodology  

WP4  –  Community  Driven  Requirements  and  Evalua5on    

WP  par'cipa'ng  organiza'ons:  ARC,  University  of  Manchester,  UKP-­‐TUDA,   INRA,  EMBL,  LIBER,  OU,  EPFL,  CNIO,  USFD,  GESIS,  GRNET,  Fron4ers,  UoS  

Contractual  Delivery  Date:  11/2016   Actual  Delivery  Date:  01/2017  

Nature:  Report   Version:  1.0  (Final)  

Public   Deliverable   /   Confiden'al   Deliverable,   only   for   members   of   the   consor0um   (including   the  Commission  Services)  

Preparation slip   Name   Organiza(on   Date  

From   Miguel  Madrid  

Mar$n  Krallinger  

Nicole  Doelker  

CNIO   02/12/2016  

Edited  by   Mar$n  Krallinger  

Nicole  Doelker  

CNIO   16/01/2017  

Reviewed  by   Ma#  Shardlow  

Piotr  Przybyła  

UNIMAN  

UNIMAN  

19/12/2016  

20/12/2016  

Approved  by   Natalia  Manola   ARC   17/01/2017  

For  delivery   Mike  Hatzopoulos   ARC   17/01/2017  

 

 

Document change record Issue   Item   Reason  for  Change   Author   Organiza(on  

V0.1   Dra$  version   Ini$al  version   Miguel  Madrid   CNIO  

V1.0   First  version   Final  version   Mar$n  Krallinger   CNIO  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  2  of  29  

Table of Contents

1.   INTRODUCTION  .........................................................................................................................................  7  

2.   GENERAL  EVALUATION  STRATEGY  DESCRIPTION  .......................................................................................  9  

3.   COMMUNITY  USE  CASE  COMPONENT  EVALUATION  SCENARIOS  .............................................................  10  

  THEME:  AGRICULTURE  /  BIODIVERSITY  ...........................................................................................................  11  3.13.1.1   USE  CASE:  AGRIS:  AUTOMATIC  EXTRACTION  OF  TOPICS,  LOCATION  AND  FIGURES  FROM  PUBLICATIONS  [AS-­‐A]  ...............  11  3.1.2   USE  CASE:  FOOD  SAFETY:  AUTOMATIC  EXTRACTION  OF  LOCATION  INFORMATION  FROM  PUBLICATIONS  [AS-­‐B]  ...............  12  3.1.3   USE  CASE:  MICROBIAL  BIODIVERSITY  [AS-­‐C]  ........................................................................................................  12  3.1.4   USE  CASE:  LINKING  WHEAT  DATA  WITH  LITERATURE  [AS-­‐D]  ...................................................................................  13  3.1.5   USE  CASE:  INFORMATION  EXTRACTION  OF  MECHANISM  INVOLVED  IN  PLANT  DEVELOPMENT  [AS-­‐E  ]  ..............................  13  

  THEME:  LIFE  SCIENCES  ................................................................................................................................  14  3.23.2.1   USE  CASE:    EXTRACT  METABOLITES  AND  THEIR  PROPERTIES  AND  MODES  OF  ACTIONS  [LS-­‐A]  ........................................  14  3.2.2   USE  CASE:    NEUROSCIENCE  [LS-­‐B]  .....................................................................................................................  14  

  THEME:  SOCIAL  SCIENCES  ............................................................................................................................  15  3.33.3.1   USE  CASE:  FACILITATION  OF  COMPLEX  INFORMATION  LINKING  AND  RETRIAL  FROM  SOCIAL  SCIENCES  PUBLICATIONS  [SS-­‐A]  15  

  THEME:  SCHOLARLY  COMMUNICATION  ..........................................................................................................  15  3.43.4.1   USE  CASE:    RESEARCH  ANALYTICS  USE  CASE  [SC-­‐A]  ..............................................................................................  15  

  COMMUNITY  USE  CASE  EVALUATION  SCENARIO  ACTORS  .....................................................................................  16  3.5   EVALUATION  PHASE  I:  USE-­‐CASE  COMPONENT  CENTRIC  ......................................................................................  17  3.6

3.6.1   CRITERIA  OF  ASSESSMENT  .................................................................................................................................  18     EVALUATION  PHASE  II:  USE-­‐CASE  WORKFLOW  CENTRIC  ......................................................................................  19  3.7

3.7.1   GENERAL  CRITERIA  OF  ASSESSMENT  ....................................................................................................................  22  

4.   REFERENCES  ............................................................................................................................................  24  

5.   APPENDIX  ...............................................................................................................................................  25  

  QUESTIONNAIRE  FOR  EVALUATION  PHASE  I:  USE-­‐CASE  COMPONENT  CENTRIC  ..........................................................  25  5.1   EVALUATION  PHASE  II:  USE-­‐CASE  WORKFLOW  CENTRIC  ......................................................................................  26  5.2

5.2.1   REGISTRY  SERVICE  SCENARIO  .............................................................................................................................  26  5.2.2   WORKFLOW  SERVICE  SCENARIO  .........................................................................................................................  28  5.2.3   ANNOTATION  SERVICE  SCENARIO  ........................................................................................................................  29    

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  3  of  29  

Table of Figures Figure 1. Evaluation flowchart cycle. ______________________________________________________________________ 9  Figure 2. OpenMinTeD Gold Standard dataset characterization aspects. __________________________________________ 11  Figure 3. OpenMinTeD evaluation framework for component assessment (evaluation phase I). _________________________ 18  Figure 4. Workflow service architecture (from OpenMinTeD Webinar 2). _________________________________________ 21  Figure 5. OpenMinTeD framework, general evaluation levels. __________________________________________________ 23    

 

 

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  4  of  29  

Disclaimer This  document  contains  description  of   the  OpenMinTeD  project   findings,  work  and  products.  Certain  parts  of   it  might  be  under  partner  Intellectual  Property  Right  (IPR)  rules  so,  prior  to  using  its  content  please  contact  the  consortium  head  for  approval.  

In   case   you   believe   that   this   document   harms   in   any   way   IPR   held   by   you   as   a   person   or   as   a  representative  of  an  entity,  please  do  notify  us  immediately.  

The  authors  of  this  document  have  taken  any  available  measure  in  order  for  its  content  to  be  accurate,  consistent  and  lawful.  However,  neither  the  project  consortium  as  a  whole  nor  the  individual  partners  that  implicitly  or  explicitly  participated  in  the  creation  and  publication  of  this  document  hold  any  sort  of  responsibility  that  might  occur  as  a  result  of  using  its  content.  

This   publication   has   been  produced  with   the   assistance   of   the   European  Union.   The   content   of   this  publication   is   the   sole   responsibility  of   the  OpenMinTeD  consortium  and   can   in  no  way  be   taken   to  reflect  the  views  of  the  European  Union.  

The  European  Union  is  established  in  accordance  with  the  Treaty  on  European  Union  (Maastricht).  There  are  currently  28  Member  States   of   the   Union.   It   is   based   on   the   European   Communities  and   the   member   states   cooperation   in   the   fields   of   Common  Foreign  and  Security  Policy  and  Justice  and  Home  Affairs.  The  five  main   institutions   of   the   European   Union   are   the   European  Parliament,   the  Council  of  Ministers,   the  European  Commission,  the   Court   of   Justice   and   the   Court   of   Auditors.  (http://europa.eu.int/)  

 

OpenMinTeD  is  a  project  funded  by  the  European  Union  (Grant  Agreement  No  654021).  

   

   

       

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  5  of  29  

Acronyms TDM   Text  and  Data  Mining  NLP   Natural  Language  Processing  OMTD  NER  

OpenMinTeD  Named  En)ty  Recogni)on  

REST   Representa)onal  state  transfer                                                      

 

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  6  of  29  

Publishable Summary This  document  is  the  fifth  deliverable  of  Work  Package  4  titled  “Community  Driven  Requirements  and  Evaluation”   and   aims   to   provide   an   overview   of   the   requirements   relevant   to   TDM   from   research  communities  that  have  been  identified  as  potential  end  users  of  the  OpenMinTeD  project  services.  The  first   step   in   the   process   of   the   OpenMinTeD   project   developing   its   text-­‐   and   data-­‐mining   powered  services,  as  well  as   its  services  to  end  users  and  e-­‐infrastructure,  was  the   identification  of   the  actual  needs  of  the  communities  that  are  currently  in  need  of  these  services.  

The  aim  of  the  OpenMinTeD  project  is  to  provide  an  interoperability  framework,  which  serves  as  core  for  existing  tools  and  resources.  The  technical  evaluation  of  requirements  is  critical  for  the  integration  and  interoperability  of  all  abstraction  levels  of  the  framework.  

Only  rigorous  evaluation  can  guarantee  the  reliability  and  usability  of  such  a  framework  and  therefore  make  it  a  solid  tool  for  all  user  communities.  

OpenMinTeD   is   born   from   the   need   to   create   and   consolidate   a   European   framework   of   culture,  support,  promotion,  and  training  for  data  and  text  mining.  To  this  end,   it   is  necessary  to  create,  host  and  encourage  the  numerous  communities  involved  in  this  interdisciplinary  field  (suppliers,  scientists,  end   users,   etc.)   through   workshops,   open   tenders,   and   European   and   international   liaisons   for  developing  new  services  for  the  benefit  of  all  members.  

Therefore,  the  applied  evaluation  strategy  necessarily  needs  to  be  Community  Driven,  i.e.  based  on  the  use  of  community  driven  applications  on  the  OMTD  platform.  

 

This  deliverable  contains:  

� An introduction  � General evaluation strategy  � Community use case derived evaluation � Appendix (example questionnaires for evaluation)

 

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  7  of  29  

1. Introduction The  OpenMinTeD  project  attempts  to  engage  different  thematic  use  case  communities,  namely  Life  Sciences,   Agriculture/Biodiversity,   Social   Sciences   and   Scholarly   Communication,   with   the   aim   to  define   real   life   application   scenarios,   which   should   be   addressed   through   the   OpenMinTeD  infrastructure.  This  community  use  cases  serve  as  a  sort  of  guiding  example  for  the  implementation  and   evaluation   of   the   resulting   text   mining   workflows.   A   general   examination   of   formal   aspects  underlying   the   OpenMinTeD   community   use   cases   serves   as   a   base   to   define   key   performance  indicators  to  be  considered  as  evaluation  criteria  covered  by  the  OpenMinTeD  evaluation  framework.  Key   performance   indicators   refer   to   standard   evaluation   measures   in   case   of   the   use   case   Gold  Standard  data  and  evaluation  structured  survey  outcomes   in  case  of   the  evaluation  phase  covering  the  OpenMinTeD  infrastructure.  

This  document  provides  a  comparative  analysis  of  the  key  performance  indicators  derived  from  the  community  use  case  evaluation  scenario  definition,  taking  into  account  additionally  aspects  related  to  component  interoperability  as  well  as  technical  aspects  underlying  the  use  case  derived  text  mining  workflows.   Based   on   this   comparative   examination,   key   evaluation   indicators   implemented   in   the  OpenMinTeD   Evaluation   framework   are   classified   according   to   three   levels   of   practical   relevance:  mandatory  evaluation  indicators,  important  evaluation  indicators  and  optional  evaluation  indicators.  Mandatory   evaluation   indicators   refer   to   evaluation   requirements   of   measurements   /   validation  outcomes   that   need   to   be   accomplished   to   be   considered   as   fulfilling   the  OpenMinted   evaluation  criteria.    

In   order   to   facilitate   a   standardized   assessment  of   some  of   the   key   evaluation   indicators,   an  open  community   challenge   shared   task   focusing   on   the   technical   interoperability   and   performance   of  specific   components   of   the   use   case   text  mining  workflows  will   be   carried   out.  Moreover,   for   the  design   of   the   evaluation   indicators,   profiling   of   the   actual   users   (actors)   of   text  mining  workflows  derived   from   community   cases   was   carried   out.   An   evaluation   indicator   survey,   in   the   form   of   a  structured  questionnaire  covering  key  performance  measures,  quantitative,  technical  and  qualitative,  served   to   prioritize   the   main   aspects   covered   by   the   evaluation   setting   considered   by   the  OpenMinTeD  Evaluation  framework.  

D.4.5  reports  on  the  evaluation  strategies  of  the  framework  services  through  the  usage  and  results  of  the   domain   specific   applications   that   implement   the   concrete   scenarios.   The   main   goal   of   the  OpenMinTeD   evaluation   framework   is   to   examine   the   technical   and   functional   capabilities   of   the  framework  services  and  workflows  as  well  as  the  integration,  creation  of  resources  and  components  needed  to  generate  results  from  text  mining  workflows,  following  the  design  of  community  use  case  applications.  

The   content   of   this   deliverable   is   complementary   to  D4.3   (OpenMinTeD   Functional   Specifications),  D.4.4  (Community  Evaluation  Scenarios  Definition  Report),  D4.1  (Requirements  methodology)  which  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  8  of  29  

defines   the  actual   community  use   cases,  profiling  users   and  building   the  questionnaires  needed   to  capture  the  user  requirements,  and  D4.2  (Community  Requirements  Analysis  Report)  which  covers  all  use  cases  discussed  in  detail,  the  actors  and  workflows  for  extracting  the  key  performance  indicators  to   evaluate   the   community.   This   deliverable   also   introduces   the   evaluation   methodology   and  questionnaires  to  formalize  this  approach.    

 

 

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  9  of  29  

2. General evaluation strategy description The   OpenMinTeD   evaluation   cycle   is   structured   into   two   evaluation   cycles   (or   phases)   that   will   be  supported   by   the   development   of   the   OpenMinTeD   evaluation   framework   (multi-­‐step   evaluation  strategy).  During   each  evaluation  phase,   a   set   of   validation   scenarios  will   be  defined   to   address   the  heterogeneity  and  particularities  of  the  various  (1)  community  use  cases  on  one  side,  and  (2)  end  users  or  OpenMinTeD  framework  actors  on  the  other  side.  For  each  use  case  and  framework  user,  a  set  of  evaluation  criteria,  derived  from  key  indicators  extracted  from  community  use  cases,  will  be  captured  and   analyzed.   CNIO   partners   will   be   in   charge   of   collecting   and   characterizing   the   minimal   set   of  evaluation  criteria.  

The   goal   of   the   evaluation   process   is   to   determine   to   which   extent   the   OpenMinTeD   framework  supports  a  given  expected  evaluation  indicator  or  criterion,  and  to  build  a  use  case  centered  evaluation  infrastructure,  the  OpenMinTeD  evaluation  framework.  The  evaluation  indicators  include  metrics  that  assess   performance   as  well   as   annotation   quality   at   the   level   of   individual   components   required   to  extract  the  type  of  information  demands  associated  to  each  community  use  case.  Those  indicators  do  also   cover   adequacy,   quality   and   usability   assessment   at   the   level   of   text  mining  workflows   and   its  support  by  the  OpenMinTeD  framework  through  guided  use,  testing,  reporting,  cognitive  walkthrough  1and  structured  questionnaires.  Figure  1  provides  a  general  overview  of   the  evaluation  cycle  starting  from  descriptive  community  use  cases  to  the  actual  evaluation  of  the  OpenMinTeD  framework  through  use  case  text  mining  workflows.  

 Figure  1.  Evaluation  flowchart  cycle.  

 

                                                                                                               1  In  Cognitive  Walkthrough,  the  usability  of  the  system  is  evaluated  by  a  group  of  inspectors,  who  work  through  a  series  of   tasks  and  ask  a  set  of  questions   from  the  perspective  of   the  user.   Its  main  aim   is   to  assess  ease  of  understanding  and  learning.  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  10  of  29  

3. Community use case component evaluation scenarios Each  of  the  nine  OpenMinTeD  community  use  cases  is  characterized  by  very  particular  user  demands  in  terms   of   the   underlying   document   collections   of   interest,   named   entity   types,   relation   types   or  ontologies  and  databases  used  for  data  integration  and  entity  grounding.    

In  order   to  be  able   to  carry  out  a   formal   component-­‐level  evaluation  and   to  perform  a  comparative  performance  analysis  across  various  use  case  scenarios,  several  critical  key  steps  have  been  identified.  Steps  1-­‐3,  are  prior  formal  requirements  relevant  for  the  evaluation  process  that.  The  key  steps  consist  of  :  

1. Gathering,  revision  and  refinement  of  descriptive  use  cases.  2. Transformation  of  descriptive  use  cases  into  formal  annotation  workflows    3. Design  of  text  mining  pipelines  for  use  case  workflows  4. Selected  key  component  tasks  for  evaluation  purposes  5. Define  evaluation  setting:  Gold  Standard  corpora  (intrinsic/extrinsic,  guidelines,  metrics)  6. Accessibility  and  integration  of  component  applications  to  the  OpenMinTeD  framework  

 

Although  different  formal  evaluation  scenarios  are  needed  to  address  text  mining  demands  associated  with   the   types   of   users   and   tasks   defined   in   the   individual   community   use   cases,   the   OpenMinTeD  evaluation   framework   will   also   identify   and   capture   upper   level   commonalities   encountered   across  multiple  evaluation  scenarios.  

The   OpenMinTeD   evaluation   framework,   in   coordination   by   the   CNIO,   will   collect   key   attributes,  capturing  use  case  specific  Gold  Standard  datasets  used  for  the  evaluation  of  annotation  workflow  key  component  tasks.  Standard  evaluation  setting  shall  be  followed  avoiding  overlapping  datasets  between  training   and   test   sets   as  well   as   characteristics   related   to   data   selection.   Annotation   criteria  will   be  carefully   described   to   exclude   evaluation   biases.     These   descriptions   will   also   include   the   baselines  wherever  possible  (if  possible  these  will  be  quantitative  baseline  scores,  otherwise  they  will  consist  of  a  formal  description  of  manual  procedures  that  are  necessary  to  obtain  the  same  information).  Baselines  in   this   context   are  needed,   as  without   them   it   is   hard  or   impossible   to  assess   the  generated   results  from  a  perspective  outside  of  each  use  case  field.  

Based  on  the  detailed  examination  of  the  use  case  descriptions  and  "Community  Evaluation  Scenarios  Definition   Report”     (D.4.4),   the   following   collection   of   evaluation   scenarios   at   the   component   level  have  been  identified:  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  11  of  29  

 Figure  2.  OpenMinTeD  Gold  Standard  dataset  characterization  aspects.  

Theme: Agriculture / Biodiversity 3.1

3.1.1 Use case: AGRIS: Automatic extraction of topics, location and figures from publications [AS-A]

Extracting   information   about   topics   and   location   and   from   data   types   like   images   or  figures  of  the  agricultural  sector.  

● Potential   component   tasks:   automatic   recognition   of   mentions   of   domains-­‐specific  topics,  locations,  viticulture  terms  and  targets,  water  pathogens.  

● Potential   evaluation   scenarios:   extraction   of   AGROVOC   thesaurus   terms,   using  Agrotagger   and   Stanford   NER.   The   output   of   these   systems   will   be   used   as  baselines,  and  compared  against  the  Gold  Standard  datasets.    

● Component  centric  evaluation  setting:  development  of  Gold  Standard  evaluation  corpora   of   viticulture   documents   and   comparative   evaluation   using   standard  metrics:  precision  (mandatory),  recall  (important),  F-­‐score  (important).  

● Integration  validation:  REST-­‐based  TDM  web  service.  ● Evaluation  data  provided  by  Agroknow,  which  will  also  carry  out  the  evaluation  ● Annotation   description:   associated   to   each   Gold   Standard   dataset   a   short  

structured  annotation  guideline  document  and  characterization  description  will  be  prepared  and  revised  by   the  use  case  experts.  This  document  will   cover   (as  complete  as  possible)  the  several  key  aspects  highlighted  in  figure  2.    

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  12  of  29  

3.1.2 Use case: Food Safety: Automatic extraction of location information from publications [AS-B]

Location  of  foodborne  outbreak  and  food  alerts  from  food  safety  publications.  

● Potential   component   tasks:   automatic   recognition   of   mentions   of   food,  foodborne   outbreak,   food   alerts,   food   safety   issues,   food   recall,   geographic  information  (e.g.  countries),  extraction  of  complex  vents/relationships  between  geographic  locations  and  foodborne  disease  outbreaks..  

● Component  centric  evaluation  setting:  development  of  Gold  Standard  evaluation  corpora   and   comparative   evaluation   using   standard   metrics:   precision  (mandatory),  recall  (important),  F-­‐score  (important).  

● Integration  validation:  REST-­‐based  TDM  web  service.  ● Evaluation  data  provided  by  Agroknow,  which  will  also  carry  out  the  evaluation  ● Annotation   Guidelines:   associated   to   each   Gold   Standard   dataset   a   short  

structured  annotation  guideline  document  and  characterization  description  will  be   prepared/revised   by   the   use   case   experts.   This   document   will   cover   (as  complete  as  possible)  the  several  key  aspects  highlighted  in  figure  2.    

3.1.3 Use case: Microbial biodiversity [AS-C] Relation  between  microorganism,  habitats  and  some  properties  of  the  microorganisms  normalized  using  metagenomics    knowledge  bases.  

● Potential   component   tasks:   automatic   recognition   of   mentions   of   food,  microorganisms,   habitats,   microorganism-­‐phenotype,   microorganism-­‐derived  chemical  entities,  entity  grounding  (ontology  linking/grounding  of  these  entities,  e.g.   NCBI   Taxonomy   (microorganism),   InCHI   key   (chemical),   Ontobiotope  (habitat)).   Relation   types:   ‘[MICROORGANISM]   lives   in   [HABITAT]’,   and  ‘[MICROORGANISM]  produces  [CHEMICAL]’  

● Component  centric  evaluation  setting:  development  of  Gold  Standard  evaluation  corpora   and   comparative   evaluation   using   standard   metrics:   precision  (mandatory),  recall  (important),  F-­‐score  (important).  

● Integration  validation:  REST-­‐based  TDM  web  service.  ● Evaluation  data  provided  by  Agroknow/INRA  (BioNLP-­‐ST  BB  task  corpus),  which  

will  also  carry  out  the  evaluation  ● Annotation   Guidelines:   associated   to   each   Gold   Standard   dataset   a   short  

structured  annotation  guideline  document  and  characterization  description  will  be   prepared/revised   by   the   use   case   experts.   This   document   will   cover   (as  complete  as  possible)  the  several  key  aspects  highlighted  in  figure  2.    

 

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  13  of  29  

3.1.4 Use case: Linking Wheat data with literature [AS-D] Phenotype  related  information  on  plant  extracted  from  scientific  articles.  

● Potential   component   tasks:   automatic   recognition  of  mentions  of  plants   (taxon  names),  gene/protein  mentions,  plant  genetics,  plant  phenotypes,  gene  markers,  entity  grounding  (ontology  linking  of  these  entities).  

● Component  centric  evaluation  setting:  development  of  Gold  Standard  evaluation  corpora   and   comparative   evaluation   using   standard   metrics:   precision  (mandatory),  recall  (important),  F-­‐score  (important).  

● Integration  validation:  REST-­‐based  TDM  web  service.  ● Evaluation  data  for  particular  key  tasks  will  be  provided/revised  by  INRA    ● Annotation   Guidelines:   associated   to   each   Gold   Standard   dataset   a   short  

structured  annotation  guideline  document  and  characterization  description  will  be   prepared/revised   by   the   use   case   experts.   This   document   will   cover   (as  complete  as  possible)  the  several  key  aspects  highlighted  in  figure  2.    

 

3.1.5 Use case: Information Extraction of mechanism involved in plant development [AS-E ]

Help   in   plant   breeding,   plant   reproduction   and   seed   development.   The   information  extracted  is  complementary  to  the  databases.  

● Potential  component  tasks:  automatic  recognition  of  mentions  of  plants  (taxons),  plant   reproduction/anatomy   terms,   plant   seed   development   (development  stages)   terms,   gene/protein   mentions,   relation   extraction   of   associations  between   plant-­‐genes-­‐seed   development/anatomy   terms,   protein-­‐promoter  binding   relations,   entity   and   concept   grounding   (ontology   linking   of   these  entities).  

● Component  centric  evaluation  setting:  development  of  Gold  Standard  evaluation  corpora   and   comparative   evaluation   using   standard   metrics:   precision  (mandatory),  recall  (important),  F-­‐score  (important).  

● Integration  validation:  explore  use  of  REST-­‐based  TDM  web  service.  ● Evaluation  data:  adaptation/exploitation  of  existing  Gold  Standard  datasets  from  

BioNLP-­‐ST  2016  SeeDev   task,   additional   training  data  generated  by   consortium  partners  for  entity  normalization  and  ontology  grounded  terms.  

● Annotation   Guidelines:   associated   to   each   Gold   Standard   dataset   a   short  structured  annotation  guideline  document  and  characterization  description  will  be   prepared/revised   by   the   use   case   experts.   This   document   will   cover   (as  complete  as  possible)  the  several  key  aspects  highlighted  in  figure  2.    

 

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  14  of  29  

Theme: Life Sciences 3.2

3.2.1 Use case: Extract Metabolites and their properties and modes of actions [LS-A] Help   curators   of   metabolism   databases   by   providing   information   tuples   about  metabolites  from  the  published  literature.  

● Potential   component   tasks:   literature   triage   (classification   and   ranking   of  curation   relevant   articles   and/or   text   passages)   automatic   recognition   of  mentions   of   chemical   names,   chemical   classes,   chemical   structures,  organisms/species,   organism   parts   (taxon   names),   biological   activity,   biological  target   entity   grounding   (ontology   linking   of   these   entities,   e.g.   ChEBI,   NCBI  taxonomy).  

● Component  centric  evaluation  setting:  development  of  Gold  Standard  evaluation  corpora   and   comparative   evaluation   using   standard   metrics:   precision  (mandatory),  recall  (important),  F-­‐score  (important).  

● Integration  validation:  explore  use  of  REST-­‐based  TDM  web  service.  ● Evaluation  data:  adaptation/exploitation  of  existing  Gold  Standard  datasets  from  

BioCreative,  BioNLP-­‐STs,  LINNEAUS  corpus  and  others.    ● Annotation   Guidelines:   associated   to   each   Gold   Standard   dataset   a   short  

structured  annotation  guideline  document  and  characterization  description  will  be   prepared/revised   by   the   use   case   experts.   This   document   will   cover   (as  complete  as  possible)  the  several  key  aspects  highlighted  in  figure  2.    

3.2.2 Use case: Neuroscience [LS-B] Systematic  way  to  curate  the  relevant  literature  of  neuroscientific  data.  

● Potential  component  tasks:  automatic  detection  of  figure  legends,  table  legends  (optional),   recognition   of   modeling   parameters   in   text,   recognition   of  experimental   values,   neuroscience   keyword   extraction,   detection   of   cell   types  (e.g.  neurons  by  NeuroNER),  detection  of  brain   regions,  detection  of   synapses,  detection  of  species   (taxon)  detection  of  genes/proteins,  detection  of  chemical  substances  of   relevance   for  neuroscience,  entity  grounding   (ontology   linking  of  these  entities).  

● Component  centric  evaluation  setting:  development  of  Gold  Standard  evaluation  corpora   and   comparative   evaluation   using   standard   metrics:   precision  (mandatory),  recall  (important),  F-­‐score  (important).  

● Integration  validation:  explore  use  of  REST-­‐based  TDM  web  service.  ● Evaluation  data:  adaptation/exploitation  of  existing  Gold  Standard  datasets  from  

GENIA   corpus,   LINNEAUS   corpus   and   preparation   of   additional   Gold   Standard  data  by  consortium  partners.    

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  15  of  29  

● Annotation   Guidelines:   associated   to   each   Gold   Standard   dataset   a   short  structured  annotation  guideline  document  and  characterization  description  will  be   prepared/revised   by   the   use   case   experts.   This   document   will   cover   (as  complete  as  possible)  the  several  key  aspects  highlighted  in  figure  2.    

Theme: Social Sciences 3.3

3.3.1 Use case: Facilitation of complex information linking and retrial from social sciences publications [SS-A]

Automatic   detection,   disambiguation   and   linking   of   entities   in   Social   Science   text  corpora  to  enhance  indexing  and  searching.  

● Potential  component  tasks:  automatic  recognition  of  named  entity  mentions  and  disambiguation   (overall   12   subtypes   and  4   coarse   types),   keyword  assignment,  variable  mention  detection,  entity  grounding  (ontology  linking  of  these  entities).  

● Component  centric  evaluation  setting:  development  of  Gold  Standard  evaluation  corpora   and   comparative   evaluation   using   standard   metrics:   precision  (mandatory),   recall   (important),   F-­‐score   (important).   Although   complementary  evaluation  measurements  can  be  added,  as  these  are  the  most  commonly  used  metrics,   which   can   be   applied   across   use   cases   they   will   be   the   primary  evaluation  metrics.  

● Integration  validation:  explore  use  of  REST-­‐based  TDM  web  service.  ● Evaluation   data:   the   use   case   responsibles   (GESIS)   will   be   in   charge   of  

compiling/constructing  suitable  Gold  Standard  corpora/datasets.    ● Annotation   Guidelines:   associated   to   each   Gold   Standard   dataset   a   short  

structured  annotation  guideline  document  and  characterization  description  will  be  prepared/revised  by  the  use  case  responsibles.  This  document  will  cover  (as  complete  as  possible)  the  several  key  aspects  highlighted  in  figure  2.    

Theme: Scholarly Communication 3.4

3.4.1 Use case: Research Analytics Use Case [SC-A] Innovate   framework   for   entity   extraction   from   publications   and   automated   and  extendible  multidimensional  analysis  of  all  scholarly  content.  

● Potential  component  tasks:  automatic  recognition  and  disambiguation  of  entity  mentions  including  projects/project  codes,  funders/funding  information,  Protein  Database   (PDB)   codes,   extraction   of   patent   citations,   topic   (thematic  information)   identification,   research   trend   identification,   automatic   recognition  of  stop  words  for  specific  domains  (pre-­‐processing  task).  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  16  of  29  

● Evaluation  data:  the  use  case  responsibles   in  collaboration  with  OpenAIRE  data  engineers/marketing   department   and   OMTD   end-­‐users   will   be   responsible   for  the   creation   of   evaluation   data   (annotation   guidelines   and   component-­‐centric  evaluation  datasets).    

 

Overall,   three   types   of   workflow   component   services   can   be   distinguished   across   the   different   use  cases:   conventional   components,   adaptation   of   general   components   to   use   case   requirements   and  development  of  new  specialized  components.  

Community use case evaluation scenario actors 3.5There  are   several   actors   (platform  user   validators)   identified   from   the  use   cases   introduced   in  D4.2.  They   can   be   roughly   classified   into   developers   of   text   mining   systems,   consumers   of   text   mining  systems  and  generators  of  text  annotations,  Gold  Standard  datasets,  and  validators.  A  more  detailed  list  of  evaluation  actors  comprise:  

Content providers (publishers): provide, integrate, support and maintain the access to the documents.

Database curators: carry out literature curation and functional annotations.

Text, NLP & data mining (TDM) experts: high level of knowledge about text mining and the framework.

Databases, applications and services developers (Computer scientists, bioinformaticians, other computing fields): provide hardware and software services for creating and improving the access and the different uses of the framework.

Legal experts: legal advice about copyright topics, international intellectual property and laws.

Researchers (experts in specific and different sciences): improve the annotations using text mining and provide feedback to the framework evaluating it. Basic and medium level of knowledge about text mining and the framework.

End users (Final users, industry, business): without need of knowledge and experience about text mining and the framework.

 The   community   evaluation   scenario   does   require   the   construction,   definition,   release   and   iterative  refinement   (validation   cycles   and   procedures)   of   material   to   support   training   of   validators   (both  general   as  well   as   actor-­‐specific   training  material   versions).   Each   use   case   should   have   at   least   one  validator,  with   the  aim  of  at   least  having   two  validators   in   case  of  more  mature  use   case   scenarios.  Note   that,   as   some   use   cases   overlap   in   terms   of   the   required   expertise,   validators   may   take   on  multiple   roles,   i.e.   work   for   multiple   scenarios.   This   will   enable   validators   to   acquire   the   needed  minimal   skills  and  knowledge   to  execute  and  evaluate   the   tasks  underlying  each  validation  scenario.  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  17  of  29  

OpenMinTeD   validator   training   material   will   include   tutorials,   videos   and   written   documentation  specifying  the  task  aim,  example  cases  as  well  as  the  description  of  the  evaluation  task  procedure.  

The   OpenMinTeD   evaluation   framework   is   organized   into   two   evaluation   phases   to   cover   different  important   characteristics   related   to   platform   validation   scenarios.   During   both   phases,   the  OpenMinTeD   evaluation   framework   comprises   task   description   material,   validation   forms   and  evaluation  templates  to  capture  validators’  answers  and  requests  referring  to  validation  criteria.    

For   each   of   the   community   evaluation   scenario   a   validator   recruitment   strategy  will   be   provided   to  document  the  selection  process  and  assignment  of  evaluation  actors  to  particular  evaluation  tasks.  

Moreover,   evaluation   actor   or   validators   are   classified   into   internal   and   external   validators   and  platform  users.  Internal  validators  are  those  that  comprise  members  of  the  OpenMinTeD  consortium,  while   external   validators   correspond   to   users   of   the   platform   outside   the   consortium.   During   the  evaluation  phase  I  only  internal  validators  will  be  recruited,  while  during  the  second  evaluation  phase  first   internal   validators  will   be   engaged,   and   in   a   final   stage   a   selected   target   group  of   five   external  validators  will  be  enrolled  for  validation  purposes  of  the  platform.  

Evaluation phase I: use-case component centric 3.6The  first  evaluation  phase  will  focus  on  the  assessment  of  the  single  critical  components  for  the  specific  use  case  tasks.  A  minimal  evaluation,  covering  aspects  related  to  performance,  access  and  integration  of  those  components  is  essential  in  order  to  be  able  to  run  the  text  mining  workflows  associated  to  a  given   use   case.   Therefore,   for   each   community   use   case   we   will   define   a   minimal   evaluation   form  capturing   key   aspects   of   the  evaluation   setting   such   as  Gold   Standards  datasets/corpora,   structured  definition  of  component  tasks,  a  formal  annotation  workflow,  the  corresponding  text  mining  workflow  definition,   workflow   resource   characteristics,   test   frame,   test   procedure,   test   results,   absolute  evaluation,  and  comparative  evaluation.  

The  evaluation  scenarios  of  OpenMinTeD  component-­‐centric  assessment  will  include:  

● For   each   use   case,   we   will   engage   domain   experts   to   revise   the   selection,   annotation   and  documentation  of  the  corresponding  gold  standard  datasets.  

● Key  component  resources  will  be  tested  against  the  gold  standard  datasets  to  determine  their  respective  performance.  

● Functionality   and   usability   of   a   selected   number   of   specific   components   will   be   assessed   by  expert  text  mining  researchers  in  community  shared  tasks.  

● The  interoperability  of  tasks  and  their  integration  into  the  OMTD  framework  will  be  addressed  separately  from  the  functional  evaluation.  

 

The  procedure  for  the  design  of  the  use  case  specific  evaluation  scenarios  will  be  developed  and  tested  on  the  basis  of  one  selected  representative  use  case  that  fulfills  the  following  criteria:    

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  18  of  29  

● It  is  well  defined  ● It  requires  no  (or  few)  3rd  party  components  ● It  contains  tasks  and  components  that  are  of  interest  for  other  use  test  cases  in  more  than  one  

domain.    

 Figure  3.  OpenMinTeD  evaluation  framework  for  component  assessment  (evaluation  phase  I).  

 

3.6.1 Criteria of assessment ● For   each   use   case,   select   use   case   descriptions   and   generate   a   formal   definition   of   common  

component   task   template   (initial)   and   iteratively   refine   the   template   based   on   the   use   case  responsible  feedback  to  generate  the  final  template.  

● For   each   use   case,   generate   a   diagram   representing   the   design   of   the   formal   text   mining  workflows.  

● For   each   use   case   text   mining   workflow,   identify,   implement   and   integrate   workflow  components  and  validate  the  compliancy  of  OMTD  functional  specifications.  

● For   each   use   case   identify   required   evaluation   resources,   in   terms   of   existing   /   to   be  implemented   resources,   definition   of   Gold   Standard   manually/automated   steps,  characterization  of  needed  lexical  resources.  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  19  of  29  

● Evaluate   individual   components   with   focus   on:   (1)   quality,   (2)   technical   and   (3)   functional  aspects.   This   will   be   associated   to   the   generation   of   GS   data,   implementation   of   technical  assessment  infrastructure,  functional  specification  compliance  check,  and  analysis  of  standards  for  registration  /  import  to  OMTD.  

For   those  component   level   tasks   that   correspond   to  basic  named  entity   recognition   (and  grounding)  and   document   concept   indexing   modules,   a   technical   infrastructure   in   form   of   the   OpenMinTeD  interoperability  and  performance  evaluation  server  (named  Becalm)  has  been  implemented  (figure  3).  This  service  offers  the  possibility  to  carry  out  a  technical  validation  in  terms  of  Web  service  monitoring,  component  accessibility  validation  and  evaluation  NER  and  concept  indexing  systems  at  three  different  levels:  

1. Data  level:  mapping  into  common  formats  (BioC  XML,  JSON,  BioC  JSON,  BeCalm  TSV  PubAnnotation).  

2. Technical  level:  stability,  status,  response  time,  batch  processing,  component  time  slot.  3. Functional  specification  level:  metadata  requirements/guidelines  and  annotation  of  

services  through  predefined  metadata  types  and  controlled  vocabularies.      

These  three  levels  are  important  for  the  integration  of  components  into  the  OpenMinTeD  text  mining  workflow.   Although   initially   such   systems   are   accessed   and   tested   through   simple   REST  (Representational  State  Transfer)  API  applications,  we  foresee  the  possibility  to  convert  such  services  into   other   more   powerful   execution   schemas,   system   distribution   types   and   service   packaging   to  empower  scalability  of  these  services.  The  Becalm  server  additionally,  through  setting  up  a  community  challenge   evaluation   effort   (TIPS-­‐   Technical   interoperability   and   performance   of   annotation   servers  task),  will  evaluate  third  party  NER  components  of  relevance  for  various  use  case  scenarios  from  the  thematic  areas  of  agriculture  and  life  sciences.    

Evaluation phase II: use-case workflow centric 3.7OpenMinTeD  will  offer  a  suite  of  services  to  make  the  infrastructure  components  visible  and  accessible  by  all:  

Registry Service: Validate the support to discover tools & services, scientific content, language resources within the OMTD registry service. A full documentation and registration mechanism for all types of resources available in the infrastructure, implementing widely agreed and used metadata models, covering all facets of documentation from persistent identification and versioning, technical specifications and software dependencies, rights of use, location and deployment instructions.

Workflow Service: Mix and match text-mining services in workflows by using the best of breed of text mining components for some task. Provides users with the ability to mix and match text-mining services in workflows, i.e., by using the best of breed of text mining components for some task. Its implementation and use delves into component level interoperability.

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  20  of  29  

DSL: OpenMinTeD scripting language is a Domain Specific Language and substituted workflow editor.

Engine: DSL interpreter, minigal engine executes one component in sequence and substitutes workflow service.

Adapters: abstract away from different architectures, component life cycle models, and data models.

Data converters: lazily convert data between components as necessary

Components: minimal execution unit of the workflow (adapters, data converters

Catalog: very simple JSON-based component catalog substitutes registry services

Repository: components packaged as Maven artifacts and stored in a Maven repository (atm GATE DKPro Core). Obtained automatically when used in a workflow. Substitutes cloud oriented deployment (a bit).

Remote service: web services, mostly third party NLP tools (figure 4).

Annotation Service: Get results in common representations and protocols, including quantitative indications of the quality of the automatic (or manual) processing. Implements the interoperability specifications for annotations showcasing common representations and protocols, including quantitative indications of the quality of the automatic (or manual) processing service on results.

 This  scenario  for  the  OpenMinTeD  framework  services  (registry,  workflows,  annotation)  will  be  used  to  assess   each  of   the   services   by   each  use   case,   following   an   approach   for   the   evaluation  of  workflow  services  and  frameworks  that  has  been  used  successfully  in  projects  of  similar  characteristics.  We  will  provide  usage  instructions  to  the  various  types  of  users/communities,  according  to  which  they  will  test  each  of  the  framework  services  independently.  Their  experiences  will  then  be  collected  by  means  of  a  structured  survey  with  four  evaluation  levels:  

Full - fully compliant

Partial - partially compliant. E.g. some parts of a product are compliant but not all. This is typically the case if a product is in a state of transition from a non-compliant to a compliant state.

No - not compliant.

N/A - not applicable. This is expected to occur mainly for concrete requirements if a certain requirement is not applicable for a certain implementation, e.g. a requirement on remote API access on a tool, which does not offer a remote API.

 

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  21  of  29  

 Figure  4.  Workflow  service  architecture  (from  OpenMinTeD  Webinar  2).  

 General   validation   instructions   for   each  end  user  of   the  evaluation   framework  will   be  distributed   to  carry  out  the  validation  experiments.  The   instructions  will   include  material  such  as  tutorials  and  task  definitions,  including  easy  to  follow  task  completion  steps,  validation  walkthroughs  and  validation  form  to  enable  the  completion  of  structured  surveys.  

The  proposed  evaluation  strategy  was  inspired  by  input  and  feedback  of  WP5  as  well  as  existing  efforts  and  systems:  

● U-­‐Compare   [3-­‐4]:   allows   a   user   to   easily   configure   a  workflow   of   components.   The   user   can  easily   exchange   components   in   their   workflow   to   see   how   exchanging   pre-­‐processing  components  can  make  a  difference  to  their  overall  results  

● Argo   [5]:  a  multi-­‐purpose  workflow  management  system  built  upon  the  UIMA  standard.  Argo  provides  a  user  with  a  library  of  atomic  components,  each  of  which  provides  a  specific  form  of  annotation  upon  an   input   text.   These   annotations   can  be  processed   iteratively   by   combining  multiple  components  into  a  workflow.  Workflows  can  be  evaluated  within  Argo  through  the  use  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  22  of  29  

of   dedicated   evaluation   components,   or   by   exporting   annotations   in   a   variety   of   formats   for  further  processing.  

● Becalm   [1]:   BioCreative   MetaServer   (BCMS),   the   first   distributed   prototype   framework  requesting,   retrieving   and   unifying   biomedical   textual   annotations   developed   under   the  umbrella  of   the  BioCreative   II   protein-­‐protein   interaction   task,   several  BioCreative   tasks  have  tried  to  promote  the  development  of  online  text  annotation  servers.    

● Panacea   [2]:   a   STREP   Project   under   EU-­‐FP7,   has   developed   a   factory   of   Language   Resources  (LRs)   in   the   form   of   a   production   line   that   automates   all   steps   involved   in   the   acquisition,  production,  maintenance  and  updating  of   the  LRs  required  by  Machine  Translation  and  other  Language  Technologies.    

● WP5:   This   evaluation   strategy   can   be   only   understood   in   the   context   of   the   Interoperability  Framework  of  OpenMinTeD  Therefore,  it  requires  a  constant  input  and  feedback  from  this  work  package  and  the  future  task  5.4  “Alignment  of  service  and  content  provider  systems”  

 

Workflow  evaluation  will  focus  on  three  points:  

● Technical  and  functional  evaluation:    latency  times,  availability  and  query  limits.  ● Metadata   and   annotations   database   evaluation:   quality   and   quantity   of   annotations   and  

metadata  available.  ● Workflow  manager  evaluation:  interoperability  of  components,  web  services  and  formats.  

 

3.7.1 General criteria of assessment In   order   to   capture   evaluation   aspects   related   to   the   usability   of   the   OpenMinTeD   platform,   the  evaluation  framework,  must  examine  how  efficiently  the  platform  can:  

● Solve  generic  tasks,    ● Be  customized    ● Be  used  in  an  interactive  and  collaborative  manner.    

 

Those  general  evaluation  criteria  should  therefore  enable  the  validation  of  four  characteristics:  

● Generic:   evaluate   to   which   degree   the   OpenMinTeD   platform   can   handle   general  purpose/domain   tasks.   Therefore   it   will   be   examined   if   it   supports   a   variety   of   formats   and  open  standards  to  allow  the  evaluation  of  different  domains  and  use  cases.  

● Customisable:   the  evaluation   framework  should  determine  to  which  degree  the  OpenMinTeD  platform   enables   modularization,   exposure   of   different   types   of   pre-­‐   and   post-­‐processing  components.  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  23  of  29  

● Interactive:   the   evaluation   framework   should   determine   to   which   degree   the   OpenMinTeD  platform   supports   user-­‐interactive   processing   components   and   inspection/intervention   of  automatically  created  text  mining  pipelines  that  can  support  evaluation  of  modules.  

● Collaborative:   the  evaluation   framework  should  determine   to  which  degree   the  OpenMinTeD  platform  supports  multi-­‐user  collaboration  and  simultaneous  modifications  of  e.g.  annotations  or  components  in  the  pipeline.  

 

The  individual  components  should  also  be  evaluated  in  the  context  of  different    actionable  workflows.  These  workflows  should  be  checked  with  regard  to  their  compliance  with  the  use  case  demands  and  to  be  refined  or  adapted  if  required  during  the  deployment  process.  The  community  should  validate  the  workflows   through   a   formal   usability   evaluation   from   inspired   whenever   possible   by   functional  specification  criteria  (figure  5).    

 Figure  5.  OpenMinTeD  framework,  general  evaluation  levels.  

 

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  24  of  29  

4. References [1] http://www.becalm.eu/  [2] http://www.panacea-­‐lr.eu/  [3] Y.  Kano,  W.  A.  Baumgartner,  L.  McCrohon,  S.  Ananiadou,  K.  B.  Cohen,  L.  Hunter,  and  J.  Tsujii,  “U-­‐

Compare:  share  and  compare  text  mining  tools  with  UIMA,”  Bioinformatics,  vol.  25,  no.  15,  pp.  1997–1998,  2009.  

[4] Y.  Kano,  M.  Miwa,  K.  B.  Cohen,  L.  Hunter,  S.  Ananiadou,  and  J.  Tsujii,  “U-­‐Compare:  a  modular  NLP  workflow  construction  and  evaluation  system,”  IBM  J.  Res.  Dev.,  vol.  55,  no.  3,  pp.  11:1  –  11:10,  2011.  

[5] R.  Rak,  A.  Rowley,  W.  Black,  and  S.  Ananiadou,  “Argo:  an  integrative,  interactive,  text  mining-­‐based  workbench  supporting  curation.,”  Database,  vol.  2012,  Jan.  2012.  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  25  of  29  

5. Appendix The  appendix   section   comprises   sample   structured  questionnaires   from   the  OpenMinTeD  evaluation  framework  which  should  be  completed  by  particular  actors  defined  at  the  beginning  of  the  document.  Training  material  will  be  provided  to  the  actors  before  filling  the  questionnaires  to  facilitate  the  right  evaluation.  Each  question  has  and  unique  reference.  

Questionnaire for evaluation phase I: use-case component centric 5.1Template  for  filling  about  the  component  evaluation  

Questions  

Kind  of  component  

 

Use  case  

 

Document  selection  (corpora)  

 

Annotation  format  

 

Annotation  editor  

 

Manual/Semi-­‐automatic/Automatic  annotation  

 

Task  

1. Did  the  component  do  its  job  correctly?    

2. Does  any  mismatch  exist  between  the  component  and  its  description  (behavior,   input,  output...)?  

 

3. Could  be  the  component  improved  in  some  way?    

Evaluation  scenario  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  26  of  29  

1. Can  the  component  be  used  in  the  scenarios  for  which  it  was  designed?    

2. Could  constraints  exist  for  any  situation  (extreme  or  not)?    

Gold  standard  

Describe  the  gold  standard  used  for  the  evaluation  of  the  component.    

 

By  default:  

● Precision  (mandatory)    

● Recall  (important)    

● F-­‐score  (important)    

Component  access  

1. Do  the  component  has  a  web  API?  Benchmark  time  latency.    2. Measure  the  time  elapsed  between  the  input  and  the  result  

 

Evaluation phase II: use-case workflow centric 5.2The   questionnaires   are   an   integral   part   of   the   evaluation   phase   II   with   the   aim   to   capture   key  performance  indicators  of  the  OpenMinTeD  framework  services1.  

5.2.1 Registry service scenario In  this  scenario  the  OpenMinTeD  registry  is  evaluated.  The  actor  has  to  connect  to  the  registry  explore  the  differences  knowledge  resources  choose  several  and  use  them  in  a  workflow.

Steps

1. Connect  to  OpenMinTeD  registry.  2. Select  one  component  from  the  catalog.  3. Check  the  information  of  the  component.  

                                                                                                               1  https://builds.openminted.eu/job/WP%205.2%20-­‐%20Interoperability%20Specification/eu.openminted.interop$openminted-­‐interoperability-­‐spec/doclinks/1/openminted-­‐interoperability-­‐spec.html#_requirements  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  27  of  29  

4. Repeat  step  2  to  3  for  several  components.    (Expert  actors)  

5. Create  and  add  a  new  resource  to  the  registry.  6. Use  the  new  resource  in  a  workflow.  7. Use  other  resource  in  that  workflow.  

 

Questions

Availability

1. RS-­‐1   Resources   of   the   registry   are   well   described   (license,   content   format   e.g   XML,  DOCX,  version)  and  url  accessible?  [Requirements  32,  33,  39].  Full-­‐Partial-­‐No-­‐N/A  

2. RS-­‐2  The  resource  categories  are  well  tagged?  [Requirement  36].  Full-­‐Partial-­‐No-­‐N/A  

3. RS-­‐3  Resources  are  downloadable  or  are  external  web  service  and  knowledge  resources  well   identified   for   ensuring   that   the   processed   data   does   not   leave   a   particular  institution?  [Requirement  38].  Full-­‐Partial-­‐No-­‐N/A  

Adaptability  (expert  actors)

1. RS-­‐4  It  is  possible  create  and  add  new  knowledge  resources  to  the  registry?  Full-­‐Partial-­‐No-­‐N/A  

2. RS-­‐5   It   is   possible   to   adapt   knowledge   resources   of   the   registry   to   be   used   by   the  workflow?  Full-­‐Partial-­‐No-­‐N/A  

Interoperability  (expert  actors)

1. RS-­‐6  It  is  possible  aggregate  the  information  from  different  knowledge  resources  of  the  registry  to  be  used  by  the  workflow?  Full-­‐Partial-­‐No-­‐N/A  

Security

1. RS-­‐7   The   registry   resources   are   only   reachable   by   people   and   components  authenticated?  Full-­‐Partial-­‐No-­‐N/A  

 

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  28  of  29  

5.2.2 Workflow service scenario In   this   scenario   an   example   workflow   is   evaluated.   The   actor   has   to   create   a   new   workflow,   add  components  and  produce  an  output.

Steps

(Expert  actors)

1. Create  a  new  workflow.  2. Add  one  resource  from  the  registry.  3. Add  one  component.  4. Repeat  the  step  3  many  times  a  needed.  5. Link  all  components  and  produce  an  output.  

 

Questions

Generic

1. WS-­‐1  The  workflow  is  described  using  and  uniform  language?  [Requirement  18].  Full-­‐Partial-­‐No-­‐N/A  

2. WS-­‐2   Components   are   downloadable   or   are   external   web   service   and   knowledge  resources   well   identified   for   ensuring   that   the   processed   data   does   not   leave   a  particular  institution?  [Requirement  28].  Full-­‐Partial-­‐No-­‐N/A  

Customisable

1. WS-­‐3  Configuration  and  parameterizable  options  of  the  components  are  well  identified  and  documented?  [Requirement  21].  Full-­‐Partial-­‐No-­‐N/A  

Interactive

4. WS-­‐4   Components  of   the  workflow  are  well   described   (license,   its   functionality,   input  and  output  format,  input  and  output  language  dependent,  citation  information,  version)  and  url  accessible?  [Requirements  4,  10,  12,  13,  14,  43,  45,  57].  Full-­‐Partial-­‐No-­‐N/A  

5. WS-­‐5   Components   detail   all   their   environmental   requirements   for   execution?  Requirement  5.  Full-­‐Partial-­‐No-­‐N/A  

6. WS-­‐6  The  component  categories  are  well  tagged?  [Requirement  8,  40,  90].  Full-­‐Partial-­‐No-­‐N/A  

 D4.5  -­‐  Community  Evaluation  Methodology    

•  •  •  

Public     Page  29  of  29  

7. WS-­‐7   Components   handle   failures   (loss   connection,   inconsistent   state)   gracefully?  [Requirement  27].  Full-­‐Partial-­‐No-­‐N/A  

Collaborative

1. WS-­‐8  The  workflow  can  be  used  as  a  component  of  other  workflow?  [Requirement  24].  Full-­‐Partial-­‐No-­‐N/A  

2. WS-­‐9  Is  it  possible  create,  add  new  components  and  use  it  inside  the  workflow?  Full-­‐Partial-­‐No-­‐N/A  

5.2.3 Annotation service scenario In   this  scenario   the  annotated  data   is  evaluated.  The  actor  has   to  review  the  annotated  data   from  a  workflow.

Steps

1. Open  an  annotated  output.  2. Analyze  the  content  and  make  conclusions.  

(Expert  actors)  

3. Use  it  in  another  workflow  or  tool.    Questions

Functionality

1. AS-­‐1   It   is   possible   to   determine   the   source   of   an   annotation/assigned   category?  [Requirement  26].  Full-­‐Partial-­‐No-­‐N/A  

2. AS-­‐2  Are  all  the  entities  well  annotated?  Full-­‐Partial-­‐No-­‐N/A  

3. AS-­‐3  Can  the  annotations  be  tagged?  Full-­‐Partial-­‐No-­‐N/A  

4. AS-­‐4  Are  the  annotations  well  related?  Full-­‐Partial-­‐No-­‐N/A  

Exportability

1. AS-­‐5  Could  some  tool  import  the  annotated  data?  Full-­‐Partial-­‐No-­‐N/A  

Reusability

1. AS-­‐6  Could  some  other  workflow  reuse  the  annotated  data?  Full-­‐Partial-­‐No-­‐N/A