Evaluating SME-Elicited Knowledge

23
1 Evaluating SME-Elicited Knowledge Julie Fitzgerald, Mike Pool, Bob Schrag Information Extraction and Transport, Inc. Arlington, Virginia August 9, 2003

description

Evaluating SME-Elicited Knowledge. Julie Fitzgerald, Mike Pool, Bob Schrag Information Extraction and Transport, Inc. Arlington, Virginia August 9, 2003. Background Information. Evaluations conducted for DARPA’s Rapid Knowledge Formation (RKF) - PowerPoint PPT Presentation

Transcript of Evaluating SME-Elicited Knowledge

Page 1: Evaluating SME-Elicited Knowledge

1

Evaluating SME-Elicited Knowledge

Julie Fitzgerald, Mike Pool, Bob SchragInformation Extraction and Transport, Inc.

Arlington, VirginiaAugust 9, 2003

Page 2: Evaluating SME-Elicited Knowledge

2

Background Information

Evaluations conducted for DARPA’s Rapid Knowledge Formation (RKF) Goal was development of tools for eliciting formalized knowledge from KR-naïve

subject matter experts (SMEs) Two system integrators

Cycorp developed KRAKEN system Large knowledge based system, primarily NL driven

SRI developed SHAKEN system Smaller, more modular knowledge based system, primarily graphical

Program evaluation was driven by Challenge Problems Y1 (summer 2001, January 2002): Cell Biology

Asked: “Can SMEs teach RNA transcription to a knowledge base?” Y2 (September 2002): COA Critiquing

Asked: “Can SMEs teach systems to critique a rudimentary military course of action (COA) wrt a number of critiquing criteria?”

Page 3: Evaluating SME-Elicited Knowledge

3

RKF Evaluation Objectives

Can subject matter expert (SMEs) author KBs using RKF tools / processes?

How good is the authored knowledge? How well does it work w.r.t. a performance task (e.g., textbook question

answering, military course of action (COA) critiquing)? How general / robust is the system? How much of the knowledge was reused/can be reused? How do KBs built by SMEs compare to KBs built by KEs? Who (between KEs and SMEs) did what? What kinds of knowledge did SMEs build well? How long did it take? What was enjoyable for SMEs?

Page 4: Evaluating SME-Elicited Knowledge

4

Overview

Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations

Meaningfulness of results Evaluation methods Target audience

Effects of human users Types of users Evaluation duration User interactions and metrics

Challenge Problems

Page 5: Evaluating SME-Elicited Knowledge

5

Overview

Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations

Meaningfulness of results Evaluation methods Target audience

Effects of human users Types of users Evaluation duration User interactions and metrics

Challenge Problems

Page 6: Evaluating SME-Elicited Knowledge

6

Evaluation Methods

What was Evaluated? Functional Performance

Subjective metrics Test questions based on section from cell biology textbook (Y1) COA Diagnostic, COA Critiquing (Y2)

EconomicsObjective metrics

Volume, rates, reuse Intrinsic KB Quality

Subjective and non-metric for KBs Quality Review Panels

Intrinsic Tool QualitySubjective and non-metric from SMEs

Expert Knowledge Challenge Problem work Questionnaires

Page 7: Evaluating SME-Elicited Knowledge

7

Problems with Evaluation Methods

Mix of methods makes it difficult to know what conclusions to draw Each evaluation evolved to improve evaluation mechanics Subjective methods are only as good as the evaluators and the

specification of the subjective measures Objective evaluations can give false sense of confidence

Across teams, even objective measures such as counts of assertions are not clear cut

Different KR systems and ontologies makes it difficult to compare across systems regarding, e.g., Number of assertions Reuse statistics Quality of knowledge entered, answers generated

Users of mixed skills and abilities

Page 8: Evaluating SME-Elicited Knowledge

8

Overview

Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations

Meaningfulness of results Evaluation methods Target audience

Effects of human users Types of users Evaluation duration User interactions and metrics

Challenge Problems

Page 9: Evaluating SME-Elicited Knowledge

9

Evaluations served two masters

Report Card for funding agency that reveals progress and significanceKnowledge Acquisition rates, reuse levelsSME vs. standard

SME KB vs. textbook/canonical SME answers SME KB vs. KE KBS SME KB vs. SME’s own answers

Page 10: Evaluating SME-Elicited Knowledge

10

Evaluations served two masters

Technologists want evaluations too A good evaluation should also be a service to the

evaluees and help to focus/refocus development.Identify and characterize:

AccomplishmentsShortcomings Limitations

Diagnose performanceCharacterize question typesDetailed scoring criteriaQuality Review Panel (QRP) reviewsSME questionnaires

Page 11: Evaluating SME-Elicited Knowledge

11

Evaluation Usefulness Tension

Conflict of interests between Contracting Agencies and ResearchersAgencies want to see progressResearchers want to do workEvaluations take time and resources

They may not show progressThey take time away from work

Page 12: Evaluating SME-Elicited Knowledge

12

Overview

Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations

Meaningfulness of results Target audience Evaluation methods

Effects of human users Types of users Evaluation duration User interactions and metrics

Challenge Problems

Page 13: Evaluating SME-Elicited Knowledge

13

Importance of User Characterization

Across the two years, RKF evaluation involved both AI-naïve users and trained KEs In evaluating how well a system enables user to do something, the nature

of the user is important. Systems that made sense to KEs were often baffling to SMEs.

RKF did not invest resources in to analyzing how the different types of users interacted with the systems. These evaluations very much focused on the systems and what was produced

using the systems. Ignored the interactions as a target for evaluation except in SME

questionnaires.

Page 14: Evaluating SME-Elicited Knowledge

14

Overview

Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations

Meaningfulness of results Target audience Evaluation methods

Effects of human users Types of users Evaluation duration User interactions and metrics

Challenge Problems

Page 15: Evaluating SME-Elicited Knowledge

15

Evaluation Duration

RKF evaluated experimental systems As such, they were works in progress.

Human users were…only human In RKF Year 1, the evaluation period was quite long (over four weeks)

As a result, the systems received a good workout but so did the users Frustration as a result of bugs and evaluation mechanisms which kept

users isolated Productivity dropped off as summer progressed Patience was non-existent by the end of summer

In RKF Year 2, evaluation was shorter but task was more complex Users felt like there was more they wanted to teach the systems Evaluation was not long enough for the task at hand

Page 16: Evaluating SME-Elicited Knowledge

16

Overview

Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations

Meaningfulness of results Target audience Evaluation methods

Effects of human users Types of users Evaluation duration User interactions and metrics

Challenge Problems

Page 17: Evaluating SME-Elicited Knowledge

17

What is being tested?

In RKF, the systems were the focus of the evaluation Systems were supposed to be designed to aid users in developing

knowledge In Year 1, users were kept apart from technology developers to make

experiment more pure. In Year 2, interaction was allowed and encouraged.

User interaction had to be characterized so that we could take effects into account when evaluating results Sufficient metrics for this purpose have not been stated. KR is a very creative process

The process would need to be teased apart more to trace the contributions of KEs versus SMEs.

Page 18: Evaluating SME-Elicited Knowledge

18

Scientific Validity

We wanted to isolate variables, e.g., determine which factor(s) led to performance differences

For this, we needed: a) Quantity of data

Longer evaluations and/or more users Enough data to help establish that differences are statistically significant

b) To avoid ceiling and floor effects Diversity in kinds and difficulty of performance tasks Sufficient, but not misleading, amount of granularity in scoring

c) An effort to isolate variables appropriately Identify controls Characterize users systematically

Page 19: Evaluating SME-Elicited Knowledge

19

Overview

Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations

Meaningfulness of results Target audience Evaluation methods

Effects of human users Types of users Evaluation duration User interactions and metrics

Challenge Problems

Page 20: Evaluating SME-Elicited Knowledge

20

Challenge Problems

CP Objectives Test technology

Feedback for DARPA -- we’re doing what we should be doing. Feedback for teams -- here’s where you can improve.

Focus development CPs provide a theme and make collaboration more targeted Possibility of developers (just) “teaching to the test”

Reflect development Show off what can be shown off

Page 21: Evaluating SME-Elicited Knowledge

21

Challenge Problem Lessons Learned

Importance of strong evaluation focus Gives technology a communal focus

Competitions are not always necessary Can promote unhealthy levels of competition Makes focus on grades rather than results

There can be several Challenge Problem foci But you do sacrifice comparability

Evaluation is a collaborative sport Evaluators need to listen to the tech providers Tech providers must accept a bar set slightly higher than their comfort

level

Page 22: Evaluating SME-Elicited Knowledge

22

Evaluation methodology needs to be well known Get specs out early and hammer out a consensus Dry runs iron out the wrinkles Mini-evaluations keep the data coming and allow teams to continually

test/improve their systems Targeted testing can really focus on particular system components

Subjectivity can be managed with good criteria

Challenge Problem Lessons Learned

Page 23: Evaluating SME-Elicited Knowledge

23

For more information…

IET’s RKF page: www.iet.com/Projects/RKF/ Y1 Spec: http://www.iet.com/Projects/RKF/TKCP-spec--v2.1.doc Y2 Spec: http://www.iet.com/Projects/RKF/COA-CP-spec--v1.2.doc Y1 Evaluation Paper: http://www.iet.com/Projects/RKF/PerMIS02.doc

Schrag, B. et al, “Experimental Evaluation of Subject Matter Expert-oriented Knowledge Base Authoring Tools” Measuring the Performance and Intelligence of Systems: Proceedings of the 2002 PerMIS Workshop, August 13-15, 2002, NIST Special Publication 990, pp. 272-279

Y2 Evaluation Paper: http://www.iet.com/Projects/RKF/QRP02/KCAP-03-COACritiquing.pdf Pool, M., Murray, K., Fitzgerald, J., Mehrotra, M., Schrag, R., Blythe, J., Kim, J.,

Chalupsky, H., Miraglia, P., Russ, T., Schneider, D. “Evaluating SME Authored COA Critiquing Knowledge,” submitted to K-CAP, 2003.