Evaluating SME-Elicited Knowledge
description
Transcript of Evaluating SME-Elicited Knowledge
1
Evaluating SME-Elicited Knowledge
Julie Fitzgerald, Mike Pool, Bob SchragInformation Extraction and Transport, Inc.
Arlington, VirginiaAugust 9, 2003
2
Background Information
Evaluations conducted for DARPA’s Rapid Knowledge Formation (RKF) Goal was development of tools for eliciting formalized knowledge from KR-naïve
subject matter experts (SMEs) Two system integrators
Cycorp developed KRAKEN system Large knowledge based system, primarily NL driven
SRI developed SHAKEN system Smaller, more modular knowledge based system, primarily graphical
Program evaluation was driven by Challenge Problems Y1 (summer 2001, January 2002): Cell Biology
Asked: “Can SMEs teach RNA transcription to a knowledge base?” Y2 (September 2002): COA Critiquing
Asked: “Can SMEs teach systems to critique a rudimentary military course of action (COA) wrt a number of critiquing criteria?”
3
RKF Evaluation Objectives
Can subject matter expert (SMEs) author KBs using RKF tools / processes?
How good is the authored knowledge? How well does it work w.r.t. a performance task (e.g., textbook question
answering, military course of action (COA) critiquing)? How general / robust is the system? How much of the knowledge was reused/can be reused? How do KBs built by SMEs compare to KBs built by KEs? Who (between KEs and SMEs) did what? What kinds of knowledge did SMEs build well? How long did it take? What was enjoyable for SMEs?
4
Overview
Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations
Meaningfulness of results Evaluation methods Target audience
Effects of human users Types of users Evaluation duration User interactions and metrics
Challenge Problems
5
Overview
Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations
Meaningfulness of results Evaluation methods Target audience
Effects of human users Types of users Evaluation duration User interactions and metrics
Challenge Problems
6
Evaluation Methods
What was Evaluated? Functional Performance
Subjective metrics Test questions based on section from cell biology textbook (Y1) COA Diagnostic, COA Critiquing (Y2)
EconomicsObjective metrics
Volume, rates, reuse Intrinsic KB Quality
Subjective and non-metric for KBs Quality Review Panels
Intrinsic Tool QualitySubjective and non-metric from SMEs
Expert Knowledge Challenge Problem work Questionnaires
7
Problems with Evaluation Methods
Mix of methods makes it difficult to know what conclusions to draw Each evaluation evolved to improve evaluation mechanics Subjective methods are only as good as the evaluators and the
specification of the subjective measures Objective evaluations can give false sense of confidence
Across teams, even objective measures such as counts of assertions are not clear cut
Different KR systems and ontologies makes it difficult to compare across systems regarding, e.g., Number of assertions Reuse statistics Quality of knowledge entered, answers generated
Users of mixed skills and abilities
8
Overview
Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations
Meaningfulness of results Evaluation methods Target audience
Effects of human users Types of users Evaluation duration User interactions and metrics
Challenge Problems
9
Evaluations served two masters
Report Card for funding agency that reveals progress and significanceKnowledge Acquisition rates, reuse levelsSME vs. standard
SME KB vs. textbook/canonical SME answers SME KB vs. KE KBS SME KB vs. SME’s own answers
10
Evaluations served two masters
Technologists want evaluations too A good evaluation should also be a service to the
evaluees and help to focus/refocus development.Identify and characterize:
AccomplishmentsShortcomings Limitations
Diagnose performanceCharacterize question typesDetailed scoring criteriaQuality Review Panel (QRP) reviewsSME questionnaires
11
Evaluation Usefulness Tension
Conflict of interests between Contracting Agencies and ResearchersAgencies want to see progressResearchers want to do workEvaluations take time and resources
They may not show progressThey take time away from work
12
Overview
Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations
Meaningfulness of results Target audience Evaluation methods
Effects of human users Types of users Evaluation duration User interactions and metrics
Challenge Problems
13
Importance of User Characterization
Across the two years, RKF evaluation involved both AI-naïve users and trained KEs In evaluating how well a system enables user to do something, the nature
of the user is important. Systems that made sense to KEs were often baffling to SMEs.
RKF did not invest resources in to analyzing how the different types of users interacted with the systems. These evaluations very much focused on the systems and what was produced
using the systems. Ignored the interactions as a target for evaluation except in SME
questionnaires.
14
Overview
Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations
Meaningfulness of results Target audience Evaluation methods
Effects of human users Types of users Evaluation duration User interactions and metrics
Challenge Problems
15
Evaluation Duration
RKF evaluated experimental systems As such, they were works in progress.
Human users were…only human In RKF Year 1, the evaluation period was quite long (over four weeks)
As a result, the systems received a good workout but so did the users Frustration as a result of bugs and evaluation mechanisms which kept
users isolated Productivity dropped off as summer progressed Patience was non-existent by the end of summer
In RKF Year 2, evaluation was shorter but task was more complex Users felt like there was more they wanted to teach the systems Evaluation was not long enough for the task at hand
16
Overview
Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations
Meaningfulness of results Target audience Evaluation methods
Effects of human users Types of users Evaluation duration User interactions and metrics
Challenge Problems
17
What is being tested?
In RKF, the systems were the focus of the evaluation Systems were supposed to be designed to aid users in developing
knowledge In Year 1, users were kept apart from technology developers to make
experiment more pure. In Year 2, interaction was allowed and encouraged.
User interaction had to be characterized so that we could take effects into account when evaluating results Sufficient metrics for this purpose have not been stated. KR is a very creative process
The process would need to be teased apart more to trace the contributions of KEs versus SMEs.
18
Scientific Validity
We wanted to isolate variables, e.g., determine which factor(s) led to performance differences
For this, we needed: a) Quantity of data
Longer evaluations and/or more users Enough data to help establish that differences are statistically significant
b) To avoid ceiling and floor effects Diversity in kinds and difficulty of performance tasks Sufficient, but not misleading, amount of granularity in scoring
c) An effort to isolate variables appropriately Identify controls Characterize users systematically
19
Overview
Y1 Cell Biology CP EvaluationY2 COA Critiquing CP EvaluationGeneral RKF Evaluation Considerations
Meaningfulness of results Target audience Evaluation methods
Effects of human users Types of users Evaluation duration User interactions and metrics
Challenge Problems
20
Challenge Problems
CP Objectives Test technology
Feedback for DARPA -- we’re doing what we should be doing. Feedback for teams -- here’s where you can improve.
Focus development CPs provide a theme and make collaboration more targeted Possibility of developers (just) “teaching to the test”
Reflect development Show off what can be shown off
21
Challenge Problem Lessons Learned
Importance of strong evaluation focus Gives technology a communal focus
Competitions are not always necessary Can promote unhealthy levels of competition Makes focus on grades rather than results
There can be several Challenge Problem foci But you do sacrifice comparability
Evaluation is a collaborative sport Evaluators need to listen to the tech providers Tech providers must accept a bar set slightly higher than their comfort
level
22
Evaluation methodology needs to be well known Get specs out early and hammer out a consensus Dry runs iron out the wrinkles Mini-evaluations keep the data coming and allow teams to continually
test/improve their systems Targeted testing can really focus on particular system components
Subjectivity can be managed with good criteria
Challenge Problem Lessons Learned
23
For more information…
IET’s RKF page: www.iet.com/Projects/RKF/ Y1 Spec: http://www.iet.com/Projects/RKF/TKCP-spec--v2.1.doc Y2 Spec: http://www.iet.com/Projects/RKF/COA-CP-spec--v1.2.doc Y1 Evaluation Paper: http://www.iet.com/Projects/RKF/PerMIS02.doc
Schrag, B. et al, “Experimental Evaluation of Subject Matter Expert-oriented Knowledge Base Authoring Tools” Measuring the Performance and Intelligence of Systems: Proceedings of the 2002 PerMIS Workshop, August 13-15, 2002, NIST Special Publication 990, pp. 272-279
Y2 Evaluation Paper: http://www.iet.com/Projects/RKF/QRP02/KCAP-03-COACritiquing.pdf Pool, M., Murray, K., Fitzgerald, J., Mehrotra, M., Schrag, R., Blythe, J., Kim, J.,
Chalupsky, H., Miraglia, P., Russ, T., Schneider, D. “Evaluating SME Authored COA Critiquing Knowledge,” submitted to K-CAP, 2003.