Evaluation
description
Transcript of Evaluation
evaluation, validation andempirical methods
Alan Dix
http://www.alandix.com/
evaluation
you’ve designed it, but is it right?
different kinds of evaluation
endless argumentsquantitative vs. qualitative
in the lab vs. in the wild
experts vs. real users (vs UG students!)
really need tocombine methods
quantitative – what is true & qualitative – why
what is appropriate and possible
purpose
Two types of evaluation
purpose stage
formative improve a design development
summative say “this is good” contractual/sales
investigative gain understanding researchinvestigative/ exploratory
researchgain understanding
Three
when does it end?
in a world of perpetual beta ...
real use is the ultimate evaluation
logging, bug reporting, etc.
how do people really use the product?
are some features never used?
studies and experiments
what varies (and what you choose)
individuals / groups (not only UG students!)
tasks / activities
products / systems
principles / theories
prior knowledge and experience
learning and order effects
which are you trying to find out about?which are ‘noise’
a little story …
BIG ACM sponsored conference
‘good’ empirical paper
looking at collaborative support for a task X
three pieces of software:A – domain specific software, synchronous
B – generic software, synchronous
C – generic software, asynchronous A
B C
asyn
c
sync
domainspec.
generic
experiment
reasonable nos. subjects in each condition
quality measures
significant results p<0.05domain spec. > generic
asynchronous > synchronous
conclusion: really want async domain specific
A
B C
asyn
c
sync
domainspec.
generic
domainspec.
generic
asyncsync
conclusion: really want async domain specific
what’s wrong with that?
interaction effectsgap is interesting to studynot necessarily end up best
more important … if you blinked at the wrong moment …
NOT independent variablesthree different pieces of softwarelike experiment on 3 people!say system B was just bad
domainspec.
generic asyncsync
A
B C
asyn
c
sync
domainspec.
generic
?
B < A B < C
what went wrong?
borrowed psych method… but method embodies assumptions
single simple cause, controlled environment
interaction needs ecologically valid exp.multiple causes, open situations
what to do? understand assumptions and modify
numbers and statistics
are five users enough?
one of the myths of usability!from a study by Nielsen and Landauer (1993)
empirical work, cost–benefit analysis and averages, many assumptions: simplified model, iterative
steps, ...
basic idea: decreasing returnseach extra user gives less new information
really ... it dependsfor robust statistics – many many morefor something interesting – one may be enough
points of comparison
measures:average satisfaction 3.2 on a 5 point scale
time to complete task in range 13.2–27.6 seconds
good or bad?
need a point of comparisonbut what?
self, similar system, created or real??
think purpose ...
what constitutes a ‘control’think!!
do I need statistics?
finding some problem to fix
to know
how frequently it occurs
whether most users experience it
if you’ve found most problems
NO
YES
statistics
need a course in itself!experimental design
choosing right test
etc., etc., etc.
a few things ...
statistical significance
stat. sig = likelihood of seeing effect by chance5% (p <0.05) = 1 in 20 chancebeware many tests and cherry picking!10 tests means 50:50 chance of seeing p<0.05
not necessarily large effect (i.e. ≠ important)
non-significant = not proven (NOT no effect)may simply not be sensitive enoughe.g. too few users
to show no (small) effect need other methodsfind out about confidence intervals!
statistical power
how likely effect will show up in experiment
more users means more ‘power’2x senisitivity needs 4x number of
users
manipulate it!
more users (but usually many more)
within subject/group (‘cancels’ individual diffs.)
choice of task (particularly good/bad)
add distracter task
from data to knowledge
types of knowledge
descriptiveexplaining what happened
predictivesaying what will happen
cause ⇒ effectwhere science often ends
• synthetic– working out what to do to make what you want
happeneffect ⇒ cause
– design and engineering
syntheticworking out what to do to make what you want happen
effect ⇒ causedesign and engineering
generalisation?
can we ever generalise?
every situation is unique, but ...
... to use past experience is to generalise
generalisation ≠ abstractioncases, descriptive frameworks, etc.
data ≠ generalistioninterpolation – maybe
extrapolation??
generalisation ...
never comes (solely) from data
always comes from the head
requires understanding
mechanism
reduction reconstruction– formal hypothesis testing+ may be qualitative too– more scientific precision
• wholistic analytic– field studies, ethnographies+ ‘end to end’ experiments– more ecological validity
wholistic analytic– field studies, ethnographies
+ ‘end to end’ experiments– more ecological validity
?
? ? ? ? ?
from evaluation to validation
validating work
• justification– expert opinion– previous research– new experiments
• evaluation– experiments– user studies– peer review
your work
evaluation
experimentsuser studiespeer review
singularity?different peopledifferent situations
sampling
generative artefacts
• justification– expert opinion– previous research– new experiments
• evaluation– experiments– user studies– peer review
artefact
evaluationsingularitypeople, situationsplus ...different designersdifferent briefs
toolkitsdevicesinterfacesguidelinesmethodologies
(pure) evaluation of generative artefacts is methodologically unsound
too many
to sample
validating work
• justification– expert opinion– previous research– new experiments
• evaluation– experiments– user studies– peer review
your work
evaluation
experimentsuser studiespeer review
justification
expert opinionprevious researchnew experiments
justification vs. validation
• different disciplines– mathematics: proof = justification– medicine: drug trials = evaluation
• combine them:– look for weakness in justification– focus evaluation there
evaluationjustification
example – scroll arrows ...
Xerox STAR – first commercial GUIprecursor of Mac, Windows, ...
principled design decisions
which direction for scroll arrows?not obvious: moving document or handle?
=> do a user study!
gap in justification => evaluation
unfortunately ...
Apple got the wrong designs