Evaluation of a Design Evaluation is not marketing Marketing requires that you make a case for why...

Evaluation of a Design

Evaluation is not marketing

Marketing requires that you make a casefor why your design is desirable

In marketing you are like a lawyer arguing the case for your client

Its common to make an “elevator pitch”to convince potential funders, clients, etc. as well as crafting PR or advertisements for external markets.

Evaluation is honesty

Evaluation requires your honest assessment of the data related to goals

In evaluation you are like a scientist testing a hypothesis

Its common to make an internal memo, white paper, etc. as well as publications in professional journals or conferences.

Evaluation requires a “Popperian” approach

Evaluation is honesty

Recall that Karl Popper based his philosophy of science on limiting experiments to those which have the ability to falsify a hypothesis.

This means I cannot grade you on whether or not your evaluation falsifies your hypothesis. I can only grade you on whether or not your evaluation honestly allows for the possibility of failure. Lucky you!

Unfortunately much of the evaluation scholarship ignors Popper and uses the terminology of “validate” or “verify” rather than “fail to falsify”.

Verification vs Validation

Verification is “internal”: did we build it according to the specs? If the specs say the maximum weight is 10 lbs, you can verify by weighing it.

Validation is “external”: did the specs describe what the user needs? You can validate by having users try it—if they say its not too heavy, then its validated.

Note that you can fail one without the other: the users might invalidate a design that was nonetheless verified as meeting specs, or vice-versa.

Technical Specifications

Typically there is some set of technical specifications for the product function expectations. Functional testing is a verification activity; did we build the product according to design specs?

Short term testing can tell you about specifications such asthroughput

Long term testing is sometimes required for testing durability, failure analysis, product safety, etc. Even before testing you canask yourself what might happen in various “failure modes”—eg during a power blackout.

Technical Specifications

Failure Modes and Effects Analysis example

(http://research.me.udel.edu/~jglancey/FailureAnalysis.pdf)

http://research.me.udel.edu/~jglancey/FailureAnalysis.pdf

User Requirements

a. User acceptance testing: This is a validation activity; did we build the right thing? It may work exactly as designed but still be unsatisfactory to the subjective perspective of the user.

b. User-based function testing: validation of the objective effects, not just acceptance: a user may tell you the design is good even if it is ineffective. This is a particular challenge in certain fields like nutrition (I love it but its bad for me), educational technologies (I love it but I don’t learn much), exercise (it fools me into thinking I am exercising), etc.

Ethics in User testing

All university research with human subjects must pass the Internal Review Board. Most education technology is approvedunder Federal Policy for the Protection of Human Subjects, 45 CFR Section 690), section 46.101-1b regarding “Research conducted in established… educational settings.”

Ethics in User testing

• Participation must be arranged through the school teacher and school administration.

• Subjects must be anonymous in your written descriptions.• No personal information or photos can be obtained without

consent. • If disclosure of students’ responses did somehow occur, there

cannot be risk of criminal or civil liability or damage to reputation.

• No students can be denied an educational opportunity simply because you need a control group –this can create a challenge for us

Basics of Experimental Design

Dependent variable: outcome measured to determine any effect. “Depending on whether or not this works, we will see X” (healthier people, smarter kids, less pollution from the exhaust, etc.).

Independent variable: the “intervention”—often simply the presence/absence of the new technology you designed.


Measuring the dependent variable with machines is straight-forward. With humans it can be tricky. The following examples are mainly from educational technology design, but they can be easily generalized:

You can apply an “instrument” (paper test) for measurement:--An achievement test, ie a series of problems testing their skills or knowledge--An attitudes survey, ie testing enthusiasm for a discipline, career interest, etc. You can also measure the dependent variable by data that is already available: as grades, college admission and retention, number of trips to the principle’s office, etc. These are typically use as “baseline” measures.


Pre-test/post-test contrast: Like running a car over speed bumps: you have a measure that tells you something is out there, even if you can’t see it.

Pre/post has an advantage over baseline, in that you control the timing. But baseline can often be useful for larger datasets.

Either way, we need to do everything we can to ensure our data is reflecting changes due to the independent variable (the design you introduced; your intervention), and not some other source.

Basics of Experimental DesignChallenges to Internal Validity call into question the validity of the experiment itself.

Challenges to External validity call into question the validity of generalizing results beyond the specific context.

Challenges to internal validity

History: For example if the pre-test concerned astronomy, and newspapers featured a new planet that was discovered. Maturation: grow older long term, get tired short term, etc. Testing: eg Claude Steele’s “stereotype effect” Instrumentation: For example if you improve placement of a microphone between pre and post or control and experimental. Statistical Regression: Students who scored high on a test will tend to score more towards the average on the next test, simply because some of high scores were due to luck.

Challenges to internal validity

Selection: eg if you ask for volunteers, for example, you might accidentally be gathering individuals who are predisposed toward your intervention

Experimental Mortality: Refers to the problematic of losing participants during the intervention – i.e. some of the participants who took the pre-test and/or experienced the interventions do not complete the experiment – for example if the worst students are the ones who have the most absences, experimental mortality will make your post-test appear better than it should have.

Control/intervention comparison for internal validity Most internal validity threats can be eliminated by comparison between a control group, which does not get the intervention, and an experimental group which does.

You give pre/post tests to both the control group and the intervention group. If your hypothesis is correct, the pre-post difference will be greater in the intervention group.

For example, a new planet is featured on TV and kids are suddenly more interested in astronomy. As long as both the control group and the intervention group are equally TV viewers, you can now eliminate that internal validity threat.

But control and intervention must be as similar as possible.

Challenges to external validity

For example working on a intervention using music and learning, conducted in a school for musically gifted children, may produce results which, though internally consistent, are only applicable to children gifted in music.

External validity challenges are about generalizing results

Statistical measures

The test scores for any given student will normally fluctuate due to random factors (good sleep, bad mood, etc).

You have to examine the scores statistically to see if the improvement is significantly greater than the random changes

Statistical significance is often measured by a confidence level ρ (rho). The best significance is ρ < .001: random fluctuations would produce such pre/post differences in one out of 1000 trials.

ρ < .01: on average 1 out of 100 trials)ρ < .05 (5 out of 100 trials).

Lower than that and you have to report “there was positive improvement but it did not achieve statistically significant levels.”

Statistical measures

A typical statistical process for pre/post comparisons is the “paired T-test.” For each student you subtract the post score from the pre score, then input that difference (usually to software such as SPSS).

Note that this only provides pre/post significance, not a comparison to the absence of the intervention.

This raises anonymity issues, because the tests have to be identified with the students name—alternatively you can use a code assigned to each student, and then discard the names.

Statistical Comparison of Control and ExperimentalHere we take the differences between pre and post for experimental, and the differences between pre and post for control, and use the difference between those differences as data to test for significance.

This is done utilizing the "T-test for two independent groups" because you cannot pair tests from two different students. Outcome in terms of confidence level ρ.

Now you can say that your product is only on effective, but more effective than the product used in the control group! In addition, the use of a control group guards against many of the validity threats.

Drawbacks to the control/experimental comparison:--takes more time and resources—not a trivial consideration--raises ethical problems: you don’t want to withhold something that would help students just to prove your stuff works.

So be sure you always provide the intervention for both groups (in other words let the control group have it after you have done the pre/post testing).

Evaluation of a Design Evaluation is not marketing Marketing requires that you make a case for why...

Documents

Transcript of Evaluation of a Design Evaluation is not marketing Marketing requires that you make a case for why...