Validity of Measure Semi

8/8/2019 Validity of Measure Semi

1/77

Measurement in nursing and healthMeasurement in nursing and health

researchresearch

Validity of measuresValidity of measures

Presented by : Wesam AlmagharbehPresented by : Wesam AlmagharbehSupervised by : Muayyad Ahmad, PhD, RNSupervised by : Muayyad Ahmad, PhD, RN


2/77

Introduction MeasurementIntroduction Measurement

The assignment of numbers to represent theThe assignment of numbers to represent theamount of an attribute present in an object oramount of an attribute present in an object orperson, using specific rules.person, using specific rules.

L. L. Thurstone : Whatever exists, exists inL. L. Thurstone : Whatever exists, exists insome amount and can be measured.some amount and can be measured.

The rules for measuring temperature, weight andThe rules for measuring temperature, weight andother physical attributes are widely known andother physical attributes are widely known and

accepted.accepted.

Rules for measuring many variables howeverRules for measuring many variables howeverhave to be invented, e.g,rule for measuring pain,have to be invented, e.g,rule for measuring pain,satisfaction and depression.satisfaction and depression.


3/77

MeasurementMeasurement

according to what criteria the numeric values areaccording to what criteria the numeric values are

to be assigned to the characteristic of interestto be assigned to the characteristic of interest In measuring attributes, researchers strive toIn measuring attributes, researchers strive to

use good, meaningful rules.use good, meaningful rules.

With a new instrument, researchers seldomWith a new instrument, researchers seldom

know in advance if their rules are the bestknow in advance if their rules are the bestpossible.possible.


4/77

Key Criteria for EvaluatingKey Criteria for Evaluating

Quantitative MeasuresQuantitative Measures

ReliabilityReliability

ValidityValidity


5/77

ValidityValidity

Validity refers to the extent to which a measureValidity refers to the extent to which a measure

achieves the purpose for which it was intendedachieves the purpose for which it was intended..

"Validity is a unitary concept. It is the degree to"Validity is a unitary concept. It is the degree to

whichwhich evidence and theoryevidence and theory support thesupport the

interpretation entailed by proposed use of tests"interpretation entailed by proposed use of tests"

(AERA) (NCME) ((AERA) (NCME) (19851985,, 19991999))


6/77

The type of validity information to be obtained depends

upon the aims or purposes for the measureaims or purposes for the measure rather than

upon the type of measure.


7/77

Two framework of measurement

Norm referenced measuresNorm referenced measures are employed when the interestis in evaluating a subjects performance relative to theperformance of other subject in some well definedcomparison group

The focus on the variance between subjects performance

Criterion referenced measuresCriterion referenced measures are employed when theinterest is in determining a subjects performance relativeto/ or whether or not subject has acquired a predetermined

set of target behavior The focus on the variance between subject performanceand predetermined set of behavior, (process and outcomevariable)


8/77

NORMNORM--REFERENCEDREFERENCED

VALIDITY PROCEDURESVALIDITY PROCEDURES

Four aspects ;

Content validityContent validity Face logicalFace logical validityvalidity

Construct validityConstruct validity

CriterionCriterion--related validityrelated validity


9/77

NORMNORM--REFERENCED MEASURESREFERENCED MEASURES

Content validityContent validity

Its focus on Determining whether or not the items sampled for

inclusion on the tool adequately represent the domain of

content addressed by the instrument The relevance of the content domain to the proposed

interpretation of scores obtained when the measure is

employed.

Important for all measures(especially instruments

designed to assess cognition )


10/77



Procedures : experts judgeexperts judge the specific items in terms oftheirrelevance, sufficiency, and clarityrelevance, sufficiency, and clarity in representing the

concepts underlying the measure's development.

When two judges are employed, the content validityindex (CVI) is used (proportion of items given a rating of

quite/very relevant by both raters )

When more than two experts rate the items on a

measure, the alpha coefficient is used 0 indicates lack of agreement

1.00 indicates complete agreement


11/77


12/77


13/77



Content validity depend largely on

Selection, preparation, and use of experts

Optimal number of experts


14/77


Face logical validityFace logical validity

Face validity is not validity in the true sense and refers

only to the appearance of the instrument to the layman

When it is present, does not provide evidence for

validity, that the instrument actually measures what it

purports to measure


15/77



Refers to the extent to which an individual ,event , objectRefers to the extent to which an individual ,event , object

actually possesses the characteristics being measuredactually possesses the characteristics being measured

by the instrumentby the instrument

The primary concern is the extent to which relationshipsamong items included in the measure are consistent with

the theory and concepts as operationally defined.

The more abstract the concept, the more difficult it is toThe more abstract the concept, the more difficult it is to

establish the construct validity of the measure.establish the construct validity of the measure.


16/77


17/77

Some Methods of AssessingSome Methods of Assessing

Construct ValidityConstruct Validity

Contrasted groups approachContrasted groups approach

hypothesis testing approachhypothesis testing approach

MultitraitMultitrait--multimulti--method approachmethod approach


18/77



Contrasted groups approachContrasted groups approach

Instrument is administered to groups expected to differ on the criticalattribute because of some known characteristic ( be extremely highand extremely low in the characteristic being measured )

E.g ; fear of labor experiences between primipara and multipara

Ifa significant differencea significant difference between the mean scores :

evidence for construct validityevidence for construct validity

IfIfno significant differenceno significant difference three possibilities exist:three possibilities exist:

((11) the test is unreliable;) the test is unreliable; ((22) the test is reliable, but not a valid measure of the characteristic) the test is reliable, but not a valid measure of the characteristic

((33) the constructor's con-ception of the construct of interest is faulty and) the constructor's con-ception of the construct of interest is faulty and

needs reformulation. the characteristicneeds reformulation. the characteristic


19/77




Hypotheses according to theory or conceptual

framework

gathers data to test the hypotheses,

rationale underlying the instrument's

construction is adequate to explain the data

collected.


20/77




According to theory, construct X is positively

related to construct Y.

Instrument A is a measure of construct X;

instrument B is a measure of construct Y.

Scores on A and B are correlated positively, as

predicted by theory.

Therefore, it is inferred that A and B are validmeasures of X and Y.


21/77



MultitraitMultitrait--multimulti--method approachmethod approach Is appropriately employed whenever it is feasible

to :

1. Measure two or more different constructs

2. Use two or more different methodologies tomeasure each construct

3. Administer all instruments to every subject atthe same time

4. Assume that performance on each instrument

employed is independent that is not influencedby, biased by or a function of performance onany other instrument


22/77



MultitraitMultitrait--multimethod approachmultimethod approach

Depend largely on the correlation size and pattern ofDepend largely on the correlation size and pattern of

Trait varianceTrait variance is the variability in a set of scores resulting

from individual differences in the trait being measured.

Method varianceMethod variance is variance resulting from individual

differences in a subject's ability to respond appropriately

to the type of measure used


23/77


24/77



MultitraitMultitrait--multimethod approachmultimethod approach

The reliability estimate (reliability diagonal)

Convergent validity (validity diagonal)

The size of these heterotrait-monomethod coefficientswill be lower than the values on the validity diagonal(constructvalidity)

The values of these heterotrait-heteromethodcoeffi-cients should be lower than the values in thevalidity diagonal (discriminantvalidity)


25/77


26/77



CONFIRMATORY FACTOR

ANALYSIS


27/77



When one wishes to infer from a measure an individual's

probable standing on some other variable or criterion,

criterion-related validity is of concern

The degree to which the instrument is related to anT

he degree to which the instrument is related to anexternal criterionexternal criterion

Check the measure against a relevantCheck the measure against a relevant criterioncriterion..


28/77



two types of criterion-related validity : Predictive validityPredictive validity indicates the extent to which an

individual's future level of performance on a criterion can

be predicted from knowledge of performance on a priormeasure.

Concurrent validityConcurrent validity refers to the extent to which a

measure may be used to estimate an individual's present

standing on the criterion.


29/77

2929

NORMNORM--REFERENCEDMEASURESREFERENCEDMEASURES


Predictive ValidityPredictive Validity

NORMNORM--REFERENCEDMEASURESREFERENCEDMEASURES


Predictive ValidityPredictive Validity

Look at measures ability toLook at measures ability to predictpredictsomethingsomething

it should be able to predictit should be able to predict

TestTest CriterionCriterion


30/77


31/77

3131



Concurrent ValidityConcurrent Validity



Concurrent ValidityConcurrent Validity

a measure of empowerment should showa measure of empowerment should show

higher scores for managers and lowerhigher scores for managers and lowerscores for their workers.scores for their workers.


32/77



The difference between predictive andThe difference between predictive and

concurrent validity then, is the difference inconcurrent validity then, is the difference in

the timing of obtaining measurements on athe timing of obtaining measurements on a

criterion.criterion.


33/77



Activities to obtain evidence for criterion-related validity

correlation studies of the type and extent of therelationships between scores and exter-nal variables

studies of the extent to which scores predict future

behavior, performance, or scores on measures obtainedat a later point in time

studies of the effectiveness of selection, placement,and/or classification decisions on the basis of the scores

resulting from the measure

studies of differential group predictions or relationships

assessment of validity generalization


34/77



Factors to be considered in planning and

interpreting criterion-related studies relate

to

(1) the target population,

(2) the sample,

(3) the criterion,

(4) measurement reliability,

(5) the !need for a cross validation


35/77


ITEMITEM--ANALYSIS PROCEDURESANALYSIS PROCEDURES

Item analysis : procedure used to further assess thevalidity of a measure by separately evaluating each item

to determine whether or not that item discriminates in the

same manner in which the overall measure is intendedto dis-criminate

Three item-analysis procedures are :

(1) item p level

(2) discrimination index

(3) item-response chart.


36/77



Item p levelItem p level The p level (the difficulty level) : is the proportion of

correct responses to that item.

It is determined by counting the number of subjects

selecting the correct or desired response to a particularitem and then dividing this number by the total number of

subjects

The closer the value of p is to 1.00, the easier the item

the closer p is to zero, the more difficult the item

p levels between 0.30 and 0.70 are desirable

extremely easy or extremely difficult items have; very

little power to discriminate or differentiate among

subjects


37/77



Discrimination IndexDiscrimination Index

The discrimination index (D) assesses an item's ability to

discriminate

if performance on a given item is a good predictor ofperformance on the overall measure, the item is said to

be a good discriminator


38/77




To determine the D value for a given item:

1. Rank all subjects' performance on the measure by using total scores

from high to low.

2. Identify those individuals who ranked in the upper 25%.

3. Identify those individuals who ranked in the lower 25%.

4. Place the remaining scores aside.

5. Determine the proportion of respondents in the top 25% who

answered the item correctly (P u)

6. Determine the proportion of respondents in the lower 25% who

answered the item correctly (PL)

7.Calculate D by subtracting PL from P u

8. Repeat steps 5 through 7 for each item on the measure


39/77




D values range from -1.00 to +1.00.

D values. greater than +0.20 are desirable for a norm-

referenced measure

A positive D value is desirable and indicates that theitem is discriminating in the same manner as the total

test

A negative D value suggests that the item is not

discriminating in the same way as the total test


40/77



Item Response ChartItem Response Chart

Like D, the item-response chart assesses an item'sability to discriminate

The respondents ranking in the upper and lower 25% areidentified as in steps 1 through 4 for determining D

the two categories, high/low scorers andcorrect/incorrect for a given item.

Chi square ; a value as large as or larger than 1.84 for achi square with one degree of freedom is significant atthe 0.05 level

Mean a significant difference exists in the proportion ofhigh and low scorers who have correct responses. Itemsthat meet this criterion should be retained, while thosethat do not should be discarded or modified to improve

their ability ' to discriminate.


41/77


42/77

CRITERIONCRITERION--REFERENCEDREFERENCED

VALIDITY ASSESSMENTVALIDITY ASSESSMENT

The validity of a criterion-referenced measure can be

analyzed to ascertain if the measure functions in a

manner consistent with its purposes

validity in terms of criterion-referenced interpretations

relates to the extent to which scores result in the

accurate classification of objects in regard to their

domain status.


43/77



Three aspects ;


Construct validityConstruct validity CriterionCriterion--related validityrelated validity


44/77



Content ValidityContent Validity

Focus on the representativeness of acluster of items in relation to the specifiedcontent domain

For a measure to provide a clear description of domainstatus, the content domain must be consistent with its

domain specifications or objective

prerequisite for all other types of validity

a posterioricontent validity approach in criterion-referenced measurement uses content specialists toassess the quality and representativeness of the items

within the test for measuring the content domain.


45/77


Validity AssessmentValidity Assessment

by Content Specialistsby Content Specialists

specialists should be conversant with the domain treatedin the measuring tool.

two or more content specialists are employed

item-objective congruence measure (item level)

if more than one objective is used for a measure, the

items that are meas-ures of each objective usually aretreated as separate tests when interpreting the results of

validity assessments


46/77




Determination of Interrater Agreement

Average Congruent/ Percentage



47/77



Determination of InterraterAgreementDetermination of InterraterAgreement

Content specialists are provided with the conceptualdefinition of the variable (s) to be measured with the setof items

The content specialists then independently rate therelevance of each item to the specified content domain

P0 0.80,

K 0.25.

The index of content validity (CVI)


48/77



49/77




IfP0 and K or either of these values is too low, one or acombination of two problems could be operating ;

First, items lack homogeneity , ambiguous or is not well

defined.

E.g. (20 out of 30), 0.50 (15 out of 30), and 0.60 (18 out of 30).

the majority of the item writers had at least one item that

was judged not/somewhat relevant (1 or 2) by the threecontent specialists, then this would be support for lack ofclarity in the domain definition.



50/77




Second, the problem due to the raters , interpret the

rating scale labels differently or used the rating scale

differently

E.g. 0.90 (27 out of 30), 0.93 (28 out of 30), and 0.93 (28

out of 30).

Each of the items judged to be unlike the rest had been

prepared by one item writer. In this case the flaw is not

likely to be in the domain definition as specified, but in

the interpretations of one item writer.



51/77




Refinement of the domain specifications is required if thefirst case.

If the latter is the problem, the raters are given moreexplicit directions and guidelines in the use of the scale

to reduce the chance of differential use. A clear and precise domain definition

domain specifications function to communicate what theresults of measurements mean to those people whomust interpret them,

what types of items and content should be included inthe measure to those people who must construct theitems.


52/77



Average Congruent/ PercentageAverage Congruent/ Percentage

Content specialists are judge the congruence of eachitem on a measure

The proportion of items rated congruent by each judge is

calculated and converted to a percentage.

Then the mean percentage for all judges is calculated toobtain the average congruency percentage.

E.g. if the percentages of congruent items for the judgesare 95,90,100, and 100%, the average congruencypercentage would be 96.25%.

percent 90 safely considered acceptable


53/77


CONSTRUCT VALIDITYCONSTRUCT VALIDITY

Evidence of the content validity is not guarantee that themeasure is useful for its intended purpose.

"we may say that a test's results are accurately

descriptive of the domain of behaviors it is supposed tomeasure, it is quite another thing to say that the functionto which you wish to put a descriptively valid test isappropriate" (Popham, 1978, p. 159).

the major focus of construct validation is to

establish support for the measure's ability to accuratelycategorize phenomena in accordance with the purposefor which the measure being used.

CRITERIONCRITERION REFERENCEDREFERENCED


54/77



Approaches used to assess the construct validityApproaches used to assess the construct validity

Experimental Methods and the Contrasted Groups

Approach

Decision Validity



55/77



Experimental Methods and the Contrasted GroupsExperimental Methods and the Contrasted Groups

ApproachApproach

The basic principles and procedures for

these two approaches are the same for

criterion-referenced measures as for

norm-referenced measures.


56/77



Decision ValidityDecision Validity

(1) a student may be allowed to progress to the next unitof instruc-tion if test results indicate that the precedingunit has been mastered.

(2) a woman in early labor may be allowed to ambulateif the nurse assesses, on pelvic examination, that thefetal head is engaged (as opposed to unengaged) in thepelvis.

(3) a diabetic patient may be allowed to go home if the

necessary skills for self-care have been mastered


57/77




The measurements obtained from criterion-referencedmeasures are often used to make decisions.

"Criterion-referenced tests have emerged as instrumentsthat provide data via which mastery decisions can bemade, as opposed to providing the decision itself(Hashway, 1998, p. 112).

The decision validity of a measure is supported when theset standard (s) or criterion classifies subjects or objectswith a high level of confidence.


58/77




In most instances, two criterion groups are used to testthe decision validity of a measure (low and high )

E.g.

"by summing the percentage of who exceed theperformance standard and the percentage who did not"

decision validity can range from 0 to 100%, with highpercentages reflecting high decision validity.

Criterion groups for testing the decision validity of ameasure also can be created

E.g.


59/77




Decision validity is influenced by

the quality of the measure

appropriateness of the criterion groups

the characteristics of the subjects

the level of performance or cut-scorerequired.


60/77


CriterionCriterion--Related ValidityRelated Validity

Criterion-related validity studies of

criterion-referenced measures are

conducted in the same manner as for

norm-referenced measures


61/77



content specialists' ratings holds the most

merit for assessing item validities for

determining which items should be

retained or discarded

empirical item-discrimination indices

should be used primarily to detect aberrantitems in need of revision or correction


62/77

Empirical Item-Analysis

Procedures

Criterion-referenced item-analysis procedures determine

the effectiveness of a specific test item to discriminate

subjects who have acquired the target behavior and

those who have not.


63/77

Two approaches are used for item analysis

procedures

(1) the criterion-groups technique, which also

may be referred to as the uninstructed-instructedgroups approach

(2) pretreatment/post-treatment measures

approach, which in appropriate instances may

be called the preinstruction/postinstructionmeasurements approach.


64/77

Advantage and disadvantage

The criterion-groups technique is highly practical

difficulty of defining criteria for identifyinggroups. Another is the requirement of

equivalence of groups

Pretreatment/post-treatment measuresapproach allowing analysis of individual as well

as group gains. impracticality , the amount of time that may be

required, potential problem with testing effect,


65/77



Three item-analysis procedures

are :

(1) Item-Objective or Item-Subscale Congruence (2) Item Difficulty

(3) discrimination index


66/77


ItemItem--Objective orObjective or

ItemItem--Subscale CongruenceSubscale Congruence

provides an index of the validity of an item based on theratings of two or more content specialists

In this method content specialists are directed to assigna value of+1,0, or -1 for each item

an item definitely measure the objective or subscale, avalue of +1 is assigned.

A rating of 0 indicates that the judge is undecided about

the item.

The assignment of a -1 rating reflects a definitejudgment that the item is not a measure of the objectiveor sub-scale.


67/77




The limits of the index range from -1.00 to +1.00.

An index of +1.00 will occur when perfect positive item-objective or subscale congruence exists, that is, when all

content specialists assign a +1 to the item for its relatedobjective or subscale and a 1 to the item for all otherobjectives or subscales that are measured by the tool.

An index of -1.00 represents the worst possible value ofthe index and occurs when all content specialists assigna -1 to the item for what was expected to be its relatedobjective or subscale and a +1 to the item for all otherobjectives or subscales.


68/77


69/77




does not depend on the number of content specialists usedor on the number of objectives measured by the test orquestionnaire.

the tool must include more than one objective orsubscale in order for this procedure to be used.

cut-off score derived by the test developer.

done by creating the poorest set of content specialists'ratings

Below cut-off score ; nonvalid; discarded from themeasure or ana-lyzed and revised to Improve theirvalidity.

above cut-off score are considered valid.


70/77


Item DifficultyItem Difficulty

the purpose is to examine the difficulty level of items andcompare them between criterion groups

The approaches to calculating item p levels and their

interpretation was discussed

The item p level should be higher for the group that isknown to possess more of a specified trait or attributethan for the group known to possess less


71/77


Item DiscriminationItem Discrimination

The focus on the measurement of performance changes

(e.g., pretest/posttest) or differences (e.g.,

experienced/inexperienced) between the criterion

groups.

referred to as D

is directly related to the property of decision validity,

Items with high positive discrimination indices improve

the decision validity of a test.


72/77



Criterion groups difference index (CGDI)

Pre/post treatment measurements

approach indices


73/77



criterion groups difference index (CGDI) is

the proportion of respondents in the group

known to have less of the trait or attribute

of interest who answered the itemappropriately or correctly subtracted from

the proportion of respondents in the group

known to possess more of the trait orattribute of interest who answered it

correctly.


74/77



Pretreatment/post treatment measurements approach

Three item-discrimination indices are

(1) pretest/posttest difference.

(2) individual gain.

(3) net gain.


75/77



The pretest/posttest difference index (PPDI) is theproportion of respondents who answered the itemcorrectly on the posttest minus the proportion whoresponded to the item correctly on the pretest

The individual gain index (IGI) is the pro-portion ofrespondents who answered the item incorrectly on thepretest and correctly on the posttest

The net gain index (NGI) is the proportion ofrespondents who answered the item incorrectly on both

occasions subtracted from the IGI.


76/77


77/77



NGI provides the most conservative estimate of itemdiscrimination and uses more information.

The range of values for each of the indices discussed

above is -1.00 to +1.00

except for IGI, which has a range of 0 to +1.00.

A high positive index for each of these itemdiscrimination indices is desirable.

Validity of Measure Semi

Documents

Transcript of Validity of Measure Semi