ARCH: Bridging the Divide Between Assessment Practice in Low & High -Stakes Contexts

of 1/1
RESEARCH POSTER PRESENTATION DESIGN © 2011 www.PosterPresentations.com ARCH: Bridging the Divide Between Assessment Practice in Low & High-Stakes Contexts CAT depends on an item bank & specific information about items established during item calibration Calibration can involve gathering responses from hundreds or thousands before the item can be used Large motivational differences exist between examinees who participate in item calibration, an often low or no stakes context, & actual test examinees (Wise & DeMars, 2006 in Makransky, 2010) ITEM CALIBRATION BURDEN HINDERS CAT ADOPTION Andrew F. Barrett & Dr. Theodore W. Frick Department of Instructional Systems Technology, School of Education, Indiana University Bloomington ARCH FOUNDATIONS PROBLEMS WITH CAT Heavy resources requirements of item calibration can make CAT impractical in all but a few large-scale, high- stakes, and/or highly profitable contexts The long tail of assessment in lower stakes contexts could benefit from CAT VL-CCT CAN REDUCE ITEM CALIBRATION BURDEN M-EXSPRT-R in a race to make accurate classification decisions about examinees. Initially, only item-bank level parameter estimates for SPRT are available. M- EXSPRT-R must collect the data it needs during live testing. After each classification, ARCH: 1. Automatically uses responses gathered during online testing to update calibration data for items 2. Uses heuristics (decision table below) to see if any items are sufficiently calibrated for use with M- EXSPRT-R As testing continues, more items become sufficiently calibrated for M-EXSPRT-R which increases the chances that M-EXSPRT-R will be able to make classification decisions before SPRT. In other words, tests get smarter & shorter as data is collected. ARCH APPROACH TO ITEM CALIBRATION ARCH RESEARCH Phase I: Test re-enactments via computer simulations using historical test data from a previous study (Frick, 1992). ARCH settings needed for accurate testing established in phase I will be used in phase III. Phase II: Pilot test & calibrate test items created for a new version of the online test used by Indiana University’s plagiarism tutorial available at https://www.indiana.edu/~istd Phase III: Live testing with the new version of the plagiarism test. Participants: Phase II & III participants will be recruited from the thousands of individuals who take the Indiana University plagiarism tutorial. REFERENCES Frick, T. W. (1989). Bayesian adaptation during computer-based tests and computer-guided practice exercises. Journal of Educational Computing Research, 5(1), 89-114. Frick, T. W. (1992). Computerized adaptive mastery tests as expert systems. Journal of Educational Computing Research, 8(2), 187-213. Makransky, G., & Glas, C. A. W. (2010). Unproctored Internet Test Verification: Using Adaptive Confirmation Testing. Organizational Research Methods, 1094428110370715. doi:10.1177/1094428110370715 Rudner, L. M. (2009). Scoring and classifying examinees using measurement decision theory. Practical Assessment, Research & Evaluation, 14(8). Retrieved from http://pareonline.net/getvn.asp?v=14&n=8. Stein, C., & Wald, A. (1947). Sequential confidence intervals for the mean of a normal distribution with known variance. The Annals of Mathematical Statistics, 427–433. Thompson, N. A. (2007). A Practitioner’s guide for variable-length computerized classification testing. Practical Assessment Research & Evaluation, 12(1). Retrieved from http://pareonline.net/getvn.asp? v=12&n=1 CONTACT Andrew F. Barrett Doctoral Candidate [email protected] http://Andrew.B4RRETT.com SPRT The Sequential Probability Ratio Test uses item- bank level probabilities & responses to randomly selected items to make classification decisions. Calibra tion Data Test Length 1. When is an item sufficiently calibrated for use with M-EXSPRT-R? 2. How well do ARCH examinee classification decisions agree with those made using the total test, traditionally calibrated SPRT, & traditionally calibrated EXSPRT-R? 3. How efficient is ARCH in comparison to traditionally calibrated SPRT & EXSPRT-R? EXAMPLE Dr. Theodore W. Frick Professor & Chairman [email protected] https://www.indiana.edu/ ~tedfrick/ A VL-CCT Solution Variable-Length Computerized Classification Testing can be accurate & efficient with a potentially less arduous item calibration phase (Thompson, 2007; Rudner, 2009) VL-CCT may enable benefits of CAT to be brought to lower-stakes contexts VL-CCT places examinee ability into 2 or more mutually exclusive groups (e.g. master & nonmaster or A, B, C, D, & F) which is a more common practice in education than precisely estimating ability Wald’s (1947) Sequential Ratio Probability Test (SPRT) requires little, if any, item calibration & has been shown to make accurate classification decisions while increasing test efficiency threefold (Frick, 1989) Frick (1992) demonstrated that a calibration phase involving as few as 25 examinees from each classification group enabled efficient classification testing without compromising accuracy Classical Test Theory based VL-CCT dependent on fewer assumptions than Item Response Theory based VL-CCT ARCH Automatic Racing Calibration Heuristics uses statistical hypothesis testing to address item calibration. In ARCH, SPRT is pitted against A beta density function can be used to estimate probabilities of a correct response from a specific classification group. For example, beta ( * | 2, 3) above could correspond to 2 correct & 3 incorrect responses from nonmasters to an item. The dashed vertical line represents the expected mean of the beta distribution (.42) & the estimate of the probability that a nonmaster would respond correctly to the item. However, with so little data (only 5 responses) the 95% highest density region is very wide (from about .1 to .75) so we cannot put much confidence in the expected mean. Collecting more data would narrow the highest density region. EXSPRT-R The EXSPRT-R (EX stands for EXpert & R stands for Random selection) ends once it’s confident in a particular decision. EXSPRT-R applies expert systems thinking & item-level probabilities from each classification group to estimate the likelihood that an examinee belongs to a classification. Beta Density Function A Beta Density Function is based on the number of correct & incorrect responses from a classification group during the item calibration phase. QUESTIONS Not Fully Calibrated M-EXSPRT-R Still Beats SPRT Start with an item-bank calibrated for use with SPRT P(C|M) = .85 P(C|N) = .40 P(¬C|M) = .15 P(¬C|N) = .60 ARCH approach has sufficiently calibrated items 63, 23, 1, & 38 for use with M-EXSPRT-R After 7 randomly administered items, only M- EXSPRT-R is able to make a decision despite not being able to use 3/7 of the responses SPRT & M-EXSPRT-R are able to use the responses to items 63 & 23 to calculate corresponding probability & likelihood ratios Only SPRT can use the item 28 response to update the probability ratio, M-EXSPRT-R does not yet know enough about item 28 Items 28, 87, & 11 have not yet met calibration heuristics criteria & have been neither accepted nor rejected for use by M-EXSPRT-R
  • date post

    23-Mar-2016
  • Category

    Documents

  • view

    48
  • download

    0

Embed Size (px)

description

ARCH: Bridging the Divide Between Assessment Practice in Low & High -Stakes Contexts. ARCH A utomatic R acing C alibration H euristics uses statistical hypothesis testing to address item calibration. In ARCH, SPRT is pitted against . - PowerPoint PPT Presentation

Transcript of ARCH: Bridging the Divide Between Assessment Practice in Low & High -Stakes Contexts

Slide 1

ARCH: Bridging the Divide Between Assessment Practice in Low & High-Stakes ContextsCAT depends on an item bank & specific information about items established during item calibrationCalibration can involve gathering responses from hundreds or thousands before the item can be usedLarge motivational differences exist between examinees who participate in item calibration, an often low or no stakes context, & actual test examinees (Wise & DeMars, 2006 in Makransky, 2010)ITEM CALIBRATION BURDEN HINDERS CAT ADOPTIONAndrew F. Barrett & Dr. Theodore W. FrickDepartment of Instructional Systems Technology, School of Education, Indiana University Bloomington

ARCH FOUNDATIONSPROBLEMS WITH CATHeavy resources requirements of item calibration can make CAT impractical in all but a few large-scale, high-stakes, and/or highly profitable contextsThe long tail of assessment in lower stakes contexts could benefit from CATVL-CCT CAN REDUCE ITEM CALIBRATION BURDENM-EXSPRT-R in a race to make accurate classification decisions about examinees. Initially, only item-bank level parameter estimates for SPRT are available. M-EXSPRT-R must collect the data it needs during live testing.After each classification, ARCH: Automatically uses responses gathered during online testing to update calibration data for itemsUses heuristics (decision table below) to see if any items are sufficiently calibrated for use with M-EXSPRT-RAs testing continues, more items become sufficiently calibrated for M-EXSPRT-R which increases the chances that M-EXSPRT-R will be able to make classification decisions before SPRT. In other words, tests get smarter & shorter as data is collected.ARCH APPROACH TO ITEM CALIBRATIONARCH RESEARCHPhase I: Test re-enactments via computer simulations using historical test data from a previous study (Frick, 1992). ARCH settings needed for accurate testing established in phase I will be used in phase III.Phase II: Pilot test & calibrate test items created for a new version of the online test used by Indiana Universitys plagiarism tutorial available at https://www.indiana.edu/~istdPhase III: Live testing with the new version of the plagiarism test. Participants: Phase II & III participants will be recruited from the thousands of individuals who take the Indiana University plagiarism tutorial.REFERENCESFrick, T. W. (1989). Bayesian adaptation during computer-based tests and computer-guided practice exercises. Journal of Educational Computing Research, 5(1), 89-114.Frick, T. W. (1992). Computerized adaptive mastery tests as expert systems. Journal of Educational Computing Research, 8(2), 187-213.Makransky, G., & Glas, C. A. W. (2010). Unproctored Internet Test Verification: Using Adaptive Confirmation Testing. Organizational Research Methods, 1094428110370715. doi:10.1177/1094428110370715Rudner, L. M. (2009). Scoring and classifying examinees using measurement decision theory. Practical Assessment, Research & Evaluation, 14(8). Retrieved from http://pareonline.net/getvn.asp?v=14&n=8.Stein, C., & Wald, A. (1947). Sequential confidence intervals for the mean of a normal distribution with known variance. The Annals of Mathematical Statistics, 427433.Thompson, N. A. (2007). A Practitioners guide for variable-length computerized classification testing. Practical Assessment Research & Evaluation, 12(1). Retrieved from http://pareonline.net/getvn.asp?v=12&n=1CONTACTAndrew F. BarrettDoctoral [email protected]://Andrew.B4RRETT.comSPRTThe Sequential Probability Ratio Test uses item-bank level probabilities & responses to randomly selected items to make classification decisions.

When is an item sufficiently calibrated for use with M-EXSPRT-R? How well do ARCH examinee classification decisions agree with those made using the total test, traditionally calibrated SPRT, & traditionally calibrated EXSPRT-R? How efficient is ARCH in comparison to traditionally calibrated SPRT & EXSPRT-R? EXAMPLEDr. Theodore W. FrickProfessor & [email protected]://www.indiana.edu/~tedfrick/

A VL-CCT SolutionVariable-Length Computerized Classification Testing can be accurate & efficient with a potentially less arduous item calibration phase (Thompson, 2007; Rudner, 2009)VL-CCT may enable benefits of CAT to be brought to lower-stakes contexts

VL-CCT places examinee ability into 2 or more mutually exclusive groups (e.g. master & nonmaster or A, B, C, D, & F) which is a more common practice in education than precisely estimating abilityWalds (1947) Sequential Ratio Probability Test (SPRT) requires little, if any, item calibration & has been shown to make accurate classification decisions while increasing test efficiency threefold (Frick, 1989)Frick (1992) demonstrated that a calibration phase involving as few as 25 examinees from each classification group enabled efficient classification testing without compromising accuracyClassical Test Theory based VL-CCT dependent on fewer assumptions than Item Response Theory based VL-CCT

ARCHAutomatic Racing Calibration Heuristics uses statistical hypothesis testing to address item calibration. In ARCH, SPRT is pitted against A beta density function can be used to estimate probabilities of a correct response from a specific classification group. For example, beta ( * | 2, 3) above could correspond to 2 correct & 3 incorrect responses from nonmasters to an item. The dashed vertical line represents the expected mean of the beta distribution (.42) & the estimate of the probability that a nonmaster would respond correctly to the item. However, with so little data (only 5 responses) the 95% highest density region is very wide (from about .1 to .75) so we cannot put much confidence in the expected mean. Collecting more data would narrow the highest density region.

EXSPRT-RThe EXSPRT-R (EX stands for EXpert & R stands for Random selection) ends once its confident in a particular decision.EXSPRT-R applies expert systems thinking & item-level probabilities from each classification group to estimate the likelihood that an examinee belongs to a classification.Beta Density FunctionA Beta Density Function is based on the number of correct & incorrect responses from a classification group during the item calibration phase.QUESTIONS

Not Fully Calibrated M-EXSPRT-R Still Beats SPRT Start with an item-bank calibrated for use with SPRTP(C|M) = .85P(C|N) = .40P(C|M) = .15P(C|N) = .60ARCH approach has sufficiently calibrated items 63, 23, 1, & 38 for use with M-EXSPRT-RAfter 7 randomly administered items, only M-EXSPRT-R is able to make a decision despite not being able to use 3/7 of the responsesSPRT & M-EXSPRT-R are able to use the responses to items 63 & 23 to calculate corresponding probability & likelihood ratiosOnly SPRT can use the item 28 response to update the probability ratio, M-EXSPRT-R does not yet know enough about item 28Items 28, 87, & 11 have not yet met calibration heuristics criteria & have been neither accepted nor rejected for use by M-EXSPRT-RQUICK DESIGN GUIDE(--THIS SECTION DOES NOT PRINT--)

This PowerPoint 2007 template produces a 48x96 professional poster. It will save you valuable time placing titles, subtitles, text, and graphics.

Use it to create your presentation. Then send it to PosterPresentations.com for premium quality, same day affordable printing.

We provide a series of online tutorials that will guide you through the poster design process and answer your poster production questions.

View our online tutorials at: http://bit.ly/Poster_creation_help (copy and paste the link into your web browser).

For assistance and to order your printed poster call PosterPresentations.com at 1.866.649.3004

Object Placeholders

Use the placeholders provided below to add new elements to your poster: Drag a placeholder onto the poster area, size it, and click it to edit.

Section Header placeholderUse section headers to separate topics or concepts within your presentation.

Text placeholderMove this preformatted text placeholder to the poster to add a new body of text.

Picture placeholderMove this graphic placeholder onto your poster, size it first, and then click it to add a picture to the poster.

RESEARCH POSTER PRESENTATION DESIGN 2011www.PosterPresentations.comQUICK TIPS(--THIS SECTION DOES NOT PRINT--)

This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly asked questions specific to this template. If you are using an older version of PowerPoint some template features may not work properly.

Using the template

Verifying the quality of your graphicsGo to the VIEW menu and click on ZOOM to set your preferred magnification. This template is at 50% the size of the final poster. All text and graphics will be printed at 200% their size. To see what your poster will look like when printed, set the zoom to 200% and evaluate the quality of all your graphics before you submit your poster for printing.

Using the placeholdersTo add text to this template click inside a placeholder and type in or paste your text. To move a placeholder, click on it once (to select it), place your cursor on its frame and your cursor will change to this symbol: Then, click once and drag it to its new location where you can resize it as needed. Additional placeholders can be found on the left side of this template.

Modifying the layoutThis template has four different column layouts. Right-click yourMouse on the background and click on Layout to see the layout options. The columns in the provided layouts are fixed and cannot be moved but advanced users can modify any layout by going to VIEW and then SLIDE MASTER.

Importing text and graphics from external sourcesTEXT: Paste or type your text into a pre-existing placeholder or drag in a new placeholder from the left side of the template. Move it anywhere as needed.PHOTOS: Drag in a picture placeholder, size it first, click in it and insert a photo from the menu.TABLES: You can copy and paste a table from an external document onto this poster template. To make the text fit better in the cells of an imported table, right-click on the table, click FORMAT SHAPE then click on TEXT BOX and change the INTERNAL MARGIN values to 0.25

Modifying the color schemeTo change the color scheme of this template go to the Design menu and click on Colors. You can choose from the provide color combinations or you can create your own.

2011 PosterPresentations.com 2117 Fourth Street , Unit C Berkeley CA 94710 [email protected] discounts are available on our Facebook page.Go to PosterPresentations.com and click on the FB icon1PhaseParticipants(N>1,600)Research Questions Data Collection Method Analytic Method

I. Computer Simulations with Historical DataExaminees (n=104) from a previous study (Frick, 1992) who responded to an 85-item test. Participants came from 2 sections of a graduate course, one undergraduate course, and a few volunteer recruits from IUs main library1. When is an item sufficiently calibrated?Stage 1 Simulations: Calibration and testing with varying ARCH settings MAPSAT APT to identify specific ARCH settings that are likely to achieve sufficient test classification accuracy while hastening deployment of M-EXSPRT-R

2. How accurate is ARCH?Stage 2 Simulations: Calibration and testing using ARCH settings established in stage 1 simulations. Recording of resulting classification decisions and test lengths.Pearsons Chi-Square Tests to determine if ARCH classification decisions deviate significantly from total test classification decisions

3. How efficient is ARCH?Repeated Measures One-Way ANOVA and Post Testing to determine if mean test lengths of ARCH, SPRT, and EXSPRT-R tests are significantly different

II. Plagiarism Test Item CalibrationVolunteers (n>700) from the thousands who take the IU plagiarism tutorialPhase II does not address research questions but, instead, calibrates test items for use in phase III. Responses to all 150 items in the pool will be collected and recorded from (1) at least 15 examinees as part of pilot testing and (2) at least 50 masters and 50 nonmasters in order to establish the item parameter estimates for use with EXSPRT-R.

III. Plagiarism Test Live TestingVolunteers (n>800) from the thousands who take the IU plagiarism tutorial1. When is an item sufficiently calibrated?Live Testing: Examinee item responses will be collected. For each examinee, the number of items required to make a classification decision and the decision itself will be recorded for each testing method.Pearsons Chi-Square Tests to determine if ARCH, calibrated using the ARCH settings established in phase I, makes classification decisions that deviate significantly from decisions made by EXSPRT-R calibrated with 50 masters and 50 nonmasters

2. How accurate is ARCH?

3. How efficient is ARCH?Repeated Measures One-Way ANOVA and Post Testing to determine if mean test lengths of ARCH, SPRT, and EXSPRT-R tests are significantly different

RProbability of R From:Probability Examinee Is A:PRTest Decision

MasterNonmasterMasterNonmaster

.5.51

1C.85.40.680.3202.125Continue

2C.15.60.347.6530.531Continue

3C.85.40.530.4701.129Continue

4C.85.40.706.2942.399Continue

5C.85.40.836.1645.098Continue

6C.85.40.915.08510.833Continue

7C.85.40.958.04223.019Continue

8C.85.40.980.02048.916Master

(12)

(13)

(10)

(11)

iRProbability of R To i From:Probability Examinee Is A:PRTest Decision

MasterNonmasterMasterNonmaster

.5.51

163C.11.35.239.761.314Continue

223C.81.24.515.4851.064Continue

31C.08.53.138.8620.160Continue

438C.02.14.024.976.025Nonmaster

ni nmax YesReject i. Stop calibration.

NoP(Ci|M) < P(Ci|N)Yes

NoP(Ci|M) > P(Ci|N)YesHDRWP(Ci|N) HDRWmax & HDRWP(Ci|M) HDRWmaxYesAccept i. Stop calibration.

NoNo Decision. Continue calibration on i.

No

Where:ni =Number of times item i has been administered during calibration

nmax =Maximum administrations for any item during calibration

P(Ci|M) =Probability of a correct response to item i given mastery

P(Ci|N) =Probability of a correct response to item i given nonmastery

HDRWP(Ci|N) =Highest density region width (HDRW) of P(Ci|N)

HDRWP(Ci|M) =HDRW of P(Ci|M)

HDRWmax =Maximum HDRW for an estimate to be considered sufficiently precise