Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001)...

16
Copyright 2001 Psychonomic Society, Inc. 1314 Perception & Psychophysics 2001, 63 (8), 1314-1329 The performance of an observer on a psychophysical task is typically summarized by reporting one or more re- sponse thresholds —stimulus intensities required to pro- duce a given level of performance—and by a characteri- zation of the rate at which performance improves with increasing stimulus intensity. These measures are de- rived from a psychometric function , which describes the dependence of an observer’s performance on some phys- ical aspect of the stimulus. Fitting psychometric functions is a variant of the more general problem of modeling data. Modeling data is a three-step process: First, a model is chosen, and the param- eters are adjusted to minimize the appropriate error met- ric or loss function. Second, error estimates of the param- eters are derived and third, the goodness of fit between model and the data is assessed. This paper is concerned with the second of these steps, the estimation of variability in fitted parameters and in quantities derived from them. Our companion paper (Wichmann & Hill, 2001) illustrates how to fit psychometric functions while avoiding bias re- sulting from stimulus-independent lapses, and how to eval- uate goodness of fit between model and data. We advocate the use of Efron’s bootstrap method, a particular kind of Monte Carlo technique, for the prob- lem of estimating the variability of parameters, thresholds, and slopes of psychometric functions (Efron, 1979, 1982; Efron & Gong, 1983; Efron & Tibshirani, 1991, 1993). Bootstrap techniques are not without their own assump- tions and potential pitfalls. In the course of this paper, we shall discuss these and examine their effect on the esti- mates of variability we obtain. We describe and examine the use of parametric bootstrap techniques in finding con- fidence intervals for thresholds and slopes. We then ex- plore the sensitivity of the estimated confidence interval widths to (1) sampling schemes, (2) mismatch of the ob- jective function, and (3) accuracy of the originally fitted parameters. The last of these is particularly important, since it provides a test of the validity of the bridging as- Part of this work was presented at the Computers in Psychology (CiP) Conference in York, England, during April 1998 (Hill & Wichmann, 1998). This research was supported by a Wellcome Trust Mathematical Biology Studentship, a Jubilee Scholarship from St. Hugh’s College, Oxford, and a Fellowship by Examination from Magdalen College, Ox- ford, to F.A.W. and by a grant from the Christopher Welch Trust Fund and a Maplethorpe Scholarship from St. Hugh’s College, Oxford, to N.J.H. We are indebted to Andrew Derrington, Karl Gegenfurtner, Bruce Hen- ning, Larry Maloney, Eero Simoncelli, and Stefaan Tibeau for helpful comments and suggestions. This paper benefited considerably from con- scientious peer review, and we thank our reviewers, David Foster, Stan Klein, Marjorie Leek, and Bernhard Treutwein, for helping us to improve our manuscript. Software implementing the methods described in this paper is available (MATLAB); contact F.A.W. at the address provided or see http://users.ox.ac.uk/~sruoxfor/psychofit/. Correspondence con- cerning this article should be addressed to F. A. Wichmann, Max-Planck- Institut für Biologische Kybernetik, Spemannstraße 38, D-72076 Tübingen, Germany (e-mail: [email protected]). The psychometric function: II. Bootstrap-based confidence intervals and sampling FELIX A. WICHMANN and N. JEREMY HILL University of Oxford, Oxford, England The psychometric function relates an observer’s performance to an independent variable, usually a phys- ical quantity of an experimental stimulus. Even if a model is successfully fit to the data and its goodness of fit is acceptable, experimenters require an estimate of the variability of the parameters to assess whether differences across conditions are significant. Accurate estimates of variability are difficult to obtain, how- ever, given the typically small size of psychophysical data sets: Traditional statistical techniques are only asymptotically correct and can be shown to be unreliable in some common situations. Here and in our companion paper (Wichmann & Hill, 2001), we suggest alternative statistical techniques based on Monte Carlo resampling methods. The present paper’s principal topic is the estimation of the variability of fitted parameters and derived quantities, such as thresholds and slopes. First, we outline the basic bootstrap procedure and argue in favor of the parametric, as opposed to the nonparametric, bootstrap. Second, we describe how the bootstrap bridging assumption, on which the validity of the procedure depends, can be tested. Third, we show how one’s choice of sampling scheme (the placement of sample points on the stimulus axis) strongly affects the reliability of bootstrap confidence intervals, and we make recom- mendations on how to sample the psychometric function efficiently. Fourth, we show that, under cer- tain circumstances, the (arbitrary) choice of the distribution function can exert an unwanted influence on the size of the bootstrap confidence intervals obtained, and we make recommendations on how to avoid this influence. Finally, we introduce improved confidence intervals (bias corrected and accelerated) that improve on the parametric and percentile-based bootstrap confidence intervals previously used. Software implementing our methods is available.

Transcript of Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001)...

Page 1: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

Copyright 2001 Psychonomic Society Inc 1314

Perception amp Psychophysics2001 63 (8) 1314-1329

The performance of an observer on a psychophysicaltask is typically summarized by reporting one or more re-sponse thresholdsmdashstimulus intensities required to pro-duce a given level of performancemdashand by a characteri-zation of the rate at which performance improves withincreasing stimulus intensity These measures are de-rived from a psychometric function which describes thedependence of an observerrsquos performance on some phys-ical aspect of the stimulus

Fitting psychometric functions is a variant of the moregeneral problem of modeling data Modeling data is a

three-step process First a model is chosen and the param-eters are adjusted to minimize the appropriate error met-ric or loss function Second error estimates of the param-eters are derived and third the goodness of fit betweenmodel and the data is assessed This paper is concernedwith the second of these steps the estimation of variabilityin fitted parameters and in quantities derived from themOur companion paper (Wichmann amp Hill 2001) illustrateshow to fit psychometric functions while avoiding bias re-sulting from stimulus-independent lapses and how to eval-uate goodness of fit between model and data

We advocate the use of Efronrsquos bootstrap method aparticular kind of Monte Carlo technique for the prob-lem of estimating the variability of parameters thresholdsand slopes of psychometric functions (Efron 1979 1982Efron amp Gong 1983 Efron amp Tibshirani 1991 1993)Bootstrap techniques are not without their own assump-tions and potential pitfalls In the course of this paper weshall discuss these and examine their effect on the esti-mates of variability we obtain We describe and examinethe use of parametric bootstrap techniques in finding con-fidence intervals for thresholds and slopes We then ex-plore the sensitivity of the estimated confidence intervalwidths to (1) sampling schemes (2) mismatch of the ob-jective function and (3) accuracy of the originally fittedparameters The last of these is particularly importantsince it provides a test of the validity of the bridging as-

Part of this work was presented at the Computers in Psychology (CiP)Conference in York England during April 1998 (Hill amp Wichmann1998) This research was supported by a Wellcome Trust MathematicalBiology Studentship a Jubilee Scholarship from St Hughrsquos CollegeOxford and a Fellowship by Examination from Magdalen College Ox-ford to FAW and by a grant from the Christopher Welch Trust Fundand a Maplethorpe Scholarship from St Hughrsquos College Oxford to NJHWe are indebted to Andrew Derrington Karl Gegenfurtner Bruce Hen-ning Larry Maloney Eero Simoncelli and Stefaan Tibeau for helpfulcomments and suggestions This paper benefited considerably from con-scientious peer review and we thank our reviewers David Foster StanKlein Marjorie Leek and Bernhard Treutwein for helping us to improveour manuscript Software implementing the methods described in thispaper is available (MATLAB) contact FAW at the address provided orsee httpusersoxacuk~sruoxforpsychofit Correspondence con-cerning this article should be addressed to F A Wichmann Max-Planck-Institut fuumlr Biologische Kybernetik Spemannstraszlige 38 D-72076Tuumlbingen Germany (e-mail felixtuebingenmpgde)

The psychometric function II Bootstrap-basedconfidence intervals and sampling

FELIX A WICHMANN and N JEREMY HILLUniversity of Oxford Oxford England

The psychometric function relates an observerrsquos performance to an independent variable usually a phys-ical quantity of an experimental stimulus Even if a model is successfully fit to the data and its goodnessof fit is acceptable experimenters require an estimate of the variability of the parameters to assess whetherdifferences across conditions are significant Accurate estimates of variability are difficult to obtain how-ever given the typically small size of psychophysical data sets Traditional statistical techniques are onlyasymptotically correct and can be shown to be unreliable in some common situations Here and in ourcompanion paper (Wichmann amp Hill 2001) we suggest alternative statistical techniques based on MonteCarlo resampling methods The present paperrsquos principal topic is the estimation of the variability of fittedparameters and derived quantities such as thresholds and slopes First we outline the basic bootstrapprocedure and argue in favor of the parametric as opposed to the nonparametric bootstrap Second wedescribe how the bootstrap bridging assumption on which the validity of the procedure depends canbe tested Third we show how onersquos choice of sampling scheme (the placement of sample points on thestimulus axis) strongly affects the reliability of bootstrap confidence intervals and we make recom-mendations on how to sample the psychometric function efficiently Fourth we show that under cer-tain circumstances the (arbitrary) choice of the distribution function can exert an unwanted influence onthe size of the bootstrap confidence intervals obtained and we make recommendations on how to avoidthis influence Finally we introduce improved confidence intervals (bias corrected and accelerated) thatimprove on the parametric and percentile-based bootstrap confidence intervals previously used Softwareimplementing our methods is available

THE PSYCHOMETRIC FUNCTION II 1315

sumption on which the use of parametric bootstrap tech-niques relies Finally we recommend on the basis of thetheoretical work of others the use of a technique calledbias correction with acceleration (BCa) to obtain stableand accurate confidence interval estimates

BACKGROUND

The Psychometric FunctionOur notation will follow the conventions we have out-

lined in Wichmann and Hill (2001) A brief summary ofterms follows

Performance on K blocks of a constant-stimuli psy-chophysical experiment can be expressed using threevectors each of length K An x denotes the stimulus val-ues used an n denotes the numbers of trials performed ateach point and a y denotes the proportion of correct re-sponses (in n-alternative forced-choice [n-AFC] experi-ments) or positive responses (single-interval or yesno ex-periments) on each block We often use N to refer to thetotal number of trials in the set N 5 aringni

The number of correct responses yini in a given block iis assumed to be the sum of random samples from aBernoulli process with probability of success pi A psy-chometric function y(x) is the function that relates thestimulus dimension x to the expected performance value p

A common general form for the psychometric func-tion is

(1)

The shape of the curve is determined by our choice of afunctional form for F and by the four parameters a b gl to which we shall refer collectively by using the pa-rameter vector q F is typically a sigmoidal function suchas the Weibull cumulative Gaussian logistic or GumbelWe assume that F describes the underlying psychologicalmechanism of interest The parameters g and l determinethe lower and upper bounds of the curve which are af-fected by other factors In yesno paradigms g is the guessrate and l the miss rate In n-AFC paradigms g usuallyreflects chance performance and is fixed at the recipro-cal of the number of intervals per trial and l reflects thestimulus-independent error rate or lapse rate (see Wich-mann amp Hill 2001 for more details)

When a parameter set has been estimated we will usu-ally be interested in measurements of the threshold (dis-placement along the x-axis) and slope of the psychomet-ric function We calculate thresholds by taking the inverseof F at a specified probability level usually 5 Slopes arecalculated by finding the derivative of F with respect to xevaluated at a specified threshold Thus we shall use thenotation threshold08 for example to mean F 1

08 andslope08 to mean dFdx evaluated at F 1

08 When we use theterms threshold and slope without a subscript we meanthreshold05 and slope05 In our 2AFC examples this willmean the stimulus value and slope of F at the point whereperformance is approximately 75 correct although the

exact performance level is affected slightly by the (small)value of l

Where an estimate of a parameter set is required givena particular data set we use a maximum-likelihood searchalgorithm with Bayesian constraints on the parametersbased on our beliefs about their possible values For ex-ample l is constrained within the range [0 06] reflect-ing our belief that normal trained observers do not makestimulus-independent errors at high rates We describeour method in detail in Wichmann and Hill (2001)

Estimates of Variability Asymptotic Versus Monte Carlo Methods

In order to be able to compare response thresholds orslopes across experimental conditions experimenters re-quire a measure of their variability which will depend onthe number of experimental trials taken and their place-ment along the stimulus axis Thus a fitting proceduremust provide not only parameter estimates but also errorestimates for those parameters Reporting error estimateson fitted parameters is unfortunately not very common inpsychophysical studies Sometimes probit analysis hasbeen used to provide variability estimates (Finney 19521971) In probit analysis an iteratively reweighted linearregression is performed on the data once they have under-gone transformation through the inverse of a cumulativeGaussian function Probit analysis relies however on as-ymptotic theory Maximum-likelihood estimators are as-ymptotically Gaussian allowing the standard deviation tobe computed from the empirical distribution (Cox amp Hink-ley 1974) Asymptotic methods assume that the numberof datapoints is large unfortunately however the numberof points in a typical psychophysical data set is small (be-tween 4 and 10 with between 20 and 100 trials at each)and in these cases substantial errors have accordingly beenfound in the probit estimates of variability (Foster ampBischof 1987 1991 McKee Klein amp Teller 1985) Forthis reason asymptotic theory methods are not recom-mended for estimating variability in most realistic psy-chophysical settings

An alternative method the bootstrap (Efron 19791982 Efron amp Gong 1983 Efron amp Tibshirani 19911993) has been made possible by the recent sharp increasein the processing speed of desktop computers The boot-strap method is a Monte Carlo resampling technique re-lying on a large number of simulated repetitions of theoriginal experiment It is potentially well suited to theanalysis of psychophysical data because its accuracy doesnot rely on large numbers of trials as do methods derivedfrom asymptotic theory (Hinkley 1988) We apply thebootstrap to the problem of estimating the variability ofparameters thresholds and slopes of psychometric func-tions following Maloney (1990) Foster and Bischof(1987 1991 1997) and Treutwein (1995 Treutwein ampStrasburger 1999)

The essence of Monte Carlo techniques is that a largenumber B of ldquosyntheticrdquo data sets y1 hellip yB are gener-ated For each data set yi the quantity of interest J (thresh-

y a b g l g g l a bx F x ( ) = + ( ) ( )1

1316 WICHMANN AND HILL

old or slope eg) is estimated to give Ji The process forobtaining Ji is the same as that used to obtain the first esti-mate J Thus if our first estimate was obtained by J 5 t(q)where q is the maximum-likelihood parameter estimatefrom a fit to the original data y so the simulated esti-mates Ji will be given by Ji 5 t(qi) where qi is themaximum-likelihood parameter estimate from a fit to thesimulated data yi

Sometimes it is erroneously assumed that the intentionis to measure the variability of the underlying J itselfThis cannot be the case however because repeated com-puter simulation of the same experiment is no substitutefor the real repeated measurements this would requireWhat Monte Carlo simulations can do is estimate the vari-ability inherent in (1) our sampling as characterized bythe distribution of sample points (x) and the size of thesamples (n) and (2) any interaction between our sam-pling strategy and the process used to estimate Jmdashthat isassuming a model of the observerrsquos variability fitting afunction to obtain q and applying t(q)

Bootstrap Data Sets Nonparametric and Parametric Generation

In applying Monte Carlo techniques to psychophysicaldata we require in order to obtain a simulated data set yisome system that provides generating probabilities p forthe binomial variates yi1 hellip yiK These should be thesame generating probabilities that we hypothesize to un-derlie the empirical data set y

Efronrsquos bootstrap offers such a system In the nonpara-metric bootstrap method we would assume p 5 y Thisis equivalent to resampling with replacement the origi-nal set of correct and incorrect responses on each blockof observations j in y to produce a simulated sample yij

Alternatively a parametric bootstrap can be performedIn the parametric bootstrap assumptions are made aboutthe generating model from which the observed data arebelieved to arise In the context of estimating the variabil-ity of parameters of psychometric functions the data aregenerated by a simulated observer whose underlyingprobabilities of success are determined by the maximum-likelihood fit to the real observerrsquos data [yf it 5 y (xq)]Thus where the nonparametric bootstrap uses y the para-metric bootstrap uses yf it as generating probabilities pfor the simulated data sets

As is frequently the case in statistics the choice ofparametric versus nonparametric analysis concerns howmuch confidence one has in onersquos hypothesis about theunderlying mechanism that gave rise to the raw data asagainst the confidence one has in the raw datarsquos precisenumerical values Choosing the parametric bootstrap forthe estimation of variability in psychometric function fit-ting appears the natural choice for several reasons Firstand foremost in fitting a parametric model to the dataone has already committed oneself to a parametric analy-sis No additional assumptions are required to perform aparametric bootstrap beyond those required for fitting afunction to the data Specification of the source of vari-

ability (binomial variability) and the model from which thedata are most likely to come (parameter vector q and dis-tribution function F ) Second given the assumption thatdata from psychophysical experiments are binomially dis-tributed we expect data to be variable (noisy) The non-parametric bootstrap treats every datapoint as if its exactvalue reflected the underlying mechanism1 The paramet-ric bootstrap on the other hand allows the datapoints tobe treated as noisy samples from a smooth and monotonicfunction determined by q and F

One consequence of the two different bootstrap regimesis as follows Assume two observers performing the samepsychophysical task at the same stimulus intensities x andassume that it happens that the maximum-likelihood fitsto the two data sets yield identical parameter vectors qGiven such a scenario the parametric bootstrap returnsidentical estimates of variability for both observers sinceit depends only on x q and F The nonparametric boot-straprsquos estimates would on the other hand depend on theindividual differences between the two data sets y1 and y2something we consider unconvincing A method for es-timating variability in parameters and thresholds shouldreturn identical estimates for identical observers perform-ing the identical experiment2

Treutwein (1995) and Treutwein and Strasburger (1999)used the nonparametric bootstrap and Maloney (1990)used the parametric bootstrap to compare bootstrap esti-mates of variability with real-word variability in the dataof repeated psychophysical experiments All of the abovestudies found bootstrap studies to be in agreement with thehuman data Keeping in mind that the number of repeatsin the above-quoted cases was small this is nonethelessencouraging suggesting that bootstrap methods are a validmethod of variability estimation for parameters fitted topsychophysical data

Testing the Bridging AssumptionAsymptoticallymdashthat is for large K and Nmdashq will con-

verge toward q since maximum-likelihood estimation isasymptotically unbiased3 (Cox amp Hinkley 1974 Kendallamp Stuart 1979) For the small K typical of psychophysicalexperiments however we can only hope that our estimatedparameter vector q is ldquoclose enoughrdquo to the true param-eter vector q for the estimated variability in the parametervector q obtained by the bootstrap method to be valid Wecall this the bootstrap bridging assumption

Whether q is indeed sufficiently close to q depends ina complex way on the samplingmdashthat is the number ofblocks of trials (K ) the numbers of observations at eachblock of trials (n) and the stimulus intensities (x) relativeto the true parameter vector q Maloney (1990) summa-rized these dependencies for a given experimental designby plotting the standard deviation of b as a function of aand b as a contour plot (Maloney 1990 Figure 3 p 129)Similar contour plots for the standard deviation of a andfor bias in both a and b could be obtained If owing tosmall K bad sampling or otherwise our estimation pro-cedure is inaccurate the distribution of bootstrap param-

THE PSYCHOMETRIC FUNCTION II 1317

eter vectors q1 hellip qBmdashcentered around qmdashwill notbe centered around true q As a result the estimates ofvariability are likely to be incorrect unless the magnitudeof the standard deviation is similar around q and q despitethe fact that the points are some distance apart in pa-rameter space

One way to assess the likely accuracy of bootstrap es-timates of variability is to follow Maloney (1990) and toexamine the local flatness of the contours around q ourbest estimate of q If the contours are sufficiently flat thevariability estimates will be similar assuming that thetrue q is somewhere within this flat region However theprocess of local contour estimation is computationally ex-pensive since it requires a very large number of completebootstrap runs for each data set

A much quicker alternative is the following Having ob-tained q and performed a bootstrap using y(xq) as thegenerating function we move to eight different points f1 hellipf8 in andashb space Eight Monte Carlo simulations are per-formed using f1 hellip f8 as the generating functions to ex-plore the variability in those parts of the parameter space(only the generating parameters of the bootstrap are changedx remains the same for all of them) If the contours of vari-ability around q are sufficiently flat as we hope they areconfidence intervals at f1 hellip f8 should be of the samemagnitude as those obtained at q Prudence should leadus to accept the largest of the nine confidence intervalsobtained as our estimate of variability

A decision has to be made as to which eight points inandashb space to use for the new set of simulations Gener-ally provided that the psychometric function is at leastreasonably well sampled the contours vary smoothly inthe immediate vicinity of q so that the precise placementof the sample points f1 hellip f8 is not critical One sug-gested and easy way to obtain a set of additional gener-ating parameters is shown in Figure 1

Figure 1 shows B 5 2000 bootstrap parameter pairs asdark filled circles plotted in andashb space Simulated datasets were generated from ygen with the Weibull as F andqgen 5 10 3 05 001 sampling scheme s7 (trianglessee Figure 2) was used and N was set to 480 (ni 5 80)The large central triangle at (10 3) marks the generatingparameter set the solid and dashed line segments adjacentto the x- and y-axes mark the 68 and 95 confidenceintervals for a and b respectively4 In the following weshall use WCI to stand for width of confidence intervalwith a subscript denoting its coverage percentagemdashthat isWCI68 denotes the width of the 68 confidence interval5The eight additional generating parameter pairs f1 hellip f8are marked by the light triangles They form a rectanglewhose sides have length WCI68 in a and b Typically thiscentral rectangular region contains approximately 30ndash40 of all andashb pairs and could thus be viewed as a crudejoint 35 confidence region for a and b A coverage per-centage of 35 represents a sensible compromise betweenerroneously accepting the estimate around q very likelyunderestimating the true variability and performing ad-

ditional bootstrap replications too far in the peripherywhere variability and thus the estimated confidence in-tervals become erroneously inflated owing to poor sam-pling Recently one of us (Hill 2001b) performed MonteCarlo simulations to test the coverage of various bootstrapconfidence interval methods and found that the abovemethod based on 25 of points or more was a good wayof guaranteeing that both sides of a two-tailed confidenceinterval had at least the desired level of coverage

MONTE CARLO SIMULATIONS

In both our papers we use only the specific case of the2AFC paradigm in our examples Thus g is fixed at 5In our simulations where we must assume a distributionof true generating probabilities we always use the Wei-bull function in conjunction with the same fixed set ofgenerating parameters qgen agen 5 10 bgen 5 3 ggen 5 5lgen 5 01 In our investigation of the effects of samplingpatterns we shall always use K 5 6 and ni constantmdashthatis six blocks of trials with the same number of points ineach block The number of observations per point ni couldbe set to 20 40 80 or 160 and with K 5 6 this meansthat the total number of observations N could take the val-ues 120 240 480 and 960

We have introduced these limitations purely for the pur-poses of illustration to keep our explanatory variablesdown to a manageable number We have found in manyother simulations that in most cases this is done withoutloss of generality of our conclusions

Confidence intervals of parameters thresholds andslopes that we report in the following were always obtainedusing a method called bias-corrected and accelerated(BCa ) confidence intervals which we describe later in ourBootstrap Confidence Intervals section

The Effects of Sampling Schemes and Number of Trials

One of our aims in this study was to examine the ef-fect of N and onersquos choice of sample points x on both thesize of onersquos confidence intervals for J and their sensitiv-ity to errors in q

Seven different sampling schemes were used each dic-tating a different distribution of datapoints along the stim-ulus axis they are the same schemes as those used and de-scribed in Wichmann and Hill (2001) and they are shownin Figure 2 Each horizontal chain of symbols representsone of the schemes marking the stimulus values at whichthe six sample points are placed The different symbolshapes will be used to identify the sampling schemes in ourresults plots To provide a frame of reference the solidcurve shows the psychometric function usedmdashthat is05 1 05F(xagen bgen)mdashwith the 55 75 and 95performance levels marked by dotted lines

As we shall see even for a fixed number of samplepoints and a fixed number of trials per point biases inparameter estimation and goodness-of-f it assessment

1318 WICHMANN AND HILL

(companion paper) as well as the width of confidence in-tervals (this paper) all depend markedly on the distrib-ution of stimulus values x

Monte Carlo data sets were generated using our sevensampling schemes shown in Figure 2 using the generationparameters qgen as well as N and K as specified aboveA maximum-likelihood fit was performed on each sim-ulated data set to obtain bootstrap parameter vectors qfrom which we subsequently derived the x values corre-sponding to threshold05 and threshold08 as well as to theslope For each sampling scheme and value of N a totalof nine simulations were performed one at qgen and eightmore at points f1 hellip f8 as specified in our section on thebootstrap bridging assumption Thus each of our 28conditions (7 sampling schemes 3 4 values of N ) required9 3 1000 simulated data sets for a total of 252000 sim-ulations or 1134 3 108 simulated 2AFC trials

Figures 3 4 and 5 show the results of the simulationsdealing with slope05 threshold05 and threshold08 re-spectively The top-left panel (A) of each figure plots theWCI68 of the estimate under consideration as a functionof N Data for all seven sampling schemes are shownusing their respective symbols The top-right hand panel(B) plots as a function of N the maximal elevation ofthe WCI68 encountered in the vicinity of qgenmdashthat ismaxWCIf1WCIqgen hellip WCIf8WCIqgen The eleva-tion factor is an indication of the sensitivity of our vari-ability estimates to errors in the estimation of q Thecloser it comes to 10 the better The bottom-left panel(C) plots the largest of the WCI68rsquos measurements at qgenand f1 hellip f8 for a given sampling scheme and N it isthus the product of the elevation factor plotted in panelB by its respective WCI68 in panel A This quantity wedenote as MWCI68 standing for maximum width of the

8 9 10 11 121

2

3

4

5

6

7

8

parameter a

N = 480 = 10 3 05 001

par

amet

er b

95 confidence interval68 confidence interval (WCI68)

Figure 1 B 5 2000 data sets were generated from a two-alternative forced-choice Weibullpsychometric function with parameter vector qgen 5 10 3 5 01 and then fit using ourmaximum-likelihood procedure resulting in 2000 estimated parameter pairs (a b ) shownas dark circles in andashb parameter space The location of the generating a and b (10 3) ismarked by the large triangle in the center of the plot The sampling scheme s7 was used togenerate the data sets (see Figure 2 for details) with N 5 480 Solid lines mark the 68 con-fidence interval width (WCI68) separately for a and b broken lines mark the 95 confidenceintervals The light small triangles show the andashb parameter sets f1 hellip f8 from which eachbootstrap is repeated during sensitivity analysis while keeping the x-values of the samplingscheme unchanged

THE PSYCHOMETRIC FUNCTION II 1319

68 confidence interval this is the confidence intervalwe suggest experimenters should use when they reporterror bounds The bottom-right panel (D) finally showsMWCI68 multiplied by a sweat factor IumlN120 Ideallymdashthat is for very large K and Nmdashconfidence intervalsshould be inversely proportional to IumlN and multiplica-tion by our sweat factor should remove this dependencyAny change in the sweat-adjusted MWCI68 as a functionof N might thus be taken as an indicator of the degree towhich sampling scheme N and true confidence intervalwidth interact in a way not predicted by asymptotic theory

Figure 3A shows the WCI68 around the median esti-mated slope Clearly the different sampling schemes havea profound effect on the magnitude of the confidence in-tervals for slope estimates For example in order to en-sure that the WCI68 is approximately 006 one requiresnearly 960 trials if using sampling s1 or s4 Samplingschemes s3 s5 or s7 on the other hand require onlyaround 200 trials to achieve similar confidence intervalwidth The important difference that makes for examples3 s5 and s7 more efficient than s1 and s4 is the presenceof samples at high predicted performance values (p $ 9)where binomial variability is low and thus the data con-strain our maximum-likelihood fit more tightly Figure 3Billustrates the complex interactions between different sam-pling schemes N and the stability of the bootstrap esti-mates of variability as indicated by the local flatness of thecontours around qgen A perfectly flat local contour wouldresult in the horizontal line at 1 Sampling scheme s6 iswell behaved for N $ 480 its maximal elevation beingaround 17 For N 240 however elevation rises to near

25 Other schemes like s1 s4 or s7 never rise above an el-evation of 2 regardless of N It is important to note that themagnitude of WCI68 at qgen is by no means a good predic-tor of the stabilityof the estimate as indicated by the sen-sitivity factor Figure 3C shows the MWCI68 and here thedifferences in efficiency between sampling schemes areeven more apparent than in Figure 3A Sampling schemess3 s5 and s7 are clearly superior to all other samplingschemes particularly for N 480 these three samplingschemes are the ones that include two sampling points atp 92

Figure 4 showing the equivalent data for the estimatesof threshold05 also illustrates that some sampling schemesmake much more efficient use of the experimenterrsquos timeby providing MWCI68s that are more compact than oth-ers by a factor of 32

Two aspects of the data shown in Figure 4B and 4C areimportant First the sampling schemes fall into two dis-tinct classes Five of the seven sampling schemes are al-most ideally well behaved with elevations barely exceed-ing 15 even for small N (and MWCI68 3) the remainingtwo on the other hand behave poorly with elevations inexcess of 3 for N 240 (MWCI68 5) The two unstablesampling schemes s1 and s4 are those that do not in-clude at least a single sample point at p $ 95 (see Fig-ure 2) It thus appears crucial to include at least one sam-ple point at p $ 95 to make onersquos estimates of variabilityrobust to small errors in q even if the threshold of inter-est as in our example has a p of only approximately 75Second both s1 and s4 are prone to lead to a serious un-derestimation of the true width of the confidence interval

4 5 7 10 12 15 17 200

1

2

3

4

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

s7

s6

s5

s4

s3

s2

s1

Figure 2 A two-alternative forced-choice Weibull psychometric function with parameter vector q 510 3 5 0 on semilogarithmic coordinates The rows of symbols below the curve mark the x valuesof the seven different sampling schemes s1ndashs7 used throughout the remainder of the paper

1320 WICHMANN AND HILL

if the sensitivity analysis (or bootstrap bridging assump-tion test) is not carried out The WCI68 is unstable in thevicinity of qgen even though WCI68 is small at qgen

Figure 5 is similar to Figure 4 except that it shows theWCI68 around threshold08 The trends found in Figure 4are even more exaggerated here The sampling schemes

without high sampling points (s1 s4) are unstable to thepoint of being meaningless for N 480 In addition theirWCI68 is inflated relative to that of the other samplingschemes even at qgen

The results of the Monte Carlo simulations are sum-marized in Table 1 The columns of Table 1 correspond to

120 240 480 960

005

01

015

120 240 480 960

1

15

2

25

120 240 480 960

005

01

015

120 240 480 960

01

015

02

025

max

imal

ele

vati

on

of W

CI 6

8

slo

pe

WC

I 68

sqrt

(N1

20)

slo

pe

MW

CI 6

8

slo

pe

MW

CI 6

8

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 3 (A) The width of the 68 BCa confidence interval (WCI68) of slopes of B 5 999 fitted psychometric functions to para-metric bootstrap data sets generated from qgen 5 10 3 5 01 as a function of the total number of observations N (B) The maximalelevation of the WCI68 in the vicinity of qgen again as a function of N (see the text for details) (C) The maximal width of the 68 BCaconfidence interval (MWCI68) in the vicinity of qgenmdashthat is the product of the respective WCI68 and elevations factors of panels Aand B (D) MWCI68 scaled by the sweat factor sqrt(N120) the ideally expected decrease in confidence interval width with increasingN The seven symbols denote the seven sampling schemes as of Figure 2

THE PSYCHOMETRIC FUNCTION II 1321

the different sampling schemes marked by their respec-tive symbols The first four rows contain the MWCI68 atthreshold05 for N 5 120 240 480 and 960 similarly thenext four rows contain the MWCI68 at threshold08 andthe following four those for the slope05 The scheme withthe lowest MWCI68 in each row is given the score of 100The others on the same row are given proportionally higherscores to indicate their MWCI68 as a percentage of thebest schemersquos value The bottom three rows of Table 1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates

An inspection of Table 1 reveals that the samplingschemes fall into four categories By a long way worstare sampling schemes s1 and s4 with mean and medianMWCI68 210 Already somewhat superior is s2 par-ticularly for small N with a median MWCI68 of 180 anda significantly lower standard deviation Next come sam-pling schemes s5 and s6 with means and medians of

120 240 480 960

05

07

1

2

3

45

7

120 240 480 960

1

2

3

4

120 240 480 960

05

07

1

2

3

45

7

120 240 480 9601

2

3

4

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

05

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

05)

MW

CI 6

8-1

thre

sho

ld (F

05)

MW

CI 6

8-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 4 Similar to Figure 3 except that it shows thresholds corresponding to F 105 (approximately equal to 75 correct during two-

alternative forced-choice)

1322 WICHMANN AND HILL

around 130 Each of these has at least one sample at p $95 and 50 at p $ 80 The importance of this highsample point is clearly demonstrated by s6 Comparing s1and s6 s6 is identical to scheme s1 except that one sam-ple point was moved from 75 to 95 Still s6 is superiorto s1 on each of the 12 estimates and often very markedlyso Finally there are two sampling schemes with meansand medians below 120mdashvery nearly optimal6 on mostestimates Both of these sampling schemes s3 and s7

have 50 of their sample points at p $ 90 and one thirdat p $ 95

In order to obtain stable estimates of the variability ofparameters thresholds and slopes of psychometricfunctions it appears that we must include at least onebut preferably more sample points at large p values Suchsampling schemes are however sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

07

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

07

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

08

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

08

) M

WC

I 68

-1

thre

sho

ld (F

08

) M

WC

I 68

-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess7s6s5s4s3s1 s2

Figure 5 Similar to Figure 3 except that it shows thresholds corresponding to F 108 (approximately equal to 90 correct during two-

alternative forced-choice)

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 2: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

THE PSYCHOMETRIC FUNCTION II 1315

sumption on which the use of parametric bootstrap tech-niques relies Finally we recommend on the basis of thetheoretical work of others the use of a technique calledbias correction with acceleration (BCa) to obtain stableand accurate confidence interval estimates

BACKGROUND

The Psychometric FunctionOur notation will follow the conventions we have out-

lined in Wichmann and Hill (2001) A brief summary ofterms follows

Performance on K blocks of a constant-stimuli psy-chophysical experiment can be expressed using threevectors each of length K An x denotes the stimulus val-ues used an n denotes the numbers of trials performed ateach point and a y denotes the proportion of correct re-sponses (in n-alternative forced-choice [n-AFC] experi-ments) or positive responses (single-interval or yesno ex-periments) on each block We often use N to refer to thetotal number of trials in the set N 5 aringni

The number of correct responses yini in a given block iis assumed to be the sum of random samples from aBernoulli process with probability of success pi A psy-chometric function y(x) is the function that relates thestimulus dimension x to the expected performance value p

A common general form for the psychometric func-tion is

(1)

The shape of the curve is determined by our choice of afunctional form for F and by the four parameters a b gl to which we shall refer collectively by using the pa-rameter vector q F is typically a sigmoidal function suchas the Weibull cumulative Gaussian logistic or GumbelWe assume that F describes the underlying psychologicalmechanism of interest The parameters g and l determinethe lower and upper bounds of the curve which are af-fected by other factors In yesno paradigms g is the guessrate and l the miss rate In n-AFC paradigms g usuallyreflects chance performance and is fixed at the recipro-cal of the number of intervals per trial and l reflects thestimulus-independent error rate or lapse rate (see Wich-mann amp Hill 2001 for more details)

When a parameter set has been estimated we will usu-ally be interested in measurements of the threshold (dis-placement along the x-axis) and slope of the psychomet-ric function We calculate thresholds by taking the inverseof F at a specified probability level usually 5 Slopes arecalculated by finding the derivative of F with respect to xevaluated at a specified threshold Thus we shall use thenotation threshold08 for example to mean F 1

08 andslope08 to mean dFdx evaluated at F 1

08 When we use theterms threshold and slope without a subscript we meanthreshold05 and slope05 In our 2AFC examples this willmean the stimulus value and slope of F at the point whereperformance is approximately 75 correct although the

exact performance level is affected slightly by the (small)value of l

Where an estimate of a parameter set is required givena particular data set we use a maximum-likelihood searchalgorithm with Bayesian constraints on the parametersbased on our beliefs about their possible values For ex-ample l is constrained within the range [0 06] reflect-ing our belief that normal trained observers do not makestimulus-independent errors at high rates We describeour method in detail in Wichmann and Hill (2001)

Estimates of Variability Asymptotic Versus Monte Carlo Methods

In order to be able to compare response thresholds orslopes across experimental conditions experimenters re-quire a measure of their variability which will depend onthe number of experimental trials taken and their place-ment along the stimulus axis Thus a fitting proceduremust provide not only parameter estimates but also errorestimates for those parameters Reporting error estimateson fitted parameters is unfortunately not very common inpsychophysical studies Sometimes probit analysis hasbeen used to provide variability estimates (Finney 19521971) In probit analysis an iteratively reweighted linearregression is performed on the data once they have under-gone transformation through the inverse of a cumulativeGaussian function Probit analysis relies however on as-ymptotic theory Maximum-likelihood estimators are as-ymptotically Gaussian allowing the standard deviation tobe computed from the empirical distribution (Cox amp Hink-ley 1974) Asymptotic methods assume that the numberof datapoints is large unfortunately however the numberof points in a typical psychophysical data set is small (be-tween 4 and 10 with between 20 and 100 trials at each)and in these cases substantial errors have accordingly beenfound in the probit estimates of variability (Foster ampBischof 1987 1991 McKee Klein amp Teller 1985) Forthis reason asymptotic theory methods are not recom-mended for estimating variability in most realistic psy-chophysical settings

An alternative method the bootstrap (Efron 19791982 Efron amp Gong 1983 Efron amp Tibshirani 19911993) has been made possible by the recent sharp increasein the processing speed of desktop computers The boot-strap method is a Monte Carlo resampling technique re-lying on a large number of simulated repetitions of theoriginal experiment It is potentially well suited to theanalysis of psychophysical data because its accuracy doesnot rely on large numbers of trials as do methods derivedfrom asymptotic theory (Hinkley 1988) We apply thebootstrap to the problem of estimating the variability ofparameters thresholds and slopes of psychometric func-tions following Maloney (1990) Foster and Bischof(1987 1991 1997) and Treutwein (1995 Treutwein ampStrasburger 1999)

The essence of Monte Carlo techniques is that a largenumber B of ldquosyntheticrdquo data sets y1 hellip yB are gener-ated For each data set yi the quantity of interest J (thresh-

y a b g l g g l a bx F x ( ) = + ( ) ( )1

1316 WICHMANN AND HILL

old or slope eg) is estimated to give Ji The process forobtaining Ji is the same as that used to obtain the first esti-mate J Thus if our first estimate was obtained by J 5 t(q)where q is the maximum-likelihood parameter estimatefrom a fit to the original data y so the simulated esti-mates Ji will be given by Ji 5 t(qi) where qi is themaximum-likelihood parameter estimate from a fit to thesimulated data yi

Sometimes it is erroneously assumed that the intentionis to measure the variability of the underlying J itselfThis cannot be the case however because repeated com-puter simulation of the same experiment is no substitutefor the real repeated measurements this would requireWhat Monte Carlo simulations can do is estimate the vari-ability inherent in (1) our sampling as characterized bythe distribution of sample points (x) and the size of thesamples (n) and (2) any interaction between our sam-pling strategy and the process used to estimate Jmdashthat isassuming a model of the observerrsquos variability fitting afunction to obtain q and applying t(q)

Bootstrap Data Sets Nonparametric and Parametric Generation

In applying Monte Carlo techniques to psychophysicaldata we require in order to obtain a simulated data set yisome system that provides generating probabilities p forthe binomial variates yi1 hellip yiK These should be thesame generating probabilities that we hypothesize to un-derlie the empirical data set y

Efronrsquos bootstrap offers such a system In the nonpara-metric bootstrap method we would assume p 5 y Thisis equivalent to resampling with replacement the origi-nal set of correct and incorrect responses on each blockof observations j in y to produce a simulated sample yij

Alternatively a parametric bootstrap can be performedIn the parametric bootstrap assumptions are made aboutthe generating model from which the observed data arebelieved to arise In the context of estimating the variabil-ity of parameters of psychometric functions the data aregenerated by a simulated observer whose underlyingprobabilities of success are determined by the maximum-likelihood fit to the real observerrsquos data [yf it 5 y (xq)]Thus where the nonparametric bootstrap uses y the para-metric bootstrap uses yf it as generating probabilities pfor the simulated data sets

As is frequently the case in statistics the choice ofparametric versus nonparametric analysis concerns howmuch confidence one has in onersquos hypothesis about theunderlying mechanism that gave rise to the raw data asagainst the confidence one has in the raw datarsquos precisenumerical values Choosing the parametric bootstrap forthe estimation of variability in psychometric function fit-ting appears the natural choice for several reasons Firstand foremost in fitting a parametric model to the dataone has already committed oneself to a parametric analy-sis No additional assumptions are required to perform aparametric bootstrap beyond those required for fitting afunction to the data Specification of the source of vari-

ability (binomial variability) and the model from which thedata are most likely to come (parameter vector q and dis-tribution function F ) Second given the assumption thatdata from psychophysical experiments are binomially dis-tributed we expect data to be variable (noisy) The non-parametric bootstrap treats every datapoint as if its exactvalue reflected the underlying mechanism1 The paramet-ric bootstrap on the other hand allows the datapoints tobe treated as noisy samples from a smooth and monotonicfunction determined by q and F

One consequence of the two different bootstrap regimesis as follows Assume two observers performing the samepsychophysical task at the same stimulus intensities x andassume that it happens that the maximum-likelihood fitsto the two data sets yield identical parameter vectors qGiven such a scenario the parametric bootstrap returnsidentical estimates of variability for both observers sinceit depends only on x q and F The nonparametric boot-straprsquos estimates would on the other hand depend on theindividual differences between the two data sets y1 and y2something we consider unconvincing A method for es-timating variability in parameters and thresholds shouldreturn identical estimates for identical observers perform-ing the identical experiment2

Treutwein (1995) and Treutwein and Strasburger (1999)used the nonparametric bootstrap and Maloney (1990)used the parametric bootstrap to compare bootstrap esti-mates of variability with real-word variability in the dataof repeated psychophysical experiments All of the abovestudies found bootstrap studies to be in agreement with thehuman data Keeping in mind that the number of repeatsin the above-quoted cases was small this is nonethelessencouraging suggesting that bootstrap methods are a validmethod of variability estimation for parameters fitted topsychophysical data

Testing the Bridging AssumptionAsymptoticallymdashthat is for large K and Nmdashq will con-

verge toward q since maximum-likelihood estimation isasymptotically unbiased3 (Cox amp Hinkley 1974 Kendallamp Stuart 1979) For the small K typical of psychophysicalexperiments however we can only hope that our estimatedparameter vector q is ldquoclose enoughrdquo to the true param-eter vector q for the estimated variability in the parametervector q obtained by the bootstrap method to be valid Wecall this the bootstrap bridging assumption

Whether q is indeed sufficiently close to q depends ina complex way on the samplingmdashthat is the number ofblocks of trials (K ) the numbers of observations at eachblock of trials (n) and the stimulus intensities (x) relativeto the true parameter vector q Maloney (1990) summa-rized these dependencies for a given experimental designby plotting the standard deviation of b as a function of aand b as a contour plot (Maloney 1990 Figure 3 p 129)Similar contour plots for the standard deviation of a andfor bias in both a and b could be obtained If owing tosmall K bad sampling or otherwise our estimation pro-cedure is inaccurate the distribution of bootstrap param-

THE PSYCHOMETRIC FUNCTION II 1317

eter vectors q1 hellip qBmdashcentered around qmdashwill notbe centered around true q As a result the estimates ofvariability are likely to be incorrect unless the magnitudeof the standard deviation is similar around q and q despitethe fact that the points are some distance apart in pa-rameter space

One way to assess the likely accuracy of bootstrap es-timates of variability is to follow Maloney (1990) and toexamine the local flatness of the contours around q ourbest estimate of q If the contours are sufficiently flat thevariability estimates will be similar assuming that thetrue q is somewhere within this flat region However theprocess of local contour estimation is computationally ex-pensive since it requires a very large number of completebootstrap runs for each data set

A much quicker alternative is the following Having ob-tained q and performed a bootstrap using y(xq) as thegenerating function we move to eight different points f1 hellipf8 in andashb space Eight Monte Carlo simulations are per-formed using f1 hellip f8 as the generating functions to ex-plore the variability in those parts of the parameter space(only the generating parameters of the bootstrap are changedx remains the same for all of them) If the contours of vari-ability around q are sufficiently flat as we hope they areconfidence intervals at f1 hellip f8 should be of the samemagnitude as those obtained at q Prudence should leadus to accept the largest of the nine confidence intervalsobtained as our estimate of variability

A decision has to be made as to which eight points inandashb space to use for the new set of simulations Gener-ally provided that the psychometric function is at leastreasonably well sampled the contours vary smoothly inthe immediate vicinity of q so that the precise placementof the sample points f1 hellip f8 is not critical One sug-gested and easy way to obtain a set of additional gener-ating parameters is shown in Figure 1

Figure 1 shows B 5 2000 bootstrap parameter pairs asdark filled circles plotted in andashb space Simulated datasets were generated from ygen with the Weibull as F andqgen 5 10 3 05 001 sampling scheme s7 (trianglessee Figure 2) was used and N was set to 480 (ni 5 80)The large central triangle at (10 3) marks the generatingparameter set the solid and dashed line segments adjacentto the x- and y-axes mark the 68 and 95 confidenceintervals for a and b respectively4 In the following weshall use WCI to stand for width of confidence intervalwith a subscript denoting its coverage percentagemdashthat isWCI68 denotes the width of the 68 confidence interval5The eight additional generating parameter pairs f1 hellip f8are marked by the light triangles They form a rectanglewhose sides have length WCI68 in a and b Typically thiscentral rectangular region contains approximately 30ndash40 of all andashb pairs and could thus be viewed as a crudejoint 35 confidence region for a and b A coverage per-centage of 35 represents a sensible compromise betweenerroneously accepting the estimate around q very likelyunderestimating the true variability and performing ad-

ditional bootstrap replications too far in the peripherywhere variability and thus the estimated confidence in-tervals become erroneously inflated owing to poor sam-pling Recently one of us (Hill 2001b) performed MonteCarlo simulations to test the coverage of various bootstrapconfidence interval methods and found that the abovemethod based on 25 of points or more was a good wayof guaranteeing that both sides of a two-tailed confidenceinterval had at least the desired level of coverage

MONTE CARLO SIMULATIONS

In both our papers we use only the specific case of the2AFC paradigm in our examples Thus g is fixed at 5In our simulations where we must assume a distributionof true generating probabilities we always use the Wei-bull function in conjunction with the same fixed set ofgenerating parameters qgen agen 5 10 bgen 5 3 ggen 5 5lgen 5 01 In our investigation of the effects of samplingpatterns we shall always use K 5 6 and ni constantmdashthatis six blocks of trials with the same number of points ineach block The number of observations per point ni couldbe set to 20 40 80 or 160 and with K 5 6 this meansthat the total number of observations N could take the val-ues 120 240 480 and 960

We have introduced these limitations purely for the pur-poses of illustration to keep our explanatory variablesdown to a manageable number We have found in manyother simulations that in most cases this is done withoutloss of generality of our conclusions

Confidence intervals of parameters thresholds andslopes that we report in the following were always obtainedusing a method called bias-corrected and accelerated(BCa ) confidence intervals which we describe later in ourBootstrap Confidence Intervals section

The Effects of Sampling Schemes and Number of Trials

One of our aims in this study was to examine the ef-fect of N and onersquos choice of sample points x on both thesize of onersquos confidence intervals for J and their sensitiv-ity to errors in q

Seven different sampling schemes were used each dic-tating a different distribution of datapoints along the stim-ulus axis they are the same schemes as those used and de-scribed in Wichmann and Hill (2001) and they are shownin Figure 2 Each horizontal chain of symbols representsone of the schemes marking the stimulus values at whichthe six sample points are placed The different symbolshapes will be used to identify the sampling schemes in ourresults plots To provide a frame of reference the solidcurve shows the psychometric function usedmdashthat is05 1 05F(xagen bgen)mdashwith the 55 75 and 95performance levels marked by dotted lines

As we shall see even for a fixed number of samplepoints and a fixed number of trials per point biases inparameter estimation and goodness-of-f it assessment

1318 WICHMANN AND HILL

(companion paper) as well as the width of confidence in-tervals (this paper) all depend markedly on the distrib-ution of stimulus values x

Monte Carlo data sets were generated using our sevensampling schemes shown in Figure 2 using the generationparameters qgen as well as N and K as specified aboveA maximum-likelihood fit was performed on each sim-ulated data set to obtain bootstrap parameter vectors qfrom which we subsequently derived the x values corre-sponding to threshold05 and threshold08 as well as to theslope For each sampling scheme and value of N a totalof nine simulations were performed one at qgen and eightmore at points f1 hellip f8 as specified in our section on thebootstrap bridging assumption Thus each of our 28conditions (7 sampling schemes 3 4 values of N ) required9 3 1000 simulated data sets for a total of 252000 sim-ulations or 1134 3 108 simulated 2AFC trials

Figures 3 4 and 5 show the results of the simulationsdealing with slope05 threshold05 and threshold08 re-spectively The top-left panel (A) of each figure plots theWCI68 of the estimate under consideration as a functionof N Data for all seven sampling schemes are shownusing their respective symbols The top-right hand panel(B) plots as a function of N the maximal elevation ofthe WCI68 encountered in the vicinity of qgenmdashthat ismaxWCIf1WCIqgen hellip WCIf8WCIqgen The eleva-tion factor is an indication of the sensitivity of our vari-ability estimates to errors in the estimation of q Thecloser it comes to 10 the better The bottom-left panel(C) plots the largest of the WCI68rsquos measurements at qgenand f1 hellip f8 for a given sampling scheme and N it isthus the product of the elevation factor plotted in panelB by its respective WCI68 in panel A This quantity wedenote as MWCI68 standing for maximum width of the

8 9 10 11 121

2

3

4

5

6

7

8

parameter a

N = 480 = 10 3 05 001

par

amet

er b

95 confidence interval68 confidence interval (WCI68)

Figure 1 B 5 2000 data sets were generated from a two-alternative forced-choice Weibullpsychometric function with parameter vector qgen 5 10 3 5 01 and then fit using ourmaximum-likelihood procedure resulting in 2000 estimated parameter pairs (a b ) shownas dark circles in andashb parameter space The location of the generating a and b (10 3) ismarked by the large triangle in the center of the plot The sampling scheme s7 was used togenerate the data sets (see Figure 2 for details) with N 5 480 Solid lines mark the 68 con-fidence interval width (WCI68) separately for a and b broken lines mark the 95 confidenceintervals The light small triangles show the andashb parameter sets f1 hellip f8 from which eachbootstrap is repeated during sensitivity analysis while keeping the x-values of the samplingscheme unchanged

THE PSYCHOMETRIC FUNCTION II 1319

68 confidence interval this is the confidence intervalwe suggest experimenters should use when they reporterror bounds The bottom-right panel (D) finally showsMWCI68 multiplied by a sweat factor IumlN120 Ideallymdashthat is for very large K and Nmdashconfidence intervalsshould be inversely proportional to IumlN and multiplica-tion by our sweat factor should remove this dependencyAny change in the sweat-adjusted MWCI68 as a functionof N might thus be taken as an indicator of the degree towhich sampling scheme N and true confidence intervalwidth interact in a way not predicted by asymptotic theory

Figure 3A shows the WCI68 around the median esti-mated slope Clearly the different sampling schemes havea profound effect on the magnitude of the confidence in-tervals for slope estimates For example in order to en-sure that the WCI68 is approximately 006 one requiresnearly 960 trials if using sampling s1 or s4 Samplingschemes s3 s5 or s7 on the other hand require onlyaround 200 trials to achieve similar confidence intervalwidth The important difference that makes for examples3 s5 and s7 more efficient than s1 and s4 is the presenceof samples at high predicted performance values (p $ 9)where binomial variability is low and thus the data con-strain our maximum-likelihood fit more tightly Figure 3Billustrates the complex interactions between different sam-pling schemes N and the stability of the bootstrap esti-mates of variability as indicated by the local flatness of thecontours around qgen A perfectly flat local contour wouldresult in the horizontal line at 1 Sampling scheme s6 iswell behaved for N $ 480 its maximal elevation beingaround 17 For N 240 however elevation rises to near

25 Other schemes like s1 s4 or s7 never rise above an el-evation of 2 regardless of N It is important to note that themagnitude of WCI68 at qgen is by no means a good predic-tor of the stabilityof the estimate as indicated by the sen-sitivity factor Figure 3C shows the MWCI68 and here thedifferences in efficiency between sampling schemes areeven more apparent than in Figure 3A Sampling schemess3 s5 and s7 are clearly superior to all other samplingschemes particularly for N 480 these three samplingschemes are the ones that include two sampling points atp 92

Figure 4 showing the equivalent data for the estimatesof threshold05 also illustrates that some sampling schemesmake much more efficient use of the experimenterrsquos timeby providing MWCI68s that are more compact than oth-ers by a factor of 32

Two aspects of the data shown in Figure 4B and 4C areimportant First the sampling schemes fall into two dis-tinct classes Five of the seven sampling schemes are al-most ideally well behaved with elevations barely exceed-ing 15 even for small N (and MWCI68 3) the remainingtwo on the other hand behave poorly with elevations inexcess of 3 for N 240 (MWCI68 5) The two unstablesampling schemes s1 and s4 are those that do not in-clude at least a single sample point at p $ 95 (see Fig-ure 2) It thus appears crucial to include at least one sam-ple point at p $ 95 to make onersquos estimates of variabilityrobust to small errors in q even if the threshold of inter-est as in our example has a p of only approximately 75Second both s1 and s4 are prone to lead to a serious un-derestimation of the true width of the confidence interval

4 5 7 10 12 15 17 200

1

2

3

4

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

s7

s6

s5

s4

s3

s2

s1

Figure 2 A two-alternative forced-choice Weibull psychometric function with parameter vector q 510 3 5 0 on semilogarithmic coordinates The rows of symbols below the curve mark the x valuesof the seven different sampling schemes s1ndashs7 used throughout the remainder of the paper

1320 WICHMANN AND HILL

if the sensitivity analysis (or bootstrap bridging assump-tion test) is not carried out The WCI68 is unstable in thevicinity of qgen even though WCI68 is small at qgen

Figure 5 is similar to Figure 4 except that it shows theWCI68 around threshold08 The trends found in Figure 4are even more exaggerated here The sampling schemes

without high sampling points (s1 s4) are unstable to thepoint of being meaningless for N 480 In addition theirWCI68 is inflated relative to that of the other samplingschemes even at qgen

The results of the Monte Carlo simulations are sum-marized in Table 1 The columns of Table 1 correspond to

120 240 480 960

005

01

015

120 240 480 960

1

15

2

25

120 240 480 960

005

01

015

120 240 480 960

01

015

02

025

max

imal

ele

vati

on

of W

CI 6

8

slo

pe

WC

I 68

sqrt

(N1

20)

slo

pe

MW

CI 6

8

slo

pe

MW

CI 6

8

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 3 (A) The width of the 68 BCa confidence interval (WCI68) of slopes of B 5 999 fitted psychometric functions to para-metric bootstrap data sets generated from qgen 5 10 3 5 01 as a function of the total number of observations N (B) The maximalelevation of the WCI68 in the vicinity of qgen again as a function of N (see the text for details) (C) The maximal width of the 68 BCaconfidence interval (MWCI68) in the vicinity of qgenmdashthat is the product of the respective WCI68 and elevations factors of panels Aand B (D) MWCI68 scaled by the sweat factor sqrt(N120) the ideally expected decrease in confidence interval width with increasingN The seven symbols denote the seven sampling schemes as of Figure 2

THE PSYCHOMETRIC FUNCTION II 1321

the different sampling schemes marked by their respec-tive symbols The first four rows contain the MWCI68 atthreshold05 for N 5 120 240 480 and 960 similarly thenext four rows contain the MWCI68 at threshold08 andthe following four those for the slope05 The scheme withthe lowest MWCI68 in each row is given the score of 100The others on the same row are given proportionally higherscores to indicate their MWCI68 as a percentage of thebest schemersquos value The bottom three rows of Table 1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates

An inspection of Table 1 reveals that the samplingschemes fall into four categories By a long way worstare sampling schemes s1 and s4 with mean and medianMWCI68 210 Already somewhat superior is s2 par-ticularly for small N with a median MWCI68 of 180 anda significantly lower standard deviation Next come sam-pling schemes s5 and s6 with means and medians of

120 240 480 960

05

07

1

2

3

45

7

120 240 480 960

1

2

3

4

120 240 480 960

05

07

1

2

3

45

7

120 240 480 9601

2

3

4

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

05

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

05)

MW

CI 6

8-1

thre

sho

ld (F

05)

MW

CI 6

8-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 4 Similar to Figure 3 except that it shows thresholds corresponding to F 105 (approximately equal to 75 correct during two-

alternative forced-choice)

1322 WICHMANN AND HILL

around 130 Each of these has at least one sample at p $95 and 50 at p $ 80 The importance of this highsample point is clearly demonstrated by s6 Comparing s1and s6 s6 is identical to scheme s1 except that one sam-ple point was moved from 75 to 95 Still s6 is superiorto s1 on each of the 12 estimates and often very markedlyso Finally there are two sampling schemes with meansand medians below 120mdashvery nearly optimal6 on mostestimates Both of these sampling schemes s3 and s7

have 50 of their sample points at p $ 90 and one thirdat p $ 95

In order to obtain stable estimates of the variability ofparameters thresholds and slopes of psychometricfunctions it appears that we must include at least onebut preferably more sample points at large p values Suchsampling schemes are however sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

07

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

07

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

08

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

08

) M

WC

I 68

-1

thre

sho

ld (F

08

) M

WC

I 68

-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess7s6s5s4s3s1 s2

Figure 5 Similar to Figure 3 except that it shows thresholds corresponding to F 108 (approximately equal to 90 correct during two-

alternative forced-choice)

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 3: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

1316 WICHMANN AND HILL

old or slope eg) is estimated to give Ji The process forobtaining Ji is the same as that used to obtain the first esti-mate J Thus if our first estimate was obtained by J 5 t(q)where q is the maximum-likelihood parameter estimatefrom a fit to the original data y so the simulated esti-mates Ji will be given by Ji 5 t(qi) where qi is themaximum-likelihood parameter estimate from a fit to thesimulated data yi

Sometimes it is erroneously assumed that the intentionis to measure the variability of the underlying J itselfThis cannot be the case however because repeated com-puter simulation of the same experiment is no substitutefor the real repeated measurements this would requireWhat Monte Carlo simulations can do is estimate the vari-ability inherent in (1) our sampling as characterized bythe distribution of sample points (x) and the size of thesamples (n) and (2) any interaction between our sam-pling strategy and the process used to estimate Jmdashthat isassuming a model of the observerrsquos variability fitting afunction to obtain q and applying t(q)

Bootstrap Data Sets Nonparametric and Parametric Generation

In applying Monte Carlo techniques to psychophysicaldata we require in order to obtain a simulated data set yisome system that provides generating probabilities p forthe binomial variates yi1 hellip yiK These should be thesame generating probabilities that we hypothesize to un-derlie the empirical data set y

Efronrsquos bootstrap offers such a system In the nonpara-metric bootstrap method we would assume p 5 y Thisis equivalent to resampling with replacement the origi-nal set of correct and incorrect responses on each blockof observations j in y to produce a simulated sample yij

Alternatively a parametric bootstrap can be performedIn the parametric bootstrap assumptions are made aboutthe generating model from which the observed data arebelieved to arise In the context of estimating the variabil-ity of parameters of psychometric functions the data aregenerated by a simulated observer whose underlyingprobabilities of success are determined by the maximum-likelihood fit to the real observerrsquos data [yf it 5 y (xq)]Thus where the nonparametric bootstrap uses y the para-metric bootstrap uses yf it as generating probabilities pfor the simulated data sets

As is frequently the case in statistics the choice ofparametric versus nonparametric analysis concerns howmuch confidence one has in onersquos hypothesis about theunderlying mechanism that gave rise to the raw data asagainst the confidence one has in the raw datarsquos precisenumerical values Choosing the parametric bootstrap forthe estimation of variability in psychometric function fit-ting appears the natural choice for several reasons Firstand foremost in fitting a parametric model to the dataone has already committed oneself to a parametric analy-sis No additional assumptions are required to perform aparametric bootstrap beyond those required for fitting afunction to the data Specification of the source of vari-

ability (binomial variability) and the model from which thedata are most likely to come (parameter vector q and dis-tribution function F ) Second given the assumption thatdata from psychophysical experiments are binomially dis-tributed we expect data to be variable (noisy) The non-parametric bootstrap treats every datapoint as if its exactvalue reflected the underlying mechanism1 The paramet-ric bootstrap on the other hand allows the datapoints tobe treated as noisy samples from a smooth and monotonicfunction determined by q and F

One consequence of the two different bootstrap regimesis as follows Assume two observers performing the samepsychophysical task at the same stimulus intensities x andassume that it happens that the maximum-likelihood fitsto the two data sets yield identical parameter vectors qGiven such a scenario the parametric bootstrap returnsidentical estimates of variability for both observers sinceit depends only on x q and F The nonparametric boot-straprsquos estimates would on the other hand depend on theindividual differences between the two data sets y1 and y2something we consider unconvincing A method for es-timating variability in parameters and thresholds shouldreturn identical estimates for identical observers perform-ing the identical experiment2

Treutwein (1995) and Treutwein and Strasburger (1999)used the nonparametric bootstrap and Maloney (1990)used the parametric bootstrap to compare bootstrap esti-mates of variability with real-word variability in the dataof repeated psychophysical experiments All of the abovestudies found bootstrap studies to be in agreement with thehuman data Keeping in mind that the number of repeatsin the above-quoted cases was small this is nonethelessencouraging suggesting that bootstrap methods are a validmethod of variability estimation for parameters fitted topsychophysical data

Testing the Bridging AssumptionAsymptoticallymdashthat is for large K and Nmdashq will con-

verge toward q since maximum-likelihood estimation isasymptotically unbiased3 (Cox amp Hinkley 1974 Kendallamp Stuart 1979) For the small K typical of psychophysicalexperiments however we can only hope that our estimatedparameter vector q is ldquoclose enoughrdquo to the true param-eter vector q for the estimated variability in the parametervector q obtained by the bootstrap method to be valid Wecall this the bootstrap bridging assumption

Whether q is indeed sufficiently close to q depends ina complex way on the samplingmdashthat is the number ofblocks of trials (K ) the numbers of observations at eachblock of trials (n) and the stimulus intensities (x) relativeto the true parameter vector q Maloney (1990) summa-rized these dependencies for a given experimental designby plotting the standard deviation of b as a function of aand b as a contour plot (Maloney 1990 Figure 3 p 129)Similar contour plots for the standard deviation of a andfor bias in both a and b could be obtained If owing tosmall K bad sampling or otherwise our estimation pro-cedure is inaccurate the distribution of bootstrap param-

THE PSYCHOMETRIC FUNCTION II 1317

eter vectors q1 hellip qBmdashcentered around qmdashwill notbe centered around true q As a result the estimates ofvariability are likely to be incorrect unless the magnitudeof the standard deviation is similar around q and q despitethe fact that the points are some distance apart in pa-rameter space

One way to assess the likely accuracy of bootstrap es-timates of variability is to follow Maloney (1990) and toexamine the local flatness of the contours around q ourbest estimate of q If the contours are sufficiently flat thevariability estimates will be similar assuming that thetrue q is somewhere within this flat region However theprocess of local contour estimation is computationally ex-pensive since it requires a very large number of completebootstrap runs for each data set

A much quicker alternative is the following Having ob-tained q and performed a bootstrap using y(xq) as thegenerating function we move to eight different points f1 hellipf8 in andashb space Eight Monte Carlo simulations are per-formed using f1 hellip f8 as the generating functions to ex-plore the variability in those parts of the parameter space(only the generating parameters of the bootstrap are changedx remains the same for all of them) If the contours of vari-ability around q are sufficiently flat as we hope they areconfidence intervals at f1 hellip f8 should be of the samemagnitude as those obtained at q Prudence should leadus to accept the largest of the nine confidence intervalsobtained as our estimate of variability

A decision has to be made as to which eight points inandashb space to use for the new set of simulations Gener-ally provided that the psychometric function is at leastreasonably well sampled the contours vary smoothly inthe immediate vicinity of q so that the precise placementof the sample points f1 hellip f8 is not critical One sug-gested and easy way to obtain a set of additional gener-ating parameters is shown in Figure 1

Figure 1 shows B 5 2000 bootstrap parameter pairs asdark filled circles plotted in andashb space Simulated datasets were generated from ygen with the Weibull as F andqgen 5 10 3 05 001 sampling scheme s7 (trianglessee Figure 2) was used and N was set to 480 (ni 5 80)The large central triangle at (10 3) marks the generatingparameter set the solid and dashed line segments adjacentto the x- and y-axes mark the 68 and 95 confidenceintervals for a and b respectively4 In the following weshall use WCI to stand for width of confidence intervalwith a subscript denoting its coverage percentagemdashthat isWCI68 denotes the width of the 68 confidence interval5The eight additional generating parameter pairs f1 hellip f8are marked by the light triangles They form a rectanglewhose sides have length WCI68 in a and b Typically thiscentral rectangular region contains approximately 30ndash40 of all andashb pairs and could thus be viewed as a crudejoint 35 confidence region for a and b A coverage per-centage of 35 represents a sensible compromise betweenerroneously accepting the estimate around q very likelyunderestimating the true variability and performing ad-

ditional bootstrap replications too far in the peripherywhere variability and thus the estimated confidence in-tervals become erroneously inflated owing to poor sam-pling Recently one of us (Hill 2001b) performed MonteCarlo simulations to test the coverage of various bootstrapconfidence interval methods and found that the abovemethod based on 25 of points or more was a good wayof guaranteeing that both sides of a two-tailed confidenceinterval had at least the desired level of coverage

MONTE CARLO SIMULATIONS

In both our papers we use only the specific case of the2AFC paradigm in our examples Thus g is fixed at 5In our simulations where we must assume a distributionof true generating probabilities we always use the Wei-bull function in conjunction with the same fixed set ofgenerating parameters qgen agen 5 10 bgen 5 3 ggen 5 5lgen 5 01 In our investigation of the effects of samplingpatterns we shall always use K 5 6 and ni constantmdashthatis six blocks of trials with the same number of points ineach block The number of observations per point ni couldbe set to 20 40 80 or 160 and with K 5 6 this meansthat the total number of observations N could take the val-ues 120 240 480 and 960

We have introduced these limitations purely for the pur-poses of illustration to keep our explanatory variablesdown to a manageable number We have found in manyother simulations that in most cases this is done withoutloss of generality of our conclusions

Confidence intervals of parameters thresholds andslopes that we report in the following were always obtainedusing a method called bias-corrected and accelerated(BCa ) confidence intervals which we describe later in ourBootstrap Confidence Intervals section

The Effects of Sampling Schemes and Number of Trials

One of our aims in this study was to examine the ef-fect of N and onersquos choice of sample points x on both thesize of onersquos confidence intervals for J and their sensitiv-ity to errors in q

Seven different sampling schemes were used each dic-tating a different distribution of datapoints along the stim-ulus axis they are the same schemes as those used and de-scribed in Wichmann and Hill (2001) and they are shownin Figure 2 Each horizontal chain of symbols representsone of the schemes marking the stimulus values at whichthe six sample points are placed The different symbolshapes will be used to identify the sampling schemes in ourresults plots To provide a frame of reference the solidcurve shows the psychometric function usedmdashthat is05 1 05F(xagen bgen)mdashwith the 55 75 and 95performance levels marked by dotted lines

As we shall see even for a fixed number of samplepoints and a fixed number of trials per point biases inparameter estimation and goodness-of-f it assessment

1318 WICHMANN AND HILL

(companion paper) as well as the width of confidence in-tervals (this paper) all depend markedly on the distrib-ution of stimulus values x

Monte Carlo data sets were generated using our sevensampling schemes shown in Figure 2 using the generationparameters qgen as well as N and K as specified aboveA maximum-likelihood fit was performed on each sim-ulated data set to obtain bootstrap parameter vectors qfrom which we subsequently derived the x values corre-sponding to threshold05 and threshold08 as well as to theslope For each sampling scheme and value of N a totalof nine simulations were performed one at qgen and eightmore at points f1 hellip f8 as specified in our section on thebootstrap bridging assumption Thus each of our 28conditions (7 sampling schemes 3 4 values of N ) required9 3 1000 simulated data sets for a total of 252000 sim-ulations or 1134 3 108 simulated 2AFC trials

Figures 3 4 and 5 show the results of the simulationsdealing with slope05 threshold05 and threshold08 re-spectively The top-left panel (A) of each figure plots theWCI68 of the estimate under consideration as a functionof N Data for all seven sampling schemes are shownusing their respective symbols The top-right hand panel(B) plots as a function of N the maximal elevation ofthe WCI68 encountered in the vicinity of qgenmdashthat ismaxWCIf1WCIqgen hellip WCIf8WCIqgen The eleva-tion factor is an indication of the sensitivity of our vari-ability estimates to errors in the estimation of q Thecloser it comes to 10 the better The bottom-left panel(C) plots the largest of the WCI68rsquos measurements at qgenand f1 hellip f8 for a given sampling scheme and N it isthus the product of the elevation factor plotted in panelB by its respective WCI68 in panel A This quantity wedenote as MWCI68 standing for maximum width of the

8 9 10 11 121

2

3

4

5

6

7

8

parameter a

N = 480 = 10 3 05 001

par

amet

er b

95 confidence interval68 confidence interval (WCI68)

Figure 1 B 5 2000 data sets were generated from a two-alternative forced-choice Weibullpsychometric function with parameter vector qgen 5 10 3 5 01 and then fit using ourmaximum-likelihood procedure resulting in 2000 estimated parameter pairs (a b ) shownas dark circles in andashb parameter space The location of the generating a and b (10 3) ismarked by the large triangle in the center of the plot The sampling scheme s7 was used togenerate the data sets (see Figure 2 for details) with N 5 480 Solid lines mark the 68 con-fidence interval width (WCI68) separately for a and b broken lines mark the 95 confidenceintervals The light small triangles show the andashb parameter sets f1 hellip f8 from which eachbootstrap is repeated during sensitivity analysis while keeping the x-values of the samplingscheme unchanged

THE PSYCHOMETRIC FUNCTION II 1319

68 confidence interval this is the confidence intervalwe suggest experimenters should use when they reporterror bounds The bottom-right panel (D) finally showsMWCI68 multiplied by a sweat factor IumlN120 Ideallymdashthat is for very large K and Nmdashconfidence intervalsshould be inversely proportional to IumlN and multiplica-tion by our sweat factor should remove this dependencyAny change in the sweat-adjusted MWCI68 as a functionof N might thus be taken as an indicator of the degree towhich sampling scheme N and true confidence intervalwidth interact in a way not predicted by asymptotic theory

Figure 3A shows the WCI68 around the median esti-mated slope Clearly the different sampling schemes havea profound effect on the magnitude of the confidence in-tervals for slope estimates For example in order to en-sure that the WCI68 is approximately 006 one requiresnearly 960 trials if using sampling s1 or s4 Samplingschemes s3 s5 or s7 on the other hand require onlyaround 200 trials to achieve similar confidence intervalwidth The important difference that makes for examples3 s5 and s7 more efficient than s1 and s4 is the presenceof samples at high predicted performance values (p $ 9)where binomial variability is low and thus the data con-strain our maximum-likelihood fit more tightly Figure 3Billustrates the complex interactions between different sam-pling schemes N and the stability of the bootstrap esti-mates of variability as indicated by the local flatness of thecontours around qgen A perfectly flat local contour wouldresult in the horizontal line at 1 Sampling scheme s6 iswell behaved for N $ 480 its maximal elevation beingaround 17 For N 240 however elevation rises to near

25 Other schemes like s1 s4 or s7 never rise above an el-evation of 2 regardless of N It is important to note that themagnitude of WCI68 at qgen is by no means a good predic-tor of the stabilityof the estimate as indicated by the sen-sitivity factor Figure 3C shows the MWCI68 and here thedifferences in efficiency between sampling schemes areeven more apparent than in Figure 3A Sampling schemess3 s5 and s7 are clearly superior to all other samplingschemes particularly for N 480 these three samplingschemes are the ones that include two sampling points atp 92

Figure 4 showing the equivalent data for the estimatesof threshold05 also illustrates that some sampling schemesmake much more efficient use of the experimenterrsquos timeby providing MWCI68s that are more compact than oth-ers by a factor of 32

Two aspects of the data shown in Figure 4B and 4C areimportant First the sampling schemes fall into two dis-tinct classes Five of the seven sampling schemes are al-most ideally well behaved with elevations barely exceed-ing 15 even for small N (and MWCI68 3) the remainingtwo on the other hand behave poorly with elevations inexcess of 3 for N 240 (MWCI68 5) The two unstablesampling schemes s1 and s4 are those that do not in-clude at least a single sample point at p $ 95 (see Fig-ure 2) It thus appears crucial to include at least one sam-ple point at p $ 95 to make onersquos estimates of variabilityrobust to small errors in q even if the threshold of inter-est as in our example has a p of only approximately 75Second both s1 and s4 are prone to lead to a serious un-derestimation of the true width of the confidence interval

4 5 7 10 12 15 17 200

1

2

3

4

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

s7

s6

s5

s4

s3

s2

s1

Figure 2 A two-alternative forced-choice Weibull psychometric function with parameter vector q 510 3 5 0 on semilogarithmic coordinates The rows of symbols below the curve mark the x valuesof the seven different sampling schemes s1ndashs7 used throughout the remainder of the paper

1320 WICHMANN AND HILL

if the sensitivity analysis (or bootstrap bridging assump-tion test) is not carried out The WCI68 is unstable in thevicinity of qgen even though WCI68 is small at qgen

Figure 5 is similar to Figure 4 except that it shows theWCI68 around threshold08 The trends found in Figure 4are even more exaggerated here The sampling schemes

without high sampling points (s1 s4) are unstable to thepoint of being meaningless for N 480 In addition theirWCI68 is inflated relative to that of the other samplingschemes even at qgen

The results of the Monte Carlo simulations are sum-marized in Table 1 The columns of Table 1 correspond to

120 240 480 960

005

01

015

120 240 480 960

1

15

2

25

120 240 480 960

005

01

015

120 240 480 960

01

015

02

025

max

imal

ele

vati

on

of W

CI 6

8

slo

pe

WC

I 68

sqrt

(N1

20)

slo

pe

MW

CI 6

8

slo

pe

MW

CI 6

8

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 3 (A) The width of the 68 BCa confidence interval (WCI68) of slopes of B 5 999 fitted psychometric functions to para-metric bootstrap data sets generated from qgen 5 10 3 5 01 as a function of the total number of observations N (B) The maximalelevation of the WCI68 in the vicinity of qgen again as a function of N (see the text for details) (C) The maximal width of the 68 BCaconfidence interval (MWCI68) in the vicinity of qgenmdashthat is the product of the respective WCI68 and elevations factors of panels Aand B (D) MWCI68 scaled by the sweat factor sqrt(N120) the ideally expected decrease in confidence interval width with increasingN The seven symbols denote the seven sampling schemes as of Figure 2

THE PSYCHOMETRIC FUNCTION II 1321

the different sampling schemes marked by their respec-tive symbols The first four rows contain the MWCI68 atthreshold05 for N 5 120 240 480 and 960 similarly thenext four rows contain the MWCI68 at threshold08 andthe following four those for the slope05 The scheme withthe lowest MWCI68 in each row is given the score of 100The others on the same row are given proportionally higherscores to indicate their MWCI68 as a percentage of thebest schemersquos value The bottom three rows of Table 1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates

An inspection of Table 1 reveals that the samplingschemes fall into four categories By a long way worstare sampling schemes s1 and s4 with mean and medianMWCI68 210 Already somewhat superior is s2 par-ticularly for small N with a median MWCI68 of 180 anda significantly lower standard deviation Next come sam-pling schemes s5 and s6 with means and medians of

120 240 480 960

05

07

1

2

3

45

7

120 240 480 960

1

2

3

4

120 240 480 960

05

07

1

2

3

45

7

120 240 480 9601

2

3

4

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

05

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

05)

MW

CI 6

8-1

thre

sho

ld (F

05)

MW

CI 6

8-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 4 Similar to Figure 3 except that it shows thresholds corresponding to F 105 (approximately equal to 75 correct during two-

alternative forced-choice)

1322 WICHMANN AND HILL

around 130 Each of these has at least one sample at p $95 and 50 at p $ 80 The importance of this highsample point is clearly demonstrated by s6 Comparing s1and s6 s6 is identical to scheme s1 except that one sam-ple point was moved from 75 to 95 Still s6 is superiorto s1 on each of the 12 estimates and often very markedlyso Finally there are two sampling schemes with meansand medians below 120mdashvery nearly optimal6 on mostestimates Both of these sampling schemes s3 and s7

have 50 of their sample points at p $ 90 and one thirdat p $ 95

In order to obtain stable estimates of the variability ofparameters thresholds and slopes of psychometricfunctions it appears that we must include at least onebut preferably more sample points at large p values Suchsampling schemes are however sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

07

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

07

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

08

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

08

) M

WC

I 68

-1

thre

sho

ld (F

08

) M

WC

I 68

-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess7s6s5s4s3s1 s2

Figure 5 Similar to Figure 3 except that it shows thresholds corresponding to F 108 (approximately equal to 90 correct during two-

alternative forced-choice)

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 4: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

THE PSYCHOMETRIC FUNCTION II 1317

eter vectors q1 hellip qBmdashcentered around qmdashwill notbe centered around true q As a result the estimates ofvariability are likely to be incorrect unless the magnitudeof the standard deviation is similar around q and q despitethe fact that the points are some distance apart in pa-rameter space

One way to assess the likely accuracy of bootstrap es-timates of variability is to follow Maloney (1990) and toexamine the local flatness of the contours around q ourbest estimate of q If the contours are sufficiently flat thevariability estimates will be similar assuming that thetrue q is somewhere within this flat region However theprocess of local contour estimation is computationally ex-pensive since it requires a very large number of completebootstrap runs for each data set

A much quicker alternative is the following Having ob-tained q and performed a bootstrap using y(xq) as thegenerating function we move to eight different points f1 hellipf8 in andashb space Eight Monte Carlo simulations are per-formed using f1 hellip f8 as the generating functions to ex-plore the variability in those parts of the parameter space(only the generating parameters of the bootstrap are changedx remains the same for all of them) If the contours of vari-ability around q are sufficiently flat as we hope they areconfidence intervals at f1 hellip f8 should be of the samemagnitude as those obtained at q Prudence should leadus to accept the largest of the nine confidence intervalsobtained as our estimate of variability

A decision has to be made as to which eight points inandashb space to use for the new set of simulations Gener-ally provided that the psychometric function is at leastreasonably well sampled the contours vary smoothly inthe immediate vicinity of q so that the precise placementof the sample points f1 hellip f8 is not critical One sug-gested and easy way to obtain a set of additional gener-ating parameters is shown in Figure 1

Figure 1 shows B 5 2000 bootstrap parameter pairs asdark filled circles plotted in andashb space Simulated datasets were generated from ygen with the Weibull as F andqgen 5 10 3 05 001 sampling scheme s7 (trianglessee Figure 2) was used and N was set to 480 (ni 5 80)The large central triangle at (10 3) marks the generatingparameter set the solid and dashed line segments adjacentto the x- and y-axes mark the 68 and 95 confidenceintervals for a and b respectively4 In the following weshall use WCI to stand for width of confidence intervalwith a subscript denoting its coverage percentagemdashthat isWCI68 denotes the width of the 68 confidence interval5The eight additional generating parameter pairs f1 hellip f8are marked by the light triangles They form a rectanglewhose sides have length WCI68 in a and b Typically thiscentral rectangular region contains approximately 30ndash40 of all andashb pairs and could thus be viewed as a crudejoint 35 confidence region for a and b A coverage per-centage of 35 represents a sensible compromise betweenerroneously accepting the estimate around q very likelyunderestimating the true variability and performing ad-

ditional bootstrap replications too far in the peripherywhere variability and thus the estimated confidence in-tervals become erroneously inflated owing to poor sam-pling Recently one of us (Hill 2001b) performed MonteCarlo simulations to test the coverage of various bootstrapconfidence interval methods and found that the abovemethod based on 25 of points or more was a good wayof guaranteeing that both sides of a two-tailed confidenceinterval had at least the desired level of coverage

MONTE CARLO SIMULATIONS

In both our papers we use only the specific case of the2AFC paradigm in our examples Thus g is fixed at 5In our simulations where we must assume a distributionof true generating probabilities we always use the Wei-bull function in conjunction with the same fixed set ofgenerating parameters qgen agen 5 10 bgen 5 3 ggen 5 5lgen 5 01 In our investigation of the effects of samplingpatterns we shall always use K 5 6 and ni constantmdashthatis six blocks of trials with the same number of points ineach block The number of observations per point ni couldbe set to 20 40 80 or 160 and with K 5 6 this meansthat the total number of observations N could take the val-ues 120 240 480 and 960

We have introduced these limitations purely for the pur-poses of illustration to keep our explanatory variablesdown to a manageable number We have found in manyother simulations that in most cases this is done withoutloss of generality of our conclusions

Confidence intervals of parameters thresholds andslopes that we report in the following were always obtainedusing a method called bias-corrected and accelerated(BCa ) confidence intervals which we describe later in ourBootstrap Confidence Intervals section

The Effects of Sampling Schemes and Number of Trials

One of our aims in this study was to examine the ef-fect of N and onersquos choice of sample points x on both thesize of onersquos confidence intervals for J and their sensitiv-ity to errors in q

Seven different sampling schemes were used each dic-tating a different distribution of datapoints along the stim-ulus axis they are the same schemes as those used and de-scribed in Wichmann and Hill (2001) and they are shownin Figure 2 Each horizontal chain of symbols representsone of the schemes marking the stimulus values at whichthe six sample points are placed The different symbolshapes will be used to identify the sampling schemes in ourresults plots To provide a frame of reference the solidcurve shows the psychometric function usedmdashthat is05 1 05F(xagen bgen)mdashwith the 55 75 and 95performance levels marked by dotted lines

As we shall see even for a fixed number of samplepoints and a fixed number of trials per point biases inparameter estimation and goodness-of-f it assessment

1318 WICHMANN AND HILL

(companion paper) as well as the width of confidence in-tervals (this paper) all depend markedly on the distrib-ution of stimulus values x

Monte Carlo data sets were generated using our sevensampling schemes shown in Figure 2 using the generationparameters qgen as well as N and K as specified aboveA maximum-likelihood fit was performed on each sim-ulated data set to obtain bootstrap parameter vectors qfrom which we subsequently derived the x values corre-sponding to threshold05 and threshold08 as well as to theslope For each sampling scheme and value of N a totalof nine simulations were performed one at qgen and eightmore at points f1 hellip f8 as specified in our section on thebootstrap bridging assumption Thus each of our 28conditions (7 sampling schemes 3 4 values of N ) required9 3 1000 simulated data sets for a total of 252000 sim-ulations or 1134 3 108 simulated 2AFC trials

Figures 3 4 and 5 show the results of the simulationsdealing with slope05 threshold05 and threshold08 re-spectively The top-left panel (A) of each figure plots theWCI68 of the estimate under consideration as a functionof N Data for all seven sampling schemes are shownusing their respective symbols The top-right hand panel(B) plots as a function of N the maximal elevation ofthe WCI68 encountered in the vicinity of qgenmdashthat ismaxWCIf1WCIqgen hellip WCIf8WCIqgen The eleva-tion factor is an indication of the sensitivity of our vari-ability estimates to errors in the estimation of q Thecloser it comes to 10 the better The bottom-left panel(C) plots the largest of the WCI68rsquos measurements at qgenand f1 hellip f8 for a given sampling scheme and N it isthus the product of the elevation factor plotted in panelB by its respective WCI68 in panel A This quantity wedenote as MWCI68 standing for maximum width of the

8 9 10 11 121

2

3

4

5

6

7

8

parameter a

N = 480 = 10 3 05 001

par

amet

er b

95 confidence interval68 confidence interval (WCI68)

Figure 1 B 5 2000 data sets were generated from a two-alternative forced-choice Weibullpsychometric function with parameter vector qgen 5 10 3 5 01 and then fit using ourmaximum-likelihood procedure resulting in 2000 estimated parameter pairs (a b ) shownas dark circles in andashb parameter space The location of the generating a and b (10 3) ismarked by the large triangle in the center of the plot The sampling scheme s7 was used togenerate the data sets (see Figure 2 for details) with N 5 480 Solid lines mark the 68 con-fidence interval width (WCI68) separately for a and b broken lines mark the 95 confidenceintervals The light small triangles show the andashb parameter sets f1 hellip f8 from which eachbootstrap is repeated during sensitivity analysis while keeping the x-values of the samplingscheme unchanged

THE PSYCHOMETRIC FUNCTION II 1319

68 confidence interval this is the confidence intervalwe suggest experimenters should use when they reporterror bounds The bottom-right panel (D) finally showsMWCI68 multiplied by a sweat factor IumlN120 Ideallymdashthat is for very large K and Nmdashconfidence intervalsshould be inversely proportional to IumlN and multiplica-tion by our sweat factor should remove this dependencyAny change in the sweat-adjusted MWCI68 as a functionof N might thus be taken as an indicator of the degree towhich sampling scheme N and true confidence intervalwidth interact in a way not predicted by asymptotic theory

Figure 3A shows the WCI68 around the median esti-mated slope Clearly the different sampling schemes havea profound effect on the magnitude of the confidence in-tervals for slope estimates For example in order to en-sure that the WCI68 is approximately 006 one requiresnearly 960 trials if using sampling s1 or s4 Samplingschemes s3 s5 or s7 on the other hand require onlyaround 200 trials to achieve similar confidence intervalwidth The important difference that makes for examples3 s5 and s7 more efficient than s1 and s4 is the presenceof samples at high predicted performance values (p $ 9)where binomial variability is low and thus the data con-strain our maximum-likelihood fit more tightly Figure 3Billustrates the complex interactions between different sam-pling schemes N and the stability of the bootstrap esti-mates of variability as indicated by the local flatness of thecontours around qgen A perfectly flat local contour wouldresult in the horizontal line at 1 Sampling scheme s6 iswell behaved for N $ 480 its maximal elevation beingaround 17 For N 240 however elevation rises to near

25 Other schemes like s1 s4 or s7 never rise above an el-evation of 2 regardless of N It is important to note that themagnitude of WCI68 at qgen is by no means a good predic-tor of the stabilityof the estimate as indicated by the sen-sitivity factor Figure 3C shows the MWCI68 and here thedifferences in efficiency between sampling schemes areeven more apparent than in Figure 3A Sampling schemess3 s5 and s7 are clearly superior to all other samplingschemes particularly for N 480 these three samplingschemes are the ones that include two sampling points atp 92

Figure 4 showing the equivalent data for the estimatesof threshold05 also illustrates that some sampling schemesmake much more efficient use of the experimenterrsquos timeby providing MWCI68s that are more compact than oth-ers by a factor of 32

Two aspects of the data shown in Figure 4B and 4C areimportant First the sampling schemes fall into two dis-tinct classes Five of the seven sampling schemes are al-most ideally well behaved with elevations barely exceed-ing 15 even for small N (and MWCI68 3) the remainingtwo on the other hand behave poorly with elevations inexcess of 3 for N 240 (MWCI68 5) The two unstablesampling schemes s1 and s4 are those that do not in-clude at least a single sample point at p $ 95 (see Fig-ure 2) It thus appears crucial to include at least one sam-ple point at p $ 95 to make onersquos estimates of variabilityrobust to small errors in q even if the threshold of inter-est as in our example has a p of only approximately 75Second both s1 and s4 are prone to lead to a serious un-derestimation of the true width of the confidence interval

4 5 7 10 12 15 17 200

1

2

3

4

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

s7

s6

s5

s4

s3

s2

s1

Figure 2 A two-alternative forced-choice Weibull psychometric function with parameter vector q 510 3 5 0 on semilogarithmic coordinates The rows of symbols below the curve mark the x valuesof the seven different sampling schemes s1ndashs7 used throughout the remainder of the paper

1320 WICHMANN AND HILL

if the sensitivity analysis (or bootstrap bridging assump-tion test) is not carried out The WCI68 is unstable in thevicinity of qgen even though WCI68 is small at qgen

Figure 5 is similar to Figure 4 except that it shows theWCI68 around threshold08 The trends found in Figure 4are even more exaggerated here The sampling schemes

without high sampling points (s1 s4) are unstable to thepoint of being meaningless for N 480 In addition theirWCI68 is inflated relative to that of the other samplingschemes even at qgen

The results of the Monte Carlo simulations are sum-marized in Table 1 The columns of Table 1 correspond to

120 240 480 960

005

01

015

120 240 480 960

1

15

2

25

120 240 480 960

005

01

015

120 240 480 960

01

015

02

025

max

imal

ele

vati

on

of W

CI 6

8

slo

pe

WC

I 68

sqrt

(N1

20)

slo

pe

MW

CI 6

8

slo

pe

MW

CI 6

8

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 3 (A) The width of the 68 BCa confidence interval (WCI68) of slopes of B 5 999 fitted psychometric functions to para-metric bootstrap data sets generated from qgen 5 10 3 5 01 as a function of the total number of observations N (B) The maximalelevation of the WCI68 in the vicinity of qgen again as a function of N (see the text for details) (C) The maximal width of the 68 BCaconfidence interval (MWCI68) in the vicinity of qgenmdashthat is the product of the respective WCI68 and elevations factors of panels Aand B (D) MWCI68 scaled by the sweat factor sqrt(N120) the ideally expected decrease in confidence interval width with increasingN The seven symbols denote the seven sampling schemes as of Figure 2

THE PSYCHOMETRIC FUNCTION II 1321

the different sampling schemes marked by their respec-tive symbols The first four rows contain the MWCI68 atthreshold05 for N 5 120 240 480 and 960 similarly thenext four rows contain the MWCI68 at threshold08 andthe following four those for the slope05 The scheme withthe lowest MWCI68 in each row is given the score of 100The others on the same row are given proportionally higherscores to indicate their MWCI68 as a percentage of thebest schemersquos value The bottom three rows of Table 1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates

An inspection of Table 1 reveals that the samplingschemes fall into four categories By a long way worstare sampling schemes s1 and s4 with mean and medianMWCI68 210 Already somewhat superior is s2 par-ticularly for small N with a median MWCI68 of 180 anda significantly lower standard deviation Next come sam-pling schemes s5 and s6 with means and medians of

120 240 480 960

05

07

1

2

3

45

7

120 240 480 960

1

2

3

4

120 240 480 960

05

07

1

2

3

45

7

120 240 480 9601

2

3

4

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

05

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

05)

MW

CI 6

8-1

thre

sho

ld (F

05)

MW

CI 6

8-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 4 Similar to Figure 3 except that it shows thresholds corresponding to F 105 (approximately equal to 75 correct during two-

alternative forced-choice)

1322 WICHMANN AND HILL

around 130 Each of these has at least one sample at p $95 and 50 at p $ 80 The importance of this highsample point is clearly demonstrated by s6 Comparing s1and s6 s6 is identical to scheme s1 except that one sam-ple point was moved from 75 to 95 Still s6 is superiorto s1 on each of the 12 estimates and often very markedlyso Finally there are two sampling schemes with meansand medians below 120mdashvery nearly optimal6 on mostestimates Both of these sampling schemes s3 and s7

have 50 of their sample points at p $ 90 and one thirdat p $ 95

In order to obtain stable estimates of the variability ofparameters thresholds and slopes of psychometricfunctions it appears that we must include at least onebut preferably more sample points at large p values Suchsampling schemes are however sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

07

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

07

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

08

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

08

) M

WC

I 68

-1

thre

sho

ld (F

08

) M

WC

I 68

-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess7s6s5s4s3s1 s2

Figure 5 Similar to Figure 3 except that it shows thresholds corresponding to F 108 (approximately equal to 90 correct during two-

alternative forced-choice)

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 5: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

1318 WICHMANN AND HILL

(companion paper) as well as the width of confidence in-tervals (this paper) all depend markedly on the distrib-ution of stimulus values x

Monte Carlo data sets were generated using our sevensampling schemes shown in Figure 2 using the generationparameters qgen as well as N and K as specified aboveA maximum-likelihood fit was performed on each sim-ulated data set to obtain bootstrap parameter vectors qfrom which we subsequently derived the x values corre-sponding to threshold05 and threshold08 as well as to theslope For each sampling scheme and value of N a totalof nine simulations were performed one at qgen and eightmore at points f1 hellip f8 as specified in our section on thebootstrap bridging assumption Thus each of our 28conditions (7 sampling schemes 3 4 values of N ) required9 3 1000 simulated data sets for a total of 252000 sim-ulations or 1134 3 108 simulated 2AFC trials

Figures 3 4 and 5 show the results of the simulationsdealing with slope05 threshold05 and threshold08 re-spectively The top-left panel (A) of each figure plots theWCI68 of the estimate under consideration as a functionof N Data for all seven sampling schemes are shownusing their respective symbols The top-right hand panel(B) plots as a function of N the maximal elevation ofthe WCI68 encountered in the vicinity of qgenmdashthat ismaxWCIf1WCIqgen hellip WCIf8WCIqgen The eleva-tion factor is an indication of the sensitivity of our vari-ability estimates to errors in the estimation of q Thecloser it comes to 10 the better The bottom-left panel(C) plots the largest of the WCI68rsquos measurements at qgenand f1 hellip f8 for a given sampling scheme and N it isthus the product of the elevation factor plotted in panelB by its respective WCI68 in panel A This quantity wedenote as MWCI68 standing for maximum width of the

8 9 10 11 121

2

3

4

5

6

7

8

parameter a

N = 480 = 10 3 05 001

par

amet

er b

95 confidence interval68 confidence interval (WCI68)

Figure 1 B 5 2000 data sets were generated from a two-alternative forced-choice Weibullpsychometric function with parameter vector qgen 5 10 3 5 01 and then fit using ourmaximum-likelihood procedure resulting in 2000 estimated parameter pairs (a b ) shownas dark circles in andashb parameter space The location of the generating a and b (10 3) ismarked by the large triangle in the center of the plot The sampling scheme s7 was used togenerate the data sets (see Figure 2 for details) with N 5 480 Solid lines mark the 68 con-fidence interval width (WCI68) separately for a and b broken lines mark the 95 confidenceintervals The light small triangles show the andashb parameter sets f1 hellip f8 from which eachbootstrap is repeated during sensitivity analysis while keeping the x-values of the samplingscheme unchanged

THE PSYCHOMETRIC FUNCTION II 1319

68 confidence interval this is the confidence intervalwe suggest experimenters should use when they reporterror bounds The bottom-right panel (D) finally showsMWCI68 multiplied by a sweat factor IumlN120 Ideallymdashthat is for very large K and Nmdashconfidence intervalsshould be inversely proportional to IumlN and multiplica-tion by our sweat factor should remove this dependencyAny change in the sweat-adjusted MWCI68 as a functionof N might thus be taken as an indicator of the degree towhich sampling scheme N and true confidence intervalwidth interact in a way not predicted by asymptotic theory

Figure 3A shows the WCI68 around the median esti-mated slope Clearly the different sampling schemes havea profound effect on the magnitude of the confidence in-tervals for slope estimates For example in order to en-sure that the WCI68 is approximately 006 one requiresnearly 960 trials if using sampling s1 or s4 Samplingschemes s3 s5 or s7 on the other hand require onlyaround 200 trials to achieve similar confidence intervalwidth The important difference that makes for examples3 s5 and s7 more efficient than s1 and s4 is the presenceof samples at high predicted performance values (p $ 9)where binomial variability is low and thus the data con-strain our maximum-likelihood fit more tightly Figure 3Billustrates the complex interactions between different sam-pling schemes N and the stability of the bootstrap esti-mates of variability as indicated by the local flatness of thecontours around qgen A perfectly flat local contour wouldresult in the horizontal line at 1 Sampling scheme s6 iswell behaved for N $ 480 its maximal elevation beingaround 17 For N 240 however elevation rises to near

25 Other schemes like s1 s4 or s7 never rise above an el-evation of 2 regardless of N It is important to note that themagnitude of WCI68 at qgen is by no means a good predic-tor of the stabilityof the estimate as indicated by the sen-sitivity factor Figure 3C shows the MWCI68 and here thedifferences in efficiency between sampling schemes areeven more apparent than in Figure 3A Sampling schemess3 s5 and s7 are clearly superior to all other samplingschemes particularly for N 480 these three samplingschemes are the ones that include two sampling points atp 92

Figure 4 showing the equivalent data for the estimatesof threshold05 also illustrates that some sampling schemesmake much more efficient use of the experimenterrsquos timeby providing MWCI68s that are more compact than oth-ers by a factor of 32

Two aspects of the data shown in Figure 4B and 4C areimportant First the sampling schemes fall into two dis-tinct classes Five of the seven sampling schemes are al-most ideally well behaved with elevations barely exceed-ing 15 even for small N (and MWCI68 3) the remainingtwo on the other hand behave poorly with elevations inexcess of 3 for N 240 (MWCI68 5) The two unstablesampling schemes s1 and s4 are those that do not in-clude at least a single sample point at p $ 95 (see Fig-ure 2) It thus appears crucial to include at least one sam-ple point at p $ 95 to make onersquos estimates of variabilityrobust to small errors in q even if the threshold of inter-est as in our example has a p of only approximately 75Second both s1 and s4 are prone to lead to a serious un-derestimation of the true width of the confidence interval

4 5 7 10 12 15 17 200

1

2

3

4

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

s7

s6

s5

s4

s3

s2

s1

Figure 2 A two-alternative forced-choice Weibull psychometric function with parameter vector q 510 3 5 0 on semilogarithmic coordinates The rows of symbols below the curve mark the x valuesof the seven different sampling schemes s1ndashs7 used throughout the remainder of the paper

1320 WICHMANN AND HILL

if the sensitivity analysis (or bootstrap bridging assump-tion test) is not carried out The WCI68 is unstable in thevicinity of qgen even though WCI68 is small at qgen

Figure 5 is similar to Figure 4 except that it shows theWCI68 around threshold08 The trends found in Figure 4are even more exaggerated here The sampling schemes

without high sampling points (s1 s4) are unstable to thepoint of being meaningless for N 480 In addition theirWCI68 is inflated relative to that of the other samplingschemes even at qgen

The results of the Monte Carlo simulations are sum-marized in Table 1 The columns of Table 1 correspond to

120 240 480 960

005

01

015

120 240 480 960

1

15

2

25

120 240 480 960

005

01

015

120 240 480 960

01

015

02

025

max

imal

ele

vati

on

of W

CI 6

8

slo

pe

WC

I 68

sqrt

(N1

20)

slo

pe

MW

CI 6

8

slo

pe

MW

CI 6

8

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 3 (A) The width of the 68 BCa confidence interval (WCI68) of slopes of B 5 999 fitted psychometric functions to para-metric bootstrap data sets generated from qgen 5 10 3 5 01 as a function of the total number of observations N (B) The maximalelevation of the WCI68 in the vicinity of qgen again as a function of N (see the text for details) (C) The maximal width of the 68 BCaconfidence interval (MWCI68) in the vicinity of qgenmdashthat is the product of the respective WCI68 and elevations factors of panels Aand B (D) MWCI68 scaled by the sweat factor sqrt(N120) the ideally expected decrease in confidence interval width with increasingN The seven symbols denote the seven sampling schemes as of Figure 2

THE PSYCHOMETRIC FUNCTION II 1321

the different sampling schemes marked by their respec-tive symbols The first four rows contain the MWCI68 atthreshold05 for N 5 120 240 480 and 960 similarly thenext four rows contain the MWCI68 at threshold08 andthe following four those for the slope05 The scheme withthe lowest MWCI68 in each row is given the score of 100The others on the same row are given proportionally higherscores to indicate their MWCI68 as a percentage of thebest schemersquos value The bottom three rows of Table 1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates

An inspection of Table 1 reveals that the samplingschemes fall into four categories By a long way worstare sampling schemes s1 and s4 with mean and medianMWCI68 210 Already somewhat superior is s2 par-ticularly for small N with a median MWCI68 of 180 anda significantly lower standard deviation Next come sam-pling schemes s5 and s6 with means and medians of

120 240 480 960

05

07

1

2

3

45

7

120 240 480 960

1

2

3

4

120 240 480 960

05

07

1

2

3

45

7

120 240 480 9601

2

3

4

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

05

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

05)

MW

CI 6

8-1

thre

sho

ld (F

05)

MW

CI 6

8-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 4 Similar to Figure 3 except that it shows thresholds corresponding to F 105 (approximately equal to 75 correct during two-

alternative forced-choice)

1322 WICHMANN AND HILL

around 130 Each of these has at least one sample at p $95 and 50 at p $ 80 The importance of this highsample point is clearly demonstrated by s6 Comparing s1and s6 s6 is identical to scheme s1 except that one sam-ple point was moved from 75 to 95 Still s6 is superiorto s1 on each of the 12 estimates and often very markedlyso Finally there are two sampling schemes with meansand medians below 120mdashvery nearly optimal6 on mostestimates Both of these sampling schemes s3 and s7

have 50 of their sample points at p $ 90 and one thirdat p $ 95

In order to obtain stable estimates of the variability ofparameters thresholds and slopes of psychometricfunctions it appears that we must include at least onebut preferably more sample points at large p values Suchsampling schemes are however sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

07

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

07

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

08

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

08

) M

WC

I 68

-1

thre

sho

ld (F

08

) M

WC

I 68

-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess7s6s5s4s3s1 s2

Figure 5 Similar to Figure 3 except that it shows thresholds corresponding to F 108 (approximately equal to 90 correct during two-

alternative forced-choice)

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 6: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

THE PSYCHOMETRIC FUNCTION II 1319

68 confidence interval this is the confidence intervalwe suggest experimenters should use when they reporterror bounds The bottom-right panel (D) finally showsMWCI68 multiplied by a sweat factor IumlN120 Ideallymdashthat is for very large K and Nmdashconfidence intervalsshould be inversely proportional to IumlN and multiplica-tion by our sweat factor should remove this dependencyAny change in the sweat-adjusted MWCI68 as a functionof N might thus be taken as an indicator of the degree towhich sampling scheme N and true confidence intervalwidth interact in a way not predicted by asymptotic theory

Figure 3A shows the WCI68 around the median esti-mated slope Clearly the different sampling schemes havea profound effect on the magnitude of the confidence in-tervals for slope estimates For example in order to en-sure that the WCI68 is approximately 006 one requiresnearly 960 trials if using sampling s1 or s4 Samplingschemes s3 s5 or s7 on the other hand require onlyaround 200 trials to achieve similar confidence intervalwidth The important difference that makes for examples3 s5 and s7 more efficient than s1 and s4 is the presenceof samples at high predicted performance values (p $ 9)where binomial variability is low and thus the data con-strain our maximum-likelihood fit more tightly Figure 3Billustrates the complex interactions between different sam-pling schemes N and the stability of the bootstrap esti-mates of variability as indicated by the local flatness of thecontours around qgen A perfectly flat local contour wouldresult in the horizontal line at 1 Sampling scheme s6 iswell behaved for N $ 480 its maximal elevation beingaround 17 For N 240 however elevation rises to near

25 Other schemes like s1 s4 or s7 never rise above an el-evation of 2 regardless of N It is important to note that themagnitude of WCI68 at qgen is by no means a good predic-tor of the stabilityof the estimate as indicated by the sen-sitivity factor Figure 3C shows the MWCI68 and here thedifferences in efficiency between sampling schemes areeven more apparent than in Figure 3A Sampling schemess3 s5 and s7 are clearly superior to all other samplingschemes particularly for N 480 these three samplingschemes are the ones that include two sampling points atp 92

Figure 4 showing the equivalent data for the estimatesof threshold05 also illustrates that some sampling schemesmake much more efficient use of the experimenterrsquos timeby providing MWCI68s that are more compact than oth-ers by a factor of 32

Two aspects of the data shown in Figure 4B and 4C areimportant First the sampling schemes fall into two dis-tinct classes Five of the seven sampling schemes are al-most ideally well behaved with elevations barely exceed-ing 15 even for small N (and MWCI68 3) the remainingtwo on the other hand behave poorly with elevations inexcess of 3 for N 240 (MWCI68 5) The two unstablesampling schemes s1 and s4 are those that do not in-clude at least a single sample point at p $ 95 (see Fig-ure 2) It thus appears crucial to include at least one sam-ple point at p $ 95 to make onersquos estimates of variabilityrobust to small errors in q even if the threshold of inter-est as in our example has a p of only approximately 75Second both s1 and s4 are prone to lead to a serious un-derestimation of the true width of the confidence interval

4 5 7 10 12 15 17 200

1

2

3

4

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

s7

s6

s5

s4

s3

s2

s1

Figure 2 A two-alternative forced-choice Weibull psychometric function with parameter vector q 510 3 5 0 on semilogarithmic coordinates The rows of symbols below the curve mark the x valuesof the seven different sampling schemes s1ndashs7 used throughout the remainder of the paper

1320 WICHMANN AND HILL

if the sensitivity analysis (or bootstrap bridging assump-tion test) is not carried out The WCI68 is unstable in thevicinity of qgen even though WCI68 is small at qgen

Figure 5 is similar to Figure 4 except that it shows theWCI68 around threshold08 The trends found in Figure 4are even more exaggerated here The sampling schemes

without high sampling points (s1 s4) are unstable to thepoint of being meaningless for N 480 In addition theirWCI68 is inflated relative to that of the other samplingschemes even at qgen

The results of the Monte Carlo simulations are sum-marized in Table 1 The columns of Table 1 correspond to

120 240 480 960

005

01

015

120 240 480 960

1

15

2

25

120 240 480 960

005

01

015

120 240 480 960

01

015

02

025

max

imal

ele

vati

on

of W

CI 6

8

slo

pe

WC

I 68

sqrt

(N1

20)

slo

pe

MW

CI 6

8

slo

pe

MW

CI 6

8

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 3 (A) The width of the 68 BCa confidence interval (WCI68) of slopes of B 5 999 fitted psychometric functions to para-metric bootstrap data sets generated from qgen 5 10 3 5 01 as a function of the total number of observations N (B) The maximalelevation of the WCI68 in the vicinity of qgen again as a function of N (see the text for details) (C) The maximal width of the 68 BCaconfidence interval (MWCI68) in the vicinity of qgenmdashthat is the product of the respective WCI68 and elevations factors of panels Aand B (D) MWCI68 scaled by the sweat factor sqrt(N120) the ideally expected decrease in confidence interval width with increasingN The seven symbols denote the seven sampling schemes as of Figure 2

THE PSYCHOMETRIC FUNCTION II 1321

the different sampling schemes marked by their respec-tive symbols The first four rows contain the MWCI68 atthreshold05 for N 5 120 240 480 and 960 similarly thenext four rows contain the MWCI68 at threshold08 andthe following four those for the slope05 The scheme withthe lowest MWCI68 in each row is given the score of 100The others on the same row are given proportionally higherscores to indicate their MWCI68 as a percentage of thebest schemersquos value The bottom three rows of Table 1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates

An inspection of Table 1 reveals that the samplingschemes fall into four categories By a long way worstare sampling schemes s1 and s4 with mean and medianMWCI68 210 Already somewhat superior is s2 par-ticularly for small N with a median MWCI68 of 180 anda significantly lower standard deviation Next come sam-pling schemes s5 and s6 with means and medians of

120 240 480 960

05

07

1

2

3

45

7

120 240 480 960

1

2

3

4

120 240 480 960

05

07

1

2

3

45

7

120 240 480 9601

2

3

4

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

05

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

05)

MW

CI 6

8-1

thre

sho

ld (F

05)

MW

CI 6

8-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 4 Similar to Figure 3 except that it shows thresholds corresponding to F 105 (approximately equal to 75 correct during two-

alternative forced-choice)

1322 WICHMANN AND HILL

around 130 Each of these has at least one sample at p $95 and 50 at p $ 80 The importance of this highsample point is clearly demonstrated by s6 Comparing s1and s6 s6 is identical to scheme s1 except that one sam-ple point was moved from 75 to 95 Still s6 is superiorto s1 on each of the 12 estimates and often very markedlyso Finally there are two sampling schemes with meansand medians below 120mdashvery nearly optimal6 on mostestimates Both of these sampling schemes s3 and s7

have 50 of their sample points at p $ 90 and one thirdat p $ 95

In order to obtain stable estimates of the variability ofparameters thresholds and slopes of psychometricfunctions it appears that we must include at least onebut preferably more sample points at large p values Suchsampling schemes are however sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

07

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

07

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

08

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

08

) M

WC

I 68

-1

thre

sho

ld (F

08

) M

WC

I 68

-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess7s6s5s4s3s1 s2

Figure 5 Similar to Figure 3 except that it shows thresholds corresponding to F 108 (approximately equal to 90 correct during two-

alternative forced-choice)

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 7: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

1320 WICHMANN AND HILL

if the sensitivity analysis (or bootstrap bridging assump-tion test) is not carried out The WCI68 is unstable in thevicinity of qgen even though WCI68 is small at qgen

Figure 5 is similar to Figure 4 except that it shows theWCI68 around threshold08 The trends found in Figure 4are even more exaggerated here The sampling schemes

without high sampling points (s1 s4) are unstable to thepoint of being meaningless for N 480 In addition theirWCI68 is inflated relative to that of the other samplingschemes even at qgen

The results of the Monte Carlo simulations are sum-marized in Table 1 The columns of Table 1 correspond to

120 240 480 960

005

01

015

120 240 480 960

1

15

2

25

120 240 480 960

005

01

015

120 240 480 960

01

015

02

025

max

imal

ele

vati

on

of W

CI 6

8

slo

pe

WC

I 68

sqrt

(N1

20)

slo

pe

MW

CI 6

8

slo

pe

MW

CI 6

8

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 3 (A) The width of the 68 BCa confidence interval (WCI68) of slopes of B 5 999 fitted psychometric functions to para-metric bootstrap data sets generated from qgen 5 10 3 5 01 as a function of the total number of observations N (B) The maximalelevation of the WCI68 in the vicinity of qgen again as a function of N (see the text for details) (C) The maximal width of the 68 BCaconfidence interval (MWCI68) in the vicinity of qgenmdashthat is the product of the respective WCI68 and elevations factors of panels Aand B (D) MWCI68 scaled by the sweat factor sqrt(N120) the ideally expected decrease in confidence interval width with increasingN The seven symbols denote the seven sampling schemes as of Figure 2

THE PSYCHOMETRIC FUNCTION II 1321

the different sampling schemes marked by their respec-tive symbols The first four rows contain the MWCI68 atthreshold05 for N 5 120 240 480 and 960 similarly thenext four rows contain the MWCI68 at threshold08 andthe following four those for the slope05 The scheme withthe lowest MWCI68 in each row is given the score of 100The others on the same row are given proportionally higherscores to indicate their MWCI68 as a percentage of thebest schemersquos value The bottom three rows of Table 1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates

An inspection of Table 1 reveals that the samplingschemes fall into four categories By a long way worstare sampling schemes s1 and s4 with mean and medianMWCI68 210 Already somewhat superior is s2 par-ticularly for small N with a median MWCI68 of 180 anda significantly lower standard deviation Next come sam-pling schemes s5 and s6 with means and medians of

120 240 480 960

05

07

1

2

3

45

7

120 240 480 960

1

2

3

4

120 240 480 960

05

07

1

2

3

45

7

120 240 480 9601

2

3

4

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

05

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

05)

MW

CI 6

8-1

thre

sho

ld (F

05)

MW

CI 6

8-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 4 Similar to Figure 3 except that it shows thresholds corresponding to F 105 (approximately equal to 75 correct during two-

alternative forced-choice)

1322 WICHMANN AND HILL

around 130 Each of these has at least one sample at p $95 and 50 at p $ 80 The importance of this highsample point is clearly demonstrated by s6 Comparing s1and s6 s6 is identical to scheme s1 except that one sam-ple point was moved from 75 to 95 Still s6 is superiorto s1 on each of the 12 estimates and often very markedlyso Finally there are two sampling schemes with meansand medians below 120mdashvery nearly optimal6 on mostestimates Both of these sampling schemes s3 and s7

have 50 of their sample points at p $ 90 and one thirdat p $ 95

In order to obtain stable estimates of the variability ofparameters thresholds and slopes of psychometricfunctions it appears that we must include at least onebut preferably more sample points at large p values Suchsampling schemes are however sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

07

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

07

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

08

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

08

) M

WC

I 68

-1

thre

sho

ld (F

08

) M

WC

I 68

-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess7s6s5s4s3s1 s2

Figure 5 Similar to Figure 3 except that it shows thresholds corresponding to F 108 (approximately equal to 90 correct during two-

alternative forced-choice)

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 8: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

THE PSYCHOMETRIC FUNCTION II 1321

the different sampling schemes marked by their respec-tive symbols The first four rows contain the MWCI68 atthreshold05 for N 5 120 240 480 and 960 similarly thenext four rows contain the MWCI68 at threshold08 andthe following four those for the slope05 The scheme withthe lowest MWCI68 in each row is given the score of 100The others on the same row are given proportionally higherscores to indicate their MWCI68 as a percentage of thebest schemersquos value The bottom three rows of Table 1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates

An inspection of Table 1 reveals that the samplingschemes fall into four categories By a long way worstare sampling schemes s1 and s4 with mean and medianMWCI68 210 Already somewhat superior is s2 par-ticularly for small N with a median MWCI68 of 180 anda significantly lower standard deviation Next come sam-pling schemes s5 and s6 with means and medians of

120 240 480 960

05

07

1

2

3

45

7

120 240 480 960

1

2

3

4

120 240 480 960

05

07

1

2

3

45

7

120 240 480 9601

2

3

4

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

05

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

05)

MW

CI 6

8-1

thre

sho

ld (F

05)

MW

CI 6

8-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 4 Similar to Figure 3 except that it shows thresholds corresponding to F 105 (approximately equal to 75 correct during two-

alternative forced-choice)

1322 WICHMANN AND HILL

around 130 Each of these has at least one sample at p $95 and 50 at p $ 80 The importance of this highsample point is clearly demonstrated by s6 Comparing s1and s6 s6 is identical to scheme s1 except that one sam-ple point was moved from 75 to 95 Still s6 is superiorto s1 on each of the 12 estimates and often very markedlyso Finally there are two sampling schemes with meansand medians below 120mdashvery nearly optimal6 on mostestimates Both of these sampling schemes s3 and s7

have 50 of their sample points at p $ 90 and one thirdat p $ 95

In order to obtain stable estimates of the variability ofparameters thresholds and slopes of psychometricfunctions it appears that we must include at least onebut preferably more sample points at large p values Suchsampling schemes are however sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

07

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

07

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

08

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

08

) M

WC

I 68

-1

thre

sho

ld (F

08

) M

WC

I 68

-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess7s6s5s4s3s1 s2

Figure 5 Similar to Figure 3 except that it shows thresholds corresponding to F 108 (approximately equal to 90 correct during two-

alternative forced-choice)

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 9: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

1322 WICHMANN AND HILL

around 130 Each of these has at least one sample at p $95 and 50 at p $ 80 The importance of this highsample point is clearly demonstrated by s6 Comparing s1and s6 s6 is identical to scheme s1 except that one sam-ple point was moved from 75 to 95 Still s6 is superiorto s1 on each of the 12 estimates and often very markedlyso Finally there are two sampling schemes with meansand medians below 120mdashvery nearly optimal6 on mostestimates Both of these sampling schemes s3 and s7

have 50 of their sample points at p $ 90 and one thirdat p $ 95

In order to obtain stable estimates of the variability ofparameters thresholds and slopes of psychometricfunctions it appears that we must include at least onebut preferably more sample points at large p values Suchsampling schemes are however sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

07

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

07

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

08

) W

CI 6

8-1

sqrt

(N1

20)

thre

sho

ld (F

08

) M

WC

I 68

-1

thre

sho

ld (F

08

) M

WC

I 68

-1

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess7s6s5s4s3s1 s2

Figure 5 Similar to Figure 3 except that it shows thresholds corresponding to F 108 (approximately equal to 90 correct during two-

alternative forced-choice)

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 10: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

THE PSYCHOMETRIC FUNCTION II 1323

chometric function (the parameter l in Equation 1 seeour companion paper Wichmann amp Hill 2001)

Somewhat counterintuitively it is thus not sensible toplace all or most samples close to the point of interest (egclose to threshold05 in order to obtain tight confidence in-tervals for threshold05) because estimation is done via thewhole psychometric function which in turn is estimatedfrom the entire data set Sample points at high p values (ornear zero in yesno tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function where the expected varianceis highest (at least during maximum-likelihood fitting)Hence adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf Lam Mills amp Dubno 1996)

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals) similar to constraining l to be between 0 and 06(see our discussion of Bayesian priors in Wichmann ampHill 2001) Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf Figure 2) it is important to notehowever that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively

Influence of the Distribution Function on Estimates of Variability

Thus far we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters

thresholds and slopes since its estimates do not rely onasymptotic theory However in the context of fitting psy-chometric functions one requires in addition that theexact form of the distribution function FmdashWeibull logis-tic cumulative Gaussian Gumbel or any other reason-ably similar sigmoidmdashhas only a minor influence on theestimates of variability The importance of this cannot beunderestimated since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question because as experimenters we do notknow and never will the true underlying distributionfunction or objective function from which the empiricaldata were generated The problem is illustrated in Fig-ure 6 Figure 6A shows four different psychometric func-tions (1) yW(xqW) using the Weibull as F and qW 510 3 5 01 (our ldquostandardrdquo generating function ygen)(2) yCG(xqCG) using the cumulative Gaussian with qCG 58875 3278 5 01 (3) yL(xqL ) using the logisticwith qL 5 8957 2014 5 01 and finally (4) yG(xqG)using the Gumbel and qG 5 10022 2906 5 01 Forall practical purposes in psychophysics the four functionsare indistinguishable Thus if one of the above psycho-metric functions were to provide a good fit to a data setall of them would despite the fact that at most one ofthem is correct The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability8 Note that this is not trivially true Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data setmdashas is shown in Figure 6Amdashthis doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F5

1 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F8

1 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dFdx at 120 212 305 119 321 139 217 100x 5 F5

1 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

NotemdashColumns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2) Thefirst four rows contain the MWCI68 at threshold05 for N 5 120 240 480 and 960 similarly the next eight rows contain theMWCI68 at threshold08 and slope05 (See the text for the definition of the MWCI68) Each entry corresponds to the largestMWCI68 in the vicinity of qgen as sampled at the points qgen and f1 hellip f8 The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 11: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

1324 WICHMANN AND HILL

from one of such similar psychometric functions duringthe bootstrap procedure Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our ldquostandardrdquo generating function ygenusing sampling scheme s2 with N 5 120

Slope threshold08 and threshold02 are quite dissimilarfor the two fits illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F even if the fits to theoriginal (empirical) data set were almost identical

To explore the effect of F on estimates of variabilitywe conducted Monte Carlo simulations using

as the generating function and fitted psychometric func-tions using the Weibull cumulative Gaussian logisticand Gumbel as the distribution functions to each data setFrom the fitted psychometric functions we obtained es-timates of threshold02 threshold05 threshold08 andslope05 as described previously All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 144

5

6

7

8

9

1

2 4 6 8 10 12 144

5

6

7

8

9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

W (Weibull) = 10 3 05 001

CG (cumulative Gaussian) = 89 33 05 001

L (logistic) = 90 20 05 001

G (Gumbel) = 100 29 05 002

W (Weibull) = 84 34 05 005

L (logistic) = 78 10 05 005

A

B

Figure 6 (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates each has a different distribution function F (Weibull cu-mulative Gaussian logistic and Gumbel) See the text for details (B) A fit of two psy-chometric functions with different distribution functions F (Weibull logistic) to the samedata set generated from the mean of the four psychometric functions shown in panel Ausing sampling scheme s2 with N 5 120 See the text for details

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 12: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

THE PSYCHOMETRIC FUNCTION II 1325

ues of N and our seven sampling schemes were used re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values) In addition we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines9for a total of 4480 bootstrap repeats affording 8960000psychometric function fits (4032 3 109 simulated 2AFCtrials)

An analysis of variance was applied to the resulting datawith the number of trials N the sampling schemes s1 to s7and the distribution function F as independent factors(variables) The dependent variables were the confidenceinterval widths (WCI68) each cell contained the WCI68estimates from our 40 repetitions For all four dependentmeasuresmdashthreshold02 threshold05 threshold08 andslope05mdashnot only the first two factors the number oftrials N and the sampling scheme were as was expectedsignificant (p 0001) but also the distribution functionF and all possible interactions The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p 0001 This result in itself however is notnecessarily damaging to the bootstrap method applied topsychophysical data because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates Model R2 is between 995 and 997implying that virtually all the variance in our simulationsis due to N sampling scheme F and interactions thereof

Rather than focusing exclusively on significance inTable 2 we provide information about effect sizemdashnamelythe percentage of the total sum of squares of variation ac-counted for by the different factors and their interactionsFor threshold05 and slope05 (columns 2 and 4) N samplingscheme and their interaction account for 9863 and9639 of the total variance respectively10 The choiceof distribution function F does not have despite being asignificant factor a large effect on WCI68 for thresh-old05 and slope05

The same is not true however for the WCI68 of thresh-old02 Here the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68mdashits influence islarger than that of the sampling scheme usedmdashand only8436 of the variance is explained by N samplingscheme and their interaction Figure 7 finally summa-rizes the effect sizes of N sampling scheme and F graph-ically Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis the different symbols refer to the different samplingschemes The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N) Thegray levels finally code the smallest (black) mean (gray)and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F

For threshold05 and slope05 (Figure 7B and 7D) esti-mates are virtually unaffected by the choice of F but forthreshold02 the choice of F has a profound influence onWCI68 (eg in Figure 7A there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120) The same is also true albeit to a lesser extentif one is interested in threshold08 Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68 It was generally thecase for threshold08 that those sampling schemes that re-sulted in small confidence intervals (s2 s3 s5 s6 and s7see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4)

Two main conclusions can be drawn from these simu-lations First in the absence of any other constraints ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold05 and slope05 becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli as wewould like it to be Second away from the midpoint of Festimates of variability are however not as independentof the distribution function chosen as one might wishmdashin particular for lower proportions correct (threshold02 ismuch more affected by the choice of F than threshold08cf Figures 7A and 7C) If very low (or perhaps very high)response thresholds must be used when comparing ex-perimental conditionsmdashfor example 60 (or 90) cor-rect in 2AFCmdashand only small differences exist betweenthe different experimental conditions this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions owingto the (arbitrary) choice of a distribution functionrsquos result-ing in comparatively narrow confidence intervals

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100)

Factor Threshold02 Threshold05 Threshold08 Slope05

Number of trials N 7624 8730 6514 7286Sampling schemes s1 s7 660 1000 1993 1969Distribution function F 1136 013 387 092Error (numerical variability in bootstrap) 034 035 046 034Interaction of N and sampling schemes s1 s7 118 098 597 350Sum of interactions involving F 428 124 463 269Percentage of SS accounted for without F 8436 9863 915 9639

NotemdashThe columns refer to threshold02 threshold05 threshold08 and slope05mdashthat is to approximately 6075 and 90 correct and the slope at 75 correct during two-alternative forced choice respectively Rowscorrespond to the independent variables their interactions and summary statistics See the text for details

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 13: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

1326 WICHMANN AND HILL

total number of observations N

thre

sho

ld (F

05

) WC

I 68-1

thre

sho

ld (F

02

) WC

I 68-1

thre

sho

ld (F

08

) WC

I 68

-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

120 240 480 9600

02

04

06

08

1

12

14

C D

BA

Color Code

largest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68rsquos for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7 The four panels show the width of WCI68 as a function of N and sampling scheme as well as of the distribution functionF The symbols refer to the different sampling schemes as in Figure 2 gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N Black symbols show the smallest WCI68 obtained middle gray the mean WCI68across the four distribution functions used and white the largest WCI68 (A) WCI68 for threshold02 (B) WCI68 for threshold05 (C)WCI68 for threshold08 (D) WCI68 for slope05

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution J For example

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate sMaloney (1990) in addition to s uses a comparable non-parametric estimate obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 14: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

THE PSYCHOMETRIC FUNCTION II 1327

terval of 683 Neither kind of plug-in estimate is guar-anteed to be reliable however Moment-based estimatesof a distributionrsquos central tendency (such as the mean) orvariability (such as s) are not robust they are very sen-sitive to outliers because a change in a single sample canhave an arbitrarily large effect on the estimate (The esti-mator is said to have a breakdown of 1n because that isthe proportion of the data set that can have such an effectNonparametric estimates are usually much less sensitiveto outliers and the median for example has a breakdownof 12 as opposed to 1n for the mean) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji estimate is wrong by a largeamount Large errors can and do occur occasionallymdashforexample when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface12

Nonparametric plug-in estimates are also not withoutproblems Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron amp Tibshirani 1993chaps 12ndash14 22) In the psychological literature thisproblem was critically noted by Rasmussen (1987 1988)

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron 1987 1988 Efron amp Tib-shirani 1993 Hall 1988 Hinkley 1988 Strube 1988 cfFoster amp Bischof 1991 p 158)

In situations in which asymptotic confidence intervalsare known to apply and are correct bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods while retainingthe desirable property of robustness (see in particularEfron amp Tibshirani 1993 p 183 Table 142 and p 184Figure 143 as well as Efron 1988 Rasmussen 19871988 and Strube 1988)

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates q to bebiased and skewed estimators of the generating values qThe same applies to the bootstrap distributions of estimatesJ of any quantity of interest be it thresholds slopes orwhatever Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull b (N 5 210 K 5 7) We alsofound in our simulations that bmdashand thus slopes smdashwere skewed and biased for N smaller than 480 evenusing the best of our sampling schemes The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation m exists to transform thebootstrap distribution into a normal distribution Hencewe assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values In

Equation 3 z0 is the bias correction term and a in Equa-tion 4 is the acceleration term Assuming Equation 2 to becorrect it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion G 1 is the inverse of the cumulative distributionfunction of the bootstrap replications q z0 is our esti-mate of the bias and a is our estimate of acceleration Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter see Efron and Tibshi-rani (1993 chap 22) and also Davison and Hinkley (1997chap 5) the extension to two or more fitted parametersis provided by Hill (2001b)

For large classes of problems it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution wherean e-level confidence interval endpoint is simply G[e](Efron amp Tibshirani 1993) Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data to our knowledge it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles butnot that they perform significantly worse Hill (2001a2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methodsincluding the BCa and other bootstrap methods as wellas asymptotic methods from probit analysis In generalthe BCa method was the most reliable in that its cover-age was least affected by variations in sampling schemeand in N and the least imbalanced (ie probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal) TheBCa method by itself was found to produce confidenceintervals that are a little too narrow this underlines theneed for sensitivity analysis as described above

CONCLUSIONS

In this paper we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures such as thresholds and slopes of psy-chometric functions

First we recommend the use of Efronrsquos parametric boot-strap technique because traditional asymptotic methodshave been found to be unsuitable given the small numberof datapoints typically taken in psychophysical experi-ments Second we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ

( )

( )q e

e

eBCaG z

z z

a z z[ ] = + +

+( )aelig

egraveccedilccedil

ouml

oslashdividedivide

aelig

egrave

ccedilccedil

ouml

oslash

dividedivide

10

0

01CG

k k aF F F F= + ( )0 0

ˆ~ F F

F( )

kN z0 1

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 15: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

1328 WICHMANN AND HILL

ability estimates are obtained to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating functionrsquos parameters This is crit-ical because the fitted parameters q are almost certain todeviate at least slightly from the (unknown) underlyingparameters q Third we explored the influence of differ-ent sampling schemes (x) on both the size of onersquos con-fidence intervals and their sensitivity to errors in q Weconclude that only sampling schemes including at leastone sample at p $ 95 yield reliable bootstrap confi-dence intervals Fourth we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function Fparticularly for low thresholds (threshold02) the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals Finally we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals whose bias and slow con-vergence had previously been noted (Rasmussen 1987)

With this and our companion paper (Wichmann amp Hill2001) we have covered the three central aspects of mod-eling experimental data first parameter estimation sec-ond obtaining error estimates on these parameters andthird assessing goodness of fit between model and data

REFERENCES

Cox D R amp Hinkl ey D V (1974) Theoretical statistics LondonChapman and Hall

Davison A C amp Hin kl ey D V (1997) Bootstrap methods and theirapplication Cambridge Cambridge University Press

Efr on B (1979) Bootstrap methods Another look at the jackknifeAnnals of Statistics 7 1-26

Efr on B (1982) The jackknife the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics)Philadelphia Society for Industrial and Applied Mathematics

Efr on B (1987) Better bootstrap confidence intervals Journal of theAmerican Statistical Association 82 171-200

Efr on B (1988) Bootstrap confidence intervals Good or bad Psy-chological Bulletin 104 293-296

Efr on B amp Gong G (1983) A leisurely look at the bootstrap thejackknife and cross-validation American Statistician 37 36-48

Ef r on B amp Tibsh ir an i R (1991) Statistical data analysis in the com-puter age Science 253 390-395

Ef r on B amp Tibshir an i R J (1993) An introduction to the bootstrapNew York Chapman and Hall

Finn ey D J (1952) Probit analysis (2nd ed) Cambridge CambridgeUniversity Press

Finn ey D J (1971) Probit analysis (3rd ed) Cambridge CambridgeUniversity Press

Fost er D H amp Bischof W F (1987) Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functionsBiological Cybernetics 57 341-347

Fost er D H amp Bisch of W F (1991) Thresholds from psychometricfunctions Superiority of bootstrap to incremental and probit varianceestimators Psychological Bulletin 109 152-159

Fost er D H amp Bischof W F (1997) Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions Spa-tial Vision 11 135-139

Hal l P (1988) Theoretical comparison of bootstrap confidence inter-vals (with discussion) Annals of Statistics 16 927-953

Hil l N J (2001a May) An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions Poster presented atthe Annual Meeting of the Vision Sciences Society Sarasota FL

Hil l N J (2001b) Testing hypotheses about psychometric functionsAn investigation of some confidence interval methods their validityand their use in assessing optimal sampling strategies Forthcomingdoctoral dissertation Oxford University

Hil l N J amp Wichma nn F A (1998 April) A bootstrap method fortesting hypotheses concerning psychometric functions Paper presentedat the Computers in Psychology York UK

Hinkl ey D V (1988) Bootstrap methods Journal of the Royal Statis-tical Society B 50 321-337

Kendal l M K amp St uar t A (1979) The advanced theory of statisticsVol 2 Inference and relationship New York Macmillan

La m C F Mil l s J H amp Dubno J R (1996) Placement of obser-vations for the efficient estimation of a psychometric function Jour-nal of the Acoustical Society of America 99 3689-3693

Mal oney L T (1990) Confidence intervals for the parameters of psy-chometric functions Perception amp Psychophysics 47 127-134

McKee S P Kl ein S A amp Tel l er D Y (1985) Statistical proper-ties of forced-choice psychometric functions Implications of probitanalysis Perception amp Psychophysics 37 286-298

Ra smussen J L (1987) Estimating correlation coefficients Bootstrapand parametric approaches Psychological Bulletin 101 136-139

Rasmussen J L (1988) Bootstrap confidence intervals Good or badComments on Efron (1988) and Strube (1988) and further evaluationPsychological Bulletin 104 297-299

St r u be M J (1988) Bootstrap type I error rates for the correlation co-efficient An examination of alternate procedures Psychological Bul-letin 104 290-292

Tr eu t wein B (1995 August) Error estimates for the parameters ofpsychometric functions from a single session Poster presented at theEuropean Conference of Visual Perception Tuumlbingen

Tr eu t wein B amp St r asbu r ger H (1999 September) Assessing thevariability of psychometric functions Paper presented at the Euro-pean Mathematical Psychology Meeting Mannheim

Wichmann F A amp Hil l N J (2001) The psychometric function I Fit-ting sampling and goodness of fit Perception amp Psychophysics 631293-1313

NOTES

1 Under different circumstances and in the absence of a model of thenoise andor the process from which the data stem this is frequently thebest one can do

2 This is true of course only if we have reason to believe that ourmodel actually is a good model of the process under study This importantissue is taken up in the section on goodness of fit in our companion paper

3 This holds only if our model is correct Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapsemdashthat is display nonstationaritymdashis as-ymptotically biased as we show in our companion paper together witha method to overcome such bias (Wichmann amp Hill 2001)

4 Confidence intervals here are computed by the bootstrap percentilemethod The 95 confidence interval for a for example was deter-mined simply by [a(025) a(975)] where a(n) denotes the 100nth per-centile of the bootstrap distribution a

5 Because this is the approximate coverage of the familiarity stan-dard error bar denoting onersquos original estimate 6 1 standard deviationof a Gaussian 68 was chosen

6 Optimal here of course means relative to the sampling schemesexplored

7 In our initial simulations we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters) Sampling schemes s1 and s4 were even worse and s3 and s7were relative to the other schemes even more superior under noncon-strained fitting conditions The only substantial difference was perfor-mance of sampling scheme s5 It does very well now but its slope esti-mates were unstable during sensitivity analysis of b was unconstrainedThis is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained

8 Of course there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)

Page 16: Wichmann (2001) The psychometric function. Ii. …wexler.free.fr/library/files/wichmann (2001) the...parameters and derived quantities, such as thresholds and slopes. First, we outline

THE PSYCHOMETRIC FUNCTION II 1329

ter fit to a given data set than do others using different distribution func-tions Differences in bootstrap estimates of variability in such cases arenot worrisome The appropriate estimates of variability are those of thebest-fitting function

9 WCI68 estimates were obtained using BCa confidence intervalsdescribed in the next section

10 Clearly the above-reported effect sizes are tied to the ranges inthe factors explored N spanned a comparatively large range of 120ndash960observations or a factor of 8 whereas all of our sampling schemes wereldquoreasonablerdquo inclusion of ldquounrealisticrdquo or ldquounusualrdquo sampling schemesmdashfor example all x values such that nominal y values are below 055mdashwould have increased the percentage of variation accounted for by sam-pling scheme Taken together N and sampling scheme should be repre-sentative of most typically used psychophysical settings however

11 A straightforward way to estimate a quantity J which is derivedfrom a probability distribution F by J 5 t(F ) is to obtain F from em-pirical data and then use J 5 t(F ) as an estimate This is called a plug-in estimate

12 Foster and Bischof (1987) report problems with local minima whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (42 of their datapoints had to be removed)Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers

(Manuscript received June 10 1998 revision accepted for publication February 27 2001)