Learning rational temporal eye movement strategiesrothkopf/docs/HoppeRothkopf2016.pdf · Learning...

13
Learning rational temporal eye movement strategies David Hoppe a,b and Constantin A. Rothkopf a,b,c,1 a Cognitive Science Centre, Technical University Darmstadt, 64283 Darmstadt, Germany; b Institute of Psychology, Technical University Darmstadt, 64283 Darmstadt, Germany; and c Frankfurt Institute for Advanced Studies, Goethe University, 60438 Frankfurt, Germany Edited by Michael E. Goldberg, Columbia University College of Physicians, New York, NY, and approved June 1, 2016 (received for review January 24, 2016) During active behavior humans redirect their gaze several times every second within the visual environment. Where we look within static images is highly efficient, as quantified by computa- tional models of human gaze shifts in visual search and face recognition tasks. However, when we shift gaze is mostly un- known despite its fundamental importance for survival in a dynamic world. It has been suggested that during naturalistic visuomotor behavior gaze deployment is coordinated with task- relevant events, often predictive of future events, and studies in sportsmen suggest that timing of eye movements is learned. Here we establish that humans efficiently learn to adjust the timing of eye movements in response to environmental regularities when monitoring locations in the visual scene to detect probabilistically occurring events. To detect the events humans adopt strategies that can be understood through a computational model that includes perceptual and acting uncertainties, a minimal processing time, and, crucially, the intrinsic costs of gaze behavior. Thus, subjects traded off event detection rate with behavioral costs of carrying out eye movements. Remarkably, based on this rational bounded actor model the time course of learning the gaze strategies is fully explained by an optimal Bayesian learner with humanscharacteristic uncertainty in time estimation, the well- known scalar law of biological timing. Taken together, these find- ings establish that the human visual system is highly efficient in learning temporal regularities in the environment and that it can use these regularities to control the timing of eye movements to detect behaviorally relevant events. eye movements | computational modeling | decision making | learning | visual attention W e live in a dynamic and ever-changing environment. To avoid missing crucial events, we need to constantly use new sensory information to monitor our surroundings. However, the fraction of the visual environment that can be perceived at a given moment is limited by the placement of the eyes and the arrangement of the receptor cells within the eyes (1). Thus, continuously monitoring environmental locations, even when we know which regions in space contain relevant information, is unfeasible. Instead, we actively explore by targeting the visual apparatus toward regions of interest using proper movements of the eyes, head, and body (25). This constitutes a fundamental computational problem requiring humans to decide sequentially when to look where. Solving this problem arguably has been crucial to our survival from the early human hunters pursuing a herd of prey and avoiding predators to the modern human navigating a crowded sidewalk and crossing a busy road. So far, the emphasis of most studies on eye movements has been on spatial gaze selection. Its exquisite adaptability can be appreciated by considering the different factors influencing gaze, including low-level image features (6), scene gist (7) and scene semantics (8), task constraints (9), and extrinsic rewards (10); also, combinations of factors have been investigated (11). Studies have cast doubt on the optimality of spatial gaze selection (1214), but computational models of the spatial selection have in part established that humans are close to optimal in targeting locations in the visual scene, as in visual search for visible (15) and invisible targets (16) as well as face recognition (17) and pattern classification (18). Although these studies provide insights into where humans look in a visual scene, the ability of managing complex tasks cannot be explained solely by optimal spatial gaze selection. Even if relevant spatial locations in the scene have been determined these regions must be monitored over time to detect behaviorally relevant changes and thus keep information updated for action selection in accordance with the current state of the environment. Successfully dealing with the dynamic environment therefore requires intelligently distributing the limited visual re- sources over time. For example, where should we look if the task is to detect an event occurring at one of several possible locations? In the beginning, if the temporal regularities in a new environment are unknown, it may be best to look quickly at all relevant loca- tions equally long. However, with more and more observations of events and their durations, we may be able to use the learned temporal regularities. To keep the overall uncertainty as low as possible, fast-changing regions (e.g., streets with cars) should be attended more frequently than slow-changing regions (e.g., walk- ways with pedestrians). Although little is known about how the dynamics in the en- vironment influence human gaze behavior, empirical evidence suggests sophisticated temporal control of eye movements, as in object manipulation (19, 20) and natural behavior (4, 5). A common observation is that gaze is often predictive both of physical events (21) and events caused by other people (22), and studies comparing novicesand experienced sportsmens gaze coordination suggest that temporal gaze selection is learned (23). However, the temporal course of these eye movements and how they are influenced by the event durations in the environment has not been investigated so far. We argue that understanding human eye movements in natural dynamic settings can only be achieved by incorporating the temporal control of the visual apparatus. Also, modeling of temporal eye movement allocation is scarce, with notable exceptions (24), but whereas ideal observers (1517) model behavior after learning has completed and usually exclude intrinsic costs, reinforcement learning models based on Markov Significance In a dynamic world humans not only have to decide where to look but also when to direct their gaze to potentially in- formative locations in the visual scene. Little is known about how timing of eye movements is related to environmental reg- ularities and how gaze strategies are learned. Here we present behavioral data establishing that humans learn to adjust their temporal eye movements efficiently. Our computational model shows how established properties of the visual system de- termine the timing of gaze. Surprisingly, a Bayesian learner only incorporating the scalar law of biological timing can fully explain the course of learning these strategies. Thus, humans use temporal regularities learned from observations to adjust the scheduling of eye movements in a nearly optimal way. Author contributions: D.H. and C.A.R. designed research; D.H. performed research; D.H. and C.A.R. analyzed data; and D.H. and C.A.R. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. 1 To whom correspondence should be addressed. Email: [email protected] darmstadt.de. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1601305113/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1601305113 PNAS Early Edition | 1 of 6 PSYCHOLOGICAL AND COGNITIVE SCIENCES

Transcript of Learning rational temporal eye movement strategiesrothkopf/docs/HoppeRothkopf2016.pdf · Learning...

Learning rational temporal eye movement strategiesDavid Hoppea,b and Constantin A. Rothkopfa,b,c,1

aCognitive Science Centre, Technical University Darmstadt, 64283 Darmstadt, Germany; bInstitute of Psychology, Technical University Darmstadt, 64283Darmstadt, Germany; and cFrankfurt Institute for Advanced Studies, Goethe University, 60438 Frankfurt, Germany

Edited by Michael E. Goldberg, Columbia University College of Physicians, New York, NY, and approved June 1, 2016 (received for review January 24, 2016)

During active behavior humans redirect their gaze several timesevery second within the visual environment. Where we lookwithin static images is highly efficient, as quantified by computa-tional models of human gaze shifts in visual search and facerecognition tasks. However, when we shift gaze is mostly un-known despite its fundamental importance for survival in adynamic world. It has been suggested that during naturalisticvisuomotor behavior gaze deployment is coordinated with task-relevant events, often predictive of future events, and studies insportsmen suggest that timing of eye movements is learned. Herewe establish that humans efficiently learn to adjust the timing ofeye movements in response to environmental regularities whenmonitoring locations in the visual scene to detect probabilisticallyoccurring events. To detect the events humans adopt strategiesthat can be understood through a computational model thatincludes perceptual and acting uncertainties, a minimal processingtime, and, crucially, the intrinsic costs of gaze behavior. Thus,subjects traded off event detection rate with behavioral costs ofcarrying out eye movements. Remarkably, based on this rationalbounded actor model the time course of learning the gazestrategies is fully explained by an optimal Bayesian learner withhumans’ characteristic uncertainty in time estimation, the well-known scalar law of biological timing. Taken together, these find-ings establish that the human visual system is highly efficient inlearning temporal regularities in the environment and that it canuse these regularities to control the timing of eye movements todetect behaviorally relevant events.

eye movements | computational modeling | decision making | learning |visual attention

We live in a dynamic and ever-changing environment. Toavoid missing crucial events, we need to constantly use

new sensory information to monitor our surroundings. However,the fraction of the visual environment that can be perceived at agiven moment is limited by the placement of the eyes and thearrangement of the receptor cells within the eyes (1). Thus,continuously monitoring environmental locations, even when weknow which regions in space contain relevant information, isunfeasible. Instead, we actively explore by targeting the visualapparatus toward regions of interest using proper movements ofthe eyes, head, and body (2–5). This constitutes a fundamentalcomputational problem requiring humans to decide sequentiallywhen to look where. Solving this problem arguably has beencrucial to our survival from the early human hunters pursuing aherd of prey and avoiding predators to the modern humannavigating a crowded sidewalk and crossing a busy road.So far, the emphasis of most studies on eye movements has

been on spatial gaze selection. Its exquisite adaptability can beappreciated by considering the different factors influencing gaze,including low-level image features (6), scene gist (7) and scenesemantics (8), task constraints (9), and extrinsic rewards (10);also, combinations of factors have been investigated (11). Studieshave cast doubt on the optimality of spatial gaze selection (12–14), but computational models of the spatial selection have inpart established that humans are close to optimal in targetinglocations in the visual scene, as in visual search for visible (15) andinvisible targets (16) as well as face recognition (17) and patternclassification (18). Although these studies provide insights into

where humans look in a visual scene, the ability of managingcomplex tasks cannot be explained solely by optimal spatial gazeselection. Even if relevant spatial locations in the scene have beendetermined these regions must be monitored over time to detectbehaviorally relevant changes and thus keep information updatedfor action selection in accordance with the current state of theenvironment. Successfully dealing with the dynamic environmenttherefore requires intelligently distributing the limited visual re-sources over time. For example, where should we look if the task isto detect an event occurring at one of several possible locations?In the beginning, if the temporal regularities in a new environmentare unknown, it may be best to look quickly at all relevant loca-tions equally long. However, with more and more observations ofevents and their durations, we may be able to use the learnedtemporal regularities. To keep the overall uncertainty as low aspossible, fast-changing regions (e.g., streets with cars) should beattended more frequently than slow-changing regions (e.g., walk-ways with pedestrians).Although little is known about how the dynamics in the en-

vironment influence human gaze behavior, empirical evidencesuggests sophisticated temporal control of eye movements, as inobject manipulation (19, 20) and natural behavior (4, 5). Acommon observation is that gaze is often predictive both ofphysical events (21) and events caused by other people (22), andstudies comparing novices’ and experienced sportsmen’s gazecoordination suggest that temporal gaze selection is learned (23).However, the temporal course of these eye movements and howthey are influenced by the event durations in the environment hasnot been investigated so far. We argue that understanding humaneye movements in natural dynamic settings can only be achievedby incorporating the temporal control of the visual apparatus.Also, modeling of temporal eye movement allocation is scarce,with notable exceptions (24), but whereas ideal observers (15–17)model behavior after learning has completed and usually excludeintrinsic costs, reinforcement learning models based on Markov

Significance

In a dynamic world humans not only have to decide where tolook but also when to direct their gaze to potentially in-formative locations in the visual scene. Little is known abouthow timing of eye movements is related to environmental reg-ularities and how gaze strategies are learned. Here we presentbehavioral data establishing that humans learn to adjust theirtemporal eye movements efficiently. Our computational modelshows how established properties of the visual system de-termine the timing of gaze. Surprisingly, a Bayesian learneronly incorporating the scalar law of biological timing can fullyexplain the course of learning these strategies. Thus, humansuse temporal regularities learned from observations to adjustthe scheduling of eye movements in a nearly optimal way.

Author contributions: D.H. and C.A.R. designed research; D.H. performed research; D.H.and C.A.R. analyzed data; and D.H. and C.A.R. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.1To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1601305113/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1601305113 PNAS Early Edition | 1 of 6

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S

decision processes (16, 24) can be applied to situations involvingperceptual uncertainty at best only approximately.Here we present behavioral data and results from a compu-

tational model establishing several properties of human activevisual strategies specific to temporally varying stimuli. In par-ticular, we address the question of what factors specifically needto be accounted for to predict human gaze timing behavior inresponse to environmental events, and how learning proceeds.To this end, we created a temporal event detection (TED) taskto investigate how humans adjust their eye movement strate-gies when learning temporal regularities. Our task enables in-vestigating the unknown temporal aspects of eye movementstrategies by isolating them from spatial selection and the com-plexity that arises from natural images. Our results are the first toour knowledge to show that human eye movements are consis-tently related to temporal regularities in the environment. Wefurther provide a computational model explaining the observedgaze behavior as an interaction of perceptual and action uncer-tainties as well as intrinsic costs and biological constraints. Sur-prisingly, based on this model alone, the learning progress isaccounted for without any additional assumptions only based onthe scalar law of biological timing, allowing crucial insights intohow observers alter their behavior due to task demands andenvironmental regularities.

ResultsBehavioral Results. The TED task required participants to com-plete 30 experimental blocks, each consisting of 20 consecutivetrials. Subjects were instructed to detect a single event on eachtrial and press a button to indicate the detection of the event,which could occur at one of two possible locations with equalprobability (Fig. 1A). This event consisted of a little dot at thecenter of the two locations that disappeared for a certain duration.The visual separation of the two locations was chosen such that

subjects could not detect an event extrafoveally (i.e., when fixatingthe other location). In each block, event durations were constantbut unknown to the subjects in the beginning. They were generatedprobabilistically from three distributions over event durations,termed short (S), medium (M), and long (L). The rationale of thisexperimental design was that there is no single optimal strategy forall blocks in the experiment, but rather subjects have to adapt theirstrategies to the event durations in the respective block (Fig. 1B).Thus, over the course of 20 consecutive trials per block, subjectscould learn the event durations by observing events.We investigated how participants changed their switching be-

havior over trials within individual blocks as a consequence oflearning the constant event durations within a single block. Fig.2A shows two example trials for participant MK, both fromthe same block (block LS—long event duration at the left re-gion and short event duration at the right region). In the firsttrial when event durations were unknown (Fig. 2A, Upper),MK switched evenly between the regions to detect the event(t7 = 0.48;P= 0.64). However, at the end of this block in Trial 20we observed a clear change in the switching pattern (SP) (Fig.2A, Lower). There was a shift in the mean fixation duration overthe course of the block (Fig. 2A, Right). The region associatedwith the short event durations was fixated longer in the last trialof the block (t4 = 2.75;P= 0.04). Hence, the subject changed thegaze SP over trials in this particular block in response to theevent durations at the two locations.Mean fixation durations aggregated over all participants are

shown in Fig. 2B. Steady-state behavior across subjects wasreached within the first 5 to 10 trials within a block, suggestingthat participants were able to quickly learn the event durations.Behavioral changes were consistent across conditions. In condi-tions with different event durations (SM, SL, and ML), regionswith shorter event durations were fixated longer in later trials ofthe block (all differences were highly significant with P< 10−5).

A B

Fig. 1. Experimental procedure for the TED task and implications for eye movement strategies. (A) Single trial structure and stimulus generation. Participantsswitched their gaze between two regions marked by gray circles to detect a single event. The event consisted of the little dot within one of the circlesdisappearing (depicted as occurring on the left side in this trial). Participants confirmed the detection of an event by indicating the region (left or right)through a button press. The event duration (how long the dot was invisible) was drawn from one of three probability distributions (short, medium, and long).Overall, six different conditions resulting from the combination of three event durations and two spatial locations were used in the experiment (SS, MM, LL,SM, SL, MM, and LL). A single block consisted of 20 consecutive trials with event durations drawn from the same distributions. For example, in a block LS allevents occurring at the left region were long events with a mean duration of 1.5 s, and all events occurring at the right region were short events with a meanduration of 0.15 s. A single event for the condition LS was generated as follows (lower left). First, the start time for the event was drawn from a uniformdistribution between 2 and 8 s. Next, the region where the event occurred was either left or right with equal probability. Finally, the duration of the eventwas drawn dependent on where the event occurred. Given that in the example the condition was LS and the event was determined to occur at the left region,the event duration was drawn from a Gaussian distribution with mean 1.5 s (long event). (B) Eye movement strategies for different conditions. The re-lationship between SPs and the probability of missing the event in a single trial is shown for two conditions. For condition MM (Upper) the probability ofmissing an event is lower for faster switching (short fixation durations at both regions). For condition LS (long events at the left region and short events at theright region) the probability of missing an event can be decreased by longer fixating the region with short events (i.e., the right region).

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1601305113 Hoppe and Rothkopf

Participants’ gaze behavior was not only influenced by whetherthe event durations were different but was also guided by the sizeof the difference (Fig. 2C, Left). Greater differences in eventdurations lead to greater differences in the fixation durations(linear regression, r2 = 0.93,P< 0.002). Moreover, the overallmean fixation duration aggregated over both regions was affectedby the overall mean event durations (Fig. 2C, Right). Participants,in general, switched faster between the regions if the mean eventduration was small (linear regression, r2 = 0.88,P< 0.005). Blocks

with long overall event durations (e.g., LL) yielded longer meanfixation durations compared with blocks with short event dura-tions (e.g., SS). Taken together, these results establish that par-ticipants quickly adapted their eye movement strategies to the taskrequirements given by the timing of events.However, further inspection of the behavioral data suggests

nontrivial aspects of the gaze-switching strategies. The condi-tions in our TED task differed in task difficulty depending on theevent durations (Fig. 2D). The shorter the event durations in a

A B

C

D

E

F

Fig. 2. Behavioral results. (A) Sample SP for participant MK in a single block (Left) and mean fixation duration for each region (Right). SPs are shown for twotrials: before learning the event durations (trial 1) and after learning the event duration (trial 20). (B) Mean fixation duration aggregated for all participantsand all blocks grouped by condition (error bars correspond to SEM). Mean fixation duration is shown in seconds (y axis) and is the same scale for all conditions(see also Fig. S1). Trial number within the block is shown on the x axis. Colors of the lines refer to the event duration used in that condition. (C) Mean fixationduration difference as a function of event duration difference (Left). Each data point corresponds to a specific condition. Mean fixation duration as a functionof mean event duration (Right). Again, each data point is associated with one of the conditions. (D) Percentage of detected events for all conditions. Values insquare brackets correspond to the 95% credibility interval for the differences in proportion. Different conditions are shown on the x axis. (E) Difference indetection performance between the first and last trial in each block. Different conditions are shown on the x axis. (F) Behavioral changes from trial 1 to trial20 (arrows) for each condition. Color density depicts the probability of missing an event for every possible SP (see also Fig. S2). Each plot corresponds to asingle condition. Conditions are arranged as in B.

Hoppe and Rothkopf PNAS Early Edition | 3 of 6

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S

specific condition, the more difficult was the task. Improvementover the course of a block differed between conditions (Fig. 2E)and participants only improved in detection accuracy in condi-tions associated with a high task difficulty. In contrast, partici-pants showed a slight decrease in performance in the otherconditions. This can be represented as arrows pointing from theobserved switching behavior in the first trial to the behavior inthe last trial within blocks in the action space (Fig. 2F). Overall,several aspects remain unclear: the different rates of change inswitching frequencies across trials in a block as well as the dif-ferent directions and end points of behavioral changes observedin the performance space.

Bounded Actor Model. We hypothesized that multiple processesare involved in generating the observed behavior, based on thebehavioral results in the TED task and taking into account knownfacts about human time constraints in visual processing, timeperception, eye movement variability, and intrinsic costs. Startingfrom an ideal observer (25) we developed a bounded actor modelby augmenting it with computational components modeling theseprocesses (Fig. 3A and see Supporting Information for a derivationof the ideal observer and further model components).The ideal observer chooses SPs that maximize performance

in the task. The gaze behavior suggested by the ideal observercorresponds to the SP with the lowest probability of missing theevent (Fig. 2F). We extend this model by including perceptualuncertainty, which limits the accuracy of estimating event dura-tions from observations. Extensive literature has reported howtime intervals are perceived and reproduced (26–28). The centralfinding is the so-called scalar law of biological timing (26, 29), alinear relationship between the mean and the SD of estimatesin the interval timing task. We included perceptual uncertaintyin the ideal observer model based on the values found in theliterature.The second extension of the model accounts for eye move-

ment variability, which limits the ability of humans to accuratelyexecute planned eye movements and has been reported fre-quently in the literature. Although many factors may contributeto this variability in action outcomes, humans have been shown

to take this variability into account (30). Further evidence forthis comes from experiments that have shown that in a re-production task the impact of prior knowledge on behavior in-creased according to the uncertainty in time estimation (31) andstudies that extended an ideal observer to an ideal actor by in-cluding behavioral timing variability (32). To describe the vari-ability of fixation durations we used an exponential Gaussiandistribution (33) because it is a common choice for describingfixation durations (34–37). The parameters were estimated fromour data using maximum likelihood.Although behavioral costs in biological systems are a funda-

mental fact, very little is known about such costs for eye move-ments, because research has so far focused on extrinsic costs (10,11, 24). We included costs in our model by hypothesizing thatswitching more frequently is associated with increased effort. Ina similar task that did not reward faster switching participantsshowed fixation durations of about 1 s (38). To our knowledge aconcrete shape of these costs has not been investigated yet, andin the present study we used a sigmoid function (Fig. 3B, Right)for which the parameters were estimated using least squaresfrom the measured data.Finally, the model has so far neglected that all neural and

motor processes involved take time. The time required for visualprocessing is highly dependent on the task at hand. It can be veryfast (39); however, in the context of an alternative choice taskprocessing is done within the first 150 ms (40) and does notimprove with familiarity of the stimulus (41). In our study, thedemands for the decision were much lower because discriminatingbetween the two states of a region (ongoing event or no event) israther simple. Still, there are some additional constraints that limitthe velocity of switching: eye–brain lag and time of planning andinitiating a saccade (42). In addition, prior research suggests thatin general saccades are triggered after processing the current fix-ation has reached some degree of completeness (43). We estimatedthe mean processing time (Fig. 3B, Left) from our behavioral datausing least squares.To investigate which model should be favored in explaining

the observed gaze-switching behavior we fitted the modelleaving out one component at a time and computed the Akaike

B

C

0.0 0.2 0.4 0.6Fixation Duration (s)

σ2 = 285 ms

Switching Costs

295 ms

Processing Time

0.0 0.2 0.4 0.6Fixation Duration (s)

M \

M \ Cost

ime

Pr Tocessi

ng

U

ty

M \ Anc

ertaincting

eM \

U

y

ΔAI

C

Pernc

erc tai

ntptual

25

75

125

175

Processing time Model Comparison with Full Model (M)

rightIdeal Observer

Event durations

Cost of SaccadesIdeal Actor

Fixation Duration

ActingUncertainty

pdf

Bounded Actor

Cost of saccades

A

Fixat

ion

dura

tion

left

Fixation duration

not preferredpreferred

OptimalStrategy

0.0 1.0

Cost

1 s

pdf

Fixation Duration Fixation Duration

Fig. 3. Computational model. (A) Schematic visualization describing how the different model components influence the preferred SPs for a single condition(MM). Images show which regions in the action space are preferred by the respective models. Each location in the 2D space corresponds to a single SP (see alsoFig. 1B). White dots are the mean fixation durations of our subjects aggregated over individual blocks. The optimal strategy is marked with a black cross foreach model. The Ideal Observer only takes the probability of missing an event into account. The Ideal Actor also accounts for variability in the execution of eyemovement plans, thus suppressing long fixations (because they increase the risk of missing the event). Therefore, strategies using shorter fixation durationsare favored. Next, additive costs for faster switching yield a preference for longer fixation durations. Hence, less value is given to very fast SPs. Finally, theBounded Actor model excludes very short fixation durations, which are biologically impossible, because they do not allow sufficient time for processing visualinformation. (B) Model components for processing time and switching costs after parameter estimation. Parameters were estimated through least squaresusing the last 10 trials of each block. (C) Model comparison. For the absence of each component the model was fitted by least squares. Differences in AIC withrespect to the full model are shown.

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1601305113 Hoppe and Rothkopf

information criterion (AIC) for each of these models. The re-sults are shown in Fig. 3C. The conditional probability for eachmodel was calculated by computing the Akaike weights fromthe AIC values as suggested in ref. 44. We found that gazeswitches were more than 10,000 times more probable under thefull model compared with the second-best model (Table S1 andFig. S3). In addition, models lacking a single component de-viate severely from the observed behavior (Fig. S4). This stronglysuggests that the temporal course of eye movements cannot beexplained solely on the basis of an ideal observer. Instead, multiplecharacteristics of perceptual uncertainty, behavioral variability,intrinsic costs, and biological constraints have to be included.

Bayesian Learner Model. All models so far have assumed fullknowledge of the event durations, as is common in ideal observermodels. This neglects the fact that event durations were a prioriunknown to the participants at the beginning of each block oftrials and needed to be inferred from observations. In our TEDtask information about the event durations cannot be accesseddirectly but must be inferred from different types of censoredand uncensored observations (Fig. 4A). Using a Bayesian learnerwe simulated the subjects’ mean estimate and uncertainty of theevent duration for each trial in a block using Markov chainMonte Carlo techniques (Fig. 4B). We used a single Gaussiandistribution fitted to the true event durations for the threeconditions to describe subjects’ initial belief before the firstobservation within a block. Our Bayesian learner enables us tosimulate the bounded actor model for every trial of a blockusing the mean estimates (Fig. 4C). The results show that ourbounded actor together with the ideal Bayesian learner is suf-ficient to recover the main characteristics of the behavioraldata. Crucially, the proposed learner does not introduce anyfurther assumptions or parameters. This means that the be-havioral changes we observed can be explained solely by changesin the estimates of the event durations. As with the steady-statecase, all included factors contributed uniquely to the model.Omitting a single component led to severe deviation from thehuman data (Fig. S4).

DiscussionThe current study investigated how humans learn to adjusttheir eye movement strategies to detect relevant events in

their environment. The minimal computational model matchingkey properties of observed gaze behavior includes perceptualuncertainty about duration, action variability in generated gazeswitches, a minimal visual processing time, and, finally, intrinsiccosts for switching gaze. The bounded actor model explainsseemingly suboptimal behavior such as the decrease in perfor-mance in easier task conditions as a trade-off between detectionprobability and faster but more costly eye movement switching.With respect to this bounded actor model, the experimentallymeasured switching times at the end of the blocks were close tooptimal. Whereas ideal observer models (15–17) assume that thebehaviorally relevant uncertainties are all known to the observer,here we also modeled the learning progress. The speed at whichlearning proceeded was remarkable; subjects were close to theirfinal gaze-switching frequencies after only 5 to 10 trials on av-erage, suggesting that the observed gaze behavior was not idio-syncratic to the TED task. Based on the bounded actor model weproposed a Bayesian optimal learner, which only incorporatesthe inherent perceptual uncertainty about time passage on thebasis of the scalar law of biological timing (26, 28, 29). Re-markably, this source of uncertainty was sufficient to fully ac-count for the learning progress. This suggests that the humanvisual system is highly efficient in learning temporal regularitiesof events in the environment and that it can use these to directgaze to locations in which behaviorally relevant events will occur.Taken together, this study provides further evidence for the

importance of the behavioral goal in perceptual strategies (2, 9,24), because low-level visual feature saliency cannot account forthe gaze switches (45). Instead, cognitive inferences on the basisof successive observations and detections of visual events lead tochanges in gaze strategies. At the beginning of a block, partici-pants spent equal observation times between targets to learnabout the two event durations and progressively looked longer atthe location with the shorter event duration. This implies thatinternal representations of event durations were likely used toadjust the perceptual strategy of looking for future events. Afurther important feature of the TED task and the presentedmodel is that, contrary to previous studies (15–17), the in-formativeness of spatial locations changes moment by momentand not fixation by fixation, which corresponds more to natu-ralistic settings. Under such circumstances, optimality of behav-ior is not exclusively evaluated with respect to a single, next gaze

Fully Observed Missed Partially Observed

1 5 10 15 20

2.0

1.0

Trials

Event Duration (s)

Prior

0.5

1.5

True durationsEstimates

Long

Medium

Short

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

human dataideal learner

Trial

MeanDuration

(s)

0.15s 0.75s 1.5s

0.15s

0.75s

1.5s

A

B

C

Fig. 4. Bayesian learner model. (A) Different types of observations in the TED task. Dashed lines depict SPs and rectangles denote events. (B) Mean estimatesand uncertainty (SD) for the event durations over the course of a block. Prior belief over event durations before the first observation is shown on the left side.(C) Human data and model predictions for the Bayesian learner model. Fitted parameters from the full model were used. Dashed lines show the mean fixationduration of the bounded actor model over the course of the blocks. For each trial the mean estimate suggested by the ideal learner was used.

Hoppe and Rothkopf PNAS Early Edition | 5 of 6

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S

shift, but instead is driven by a complex sequence of behavioraldecisions based on uncertain and noisy sensory measurements.In this respect, our TED task is much closer to natural visionthan experiments with static displays. However, the spatial lo-cations in our task were fixed and the visual stimuli were simplegeometric shapes. Naturalistic vision involves stimuli of richspatiotemporal statistics (46) and semantic content across theentire visual field. How the results observed in this study transferto subjects learning to attend in more naturalistic scenarios willrequire further research. The challenge under such circum-stances is to quantify relative uncertainties as well as costs andbenefits of all contributing factors. There is nothing that inprinciple precludes applying the framework developed here tosuch situations because of its generality. Clearly, several factorsindicate that participants did not depart dramatically from eyemovement strategies in more normal viewing environments, andthus that natural behavior transferred well to the TED task.First, subjects did not receive extensive training (Materials andMethods). Second, they showed only minimal changes in learningbehavior across blocks over the duration of the entire experiment(Fig. S5). Third, intersaccadic interval statistics were very closeto the ones observed in naturalistic settings according to pre-vious literature (34–37). Overall, the found concepts contributeuniquely to our understanding of human visual behavior andshould inform our ideas about neuronal mechanisms underlyingattentional shifts (47), conceptualized as sequential decisions. The

results of this study may similarly be relevant for the understandingof learning optimal sensory strategies in other modalities and themethodologies can fruitfully be applied there, too.

Materials and MethodsSubjects. Ten undergraduate subjects (four females) took part in the ex-periment in exchange for course credit. The participants’ ages ranged from18 to 30 y. All subjects had normal or corrected-to-normal vision. Data fromtwo subjects were not included into the analysis due to insufficient qualityof the eye movement data. Participants completed two to four shortertraining blocks (10 trials per block) before the experiment to get used to thesetting and the task (on average 5 min). The number of training blocks,however, had no influence on task performance. All experimental proce-dures were approved by the ethics committee of the Darmstadt University ofTechnology. Informed consent was obtained from all participants.

Stimuli and Analysis. The two regions (distance 35°) were presented on aFlatron W2242TE monitor (1,680 × 1,050 resolution and 60-Hz refresh rate).Each region was marked with a gray circle (0.52° in diameter). Contrast ofthe target dot was adjusted to prevent extrafoveal vision. Eye movementdata were collected using SMI Eye Tracking Glasses (60-Hz sample rate).Calibration was done using a three-point calibration. For detection of sac-cades, fixations, and blinks dispersion-threshold identification was used.

ACKNOWLEDGMENTS. We thank R. Fleming, M. Hayhoe, F. Jäkel, andM. Lengyel for useful comments on the manuscript.

1. Land MF, Nilsson D (2002) Animal Eyes (Oxford Univ Press, Oxford).2. Yarbus A (1967) Eye Movements and Vision (Plenum, New York).3. Findlay JM, Gilchrist ID (2003) Active Vision: The Psychology of Looking and Seeing

(Oxford Univ Press, Oxford).4. Hayhoe M, Ballard D (2005) Eye movements in natural behavior. Trends Cogn Sci 9(4):

188–194.5. Land M, Tatler B (2009) Looking and Acting: Vision and Eye Movements in Natural

Behaviour (Oxford Univ Press, Oxford).6. Borji A, Sihite DN, Itti L (2013) What stands out in a scene? A study of human explicit

saliency judgment. Vision Res 91(0):62–77.7. Torralba A, Oliva A, Castelhano MS, Henderson JM (2006) Contextual guidance of eye

movements and attention in real-world scenes: The role of global features in objectsearch. Psychol Rev 113(4):766–786.

8. Henderson JM (2003) Human gaze control during real-world scene perception. TrendsCogn Sci 7(11):498–504.

9. Rothkopf CA, Ballard DH, Hayhoe MM (2007) Task and context determine where youlook. J Vis 7(14):1–20.

10. Navalpakkam V, Koch C, Rangel A, Perona P (2010) Optimal reward harvesting incomplex perceptual environments. Proc Natl Acad Sci USA 107(11):5232–5237.

11. Schütz AC, Trommershäuser J, Gegenfurtner KR (2012) Dynamic integration of in-formation about salience and value for saccadic eye movements. Proc Natl Acad SciUSA 109(19):7547–7552.

12. Morvan C, Maloney LT (2012) Human visual search does not maximize the post-sac-cadic probability of identifying targets. PLOS Comput Biol 8(2):e1002342.

13. Morvan C, Maloney L (2009) Suboptimal selection of initial saccade in a visual searchtask. J Vis 9(8):444–444.

14. Clarke AD, Hunt AR (2016) Failure of intuition when choosing whether to invest in asingle goal or split resources between two goals. Psychol Sci 27(1):64–74.

15. Najemnik J, Geisler WS (2005) Optimal eye movement strategies in visual search.Nature 434(7031):387–391.

16. Chukoskie L, Snider J, Mozer MC, Krauzlis RJ, Sejnowski TJ (2013) Learning where tolook for a hidden target. Proc Natl Acad Sci USA 110(Suppl 2):10438–10445.

17. Peterson MF, Eckstein MP (2012) Looking just below the eyes is optimal across facerecognition tasks. Proc Natl Acad Sci USA 109(48):E3314–E3323.

18. Yang SC, Lengyel M, Wolpert DM (2016) Active sensing in the categorization of visualpatterns. eLife 5:e12215, in press.

19. Ballard DH, Hayhoe MM, Pelz JB (1995) Memory representations in natural tasks.J Cogn Neurosci 7(1):66–80.

20. Johansson RS, Westling G, Bäckström A, Flanagan JR (2001) Eye-hand coordination inobject manipulation. J Neurosci 21(17):6917–6932.

21. Diaz G, Cooper J, Rothkopf C, Hayhoe M (2013) Saccades to future ball location revealmemory-based prediction in a virtual-reality interception task. J Vis 13(1):20.

22. Flanagan JR, Johansson RS (2003) Action plans used in action observation. Nature424(6950):769–771.

23. Land MF, McLeod P (2000) From eye movements to actions: How batsmen hit the ball.Nat Neurosci 3(12):1340–1345.

24. Hayhoe M, Ballard D (2014) Modeling task control of eye movements. Curr Biol24(13):R622–R628.

25. Geisler WS (1989) Sequential ideal-observer analysis of visual discriminations. PsycholRev 96(2):267–314.

26. Merchant H, Harrington DL, Meck WH (2013) Neural basis of the perception andestimation of time. Annu Rev Neurosci 36:313–336.

27. Ivry RB, Schlerf JE (2008) Dedicated and intrinsic models of time perception. TrendsCogn Sci 12(7):273–280.

28. Wittmann M (2013) The inner sense of time: How the brain creates a representationof duration. Nat Rev Neurosci 14(3):217–223.

29. Gibbon J (1977) Scalar expectancy theory and weber’s law in animal timing. PsycholRev 84(3):279.

30. Hudson TE, Maloney LT, Landy MS (2008) Optimal compensation for temporal un-certainty in movement planning. PLOS Comput Biol 4(7):e1000130.

31. Jazayeri M, Shadlen MN (2010) Temporal context calibrates interval timing. NatNeurosci 13(8):1020–1026.

32. Sims CR, Jacobs RA, Knill DC (2011) Adaptive allocation of vision under competingtask demands. J Neurosci 31(3):928–943.

33. Ratcliff R (1979) Group reaction time distributions and an analysis of distributionstatistics. Psychol Bull 86(3):446–461.

34. Laubrock J, Cajar A, Engbert R (2013) Control of fixation duration during sceneviewing by interaction of foveal and peripheral processing. J Vis 13(12):11.

35. Staub A (2011) The effect of lexical predictability on distributions of eye fixationdurations. Psychon Bull Rev 18(2):371–376.

36. Luke SG, Nuthmann A, Henderson JM (2013) Eye movement control in scene viewingand reading: Evidence from the stimulus onset delay paradigm. J Exp Psychol HumPercept Perform 39(1):10–15.

37. Palmer EM, Horowitz TS, Torralba A, Wolfe JM (2011) What are the shapes of re-sponse time distributions in visual search? J Exp Psychol Hum Percept Perform 37(1):58–71.

38. Schütz AC, Lossin F, Kerzel D (2013) Temporal stimulus properties that attract gaze tothe periphery and repel gaze from fixation. J Vis 13(5):6.

39. Stanford TR, Shankar S, Massoglia DP, Costello MG, Salinas E (2010) Perceptual de-cision making in less than 30 milliseconds. Nat Neurosci 13(3):379–385.

40. Thorpe S, Fize D, Marlot C (1996) Speed of processing in the human visual system.Nature 381(6582):520–522.

41. Fabre-Thorpe M, Delorme A, Marlot C, Thorpe S (2001) A limit to the speed of pro-cessing in ultra-rapid visual categorization of novel natural scenes. J Cogn Neurosci13(2):171–180.

42. Trukenbrod HA, Engbert R (2014) ICAT: A computational model for the adaptivecontrol of fixation durations. Psychon Bull Rev 21(4):907–934.

43. Remington RW, Wu SC, Pashler H (2011) What determines saccade timing in se-quences of coordinated eye and hand movements? Psychon Bull Rev 18(3):538–543.

44. Wagenmakers EJ, Farrell S (2004) AIC model selection using Akaike weights. PsychonBull Rev 11(1):192–196.

45. Itti L, Koch C (2001) Computational modelling of visual attention. Nat Rev Neurosci2(3):194–203.

46. Simoncelli EP, Olshausen BA (2001) Natural image statistics and neural representa-tion. Annu Rev Neurosci 24(1):1193–1216.

47. Gottlieb J (2012) Attention, learning, and the value of information. Neuron 76(2):281–295.

48. Burnham KP, Anderson DR (2004) Multimodel inference. Sociol Methods Res 33(2):261–304.

6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1601305113 Hoppe and Rothkopf

Supporting InformationHoppe and Rothkopf 10.1073/pnas.1601305113Statistics of the TED TaskEvents for the TED task were generated according to thegraphical model presented in Fig. S2A. An event was defined as atuple ðs, d, qÞ with the generative process

s∼Uða, bÞ [S1]

q∼Bð1, pÞ [S2]

θ=!μL if   q= 0μR if   q= 1 [S3]

djq∼N"θ, σ2

#, [S4]

where the random variable s denotes the event’s starting timewithin a trial, which was distributed uniformly between a= 2s andb= 8s. The region q where the event occurs (q= 0: left; q= 1:right) was drawn from a binomial distribution and the event’sduration d from a Gaussian distribution. The SD of the eventdurations σ was fixed to a small value (100 ms). Events wereequally probable at both regions (P= 0.5). The mean of the eventduration depended on the region where the event occurred. Themean event durations for the right region (μR) and for the leftregion (μL) were fixed over the course of a block according to therespective condition. For example, in a block belonging to con-dition LM (long–medium) μL and μR were set to 1.5 s and 0.75 s,respectively. Because the goal in each trial was to detect theevent we established the relationship between SPs and detectionprobability.Therefore, we first derived the probability that an event ends at

time e at region q based on the generative process. This corre-sponds to the joint distribution pðe, qÞ, which can be factorizedinto pðe, rÞ= pðejrÞpðrÞ. For any event the ending time is the sumof the starting time and the duration. The probability of anevent’s ending at a specific time e given the region r can becomputed by the sum of s and d, which results in the convolutionof the respective probability density functions

pðejqÞ= pðsÞ  p  pðdjqÞ= 1b− a

$ψθ

%b− eσ

&−ψθ

'a− eσ

(). [S5]

The last identity follows from the convolution of a Gaussian and aUniform distribution, where ψθ denotes the cumulative distribu-tion of a normal distributed variable centered at θ.We formalized SPs as a tuple ðfL, δ, fRÞ, where fL and fR are the

fixation durations for the left and right region, respectively, and δis the time needed to move focus from one position to the other(i.e., the duration of a saccade). Using our results for the dis-tribution of event ending times we derived the probability ofmissing an event when performing a SPs (Fig. S2B). For a singleregion (e.g., R), the time between two consecutive fixations isfL + 2δ. Events both starting and ending within this time range atregion R are regarded as missed. The probability of missing anevent at region R while fixating L can therefore be computed as

pðevent missedjfL, δ, fR, q=RÞ= pðe≤ fL + 2δjq=RÞ

=ZfL+2δ

0

pðejq=RÞde. [S6]

The probability of missing an event at region L can be computedanalogously. The event-generating process implies that events aremutually exclusive at the two regions. Therefore, we can computethe probability of missing an event when performing the SP (fL, δ,fR) by marginalizing over the regions

pðevent missedjfL, δ, fRÞ=X

q∈fR,Lgpðevent missedjfL, δ, fR, qÞpðqÞ.

[S7]

The derived probability of missing an event only covers thesequence of two fixations. Thus, the total time covered isfL + fR + 2δ. To compute the probability of missing an event overthe course of the entire trial and to compare different SPs, wedivided by the total time covered by the SP. As a result we getthe expected probability of missing an event in a single time stepΔt associated with a certain SP:

Pðevent missedjfL, δ, fRÞ=Δt=Pðevent missedjfL, δ, fRÞ

fL + fR + 2δ. [S8]

Computational Models of Gaze Switching BehaviorBased on the statistical relationships derived from the event-gen-erating process we now propose different models for determiningthe optimal SP assuming that the event durations are learned. Thetime needed to perform a saccade is small compared with thefixation durations and by and large determined by its amplitude asreflected by the main sequence relation. Thus, δ was treated as afixed value and omitted from the equations. Further, the proba-bility of missing the event when performing a certain SP can beexpressed as a task-related cost function cmissðfL, fRÞ.We started by deriving an ideal observer that chooses SPs fL, fR

solely by minimizing cmiss:

SPIO = argminfL , fR

cmissðfL, fRÞ. [S9]

Ideal observer models are frequently used to investigate humanbehavior involving perceptual inferences. However, a perceptualproblem may have multiple ideal observer models depending onwhich quantities are assumed to be unknown. Ideal observers usu-ally do not consider costs but provide a general solution to an in-ference problem. Therefore, we extended our ideal observer toobtain more realistic models.First, by using the true event durations d the ideal observer as-

sumes perfect knowledge about the generative process. Especially,the mean durations for both regions (μL and μR) are assumed to beknown. This is clearly not the case in the beginning of the blocksand even after learning perceptual uncertainty and, crucially, thescalar property prevents complete knowledge. Although the formeris irrelevant for the case of learned event durations, the latter isnot. Building on findings regarding the scalar property we aug-mented the ideal observer by including signal-depending Gaussiannoise on the true variance of event durations σ2. As a result thedistribution djq becomes N ðθ, σ2 +wfθÞ, where wf is the Weberfraction.Second, the ideal observer only considers task performance for

determining optimal SPs. It implicitly assumes that SPs do notdiffer with respect to other relevant metrics. However, switching ata very high frequency consumes more metabolic energy and can beassociated with further internal costs. Therefore, we hypothesized

Hoppe and Rothkopf www.pnas.org/cgi/content/short/1601305113 1 of 7

that human behavior is the result of trading off task performanceand intrinsic costs:

SPIO+Costðα, τÞ= argminfL , fR

½cmissðfL, fRÞ+ α · ccostðfL, fR, τÞ$, [S10]

where α quantifies eye movement costs in units of cmiss. To ourknowledge the exact shape of these costs is unknown. We hy-pothesize that eye movement costs are proportional to theswitching frequency and therefore to the sum of the fixationdurations. We used the positive range of a bell curve as a costfunction:

ccostðfL, fR, τÞ= exp

"

−ðfL + fRÞ2

τ2

#

, [S11]

where τ controls how quickly costs increase with switchingfrequency.So far, all models have assumed the SPs to be deterministic.

However, humans clearly are not capable of performing an eyemovement pattern without variability. In addition, they have shownto take motor variability into account when choosing actions, andhence they can be described as ideal actors. The ideal actor modeltakes into account that there is variation when targeting a certainSP fL, fR. This property can be considered by no longer treatingfixation durations as fixed but as samples drawn from some dis-tribution. We chose an exponential Gaussian, centered at thetargeted durations fL and fR:

pðf jfX Þ= ex-Gaussianðζ, η,ωÞ [S12]

for X in (L,R), whose expected value E½f $= ζ+ η is equal to therespective target fixation duration fX . Given fX the shape pa-rameters of the distribution were determined using fX = ζ+ ηtogether with ζ=w0 +w1η. The regression parameters w0 and w1as well as ω were estimated from the data. The expected value ofcmissðfL, fRÞ is then computed by marginalizing out the variabilityof fL, fR:

cactmissðfL, fRÞ=Z

f0

Z

f1

pðf0jfLÞpðf1jfRÞcmissðf0, f1Þ. [S13]

Finally, SPs are constrained by the abilities of the motor systemand the speed of information processing. As a consequence, tar-geted fixations durations have a lower bound. To account for thisconstraint, we propose a threshold using a cumulative Gaussian:

pðfX Þ=ψ!β, γ2

", [S14]

where the mean β was estimated from the data and γ2 wasfixed to a small value. Combining all components yields thebounded actor:

SPBounded Actorðα, τ, β,w0,w1,ωÞ

= argminfL , fR

"

pðfLÞpðfRÞ

Z

f0

Z

f1

pðf0jfLÞpðf1jfRÞcmiss

3 ðf0, f1Þdf0df1 + α · cðfL, fR, τÞ

!#

.

[S15]

Parameter EstimationParameters were estimated by least squares using the aggregateddata from trials 10–20, because switching behavior was stable

across subjects during this period. AIC values for the modelswere computed according to ref. 48 as

AIC= n log#SSEn

$+ 2ðp+ 1Þ+C, [S16]

where n is the number of data points, p the number of freeparameters, and SSE is the sum of squared errors. The constantC contains all additive components that are shared between themodels. Because we use the difference of AIC values betweenthe models for model comparison, these constant values can beignored, because they cancel out. Akaike weights were computedaccording to ref. 44 as

wiðAICÞ=exp

%−12ΔiðAICÞ

&

PK

k=1exp

%−12ΔkðAICÞ

&, [S17]

where i refers to the model the weight is computed for andΔiðAICÞ is the difference in AIC between the ith model and themodel with the minimum AIC. The Akaike weight of a model isa measure for how likely it is that the respective model is the bestamong the set of models. The probability ratio between twocompeting models i and j was computed as the quotient of theirrespective weights wi=wj.

Lower Bound Model FitOur model predicted a single optimal fixation duration for eachaggregated line in Fig. S3D (in the range of trials 10–20). TheSSE was computed as the variation around this predicted valueand then used to obtain the AIC. Differences in AIC were usedto compare different models. However, we used the AIC solelyfor the purpose of model comparison, not as an absolute mea-sure of model fit. The reason for this is that the AIC is not well-suited for an evaluation of absolute model fit because there is nofixed lower bound. If a model were so good that it perfectlypredicted the dependent variable, the error-sum-of-squareswould approach zero, and the natural logarithm of this value andthereby the AIC would approach negative infinity (Eq. S16).Further, the AIC by itself is highly dependent on the units ofmeasurement, which influence the error-sum-of-squares.Here we present an absolute lower bound on AIC for the data

presented in our experiment. We computed the SSE for a modelthat estimates the best-fitting parameter for each aggregated linein Fig. S3D. For each line the parameter that minimizes the SSEis the mean. The lower bound for the data we collected is SSE =0.09 and AIC = −603.15, respectively. We used n = 90 and P = 9(one for each line separately) parameters.Note that this is an overly optimistic bound for two reasons.

First, the estimates for fixation durations underlying the lowerbound are independently chosen to best represent each individualoptimal strategy. For the full model, however, these predictionsare not independent because each estimated parameter simul-taneously affects the optimal fixation durations for all conditions.This dependency puts a constraint on the full model that is notconsidered by this lower bound. Second, the lower bound assumesthe sample mean of the data to be equal to the optimal fixationduration and only quantifies the variation around that mean. Thisneglects the variability across blocks and participants. In contrast,the full model uses all available data to estimate biologicallyplausible parameters. Using these parameters the optimal fixationduration is predicted by the dynamics of the full model rather thandirectly estimated from the data. Although this is necessary toprovide generalizability and indispensable to prevent overfitting,it naturally introduces a further source for variability. This is

Hoppe and Rothkopf www.pnas.org/cgi/content/short/1601305113 2 of 7

reflected in a higher SSE of the full model compared with thelower bound.

Bayesian LearnerAll models so far have assumed perfect knowledge of the eventdurations despite perceptual uncertainty. They are therefore onlyapplicable to the steady-state case because event durations wereunknown in the beginning of a block and had to be learned overthe course of the block. Learning was formulated as updating themodel’s belief about the event durations through observations(Fig. 4A). From a Bayesian perspective this can be seen as usingthe observation to transform a prior distribution to a posteriordistribution:

pðθjoÞ= pðojθÞpðθÞpðoÞ , [S18]

where θ denotes the mean of the event duration and o a se-quence of observations. The updated belief, the posterior distri-bution PðθjoÞ, is computed as the product of the prior PðθÞ andthe likelihood PðojθÞ. The denominator normalizes the posteriorto sum to one.In the TED task there are three types of observations. First, if

an event starts at a region while the respective region is fixatedand is perceived for the whole event duration the observation isdefined as fully observed. Second, if the region is not fixated at allwhile the event occurs the observation is defined as missed. Third,if only a fraction of the event is observed (event is already oc-curring when the region is fixated) the observation is defined aspartially observed. Hereby, we assume that if a region is fixatedwhile an event occurs, participants continue fixating the re-spective region until the event has finished. Our data show thatthis is indeed the case. The different types of observations lead to

different types of censoring. The resulting likelihood can com-puted by

pðojθÞ=Y

fully  observed

pðojθÞY

missed

FðtE − tLÞ [S19]

·Y

partially  observed

FðtE − tLÞ−FðtE − tSÞ, [S20]

where PðojθÞ is Gaussian and F is the cumulative distributionfunction of θ. The starting and ending times for the event aredenoted as tS and tE, respectively. The time when the region ofinterest is fixated is denoted by tL.To extend our models from the steady-state case to all trials the

progression of uncertainty over the course of the blocks wassimulated using the proposed Bayesian learner (Fig. 4B). Wecomputed the mean estimate and the uncertainty of the eventduration for each trial in a block as follows. We repeatedlysimulated events and determined the type of the resulting ob-servation using the mean SP for each stage of a block. We thenused Markov chain Monte Carlo methods to draw samples fromthe posterior belief about the true event durations for each trial.For each trial, the prior distribution for the event duration isequivalent to the posterior distribution in the trial before. Forthe first trial, we used a Gaussian distribution fitted to the meansof the three event durations. To model the initial belief before thefirst observation, we used a Gaussian distribution fitted to the trueevent durations for the three conditions. We controlled the sub-jects’ prior belief by having them perform a few training trialsduring which they became familiar with the overall range of eventdurations.

Hoppe and Rothkopf www.pnas.org/cgi/content/short/1601305113 3 of 7

r2 = 0

.88

Mea

n

Standard Deviation

0

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

Scal

ar P

rope

rty

μ

τ

0.2

0.25

0.3

0.1

0.15

0.20

0.250.

3

Ex-G

auss

ian

Para

met

ers r2 =

0.7

8Fi

xatio

nD

urat

ion

Freq

uenc

y

Tria

l

r2 = 0

.16

Saccade Duration(ms)

100

105

110

115

120

125

130

05

1015

20

Fatig

ue o

ver

the

cour

se o

f th

e bl

ock r2 =

0.0

5

Bloc

k

Saccade Duration(ms)

100

105

110

115

120

125

130

135

140

05

1015

2025

30

Fatig

ue o

ver

the

cour

se o

f th

e ex

peri

men

t

0.75

s1.

5s0.

15s

0.15

s

0.75

s

1.5s

AB

C

Fig.S

1.Variability

offixa

tiondurations.(A

)Distributionsoffixa

tiondurationsacross

conditions.Fo

rea

chco

nditionhistogramsan

dfitted

exponen

tial

Gau

ssiandistributionsareshownforboth

regions.(B)S

calarlaw

ofbiological

timing.(Upper)T

herelationship

betwee

nthemea

nfixa

tiondurationsin

atriala

ndtherespective

SDobserved

inourex

perim

ent.Th

esolid

linerefers

totheresultofafitted

linea

rregression.(Lo

wer)T

helin

earrelationship

oftheparam

etersoftheex

ponen

tial

Gau

ssiandistribution,w

hereμden

otesthemea

noftheGau

ssianan

den

otestherate

oftheex

ponen

tial

distribution.E

achpointrepresents

thebestparam

eter

estimates

forasingle

distributionshownin

A.(C)Q

uan

tifyingocu

lomotorfatigue.

Saccad

edurationsove

rtheco

urseoftrialsag

gregateove

rallb

locks(U

pper).Sa

ccad

edurationsove

rthe

courseofblocksag

gregated

ove

ralltrialswithin

therespective

blocks(Lower).

Hoppe and Rothkopf www.pnas.org/cgi/content/short/1601305113 4 of 7

Fig. S2. Derivation of the statistics in the TED task. (A) Probabilistic graphical model of the event-generating process. (B) Probability of missing an event. Thecondition for the shown segment is LM (i.e., long event durations at the left region and medium event durations at the right region). The probability of missingthe event (orange area) is shown for an arbitrary SP (dashed line), where the left region is fixated for a duration fL and the right region for a duration fR. Eventsare missed if the event’s region is not fixated for the entire duration of the event. For long event durations the offset of the probability of missing an event(purple lines) is greater than for short event durations. As a consequence, the time between two fixations can be greater if the event duration is long withoutmissing events.

Fixa

tion

dura

tion

left

Fixation duration right

L

M

S

S M L

Fig. S3. SPs (color map) predicted by the fitted model and observed human data for steady-state behavior (trials 15–20 of each block).

Hoppe and Rothkopf www.pnas.org/cgi/content/short/1601305113 5 of 7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

Without eye movement costs

10 20

0.3

0.5

0.7

10 20

0.3

0.5

0.7

10 20

0.3

0.5

0.7

10 20

0.3

0.5

0.7

10 20

0.3

0.5

0.7

10 20

0.3

0.5

0.7

Without minimal processing time

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

Without acting uncertainty

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

10 200.3

0.5

0.7

Full Model

A B

C D

Trial

FixationDuration(seconds)

Trial

FixationDuration(seconds)

Trial

FixationDuration(seconds)

Trial

FixationDuration(seconds)

Fig. S4. Learning of gaze switching over blocks for all conditions according to different models. For each model (except D), a single component was removedfrom the full model. The remaining parameters were estimated from the data using only the steady-state behavior (trials 10–20). (A) Eye movement costs.(B) Processing time. (C) Acting uncertainty. (D) Full model with all parameters fitted.

Hoppe and Rothkopf www.pnas.org/cgi/content/short/1601305113 6 of 7

1 2 3 4 5

0.4

0.6

0.8

1.0

Perc

enta

ge C

orre

ct

Block Order

SS

SM

SL

MMML

LL

0.00 0.02 0.04 0.0640.0- 20.0--0.06

Freq

uenc

ies

Regression slopes

A B

Fig. S5. Learning effects over the course of the experiments. (A) Mean performance aggregated over all participants grouped by the position of the blockwithin the experiment. Error bars correspond to the sample SD. Different conditions are shown as separate lines. (B) Histogram of regression slopes. For eachparticipant and each condition linear regression between the mean performance in the blocks and the position of the respected blocks in the experiment wascomputed. The distribution of regression slopes indicates no improvement in the task over the course of the experiment.

Table S1. Model comparison for all fitted models

Model p SSE AIC Δ AIC w(AIC) wmin/w

FM 6 0.15 −561.56 0 0.99 1.0FM without perceptual uncertainty 6 0.18 −543.11 18.44 9.9× 10−5 1.0× 104

FM without cost 4 0.65 −434.06 127.50 2.1× 10−28 4.8× 1027

FM without processing time 5 1.04 −389.14 172.42 3.6× 10−38 2.8× 1037

FM without acting uncertainty 4 1.21 −377.95 183.61 1.4× 10−40 7.4× 1039

Differences in AIC (ΔAIC), Akaike weights [wðAICÞ], and the probability ratios (wmin=w) were computed withrespect to the full model (FM). The number of data points for all models was 90. p, number of fitted parameters.

Hoppe and Rothkopf www.pnas.org/cgi/content/short/1601305113 7 of 7