Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable...

4
Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected] . ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00 SemantiCode: Using Content Similarity and Database-driven Matching to Code Wearable Eyetracker Gaze Data Daniel F. Pontillo, Thomas B. Kinsman, & Jeff B. Pelz* Multidisciplinary Vision Research Lab, Carlson Center for Imaging Science, Rochester Institute of Technology {dfp7615, btk1526, *pelz}@cis.rit.edu Abstract Laboratory eyetrackers, constrained to a fixed display and static (or accurately tracked) observer, facilitate automated analysis of fixation data. Development of wearable eyetrackers has extended environments and tasks that can be studied at the expense of automated analysis. Wearable eyetrackers provide 2D point-of-regard (POR) in scene-camera coordinates, but the researcher is typically interested in some high-level semantic property (e.g., object identity, region, or material) surrounding individual fixation points. The synthesis of POR into fixations and semantic information remains a labor-intensive manual task, limiting the application of wearable eyetracking. We describe a system that segments POR videos into fixations and allows users to train a database-driven, object-recognition system. A correctly trained library results in a very accurate and semi-automated translation of raw POR data into a sequence of objects, regions or materials. Keywords: semantic coding, eyetracking, gaze data analysis 1 Introduction Eye tracking has a well-established history of revealing valuable information about visual perception and more broadly about cognitive processes [Buswell 1935; Yarbus 1967; Mackworth and Morandi 1967; Just and Carpenter 1976]. Within this field of research, the objective is often to examine how an observer visually engages with the content or layout of an environment. When the observer’s head is stationary (or accurately tracked) and the stimuli are static (or their motion over time is recorded), commercial systems exist that are capable of automatically extracting gaze behavior in scene coordinates. Outside the laboratory, where observers are free to move through dynamic environments, the lack of constraints precludes the use of most existing automatic methods. A variety of solutions have been proposed and implemented in order to overcome this issue. One approach, ‘FixTag,’ [Munn and Pelz 2009] utilizes ray tracing to estimate fixation on 3D volumes of interest. In this scheme, a calibrated scene camera is used to track features across frames, allowing for the extraction of 3D camera movement. With this, points in a 2D image plane can be mapped onto the scene camera’s intrinsic 3D coordinate system. This allows for accurate ray tracing from a known origin relative to the scene camera. While this method has been shown to be accurate, it has limitations. Critically, it requires an accurate and complete a priori map of the environment to relate object identities with fixated volumes of interest. In addition, all data collection must be completed with a carefully calibrated scene camera, and the algorithm is computationally intensive. Another proposed method is based on Simultaneous Localization and Mapping (SLAM) algorithms originally developed for mobile robotics applications [Thrun and Leonard 2008]. Like FixTag, current implementations of SLAM-based analyses require that the environment be completely mapped before analysis begins, and are brittle to scene layout changes, precluding their use in novel and/or dynamic environments. Our initial impetus for this research was the need for a tool to aid the coding of gaze data from mobile shoppers interacting with products. Because the environment changes every time a product is purchased (or the shopper picks up a product to inspect it), neither FixTag nor SLAM-based solutions were viable. Another application of the tool is in a geoscience research project, in which multiple observers explore a large number of sites. While the background in each scene is static, it isn’t practical to survey each site horizon-to-horizon, and because the scenes include an active instructor and other observers, existing solutions were not suitable for this case. Figure 1 shows sample frames from the geosciences-project gaze video recorded in an open, natural scene, which contains many irregular objects and other observers. Note that even if it were possible to extract volumes of interest and camera motions within this environment, there would be no mechanism for mapping fixations within volumes into their semantic identities because of the dynamic properties of the scene. Figure 1 Frames gaze captured in outdoor scene 2 Semantic-based Coding Our goal in developing the SemantiCode tool was to replace the concept of location-based coding with a semantic-based tool. In 2D location-based coding, the identity of each fixation is defined 267

description

Laboratory eyetrackers, constrained to a fixed display and static (or accurately tracked) observer, facilitate automated analysis of fixation data. Development of wearable eyetrackers has extended environments and tasks that can be studied at the expense of automated analysis. Wearable eyetrackers provide 2D point-of-regard (POR) in scene-camera coordinates, but the researcher is typically interested in some high-level semantic property (e.g., object identity, region, or material) surrounding individual fixation points. The synthesis of POR into fixations and semantic information remains a labor-intensive manual task, limiting the application of wearable eyetracking.We describe a system that segments POR videos into fixations and allows users to train a database-driven, object-recognition system. A correctly trained library results in a very accurate and semi-automated translation of raw POR data into a sequence of objects, regions or materials.

Transcript of Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable...

Page 1: Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data

Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00

SemantiCode: Using Content Similarity and Database-driven Matching to Code Wearable Eyetracker Gaze Data

Daniel F. Pontillo, Thomas B. Kinsman, & Jeff B. Pelz* Multidisciplinary Vision Research Lab, Carlson Center for Imaging Science, Rochester Institute of Technology

{dfp7615, btk1526, *pelz}@cis.rit.edu

Abstract

Laboratory eyetrackers, constrained to a fixed display and static (or accurately tracked) observer, facilitate automated analysis of fixation data. Development of wearable eyetrackers has extended environments and tasks that can be studied at the expense of automated analysis.

Wearable eyetrackers provide 2D point-of-regard (POR) in scene-camera coordinates, but the researcher is typically interested in some high-level semantic property (e.g., object identity, region, or material) surrounding individual fixation points. The synthesis of POR into fixations and semantic information remains a labor-intensive manual task, limiting the application of wearable eyetracking.

We describe a system that segments POR videos into fixations and allows users to train a database-driven, object-recognition system. A correctly trained library results in a very accurate and semi-automated translation of raw POR data into a sequence of objects, regions or materials.

Keywords: semantic coding, eyetracking, gaze data analysis

1 Introduction

Eye tracking has a well-established history of revealing valuable information about visual perception and more broadly about cognitive processes [Buswell 1935; Yarbus 1967; Mackworth and Morandi 1967; Just and Carpenter 1976]. Within this field of research, the objective is often to examine how an observer visually engages with the content or layout of an environment. When the observer’s head is stationary (or accurately tracked) and the stimuli are static (or their motion over time is recorded), commercial systems exist that are capable of automatically extracting gaze behavior in scene coordinates. Outside the laboratory, where observers are free to move through dynamic environments, the lack of constraints precludes the use of most existing automatic methods.

A variety of solutions have been proposed and implemented in order to overcome this issue. One approach, ‘FixTag,’ [Munn and Pelz 2009] utilizes ray tracing to estimate fixation on 3D volumes of interest. In this scheme, a calibrated scene camera is used to track features across frames, allowing for the extraction of 3D camera movement. With this, points in a 2D image plane

can be mapped onto the scene camera’s intrinsic 3D coordinate system. This allows for accurate ray tracing from a known origin relative to the scene camera. While this method has been shown to be accurate, it has limitations. Critically, it requires an accurate and complete a priori map of the environment to relate object identities with fixated volumes of interest. In addition, all data collection must be completed with a carefully calibrated scene camera, and the algorithm is computationally intensive. Another proposed method is based on Simultaneous Localization and Mapping (SLAM) algorithms originally developed for mobile robotics applications [Thrun and Leonard 2008]. Like FixTag, current implementations of SLAM-based analyses require that the environment be completely mapped before analysis begins, and are brittle to scene layout changes, precluding their use in novel and/or dynamic environments.

Our initial impetus for this research was the need for a tool to aid the coding of gaze data from mobile shoppers interacting with products. Because the environment changes every time a product is purchased (or the shopper picks up a product to inspect it), neither FixTag nor SLAM-based solutions were viable. Another application of the tool is in a geoscience research project, in which multiple observers explore a large number of sites. While the background in each scene is static, it isn’t practical to survey each site horizon-to-horizon, and because the scenes include an active instructor and other observers, existing solutions were not suitable for this case.

Figure 1 shows sample frames from the geosciences-project gaze video recorded in an open, natural scene, which contains many irregular objects and other observers. Note that even if it were possible to extract volumes of interest and camera motions within this environment, there would be no mechanism for mapping fixations within volumes into their semantic identities because of the dynamic properties of the scene.

Figure 1 Frames gaze captured in outdoor scene

2 Semantic-based Coding

Our goal in developing the SemantiCode tool was to replace the concept of location-based coding with a semantic-based tool. In 2D location-based coding, the identity of each fixation is defined

267

Page 2: Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data

by the (X,Y) coordinate of the fixation in a scene plane. The 2D scene plane can be extended for dynamic cases such as web pages, providing that any scene motion (i.e., scrolling) is captured for analysis. In 3D location-based coding, fixations are defined by the (X,Y,Z) coordinate of the fixation in scene space, provided that the space is mapped and all objects of interest are placed within the map.

By contrast, in semantic-based coding, a fixation’s identity can be determined independent of its location in a 2D or 3D scene. Rather than basing identity on location, semantic-based coding uses the tools of object recognition to infer a semantic identity for each fixation. A wide range of spectral, spatial, and temporal features can be used in this recognition step. Note that while identity can be determined independent of location in semantic-based coding, location can be retained as a feature in identifying a fixation by integrating location data. Alternatively, a ‘relative location’ feature can be included by incorporating the semantic-based features of the region surrounding the fixated object.

Fundamental to the design of the SemantiCode Tool is the concept of database training. Training occurs at two levels; the system is first trained by manually coding gaze videos. As each fixation is coded, the features at fixation are captured and stored along with the image region as an exemplar of the semantic identifier. Higher-level training can occur via relative weighting of multiple features, as described in Section 8.

3 Software Overview

SemantiCode was designed as a tool to optimize and optionally automate the process of coding without sacrificing adaptability, robustness, and an immediate mechanism for manual overrides. The software is reliant on user interaction and feedback, yet in most cases this interaction requires very little training. One major design consideration was a scalable operative complexity; this is crucial for research groups who employ undergraduates and other short-term researchers, as it obviates the need for an extended period of operator training. To this end, the graphical user interface (GIU) allows users manual control over every parameter and phase of video analysis and coding, while simultaneously offering default settings and semi-automation that should be applicable to most situations. Assuming previous users have trained a strong library of objects and exemplars, the coding task could be as simple as pressing one key to progress through fixations, resulting in a table of data that correlates each fixation to the semantic identity of the fixated region. The training process requires some straightforward manual configuration before this type of usage is possible, but depending on the variety of objects of interest, this can still be achieved in a much shorter period of time with significantly less effort than previous manual processes have required.

4 Graphical User Interface

When the user runs SemantiCode for the first time, the first step is to import a video that has been run through POR analysis software. (The examples here were done with Positive Science Yarbus 2.0 software [www.positivescience.com]). Any video with an accompanying text file listing the POR for each video frame can be used. The POR location and time code of each frame are used to automatically segment the video into estimated fixations. Once this is finished, the first fixation frame appears, and coding can proceed.

An existing algorithm for automatic extraction of fixations [Munn 2009; Rothkopf and Pelz 2004] was modified and embedded within the SemantiCode system. Temporal and spatial

Figure 2 The SemantiCode GUI as it appears after the user has loaded a video and tagged a number of fixations. This usage example represents a scenario wherein a library has just been built. The area on the left side of the interface contains all of the fixation viewer components, while the area on the right is generally devoted to coding, library management, and the display of the top matches from the active library

constraints on the fixation extraction can be adjusted by the experimenter via the Fixation Definition Adjustment subgui seen in Figure 3. The user is also presented with statistics about the fixations as calculated from the currently selected video. The average fixation duration and the rate of fixations per second can be useful indicators of how well the automatic segmentation has worked for the current video [Salvucci and Goldberg 2000].

Figure 3 Fixation Definition Adjustment subgui allows the user to shift the constraints on what may be considered a fixation in order to produce more or fewer fixations.

5 Fixation Analysis

A single frame, extracted from the temporal center of the active fixation in the gaze video is displayed on the left of the main GUI. Within the frame, a blue box indicates the pixel region considered relevant in all further steps. Beneath the frame, that region’s semantic identifier is shown, if one exists, along with a text display of the progress that has been made in coding the currently selected video. The user can use an intuitive control panel for switching between fixation frames, videos and projects. Users have the option of manually navigating fixations either with a drop-down menu fixation selector, with the next/previous buttons, or with the right and left arrows on the user’s keyboard.

6 Object Coding

The primary purpose of SemantiCode is the attachment of semantic identification information to a set of pixel locations in

268

Page 3: Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data

an eye tracking video. Thus, the actual coding of fixations is a critical functionality in the software. The first time the software is used with video of a new environment, coding begins manually. Users add new objects to the active library by typing in an identifier for the fixated region in the active frame, which can be selected as either 64x64 or 128x128 pixels surrounding the point of regard. With each added object, the image and its histogram are stored in the active library under the specified name. Once a sufficient number of objects have been added to describe the elements of interest in the environment, the user can continue coding by selecting the most appropriate member of the object list. As each frame is tagged with a name, the frame number, video name, and semantic identifier are stored and displayed as coded frames.

After coding each fixation (either manually or by accepting SemantiCode’s match), data about the fixation and the video from which it was extracted are written to an output data file. With this, statistical analyses can easily be run on the newly added semantic information for each coded fixation.

7 Building a Library

The data structure that underlies SemantiCode is referred to as a library. A library is simply a collection of semantic identifiers that each contain one or more images or exemplars that has been constructed through the act of coding. When a user runs the software for the first time, an unpopulated default library is automatically created. Users can immediately start adding to this library, which is a persistent data structure that is automatically imported for every subsequent session of the software. The user can create a new blank library, copy an existing library into a new one, merge two or more libraries into one, and delete unwanted libraries.

Alternatively, users can import a pre-existing library, or merge several libraries into one before ever coding a single object. This portability is a major feature, as it means that theoretically for a given environment, manual object coding must only be done once. All subsequent coding tasks, regardless of the user or the location of the software, can be based on a pre-built library of exemplars and object images.

8 Semantic Recognition

Computer Vision usually attempts to either find the location of a known object (“where?”), or identify an object at a known location (“what?”). In the case of eyetracking the fixation location is given, so the primary question is, “What is the fixated object, region, or material?” To answer this the region surrounding the calculated POR is characterized by one or more features. Those features are then compared to the features stored in a library to answer the question posed above.

As our initial method, we used the color-histogram intersection method introduced by Swain and Ballard [1990], in which the count of the number of pixels in each bin of image I’s histogram is compared to the number of pixels in the same bin of model M’s histogram:

Where Ij represents the jth bin of the histogram at fixation and Mj is the jth bin of a model’s histogram from the library of

exemplars to test against. The denominator is the sum of each model’s histogram, a normalization constant computed once. H(I,M) represents the fractional match value [0 – 1] between the fixated region and a model in the library. This has the desirable qualities that background colors, which are not present in the object, are ignored. The intersection only increases if the same colors are present, and the amount of those colors does not exceed the amount expected. This approach is robust to changes in orientation and scale because it relies only on the color intensity values within the two images being compared. It is also computationally efficient, requiring only n comparisons, n additions, and one division.

The representation of 3D content by 2D views is elegantly handled by the design of the library. Each semantic identifier can contain an arbitrary number of exemplars from any view or scale. Consequently, multiple perspectives are added to the library as they are required. The library is thus adaptively modified to meet the needs of the coding task.

Future work will involve extended feature generation and selection, including alternative and adaptive histogram techniques, and the use of machine-learning algorithms for enhanced object discrimination.

Figure 4 The Examine subgui for a region called “Distant Terrain.” The GUI displays the exemplar and image for each fixation tagged with this name.

Since the current algorithm is not affected by shape or spatial configuration, it is not is not necessary to segment the region of interest from its background. As a result, irregular environments and observer movement do not degrade performance. Even more compelling is the capacity for this algorithm to accurately match materials and other mass nouns that may not take the form of discrete objects. The ability to automatically identify materials along with objects helps to address a larger issue in the machine-vision field about the salience of uniform material regions.

These factors make the Swain and Ballard [1990] color- histogram method an attractive choice for a highly adaptable and robust form of assisted semantic coding. Testing with just RGB histogram intersections shows great promise. In its current implementation, each time a new fixation frame is shown, SemantiCode matches its histogram against every object in the currently active library, ranks them, and displays the top ten objects on the right panel. The highest-ranking object shows the top three exemplars.

Table 1 shows the results of preliminary tests in a challenging outdoor environment similar to that depicted in Figure 1. For analysis, five regions were identified: Distant terrain, Midground terrain, Horizon, and Lake. After initializing the library by coding the first nine fixations within each region, the color-histogram match scores for the tenth fixation in each region were calculated. Recall that SemantiCode performs an

!

H(I,M) = min(I j ,M j )j=1

n

" M jj=1

n

" Eq.1

269

Page 4: Pontillo Semanti Code Using Content Similarity And Database Driven Matching To Code Wearable Eyetracker Gaze Data

exhaustive search of all histograms. Table 1 contains the peak histogram match within each category. In the current implementation, SemantiCode presents the top ten matches to the experimenter. Hitting a single key accepts the top match; any of the next nine can be accepted instead by using the numeric keypad, as seen in Figure 2.

Table 1 Peak histogram match (see text)

Mid

grou

nd

terr

ain

Ligh

ter

terr

ain

Dis

tant

te

rrai

n

Hor

izon

Lake

Midground terrain

81% 52% 26% 38% 55% Lighter terrain 34% 77% 72% 54% 65% Distant terrain 45% 65% 82% 58% 71%

Horizon 14% 30% 39% 60% 55% Lake 14% 61% 72% 65% 81%

The next version will allow the experimenter to implement automatic coding when the feature matches are unambiguous. For example, if the top match exceeds a predefined accept parameter (e.g., 80%), and no other matches are closer than a conflict parameter (e.g., 10%) of the top match, the fixation would be coded without experimenter intervention. If either constraint is not met, SemantiCode would revert to suggesting codes and waiting for verification. Table 1 shows that even in the challenging case of a low-contrast outdoor scene with similar spectral signatures, three of the five semantic categories would be coded correctly without user intervention, even with only nine exemplars per region. Note that in this case the semantic label ‘Horizon’ spans two distinct regions, making it a challenge to match. Still, the correct label is the second highest match.

To test SemantiCode’s ability to work in various environments, it was also evaluated in a consumer-shopping environment. Six regions were identified for analysis: four shampoos and two personal hygiene products. Histogram matches were calculated as described for the outdoor environment. The indoor environment was less challenging – after training, all six semantic categories could be coded correctly without user intervention with top matches ranging from 74% to 85%.

In the near future, additional image-matching algorithms will be evaluated within the SemantiCode application for their effectiveness in different scene circumstances. Using the results from these evaluations it will be possible to select optimally useful match evaluation approaches.

Match scores can be computed as weighted combinations of outputs from a number of image matching algorithms. Weights, dynamically adjusted by the reinforcement introduced by the experimenter’s manual coding, would allow a given library to be highly tuned to the detection of content that may otherwise be too indistinct for any individual matching technique.

9 Conclusion

SemantiCode offers a significant improvement over previous approaches to streamlining the coding of eyetracking data. The immediate benefit is seen in the dramatically increased efficiency for video coding, and increased gains are anticipated with the semi-autonomous coding described.

With future improvements and extensibility, SemantiCode promises to become a valuable tool to support attaching semantic identifiers to image content. It will be possible to tune SemantiCode to virtually any environment. By combining the power of database-driven identification with unique matching techniques, it will only be limited by the degree to which it is appropriately trained. It is thus promising both as a tool for evaluating which algorithms are useful in different experimental scenarios, and as an improved practical coding system with which to analyze research data.

10 Acknowledgments

This work was made possible by the generous support of Procter & Gamble and NSF Grant 0909588.

11 Proprietary Information/Conflict of Interest

Invention disclosure and provisional patent protection for the described tools are in process.

References

BUSSWELL, G.T. 1935 How People Look At Pictures: A Study Of The Psychology Of Perception In Art, The University of Chicago Press, Chicago

JUST, M. A. AND CARPENTER, P. A. 1976. Eye fixations and cognitive processes. Cognitive Psychology, 8, 441-480.

MACKWORTH, N.H. AND MORANDI, A. 1967. The gaze selects informative details within pictures, Perception and Psychophysics, 2, 547–552.

MUNN, S.M., and Pelz, J.B. 2009. FixTag: An algorithm for identifying and tagging fixations to simplify the analysis of data collected by portable eye trackers. Transactions on Applied Perception, Special Issue on APGV, In press.

ROTHKOPF, C. A. and PELZ, J. B. 2004. Head movement estimation for wearable eye tracker. In Proceedings of the 2004 Symposium on Eye Tracking Research & Applications (San Antonio, Texas, March 22 - 24, 2004). ETRA '04. ACM, New York, NY, 123-130.

SALVUCCI, D. D. and GOLDBERG, J. H. 2000. Identifying fixations and saccades in eye-tracking protocols. In Proceedings of the 2000 Symposium on Eye Tracking Research & Applications (Palm Beach Gardens, Florida, United States, November 06 - 08, 2000). ETRA '00. ACM, New York, NY, 71-78.

SWAIN, M.J., BALLARD, D.H. 1990. Indexing Via Color Histograms, 1990, Third International Conference on Computer Vision.

THRUN, S. and LEONARD, J. 2008. Simultaneous localization and mapping. In SICILIANO, B. and KHATIB, O., Springer Handbook of Robotics, Springer, Berlin.

YARBUS, A.L. 1967. Eye movements and vision. New York: Plenum Press.

270